How to implement sentiment analysis using a naive Bayes classifier.

Naive Bayes classifiers are used in various natural language processing tasks. These include sentiment analysis, spam filtering, and other types of document classifications. In this post, I will present Bayes’ theorem, the building block of naive Bayes classifiers, describe how naive Bayes classifiers work, and show how to implement these algorithms.

The Naive Bayes classifier represents an application of Bayes’ theorem. Given two events $A$ and $B$, this theorem states that the conditional probability $P(A \ | \ B)$ that event $A$ occurs given that event $B$ has occurred is expressed as

$\displaystyle P(A | B) = \frac{P(B | A) \times P(A)}{P(B)}.$Here, $P(A)$ and $P(B)$ are the probabilities that $A$ and $B$ occur, respectively. Bayes’ theorem allows us to take an unknown quantity $P(A \ | \ B)$ and define it in terms of other relevant probabilities that we may already know.

If reading this equation makes you feel like you’re reading heiroglyphics, try the following example:

$\displaystyle P(\textrm{fun year} \;|\; \textrm{no covid}) = \frac{P(\textrm{no covid} \; | \; \textrm{fun year}) \times P(\textrm{fun year})}{P(\textrm{no covid})}$In other words, the probability that 2020 would have been a fun year, given that the COVID-19 pandemic hadn’t occurred, can be computed if we have prior knowledge of the terms on the right-hand side. These include:

- $P(\textrm{no covid} \ | \ \textrm{fun year})$: the probability that COVID-19 doesn’t occur in a fun year
- $P(\textrm{fun year})$: the probability that any year is fun
- $P(\textrm{no covid})$: the probability that COVID-19 doesn’t occur in any given year

The term $P(\textrm{no covid})$ can be written as the probability that COVID-19 doesn’t occur regardless of whether the year is fun or not. This is given as

$\displaystyle P(\textrm{no covid}) = P(\textrm{no covid} \; | \; \textrm{fun year}) \times P(\textrm{fun year}) + P(\textrm{no covid} \; | \; \textrm{bad year}) \times P(\textrm{bad year})$Now, lets assume we know the following:

- $P(\textrm{fun year}) = 0.8$
- $P(\textrm{bad year}) = 1 - P(\textrm{fun year}) = 0.2$
- $P(\textrm{no covid} \; | \; \textrm{fun year}) = 0.95$
- $P(\textrm{no covid} \; | \; \textrm{bad year}) = 1 - P(\textrm{no covid} \; | \; \textrm{fun year}) = 0.05$

These probabilities suggest that

$\begin{aligned} P(\textrm{no covid}) &= 0.95 \times 0.8 + 0.05 \times 0.2 \\ &= 0.77 \end{aligned}$Using these results, we can determine that

$\begin{aligned}\displaystyle P(\textrm{fun year} \;|\; \textrm{no covid}) &= \frac{0.95 \times 0.8}{0.77} \\ &= 0.99 \end{aligned}$In other words, the probability that any year is fun, given that the COVID-19 pandemic doesn’t occur during that same year, is about 99%. Next, I will describe how Bayes’ theorem relates to the Naive Bayes classifier.

In sentiment analysis applications, the goal of the Naive Bayes classifier is to predict the most probable sentiment for a given text. This requires a training set of text samples that have been labeled as having either a positive or negative sentiment. Using Bayes’ theorem, the probability of interest is expressed as

$\displaystyle P(s \ | \ \textrm{text}) = \frac{P(\textrm{text} \ | \ s) \times P(s)}{P(\textrm{text})} \quad s \in \{ \textrm{pos},\;\textrm{neg} \}$where the sentiment $s$ can correspond to either a positive ($\mathrm{pos}$) or negative ($\mathrm{neg}$) sentiment. The probability $P(s)$ is the fraction of text samples in the training set that have sentiment $s$.

It may not always be straightforward to determine $P(\textrm{text} | s)$ and $P(\textrm{text})$. One way to simplify this is to assume that the words occurring in the text are independent of each other.

In practice, this assumption is typically not correct which is why this algorithm is known as *naive* Bayes. Regardless, this simplification allows us to define the conditional probability $P(s \ | \ w)$, the probability of a sentiment $s$ given the word $w$, as

where $P(w)$ represents the probability that the word $w$ occurs in the training set. Now we can determine the probability that a piece of text has sentiment $s$ as

$\displaystyle P(s\;| \; \mathrm{text}) = P(s) \times \prod_i \frac{P(w_i \; | \; s) }{P(w_i)},$where $w_i$ represents the $i$-th word in the text.

Naive Bayes performs classifications by first computing $\displaystyle P(\textrm{pos} \ | \ \mathrm{text})$ and $\displaystyle P(\textrm{neg} \ | \ \mathrm{text})$. We can calculate the ratio of these two probabilities, known as the likelihood, to obtain

$\displaystyle \frac{P(\textrm{pos}\;| \; \mathrm{text})}{P(\textrm{neg}\;| \; \mathrm{text})} = \frac{P(\mathrm{pos})}{P(\mathrm{neg})} \prod_i \frac{P(w_i \; | \; \mathrm{pos}) \times \cancel{P(w_i)}}{P(w_i \; | \; \mathrm{neg}) \times \cancel{P(w_i)} }$If the likelihood is greater than 1, the model predicts a positive sentiment. But if the likelihood is less than 1, the model instead predicts a negative sentiment. One advantage of computing the likelihood function is that we can avoid calculating $P(w_i)$ as it occurs in the numerator and the denominator.

The term $P(\mathrm{pos})/P(\mathrm{neg})$ is known as a prior. It is relevant if there are more samples having one particular sentiment than the other. If the training set contains an equal number of positive and negative text samples, $P(\mathrm{pos}) = P(\mathrm{neg}) = 0.5$ and thus, $P(\mathrm{pos})/P(\mathrm{neg}) = 1$.

The conditional probability $\displaystyle P(w \ | \ s)$ that a word $w$ occurs in the presence of sentiment $s$ is given by the expression

$\displaystyle P(w \;|\; s) = \frac{\mathrm{freq}(w, \; s)}{N_s}$The function $\displaystyle \mathrm{freq}(w, \ s)$ represents how often the word $w$ occurs with sentiment $s$ in the training set. Furthermore, $N_s$ is the total number of words in text samples that have sentiment $s.$

These are the building blocks needed to implement naive Bayes for sentiment classification. In the next section, I will describe how to implement a Naive Bayes classifier from scratch.

Let’s now implement this in some code. I’ll start by creating a small dataset:

```
import pandas as pd
dataset = pd.DataFrame(
{
"text": [
"2020 was a fun year",
"Cats was a great movie",
"NLP is not fun",
"I hate tacos",
],
"sentiment": ["positive",
"positive",
"negative",
"negative"],
}
)
```

The table below represents this dataset:

text | sentiment | |
---|---|---|

0 | 2020 was a fun year | postive |

1 | Cats was a great movie | postive |

2 | NLP is not fun | negative |

3 | I hate tacos | negative |

Now, we need to determine how often each word in our dataset occurs for both sentiment labels. The function `get_freqs`

defined below takes care of this. It uses the functions `word_tokenize`

and `FreqDist`

imported from the NLTK library. The function `word_tokenize`

splits a string into a list of words, also known as word tokens. Furthermore, the function `FreqDist`

takes these word tokens to generate a dictionary showing how often each word occurs.

```
from nltk import word_tokenize
from nltk import FreqDist
def get_freqs(dataset, sentiment):
senti_dataset = dataset.loc[dataset["sentiment"] == sentiment]
total_text = senti_dataset["text"].to_list()
total_text = " ".join(total_text)
words = word_tokenize(total_text)
return FreqDist(words)
```

Here’s how `get_freqs`

works in practice. First, we can generate a frequency dictionary for the words in the text samples with a positive sentiment.

```
pos_freqs = get_freqs(dataset, "positive")
pos_freqs
```

```
FreqDist({'was': 2, 'a': 2, '2020': 1, 'fun': 1, 'year': 1, 'Cats': 1, 'great': 1, 'movie': 1})
```

Similarly, we can compute a frequency dictionary for text samples with a negative sentiment.

```
neg_freqs = get_freqs(dataset, "negative")
neg_freqs
```

```
FreqDist({'NLP': 1, 'is': 1, 'not': 1, 'fun': 1, 'I': 1, 'hate': 1, 'tacos': 1})
```

For convenience, let’s combine these dictionaries in a DataFrame. The function `get_freq_table`

defined below achieves this. It first creates two dataframes using `pos_freqs`

and `neg_freqs`

, respectively. These are then merged using the `DataFrame.merge`

function. In the final step, this implementation uses the `DataFrame.fillna`

function to replace any missing values with a `0`

.

```
def get_freq_table(dataset):
pos_freqs = get_freqs(dataset, "positive")
neg_freqs = get_freqs(dataset, "negative")
pos_freq_table = pd.DataFrame.from_dict(pos_freqs, orient="index")
pos_freq_table.columns = ["positive"]
neg_freq_table = pd.DataFrame.from_dict(neg_freqs, orient="index")
neg_freq_table.columns = ["negative"]
freq_table = pos_freq_table.merge(
neg_freq_table, how="outer", left_index=True, right_index=True
)
freq_table = freq_table.fillna(0)
return freq_table
```

We can now compute a frequency table showing how often words in the training set occur with positive and negative sentiments.

```
freq_table = get_freq_table(dataset)
freq_table
```

positive | negative | |
---|---|---|

2020 | 1.0 | 0.0 |

Cats | 1.0 | 0.0 |

I | 0.0 | 1.0 |

NLP | 0.0 | 1.0 |

a | 2.0 | 0.0 |

fun | 1.0 | 1.0 |

great | 1.0 | 0.0 |

hate | 0.0 | 1.0 |

is | 0.0 | 1.0 |

movie | 1.0 | 0.0 |

not | 0.0 | 1.0 |

tacos | 0.0 | 1.0 |

was | 2.0 | 0.0 |

year | 1.0 | 0.0 |

Given this frequency table, we can now define a function `prob`

to compute the conditional probability $P(w \ | \ s) = \mathrm{freq}(w, \ s)/N_s$ that a word $w$ occurs with a sentiment $s$.

```
def prob(word, freq_table, sentiment):
vocab = freq_table.index.to_list()
if word in vocab:
word_freq = freq_table.loc[word, sentiment]
else:
word_freq = 0
n_senti = freq_table[sentiment].sum()
return word_freq/n_senti
```

s an example, we can use this function to compute the probability that the word “was” occurred given that the sentiment was positive. In this case, the probability is 20% since “was” occurred twice and the total number of words in the positive text samples is 10.

```
prob("was", freq_table, "positive")
```

```
0.2
```

Similarly, we can compute the probability that the word “was” occurred given that the sentiment was positive. The probability here is 0 because this word didn’t occur in the samples with a negative sentiment.

```
prob("was", freq_table, "negative")
```

```
0
```

Next, let’s define a `likelihood`

function that we can use to predict the sentiment of any given text:

```
def likelihood(text, freq_table, prior=1):
word_lst = word_tokenize(text)
output = prior
for word in word_lst:
pos_prob = prob(word, freq_table, sentiment="positive")
neg_prob = prob(word, freq_table, sentiment="negative")
output *= pos_prob/neg_prob
return output
```

Here, the argument `prior`

corresponds to the ratio $P(\mathrm{pos})/P(\mathrm{neg})$. If the training set contains more text samples having one sentiment than the other, this argument will have to be adjusted to account for this. We can compute the likelihood for the string `“Cats was great”`

by running the following expression:

```
likelihood("Cats was great", freq_table)
```

```
RuntimeWarning: divide by zero encountered in double_scalars
output *= pos_prob/neg_prob
```

Uh oh! What happened here? The problem with this implementation is that if $P(w_i \ | \ s) = 0$ for any word in the input text then $P(s \ | \ \mathrm{text}) = 0$. In this case, `neg_prob`

is zero because none of the words in the string `“Cats was great”`

ever occur with a negative sentiment in the training set. This can be addressed using a method known as Laplacian smoothing.

When using Laplacian smoothing, the probability $P(w \ | \ s)$ is revised to become

$\displaystyle P(w \;|\; \textrm{s}) = \frac{\mathrm{freq}(w,\; \mathrm{s}) + 1}{N_\mathrm{s} + V} \quad \textrm{s} \in \{ \textrm{pos}, \;\textrm{neg} \}.$In this expression, $V$ represents the number of unique words that occur in the training set. Now, if $\mathrm{freq}(w, \ s) = 0$, then $P(w \ | \ s) = 1/(N_s + V)$ instead of 0. Let’s now define a new probability function `prob_lps`

to implement this:

```
def prob_lps(word, freq_table, sentiment):
vocab = freq_table.index.to_list()
v = len(vocab)
if word in vocab:
word_freq = freq_table.loc[word, sentiment]
else:
word_freq = 0
n_senti = freq_table[sentiment].sum()
return (word_freq + 1)/(n_senti + v)
```

Let’s see how this works on the word “was.” Before, the probability was 20%. Now, it’s 12.5%.

```
prob_lps("was", freq_table, "positive")
```

```
0.125
```

For negative sentiments, we originally got a probability of 0. After including Laplacian smoothing, the probability is now about 4.76%.

```
prob_lps("was", freq_table, "negative")
```

```
0.047619047619047616
```

Let’s now define a new likelihood function that uses Laplacian smoothing.

```
def likelihood_lps(text, freq_table, prior=1):
word_lst = word_tokenize(text)
output = prior
for word in word_lst:
pos_prob = prob_lps(word, freq_table, sentiment="positive")
neg_prob = prob_lps(word, freq_table, sentiment="negative")
output *= pos_prob/neg_prob
return output
```

This time, we don’t run into any runtime errors when computing likelihoods

```
likelihood_lps("Cats was great", freq_table)
```

```
8.0390625
```

How does the likelihood change if instead input the string `“Cats was not great”`

?

```
likelihood_lps("Cats was not great", freq_table)
```

```
3.51708984375
```

The likelihood drops by almost 50% in this case. Here, the model will still predict a positive sentiment. This is because the word “not” is the only word in the input text that appears with a negative sentiment in the training set.

It is best practice to compute log probabilities for numerical stability. As the number of words in the vocabulary $V$ increases, the probability of any word will become smaller and smaller.

Thus, when computing the likelihood for a very long string of text, computing the product of small probabilities will yield a number so small there might not be enough numerical precision to properly represent this number accurately.

This is known as underflow. It represents one of the subtle nuances of performing arithmetic on decimal numbers using a computer. A more in-depth review of floating-point computations can be found in this article.

Computing the logarithm of probability will mitigate the presence of underflow errors. This is due to following the property of the logarithm function:

$\displaystyle \log \left( \prod_{i}^{m}p_i \right)= \sum_i^m \log(p_i).$In the event that we have have to multiply several small probabilities $p_i$, we can prevent underflow by taking the log of this product and instead computing the sums of the log probabilities $\log(p_i)$. Using this property, we can then define the log-likelhood as:

$\displaystyle \log\left(\frac{P(\mathrm{pos})}{P(\mathrm{neg})} \times \prod_{i}^{m}\frac{P(w_i \;|\; \mathrm{pos})}{P(w_i \;|\; \mathrm{neg})} \right) = \log \frac{P(\mathrm{pos})}{P(\mathrm{neg})} + \sum_i^m \log \frac{P(w_i \;|\; \mathrm{pos})}{P(w_i \;|\; \mathrm{neg})}.$This is implemented in the function `log_likelihood_lps`

shown below:

```
import numpy as np
def log_likelihood_lps(text, freq_table, prior=1):
word_lst = word_tokenize(text)
output = np.log(prior)
for word in word_lst:
numerator = prob_lps(word, freq_table, sentiment="positive")
denom = prob_lps(word, freq_table, sentiment="negative")
output += np.log(numerator/denom)
return output
```

In this case, a positive sentiment is predicted when the likelihood is greater than zero. Otherwise, the model will predict a negative sentiment, as seen in the following examples:

```
log_likelihood_lps("Cats are great", freq_table)
```

```
0.9857001832463227
```

```
log_likelihood_lps("I hate pineapple pizza", freq_table)
```

```
-1.9204199316179813
```

In this post, I introduced Bayes’ theorem and showed how it’s used to build a simple classifier, known as naive Bayes, for sentiment analysis. I also showed how to implement Laplacian smoothing and compute log-likelihoods to make the classifier more numerically robust.