How to implement sentiment analysis using a naive Bayes classifier.
Naive Bayes classifiers are used in various natural language processing tasks. These include sentiment analysis, spam filtering, and other types of document classifications. In this post, I will present Bayes’ theorem, the building block of naive Bayes classifiers, describe how naive Bayes classifiers work, and show how to implement these algorithms.
The Naive Bayes classifier represents an application of Bayes’ theorem. Given two events and , this theorem states that the conditional probability that event occurs given that event has occurred is expressed as
Here, and are the probabilities that and occur, respectively. Bayes’ theorem allows us to take an unknown quantity and define it in terms of other relevant probabilities that we may already know.
If reading this equation makes you feel like you’re reading heiroglyphics, try the following example:
In other words, the probability that 2020 would have been a fun year, given that the COVID-19 pandemic hadn’t occurred, can be computed if we have prior knowledge of the terms on the right-hand side. These include:
The term can be written as the probability that COVID-19 doesn’t occur regardless of whether the year is fun or not. This is given as
Now, lets assume we know the following:
These probabilities suggest that
Using these results, we can determine that
In other words, the probability that any year is fun, given that the COVID-19 pandemic doesn’t occur during that same year, is about 99%. Next, I will describe how Bayes’ theorem relates to the Naive Bayes classifier.
In sentiment analysis applications, the goal of the Naive Bayes classifier is to predict the most probable sentiment for a given text. This requires a training set of text samples that have been labeled as having either a positive or negative sentiment. Using Bayes’ theorem, the probability of interest is expressed as
where the sentiment can correspond to either a positive () or negative () sentiment. The probability is the fraction of text samples in the training set that have sentiment .
It may not always be straightforward to determine and . One way to simplify this is to assume that the words occurring in the text are independent of each other.
In practice, this assumption is typically not correct which is why this algorithm is known as naive Bayes. Regardless, this simplification allows us to define the conditional probability , the probability of a sentiment given the word , as
where represents the probability that the word occurs in the training set. Now we can determine the probability that a piece of text has sentiment as
where represents the -th word in the text.
Naive Bayes performs classifications by first computing and . We can calculate the ratio of these two probabilities, known as the likelihood, to obtain
If the likelihood is greater than 1, the model predicts a positive sentiment. But if the likelihood is less than 1, the model instead predicts a negative sentiment. One advantage of computing the likelihood function is that we can avoid calculating as it occurs in the numerator and the denominator.
The term is known as a prior. It is relevant if there are more samples having one particular sentiment than the other. If the training set contains an equal number of positive and negative text samples, and thus, .
The conditional probability that a word occurs in the presence of sentiment is given by the expression
The function represents how often the word occurs with sentiment in the training set. Furthermore, is the total number of words in text samples that have sentiment
These are the building blocks needed to implement naive Bayes for sentiment classification. In the next section, I will describe how to implement a Naive Bayes classifier from scratch.
Let’s now implement this in some code. I’ll start by creating a small dataset:
import pandas as pd
dataset = pd.DataFrame(
{
"text": [
"2020 was a fun year",
"Cats was a great movie",
"NLP is not fun",
"I hate tacos",
],
"sentiment": ["positive",
"positive",
"negative",
"negative"],
}
)
The table below represents this dataset:
text | sentiment | |
---|---|---|
0 | 2020 was a fun year | postive |
1 | Cats was a great movie | postive |
2 | NLP is not fun | negative |
3 | I hate tacos | negative |
Now, we need to determine how often each word in our dataset occurs for both sentiment labels. The function get_freqs
defined below takes care of this. It uses the functions word_tokenize
and FreqDist
imported from the NLTK library. The function word_tokenize
splits a string into a list of words, also known as word tokens. Furthermore, the function FreqDist
takes these word tokens to generate a dictionary showing how often each word occurs.
from nltk import word_tokenize
from nltk import FreqDist
def get_freqs(dataset, sentiment):
senti_dataset = dataset.loc[dataset["sentiment"] == sentiment]
total_text = senti_dataset["text"].to_list()
total_text = " ".join(total_text)
words = word_tokenize(total_text)
return FreqDist(words)
Here’s how get_freqs
works in practice. First, we can generate a frequency dictionary for the words in the text samples with a positive sentiment.
pos_freqs = get_freqs(dataset, "positive")
pos_freqs
FreqDist({'was': 2, 'a': 2, '2020': 1, 'fun': 1, 'year': 1, 'Cats': 1, 'great': 1, 'movie': 1})
Similarly, we can compute a frequency dictionary for text samples with a negative sentiment.
neg_freqs = get_freqs(dataset, "negative")
neg_freqs
FreqDist({'NLP': 1, 'is': 1, 'not': 1, 'fun': 1, 'I': 1, 'hate': 1, 'tacos': 1})
For convenience, let’s combine these dictionaries in a DataFrame. The function get_freq_table
defined below achieves this. It first creates two dataframes using pos_freqs
and neg_freqs
, respectively. These are then merged using the DataFrame.merge
function. In the final step, this implementation uses the DataFrame.fillna
function to replace any missing values with a 0
.
def get_freq_table(dataset):
pos_freqs = get_freqs(dataset, "positive")
neg_freqs = get_freqs(dataset, "negative")
pos_freq_table = pd.DataFrame.from_dict(pos_freqs, orient="index")
pos_freq_table.columns = ["positive"]
neg_freq_table = pd.DataFrame.from_dict(neg_freqs, orient="index")
neg_freq_table.columns = ["negative"]
freq_table = pos_freq_table.merge(
neg_freq_table, how="outer", left_index=True, right_index=True
)
freq_table = freq_table.fillna(0)
return freq_table
We can now compute a frequency table showing how often words in the training set occur with positive and negative sentiments.
freq_table = get_freq_table(dataset)
freq_table
positive | negative | |
---|---|---|
2020 | 1.0 | 0.0 |
Cats | 1.0 | 0.0 |
I | 0.0 | 1.0 |
NLP | 0.0 | 1.0 |
a | 2.0 | 0.0 |
fun | 1.0 | 1.0 |
great | 1.0 | 0.0 |
hate | 0.0 | 1.0 |
is | 0.0 | 1.0 |
movie | 1.0 | 0.0 |
not | 0.0 | 1.0 |
tacos | 0.0 | 1.0 |
was | 2.0 | 0.0 |
year | 1.0 | 0.0 |
Given this frequency table, we can now define a function prob
to compute the conditional probability that a word occurs with a sentiment .
def prob(word, freq_table, sentiment):
vocab = freq_table.index.to_list()
if word in vocab:
word_freq = freq_table.loc[word, sentiment]
else:
word_freq = 0
n_senti = freq_table[sentiment].sum()
return word_freq/n_senti
s an example, we can use this function to compute the probability that the word “was” occurred given that the sentiment was positive. In this case, the probability is 20% since “was” occurred twice and the total number of words in the positive text samples is 10.
prob("was", freq_table, "positive")
0.2
Similarly, we can compute the probability that the word “was” occurred given that the sentiment was positive. The probability here is 0 because this word didn’t occur in the samples with a negative sentiment.
prob("was", freq_table, "negative")
0
Next, let’s define a likelihood
function that we can use to predict the sentiment of any given text:
def likelihood(text, freq_table, prior=1):
word_lst = word_tokenize(text)
output = prior
for word in word_lst:
pos_prob = prob(word, freq_table, sentiment="positive")
neg_prob = prob(word, freq_table, sentiment="negative")
output *= pos_prob/neg_prob
return output
Here, the argument prior
corresponds to the ratio . If the training set contains more text samples having one sentiment than the other, this argument will have to be adjusted to account for this. We can compute the likelihood for the string “Cats was great”
by running the following expression:
likelihood("Cats was great", freq_table)
RuntimeWarning: divide by zero encountered in double_scalars
output *= pos_prob/neg_prob
Uh oh! What happened here? The problem with this implementation is that if for any word in the input text then . In this case, neg_prob
is zero because none of the words in the string “Cats was great”
ever occur with a negative sentiment in the training set. This can be addressed using a method known as Laplacian smoothing.
When using Laplacian smoothing, the probability is revised to become
In this expression, represents the number of unique words that occur in the training set. Now, if , then instead of 0. Let’s now define a new probability function prob_lps
to implement this:
def prob_lps(word, freq_table, sentiment):
vocab = freq_table.index.to_list()
v = len(vocab)
if word in vocab:
word_freq = freq_table.loc[word, sentiment]
else:
word_freq = 0
n_senti = freq_table[sentiment].sum()
return (word_freq + 1)/(n_senti + v)
Let’s see how this works on the word “was.” Before, the probability was 20%. Now, it’s 12.5%.
prob_lps("was", freq_table, "positive")
0.125
For negative sentiments, we originally got a probability of 0. After including Laplacian smoothing, the probability is now about 4.76%.
prob_lps("was", freq_table, "negative")
0.047619047619047616
Let’s now define a new likelihood function that uses Laplacian smoothing.
def likelihood_lps(text, freq_table, prior=1):
word_lst = word_tokenize(text)
output = prior
for word in word_lst:
pos_prob = prob_lps(word, freq_table, sentiment="positive")
neg_prob = prob_lps(word, freq_table, sentiment="negative")
output *= pos_prob/neg_prob
return output
This time, we don’t run into any runtime errors when computing likelihoods
likelihood_lps("Cats was great", freq_table)
8.0390625
How does the likelihood change if instead input the string “Cats was not great”
?
likelihood_lps("Cats was not great", freq_table)
3.51708984375
The likelihood drops by almost 50% in this case. Here, the model will still predict a positive sentiment. This is because the word “not” is the only word in the input text that appears with a negative sentiment in the training set.
It is best practice to compute log probabilities for numerical stability. As the number of words in the vocabulary increases, the probability of any word will become smaller and smaller.
Thus, when computing the likelihood for a very long string of text, computing the product of small probabilities will yield a number so small there might not be enough numerical precision to properly represent this number accurately.
This is known as underflow. It represents one of the subtle nuances of performing arithmetic on decimal numbers using a computer. A more in-depth review of floating-point computations can be found in this article.
Computing the logarithm of probability will mitigate the presence of underflow errors. This is due to following the property of the logarithm function:
In the event that we have have to multiply several small probabilities , we can prevent underflow by taking the log of this product and instead computing the sums of the log probabilities . Using this property, we can then define the log-likelhood as:
This is implemented in the function log_likelihood_lps
shown below:
import numpy as np
def log_likelihood_lps(text, freq_table, prior=1):
word_lst = word_tokenize(text)
output = np.log(prior)
for word in word_lst:
numerator = prob_lps(word, freq_table, sentiment="positive")
denom = prob_lps(word, freq_table, sentiment="negative")
output += np.log(numerator/denom)
return output
In this case, a positive sentiment is predicted when the likelihood is greater than zero. Otherwise, the model will predict a negative sentiment, as seen in the following examples:
log_likelihood_lps("Cats are great", freq_table)
0.9857001832463227
log_likelihood_lps("I hate pineapple pizza", freq_table)
-1.9204199316179813
In this post, I introduced Bayes’ theorem and showed how it’s used to build a simple classifier, known as naive Bayes, for sentiment analysis. I also showed how to implement Laplacian smoothing and compute log-likelihoods to make the classifier more numerically robust.