No items found.
Analysis

Naive Bayes Sentiment Analysis in Python After Preparing Data Using SQL

Machine learning (ML) refers to the use of existing data, computing power, and effective algorithms to identify patterns in data, recognize those patterns when they occur again, and correctly predict an outcome based on those patterns. A frequent type of problem encountered in machine learning is the classification problem. In these problems, we attempt to predict whether an object or an event belongs to a certain category. Some examples of classification problems are detecting whether a credit card transaction is fraudulent, detecting whether an email is spam, and detecting whether a customer is likely to churn.

Sentiment analysis is a classification problem where data teams attempt to predict whether text is positive or negative in tone. Many companies use sentiment analysis to automatically analyze product reviews, social media comments, and survey responses to quantify feedback about their products and services. In this post, we will build a sentiment analyzer using Python after preparing text data using SQL. We will use the Naive Bayes algorithm, a popular algorithm for sentiment analysis problems. Let’s get started.

The ML learning process

The ML process involves three major steps — preparing data, training a model, and testing the model. After the model is tested, it is deployed. Once the model is deployed, the applications use the model to answer a question — in this case, we’ll determine whether text is positive or negative. But it does not stop there, the ML process is very iterative. A successful model needs to be constantly tested, trained, and recreated as the world changes!

Preparing data

The first step in any ML process is preparing the training data. We will use the Sentiment Labelled Sentences Dataset from UCI Machine Learning Repository. That dataset contains user reviews from Amazon, IMDB, and Yelp plus a judgment about whether each review is positive (score of 1) or negative (score of 0).  The dataset is available as a CSV, so we can import the data using the Periscope’s CSV upload feature

Once the data has been imported, it needs to be cleaned to remove duplicates and missing data. This is best done using SQL, the most popular language for data analysts. Here’s a look at the SQL I used to prepare this dataset for ML analysis:

select
 review
 , sentiment
from
 [govind_amazon_reviews]
union
select
 review
 , sentiment
from
 [govind_yelp_restaurant_reviews]
union
select
 review
 , sentiment
from
 [govin_imdb_movie_reviews]
where
 review is not null
 and sentiment is not null

Once the data has been cleaned, we will use it as our training data. It’s ready to be fed into our ML algorithm (Naive Bayes) to build our model. Before we do that, let me spend a little time explaining the Naive Bayes algorithm.

Understanding Naive Bayes

Training phase

If we pick a review from the Labelled Sentences Dataset at random, the probability of it being positive is P and the probability of it being negative (N) is 1-P. Reviews are made up of words. Using the frequency of a specific word across all the reviews, we can compute a positive score and a negative score for each word. For example, here’s the calculation for P, N, and the positive and negative scores of the word “love.”

P = Number of Positive Reviews / Total Number of Reviews
N = 1 - P

Positive Score("Love") = Sum of freq. of "Love" in Positive Reviews / Sum of freq. of "Love" in All Reviews Negative Score("Love") = 1-Positive Score("Love")

After going through our entire training data, we will have P & N, the probabilities that any review picked at random in the dataset is positive and negative respectively, and Positive Score & Negative Score for every individual word present in our training data. Let’s assume at the end of our training phase, P is 60%, N is 40%, Positive Score(“Love”) is 90%, Positive Score(“Periscope”) is 80%.

Testing phase

Given a new review, the algorithm now determines a positive score and negative score for that review based on the individual words in that review. If the positive score is greater than negative score, it treats the overall review as positive.  

To compute the positive and negative score for a comment, our model uses the information obtained in the training phase. For example,

Positive Score("Love Periscope") = Positive Score("Love") * Positive Score("Periscope") * P
Negative Score("Love Periscope") =(1-Positive Score("Love")) * (1-Positive Score("Periscope")) *(1-P)

Positive Score("Love Periscope") = 0.9 * 0.8 * 0.6 = 0.43
Negative Score("Love Periscope") = 0.1 * 0.2 * 0.4 = 0.008

Hence the review “Love Periscope” is classified as a positive review. The Naive Bayes algorithm assumes that each word contributes independently to the positive or negative score of a review. It does not consider the dependencies between the words. Despite this, Naive Bayes is a powerful algorithm that generates powerful results, especially when we don’t have a large amount of training data or a lot of information about the problem domain.

Building the model

Let’s get back to building our model using the Naive Bayes algorithm. The output of our SQL query is available as a dataframe (df). The first step in building the Naive Bayes model is to represent each review in term frequency representation. Skikit-learn package has a built-in object named CountVectorizer which will represent our reviews as a term frequency matrix.

# SQL output is imported as a dataframe variable called 'df'
import pandas as pd
import sklearn.feature_extraction.text as skltext
import sklearn.naive_bayes as sklnb

reviews = df['REVIEW']
sentiments = df['SENTIMENT']

count_vectorizer = skltext.CountVectorizer(binary='true')
transformed_reviews = count_vectorizer.fit_transform(reviews)
print(transformed_reviews.shape)

Each review has been converted into a tuple of 4812 numbers, which is the number of unique words in the dataset. Out of the 4812 numbers, many of them will be 0 since they will not be present in a single review. If we print any one review, we get only the elements which are 1.

The Skikit-learn package also contains the algorithm for Naive Bayes classifier. We instantiate this classifier (BernoulliNB) and pass the reviews in term frequency representation along with the sentiments to the fit method. This builds a model that is capable of classifying text as positive or negative.

import pandas as pd
import sklearn.feature_extraction.text as skltext
import sklearn.naive_bayes as sklnb

reviews = df['REVIEW']
sentiments = df['SENTIMENT']

count_vectorizer = skltext.CountVectorizer(binary='true')
transformed_reviews = count_vectorizer.fit_transform(reviews)

classifier = sklnb.BernoulliNB().fit(transformed_reviews, sentiments)
Testing the model

Now that the model has been built, we are ready to test it. This is done by calling the predict method on the classifier and passing the review to test in term frequency representation. The method returns whether the review is positive or negative.

import pandas as pd
import sklearn.feature_extraction.text as skltext
import sklearn.naive_bayes as sklnb

reviews = df['REVIEW']
sentiments = df['SENTIMENT']

count_vectorizer = skltext.CountVectorizer(binary='true')
transformed_reviews = count_vectorizer.fit_transform(reviews)

classifier = sklnb.BernoulliNB().fit(transformed_reviews, sentiments)

result = classifier.predict(count_vectorizer.transform(['I love Periscope']))
periscope.text('POSITIVE') if result == 1 else periscope.text('NEGATIVE')

Instead of modifying the Python code each time to supply the text for testing, we can set up a filter that the user can enter free-form or load from a data source and pass the text from a filter into our analysis code.

Once the filter is set up, we modify the SQL to pass the input values from a filter into Python code.

select
 review
 , sentiment
 , '[InputText]' as InputText
from
 [govind_amazon_reviews]
union
select
 review
 , sentiment
 , '[InputText]' as InputText
from
 [govind_yelp_restaurant_reviews]
union
select
 review
 , sentiment
 , '[InputText]' as InputText
from
 [govin_imdb_movie_reviews]
where
 review is not null
 and sentiment is not null

Then in the Python code, we replace our test text “I love Periscope” with the filter input received through the data frame as df['INPUTTEXT'][0].

result = classifier.predict(count_vectorizer.transform([df['INPUTTEXT'][0]]))

This allows us to test our sentiment analyzer now by entering text directly from the user interface!

Summary

Using a few lines of SQL, we have prepared data to be analyzed; using a few lines of Python, we have trained a model that is capable of analyzing the sentiment of that text. This shows the power of tools in our hands that help us perform data analysis today. Periscope Data supports dozens of R and Python libraries made for data analysis and visualization, ready and waiting for your next data project!

Tags: 

Want to discuss this article? Join the Periscope Data Community!

Govind Rajagopalan
Govind is Senior Engineering Manager at Sisense. He is a passionate software professional living in the East Bay with his wife and daughter. He is excited to teach, help his teammates thrive and have fun improving his craft. In his leisure time, he enjoys an outing with his family, hiking, and exploring parks around the Bay Area.