Sentiment Analysis is a technique that uses Natural Language Processing (NLP), Text Mining, and Computational Linguistics to identify and extract the emotions present in the text. It has become increasingly valuable in today's digital age, as the proliferation of reviews, blogs, ratings, and feedback on the internet has created a wealth of information for businesses looking to understand their customers, identify new opportunities, and manage their reputation.
This technique has a wide range of applications and is used by many different industries, such as Market Research, Customer Feedback, Brand Monitoring, Employee Engagement, and Social Media Monitoring. By analyzing the emotions expressed in customer feedback, for example, businesses can gain insight into how their products or services are perceived and make improvements accordingly.
In this blog, we will explore the following topics:
Traditionally, sentiment classification involves a multi-step process that includes organizing text data and understanding customer emotions. However, with the arrival of deep learning, sentiment analysis has been revolutionized. The introduction of advanced techniques such as Transformers and Transfer Learning has made it possible to quickly build models for sentiment classification.
While the new deep-learning approaches have greatly simplified the process, it is still beneficial to have a basic understanding of sentiment classification. This understanding can help to fine-tune and improve the model, as well as provide a deeper understanding of customer sentiment.
Let’s build a model for classifying the sentiments using the conventional approach!
In this tutorial, we will be using Kaggle’s IMDB movie review dataset for demonstration. This dataset contains more than 40,000 Reviews & sentiments, and most of the reviews are described in 200-plus words in this dataset.
Let’s load the dataset!
import pandas as pd
imdb_reviews = pd.read_csv('train.csv')
imdb_reviews.head()
TEXT | LABEL
---------------------------------------------------------------------
0 grew up (b. 1965) watching and loving the Th... | 0
1 When I put this movie in my DVD player, and sa... | 0
2 Why do people who do not know what a particula... | 0
3 Even though I have great interest in Biblical... | 0
4 Im a die hard Dads Army fan and nothing will e... | 1
It is important to clean text data before applying machine learning models to it because machines cannot understand the unstructured text. To prepare the text data, we will create a text preprocessing pipeline that includes the following operations on our movie review corpus:
def text_preprocessing_pipeline(corpus):
corpus['text'] = corpus['text'].str.lower()
corpus['text'] = corpus['text'].str.replace(r"http\S+", "", regex=True)
corpus['text'] = corpus['text'].str.replace('[^A-Za-z0–9]+',' ', regex=True)
corpus['text'] = corpus['text'].apply(lambda words: ' '.join(word.lower() for word in words.split() if word not in stopwords))
corpus['text'] = corpus['text'].apply(lambda x: str(TextBlob(x).correct()))
reviews = text_preprocessing_pipeline(imdb_reviews)
reviews.head()
TEXT | LABEL
---------------------------------------------------------------------
0 grew b 1965 watching loving thunderbirds mates... | 0
1 put movie dvd player sat coke chips expectatio... | 0
2 people know particular time past like feel nee... | 0
3 even though great interest biblical movies bor... | 0
4 im die hard dads army fan nothing ever change ... | 1
Tokenization is the process of breaking down a sentence into individual words, known as tokens. These tokens are used to understand the context of the sentence and to create a vocabulary. Tokenization is achieved by separating the words in a sentence using spaces or punctuation marks. This process helps to make the text more structured, which makes it easier for machine learning models to understand and analyze the data.
Text
"The cat sat on the mat."
|
\|/
Tokens
"the", "cat", "sat", "on", "the", "mat", "."
Lemmatization is a process that helps to reduce a word to its most basic root form. It uses linguistic analysis to determine the root form of a word, and it is necessary to have a comprehensive dictionary for the algorithm to reference in order to link the word form to its root. This process can help to improve the accuracy and performance of machine learning models by reducing the number of variations of a word and making the text more structured.
Studying Lemmatization Study
Studies ----------------------> Study
Study Study
Applying tokenization and Lemmatization to our Clean Movie Reviews:
import nltk
nltk.download('wordnet')
nltk.download('punkt')
w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
lemmatizer = nltk.stem.WordNetLemmatizer()
def lemmatize_text(text):
return [lemmatizer.lemmatize(w) for w in w_tokenizer.tokenize(text)]
reviews['lemmatized_tokens'] = reviews['text'].apply(lemmatize_text)
reviews.head()
Now, we have a clean dataset ready for Exploratory data analysis.
We are also interested in the most frequent words other than the stopwords but highly frequent in reviews. Let’s find those words!
import itertools
import collections
import pandas as pd
import matplotlib.pyplot as plt
lemmatized_tokens = list(reviews["lemmatized_tokens"])
token_list = list(itertools.chain(*lemmatized_tokens))
counts_no = collections.Counter(token_list)
clean_reviews = pd.DataFrame(counts_no.most_common(30),
columns=['words', 'count'])
fig, ax = plt.subplots(figsize=(12, 8))
clean_reviews.sort_values(by='count').plot.barh(x='words',
y='count',
ax=ax,
color="purple")
ax.set_title("Most Frequently used words in Reviews")
plt.show()
Since our dataset contains movie reviews, the resultant word frequency plot is pretty intuitive.
A bigram is a sequence of two adjacent elements from a string of tokens, typically letters, syllables, or words. Let’s also check the highly frequent bigrams in our data.
bigrams = zip(token_list, token_list[1:])
counts_no = collections.Counter(bigrams)
Almost all the above bigrams make sense in our data. We could go further with trigrams, but that would not be as informative as these bigrams and unigrams.
Let’s visualize the most practical words representing positive or negative sentiment in reviews.
import scattertext as st
from IPython.display import IFrame
from IPython.core.display import display, HTML
from scattertext import CorpusFromPandas, produce_scattertext_explorer
corpus = reviews.loc[(reviews['label'] == "Positive") | (reviews['label'] == "Negative")]
corpus = st.CorpusFromParsedDocuments(corpus.iloc[:2000,:], category_col='label').build()
html = st.produce_scattertext_explorer(corpus,
category="Positive",
category_name='Negative',
not_category_name='Positive',
minimum_term_frequency=5,
width_in_pixels=1000,
transform=st.Scalers.log_scale_standardize)
file_name = 'Sentimental Words Visualization.html'
open(file_name, 'wb').write(html.encode('utf-8'))
IFrame(src=file_name, width = 1000, height=700)
Let’s quickly summarise our findings:
Word embedding is a technique used to represent words as numerical vectors. This method encodes words in real-valued vectors, such that words with similar meaning and context are located close to each other in the vector space. In other words, word embeddings connect the way humans understand language to the way machines understand it. They are critical for solving natural language processing (NLP) tasks, as they provide a way for machines to understand the meaning and context of words in a text.
man ---------------> Woman
| |
| |
| |
| |
King ---------------> Queen
There are several methods available for producing word embeddings, but their main idea is the same: to capture as much contextual and semantic information as possible. Choosing the best word embedding method often requires experimentation and can be a difficult task.
Some popular and straightforward methods for creating vector representations of words include:
In this blog, we will keep ourselves confined to the TF-IDF Vectorizer.
TF-IDF is a short notation for "Term Frequency and Inverse Document Frequency". It is commonly used to transform text into a meaningful representation of numeric vectors. Initially, it is an information retrieval method that relies on Term Frequency (TF) and Inverse Document Frequency (IDF) to measure the importance of a word in a document.
Term Frequency (TF) tracks the occurrence of words in a document; Inverse Document Frequency (IDF) assigns a weightage to each word in the corpus. The IDF weightage is high for infrequently appearing words and low for frequent words. This allows us to detect how important a word is to a document.
Let’s implement TF-IDF on our movie reviews:
tfidf_converter = TfidfVectorizer(max_features=2000)
features = tfidf_converter.fit_transform(reviews['text']).toarray()
We are ready to build our Sentiment Classification model, but first, we must select a supervised classification model that satisfies our requirements.
We have several algorithms for classification tasks, each with their own pros and cons. One algorithm may produce superior results compared to others but may require more explainability. Even if explainability is not compromised, deploying such complex algorithms can be tedious. In other words, there is a trade-off between performance, model complexity, and model explainability. The ideal algorithm should be explainable, reliable, and easy to deploy, but again, there is no such thing as a perfect algorithm.
For example, XGBoost is a high-performance and explainable algorithm, but on the other hand, it is quite complex and requires high computational power. On the other hand, Logistic Regression is relatively fast, simple to implement, and explainable, but the performance of logistic regression on non-linear datasets is considerably disappointing. As the number of features in the dataset increases, Logistic Regression tends to become slower and its performance deteriorates.
For this blog, we will be using the Light GBM Classifier!
Light GBM is a gradient-boosting framework that is similar to XGBoost and utilizes tree-based learning algorithms. It is designed to be distributed and efficient, with the following benefits:
Light GBM is an excellent alternative to XGBoost as it is roughly six times faster than XGBoost without compromising performance. It can handle large datasets and requires low memory to operate.
Let’s implement Light-GBM for Sentiment Classification:
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
target = reviews['label']
x_train, x_test, y_train, y_test = train_test_split(features, target, test_size=0.3)
clf = lgb.LGBMClassifier(max_depth=20,
n_estimators=25,
min_child_weight=0.0016,
n_jobs=-1)
clf.fit(X_train, y_train)
pred = clf.predict(x_test)
print("Test data Accuracy is : ",accuracy_score(y_test , pred))
print(classification_report(y_test, pred))
#############
Test data Accuracy is : 0.816916666666
Accuracy on the Testing dataset
Classification Report
import seaborn as sns
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, pred)
cm_matrix = create_ticks(cm)
fig, ax = plt.subplots(figsize=(8, 6))
sns.heatmap(cm_matrix, annot=True, fmt='d', cmap='YlGnBu')
##### This will give us the confusion matrix plot
Twitter allows businesses to engage personally with consumers by using real-time sentiment classification models to support and manage the marketing strategies of several brands. With so much data available, Twitter's Sentiment analysis enables companies to understand their customers, keep track of what's being said about their brand and competitors, and discover trends in the market.
IBM is one of the few companies that uses sentiment analysis to understand employee concerns. They are also developing programs to improve employees' likelihood of staying on the job. This helps human-resource managers figure out how workers feel about their company and where management can make changes to improve the experience of their employees.
Nielsen relies on Sentiment Analysis to discover market trends and find the popularity of their customer's products. Based on sentimental trends, they also provide consultation for building marketing strategies and campaigns.
Sentiment analysis projects are a common category of project that is often found in beginners' resumes. However, it's important to be prepared for potential questions on this topic, such as:
We started with a brief introduction to Sentiment Analysis and why it is required in industries. Moving on, we applied a text preprocessing pipeline to our movie review dataset to remove the redundant expressions from the text. We implemented tokenization and Lemmatization to understand the context of those words used in the reviews and limit the recurring words appearing in diverse forms. Further, we performed a text exploratory analysis to understand the frequent unigrams and bigrams used in the reviews and visualize the clusters of positive, negative, and neutral words available in reviews.
Finally, we applied the TF-IDF vectorizer to the processed reviews, built a Light GBM model to classify the reviews, and evaluated the performance on the testing dataset. We also looked at some industrial use cases of Sentiment analysis.