Natural language processing (NLP) is a branch of artificial intelligence that deals with the interactions between computers and human languages, such as English, Hindi, and others. NLP allows computers to understand and process human language in a way that is similar to how humans do. It has many practical applications in the business world, including language translation, document summarization, sentiment analysis, and the development of virtual assistants like Siri and Cortana.
Text is a form of data, but preprocessing it can be a challenging and time-consuming task when working on an NLP project. Preprocessing allows you to work with raw data and can greatly improve the results of your analysis. Fortunately, Python has several NLP libraries, such as NLTK, spaCy, and Gensim, that can assist with text analysis and make preprocessing easier. It is important to properly preprocess your text data in order to achieve optimal results.
Let's start with text pre-processing first!
Why do we need to clean the text? Unlike humans, machines lack an understanding of the unstructured text, so cleaning the text data is necessary before feeding it to any machine learning algorithm. To understand the concept better, let's follow the "learning by doing" strategy. In this blog, we will be using the Coronavirus Tweets NLP Text Classification dataset for demonstration.
Let's start by loading the data!
import pandas as pd
tweets = pd.read_csv('Corona_NLP_train.csv')
print(tweets.head())
# This will print the head of dataframe
UserName ScreenName Location TweetAt OriginalTweet Sentiment
0 3799 48751 London 16-03-2020 @MeNyrbie @Phil Gahan @Chrisitv https://t.co/i. Neutral
1 3800 48752 UK 16-03-2020 advice Talk to your neighbours family to excha. Positive
2 3801 48753 Vagabonds 16-03-2020 Coronavirus Australia: Woolworths to give elde. Positive
3 3802 48754 NaN 16-03-2020 My food stock is not the only one which is emp. Positive
4 3803 48755 NaN 16-03-2020 Me, ready to go at supermarket during the #COV. Extremely Negative
For this blog, we are only concerned with the columns of unstructured textual tweets and Sentiments. We can drop the remaining columns and rename the rest columns for clear understanding.
tweets = tweets[['OriginalTweet', 'Sentiment']] #extraction
tweets.columns = ['Text', 'Sentiment'] #renaming
We must design a pre-processing pipeline (sequence-wise processing), where we will gradually clean our unstructured text at each step.
The first step is transforming the tweets into lowercase to maintain a consistent flow during the NLP tasks and text mining. For example, 'Virus' and 'virus' will be treated as two different words in any sentence, and hence, we need to make all the words lowercase in the tweets to prevent this duplication.
tweets['Text'] = tweets['Text'].str.lower()
tweets.head()
Text Sentiment
------------------------------------------------------------------------
0 @menyrbie @phil_gahan @chrisitv https://t.co/i... Neutral
1 advice talk to your neighbours family to excha... Positive
2 coronavirus australia: woolworths to give elde... Positive
3 my food stock is not the only one which is emp... Positive
4 me, ready to go at supermarket during the #cov... Extremely Negative
Hyperlinks are very common in tweets and don't add any additional information. For any other problem statement, we may need to preserve the hyperlinks. It depends upon the need for the problem statement. But for sentiment analysis, let's remove them!
tweets['Text'] = tweets['Text'].str.replace(r"http\S+", "", regex=True)
tweets.head()
Text Sentiment
------------------------------------------------------------------------
0 @menyrbie @phil gahan @chrisitv and and Neutral
1 advice talk to your neighbours family to excha... Positive
2 coronavirus australia: woolworths to give elde... Positive
3 my food stock is not the only one which is emp... Positive
4 me, ready to go at supermarket during the #COv... Extremely Negative
For most NLP problems, punctuation does not provide additional language information. So we generally drop it. Similarly, punctuation symbols are not crucial for sentiment analysis. They are redundant, and the removal of punctuation before text modeling is highly recommended.
tweets['Text'] = tweets['Text'].str.replace('[^A-Za-z0-9]+',' ', regex=True)
tweets.head()
Text Sentiment
------------------------------------------------------------------------
O menyrbie phil gahan chrisit and and Neutral
1 advice talk to your neighbours family to excha... Positive
2 coronavirus australia woolworths to give elder.. Positive
3 my food stock is not the only one which is emp.. Positive
4 me ready to go at supermarket during the covid... Extremely Negative
Stopwords are English words that do not add much meaning to a sentence. They can be safely removed without sacrificing the meaning of the sentence. For instance, the words like the, he, have, etc. If we notice, stopwords are some of the most frequently appearing words in any paragraph and do not contribute much meaning to sentences.
Let's remove the stopwords from the text.
import nltk
from nltk.corpus import stopwords
## NLTK library provides the set of stop words for English
nltk.download('stopwords')
stopwords = stopwords.words('english')
tweets['Text'] = tweets['Text'].apply(lambda words: ' '.join(word.lower() for word in words.split() if word not in stopwords))
print(tweets.head())
Text Sentiment
------------------------------------------------------------------------
O menyrbie phil gahan chrisitv Neutral
1 advice talk neighbours family exchange phone n... Positive
2 coronavirus australia woolworths give elderly Positive
3 food stock one empty please panic enough food Positive
4 ready go supermarket covid19 outbreak paranoid... Extremely Negative
These days, Text editors are smart enough to correct your text documents. Still, spelling mistakes are widespread in text data. In the current scenario, spelling mistakes are common while writing tweets. Fortunately, misspelled words can be treated efficiently with the help of the textblob library.
from textblob import TextBlob
tweets['Text'] = tweets['Text'].apply(lambda x: str(TextBlob(x).correct()))
Tokenization is breaking down the sentence into words and paragraphs into sentences. These broken pieces are called tokens (either word tokens or sentence tokens), which help understand the context and create a vocabulary. It works by separating the words by spaces or punctuations.
Text
"The cat sat on the mat."
|
\|/
Tokens
"The", "cat", "sat", "on", "the", "mat", "."
import nltk
word_data = "Enjoyalgorithms is a nice platform for computer science education."
nltk_tokens = nltk.word_tokenize(word_data)
print (nltk_tokens)
# Output
['Enjoyalgorithms', 'is', 'a', 'nice', 'platform', 'for', 'computer', 'science', 'education', '.']
Stemming and Lemmatization are commonly used for developing search engines, keyword extractions, grouping similar words, and NLP.
Both processes aim to reduce the word into a common base word or root word. However, these two methods follow a very different approaches.
Note: Lemmatization is almost always preferred over stemming algorithms until and unless we need a super-fast execution on a massive corpus of text data.
Applying tokenization and Lemmatization to tweets:
import nltk
nltk.download('wordnet')
nltk.download('punkt')
w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
lemmatizer = nltk.stem.WordNetLemmatizer()
def lemmatize_text(text):
return [lemmatizer.lemmatize(w) for w in w_tokenizer.tokenize(text)]
tweets['lemmatized_tokens'] = tweets['text'].apply(lemmatize_text)
tweets.head()
Text Sentiment lemmatized_tokens
-----------------------------------------------------------------------------------------------------------------------------------------
0 menyrbie phil gahan chrisitv Neutral [menyrbie, phil, gahan, chrisitv]
1 advice talk neighbours family exchange phone n... Positive [advice, talk, neighbour, family, exchange, ph...
2 coronavirus australia woolworths give elderly ... Positive [coronavirus, australia, woolworth, give, elde...
3 food stock one empty please panic enough food Positive [food, stock, one, empty, please, panic, enoug...
4 ready go supermarket covid19 outbreak paranoid... Extremely Negative [ready, go, supermarket, covid19, outbreak, pa...
With this, we covered the text pre-processing section, and now we're ready to draw insights from our data.
Let's start by analyzing the text length for different sentiments. Create a new column having the length of text.
tweets['word_length'] = tweets['Text'].str.split().str.len()
tweets.head()
Text Sentiment word length lemmatized_tokens
-----------------------------------------------------------------------------------------------------------------------------------------
0 menyrbie phil gahan chrisitv Neutral 4 [menyrbie, phil, gahan, chrisitv]
1 advice talk neighbours family exchange phone n... Positive 27 [advice, talk, neighbour, family, exchange, ph...
2 coronavirus australia woolworths give elderly ... Positive 13 [coronavirus, australia, woolworth, give, elde...
3 food stock one empty please panic enough food Positive 23 [food, stock, one, empty, please, panic, enoug...
4 ready go supermarket covid19 outbreak paranoid... Extremely Negative 21 [ready, go, supermarket, covid19, outbreak, pa...
Our objective is to explore the distribution of the tweet length for different sentiments.
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(color_codes=True)
plt.figure(figsize=(15,7))
cmap = ["red", "green", "blue"]
labels = ["Neutral", "Positive", "Negative"]
for label,clr in zip(labels,cmap):
sns.kdeplot(tweets.loc[(tweets['Sentiment']==label), 'word_length'], color=clr, shade=True, label=label)
plt.xlabel('Text Length')
plt.ylabel('Density')
plt.legend()
From the above distribution plot, one can conclude that Neutral tweets have a shorter average text length than Positive and Negative tweets.
We are also interested in the most frequent words (other than the stopwords) but widespread in tweets. Let's find those words!
import itertools
import collections
import pandas as pd
import matplotlib.pyplot as plt
lemmatized_tokens = list(tweets["lemmatized_tokens"])
token_list = list(itertools.chain(*lemmatized_tokens))
counts_no = collections.Counter(token_list)
clean_tweets = pd.DataFrame(counts_no.most_common(30),
columns=['words', 'count'])
fig, ax = plt.subplots(figsize=(8, 8))
clean_tweets.sort_values(by='count').plot.barh(x='words',
y='count',
ax=ax,
color="blue")
ax.set_title("Most Frequently used words in Tweets")
plt.show()
Since our tweets belong to the pandemic timeline, the resultant word frequency plot is pretty intuitive.
A word cloud is a cluster of words represented in different sizes. The bigger and bolder the word appears, the more often it is mentioned within a given text data, and the more important it is.
from wordcloud import WordCloud
wordcloud = WordCloud(width = 1200, height = 800,
background_color ='white',
stopwords = stopwords,
min_font_size = 10).generate(str(all_words))
plt.figure(figsize = (15, 8), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)
plt.show()
We highly recommend the use of scatter-text for the visualization of sentimental words. In the attached plot, one can look for the words that describe the sentence's sentiment. It arranges the words based on their frequency in a document and, at the same time, clusters them with their corresponding sentiment. In our case, red dots represent the cluster of positive words, and blue dots for the cluster of negative words. Words in the yellow cluster are close to neutral sentiment.
import scattertext as st
from IPython.display import IFrame
from IPython.core.display import display, HTML
from scattertext import CorpusFromPandas, produce_scattertext_explorer
tweets = tweets.loc[(tweets['Sentiment'] == 'Extremely Negative') | (tweets['Sentiment'] == 'Extremely Positive')]
corpus = st.CorpusFromParsedDocuments(tweets.iloc[:10000,:], category_col='Sentiment', parsed_col='parsed').build()
html = st.produce_scattertext_explorer(corpus,
category='Extremely Negative',
category_name='Negative',
not_category_name='Positive',
minimum_term_frequency=5,
width_in_pixels=1000,
transform=st.Scalers.log_scale_standardize)
file_name = 'Sentimental Words Visualization.html'
open(file_name, 'wb').write(html.encode('utf-8'))
IFrame(src=file_name, width = 1000, height=700)
Enough for the first part of text pre-processing. We have cleaned and visualized our data, but computers or machines need help understanding English. They only understand numbers. So, there must be a way to convert this cleaned text into a machine-readable format, and that's where word embeddings come into the picture. We will learn about word embeddings in our next part.
Text data preprocessing is a crucial topic in the field of NLP. If you are applying for positions as an NLP engineer or data scientist, it is important to have a good understanding of preprocessing steps. Some possible questions that may be asked about text data preprocessing include:
In this article, we focused on one of the most essential steps in natural language processing: text preprocessing. To give you practical experience, we applied the text cleaning steps chronologically in Python to a dataset of tweets about Covid-19 sentiment analysis. By visualizing the hidden trends and significant words in the dataset, we were able to demonstrate the importance of text preprocessing. We hope you found this article enjoyable and informative.
Next Blog: Word vector encoding