The expansion of the internet and the easy accessibility of smartphones, with internet access, has led to a boom in the content available for surfing. Although there are a lot of positives from this technological shift, at the same time, uncovering the relevant information has become a pain point. This is where topic modelling shines.
Consider a case of 30 news articles where 5 focus on cricket, 4 on football, 3 on hockey, and the remaining articles focus on laptops and mobiles. Topic modelling helps classify the articles focusing on cricket, football and hockey under sports and the remaining under technology.
In this blog, we explore and compare two techniques for topic modelling: Latent Dirichlet Allocation (LDA) and Latent Semantic Analysis (LSA).
Events happen daily, most of which are reported by various news agencies and the general public. These reports include the Ukraine war, financial crisis, elections, ecological disasters, etc. The amount of data added daily regarding such events is alarming, and it's almost impossible to classify these events into different topics manually. That's where topic modelling comes into the picture.
Topic modelling is a type of statistical modelling that clubs together similar textual content. It is currently a hot topic in the field of NLP and has a great demand, especially when considering the increasing amount of unstructured data.
In this blog, we will discuss the working of LDA, a basic algorithm used for topic modelling. We execute this algorithm on the 'Million News Headlines' dataset, which is then followed by categorizing the dataset into different topics.
LDA and LSA are algorithms that help us to determine topics. Let's learn more about their functioning, but before doing so, let's define a few basic terms to set the context for further discussions.
The content available these days is mostly unlabelled, which justifies using unsupervised techniques. LDA and LSA are unsupervised learning methods, making them suitable for the task.
Please note that the number of topics is a required parameter in both algorithms. It can be defined as the number of categories into which the content is expected to be classified. In this blog, we choose the total number of topics to be 10.
LDA algorithm exploits the 'word frequency' in documents to generate topics and is represented as the black box in the above diagram. Let's use an example and set specific assumptions to understand this black box. Assumptions are:
Doc i: GT won the IPL Cup in 2022.
Step 1: A random topic for each word will be assigned:
Word/Token | GT | won | the | IPL | cup | in | 2022 |
Topic | 1 | 4 | 1 | 2 | 3 | 4 | 1 |
Step 2: The count of topics per document is prepared:
Topic | Topic 1 | Topic 2 | Topic 3 | Topic 4 |
Count | 3 | 1 | 1 | 2 |
Step 3: Across all the documents, the frequency of every topic for each unique word is calculated.
Words | Topic 1 | Topic 2 | Topic 3 | Topic 4 |
------------------------------------------------------
GT | 5 | 3 | 2 | 9 |
won | 3 | 7 | 4 | 14 |
the | 6 | 8 | 14 | 18 |
IPL | 8 | 4 | 2 | 27 |
cup | 5 | 9 | 19 | 6 |
in | 10 | 12 | 9 | 7 |
2022 | 13 | 15 | 4 | 2 |
Step 4: A random word is picked up, and the topic referred to by the word is reset in all documents sequentially where this word was present. In this example, let's select 'IPL', and it will not have any topic in the 'i'th document. Correspondingly the metric from steps 2 and step 3 will also change.
Step 5: A new topic must be assigned to the word IPL. It is done based on the score of
Overall score = metricA * metricB
The above score is calculated for every topic for document i. The topic with the max score is assigned to the word 'IPL', which is topic 4.
Overall score for topic 4 = 2 * 27
Step 6: These step 4 and step 5 are repeated for every unique word in the body.
Step 7: Step 6 is done a fixed number of times
LSA is a method which helps us to convert unstructured data to a structured form. The below diagram describes the process of LSA.
Let's look into some of the key components mentioned above →
The basics of LDA and LSA are covered. Let's start implementing them. The basic steps that will be followed here are:
The dataset we choose for this blog is "A Million News Headlines". It contains news headlines of the past nineteen years from the Australian Broadcasting Corporation (ABC), a reputable Australian news source. A sample from the data is shown below.
This dataset contains the following fields →
The file "abcnews-date-text.csv" can be downloaded from the website and read in a DataFrame using the read_csv function from the Pandas module. One can read more about Pandas here.
import pandas as pd
raw_data = pd.read_csv("abcnews-date-text.csv")
The site from which we picked the data claims that the dataset contains every news from the ABC website and has witnessed more than 200 news per day. Let's verify this thing through EDA in the next section.
The dataset should generally be pre-processed before it is used for any downstream tasks. Some of the common pre-processing steps are as follows →
import spacy
from spacy.lang.en.stop_words import STOP_WORDS as stopwords
x1 = ' '.join([w for w in x1.split() if w not in stopwords])
## here x1 refers to any news heading
Pre-processing is now complete. Let's perform an EDA on critical characteristics to understand the data better.
If we do not remove the stopwords, the top words mostly consist of stopwords like 'the', 'is ', 'are ', and 'an', as they are most frequently used in any sentence. After their removal, the top words occurring in the body are drastically changed, as shown below.
The top words are 'police', 'man', 'govt' etc. These words, at some high level, refer to various topics to which the respective news can belong. For example, 'police' is a word which can be related to law and order, crime investigation, etc. but has a very minute chance of being connected to finance.
Previously we quoted that around 200 news were witnessed by ABC daily. Let's verify the same. A plot of the daily news count corresponding to its date will serve the purpose; hence the same is shown below.
We can visually infer from the plot above that the 200 news count is the average news count over a long period of around 19 years of data. Hence, we can confidently verify that approximately 200 news were published on the ABC website daily.
The number of words per headline is important, as both LDA and LSA depend on words to infer the topic. Extremely low or high word counts per heading will not make sense. Hence the need to visualize the distribution of the words per heading is necessary.
The above bar graph shows us that 7 words are the most likely number of words in a news heading. Also, it is evident that extreme word counts occur very less, certifying our intuition.
The EDA has been completed. Let's build a model for LDA and LSA now.
Please note the following before the initialization of model building for these algorithms →
As mentioned in point 1, the text needs to be vectorized using the count vectorizer method from the sklearn library. This method converts a collection of text documents to a matrix of token counts. More details can be found on the official page.
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
sample_document_term_matrix = vectorizer.fit_transform(small_text_sample)
print('Headline before vectorization: {}'.format(small_text_sample[5]))
print('Headline after vectorization: \n{}'.format(sample_document_term_matrix[5]))
## Output
Headline before vectorization: council considers hospital development plan
Headline after vectorization --> (Index of occurance in the Matrix), token count
(0, 2665) 1
(0, 2521) 1
(0, 5258) 1
(0, 3153) 1
(0, 8042) 1
Please note that the vectorized text is present in the form of a sparse matrix ( a lot of zeros will be present) due to the large size of the unique vocabulary. Each element in the matrix represents the count corresponding to some token. The dataset smalltextsample is a mini corpus of our main dataset formed by randomly selecting 10000 news headlines.
We have already discussed the theoretical functioning of LSA at the start of this blog. Now let's build the LSA model using the sklearn library of Python. The input to this step is the document matrix constructed in the previous section.
from sklearn.decomposition import TruncatedSVD
# Initialsing the model
lsaModel = TruncatedSVD(n_components=10)
# Fit the model with dataset
lsa_matrix = lsaModel.fit_transform(sample_document_term_matrix)
Truncated SVD is a sklearn method used for dimensionality reduction, and unlike PCA, it works very well with sparse matrices. Hence, it works well on term-count/TF-IDF matrices as returned by the vectorizers in sklearn.feature_extraction.text .
This method is known as LSA, which is precisely what we are doing here.
In the code snippet above, n_components refers to the number of topics in which text can be classified and is already pre-defined by us.
lsa_matrix, as mentioned, is of length 10 for each document. It outputs a probability of the documents belonging to each topic, and selecting the topic with the maximum probability from the probability vector gives us the predicted topic for that document.
The probability vectors are also an excellent topic-wise representation of the real data present in the documents. To visualize the clusters formed by LSA and to compare the cluster quality, we chose to represent each probability vector in a 2-d space using t-SNE. t-SNE is an unsupervised dimensionality reduction technique used for exploring high-dimensional data. More about this can be read in the t-SNE blog.
from sklearn.manifold import TSNE
tsne_lsa_model = TSNE(n_components=2, perplexity=50, learning_rate=100,
n_iter=2000, verbose=1, random_state=0, angle=0.75)
tsne_lsa_vectors = tsne_lsa_model.fit_transform(lsa_topic_matrix)
Let's see what each component signifies in the above t-SNE algorithm →
Each colour in the graph above represents a topic. It is visually evident that the demarcations are not clear.
Let's now use LDA to visualize and compare its results with LSA.
from sklearn.decomposition import LatentDirichletAllocation
lda_model = LatentDirichletAllocation(n_components=10, learning_method='online',
random_state=0, verbose=0)
lda_topic_matrix = lda_model.fit_transform(small_document_term_matrix)
We have discussed the theoretical functioning of LDA at the start of this blog. Now let's build the LDA model using the sklearn library of Python. The input to this step is the document matrix constructed in the previous section.
Let's understand the various components utilized here →
Over here, a t-SNE vector cluster is plotted on the graph for the obtained ldamatrix in the same way as we plotted before for the lsamatrix.
Here it is visible that the topics are distinguishable and well-demarcated, unlike the haphazard demarcation in LSA, thus proving that LDA performs better than LSA for topic modelling.
Let's also construct a PCA diagram for LDA to analyze the results and compare them with those obtained using t-SNE. PCA stands for Principal Component Analysis and is a prevalent method to analyze data with high dimensional features. The code is shown below.
from sklearn.decomposition import PCA
pca = PCA(n_components = 2)
pca.fit(lda_topic_matrix)
x_pca_lda = pca.transform(lda_topic_matrix)
The PCA cluster that we received for the obtained ldatopicmatrix is shown below.
It is visible from the PCA diagram above that the demarcation between each category is less distinguishable when compared to the one received in the t-SNE diagram.
This can be attributed to majorly 2 factors which are →
More details about the above differences can be read in this blog.
This company helps its customer's sales representatives to increase their chances of success in getting a deal by analyzing various risks involved related to revenue and analyzing the call between the sales representatives and the buyer. This part of analyzing the call between the sales representatives and the buyer falls under CI (Conversational Intelligence).
Here topic modelling plays a vital role. Consider a call between an Aviso sales rep and a buyer P1. A higher management person wants to review the call and only wants to know what was discussed on a particular topic, such as cost. Then this comes very handy as they can directly check all the data under the topic "cost" and quickly get to know the information.
Considering the boom in daily data, Topic modelling has become a vital part of today's NLP industry to select relevant and meaningful insights from the raw data. Most of these datasets are unlabelled and require unsupervised learning methods to assign topics. In this blog, we have developed a topic model using two unsupervised learning algorithms: LSA and LDA. These algorithms were discussed in detail, implemented in Python on a real dataset, followed by comparing their performance. We hope you enjoyed the article.