Training a Machine Learning model involves data refining and selecting valuable features from the available dataset. However, if we consider every feature as a separate dimension to visualize, we can only imagine up to three dimensions. This becomes challenging when we have more than three features in our dataset. To overcome this, we need dimensionality reduction techniques of features for better data visualization. One popular method is t-SNE, which we will discuss in this blog.
After going through this article, we will be able to understand the following things:
So let’s start without further delay and try to find the answer to our first question.
Let’s look at the official definition of t-SNE and then unpack the constituent elements present in this definition.
t-distributed Stochastic Neighbor Embedding, popularly known as t-SNE algorithm, is an unsupervised non-linear dimeniosnality reduction technique used for exploring high dimensional data.
Now let’s understand the terms one-by-one to know t-SNE completely.
In simple terms, t-SNE tries to maintain the probability distribution for data samples in lower dimensions the same as the probability distribution of data samples in higher dimensions.
Considering we have multiple feature dimensions, how will we visualize them? It is easy to imagine 2 and 3 dimensions, but we only have a technique to visualize up to three. This is also known as the Curse of Dimensionality. Moreover, higher feature dimension increases training time; hence, algorithms like PCA, t-SNE and some autoencoders help us reduce the number of dimensions by keeping the overall information the same as what the original features were carrying. This process of reducing the number of dimensions is known as dimensionality reduction.
One common question we might be wondering is that PCA is already a prevalent technique commonly used in machine learning, so why do we even need t-SNE, and how to decide which algorithm should be preferred?
PCA and t-SNE are two popular dimensionality reduction techniques for visualizing high-dimensional data. While PCA is a linear method, t-SNE is a non-linear method.
Linear vs. Non-Linear Dimension Reduction Methods
Linear methods involve linear transformations in the data samples, such as scaling, projection, and shifting. However, when the goal is to preserve the similarity between samples in lower dimensions, linear techniques may fail to do so.
Preserving Similarity in t-SNE
In t-SNE, similar data points are represented close together. This means that points with similar characteristics are together in a higher or a lower dimension. This is achieved using non-linear transformations, making preserving the similarity between samples in lower dimensions easier.
PCA’s Aim vs. t-SNE’s Aim
On the other hand, in PCA, the aim is to keep the maximum variance dimensions for more information rather than preserving the similarity between neighbors.
Sophistication
t-SNE is way more sophisticated or advanced than PCA, as it was coined in 2008, and PCA is an old algorithm from 1933. So let’s understand how it works.
We will see an example of dimensional reduction using t-SNE from 2d to 1d. As shown in the figure below, there are 3 clusters (a cluster is defined as a set of points having similar characteristics) of 2D points.
Dimensional reduction is more than just projecting data on a single axis; it also leads to a loss of information. For example, let’s take three points: (23,31), (47,31.5), (24,32) and their corresponding projection on the y-axis : (0,31), (0, 31.5), (0,32). These three points in 1D now appear near each other, but in 2D, they are far apart. t-SNE avoids such a situation and ensures that the corresponding relation of points, evident in the higher dimension, is preserved in the lower dimensions.
Let’s break it down into steps to simplify and visualize the process thoroughly.
We plot all the points randomly on a 1d line shown in the image above. Next, we have to move them step-by-step such that points of similar characteristics gather together, whereas points of different characteristics move far apart. These points will be moved based on the similarity that we will calculate.
The similarity between the two points is defined in terms of a probability distribution. Here, the probability distribution for two points, A to B, is defined as a conditional probability that A will pick B as its neighbor.
Suppose we choose a point and build a t-distributed curve with that point as the mean (the black dot in the above figure). In that case, the height of all other points on the same 2D line projected on the curve is based on similarity, i.e., the more similar the point, the closer they are to the center and the more the value of the y-axis. Here 0.24 is closure in similarity, so it is near the central point.
Perplexity tells the density of points relative to a particular point. If 4 points of similar characteristics are densely clustered, they will have higher perplexity than those not. Points with less density around them have flatter normal curves than curves formed for points with more density. In the following case of the figure below, purple points are sparse.
Suppose we consider 4 points for denser plots and 4 points for less dense plots but having a similar distance ratio (dense has values double that of less dense). In simple terms, we have two sets, and the distance average in the denser group is half the distance average in the less dense group. These two sets of 4 points are the same if seen separately and deserve the same treatment. But with the existing approach, that will not be the case as y values will differ, i.e., similarity differs.
So we take average similarity values to make the overall similarity sum equal to one for each graph. This is done by dividing the similarity of one point with the other by the overall similarity scores obtained in that graph. (Score)/(Sum of all scores) = Scaled Score
Please note that the similarity score of point A with respect to point B will not be the same as point B with respect to point A. This is because the density of points around these two points might vary, and we might have two different curves for them; hence the value can differ. So we consider average values in these cases. So the above image shows that the blue point graph will vary from the yellow point.
We end up with a similarity matrix where the points with high similarity scores mean they belong to the same cluster. Please note that the matrix defines the similarity of a point to itself also (green squares), but it does not make any sense to do that, so t-SNE defines these similarity scores as 0.
In the above image, we plotted the similarity matrix for 2D data samples, and the 1D data samples were randomly initialized in the first step. The objective of the t-SNE algorithm will be to move points on 1D left or right, such that the similarity matrix for 1D data start looking similar to the 2D similarity matrix.
According to the similarity score matrix, the orange point is currently moving to the right when ideally, it should move left. This is due to the blue point having a higher perplexity score. Despite the presence of attraction, which point is dominant is determined by perplexity values. The image below illustrates the resulting point positions.
Too much theory. Let’s implement the t-SNE algorithm on the MNIST dataset using python.
# Importing Necessary Modules.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler
import seaborn as sn
The dataset used for this project is the MNIST dataset. The MNIST dataset is an acronym for the Modified National Institute of Standards and Technology dataset. It contains 60,000 small square 28×28 pixel grayscale images of handwritten single digits between 0 and 9. There are 60000 data points with 784 (28 * 28) features. There is also a label that we remove from features.
# Reading the data using pandas
df = pd.read_csv('mnist_train.csv')
print(df.head(4))
labels = df['label']
# Drop the label feature and store the pixel data in dataframe
data = df.drop("label", axis = 1)
Here we normalize data to prevent it from giving weights to particular values. After normalization, it ensures that even if some values were more significant initially, they would not affect our model negatively. For normalizing, we use the standard scaler function of sklearn. Also, we choose 1000 data points here to show the effect of our model.
standardized_data = StandardScaler().fit_transform(data)
print(standardized_data.shape)
data_1000 = standardized_data[0:1000, :]
labels_1000 = labels[0:1000]
Here we train the model using the standard option of sklearn and the top 1000 data points.
model = TSNE(n_components = 2, random_state = 0)
# configuring the parameters
# the number of components = 2
# default perplexity = 30
# default learning rate = 200
# default Maximum number of iterations
# for the optimization = 1000
tsne_data = model.fit_transform(data_1000)
Here, we draw a scatter plot using matplotlib in 2D as our t-sne model is in 2D. So our features get reduced from 784 to 2. We see here that similar points are grouped.
tsne_data = np.vstack((tsne_data.T, labels_1000)).T
tsne_df = pd.DataFrame(data = tsne_data,
columns =("Dim_1", "Dim_2", "label"))
# Ploting the result of tsne
sn.FacetGrid(tsne_df, hue ="label", size = 6).map(
plt.scatter, 'Dim_1', 'Dim_2').add_legend()
plt.show()
The results of the plot are shown below. The labels are numbers from 0 to 9, and we see many of these clusters.
Now we will see the mathematical theory of how the t-SNE algorithm works. The formula for normal distribution is:
The above formula is with respect to the mean. But in t-SNE, we calculate with respect to every point and not mean. So if we replace the mean with a point, we get the following formula.
We need to define a matrix with the dimension of nsamples * ncomponents and randomly fill the values. In our case, we have 12 samples, and the number of components is 1(final dimensions are considered here). So, in this case, 12*1, we fill it with random values. We will use the following formula.
We must ensure that these probability distributions concerning every point are the same as possible. For this, we use the Kullback-Leiber Divergence method. This KL method is defined as how one probability distribution differs from another. So lower the value, the more identical distributions are. For example, 0 means that the two distributions are precisely the same.
So we now need to minimize the KL divergence values. Hence our cost function is based on this.
The above cost function is minimized to give us directions in which points proceed, i.e., to left or right on a single number line. That's all for this topic.
t-SNE is quite famous among data scientists. If we are applying for the data scientist or machine learning roles, interviewers can ask the following questions on this topic:
This article discussed one of the famous non-linear dimensionality reduction techniques used to select valuable features from the available dataset, i.e., t-SNE. We discussed the step-wise process of the t-SNE algorithm on a dummy dataset and the MNIST dataset using python. Later we discussed the inlined mathematical details of this topic. We hope you find the article informative and enjoyable.