In today’s world, most data samples are unlabelled as it needs human resources and additional cost. To tackle this challenge, ML engineers developed unsupervised learning techniques to remove the dependency on data labeling. One of the methods in unsupervised learning techniques is clustering, where machines automatically form groups of data samples based on their similarities.
The K-means algorithm is one of the most widely used clustering algorithms in machine learning. It separates data samples into K distinct clusters, and we will learn the procedure through which it does the same.
After successfully going through this blog, we will be able to understand the following things:
Let’s start with clustering, as K-means is an algorithm to perform clustering.
Clustering is an unsupervised learning technique used to find the subgroups (also known as Clusters) in the dataset. Let’s understand this idea with an analogy — suppose you have a box of mixed candies. Some candies are salty, some are sweet, and some are chocolate.
Now, if someone asks you to give them chocolate candy, finding it in the mix would be difficult. But what if we have already separated the candies into different boxes based on flavor? This would make it much easier to find chocolate candy when someone asks for it.
Similarly, clustering is a technique in which we group similar data points in a dataset based on certain criteria, such as their shapes, size, taste, or color. The image below shows an example of clustering — the data samples have been divided into three clusters and assigned different colors to represent each cluster. Yellow squares correspond to the 1st cluster, blue to the 2nd, and red to the 3rd.
The clustering technique is prevalent in many fields, and many algorithms exist to perform it. K-means is one of them, which is the main topic of this blog. So let’s discuss this in greater detail.
K-means is a simple and elegant approach to partitioning data samples into pre-defined “K” distinct and non-overlapping clusters. The value of K in the K-means algorithm depends on the user’s choice. If we use the K-means algorithm to segregate data samples into three categories, we need to mention the value of k = 3 in our code.
James MacQueen first used the term “K-means” in 1967. It is an older technique but still very popular in the data science and machine learning industries.
In the K-means algorithm, every data sample from the dataset will follow two fundamental properties:
Because of its simplicity, relative robustness (it can work with a wider variety of datasets), and “good enough” explanation, it is still considered one of the first algorithms data analysts use to investigate any new dataset. Explainability is one of its key characteristics and makes it even more popular.
Suppose we have some data samples (X1, X2, …, Xn), and we want to divide these data samples into “K” clusters. We can use the K-means algorithm to perform this task based on iterative logic. The complete implementation can be summarized in five steps:
Accept data inputs and pre-define the number of groups we want to form. In our case, we have “n” data samples, and we want to divide these samples into K clusters.
K-means is an algorithm that inherently uses the statistical means of samples present in a cluster. But that would be possible when we have already formed the group. To start from somewhere, we need some values for these means, which will later be updated when the algorithm progresses.
Initialize the first K means for K clusters. There can be two ways to do this:
These K samples will be treated as temporary centroids (also called the mean).
Now, we have (n — k) data samples left. For each sample, we need to calculate its distance from all the K centroids, and later it will be assigned to only one of the K clusters for which distance would be minimum. We can use any of the distance measures to calculate this distance. Some of the popular ones are shown in the image below.
What will happen if the distances from the two clusters are equal and minimum? (Think!) In that case, we can pick any one cluster and put that sample in that.
All the samples have been assigned to some clusters out of K clusters at this stage.
After all the data points have been assigned to a cluster, we need to calculate the new centroids for every cluster. But this time, the centroid calculation will not be random. The new centroid is calculated as the mean of all the data points assigned to that cluster. At this step, we will have K centroids that can differ from the earlier chosen random K samples.
Please note that we are not saying all the K centroids will differ. There can be a possibility that no other data sample was assigned to one cluster. In that case, the centroid for that cluster will not update.
We will repeat steps 3 and 4 until the entire n samples converge. By convergence, we mean that there will be no or negligible movement of samples among clusters.
In this way, K-means groups the data samples into the desired number of clusters. The steps are straightforward and intuitive, one of the most important reasons for the algorithm’s popularity.
data_samples = [x1, x2, x3, ..., xn]
initialize_k_means = [x1, x2, ..., xk]
for all (n-k) sample:
track_minimum_distance
for all K selected means:
calculate distance of samples from all the selected K means.
assign sample to the cluster with which distance is minimum.
for all K means:
calculate the updated mean values.
Repeat above loops until no update in centroid happens.
Output K clusters.
K-means is the most popular algorithm for clustering data points. But it has certain limitations:
As per the limitations discussed above, there are scopes where K-means can be improved further. The K-means algorithm computes the distance between data samples and the centroid within each cluster in each iteration. This makes the algorithm computationally very expensive in the case of massive datasets. However, we can utilize the knowledge of the distance calculated in the previous iteration to make it computationally cheaper. Can you think of how?
In the first iteration, we calculated the distance for each data point to the nearest cluster. Let’s say it is previous_distance[i] for the ith sample. In the next iteration, the centroids of all the groups will change because samples will be changed in the clusters.
Here, we will first calculate the distance of the same i-th sample to the previous cluster (but with the updated centroid) and term it as current_distance[i]. We will then compare it with the previous_distance[i].
There are two scenarios:
By making this comparison at the start, we will save the time required to compute the distances to K-1 clusters, making it slightly better in computational complexity.
K-means is an unsupervised learning algorithm; hence, we do not have any data labels. How can we pre-decide the number of clusters? The decision of the optimal number of clusters is subjective and depends on the methods used to find the similarity based on which we perform the clustering. Some algorithmic approaches can help us decide the value for K, but they only provide a possible number of clusters. It’s upon us how many clusters we want.
Let’s discuss the pseudocode we can use to implement the Elbow method:
SSE = {}
for k in range(1, 10):
k_means = KMeans(n_clusters=k, max_iter=1000).fit(data)
data["clusters"] = kmeans.labels_
SSE[k] = kmeans.inertia_
# Inertia: Sum of distances of samples to their closest cluster #center
plt.figure()
plt.plot(list(SSE.keys()), list(SSE.values()))
plt.xlabel("Number of cluster")
plt.ylabel("SSE")
plt.show()
The optimal number of clusters for the data used in the above code is 3. Let’s learn one another famous method:
The Silhouette Coefficient (or Silhouette Score) is a metric used to evaluate the performance of a clustering algorithm. It is defined for every sample present in the dataset based on two calculations:
\
The pseudocode of this algorithm:
for n_cluster in range(2, 11):
kmeans = KMeans(n_clusters=n_cluster).fit(data)
label = kmeans.labels_
AvS = silhouette_score(data, label, metric='euclidean')
print("For n_clusters={}, The Silhouette Coefficient is {}".format(n_cluster, AvS))
##############################################################
'''
For n_clusters=2, The Silhouette Coefficient is 0.680813620271
For n_clusters=3, The Silhouette Coefficient is 0.552591944521
For n_clusters=4, The Silhouette Coefficient is 0.496992849949
For n_clusters=5, The Silhouette Coefficient is 0.488517550854
For n_clusters=6, The Silhouette Coefficient is 0.370380309351
For n_clusters=7, The Silhouette Coefficient is 0.356303270516
For n_clusters=8, The Silhouette Coefficient is 0.365164535737
For n_clusters=9, The Silhouette Coefficient is 0.346583642095
For n_clusters=10, The Silhouette Coefficient is 0.328266088778
'''
###############################################################
Here, n = 2 has the highest coefficient, so ideally, it should be selected. However, based on the knowledge of the Iris data, we were aware that there are three species, so the next value of 3 was chosen as optimal. When dealing with higher dimensions, this method produces better results.
This example shows that these methods can only help us decide the appropriate number of clusters, and we cannot rely entirely on their suggestions.
Let’s try to implement this algorithm using the Scikit-Learn library on one of the famous datasets of the framework, i.e., the Iris Dataset shown in the image below.
Step 1: Importing the required libraries of pandas, Iris datasets, KMeans model, and matplotlib.
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
Step 2: Extract the data from the load_iris dataset and separate the target variables.
iris = load_iris()
X = iris.data
y = iris.target
Step 3: Fit the Scikit-learn K-means model.
kmeans = KMeans(n_clusters=3, max_iter=1000).fit(X)
labels = kmeans.labels_
Step 4: Plot the scattered pred.
plt.figure(figsize=(4, 3))
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)
ax.scatter(X[:, 3], X[:, 0], X[:, 2],
c=labels.astype(float), edgecolor='k')
ax.w_xaxis.set_ticklabels([])
ax.w_yaxis.set_ticklabels([])
ax.w_zaxis.set_ticklabels([])
ax.set_xlabel('Petal width')
ax.set_ylabel('Sepal length')
ax.set_zlabel('Petal length')
ax.dist = 12
plt.show()
We can see that the 3 clusters are formed based on the Petal width, Petal Length, and Sepal Length. From the Average silhouette method, we received the best number of clusters 2, and we also noticed from the above image that the yellow and purple samples are very close. That’s why the silhouette method needed clarification and suggested two clusters.
There is no doubt that clustering is one of the most important processes for data science and machine learning, but it has some practical challenges for which the solution is challenging. Some of these practical issues are:
K-means is a popular algorithm widely used to analyze real-life data, including medical, sales, finance, and many more. Some of the most excellent applications of K-means are:
K-means is considered one of the fundamental algorithms that every interviewer expects a candidate knows about it. Data analysis and finding patterns in massive data is the most crucial step. Possible interview questions on K-means would be,
In this blog, we have discussed the K-means clustering algorithm in detail. We have learned the pseudocode of the K-means algorithm, its inherent problems, and methods to improve the time performance of the algorithm. K-means requires a pre-decided value for K which is tough to decide as data is not labeled. We also discussed methods to help find the optimal value of K via algorithmic approaches.
If you have any queries/doubts/feedback, please write us at contact@enjoyalgorithms.com. Enjoy learning, Enjoy algorithms!