🍃Clustering

As many of you know, clustering is an unsupervised machine learning technique that groups data without any prior knowledge of labels. One key challenge in clustering is determining k, the optimal number of clusters. In this session, we'll explore various methods for selecting the best value for k.

About the Data

Up to this point, all the data you've seen has been provided by Crystal Ball, which searches throughout the universe 😉. It can also generate its own data to mimic real-world scenarios, a process known as "data simulation". The clustering data we're using here is simulated data.

See the code below, we can create clusters in the shape of circles, moons and blobs.

🌻 Check clusters simulation code here >>

Now, let’s take a look at the different clusters that Lady H. has simulated. In the 3rd, 4th, and 6th plots, the clusters have similar standard deviations (cluster_std) and are more distinct from one another, making these easier clustering scenarios. To introduce more complexity, the clusters in the 5th plot are more intermixed. The most challenging clusters appear in the 1st and 2nd plots—the circles and moons. Although we can see clear separations in these plots, they present a greater challenge for clustering algorithms to handle effectively.

🌻 Check purposeful clusters simulation code here >>

With these simulated clusters, let's explore more on finding the optimal number of clusters k.

Finding Optimal K

There are countless data science tutorials available online, but how do you typically use them? While these tutorials often spark inspiration, Lady H. is cautious about taking them at face value. When she finds something intriguing, she delves deeper into the theory and validates it through additional experiments.

The inspiration for this experiment came from an article titled "Are You Still Using the Elbow Method?". In the article, the author compares several k-estimation algorithms using five datasets with clusters ranging from 2 to 25. The visualizations are quite compelling, and the author concludes that the elbow method, a popular k-estimation approach, performed the worst. Since the datasets in the article consist of well-separated blobs, Lady H. wondered how these algorithms would perform on more complex datasets.

Before diving into Lady H.'s experiments, let’s first understand how each k-estimation algorithm works. To identify the optimal k, the goal is to achieve clusters with high between-cluster variance and low within-cluster variance.

Elbow Method

The elbow method examines WCSS (Within-Cluster Sum of Squares), which is the sum of the squared distances between data points within a cluster and the cluster’s centroid. A lower WCSS indicates that data points within a cluster are closer together. The optimal k is typically chosen at the 'elbow' of the plot, where WCSS decreases most sharply. Choosing a larger k with an even smaller WCSS isn’t ideal, as it would create more clusters than necessary.

As we can see, the elbow method doesn’t evaluate between-cluster variance. Additionally, it can sometimes be difficult to identify the optimal k on the plot, as shown in the example below.

Calinski Harabasz Index

The Calinski-Harabasz Index, also known as the Variance Ratio Criterion, is calculated as Calinski-Harabasz Index = sum(between-cluster dispersion) / sum(within-cluster dispersion), where dispersion represents the sum of squared distances. A higher Calinski-Harabasz Index indicates better clustering, as it reflects greater between-cluster variance and smaller within-cluster variance.

This index is quick to compute and tends to provide more accurate k estimations on clusters that are convex, dense, and well-separated.

Do you know what does "convex" mean to clusters?

Davies-Bouldin Index

The Davies-Bouldin Index (DBI) measures the average similarity between each cluster and the cluster that is most similar to it. Similarity is calculated using the ratio of within-cluster distance to between-cluster distance. A lower DBI indicates better clustering, as it reflects smaller within-cluster distances and larger between-cluster distances.

DBI is straightforward to calculate and interpret. However, it only considers pairwise distances between cluster centroids and their members, making it sensitive to outliers and neglectful of data distribution or structure (e.g., nested clusters or non-linear relationships). Additionally, it assumes clusters have similar density and size, which is often not the case in real-world scenarios.

Silhouette Coefficient

Silhouette Coefficient is a measure of how similar an object is to its own cluster comparing to other clusters. Silhouette Coefficient = (b-a)/max(a,b), a is the average distance between each point within a cluster, b is the average distance between clusters. The coefficient ranges from -1 to 1, with higher values indicating better clustering and 0 suggesting overlapping clusters.

Some online tutorial shared about the drawbacks of DBI, then said Silhouette Coefficient can be an alternative solution. However, if we just look at the definition of DBI, Silhouette Coefficient and Calinski Harabasz Index, they share the same drawbacks for being better in convex clusters, being sensitive to outliers and ignoring the data distribution or data structure.

BIC

The Bayesian Information Criterion (BIC) is commonly used for model selection and is based on the maximum likelihood of a model given its parameters. In clustering, BIC seeks to balance the model’s maximum likelihood with the number of clusters, k, while applying a penalty to prevent overfitting. A lower BIC score indicates a better model fit.

Estimated K Comparison

Now let's apply these k-estimation algorithms to our clusters.

Here's the comparison code:

🌻 Check detailed implementation of each k-estimation algorithm here >>

Let's take a look at the k-estimation results. Is elbow method really the worst? Obviously not! In the plot below, the line closer to the grey diagonal has k-estimator closer to the ground truth. So, elbow method has obviously better performance than other algorithms.

🌻 Check all the code here >>

In reviewing the tutorial "Are You Still Using the Elbow Method?", all the datasets it uses contain well-separated blobs, with each blob forming a convex cluster. However, such ideal datasets are rare in real-world applications.

So what should you consider when estimating k for clustering problems in the future? Lady H. suggests:

The elbow method remains a simple and effective starting point.
The elbow method remains a simple and effective starting point.
If you’re familiar with coding in R, Lady H. has experimented with four methods—one of which applies 30 k-estimation algorithms and returns the most frequently selected k.
- 🌻 See R code here >>
Similarly, in Python, you can apply multiple k-estimation algorithms and select the most frequently recommended k.
Additionally, consider exploring clustering algorithms beyond k-means.

PreviousAssociation NextSemi-Supervised Learning

Last updated 11 months ago