What is Clustering?
Clustering is a machine learning technique used to group data points so that those within the same group (referred to as a cluster) are more similar to each other than to those in other clusters. It aims to find patterns, relationships, or structures in data without having predefined categories or labels.
Common clustering algorithms like $K$-means and hierarchical clustering are often the first to come to mind, as they are widely used. However, these methods have certain limitations, particularly when dealing with irregularly shaped data. This limitation often leads us to explore other techniques, such as DBSCAN, which we will examine further in this post.
$K$-Means and Hierarchical Clustering
Traditional algorithms sometimes struggle to identify clusters when the data has irregular shapes. To illustrate this drawback, let’s look at a simple example: a dataset resembling two intertwined half circles with slight noise. We will apply the $K$-means and hierarchical clustering algorithms to group the data into two clusters. Let’s evaluate how well they perform the task.
It is clear that both algorithms fail to identify the two clusters that we expected. Some data points that belong to one circle are mistakenly classified as part of the other. Additionally, the algorithms fail to detect outliers, as all points are simply classified as either purple or yellow. Given the shortcomings of $K$-means and hierarchical clustering with data of arbitrary shapes, it is easy to imagine how their performance might degrade when applied to high-dimensional data.
DBSCAN
Density-Based Spatial Clustering of Applications with Noise (DBSCAN), as mentioned earlier, provides an alternative to the conventional clustering methods we’ve discussed. DBSCAN forms clusters based on the densities of data points. Consequently, grouping closely packed points into clusters, while identifying points in low-density regions as outliers.
A key advantage of DBSCAN is that it doesn’t require the user to specify the number of clusters in advance, unlike methods such as $K$-means, which rely on the predefined number of centroids. The clustering process can be summarized as follows:
-
Select an arbitrary point in the dataset.
-
Check if there are enough neighboring points (based on $\epsilon$ and $minPts$) to consider it a core point.
- $\epsilon$ represents the maximum distance between two points for them to be considered neighbors. This hyperparameter controls the size of the neighborhood used to form clusters.
- The distance between two points is commonly measured using Euclidean distance, although other distance metrics can be considered.
- To determine an optimal value for $\epsilon$, a typical approach involves plotting a $K$-distance graph and identifying the point of maximum curvature (also called the “elbow point”).
- $minPts$ is the minimum number of points required in a neighborhood for a point to be considered a core point.
- A common guideline is to set $minPts$ to at least the number of dimensions in the data plus one. Another rule of thumb is to double the number of dimensions, though domain knowledge can further help refine the choice.
-
If the point is a core point, a cluster is formed by including all points in its neighborhood.
- If the point is not a core point and lacks sufficient neighbors, it is temporarily classified as noise, though it might later be assigned to a cluster as a border point.
- $\epsilon$ represents the maximum distance between two points for them to be considered neighbors. This hyperparameter controls the size of the neighborhood used to form clusters.
- Randomly select a core point and assign it to the first cluster. Next, include all other core points adjacent to the same cluster until no more core points can be added.
- Note that border points are not considered during this step.
-
Afterward, include all border points that are close to the core points in the first cluster.
-
Select another core point, not yet assigned to any cluster, and form a new cluster. Repeat the process of expanding the cluster as described.
- Any remaining border points that do not belong to any of the clusters are classified as noise (outliers).
Following the approach described above, we’ll apply DBSCAN to the same dataset. Initially, we set the minimum number of points in a neighborhood ($minPts$) to 4, based on the two-dimensional nature of the data. We then examine the $K$-distance graph with a distance parameter of 4 to determine an appropriate radius ($\epsilon$) for the neighborhood. Upon analyzing the graph, the point of maximum curvature appears at $\epsilon = 0.05$.
Using this observation, we proceed to fit the DBSCAN model and visualize the results, expecting to see clear cluster differentiation. However, the outcome seems underwhelming, as the algorithm fails to effectively distinguish between the clusters. Surprisingly, it identifies only one cluster, contrary to our initial expectations.
In light of these results, we’ll explore an alternative method for selecting $\epsilon$, using the silhouette score to guide our choice. The silhouette score is a metric to evaluate the quality of a clustering technique. It measures how similar each data point is to its own cluster compared to other clusters. Specifically, for each sample, it calculates two values: the average distance to all other points within the same cluster ($a$) and the average distance to points in the nearest neighboring cluster ($b$).
The silhouette score for a given sample $i$ is defined as:
\[\frac{b(i)-a(i)}{\max(a(i),b(i))}\]This score ranges from -1 to 1, where a score of 1 indicates that the point is well-clustered (i.e., the mean distance to its own cluster is much smaller than the distance to the nearest cluster), while a score near 0 means the point lies on the boundary between two clusters. Negative values suggest that the point may have been assigned to the wrong cluster.
By calculating the average silhouette score across all data points, we can assess the overall effectiveness of the clustering. An ideal silhouette score of 1 implies perfect clustering, where points are maximally separated from the nearest cluster and tightly grouped within their own cluster.
To explore the impact of different values of $\epsilon$ (with a minimum core point requirement of 4), we calculate the silhouette scores over a range of $\epsilon$ values, from 0.05 to 0.2 with a step size of 0.01. The table below shows the silhouette scores corresponding to each value of $\epsilon$.
$\epsilon$ | Silhouette Score |
---|---|
0.05 | -0.71 |
0.06 | -0.58 |
0.07 | -0.30 |
0.08 | -0.13 |
0.09 | -0.04 |
0.10 | 0.05 |
0.11 | 0.31 |
0.12 | 0.15 |
0.13 | 0.16 |
0.14 | 0.17 |
0.15 | 0.22 |
0.16 | 0.21 |
0.17 | 0.24 |
0.18 | 0.25 |
0.19 | 0.17 |
0.20 | 0.17 |
From this analysis, the silhouette score reaches its highest value of 0.31 when $\epsilon$ is set to 0.11. Based on this result, we refit the model using $\epsilon=0.11$ and visualize the clustering outcome. Compared to the previous result with $\epsilon= 0.05$, the new outcome is notably more promising. The model successfully identifies two clusters, represented by the yellow and green points, while the purple points are classified as outliers, indicating that they do not belong to either cluster.