# Comparing Clustering

I’ve done a fair amount with unsupervised learning recently. Most of my code has leveraged the scikit-learn libraries so I am becoming more familiar with that code base and their conventions.

Among other positive qualities, scikit-learn is particularly good at attracting contributions with excellent documentation. When I was first exposed to scikit- learn’s clustering suite a few years ago, I really appreciated this demo showing how the various clustering algorithms perform on a few canonical datasets. I’ve copied the main plot below.

After working in this area for a while, I felt like this plot does not tell the whole story. For one thing, Gaussian Mixture Model (GMM) Clustering is not included. The absences of GMMs is striking to me because I find GMMs to be the most intuitive clustering idea. Also, GMMs are very robust in applications where clusters have varying densities.

I made pull request to add GMMs, but it has not been merged yet. In the process of making the PR, there were several good suggestions on how to add more canonical datasets to the plot in order to show a wider range of performance.

## Code

## Commentary

I really like this plot and I think the addition of the Anisotropically and varying-variance datasets in the third and fourth rows tells a better story.

### Convex Clusters

The plot clearly shows how well GMMs perform for convex clusters that are distinct. GMM is the only algorithm able to properly cluster the Anisotropically dataset, which is not something I would have expected.

DBSCAN is also a consistent high performer and has the advantage of not requiring the number of clusters be an input to the algorithm. This helps for the “no structure” dataset in the bottm row where there is ony one cluster. DBSCAN fails for the varying-variance dataset because DBSCAN clusters based on areas of similar density.

### Non-Convex Clusters

The “noisy circles” and the “noisy moons” datasets in the first two rows are the non-convex datasets. GMM does poorly on both of these. This is because the underlying concept of GMM clustering is to model the clusters as a bunch of realizations from a Gaussian random variable and Guassians have convex level curves and will never fit a non-convex shape well.

### Agglomerative Clustering

Agglomerative Clustering is another consistent high performer handling both convex and non-convex clusters well. The drawback with Agglomerative Clustering is that it is way slower than DBSCAN or GMM.

### K-Means Clustering

The plot shows that K-Means–arguably the most popular default clustering algorithm–is fast, but is also a consistently bad performer. It doesn’t handle non-convex clusters and it also does not handle clusters that are not well separated. K-Means has the further drawback that you have to specify the number of clusters as an input to the algorithm.