I’ve done a fair amount with unsupervised learning recently. Most of my code has
leveraged the scikit-learn
libraries so I am becoming more familiar with that code base and their
Among other positive qualities, scikit-learn is particularly good at attracting
contributions with excellent documentation. When I was first exposed to scikit-
learn’s clustering suite a few years ago, I really appreciated this
demo showing how the various clustering algorithms perform on a few
canonical datasets. I’ve copied the main plot below.
After working in this area for a while, I felt like this plot does not tell the
whole story. For one thing, Gaussian Mixture Model (GMM)
Clustering is not included. The absences of GMMs is striking to me because I
find GMMs to be the most intuitive clustering idea. Also, GMMs are very robust
in applications where clusters have varying densities.
I made pull
request to add GMMs, but it has not been merged yet. In the process of
making the PR, there were several good suggestions on how to add more canonical
datasets to the plot in order to show a wider range of performance.
I really like this plot and I think the addition of the Anisotropically and
varying-variance datasets in the third and fourth rows tells a better story.
The plot clearly shows how well GMMs perform for convex clusters that are
distinct. GMM is the only algorithm able to properly cluster the
Anisotropically dataset, which is not something I would have expected.
DBSCAN is also a consistent high performer and has the advantage of not
requiring the number of clusters be an input to the algorithm. This helps for
the “no structure” dataset in the bottm row where there is ony one cluster.
DBSCAN fails for the varying-variance dataset because DBSCAN clusters based on
areas of similar density.
The “noisy circles” and the “noisy moons” datasets in the first two rows are the
non-convex datasets. GMM does poorly on both of these. This is because the
underlying concept of GMM clustering is to model the clusters as a bunch of
realizations from a Gaussian random variable and Guassians have convex level
curves and will never fit a non-convex shape well.
Agglomerative Clustering is another consistent high performer handling both
convex and non-convex clusters well. The drawback with Agglomerative Clustering
is that it is way slower than DBSCAN or GMM.
The plot shows that K-Means–arguably the most popular default clustering
algorithm–is fast, but is also a consistently bad performer. It doesn’t handle
non-convex clusters and it also does not handle clusters that are not well
separated. K-Means has the further drawback that you have to specify the number
of clusters as an input to the algorithm.