Clustering is a Machine Learning technique that involves the grouping of data points. Given a set of data points, we can use a clustering algorithm to classify each data point into a specific group. In theory, data points that are in the same group should have similar properties and/or features, while data points in different groups should have highly dissimilar properties and/or features. Clustering is a method of unsupervised learning and is a common technique for statistical data analysis used in many fields.
In Data Science, we can use clustering analysis to gain some valuable insights from our data by seeing what groups the data points fall into when we apply a clustering algorithm. Today, we’re going to look at 5 popular clustering algorithms that data scientists need to know and their pros and cons!
Types of clustering algorithms
There are different types of clustering algorithms that handle all kinds of unique data.
In density-based clustering, data is grouped by areas of high concentrations of data points surrounded by areas of low concentrations of data points. The algorithm finds the places that are dense with data points and calls those clusters.
The great thing about this is that the clusters can be any shape. You aren’t constrained to expected conditions.
The clustering algorithms under this type don’t try to assign outliers to clusters, so they get ignored.
With a distribution-based clustering approach, all of the data points are considered parts of a cluster based on the probability that they belong to a given cluster.
It works like this: there is a center point, and as the distance of a data point from the center increases, the probability of it being a part of that cluster decreases.
If you aren’t sure of how the distribution in your data might be, you should consider a different type of algorithm.
Centroid-based clustering is the one you probably hear about the most. It’s a little sensitive to the initial parameters you give it, but it’s fast and efficient.
These types of algorithms separate data points based on multiple centroids in the data. Each data point is assigned to a cluster based on its squared distance from the centroid. This is the most commonly used type of clustering.
Hierarchical-based clustering is typically used on hierarchical data as you would get from a company database or taxonomies. It builds a tree of clusters so everything is organized from the top-down.
This is more restrictive than the other clustering types, but it’s perfect for specific kinds of data sets.
K-Means is probably the most well-known clustering algorithm. It’s taught in a lot of introductory data science and machine learning classes. It’s easy to understand and implement in code! Check out the graphic below for an illustration.
To begin, we first select several classes/groups to use and randomly initialize their respective center points. To figure out the number of classes to use, it’s good to take a quick look at the data and try to identify any distinct groupings. The center points are vectors of the same length as each data point vector and are the “X’s” in the graphic above.
Each data point is classified by computing the distance between that point and each group center and then classifying the point to be in the group whose center is closest to it.
Based on these classified points, we recompute the group center by taking the mean of all the vectors in the group.
Repeat these steps for a set number of iterations or until the group centers don’t change much between iterations. You can also opt to randomly initialize the group centers a few times, and then select the run that looks like it provided the best results.
K-Means has the advantage that it’s pretty fast, as all we’re doing is computing the distances between points and group centers; very few computations! It thus has a linear complexity O(n).
On the other hand, K-Means has a couple of disadvantages. Firstly, you have to select how many groups/classes there are. This isn’t always trivial and ideally, with a clustering algorithm, we’d want it to figure those out for us because the point of it is to gain some insight from the data. K-means also starts with a random choice of cluster centers and therefore it may yield different clustering results on different runs of the algorithm. Thus, the results may not be repeatable and lack consistency. Other cluster methods are more consistent.
K-Medians is another clustering algorithm related to K-Means, except instead of recomputing the group center points using the mean we use the median vector of the group. This method is less sensitive to outliers (because of using the Median) but is much slower for larger datasets as sorting is required on each iteration when computing the Median vector.
Mean shift clustering is a sliding window-based algorithm that attempts to find dense areas of data points. It is a centroid-based algorithm meaning that the goal is to locate the center points of each group/class, which works by updating candidates for center points to be the mean of the points within the sliding window. These candidate windows are then filtered in a post-processing stage to eliminate near-duplicates, forming the final set of center points and their corresponding groups. Check out the graphic below for an illustration.
- To explain mean-shift we will consider a set of points in two-dimensional space like the above illustration. We begin with a circular sliding window centered at a point C (randomly selected) and having radius r as the kernel. Mean shift is a hill-climbing algorithm that involves shifting this kernel iteratively to a higher density region on each step until convergence.
- At every iteration, the sliding window is shifted towards regions of higher density by shifting the center point to the mean of the points within the window (hence the name). The density within the sliding window is proportional to the number of points inside it. Naturally, by shifting to the mean of the points in the window it will gradually move towards areas of higher point density.
- We continue shifting the sliding window according to the mean until there is no direction in which a shift can accommodate more points inside the kernel. Check out the graphic above; we keep moving the circle until we no longer are increasing the density (i.e number of points in the window).
- This process of steps 1 to 3 is done with many sliding windows until all points lie within a window. When multiple sliding windows overlap the window containing the most points is preserved. The data points are then clustered according to the sliding window in which they reside.
An illustration of the entire process from end to end with all of the sliding windows is shown below. Each black dot represents the centroid of a sliding window and each grey dot is a data point.
The entire process of Mean-Shift Clustering
In contrast to K-means clustering, there is no need to select the number of clusters as mean-shift automatically discovers this. That’s a massive advantage. The fact that the cluster centers converge towards the points of maximum density is also quite desirable as it is quite intuitive to understand and fits well in a naturally data-driven sense. The drawback is that the selection of the window size/radius “r” can be non-trivial.