In this algorithm, we do not have any target or outcome variable to predict / estimate. It is used for clustering population in different groups, which is widely used for segmenting customers in different groups for specific intervention. Examples of Unsupervised Learning: Apriori algorithm, K-means. Unsupervised learning is something that’s very important, because most of the time, the data that you get in the real world doesn’t have little flags attached that tell you the correct answer.
When we look at this type of data, it looks like there’s clumps or clusters in the data. And if we could identify those clumps or clusters, we could maybe say something about a new, unknown data point and what its neighbours might be like.
This is called unsupervised learning.
The most basic algorithm for clustering, and by far the most used is called K-MEANS.
In k-means, we randomly draw cluster centers and say our first initial guess is as shown in the picture above.
The red points are the data points, and green ones are the assumed centres.
These are obviously not the correct cluster centers, we’re not done yet.
k-means operates in two steps.
Assignment: We divide the points among the 2 centers depending on the distance. Example: We assign class 1 to those points which are closer to center one than center two.
Optimizing: Minimizing the total quadratic distance of our cluster center to the points. We’re now free to move our cluster centers.
We follow these steps iteratively till we get perfect clusters.
class sklearn.cluster.KMeans(n_clusters=8, init=’k-means++’, n_init=10, max_iter=300, tol=0.0001, precompute_distances=’auto’, verbose=0, random_state=None, copy_x=True, n_jobs=1, algorithm=’auto’)
from sklearn.cluster import KMeans
clf = KMeans(2)
pred = clf.predict(features)
Important parameters of k-means:
1. n_clusters: The default value for n_clusters is eight. Number of clusters in the algorithm is something that we need to set on our own based on what you think makes sense.
2. max_iter: It’s default value is 300. max_iter actually says how many iterations of the algorithm do you want it to go through as we’re finding the clusters, where we assign each point to a centroid and then we move the centroid.
3. n_init: Is the number of different initializations that you give it. k-means clustering has this challenge, that depending on exactly what the initial conditions are, you can sometimes end up with different clusterings. And so then you want to repeat the algorithm several times so that any one of those clusterings might be wrong, but in general, the ensemble of all the clusterings will give you something that makes sense. That’s what this parameter controls. It’s basically how many times does it initialize the algorithm, how many times does it come up with clusters. By default it goes through at ten times.
Limitations of k-means clustering: local minimum for clustering
Given a fixed data set, given a fixed number of cluster centers, when we run k-means we don’t always arrive at the same result. K-means is what’s called a hill climbing algorithm, and as a result it’s very dependent on where we put your initial cluster centers.
Here the same points are divided into different types of clusters even if took 3 centers in both the cases, but at different places. So it is important to find the right local minima to form clusters.