Interpretation Result of K means Algorithm

machine_learning

#1

Hi All,

I would need your help in understanding the result of K means algorithm.
(especially what does thee below percentage tells us.
the Within cluster sum of squares by cluster:
[1] 23.15862 17.33362 149.25899
(between_SS / total_SS = 68.2 %))


#2

If we calculate the sum of squared distances of each observation to the overall sample avg, we get total_SS.

Instead of computing an overall sample avg, we compute one per cluster (here, not sure how many groups you have, let us assume you have 3 clusters) then calculate the sum of squared distances of these three averages to the overall avg, we get between_SS. (While we calculate this, we multiply the squared distance of each avg to the overall avg by the number of observations it represents.)

The 68.2 % is the measure of the total variance in the data set that is explained by the clustering.


#3

Thanks Mathan for the reply. What exactly 68.2% tells here?


#4

@sirishan: I have edited my answer.


#5

Thank you Mathan. Its helped.

If the percentage is high, then can we conclude the its a good clustering ?


#6

Hey sirishan,

In clustering, the goal is usually to get high similarity within each group, and low similarity between each group.
Let’s translate it to statistical terms:
high similarity within a group = low variance within the cluster, or within_SS.
low similarity between the groups = high variance between the clusters, or between_SS.

Now, let’s say you compute all the variance in the data, and call it total_SS.
In optimal clustering, since the clusters are very different from each other, then most of the total variance is explained by the variance between the groups. And of course, since the variance within each group is very small, it would explain only a small fraction of the total variance in the data.

In summary - your objective is to maximize between_ss/total_ss.
However, if you choose the number of clusters to be the same as the number of observations - then total_ss is exactly equal to between_ss (can you tell why?) and the desired ratio will be 1 (or 100%).

So, in order to get high percentage you can just increase the number of clusters - but then you miss out the point of clustering. A way to deal with it is to use the elbow method https://en.wikipedia.org/wiki/Elbow_method_(clustering) to choose a reasonable number of clusters.

Hope this helps.


#7

Also, a quick silhouette plot helps to visualize cluster similarity!