Hey sirishan,

In clustering, the goal is usually to get high similarity within each group, and low similarity between each group.

Let’s translate it to statistical terms:

high similarity within a group = low variance within the cluster, or within_SS.

low similarity between the groups = high variance between the clusters, or between_SS.

Now, let’s say you compute all the variance in the data, and call it total_SS.

In optimal clustering, since the clusters are very different from each other, then most of the total variance is explained by the variance between the groups. And of course, since the variance within each group is very small, it would explain only a small fraction of the total variance in the data.

In summary - your objective is to maximize between_ss/total_ss.

However, if you choose the number of clusters to be the same as the number of observations - then total_ss is exactly equal to between_ss (can you tell why?) and the desired ratio will be 1 (or 100%).

So, in order to get high percentage you can just increase the number of clusters - but then you miss out the point of clustering. A way to deal with it is to use the elbow method https://en.wikipedia.org/wiki/Elbow_method_(clustering) to choose a reasonable number of clusters.

Hope this helps.