I am working on K-means in R but I am not able to understand the feature “Within cluster sum of squares by cluster” when I look at the model
Within cluster sum of squares by cluster:
 15.15100 39.82097 23.87947
(between_SS / total_SS = 88.4 %)
K-Means algorithm go with minimum sum of squares to identify clusters of data points. Le’s say a data set has n observations of m variables. Here, we first identify initial centers of clusters. To perform this we can follow below steps:
Identify Initial Clusters
- First identify k clusters, it can be random
- Identify the significant clusters and this process is iterative. If the distance between the observation and its closest cluster center is greater than the distance between the others closest cluster centers(Cluster 1, Cluster 2 …), then the observation will replace the cluster center depending on which one is closer to the observation.
Allocate Observations to the Closest Cluster
Each observation is allocated to the closest cluster, and the distance between an observation and a cluster is calculated from the Euclidean distance between the observation and the cluster center.Each cluster center will then be updated as the mean for observations in each cluster.
The within-cluster sum of squares is:
We perform this exercise in a loop to find updated cluster centers and allocation of each observation. The iteration will stop when the maximum number of iterations is reached or the change of within-cluster sum of squares in two successive iterations is less than the threshold value. The updated cluster centers for the last iteration are called Final Cluster Centers.
The 88.4 % is a measure of the total variance in your data set that is explained by the clustering. k-means minimize the within group dispersion and maximize the between-group dispersion. By assigning the samples to k clusters rather than n (number of samples) clusters achieved a reduction in sums of squares of 88.4 %.
Hope this helps!
in the case of kmeans clustering, does the algorithm calculates the distance between one observation to other observation or one observation to the cluster center.
If the algorithm computes the distance between one observation to other observation is true, I don’t understand why it should compute between observations ? Can someone throw some light on this.
K Means computes the distance between a cluster centroid and each observation based on which it assigns the observation to the nearest cluster.
You can use the following resource to learn more comprehensively how exactly K means works and its comparison to hierarchical clustering:
I can’t understand how these figures are calculated:
15.15100 39.82097 23.87947.
Can you please explain - Your other comments cleared my doubts but I I can’t understand how the Within Cluster sum of squares are calculated.