K-Means algorithm go with minimum sum of squares to identify clusters of data points. Le’s say a data set has n observations of m variables. Here, we first identify initial centers of clusters. To perform this we can follow below steps:
Identify Initial Clusters
- First identify k clusters, it can be random
- Identify the significant clusters and this process is iterative. If the distance between the observation and its closest cluster center is greater than the distance between the others closest cluster centers(Cluster 1, Cluster 2 …), then the observation will replace the cluster center depending on which one is closer to the observation.
Allocate Observations to the Closest Cluster
Each observation is allocated to the closest cluster, and the distance between an observation and a cluster is calculated from the Euclidean distance between the observation and the cluster center.Each cluster center will then be updated as the mean for observations in each cluster.
The within-cluster sum of squares is:
We perform this exercise in a loop to find updated cluster centers and allocation of each observation. The iteration will stop when the maximum number of iterations is reached or the change of within-cluster sum of squares in two successive iterations is less than the threshold value. The updated cluster centers for the last iteration are called Final Cluster Centers.
The 88.4 % is a measure of the total variance in your data set that is explained by the clustering. k-means minimize the within group dispersion and maximize the between-group dispersion. By assigning the samples to k clusters rather than n (number of samples) clusters achieved a reduction in sums of squares of 88.4 %.
Hope this helps!