What is "Within cluster sum of squares by cluster" in K-means

r
k-meansclustering

#1

I am working on K-means in R but I am not able to understand the feature “Within cluster sum of squares by cluster” when I look at the model
data(iris)
irisf<-iris
irisf$Species<-NULL
m1<-kmeans(irisf,3)
m1

Within cluster sum of squares by cluster:
[1] 15.15100 39.82097 23.87947
(between_SS / total_SS = 88.4 %)


#2

@hinduja1234,

K-Means algorithm go with minimum sum of squares to identify clusters of data points. Le’s say a data set has n observations of m variables. Here, we first identify initial centers of clusters. To perform this we can follow below steps:

Identify Initial Clusters

  • First identify k clusters, it can be random
  • Identify the significant clusters and this process is iterative. If the distance between the observation and its closest cluster center is greater than the distance between the others closest cluster centers(Cluster 1, Cluster 2 …), then the observation will replace the cluster center depending on which one is closer to the observation.

Allocate Observations to the Closest Cluster
Each observation is allocated to the closest cluster, and the distance between an observation and a cluster is calculated from the Euclidean distance between the observation and the cluster center.Each cluster center will then be updated as the mean for observations in each cluster.
The within-cluster sum of squares is:

We perform this exercise in a loop to find updated cluster centers and allocation of each observation. The iteration will stop when the maximum number of iterations is reached or the change of within-cluster sum of squares in two successive iterations is less than the threshold value. The updated cluster centers for the last iteration are called Final Cluster Centers.

The 88.4 % is a measure of the total variance in your data set that is explained by the clustering. k-means minimize the within group dispersion and maximize the between-group dispersion. By assigning the samples to k clusters rather than n (number of samples) clusters achieved a reduction in sums of squares of 88.4 %.

Hope this helps!

Regards,
Imran


#3

very well explained!!!


#4

in the case of kmeans clustering, does the algorithm calculates the distance between one observation to other observation or one observation to the cluster center.

If the algorithm computes the distance between one observation to other observation is true, I don’t understand why it should compute between observations ? Can someone throw some light on this.
Thanks
Raja


#5

Hi @raja200k,

K Means computes the distance between a cluster centroid and each observation based on which it assigns the observation to the nearest cluster.

You can use the following resource to learn more comprehensively how exactly K means works and its comparison to hierarchical clustering: