How to find starting centroid and value of k in K means

k-meansclustering

#1

I am currently studying the K means and I have doubt that how we can find the starting centroid and value of k so that I can achieve the best classification result ??

for example how I can select the starting centroid in blue and red so that classification can achieve


#2

@harry,

In K-means, we have clusters and each clusters have their own centroid. Sum of square of difference between centroid and the data points within the cluster constitute the within sum of square value for that cluster and when the sum of square values for all the clusters are added, it becomes total within sum of square value for the cluster solution. We know that as the number of cluster increases, this value keeps on decreasing but if you plot the result you may see that the sum of squared distances decreases quite sharply up to some value of k, and then much more slowly after that. The last value that gave you a sharp decrease is then the most plausible value of k.

Wholesale <- read.csv(file="c:/TheDataIWantToReadIn.csv", header=TRUE, sep=",") 
Wholesale.New.Std=scale(Wholesale.New)# Standardize the variables to remove influence of one variable over other
Wss=sapply(2:15, function(x)kmeans(Wholesale.New.Std,x,nstart=40)$tot.withinss) # Run K-means clustering from 2 cluster solution to 15 cluster solution in a loop
plot(2:15,Wss,type="l") # Plot the required graph

Above, you can see that that there is a sharp change at cluster number 5. Hence, a five cluster solution couldbe a good solution
The code for getting 5 cluster solution is given below.
Model=kmeans(Wholesale.New.Std,5,nstart=40)

Once clustering is done, each data point has a cluster association. In above code, you can see an extra argument nstart is put so that a consistent result is got even after multiple runs because k-means can deliver different result if the starting centroid position get changed. This argument allows cluster analysis to get initialized from multiple points (40 in this case) so that the outcome returns a global optimal value.

Hope this helps!

Regards,
Imran