How many clusters are enough in k-means clustering?

k-meansclustering

#1

Is there a set formula for this, or it depends?


#2

We Use Elbow Method to find K.


#3

dataset = read.csv(‘Mall_Customers.csv’)
dataset = dataset[4:5]
set.seed(6)
wcss = vector()
for (i in 1:10) wcss[i] = sum(kmeans(dataset, i)$withinss)
plot(1:10,
wcss,
type = ‘b’,
main = paste(‘The Elbow Method’),
xlab = ‘Number of clusters’,
ylab = ‘WCSS’)

Plot the WCSS(Within Cluster Sum Of Squares) graph, and see at which point the Graph descends gradually, that point will be the K.
this is also knows as the Elbow method.


#4

Although I am not that much expert, I think you need to go for ‘Elbow’ method. In this method, you plot k vs SSE and then choose k at which SSE decrease abruptly.
Sometimes k = \sqrt{n/2} also being used, where n is the number of data points.

Remember ‘Elbow’ method doesn’t work always. There may be more than one or no elbow at all, but have a try. You may also try Average Silhouette Method


#5

Hi @p22 ,

In K-means, each cluster has its own centroid. Sum of square of difference between centroid and the data points within a cluster constitutes within sum of square value for that cluster. When the sum of square values for all the clusters are added, it becomes total within sum of square value for the cluster solution.

As the number of cluster increases, this value keeps on decreasing but if you plot the result you may see that the sum of squared distance decreases sharply up to some value of k, and then much more slowly after that.

Here, we can find the optimum number of cluster.