Cluster Validation

machine_learning

#1

Hi experts,
leaning clustering please let me know what are the parameters used to evaluate cluster and point me to some resources for cluster validation.

Regards,
tony


#2

@tillutony,

Can you please ask specific question so that we can guide you in right direction? It would be great if you share the detail about your attempt and understanding too.

Regards,
Faizan


#4

HI jal faizy,

I have run cluster analysis on Iris data set and have partitioned the data with three clusters.
Now would like to understand which cluster is good one

kmeans(iris.features,3)

K-means clustering with 3 clusters of sizes 38, 62, 50

Cluster means:
sepal.length sepal.width petal.length petal.width
1 6.850000 3.073684 5.742105 2.071053
2 5.901613 2.748387 4.393548 1.433871
3 5.006000 3.418000 1.464000 0.244000

It is the mean of the distances of all the objects within the cluster

Clustering vector:
[1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
[38] 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[75] 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 1 1 1 1 2 1 1 1 1
[112] 1 1 2 2 1 1 1 1 2 1 2 1 2 1 1 2 2 1 1 1 1 1 2 1 1 1 1 2 1 1 1 2 1 1 1 2 1
[149] 1 2

Within cluster sum of squares by cluster:

[1] 23.87947 39.82097 15.24040

(between_SS / total_SS = 88.4 %)

fit$withinss — i learnt that this is the most important metric used as an indicator to
signify how good the cluster is?

[1] 23.87947 39.82097 15.24040 — from these values how do we conclude or validate which cluster is good?Please let me know.

Regards,
tony


#5

Hi @tillutony

the withinss gives you the pooled within cluster sum square euclidean distance around the cluster means, in clear words it give you the dispersion of the points in each cluster, that you can translate as the cluster is compact or not.

This does not give you if the clustering is relevant for the question you have, if you want to cluster aiming as finding the species of Iris using K means you should check if the observations in your cluster match with the species for example if the majority of cluster one is of type sets.

Hope this help

Alain