Clustering Ordinal/Categorical data

r
machine_learning

#1

Hi,

I am working on a dataset to identify clusters among people based on their ratings on Likert scale(1-5) i.e Strongly disagree - Strongly agree, consist of 1000 observations and 19 features, all measured on the same scale. I am trying to find answers to the following questions:

a) Is normalization necessary/mandatory before measuring dissimilarity?

b) What is the similarity/dissimilarity metric to be applied here to perform hierarchical clustering? viz - euclidean, manhattan, gower…etc what is correlation based distance measure means?

c) Is kmeans function in R able to cluster with default metrics? if not then what is the alternative?

d) What is the best way to perform the same in R?


#2

Clustering in this case do not require normalization as the scales of data is same for all variables.
It is only required when there are different scales e.g age, income- there is huge difference in scales of these two features & can result in calculating wrong euclidean distances & hence wrong clusters.

I think you should perform conjoint analysis to figure out most preferable features by users & use highly preferred features for clustering.
You can also use PCA or t-SNE for dimension reduction & hence do clustering


#3

@ Dhillon, I am in agreement with not scaling the features. If I continue with clustering analysis without PCA, what distance metric I should select - euclidean, manhattan or gower for heirarchial clustering,
Also , which function I should choose kmeans() or kmodes() or any other. I understand Kmeans() is not the ideal choice as it doesn’t support ordinal data.