Clustering technique for mixed-Numeric and Categorical Variables

cluster
k-meansclustering
hierarchical_cluster
clustering

#1

I want to perform clustering for mixed variables in R.Kindly guide me regarding the same.The dataset consists of >100000 observations and almost 20 variables


#2

@nehak,

The easiest technique would be to convert Categorical variables into numeric variables with magnitude similar to Numeric values and then perform clustering.

You might need to play a bit with the scale before you can zero in on a particuar result.

Regards,
Kunal


#3

:hamburger: thanks and Happy New Year!


#4

If you have access to SPSS, then I would recommend two-step clustering in SPSS. Based on my experience, I have found it to be pretty effective on mixed data. I am not sure if R has anything similar to 2 step in SPSS. Hope this helps!


#5

Hello Kunal Sir and Nehak…
Can you please help me with how to convert categorical to numeric and then perfrom clustering.I am very new in this data science learning.So little bit confused how to do this.

It would be of great help to me if u enlighten me on this topic with the code or something.

@Nehak. - Since u understood what your doubt was, can u post the code here or mail me the code and dataset at rohitn220@gmail.com

Thanks,


#6

Hi Nehak and Rohit,

The method for mix clustering (numerical and categorical) is k-mode, if you work in R look at the package klaR, where the method is implemented. Be careful about the initial conditions, if you want to learn more check this paper, go to the empirical results, if you want to jump over the formula (pretty one in this paper :slightly_smiling: ) .

Good luck
Alain

Approximation algorithms for K-Modes clustering


#7

K-modes is more suitable for categorical variables I suppose?


#8

Hi Nehak,
k-mode work with categorical and numerical, in few words it works with mixed types of variable. One other way of doing if you do not want k-mode due to the initial conditions limits, you can do as mentioned by Kunal but there you should use one other type of distance not euclidian to build your clusters, check Mahalanobis distance, check the nbClust package for this, the function nbClust() can tell you the number of cluster using multiple distance this can give you a good insight to start with.
Hope this help.
Alain


#9

how can u calculate Mahalanobis distance and how can i use this?hclust package will be appropriate for this distance matrix?


#10

Also waht method i can use in this if i use nbclust with distance matrix=Mahalanobis distance


#11

Hi Nehak,

you first calculate the dissimilarity matrix using the base function Mahalanobis then in ncClust() you do not set the distance ( NULL) as you calculated it via the diss matrix in nbclust().
Then you have to set the method, I do not know, what you want to do I guess you use kmeans or centroid, what is good with nbClust you can try multiple base not the metrics you use to test your clusters.

And then you are done you can go via hclust() as well using the same dissimilarity matrix and then you prune.

Hope this help.

Alain


#12

hello @nehak,

For mixed variables many distance measures are available:like “eucledian”.“manhattan”,“gower” etc.,and also many methods of clustering are available in R via many packages.
Kmediod is robust in the sense that it is not affected by the presence of outliers in the dataset and the medoids are real data points in the dataset.
However,if you convert all variables to numeric and then do the clustering it is always a better option.
We can use the average silhoutte width to measure cluster validity and anything above 0.5 is good.

Consider the below example:

we use the fpc package here to do k-medoid clustering and find the optimal number of clusters.In this example the dataset is a mix of numeric/categorical variables.

#use daisy to calculate dissimilarity matrix:
insurance.daisy <- daisy(insurance2)
#use pamk to determine optimal number of clusters:
library(fpc)
pamk.best <- pamk(insurance.daisy,krange = 1:15,usepam = T)
cat("number of clusters estimated by optimum average silhouette width:", pamk.best$nc, "\n")
plot(pam(insurance.daisy, pamk.best$nc))

ance.daisy, pamk.best$nc))

As you can see here the average sil width is 0.14 which signifies the clustering quality is not good.

But if I convert to numeric data and then do the clustering:

insurance2.numeric <- data.frame(data.matrix(insurance2))
pamk.best.num <- pamk(insurance2.numeric,krange = 1:4,usepam = F)
plot(pam(insurance2.numeric, pamk.best.num$nc))

Summary:
1.) Use k-medoid mostly when you need the centroids to be real data points in the dataset.
2.)Convert to numeric for better clustering.

Hope this helps!!


#13

Hi Kunal,

I have converted the categorical variables into numeric and also done the scaling part.

Can i use simply kmeans for clustering?


#14

Hi while running this code daisy package doesnt exist.
pam in plot code also doesnt exist.

how should i run the code?


#15

Hi @nehak

load the package cluster for the daisy function , pam is also part of this package.

Hope it helps.

Alain


#16

Plot code is not working.its giving no output for plot


#17

Hi plot code in this is not working.I’m not getting any output while running


#18

Can you load few lines of cade and the data ?
Alain


#19

Here are few records:

i have calculated mahalanobis distance following:

mean<-colMeans(newdata)
Sx<-cov(newdata)
D2<-mahalanobis(newdata,mean,Sx)
trying to run further code as :

res<-NbClust(newdata, distance = D2, min.nc=2, max.nc=6,
method = “kmeans”)

getting following error:
Error in NbClust(newdata, distance = D2, min.nc = 2, max.nc = 6, method = “kmeans”) :
invalid distance
In addition: Warning message:
In if (is.na(distanceM)) { :
the condition has length > 1 and only the first element will be used


#20

Hi @nehak

You can try this to check if it works on your data set:

library(cluster)

Dissimilarity matrix calculation

daisy.mat <- as.matrix(daisy(your.dataset, metric=“gower”))

Clustering by pam algorithm

my.cluster <- pam(daisy.mat, k=desired number of clusters, diss = T)

Cluster plot

clusplot(daisy.mat, diss = T, my.cluster$clustering, color = T)

You may also try the method proposed here, which is a smarter way to perform clustering on mixed and large datasets.

Let me know if it helped.

Thanks,
Debarati.