How to use binary variables in K means/Hierarchical clustering in SAS/R?



I need to use binary variables( values 0 & 1) in K means. But K means works with only continuous variables.
I know some people still use these binary variables in K means ignoring the fact that k means is only designed for continuous variables. This is unacceptable to me.

  • Questions

1)So what is the statistically/mathematically correct way of using binary variables in K means/Hierarchical clustering?
2)How to implement the solution in SAS/R?



I have not tried it yet but I know Latent Class Analysis is a way to handle categorical variables (including binary) in clustering. Pennsylvania State University has developed a SAS: Proc LCA that is available for downloading. Just google it.


You can try building Principle component from all the variables first and then feed these into your k-means algorithm. This way you bring in all the variance binary independent variable has to offer and also you dont have any classification variable fed into k-means.


@tavish PCA also can take Continous variables as input not binary variables, so we can’t create PCA from binary variables right?


PCA can take any kind of variable (binary or continuous) as log as you standardize the continuous variable. But still its not an exact science of what your should be doing if you have binary variable. The only role of PCA is to find the direction of maximum variance which is also possible in the directions of binary variable.

Hope this helps.



I think 2 Step clustering option in IBM SPSS or CHAID algorithms can help you in doing clustering using both numeric and categorical variables.

Hope this helps.

Aayush Agrawal


I think Aayush rightly suggested. CHAID could help you out.