Algorithm for segmentation of categorical variables?

IF segmentation has to be done on the basis of gender, age or any other categorical variable, which algorithm should be used. Any example on this?

Hi @shivanihmcl

well you should explain what do you want to achieve first. But why not to start by clustering ? This will give you a kind segmentation you will only know that they are close or dissimilar based on the distance you use.
Then you can use for your model.


The data is categorical. I believe for clustering the data should be numeric . If there are multiple levels in the data of categorical variable,then which clustering algorithm can be used. Could you please quote an example?

The columns in the data are:

ID Age Sex Product Location

ID- Primary Key
Age- 20-60
Sex- M/F
Product- Multiple(Around 100 types)
Location - 30 different locations

Could please help me with the approach @Lesaffrea ?


Hi @shivanihmcl,

You can use k-modes algorithm for clustering categorical variables. For an overview of the algorithm, you can refer to this. For implementation of k-modes clustering on categorical data, you can use the kmodes function from klaR package. Moreover, I would suggest you to reduce the levels of Product and Location variables by combining similar types of products into a single category and combining nearby locations into a single category.

Hope this helps.


Hi @shivanihmcl

as mentioned K-mode could be one method, it takes time to compute if I remember well.
This is if you go with clustering, king of filter in a way. One point to keep in mind with clustering they do not care about what you want to do, they take a distance and then find closest from from references (the references is dependant of the method you use simp, complete etc ).
By the way with clustering you can use many type ordinal, binary, categorical after it is a question of distance.

Now in your case is clustering the best? I make the hypothesis you would like to go more in the direction of basket analysis? If the case you look for association between people and product one way is association rules. So you have in R the package arules, which allows you to do such an analysis.
I have included the document of introduction to this package (framework) if you have access to “Practical Data Science with R” go to chapter 8 they have a good explanation.

Hope this help


Introduction to arules.pdf (0 Bytes)

@Lesaffrea There is no attachment. Could you please share again?
I dont have the access to the book you mentioned. Any alternative or any write up?


Hi @shivanihmcl
sorry for this I did not noticed hope it is ok this time
Alainarules.pdf (288.0 KB)

Hi @shivanihmcl

Seems ok this time not exactly the same size.

Have a good time reading, hope it helps to solve your issue.


Hi @shivanihmcl,
You can use clustering algorithms for segmentation of categorical variables. I attached a figure. For further information you can examine books and articles. This figure was taken from the following book:

Data Clustering-Algorithms and Applications (Eds: C. C. Aggarwal & C. K. Reddy)


In the following table shows properties of algorithms


I am using klaR package to do k-mode clustering. After I have my clusters, I want to predict new data to check which cluster it belongs. I wasn’t able to figure out how to achieve that. Could any one suggest how to do that?