Regarding clustering



If I have to cluster products based on cost, volume, revenue, ship set count. DO I need to change the product into number and perform kmeans clustering?


I can take cost, volume, ship set and perform the clustering then add the cluster membership back to the original dataset?


@rumsinha. It’s fine to avoid inclusion of any categorical features into clusters if you don’t find it significant.

But if you want to include catogorical features, you can use encoding of that categorical feature. Other techniques that you might adopt here is to find mean response for each level and use that in clustering instead.


Saurav, The purpose of the clustering analysis to have products with similar attributes in one cluster then also do I need to include these products as variable during the cluster analysis along with the other factors like volume/revenue/cost?


The answer is…If your aspirations are to perform clustering analysis, as I said before,you should look whether the categorical variable distribution is adding any value. If it is, then you must take it. What you should make sure is that all the if you encode your categorical feature, it must be in some way be in sync with your other numerical and categorical variables in terms of scale, because at the end you are just calculating distance between the points in the data space.

For ex., If you encode two very different levels of a categorical feature as 0 and 1 while you also have another numerical feature with range form 0-1000, the effect of your categorical feature in cluster formation will almost diminish to zero. So you should keep this in mind.

Hope this helps.


If you are using R then you can use the clara function under the package cluster. Here you can include the categorical variables as input under the distance metric “gower”.It will calculate the distance between two data points and with respect to that it will produce clusters among all the data points.


My question is if I need to have distinct clusters for the products based on volume/revenue etc then why I need to have products also while clustering. Products are each distinct PID names which does not add any value to the clustering?

sample data:
Product, Volume, Revenue, SalesOrderCount, ShipSetCount, bookingcost
AXSFANBLWR,16,605,5,5, 105,
NEBSVTR, 318,62,14, 14, 9209
FANSY, 168,612, 29, 24, 476
AEBS, 531, 38034, 10 ,6 ,998
2NASS2, 387,42 ,19, 13, 1247
ZEL, 230 ,13340, 3, 3, 596
39SSBS, 157,233, 53, 41, 3383
3ASBS, 46, 1576, 15, 13 ,4,96
NAY, 184, 1903, 46, 36, 76
AEBS, 4, 102, 2, 2, 20
OUTA, 152, 242, 25, 24, 856


I think you don’t need to consider PID in this case.That will not add any value from analytical point of view. If you use other variables for clustering then from business perspective I can say you can get clusters based on profit range for products and I think there should be strong associations between the variables which are used for clustering(like vol,rev,#salesorder…).


Hi @rumsinha

As Paul mentioned the PID will great confusion only as it will go in the distance calculation. Only if you have relationship between the pID and then build your custom distance matrix PID will have a meaning. I mean if you find a distance between for Example ZEL is different than OUTA let say by x units then you can build a distance metrics and add in you distance matrix.
Other point your revenue they have a relation with ShipsetCount am I right? If so and if you use the Euclidian distance you have one interaction effect, if you cleats product it is perhaps not what you want, if it was for customer it could be ok.
Hope this help.


Thanks Paul and Alain,

the units booked for each product stored in the variable volume. The product names are only to identify different products falling in different clusters. My understanding was if we change it to number then it will participate in the distance calculation and give different clusters.

One last question, say if we bring in one more variable product family which is parent of the products. each product can belong to one and only one product family. Now if we want to see the clustering at product family then how do we need to treat these categorical variables product family and products in addition to volume, revenue, shipsetcount,salesordercount and bookingcost?

Appreciate all your inputs