Regarding profiling of products using unsupervised learning


I have following data with me:

product which takes character values like"abc", “xyz”, etc
Quantity which take numeric values
Amount numeric values
ShipSet count numeric values

I am changing the product to number with as.numeric(product)
then I am performing scale and getting the optimal clusters using silhouette width.
Performing kmeans clusters not giving me good clusters.

how can I improve upon this so that I can have 3 groups of products with high amt/high qty/high shipset, another one with medium and third one with low values?

thanks and regards


Hi @rumsinha,

I don’t think that changing product names to numeric and finding using distance between them for performing clustering is any good simply as it doesn’t makes much sense. Probably you should not include the product in your clustering analysis at all.

If this doesn’t helps. Please give a snapshot and/ or a bit more details of data.



Thanks Saurav,

Based on my reading on clustering and euclidean distance, my understanding was on similar line only. But when not changing the products into numbers and doing the clustering, I will get one with more products and others with very very few like 1/2/10 so I was not able to make much sense out of it. Including the numeric form of the products, 2 clusters will be kind of fine.

I get your explanation and my understanding also on the same line.

Some, dummy data as below:
sampledata.csv (259 Bytes)

volatility is the standard deviation of the booking quantity divided by mean over 2 years of data.

my need is to segment the clusters quarterly basis and see if any patterns exist across the 4 quarters to take any decision as which products to give priority for manufacturing.

all feedback and suggestions welcome.

Thanks and Regards


Hey @rumsinha

I have had a look at the data and I’ll still not go changing the product names to numeric and using them in clustering.

Although, I might be able to suggest you something which might help you in getting good/ equal no. of objects in every cluster. I see that the BookingQty as well as BookingAmt shows high variation which might be leading to this problem. Probably use of scaling and or normalizing and or capping of these variables will produce good results.

This will most surely help you but if it doesn’t then probably an alternate strategy will be to look at different proximity functions which might be more suited in this situation.



Thank you… I was doing the normalization using scale function from R. Let me read around proximity functions and try out.

This discussion was very helpful.