Help with clustering Listings



I am trying to cluster a number of listings on our platform. basically we have
city numeric
abstract 200-500 words
title 6-10 words
bid price numeric
i want to use all these parameters for unsupervised clustering.
in the end we want to find similar listings and suggest them to users.
our main consent is how can we use abstract which almost 200 words long so do we need to apply some kind of NLP algorithm to get keywords out of it? or the clustering algorithm does everything. also what will be the best way to implement this in production i.e. and kind of library etc.


@rjcrystal - It is a very tricky question because it completely depends on upon many parameters for clustering.But as I understood the best method would try to create keywords which are relevant to each abstract and try to cluster according to that

Hope this helps!



hello @rjcrystal,

I think topic modelling might be able to help in the sense that suppose you create a corpus of all the abstracts and then use the lda package in R to extract say 8 topics.Then for each abstract you will have to decide which topic it belongs to based on the words(say top 10-20) contained in each topic and the abstract itself.This will give you a categorical variable- Topic: Abstract 1: Topic 2,Abstract 2: Topic 7 etc.
But I have not done this so can’t be 100% sure.What is the volume of your data and also could you please give an example of an abstract here for better understanding??


In this type of issue, certainly excel is better.
In excel u can use row to column. and conditional formatting with the words you like to choose from.


text to column function in Data menu. I will make us help with column data and by giving filter you can choose the word you want from the column and get the filtered rows.


Hi we’ve got around 2000 of them here is a sample Abstract

This is an exclusive retail outlet and e-commerce website for ladies apparel and accessories based in Pune that is up for sale. It was established in 2011 and is running profitably since then. Their product category consists of women ethnic wear, designer & casual wear, jewellery and accessories. They believe in very high customer service and provide personalized care and service to all their esteemed customers. The outlet has a cosy and comfortable ambience. They have a very strong client base of more than 13000 customers, most of which are entrepreneurs and professionals. Their core competency being high product quality and competitive pricing, which comes from their strong networking with the finest of manufacturers across the country. They also have a strong customer base abroad.


hello @rjcrystal,

I think topic modelling will be able to help you.Please see the below image for an intuitive understanding of how it works and whether or not it will suit your purpose .


Hi @rjcrystal

It seems that you try to build a king of recommender system there am I right? For the 200 words you could preprocess them with the standard test processing (stemming, remove stop word etc ) and then build the distance matrix using cosine or and Jacard. then do a hierarchical clustering and prune the tree to find the categories. Then build a second model by integrating the demographic to the first categories from the hc and reclusters, in second model you will have mix of ordinal and categorical, there are few discussions in this forum which tell you how to do this.
Hope this help.