Topic Modelling in R

text_mining
textanalytics

#1

Hi,

I am implementing the LDA on Incident Ticket Description. I am using R

My approach is following:
.csv > corpus > remove( punc, stop words, numbers, tolower etc) > stemming > dtm > find no of topics ( k) using hmean > apply topicmodelling:: lda on dtm > checknig the topics and their terms > visualize usnig LDAVis.

Now my question are:

1. I have many words being repeated in other topics , so how interpret it and how to remove this correlation ?
2. How to give names to topics using the text ?
3. how to check accuracy of topic modelling and how to test in on TEST data set?
4. can I apply SVM ,NB, Xgboost etc on output of LDA for classification of new incident ticket ?
5. how to deploy it to server such that I can see my model working in real world ?

Has anyone hear of TWC-LDA, NMF, T-SNE implementation in R.

Kindly answer each point with approach/code in R.
Since it is live project that I am working on so appreciate the ASAP reply.
Sincere Regards
Manish Sharma


#2

#1 - Thats the way it is. Different topics may use same words
#2 - Thumb rule is words in each topic are sorted in descending order and first two are chosen as name. Otherway is to read through some and find a business /context specific name for that topic

#3, #4 & #5 - After naming the topics, create a model on top of it using supervised learning techinique and you can use that for classifying new documents in production


#3

@manishceeri
Why do you start with topic directly from the corpus? Incident have attributes you can usually use with TF-IDF for example then you could use dissimilarity or distance (hmean does only TF in a way) to do clustering and you could the few attributes for the incidentt and the clusters, check if the clustering match the attributes.
If you stick by topic modelling (Dirichlet in this case) then you make one assumption about the distribution of your terms (prior) and that is perhaps not the case. How to test good questions, the model model is right if you prior distribution is effectively dirichlet !!! and this is not easy to test.
Then you apply any models (question 3) as you have a set of features, you could have thousand features easily with text … so start with something which is fast in the first shot, if you go ensemble (boosting, random forest) Xgboost could be fast but ranger even faster.
No problem to deploy well if you have the luck to be on Azure for example easy Microsoft did a good job to support R at server level. The issue is the model keep in mind R put a lot off information in your model so if you use the model at run time and you have a massive train set … it could be slow to load even as RDS (i make the assumption you save your model once built as rds). But the problem will be over time your incidents could change then you will start to have false negative and positive, therefore think of monitoring the accuracy of you model.

Hope this help
Alain


#4

Thanks @lesaffrea1 for your valuable input.

I however not getting the " Incident have attributes" and “dissimilarity or distance (hmean does only TF in a way) to do clustering” that u have mentioned in your answer.

Are you talking of doing clustering and then topic modeling and then do comaprison of the groups vs topics ?

can u send me a link ( like github etc) from where I could get an idea to apply the question #3 and #4 in R (specifically).

once again thanks for your prompt reply.
looking forward to learn a lot from you

Regards