Text mining and clustering



How To find the top 5 words of in each of 20 clusters with R
can someone help me with the R code


Hi @Najeeb

Something that could help:

Notes: The following script goes through the directory pick up the file take top 10000 lines then display distribution

numoccurence   <-c(0)

usfiles <-list.files(corpusUSdir)
for(curfile in usfiles){
        curdoc<-read.table(file.path(corpusUSdir, curfile), sep="\n", stringsAsFactors = FALSE, quote="" ,skipNul = TRUE,nrows= 10000)
        curcorpus <-Corpus(DataframeSource(curdoc))       
        # We do a tolower on the corpus 
        curcorpus <-tm_map(curcorpus, tolower)
        curcorpus <-tm_map(curcorpus, PlainTextDocument)
        curcorpus <- tm_map(curcorpus, removePunctuation)
        wordstodocument <- TermDocumentMatrix(curcorpus)
        wordsmatrix <-as.matrix(wordstodocument)
        freqterms <-rowSums(wordsmatrix)
        freqdisp  <-data.frame(namesterms, freqterms, stringsAsFactors = FALSE)
        mainlabel<-paste(c("Terms distribution log transformation"),curfile, sep=" ")
        disp<-hist(log(freqdisp$freqterms), breaks=30, freq=TRUE,labels = TRUE, main=mainlabel, xlab="log frequency terms")

Have a good day


Hi Lesaffrea

ok But suppose I want to make for loop in which it will
subset for the records belonging to each cluster i.e assign all the records of a particular cluster into a variable
then apply TermDocumentMatrix on each subset,
Inspect the elements in it
Find the count of the words
and then output theLogGroup,LogCount,Top Words,WordCount,Counter into a file.

Can u tell me how to start with this


Hi @Najeeb

similar code as the one instead of doing a graph as I do, you prepare a row for you result data frame that you bind. In few word.

  1. Define you results data frame with the metrics you want

  2. go in the loop for(curdle in XXXX) XXXX the list of you clusters

  3. Process the freqdisp data frame with you metric

4 bind to the data frame declare in 1.

  1. Loop to 2.

Hope this help.


Thanks a lot Alain