Text Mining Problem- urgent

r

#1

hello All my friends, senior and junior I have one problem. I need valuable input from your side. The problem is :

I have around 50k records given by users. I also have five pre defined set of categories which has various kind of keywords. The keywords list given by the client. Now I need to categorize those comments among five categories based on the keywords given for each category. If any user comments has one or more keywords pertaining to one category then that user comments will be classified into that category. Then I need to show the term frequency for that. How to write a code through R for that? Please reply. Thanks in advance to everybody


#2

Hi,

In R two tools package tm and RTextTools are used often, you have some other package more specialised for example for Drichelet signature, but in your case if quite tm like.

Here some lines to solve you first part perhaps that is the frequency of words per document . The variable directorydocument point to the directory of document, if you check the nameterms then you will have something the word in you comments after you can add your code to match with the categories you have.

Hope this help.

Alain


library(ggplot2)
library™
par(mfrow=c(1,3))
numoccurence <-c(0)

usfiles <-list.files(directorydocument)
for(curfile in usfiles){
curdoc<-read.table(file.path(corpusUSdir, curfile), sep="\n", stringsAsFactors = FALSE, quote="" ,skipNul = TRUE,nrows= 10000)
curcorpus <-Corpus(DataframeSource(curdoc))
# We do a tolower on the corpus
curcorpus <-tm_map(curcorpus, tolower)
curcorpus <-tm_map(curcorpus, PlainTextDocument)
curcorpus <- tm_map(curcorpus, removePunctuation)
wordstodocument <- TermDocumentMatrix(curcorpus)
wordsmatrix <-as.matrix(wordstodocument)
freqterms <-rowSums(wordsmatrix)
namesterms<-as.vector(unlist(attributes(freqterms)))
freqdisp <-data.frame(namesterms, freqterms, stringsAsFactors = FALSE)
mainlabel<-paste(c(“Terms distribution log transformation”),curfile, sep=" ")
disp<-hist(log(freqdisp$freqterms), breaks=30, freq=TRUE,labels = TRUE, main=mainlabel, xlab=“log frequency terms”)
}
par(mfrow=c(1,1))