How do I set up a corpus of documents using the 'tm' package in R

text_mining
r

#1

Hello,

while using the tm package for text mining i got stuck at how to create the corpus required from say 5 blog post.I can copy the content into separate text files but after that how do I construct a corpus that can be fed into the tm function.
I am sorry if this a basic question but I can’t seem to figure it out yet.r


#2

the following should work:

myCorpus<-Corpus(VectorSource(myFile$myColumn))


#3

Hello @Nalin ,

Thanks for the reply but my question was more like:
Say I have five blogs in which I am interested in say 5 words.So how do I arrange the words in each blog so that I can do the myFile$myColumn part.


#4

suppose you have a file with two columns. The first column has, say, the name of the blog and the second column has a para of text from the blog. Suppose the second column is called myColumn, then you can get the common words into a separate dataframe with each column representing the frequency of a word with the following code:

myCorpus<-Corpus(VectorSource(myFile$myColumn)) #converts the relevant part of your file into a corpus

myCorpus = tm_map(myCorpus, PlainTextDocument) # an intermediate preprocessing step

myCorpus = tm_map(myCorpus, tolower) # converts all text to lower case

myCorpus = tm-map(myCorpus, removePunctuation) #removes punctuation

myCorpus = tm_map(myCorpus, removeWords, stopwords(“english”)) #removes common words like “a”, “the” etc

myCorpus = tm_map(myCorpus, stemDocument) # removes the last few letters of similar words such as get, getting, gets

dtm = DocumentTermMatrix(myCorpus) #turns the corpus into a document term matrix

notSparse = removeSparseTerms(dtm, 0.99) # extracts frequently occuring words

finalWords=as.data.frame(as.matrix(notSparse) # most frequent words remain in a dataframe, with one column per word