How to remove plural words from the training data for forming bag of words?

bagofwords
r

#1

I am currently studying about the bag of words technique in R and for forming the the words I have use the package tm .But after using it, the training data contains lots of similar words .I want to remove them.

library(jsonlite)
library(dplyr)
library(ggplot2)
library(tm) 

train <- fromJSON("train.json", flatten = TRUE)
ingredients <- Corpus(VectorSource(train$ingredients))
ingredients
 <<VCorpus>>
 Metadata:  corpus specific: 0, document level (indexed): 0
 Content:  documents: 39774 

It contains 39774 words in which there are lots of plural words .I want to remove them


#2

@harry- You can stem the words to remove the similar words from the document.

ingredients <- tm_map(ingredients, stemDocument)

Hope this helps!

Regards,
Hinduja


#3

Hi Harry,

look at the function tm_map() and stemming, it would extract the root of the text and therefore reduce, after you can do a text document and pick up the highest frequencies.
Check the tm_map as you can put in lower cases, remove punctuation etc

ingredients <- tm_map(ingredients , stemDocument)
textdocumentsmodel <- TermDocumentMatrix(ingredients)

Hope this help.
Alain