How to provide a custom stop-word list in R?

textmining
r

#1

Hi,
I am currently working on a email-classification problem based on keywords in the contents of the mail. I am new to R and need some help. I have a very long list of stop-words in a text file that I would like to be used as stop-words in addition to the inbuilt one in R.

Say the file that has additional keywords is stop.txt. Now, how do I modify the code below to accommodate this?

corpus <- tm_map(corpus, removeWords, c(stopwords('english')))

Can anyone help me with the revised code?

Regards,
Sharath


#2

@SD1 - You can provide this a value to the combined function ©

for example, if you want to delete the stop.txt word.

corpus <- tm_map(corpus,removeWords,c(stopword("english"),"stop.txt")))

Hope this helps!

Regards,
Hinduja


#3

@hinduja1234, I doubt if that’s what @SD1 is looking for. The code you have provided will only remove the exact string “stop.txt”, if it appears in the corpus, whereas what Sharath wants is to remove the words contained within the file stop.txt.

One option is to read the words into a vector and contcatenate it to stop words. Something like:

corpus <- tm_map(corpus, removeWords, c(words_read_from_file, stopwords('english')))


#4

Sorry, about that and thanks @anon for correcting me.


#5

Thanks both for the reply.

However, I encountered the following error when I tried your suggestion. Let me know if I am missing something.

words_read_from_file <- read.table(“stop.txt”, header=F, sep="\t")
corpus_clean <- tm_map(corpus_clean, removeWords, c(words_read_from_file, stopwords(‘english’)))

Error in sort.int(x, na.last = na.last, decreasing = decreasing, …) : **
** ‘x’ must be atomic

Thanks.
Regards,
Sharath


#6

@SD1 Could you let us know the format of stop.txt? If it"s not too big, then just attach the file to your post. Most likely it’s complaining that the new variable is actually a data frame and not an atomic vector.


#7

stop.csv (559 Bytes)

Hey Anon,

Was unable to upload the .txt file as the forum doesn’t allow me to do so. Hence, uploaded it in csv format.

I tried the following code by reading in the csv, but the same error shows up again. I am definitely doing something wrong.

stop <- read.csv(“stop.csv”,stringsAsFactors = FALSE)
corpus_clean <- tm_map(corpus_clean, removeWords, c(stop, stopwords(‘english’)))

Error in sort.int(x, na.last = na.last, decreasing = decreasing, …) :
‘x’ must be atomic

Thanks for your time.

Regards,
SD


#8

For some reason, I’m not able to download the file. (Ends up being an empty, 0B file.)

In any case the issue is the same: the function expects a vector, whereas stop is a data frame.

Here’s an example with a simple stop.txt that I made myself.

stop.txt file

CUSTOM_STOP_WORDS
stop_word1
stop_word2
stop_word3
stop_word4
stop_word5
stop_word6
stop_word7
stop_word8
stop_word9
stop_word10

What is needed for concatenation is a character vector, like stop_vec shown below:

> stop = read.table("stop.txt", header = TRUE)
> class(stop)
[1] "data.frame"
> stop
   CUSTOM_STOP_WORDS
1         stop_word1
2         stop_word2
3         stop_word3
4         stop_word4
5         stop_word5
6         stop_word6
7         stop_word7
8         stop_word8
9         stop_word9
10       stop_word10
> stop_vec = as.vector(stop$CUSTOM_STOP_WORDS)
> class(stop_vec)
[1] "character"
> stop_vec
 [1] "stop_word1"  "stop_word2"  "stop_word3"  "stop_word4"  "stop_word5"  "stop_word6"  "stop_word7" 
 [8] "stop_word8"  "stop_word9"  "stop_word10"

#9

Hi anon,
Your code works like a charm.

Thanks buddy!

Regards,
SD