How to randomly split corpus data?

text_mining
r

#1

Hi,
Another query on text mining.

I found a piece of code that would help me in randomly splitting a data frame into 70%-30%. The code ran successfully.

 dt=sort(sample(nrow(atac_raw),nrow(atac_raw)*.7))
 atac_raw_train <- atac_raw[dt,]
 atac_raw_test <- atac_raw[-dt,]

However, when I use the same code to split the corresponding corpus data (corpus_clean), it fails. Maybe, the code doesn’t work on corpus data?

 dt_corpus=sort(sample(nrow(corpus_clean),nrow(corpus_clean)*.7))

*Error in sample.int(length(x), size, replace, prob) : invalid 'size' argument

Can anyone help? Couldn’t find any solution on the web.

I can think of a workaround (jugaad!) by modifying the datafile in such a way so that I select the first n records as training and the remaining as my testing data. But, would like to know if there is a way to fix the code instead to make it work.

Regards,
SD


#2

A corpus that you create is not a data frame – it’s an object of type VCorpus (more about that here), so you cannot expect it have rows and columns on which you apply the nrow/ncol functions.

With my limited expusre to the tm package, I’d ask you to convert the corpus into a data frame before splitting in into train and testing sets. Others may have better ideas. :slightly_smiling:


#3

@anon - I did a jugaad in the interest of time :grin:

Thanks!
SD


#4

But wouldn’t that remove the element of randomness in the split? You would most likely end up with biased datasets on which to train and test. I think it’s advisable to create a corpus and then split it into dataframes rather than the other way around. Otherwise, as stated above, you may end up with a test data set that doesn’t contain a word that exists in the training set (as a column of the sparse matrix, e.g.), and vice-versa.

(Or have I misunderstood your jugaad?)


#5

Hey!

I used Kutools add-in in Excel to randomly sort the data before I used it for the analysis. I tried doing the random sorting in R but ran into errors and did not know how to resolve it.

I have been a SAS user for the last 8 years. I had to use R for a Text mining project and started programming in R rightway w/o learning the basics. Actually, I am enjoying learning R the hard way :). Also, with the help of advice from seasoned programmers like you my learning gets faster.

Link for Kutools - http://www.extendoffice.com/documents/excel/644-excel-random-cell.html

Regards,
SD


#6

Thanks for the link to kutools.

In case you need help using tm, please refer to the attached files (in the zip file) which contain the data and code for a simple bag-of-words approach to Twitter sentiment analysis. (This was the demo used in the EdX course The Analytics Edge). Hope it helps.

using_tm.zip (46.7 KB)


#7

Hi @anon,
Thanks. However, I am unable to download the file. Can you send me the link instead?

P.S. - I initially thought Anon is a name and it later struck me that is the abbreviation for anonymous! :slightly_smiling:

Regards,
SD


#8

Well, then try these

Source code
CSV

And regarding anon, technically it is my username, so in this case it is indeed a name. :wink:


#9

Thanks bud!


#10

I’m glad you’ve found the solution to your problem. I’ll address the R error itself. Basically, it’s telling me that nrow(corpus_clean)*.7 is not an integer (really, it has to be a natural number, but whatever). You were lucky in your earlier code that nrow(atac_raw)*.7 was an integer. All you have to do is call ceiling() or floor() to get an integer and make the code work. Whether it’s the correct thing to do with your corpus depends on how you’ve read it in.