What are the ways to handle huge data in R?

big_data
r

#1

I have been using R for some time now and have built various predictive model on it. I have recently started competing on Kaggle competitions and found out that I can not process data more than 3 GB in size. Upon re-searching I found out that R loads data in its RAM and hence this is driven by RAM of my machine.

Is there any way I can overcome this limitation? How? Some of the Kaggle competitions have had data more than 20 GB and I have seen people using R to solve those problems.

Any help would be greatly appreciated!


How to resolve error in allocation of memory while applying linear regression?
How to deal with R Code execution error
R - Memory issue on allocating large size data to a vector
#2

@jon

You are right, one of the limitations of R is that it loads memory in the computer’s RAM before processing it. This puts a limitation on datasize. Here are a few ways, you can get over them:

  1. Split the CSV files into smaller files -
  2. I have used the ff package and found it very effective while working with big datasets. Here is a good tutorial to get you started: http://ff.r-forge.r-project.org/bit&ff2.1-2_WU_Vienna2010.pdf
  3. You can also try bigmemory package. Though I have not used it, I know a few people who have used it. This can be a good presentation to start: http://www.slideshare.net/bytemining/r-hpc
  4. if you have cluster of computers, you can packages like ‘snow’ to process in parallel
  5. You can also use Hadoop and MapReduce to overcome this limitation, in case none of the above helps.

Hope these would give you the right direction to proceed.

Regards,
Kunal


#3

In addition to what @kunal has already mentioned, you can use packages like ‘doMC’ and ‘parallel’, if you have multi-core processor machine.

Also, instead of doing data exploration on the entire dataset, you can read samples instead of entire files at exploration stage.

row.sample(yourdata, 2000)

will reduce your file to a random sample of 2,000 observations.


#4

You can also try H20


#6

I would just go on the cloud at aws.amazon.cloud use a Linux instance and a 20 GB RAM Instance and finish the work much quicker rather than learn a lot of new programming? Try it!


#7

@ajay_ohri

I tried to get a hang of it, but god confused of how to do this. Apparently, there are various types of instances and I have no idea whether I would be able to upload this size of data without fail.

The upload time might end up being far higher.

Any suggestions to overcome these hurdles?


#8

valid point. I would split the file in many parts and compress it before uploading. I would also look into ff package as mentioned above.


#9

Hi,

Please have a look at R package data.table, this is more advanced version of data.frame. Should be able to handle data in GB size.

Best wishes,