Is there anyway to overcome system limitations for running random forest for a large dataset in R?



Hi all,

I have a dataset with 1200+ variables. I want to find the variable importance using random forest for this dataset. I am facing memory issues / system gets hanged completely when i run the predictor importance.

Total number of records is 120000. I could not sample the data for two reasons. 1. No of columns and records proportionate seems to be less. 2. Huge variation of pattern with in the dataset.

My system has 8 gb ram. I cannot go for cloud machine as it is a client confidential data. How can I overcome this issue and still find the variable importance?

Please help.



Here are a few things you can try in R:

  1. Check out bigrf project from CRAN. It looks like, it is meant for exactly the problem you are facing.
  2. If you are calculating proximity matrix by any chance and you don’t need it (likely the case), you can try adding `proximity=FALSE

If none of these work, let me know.



Thank you @kunal sir. One more thing I am planning to do is to remove the variables which are dominated by null values and zeroes. I hope this will considerably reduce the dataset.

Thank you.

Karthikeyan P