Predictive Modelling on Large Data Set



Hi everybody,
I want to do predictive modelling on a kaggle data set having 29 million observations. When I try to apply KNN or Logistic Regression in python my screen freezes, after checking System Monitor, I found that RAM was full and swap space is also getting pretty much used up.

After searching a little bit online I have found out following solutions that are possible :

  • Upgrading my laptop with RAM and/or SSD(currently I have 8gb RAM and 1Tb HDD).

  • Parallel computing using PyCuda with Nvidia (I have Nvdia gt 740m).

  • Dividing data into small chunks then fitting the algorithm on all the individual chunks.

  • Using a new language or libraries (like Julia, PySpark, H2O etc.)

Please suggest me the best possible solution, If you have any other solution feel free to suggest.

Thanks in advance


Hi @ravi_6767,

You could decrease the amount of data for training by applying various methods like dimensionality reduction or dropping extra features.


May be, you can check the accompanying discussions on Kaggle as a starting point:

Also, normally when you are faced with large data, the best thing could be to just be a bit innovative on how to use your limited resources on the dataset - there is no end to how much computational power you may need, if you do not use the resources efficiently.



Thanks @kunal and @jalFaizy.