Predictive Modelling on Large Data Set

julia
python

#1

Hi everybody,
I want to do predictive modelling on a kaggle data set having 29 million observations. When I try to apply KNN or Logistic Regression in python my screen freezes, after checking System Monitor, I found that RAM was full and swap space is also getting pretty much used up.

After searching a little bit online I have found out following solutions that are possible :

  • Upgrading my laptop with RAM and/or SSD(currently I have 8gb RAM and 1Tb HDD).

  • Parallel computing using PyCuda with Nvidia (I have Nvdia gt 740m).

  • Dividing data into small chunks then fitting the algorithm on all the individual chunks.

  • Using a new language or libraries (like Julia, PySpark, H2O etc.)

Please suggest me the best possible solution, If you have any other solution feel free to suggest.

Thanks in advance


#2

Hi @ravi_6767,

You could decrease the amount of data for training by applying various methods like dimensionality reduction or dropping extra features.


#3

May be, you can check the accompanying discussions on Kaggle as a starting point: https://www.kaggle.com/svpons/facebook-v-predicting-check-ins/grid-knn/comments

Also, normally when you are faced with large data, the best thing could be to just be a bit innovative on how to use your limited resources on the dataset - there is no end to how much computational power you may need, if you do not use the resources efficiently.

Regards,
Kunal


#4

Thanks @kunal and @jalFaizy.