Starting At Kaggle

r
kaggle
machine_learning
data_mining
random_forest

#1

I have started to compete at kaggle. I have went through http://trevorstephens.com/post/72916401642/titanic-getting-started-with-r and also i know about machine learning algos. Now, I am trying to compete in various data science competitions. But my problem is I can not apply machine learning algorithms like Random Forest on such large datasets due to shortage of RAM(4GB). What should I do?
The datasets in Analytics Vidhya practice competitions is small. So, I can easily apply these algos. But what about Kaggle Datasets?


#2

Hi @gau2112

you should be able to quite a lot with 4 Gb already (gc() to check usage) , well if the data set is big you can reach the limit of R and this limit is dependant of the OS you have, it means that you have toward in “chuck”. If you want to do bigger then AWS on Amazon check OS , you can bid for cheap price.
If you face the limit of R object you can use package such as BigMemory to do some matrix operations and once more use rm() and gc() to be clean what you do not use.
Hope this help.
Alain


#3

Thanks @Lesaffrea
So, you are asking me to optimize my code and global environment?
Edit -
I came across another R package today named ‘h2o’ and most of the scripts which I read at kaggle use this package.


#4

Yes in a way it is the first place I shall look for before to change platform. From my experience I only face issues when I start to work with optimisation problem. With 4 GB ram I am surprise you have issue with data set from Kaggle competition for example, with business data well that is one other story.
H20 good documentation for sure well known by Kaggle people I tried it twice for my problem I did not see one advantage, they have good integration with Hadoop, how they manage the data I do not know. But the algorithms have very reputation. Some people in this forum can certainly give you good advises about H20.
Have a good day
Alain


#5

Hi @gau2112
you were speaking abut h20, I was just reading about neural network I just face a problem of optimisation and a colleague sent me this presentation about h20.
I think it give a good overview of what you can do with this package for deep learning still.
Have a good read
Alain h20 slides


#6

Thanks @Lesaffrea