How to access large data sets on Kaggle without downloading them to system?



Any cloud drives which offer more than 20GB space?

I have been using the data in the competitions on Kaggle for exploring and practicing Analytics using R. Sizes of Data provided in recent competitions on Kaggle are running above 1 GB. For example Microsoft competition’s data is about 17GB. Is there any away to access to this adata without actually downloading on to the personal system?

Say Download or copy the data to some Cloud drive and accessing it using R.





Most of the cloud services would provide 20 GB space, they might charge for it.

You can use AWS or Microsoft’s own Azure to set up a server and perform machine learning. I have also come across IBM Softlayer, which provides a month trial for free and 5 TB data download in a month for free. All uploads are free as well.

Any of these should be a good way to work on these datasets. Happy time building an anti-virus for Microsoft :smile:




Yesterday, I have setup a bucket in AWS but couldn’t able to find a way to actually transfer the data from Kaggle Website to AWS without downloading it on to my laptop. I tried the same with Google Drive also.

With the limitations of my Internet speed, it will take continuous 3 days to download the 17 GB zip file on to my laptop and another 3 -4 days to upload onto the AWS.

Did you ever tried or is it possible to transfer the data from Kaggle website to AWS or Google or any Cloud drive, directly?




What you should do is pick up an EC2 instance. I think this might need a medium size instance. Pick up Ubuntu image on it (or any other OS you are comfortable with) and keep good hard disk (I think there might be a SSD option as well). After that you can use wget to download the data:

wget -c url
wget --continue url
wget --continue [options] url

I think you should have the entire data on that instance in less than an hour. You can then run R / RStudio and start building your models.

Hope this helps.