Best ML Tool-Large Datasets 10-100 GB



Hi, what is the best tool to perform machine learning on datasets in the range of 10-100 GB.

This falls between the upto 10 GB (available in R/Python) and 100GB + (Amazon + Spark) solutions

My configuration is an i7 Dual Core Processor with 8 GB RAM and 1TB HDD on Windows 7.

I have researched on the following

  1. Data.table/FF/SQL Lite packages with R (unwieldy syntax)
  2. Python with Pandas (not tried as I believe it is also in memory)
  3. Tried smart sampling with R to reduce dataset size but faced issues
  4. Hiring an Amazon Instance with Free RAM (good idea but how do I transfer data from laptop to server, too expensive)

These do not seem to fit the requirements, I need something that

  1. Is open source and free
  2. Has a consistent and sensible syntax and documentation
  3. Has a good library of algorithms especially random forest, logistic regression and clustering
  4. Works on an out of memory basis i.e. on the hard disk so it can handle the data sizes of 10GB +

I’ve heard about Vowpal Wabbit, but no idea about how to go about learning it, it’s list of algorithms etc…
Also, are there any other out of memory tools that work well

Please help, it’s a little urgent



Apache Spark is probably your best bet. Moderately easy to use and can handle larger than memory datasets automatically: Beginners Guide: Apache Spark Machine Learning Scenario With A Large Input Dataset. It also has wrappers for R and Python, so no need to learn some new language syntax on top of Spark.

Another alternative is to use learning methods that can learn incrementally. In this case, sklearn can solve your problem: Scikit-learn: Strategies to scale computationally

And I really recommend using a cluster. Even if the methods above enable training on large datasets, it will be painfully slow. If you have the time to train a model with data on the HDD, you probably have the time to upload the data to a cloud cluster. gcloud offers a 60 day trial with $300 of credit that you can use to run 8-core VMs with 50GB of RAM.