Hi, what is the best tool to perform machine learning on datasets in the range of 10-100 GB.
This falls between the upto 10 GB (available in R/Python) and 100GB + (Amazon + Spark) solutions
My configuration is an i7 Dual Core Processor with 8 GB RAM and 1TB HDD on Windows 7.
I have researched on the following
- Data.table/FF/SQL Lite packages with R (unwieldy syntax)
- Python with Pandas (not tried as I believe it is also in memory)
- Tried smart sampling with R to reduce dataset size but faced issues
- Hiring an Amazon Instance with Free RAM (good idea but how do I transfer data from laptop to server, too expensive)
These do not seem to fit the requirements, I need something that
- Is open source and free
- Has a consistent and sensible syntax and documentation
- Has a good library of algorithms especially random forest, logistic regression and clustering
- Works on an out of memory basis i.e. on the hard disk so it can handle the data sizes of 10GB +
I’ve heard about Vowpal Wabbit, but no idea about how to go about learning it, it’s list of algorithms etc…
Also, are there any other out of memory tools that work well
Please help, it’s a little urgent