I have used Python for data analysis on moderate size databases (thousands on data points). These have been mostly toy examples and iPython notebooks, I gathered over the web.
Now, I am working on a dataset which is huge (13.5 GB) and some of the things, which used to work with ease end up taking ages. Can people help me with some tips and tricks to help me work more efficiently on this dataset.
- If you have any thoughts on how can I slice and dice data more efficiently (vs. using Pandas)
- How can I use sparse algorithms as quite a few columns have missing values?
- Are there any libraries, which work better on this size of dataset by processing them in chunks or in parallel?
Any help is greatly appreciated.