Tips for handling Big Data efficiently in Python

big_data
python

#1

I have used Python for data analysis on moderate size databases (thousands on data points). These have been mostly toy examples and iPython notebooks, I gathered over the web.

Now, I am working on a dataset which is huge (13.5 GB) and some of the things, which used to work with ease end up taking ages. Can people help me with some tips and tricks to help me work more efficiently on this dataset.

Specifically:

  • If you have any thoughts on how can I slice and dice data more efficiently (vs. using Pandas)
  • How can I use sparse algorithms as quite a few columns have missing values?
  • Are there any libraries, which work better on this size of dataset by processing them in chunks or in parallel?

Any help is greatly appreciated.


#2

have you looked at PyDoop

http://crs4.github.io/pydoop/
. Pydoop is a package that provides a Python API for Hadoop.

or with databases try this link http://code.google.com/p/pyodbc/

pyodbc is a Python 2.x and 3.x module that allows you to use ODBC to connect to almost any database from Windows, Linux, OS/X, and more.

and with MongoDb try PyMongo

http://api.mongodb.org/python/current/


#3

Srini,

You can look at Graphlab / Dato Create for slicing and dicing really big data sets. Sframes are similar to pandas, but are far more effective in handling Big Data. I have sliced 22GB data set with Graphlab on my laptop (3rd gen i7, 8 GB RAM) in under 10 minutes.

J