Handling large data files in python with low RAM low resource, creating large datafiles at local PC from SQLserver database using Python / ODBC



I am very new to Python. In our company we use Base SAS for data analysis (ETL, EDA, basic model building). We want to check whether replacing it with Python is possible for big chunk of data. With respect to that i have following few questions :

How do python handle large files? my PC has RAM of 8gb and i have a flat file of 30gb (say a csv file). I would generally do operations like left join, deleting, group by etc. on such file. This is easily doable in SAS i.e. I don’t have to be worried about low RAM. Are the same operations doable in python? would appreciate if somebody can provide the list of libraries & code for the same.

How can i perform SAS operations like “PROC SQL” in python to create dataset in my local PC while fetching the data from server. i.e. In sas i would download 10mln rows (7.5 gb of data) from SQL server by performing following operation

libname aa ODBC dsn =sql user = pppp pwd = XXXX;
libname bb ‘<>’;

proc sql outobs = 10000000;
create table bb.foo as
select * from aa.bar

What is the method to perform the same in python. Again just to remind you - my PC has only 8 gb RAM



I would recommend you to read this article, it will help you a lot to take decision between SAS Vs R (Vs Python).

Below are the list of libraries, you should focus for Data Science tasks (Data Cleaning, Exploration and Modeling):

  • Pandas: Used for structured data operations and manipulations. It is extensively used for data munging and preparation. Pandas were added relatively recently to Python and have been instrumental in boosting Python’s usage in data scientist community

  • Numpy: It stands for Numerical Python. The most powerful feature of NumPy is n-dimensional array. This library also contains basic linear algebra functions, Fourier transforms, advanced random number capabilities and tools for integration with other low level languages like Fortran, C and C++.

  • Matplotlib: Used for plotting vast variety of graphs, starting from histograms to line plots to heat plots… You can use Pylab feature in ipython notebook (ipython notebook –pylab = inline) to use these plotting features inline. If you ignore the inline option, then pylab converts ipython environment to an environment, very similar to Matlab. You can also use Latex commands to add math to your plot.

  • pyodbc: Used to retrieve data from SQL

  • pandasql: Allows you to query pandas DataFrames using SQL syntax. pandasql seeks to provide a more familiar way of manipulating and cleaning data for people new to Python or pandas.

  • Scikit Learn: Used for machine learning. Built on NumPy, SciPy and matplotlib, this library contains a lot of effiecient tools for machine learning and statistical modeling including classification, regression, clustering and dimensionality reduction.

I would also recommend you to go through below listed cheat sheets for basic data exploration and visualization in python!

Hope this helps!