Exploratory analysis on 100GB data

big_data
python

#1

Hi ,
I have around 100 GB of log data in CSV format and I wish to do Exploratory analysis on this data. As pandas loads data in memory, I am looking for possible alternatives. I have tried Graph Lab’s Sframe on my 8GB RAM machine, but it takes too much time to process a subset of data. Another alternative is using Spark Data frame or a MPP database ?

Can you please suggest best approach for handling the above amount data? Also as the data set is large, to visualize the data, what viz libraries can be used ?


#2

Hi @mtare,

pandas has a feature that you can load only a subset of data. (Refer nrows or usecols arguments of read_csv function).

Alternatively, you can move on to big data visualization tools such as Tableau. Read this infographic on big-data visualization tools.


#3

hi @jalFaizy

We wish to do analysis on entire data. For some analysis we wish to do it on entire data set. Can you suggest any other approach ?


#4

Did you check out the alternative approach? Does it work for your problem?

PS: According to Tableau devs

Whether it’s structured or unstructured, petabytes or terabytes, millions or billions of rows, you can turn big data into big ideas. Tableau helps people unlock the value in today’s information flows, from clickstreams to sensor networks to infrastructure logs. Connect directly to local and cloud data sources, or import data for fast in-memory performance.


#5

Tableau would work for Visualizations purpose, but for processing and performing statistical analysis over the data i need the approach ?


#6

These links might help you:

TLDR; I recommend you should rely on paid data analysis softwares (such as pentaho) which work with big data technologies like hadoop.


#7

I would suggest you to use spark’s dataframe for this task.
Spark data frame is able to handle any large data set and you can do EDA quite efficiently.


#8

Thanks for all the inputs. I have been trying with Spark Dataframe and I am able to successfully operate on data.


#9

@mtare,

Glad that spark was useful - would you be able to share your learnings and insights with the community?

Regards,
Kunal


#10

@kunal and others
Problem Statement - I have around 100 GB of Digital Ad Log data that I wanted to do exploration on.
Approach followed until now - I found out that Apache Spark is being widely used in for analysis of large scale data. Apache Spark has a powerful set of libraries like MLlib,GraphX,SQL and Dataframes, Spark Streaming etc. I am primarily using the SQL and Dataframe API on Spark 1.6.
Spark Dataframes have the capability to handle large scale data, also API has a good support of various dataframe functions. The Dataframe functions are very similar to pandas library function.
Converting Data to Apache Parquet format : I first converted the data to Apache Parquet format with “GZIP” compression. The Parquet format is columnar and helps to speed up the operation. Converting to Parquet format using GZIP/SNAPPY compression also reduced the size of the data (100GB -> 20 GB) and thus help in reducing IO and increasing performance. You can use sparks’ read.parquet() method to read these files from HDFS on multi-node cluster.
Visualizing data : After researching a bit, i found that in most cases visualizing all the data points at once, doesn’t make sense.So I am visualizing on aggregated data only. Spark Dataframe API has a convenient toPandas() method that helps us to convert it to pandas dataframe. Then it can be plotted via matplotlib/seaborn.

Please let me know your comments and other feasible more efficient approach.