Large dataset Python

machine_learning
python

#1

Hi Everyone, I have a question regarding working with large dataset using python. I have a dataset of around 3.5 GB( json), it contains 4 million rows and 10 columns. I need to traverse this 4 million rows for my code. But I am finding it difficult to even load this whole dataset. I tried using dask, but could not find a way to traverse those 4 million rows ( am I missing something with dask?).

What is the probable solution here, should I split my dataset and work on smaller ones individually? or is there any better alternative. I am not confident of using Hadoop/spark for less than 10 gbs of dataset.


#2

Hi.

I have not tried this myself but it seems reasonable.

HTH

Dave