AWS Hadoop JSON Data Processing

json
hadoop

#1

Hi Friends,
I am new to big data and have recently started working on a cloud based (AWS) hadoop project. I have setup the AWS environment (4 t2.large EC2 instances with 100GB data volume per instance) and have installed cloudera distribution. I have tested couple of examples using word count csv files etc.

Now, My main project is to analyze research article data in JSON files. I have around 4 million JSON files close to 70GB of data with each JSON file containing all the information for one article (i.e. around 4million articles). These files are unrelated to each other and are around 340+ lines per file in a multi-level structure format. They are spread across 400 folders with each folder containing 10,000 JSON files. I want to analyze this data (bring it to a form that can be analyzed) . I am bit stuck here and not sure how to move forward.

May be convert to CSV,but converting this into CSV may take long time. I am not sure whether dumping this in HDFS and running map reduce on top of it is good idea or should i move it to hive? The no of files and size has made me little hesitant towards moving forward.
Please advice on possible approach.

Looking forward to here from you. Thanks in advance.


#2

Hi Rashnil,

if you want to analyse in R Please try this work around

**Read the JSON File**
The JSON file is read by R using the function from JSON(). It is stored as a list in R.
Load the package required to read JSON files.
library("rjson")

 **Give the input file name to the function.**
result <- fromJSON(file = "input.json")

 **Print the result.**
print(result)


**Convert JSON to a Data Frame**
We can convert the extracted data above to a R data frame for further analysis using the as.data.frame() function.
 Load the package required to read JSON files.
library("rjson")

 Give the input file name to the function.
result <- fromJSON(file = "input.json")

Convert JSON file to a data frame.
json_data_frame <- as.data.frame(result)

print(json_data_frame)

Hope this helps

Regards
tony


#3

Thanks tony,
This is good.However, the number of files i have to process is pretty huge. Even if i try this, it wont work on R. R can be used when the data volume is manageable. I am aware of processing csv data on hive warehouse for analysis, but i am not sure how to process JSON data. In case you know how can we convert JSON to CSV (Considering 4 million files). Please let me know.

By the way, i have tried with fromJSON before and it usually failes when the number of files is huge.


#4

@Rashnil : just curious, is converting JSON to CSV format necessary for analysis?

Depending upon which software you use for data analysis, you could maybe use JSON for processing right?


#5

Hi,
That was my initial question. I know how to process CSV on Hadoop/Hive, but not sure how it will work out with JSON. I have tried JSON Ser de, getting many issue processing it. That is the reason i talked about converting to CSV.

Do you have any suggestion for processing JSON on Hive/Hadoop? Especially for multiple large JSON files?


#6

Hi Rashnil,

Please try this

Create an external table in hIve

load the json files to external table
When setting up a Hive external table just specify the data source as the folder that will contain all the files (regardless of names).

and you can then analyse on hive or bring to R and analyse

Hope this helps

Regards,
tony


#7

Thanks tony. I will try this and get back.