I am new to big data and have recently started working on a cloud based (AWS) hadoop project. I have setup the AWS environment (4 t2.large EC2 instances with 100GB data volume per instance) and have installed cloudera distribution. I have tested couple of examples using word count csv files etc.
Now, My main project is to analyze research article data in JSON files. I have around 4 million JSON files close to 70GB of data with each JSON file containing all the information for one article (i.e. around 4million articles). These files are unrelated to each other and are around 340+ lines per file in a multi-level structure format. They are spread across 400 folders with each folder containing 10,000 JSON files. I want to analyze this data (bring it to a form that can be analyzed) . I am bit stuck here and not sure how to move forward.
May be convert to CSV,but converting this into CSV may take long time. I am not sure whether dumping this in HDFS and running map reduce on top of it is good idea or should i move it to hive? The no of files and size has made me little hesitant towards moving forward.
Please advice on possible approach.
Looking forward to here from you. Thanks in advance.