I am working on this use case, and I need as im new on data science field.
In order to analyze the behavior of the users (music web application), we recover all of their plays conducted since 2009.
We store these plays in flat files. Each file contains the plays performed a day.
Each file contains 50M of lines
We have 19M of users
Our catalog contains 35M of tracks.
the format of these files is as follows : id-user | country | id-artist | id-track
Question : I would like to represent each user by his profile (plays profile). This profile would be used by the production site
How can I process the whole chain. Can you please describe the whole process from the begining to the end. And which tools can I use for example pig or hive…
Any idea please?
Thank you in advance