At Analytics Vidhya, we love creating new content. That is the way we have grown!
While we have relied mostly on listening to our audience and simplifying what is difficult for them, how about putting little science to this? That is exactly what we will be doing today for our friend Magazino!
Brief about Magazino
Magazino, was started by a few technoprenuers about a year back, who wanted to bring data science to publishing. They believed that current media houses are not doing justice to what should be served to people and with use of Big Data they can get a fair differentiation in their offering.
In the last one year, Magazino has done a lot of experimentation with publishing several topics and content. They have made sure that they capture all the data at the back end. Now, they want to put a bit of science to predict which articles get shared on social media, which ones don’t! To that extent, they have stored social media performance of ~24,875 articles they have published till date along various attributes.
What is the gem of Auxesia?
Auxesia is the goddess of growth and prosperity in Greek mythology! Finding articles which can go viral in today’s age is like being blessed with a gem from Auxesia. So, you have to find out these gems for Magazino!
Description of the dataset:
Here is a brief description of the attributes:
- id - Unique id of the article
- n_tokens_title: Number of words in the title
- n_tokens_content: Number of words in the content
- n_unique_tokens: Rate of unique words in the content
- n_non_stop_words: Rate of non-stop words in the content
- n_non_stop_unique_tokens: Rate of unique non-stop words in the content
- num_hrefs: Number of links
- num_self_hrefs: Number of links to other articles published by Magazino
- num_imgs: Number of images
- num_videos: Number of videos
- average_token_length: Average length of the words in the content
- num_keywords: Number of keywords in the metadata
- Category_article - Various categories of article
- kw_min_min: Worst keyword (min. shares)
- kw_max_min: Worst keyword (max. shares)
- kw_avg_min: Worst keyword (avg. shares)
- kw_min_max: Best keyword (min. shares)
- kw_max_max: Best keyword (max. shares)
- kw_avg_max: Best keyword (avg. shares)
- kw_min_avg: Avg. keyword (min. shares)
- kw_max_avg: Avg. keyword (max. shares)
- kw_avg_avg: Avg. keyword (avg. shares)
- self_reference_min_shares: Min. shares of referenced articles in Magazino
- self_reference_max_shares: Max. shares of referenced articles in Magazino
- self_reference_avg_shares: Avg. shares of referenced articles in Magazino
- Day_of_publishing: Day of week - Monday, Tuesday…Sunday
- LDA_00: Closeness to LDA topic 0
- LDA_01: Closeness to LDA topic 1
- LDA_02: Closeness to LDA topic 2
- LDA_03: Closeness to LDA topic 3
- LDA_04: Closeness to LDA topic 4
- global_subjectivity: Text subjectivity
- global_sentiment_polarity: Text sentiment polarity
- global_rate_positive_words: Rate of positive words in the content
- global_rate_negative_words: Rate of negative words in the content
- rate_positive_words: Rate of positive words among non-neutral tokens
- rate_negative_words: Rate of negative words among non-neutral tokens
- avg_positive_polarity: Avg. polarity of positive words
- min_positive_polarity: Min. polarity of positive words
- max_positive_polarity: Max. polarity of positive words
- avg_negative_polarity: Avg. polarity of negative words
- min_negative_polarity: Min. polarity of negative words
- max_negative_polarity: Max. polarity of negative words
- title_subjectivity: Title subjectivity
- title_sentiment_polarity: Title polarity
- abs_title_subjectivity: Absolute subjectivity level
- abs_title_sentiment_polarity: Absolute polarity level
- shares: Number of shares (target)
Where polarity and subjectivity are measured on a relative scale. All the remaining terminology are standard in any publishing world. You have title of articles and keywords associated with the articles. If you are not sure about something, just google it!
They also have more than 10,000 articles in pipeline and they want to predict which of these are piping hot to share on social media! So, here you go, train dataset contains the information of all these 24,875 articles while test contains the 10,089 articles in the pipeline. You need to predict the number of shares for each article in the pipeline.
The evaluating criteria for this problem would be Root Mean Square error (RMSE). We will be putting up a solution checker online on Saturday afternoon, where you can upload your solutions and check their accuracy based on 20% of the submission. The URL of the solution checker is:
You can also access the leaderboard on http://www.datahack.club:8000/learderboard
Expected code of conduct:
- The aim of this hackathon is to create an environment of collective learning in Analytics Vidhya community. If you are facing a conflict at any point in time, remember this and take the right decision.
- If you are facing any challenge where you need help on, please use the discussion portal. Keep the discussions specific and useful. If they are not, they will be deleted. You should post these discussions in the category - Hackathons - Online Hackathon
- For non-technical discussions and saying high-fives to the community, use the slack channel. People are already rolling up their sleeves there. You should have got an invite to the slack channel for discussions already. If not, please drop me a PM - you should join the fun there. http://analyticsvidhya.slack.com
- You are free to use whatever software you wish, as long as you have rightful access to it
- The winner takes away Amazon vouchers worth INR 10,000 (~$200). You would need to share your solution and the code with the community.
- We are learning as much about the community building as you are about data science. So, if you see a bug (yes, those creepy creatures!), please raise it with us through slack or PM any member of the team directly
- At the end of the contest, we will share the larger dataset and the correct answers for the problem.