Knocktober - sharing solutions


#1

Traditionally most hackathons have a solutions sharing thread, but couldn’t find one for Knocktober. So, here goes:

Congrats to the winners and thanks for your wishes! Really really happy on getting my hat-trick on AV :slight_smile:

This was a well-crafted dataset, with plenty ideas on features and EDA. Unfortunately, I was busy at the World Championships on Friday-Saturday, and had my returning flight back to India on Sunday, so couldn’t participate in the channel discussions and other sharing of posts and ideas.

I decided to team up for the first time since I only had a few hours on Saturday. I managed to build a single XGB with 13 features (6 raw + 7 engineered) that scored 0.8375 on the public LB, which ranked 1st and handed over the rest to SRK :slight_smile:

Here’s my full code: https://github.com/rohanrao91/AnalyticsVidhya_Knocktober

Our final submission was a weighted average of SRK’s best model and my best model.

I’ll write a detailed post about my approach in this competition and also a general post on my hat-trick sometime later this week.


#2

Congratulations…


#3

Congratulations !! Thanks for sharing.


#4

Congrats Rohan!
You make it seem so simple!

Do we unsdrestand the difference in public and final LB? Was there a separation by time between them?


#5

Can anyone share their python code ?


#6

Congratulations Rohan & SRK. Thanks Rohan for sharing the solution.


#7

Congrats Rohan and SRK.

Regards
Deepak


#9

Public LB is when your model is tested on 30 % of the data available to them.Private LB is when they run your model to the entire data available to them.That’s why you see huge rank difference b/w the both.


#10

Congrats @Rohan_Rao. I am wondering how could you able to manage entire competition jut in two hours. Great!!!


#11

Congratulations Rohan and SRK thanks for sharing the solution!


#12

For this Hackathon I started from analysing the Business Problem.
This challenge was a little different from the other DataHack in the sense that the value to be predicted is a calculated column based on business rules.
I analysed the Patient Profile information and created some common sense viewpoint on which feature will contribute to the model.

For example:

City_Type may be a good predictor as it has to be a city with low work life balance

People who are following the event in social media(Online_Follower,Linkedin_Shared,Twitter_Shared,Facebook_Shared) might be interested in showing up.

Income can be a good predictor , as people with low income might want to utilize healthcamp to get free healthscore.

Age might be a good predictor , but not sure about education.

Also thought about what might cause high drop off between registration and number of people attending the Camps. Since these are people with Low work life balance

  1. #They might not remember about the event or Interest level might go down if the duration between Date of Registration and Start Date of Camp Increased.

  2. #They might not show up if the Camp is not held during weekend.

  3. #There might be seasonality in showing up during Holidays like Christmas and Thanksgiving.

  4. Based on above analysis created new features like

  5. Number of days between Camp Start Date and End Date.

  6. Number of weekend days between Camp Start Date and End Date.

  7. Number of days between first interaction and Registration_Date

  8. Number of days between Registration and Camp Start Date

  9. First Interaction Day of Week

  10. Registration Day of Week

  11. Social Media as addition all four social media field.

  12. Split Registration and First Interaction dates to day , month and year(helps predict seasonility).

Trained a xgboost model without the new feature got Leadership Board score of .49.
Ran Logisitc Regression with new feature got AUC ROC of .78. With h2o Gradient Boositng Machine got .812 but analysing the variable importance noticed that has included Health_camp_id by mistake.
So excluding Patient_ID and Health_Camp_ID from model ,with new features and given features got .814 AUC ROC for h20.gbm model. Tried Xgboost model with new features and cross validation got .812 as AUC ROC.
From previous winners post got to know that Xgboost with model ensembling has always worked for contest like this. So made my final submission as Rank Average of the submission files from h20.gbm and Xgboost algorithms.
I have done Data Mining in my Masters program using JMP. I am a Business Intellgence and Analytics professional with strong SQL skills working on redefining my Data Science skills for my next job. Past few months have been learning and teaching myself R with the help of Data Camp and Coursera and made my first attempt in Kaggle and Analytics Vidya in August.
This is my second DataHack happy about the progress in making it to top 5 that keeps me motivated to keep working on improving my skills to be a great Data Scientist.
This was possible mostly by following the tips from Analytics Vidhya articles from past winners and others.
Thank you all!


#13

@Rohan_Rao : Hello Rohan , Congrats on your victory.
I had three questions with respect to your code :-

  1. Where exactly do you perform cross validation in your code and what is the technique used ? How do you understand the performance of the cross validation ?
  2. In the step where you understand the importance gain , how do you make sense of those numbers. For example, category 1 has a gain of .425 , cover of .11 etc. What does this mean and how do we know that is good?
  3. If you were to do an ensemble , how would you understand correlation between different techniques and ensure the correlation was minimal.
    Also any links / reusable code for the regular EDA work that goes in before model building would help a lot for beginners like myself.

Thanks
Kailash


#14
  1. I didn’t have much time in this hackathon, so my CV code isn’t part of it. SRK’s code has it, you can view it from there. I followed a similar approach using xgb.cv, which I’ve used multiple times in my previous hackathon codes.

  2. You can read about XGB’s variable importance function here: https://github.com/dmlc/xgboost/blob/master/R-package/vignettes/discoverYourData.Rmd

  3. Simple correlation between the prediction files of SRK and mine. Lower the correlation, better the ensemble, usually.

EDA is very specific to the data and varies with every contest. I don’t follow any standard scripts / routines.


#15

Thank you very much . Appreciate the answers!


#16

Can anyone share the datasets? Thanks for sharing


#17

@fertueros

You can find the datasets on the competetion page itself : https://datahack.analyticsvidhya.com/contest/knocktober-2016/


#18

Dear, the datasets in that page is not available now


#19

@fertueros. Yes they are. check under the data section.

If you still can’t get your hands on data. It might be because you didn’t registered for it.


#20

Your right, I arrived late. I appreciate if any one have a link for the data. Thanks!!