Hackathon 3.0 - Share your approach / learning

hackathon

#1

This hackathon was fun to say the least! :exclamation:

The movements in leaderboard, the confusion between model improvement and over-fitting and the fight to strive for the best! Every one of us would have learnt a lot in the process - now is the time to share it with the larger community.

Please share your approach, findings and learnings here so that the larger audience can benefit from it.

Remember, there is a special prize for some one who spreads the most knowledge during / after the hackathon :slight_smile:


#3

I used an ensemble of two methods:

method1: No feature engineering at all. I just used feature hashing for dimensionality reduction and followed that with ridge regression.

method2: A little feature engineering. I truncated Project_Valuation to a maximum value of US$ 30,000. I fear this could be the source of my overfitting. If the 100% set has several instances over 30,000 then I’ll get a huge error. I also removed the latitude and longitude features. After this feature engineering I used trees with gradient boosting.

My final submission was the simple average of the predictions from these two methods.


#4

My final solution was an ensemble of 2 linear methods.

Feature Generation -

Spent most of the time here.

  • Created second order, log feature of the Similar_project metric.

  • Also created metric for all of the Categorical Variables as below -
    city,sum(Project_Evaluation) group by City to generate a probability
    table for each of categorical (more than 2 level) variables. Used

  • Latitute, Longitude and institute_county (all three proxying for
    insitute_name) to get a similar probability table on College

Model 1
Linear model of all the above features including some interaction variables.

Model 2
Notice the relation between Project Evaluation and Similar project, it is super highly correlated for similar_proj val >300 and <118. Used a linear model in this range to bum up the score. It is a clear pattern and not by chance. Hence took this approach.

Final model was ensemble of the above two.

Learnings

  • I was in two minds whether to submit the model which was scoring me top #5 on leader-board or a model which was more logical and giving better results of CV. Ended up sending the logical model.

  • Spent a lot of time evaluating features. Figured interesting relationships - few helped a few others (proxy for college - lat,long,county) which gave up on final model.

  • Shall keep visiting this blog to check other solutions and see other interesting approaches.

Agenda for next hackathon

  • Manage the code and book keeping better.

  • Implement a robust CV methodology

  • Try out unexplored models

Update - Please find my code here.

Thanks AV team for organizing this. Looking forward to participate in future hackathons !

-Nayan


#5

@nayan. thanks for an informative post. Your treatment of the Similar_project metrix seems like a good idea. Wish I had done that ! What exactly do you mean by probability table ? And is there an R package to do this ?


#6

@Nalin -

For every Categorical Variable I have create a probability table by giving weights of the target variable.

For eg - For subject_area and Project Evaluation. Take a sum of Project Evaluation and create a index sorts for each of levels. Finally we should have something like -

Use the above table to create one more column in the training dataset. (The values will repeat for sure) . But few variables have a significant impact even in the linear frame. I tried this approach as the basic linear model was strong, and i was trying to add more information by categorical variables using the above mentioned approach.

Hope this helps.

Nayan


#7

If my understanding is right then this probability table approach is similar to what a decision tree would use ? In your case it made sense to use it as you were not using a tree model for your submission - but simply adding information to your linear model.


#8

I believe for improving our learning , we should be allowed to send our prediction results to check the improvement of our models.


#10

Ya, it was primarily strengthening the linear model !!


Question for Toppers ( Data Hackathon 3.0 )
#11

Thanks Analytics Vidhya and Kunal for the Hackathon. Congrats to all the participants. It was fun and a good learning opportunity.

My approach for the hackathon is as follows:

  1. Converted all the categorical variables into one-hot encoded variables
  2. Truncated the “Project Evaluation” value at 99.9th percentile value (value is 6121) - As Nalin mentioned in his post, if the DV distribution is different in the test set, then am done.
  3. Built tree based models by selecting the params through cross validation
    a. Random Forest (2 models with different params - 1 with shorter trees and 1 with deep trees)
    b. Gradient Boosting (2 models with different params)
    c. Extreme Gradient Boosting (2 models with different params)
  4. Simple weighted average of all the six models based on local validation

Please find the code which I have used in this Github link. Thank you all once again.


#12

@SRK - saw your code on the github link. Very professionally done ! You surely deserve to win.


#13

@Nalin @SRK @nayan1247 @aayushmnit Just out of interest, did any of you try building a classifier first to predict project unlikely to get funded (there was a significant percent of them which were non funded)? And then build regression over it?

That was the first thought which had come to my mind, when I saw the dataset. But no one has mentioned that apparently. Just curious!

Regards,
Kunal


#14

I tried it. But my best score on that was 586. So I didn’t use it.


Question for Toppers ( Data Hackathon 3.0 )
#15

@kunal : That was the thought which came into my mind after the competition was over. I was thinking of two step modeling 1st for creating 3 categories , project not funded, project funded, a blockbuster funding( >$6k) and then going for regression modelling on exacting the amount. I think the time frame for 2 days restrict people to go for it, atleast that’s the case with me.


#16

I tried a two stage approach as well, a classifier followed by regressor. As Nalin mentioned, it didn’t better the standalone regression score.

My reasoning is that, since I am using tree based regression models, the results are the average of observations present in the leaf node. So in a leaf node, if all the observations are 0 (not funded), then the average will also be 0 and so the regression score will be 0. And it will try to split the nodes based on mean square error and so it will try to put all 0s together. So I think the tree based regression model itself is able to perform similar to two stage models. However I didn’t try too much on the two stage approach. Eager to see if it worked for anyone else.


#17

Hi All,

Thanks AV for organizing this Hackathon. It was a good learning opportunity for me.

Here how I appraoched this problem -

  • I looked into levels of data and created a data dictionary by mentioning the level gaps, as I figured out that there is difference in level of data in training and testing data set (Like some cities are only in training dataset but are missing from testing and vice versa)
  • Ran a simple linear model to see if some of the greater number of level categories are impacting the funding and found that state column have some impact on the valuation
  • Converted some of the categorical variables into 1/0 encoded variables
  • Looked into distribution of Valuation column and identified it isa log - normal distribution, so removed outliers whose valuation is >$5750(experimentally optimized)
  • Ran R part over Similar project valuation to see it’s impact on subsequent funding and found that there is significant shift in mean values with Similar project valuation >$549 and <$549
  • Made two Random forest models with Similar project valuation >$549 and <$549, simply merged there result for the final output

Sharing the data dictionarydata_dictionary_myversion.csv (102.4 KB)
Please find my code here

Regards,
Aayush


#18

Well, my ranking is 12, it is not good but not too bad. My method is simple and quite poor but I think it is better to share with you all. My language is R (with randomForest package). I am a totally newbie in this field.

1, I convert all variable to numeric so I can do some regression more easy.
ex: training_convert_set$city = as.numeric(training_set$institute_city)
It is not good method at all (without more processing), hope I can improve my method :"<

2, I try 2 method: glm and random forest, but glm worked poorly so my solution is all from random forest.
I try tune.rf to find better mtry but my computer is too old (pentium computer :"<); so my randomforest has ntree = 300, nodesize = 70.

@kunal is the ranking in leaderboard the final ranking?


#19

Tried Logistic / Xgboost/knn classification for Not_funded vs Funded on raw data. Overall classification accuracy was hovering around 50-60 percent. So did not think that would be wise option and dropped it there.
The ones which got funded and get classified as not funded directly get a 0 on Project_Evaluation (impacting RMSE).

Would love to understand if someone did a 2 step approach and was able to tune it to get <500 RMSE.

-Nayan


#20

Time to announce the result now. Last weekend, it was amazing to watch and learn from Data Scientist fighting hard and trying till the end.
The New Champion of Data Hackathon 3.0 is Mr Sudalai Rajkumar( @SRK ) . Congratulations Raj for winning Amazon Voucher worth Rs. 10,000 (~$200)
Here are the final rankings: http://goo.gl/NVyCfK


#21

congratz @SRK , that is what we expect from a Kaggle world rank 42 :smiley: :raised_hands: . Btw thank you for your brilliantly written code :smiley:


#22

**Data Hackathon 3.0 Results**

Out of 300 participants, here are the Top 10 rankings of the participants:
Rank 1 - @SRK
Rank 2 - @aayushmnit
Rank 3 - @vikash
Rank 4 - @nayan1247
Rank 5 - @ankurv857
Rank 6 - @Nalin
Rank 7 - @Tulip4attoo
Rank 8 - @joshij
Rank 9 - @Saikrishna
Rank 10 - @Hemant_Rupani