Hackathon 3.0 - Share your approach / learning

hackathon

#10

Ya, it was primarily strengthening the linear model !!


Question for Toppers ( Data Hackathon 3.0 )
#11

Thanks Analytics Vidhya and Kunal for the Hackathon. Congrats to all the participants. It was fun and a good learning opportunity.

My approach for the hackathon is as follows:

  1. Converted all the categorical variables into one-hot encoded variables
  2. Truncated the “Project Evaluation” value at 99.9th percentile value (value is 6121) - As Nalin mentioned in his post, if the DV distribution is different in the test set, then am done.
  3. Built tree based models by selecting the params through cross validation
    a. Random Forest (2 models with different params - 1 with shorter trees and 1 with deep trees)
    b. Gradient Boosting (2 models with different params)
    c. Extreme Gradient Boosting (2 models with different params)
  4. Simple weighted average of all the six models based on local validation

Please find the code which I have used in this Github link. Thank you all once again.


#12

@SRK - saw your code on the github link. Very professionally done ! You surely deserve to win.


#13

@Nalin @SRK @nayan1247 @aayushmnit Just out of interest, did any of you try building a classifier first to predict project unlikely to get funded (there was a significant percent of them which were non funded)? And then build regression over it?

That was the first thought which had come to my mind, when I saw the dataset. But no one has mentioned that apparently. Just curious!

Regards,
Kunal


#14

I tried it. But my best score on that was 586. So I didn’t use it.


Question for Toppers ( Data Hackathon 3.0 )
#15

@kunal : That was the thought which came into my mind after the competition was over. I was thinking of two step modeling 1st for creating 3 categories , project not funded, project funded, a blockbuster funding( >$6k) and then going for regression modelling on exacting the amount. I think the time frame for 2 days restrict people to go for it, atleast that’s the case with me.


#16

I tried a two stage approach as well, a classifier followed by regressor. As Nalin mentioned, it didn’t better the standalone regression score.

My reasoning is that, since I am using tree based regression models, the results are the average of observations present in the leaf node. So in a leaf node, if all the observations are 0 (not funded), then the average will also be 0 and so the regression score will be 0. And it will try to split the nodes based on mean square error and so it will try to put all 0s together. So I think the tree based regression model itself is able to perform similar to two stage models. However I didn’t try too much on the two stage approach. Eager to see if it worked for anyone else.


#17

Hi All,

Thanks AV for organizing this Hackathon. It was a good learning opportunity for me.

Here how I appraoched this problem -

  • I looked into levels of data and created a data dictionary by mentioning the level gaps, as I figured out that there is difference in level of data in training and testing data set (Like some cities are only in training dataset but are missing from testing and vice versa)
  • Ran a simple linear model to see if some of the greater number of level categories are impacting the funding and found that state column have some impact on the valuation
  • Converted some of the categorical variables into 1/0 encoded variables
  • Looked into distribution of Valuation column and identified it isa log - normal distribution, so removed outliers whose valuation is >$5750(experimentally optimized)
  • Ran R part over Similar project valuation to see it’s impact on subsequent funding and found that there is significant shift in mean values with Similar project valuation >$549 and <$549
  • Made two Random forest models with Similar project valuation >$549 and <$549, simply merged there result for the final output

Sharing the data dictionarydata_dictionary_myversion.csv (102.4 KB)
Please find my code here

Regards,
Aayush


#18

Well, my ranking is 12, it is not good but not too bad. My method is simple and quite poor but I think it is better to share with you all. My language is R (with randomForest package). I am a totally newbie in this field.

1, I convert all variable to numeric so I can do some regression more easy.
ex: training_convert_set$city = as.numeric(training_set$institute_city)
It is not good method at all (without more processing), hope I can improve my method :"<

2, I try 2 method: glm and random forest, but glm worked poorly so my solution is all from random forest.
I try tune.rf to find better mtry but my computer is too old (pentium computer :"<); so my randomforest has ntree = 300, nodesize = 70.

@kunal is the ranking in leaderboard the final ranking?


#19

Tried Logistic / Xgboost/knn classification for Not_funded vs Funded on raw data. Overall classification accuracy was hovering around 50-60 percent. So did not think that would be wise option and dropped it there.
The ones which got funded and get classified as not funded directly get a 0 on Project_Evaluation (impacting RMSE).

Would love to understand if someone did a 2 step approach and was able to tune it to get <500 RMSE.

-Nayan


#20

Time to announce the result now. Last weekend, it was amazing to watch and learn from Data Scientist fighting hard and trying till the end.
The New Champion of Data Hackathon 3.0 is Mr Sudalai Rajkumar( @SRK ) . Congratulations Raj for winning Amazon Voucher worth Rs. 10,000 (~$200)
Here are the final rankings: http://goo.gl/NVyCfK


#21

congratz @SRK , that is what we expect from a Kaggle world rank 42 :smiley: :raised_hands: . Btw thank you for your brilliantly written code :smiley:


#22

**Data Hackathon 3.0 Results**

Out of 300 participants, here are the Top 10 rankings of the participants:
Rank 1 - @SRK
Rank 2 - @aayushmnit
Rank 3 - @vikash
Rank 4 - @nayan1247
Rank 5 - @ankurv857
Rank 6 - @Nalin
Rank 7 - @Tulip4attoo
Rank 8 - @joshij
Rank 9 - @Saikrishna
Rank 10 - @Hemant_Rupani


#23

Congratulations @SRK :smile: :smile:


#24

Tried with bench-marking first. Went with random forest.
It gave an ok rmse.

Then for feature engineering I removed a lot of redundant and expensive to make use of features like Geo locations and extra details about those locations. (Now that i think of it i could have tried geo locations because RF could have handled it any ways.)

I moved ahead and tried creating a couple of features, For which i tried to create variable like avg funding in the citi, and state. Few of these variables became prominent in the feature_importance and reduced rmse.

Would love to ssrk’s and aayushmnit’s code… And would be awesome if we can have that hangout i was talking about … with the winner…


Question for Toppers ( Data Hackathon 3.0 )
#25

wow, srk is rank 42 kaggle? This is amazing.

And I am so happy with this final result :slight_smile: this is so much better than my expectation.


#26

@vikash…both @srk and @aayushmnit have shared their approach and codes…let us do the hangout, if there is a specific question / need which is unaddressed right now

Also, what was the change which pushed you to top 3 in the leaderboard?

Kunal


#27

Thank you all :smile:

Congrats to other top finishers as well. Thanks Analytics Vidhya team for the Hackathon and helping us out when needed. :smile:


#28

@kunal Kunal I tried the approach which you mentioned. I classified and then classified the bucket of valuation then trained my model
Class> Class >Regression
And I was able to score 497 with that and it would have performed well if I have used few @SRK approach.
After ensembling results of 501,497 and 507 score i was able to score 483.but i could’nt make into top 10.
Will try more efficiently in next competition.

My code links Github
The files with score is also available


unpinned #29