My final solution was an ensemble of 2 linear methods.
Feature Generation -
Spent most of the time here.
Created second order, log feature of the Similar_project metric.
Also created metric for all of the Categorical Variables as below -
city,sum(Project_Evaluation) group by City to generate a probability
table for each of categorical (more than 2 level) variables. Used
Latitute, Longitude and institute_county (all three proxying for
insitute_name) to get a similar probability table on College
Linear model of all the above features including some interaction variables.
Notice the relation between Project Evaluation and Similar project, it is super highly correlated for similar_proj val >300 and <118. Used a linear model in this range to bum up the score. It is a clear pattern and not by chance. Hence took this approach.
Final model was ensemble of the above two.
I was in two minds whether to submit the model which was scoring me top #5 on leader-board or a model which was more logical and giving better results of CV. Ended up sending the logical model.
Spent a lot of time evaluating features. Figured interesting relationships - few helped a few others (proxy for college - lat,long,county) which gave up on final model.
Shall keep visiting this blog to check other solutions and see other interesting approaches.
Agenda for next hackathon
Manage the code and book keeping better.
Implement a robust CV methodology
Try out unexplored models
Update - Please find my code here.
Thanks AV team for organizing this. Looking forward to participate in future hackathons !