Share your approach - The Ultimate Student Hunt


#1

Dear all,

If you participated in “The Ultimate Student Hunt” - please use this thread to share your approach and codes. Feel free to share what worked and what did not?

What else would you have tried and what were your learnings during the contest?

Regards,
Kunal


#2

Hey,

Here are the scripts:

The following feature engineering helped:

  1. Time specific across unit aggregates (liks sd., mean, max and min) – this helped with the missing value problem.
  2. Without this the day of year variable led to overfitting.
  3. The day of month and day of week variables also helped.
  4. Moving windows on the exogenous variables (weather) also helped.

Bests,

Benedek


#4

Approaches mentioned in the Readme file


#5

Our approach was an experimentation: we tried things like removing the outliers by mean of every variables plus to remove the outliers in var1 we predicted the var1 on the linear model and use that value instead of conventional one.plus there were ap and multiplication of 3 , 4 and 0.83 , 0.76 in the data and later use the decision tree prediction with the 2 xg boost model that were giving rmse of approximately 112 we took the harmonic mean of the two xg’s model . Might not be the best one but thats all we could think of
Github link : https://github.com/aman1391/ultimate-studenthunt
112.xx public leaderboard rank 28

96.xx private leaderboard rank 26

Team : Aegis2.0


#6

Hey everyone,
The most important thing that improved our score was features in the form of percentage_changes and rolling means of direction of wind. Since these features gave us an immense jump of score we tried to play with them for more similar features on every weather condition. This improved our score further up to 98.xx on public LB. Other than these we had day, day_of_year, month etc. as features. Finally we implemented single xgboost model for prediction. We thought a lot about features and our hard work paid off. :slight_smile:

Here’s the script
Public LB - 96.76 , Rank 2
Private LB - 86.41, Rank 2

Team : AK


#7

care to elaborate ? " features in the form of percentage_changes and rolling means of direction of wind" what you mean by this


#8

by percentage changes i mean percentage change of next day to previous day and rolling means, or better called moving averages of a window of 3, 7 and 30 days.


#9

I got to know about a new approach thanks a lot and congrats :slight_smile:


#10

Hi everyone,

First of all, thanks Kunal for hosting such an amazing contest.
I spent my initial time on imputing the missing values because some variables had 40% data missing. Pattern observed was that any feature for a park depends on that feature for other parks in the same location on that day.
I submitted my first solution just by using Date, Month and Park ID which gave me a public LB score of 146. As I kept on imputing missing values and adding features, I got a huge boost to 113 just by missing values.
There was some noise which was cleaned.
As the features varied a lot, I scaled them down to range between 0 to 10.
Binning of categorical variables was important as when you observe the median (using Boxplot), you seem decent similarities between parks, months and dates.

I didn’t spend much time on my model owing to some commitments. Tried a GradientBoostingRegressor, but I am sure that XGB after parameter tuning could have given me a raise of 2-3 points more.
Cross-validation was key and I used data for 2000-01 to check my model.

The solution can be found here.
Public LB - 105.82, Rank 4
Private LB - 92.70, Rank 4

Thanks,
Akhil Gupta


#11

excellent… When I read this, I think why I didnt think of this. but tried hard last night, but couldn’t do these somehow.
Just so I understand this correctly - you found out the change from the day before (for all weather conditions). What was your inference? I mean, if the change was high the footfall also increased (or something of that sort). I am still trying to understand this clearly please.


#12

Hi Sriram,
yes u got it right, what we were thinking was if not the weather condition itself maybe it’s change could prove a better feature. We didn’t think about it in much depth and gave it a try and it worked. :wink:


#13

I started with filling the missing values with the pad method (ie
copying from the previous row) since the attribute of a park should be
similar to what it was the last day. A better approach could have been
to do this individually for each park. Then for the feature selection, i
observed that the month of the year and the day of the month were
highly influencial in determining the footfall. I made a 0/1 feature for
winters(months 11,12,1,2,3). For the dates, i observed some pattern in
the variation of mean footfall with increasing dates, but there were
some anomalies which i tried to treat by averaging across adjacent days.
(I did this for all the parks together, a better approach could have
been to do this for each park). I also binned the direction of wind to
represent the 4 directions.
For the model, i started with gradient boosted trees and further tuned it to get the best result in cv (by testing on years 2000-2001) and then i tried xgboost and tuned it.
Finally i made a 1 hidden layer neural network with a wide hidden layer.
In addition to all that, for the gbm and xgboost models,i trained the
regressors for each park independentely as i beleived that each park
will have an independent pattern and relationship with other variables.
For the neural net model, i trained it for all parks together and giving
in park id as a feature as it needed larger number of samples to be
trained.
I averaged the results of these 3 models to get the final output.You can find the code here : https://github.com/akashgupta222/gardern_pred_analyticsvidhya


#14

Hey, everyone.

This repo contains our final code submission along with comments.

Team : Little little (Next time we will be more careful the team name)

Regards
Yash


#15

Hi everyone,

Thanks AV for hosting such an amazing content.
Let me take to my approach for this problem. I imputed missing value by median and mean then I visualized found out pattern in year so I created a Season variable, day, month and bin the direction of wind variable. I used Gradient boosting by using 10000 trees and learning rate of 0.01 which had 117 RMSE on public LB.
After that I removed direction bin variable which was giving less importance and increased number of trees to 15000. This gave me 111 RMSE on public LB and on private LB 93.46, Rank 7
After that I made a model using dummy variables using GBM but was not helpful then I build many models using XGB but couldn’t reduce the RMSE further. I think I have missed year day variable in my model that could have helped me.

Regards,
Vaibhav


#16

Hi Everyone,

Many of you sent me messages asking about my 3rd place solution to this competition. I’ve written a blog post that goes in-depth into my solution, as well as talking about the thought process I went through to find my features, and my general approach to competitions.

You can read it here: https://medium.com/implodinggradients/how-i-got-3rd-place-in-the-ultimate-student-hunt-3ecf827375a6#.vyp3mujk1

Any feedback or sharing is much appreciated!

- Mikel (anokas)


#17

Is there any place to download dataset?