First of all, thanks Kunal for hosting such an amazing contest.
I spent my initial time on imputing the missing values because some variables had 40% data missing. Pattern observed was that any feature for a park depends on that feature for other parks in the same location on that day.
I submitted my first solution just by using Date, Month and Park ID which gave me a public LB score of 146. As I kept on imputing missing values and adding features, I got a huge boost to 113 just by missing values.
There was some noise which was cleaned.
As the features varied a lot, I scaled them down to range between 0 to 10.
Binning of categorical variables was important as when you observe the median (using Boxplot), you seem decent similarities between parks, months and dates.
I didn’t spend much time on my model owing to some commitments. Tried a GradientBoostingRegressor, but I am sure that XGB after parameter tuning could have given me a raise of 2-3 points more.
Cross-validation was key and I used data for 2000-01 to check my model.
The solution can be found here.
Public LB - 105.82, Rank 4
Private LB - 92.70, Rank 4