3rd Place Solution [Club Mahindra DataOlympics]


Thanks Analytics Vidhya and Club Mahindra for organising such a wonderful hackathon,The competition was quite intense and dataset was very clean to work.

Approach :

Step-1 :

I started my problem with very basic approach changing all the features (resort id , persontravellingID , main_product_code and other ordinal features to category . Made some common for each date columns [booking date , checkin_date, checkout_date ]:

  1. Weekday
  2. Month
  3. Day
  4. Day of year
  5. Week of year
  6. Is month end
  7. Year

Step-2: Intuitive features

  1. In_out : Checkout_Date - Checkin_Date
  2. book_in:Checkout_date - booking_date
  3. Roomnights per stay : roomnights/in_out
  4. Roomnights per book span : roomnights / book_out

Step - 3: Time Based Features :

  1. Prev_resort_time = Time when the resort was previously booked.
  2. Prev_resort_member_time = Time when the resort was previously booked by a particular member.
  3. Next_resort_time = Time when the resort will Next booked.
  4. Next_resort_member_time = Time when the resort will next booked by a particular member.

Step-4 : Groupby Features

S.No. TYPE Value_column ON
3. COUNT _ [‘resort_id’,‘checkout_dateyear’,‘checkout_datemonth’]
4. COUNT _ [‘memberid’,‘checkout_dateyear’]
5 VAR roomnights RESORT_ID
6 Median roomnights RESORT_ID,MemberID
7. MAX roomnights [resort_id,checkout_dateyear,checkout_datemonth]
8. MIN roomnights [memberid’,‘checkout_dateyear’]
9 VAR in_out RESORT_ID
10 Median in_out RESORT_ID,MemberID
11. MAX in_out [‘resort_id’,‘checkout_dateyear’,‘checkout_datemonth’]
12. MIN in_out [‘memberid’,‘checkout_dateyear’]
13 VAR total_pax RESORT_ID
14 Median total_pax RESORT_ID,MemberID
15 MAX total_pax [‘resort_id’,‘checkout_dateyear’,‘checkout_datemonth’]
16 MIN total_pax [‘memberid’,‘checkout_dateyear’]

…… in Similar fashion approx ~ 72 combinations were tried which gave a boost of rmse from 96 to 95.3 on LB and nearly same change in Local CV.


My final model consist of ensemble of

[ lightGBM , Catboost , 5_fold_Light GBM , 5_fold_Catboost and stacking of [xgb,catboost,lightGBM]

  1. Params tuned using Bayesian Optimization :

“objective” : “regression”, “metric” : “rmse”, ‘n_estimators’:3000, ‘early_stopping_rounds’:200,

“num_leaves” : 31, “learning_rate” : 0.05, “bagging_fraction” : 0.9,

“bagging_seed” : 0, “num_threads” : 4,“colsample_bytree” : 0.8,“lambda_l1”: 25,“min_child_weight” : 44.9, “min_child_samples”: 5 }


Thanks Kanav


What have you done with the features in step4. Can you please explain the basic approach?


I have made groupby / aggregate features.
Example S.no 1 means ->
merge = merge.merge(pd.DataFrame(merge.groupby(by=[‘resort_id’,‘memberid’]).roomnights.count()).reset_index(),suffixes=(’’,‘res_mem’),on=[‘resort_id’,‘memberid’],how=‘left’)

This will give me a new column which have resort_id count and in similar way you can get other features as well.
refer groupby:

1 Like

Hi Kanav,

Great Work!!

can u please explain the final step? How did u ensemble the various models?


Good that you are sharing this information. Can you please explain how Groupby helpful here? I also calculated date difference between checkout and booking. But couldn’t realize information gain from several groupby. Can you please explain why Groupby important here?


Hi Bipin,
I chose weighted averaging for ensembling and logic behind chosing weights was that they minimzed the local CV.

(model_1 X *0.3+model_2 X *0.2+model_3 X *0.3 +model_4 X *0.2)*0.8 + model_5 X *0.2


Hi Kanav,

Thanks for sharing your approach! It’s quite helpful to me. A quick question, which algorithm did you use for Bayesian optimization,spearmint, hyperopt or anything else? Could you elaborate your approach?

Once again, thanks!


thanks Kanav,
for sharing your approach, really appreciate


How do I get to know about groupby method? I can’t be able to decide step 4. Please share your intuition behind this. Thanks


Hey would you please share your code?