3rd Place Solution [Club Mahindra DataOlympics]

solution
#1

Thanks Analytics Vidhya and Club Mahindra for organising such a wonderful hackathon,The competition was quite intense and dataset was very clean to work.

Approach :

Step-1 :

I started my problem with very basic approach changing all the features (resort id , persontravellingID , main_product_code and other ordinal features to category . Made some common for each date columns [booking date , checkin_date, checkout_date ]:

  1. Weekday
  2. Month
  3. Day
  4. Day of year
  5. Week of year
  6. Is month end
  7. Year

Step-2: Intuitive features

  1. In_out : Checkout_Date - Checkin_Date
  2. book_in:Checkout_date - booking_date
  3. Roomnights per stay : roomnights/in_out
  4. Roomnights per book span : roomnights / book_out

Step - 3: Time Based Features :

  1. Prev_resort_time = Time when the resort was previously booked.
  2. Prev_resort_member_time = Time when the resort was previously booked by a particular member.
  3. Next_resort_time = Time when the resort will Next booked.
  4. Next_resort_member_time = Time when the resort will next booked by a particular member.

Step-4 : Groupby Features

S.No. TYPE Value_column ON
1. COUNT _ RESORT_ID
2 COUNT _ RESORT_ID,MemberID
3. COUNT _ [‘resort_id’,‘checkout_dateyear’,‘checkout_datemonth’]
4. COUNT _ [‘memberid’,‘checkout_dateyear’]
5 VAR roomnights RESORT_ID
6 Median roomnights RESORT_ID,MemberID
7. MAX roomnights [resort_id,checkout_dateyear,checkout_datemonth]
8. MIN roomnights [memberid’,‘checkout_dateyear’]
9 VAR in_out RESORT_ID
10 Median in_out RESORT_ID,MemberID
11. MAX in_out [‘resort_id’,‘checkout_dateyear’,‘checkout_datemonth’]
12. MIN in_out [‘memberid’,‘checkout_dateyear’]
13 VAR total_pax RESORT_ID
14 Median total_pax RESORT_ID,MemberID
15 MAX total_pax [‘resort_id’,‘checkout_dateyear’,‘checkout_datemonth’]
16 MIN total_pax [‘memberid’,‘checkout_dateyear’]

…… in Similar fashion approx ~ 72 combinations were tried which gave a boost of rmse from 96 to 95.3 on LB and nearly same change in Local CV.

Modeling:

My final model consist of ensemble of

[ lightGBM , Catboost , 5_fold_Light GBM , 5_fold_Catboost and stacking of [xgb,catboost,lightGBM]

  1. Params tuned using Bayesian Optimization :

“objective” : “regression”, “metric” : “rmse”, ‘n_estimators’:3000, ‘early_stopping_rounds’:200,

“num_leaves” : 31, “learning_rate” : 0.05, “bagging_fraction” : 0.9,

“bagging_seed” : 0, “num_threads” : 4,“colsample_bytree” : 0.8,“lambda_l1”: 25,“min_child_weight” : 44.9, “min_child_samples”: 5 }

16 Likes
#2

Thanks Kanav

#3

What have you done with the features in step4. Can you please explain the basic approach?

#4

Hi
I have made groupby / aggregate features.
Example S.no 1 means ->
merge = merge.merge(pd.DataFrame(merge.groupby(by=[‘resort_id’,‘memberid’]).roomnights.count()).reset_index(),suffixes=(’’,‘res_mem’),on=[‘resort_id’,‘memberid’],how=‘left’)

This will give me a new column which have resort_id count and in similar way you can get other features as well.
refer groupby:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html

1 Like
#5

Hi Kanav,

Great Work!!

can u please explain the final step? How did u ensemble the various models?

#6

Good that you are sharing this information. Can you please explain how Groupby helpful here? I also calculated date difference between checkout and booking. But couldn’t realize information gain from several groupby. Can you please explain why Groupby important here?

#7

Hi Bipin,
I chose weighted averaging for ensembling and logic behind chosing weights was that they minimzed the local CV.

(model_1 X *0.3+model_2 X *0.2+model_3 X *0.3 +model_4 X *0.2)*0.8 + model_5 X *0.2

2 Likes
#8

Hi Kanav,

Thanks for sharing your approach! It’s quite helpful to me. A quick question, which algorithm did you use for Bayesian optimization,spearmint, hyperopt or anything else? Could you elaborate your approach?

Once again, thanks!

#9

thanks Kanav,
for sharing your approach, really appreciate

#10

How do I get to know about groupby method? I can’t be able to decide step 4. Please share your intuition behind this. Thanks

#11

Hey would you please share your code?