5th Place solution Amexpert

machine_learning

#1

Thanks American Express and Analytics vidhya for organizing such a learning hackathon. The dataset was really amazing and had limitless new things to explore.
Approach:

Step 1
I started my problem with very basic approach changing all the features (user_id,product,webpage_id,campaign_id and other user features to category . Made some common features like Hour,weekday,day as time features and applied simple catboost model.I used simple train and test split initially and got local CV of .61 and public score of .601.

Then I introduced groupby count features like:

How many times user has appeared/ how many times he come across a particular product or webpage or campaign/how many time in day he appeared and so on… I this manner i tried various combinations and made total 16 new features

Using History data:

Made similar kind of count features again using {day,minute,date,user_id,product,week_day} as above. In addition to above i made some features like how many times a person showed interest (mean,sum and count) Similarly taking various combinations of features i made total 19 new intuitive variables. This boosted my local CV = .6460 and public leader score to .631+

At this stage I was pretty sure that my local and public score were very much synced and feature engineering is key to win:

I started brainstorming on plenty of new features some worked and some didn’t.

Time(in seconds) between previous appearance. Example for each row if there exist just previous user than how much time difference is there.

Example time between to see previous session for a A product or some webpage so on. In this manner Now i did it for again various combinations of product user webpage .{ Local cv=.647 and leader board - almost same.

Then i added next click time using same way, woooelaa my score boosted Local score .652 and public score of .637 .

At this stage I was really happy with my score and thought of starting prep for my exam tomorrow but all thanks to russian masters and mohsin sir. _/_ Again the cycle of feature engineering start.

I created new features like. Which was previous category(product/webpage./product_categ) used. {7combinatinatons}

Target encoding using below trick: “https://kaggle2.blob.core.windows.net/forum-message-attachments/225952/7441/high%20cardinality%20categoricals.pdf“

Calculated each category contribution building log confidence.

Concat history and merged file and calculated total count and features till now.

These features helped me to get into .64 category on leader board and also a very powerful local score of .659 .

Finally tuned the Lightgm model using Bayesian optimization and got the optimal parameters as follows:

“clf2=lgb.LGBMClassifier(max_depth=9,num_leaves=44,n_estimators=200,learning_rate=.1,reg _alpha=.5914,subsample=.8747,colsample_bytree=.3668,reg_lambda=.14,min_split_gain=.008 3,min_child_weight=36)”

Training 5 LightGBM , 4 catboost and 1 XGB model and ensembling them crossed .661 on local and .6418 on Public. At this stage i realized that i have build a very robust model which proved to be very stable on private LB as well with a score of .6433
Thanks again Av & Amex


#2

Hi @kanav

Thanks for sharing this - added it to the leaderboard, so that more people can read it.

You can do that at your end as well.

Regards,
Kunal


#3

@kanav Thanks for sharing your feature engineering techniques.
Can you explain more on this point of yours, Time(in seconds) between previous appearance .