Can the winners share their approach?
What are the features that worked for you?
What did not work?
Were the text features important?
Can the winners share their approach?
Some participants have shared their solution for Lord Of The Machine on the slack channel. Here are a few!
3rd rank solution, by SRK and Mark:
Most of our time is spent on creating new features. We did validation split based on campaign ids. Our best single model is a light GBM that scored 0.7051 in LB. List of important features we used are:
- Target encoding on the user ID, user ID - communication type
- Min, max, mean and standard deviation of the mail sent time.
- One hot encoding of the campaigns.
- Time between current mail and previous mail
- Number of campaigns inbetween current mail and previous mail
- Total number of mail campaigns per user ID
- Cumulative count of the mail at user level
- Hour of the mail
Here’s the code!
4th rank solution by Akash Gupta and Aditya Kumar Sinha:
I posed it as a problem of sequence prediction where we want to find whether a user will click on an email, given his past interactions on platform. The first thing that comes to mind when we think of sequence prediction problems is RNN or more specifically LSTM which is what i went through with.
More details can be found here:
9th rank solution, by Soham:
We focused on features which would characterize user behavior and his/her time patterns.
- used renn for undersampling the data
- click confidence for each user (ratio of no. of clicks to that of total mails received by that user)
- is open confidence for each user
- subscription period
- total mails sent to the user
- specific user mailing frequency
- basic nlp features like: no of capital letters, punctuation percent, unique words, stop words
- combined feature which included text subjectivity and polarity (boosted score on CV and lb)
- unsupervised clustering for making groups of similar mail titles and subjects
- time based features like day, month, day of week, buckets by time of dayUsed lightgbm and xgboost on this features. The final submission was average predictions for above two models.
11th rank solution, by Balajisr:
I had segmented test data into new and old userids.Ran separate models for each segment. Used xgboost.Used 7 features for model 1 and 4 features for model2.
Features used for common user IDs in train and test are: mean click,mean open,number of campaigns the user was part of,number of users for the campaign, communication type,total links,internal links.
Features for new userids are: total links,no of images to take internal links and communication type.What did not work: word count of emails,bag of words, reordering communication type to exclude conference mails, difference in time between two email for given userids,recency of clicks.
Hi @AishwaryaSingh , I’m a newbie. I wonder after competition ,where can I find the open source solutions info ? Via slack? I found AV discussion is a little different from kaggle. It’s more easy to find all the open source solutions or discussion after competition ends in Kaggle.
All solutions to the competitions will be shared by the participants here. I have added a few solutions which were discussed on slack. We hope that other participants will also share their approach to the problem statement.
Thanks AV for this wonderful competition. I secured 34th and struggled to get pass my lb score of 0.662…Nevertheless here is my approach…
- Is click and Is open mean encoding for Userid, user-communication type combo with 2 fold CV
- Count encodings for user, user/campaign communication.
- Extracted DoW, time of day from send date.
- A simple weighted ensemble of logistic regression ( with one hot encoded vars) model and a RF model.
Tried but didn’t work
- Bag of words for subject
- Cosine similarity bw subject and body
- Predicted is_open from train data to add new column for is open in test data ( added a lot of unnecessary bias and over fitting)
Should have tried
- Xgboost/light gbm
- Playing with send date more to calculate features like mean send time be two mails etc.
- Expanding mean technique as many above suggested.
Lastly thanks everyone fr sharing approaches.
Thanks for your amazing work and all the Top Aver’s Sharing.I can’t wait to learn tricks from them.Hoping every competition have a such subject “Share your Approach”.
Hi, posting our 18th place solution below. (Team: alpha1995)
We used an ensemble of logistic regression and naive bayes models based on the open and click probabilities of the user. The Naive Bayes estimate was further used as input in the LR model to predict outputs.
Averaging all the models, we got our final solution.
More details shared here:
1st place solution
I created several features based on textual information and user behavior to arrive at my final solution
The features created were:
- Target encoding of user_id with respect to is_open and is_click
- Target encoding of campaign_id with respect to is_open and is_click
- Target encoding of communication_type with respect to is_open and is_click
- Length of email body (word wise)
- Length of subject
- Key feature : I pre-processed the text in the subject by removing stop words, lemmatizing them, removing punctuations etc. After that I used a bag of words (unigram) representation of different campaign_ids based on their subject. This was followed by merging this dataset with campaing_ids present in the train and test data. After this merge operation, I used groupby sum based on user_id to obtain a unique representation for every user. This was followed by PCA to reduce the dimensions to 50. This operation added the biggest jump to my score.
- Number of mails received by different users
- Cross tab of user_id vs communication type
- Numerical features present in the campaign_data
This became my general frame work for data preparation before feeding it into any model. An xgboost model with these set of features gave me score of 0.695+ on the public leaderboard. What followed after this was sheer pragmatism. I created several models based on approximately the same frame work and differentiated them by adding variability. Some of the important variations were:
- Using bi-grams for BOW representation
- Using tri-grams for BOW representation
- Using all three of them
- Using tf-idf with same (unigram,bi-gram,tri-gram)
- Using lightboost, xgboost and catboost on each of the three representations above
- Using truncated SVD instead of PCA for dimension reduction
- I even dropped the best performing feature and tuned the hyper-parameters in such a way to arrive at similar scores using remaining features
- Target encoding of weekday of sent mail
- Cosine distance among the Glove vector representations of differnt campaign ids.
These are just some of them. I created many notebooks and added/dropped/modified many features and performed many experiments, which most of the time gave me a public lb score around the vicinity of (0.685 - 0.69). Even though the performance of all the models were similar, there predictions were not highly correlated. This gave me the opportunity to take advantage of weighted ensembles to arrive at a higher score. I took the most similar scoring prediction files with the least correlation and took their weighted average. I continued this process in an uphill fashion. I ended up with four best performing predictions with scores (0.699 - 0.7011). I again followed the same hueristic to arrive at my final score which gave me a public leader board score of 0.704. This entire process is very similar to model stacking where diverse base classifiers prediction is fed to a meta classifier to arrive at better predictions. Only in my case, it was me manually adjusting the weights assigned to different models by validating them against the public leader board.
Congratulations! Thanks for your sharing. Wondering if you can share your github repository?
Congratz…and thanks for sharing…would be great if code for point 6 (key feature) is shared
Exceptional! Great job Kunal and many thanks for sharing your solution! As I was reading these lines I thought, Hey he did the same things I did! Why did I end up 40th? And then I read your ‘key feature’ on Natural Language Processing which, as you said, gave a big boost to your outcome…So, my respects for sharing your course of action with us and for teaching us how to think and act in Machine Learning Projects!
I could not register for the event during the competition. Could you please share the problem statement and data ? I am not able to access data now without registration.
I’m wondering if the score is the same as an accuracy measure comparing predicted click rate vs actual click rate? I don’t have the competition data set, but email clickthrough is usually low, creating an imbalance data set. So if the accuracy is around 0.7, doesn’t that mean the algorithm can just predict no for clicks all the time for all users, and yield this accuracy as an aggregate?
The evaluation metric for this competition was AUC-ROC score and not accuracy.