I have finished 2nd on the private LB for the weekend competition
My public score: 0.86126 and private score: 0.84138. Missed out on the 1st by a very tiny 0.0004.
Anyways, coming to the approach,
- I used Python completely.
- I have initially started off with only a few set of variables. Mostly numeric variables. I have decided to use 4-fold CV till the end of the competition (4 because my laptop has 4 cores and anything more than would be a computationally expensive and slow iteration time). Treated missing values as -3.14.
- Tried out LR, RF and XGB. Found out that RF CV was around 0.83 CV and 0.844 LB and XGB was able to reach 0.84 CV and 0.849 public LB.
- On Saturday evening, the others overtook me quite easily and then I realized that they must be using the other variables. So, I decided to use dates (day+month+year etc.), cities and finally employer names. I removed rare levels.
- That's when I could push XGB to 0.8557 and public LB 0.8557. It was an Aha! moment because my CV was exactly my LB score. But then, I was stuck there for a long time. Then, I realized I made a mistake tuning the XGB and instantly saw an improvement to 0.859 CV and 0.8594.
- On Sunday, I realized I couldn't push the XGB any further and I thought of ensembling. Sadly, my RF which was performing upto 0.83 the previous day wasn't even scoring 0.79 CV which was a big disaster on Sunday afternoon. I just couldn't push it till the end of Sunday.
- Instead of spending time on that, I tried other algorithms and luckily a simple Logistic Regression scores 0.836 on LB. I was pleasantly surprised (shocked too!) to see that it performed well. So, I decided to use an ensemble of XGB + LR and the public score went to 0.8612. I couldn't cross-validate due to lack of time and decided on some weights for each of them.
I was pretty sure that my score wouldn't drop by a fair margin because I used an ensemble. Unfortunately, I just fell behind my a tiny score. Well, I should have cross-validated for the right weights.
Big mistake: I should have saved my RF code/features/params. That was a bad mistake.
Big learning: Logistic Regression was working well. I was lucky to try it in the last few hours and add it to the ensemble
What didn't work: KNN - 0.79 CV. I think it can be done better than this.
Tuning and CV strategy for XGB:
Typically, people use 5 folds. You can make a choice. To see the reliability of CV estimate, a few guys use 10-fold as well.
Steps: 1. Decide 'n' in n-fold. Stick to it for complete analysis.
2. Create a baseline score using a simple model.
3. Now, use XGBoost default settings and establish another XGB baseline score.
4. Put num_trees at 10000 and a tiny learning rate of 0.01.
5. Try step(4) for various max_depth.
6. While doing step(4), monitor the progress. Note at what tree# is the model overfitting
7. After you're done with 1-6, you would have reached a saturation score
8. Now comes some magic! Start using subsample and tada, your score improves.
9. Use colsample_bytree, then scale_pos_weight, improve your score
10. Try using max_delta_step and gamma too (a little tricky to tune)