I wanted to have a discussion on the LTFS Hackathon. My preferred language is R. My score is currently around 0.6 . The leaderboard is around 0.66.
This is a big difference which makes me feel like there is something very obvious which I am doing wrong. I have tried different techniques like random forest, xgboost, logistic regression, neural net.
Is it simply tuning? In my experience tuning makes incremental difference.
What is it that everybody seems to know but I don’t? I am self taught.
Any ideas will be appreciated.
Have you added any new features? The first test I ran I changed DOB to age and ran some label encoding on other variables. Ran it through LGBM and hit .65ish. I also added some other features and fine tuned LGBM but can break that mark right now. I’m looking into adding more features as well and run different models.
Doesn’t seem to be much chatter on here. I finally found a little something yesterday that shot my score to .6582. I feel silly because I spent a day or 2 trying to figure it out but using .groupby helped me get there. Now I’m trying to add some more features but can’t seem to break this mark currently. How’s everyone else doing with only 1.5 days or so left?
Thanks for sharing your thoughts. I have been working on this problem for the last 5 days but can’t get my AUC any higher than 0.6. I have used the Light GBM model as well and did some feature engineering like customer_age, loan_amt, avg_disbursal_amt, etc. I am not sure where I am going wrong. Have you done any cross-validation on your train dataset? Any other thoughts you could share would be really helpful. This problem is very interesting. I would like to learn from the winner’s solution after the competition is completed. Thanks Again.
HI…Any idea about State_ID?
Thank you all for your replies! I finally figured it out on the last day. The submission was for probabilities rather than classification i.e. it accepts a range of values from 0-1 while I was submitting just 0 or 1. This is why why my score was capped.
@badri.prudhvi27 this may be your mistake also.
@gaurav207 Thanks for the valuable advice. This was my mistake as well. Now my score is .65 ish range. This suggestion helped me a lot. Thanks Again.
@dhariyalbhaskar As @gaurav207 mentioned you should use probabilities in your submission file and upload it instead of class values [0,1]. If you are already doing that then you might be overfitting the model or there might be data leakage due to your feature engineering step. Please make sure you avoid these issues.
Hi, thanks for reply. I’m using cross validation to take care of overfitting. I submitted output using predict_proba it went to around 0.35
How did everyone do? I ended up not having as much time the last 3 days but i was able to reach 118 on the private LB : 0.6631884673 score.
I used LGBM as my model and built some features off of that. When I ran my first test the Current_pincode_ID, state_id, emp_id, etc kept coming up as some of the top features. So I grouped each feature by the other features like PRI.NO.ACCTS , SEC, etc etc. This helped improve my score with some fine tuning to finally get past the .66 mark on the public LB. After looking over a few submissions I did miss a few variables I could have added which included LTV*Disubursed amount. And then using that variable to make a few extras.
This was a really fun time, I just wish I had more time!
Congrats to the winners.
Sorry I didn’t get back to you earlier. I did use 5-fold and did one submission on 10-fold which was lower. So I kept 5-fold throughout. Definitely look through some of the solutions posted already, I learned a lot! I posted below as well my approach. Not the greatest but it was still fun.
I decided to upload my solution if anyone would like to check it out. This was my first hackathon and I learned quite a bit. I tried a few different variations but in the end this was the best performance on the LB.
How did you steer through correlation?
Better way to impute Employment.Type?
And is there any specific reason to use Catboost most of the time instead of XGBoost and LightGBM?
I built a polynomial model(degree=3) from the negative correlated variables like Age and so forth. This improved my model a bit but due to time constraint I wasn’t able to investigate further.
The first thing I did to employment type was create ‘unemployed’ variable. I decided to change this after re-reading the Data Dictionary, which said this variable was for 2 groups. So I went under that assumption. I think making more cat features from employment type, length of credit history, credit history description, and other variables like those are what makes CatBoost work so well. CatBoost is great for a lot of cat variables like this data has. You just have to convert those variables into a categories that catboost can recognize. Check out the private leaderboard and the some top 50 solutions. Prety good learning