Hi everyone, I am working on a highly imbalanced dataset (with response rate 0.27%). I have used algorithms like LR, SVM, ADABOOST(DT), KNN etc and executed these algos by oversampling the minority class. But the best f-score I have got so far is 0.29. Can you guys suggest some ways/tricks to improve the model performance. Number of predictors=9, Number of observations=2 millions. Thanks.
Having witnessed a similar situation, here is what I’d suggest:
You have extremely low response rate. If possible, try using SMOTE to generate new observations for responders.
Use balanced random forest for modelling. It’s a modified random forest technique that does sampling in a way such that both responders and non-responders are almost equal in each sample. This can be done by setting the “strata” and “sampsize” parameters in randomforest function. Also, you can also try increasing the number of trees.
Boosting is the best choice in this scenario. You can try XGBOOST too. ADABOOST will take time (may take lifetime!!) while XGBOOST is faster.
View the variable importance and reduce the number of variables.
Try ensembel modelling to find if there is any improvement.
Hope it will help.
That’s unfortunate. So how are you proceeding on this?