Imbalanced classification!



Hi folks,

I`ve got a classification problem in which one class has only 1%(Failures) instances. tried various classification algorithms with upsampling and downsampling(1:1,1:2,1:4,1:6) unfortunetly ,nothing has worked out(model overfits on train). Problem is that my model is able to capture Failure pattern but a small chunk of miss-classified 0(majority class) makes value of precision to be very very low.

currently using Random forest but precision is coming very low and it over-fits on train set,

over net people talk about Penalizedsvm/LDA or using cost function in SVM. but, i dont know how to implement them in R.

could anybody help , if has faced same scenario.


as you are going the tree way, why no bagging ? You stratify the same way as for RandomForest. Then in the control of Baiing which is for rpart you set the parms=list(loss = your penalty).

Before to do this I shall check again the oversampling, which method do you use? You can create a generalisation if you use a simple method. By the way you can go more in the unbalance 1:10 should be ok on random forest.



Hey, Thanks for your time.

I tired bagging using randomforest with 20 iteration.still, its not improving moreover bagging is takinh lot of time to run.

for sampling i use random observations from majority class using sample function in r . used smote once but it was running on hence killed the process after few hours wait.

Will see now, running the same thing again with 1:10 downsample size. :smile:


Try with over sampling on the minority this will allow you to increase the majority as well. Strange about SMOTE it works ok usually. Be careful with the generalisation there look for Tomek link to avoid this.
If SMOTE does not work for x reasons, you can try with ensemble as well.


Oversampling cant be done fully as the data has .4 million rows and didnt work using classification algos.
but tired to duplicate the number of rows in original datset twice n thrice, it overfits on train.



never a good idea to duplicate it brings no new pattern. You mean you have .4 million row on the minority am I right ? If not then you have a minority of 4000 and there you can over sample how many variables do you have ? The could be the issue but if you have over fitting perhaps you do not have enough. Do a learning curve it will help to tell you if you need more observations or variables.
If the oversampling did not work check it did you use Tomek signature? As mentioned you could have over generalised.
If SMOTE does not work, build a function with knn model that you clean after prediction using Tomek algorithm, you can even do multiple passes to converge.

Hope this help