I wanted to know which methods would be useful to increase the accuracy of the minority class in binary classification. I’ve heard about resampling,weighing etc can anyone briefly explain me these?
I did some searching some time back and found a paper which discussed many methods. Here is the link:
I’m sure you’ll be able to access it easily after some effort. But I haven’t tried them yet. Most popular seems to be SMOTE which you research more.
Another way could be to use AUC as your metric for determining the outcome.
Here’s a link from University of Rhode Island (USA) regarding lecture notes from the author on the topic: Topic Notes on dealing with imbalanced learning
A little experiment I made with some data from Hackaton-3 the bank problem. Various methods of rebalancing and the result with learning curve.
Hope this help.
AlainUnbalance Class.pdf (231.8 KB)
There are several re-sampling techniques to handle imbalanced data, each with its advantages and limitations. Here is list of methods available in the caret package for R.
Down-sampling: Example, 9000 rows of the training set belong to class1, and 1000 rows belong to class2. Down-sampling method will randomly sample to reduce the majority class entries, so that class1 is the same size as class2.(ie.,1000 rows for each class) . This will result in loss of some data, as it ignores a lot of class1 rows.
Up-sampling: This is the opposite of down-sampling. It randomly samples (with replacement) the minority class so that it is the same size as the majority class. In the above example, the classes will end up with 9000 rows each, with duplicate rows for class2.
Hybrid methods: Using hybrid methods, ex: SMOTE and ROSE, These methods down-sample the majority class1 and create new rows in class2 so that there are no duplicates for class2, unlike in #2 above.
Here is a document with example R code using the caret package:
Subsampling for class imbalance:
I hope I did not interpret your questions incorrectly. You want to know how to improve accuracy of predicting True negatives.
The below algorithm may help and you can even see which TREE is optimized to improve the different positive or negatives.
They have very good examples below. Please see their pdf file.