Machine learning with imbalanced classification



I have an outcome variable with three levels, one of which only 9% of my sample belongs to. Is that too low? Do techniques such as bagging or boosting inherently account for this distribution in the algorithm? I’ve read that using kappa instead of accuracy as the metric with which to evaluate the model may be a good idea?

Thank you!!



Boosting techniques can deal with imbalanced dataset. Try using xgboost on your data.

About the evaluation metric : you’re correct, accuracy is not the right metric for this case and kappa can be used. You can use f1score or precision/recall to determine the performance of your model.

I would recommend you to go through this article


Do you have enough data? or did you check for cross validation on training set ?
If you are classifying texts into =ve, -ve or neutral then you should have a very big training set to get accuracy.