Imbalance Class Classification using Random Forest


#1

Hi,

I have a dataset that has highly unbalanced class (binary outcome). In order to fit a model I know we have to make them balance by over sampling and under sampling, but I have 3 specific questions

  1. Is Random Forest not resistent to unbalanced class? The reason being they already use bootstrapping to create multiple trees. If not then which algorithm is better than RF
  2. How can we implement k fold cross validation and over sampling using SMOTE in python
  3. Can I use StratifiedSampling while splitting dataset?

Please help. Thanks in advance

Tarun Singh


#2

@TarunSingh,

  1. It’s true that RF uses bootstrapping to create multiple trees but in practice that is not enough to beat the imbalanced class problem. We usually go for a boosting based algorithm since it(boosting) involves paying more attention to the points wrongly classified while creating successive trees. Some of the popular algo are

    a. XGBoost - https://www.analyticsvidhya.com/blog/2016/01/xgboost-algorithm-easy-steps/
    b. LightGBM - https://www.analyticsvidhya.com/blog/2017/06/which-algorithm-takes-the-crown-light-gbm-vs-xgboost/


#3

Thanks a lot. Let me try these, if I am stuck will get back to you !!

Tarun Singh


#4

I agree with the idea of using boosting algorithms is better but not enough in practice. SMOTE would be a good starting point (definitely I would opt for a over-sampling strategy) but there are others.

Here you can find a nice implementation of solutions for imbalanced data in python (scikit-learn-contrib). The success of any of these techniques depend largely on the nature of your data. Therefore, I would suggest you try different approaches and see how they affect your results.

Best of luck!

Jose


#5

Thanks @TitoOrt for the link. Initially, I will be working with random forest only and will try to optimize by over sampling or SMOTE. I have started with class_weight parameter and have set to ‘balanced’. Is this approach correct, if I manipulate class_weight parameter. I still have mis classification for lower frequency class. Please find the snapshot below. If this does not work then I will go for SMOTE and may be bagging/boosting techniques

Thanks
Tarun Singh


#6

You are right. The parameter class_weight set to ‘balanced’ should work for you in this case. It works under the formula:

n_samples / (n_classes * np.bincount(y))

thus, balancing all the classes. However, it seems it is not so easy to find the best ‘balance’ settings anyway. These posts are based in R packages, but they might give you an intuition on what I mean: here and here.

If ‘balance’ don’t work, try custom weighting and if everything fails move on to some more complex options as SMOTE or SMOTE+ENN etc.

Best of luck!


#7

Thanks @TitoOrt. I used balanced class_weight and results have improved but not significant.
To elaborate I have unbalanced class with 98% vs 2% distribution, I am really skeptical now with such high imbalance it would be very difficult to achieve desired accuracy.

I will try SMOTE and bagging classifiers next. Hope they perform better !!


#8

Hi @TarunSingh. Did SMOTE and/or bagging help? Very keen to know since I too have similar problems with my dataset.