Hi everyone - Wanted to say hi and ask a question - Class Imbalance


#1

Hi everyone,
This is my first post and wanted to ask a question about class imbalance.
Assuming I have a dataset with two outputs 0, 1 and one class is majority and the other minority.
What is the threshold to apply class imbalance techniques to make the dataset more balanced for training ?
Do you consider a the threshold to be 95% to 5% imbalance ?

Thank you


#2

Algorithms like gradient boosting do not get impacted by class imbalance, as it automatically assigns more weight to the samples from the minor class. however if you intend to use other algorithms, there is no short answer. You need to try the class imbalance techniques and find out if it improves the cross validation score. Here are some techniques you could try:


#3

Thank you for your reply.
So if I use XGBoost it automatically accounts for the class imbalance ?
Thank you


#4

Yes, it does.


#5

Thank you very much - I have posted a follow up question


#6

Please help with understanding …

1- in logistic regression when applying python sklearn option “balanced” (link below), does this option automatically weight the different classes to make them balanced ?

2-If the first option balanced the classes, do I need to use techniques like SMOTE etc… ?

3-If I use a technique like XGBoost do I need to balance my classes ? you mentioned it does nor require any balancing because these techniques account for class imbalanced.

Thank you

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html


#7
  1. Yes, balanced option does provide you with an option to account for class imbalance but only in proportion to the frequency of each class.
  2. SMOTE is a different kind of class-balancing technique. SMOTE algorithm creates artificial data based on feature space similarities from minority samples. We can also say, it generates a random set of minority class observations to shift the classifier learning bias towards minority class. You could try both and see which technique gets better results.
  3. XGBoost automatically gives higher weights to the minority class samples in its tree building process and hence automatically accounts for the imbalance.