Dealing with imbalanced categorical data for machine learning predictions



My question regards the use of techniques to improve imbalanced data for machine learning prediction. Wouldn’t techniques like random over-sampling,… bring unreal information to the problem?
Isn’t it just synthetic and artificial data making? I am worried about the correctness of the predictions of my data. Thanks


Hi @amin1,

Over-Sampling increases the number of instances in the minority class by randomly replicating them in order to present a higher representation of the minority class in the sample. Here we do not loose any information.

True that we bring extra information to the problem but that information is not unreal. We replicate the minority class observations which increases the total observations and leads to better performance.

There is one disadvantage of using random over-sampling. It increases the likelihood of overfitting since it replicates the minority class events. So in such case we can look for other techniques to handle imbalanced classes.

To learn more about random over-sampling and other techniques that can be used to handle imbalanced class problems, you can refer this article:


Thank you Pulkits for your answer.


What if the data categories are not binary? I mean, if I have several classes in my data like; a number of observations for categories of A, B, C, D, E and F.
Here A has the majority of observations and F has the minority of observations. Can I consider this problem as binary and use R packages such as ROSE or “unbalanced”?


In such cases you can train multiple one-vs-all classifiers. For example, you may consider A as 1 class and combination of B, C, D, E and F as the other class, then treat it as a binary classification problem. Similarly repeat the same process for the rest of the classes.


Thank you.