How to handle data that is unevenly distributed


#1

If I have a dataset where the target variable has 2 categories(1,0) and their distribution is imbalanced as in say they are distributed at 99:1. How to build a model on the data or do we need to take a subset of the data. Are there any methods to handle such type of data?


#2

99:1.

What are you trying to predict. Is it even significant to cause such a business impact? 100 events, 1 going awry is perfectly acceptable and something you should not really worry about as 99% of the events are fine :slight_smile:

On a serious note, explain the problem. This is not imbalanced data, this is just an abberation.


#3

explain the problem clearly… how do the distribution of data in your data set?.. in 99:1, do 99 mean for instances with negative classes and 1 for instance with positive class?


#4

wow 99:1, that’s a huge imbalance.

i usually use “smote” in my algorithms when i have cases of imbalance(usually 80:20) . smote is basically a combination of oversampling(the minority class) and under sampling(the majority class). you can read more on this


#5

I assume from your question that, 99% of examples are negative and 1% are positive examples. It happens in medical diagnosis problems , credit card fraud detection , genes patterns and manufacturing defects cases. This is called class imbalance problem.
The possible solutions are oversampling , undersampling and creating synthetic dataset for
minority examples.
Another approach is one class classification.


#6

Thanks for the suggestion Malathi.


#7

Thank you for SMOTE suggestion.


#8

That’s not true in all cases - ham/spam prediction is often about finding small amounts of spam in large amounts of ham. certain cancers can be infrequent but still very important to detect for the unfortunate sufferer.


#9

How about trying models that can deal with non linear data? Random Forests, SVR and such?


#10

You can use biased samples and run multiple models before finalizing a stable model


#11

Smote is the best way to handle the class imbalance problem. Here is the command usage.

ds <- SMOTE(target~.,ds,perc.over = 400,perc.under=100)