How to handle data that is unevenly distributed



If I have a dataset where the target variable has 2 categories(1,0) and their distribution is imbalanced as in say they are distributed at 99:1. How to build a model on the data or do we need to take a subset of the data. Are there any methods to handle such type of data?



What are you trying to predict. Is it even significant to cause such a business impact? 100 events, 1 going awry is perfectly acceptable and something you should not really worry about as 99% of the events are fine :slight_smile:

On a serious note, explain the problem. This is not imbalanced data, this is just an abberation.


explain the problem clearly… how do the distribution of data in your data set?.. in 99:1, do 99 mean for instances with negative classes and 1 for instance with positive class?


wow 99:1, that’s a huge imbalance.

i usually use “smote” in my algorithms when i have cases of imbalance(usually 80:20) . smote is basically a combination of oversampling(the minority class) and under sampling(the majority class). you can read more on this


I assume from your question that, 99% of examples are negative and 1% are positive examples. It happens in medical diagnosis problems , credit card fraud detection , genes patterns and manufacturing defects cases. This is called class imbalance problem.
The possible solutions are oversampling , undersampling and creating synthetic dataset for
minority examples.
Another approach is one class classification.


Thanks for the suggestion Malathi.


Thank you for SMOTE suggestion.


That’s not true in all cases - ham/spam prediction is often about finding small amounts of spam in large amounts of ham. certain cancers can be infrequent but still very important to detect for the unfortunate sufferer.


How about trying models that can deal with non linear data? Random Forests, SVR and such?


You can use biased samples and run multiple models before finalizing a stable model


Smote is the best way to handle the class imbalance problem. Here is the command usage.

ds <- SMOTE(target~.,ds,perc.over = 400,perc.under=100)


Imbalanced dataset is a problem only to the way you evaluate a model on. To make sure that your model is well evaluated you shouldn’t use accuracy , instead use recall and precision or F1(they don’t depend on the instance proportions).