If I have a dataset where the target variable has 2 categories(1,0) and their distribution is imbalanced as in say they are distributed at 99:1. How to build a model on the data or do we need to take a subset of the data. Are there any methods to handle such type of data?
What are you trying to predict. Is it even significant to cause such a business impact? 100 events, 1 going awry is perfectly acceptable and something you should not really worry about as 99% of the events are fine
On a serious note, explain the problem. This is not imbalanced data, this is just an abberation.
explain the problem clearly… how do the distribution of data in your data set?.. in 99:1, do 99 mean for instances with negative classes and 1 for instance with positive class?
wow 99:1, that’s a huge imbalance.
i usually use “smote” in my algorithms when i have cases of imbalance(usually 80:20) . smote is basically a combination of oversampling(the minority class) and under sampling(the majority class). you can read more on this
I assume from your question that, 99% of examples are negative and 1% are positive examples. It happens in medical diagnosis problems , credit card fraud detection , genes patterns and manufacturing defects cases. This is called class imbalance problem.
The possible solutions are oversampling , undersampling and creating synthetic dataset for
Another approach is one class classification.
Thanks for the suggestion Malathi.
Thank you for SMOTE suggestion.
That’s not true in all cases - ham/spam prediction is often about finding small amounts of spam in large amounts of ham. certain cancers can be infrequent but still very important to detect for the unfortunate sufferer.
How about trying models that can deal with non linear data? Random Forests, SVR and such?
You can use biased samples and run multiple models before finalizing a stable model
Smote is the best way to handle the class imbalance problem. Here is the command usage.
ds <- SMOTE(target~.,ds,perc.over = 400,perc.under=100)