This is the situation of imbalanced data set where the ratio of positive and negative classes are highly skewed. As you asked, how to handle unbalanced data set? When do you model on the entire imbalanced dataset, you will end up with a poor model. There are different methods to deal with this imbalance situation. Let’s look at these methods:
- First approach is to take all the positives and take a sample of the same or slightly more observation of negative values ( may be around 60/40 ration of negative/positive values). This method works well but one concern is that you have no observation left for cross-validation the quality of your model as you have no positives left.
- Another approach is BAGGING (bootstrap averaging). In this method, we do not take all the positive values in a sample, we work with 90% of positive values and about same number of negative values (we can also take a slightly higher percentage of negatives values compare to positive ones to have 40/60 ratio of positives and negatives values). Another important facts about this method, we do not work with a single sample, we do n samples and n model and then take the average of n model outputs. It is advisable to take 10-20 samples. In this method, you can test individual method on 10% of the positive observations with rest of negative observations.
Above discussed methods always helps to deal with unbalanced data set.
Hope this helps!