How to predict proportionate amounts of 1/0 in logistic regression




while running logistic regression on some sample data:

the proportion of Churned is too low because of which the in the predicted values also the proportion of records which are getting predicted as 1 is very high.

Ideally if prob is < 70% I would like to label them as not churned.
How do I go about achieving this?



This is the situation of imbalanced data set where the ratio of positive and negative classes are highly skewed. As you asked, how to handle unbalanced data set? When do you model on the entire imbalanced dataset, you will end up with a poor model. There are different methods to deal with this imbalance situation. Let’s look at these methods:

  • First approach is to take all the positives and take a sample of the same or slightly more observation of negative values ( may be around 60/40 ration of negative/positive values). This method works well but one concern is that you have no observation left for cross-validation the quality of your model as you have no positives left.
  • Another approach is BAGGING (bootstrap averaging). In this method, we do not take all the positive values in a sample, we work with 90% of positive values and about same number of negative values (we can also take a slightly higher percentage of negatives values compare to positive ones to have 40/60 ratio of positives and negatives values). Another important facts about this method, we do not work with a single sample, we do n samples and n model and then take the average of n model outputs. It is advisable to take 10-20 samples. In this method, you can test individual method on 10% of the positive observations with rest of negative observations.

Above discussed methods always helps to deal with unbalanced data set.

Hope this helps!



Hi Imran,

Can you please illustrate the above methods with codes for better understanding?


Amol Gothe



you mentioned you use logistic regression to build your model. Logistic regression is not sensitive to unbalance training set (the intercept will be influenced by the unbalance, you can correct this).
Now what could happen ? How many predicators do you have in your model? It could be that do not have enough patterns, I do not mean observations there, but enough variables which allow accurate evaluations of the betas of the logic function.

Hope this help