Query Around Cut off Probability in Logistic regression Techniques in R

logistic_regression

#1

Hi Team,

While working with multiple datasets for logistic regression below is the common problem I am facing

While doing gain chart for the test/training set I can see the model is only able to predict True + ve’s when the cut off probability has been kept as low as 15% ( i.e. > = 15 % is 1 else 0).If I chose to have a higher cut off like 45% then the model is able to predict the True –ve’s but not True + ve.
E.g. When I draw the gain chart I can find when probability sorted in descending order top 48% of my data is having 70% predicted as 1.However the Quintiles of the probability shows the top 48% probability falls in the rage of 0.15 – 0.99 (Top 10% is in the range of 0.37-0.99), so I decided to choose the lower cut off as 0.15 which gives me a kind of confusion matrix saying 63% of the Actual Events are correctly predicted and 51% of the Actual nonevents are correctly predicted, however if I increase the cut off probability as below is what I get

Cut off probability is 0.5


            Reference
Prediction  0        1
  0        1233    307
  1         0       0

Which means for higher probabilities my model are only good for True –ve’s but not for True + ve’s, which does not meet my expectation as the model has been created for + ve’s so on higher probabilities it should give more True + ve’s which is not the case here. Kindly note the Kappa value is 0.17 (< 0.6) which again signifies the model is not that good although I am able to keep all significant variables in the model with the correct & expected correlations coefficients. Could you please suggest the potential reason for this and what can be done to get rid of this, The Source dataset has 80:20 ratio for Non event (0) & Event (1) ?


#2

Anyone please help me with this please


#3

@arnabitsme,

As I can see from the description above at 0.5 probability cut off your model predicts everything as 0, this could be mostly because of the class imbalance you have in your dataset, by that I mean that only 20% of it is in positive class. Generally, you can change the P value cutoff to get more predictions in your positive class depending on the statistic you are trying to optimize. Now a Mythbuster for you -

  • P value cutoff have nothing to do with model performance, it’s just a way to segment positive and negative class. All the above interpretations around higher probability should give more True +tive is wrong!

Typically you try to determine you model performance in an unbalanced class by F-score which is a harmonic mean of precision and recall. Formula on this link - https://en.wikipedia.org/wiki/F1_score

Also sometimes you can build cost-based models in which you try to specify the cost of misclassification of positive class as compared to the negative class and models gets the bias towards the unbalanced class.

Hope this helps clarify some of your confusion.

Regards,
Aayush Agrawal


#4

@aayushmnit

Many thanks for your help on this, just a question regarding the p value cut off, for model evaluation does the cut off always kept constant when the model deployed on real time data E.g. Let’s consider that a model has been created to determine fraudulent transactions , now if I use the same model for every set of new transactions should we have a fixed cut off probability § such that > p signifies fraud and genuine otherwise. If not how a cut off probability is used in real time industry.

Regards
Arnab Roy


#5

For the same model, p-value should be same. while if you refresh model with new dataset you can optimize your p-value again.