While working with multiple datasets for logistic regression below is the common problem I am facing
While doing gain chart for the test/training set I can see the model is only able to predict True + ve’s when the cut off probability has been kept as low as 15% ( i.e. > = 15 % is 1 else 0).If I chose to have a higher cut off like 45% then the model is able to predict the True –ve’s but not True + ve.
E.g. When I draw the gain chart I can find when probability sorted in descending order top 48% of my data is having 70% predicted as 1.However the Quintiles of the probability shows the top 48% probability falls in the rage of 0.15 – 0.99 (Top 10% is in the range of 0.37-0.99), so I decided to choose the lower cut off as 0.15 which gives me a kind of confusion matrix saying 63% of the Actual Events are correctly predicted and 51% of the Actual nonevents are correctly predicted, however if I increase the cut off probability as below is what I get
Cut off probability is 0.5
Reference Prediction 0 1 0 1233 307 1 0 0
Which means for higher probabilities my model are only good for True –ve’s but not for True + ve’s, which does not meet my expectation as the model has been created for + ve’s so on higher probabilities it should give more True + ve’s which is not the case here. Kindly note the Kappa value is 0.17 (< 0.6) which again signifies the model is not that good although I am able to keep all significant variables in the model with the correct & expected correlations coefficients. Could you please suggest the potential reason for this and what can be done to get rid of this, The Source dataset has 80:20 ratio for Non event (0) & Event (1) ?