Logistic Regression Model Validation in layman terms

regression
logistic

#1

Hi,

Can anyone please explain logistic regression validation methods like K-S curves, Gain Charts or any other methods in simple terms. With the examples and coding in SAS.

Thanks in advance.


Logistic Regression : Multi Collinearity Check
#2

Hello @Kalyan ,

There are several ways to assess model fit for logistic regression.I will try to explain some of those below.

1.Concordance:
Does the model predict a higher probability of churn/non-default when actual data shows churn/non-default?
So the dependent variable here is churn or non-default where churn = 1 means the customer has left.

proc logistic data = churn descending outest = model_churn;
model churn = purchase_hist number_of_complaints_last6months etc… /lackfit CTABLE;
output out = pred predicted = p;
run;

SAS takes every possible pair(nC2 pairs) and predicts probability of say churn(or 1) for each record.
So if you are modelling for churn = 1(customer has left) you should have a higher probability than when churn = 0.

Churn predicted p
1 0.83
0 0.75–concordant pair

1 0.64
0 0.87–discordant pair

1 0.65
0 0.65–tied pair.

The higher the concordance,more concordant pairs(>70%) the better is the model.

2.Hosmer-Lemeshow Test:(Generated by the lackfit option)

This performs a chi-sq test for difference in observed & expected values.
Higher p-value indicates good fit.

3.Lift Chart:

Step 1:Export the data of the output dataset pred to a csv file.

proc export data = pred outfile = 'path\filename.csv’
dbms = csv replace;
run;
Step 2:Sort the data in descending order of predicted p
Step 3:Divide the data into 10 deciles.So if there are a 1000 records there will be 100 records in each decile.So first 100 - decile 1,next 100 - decile 2 and so on.

The last highlighted column is decile.And the column to the immediate left of it contains the predicted probabilities.
Step 4:Create a pivot table to capture the number of 1 & 0 in each decile.
Step 5:Create a table like the one shown in the image.


In this example there are 699 customers who have left and 300 who have not.So the random chance
the probability of churn in this example is 70%(approx) & as you progress from one decile to another you would expect to capture 10%,20%…100% of all the churned customers,as shown in column Cumulative Expected.
But in decile 1 ,we have 85 cases instead of 70,in decile 2 81 as shown in the column #eventspredicted.
So in the first row of column cumulative predicted i would capture 85/699 or 12% of the total churned customers.By the time I reach the 3rd decile row I would have captured (85+81+82)/699 or 35%.As you can see in the image cumulative expected in the 3rd decile was 30% whereas with the model cumulative predicted is 35% which suggests that the model is doing better than random chance to predict churn.
The lift chart is generated from the last two columns of the table in excel using line chart functionality.The more gap is there between the cumulative expected & cumulative predicted the better the model.

Better

Hope this helps!!


#3

Sensitivity, a.k.a True Positive Rate is the proportion of the events (ones) that a model predicted correctly as events, for a given prediction probability cut-off.

Specificity, a.k.a * 1 - False Positive Rate* is the proportion of the non-events (zeros) that a model predicted correctly as non-events, for a given prediction probability cut-off.

False Positive Rate is the proportion of non-events (zeros) that were predicted as events (ones)

False Negative Rate is the proportion of events (ones) that were predicted as non-events (zeros)

Mis-classification error is the proportion of observations (both events and non-event) that were not predicted correctly.

Concordance is the percentage of all-possible-pairs-of-predicted Ones and Zeros where the scores of actual ones are greater than the scores of actual zeros. It represents the predictive power of a binary classification model.

Weights of Evidence (WOE) provides a method of recoding the categorical x variable to continuous variables. For each category of a categorical variable

Information Value (IV) is a measure of the predictive capability of a categorical x variable to accurately predict the goods and bads. For each category of x, information value is computed as:

IV = (perc good of all goods − perc bad of all bads) * WOE

KS Statistic or Kolmogorov-Smirnov statistic is the maximum difference between the cumulative true positive and cumulative false positive rate. It is often used as the deciding metric to judge the efficacy of models in credit scoring. The higher the ks_stat, the more efficient is the model at capturing the responders (Ones). This should not be confused with the ks.test function.

I have written a detailed explanation of all these concepts here.