Interpreting the output of a logistic regression


#1

I am working on a POC where in I am using logistic regression for predicting the response (YES/NO). There are couple of categorical variables as independent variables, when I see the output for each categorical variable few factors are significant while others are insignificant. So my question is how to handle this equation? Should we ignore the categories which are insignificant ones? Also, Shall I go for any other ML technique which can handle such data efficiently?

Thanks.

Regards,
Aparna


#2

Hi @waparna

The results of variable importance show that which particular factor of the variable have a more significant impact on the outcome of dependent variable. Now , it is up to you how you can use the information.
I will just be able to tell what I can think of. I would probably combine the factors that do not have a significant impact on the dependent variable, so that my model is not affected by noise of the lesser significant factors. For example, if out of the factors ‘A’, ‘B’,‘C’,‘D’ only ‘A’ and ‘B’ are significant, I would combine ‘C’ and ‘D’ as ‘Others’; such that we are left with factors ‘A’, ‘B’ and ‘Others’.

I am sure others from the community would have some brilliant ideas on this.

Regards,
Shashwat


#3

Hi Aparna,

Take a look at the below example

Interpreting the results of logistic regression model

summary(model)
Call:
glm(formula = Survived ~ ., family = binomial(link = “logit”),
data = train)

Deviance Residuals:
Min 1Q Median 3Q Max
-2.6064 -0.5954 -0.4254 0.6220 2.4165

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 5.137627 0.594998 8.635 < 2e-16 ***
Pclass -1.087156 0.151168 -7.192 6.40e-13 ***
Sexmale -2.756819 0.212026 -13.002 < 2e-16 ***
Age -0.037267 0.008195 -4.547 5.43e-06 ***
SibSp -0.292920 0.114642 -2.555 0.0106 *
Parch -0.116576 0.128127 -0.910 0.3629
Fare 0.001528 0.002353 0.649 0.5160
EmbarkedQ -0.002656 0.400882 -0.007 0.9947
EmbarkedS -0.318786 0.252960 -1.260 0.2076

Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 1065.39  on 799  degrees of freedom

Residual deviance: 709.39 on 791 degrees of freedom
AIC: 727.39

Number of Fisher Scoring iterations: 5

Now we can analyze the fitting and interpret what the model is telling us.

First of all, we can see that SibSp, Fare and Embarked are not statistically significant.

As for the statistically significant variables, sex has the lowest p-value suggesting a strong association of the sex of the passenger

with the probability of having survived. The negative coefficient for this predictor suggests that all other variables being equal, the male passenger is less likely to have survived.

Remember that in the logit model the response variable is log odds:
ln(odds) = ln(p/(1-p)) = ax1 + bx2 + … + z*xn.
Since male is a dummy variable, being male reduces the log odds by 2.75 while a unit increase in age reduces the log odds by 0.037.

you can try Decision tree and other machine learning techniques too

Hope this helps

Regards,
Bharath