What to select : Model with statistically insignificant variables but better accuracy or Model with statistically significant variables with little bit lower accuracy


#1

Hi Folks,
I am new to Machine learning and was trying to fit logistic regression on Boston data set available in MASS library of R.

I tried with forward selection and started with one predictor and keep on adding predictor variables one by one.

I am using first half of data set for training purpose and second half as test set.
I have introduced a new variable “crim_class” which is response variable and will be one if “crim” >= median(crim) and zero if “crim” < median(crim).

–> with the following model i got leat value of prediction error on test set but some of the predictors are statistically insignificant considering the * coding of p-value.

glm.fit = glm(crim_class~zn+indus+chas+nox+rm+age, data = Boston, family = binomial, subset = train)

summary(glm.fit)

Call:
glm(formula = crim_class ~ zn + indus + chas + nox + rm + age,
family = binomial, data = Boston, subset = train)

Deviance Residuals:
Min 1Q Median 3Q Max
-2.1280 -0.4903 -0.0315 0.4073 3.7807

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -30.344933 5.569194 -5.449 5.07e-08 ***
zn -0.108906 0.066974 -1.626 0.10393
indus -0.186455 0.061034 -3.055 0.00225 **
chas 1.563326 0.725541 2.155 0.03118 *
nox 53.472313 10.465525 5.109 3.23e-07 ***
rm 0.620567 0.306502 2.025 0.04290 *
age -0.004282 0.011314 -0.378 0.70511

Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 329.37  on 252  degrees of freedom

Residual deviance: 162.08 on 246 degrees of freedom
AIC: 176.08

Number of Fisher Scoring iterations: 8

glm.pred = rep(0, length(y.test))
glm.prob = predict(glm.fit, Boston.test, type = ‘response’)
glm.pred[glm.prob > 0.5] = 1
mean(glm.pred != y.test)
[1] 0.1106719

–> With the following model error increased little bit but all the predictors are statistically significant.

glm.fit = glm(crim_class~.-crim-chas-tax, data = Boston, family = binomial, subset = train)
Warning message:
glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(glm.fit)

Call:
glm(formula = crim_class ~ . - crim - chas - tax, family = binomial,
data = Boston, subset = train)

Deviance Residuals:
Min 1Q Median 3Q Max
-3.07309 -0.06280 0.00000 0.04518 2.52250

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -95.004862 20.048738 -4.739 2.15e-06 ***
zn -0.850179 0.184521 -4.607 4.08e-06 ***
indus 0.413284 0.157156 2.630 0.008544 **
nox 89.142048 19.552114 4.559 5.13e-06 ***
rm -4.631311 1.678561 -2.759 0.005796 **
age 0.050660 0.023105 2.193 0.028337 *
dis 4.513311 0.954809 4.727 2.28e-06 ***
rad 2.968052 0.677653 4.380 1.19e-05 ***
ptratio 1.483118 0.369531 4.014 5.98e-05 ***
black -0.016615 0.006504 -2.554 0.010636 *
lstat 0.209340 0.086666 2.415 0.015714 *
medv 0.631766 0.183235 3.448 0.000565 ***

Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 329.367  on 252  degrees of freedom

Residual deviance: 70.529 on 241 degrees of freedom
AIC: 94.529

Number of Fisher Scoring iterations: 10

glm.pred = rep(0, length(y.test))
glm.prob = predict(glm.fit, Boston.test, type = ‘response’)
glm.pred[glm.prob > 0.5] = 1
mean(glm.pred != y.test)
[1] 0.1857708

My query is in this scenario which model to choose:
–> Model with higher accuracy but statistically insignificant predictors?
or
–> Model with lower accuracy but with all the predictors which are statistically significant.