Hi Folks,

I am new to Machine learning and was trying to fit logistic regression on Boston data set available in MASS library of R.

I tried with forward selection and started with one predictor and keep on adding predictor variables one by one.

I am using first half of data set for training purpose and second half as test set.

I have introduced a new variable “crim_class” which is response variable and will be one if “crim” >= median(crim) and zero if “crim” < median(crim).

–> with the following model i got leat value of prediction error on test set but some of the predictors are statistically insignificant considering the * coding of p-value.

glm.fit = glm(crim_class~zn+indus+chas+nox+rm+age, data = Boston, family = binomial, subset = train)

summary(glm.fit)

Call:

glm(formula = crim_class ~ zn + indus + chas + nox + rm + age,

family = binomial, data = Boston, subset = train)

Deviance Residuals:

Min 1Q Median 3Q Max

-2.1280 -0.4903 -0.0315 0.4073 3.7807

## Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -30.344933 5.569194 -5.449 5.07e-08 ***

zn -0.108906 0.066974 -1.626 0.10393

indus -0.186455 0.061034 -3.055 0.00225 **

chas 1.563326 0.725541 2.155 0.03118 *

nox 53.472313 10.465525 5.109 3.23e-07 ***

rm 0.620567 0.306502 2.025 0.04290 *

age -0.004282 0.011314 -0.378 0.70511

Signif. codes: 0 ‘* ’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

```
Null deviance: 329.37 on 252 degrees of freedom
```

Residual deviance: 162.08 on 246 degrees of freedom

AIC: 176.08

Number of Fisher Scoring iterations: 8

glm.pred = rep(0, length(y.test))

glm.prob = predict(glm.fit, Boston.test, type = ‘response’)

glm.pred[glm.prob > 0.5] = 1

mean(glm.pred != y.test)

[1] 0.1106719

–> With the following model error increased little bit but all the predictors are statistically significant.

glm.fit = glm(crim_class~.-crim-chas-tax, data = Boston, family = binomial, subset = train)

Warning message:

glm.fit: fitted probabilities numerically 0 or 1 occurred

summary(glm.fit)

Call:

glm(formula = crim_class ~ . - crim - chas - tax, family = binomial,

data = Boston, subset = train)

Deviance Residuals:

Min 1Q Median 3Q Max

-3.07309 -0.06280 0.00000 0.04518 2.52250

## Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -95.004862 20.048738 -4.739 2.15e-06 ***

zn -0.850179 0.184521 -4.607 4.08e-06 ***

indus 0.413284 0.157156 2.630 0.008544 **

nox 89.142048 19.552114 4.559 5.13e-06 ***

rm -4.631311 1.678561 -2.759 0.005796 **

age 0.050660 0.023105 2.193 0.028337 *

dis 4.513311 0.954809 4.727 2.28e-06 ***

rad 2.968052 0.677653 4.380 1.19e-05 ***

ptratio 1.483118 0.369531 4.014 5.98e-05 ***

black -0.016615 0.006504 -2.554 0.010636 *

lstat 0.209340 0.086666 2.415 0.015714 *

medv 0.631766 0.183235 3.448 0.000565 ***

Signif. codes: 0 ‘* ’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

```
Null deviance: 329.367 on 252 degrees of freedom
```

Residual deviance: 70.529 on 241 degrees of freedom

AIC: 94.529

Number of Fisher Scoring iterations: 10

glm.pred = rep(0, length(y.test))

glm.prob = predict(glm.fit, Boston.test, type = ‘response’)

glm.pred[glm.prob > 0.5] = 1

mean(glm.pred != y.test)

[1] 0.1857708

My query is in this scenario which model to choose:

–> Model with higher accuracy but statistically insignificant predictors?

or

–> Model with lower accuracy but with all the predictors which are statistically significant.