Keeping insignificant variables in a model

boosting
predictive_model
random_forest
decision_trees
logistic_regression

#1

Hi All,

Let’s say we have one target variable and 10 predictors. However while building the model we found that there are only 2 significant variables. In such cases

  1. Is it feasible to build the model using those two variables only.
  2. Will it impact the result if we include all variables. (Note: No Multi co-linearity exists)
  3. Does it depend on the number of variables?

The reason why I asked this question in most of the articles I have seen that we build model using the variable exists and we find that few variables are significant, However we use all variables to predict.(Which ever model it may be (Logistic, Decision tree, Random Forest or Boosting)

Kindly correct my understanding.


#2

Hi @Surya1987,

I think this depends on the model which you are using. If you’re using a regression model like linear or logistic regression, then that variable can have a harmful impact on the outcome for sure.

But if you’re using tree-based models like random forest, BBM, XGBoost, etc, you can tune the parameters so that the model doesn’t overfit the data. This way, only the required insights from the low importance variables are taken. If a variable has 0 importance, you can remove it for sure. Otherwise, I would prefer tuning the model to reduce overfitting instead manually looking which variables to remove.

However, this can be a problem if you’re talking about production. If you want the algorithm to use less features and run faster, you can use L1 regularization. Check this article for basics of regularization - http://www.analyticsvidhya.com/blog/2016/01/complete-tutorial-ridge-lasso-regression-python/#four. Similar concepts apply in other kind of models as well.

Hope this helps.

Thanks,
Aarshay