Variable selection

machine_learning

#1

Hello,
Recently I found this study: https://rstudio-pubs-static.s3.amazonaws.com/203258_d20c1a34bc094151a0a1e4f4180c5f6f.html#feature-engineering

In section Feature Engineering author write: “drop irrelevant features(features that do not appear at the issued time”. I don’t understand why? What’s wrong with installment or loan amount? I know that with these variables I will get a good fit of the model, which proves that a variable is associated with loan_status. Thanks for help.


#2

Hi @anna19,

We generally drop the ID variables, variables with a large number of missing values or variables that have a high correlation with other independent variables. So please check whether these variables have a high correlation with other independent variables. That might be the possible reason for dropping these variables. If not, you can keep these variables for building the model.


#3

Hi @PulkitS

Thank you for your answer. I checked it and I found for these variables high correlation.


#4

Hi @anna19,

This is the reason for dropping that variable. We don’t want our independent variables to be correlated to each other and hence if two independent variables are correlated with each other, we drop one of them which has less correlation with the target variable.

Hope it is clear now!


#5

Yes, it clear for me :slight_smile: Thanks !