While working on building models from large datasets, feature engineering is a must. I understand this can be creation of new feature or removal of some existing features. I am focussing on the removal of features and would like to know what parameters need to be considered when removing some features. For example, i am currently looking at variables that have more than 50 - 60 % same values as i think these might not contribute to the model. Are there are any other things that we need to consider when removing variables and overall is it a good practice to remove features.
It is not necessary that if you have 60% same value and 40% other value( I am assuming you have only two labels), the feature isn’t important. Here are a few things you should consider:
Check the correlation of the feature with the target variable, and with other dependent variables. If there are two dependent features which are highly correlated, you can choose to drop one of them.
The number of missing values. Suppose 90% of the values are missing, then you can consider dropping this feature.
It is something like User_ID , which is unique for each observation. This will not play a major role in the prediction.
we can also use PCA technique to drop down some columns, by looking at explained_varience.
Thanks for your answers, I will try few more things and will get back.