Categorical variables and predictive modelling

r
modelling
prediction
linearregression

#1

HI,
Scenario:
I have a categorical variable x1 with four levels(a, b, c and d). I run a few tests and find that one of those levels(say d) does not contribute much towards predicting the target variable.

Question:
Now is it possible for me to include just the first three categories of x1 to predict my target variable?

I read that for linear regression, reference coding is used and removing one level from a predictor variable would change the coding of the other levels. Here’s a link to the article:

And is this the same case with tree based methods too?

Thank you for the answer!!


#2

Statistically its not a good practice. Can it be done, yes it does but an alternate approach would be to collapse your levels.

Example Collapse D with say A or B or C and create a new level indicator called E.

Example - If there are four categories like Black, Yellow, Green and Blue and you want to drop Blue, my suggestion is collapse the categories and probably create Yellow &Blue, Green and Black as new 3 categories.

This way you are not losing data or information while you take a call with your functional knowledge on which level you can collapse.

Secondly the savings are probably on the levels of freedom only.