Categorical variables and predictive modelling



I have a categorical variable x1 with four levels(a, b, c and d). I run a few tests and find that one of those levels(say d) does not contribute much towards predicting the target variable.

Now is it possible for me to include just the first three categories of x1 to predict my target variable?

I read that for linear regression, reference coding is used and removing one level from a predictor variable would change the coding of the other levels. Here’s a link to the article:

And is this the same case with tree based methods too?

Thank you for the answer!!


Statistically its not a good practice. Can it be done, yes it does but an alternate approach would be to collapse your levels.

Example Collapse D with say A or B or C and create a new level indicator called E.

Example - If there are four categories like Black, Yellow, Green and Blue and you want to drop Blue, my suggestion is collapse the categories and probably create Yellow &Blue, Green and Black as new 3 categories.

This way you are not losing data or information while you take a call with your functional knowledge on which level you can collapse.

Secondly the savings are probably on the levels of freedom only.