Handling Categorical Variables in Regression Models


#1

How to handle categorical variables in the below two scenarios with respect to Multi Linear Regression Models.

  1. If we have a categorical variable (X1) which consists of 100 factor levels (e.g. Name of Cities in Country), How will we convert them into dummy variables. I understand that we should not create 100 dummy variables (DX1, DX2, …, DX99) to represent these categorical values. How will we handle this situation?

  2. Let’s consider a scenario where we have a categorical variable (X1) which consists of 4 factor levels, what would be the code in R to create the dummy variables for this categorical variable?


#2

Hi Marin,

Below are the given suggestion for your problems

  1. If we have so many categories like just you mention ac Cities. We should club these variables into smaller number of groups like Regions,Zones etc and then we can create dummy variables and include in the models.

  2. You can use if else logic to create dummy variables in R or there are many other way as well to create dummy variables in R.

Thanks
Bimlesh Singh


#3

Hi Marin,

as you are working in R, you should remember the algorithm you will use, if for example you use GLM as you mentioned linear regression is you defined you variable as factor GLM will encode and in this case 99 variables for (1. For (2 this will be the same but with 3 variables, which you will notice if you do summary of your model. (lm will do the same). If you do not use GLM, lm or caret then it will be different as you should use one shot encoding for example with xgboost (not using caret) my advise use caret … you do not have all the tuning of direct usage but to do many model it is enough.
Best regards
Alain