Categorical variable with large level

machine_learning
python

#1

Hi I am working on a logistic regression based binary classification problem where I need predict customer churn. Some categorical variables in the data-set have a large no of levels like area(75 levels), district(135 levels), sub area(180 levels) etc. Creating dummy variables doesn’t make sense as the no of columns will explode then. Is there anyway we can handle such deep categorical variables ? Also, keeping both ‘area’ & ‘sub-area’ seems redundant as a sub-area will belong to an area. If so, does it make sense to remove the ‘area’ variable ?
Thanks in advance


#2

Hi,

Best way to handle such situation is to check for variable distribution first and then deciding on a cut-off to re-distribute the variable.

For example, if your area has 75 levels, it’s only about 15-20 levels that contributes to about 80% of the variable data. So it’s better to keep only those 15-20 levels in that variable and tag the rest of the values as “Others”.

One thing that you need to keep in mind is, make sure that your scoring data also have the same 15-20 levels contributing 80% of the variable data, otherwise your model won’t be able to score the test data.

Also, if you include all three variables i.e area, district and sub-area, they’re going to be correlated, so it’s better to have just the area as variable. In case you’re interested in finding out the effects of those district and sub-area, you can distribute your predictions over these two variables to see how they’re affecting.

Hope this helps.

Peace!


#3

@sree1986,

Another approach would be to label encode your levels into the same column. For examples,

Levels a1, a2, b1, b2 can be encoded as 0,1,2,3 in the area column. Label encoding doesn’t require you to create extra n columns.

Also, I feel there is an order in your levels. For example, in the area column , level 1 might have some relation with level 75 (either less than, greater than or anything). That is captured pretty nicely by label encoding.

You could read the following answer for more info on Label Encoding and One Hot Encoding(creating dummies)