Label Encoding vs One Hot Encoding in Machine Learning Model



I am working on a data set comprising of multiple variables including 10 categorical (2 level) variables and 5 categorical (3 level) variables. I read about dealing them for machine learning modeling.

I came to know about Label Encoding and One Hot Encoding.
I learnt that Label Encoding is best used we have categorical variables with 2 levels (i.e. Male /Female, Yes/No). And, one hot encoding, creates all together separate column for different levels of a category. Have I understood it right ?

My Question:
Q. Why is Label Encoding or One Hot Encoding required? Can’t the algorithm identify Male/Female, or any other binary level categorical value as separate values ?
Q. One Hot Encoding leads to redundancy of variables. I’m sure that adds noise too. How to deal with redundancy created by one hot encoding?




Q1: Certain algorithms like XGBoost can only have numerical values as their predictor variables. Hence Label Encoding or One Hot Encoding becomes necessary. Obviously you could use Bayes or other algos which support categorical variables (as predictor).But they are not as accurate as XGBoost.
Q2: I won’t say it adds redundancy consider this example: suppose you have a country attribute with foll possible values : India,UK,China,US now here we have 4 possible values thus we create 4 new attributes. By doing this we can check the impact of nationality on the target variable. Hope this helps



Agree with @Akash_Haldankar on #1.

For Q2, I think it adds redundancy which can be dealt with. For example, if there are 3 levels - High, Medium and Low, we can create only 2 variables:

  1. High - 1 if high 0 otherwise
  2. Medium - 1 if medium 0 otherwise

The third for Low is not required because a 0 in both High & Medium indicates a low. If you make a separate Low variable, it will lead to redundancy.

I’m not convinced that it would add noise. Its the same data just represented differently. Not 100% sure though.



@supra_minion, I guess others have answered well your question. Here’s another take on Label Encoding vs One Hot Encoding(when to use which),

Label Encoding gives numerical aliases(I guess we can call it that) to different classes. If I have ‘eggs’, ‘butter’ and ‘milk’ in my column. It will give them 0,1 and 2. The problem with this approach is that there is no relation between these three classes yet our Algo might consider them to be ordered (that is there is some relation between them) maybe 0<1<2 that is ‘eggs’<‘butter’<‘milk’. This doesn’t make sense, right?

So, in this case, I’d rather go for One Hot Encoding. In this, I will get three columns and the presence of a class will be represented in binary like format. But, here the three classes are separated out to three different columns(features). The Algo is only worried about their presence/absence without making any assumptions of their relationship. Which is what I wanted initially.


Categorical variable with large level

You laid the dog down bro… thanks a lot

1 Like