Hello

I am working on a data set comprising of multiple variables including 10 categorical (2 level) variables and 5 categorical (3 level) variables. I read about dealing them for machine learning modeling.

I came to know about Label Encoding and One Hot Encoding.

I learnt that Label Encoding is best used we have categorical variables with 2 levels (i.e. Male /Female, Yes/No). And, one hot encoding, creates all together separate column for different levels of a category. Have I understood it right ?

My Question:

Q. Why is Label Encoding or One Hot Encoding required? Can’t the algorithm identify Male/Female, or any other binary level categorical value as separate values ?

Q. One Hot Encoding leads to redundancy of variables. I’m sure that adds noise too. How to deal with redundancy created by one hot encoding?

Thanks.