Advantages of One-Hot-Coding for GBM or XGBoost

ensemble_methods
gbm
one-hot-encoding
xgboost

#1

Hi AVians,

Many people are confused with this topic (including me) and I would like to discuss this further.

The issue is that people say that tree-based models have the capability to extract individual categories on their own and there should be no need for one-hot-coding.

I personally prefer one-hot-coding because:

  1. If not separated, a tree will consider all the categories every time the variable comes is randomly selected for a split. So it might end up putting a higher emphasis on the most important categories and ignoring the rest.
  2. If we make separate variables, there will be splits where the most important categories will not be selected and the model will derive insights from less important categories as well.
  3. Since the whole idea behind boosting is combination of weak learners, this should work better.

Also, I have found one-hot-coding to generate better results. But I am not 100% if this works in all the cases.

Please share your thoughts and experience.

@SRK, @kunal, @Nalin, @aayushmnit, @binga, @vikash - pls comment…

Thanks,
Aarshay


#2

As far as XGBoost is concerned, one-hot-encoding becomes necessary as XGBoost accepts only numeric features. If you have a categorical variable and you use numeric placeholders for the categories (eg. {Male,Female,Unknown} as {1,2,3}), that isn’t an accurate representation as 3>2>1 while the original categories do not have any inherent order associated with them.

Random Forest, on the other hand, does accept categorical variables. I don’t know all the reasons why OHE would help with random forests but one I can think of is that it lets random forest exploit the sub-sample feature with respect to the categories of this feature as well. Like it takes only a subset of all features for one tree, it would take only a subset of categories for this one-hot-encoded feature as well. And this would help create more randomness, help throw light on more trends rather than just the optimal split.


#3

Thanks for your insights… I get your points… :slight_smile:


#4

"If not separated, a tree will consider all the categories every time the variable comes is randomly selected for a split."
I guess tree should not consider all the categories even with label encoding. for instance {male, female} will have 1,2 (2 branches) so either it should take 1 or 2. For one hot encoding, it will increase the number of levels.
Please correct me if I am wrong, I am still new to this field so may not interpreting correctly


#5

What I meant was that if a column has 10 categories and we take them as a single numeric variable, it might split on a single value again and again. If it is one-hot-coded, the more significant category might not get selected in some trees and can form a weak learner. This can be useful in techniques like GBM, XGBoost.