Many people are confused with this topic (including me) and I would like to discuss this further.
The issue is that people say that tree-based models have the capability to extract individual categories on their own and there should be no need for one-hot-coding.
I personally prefer one-hot-coding because:
- If not separated, a tree will consider all the categories every time the variable comes is randomly selected for a split. So it might end up putting a higher emphasis on the most important categories and ignoring the rest.
- If we make separate variables, there will be splits where the most important categories will not be selected and the model will derive insights from less important categories as well.
- Since the whole idea behind boosting is combination of weak learners, this should work better.
Also, I have found one-hot-coding to generate better results. But I am not 100% if this works in all the cases.
Please share your thoughts and experience.