How correctly encode categorical features avoiding data leakage?
Using pipeline in sklearn is possible to write concice an clear code that improve usability and readability laying the path to a framework that can be reused to almost every machine learning problem, all of that avoiding data leakage.
What I haven’t been able to figure it out till now is how to corrctly encode categorical features avoding data leakage.
Usually, almost all code examples I have found out there follow this way:
Merge training and testing dataset into an unique pandas dataframe
Identify categorical features and choose the correct way for encoding it (OneHotEncoding, Binary, etc…)
But this way of proceeding does not introduce data leakage (having merged training and testing datasets)? A categorical feature in training dataset can contains levels not present in test and vice versa.
How to handle these scenario?