How correctly encode categorical features avoiding data leakage?



How correctly encode categorical features avoiding data leakage?

Using pipeline in sklearn is possible to write concice an clear code that improve usability and readability laying the path to a framework that can be reused to almost every machine learning problem, all of that avoiding data leakage.

What I haven’t been able to figure it out till now is how to corrctly encode categorical features avoding data leakage.

Usually, almost all code examples I have found out there follow this way:

  1. Merge training and testing dataset into an unique pandas dataframe

  2. Conduct EDA

  3. Identify categorical features and choose the correct way for encoding it (OneHotEncoding, Binary, etc…)

But this way of proceeding does not introduce data leakage (having merged training and testing datasets)? A categorical feature in training dataset can contains levels not present in test and vice versa.
How to handle these scenario?


To handle this scenario when there are categorical variables present in training set but not in test set or vice versa, the best method in my vision is to separately analyze the number of unique values present in each feature/column in training and test set and then if the more unique values are present in training not in test then remove those from training.


Thanks blasteraj for your reply.
To avoid data leakage your suggestion does not seem to be appropriate, in my opinion.
Using pipeling I can introduce an endcoding step into a GridSearchCV but I get an error when a level for a specific gategirical feature is found in test set while not being present in training. It would be nice to introduce a custom trasformer to handle this scenario. In skutil.preprocessing is available SafeLabelEncoder but I was not able to install skutil correctly (python 3.6).