Data transformation while predicting for new data(one hot encoding issue)

machine_learning

#1

I have one hot encoded the data and trained a model. After passing the new(test) data through my transformation pipeline(including the one hot encoder) new columns are created as the new(test) data includes additional unknown categorical data.
Due to this the model throws error. As i had trained it for lesser number of columns. How do i handle these unknown categories while doing data transformation(one hot encoding).


#2

Hi @syammohan2103,

You can combine the train and test file, and then perform one hot encoding on the complete data. In this case, the number of columns will be the same for train and test data. Further, you can simply split the data again into train and test set.


#3

Hi @syammohan2103
I faced the same issue in my early stages and solution I found was, take the intersection of both the train(df) and test(test) columns and name it as col_inter
now train on df[col_inter] and test on test[col_inter] dataframes.
Explanation: column which are not present in col_inter have no role to play while training / predicting.


#4

@kanav and @AishwaryaSingh
Yes i understand the scenario when i have both training and test data with me. My doubt is regarding the cases where i have trained a model and deployed it in a system which will take completely new values for predicting. (some data in train set is not there in test set)


#5

I guess you’ll have to drop the particular column which is not present in the train data. On a side note, which model are you using?


#6

Using a Random Forest classifier. But this issue will be there irrespective of the model. When data is being one hot encoded.