How to handle unknown categories, viz that are in test but not in train?



Hi everyone,

I have been facing this problem since I have started participating in machine learning competitions on Kaggle and AnalyticsVidhya. I don’t know what is the proper way to dealing with the categories which are present in train but not in test for any categorical feature in data. So I am asking simple question here because I want to know about the strategies people follow while they face such problem. Is there any specific research paper published which cover such problem.

Ankit Gupta


In such cases you combine train and test records for preprocessing. Once they are combined you go for vectorisation like one hot encoding or providing certain weight to each of the unique value in each categorical column.After you are done with coding the variables ,separate the test and train cases and go ahead with training.