What are some ways to handle different factor levels in train and test data?




What are some ways to tackle the problem of different factor levels in variables in the train and the test dataset, specifically when some variable has more factor levels in test data than the train data?



Hi @adityashrm21,

Two ways I can think of -

  1. You can append the training and testing data, which will bring them to same factor level

  2. Using re level function in R. Example-

    levels(test_1$Var4) <- levels(train_1$Var4)

Hope this helps.




Create a dummy variables for each factor variable then you will get coefficient for each level.it should be a good way to handle.



My training set has 28 levels for one of the categorical column while my test set has 30 levels. If I apply second option it would reshuffle the levels but there would still be an error while predicting the test result.


Its best to create dummy variables or change to factors and then split the data into train-test.
Else you can also mention the split ratio for dummy variables