Mismatch in levels of categorical variable in train and test data

dataexploration
one-hot-encoding
categorical

#1

Hi,
I have a data set divided into two parts train and test , I want to know how one should handle extra levels present in test or /and train data. Three cases are possible :

Case 1 :
In train there is a variable Education with 4 levels : U.G.,P.G.,12, PHD .
In test data “Education” variable has 3 levels : U.G., P.G., PHD.
Level 12 is present in train but not in test.

Case 2 :
In train there is a variable Education with 3 levels : U.G.,P.G., PHD .
In test data “Education” variable has 4 levels : U.G.,P.G.,12, PHD.
Level 12 is present in test but not in train.

Case 3 :
In train there is a variable Education with 4 levels : U.G.,P.G.,12, PHD .
In test data “Education” variable has 4 levels : U.G., P.G., 12, Diploma.
Level PHD is present in train but not in test and Diploma is present in test but not in train.

What will be the suitable operation in each case :

  1. Applying one hot encoding on the levels that are present in both test and train.
  2. We can just drop the particular observations in test/train having the levels which are absent in corresponding train/test.
  3. We could just create a prediction model without doing anything about the extra levels.

Please suggest if there is any other way to handle this problem.

Thanks in advance.


#2

Hi @syed.danish,

Though I am not very sure about my methodologies, but I would comment on what I would be doing in such a scenario.

Case 1:
No. of levels in the variable in training set > No. of levels in the variable in test set.

Here, since the testing set has one level less, we need not include the one extra level in our model.
That is,
Apply the model only for “U.G.”, “P.G.” and “PHD” and then implement it like that on the test data. There is a high probability that the extra level is only increasing noise.
You can club some of the levels together(if the counts are very different) and use one hot encoding.

Case 2:
No. of levels in the variable in training set < No. of levels in the variable in test set.

In this case, you have got no choice but to apply the model on the levels you have in the training set.

Case 3:
Though I am not sure for this one but I would recommend you to leave out the extra levels and build the model on the variables that are common.

One approach that can be also taken in all these cases is merging the level with some other same frequency level if the frequency of this level(say “PHD” in Case 3) is low.

Regards,
CC


#3

I think ,
For case 1: There is no problem as levels of test data is a subset of train data.But in train data if class 12 is high in frequency then you can merge it with ug i.e creating a new variable which has 3 levels “Upto ug”,pg and Ph.D. whereas you also have to change the test data in the same way.
for case 2:
Again you can do the same thing as previous.in train data mark ug as “upto ug” and in test data mark ug and class 12 together as "up to ug"
or you can convert the levels as columns and for train data put “zero” for each row in the column “12”.But this may not be the right way.
for case 3 :
we have to take the decision after seeing the problem.If education is an important variable in predicting then we have to tweak the data. It may be creating a new variable by grouping the levels or may be making some dummy columns for the level which is absent.
thanks.


#4

Thanks @Tapojyoti_Paul and @Corporate_Cowboy.
Regards,
Danish