Different train and test set data lebels for categorical data



I faced this issue while creating dummies for categorical variable. Let say in train set I have 2 categorical columns (A and B).
‘A’ has 3 distinct categories A1,A2,A3.
‘B’ has 2 distinct categories B1,B2
I now dummified it and got 6 binary columns in train dataset.

Now I have similar columns in test data but they have different number of category. Let say
‘A’ has A1,A2,A3,A4 as categories
‘B’ has B1 only as the category.
Test dataframe will now have different columns sets.

So how to predict the test dataset, if the columns become different after category treatment(dummifying)

Please answer :grinning:


Hi @syed_f_aziz

Can you share the dataset or explain what are the columns A and B.
I have a few suggestions which might or might not work.

  1. If A and B are ordinal variables, you can replace the labels with numbers, thus not creating dummies for them.
  2. If A4 in the test is very similar to A1, A2 or A3, you can replace all A4 with any one of these (similarly doing for B is the train dataset)
  3. If these variables do not have any significant effect on your target, you can consider dropping the variable. (although keep this as your last option)



As per my understanding, you are solving supervise learning problem. So when you apply mode for prediction. Dependent variable should not include in test data and All test columns similar to train columns variable except dependent variable.

I hope it will be helpful for you.


Hi @AishwaryaSingh, thanks for your prompt response over this. Perhaps I may not be able to share the actual data with you as it’s on my office PC.:pensive:
I did get some essence from the reply. But how do we achieve (step2) in your response, given​ the fact that I have about 300 columns and multiple level of categories within each column.
How do we select what to keep and what to leave.
Also there could be multiple uncommon categories present on the respective train data columns and final submission columns.



Hi Syed,

First combine both test and train and then dummify the categorical variables before prediction.