Error in predict.randomForest(fit, test) : Type of predictors in new data do not match that of the training data

random_forest

#1

Hello,

While running a randomForest model on my training data to predict a categorical variable for my test dataset I am getting this error

pred=predict(fit,test)

Error in predict.randomForest(fit, test) :
Type of predictors in new data do not match that of the training data.

Which predictors in new data is this error talking about? Do my variables in test dataset contain values that the training data doesn’t contain?
If yes, then how to handle this problem?

Thanks.


#2

@adityashrm21,

This is happening because of different number of levels in some factor variable in your test and train dataset. You can remove the error by finding the variable and using

levels(test$variableName) <- levels(train$variableName)

Hope this helps.


#3

This worked. Thanks.


#4

Dear Adminstrator and Dear members,

Hello!

I try the random forest model in my research thesis, but I met a problem during the validation of testing phase.
So, when I used the final random forest to predict with an independent dataset, I received this message: “Type of predictors in new data do not match that of the training data”.

I try this method to detect the different categories in my factors/variables
levels(Train$Aquifer.media) levels(Test$Aquifer.media)
For this factor “Aquifer.media”, I have:
Train dataset: “Carbonates rocks” “Crystalline rocks” “Siliciclastic sedimentary rocks” “Unconsolisated sediments rocks” "Volcanic rocks"
Test Dataset: “Crystalline rocks” “Siliciclastic sedimentary rocks” “Unconsolisated sediments rocks” "Volcanic rocks"
I detected that it is not the same categories of the same variable, I would lie to know, how I can solve this problem?
It is possible for me to delete some categories in the factors?

Best regards


#5

Dear members,

Hello!

I tried the random forest model in my research topic, but I met a problem during the validation phase.
When, I used the final model of random forest to predict on an independent dataset, I received this message: “Type of predictors in new data do not match that of the training data”.
So, to detect the different categories in my factors/variables, I used:
levels(Train$Aquifer.media) levels(Test$Aquifer.media)
For this factor “Aquifer.media”, I have:
Train dataset: “Carbonates rocks” “Crystalline rocks” “Siliciclastic sedimentary rocks” “Unconsolisated sediments rocks” "Volcanic rocks"
Test Dataset: “Crystalline rocks” “Siliciclastic sedimentary rocks” “Unconsolisated sediments rocks” "Volcanic rocks"
I detected that predictors were of different categories, I would like to know, how I can solve this problem?
It is possible to delete some categories in the factors?

Best regards


#6

Hi @spacemodel,

Could you please elaborate on your problem?

As I understand, you had the “type of predictors matching error”. You applied leveling (as given above), which should have solved this error. Then what happens? Does this error come again?


#7

Hi
JalFaizy,

Thank you for your response. The error “Type of predictors in new data do not match that of the training data” come always. Yes, I applied the leveling given in above. I don’t understand why the error come.

See my short code:

Traindata <- read.table(“C:/Users/iouedraogo/Desktop/Tester/MoyData_correction final.txt”,header=TRUE, sep="\t", na.strings=“NA”, dec=",", strip.white=TRUE)
rf<-randomForest(Ln.NO3._mean~ Aquifer.media + Recharge + Climat.Class + Population.density…people.km2. + Rainfall.Class, mtry=4, ntree=1000, data=Train, importance= TRUE)
rf
predict(rf)

Testdata<-read.table(“C:/Users/iouedraogo/Desktop/Tester/Random_Forest_Factors_Fin.txt”, header=TRUE, sep="\t", na.strings=“NA”, dec=",", strip.white=TRUE)
predict(rf,Testdata)

When, I run the step: predict(rf, Test) , this message Type of predictors in new data do not match that of the training data come.

Best regards.


#8

Hi @spacemodel,

Does your short code include the leveling step?

What I would do is try visualizing the train and test data.
Print the top 5 columns of both. do they match?
Try printing unique values of each column of both. Do they match?
If above all fails, see if you have applied leveling. Check for both train<-test and test<-train.

Does this solve your problem?

PS: refer here. maybe this can help you.


#9

Thank you for for response. I checked the leveling step also.
For example, you can observe the leveling for these two variables:
levels(Traindata$Aquifer.media), levels(Testdata$Aquifer.media).
levels(Traindata$Climat.Class), levels(Testdata$Climat.Class).

The print of top 5 columns are:
For Traindata
Aquifer.media: Climat.Class:
Crystalline rocks Dry sub-Humid
Crystalline rocks Humid
Crystalline rocks Humid
Crystalline rocks Dry sub-Humid
Unconsolisated sediments rocks Arid

For Testdata
Aquifer.media Climat.Class
Crystalline rocks Semi-arid
Crystalline rocks Semi-arid
Crystalline rocks Semi-arid
Crystalline rocks Arid
Crystalline rocks Semi-arid

We observe that the values does not match in Traindata and Testdata.
In my study, I think that it would perhars a mistake to combine Traindata and Testdata because, Traindata are the data observed at regional scale in my study and Testdata are the data observed at local/small scale. So, my objective was to develop a regional-scale model by using Traindata, and after, we use the independent data (here Testdata) to validate the random forest developed).

If you analyse very well Climate class in Traindata, you can observe that there are several climatic conditions due to the large scale study compare to Climate class in Testdata corresponding to local scale. I think that the problem of scale is the reason fundamental which caused the data matching.

Please, how can solve this problem?

Best regards.


#10

Hi @spacemodel,

Does this help? (try @bluenote10 answer)