What does the Error : factor Loan ID has new level mean



Hi ,
I started working on this yesterday.
I started with simple glm() function like ,

m1 <- glm(Loan_Status ~ . -Loan_ID , data = train_1 , family = “binomial”)

test_1$Loan_Status <- “Y”

test_1$Loan_Status <- predict(m1 , test_1 , type = “response”)

I got following error

Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
factor Loan_ID has new levels LP001015, LP001022, LP001031, LP001035, LP001051, LP001054, LP001055, LP001056, LP001059, LP001067, LP001078, LP001082, LP001083, LP001094, LP001096, LP001099, LP001105, LP001107, LP001108, LP001115, LP001121, LP001124, LP001128, LP001135, LP001149, LP001153, LP001163, LP001169, LP001174, LP001176, LP001177, LP001183, LP001185, LP001187, LP001190, LP001203, LP001208, LP001210, LP001211, LP001219, LP001220, LP001221, LP001226, LP001230, LP001231, LP001232, LP001237, LP001242, LP001268, LP001270, LP001284, LP001287, LP001291, LP001298, LP001312, LP001313, LP001317, LP001321, LP001323, LP001324, LP001332, LP001335, LP001338, LP001347, LP001348, LP001351, LP001352, LP001358, LP001359, LP001361, LP001366, LP001368, LP001375, LP001380, LP001386, LP001400, LP001407, LP001413, LP001415, LP001419, LP001420, LP001428, LP001445, LP001446, LP001450, LP001452, LP001455, LP001466, LP001471, LP001472, LP001475, LP001483, LP001486, LP001490, LP001496, LP001499,

Please suggest , how to remove this error


@Rishabh0709 You should not use variables such as ID for building your models. When you are building a model ID is treated as a factor(having levels) and the model is built. As these IDs(as they are unique) are not present in test set, you are getting this error


@sowmiyanm. Thanks for the explanation.
I have one more doubt though.
If you see , the glm() function above , I have written
Loan_Status ~ . -Loan_ID

Doesn’t that mean, I have excluded loan_ID?


@Rishabh0709 I am not sure about this, but you can do summary(modelname) to see if that is including LOAN_ID or not while building the model
You can explore update function in R for variable selection


@Rishabh0709 To your first question, Loan ID would ideally have no bearing on Loan Status, unless some pattern exists extracting it using feature engineering. As @sowmiyanm rightly mentioned when imported to R it would be treated as factors and since every ID is unique it basically is facing different factor levels between the training and validation set.

For the second part yes Loan_Status ~ . -Loan_ID is considered as formulation taking all variables except the Loan_ID for explaining Loan_Status

Can you please share some more codes like what have you assigned to the objects, so that we can understand a possible reason for the error