Error while making predicition using a logistic regression model

r
logistic_regression

#1

Hi all,

I am new to predictive analytics and this is the first time i am participating in hackathon.
I am trying to predict on my test dataset with following command
predtest <- predict(logmodel,newdata = test,type=“response”)
and i am getting error :- Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
factor Loan_ID has new levels.

What is the reason behind it and how to resolve it ?


#2

@Killing_Machine
Hey there, It appears to me like you used Loan_Id as one of your predictors in training your model which you shouldn’t.

Please have a look and let me know if that was the case.


#3

You are using a factor variable whose level present in the test/validation data but not present in the train data. So, model fails to predict.


#4

No Man. Models which i have used here is
logmodel <- glm (Loan_Status ~ . - Loan_ID , data=train, family=binomial) .


#5

@Tapojyoti_Paul :- yes I realized that. There is no Loan_Status factor variable available in the test data set.

In this case what i have to do to predict on test data set ?


#6

@Killing_Machine

Then try and do one thing, remove the Loan_ID variable from the test data and predict again.

and please let me know what happens.


#7

See you are fitting a model whose dependent variable is Loan_status.
So, obviously in test data set it will not be present because you are going to predict this using all other independent variables.
But respect to your 1st question “factor Loan_ID has new levels” means that you used that loan id in modeling.suppose loan_id has following levels in train data:
id1,id2,id3,…,id10
now If you are going to predict loan status with a new level id11 in test data then usually this type of error message will arrive.
I am not sure Load id is a factor variable or id var i.e if for a particular customer if load id is unique then you should have to remove it.
you can do it in the following way,
fit<-glm(Loan_status~.,data=train[-c(index of the column you want to remove)],family=binomial)


#8

@NSS & @Tapojyoti_Paul :- Thank you guys. it resolved my problem.

I have two more question -
Q1) How to impute values in factor variable. with the help of Mice package i was able to impute only continuous variable.
Q2) Apart from Logistic and Decision Trees, Is there any other method that we can use here ?


#9

@Killing_Machine
You can use MICE package to impute categorical variables as well. Just make sure that they have been already converted into factors before applying the package.
The number of missing values is comparatively big only in the variables “Credit_History” and “Self_Employed”. For these variables you must convert them into factors

` total$Credit_History<-as.factor(total$Credit_History)
  total$Self_Employed<-as.factor(total$Self_Employed)`    

Now you can use the MICE package
imputed_Data_cat <- mice(data.frame(total$Credit_History,total$Self_Employed), m=5, maxit = 5, method = 'logreg')

Since the other variables have very less missing values that are missing at random, you can impute them using median.


#10

@Killing_Machine I am glad it helped.

Now to answer your questions:
1- Go through this very well written article on missing value imputation and I am pretty sure that your all doubts will be cleared.
2- For classification purposes you can use SVM, Bayes classification etc but since you are new into analytics, it would be better if you first invest your time in understanding these algorithms and then move into the application part.


#11

For missing values, you can follow this two links.
1)http://www.analyticsvidhya.com/blog/2016/03/tutorial-powerful-packages-imputing-missing-values/
2)http://www.analyticsvidhya.com/blog/tag/missing-values-imputation-in-r/

Usually, for factor variable you can handle missing values different ways:
1)treat NA s as different factor level.
2)use the rule of association
There are many more techniques.Hope those two link helps.

And apart from logistic regression and decision tree there are much more techniques:
for bagging techniques, you can use the random forest.
for boosting there are ADA,GBM,XGB etc.
But boosting usually works better.
You can even use SVM and neural network.
You can follow these links


#12

Code:

Predictlog <- predict(logistic_model,newdata = newtestDF,type = “response”)
Predictlog
table(Predictlog,newtestDF)

Error:
Error in table(Predictlog, newtestDF) :
all arguments must have the same length