How to deal with missing values in ‘Test’ data-set to predict the output after ready with model from ‘Train’ data-set ? I am new to this analytical world. I am using ‘SAS’, will I have to learn ‘R’ too ?
There are multiple ways to deal with missing values.
- Replacing them with mean/mode.
- Replacing them with a constant say -1.
- Using classifier models to predict them.
No idea about SAS but R provides various packages for missing value imputation like kNN, Amelia.
Thanks Akash for the reply.
But here I asked for missing values in ‘Test’ dataset. Here in ‘Hackathons’ both ‘Train’ & ‘Test’ datasets have missing values. They are the part of same population. If we prepared our model on the basis of ‘Train’ dataset. Then we can predict values for ‘Test’ dataset. But here is the problem of missing values in ‘Test’ dataset. If we have to again impute the value of ‘Test’ then why we had imputation in ‘Train’ dataset. Is it because of large size of ‘Train’ dataset ?
Will you recommend me a book for ‘R’ ? I want to start from the basic, no prior knowledge of ‘R’ I have.
How replacing missing value with -1 will help us? What is the concept behind imputing -1 and why with -1 ? What happens if the columns with missing value contains -1 already?
Raghu: This method was used in ‘Last Man Standing’ Data Hack recently by @Rohan_Rao aka ‘Vopani’.
He stated in his approach:
"Imputing missing values with -1 is very common. It essentially informs the model that the value is missing and the model handles it differently. I’ve found this works better for tree-based models".
He got the 2nd rank & overall in this data hack he was leading. But before two-one hour the winner @Bishwarup surprised us all.
I usually have an iterative approach i.e first I create the model using training data with missing values removed (i.e. I eliminate rows with missing values). Then i use a constant to replace missing values followed by mean & kNN imputation. For the test data set I avoid row elimination since we need the rows to predict values. Also in R I observed that some algorithms like random forest throws me an error stating : Missing values in test/train data. I recommend R Cookbook, Beginning R. Try watching videos on youtube which have R tutorials.
Thanks a lot akash for your approach and suggestion of R cookbook. I will give it try. If you are not eliminate the missing values from test dataset, then how you predict the values? This I did not able to find anywhere.
give this a chance
I appreciates your patience and cooperation. Well I am just started learning ‘R’ through ‘Swirl’. But definitely I will give this a try.
Thanks a lot.
Please take a look at the below R code for missing value imputation using models:
rm(list = ls()) setwd('/home/shuvayan/Downloads/AV/Loan') train <- read.csv('train_u6lujuX.csv',stringsAsFactors = T) test <- read.csv('test_Y3wMUE5.csv') label <- train['Loan_Status'] df_combi <- rbind(train[,-13],test) str(train) summary(train) summary(df_combi) #Treat missing levels in data: #Gender: levels(df_combi$Gender) <- 'Male' levels(df_combi$Gender) #Married: levels(df_combi$Married) <- 'Yes' levels(df_combi$Married) #Dependents: levels(df_combi$Dependents) <- '0' levels(df_combi$Dependents) #Self_Employed: levels(df_combi$Self_Employed) <- 'No' #Replace missing values in Loan_Amount_term: summary(df_combi$Loan_Amount_Term) #Use rpart: library(rpart) library(rpart.plot) loan_duration.rpart <- rpart(Loan_Amount_Term ~ ., data = df_combi[!is.na(df_combi$Loan_Amount_Term),-1], method = "anova") loan_duration.rpart$cptable rpart.plot(loan_duration.rpart) #Use this to replace missing values in Loan Amount Term: df_combi$Loan_Amount_Term[is.na(df_combi$Loan_Amount_Term)] <- predict(loan_duration.rpart, df_combi[is.na(df_combi$Loan_Amount_Term),]) #Use lm to predict Loan Amount: lm.loan_amount <- lm(LoanAmount ~ .,data = df_combi[!is.na(df_combi$LoanAmount),-1]) summary(lm.loan_amount) #Use this to replace missing values in Loan Amount: df_combi$LoanAmount[is.na(df_combi$LoanAmount)] <- predict(lm.loan_amount, df_combi[is.na(df_combi$LoanAmount),]) summary(df_combi) #Treat missing values in CreditHistory: rpart.credit_hist <- rpart(Credit_History ~ .,data = df_combi[!is.na(df_combi$Credit_History),-1],method = "anova") summary(rpart.credit_hist) rpart.plot(rpart.credit_hist) rpart.credit_hist$cptable `#Use the data not containing missing values to predict missing values` df_combi$Credit_History[is.na(df_combi$Credit_History)] <- predict(rpart.credit_hist, df_combi[is.na(df_combi$Credit_History),]) summary(df_combi) #Split back into train and test: df_train <- df_combi[1:614,-1] df_test <- df_combi[615:981,]
Alternatively you can use a single value like -1/-999 to replace all missing values.
Hope this helps!!
@shuvayan: Thanks for the help. I am new for ‘R’. With the help of ‘Swirl’ I am trying to learn ‘R’.
Will go through to your code and try to understand.
Great.I would also suggest Machine Learning with R as a book to have by your side.
I am facing the same exact problem as the topic of this thread. Having read the thread, I found one solution, which combines train and test set and then imputes the missing values, after which splits back again. However what is the point of imputing the test set with values? What if the test set on which our model is tested later have NaN values, This doesn’t help in prediction. Can anyone suggest any other alternatives to deal with null values in the test set?