Steps after Imputation

adaboost
decision_trees
logistic_regression
randomforest

#1

I have created 5 imputed datasets on loan prediction data using Amelia Package.
How to proceed further using them in the model.

  1. Should I combine the results generated by all to meet the best? or
  2. Choosing imputed datasets which gave more accuracy?
  3. Is it a good practise to apply logistic regression, decision tree, random forest and adadboost on the imputed dataset one by one to check the best model. (Currently I am doing that)

#2

Hi @Surya1987

Do you mind sharing your code?

I too used Amelia package for missing value imputation in Loan Prediction. Since choosing ad-hoc method such as mean imputation can lead to serious biases in variances and covariance. In this case, instead of building separate model for each data set, I chose to impute the missing value in my train data set, with average of imputed values.

For example: Credit_History when gets imputed by Amelia and provide 5 different data sets. Just, average the imputed value and use it in your train data set. This will save you time. You no longer need to use different algorithms on the new data sets one by one.

Alternatively, you can also use missForest package for missing value imputation. I found it quite robust and better than amelia. missForest uses random forest trained on observed values of a data matrix to predict missing values. Like Amelia, it can be run in parallel to save computation time.

Here’s the code you can use:

missForest(data, maxiter = 10, ntree = 100, decreasing = FALSE, 
mtry = floor(sqrt(ncol(xmis))), replace = TRUE,
parallelize = c('forests'))

data - it’s a data matrix with missing values
maxiter - the maximum number of iterations to be performed. Default is 10
ntrees - number of trees to grow in each forest
mtry - no. of variables to be sampled at each node
replace - if ‘TRUE’ leads to bootstrap sampling. If ‘FALSE’ leads to sub-sampling (without replacement). It should be TRUE
parallelize - activates parallel processing. You should ‘forest’ because this data set does not have many variables. Had there been many variables, you should have used ‘variables’ instead of ‘forest’.

Thanks


#3

Thank you very much Manish. I will work on the suggestion mentioned,


#4

How to take average of imputed factor variables


#5

Hi @Surya1987

I’m working on a similar loan prediction problem and imputed the values using Amelia package. Can you tell me how you have averaged the values from the 5 datasets in R?

Thank you.


#6

Hi Manish,

Thanks for your help on this.

I have tried missForest package to impute the missing data in Train data with the following code.

train.impute <- missForest(train, maxiter = 10, ntree = 50, decreasing = FALSE, mtry = floor(sqrt(ncol(train))), replace = TRUE, parallelize = ‘no’)

  1. R asking to change the parallelize = ‘forests’ to parallelize = ‘no’
  2. After changing parallelize = ‘no’, I am getting the following error.
    Can not handle categorical predictors with more than 53 categories.

I do understand this is because the “Loan_ID” variable is as Factor with 614 levels.

Should we convert “Loan_ID” to string? or, is there any option to select few variables and run the missForest to impute missing values rather than feeding the entire Train data? If yes, please let me know how to do that?

Thanks.

Regards,
Muthuraj. D


#7

Hi @Manish ,

Thanks for your help on this.

I have tried missForest package to impute the missing data in Train data with the following code.

train.impute <- missForest(train, maxiter = 10, ntree = 50, decreasing = FALSE, mtry = floor(sqrt(ncol(train))), replace = TRUE, parallelize = ‘no’)

R asking to change the parallelize = ‘forests’ to parallelize = 'no’
After changing parallelize = ‘no’, I am getting the following error.
Can not handle categorical predictors with more than 53 categories.
I do understand this is because the “Loan_ID” variable is as Factor with 614 levels.

Should we convert “Loan_ID” to string? or, is there any option to select few variables and run the missForest to impute missing values rather than feeding the entire Train data? If yes, please let me know how to do that?

Thanks.

Regards,
Muthuraj. D


#8

Hi dmuthuraj,

you can exclude the LoanId variable from your data set as it is just used for identification.
U can exclude this,create a new data set and run the missforest on new dataset.

new_dataSet<-mydata[-1]