How to impute categorical missing values?



Can anybody help in telling how should we impute categorical missing values without using mode ,median
central tendencies.
-I tried random forest to impute misssing values in r package missforest but it doesnt work more than 53 categories and here in loanprediction dataset.LOANID variable has 600 catagories.


@Gurpreet_amity- You can still use missforest package. You need not take the LoanId because it not significant at all because it is unique for every member it is just used to separate the loan application.

Hope this helps!



should i create another data set without loan_id variable and then try to impute?
if so then how will a merge back loan_id without any common variable.

and can we use clustering technique? if so then how will we weight catagorical variable?


@Gurpreet_amity- Yes you can create a new data frame by a new name in which you can exclude the loanid and then you can add the loanid variable to the new data frame.
There is another method in which you can use the existing data frame by not using current data frame.
train[,-1] ### not using the first column

Hope this helps!



Can you do the same thing in python? If yes, how?


@jalFaizy- I request you to post this a new topic in Discussion forum

How to impute categorical missing values in python?

Have you checked out mice and hmisc packages? They are pretty good for imputing categorical data. For all purposes discard the LoanID col until you are creating the submission file


@Gurpreet_amity Check out this comment