What to do for missing values in important variables?




I have a prediction on a dataset and on looking at the importance of variables and I see that some of the most important variables are the ones with the maximum number of missing values almost around 50-70%. What can/should I do in such a case (if at all I should do something) to handle those variables?



@adityashrm21-Treating of missing value is a very important for improving the performance of the classifier.
There are four common methods for interpretation of missing value.

Case Deletion (CD) - Also is known as complete case analysis. It is available in all statistical packages and is the default method in many programs. This method consists of discarding all instances (cases) with missing values for at least one feature.A variation of this method consists of determining the extent of missing data on each instance and attribute and delete the instances and/or attributes with high
levels of missing data. Before deleting any attribute, it is necessary to evaluate its relevance to the analysis.

Mean Imputation (MI) - This is one of the most frequently used methods.It consists of replacing the missing data for a given feature (attribute) by the mean of all known values of that attribute in the class where the instance with missing attribute belongs.

Median Imputation (MDI) - Since the mean is affected by the presence of outliers it seems natural to use the median instead just to assure robustness. In this case, the missing data for a given feature is replaced by the median of all known values of that attribute in the class where the instance with the missing feature belongs. This method is also a recommended choice when the distribution of the values of a given feature is skewed.

KNN Imputation (KNNI) - This method the missing values of an instance are imputed considering a given number of instances that are most similar to the instance of interest. The similarity of two instances is determined using a distance function.

Hope this helps!


What is a good order to impute dataset having NA in multiple feature without using package?
Dealing with missing categorial data

all methods mentioned by Hinduja are valid, the main point you should be aware of is the post distribution of your imputed sample (dataset). Will your posterior distribution will match your population? As the methods mentioned are applied to one variable at the time you could face a posterior multivariable distribution which is not similar to your population.
In case you have a normal multivariable distribution and your NA are missing at random (MAR) then the algorithm you use could be different and the posterior distribution will match the population. As you mentioned multiple variables the method could be of interest to you. You have to do some bootstraps with your data, calculate the posteriori distribution and do a conditional draw, I simplify here.
Do not panic all this has been implemented already and works well, check the package Amelia for imputation and MVN for the normal multivariable distribution.
Hope this help.


I wasn’t really asking about the various ways that I could impute the missing values with. I was asking about the cases when the most important of the predictor variables have a large number of missing values. Imputing them made my prediction worse, even after using the Amelia package and stuff. So what would be a way to tackle this situation?



the first step is to build the learning curve, if your train and test metrics are close then the gain will not be great, if the gap is wide you will increase you prediction.

Hope this help.