What percent of missing value is allowed for modeling?



I am currently trying to find a classification model ,for finding a model first I have checked the number of NA(missing value ) in each column and there is one variable which has 43.4% of missing value .I want to know if this amount missing value should be considered for modeling or not .


hi @harry,

You need to understand whether that variable is important for the classification problem or not.If not you can drop it,if yes,you will have to impute it’s value.
You can use random forests to determine variable importance:

IN the above image the circle shows the top 3 important variables.
Also,if you are imputing the variable,you might use linear regression if the variable is continuous and any classification model for categorical data.
It would be great if you could provide a snapshot of the data?


Hi Harry
if you use Random Forest Shuvayan advises is good, keep in mind that random forest based in some implementations with do an imputation (better to know which one) . But as we do not know the model and metric you want to use, linear or non linear or even rule base in case of model I shall use one other method which is more universal.

  1. I shall build a subset of you data set with only the non NA observations, it means you should have still enough observations compare to the number of variables you want to use in your model.

  2. I shall build the model on the subset minus the variable with NA. Then I do the prediction on the train set (you do not need test set at this stage really) with the metric you want to use. This is a base line, this value is certainly different that if you use the full dataset with the non NA variables only, it should have more errors as you have less observation if not you should check identical observations.

  3. I do the same as 2 but this time with I add the variable with na (always with your subset without na). Than you do a prediction with the train set with same metric. You have a new value.

Now if you error lower in 3 than 2 then the variable which has NA in the original dataset has one effect and therefore is quite important base on the delta between 2 and 3.

If you discover that the variable with NA in the original data set is important, the step imputation is on the variable with NA, there the method to use dependant of the model you use, let say closest neighbour (simple but perhaps not the best ) and redo now with the full dataset, train and test. You should be on the way to have a better model.

Keep in mind I make the assumption that the subset is big enough compare to the number of variables you have if you do not have this then you should start with imputations.

Hope this help. If not then you will have to give more details about the model number of variables size etc…