What should be the allowed percentage of Missing Values?

Deleting columns containing more than a particular number of missing values is one of the techniques of dimensionality reduction.
What percent of values should be missing in the column to drop it completely??


@shuvayan - Theoretically, 25 to 30% is the maximum missing values are allowed, beyond which we might want to drop the variable from analysis. Practically this varies.At times we get variables with ~50% of missing values but still the customer insist to have it for analyzing. In those cases we might want to treat them accordingly.

1 Like


As @karthe1 suggested, this varied from case to case and the amount of information you think the variable has. For example, if you are working on some dataset which contains a column for date of marriage. It may be blank for 50% (or even more) of the population, but might have very high information about the lifestyle of the person. In such cases, you would still use the variable.

If the information contained in the variable is not that high, you can drop the variable if it has more than 50% missing values. I have seen projects / models where imputation of even 20 - 30% missing values provided better results - the famous Titanic dataset on Kaggle being one such case. Age is missing in ~20% of cases, but you benefit by imputing them rather than ignoring the variable.

Hope this helps.



If columns have 85% missing value what should be done in that case ? @kunal

If the column contains 85% missing values then it should be dropped. But if it is important column then request business to provide new data set.

1 Like

@siddhesh1991 it’s totally depends upon the importance of the variable.

I think it also depends on the nature of the variable. It is: do you know the expected distribution of your values? If so, can you impute them with a confident strategy (linear, polynomial…)? For instance, for numerical values in time series, something to do first is plotting the values, so you can have a guess on the correct strategy, and apply interpolation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.interpolate.html

Other simpler strategies for categorical variables in case of no possible guess could be, for instance, the majority class… You can always apply different methods and check the model performance :wink:

Hey Karthe1,
So happy to find your answer to this question. I am doing a data processing now and, do you have references to support this ‘Theoretically, 25 to 30% is the maximum missing values are allowed, beyond which we might want to drop the variable from analysis’.
In my analysis, I selected the 25% as the threshold but i need reference to support it. But I didn find any reference so far
Hope to hear from you. Many thanks

@rajshukla you can tell more on this. :slightly_smiling_face:

This is very subjective question @manish7273 , it depends upon intuition, if variable is important then consider it even after high missing value percentage and vice versa.

Do you know what was the method used for imputation in this dataset ?

What if we don’t know the names or anything else about our features then how can we know it’s importance for predicting target variable ?

1 Like

correlation from data you have

1 Like

I am afraid there is no rule of thumb for this threshold.

I read at Stef van Buuren 's book of Multiple imputation:

King et al. (2001) estimated that the percentage of incomplete records in the political sciences exceeded 50% on average, with some studies having over 90% incomplete records.

Those features with crazy bunches of missing data still are valuable, not to drop.

© Copyright 2013-2021 Analytics Vidhya