How clustering can be used to impute the missing in the categorical variable?



I am currently studying the methods by which I can fill the missing value in the categorical variable.I have studied the methods like

  1. The highest frequency of the variable.
  2. Model building approach
  3. User defining approach(by finding the correlation between the variable)

I want to know if clustering can be used as one of the methods for filling the missing value of the variable.


I am not sure if clustering would help you in imputation of missing values , as its a unsupervised learning. In clustering you intend to identify group behaviors.
I would recommend you to try Classification and regression trees , if using an “algorithm” for imputation is your primary goal.

Hope it helps


hi @harry,

Clustering can be used but not directly.
For example let’s say you have a dataset which contains consumer demographic information.Age has some missing values.Now you can do clustering without the age variable to find groups of similar consumers.
Once done with that,you can impute the missing values of Age in each cluster with the mean of the ages of other consumers in the cluster.Or apply regression to do that within each cluster.
This is more accurate than simply replacing by mean/regression for the whole dataset.
Hope this helps!!


Hi guys,

I totally agree with shuvayan!

I would like to add that before going into clustering, first try to make clusters simply using domain knowledge. For eg, in the Titanic dataset, Age has many missing values. One approach could me to replace it with mean/median of the combination of “Sex” and “Pclass” which are two other variables.

However, if the dataset is too large to make such judgement, we should go for clustering. Just beware that using many variables for clustering for imputation might over-fit the data.

Hope this helps!