How to handle missing values of categorical variables?

Hi,

In case of missing values for continuous variables, we perform following steps to handle it.

  1. Ignore these observations
  2. Replace with general average
  3. Replace with similar type of averages
  4. Build model to predict missing values

Can you suggest me the methods to handle missing values if data is binary (1/0 or M/F) or categorical variables.

Regards,
Imran

@Imran

There is various ways to handle missing values of categorical ways.

  1. Ignore observations of missing values if we are dealing with large data sets and less number of records has missing values
  2. Ignore variable, if it is not significant
  3. Develop model to predict missing values
  4. Treat missing data as just another category

Regards,
Steve

1 Like

Imran,
The same steps apply for a categorical variable as well.

  1. Ignore observation
  2. Replace by most frequent value
  3. Replace using an algorithm like KNN using the neighbours.
  4. Predict the observation using a multiclass predictor.

Hope this helps.
Tavish

1 Like

You can also look at this article:

Generalised Low rank models can generate missing values by themselves. You can have a look at -

http://learn.h2o.ai/content/tutorials/glrm/glrm-tutorial.html

1 Like

Hi @arpitqw
thanks to share the Stanford paper great chapter 5
Alain

1 Like

Hi srivastava

can you explain how to replace by most frequent value?
the second option you mentioned.

appreciate your reply.

Thanks
Haneesh

Hi @haneeshb,

It simply means replacing the missing values using the mode of the column. You can calculate the mode using df['col_name'].mode()

hi @haneeshb ,for replacing by most frequent value
you can do like ,

df[“example”] = df.example.fillna(df.example.mode[0],inplace = True) ,
where mode[0] represents the most frequent value out if n values

Hi @AishwaryaSingh
I am new to machine learning , help me out with my model.
I have a dataset that has two categorical columns one column with 100 unique entries and the second one with 136 unique entries. The size of the dataset is 20k observations. Onehotencoding or dummies result in the exploding of the dataset. How should I preprocess these columns to predict my linear regression model?

Are there any working examples with KNN to treat missing values of categorical data , both for nominal and ordinal types?