Imran
February 4, 2015, 7:31am
1
Hi,
In case of missing values for continuous variables, we perform following steps to handle it.
Ignore these observations
Replace with general average
Replace with similar type of averages
Build model to predict missing values
Can you suggest me the methods to handle missing values if data is binary (1/0 or M/F) or categorical variables.
Regards,
Imran
Steve
February 5, 2015, 7:11am
2
@Imran
There is various ways to handle missing values of categorical ways.
Ignore observations of missing values if we are dealing with large data sets and less number of records has missing values
Ignore variable, if it is not significant
Develop model to predict missing values
Treat missing data as just another category
Regards,
Steve
1 Like
Imran,
The same steps apply for a categorical variable as well.
Ignore observation
Replace by most frequent value
Replace using an algorithm like KNN using the neighbours.
Predict the observation using a multiclass predictor.
Hope this helps.
Tavish
1 Like
kunal
February 20, 2015, 5:02pm
4
You can also look at this article:
Tutorial on data exploration that comprises missing value imputation, outliers, feature engineering, variable creation in data science and machine learning
Generalised Low rank models can generate missing values by themselves. You can have a look at -
http://learn.h2o.ai/content/tutorials/glrm/glrm-tutorial.html
1 Like
Hi @arpitqw
thanks to share the Stanford paper great chapter 5
Alain
1 Like
Hi srivastava
can you explain how to replace by most frequent value?
the second option you mentioned.
appreciate your reply.
Thanks
Haneesh
Hi @haneeshb ,
It simply means replacing the missing values using the mode of the column. You can calculate the mode using df['col_name'].mode()
vinodrs
February 24, 2020, 6:09am
9
hi @haneeshb ,for replacing by most frequent value
you can do like ,
df[“example”] = df.example.fillna(df.example.mode[0],inplace = True) ,
where mode[0] represents the most frequent value out if n values
Hi @AishwaryaSingh
I am new to machine learning , help me out with my model.
I have a dataset that has two categorical columns one column with 100 unique entries and the second one with 136 unique entries. The size of the dataset is 20k observations. Onehotencoding or dummies result in the exploding of the dataset. How should I preprocess these columns to predict my linear regression model?
Are there any working examples with KNN to treat missing values of categorical data , both for nominal and ordinal types?