Dealing with missing categorial data

r
machine_learning
data_wrangling
data_science

#1

Hi everybody,

I’m new to Data Science and I have a problem statement.

I’ve got three categorical variables in which one feature has missing data. I need to predict the missing values in those feature using the other ones. Which algorithm should I use to predict the values.

I need to predict the Outlet_size based on the Outlet_location_type and Outlet_type. Please guide me on which algorithm should I use to predict the values or is there any alternate ways to fill up the missing data?

Thanks,
Prabakar


#2

Hi @prabakarsas, if you are using R then you can use rpart for missing data imputation.


#3

@prabakarsas
Option 1: Mode imputation, but it may induce bias
Option 2: Create a new category named “unknown” . Because you don’t lose samples.
Option 3: Don’t include missing categorical samples in train data. Include them in test set.


#4

Hi,
You can use the mice package for this! It imputes missing values based on the the other columns in the dataset.
my categorical columns dataframe has all the categorical columns in one!

#R code
library(mice)
library(lattice)

tempdata<-mice(categoricalColumns,m=5,maxit=40,meth='logreg',seed=500)

completedData <- complete(tempdata,1)
table(is.na(completedData))
categoricalColumns<-completedData
table(is.na(categoricalColumns))
#-----------

#5

Thanks to @pjoshi15 @A.Malathi and @aakashahuja30794 for their answers! You can also refer to this answer in a previous thread


#6

I used the mice package and imputed the values. The method I used was PMM and there were several methods too. How to find the accuracy of these methods?

Thanks, [@pjoshi15], [@A.Malathi], [@aakashahuja30794] and [@jalFaizy] for the valuable inputs.