How to impute missing values for a variable like Gender?

missing_values

#1

Hi all,

How should one think about missing values for the variables like Gender, marital status etc (especially the dichotomous variables )? I understand that Mode imputation might yield bad results. Correlation analysis also cannot be done for this.

What are all the other ways of doing this kind of imputation? Please advice.


#2

Hi Karthikeyan,

Based on the problem at hand, we can try to do one of the following:

  1. As you mentioned, mode is one of the option which can be used
  2. Missing values can be treated as a separate category by itself. We can create another category for the missing values and use them as a different level
  3. If the number of missing values are lesser compared to the number of samples and also the total number of samples is high, we can also choose to remove those rows in our analysis
  4. We can also try to do an imputation based on the values of other variables in the given dataset. We can identify related rows to the given row and then use them for imputation
  5. We can also run a model to predict the missing values using all other variables as inputs.

Depending on the nature of the problem and the dataset, we could choose to use the one that seems more suitable.

Thanks,
SRK


How to impute categorical missing values?
#3

Awesome !! Thank you @SRK


#4

Hi @karthe1, R provides MICE(multiple imputation by chained equation) package for handling missing values.Steps are as follows:

  1. Change variable(with missing values) into factors by as.factor().
  2. create a data set of all the known variable and the missing value variable.
  3. read about complete() command from MICE package and apply to the new data set.
    Check this article http://www.analyticsvidhya.com/blog/2015/02/7-steps-data-exploration-preparation-building-model-part-2/
    Hope this helps.

#5

I think MICE doesn’t impute missing factor values.


#6

We can change the variable to numeric value since gender can be male or female (0 or 1). as.numeric()


#7

That we can do But i want to knw how to fill the factor variables value.I am unable to increase the accuracy of my model.So upset :frowning:


#8

I tried using as.numeric(),it convert only Boolean variable to numeric.So ,we hv to convert it into T or F.


#9

Try converting into factor then numeric see that might help…


#10

Its better to use others for blanks…!


#12

Hey SRK,
I appreciate your skills at hackthon!
As I am new to language R, This is my first hackthon which I am participating. I have larned from analytics edge MOOC on R but am still struggling with data preparation and cleaning part as lack of focus by the mooc. Can you help me with your script of any previous hackthons or links so that I can learn in a better way without surfing randomly.
Thanks


#13

Hi, is there a possibility if i dont change the Gender variable, or loan status variable to numeric (when i have read the dataset and perform regression on it), it will give incorrect/bad equation.


#14

Hi Gurpreet,

Since most of codes are in python, these R codes by Rohan Rao will be of help to you. Thank you.

Thanks,
SRK


#15

@riteshmehr logistic regression will do fine as our variables are in factor level.


#16

well how to perform analysis in SAS as i am a complete newbie to this, i am getting quite a low accuracy level, some where around 64% …how can i improve. though i gave numeric values to the blanks prior to performing the test. converted every categorical variable in to 0&1 and where more then 2 options then to numeric. please tell me how to approach and improve.


#17

Hi @abh99, there may be many reasons why your accuracy may be low like Are you doing are the steps correct? Are you using the correct ML algorithm? etc. etc.

First I would recommend following a basic tutorial on any Machine learning using SAS (like this) and try again at loan prediction problem again.


#18

How about checking the no. of blank values and segregating them based on percentage of male and female in the overall dataset. Assignment could be an another issue,but for now we could try randomly.


#19

Use Pandas and use this.

data.Self_Employed = data.Self_Employed.fillna(‘No’)