How to handle missing/undefined values in our data?

missing_values
data_wrangling

#1

Hi,

I face a lot of trouble while dealing with missing values and undefined values in my data. I wanted to know what are the most general and best ways to deal with missing values in our data. And how to handle undefined values which occur while performing transformations on our variables such as log() etc. Which is the better choice: equating them to the mean of the remaining values or something like that OR equating them to some constant like 0. And how to know what to do in which situation?

Thanks


#2

Hi,

It actually involves various techniques, because data can be missing because of various reason. If you understand the origin of your data and why some values are missing, it becomes easy to handle them.

I am attaching a PDF which will missing.pdf (558.9 KB) help you in understanding best techniques for handling and imputing missing data thorugh R.

Hope this helps.

Regards,
Aayush Agrawal


#3

Try out Amelia II package in R.


#4

Hi Aayush,

For some reason, am unable to download the pdf file. Can you please share it to my personal id krishna.nitt@gmail.com

Thanks much for all the help.

Regards,
Krishna


#5

@krishna1072,

I don’t have the file right now. But for missing value treatment you can read this link-
http://www.analyticsvidhya.com/blog/2015/02/7-steps-data-exploration-preparation-building-model-part-2/

Else these days tree based algorithm like xgboost do great by just replacing missing values with a placeholder value like -1 or -999.

Hope this helps.

Regards,
Aayush


#6

hey, @adityashrm21 try looking at mice package. All you have to do is convert all the missing variables to numerics then impute using complete instruction.
ex.
require(mice) set.seed(45) data=complete(data)
that will impute all the missing variables based on other variables.