Missing value treatment in the dataset

r
missing_values

#1

Hi
I am working with attached dataset for linear regression problem and have performed the below steps :

dataset<- read.csv(“LinearRegressionCase.csv”)
View(dataset)
summary(dataset)

Function to replace #NULL with NA

nullrep<-function(x){
ifelse(x=="#NULL!",NA,x)

}

df<- as.data.frame(apply(dataset,2,nullrep))
summary(df)

Now I need to replace NA with mean or 0 value but for me it is throwing error saying that it needs to be converted to numeric…

Please help me out in removing the NAs (using function) and converting the variables to numeric.


#2

LinearRegressionCase.csv (3.4 MB)

Attached is the dataset.


#3

The dataset cannot be downloaded.But the problem is probably with some factor or character variable in the dataset.If there is a character variable of factor variable in your dataset when you apply the function then all the columns will change to character variable.
You can use your function like following way,
dataset[-c(index of non-numeric cols)] <-apply(dataset[-c(index of non-numeric cols)],c(1,2),nullrep)
Now you can perform your task.
OR,
You can convert df into numeric.
df[c(index of cols which need to be change to numeric)]<-apply(df[c(index of cols which need to be change to numeric)],2,as.numeric)


#4

Please try this attachment Linear Regression Case.csv (241 Bytes)

Thanks for the quick response though.


#5

Hi @anilkumarqa86

The problem is that the variables in which NA values are present in df are all factor variables. You need to convert them into numeric form since 0 or mean of values is not an existing factor.

For example, to impute the mean value in the variable lnwireten, you can use the following code:

table(is.na(df$lnwireten))
FALSE  TRUE 
1344  3656 

df$lnwireten<-as.numeric(df$lnwireten)
df$lnwireten[which(is.na(df$lnwireten))]<-mean(df$lnwireten[!is.na(df$lnwireten)])

table(is.na(df$lnwireten))
FALSE 
5000 

Hope this helps!


#6

Thanks @shashwat.2014

I want to create a function for the same as there are huge number of variables which needs to be converted and imputed which I suppose would not be efficient and also it will be time consuming.
Please help if you can create a function.


#7

You can use the MICE package instead. This will save you lots of time.
Convert all the factors into numeric type and select pmm method.

imputed_Data <- mice(data.frame(df$var1,df$var2,.......), m=5, maxit = 5, method = 'pmm', seed = 200)

Regards


#8

@shashwat.2014

Thank you very much…I am able to impute the missing data now after converting variables(with NA) to numeric.

Now my question would be ? Do I need to convert other factor variable as well to numeric or I can proceed further and how do I apply the lm function on the data set.


#9

@anilkumarqa86

You can directly apply regression models to the dataset. You need not convert categorical variables into numeric type for that. However, I would recommend you to do some data exploration and generate some new features before applying the model.

https://stat.ethz.ch/R-manual/R-devel/library/stats/html/lm.html

You can apply your linear model using this.
All the best!