R - removing NA values


#1

Hi Guys, total noob here. How would you go about removing NA values in R?

Getting an error here -> “Error in na.fail.default(list(as.factor(Disbursed) = c(1L, 1L, 1L, 1L, :
missing values in object”


#2

hello @Mitalee,

there are several options in R to remove NA values like na.rm , na.omit etc.Which option will be appropriate for your case has to be decided on your requirement.
Please try to be a little more specific with your questions so that better answers can be given. :stuck_out_tongue:


#3

Well use is.na()
example
traindata <-traindata[!is.na(traindata$Loan_Amount_Applied), ]

remove all the Nan for Loan_Amout_Apllied you have 71 in the train data set, after this a little cleaner.
I have included the markdown, so you can see the code, it generate the report I shared in my mail before.

AlainFirstAnalysis.Rmd.zip (2.9 KB)


#4

hi,
And what about NAs in the test file ?


#5

awesome Lesaffrea… thanks… will try.

Shuvayan, I’m THAT kind of noob that doesn’t know how to be specific yet… kindly bear till I catch up! In this case, the loan_amount_applied had NA errors but I didn’t want to put it out there as I wasnt sure whether it was the issue… got advice from a gyani to try RF, so working on it.

The goal in this hackathon is to get 0.5!


#6

Mitalee
use a simple rule there for NA on Loan it could work the number is quite small 33 so and you have 37000 to answer , even if few wrong there … you ROC will not be influenced a lot. I did not do the calculation but it should be around .002%


#7

what about just removing it using complete.cases?


#8

I m not the judge !! my belief is and if I was doing the script for the test I except all results or I reject if not . I have not submit yet but I should tell you for the first submission I use simple rule no imputation yet.
In the training set as you saw we have 40 with money out of the 71.
If you want really to play with it you say ok I do not know and you use a flip coin slightly bias to reproduce the training set distribution, let build a sample with this and you cbind() with NA and done !!! actually I could try this, and then submit (very predictive method by the way !!!)


#9

makes sense. guess i’m just stumbling in the dark here.


#10

Hi Guys,

Out of 87020 rows, some variables have as much as 59600 missing values.

Should I :-

  1. Drop these entire rows
  2. impute with 0
  3. Impute with median
  4. Run a model to predict the missing values
  5. Create a new variable, as flag to indicate missing or non-missing

This is a case where almost 70% of data is missing, i have thus far worked on around 1% missing data. I am absolutely clueless as to what must be done in this case.

Any help,guidance or insight would be much appreciated.

Thanks in advance


#11

Well I spend time to look at this today more in details.
Point 3. gives some increase in ROC not a lot but you can do it is already something. I work with small model.

Point 4. Well this is suggested some time I work with small model 7 variables, until now took a lot of time no gain at the moment compare to point 3, but due to my model I am already at 1 on accuracy therefore the model is perfect… It does not mean that it does not bring with more variables.
But it depends also which method you use, more later about this topic.

Point 5. If you create a new variable and you leave the NA do you know how you model behave? If you use Random forest for example …just as example what happens? how do you set na.action? and if you use caret on top how caret set na.action ? Your observation could be ignored anyway and your new variable has no effect what so ever!!

Hope this help.
Alain


#12

Thanks for the help @Lesaffrea


#13

haha…
its ok
1 month ago i was too noob here .

well to remove na values u have too many option

  1. is.na()
    2… is.rm()

all looks same but there is difference

x <- c(1, 2, NA, 4, NA, 5)
bad <- is.na(x)
print(bad)
[1] FALSE FALSE TRUE FALSE TRUE FALSE
x[!bad]
[1] 1 2 4 5

What if there are multiple R objects and you want to take the subset with no missing values in any
of those objects?

x <- c(1, 2, NA, 4, NA, 5)
y <- c(“a”, “b”, NA, “d”, NA, “f”)
good <- complete.cases(x, y)
good
[1] TRUE TRUE FALSE TRUE FALSE TRUE
x[good]
[1] 1 2 4 5
y[good]
[1] “a” “b” “d” “f”

**

well dear there are many other ways but it will just confuse you so be

**


#14

Cool… thanks much!


#15

ur welcome