Hello R Users
I’ve read lot many blog posts on variable selection, but couldn’t find a universal path to choose.
I tried digging deeper, and came across few methods in R to find most significant variable to choose for model building:
#using knn imputation
library(DMwR)
inputData <- knnImputation(Data)
and,
#use random forest to find set of predictors
library(party)
cf1 <- cforest(ozone_reading ~ ., data = inputData, controls = cforest_unbiased(mtry = 2, ntree = 50))
#get variable importance based on mean decrease in accuracy
varimp(cf1)
#based on mean difference in accuracy
varimp(cf1, conditional = TRUE)
varimpAUC(cf1) #more robust towards class imbalance
and,
#using relaimpo package
#check for relative importance of variable
install.packages("relaimpo")
library(relaimpo)
#fit lm model
lmMod <- lm(ozone_reading ~ ., data = inputData)
#calculate relative importance scaled upto 100
relImportance <- calc.relimp(lmMod, type = 'lmg', rela = TRUE)
#relative importance
sort(relImportance$lmg, decreasing = TRUE)
and finally,
#boruta method
install.packages("Boruta")
library(Boruta)
boruta_output <- Boruta(response ~ ., data = na.omit(inputData), doTrace=2)
My question is:
Is it necessary for me, to select significant variables using these methods? What other options do I have? Is there any robust method which can be applied in all situation to find the most significant variable?
Somebody help!