Hello R Users
I’ve read lot many blog posts on variable selection, but couldn’t find a universal path to choose.
I tried digging deeper, and came across few methods in R to find most significant variable to choose for model building:
#using knn imputation library(DMwR) inputData <- knnImputation(Data)
#use random forest to find set of predictors library(party) cf1 <- cforest(ozone_reading ~ ., data = inputData, controls = cforest_unbiased(mtry = 2, ntree = 50)) #get variable importance based on mean decrease in accuracy varimp(cf1) #based on mean difference in accuracy varimp(cf1, conditional = TRUE) varimpAUC(cf1) #more robust towards class imbalance
#using relaimpo package #check for relative importance of variable install.packages("relaimpo") library(relaimpo) #fit lm model lmMod <- lm(ozone_reading ~ ., data = inputData) #calculate relative importance scaled upto 100 relImportance <- calc.relimp(lmMod, type = 'lmg', rela = TRUE) #relative importance sort(relImportance$lmg, decreasing = TRUE)
#boruta method install.packages("Boruta") library(Boruta) boruta_output <- Boruta(response ~ ., data = na.omit(inputData), doTrace=2)
My question is:
Is it necessary for me, to select significant variables using these methods? What other options do I have? Is there any robust method which can be applied in all situation to find the most significant variable?