Robust way to find significant variables to build a model in R

r
random_forest

#1

Hello R Users

I’ve read lot many blog posts on variable selection, but couldn’t find a universal path to choose.
I tried digging deeper, and came across few methods in R to find most significant variable to choose for model building:

 #using knn imputation    
 library(DMwR)
 inputData <- knnImputation(Data)

and,

    #use random forest to find set of predictors
    library(party)
    cf1 <- cforest(ozone_reading ~ ., data = inputData, controls = cforest_unbiased(mtry = 2, ntree = 50))
    #get variable importance based on mean decrease in accuracy
    varimp(cf1)
    #based on mean difference in accuracy
    varimp(cf1, conditional = TRUE)
    varimpAUC(cf1)  #more robust towards class imbalance

and,

    #using relaimpo package  
    #check for relative importance of variable
    install.packages("relaimpo")
    
    library(relaimpo)
    
    #fit lm model
    lmMod <- lm(ozone_reading ~ ., data = inputData)
    
    #calculate relative importance scaled upto 100
    relImportance <- calc.relimp(lmMod, type = 'lmg', rela = TRUE)
    
    #relative importance
    sort(relImportance$lmg, decreasing = TRUE)

and finally,

#boruta method

install.packages("Boruta")

library(Boruta)

boruta_output <- Boruta(response ~ ., data = na.omit(inputData), doTrace=2)

My question is:
Is it necessary for me, to select significant variables using these methods? What other options do I have? Is there any robust method which can be applied in all situation to find the most significant variable?

Somebody help!


Random Forest to Choose multiple variable in consumer lending portfolio
#2

Hi @indutaneja11,

The best way is to understand the data and then form hypotheses about which variables you think(or assume) to be affecting your response variable.
For example if you are predicting the price of a car,it is safe to assume that number of cylinders,mpg etc are going to be significant.In this case,it is not necessary to use the packages.
Also,for cases like clustering I am not aware of any way to find out variable importance.You can of course assign weights but that again depends on the domain experience.
In short, I do not think there is a method that fits all cases.It is more of a trial + error+judgement+packages that deals with variable importance.
Say,you are running a linear regression model on predicting sales based on some advertising data.
Advertising medium can be TV.radio etc.It can very well happen that TV comes out to be in-significant but that beats common sense.
In such cases,we need to dig deeper and find out the WHY?
Hope this helps!!


#3

@indutaneja11,

Common and more significant question, It is relevant for both beginners and intermediate data scientist .Business sense always matter a lot in feature selection and it performs well also. Apart from this, there are various methods to deal with it, I structure this approach based on number of features available!

  • Have less than 25 features
    Here, I would suggest you to go with chi-square test or correlation table to identify the most significant variables. Another approach, build a decision tree and visualize the tree (It always gives you quick information about significance of variables).

  • Number of features in range(25-100)
    Perform principle component analysis (or apply other dimensionality reduction technique) which will reduce similar type of variables. This method will reduce the number of variable and further we identify significant variables using chi-squared test or correlation table.

  • More than 100 Features
    In any analysis you start with more than 1000 variables. The first step we generally do for initial shortlisting is to find information value. The variable with a high IV are considered. Generally this reduces the variable list by 70-80%. This removes all non-variant and variables independent of target variable. Once this is done we do PCA and chi-square test.

Apart from above discussed methods, You can try using the backward selection technique if you are using regression or you can use the variable importance plot or the variable importance available with various decision tree/ random forest techniques. I have seen people using random forest specifically for the feature selection.

Hope this helps!

Regards,
Sunil