How to do feature selection in R using Random Forest for classification and regression?

r
machine_learning
randomforest

#1

Hi All,

Can you please help me understand how to do feature selection in R using Random Forest for classification and regression?


#2

Hi Nagu,

We could use the variable importance function in Random Forest to get the importance of each of the input features. Based on the importance values, we could choose the ones which we would like to have in our models.

Hope this helps.!

Thanks,
SRK


#3

Thank you so much SRK. I tried this out using Random Forest and got the Importance value for each Feature. My question is, is there any threshold value after which we select the feature to be included in the model. ??

For ex
Feature1 16.342
Feature2 12.323
Feature3 3.45
Feature4 10.52

In the above example i found Feature 1 2 and 4 has a higher value but Feature3 has less value. So do i need to eliminate Feature3 from the predictor? If thats the case is there any threshold value below which i have to not consider that particular variable in the model. ??

Thank you…


#4

Hi Nagu,

It is hard to make a decision based on the raw importance value as such since it will change from problem to problem. So it is better to convert the raw value to percentage of overall value and then see the percentage to make a decision.

The threshold value depends on the problem at hand. In some cases, we don’t really care about the number of variables. So it is good to keep the variables as such since even the variable with lowest importance might give some additional value add. If we need to reduce down the number of variables, then it is wise to choose the threshold based on the problem at hand. Generally, variables with percentage value close to 0 can be eliminated.

Hope this helps.!

Thanks,
SRK


#5

Hi Nagu,

As SRK indicated, there is no hard limit below which you should ignore the variables. In your situation, you create two models: one with top 3 predictors and the other with top 4 predictors, and then select the model that gives lower cross-validation error.


#6

Thank you all that was really informative.


#7

Hi Naggu, you can try this code snippet :

importance    <- importance(rf_model)
varImportance <- data.frame(Variables = row.names(importance), 
                            Importance = round(importance[ ,'MeanDecreaseGini'],2))

#Create a rank variable based on importance
rankImportance <- varImportance %>%
  mutate(Rank = paste0('#',dense_rank(desc(Importance))))

#Use ggplot2 to visualize the relative importance of variables
ggplot(rankImportance, aes(x = reorder(Variables, Importance), 
                           y = Importance, fill = Importance)) +
  geom_bar(stat='identity') + 
  geom_text(aes(x = Variables, y = 0.5, label = Rank),
            hjust=0, vjust=0.55, size = 4, colour = 'red') +
  labs(x = 'Variables') +
  coord_flip() + 
  theme_few() 

Hope this helps. :slight_smile:


#8

Hi SRK,

I just some guidance from in terms of pursuing my career in Data analytics.

I am new to these field but i have read few books and also learning R by using data Camp.
But i need some real practice to convert the concept into data.
It will be good if you can guide from where to start.


#9

Hi,

You can also select the variables based on mutual information.

Best!

Ankit Gupta