How to extract important variables from random forest model using varImpPlot in R?




While building a random forest model on the dataset from the Kaggle problem ‘bike-sharing-demand’ I used to varImpPlot to see the important variables in my model->

fit <- randomForest(logreg ~ season+weather+temp +humidity +holiday+workingday+atemp +m+ hour + day_part+ year+day_type + windspeed, data=train,importance=TRUE, ntree=250)

and I get->

I can see that the topmost variable is the most significant but what is the difference between these two plots? They have different values on different positions. Which one to follow while selecting the significant variables?



The first graph shows that if a variable is assigned values by random permutation by how much will the MSE increase. So in your case if you randomly permute the year (i.e. an observation which had year =2014 but you randomly assign the year = 2012 (if that is present in your bagged sample) and so on) the MSE will increase by 100% on an average. Which make sense that bike demand may have increased in recent years. Higher the value, higher the variable importance.

On the other hand, Node purity is measured by Gini Index which is the the difference between RSS before and after the split on that variable.

Since the concept of criteria of variable importance is different in two cases, you have different rankings for different variables.

There is no fixed criterion to select the “best” measure of variable importance it depends on the problem you have at hand.

Since I am expecting this to be a time series data you should look at %IncMSE as you cannot randomly permute any year/day/hour to predict the demand for bikes.