Multicollinearity in Random Forest



Is Random Forest affected by multicollinearity and do we need to remove the multicollinear variables.
I fed all variables into RF and retained the ones which had a higher variable importance amongst a set of multicollinear variables. Is that the right approach ?


@asha_vish - Yes there is a multicollinearity between the variable when we build a random forest .Yes, you are on the right path but you have to select a threshold value above which you will select all the variable and discard remaining ones.

Hope this helps!



Hi @asha_vish

multicollinearity is quite tricky even with Random Forest I do not speak about lm where there it a non no !!! , You have to ask few questions even before to think to adjust with the variable of importance.

  1. Will you do prediction with a validation and predicting set which will have the same of collinearity if yes no problem.
  2. If no or you cannot prove you have one issue as the multicollinearity change will have a difficult effect to predict. In case you know you will have high extrapolation then your model will be not accurate at all. Conclusion you should be a model with collinearity removed in this case.

Concerning the reduction with Varimp, well there you can not consider the variables as independent as you have multicollinearity, therefore who has the most influence? Random forest built the variable of importance based on the OOB that is good. but the variables of choice for the split have been picked up randomly, therefore one you could have one variable and the other the other (in case of two correlation, I simplify here) and in both case one is exclude, but as they have same effect which one to remove? You do not know exactly. The rule is to consider the multicollinearity variables as one set and not as individual. You keep or you remove the set.
In few words if you have the choice I will remove the collinearity by keeping the variables of interest, as it is not always possible (for example in ecological studies genetics etc) I treat them as set.

Hope this help a little.



Thanks @Lesaffrea and @hinduja1234

Got back to the problem just now… so please excuse the delay in replying.

Regarding the point Alain raised…

If we were to consider the multicollinear variables as a set and remove all of them, would it not lead to losing a good predictor (vs. if we had actually retained one from the set)

Also, I read that multicollinearity doesnt effect the overall fit of the model ie. would not result in bad predictions.


Hi @asha_vish

Yes you right if you remove the set then prediction good be worsen, the point and I certainly did explain very well you treat the set as one entity, you keep or you do not, but you not remove or or few variables out of the set (or you are lucky !!!).

For your second point I am not totally agree with this, if you want a model a prediction without looking at the correlation fair enough that is true, if now you want to act and you use a linear model for example on act on it I am not agree. If you have time look at page 74 of “An introduction to Statistical Learning” I gave the link before to the pdf Pr Hastie explains very well, better than I do :slight_smile:

Hope this help