When can you say a random forest is saturated and won't improve?

random_forest
overfitting

#1

Hi Guys,

I think that a model saturates when your training accuracy becomes close to the CV error, thus there is little scope of improvement. But, my thought process got challenged recently.

I faced an issue wherein I was using a RF model and got high accuracy (~95%) but low CV score(~80%). I thought the model is overfitting and I started tuning parameters - min_samples_split, min_samples_leaf, max_depth, max_features. The accuracy went down but the CV score also went down. I was expecting CV to increase as I try to reduce overfitting but this wasn’t the case.

Does this mean that RF has saturated? I doesn’t look this way to me but almost nothing worked after trying for hours. Max I got is accuracy 91%, CV 78% which is pretty far I guess. Is this normal or I am missing a trick here?

Thanks,
Aarshay


#2

Hi @Aarshay,

It is not the case that RF is saturated, in my thinking saturation mean model is having zero training and testing errors. As you said that after tuning the parameters your training and testing errors decreasing which does not signify over-fitting. You will find the over-fitting after seeing your high standard deviation. As your training and testing accuracy went down together so you can think that you are not over-fitting.

What you should do is, instead of tuning the parameters you should try to include more features in your feature space.


#3

Predictive modeling/forecasting is not an exact science, unlike the Physical sciences.

Since the data contains some noise (or considerable noise) along with signal, you will not get 100% accuracy or 100%CV result. The more noise there is, the less accuracy will you achieve. You can get 100% accuracy if the training and testing data were artificially generated and contain 0 noise. Many budding data scientists seem to forget this fact.


#4

You can use also the “Learning Curves” concept for that.

Check these (for R):

And for Python:


#5

Considering that NO model variable will be removed or added to the existing model, I would consider a RF model to be saturated when changing the parameters such as “tree depth”, “no. of trees grown” and “learning rate” have no effect of model accuracy on test dataset.

I would also make sure that the variation between test and train dataset accuracy do not vary by huge margin, as in this case, you will end up overfitting the RF model.