Choosing number of trees in Random Forest

machine_learning
random_forest

#1

Hi,
I was creating a random forest model and played around the number of estimators that are used in the algorithm and I came across a situation.
When I find a forest having maximum cross-validation score till then and after it the validation score starts decreasing, Is it safe to assume that there will be no other forest that will produce better validation score?
If so, please provide a proof ?
Thanks in advance
Danish


#2

Hi @syed.danish: I hope you already know about RF, if not, random forest is a ensemble method where you use lots of Decision Trees, now the catch is the more number of trees the more samples you are creating of your data, the more samples you have created the more you reduced the bias-ness of your data. but there would be a time comes where you would be creating enough samples and now data is getting duplicated, I mean same data points are coming in different samples, this can be normal data or can be outliers also which are coming in different samples. this will increase the biasness of your results,
as more trees are influenced by those data points This is the main reason which has started reducing the CV score because of more biased data samples. Its always a good practice to be in limit of 200 to 300 estimators in RF if your data size is above 1 lakh rows, or use grid search and get the right number of estimators.

To get the proof, please try to prepare a learning curve by yourself or follow the given links

http://scikit-learn.org/stable/auto_examples/model_selection/randomized_search.html

http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier


#3

Thanks @Swapnil_Sharma, It helped a lot.