How much does the accuracy of the model depend upon the number of trees in Random Forest?

r
random_forest

#1

Hi,

I wanted to know if the number of trees in our randomForest model affects the accuracy of our model and if it does, then to what extent?

For example, if the model with 250 trees is more accurate tahn the one with 100 trees then how much will be the difference?

model<- randomForest(formula~. , data=train, ntree=250)
model<- randomForest(formula~. , data=train, ntree=100)

Thanks.


#2

Hi,

As per my knowledge that depends on lot of factors…What you can do is build multiple models each time using different number of trees say 200,300…and then finalise the ntree value based on the OOB error rate(should be less). After that you can play with mtry option as well to get better and accurate model.


#3

I would answer by applying ceteris paribus - i.e. with all other things kept constant, how would ntree = 250 differ from ntree = 100.

Assume there are two data sets. A and B. Both have the same # of preditors (columns) but A has 1000 records (rows) while B has 100,000 records. In other words A is a small data set while B is quite large. Now let us apply random forest to both keeping all other function parameters same except ntree.

Case 1. On A, ntree = 250, On B ntree = 250
Case 2. On A ntree = 100, On B ntree = 100.

Random Forest algorithm uses a bootstrap sample. If we choose to build 250 trees, what will happen is every record that was NOT selected into the train sample (and hence went to the Out of Bag sample) will be scored. Since this selection is purely random, we will never get 250 predictions for every record. Record # 24 might be predicted 200 times, Record #305 might be predicted 206 times… it is random. Eventually majority voting is employed to decide on the final prediction for each record.

So to answer the question at hand, as long as we ensure that each record has a good enough chance to appear in the Out of Bag sample a sufficient number of times, we are good! As you can see, the choice of number of trees then depends on how large our data set it. In the two cases we above, I would suspect even 250 trees may not be enough for dataset B which has 100K records.

This prods us to another question then - is there a mathematical formula or rule of thumb that gives us the relationship between # of records and number of trees to build to make our lives easier? Till now I haven’t come across such a thing and continue to search:-) Trial and error it is.


#4

Hi @adityashrm21 ,

Just like to add one more point, apart from what everyone said. You can try seeing the impact yourself, that how many trees are sufficient, how? there are two approaches -

  • Use plot(model) with 250 trees , it will give you impact of number of trees over your Out of bag error, the moment it becomes constant you can use that many number trees to assume that you reached the optimized solution
  • Second is real time while the randomForest is running, by using “model<- randomForest(formula~. , data=train, ntree=250, do.trace = 1)”, this will give you real time Out of bag error after each tree is made. How can it help? Well you can track say after 50 trees the error is not going down, then why to run the whole forest till 250 , just stop it at 50 and rerun till 50 , can save some time!

Hope this helps!

Regards,
Aayush