What is good in a decision tree, a large or a small leaf size?




While reading about the tuning of the randomForest model here, it says,

“If you have built a decision tree before, you can appreciate the importance of minimum sample leaf size. Leaf is the end node of a decision tree. A smaller leaf makes the model more prone to capturing noise in train data. Generally, I prefer a minimum leaf size of more than 50. However, you should try multiple leaf sizes to find the most optimum for your use case.”

What does it mean to have a small leaf size? Is the leaf size the number of levels in the tree or the number of leaves at the end of the tree? And I am even not getting what leaf size is better. Larger or smaller? And why?



Hi Pravin
Leaf size = number of cases or observations in that leaf.
Consider this simplified example for illustration purpose.
We start with 1000 rows/observations and are building a decision tree to predict yes/no.
Split 1: variable 1 split 1000 into 700Y, 300N.
Split 2: variable 2 split the above 700 into 550Y, 150N
Split 3: variable 3 split the above 550 into 300Y,150N

question… how long does this go on. If our data set is kind to us, maybe at split 4 we may get all Y or all N and end of story. But if not, how much more splitting will be needed. This is where the concept of minimum size of the leaf to attempt a split comes in. If we choose too small a leaf size, say 20, you can see it may take us 10 or more splits to reach upto 20. Too deep a tree means overfitting! On the flip side, if we choose too large a leaf size, say in the above example, 500, the tree will stop growing after the second split itself. Meaning poor predictive performance.
Hope that helps you see the reason why we need to discover the optimal minimum leaf size.