What does min_samples_split means in decision tree?



I am currently solving one classification problem using decision tree algorithm in python, when I calculated the accuracy of my model I got 90.8% so, I searched the documentation of decision tree , while searching I got one attribute min_samples_split and is default value is 2.I want to know what does it mean and how it values can improve the model performance.


hello @hinduja1234,

Please see the below image from the sklearn documentation:

So if you are splitting a node at minimum it should have 2 records which after splitting into two nodes will give 1 record each which is specified by the min_samples_leaf.
So,a split will not happen if there are less than a certain number of records specified by min_samples_split in a node.
Hope this helps!!


Just to add that min_samples_leaf=2 is the minimum value of the argument. You need two observations to consider a split obviously.

To see how this parameter differs from min_samples_leaf consider the following tree:

That tree is being constructed with min_samples_split=10. But notice how the leaves at the bottom would not be constructed if min_samples_leaf > 1. I personally only change max_depth and min_samples_leaf, because min_samples_split is (very) slightly more expensive to train because the sub-tree needs to be constructed before checking if samples_leaf >= min_samples_leaf and if not the sub-tree needs to be replaced by a leaf.

Anyhow, you should probably use cross validation on all those parameters and then choose them.