I am currently solving one classification problem using decision tree algorithm in python, when I calculated the accuracy of my model I got 90.8% so, I searched the documentation of decision tree , while searching I got one attribute min_samples_split and is default value is 2.I want to know what does it mean and how it values can improve the model performance.
Please see the below image from the sklearn documentation:
So if you are splitting a node at minimum it should have 2 records which after splitting into two nodes will give 1 record each which is specified by the min_samples_leaf.
So,a split will not happen if there are less than a certain number of records specified by min_samples_split in a node.
Hope this helps!!
Just to add that min_samples_leaf=2 is the minimum value of the argument. You need two observations to consider a split obviously.
To see how this parameter differs from min_samples_leaf consider the following tree:
That tree is being constructed with min_samples_split=10. But notice how the leaves at the bottom would not be constructed if min_samples_leaf > 1. I personally only change max_depth and min_samples_leaf, because min_samples_split is (very) slightly more expensive to train because the sub-tree needs to be constructed before checking if samples_leaf >= min_samples_leaf and if not the sub-tree needs to be replaced by a leaf.
Anyhow, you should probably use cross validation on all those parameters and then choose them.