What’s the process used by R to split the data into different buckets after an appropriate complexity parameter is selected in classification trees?
The complexity parameter (cp) is used to control the size of the decision tree and to select the optimal tree size. If the cost of adding another variable to the decision tree from the current node is above the value of cp, then tree building does not continue. We could also say that tree construction does not continue unless it would decrease the overall lack of fit by a factor of cp.
In Case you are using R:
tree <- rpart(default ~ .,data = bankloan,method="class") plot(tree);text(tree, pretty=2)
In case we need to see the optimal value of the Cp:
Hope this helps!
Thanks shuvayan. It is now clear that cp decides the number of splits in the tree, but what exactly should be the split is decided by which factor or algorithm? Eg : in your above example, where does the numbers 24.65, 9.5, etc, come from? What is the algorithm behind it?
That is decided by criterions such as InformationGain,Entropy etc which select the variables/it’s values based on which one makes the class distribution within each bucket more prominent.
I would request you to post this as a separate question in the forum.
can you please explain on what basis did you select the 4th row while performing the printcp function
we select the cp value for pruning the tree which has lowest cross valiadation error represented as ‘xerrror’. Thus we selected the 4th row having lowest error.
Can u explain me what u do in plotcp(tree) function and what information this graph show.