Decision Tree, Gini Impurity, purity

r
machine_learning

#1

Hi,

I’ve read that Gini index is a measure of purity. What is “purity”? What does exactly does it mean?

Thank you!


#2

Think of purity as having maximum of one class when you do a split. Gini Index is an indicator of how the classification split is with respect to the classes.

1


#3

@Nivedan

Gini index says, if we select two items from a population at random then they must be of the same class and probability for this is 1 if the population is pure.

In other words, a population is pure if all of its members belong to a single class. Take gold purity for example, we measure the purity of an ornament made with gold based on the amount of other metals are mixed with gold while making the ornament. Lesser the quantity of other metals in the ornament, more is the purity.

Now imagine, you are going to split two different ornaments (one being of higher purity than the other) into tiny atoms. You want to randomly select an atom from both the samples, which sample will give higher probablity of you picking a gold atom? The one with higher purity. That’s purity for you.

Read this for a deeper understanding of tree based modelling - https://www.analyticsvidhya.com/blog/2016/04/complete-tutorial-tree-based-modeling-scratch-in-python/

Hope this helps
Sanad :slight_smile:


#4

@mohdsanadzakirizvi Thank you, for elaborating on the 1st comment. Helped me in understanding better!


#5

Hi, I was going through the Decision Tree link that you’d shared with me in the above comment. It says that the splitting variable is chosen if the Gini Index is higher. But isn’t it that variables with lower Gini score are to split?


#6

Hey Nivedan,

No, as you can read the article,

Higher the value of Gini higher the homogeneity.

It is homogeneity that we want while splitting a node because higher the homogeneity(purity) at a split, easier it will be to make a decision to either go left(maybe class A) or to go right(maybe class B) from the node while making a decision.

If the homogeneity is less then the decision at that node would be ambiguous to make. This will hence be a bad split

Imagine you are at some node and you have to go either right or left but in both the options the probability of class A and class B are 50-50, it will be ambiguous. Ideally speaking, one should be higher of class A and other should have a higher probability of class B.

Hope this helps,
Sanad :slight_smile:


#8

But “many” other sites say a variable split has a low gini index.
for example; (http://www.learnbymarketing.com/481/decision-tree-flavors-gini-info-gain/)

“You want a variable split that has a low Gini Index”.
Or am I understanding the above statement wrongly. (It’s from the
above attached link).

I understand it as, lower the gini value, the more homogenous a node. Am I wrong?


#9

You are right, I read the link you shared and found something interesting. It seems there are two things Gini Impurity(used by RF) and Gini Coefficient.
21

Gini Impurity has the formula like 1 - (weighted sum of probabilities) which is correct. In the article, 1 - is not present which I find similar to the formula of Gini Coefficient present on Wikipedia.

RF/CART uses Gini Impurity so your definition would make sense. According to Wikipedia,

It reaches its minimum (zero) when all cases in the node fall into a single target category.

Here is the Wikipedia link to the same - https://en.wikipedia.org/wiki/Decision_tree_learning#Gini_impurity


#10

@jalFaizy, Would you please verify this and check if the article needs an update?


#11

So it’s Gini Impurity! Great! Thank you so much!:grinning:


#12

Also, when it comes to pruning a classification tree what are the parameters we focus on? While selecting the best CP value do we either consider the least rel.error or xerror or both together? What’s the best way to determine the optimal cp value for pruning?

And is the plot given by plotcp() reliable for selecting the best CP value? If yes, which point do we consider from the below image? Is it at the point where graph line intercepts the horizontal line or is it the point that I’ve indicated with the red arrow. Also what does the horizontal line mean?