Calculating Information gain in Decision Trees while choosing which attribute to split on



Suppose we do this by calculating the Entropy as -(P+)log(P+) - P(-)log(P-) and comparing the uncertainty, then what if we encounter a pure set like 4Yes/0No. In that case the second term in the above expression would be 0*(-Infinity). So do we have to assume it as 0 in that case?



Hi there,

There are two things which needs to be focused on. Entropy and Information Gain.

Entropy is the measure of the impurity in a dataset. So for a pure dataset with 4 Yes and 0 No, the entropy (impurity) is 0.

Whereas what we consider while making splits in models like decision trees is the Information Gain.i.e change in the entropy after a split. Higher the “Entropy change” (Entropy approaches 0 or pure form), the better the split. When the entropy reaches zero, like in the example you mentioned, there can not be further more useful splits.

Hope I made things clear.



Suppose I have to choose been two attributes where one attribute has a subset of 4 Yes/0 No and other attribute has one of the subset as 0Yes/7 No. Now both are pure, so how do we decide which attribute to use for splitting?



What you are trying to do is compare between two independent splits which cannot happen under any circumstance. If it is a binary problem, then it Yes and No will always complement each other. It is better to understand with the help of an example.

Suppose initially you had 11 targets with 4 Yes and 7 No. You have two features to split on , namely , feature 1 and feature 2. Let’s work out the cases.

Case 1: Split on feature 1 : 3 yes and 4 No in left node and 1 yes and 3 No in right node.

Inference: This is to say that feature 1 was not able to homogenously split our dataset.

Case 2: Split on feature 2: 4 Yes and 0 No in left node and 0 Yes and 7 No in right node.

Inference: This is to say that feature 2 was successfully able to separate the two classes effectively.

The thing to focus on is that you do not see the Entropy of a single sub-node but the weighted sum of all the sub-nodes of a node in a particular dataset.

And comparing two datasets in this is pointless anyhow.

Hope this made things clear.



Change your example to :
Case 1: Split on feature 1 : 3 yes and 0 No in left node and 1 yes and 7 No in right node.
Case 2: Split on feature 2 : 4 Yes and 1 No in left node and 0 Yes and 6 No in right node.

and it turns out exactly what my question is.

Both feature 1 and feature 2 provide pure split. So on what basis would we decide which feature to choose for splitting.



OK,then the question was not clear in the first part. Now since it is clear, Lets answer your question.

If you see, then it is somewhat intuitive that the homogenity created by feature 2 is more than feature 1. So feature 2 should provide the best split but lets evaluate it mathematically.

Entropy before splitting

Yes: 4 No: 7

Entropy_before= -[4/11*log (4/11)+7/11*log (7/11)]=0.933

Entropy after splitting

On feature 1

Left Node
Yes: 3 No: 0

Entropy_Left= -[3/3*log(3/3)+ 0/3*log(0/3)]= 0

Right Node
Yes: 1 No: 7

Entropy_Right= -[1/8*log(1/8)+7/8log(7/8)]=0.538

Total_Entropy= left node/(left node+right node) * Entropy_Left +Right node/(Left node+Right node)*Entropy_Right


On feature 2

Left Node
Yes: 4 No: 1


Right Node
Yes: 0 No: 6


Total Entropy= 5/11*0.725+6/11*0=0.329

Information gain_feature1= 0.933-0.391= 0.542
Information gain_feature2= 0.933-0.329= 0.604

So it is evident that Feature 2 provides us with a better split compared to Feature1.

I hope, I made things clear.



Thank you very much. That’s exactly what I was trying to ask in the first place when I asked about taking 0/6*log(0/6) to be equal to zero