Lecture 7: Impurity Measures For Decision Trees: Madhavan Mukund
Lecture 7: Impurity Measures For Decision Trees: Madhavan Mukund
Madhavan Mukund
https://www.cmi.ac.in/~madhavan
Madhavan Mukund Lecture 7: Impurity Measures for Decision Trees DMML Aug–Dec 2020 2 / 11
A better impurity function
c=1
Madhavan Mukund Lecture 7: Impurity Measures for Decision Trees DMML Aug–Dec 2020 3 / 11
Entropy
Madhavan Mukund Lecture 7: Impurity Measures for Decision Trees DMML Aug–Dec 2020 4 / 11
Gini Index
Measure of unequal distribution of
wealth
Economics — [Corrado Gini]
As before, n data items
n0 with c = 0, p0 = n0 /n
n1 with c = 1, p1 = n1 /n
Gini Index G = 1 − (p02 + p12 )
G = 0 when p0 = 0, p1 = 0 or v.v.
G = 0.5 when p0 = p1 = 0.5
Entropy curve is slightly steeper, but
Gini index is easier to compute
Decision tree libraries usually use Gini c=1
index
Madhavan Mukund Lecture 7: Impurity Measures for Decision Trees DMML Aug–Dec 2020 5 / 11
Information gain
Greedy strategy: choose attribute to
maximize reduction in impurity —
maximize information gain
Suppose an attribute is a unique
identifier
Roll number, passport number,
Aadhaar . . .
Querying this attribute produces
partitions of size 1
Each partition guaranteed to be
pure
New impurity is zero
Maximum possible impurity
reduction, but useless!
Madhavan Mukund Lecture 7: Impurity Measures for Decision Trees DMML Aug–Dec 2020 6 / 11
Information gain
Madhavan Mukund Lecture 7: Impurity Measures for Decision Trees DMML Aug–Dec 2020 7 / 11
Attribute Impurity
Madhavan Mukund Lecture 7: Impurity Measures for Decision Trees DMML Aug–Dec 2020 8 / 11
Attribute Impurity
Madhavan Mukund Lecture 7: Impurity Measures for Decision Trees DMML Aug–Dec 2020 9 / 11
Summary
Madhavan Mukund Lecture 7: Impurity Measures for Decision Trees DMML Aug–Dec 2020 10 / 11