[go: up one dir, main page]

0% found this document useful (0 votes)
86 views10 pages

Lecture 7: Impurity Measures For Decision Trees: Madhavan Mukund

The document discusses impurity measures for decision trees. It notes that while misclassification rate is commonly used, entropy and Gini index are better measures as they increase more sharply. Entropy is based on information theory while Gini index measures unequal distribution. Information gain is used to select attributes but can overvalue unique identifiers; to address this, information gain ratio divides information gain by the attribute's impurity to penalize scattered attributes.

Uploaded by

Amar Srivastava
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
86 views10 pages

Lecture 7: Impurity Measures For Decision Trees: Madhavan Mukund

The document discusses impurity measures for decision trees. It notes that while misclassification rate is commonly used, entropy and Gini index are better measures as they increase more sharply. Entropy is based on information theory while Gini index measures unequal distribution. Information gain is used to select attributes but can overvalue unique identifiers; to address this, information gain ratio divides information gain by the attribute's impurity to penalize scattered attributes.

Uploaded by

Amar Srivastava
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Lecture 7: Impurity Measures for Decision Trees

Madhavan Mukund
https://www.cmi.ac.in/~madhavan

Data Mining and Machine Learning


August–December 2020
Misclassification rate

Goal: partition with uniform category


— pure leaf
Impure node — best prediction is
majority value
Minority ratio is misclassification rate
Heuristic: reduce impurity as much as
possible
c=1
For each attribute, compute weighted
average misclassification rate of Misclassification rate is linear
children c ∈ {0, 1}
Choose the minimum x-axis: fraction of inputs with c = 1

Madhavan Mukund Lecture 7: Impurity Measures for Decision Trees DMML Aug–Dec 2020 2 / 11
A better impurity function

Misclassification rate is linear


Impurity measure that increases more
sharply performs better, empirically
Entropy — [Quinlan]
Gini index — [Breiman]

c=1

Madhavan Mukund Lecture 7: Impurity Measures for Decision Trees DMML Aug–Dec 2020 3 / 11
Entropy

Information theoretic measure of


randomness
Minimum number of bits to transmit
a message — [Shannon]
n data items
n0 with c = 0, p0 = n0 /n
n1 with c = 1, p1 = n1 /n
Entropy
E = −(p0 log2 p0 + p1 log2 p1 )
Minimum when p0 = 1, p1 = 0 or vice
versa — note, declare 0 log2 0 to be 0 c=1
Maximum when p0 = p1 = 0.5

Madhavan Mukund Lecture 7: Impurity Measures for Decision Trees DMML Aug–Dec 2020 4 / 11
Gini Index
Measure of unequal distribution of
wealth
Economics — [Corrado Gini]
As before, n data items
n0 with c = 0, p0 = n0 /n
n1 with c = 1, p1 = n1 /n
Gini Index G = 1 − (p02 + p12 )
G = 0 when p0 = 0, p1 = 0 or v.v.
G = 0.5 when p0 = p1 = 0.5
Entropy curve is slightly steeper, but
Gini index is easier to compute
Decision tree libraries usually use Gini c=1
index
Madhavan Mukund Lecture 7: Impurity Measures for Decision Trees DMML Aug–Dec 2020 5 / 11
Information gain
Greedy strategy: choose attribute to
maximize reduction in impurity —
maximize information gain
Suppose an attribute is a unique
identifier
Roll number, passport number,
Aadhaar . . .
Querying this attribute produces
partitions of size 1
Each partition guaranteed to be
pure
New impurity is zero
Maximum possible impurity
reduction, but useless!
Madhavan Mukund Lecture 7: Impurity Measures for Decision Trees DMML Aug–Dec 2020 6 / 11
Information gain

Tree building algorithm blindly picks


attribute that maximizes information
gain
Need a correction to penalize
attributes with highly scattered
attributes
Extend the notion of impurity to
attributes

Madhavan Mukund Lecture 7: Impurity Measures for Decision Trees DMML Aug–Dec 2020 7 / 11
Attribute Impurity

Attribute takes values {v1 , v2 , . . . , vk }


vi appears ni times across n rows
pi = ni /n

Entropy across k values


Xk
− pi log2 pi
i=1

Gini index across k values


X k
1− pi2
i=1

Madhavan Mukund Lecture 7: Impurity Measures for Decision Trees DMML Aug–Dec 2020 8 / 11
Attribute Impurity

Extreme case, each pi = 1/n Penalizing scattered attributes

Entropy Divide information gain by


n attribute impurity
X 1 1 1
− log2 = −n · (− log2 n) = log2 n Information gain ratio(A)
n n n
i=1

Gini index Information-Gain(A)


n  2 Impurity(A)
X 1 n n−1
1− =1− 2 =
n n n
i=1 Scattered attributes have high
Both increase as n increases denominator, counteracting high
numerator

Madhavan Mukund Lecture 7: Impurity Measures for Decision Trees DMML Aug–Dec 2020 9 / 11
Summary

Can find better measures of impurity than misclassification rate


Non linear impurity function works better in practice
Entropy, Gini index
Gini index is used in most decision tree libraries

Blindly using information gain can be problematic


Attributes that are unique identifiers for rows produces maximum information gain, with little
utility
Divide information gain by impurity of attribute
Information gain ratio

Madhavan Mukund Lecture 7: Impurity Measures for Decision Trees DMML Aug–Dec 2020 10 / 11

You might also like