Lecture 7: Impurity Measures For Decision Trees: Madhavan Mukund

The document discusses impurity measures for decision trees. It notes that while misclassification rate is commonly used, entropy and Gini index are better measures as they increase more sharply. Entropy is based on information theory while Gini index measures unequal distribution. Information gain is used to select attributes but can overvalue unique identifiers; to address this, information gain ratio divides information gain by the attribute's impurity to penalize scattered attributes.

Uploaded by

Amar Srivastava

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

86 views10 pages

Lecture 7: Impurity Measures For Decision Trees: Madhavan Mukund

Uploaded by

Amar Srivastava

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Lecture 7: Impurity Measures for Decision Trees

Madhavan Mukund
https://www.cmi.ac.in/~madhavan

Data Mining and Machine Learning

August–December 2020
Misclassification rate

Madhavan Mukund Lecture 7: Impurity Measures for Decision Trees DMML Aug–Dec 2020 2 / 11
A better impurity function

Misclassification rate is linear

Impurity measure that increases more
sharply performs better, empirically
Entropy — [Quinlan]
Gini index — [Breiman]

c=1

Madhavan Mukund Lecture 7: Impurity Measures for Decision Trees DMML Aug–Dec 2020 3 / 11
Entropy

Information theoretic measure of

randomness
Minimum number of bits to transmit
a message — [Shannon]
n data items
n0 with c = 0, p0 = n0 /n
n1 with c = 1, p1 = n1 /n
Entropy
E = −(p0 log2 p0 + p1 log2 p1 )
Minimum when p0 = 1, p1 = 0 or vice
versa — note, declare 0 log2 0 to be 0 c=1
Maximum when p0 = p1 = 0.5

Madhavan Mukund Lecture 7: Impurity Measures for Decision Trees DMML Aug–Dec 2020 4 / 11
Gini Index
Measure of unequal distribution of
wealth
Economics — [Corrado Gini]
As before, n data items
n0 with c = 0, p0 = n0 /n
n1 with c = 1, p1 = n1 /n
Gini Index G = 1 − (p02 + p12 )
G = 0 when p0 = 0, p1 = 0 or v.v.
G = 0.5 when p0 = p1 = 0.5
Entropy curve is slightly steeper, but
Gini index is easier to compute
Decision tree libraries usually use Gini c=1
index
Madhavan Mukund Lecture 7: Impurity Measures for Decision Trees DMML Aug–Dec 2020 5 / 11
Information gain
Greedy strategy: choose attribute to
maximize reduction in impurity —
maximize information gain
Suppose an attribute is a unique
identifier
Roll number, passport number,
Aadhaar . . .
Querying this attribute produces
partitions of size 1
Each partition guaranteed to be
pure
New impurity is zero
Maximum possible impurity
reduction, but useless!
Madhavan Mukund Lecture 7: Impurity Measures for Decision Trees DMML Aug–Dec 2020 6 / 11
Information gain

Tree building algorithm blindly picks

attribute that maximizes information
gain
Need a correction to penalize
attributes with highly scattered
attributes
Extend the notion of impurity to
attributes

Madhavan Mukund Lecture 7: Impurity Measures for Decision Trees DMML Aug–Dec 2020 7 / 11
Attribute Impurity

Attribute takes values {v1 , v2 , . . . , vk }

vi appears ni times across n rows
pi = ni /n

Entropy across k values

Xk
− pi log2 pi
i=1

Gini index across k values

X k
1− pi2
i=1

Madhavan Mukund Lecture 7: Impurity Measures for Decision Trees DMML Aug–Dec 2020 8 / 11
Attribute Impurity

Extreme case, each pi = 1/n Penalizing scattered attributes

Entropy Divide information gain by

n attribute impurity
X 1 1 1
− log2 = −n · (− log2 n) = log2 n Information gain ratio(A)
n n n
i=1

Gini index Information-Gain(A)

n 2 Impurity(A)
X 1 n n−1
1− =1− 2 =
n n n
i=1 Scattered attributes have high
Both increase as n increases denominator, counteracting high
numerator

Madhavan Mukund Lecture 7: Impurity Measures for Decision Trees DMML Aug–Dec 2020 9 / 11
Summary

Can find better measures of impurity than misclassification rate

Non linear impurity function works better in practice
Entropy, Gini index
Gini index is used in most decision tree libraries

Blindly using information gain can be problematic

Attributes that are unique identifiers for rows produces maximum information gain, with little
utility
Divide information gain by impurity of attribute
Information gain ratio

Madhavan Mukund Lecture 7: Impurity Measures for Decision Trees DMML Aug–Dec 2020 10 / 11

Handbook of Optical Constants of Solids PDF
0% (2)
Handbook of Optical Constants of Solids PDF
10 pages
MODULE 4-Dr - GM
No ratings yet
MODULE 4-Dr - GM
23 pages
dm4
No ratings yet
dm4
68 pages
Decision Tree
100% (4)
Decision Tree
66 pages
Pipelines in Verilog
No ratings yet
Pipelines in Verilog
6 pages
Text 034
No ratings yet
Text 034
256 pages
Ecture Ecision REE: Sajal Halder Bsmrstu
100% (1)
Ecture Ecision REE: Sajal Halder Bsmrstu
22 pages
Decision Tree
No ratings yet
Decision Tree
8 pages
Gini Impurity - LearnDataSci
No ratings yet
Gini Impurity - LearnDataSci
9 pages
Lec 07
No ratings yet
Lec 07
66 pages
Part 1
No ratings yet
Part 1
46 pages
Introduction To Big Data and Data Mining
No ratings yet
Introduction To Big Data and Data Mining
130 pages
DMMLASSIGNMENT
No ratings yet
DMMLASSIGNMENT
36 pages
Assessment in Learning
100% (3)
Assessment in Learning
22 pages
7 DecisioinTrees
No ratings yet
7 DecisioinTrees
48 pages
CHEM113 (Prelims)
No ratings yet
CHEM113 (Prelims)
41 pages
Vaidhyanathan CV
No ratings yet
Vaidhyanathan CV
3 pages
Receivers: 1. Inter-Symbol Interference
No ratings yet
Receivers: 1. Inter-Symbol Interference
12 pages
Attribute Selection Measures: Decision Tree Based Classification
No ratings yet
Attribute Selection Measures: Decision Tree Based Classification
16 pages
T1L1 Classification Trees
No ratings yet
T1L1 Classification Trees
44 pages
Programming Camp Syllabus
No ratings yet
Programming Camp Syllabus
8 pages
Decision Tree - ML Class
No ratings yet
Decision Tree - ML Class
58 pages
Decision Trees
No ratings yet
Decision Trees
16 pages
Synthesis and Electrical Properties of Bismuth Substituted Barium Titanate
No ratings yet
Synthesis and Electrical Properties of Bismuth Substituted Barium Titanate
40 pages
4.3-DecisionTreesLearningAlgorithms Part 1
No ratings yet
4.3-DecisionTreesLearningAlgorithms Part 1
13 pages
Physics Mechanics 1
No ratings yet
Physics Mechanics 1
3 pages
ML-Lec-07-Decision Tree Overfitting
No ratings yet
ML-Lec-07-Decision Tree Overfitting
25 pages
Chapter 3
No ratings yet
Chapter 3
88 pages
P9-10 ClassBasic
No ratings yet
P9-10 ClassBasic
82 pages
04 Classification
No ratings yet
04 Classification
72 pages
ML-7
No ratings yet
ML-7
65 pages
ML-chap9_2024_110217
No ratings yet
ML-chap9_2024_110217
52 pages
Estimation Method For The Fatigue Limit of Case Hardened Steels
No ratings yet
Estimation Method For The Fatigue Limit of Case Hardened Steels
6 pages
15 1 Random Forest and Decision Tree
No ratings yet
15 1 Random Forest and Decision Tree
66 pages
ml
No ratings yet
ml
39 pages
Part1 Lecture 13 Annotated
No ratings yet
Part1 Lecture 13 Annotated
12 pages
Ibchkinetics
No ratings yet
Ibchkinetics
16 pages
DM Lect8
No ratings yet
DM Lect8
56 pages
Classification - Decision Trees
No ratings yet
Classification - Decision Trees
43 pages
Decision Tree - Gini Index
No ratings yet
Decision Tree - Gini Index
27 pages
KMIT DecisionTree GiniImpurity
No ratings yet
KMIT DecisionTree GiniImpurity
10 pages
sonika 1
No ratings yet
sonika 1
10 pages
Decision Tree
No ratings yet
Decision Tree
34 pages
Decision Tree
No ratings yet
Decision Tree
27 pages
Attribute Selection Presentation by - Rohit Ghosh
No ratings yet
Attribute Selection Presentation by - Rohit Ghosh
11 pages
Decision Tree: "For Each Node of The Tree, The Information Value Measures
No ratings yet
Decision Tree: "For Each Node of The Tree, The Information Value Measures
3 pages
CART1
No ratings yet
CART1
17 pages
Lecture 4
No ratings yet
Lecture 4
74 pages
Product Overview: NVMFS6H801N: Single N-Channel Power MOSFET 80V, 157 A, 2.8 mΩ
No ratings yet
Product Overview: NVMFS6H801N: Single N-Channel Power MOSFET 80V, 157 A, 2.8 mΩ
1 page
Unit 10 - Decision Trees
No ratings yet
Unit 10 - Decision Trees
21 pages
CSE445 NSU Week_4
No ratings yet
CSE445 NSU Week_4
48 pages
dm 3
No ratings yet
dm 3
37 pages
CH2-Decision Trees and Random Forest
No ratings yet
CH2-Decision Trees and Random Forest
54 pages
Decision Trees
No ratings yet
Decision Trees
61 pages
UNIT 1 CLASSIFICATION & PREDICTION DM
No ratings yet
UNIT 1 CLASSIFICATION & PREDICTION DM
71 pages
Term 1 - Portion Grade 9
No ratings yet
Term 1 - Portion Grade 9
3 pages
Decision Tree
No ratings yet
Decision Tree
19 pages
Class Basic
No ratings yet
Class Basic
75 pages
Data II_ Decision Trees and Rules
No ratings yet
Data II_ Decision Trees and Rules
11 pages
Decision Trees - 2022
No ratings yet
Decision Trees - 2022
49 pages
Decision Trees - Detailed Notes
No ratings yet
Decision Trees - Detailed Notes
8 pages
Advanced Organic Chemistry Lab-1 Course Code: 2597 SPRING 2020 UNIT 4: Column Chromatography Objectives of Activity
No ratings yet
Advanced Organic Chemistry Lab-1 Course Code: 2597 SPRING 2020 UNIT 4: Column Chromatography Objectives of Activity
5 pages
Operations MGT Module #2
100% (1)
Operations MGT Module #2
4 pages
unit 3
No ratings yet
unit 3
14 pages
Trees
No ratings yet
Trees
78 pages
Data Mining & Knowledge Discovery
No ratings yet
Data Mining & Knowledge Discovery
34 pages
dm unit 4
No ratings yet
dm unit 4
24 pages
Unit-3_ML
No ratings yet
Unit-3_ML
47 pages
Training Day 22
No ratings yet
Training Day 22
48 pages
DWDM UNIT 4
No ratings yet
DWDM UNIT 4
80 pages
Gini Index
No ratings yet
Gini Index
6 pages
Tems Investigation 14.0 Datasheet
No ratings yet
Tems Investigation 14.0 Datasheet
2 pages
8.Classification
No ratings yet
8.Classification
82 pages
Fundamentals Mock 2 - 1548950645533
No ratings yet
Fundamentals Mock 2 - 1548950645533
48 pages
Concepts and Techniques: - Chapter 8
No ratings yet
Concepts and Techniques: - Chapter 8
81 pages
Ee Final Preboard Set Online - No Answers
No ratings yet
Ee Final Preboard Set Online - No Answers
11 pages
2 Decision Tree Algo
No ratings yet
2 Decision Tree Algo
46 pages
Lecture 5 DecisionTree
No ratings yet
Lecture 5 DecisionTree
21 pages
RAY OPTICS JEE
No ratings yet
RAY OPTICS JEE
11 pages
Part 02 Question (404 - 428)
No ratings yet
Part 02 Question (404 - 428)
29 pages
Construction Materials For Acoustic Design
100% (1)
Construction Materials For Acoustic Design
35 pages
DECISION TREES-jb
No ratings yet
DECISION TREES-jb
8 pages
PT Mathematics 4 q2
No ratings yet
PT Mathematics 4 q2
6 pages
Module 4 - Flow Control
No ratings yet
Module 4 - Flow Control
42 pages
Mo Ta 3
No ratings yet
Mo Ta 3
37 pages
Statnamic Load Test ASTM D7383 - 551.970001TZ
No ratings yet
Statnamic Load Test ASTM D7383 - 551.970001TZ
4 pages
Pipe Dimensions Chart Rev Jan 2012
No ratings yet
Pipe Dimensions Chart Rev Jan 2012
1 page
Substructure Modelling and Design As Per IRC112
100% (2)
Substructure Modelling and Design As Per IRC112
66 pages