100% found this document useful (1 vote)

151 views31 pages

C4.5 Algorithm

C4.5 is an algorithm for generating decision trees that addresses limitations of prior algorithms like ID3. It can handle numeric attributes by evaluating all possible split points rather than just binary splits. It also deals with missing values by treating them as a separate value or splitting instances proportionally. C4.5 uses pruning techniques like pre-pruning to stop trees from growing too deep and post-pruning like subtree replacement to remove parts of fully grown trees that may be due to noise.

Uploaded by

Bryne Uy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

151 views31 pages

C4.5 Algorithm

Uploaded by

Bryne Uy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

Machine Learning in

Real World:
C4.5
Outline

▪ Handling Numeric Attributes

▪ Finding Best Split(s)

▪ Dealing with Missing Values

▪ Pruning
▪ Pre-pruning, Post-pruning, Error Estimates

▪ From Trees to Rules

2
Industrial-strength algorithms
▪ For an algorithm to be useful in a wide range of real-
world applications it must:
▪ Permit numeric attributes
▪ Allow missing values
▪ Be robust in the presence of noise
▪ Be able to approximate arbitrary concept descriptions (at least
in principle)

▪ Basic schemes need to be extended to fulfill these

requirements

witten & eibe 3

C4.5 History
▪ ID3, CHAID – 1960s
▪ C4.5 innovations (Quinlan):
▪ permit numeric attributes
▪ deal sensibly with missing values
▪ pruning to deal with for noisy data

▪ C4.5 - one of best-known and most widely-used learning

algorithms
▪ Last research version: C4.8, implemented in Weka as J4.8 (Java)
▪ Commercial successor: C5.0 (available from Rulequest)

4
Numeric attributes
▪ Standard method: binary splits
▪ E.g. temp < 45

▪ Unlike nominal attributes,

every attribute has many possible split points
▪ Solution is straightforward extension:
▪ Evaluate info gain (or other measure)
for every possible split point of attribute
▪ Choose “best” split point
▪ Info gain for best split point is info gain for attribute

▪ Computationally more demanding

witten & eibe 5

Weather data – nominal values
Outlook Temperature Humidity Windy Play
Sunny Hot High False No

Sunny Hot High True No

Overcast Hot High False Yes

Rainy Mild Normal False Yes

… … … … …

If outlook = sunny and humidity = high then play = no

If outlook = rainy and windy = true then play = no
If outlook = overcast then play = yes
If humidity = normal then play = yes
If none of the above then play = yes

witten & eibe 6

Weather data - numeric
Outlook Temperature Humidity Windy Play
Sunny 85 85 False No

Sunny 80 90 True No

Overcast 83 86 False Yes

Rainy 75 80 False Yes

… … … … …

If outlook = sunny and humidity > 83 then play = no

If outlook = rainy and windy = true then play = no
If outlook = overcast then play = yes
If humidity < 85 then play = yes
If none of the above then play = yes

7
Example
▪ Split on temperature attribute:
64 65 68 69 70 71 72 72 75 75 80 81 83 85
Yes No Yes Yes Yes No No Yes Yes Yes No Yes Yes No

▪ E.g. temperature  71.5: yes/4, no/2

temperature  71.5: yes/5, no/3

▪ Info([4,2],[5,3])
= 6/14 info([4,2]) + 8/14 info([5,3])
= 0.939 bits

▪ Place split points halfway between values

▪ Can evaluate all split points in one pass!
witten & eibe 8
Avoid repeated sorting!
▪ Sort instances by the values of the numeric attribute
▪ Time complexity for sorting: O (n log n)

▪ Q. Does this have to be repeated at each node of

the tree?
▪ A: No! Sort order for children can be derived from sort
order for parent
▪ Time complexity of derivation: O (n)
▪ Drawback: need to create and store an array of sorted indices
for each numeric attribute

witten & eibe 9

More speeding up
▪ Entropy only needs to be evaluated between points
of different classes (Fayyad & Irani, 1992)

value 64 65 68 69 70 71 72 72 75 75 80 81 83 85
class Yes No Yes Yes Yes No No Yes Yes Yes No Yes Yes No
X

Potential optimal breakpoints

Breakpoints between values of the same class cannot

be optimal

10
Binary vs. multi-way splits
▪ Splitting (multi-way) on a nominal attribute
exhausts all information in that attribute
▪ Nominal attribute is tested (at most) once on any path
in the tree

▪ Not so for binary splits on numeric attributes!

▪ Numeric attribute may be tested several times along a
path in the tree

▪ Disadvantage: tree is hard to read

▪ Remedy:
▪ pre-discretize numeric attributes, or
▪ use multi-way splits instead of binary ones

witten & eibe 11

Missing as a separate value
▪ Missing value denoted “?” in C4.X
▪ Simple idea: treat missing as a separate value
▪ Q: When this is not appropriate?
▪ A: When values are missing due to different
reasons
▪ Example 1: gene expression could be missing when it is
very high or very low
▪ Example 2: field IsPregnant=missing for a male
patient should be treated differently (no) than for a
female patient of age 25 (unknown)
12
Missing values - advanced
Split instances with missing values into pieces
▪ A piece going down a branch receives a weight
proportional to the popularity of the branch
▪ weights sum to 1

▪ Info gain works with fractional instances

▪ use sums of weights instead of counts

▪ During classification, split the instance into pieces

in the same way
▪ Merge probability distribution using weights

witten & eibe 13

Pruning
▪ Goal: Prevent overfitting to noise in the
data
▪ Two strategies for “pruning” the decision
tree:
◆ Postpruning - take a fully-grown decision tree
and discard unreliable parts
◆ Prepruning - stop growing a branch when
information becomes unreliable

▪ Postpruning preferred in practice—

prepruning can “stop too early”
14
Prepruning
▪ Based on statistical significance test
▪ Stop growing the tree when there is no statistically significant
association between any attribute and the class at a particular
node

▪ Most popular test: chi-squared test

▪ ID3 used chi-squared test in addition to information gain
▪ Only statistically significant attributes were allowed to be
selected by information gain procedure

witten & eibe 15

a b class

Early stopping 1
2
0
0
0
1
0
1
3 1 0 1
4 1 1 0

▪ Pre-pruning may stop the growth process

prematurely: early stopping
▪ Classic example: XOR/Parity-problem
▪ No individual attribute exhibits any significant
association to the class
▪ Structure is only visible in fully expanded tree
▪ Pre-pruning won’t expand the root node

▪ But: XOR-type problems rare in practice

▪ And: pre-pruning faster than post-pruning

witten & eibe 16

Post-pruning
▪ First, build full tree
▪ Then, prune it
▪ Fully-grown tree shows all attribute interactions
▪ Problem: some subtrees might be due to chance effects
▪ Two pruning operations:
1. Subtree replacement
2. Subtree raising
▪ Possible strategies:
▪ error estimation
▪ significance testing
▪ MDL principle

witten & eibe 17

Subtree replacement, 1
▪ Bottom-up
▪ Consider replacing a tree
only after considering all
its subtrees
▪ Ex: labor negotiations

witten & eibe 18

Subtree replacement, 2
What subtree can we replace?

19
Subtree
replacement, 3
▪ Bottom-up
▪ Consider replacing a tree
only after considering all
its subtrees

witten & eibe 20

Estimating error rates
▪ Prune only if it reduces the estimated error
▪ Error on the training data is NOT a useful
estimator
Q: Why it would result in very little pruning?
▪ Use hold-out set for pruning
(“reduced-error pruning”)
▪ C4.5’s method
▪ Derive confidence interval from training data
▪ Use a heuristic limit, derived from this, for pruning
▪ Standard Bernoulli-process-based method
▪ Shaky statistical assumptions (based on training data)

witten & eibe 22

*Mean and variance
▪ Mean and variance for a Bernoulli trial:
p, p (1–p)
▪ Expected success rate f=S/N
▪ Mean and variance for f : p, p (1–p)/N
▪ For large enough N, f follows a Normal distribution
▪ c% confidence interval [–z  X  z] for random
variable with 0 mean is given by:

Pr[− z  X  z ] = c
▪ With a symmetric distribution:

Pr[− z  X  z] = 1 − 2  Pr[ X  z]
witten & eibe 23
*Confidence limits
▪ Confidence limits for the normal distribution with 0 mean and
a variance of 1:
Pr[X  z] z
0.1% 3.09
0.5% 2.58

1% 2.33
5% 1.65
10% 1.28
20% 0.84
25% 0.69
–1 0 1 1.65
40% 0.25
▪ Thus:
Pr[−1.65  X  1.65] = 90%

▪ To use this we have to reduce our random variable f to have

0 mean and unit variance
witten & eibe 24
*Transforming f
▪ Transformed value for f : f −p
p(1 − p) / N
(i.e. subtract the mean and divide by the standard deviation)

▪ Resulting equation:
 f −p 
Pr − z   z = c
▪ Solving for p:  p(1 − p) / N 

 z2 f f2 z2   z2 
p =  f + z − +  1 + 
2 
 2N N N 4N   N 

witten & eibe 25

C4.5’s method
▪ Error estimate for subtree is weighted sum of error
estimates for all its leaves
▪ Error estimate for a node (upper bound):

 z 2
f f 2
z 2 
 z2 
e =  f + +z − + 
2
1 + 
 2N N N 4N   N
▪ If c = 25% then z = 0.69 (from normal distribution)
▪ f is the error on the training data
▪ N is the number of instances covered by the leaf

witten & eibe 26

Example

f = 5/14
e = 0.46
e < 0.51
so prune!

f=0.33 f=0.5 f=0.33

e=0.47 e=0.72 e=0.47

witten & eibe

Combined27using ratios 6:2:6 gives 0.51
From trees to rules – how?

How can we produce a set of rules from

a decision tree?

29
From trees to rules – simple
▪ Simple way: one rule for each leaf
▪ C4.5rules: greedily prune conditions from each rule
if this reduces its estimated error
▪ Can produce duplicate rules
▪ Check for this at the end

▪ Then
▪ look at each class in turn
▪ consider the rules for that class
▪ find a “good” subset (guided by MDL)

▪ Then rank the subsets to avoid conflicts

▪ Finally, remove rules (greedily) if this decreases
error on the training data
witten & eibe 30
C4.5rules: choices and options
▪ C4.5rules slow for large and noisy datasets
▪ Commercial version C5.0rules uses a different technique
▪ Much faster and a bit more accurate

▪ C4.5 has two parameters

▪ Confidence value (default 25%):
lower values incur heavier pruning
▪ Minimum number of instances in the two most popular
branches (default 2)

witten & eibe 31

Summary
▪ Decision Trees
▪ splits – binary, multi-way
▪ split criteria – entropy, gini, …
▪ missing value treatment
▪ pruning
▪ rule extraction from trees
▪ Both C4.5 and CART are robust tools
▪ No method is always superior –
experiment!

witten & eibe 36

Reference:

• Machine Learning in Real World: C4.5.(n.d.). Retrieved from

https://info.psu.edu.sa/psu/cis/asameh/cs-500/dm7-decision-tree-
c45.ppt

Data Mining: Practical Machine Learning Tools and Techniques
No ratings yet
Data Mining: Practical Machine Learning Tools and Techniques
57 pages
Decision Trees
No ratings yet
Decision Trees
37 pages
Unit IV Decision Trees
No ratings yet
Unit IV Decision Trees
37 pages
Classification and Prediction
No ratings yet
Classification and Prediction
81 pages
Unit 3 Classification
No ratings yet
Unit 3 Classification
71 pages
UNIT-3 Machine Learning
No ratings yet
UNIT-3 Machine Learning
40 pages
UNIT-3 Machine Learning
No ratings yet
UNIT-3 Machine Learning
43 pages
Classification With Decision Trees: Instructor: Qiang Yang
100% (1)
Classification With Decision Trees: Instructor: Qiang Yang
62 pages
ML Unit 3
No ratings yet
ML Unit 3
22 pages
Decision Trees for Data Mining Students
No ratings yet
Decision Trees for Data Mining Students
30 pages
ML-Lec-07-Decision Tree Overfitting
No ratings yet
ML-Lec-07-Decision Tree Overfitting
25 pages
Decision Trees
No ratings yet
Decision Trees
45 pages
Classification Problems
No ratings yet
Classification Problems
53 pages
L04 Decision Trees
No ratings yet
L04 Decision Trees
34 pages
Lecture 17 18
No ratings yet
Lecture 17 18
52 pages
Ch02 DecisionTree
100% (1)
Ch02 DecisionTree
41 pages
07.2.decision Trees - ML
No ratings yet
07.2.decision Trees - ML
32 pages
Unit 5 - Data Mining - WWW - Rgpvnotes.in
No ratings yet
Unit 5 - Data Mining - WWW - Rgpvnotes.in
15 pages
Clase12 13
No ratings yet
Clase12 13
15 pages
Wk. 5.2. Decision Trees (27.10.2020)
No ratings yet
Wk. 5.2. Decision Trees (27.10.2020)
57 pages
Decision Tree Learning Challenges
No ratings yet
Decision Tree Learning Challenges
23 pages
Data Mining: Practical Machine Learning Tools and Techniques
No ratings yet
Data Mining: Practical Machine Learning Tools and Techniques
69 pages
C4.5 vs CHAID: Decision Tree Algorithms
No ratings yet
C4.5 vs CHAID: Decision Tree Algorithms
30 pages
Classification and Regression Trees
100% (1)
Classification and Regression Trees
60 pages
Unit - Iii
No ratings yet
Unit - Iii
52 pages
07.2.decision Trees
No ratings yet
07.2.decision Trees
33 pages
ML Unit3
No ratings yet
ML Unit3
24 pages
Learning Analytics
No ratings yet
Learning Analytics
56 pages
Decision Tree Learning Guide
No ratings yet
Decision Tree Learning Guide
33 pages
M01 Tree-Based Methods
No ratings yet
M01 Tree-Based Methods
38 pages
Decision-Tree Learning .
No ratings yet
Decision-Tree Learning .
29 pages
Decision Tree
No ratings yet
Decision Tree
15 pages
ML Assignment-2: Unit 3
No ratings yet
ML Assignment-2: Unit 3
21 pages
ESGB - 2025 - Classification and Regression Tress (Enregistré Automatiquement)
No ratings yet
ESGB - 2025 - Classification and Regression Tress (Enregistré Automatiquement)
43 pages
Machine Learning: Decision Trees & Algorithms
No ratings yet
Machine Learning: Decision Trees & Algorithms
24 pages
Unit 7
No ratings yet
Unit 7
67 pages
4 & 5 DWM 2024-25
No ratings yet
4 & 5 DWM 2024-25
32 pages
Decision Tree & Techniques
71% (7)
Decision Tree & Techniques
41 pages
CSE445 NSU Week - 4
No ratings yet
CSE445 NSU Week - 4
48 pages
DWDM Unit 4
No ratings yet
DWDM Unit 4
80 pages
Decision Tree Algorithms Guide
No ratings yet
Decision Tree Algorithms Guide
54 pages
T6 Decision Tree
No ratings yet
T6 Decision Tree
38 pages
I2ml3e Chap9
No ratings yet
I2ml3e Chap9
15 pages
Decision Tree
100% (4)
Decision Tree
66 pages
Decision Trees for Data Scientists
No ratings yet
Decision Trees for Data Scientists
14 pages
Chapter 3
No ratings yet
Chapter 3
88 pages
ML Unit 3 New
100% (1)
ML Unit 3 New
24 pages
Data Mining CS4168 Lecture 5 Basics of Classification 1
No ratings yet
Data Mining CS4168 Lecture 5 Basics of Classification 1
25 pages
Data Mining Mini Projrct
No ratings yet
Data Mining Mini Projrct
16 pages
Unit-5 Decision Trees & Ensembles Methods
No ratings yet
Unit-5 Decision Trees & Ensembles Methods
11 pages
Decision Tree Induction
No ratings yet
Decision Tree Induction
80 pages
ML Unit 03
No ratings yet
ML Unit 03
23 pages
Decision Tree Course Guide
No ratings yet
Decision Tree Course Guide
37 pages
DT-0 (3 Files Merged)
No ratings yet
DT-0 (3 Files Merged)
143 pages
Decision Tree - All Cost Functions - Stanford
No ratings yet
Decision Tree - All Cost Functions - Stanford
56 pages
Decision Trees in Data Mining
No ratings yet
Decision Trees in Data Mining
17 pages
Living Dining-Reflected Ceiling Plan
No ratings yet
Living Dining-Reflected Ceiling Plan
1 page
RNA Structure and Functions Explained
No ratings yet
RNA Structure and Functions Explained
15 pages
Mupas and Mupas Vs People
No ratings yet
Mupas and Mupas Vs People
7 pages
Fast-Locking Burst-Mode Clock and Data Recovery For Parallel VCSEL-Based Optical Link Receivers
No ratings yet
Fast-Locking Burst-Mode Clock and Data Recovery For Parallel VCSEL-Based Optical Link Receivers
15 pages
Battery Report
No ratings yet
Battery Report
38 pages
Motivation and Emotion: Ciccarelli, Saundra and Meyer, G., Psychology. 2006, Prentice Hall, Upper Saddle River, NJ
No ratings yet
Motivation and Emotion: Ciccarelli, Saundra and Meyer, G., Psychology. 2006, Prentice Hall, Upper Saddle River, NJ
42 pages
Ug0727 User Guide Polarfire Fpga 10G Ethernet Solutions
No ratings yet
Ug0727 User Guide Polarfire Fpga 10G Ethernet Solutions
24 pages
Ecocriticism - Wikipedia
No ratings yet
Ecocriticism - Wikipedia
27 pages
Turbine Flowmeter Datasheet
No ratings yet
Turbine Flowmeter Datasheet
7 pages
Crosstalent LearningDevelopment Module Documentation 64 End (1) .FR - en
No ratings yet
Crosstalent LearningDevelopment Module Documentation 64 End (1) .FR - en
61 pages
CEH-Certified Ethical Hacker: Required Prerequisites
No ratings yet
CEH-Certified Ethical Hacker: Required Prerequisites
3 pages
James Oppenheim - Songs For The New Age
100% (1)
James Oppenheim - Songs For The New Age
188 pages
AN Assignment ON Marketing Mix (4 PS) of Dabur India LTD
No ratings yet
AN Assignment ON Marketing Mix (4 PS) of Dabur India LTD
15 pages
Full Custom Design Flow For A Transimpedance Amplifier Using Cadence Virtuoso
No ratings yet
Full Custom Design Flow For A Transimpedance Amplifier Using Cadence Virtuoso
6 pages
Words of Wisdom for Believers
No ratings yet
Words of Wisdom for Believers
5 pages
Lesson Plan Subject: UHV
No ratings yet
Lesson Plan Subject: UHV
3 pages
Candidate Name: KIRTAN KUMAR Roll No: PAT-G-0400-17 Exam Name: PRB-3-X-Social Science Total Score: 78.50
No ratings yet
Candidate Name: KIRTAN KUMAR Roll No: PAT-G-0400-17 Exam Name: PRB-3-X-Social Science Total Score: 78.50
6 pages
Testbank For Phlebotomy 6th Edition Warekois Solution Manual
No ratings yet
Testbank For Phlebotomy 6th Edition Warekois Solution Manual
18 pages
The Spiritual Way To Say "No"
No ratings yet
The Spiritual Way To Say "No"
6 pages
HSM Design Principles v5 1
No ratings yet
HSM Design Principles v5 1
6 pages
WFJ-80 Kratom Grinder Quotation
No ratings yet
WFJ-80 Kratom Grinder Quotation
15 pages
Architects List - Tejas - Final - DT 13th June - Google Sheet
No ratings yet
Architects List - Tejas - Final - DT 13th June - Google Sheet
30 pages
Account Determination MM en US
No ratings yet
Account Determination MM en US
79 pages
RWA - Monthly Income and Expenditure - August 2025
No ratings yet
RWA - Monthly Income and Expenditure - August 2025
1 page
Bill of Quantities For Interior Works of Office of Nerlp: at Dilip Huzuri Path, Supermarket, G. S. Road, Guwahati
No ratings yet
Bill of Quantities For Interior Works of Office of Nerlp: at Dilip Huzuri Path, Supermarket, G. S. Road, Guwahati
16 pages
GEG 311 - 3 Calculus of Several Variablespdf2
No ratings yet
GEG 311 - 3 Calculus of Several Variablespdf2
26 pages
The Graphical Kernel System (GKS) : Daduceand Frahopgood
No ratings yet
The Graphical Kernel System (GKS) : Daduceand Frahopgood
14 pages
Perter Grimes
No ratings yet
Perter Grimes
2 pages
Bert Thompson-The Case For The Existence of God-Apologetics Press (2003)
No ratings yet
Bert Thompson-The Case For The Existence of God-Apologetics Press (2003)
212 pages
MCM Settings - Ultimate Skyrim 4.2.0
No ratings yet
MCM Settings - Ultimate Skyrim 4.2.0
20 pages

C4.5 Algorithm

Uploaded by

C4.5 Algorithm

Uploaded by

Machine Learning in

▪ Handling Numeric Attributes

▪ Dealing with Missing Values

▪ From Trees to Rules

▪ Basic schemes need to be extended to fulfill these

witten & eibe 3

▪ C4.5 - one of best-known and most widely-used learning

▪ Unlike nominal attributes,

▪ Computationally more demanding

witten & eibe 5

Sunny Hot High True No

Overcast Hot High False Yes

Rainy Mild Normal False Yes

If outlook = sunny and humidity = high then play = no

witten & eibe 6

Overcast 83 86 False Yes

Rainy 75 80 False Yes

If outlook = sunny and humidity > 83 then play = no

▪ E.g. temperature  71.5: yes/4, no/2

▪ Place split points halfway between values

▪ Q. Does this have to be repeated at each node of

witten & eibe 9

Potential optimal breakpoints

Breakpoints between values of the same class cannot

▪ Not so for binary splits on numeric attributes!

▪ Disadvantage: tree is hard to read

witten & eibe 11

▪ Info gain works with fractional instances

▪ During classification, split the instance into pieces

witten & eibe 13

▪ Postpruning preferred in practice—

▪ Most popular test: chi-squared test

witten & eibe 15

▪ Pre-pruning may stop the growth process

▪ But: XOR-type problems rare in practice

witten & eibe 16

witten & eibe 17

witten & eibe 18

witten & eibe 20

witten & eibe 22

▪ To use this we have to reduce our random variable f to have

witten & eibe 25

witten & eibe 26

f=0.33 f=0.5 f=0.33

witten & eibe

How can we produce a set of rules from

▪ Then rank the subsets to avoid conflicts

▪ C4.5 has two parameters

witten & eibe 31

witten & eibe 36

• Machine Learning in Real World: C4.5.(n.d.). Retrieved from

You might also like