[go: up one dir, main page]

0% found this document useful (0 votes)
40 views23 pages

Decision Tree Classifier-C4.5

The C4.5 algorithm, developed by Quinlan, is an enhancement of the ID3 algorithm that effectively handles both discrete and continuous data, as well as incomplete datasets. It utilizes a normalized information gain metric called Gain Ratio to evaluate splits, mitigating biases present in the ID3 algorithm. The algorithm also incorporates a pruning process to reduce overfitting and includes a method for handling continuous attributes by partitioning them into discrete intervals.

Uploaded by

Deepak Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views23 pages

Decision Tree Classifier-C4.5

The C4.5 algorithm, developed by Quinlan, is an enhancement of the ID3 algorithm that effectively handles both discrete and continuous data, as well as incomplete datasets. It utilizes a normalized information gain metric called Gain Ratio to evaluate splits, mitigating biases present in the ID3 algorithm. The algorithm also incorporates a pruning process to reduce overfitting and includes a method for handling continuous attributes by partitioning them into discrete intervals.

Uploaded by

Deepak Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Decision Tree Classifier

(C4.5 Algorithm)

TIET, PATIALA
C4.5 Algorithm
▪ C4.5 algorithm is also proposed by Quinlan’s and is an extension of earlier ID3 algorithm.

▪ It removes the restrictions and limitations of the ID3 variant of decision tree algorithm.

▪ It can work with both Discrete and Continuous Data

▪ C4.5 can handle the issue of incomplete data very well

▪ The algorithm is not biased towards the features with high number of distinct values.

▪ The algorithm inherently employs Single Pass Pruning Process to Mitigate overfitting.
Evaluating Split –C4.5
▪ ID3 algorithm’s Information Gain metric is biased towards a feature with large number of
distinct values.
▪ This limitation of ID3 algorithm is handled by normalizing the Information Gain metric using a
parameter called SplitInfo. The normalized Information Gain is called Gain Ratio.
▪ Gain Ratio of an attribute A for a given dataset is computed as:
𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝐺𝑎𝑖𝑛(𝑆, 𝐴)
𝐺𝑎𝑖𝑛 𝑅𝑎𝑡𝑖𝑜 𝑆, 𝐴 =
𝑆𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜(𝑆, 𝐴)
|𝑣|
𝑆𝑣
𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝐺𝑎𝑖𝑛 𝑆, 𝐴 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − ෍ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑣 )
𝑆
|𝑣| 𝑣=1
𝑆𝑣 𝑆𝑣
𝑆𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜 𝑆, 𝐴 = − ෍ 𝑙𝑜𝑔2
𝑆 𝑆
𝑣=1
where Sᵥ is the set of rows in S for which the feature column A has value v, |Sᵥ| is the number of
rows in Sᵥ and likewise |S| is the number of rows in S.
Handling Continuous Values-C4.5
Algorithm
▪ C4.5 algorithm partitions the continuous attribute value into a discrete set of intervals.

▪ C4.5 proposes to perform binary split based on a threshold value for features with continuous
values.

▪Threshold should be a value which offers maximum gain ratio (Information Gain) for that
attribute.
C4.5 Algorithm-Pseudocode
1.Check for the base cases (as discussed in ID3 algorithm).
2.For each attribute a, find the normalised information gain ratio from splitting on
a.
3.Let a_best be the attribute with the highest normalized information gain.
4.Create a decision node that splits on a_best.
5.Recur on the sublists obtained by splitting on a_best, and add those
nodes as children of node.
Numerical Example 1-C4.5 Algorithm
Consider the dataset that
informs about decision making
factors to play tennis at outside
for previous 14 days.
The dataset is similar to the
ID3 dataset.
The difference is that
temperature and humidity
columns have continuous
values instead of nominal ones.
Train a C4.5 Decision Tree
Classifier.
Solution- Example 1
Compute Entropy of the entire dataset:

𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 = − σ𝑛𝑖=1 𝑝𝑖 𝑙𝑜𝑔2 (𝑝𝑖 )

Positive Examples = 9

Negative Examples =5
Total= 14

9 9 5 5
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 = − 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2 = 0.940
9+5 9+5 9+5 9+5
Solution- Example 1 (Contd…)
Outlook Attribute
Outlook is a nominal attribute. Its possible values are Rainy, Overcast and Sunny.
2 2 3 3
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑂𝑢𝑡𝑙𝑜𝑜𝑘 = 𝑆𝑢𝑛𝑛𝑦 = − 2+3 𝑙𝑜𝑔2 2+3
− 2+3 𝑙𝑜𝑔2 2+3
= 0.971

3 3 2 2
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑂𝑢𝑡𝑙𝑜𝑜𝑘 = 𝑅𝑎𝑖𝑛𝑦 = − 2+3 𝑙𝑜𝑔2 2+3
− 2+3 𝑙𝑜𝑔2 2+3
= 0.971
4 4 0 0
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑂𝑢𝑡𝑙𝑜𝑜𝑘 = 𝑂𝑣𝑒𝑟𝑐𝑎𝑠𝑡 = − 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2 =0
4+0 4+0 4+0 4+0
|𝑣|
𝑆𝑣
𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 = 𝐼(𝑆, 𝑂𝑢𝑡𝑙𝑜𝑜𝑘) = ෍ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑣 )
𝑆
𝑣=1
2+3 3+2 4+0
𝐼 𝑆, 𝑂𝑢𝑡𝑙𝑜𝑜𝑘 = × 0.971 + × 0.971 + × 0 = 0.693
9+5 9+5 9+5
𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝐺𝑎𝑖𝑛 𝑆, 𝑂𝑢𝑡𝑙𝑜𝑜𝑘 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 S − I S, Outlook = 0.940 − 0.693 =0.247
2+3 2+3 3+2 3+2 4+0 4+0
𝑆𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜 𝑆, 𝑂𝑢𝑡𝑙𝑜𝑜𝑘 = − 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2 + 𝑙𝑜𝑔2 = 1.577
9+5 9+5 9+5 9+5 9+5 9+5
Replace + with -
𝑰𝒏𝒇𝒐𝒓𝒎𝒂𝒕𝒊𝒐𝒏 𝑮𝒂𝒊𝒏 𝑺, 𝑶𝒖𝒕𝒍𝒐𝒐𝒌 𝟎. 𝟐𝟒𝟕
𝑮𝒂𝒊𝒏 𝑹𝒂𝒕𝒊𝒐 𝑺, 𝑶𝒖𝒕𝒍𝒐𝒐𝒌 = = = 𝟎. 𝟏𝟓𝟓
𝑺𝒑𝒍𝒊𝒕𝑰𝒏𝒇𝒐 𝑺, , 𝑶𝒖𝒕𝒍𝒐𝒐𝒌 𝟏. 𝟓𝟕𝟕
Solution- Example 1 (Contd…)
In ID3 algorithm, we’ve calculated gains for each attribute. Here, we need to
calculate gain ratios instead of gains.
Wind Attribute
Wind is a nominal attribute. Its possible values are weak and strong.
3 3 3 3
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑊𝑖𝑛𝑑𝑦 = 𝑆𝑡𝑟𝑜𝑛𝑔 = − 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2 =1
3+3 3+3 3+3 3+3
6 6 2 2
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑊𝑖𝑛𝑑𝑦 = 𝑊𝑒𝑎𝑘 = − 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2 = 0.811
6+2 6+2 6+1 6+2
|𝑣|
𝑆𝑣
𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 = 𝐼(𝑆, 𝑊𝑖𝑛𝑑𝑦) = ෍ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑣 )
𝑆
𝑣=1
3+3 6+2
𝐼 𝑆, 𝑊𝑖𝑛𝑑𝑦 = ×1+ × 0.811 = 0.892
9+5 9+5
𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝐺𝑎𝑖𝑛 𝑆, 𝑊𝑖𝑛𝑑𝑦 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 S − I S, 𝑊𝑖𝑛𝑑𝑦 = 0.940 − 0.892 = 0.048
3+3 3+3 6+2 6+2
𝑆𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜 𝑆, 𝑊𝑖𝑛𝑑𝑦 = − 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2 = 0.985
9+5 9+5 9+5 9+5
𝑰𝒏𝒇𝒐𝒓𝒎𝒂𝒕𝒊𝒐𝒏 𝑮𝒂𝒊𝒏 𝑺, 𝑾𝒊𝒏𝒅𝒚 𝟎. 𝟎𝟒𝟖
𝑮𝒂𝒊𝒏 𝑹𝒂𝒕𝒊𝒐 𝑺, 𝑾𝒊𝒏𝒅𝒚 = = = 𝟎. 𝟎𝟒𝟗
𝑺𝒑𝒍𝒊𝒕𝑰𝒏𝒇𝒐 𝑺, 𝑾𝒊𝒏𝒅𝒚 𝟎. 𝟗𝟖𝟓
Solution Example – 1 (Contd….)
▪As an exception, humidity is a continuous attribute. Humidity Play
65 Yes
▪ We need to convert continuous values to nominal 70 No
ones. 70 Yes
70 Yes
▪Firstly, we need to sort humidity values smallest to 75 Yes
largest (as shown in table). 78 Yes

▪Now, we need to iterate on all humidity values and 80 Yes


80 Yes
separate dataset into two parts as instances less than or
80 No
equal to current value, and instances greater than the
85 No
current value.
90 No
▪We would calculate the gain or gain ratio for every 90 Yes
step. The value which maximizes the gain would be 95 No
the threshold. 96 Yes
Solution Example – 1 (Contd….)
Check 65 as a threshold for humidity Humidity Yes No
1 1 0 0
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝐻𝑢𝑚𝑖𝑑𝑖𝑡𝑦 ≤ 65 = − 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2 =0 <=65 1 0
1+0 1+0 1+0 1+0

>65 8 5
8 8 5 5
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝐻𝑢𝑛𝑖𝑑𝑖𝑡𝑦 > 65 = − 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2 = 0.961
8+5 8+5 8+5 8+5
1+0 8+5
𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 = 𝐼 𝑆, 𝐻𝑢𝑚𝑖𝑑𝑖𝑡𝑦<>65 = ×0+ × 0.961 = 0.892
9+5 9+5
𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝐺𝑎𝑖𝑛 𝑆, 𝐻𝑢𝑚𝑖𝑑𝑖𝑡𝑦<>65 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 S − I S, 𝐻𝑢𝑚𝑖𝑑𝑖𝑡𝑦<>65 = 0.940 − 0.892 = 0.048
1+0 1+0 8+5 8+5
𝑆𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜 𝑆, 𝐻𝑢𝑚𝑖𝑑𝑖𝑡𝑦<>65 = − 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2 = 0.371
9+5 9+5 9+5 9+5
𝑰𝒏𝒇𝒐𝒓𝒎𝒂𝒕𝒊𝒐𝒏 𝑮𝒂𝒊𝒏 𝑺, 𝑯𝒖𝒎𝒊𝒅𝒊𝒕𝒚<>𝟔𝟓 𝟎. 𝟎𝟒𝟖
𝑮𝒂𝒊𝒏 𝑹𝒂𝒕𝒊𝒐 𝑺, 𝑯𝒖𝒎𝒊𝒅𝒊𝒕𝒚<>𝟔𝟓 = = = 𝟎. 𝟏𝟐𝟔
𝑺𝒑𝒍𝒊𝒕𝑰𝒏𝒇𝒐 𝑺, 𝑯𝒖𝒎𝒊𝒅𝒊𝒕𝒚<>𝟔𝟓 𝟎. 𝟑𝟕𝟏

The statement 𝐻𝑢𝑚𝑖𝑑𝑖𝑡𝑦<>65 refers to that what would branch of decision tree be for less than or equal to 65, and
greater than 65. It does not refer to that humidity is not equal to 65!
Solution Example – 1 (Contd….)
Similarly, the process will be repeated for each unique value in Humidity and we will get
following Gain Ratios:
For 70, Gain_Ratio(S, Humidity<> 70) = 0.016
For 75, Gain_Ratio(S, Humidity<> 75) = 0.047
For 78, Gain_Ratio(S, Humidity <> 78) =0.090
For 80, Gain_Ratio(S, Humidity <> 80) = 0.107
For 85, Gain_Ratio(S, Humidity <> 85) = 0.027
For 90, Gain_Ratio(S, Humidity <> 90) = 0.016
For 95, GainRatio(S, Humidity <> 95) = 0.118
Value 96 is ignored, because humidity cannot be greater than this value.
As seen, gain maximizes when threshold is equal to 65 for humidity. This means that we
need to compare other nominal attributes and comparison of humidity to 65 to create a
branch in our tree.
Solution Example – 1 (Contd….)
▪ Temperature is also a continuous attribute. Temperature Play

64 yes
▪ We need to convert continuous values to nominal 65 no
ones. 68 yes

▪Firstly, we need to sort temperature values smallest to 69 yes

70 yes
largest (as shown in table).
71 no

▪Now, we need to iterate on all temperature values and 72 no

separate dataset into two parts as instances less than or 72 yes

equal to current value, and instances greater than the 75 yes

current value. 75 yes

80 no
▪We would calculate the gain or gain ratio for every 81 yes
step. The value which maximizes the gain would be 83 yes
the threshold. 85 no
Solution Example – 1 (Contd….)
Check 68 as a threshold for Temperature Temperature Yes No
2 2 1 1
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑇𝑒𝑚𝑝 ≤ 68 = − 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2 = 0.915 <=68 2 1
2+1 2+1 2+1 2+1

>68 7 4
7 7 4 4
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑇𝑒𝑚𝑝 > 68 = − 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2 = 0.914
7+4 7+4 7+4 7+4
2+1 7+4
𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 = 𝐼 𝑆, Temp<>68 = × 0.915 + × 0.914 = 0.914
9+5 9+5
𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝐺𝑎𝑖𝑛 𝑆, Temp<>68 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 S − I S,Temp<>68 = 0.940 − 0.914 = 0.026
2+1 2+1 7+4 7+4
𝑆𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜 𝑆, Temp<>68 = − 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2 = 0.750
9+5 9+5 9+5 9+5
𝑰𝒏𝒇𝒐𝒓𝒎𝒂𝒕𝒊𝒐𝒏 𝑮𝒂𝒊𝒏 𝑺, Temp<>𝟔𝟖 𝟎. 𝟎𝟐𝟔
𝑮𝒂𝒊𝒏 𝑹𝒂𝒕𝒊𝒐 𝑺, Temp<>𝟔𝟖 = = = 𝟎. 𝟎𝟑𝟓
𝑺𝒑𝒍𝒊𝒕𝑰𝒏𝒇𝒐 𝑺, Temp<>𝟔𝟖 𝟎. 𝟕𝟓𝟎
Solution Example – 1 (Contd….)
Similarly, the process will be repeated for each unique value in Temperature and
we will get maximum Gain Ratio for 83.
Outlook Temperature>83 Humidity>65 windy play
Gain_Ratio(S, Temperature<> 83) = 0.305 overcast No Yes Weak yes
overcast No No Strong yes
overcast No yes Strong yes
The dataset of continuous values is overcast No yes Weak yes
rainy No yes Weak yes
transformeed into nominal values as rainy No yes Weak yes
rainy No yes Strong no
Humidity is now converted to Humiditiy>65 rainy No yes Weak yes
And Temperature is now converted to rainy No yes Strong no
sunny Yes Yes Weak no
Temperature>83 with values Yes or no sunny No Yes Strong no
(as shown in Table) sunny No Yes Weak no
sunny No yes Weak yes
sunny No yes Strong yes
Solution Example – 1 (Contd….)
Summary of Gain Ratio for Each Attribute

Attribute Gain Ratio


Outlook 0.155
Temperature<>83 0.305
Humidity<>65 0.107
Windy 0.049

Temperature attribute comes with both maximized gain and gain ratio. This
means that we need to put Temperature decision in root of decision tree.
Solution Example – 1 (Contd….)
So for each value of Temperature>83 i.e. Yes and No, a branch will be added and the C4.5
algorithm will be applied for instances corresponding to (Temperature>83)=Yes, and
(Temperature>83=No)
Temperature>83 Outlook Humidity>65 windy play Temperature>83 Outlook Humidity>65 windy play
No overcast Yes Weak yes Yes sunny Yes Weak no
No overcast No Strong yes
No overcast yes Strong yes
No overcast yes Weak yes
No rainy yes Weak yes
No rainy yes Weak yes
No rainy yes Strong no
No rainy yes Weak yes
No rainy yes Strong no
No sunny Yes Strong no
No sunny Yes Weak no
No sunny Yes Weak yes
No sunny yes Strong yes
Solution Example – 1 (Contd….)
The Final Decision Tree is as follows:
if (Temperature>83)=No:
if Outlook == 'Rain':
if Wind == 'Weak':
return 'Yes'
elif Wind == 'Strong':
return 'No'
elif Outlook == 'Overcast':
return 'Yes'
elif Outlook == 'Sunny':
if (Humidity>65)=Yes:
if Wind == 'Strong':
return ‘Yes'
elif Wind == 'Weak':
return ‘Yes’
elif (Temperature>83)=Yes:
return 'No'
Pruning in C4.5 Algorithm
▪ C4.5 Algorithm employs single pass statistical testing for pruning.
▪ It make use of pessimistic error rate based on confidence intervals.
▪ For C4.5 algorithm, Pessimistic error estimate for a node is compted as:
𝑧2 𝑓 𝑓2 𝑧2
𝑓+ +𝑧 − +
2𝑁 𝑁 𝑁 4𝑁 2
𝑒=
𝑧2
1+
𝑁
▪z is derived from the desired confidence value
▪ If c = 25% then z = 0.69 (from normal distribution)
▪f is the error on the training data
▪ N is the number of instances covered by the leaf
Pruning in C4.5 Algorithm
▪ C4.5 Algorithm employs single pass statistical testing for pruning.
▪ It make use of pessimistic error rate based on confidence intervals.
▪ For C4.5 algorithm, Pessimistic error estimate for a node is compted as:
𝑧2 𝑓 𝑓2 𝑧2
𝑓+ +𝑧 − +
2𝑁 𝑁 𝑁 4𝑁 2
𝑒=
𝑧2
1+
𝑁
▪z is derived from the desired confidence value
▪ If c = 25% then z = 0.69 (from normal distribution)
▪f is the error on the training data
▪ N is the number of instances covered by the leaf
Pruning in C4.5 Algorithm (Contd….)
▪ Error estimate for subtree is weighted sum of error estimates for all its leaves.
▪ A node is pruned if error estimate of subtree is lower than error estimate of the node.
▪ Consider the unpruned subtree

Where + means correct classification and – mean incorrect classification on test set
Pruning in C4.5 Algorithm (Contd….)
We target the health plan node near the bottom of the tree for pruning. First we calculate the
average estimated upper error rate for the unpruned tree:
for health plan = none leaf node:
f = 2/6, N=6 => upper p estimate = .46
(using z=0.69 for 75-th percentile estimate)
for health plan = half leaf node:
f = 1/2, N=2 => upper p estimate = .74
for health plan = full leaf node:
f = 2/6, N=6 => upper p estimate = .46
average upper error rate over the three leaves: .55 (average sing 6:2:6)
Pruning in C4.5 Algorithm (Contd….)
On the other hand, if the tree were pruned by replacing the health plan node by a leaf (9+, 5-), the
confidence interval calculation would be as follows:
f = 5/14, N=14 => upper p estimate = .49
Since the pruned tree results in a lower upper estimate for the error rate, the leaves are indeed
pruned.

You might also like