The document discusses the Gini index, a measure of impurity used in decision tree algorithms like CART, to determine the best binary splits for attributes in a dataset. It explains how to compute the Gini index for both discrete and continuous attributes and how to select the attribute that maximizes the reduction in impurity. An example illustrates the process of calculating Gini indices for various splits to identify the optimal splitting criterion for a decision tree.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0 ratings0% found this document useful (0 votes)
10 views9 pages
Decision Tree2
The document discusses the Gini index, a measure of impurity used in decision tree algorithms like CART, to determine the best binary splits for attributes in a dataset. It explains how to compute the Gini index for both discrete and continuous attributes and how to select the attribute that maximizes the reduction in impurity. An example illustrates the process of calculating Gini indices for various splits to identify the optimal splitting criterion for a decision tree.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 9
fer
ship
atop
lso
120
uus
SoD
soo
stip YL sat according Be tonktineus
True
Pale
True
True
Palre
Palre
True
HO) = Spe a
modralt
pliep
sleep
=
atep
fies.
aliip
= Ussk bits
Bo.
I20-
OS23P + ocs63}+ aGues
data ottrebult
aaddisch.
Came et
“ -raddush
earref-H 5, stremm)= 3 [a tog? fad] 919 PEG
= hasot
cc, ~ fo 30 5
Txfo, Gain LS, Stream) 2 SSE — 250% 4
CS, lene) = € [Stagg a tog Jed [pat ab ea]
Seta 2
tee ES, steps) = 1 SSE~ O-9FIZ
=loStée
Inge ge es hese — 13% = 0-184
6, elevation cea > g Blog -L qt ye gz 53] “
Hs, elevation x75) ;
ae
dog] bogs. ~2 teg2 2g €
4 Peay gf :
¢
= ras ;
Ente gain i Lesé — 125 2.0.30 ;
CS, elevation < 135) 4
Eel -5 tend lead]
*S [549s 292 - 2g] ‘
Gees r
Sealeras
A
= 0:93
es ee4S, elrvatiim its praperemct for pose wih meuy Levels
gy pe ee ves will spl te data wits
mn. ubsels wel hich will Lene! Lo" be pure
Lraspetivt of my correbtion pbhwen Lhe
alts onuplt ue fev end fargt Alass,
Cus, a Aucoleson of EDS, usta om exlineton t6
Lapormatien pr Knee ao ger vals , whieh afleuph
“ke everaome Lu's Ata. Tt applied a land 4 mormalezalis,
ca eafornetion eg gain aig a ‘split waformation”
Aifive Qo
Lip _ = 1D,) :
spi rego, (od = & 1) tog, (Pil |
Beas Jz Ip} 4 Ga : |
hes value vtpresets tbe pokintiat Le gormation |
generis by splilteng de Pratning data sep, D, ls
V parlitions , eorvespondiug to Le v euler ya cer {
om abtribuli A. For excl, eultome it consider tle
. (
muwbor o Lapls Arowatng Mak out, ete oe
to tha Aofal rarmary of Lupls iv D Et if 4 from
taformati > Greece measures tha A Sahn Wee
with vespeet do Alassipicatim, Lrak co acyured bout |
on We Bame portitoung, The gain ina :
ued of
7 Gatun Rat's (A) = LGelu(A)
. =p)Az per Lt preniow example
p) ete [Gee
aie ee
Snpogain Rakto (Stream,
Sng sek
Ss
mqaesy - eeeil
Oqase
201310Gini Index
The Gini index is used in CART. Using the notation previously described, the Gini index
measures the impurity of D, a data partition or set of training tuples, as
Gini(D) = 1-3, (8.7)
where p; is the probability that a tuple in D belongs to class C; and is estimated by
\C,p\/|D]. The sum is computed over m classes. : . :
"The Gini index considers a binary split for each attribute. Let’s first consider the case
where A is a discrete-valued attribute having v distinct values, {a, a2,..., ay}, occur-
ring in D. To determine the best binary split on A, we examine all the possible subsets
that can be formed using known values of A. Each subset, S4, can be considered as a
binary test for attribute A of the form “A € S4?” Given a tuple, this test is satisfied if
the value of A for the tuple is among the values listed in S4. If A has v possible val-
ues, then there are 2” possible subsets. For example, if income has three possible values, —
namely {low, medium, high}, then the possible subsets are {low, medium, high}, {
medium}, (low, high}, {medium, high}, {low}, {medium}, {high}, and {}. We exch
power set, {low, medium, high}, and the empty set from consideration since, co’
ally, they do not represent a split. Therefore, there are 2” — 2 possible ways
partitions of the data, D, based on a binary split on A. eWhen considering a binary split, we compute a weighted sum of the im
resulting partition, For example, if a binary split on A partitions D into PUT ofeach
Gini index of D given that partitioning is Day the
j IDI Gini IDal
Gini(D) = —Gini(D,) + —— Gi 7
inig(D) TDI ini(D1) + DI iini(D»). (ea)
For each attribute, each of the possible binary splits is considered. For a discrete-valued
attribute, the subset that gives the minimum Gini index for that attribute is selected as
its splitting subset.
For continuous-valued attributes, each possible split-point must be considered. The
strategy is similar to that described earlier for information gain, where the midpoint
between each pair of (sorted) adjacent values is taken as a possible split-point. The point
giving the minimum Gini index for a given (continuous-valued) attribute is taken.as
the split-point of that attribute. Recall that for a possible split-point of A, Dy is the
set of tuples in D satisfying A < split_point, and D; is the set of tuples in D satisfying
A> split_point.
‘The reduction in impurity that would be incurred by a binary split on a discrete- or
continuous-valued attribute A is
AGini(A) = Gini(D) — Gini,(D). (89)
‘The attribute that maximizes the reduction in impurity (or, equivalently, has the
minimum Gini index) is selected as the splitting attribute. This attribute and either
its splitting subset (for a discrete-valued splitting attribute) or split-point (for @
continuous-valued splitting attribute) together form the splitting criterion.
Example 8.3 Induction of a decision tree using the Gini index. Let D be the training data shown
earlier in Table 8.1, where there are nine tuples belonging to the class buys.computer =
yes and the remaining five tuples belong to the class buys_computer = no. ‘A (root) node
N is created for the tuples in D. We first use Eq. (8.7) for the Gini index to compute the
impurity of D:
9\? 75\2
(3) -(3) = 0.459,
7 co
k pf he splitting criterion for the tuples in D,
ribute, ’s i il
splicing enue Let start with the attribute ico
Partition Dy satis
tuples of D woul
Gini(D) =
for
, we need to compute the Gini int
. me and consider each of the poss!”
mage the subset (ow, medium). This would result in 10 tuples 2
ieee 2 sonata “income (low, medium}.” The remaining four
sa Partition D;. The Gini index value computed based 0”Class-Labeled Training Tuples from the AllElectronics Customer Database
income student credit-rating Class: buys-computer
RID age
1 youth high no fair no
2 youth high no excellent no
3 middle-aged high no fair yes
4 senior medium no fair yes
5 senior low yes fair yes
6 senior low yes excellent no.
7 middleaged low yes excellent yes
8 youth medium no fair no
9 youth low yes fair yes
10 senior medium yes fair yes
11 youth medium yes excellent yes
12 middleaged medium no excellent yes.
13 middle.aged high yes fair yes
14 senior medium no excellent no
Hoenn eee EEEthis partitioning is
Gintiincome € {low,medium(D)
10 Gini(D1) + 4. Gini(D2)
= — Gini i
14 ae
“i(-(o)-Ge)) ae)
= 0.443
= Gintincome ¢ (high\(D)-
Similarly, the Gini index values for splits on the remaining subsets are 0.458 (for the sub-
sets {low, high} and {medium)}) and 0.450 (for the subsets {medium, high} and {low}).
Therefore, the best binary split for attribute income is on {low, medium} (or {high})
because it minimizes the Gini index. Evaluating age, we obtain {youth, senior} (or
{middle_aged}) as the best split for age with a Gini index of 0.375; the attributes student
and credit.rating are both binary, with Gini index values of 0.367 and 0.429, respectively.
The attribute age and splitting subset {youth, senior} therefore give the minimum
Gini index overall, with a reduction in impurity of 0.459 — 0.357 = 0.102. The binary
split “age € (youth, senior?)” results in the maximum reduction in impurity of the tuples
in D and is returned as the splitting criterion. Node N is labeled with the criterion, two
branches are grown from it, and the tuples are partitioned accordingly. - j