[go: up one dir, main page]

0% found this document useful (0 votes)
10 views9 pages

Decision Tree2

The document discusses the Gini index, a measure of impurity used in decision tree algorithms like CART, to determine the best binary splits for attributes in a dataset. It explains how to compute the Gini index for both discrete and continuous attributes and how to select the attribute that maximizes the reduction in impurity. An example illustrates the process of calculating Gini indices for various splits to identify the optimal splitting criterion for a decision tree.

Uploaded by

23cse077
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
10 views9 pages

Decision Tree2

The document discusses the Gini index, a measure of impurity used in decision tree algorithms like CART, to determine the best binary splits for attributes in a dataset. It explains how to compute the Gini index for both discrete and continuous attributes and how to select the attribute that maximizes the reduction in impurity. An example illustrates the process of calculating Gini indices for various splits to identify the optimal splitting criterion for a decision tree.

Uploaded by

23cse077
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 9
fer ship atop lso 120 uus SoD soo stip YL sat according Be tonktineus True Pale True True Palre Palre True HO) = Spe a modralt pliep sleep = atep fies. aliip = Ussk bits Bo. I20- OS23P + ocs63}+ aGues data ottrebult aaddisch. Came et “ -raddush earref- H 5, stremm)= 3 [a tog? fad] 919 PEG = hasot cc, ~ fo 30 5 Txfo, Gain LS, Stream) 2 SSE — 250% 4 CS, lene) = € [Stagg a tog Jed [pat ab ea] Seta 2 tee ES, steps) = 1 SSE~ O-9FIZ =loStée Inge ge es hese — 13% = 0-184 6, elevation cea > g Blog -L qt ye gz 53] “ Hs, elevation x75) ; ae dog] bogs. ~2 teg2 2g € 4 Peay gf : ¢ = ras ; Ente gain i Lesé — 125 2.0.30 ; CS, elevation < 135) 4 Eel -5 tend lead] *S [549s 292 - 2g] ‘ Gees r Sealeras A = 0:93 es ee 4S, elrvatiim its praperemct for pose wih meuy Levels gy pe ee ves will spl te data wits mn. ubsels wel hich will Lene! Lo" be pure Lraspetivt of my correbtion pbhwen Lhe alts onuplt ue fev end fargt Alass, Cus, a Aucoleson of EDS, usta om exlineton t6 Lapormatien pr Knee ao ger vals , whieh afleuph “ke everaome Lu's Ata. Tt applied a land 4 mormalezalis, ca eafornetion eg gain aig a ‘split waformation” Aifive Qo Lip _ = 1D,) : spi rego, (od = & 1) tog, (Pil | Beas Jz Ip} 4 Ga : | hes value vtpresets tbe pokintiat Le gormation | generis by splilteng de Pratning data sep, D, ls V parlitions , eorvespondiug to Le v euler ya cer { om abtribuli A. For excl, eultome it consider tle . ( muwbor o Lapls Arowatng Mak out, ete oe to tha Aofal rarmary of Lupls iv D Et if 4 from taformati > Greece measures tha A Sahn Wee with vespeet do Alassipicatim, Lrak co acyured bout | on We Bame portitoung, The gain ina : ued of 7 Gatun Rat's (A) = LGelu(A) . =p) Az per Lt preniow example p) ete [Gee aie ee Snpogain Rakto (Stream, Sng sek Ss mqaesy - eeeil Oqase 201310 Gini Index The Gini index is used in CART. Using the notation previously described, the Gini index measures the impurity of D, a data partition or set of training tuples, as Gini(D) = 1-3, (8.7) where p; is the probability that a tuple in D belongs to class C; and is estimated by \C,p\/|D]. The sum is computed over m classes. : . : "The Gini index considers a binary split for each attribute. Let’s first consider the case where A is a discrete-valued attribute having v distinct values, {a, a2,..., ay}, occur- ring in D. To determine the best binary split on A, we examine all the possible subsets that can be formed using known values of A. Each subset, S4, can be considered as a binary test for attribute A of the form “A € S4?” Given a tuple, this test is satisfied if the value of A for the tuple is among the values listed in S4. If A has v possible val- ues, then there are 2” possible subsets. For example, if income has three possible values, — namely {low, medium, high}, then the possible subsets are {low, medium, high}, { medium}, (low, high}, {medium, high}, {low}, {medium}, {high}, and {}. We exch power set, {low, medium, high}, and the empty set from consideration since, co’ ally, they do not represent a split. Therefore, there are 2” — 2 possible ways partitions of the data, D, based on a binary split on A. e When considering a binary split, we compute a weighted sum of the im resulting partition, For example, if a binary split on A partitions D into PUT ofeach Gini index of D given that partitioning is Day the j IDI Gini IDal Gini(D) = —Gini(D,) + —— Gi 7 inig(D) TDI ini(D1) + DI iini(D»). (ea) For each attribute, each of the possible binary splits is considered. For a discrete-valued attribute, the subset that gives the minimum Gini index for that attribute is selected as its splitting subset. For continuous-valued attributes, each possible split-point must be considered. The strategy is similar to that described earlier for information gain, where the midpoint between each pair of (sorted) adjacent values is taken as a possible split-point. The point giving the minimum Gini index for a given (continuous-valued) attribute is taken.as the split-point of that attribute. Recall that for a possible split-point of A, Dy is the set of tuples in D satisfying A < split_point, and D; is the set of tuples in D satisfying A> split_point. ‘The reduction in impurity that would be incurred by a binary split on a discrete- or continuous-valued attribute A is AGini(A) = Gini(D) — Gini,(D). (89) ‘The attribute that maximizes the reduction in impurity (or, equivalently, has the minimum Gini index) is selected as the splitting attribute. This attribute and either its splitting subset (for a discrete-valued splitting attribute) or split-point (for @ continuous-valued splitting attribute) together form the splitting criterion. Example 8.3 Induction of a decision tree using the Gini index. Let D be the training data shown earlier in Table 8.1, where there are nine tuples belonging to the class buys.computer = yes and the remaining five tuples belong to the class buys_computer = no. ‘A (root) node N is created for the tuples in D. We first use Eq. (8.7) for the Gini index to compute the impurity of D: 9\? 75\2 (3) -(3) = 0.459, 7 co k pf he splitting criterion for the tuples in D, ribute, ’s i il splicing enue Let start with the attribute ico Partition Dy satis tuples of D woul Gini(D) = for , we need to compute the Gini int . me and consider each of the poss!” mage the subset (ow, medium). This would result in 10 tuples 2 ieee 2 sonata “income (low, medium}.” The remaining four sa Partition D;. The Gini index value computed based 0” Class-Labeled Training Tuples from the AllElectronics Customer Database income student credit-rating Class: buys-computer RID age 1 youth high no fair no 2 youth high no excellent no 3 middle-aged high no fair yes 4 senior medium no fair yes 5 senior low yes fair yes 6 senior low yes excellent no. 7 middleaged low yes excellent yes 8 youth medium no fair no 9 youth low yes fair yes 10 senior medium yes fair yes 11 youth medium yes excellent yes 12 middleaged medium no excellent yes. 13 middle.aged high yes fair yes 14 senior medium no excellent no Hoenn eee EEE this partitioning is Gintiincome € {low,medium(D) 10 Gini(D1) + 4. Gini(D2) = — Gini i 14 ae “i(-(o)-Ge)) ae) = 0.443 = Gintincome ¢ (high\(D)- Similarly, the Gini index values for splits on the remaining subsets are 0.458 (for the sub- sets {low, high} and {medium)}) and 0.450 (for the subsets {medium, high} and {low}). Therefore, the best binary split for attribute income is on {low, medium} (or {high}) because it minimizes the Gini index. Evaluating age, we obtain {youth, senior} (or {middle_aged}) as the best split for age with a Gini index of 0.375; the attributes student and credit.rating are both binary, with Gini index values of 0.367 and 0.429, respectively. The attribute age and splitting subset {youth, senior} therefore give the minimum Gini index overall, with a reduction in impurity of 0.459 — 0.357 = 0.102. The binary split “age € (youth, senior?)” results in the maximum reduction in impurity of the tuples in D and is returned as the splitting criterion. Node N is labeled with the criterion, two branches are grown from it, and the tuples are partitioned accordingly. - j

You might also like