[go: up one dir, main page]

0% found this document useful (0 votes)
43 views13 pages

Decision Tree

decision tree

Uploaded by

Mr. Gamer X
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
43 views13 pages

Decision Tree

decision tree

Uploaded by

Mr. Gamer X
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 13
cel Te yg 6 Decision Tree Learning “Predict ton is very difficult, especially if it’s about the future.” ~ Niels Bohr pecision Tree Learning is a widely used predictive m epee lend yt as ris ee nee ree ae jon tasks. The decision tree model basically represents logical rules that Bate salue of a target variable by inferring from data features. This chapter pr wi aaa ie into how to construct a decision tree and infer knowledge from the tree ondesa keen ince Powe tiraey (oss oy * Understand the structure of the decision tree * Know about the fundamentals of Entropy © Learnand understand popular univariate Decision Tree Induction algorithms such as 1D3, C4.5, and multivariate decision tree algorithm such as CART * Deal with continuous attributes using improved C4.5 algorithm + Construct Classification and Regression Tree (CART) for classifying both categorical and continuous-valued target variables re the target feature is a continuous-valued variable * Construct regression trees whe! Understand the basics of validating and pruning of decision trees 61 INTRODUCTION TO DECISION TREE LEARNING MODEL Deision tree learning model, one of the most popular supervised nee, auras eee ‘sfies data inetanees with high accuracy and consisten"): The model proms an et "rence that reaches a general conclusion from observed examples. This m™ ‘ahd Fn "solving complex classification application Decisio i ich summarizes d n tree is a concept tree Wil i in the form of a tree structure. OnCe the conce] fed, s. 7 ation contained in the training the inform: r test data can be easily pt model is built, ning > | 156 6 Machine sorical target variables and contin, a model can be classify both catey This meatel can be used to clasity both categoria computes a hypothesis fyn cin (x) t, target variables. Given a training dataset X, this mor doasion tree Inputs to the model are data instances or objects with a set of features or attribute, whi be discrete or continuous and the output of the model is a decision tree which predicts o, class the tanget class for the test data object. : In statistical terms, attributes or features are called as independent variables. The argo, &F target class is called as response variable which indicates the category we need to pg. a test object. : The decision tree learning model generates a complete hypothesis space in the form g.| tree structure with the given training dataset and allows us to search through the possible’ hypotheses which in fact would be a smaller decision tree as we walk through the tree, Th." of search bias is called as preference bias. i, 6.1.1 Structure of a Decision Tree A decision tree has a structure that consists of a root node, internal nodes/decision nodes, brandy, and terminal nodes/leaf nodes. The topmost node in the tree is the root node. Internal nodes, Ate hy test nodes and are also called as decision nodes. These nodes represent a choice or test of an attribute and the outcome or outputs of the test condition are the branches emanating from ts decision node. The branches are labelled as per the outcomes or output values of the test conditi, Each branch represents a sub-tree or subsection of the entire tree. Every decision node is patti: path to a leaf node. The leaf nodes represent the labels or the outcome of a decision path. The ik of the leaf nodes are the different target classes a data instance can belong to. Every path from root to a leaf node represents a logical rule that corresponds to a conjunctit of test attributes and the whole tree represents a disjunction of these conjunctions. The dedsie tree model, in general, represents a collection of logical rules of classification in the form of ate structure. Decision networks, otherwise called as influence diagrams, have a directed graph struct with nodes and links. It is an extension of Bayesian belief networks that represents informatit about each node’s current state, its possible actions, the possible outcome of those actions, and utility. The concept of Bayesian Belief Network (BBN) is discussed in Chapter 9. Figure 6.1 shows symbols that are used inthis book to represent different nodes in the cost of a decision tree. A circle is used to represent a root node, a diamond symbol is used to repr" decision node or the internal nodes, and all leaf nodes are represented with a rectangle. © Root note ©} Decision node Ea Leaf node Figure 6.1: Nodes in a Decision Tree Bed, bce doe eeuenles ook beeen erie weeierae lies > _ Decision Tree Learning » 187 puilding the Tree | Construct a decision tree with the Biven training dataset. The tree is constructed in a jown fashion. It starts from. the root node. At i op t ~ At every level of tree construction, we need to find pe best split peur atheros Node among all attributes. This process is recursive and continued unt p in the last level of the tree or finding a leaf node which cannot be split forher The tree construction is complete when all the test conditions lead to a leaf node. The leaf node contains the target class or output of classification, output Decision tree representing the complete hypothesis space Knowledge Inference or Classification Goal Given a test instance, infer to the target class it belongs to Cassification Inferring the target class for the test instance or object is based on inductive inference on the constructed decision tree. In order to classify an object, we need to start traversing thetree from the root. We traverse as we evaluate the test condition on every decision node with the test object attribute value and walk to the branch corresponding to the test’s outcome. This process is repeated until we end up in a leaf node which contains the target class of the test object. Output Target label of the test instance. Advantages of Decision Trees 1. Easy to model and interpret 2 Simple to understand 3 The input and output attributes can be discrete or continuous predictor variables, 4. Can model a high degree of nonlinearity in the relationship between the target variables and the predictor variables 5. Quick to train Disadvantages of Decision Trees Some of the issues that generally arise with a decision tree learning are that: . It is difficult to determine how deeply a decision tree can be grown or when to stop growing it. If training data has errors or missing attribute values, then the decision tree constructed may become unstable or biased. If the training data has continuous valued attributes, handling it is computationally complex and has to be discretized. A complex decision tree may also be over-fitting with the training data. Decision tree learning is not well suited for classifying multiple output classes, own to be NP-complete. Learning an optimal decision tree is also kn 158 + Machine Learning s academic performance asec dict a student’ + to pre [Example 6.1: ae top e, s assignments, home-work assi ty the giver information such as class attendance, class as : ene © given information such as clas ‘tivities such as projects and Presentation a participation in competitions her events, BFOUP 4 . aan oe in the final examination whether jy. wi Solution: The target feature is t i pass or fail in the examination The decision nodes are test nodes wre check for Conditions AW hat's the student's class attendance?’ “How did he perform in Tus «ir assignments?, Dg do his home assignments properly?” “What about his assessment results? ‘Did he participay, events”, ‘What is the performance rating in group activities such as Proje | 6.1 shows the attributes and set of values for each attribute. or ot he student performance competitions or other and presentations?’. Table Table 6.1: Attributes and Associated Values rome Class attendance ‘Good, Average, Poor Class assignments ‘Good, Moderate, Poor | | Home-work assignments Yes, No ‘Assessment ‘Good, Moderate, Poor | [Participation in competitions or other events Yes, No | | Group activities such as projects and presentations _{ YES. No | Pass, Fail | [Exam Result that is, either ‘pass’, or ‘fail’. | t of if-else conditions which may are two or more than two. Hence, The leaf nodes represent the outcomes, nstructed by following a se A decision tree would be cot and decision nodes outcomes may not include all the attributes, the tree is not a binary tree. a tree which can have more than two branches. Note: A decision tree is not always a binary tree. It is ——— ass or fail ase! Table 6.2 sho"* 'm Result ¥* Predict a student's academic performance of whether he will p: formation such as ‘Assessment’ and ‘Assignment’. The following Jes, Assessment and Assignment, and the target variable Exal ‘on the given the independent variabl their values. Draw a binary decision tree. Table 6.2: Attributes and Associated Values Attributes ‘Assessment Assignment Yes, No Exam Result Pass, Fail [Exam Rests [Pass Fail _J ode Solution: Consider the root node is ‘Assessment’, If a student’s marks are 250, the 100" isi! branched to leaf node ‘Pass’ and if the assessment marks are <50, it is branched to another Sed node. If the decision node in next level of the tree is ‘Assignment’ and if a student has sue his assignment, the node branches to ‘Pass’ and if not submitted, the node bran! " Figure 6.2 depicts this rule. ——— Decision Tree Learning » 159 Figure 6.2: Illustration Ths tree can be interpreted as a sequence of lo 1 (Assessment 2 50) then ‘Pass’ else if (Assessment < 50) then if (Assignment = Yes) then ‘Pass’ else if (Assignment = No) then ‘Fail’ Now, ifa test instance is given, not submitted his assignment, then of a Decision Tree gical rules as follows. such as a student has scored 42 marksin his assessment and has itis predicted with the decision tree that his exam result is ‘Fail’ —_ vst which ai ee “any algorithms exist which will be studied for constructing decision trees in the sections below. 61.2 Fundamentals of Entropy Sven the training dataset with a set of attributes or features, ‘nding the attribute or feature’ that best describes the target class for the given test instances. ‘he best split feature is the one which contains more information about how to split the dataset ‘mong all features so that the target class is accurately identified for the test instances. In other words, the best split attribute is more informative to split the dataset into sub datasets and this ess is continued until the stopping criterion is reached. This splitting should be pure at every “ape of selecting the best feature. The best feature is selected based on the amount of information among the features which ‘x bsically calculated on probabilities. Quantifying information is closely related to information °° In the field of information theory, the features are quantified by a measure called Shannon “py which is calculated based on the probability distribution ofthe events the decision tree is constructed by "atropy is the amount of uncertainty or randomness in the outcome ofa random varicleor an St Moreoy homogeneity of the data instances. The best feature er, entropy describes about the homog¢ : “kted based on the entropy value, For example when a coinis flipped, head or tall are the two wteomes, hence i tifa compared to rolling a dice which has got six outcomes. Hanes X® Rence its entropy is lower w' the interpretation is, wa 160 ~ Machine Learning TTT Higher the entropy —> Higher the uncertainty Lower the entropy — Lower the uncertainty . . Similarly, if all instances are homogenous, say (1, 0), ces Rahs belo, same class (here it is positive) or (0, 1) where all instances ae ae a eN the entyoe 6, On the other hand, if the instances are equally distributed, say (0.5, 05), which means 59 and 50% negative, then the entropy is 1. If there are 10 data instances, out of which ¢ bets positive class and 4 belong to negative class, then the entropy 1S calculated as shown in et 6 top & + Atog, + Me Entropy =- 70 °8 10 10 82 10 , instances that are completely homogeneous then, PAL: s that are equally divided (i.e, 50% ~ 50%), . 5 between 0 and 1 based on the randomness 4° It is concluded that if the dataset has entropy is 0 and if the dataset has sample: entropy of 1. Thus, the entropy value ranges e i samples in the dataset. If the entropy is 0, then the split is pure which means that all sampjg, the set will partition into one class or category. But if the entropy is 1, the split is impure ang, distribution of the samples is more random. The stopping criterion is based on the entropy vijy Let P be the probability distribution of data instances from 1 to m as shown in Eq. (6.2), So, P=P,....... P, Entropy of P is the information measure of this probability distribution given in Eq, (6.3), Entropy_Info(P) = Entropy_Info(P, .... P,) =-(P, log,(P,) + P, log,(P,) + «..---- + P, log,(P,)) 3 where, P, is the probability of data instances classified as class 1 and P, is the probability of daz instances classified as class 2 and so on. No of data instances belonging to class 1| / |Total no of data instances in the training dataset! Entropy_Info(P) can be computed as shown in Eq. (6.4). Thus, Entropy_Info(6, 4) is calculated as-| 610g £4 A tog 4] 10°80 * 10° 10 ( Mathematically, entropy is defined in Eq. (6.5) as: 1 6 PAK = x]-log, oo Pr[X = x] is the probability of a random variable X with a possible outcome *- 1 Ga 198, Ba Sa Entropy_Info(X) = ¥, -revalues(X) = -log, (Pr[X = x}) J Algorithm 6.1: General Algorithm for Decision Trees 1. Find the best attribute from the training dataset using an attribute selection meas place it at the root of the tree. i) (conti — pe cision Tree Learning #161 aa — ( 2, Split the training dataset into subsets based on the outcomes of the test attribute and each subset in a branch contains the data instances or tuples with the same value for the selected test attribute. 3, Repeat step 1 and step 2 on each subset until we end up in leaf nodes in all the branches of the tree. 4. This splitting process is recursive until the stopping criterion is reached. stopping Criteria | The following are some of the common stopping conditions: | 1, The data instances are homogenous which means all belong to the same class Cand | hence its entropy is 0. ‘ 2. Anode with some defined minimum number of data instances becomes a leaf (Number | of data instances in a node is between 0.25 and 1.00% of the full training dataset). 3. The maximum tree depth is reached, so further splitting is not done and the me | becomes a leaf node. 6.2 DECISION TREE INDUCTION ALGORITHMS There are many decision tree algorithms, such as ID3, C4.5, CART, CHAID, QUEST, GUIDE, CRUISE, and CTREE, that are used for classification in real-time environment. The most commonly used decision tree algorithms are ID3 (Iterative Dichotomizer 3), developed by JR Quinlan in 1986, and C4.5 is an advancement of ID3 presented by the same author in 1993. CART, that stands for, Classification and Regression Trees, is another algorithm which was developed by Breiman et al. in 1984. The accuracy of the tree constructed depends upon the selection of the best split attribute. Different algorithms are used for building decision trees which use different measures to decide on ‘he splitting criterion. Algorithms such as ID3, C4.5 and CART are popular algorithms used in the “nstruction of decision trees. The algorithm ID3 uses ‘Information Gain’ as the splitting criterion “hereas the algorithm C4.5 uses ‘Gain Ratio’ as the splitting criterion. The CART algorithm is Popularly used for classifying both categorical and continuous-valued target variables. CART uses 'NI Index to construct a decision tree. Decision trees constructed using ID3 and C4.5 are also called as univariate decision trees which sider only one feature/attribute to split at each decision node whereas decision trees constructed “t8 CART algorithm are multivariate decision trees which consider a conjunction of univariate ‘Pits. The details about univariate and multivariate data has been discussed in Chapter 2 62, ‘1 ID3 Tree Construction DB is ining dataset with labels and constructs a supervi ithm which uses a training dai ; 2 ison ee. ibs aes Be ar arin decision trees as it considers only one feature at B'decision pends ee example o cd splits. The tree is then used to classify the future test wage sio i -aligne peat acters Iteonetnae ae ie ; greedy approach in a top-down fashion by identifying the a \ces, “bute at each level of the tree. Z — —— ee considered as discrete/categorica, attributes or features to be dec lay ‘ty 162 + Machine Learning — features are © have to partition attributes oT us, then we ID3 works well if the If some attributes are continuo’ or nominal attributes or features: puri measure The algorithm builds the tree using a purity me J then uses the constructe training data instances and aan orical for training set with only nome te ot etl rine "Bah ‘on, IDS works well for a large eeraiiee one Moreover, iis not accurate if the dataset has missing attril cone to i it is outlie s done ‘ter construction of the tree ané ce is done during 0» uous attributes. Both C4.5 and CaRy & putes and contin' 0 outliers whereas CART can handle outliers, aS We called ‘Information Gain’ with to classify the test data. It is ar" attributes and with no missing dataset is small, overfitting m, FF tor classificaty No pruning : CART can handle both categorical attr! Seo handle missing values, but C45 is prone PP bia ee Pe acannon Info Eq. (6.8) for the whole training dataset based on the target attribute. tion Gain Eq, (6.10) for each of the attribute Compute Entropy_] Compute Entropy_Info Eq. (6.9) and Informal the training dataset. Choose the attribute for which entropy is minimu the best split attribute. The best split attribute is placed as the root node. The root node is branched into subtrees with each subtree as an outcome of the tet condition of the root node attribute. Accordingly, the training dataset is also spii into subsets. Recursively apply the same operation for the subset of the training set with the remaining attributes until a leaf node is derived or no more training instances are available the subset. m and therefore the gain is maximums wg o = Note: We stop branching a node if entropy is 0. The best split attribute at every iteration is the attribute with the highest information ga" Definitions Let T be the training dataset. Let A be the set of attributes A=(A,,A,, A A} pAb Asn Als Let m be the number of classes in the training dataset, Let P,be the probability that a data instance or a tupl ple ‘d’ It is calculated as, belongs to class C,. P.=Total ni ins : © of data instances that belongs to class Cj ini 8 C, in T/Total no of tuples in the tai"® Mathematically, it is rey , presented as shown in Eq. (6.7). ry Decision Tree Learning * 163 ected information or Entropy needed to cl toy Info(T) given in Eq. (6.8). assity a data instance d’ in T is denoted as Entropy Info(T)= — YP tog, P (6.8) snuropy of every attribute denoted as Entropy_Info(T, A) is shown in Eq, (6.9) as Entropy_Info(T, A) = a x Entropy_Info (A,) (6.9) »re, the attribute A has got ‘v’ di wheres . istinct values (a,, a, ....a,), |A|| is the number of instances for vi v ttriby ansunet value “Tin attribute A, and Entropy_Info (A) is the entropy for that set of instances. Information_Gain is a metric that measures how mu tribute A. In other words, ch information is gained by branching on it measures the reduction in impurity in an arbitrary subset of data. Itis calculated as given in Eq. (6.10): Information_Gain(A) = Entropy_Info(T) — Entropy_Info(T, A) (6.10) It can be noted that as entropy increases, portional to each other. information gain decreases. They are inversely Pro 5 aR Scan for ‘Additional Exomples’ Ee | Es | a oo Assess a student's performance during his course of study and predict whether a student will get a job offer or not in his final year of the course. The training dataset T consists of 10 data instances with attributes such as ‘CGPA’, ‘Interactiveness’, ‘Practical Knowledge’ and ‘Communication Skills’ as shown in Table 6.3. The target class attribute is the ‘Job Offer’. Table 6.3: Training Dataset T CoE eee eT Te eee ee eT en ey ‘Yes Very good Good Yes 2. No Good Moderate Yes 3. 29 No Average Poor No : 4 <8 No. Average Good No 5. 28 [Yes Good ~ | Moderate Yes L$ 29 | Yes Good Moderate Yes [2 <8 | Yes Good Poor No a 29 [No Very good Good Yes ° 28 | Yes Good Good Yes Ce] 28 [Yes ‘Average Good Yes 164 + Machine Learning ———___ Solution: Step 1; Calculate the Entropy for the target class ‘Job Offer’ Entropy_Info(Target Attribute = Job Offer) = Entropy_Info(7, 3) = = -|Z ve 24 F tog 3] = ~(0.3599 + -0.5208) = 0.8807 Iteration 1: 10 710 10-7? 10 Step 2: Calculate the Entropy_Info and Gain(Information_Gain) for each of the attribute in the training dataset. Table 6.4 shows the number of data instances classified with Job Offer as Yes or No for the attribute CGPA. Table 6.4: Entropy Information for CGPA CGPA__Job Offer= Yes Job Offer=No Total Entropy EI 3 1 4 28 4 0 4 o <8 0 2 2 0 Entropy_Info(T, CGPA) 473, 3-1, 1], 4/4, 4.0, 0]. 2[ 0.0 2 = Fp] los. 5 - zlos. = | +] Slog, 4 - Stog 2} 42/2 tog 9 2) -4 1°84 paz] A] 41°82 g~ 4'°82 g | * 79] 71°82 9 - 51°82 5 = + (03111 +0.4997)+0+0 10 = 0.3243 Gain (CGPA) = 0.8807 — 0.3243 0.5564 Table 6.5 shows the number of data instances classified with Job Offer as Yes or No for the attribute Interactiveness. Table 6.5: Entropy Information for Interactiveness ee YES 5 1 6 [No 2 2 [4 | 2 . 6/5, 5 1 1], 4f_2, > 2.2 3] Entropy_Info(T, Interacti = 79] ~ gloss = - Slog, 1] + 4] ~2 tog, 2 - Zt0g, py_Info(T, Interactiveness) é 51°85 sie. 2|+ 4] 710824 4°74 = Soin + 0.4306) + + (0.4997 + 0.4997) = 0.3898 + 0.3998 = 0.7896 Gain(Interactiveness) = 0.8807 - 0.7896 = 0.0911 the Table 6.6 shows the number of data instances classified with Job Offer as Yes or No for attribute Practical Knowledge —— Decision Tree Learning + 165 Table 6.6: Entropy Information for Practical Knowledge eed Very Good_ Average Good 3] Uo 12, 2 3)_1, 12 5f 4. 44 4 ToL 3°83 35 3| . woes f 3) o, 0 2°89 20 + o2sr +0.4641) 0+ 0.2753 + 0.3608 0.6361 Gain(Practical Knowledge) = 0.8807 — 0.6361 = 0.2446 Table 6.7 shows the number of data instances classified with Job Offer as Yes or No for the atnbute Communication Skills. Table 6.7: Entropy Information for Communication Skills Communication Skills Job Offer = Yes Job Offer=No Total Good 4 1 5 | Moderate 3 0 3 Poor 0 2 2 Entropy_Info(T, Communication Skills) 5] 4 4 1 1 3] 2 3 _9 oO 2] aor o- 2) 2 3 |-tioe, 4 Log, 3] * A]-F985 31°83 |* yo] 7/82 21°82 5 3 2 = (05 a 0) 79 (05280 + 0.3897) + 10 (0) + 10° ) = 0.3609 Gain(Communication Skills) = 0.8813 - 0.36096 = 0.5203 The Gain calculated for all the attributes i Table 6.8: Gain Pe is shown in Table 6.8: CGPA Interactiveness Practical Knowledge ‘Communication Skills | Communicaien hich entropy is minimum and therefore the bh yr wh iy 166 + Machine Learning . : : ple 6.8, choose ain. So, we choose CG) Step 3: From heey best split ati jt has the a Sones 29, 28 and <8. The &y ‘GPA GPA wil fer = Yes for 28 and Job Offer sk of js maximum as it attribute . The best split attribu! ccra wt » are three d ified a5 or the subset of > root node. There are vec Of of value is 0 for >8 and <8 oo a a ws noe. THE ree 8 " <8. Hence, both 28 and

You might also like