[go: up one dir, main page]

0% found this document useful (0 votes)
7 views20 pages

UNIT 2 - Groups (Decision Tree)

This document discusses the concepts of relationships and groups in data analysis, focusing on decision trees as a supervised learning technique for classification and regression. It outlines the structure of decision trees, including root and leaf nodes, and explains the process of building a decision tree using the CART algorithm and attribute selection measures like Information Gain and Gini Index. The document emphasizes the ease of understanding decision trees due to their tree-like structure, which mimics human decision-making.

Uploaded by

VIGNESH BABU T R
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views20 pages

UNIT 2 - Groups (Decision Tree)

This document discusses the concepts of relationships and groups in data analysis, focusing on decision trees as a supervised learning technique for classification and regression. It outlines the structure of decision trees, including root and leaf nodes, and explains the process of building a decision tree using the CART algorithm and attribute selection measures like Information Gain and Gini Index. The document emphasizes the ease of understanding decision trees due to their tree-like structure, which mimics human decision-making.

Uploaded by

VIGNESH BABU T R
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 20

UNIT 2 – RELATIONSHIPS AND GROUPS AMONG DATA

• Understanding relationship:
 Exploring relationships between variables,
 Visualizing relationships

• Understanding groups:
 Clustering
 Association Rules
 Learning Decision Trees from Data

• CO2: Illustrate the relationship and groups among the data for
decision making. [K3]
Decision Tree
Classification
• Decision Tree - Learning
is a Supervised Decision
learning Trees
technique thatfrom
can Data
be used for both
classification and Regression problems, but mostly it is preferred for solving
Classification problems.

• It is a tree-structured classifier, where internal nodes represent the features of a


dataset, branches represent the decision rules and each leaf node represents the
outcome.

• In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node.

• Decision nodes are used to make any decision and have multiple branches, whereas
Leaf nodes are the output of those decisions and do not contain any further branches.
Decision Tree
• The decisions or the test are performed on the basis of features of the given dataset.
• Asks a question, and based on the answer (Yes/No), it further split the tree into
subtrees.
• It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
• It is called a decision tree because, similar to a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.
• In order to build a tree, we use the CART algorithm, which stands for Classification
and Regression Tree algorithm.
Below diagram explains the general structure of a decision tree:
Why use Decision Trees?

• There are various algorithms in Machine learning, so choosing the best algorithm
for the given dataset and problem is the main point to remember while creating a
machine learning model. Below are the two reasons for using the Decision tree:

• Decision Trees usually mimic human thinking ability while making a decision,
so it is easy to understand.

• The logic behind the decision tree can be easily understood because it shows a
tree-like structure.
Decision Tree Terminologies

• Root Node: Root node is from where the decision tree starts. It represents the entire dataset,
which further gets divided into two or more homogeneous sets.

• Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further
after getting a leaf node.

• Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes
according to the given conditions.

• Branch/Sub Tree: A tree formed by splitting the tree.

• Pruning: Pruning is the process of removing the unwanted branches from the tree.

• Parent/Child node: The root node of the tree is called the parent node, and other nodes are
called the child nodes.
How does the Decision Tree algorithm Work?
Step-1: Begin the tree with the root node, says S, which contains the complete dataset.

Step-2: Find the best attribute in the dataset using Attribute Selection Measure
(ASM).

Step-3: Divide the S into subsets that contains possible values for the best attributes.

Step-4: Generate the decision tree node, which contains the best attribute.

Step-5: Recursively make new decision trees using the subsets of the dataset created in
step -3. Continue this process until a stage is reached where you cannot further classify
the nodes and called the final node as a leaf node.
Attribute Selection Measures or Measure of the goodness of split

• While implementing a Decision tree, the main issue arises that how to select the best
attribute for the root node and for sub-nodes. So, to solve such problems there is a
technique which is called as Attribute selection measure or ASM. By this
measurement, we can easily select the best attribute for the nodes of the tree. There
are two popular techniques for ASM, which are:

• Information Gain

• Gini Index
Information Gain
• This measure is used to select the test attribute at each node in the tree.

• The attribute with high information gain is chosen as the test attribute for the current
node
• Select the attribute with the highest information gain

Assume there are two classes, P and N


• Let the set of examples S contain p elements of class P and n elements of class N

• The amount of information, needed to decide if an arbitrary example in S belongs to P


or N is defined as
p p n n
I ( p, n)  log 2  log 2
pn pn pn pn
Entropy
• Entropy: Entropy is a metric to measure the impurity in a given
attribute. It specifies randomness in data. Entropy can be calculated
as:
• Assume that using attribute A a set S will be partitioned into sets {S1,
S2 , …, Sv}
• If Si contains pi examples of P and ni examples of N, the entropy, or the expected information needed
to classify objects in all subtrees Si is
 pi  ni
E ( A)  I ( pi , ni )
i 1 p  n

• The encoding information that would be gained by branching on A


Gain( A) I ( p, n)  E ( A)
Attribute Selection by Information Gain Computation

• Since age has the highest information gain among the attributes, it is selected as the test attribute.

• A node is created and labeled with age and branches are grown for each of the attribute’s values

• The samples are then partitioned accordingly

You might also like