[go: up one dir, main page]

0% found this document useful (0 votes)
147 views105 pages

Classification (NaiveBayes KNN SVM DecisionTrees)

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 105

CLASSIFICATION METHODS

Classification
 Classification is a process of categorizing a given set of data into classes

 The process starts with predicting the class of given data points

 The classes are often referred to as target, label or categories


Binary vs. Multi-Classification
Classification Methods
4

 There is no “best” method. Methods can be selected based on metrics (accuracy,


precision, recall, F-measure), speed, robustness, scalability, and robustness.

 We will cover some of the more classic and state-of-the-art techniques in the following
slides, including:
 Logistic Regression (Covered with Regression Topic)
 Naïve Bayes
 K-Nearest Neighbor
 Support Vector Machine (SVM)
 Decision Tree
 Random Forest (Not in Syllabus)
Classification Vs. Regression
Parameter CLASSIFICATION REGRESSION
The mapping function is used for Mapping Function is used for
Basic mapping values to predefined the mapping of values to
classes continuous output

Involves prediction of Discrete values Continuous values

by measurement of root
Method of calculation by measuring accuracy
mean square error

Logistic Regression, Naïve Bayes, K- Simple Linear Regression,


Example Algorithms Nearest Neighbor, Support Vector Multi Linear Regression, Non-
Machines, Decision Trees, etc. linear Regression etc.
Naïve Bayes
6

 Naïve Bayes is a probabilistic classifier applying Bayes’ theorem.

 Assumes that the value of features are independent of other features and that features
have equal importance.
 Hence “Naive”

 Scales and performs well in text categorization tasks.


 E.g., spam or legitimate email, sports or politics, etc.

 Also has extensions such as Gaussian Naïve Bayes, Multinomial Naïve Bayes, and Bernoulli
Naïve Bayes.
 Naïve Bayes and Multinomial Naïve Bayes are part of WEKA (Data Mining Tool)
Naïve Bayes – Bayes Theorem
7

 Naïve Bayes is based off of Bayes theorem, where a posterior is


calculated based on prior events, likelihood, and evidence.

In English

 Example – If a patient has stiff neck, what is the probability he/she has
meningitis (P(M|S)) given that:
 A doctor knows that meningitis causes stiff neck (P(S|M)) 50% of the time
 Prior probability of any patient having meningitis (P(M)) is 1/50,000
 Prior probability of any patient having stiff neck (P(S)) is 1/20
Naïve Bayes – Approach to Classification
8
“If a 60-year old patient returned from Beijing, China to Tucson yesterday and has fever, nagging cough,
and shortness of breadth, what is the probability that he has COVID-19?”

 Compute the posterior probability P(C | A1 , A2 , … , An) (with multiple


evidence) for all values of C (i.e., class) using Bayes’ theorem.

 After computing the posteriors for all values, choose the value of C that maximizes:

 This is equivalent to choosing value of C that maximizes:

 The following equation equates to the first equation. It also illustrates the
“naive” assumption that all attributes (Ai) are independent from each other.
Naïve Bayes – Example
9
Naïve Bayes – Example
10
Naïve Bayes – Example
11
Nearest Neighbors
 k-nearest neighbors
k = parameter, chosen by analyst
 For a given test instance, use the k “closest” points
(nearest neighbors) for performing classification
 “closest” points: defined by some proximity metric, such as
Euclidean Distance
Unknown record
l Requires three things
1. The set of stored records
2. Distance Metric to compute distance
between records
3. The value of k, the number of nearest
neighbors to retrieve

l To classify an unknown record:


1. Compute distance to other training
records
2. Identify k nearest neighbors
3. Use class labels of nearest neighbors to
determine the class label of unknown
record (e.g., by taking majority vote)
Definition of Nearest Neighbor
 The k-nearest neighbors of a given example x are
the k points that are closest to x. • Classification changes
depending on the chosen
k
• Majority Voting

X X X
Tie Scenario:
• Randomly choose
classification?
• For binary problems,
(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor usually an odd k is used to
avoid ties.
As the differences between the values of the descriptive features grows, so too does the
distance between the points in the feature space that represent these instances.
Distance Metric
 Distance Metric: A function that returns the distance
between two instances a and b.
 Must, mathematically have the criteria:
1. Non-negativity:
2. Identity:
3. Symmetry:
4. Triangular Inequality:
Dissimilarities between Data Objects
 A common measure for the proximity between two
objects is the Euclidean Distance:
n
d(x, y) = å k k
(x - y ) 2
x and y are two data objects
n dimensions
k=1
 In high school, we typically used this for calculating the distance
between two objects, when there were only two dimensions.
 Defined for one-dimension, two-dimensions, three-dimensions, …,
any n-dimensional space
Example

   
Dissimilarities between Data Objects
 Typically the Euclidean Distance is used as a first
choice when applying Nearest Neighbors and
Clustering
n
d(x, y) = å k k
(x - y ) 2 x and y are two data objects
n dimensions
k=1

 Other distance metrics: n


1r

å xk - yk
r
 Generalized by the Minkowski d(x, y) =
distance metric: k=1
r is a parameter
Dissimilarities between Data Objects
1r
Minkowski Distance Metric: n

åx
 r
d(x, y) = k - yk
r =1 k=1

 L1norm
 “Manhattan”, “taxicab” Distance

r =2
 Euclidean Distance
 L2 norm
The larger the value of r the more emphasis is placed on the features with large differences
in values because these differences are raised to the power of r.
Distance Matrix
 Once a distance metric is chosen, the proximity
between all of the objects in the dataset can be
computed
 Can be represented in a distance matrix
 Pairwise distances between points
Distance Matrix
L1 norm distance. “Manhattan” Distance.
3 L1 p1 p2 p3 p4
p1 0 4 4 6
2 p1 p2 4 0 2 4
p3 p4 p3 4 2 0 2
1 p4 6 4 2 0
p2
0
L2 norm distance. Euclidean Distance.
0 1 2 3 4 5 6
L2 p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
Using Weights
 So far, all attributes treated equally when
computing proximity
 In some situations: some features are more important
than others
 Decision up to the analyst
 Modified Minkowski distance definition to include
weights: n
1r

åw
r
d(x, y) = k xk - yk
k=1
K-Nearest Neighbors Pseudocode
24
Which is the ideal value of k?
Setting k to a high value is riskier with
Different Values of k an imbalanced dataset.
• Majority target level begins to
dominate the feature space.

(b) Decision boundary (k = 1)


k=1 k=3 k=5 k=15
Ideal decision boundary

Choose k by running evaluation experiments on a training or validation set.


Choosing the right k
 If k is too small, sensitive to
noise points in the training
data
 Susceptible to overfitting
 If k is too large,
neighborhood may include
points from other classes
 Susceptible to
misclassification
Standardization
 Treat all features “equally” so that one feature doesn’t
dominate the others
 Common treatment to all variables:
 Standardize each variable:
 Mean = 0
 Standard Deviation = 1
 Normalize each variable:
 Max = 1
 Min = 0
Query / test instance
to classify:
• Salary = 56000
• Age = 35
Classifier Comparison
Eager Learners Lazy Learners

 Decision Trees, SVMs  Nearest Neighbors


 Model Building:  Model Building: fast
potentially slow (because there is none!)
 Classifying Test  Classifying Test
Instance: fast Instance: slow
Classifier Comparison
Eager Learners Lazy Learners

 Decision Trees, SVMs  Nearest Neighbors


 finding a global model  classification decisions
that fits the entire input are made locally (small
space k values), and are more
susceptible to noise
Support Vector Machines (SVM)
 What are they?
 Developed in the 1990s
 Computer Science community
 Very popular
 Performance:
 Often considered one of the best “out of the box” classifiers
 Applications: handwritten digit recognition, text
categorization
Support Vector Machines (SVM)
 Comparing to other statistical learning methods:
 SVMs work well with high-dimensional data

 Unique:
 Represents “decision boundary” using a subset of
training examples
Terminology
1. Maximal Margin Classifier
2. Support Vector Classifier
3. Support Vector Machine

 Often all three are referred to as “Support Vector


Machine”
The Path Ahead
1. Maximal Margin Classifier
2. Support Vector Classifier
 Generalization of Maximal Margin Classifier
3. Support Vector Machine
 Generalization of Support Vector Classifier
Maximal Margin Classifier
 First, need to define a hyperplane

 What is a hyperplane?
 Hyperplane has p-1 dimensions in a p dimensional
space
 Example: in 2 dimension space, a hyperplane has 1
dimension (and thus, is a line)
Hyperplane Mathematical Definition
 For two dimensions, hyperplane defined as:

B0 + B1 X1 + B2 X2 = 0
B0, B1, B2 are parameters.
X1, X2 are variables.

 Note that this equation is a line: B0 + B1 X1 + B2Y = 0


 Hyperplane is in one-dimension B2Y = -B1 X1 - B0
-B1 X1 - B0
Y=
B2
Hyperplane Mathematical Definition
B0 + B1 X1 + B2 X2 = 0
 We’re going to “find” values for B0, B1, B2.
 Then, for any values X1 and X2:
1. if B0 + B1X1 + B2X2 = 0
 Point is on the line.
Hyperplane Mathematical Definition
B0 + B1 X1 + B2 X2 = 0
 We’re going to “find” values for B0, B1, B2.
 Then, for any values X1 and X2:
2. if B0 + B1X1 + B2X2 > 0
 Point is not on the line. On one side of the line.
3. if B0 + B1X1 + B2X2 < 0
 Point is on the other side of the line.
Hyperplane
 … is dividing 2-dimesional space into two halves by
a line.
Note: a separating
hyperplane means
Separating Hyperplane zero training errors.

Dataset with two classes:


1. Squares
2. circles

Can find a separating


hyperplane with all
squares on one side and
all circles on the other.

Infinitely many such


hyperplanes possible.
Classification Using a Separating
Hyperplane
 For a new test
instance, which side of
the line is it on?

 B0 + B1X1 + B2X2 > 0


 B0 + B1X1 + B2X2 < 0
Classification Using a Separating
Hyperplane
 Standard SVM approach:
 Label class data as either +1 or -1, depending on which
class an instance belongs to.
 Prediction:

ìï 1, if B + B x + B x +... + B x > 0
0 1 1 2 2 n n
yi = í
ïî -1, if B0 + B1 x1 + B2 x2 +... + Bn xn < 0
Classification Using a Separating
Hyperplane
 For a new test instance, which side of the line is it on?

 B0 + B1X1 + B2X2 > 0


 B0 + B1X1 + B2X2 < 0

 Can also look at the magnitude.


 How far from zero?
 Greater magnitude means more confident prediction.
Some Concerns with this Approach:
 Datasets with more than 2 target classes
 What if a “seperating hyperplane” can’t be formed
 Data is more than two dimensions
 Regression instead of classification

SVMs can deal with each of these.


What if Data is more than 2-Dimensions?
 Mathematical definition of hyperplane generalizes
to n-dimensions:
B0 + B1 X1 + B2 X 2 = 0
B0 + B1 X1 + B2 X 2 +... + Bn X n = 0

B0 + B1 X1 + B2 X 2 +... + Bn X n > 0
B0 + B1 X1 + B2 X 2 +... + Bn X n < 0
Maximum Margin Hyperplane
 What’s the best separating hyperplane?

Intuition: the one that is farthest from


the training observations.

Called the maximum margin


hyperplane.
The Margin
 B1 and B2 are each
separating hyperplanes
 B1 is better
 Margin: the smallest
distance from the
hyperplane to the
training data
Represents the mid-line of the widest “slab” that can be inserted between the two classes.

Maximal Margin Hyperplane


 We want the
hyperplane that has the
greatest margin.
 That is, B1 instead of B2
or any of the other
infinitely many
separating hyperplanes
Maximal Margin Hyperplane
 Support Vectors: the
points in the data, that
if moved, the maximal
margin hyperplane
would move as well.
Moving any of the other data points
would not affect the model.
Support Vector Classifier
 Maximum Margin Classifier is natural way to perform
classification if a separating hyperplane exists.
 Perfect segmentation between two classes
 In many cases, no separating hyperplane will exist
 Find a hyperplane that almost perfectly segments the classes
 This generalization is called: support vector classifier
Support Vector Classifier
 Maximal Margin Classifier: no training errors
allowed
 Support Margin Classifier: tolerate training errors
 Approach: Soft margin
 Will allow construction of linear decision boundary
even when classes are not linearly separable.
Support Vector Classifier
Additional motivation:
New data point added.

Dramatic shift in maximal margin


hyperplane.
Maximum margin classifier. Model has high variance when trying to
Perfectly segments training data. maintain perfect segmentation.
Support Vector Classifier
 So, interested in:
 Greater robustness to individual data instances
 Better classification of most of the training data

 Some misclassifications permitted:


 “Soft”margin: because margin can be violated by
some of the instances
Red Instances: Red Instances:
• 3 4 5 6 on correct side of margin • 3 4 5 6 on correct side of margin
• 2 is on the margin • 2 is on the margin
• 1 is on the wrong side of the margin • 1 is on the wrong side of the margin
• 11 is on the wrong side of the hyperplane
Using Support Vector Classifier for
Classification
 Same as before.
 Which side of the line is the test instance on?
Constructing the Support Vector Classifier
 More interesting.
 How much “softness” (misclassifications) in the soft margin is
ideal?
 Complicated math, but python figures it out.
 Specification of nonnegative tuning parameter C
 Generally chosen by analyst following cross-validation
 Large C: wider margin; more instances violate margin
 Small C: narrower margin; less tolerance for instances that violate
margin
Same data points.

Larger C to Smaller C

Lower variance. Higher variance.


Support Vector Machines
 What if a non-linear decision boundary is needed?

Poor performance using


this decision boundary.
Support Vector Machines
 Idea: transform the data from its original coordinate
space in X into a new space Φ(X) so that a linear
decision boundary can separate the two classes
 Φ: nonlinear transformation
Instead of fitting a support vector classifier using n
 Huh? features:
X1, X2, …, Xn
… use 2n features:
X1, X12, X2, X22, …, Xn, Xn2
Support Vector Machines
 Enlarged “feature space” compared to original
“feature space”
 Can even extend to higher-order polynomial terms.

 Downside: can easily end up with huge number of


features
 overfitting
https://towardsdatascience.com/support-vector-machine-svm-and-
the-multi-dimensional-wizardry-b1563ccbc127
F : (x1, x2 ) ® (x , x , 2x1, 2x2 ,1)
2
1
2
2
Attribute Transformation
SVM Kernel functions
Popular SVM Kernel functions:
 Linear Kernel: It is just the dot product of all the features. It
doesn’t transform the data.
 Polynomial Kernel: It is a simple non-linear transformation
of data with a polynomial degree added.
 Gaussian or Radial Basis Kernel: It is the most used SVM
Kernel for usually used for non-linear data.
 Sigmoid Kernel: It is similar to the Neural Network with
sigmoid activation function.
Support Vector Machine – Kernel Functions
63

What if a straight line or a flat plane does not fit?


 The simplest way to divide two groups is with a
straight line, flat plane or an N-dimensional
hyperplane. But what if the points are
separated by a nonlinear region?

 Rather than fitting nonlinear curves to the


data, SVM handles this by using a kernel
function to map the data into a different space
where a hyperplane can be used to do the
separation.

Nonlinear, not flat


Support Vector Machine – Kernel Functions
64

 Kernel function Φ: map data into a different space to enable


linear separation.

 Kernel functions are very powerful. They allow SVM models to perform separations even with very complex
boundaries.
 Some popular kernel functions are linear, polynomial, and radial basis.
 For data in a structured representation, convolution kernels (e.g., string, tree,
etc.) are frequently used.
 While you can construct your own kernel functions according to the data
structure, WEKA provides a variety of in-built kernels.
Learning a Nonlinear SVM Model
 Once again:
 Complicated, but python does it for us.

https://github.com/swapnilin/SVM-Demo-and-Kernel-Functions/blob/main/SVM_Demo.ipynb
https://www.analyticsvidhya.com/blog/2021/07/svm-support-vector-machine-algorithm/
Other extensions to SVMs:
 Regression instead of classification
 Categorical variables instead of continuous
 Multiclass problems instead of binary
 Radial “kernels”
 Circle instead of hyperplane
Decision Trees
 Simple, yet widely used classification technique
 For nominal target variables
 There also are Regression trees, for continuous target
variables
 PredictorFeatures: binary, nominal, ordinal, discrete,
continuous
 Evaluating the model:
 One metric: error rate in predictions
1. Root node
2. Internal nodes
Decision Trees Tree is
3. Leaf / terminal nodes

consistent with
training “Splitting Attributes”
Tid Home Marital Annual Defaulted dataset.
Owner Status Income Borrower
1 Yes Single 125K No
2 No Married 100K No
Home
Yes
3 No Single 70K No Owner No
4 Yes Married 120K No
NO {Single, Marital
5 No Divorced 95K Yes Married
6 No Married 60K No
Divorced} Status
7 Yes Divorced 220K No
Annual NO
8 No Single 85K Yes
9 No Married 75K No
< 80K Income > 80K

10
10 No Single 90K Yes
NO YES

Training Data Decision Tree Model #1


Decision Trees Tree is consistent with training dataset.

Marital {Single, Divorced}


Married
Tid Home Marital Annual Defaulted Status
Owner Status Income Borrower
NO Home
1 Yes Single 125K No
Yes No
2 No Married 100K No
Owner
3 No Single 70K No NO Annual
4 Yes Married 120K No < 80K Income > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes Decision Tree Model #2
9 No Married 75K No
10 No Single 90K Yes
There could be more than one tree that fits the
10

Training Data same data!


Apply Model to Test Data
Start from the root of tree. Test Data
Home Marital Annual Defaulted
Owner Status Income Barrower
Home No Married 80K ?
Yes
Owner No
10

NO {Single, Marital
Married
Divorced} Status

Annual NO
< 80K Income > 80K
NO YES
Apply Model to Test Data
Start from the root of tree. Test Data
Home Marital Annual Defaulted
Owner Status Income Barrower
Home No Married 80K ?
Yes
Owner No
10

NO {Single, Marital
Married
Divorced} Status
• Predict that this person
Annual NO
will not default.
< 80K Income > 80K
NO YES
Formally…
 A decision tree has three types of nodes:
1. A root node that has no incoming edges and zero or mode
outgoing edges
2. Internal nodes, each of which has exactly one incoming edge
and two or more outgoing edges
3. Leaf nodes (or terminal nodes), each of which has exactly one
incoming edge and no outgoing edges
 Each leaf node is assigned a class label
 Non-terminal nodes contain attribute test conditions to
separate records that have different characteristics
How to Build a Decision Tree?
 Referred to as decision tree induction.
 Exponentially many decision trees can be constructed
from a given set of attributes
 Infeasible to try them all to find the optimal tree
 Different “decision tree building” algorithms:
 ID3, CART, C4.5 etc.
 Usually a greedy strategy is employed
Design Issues of Decision Tree Induction
1. How should the training records be split?
 Greedy strategy: split the records based on some
attribute test (always choose immediate best option)
 Need to evaluate the “goodness” of each attribute test
and select the best one.
2. How should the splitting procedure stop?
 For now, we’ll keep splitting until we can’t split anymore.
Splitting Based on Nominal Attributes
 Multiway Split: Use as many partitions as distinct values

 Binary Split: Grouping attribute values CART decision tree


algorithm only creates
 Need to find optimal partitioning binary splits.
Splitting Based on Ordinal Attributes
 Ordinal attributes can also produce binary or
multiway splits.
 Grouping should not violate ordering in the ordinal set
Splitting Based on Continuous Attributes
Tid Home Marital Annual Defaulted
Owner Status Income Borrower
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes

 Continuous attributes can also have a binary or 6


7
No
Yes
Married
Divorced
60K
220K
No
No
multiway split. 8 No Single 85K Yes

 Binary: decision tree algorithm must consider all 9 No Married 75K No

possible split positions v, and it selects the best one 10


10 No Single 90K Yes

 Comparison test: (A <= v) or (A > v), where v=80K


 Computationally intensive
Entropy
 Defined by mathematician
Little impure.
Claude Shannon Very pure. Not impure.

 Measures the impurity


(heterogeneity) of the
elements of a set
 “what is the uncertainty of
guessing the result of the
random selection from a set?”
Completely impure.
Entropy

 Weighted sum of the logs of the probabilities of


each of the possible outcomes.
Entropy Examples
1. Entropy of the set of 52 playing cards:
 Randomly selecting any specific card i is 1/52.

2. Entropy if only the 4 suits matter:


 Randomly selecting any suit is ¼

The higher the impurity, the higher the entropy.


Why the log?

Want a low “score” when something is highly probable or certain.


How to determine the Best Split?
Tid Home Marital Annual Defaulted
Owner Status Income Borrower
How does entropy help us? 1 Yes Single 125K No
• We can calculate the entropy (impureness) of 2 No Married 100K No

Default Borrower 3 No Single 70K No


4 Yes Married 120K No
5 No Divorced 95K Yes
Home Marital 6 No Married 60K No
Owner? Status?
Yes Single 7 Yes Divorced 220K No
No Divorced Married 8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10

Annual Income
<= 85K
Yes
No
Information Gain
 Try to split on each possible feature in a dataset.
See which split works “best”.
 Measure the reduction in the overall entropy of a
set of instances

Weighting Term
Information Gain Example
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 = −∑ 𝑝 log 𝑝 = −( log + log ) = 1
 
𝑆
𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛𝐺𝑎𝑖𝑛 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − 𝐸(𝑆 )
𝑆

2 1 1 1 1 4 2 2 2 2
𝐼𝐺 = 1− × − log + log + × − log + log =0
6 2 2 2 2 6 4 4 4 4

• The 0 means there was no


“information gain”.
• Nothing was learned by
splitting on “Contains Images”.
Calculate the Information Gain on Each
Feature

 “Suspicious Words” is the best split.


ID3 Algorithm
1. Figure out the best feature to split on based on by
using Information Gain
2. Add this root note to the tree; label it with the
selected test feature
3. Partition the dataset using this test
4. For each partition, grow a branch from this node
5. Recursively repeat the process for each of these
branches using the remaining partition of the dataset
ID3 Algorithm: Stopping Condition
Stop the recursion and construct a leaf node when:
1. All of the instances in the remaining dataset have the same
classification (target feature value).
 Create a leaf node with that classification as its label
2. The set of features left to test is empty.
 Create a leaf node with the majority class of the dataset as its
classification.
3. The remaining dataset is empty.
 Create a leaf note one level up (parent node), with the majority
class.
Solved Example

There are four independent variables to determine the dependent variable. The independent
variables are Outlook, Temperature, Humidity, and Wind. The dependent variable is whether
to play football or not.
Solution

Find the entropy of the class variable.

E(S) = -[(9/14)log(9/14) + (5/14)log(5/14)] = 0.94

NOTE:
Here typically we will take log to base 2
Here total there are 14 yes/no
Out of which 9 yes and 5 no
Based on it we calculated probability
above
Solution

2 3

3 2

E(S, outlook) = (5/14)*E(3,2) + (4/14)*E(4,0) + (5/14)*E(2,3)


= (5/14)(-(3/5)log(3/5)-(2/5)log(2/5))+ (4/14)(0) +
(5/14)((2/5)log(2/5)-(3/5)log(3/5))
= 0.693
Solution

Now we have to calculate average weighted entropy ie, we have found the total of weights
of each feature multiplied by probabilities.

E(S, outlook) = (5/14)*E(3,2) + (4/14)*E(4,0) + (5/14)*E(2,3)


= (5/14)(-(3/5)log(3/5)-(2/5)log(2/5))+ (4/14)(0) +
(5/14)((2/5)log(2/5)-(3/5)log(3/5))
= 0.693
Solution

The next step is to find the information gain. It is the difference between parent entropy and
average weighted entropy we found above.

IG(S, outlook) = 0.94 - 0.693 = 0.247


Solution

Similarly find Information gain for Temperature, Humidity, and Windy.

IG(S, Temperature) = 0.940 - 0.911 = 0.029

IG(S, Humidity) = 0.940 - 0.788 = 0.152

IG(S, Windy) = 0.940 - 0.8932 = 0.048


Solution

Now select the feature having the largest entropy gain. Here it is Outlook. So it forms the first
node(root node) of our decision tree.

IG(S, outlook) = 0.94 - 0.693 = 0.247

IG(S, Temperature) = 0.940 - 0.911 = 0.029

IG(S, Humidity) = 0.940 - 0.788 = 0.152

IG(S, Windy) = 0.940 - 0.8932 = 0.048


Solution
Now our data look as follows:
Solution

Since overcast contains only examples of class ‘Yes’ we can set it as yes. That means If
outlook is overcast football will be played. Now our decision tree looks as follows.
Solution

The next step is to find the next node in our decision tree
Now we will find one under sunny
We have to determine which of the following Temperature, Humidity or Wind has
higher information gain.
Solution

Calculate parent entropy E(sunny)

E(sunny) = (-(3/5)log(3/5)-(2/5)log(2/5)) = 0.971.

Now Calculate the information gain of Temperature.


IG(sunny, Temperature)

0 1
1 2
Solution
E(sunny, Temperature) = (2/5)*E(0,2) + (1/5)*E(1,0) + (2/5)*E(1,1)=2/5=0.4

Now calculate information gain.

IG(sunny, Temperature) = 0.971–0.4 =0.571

Similarly we get

IG(sunny, Humidity) = 0.971

IG(sunny, Windy) = 0.020

Here IG(sunny, Humidity) is the largest value. So Humidity is the node that comes under sunny.
Solution

For humidity from the above table, we can say that play will occur if humidity is normal
and will not occur if it is high.
Similarly, find the nodes under rainy.
Note: A branch with entropy more than 0 needs further splitting.
Solution
Exercise

https://medium.com/analytics-
vidhya/how-exactly-decision-trees-
are-built-with-complete-example-
dbda4a34cf1d
Advantages and Disadvantages of Trees
Advantages Disadvantages
 Trees are very easy to explain  Trees usually do not have same
 Easier to explain than linear level of predictive accuracy as
regression other data mining algorithms
 Trees can be displayed graphically
and interpreted by a non-expert
 Decision trees may more closely
mirror human decision-making But, predictive performance of decision trees
 Trees can easily handle qualitative can be improved by aggregating trees.
predictors • Techniques: bagging, boosting, random
forests
 No dummy variables
Decision Tree Advantages
 Inexpensive to construct
 Extremely fast at classifying unknown records
 O(d) where d is the depth of the tree
 Presence of redundant attributes does not adversely affect the
accuracy of decision trees
 One of the two redundant attributes will not be used for splitting once
the other attribute is chosen
 Nonparametric approach
 Does not require any prior assumptions regarding probability
distributions, means, variances, etc.
107
Summary of Classification Methods
WEKA
Classifier Pros Cons
Support?
-Easy to implement -No variable dependency
Naïve Bayes Yes
-Less model complexity -Over simplification
-Fast -Tend to overfit
Decision Tree -Easily interpretable -Little training data for lower Yes
-Generally performs well nodes
-Strong performance -Few hyper-parameters to tune
Random Forest -Simple to implement -A little harder to interpret than Yes
decision trees
K-Nearest -Simple and powerful -Slow and expensive
Yes
Neighbors -No training involved
-Tend to have better performance than other -Can be computationally
Support Vector methods intensive
-Works well on text classification -Choice of kernel may not be Yes
Machine
-Works well with large feature set obvious

You might also like