Classification (NaiveBayes KNN SVM DecisionTrees)
Classification (NaiveBayes KNN SVM DecisionTrees)
Classification (NaiveBayes KNN SVM DecisionTrees)
Classification
Classification is a process of categorizing a given set of data into classes
The process starts with predicting the class of given data points
We will cover some of the more classic and state-of-the-art techniques in the following
slides, including:
Logistic Regression (Covered with Regression Topic)
Naïve Bayes
K-Nearest Neighbor
Support Vector Machine (SVM)
Decision Tree
Random Forest (Not in Syllabus)
Classification Vs. Regression
Parameter CLASSIFICATION REGRESSION
The mapping function is used for Mapping Function is used for
Basic mapping values to predefined the mapping of values to
classes continuous output
by measurement of root
Method of calculation by measuring accuracy
mean square error
Assumes that the value of features are independent of other features and that features
have equal importance.
Hence “Naive”
Also has extensions such as Gaussian Naïve Bayes, Multinomial Naïve Bayes, and Bernoulli
Naïve Bayes.
Naïve Bayes and Multinomial Naïve Bayes are part of WEKA (Data Mining Tool)
Naïve Bayes – Bayes Theorem
7
In English
Example – If a patient has stiff neck, what is the probability he/she has
meningitis (P(M|S)) given that:
A doctor knows that meningitis causes stiff neck (P(S|M)) 50% of the time
Prior probability of any patient having meningitis (P(M)) is 1/50,000
Prior probability of any patient having stiff neck (P(S)) is 1/20
Naïve Bayes – Approach to Classification
8
“If a 60-year old patient returned from Beijing, China to Tucson yesterday and has fever, nagging cough,
and shortness of breadth, what is the probability that he has COVID-19?”
After computing the posteriors for all values, choose the value of C that maximizes:
The following equation equates to the first equation. It also illustrates the
“naive” assumption that all attributes (Ai) are independent from each other.
Naïve Bayes – Example
9
Naïve Bayes – Example
10
Naïve Bayes – Example
11
Nearest Neighbors
k-nearest neighbors
k = parameter, chosen by analyst
For a given test instance, use the k “closest” points
(nearest neighbors) for performing classification
“closest” points: defined by some proximity metric, such as
Euclidean Distance
Unknown record
l Requires three things
1. The set of stored records
2. Distance Metric to compute distance
between records
3. The value of k, the number of nearest
neighbors to retrieve
X X X
Tie Scenario:
• Randomly choose
classification?
• For binary problems,
(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor usually an odd k is used to
avoid ties.
As the differences between the values of the descriptive features grows, so too does the
distance between the points in the feature space that represent these instances.
Distance Metric
Distance Metric: A function that returns the distance
between two instances a and b.
Must, mathematically have the criteria:
1. Non-negativity:
2. Identity:
3. Symmetry:
4. Triangular Inequality:
Dissimilarities between Data Objects
A common measure for the proximity between two
objects is the Euclidean Distance:
n
d(x, y) = å k k
(x - y ) 2
x and y are two data objects
n dimensions
k=1
In high school, we typically used this for calculating the distance
between two objects, when there were only two dimensions.
Defined for one-dimension, two-dimensions, three-dimensions, …,
any n-dimensional space
Example
Dissimilarities between Data Objects
Typically the Euclidean Distance is used as a first
choice when applying Nearest Neighbors and
Clustering
n
d(x, y) = å k k
(x - y ) 2 x and y are two data objects
n dimensions
k=1
å xk - yk
r
Generalized by the Minkowski d(x, y) =
distance metric: k=1
r is a parameter
Dissimilarities between Data Objects
1r
Minkowski Distance Metric: n
åx
r
d(x, y) = k - yk
r =1 k=1
L1norm
“Manhattan”, “taxicab” Distance
r =2
Euclidean Distance
L2 norm
The larger the value of r the more emphasis is placed on the features with large differences
in values because these differences are raised to the power of r.
Distance Matrix
Once a distance metric is chosen, the proximity
between all of the objects in the dataset can be
computed
Can be represented in a distance matrix
Pairwise distances between points
Distance Matrix
L1 norm distance. “Manhattan” Distance.
3 L1 p1 p2 p3 p4
p1 0 4 4 6
2 p1 p2 4 0 2 4
p3 p4 p3 4 2 0 2
1 p4 6 4 2 0
p2
0
L2 norm distance. Euclidean Distance.
0 1 2 3 4 5 6
L2 p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
Using Weights
So far, all attributes treated equally when
computing proximity
In some situations: some features are more important
than others
Decision up to the analyst
Modified Minkowski distance definition to include
weights: n
1r
åw
r
d(x, y) = k xk - yk
k=1
K-Nearest Neighbors Pseudocode
24
Which is the ideal value of k?
Setting k to a high value is riskier with
Different Values of k an imbalanced dataset.
• Majority target level begins to
dominate the feature space.
Unique:
Represents “decision boundary” using a subset of
training examples
Terminology
1. Maximal Margin Classifier
2. Support Vector Classifier
3. Support Vector Machine
What is a hyperplane?
Hyperplane has p-1 dimensions in a p dimensional
space
Example: in 2 dimension space, a hyperplane has 1
dimension (and thus, is a line)
Hyperplane Mathematical Definition
For two dimensions, hyperplane defined as:
B0 + B1 X1 + B2 X2 = 0
B0, B1, B2 are parameters.
X1, X2 are variables.
ìï 1, if B + B x + B x +... + B x > 0
0 1 1 2 2 n n
yi = í
ïî -1, if B0 + B1 x1 + B2 x2 +... + Bn xn < 0
Classification Using a Separating
Hyperplane
For a new test instance, which side of the line is it on?
B0 + B1 X1 + B2 X 2 +... + Bn X n > 0
B0 + B1 X1 + B2 X 2 +... + Bn X n < 0
Maximum Margin Hyperplane
What’s the best separating hyperplane?
Larger C to Smaller C
Kernel functions are very powerful. They allow SVM models to perform separations even with very complex
boundaries.
Some popular kernel functions are linear, polynomial, and radial basis.
For data in a structured representation, convolution kernels (e.g., string, tree,
etc.) are frequently used.
While you can construct your own kernel functions according to the data
structure, WEKA provides a variety of in-built kernels.
Learning a Nonlinear SVM Model
Once again:
Complicated, but python does it for us.
https://github.com/swapnilin/SVM-Demo-and-Kernel-Functions/blob/main/SVM_Demo.ipynb
https://www.analyticsvidhya.com/blog/2021/07/svm-support-vector-machine-algorithm/
Other extensions to SVMs:
Regression instead of classification
Categorical variables instead of continuous
Multiclass problems instead of binary
Radial “kernels”
Circle instead of hyperplane
Decision Trees
Simple, yet widely used classification technique
For nominal target variables
There also are Regression trees, for continuous target
variables
PredictorFeatures: binary, nominal, ordinal, discrete,
continuous
Evaluating the model:
One metric: error rate in predictions
1. Root node
2. Internal nodes
Decision Trees Tree is
3. Leaf / terminal nodes
consistent with
training “Splitting Attributes”
Tid Home Marital Annual Defaulted dataset.
Owner Status Income Borrower
1 Yes Single 125K No
2 No Married 100K No
Home
Yes
3 No Single 70K No Owner No
4 Yes Married 120K No
NO {Single, Marital
5 No Divorced 95K Yes Married
6 No Married 60K No
Divorced} Status
7 Yes Divorced 220K No
Annual NO
8 No Single 85K Yes
9 No Married 75K No
< 80K Income > 80K
10
10 No Single 90K Yes
NO YES
NO {Single, Marital
Married
Divorced} Status
Annual NO
< 80K Income > 80K
NO YES
Apply Model to Test Data
Start from the root of tree. Test Data
Home Marital Annual Defaulted
Owner Status Income Barrower
Home No Married 80K ?
Yes
Owner No
10
NO {Single, Marital
Married
Divorced} Status
• Predict that this person
Annual NO
will not default.
< 80K Income > 80K
NO YES
Formally…
A decision tree has three types of nodes:
1. A root node that has no incoming edges and zero or mode
outgoing edges
2. Internal nodes, each of which has exactly one incoming edge
and two or more outgoing edges
3. Leaf nodes (or terminal nodes), each of which has exactly one
incoming edge and no outgoing edges
Each leaf node is assigned a class label
Non-terminal nodes contain attribute test conditions to
separate records that have different characteristics
How to Build a Decision Tree?
Referred to as decision tree induction.
Exponentially many decision trees can be constructed
from a given set of attributes
Infeasible to try them all to find the optimal tree
Different “decision tree building” algorithms:
ID3, CART, C4.5 etc.
Usually a greedy strategy is employed
Design Issues of Decision Tree Induction
1. How should the training records be split?
Greedy strategy: split the records based on some
attribute test (always choose immediate best option)
Need to evaluate the “goodness” of each attribute test
and select the best one.
2. How should the splitting procedure stop?
For now, we’ll keep splitting until we can’t split anymore.
Splitting Based on Nominal Attributes
Multiway Split: Use as many partitions as distinct values
Annual Income
<= 85K
Yes
No
Information Gain
Try to split on each possible feature in a dataset.
See which split works “best”.
Measure the reduction in the overall entropy of a
set of instances
Weighting Term
Information Gain Example
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 = −∑ 𝑝 log 𝑝 = −( log + log ) = 1
𝑆
𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛𝐺𝑎𝑖𝑛 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − 𝐸(𝑆 )
𝑆
2 1 1 1 1 4 2 2 2 2
𝐼𝐺 = 1− × − log + log + × − log + log =0
6 2 2 2 2 6 4 4 4 4
There are four independent variables to determine the dependent variable. The independent
variables are Outlook, Temperature, Humidity, and Wind. The dependent variable is whether
to play football or not.
Solution
NOTE:
Here typically we will take log to base 2
Here total there are 14 yes/no
Out of which 9 yes and 5 no
Based on it we calculated probability
above
Solution
2 3
3 2
Now we have to calculate average weighted entropy ie, we have found the total of weights
of each feature multiplied by probabilities.
The next step is to find the information gain. It is the difference between parent entropy and
average weighted entropy we found above.
Now select the feature having the largest entropy gain. Here it is Outlook. So it forms the first
node(root node) of our decision tree.
Since overcast contains only examples of class ‘Yes’ we can set it as yes. That means If
outlook is overcast football will be played. Now our decision tree looks as follows.
Solution
The next step is to find the next node in our decision tree
Now we will find one under sunny
We have to determine which of the following Temperature, Humidity or Wind has
higher information gain.
Solution
0 1
1 2
Solution
E(sunny, Temperature) = (2/5)*E(0,2) + (1/5)*E(1,0) + (2/5)*E(1,1)=2/5=0.4
Similarly we get
Here IG(sunny, Humidity) is the largest value. So Humidity is the node that comes under sunny.
Solution
For humidity from the above table, we can say that play will occur if humidity is normal
and will not occur if it is high.
Similarly, find the nodes under rainy.
Note: A branch with entropy more than 0 needs further splitting.
Solution
Exercise
https://medium.com/analytics-
vidhya/how-exactly-decision-trees-
are-built-with-complete-example-
dbda4a34cf1d
Advantages and Disadvantages of Trees
Advantages Disadvantages
Trees are very easy to explain Trees usually do not have same
Easier to explain than linear level of predictive accuracy as
regression other data mining algorithms
Trees can be displayed graphically
and interpreted by a non-expert
Decision trees may more closely
mirror human decision-making But, predictive performance of decision trees
Trees can easily handle qualitative can be improved by aggregating trees.
predictors • Techniques: bagging, boosting, random
forests
No dummy variables
Decision Tree Advantages
Inexpensive to construct
Extremely fast at classifying unknown records
O(d) where d is the depth of the tree
Presence of redundant attributes does not adversely affect the
accuracy of decision trees
One of the two redundant attributes will not be used for splitting once
the other attribute is chosen
Nonparametric approach
Does not require any prior assumptions regarding probability
distributions, means, variances, etc.
107
Summary of Classification Methods
WEKA
Classifier Pros Cons
Support?
-Easy to implement -No variable dependency
Naïve Bayes Yes
-Less model complexity -Over simplification
-Fast -Tend to overfit
Decision Tree -Easily interpretable -Little training data for lower Yes
-Generally performs well nodes
-Strong performance -Few hyper-parameters to tune
Random Forest -Simple to implement -A little harder to interpret than Yes
decision trees
K-Nearest -Simple and powerful -Slow and expensive
Yes
Neighbors -No training involved
-Tend to have better performance than other -Can be computationally
Support Vector methods intensive
-Works well on text classification -Choice of kernel may not be Yes
Machine
-Works well with large feature set obvious