[go: up one dir, main page]

0% found this document useful (0 votes)
31 views82 pages

Classification Basics & Decision Trees

- The document discusses classification concepts including supervised vs unsupervised learning, classification vs numeric prediction, and the basic process of constructing, validating, and testing classification models using training data. - Decision tree induction is presented as a common classification method that recursively splits data into partitions based on attribute values to predict a target class. Information gain is introduced as a metric to select the most informative attributes for splitting.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views82 pages

Classification Basics & Decision Trees

- The document discusses classification concepts including supervised vs unsupervised learning, classification vs numeric prediction, and the basic process of constructing, validating, and testing classification models using training data. - Decision tree induction is presented as a common classification method that recursively splits data into partitions based on attribute values to predict a target class. Information gain is introduced as a metric to select the most informative attributes for splitting.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 82

Konsep Data Mining

Pertemuan 9-10 | Classification: Basic Concepts

Slides adopted from Jiawei Han, Computer Science, Univ. Illinois at Urbana-Champaign, 2017
1
Chapter 8. Classification: Basic Concepts
q Classification: Basic Concepts

q Decision Tree Induction

q Bayes Classification Methods

q Linear Classifier

q Model Evaluation and Selection

q Techniques to Improve Classification Accuracy: Ensemble Methods

q Additional Concepts on Classification

q Summary

2
Supervised vs. Unsupervised Learning (1)
q Supervised learning (classificaDon)
q Supervision: The training data such as observaDons or measurements are
accompanied by labels indicaDng the classes which they belong to
q New data is classified based on the models built from the training set
Training Data with class label:
Training Model
age income student credit_rating buys_computer
<=30 high no fair no Instances Learning
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40
<=30
low
medium
yes excellent
no fair
yes
no
Positive
<=30 low yes fair yes
>40 medium yes fair yes Test Prediction
<=30 medium yes excellent yes
31…40 medium no excellent yes Instances Model
31…40 high yes fair yes Negative
>40 medium no excellent no
3
Supervised vs. Unsupervised Learning (2)
q Unsupervised learning (clustering)
q The class labels of training data are unknown
q Given a set of observations or measurements, establish the possible existence
of classes or clusters in the data

4
Prediction Problems: Classification vs. Numeric
Prediction
q Classification
q Predict categorical class labels (discrete or nominal)
q Construct a model based on the training set and the class labels (the values in a
classifying attribute) and use it in classifying new data
q Numeric prediction
q Model continuous-valued functions (i.e., predict unknown or missing values)
q Typical applications of classification
q Credit/loan approval
q Medical diagnosis: if a tumor is cancerous or benign
q Fraud detection: if a transaction is fraudulent
q Web page categorization: which category it is
5
Classification—Model Construction, Validation and Testing
q Model construction
q Each sample is assumed to belong to a predefined class (shown by the class label)
q The set of samples used for model construction is training set
q Model: Represented as decision trees, rules, mathematical formulas, or other forms
q Model Validation and Testing:
q Test: Estimate accuracy of the model
q The known label of test sample is compared with the classified result from the
model
q Accuracy: % of test set samples that are correctly classified by the model
q Test set is independent of training set
q Validation: If the test set is used to select or refine models, it is called validation (or
development) (test) set
q Model Deployment: If the accuracy is acceptable, use the model to classify new data
6
Chapter 8. Classification: Basic Concepts
q Classification: Basic Concepts

q Decision Tree Induction

q Bayes Classification Methods

q Linear Classifier

q Model Evaluation and Selection

q Techniques to Improve Classification Accuracy: Ensemble Methods

q Additional Concepts on Classification

q Summary

7
Decision Tree Induction: An Example
q Decision tree construction: Training data set: Who buys computer?
age income student credit_rating buys_computer
q A top-down, recursive, divide-and-
<=30 high no fair no
conquer process <=30 high no excellent no
q Resulting tree: 31…40 high no fair yes
age? >40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
<=30 overcast
31..40 >40
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
student? Buy credit rating? <=30 medium yes excellent yes
31…40 medium no excellent yes
no yes excellent fair 31…40 high yes fair yes
>40 medium no excellent no

Not-buy Buy Not-buy Buy Note: The data set is adapted from
“Playing Tennis” example of R. Quinlan
8
From Entropy to Info Gain: A Brief Review of Entropy
qEntropy (Information Theory)
q A measure of uncertainty associated with a random number
q Calculation: For a discrete random variable Y taking m distinct values {y1, y2, …, ym}

q Interpretation
q Higher entropy → higher uncertainty
q Lower entropy → lower uncertainty
q Conditional entropy

m=2

9
Information Gain: An Attribute Selection Measure
q Select the attribute with the highest information gain (used in typical
decision tree induction algorithm: ID3/C4.5)
q Let pi be the probability that an arbitrary tuple in D belongs to class Ci,
estimated by |Ci, D|/|D|
q Expected information (entropy) needed to classify a tuple in D:
m
Info ( D) = -å pi log 2 ( pi )
i =1

q Information needed (after using A to split D into v partitions) to classify D:


v | Dj |
InfoA ( D) = å ´ Info( D j )
j =1 |D|
q Information gained by branching on attribute A
Gain(A) = Info(D) - InfoA(D)
10
Example: Attribute Selection with Information Gain
q Class P: buys_computer = “yes” 5 4
q Class N: buys_computer = “no” Infoage ( D) = I (2,3) + I ( 4,0)
14 14
9 9 5 5 5
Info( D) = I (9,5) = - log 2 ( ) - log 2 ( ) =0.940
14 14 14 14 + I (3,2) = 0.694
14
age pi ni I(pi, ni)
5
<=30 2 3 0.971 I (2,3) means “age <=30” has 5 out of 14
31…40 4 0 0 14
>40 3 2 0.971
samples, with 2 yes’es and 3 no’s.
age
<=30
income student credit_rating
high no fair
buys_computer
no
Hence
<=30
31…40
high
high
no
no
excellent
fair
no
yes
Gain(age ) = Info ( D) - Infoage ( D) = 0.246
>40
>40
medium
low
no fair
yes fair
yes
yes
Similarly, we can get
>40
31…40
low
low
yes excellent
yes excellent
no
yes Gain(income) = 0.029
<=30 medium no fair no
<=30 low yes fair yes Gain( student ) = 0.151
>40 medium yes fair yes
<=30 medium yes excellent yes Gain(credit _ rating ) = 0.048
31…40 medium no excellent yes
31…40 high yes fair yes
11 >40 medium no excellent no
Decision Tree Induction: Algorithm
q Basic algorithm
q Tree is constructed in a top-down, recursive, divide-and-conquer manner
q At start, all the training examples are at the root
q Examples are par``oned recursively based on selected aaributes
q On each node, aaributes are selected based on the training examples on that
node, and a heuris`c or sta`s`cal measure (e.g., informaAon gain)
q Condi`ons for stopping par``oning
q All samples for a given node belong to the same class
q There are no remaining aaributes for further par``oning
q There are no samples lec
q Predic`on
q Majority voAng is employed for classifying the leaf

12
How to Handle Continuous-Valued Attributes?
q Method 1: Discretize continuous values and treat them as categorical values
q E.g., age: < 20, 20..30, 30..40, 40..50, > 50
q Method 2: Determine the best split point for continuous-valued attribute A
q Sort the value A in increasing order:, e.g. 15, 18, 21, 22, 24, 25, 29, 31, …
q Possible split point: the midpoint between each pair of adjacent values
q (ai+ai+1)/2 is the midpoint between the values of ai and ai+1
q e.g., (15+18/2 = 16.5, 19.5, 21.5, 23, 24.5, 27, 30, …
q The point with the maximum information gain for A is selected as the split-
point for A
q Split: Based on split point P
q The set of tuples in D satisfying A ≤ P vs. those with A > P
13
Gain Ratio: A Refined Measure for Attribute Selection
q Information gain measure is biased towards attributes with a large number of
values
q Gain ratio: Overcomes the problem (as a normalization to information gain)
v | Dj | | Dj |
SplitInfo A ( D) = -å ´ log 2 ( )
j =1 |D| |D|
q GainRatio(A) = Gain(A)/SplitInfo(A)
q The attribute with the maximum gain ratio is selected as the splitting attribute
q Gain ratio is used in a popular algorithm C4.5 (a successor of ID3) by R. Quinlan
q Example
' ' * * ' '
q SplitInfo!"#$%& D = − (' log ) (' − (' log ) (' − (' log ) (' = 1.557
q GainRatio(income) = 0.029/1.557 = 0.019

14
Another Measure: Gini Index
q Gini index: Used in CART, and also in IBM IntelligentMiner
q If a data set 𝐷 contains examples from 𝑛 classes, gini index, 𝑔𝑖𝑛𝑖(𝐷) is defined as
q 𝑔𝑖𝑛𝑖 𝐷 = 1 − ∑-+,( 𝑝+)
q 𝑝+ is the relative frequency of class 𝑗 in 𝐷
q If a data set 𝐷 is split on 𝐴 into two subsets 𝐷1 and 𝐷2, the 𝑔𝑖𝑛𝑖 index 𝑔𝑖𝑛𝑖(𝐷) is
defined as
/! /"
q 𝑔𝑖𝑛𝑖. 𝐷 = 𝑔𝑖𝑛𝑖 𝐷( + 𝑔𝑖𝑛𝑖 𝐷)
/ /
q Reduction in Impurity:
q Δ𝑔𝑖𝑛𝑖 𝐴 = 𝑔𝑖𝑛𝑖 𝐷 − 𝑔𝑖𝑛𝑖. (𝐷)
q The attribute provides the smallest 𝑔𝑖𝑛𝑖𝑠𝑝𝑙𝑖𝑡(𝐷) (or the largest reduction in
impurity) is chosen to split the node (need to enumerate all the possible splitting
points for each attribute)
15
Computation of Gini Index
q Example: D has 9 tuples2 in buys_computer
2
= “yes” and 5 in “no”
æ 9 ö æ 5 ö
gini ( D ) = 1 - ç ÷ -ç ÷ = 0.459
è 14 ø è 14 ø
q Suppose the attribute income partitions D into 10 in D1: {low, medium} and 4 in D2
(; '
q 𝑔𝑖𝑛𝑖0-1234∈ 627,3490:3 𝐷 = (' 𝑔𝑖𝑛𝑖 𝐷( + (' 𝑔𝑖𝑛𝑖 𝐷)
) ) )
10 7 ) 3 4 2 2
= 1− − + 1− − = 0.443
14 10 10 14 4 4
= 𝐺𝑖𝑛𝑖0-1234∈ <0=< 𝐷
q Gini{low,high} is 0.458; Gini{medium,high} is 0.450
q Thus, split on the {low,medium} (and {high}) since it has the lowest Gini index
q All attributes are assumed continuous-valued
q May need other tools, e.g., clustering, to get the possible split values
q Can be modified for categorical attributes
16
Comparing Three Attribute Selection Measures
q The three measures, in general, return good results but
q Information gain:
q biased towards multivalued attributes
q Gain ratio:
q tends to prefer unbalanced splits in which one partition is much smaller than
the others
q Gini index:
q biased to multivalued attributes
q has difficulty when # of classes is large
q tends to favor tests that result in equal-sized partitions and purity in both
partitions

17
Other Attribute Selection Measures
q Minimal Description Length (MDL) principle
q Philosophy: The simplest solution is preferred
q The best tree as the one that requires the fewest # of bits to both (1) encode
the tree, and (2) encode the exceptions to the tree
q CHAID: a popular decision tree algorithm, measure based on χ2 test for
independence
q Multivariate splits (partition based on multiple variable combinations)
q CART: finds multivariate splits based on a linear combination of attributes
q There are many other measures proposed in research and applications
q E.g., G-statistics, C-SEP
q Which attribute selection measure is the best?
q Most give good results, none is significantly superior than others
18
Overfitting and Tree Pruning
q Overfiwng: An induced tree may overfit the training data
q Too many branches, some may reflect anomalies due to noise or
outliers
q Poor accuracy for unseen samples
q Two approaches to avoid overfiwng
q Prepruning: Halt tree construc<on early ̵ do not split a node if this
would result in the goodness measure falling below a threshold
q Difficult to choose an appropriate threshold
q Postpruning: Remove branches from a “fully grown” tree—get a
sequence of progressively pruned trees
q Use a set of data different from the training data to decide which is
19
the “best pruned tree”
Classification in Large Databases
q Classification—a classical problem extensively studied by statisticians and machine
learning researchers
q Scalability: Classifying data sets with millions of examples and hundreds of
attributes with reasonable speed
q Why is decision tree induction popular?
q Relatively fast learning speed
q Convertible to simple and easy to understand classification rules
q Easy to be adapted to database system implementations (e.g., using SQL)
q Comparable classification accuracy with other methods
q RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti)
q Builds an AVC-list (attribute, value, class label)
20
RainForest: A Scalable Classification Framework
q The criteria that determine the quality of the tree can be computed separately
q Builds an AVC-list: AVC (AIribute, Value, Class_label)
q AVC-set (of an aaribute X )
q Projec`on of training dataset onto the aaribute X and class label where counts
of individual class label are aggregated AVC-set on Age
age income studentcredit_rating
buys_computer
AVC-set on Income
q AVC-group (of a <=30 high no fair no Age Buy_Computer income Buy_Computer
node n ) <=30 high
31…40 high
no
no
excellent
fair
no
yes
yes no yes no
<=30 2 3
q Set of AVC- >40 medium no fair yes high 2 2
>40 low yes fair yes 31..40 4 0 medium 4 2
sets of all >40 low yes excellent no
>40 3 2 low 3 1
31…40 low yes excellent yes
predictor <=30 medium no fair no
yes AVC-set on Student AVC-set on Credit_Ra3ng
attributes at >40 medium yes fair
<=30 low yes fair
yes student Buy_Computer Buy_Computer
Credit
the node n <=30 medium
31…40 medium
yes excellent yes
no excellent yes
yes no rating yes no

31…40 high yes fair yes yes 6 1 fair 6 2


>40 medium no excellent no no 3 4 excellent 3 3

21 The Training Data Its AVC Sets


Presentation of Classification Results

22
22 May 18, 2022 Data Mining: Concepts and Techniques
Visualization of a Decision Tree (in SGI/MineSet 3.0)

23
23 Data Mining: Concepts and Techniques
Interactive Visual Mining by Perception-Based
Classification (PBC)
q Perception-based classifier (PCB):
developed at Univ. of Munich (1999)
q One color represents one class label
q One pie represents one attribute (or
variable)
q The pie with random spread implies
weak classification power
q The pie with clearly partitioned color
strips implies good classification power
q One can select a good attribute and
regenerate new pie charts for
classification at the subsequent levels

24
CS412-Fall 2017: Midterm Statistics
Range Count Mean 68.64
0-30 1 Median 69.5
30-40 4 1st quartile 57.75
40-50 20 3rd quartile 79.5
50-60 35
60-70 45
70-80 50
80-90 32
90-100 13

25
Chapter 8. Classification: Basic Concepts
q Classification: Basic Concepts

q Decision Tree Induction

q Bayes Classification Methods

q Linear Classifier

q Model Evaluation and Selection

q Techniques to Improve Classification Accuracy: Ensemble Methods

q Additional Concepts on Classification

q Summary

26
What Is Bayesian Classification?
q A sta`s`cal classifier
q Perform probabilis<c predic<on (i.e., predict class membership probabili`es)
q Founda`on—Based on Bayes’ Theorem
q Performance
q A simple Bayesian classifier, naïve Bayesian classifier, has comparable
performance with decision tree and selected neural network classifiers
q Incremental
q Each training example can incrementally increase/decrease the probability that
a hypothesis is correct—prior knowledge can be combined with observed data
q TheoreAcal Standard
q Even when Bayesian methods are computa`onally intractable, they can provide
a standard of op`mal decision making against which other methods can be
measured
27
Bayes’ Theorem: Basics
q Total probability Theorem:
p B = D p B A! p(A! )
!
q Bayes’ Theorem:
p 𝐗H P H
p H|𝐗 = ∝p 𝐗H P H
p(𝐗)

posteriori probability likelihood prior probability

What we should choose What we just see What we knew previously

q X: a data sample (“evidence”) PredicEon can be done based on Bayes’ Theorem:


q H: X belongs to class C Classification is to derive the maximum posteriori
28
Naïve Bayes Classifier: Making a Naïve Assumption
q Practical difficulty of Naïve Bayes inference: It requires initial knowledge of many
probabilities, which may not be available or involving significant computational cost
q A Naïve Special Case
q Make an additional assumption to simplify the model, but achieve comparable
performance.

attributes are conditionally independent


(i.e., no dependence relation between attributes)

p X|𝐶! = ∏" p x" C# ) = p x$ C# ) * p x% C# ) ***** p x& C# )

q Only need to count the class distribution w.r.t. features


29
Naïve Bayes Classifier: Categorical vs. Continuous
Valued Features
q If feature x> is categorical, p(x> = v>|C! ) is the # of tuples in C! with x> = v>,
divided by |Ci, D| (# of tuples of Ci in D)

p X|𝐶0 = ∏> p x> C! ) = p x( C! ) X p x) C! ) XXXXX p x" C! )

q If feature x> is con`nuous-valued, p(x> = v>|C! ) is usually computed based on


Gaussian distribu`on with a mean μ and standard devia`on σ
"
A@B$%
1 @
p x> = v> C! = 𝑁 x> µ?# , σ?# = 𝑒 )C"
2πσ?#

30
Naïve Bayes Classifier: Training Dataset

Class: age
<=30
income student credit_rating buys_computer
high no fair no
C1:buys_computer = ‘yes’ <=30 high no excellent no
31…40 high no fair yes
C2:buys_computer = ‘no’ >40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
Data to be classified: 31…40 low yes excellent yes
<=30 medium no fair no
X = (age <=30, Income = medium, <=30 low yes fair yes
Student = yes, Credit_rating = Fair) >40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

31
Naïve Bayes Classifier: An Example
age income student credit_rating buys_computer
q P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643
<=30 high no fair no
P(buys_computer = “no”) = 5/14= 0.357 <=30 high no excellent no
31…40 high no fair yes
q Compute P(X|Ci) for each class
>40 medium no fair yes
P(age = “<=30”|buys_computer = “yes”) = 2/9 = 0.222 >40 low yes fair yes
P(age = “<= 30”|buys_computer = “no”) = 3/5 = 0.6 >40 low yes excellent no
31…40 low yes excellent yes
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444 <=30 medium no fair no
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4 <=30 low yes fair yes
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667 >40 medium yes fair yes
<=30 medium yes excellent yes
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2 31…40 medium no excellent yes
P(credit_raEng = “fair” | buys_computer = “yes”) = 6/9 = 0.667 31…40 high yes fair yes
>40 medium no excellent no
P(credit_raEng = “fair” | buys_computer = “no”) = 2/5 = 0.4
q X = (age <= 30 , income = medium, student = yes, credit_ra8ng = fair)
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
Therefore, X belongs to class (“buys_computer = yes”)
32
Avoiding the Zero-Probability Problem
q Naïve Bayesian predic`on requires each condi`onal probability be non-zero
q Otherwise, the predicted probability will be zero
p X|𝐶! = ∏' 𝑝 𝑥' 𝐶! ) = 𝑝 𝑥$ 𝐶! ) * 𝑝 𝑥% 𝐶! ) ***** 𝑝 𝑥( 𝐶! )
q Example. Suppose a dataset with 1000 tuples:
income = low (0), income= medium (990), and income = high (10)
q Use Laplacian correcAon (or Laplacian es`mator)
q Adding 1 to each case
Prob(income = low) = 1/(1000 + 3)
Prob(income = medium) = (990 + 1)/(1000 + 3)
Prob(income = high) = (10 + 1)/(1000 + 3)
q The “corrected” probability es`mates are close to their “uncorrected”
counterparts
33
Naïve Bayes Classifier: Strength vs. Weakness
q Strength
q Easy to implement
q Good results obtained in most of the cases
q Weakness
q Assumption: attributes conditional independence, therefore loss of accuracy
q Practically, dependencies exist among variables
q E.g., Patients: Profile: age, family history, etc.
Symptoms: fever, cough etc.
Disease: lung cancer, diabetes, etc.
q Dependencies among these cannot be modeled by Naïve Bayes Classifier
q How to deal with these dependencies?
q Use Bayesian Belief Networks (to be covered in the next chapter)
34
Chapter 8. Classification: Basic Concepts
q Classifica`on: Basic Concepts

q Decision Tree Induc`on

q Bayes Classifica`on Methods

q Linear Classifier

q Model Evalua`on and Selec`on

q Techniques to Improve Classifica`on Accuracy: Ensemble Methods

q Addi`onal Concepts on Classifica`on

q Summary

35
Linear Regression vs. Linear Classifier
q Linear regression
q Data modeled to fit a straight line
q Linear equa<on: Y = w X + b
q Ocen uses the least-square method to fit the line
q Used to predict con`nuous values

qLinear Classifier x
x
q Built a classifica`on model using a straight line x x x
x
q Used for (categorical data) binary classifica`on x x x o
o
x o
ooo o
o o
o o o o
36
Linear Classifier: General Ideas
q Binary Classification
q 𝑓(𝑥) is a linear function based on the example’s attribute values
q The prediction is based on the value of 𝑓(𝑥)
q Data above the blue line belongs to class ‘x’ (i.e., 𝑓 𝑥 > 0)
x
q Data below blue line belongs to class ‘o’ (i.e., 𝑓 𝑥 < 0) x
x x x
q Classical Linear Classifiers x
x x x o
q Linear Discriminant Analysis (LDA) (not covered) o
x o o
q Logistic Regression ooo
o o
q Perceptron (later) o o o o
q SVM (later)

37
Linear Classifier: An Example
q A toy rule to determine whether a faculty member has tenure
q Year >= 6 or Title = “Professor” ó Tenure
q How to express the rule as a linear classifier?
q Features
q x( (𝑥( ≥ 0) is an integer denoting the year
q x) is a Boolean denoting whether the title is “Professor”
q A feasible linear classifier: 𝑓 𝑥 = 𝑥( − 5 + 6 ⋅ 𝑥2
q When 𝑥) is True, because 𝑥( ≥ 0, 𝑓(𝑥) is always greater than 0
q When 𝑥) is False, because 𝑓 𝑥 > 0 ó𝑥( ≥ 6
q There are many more feasible classifiers
q 𝑓 𝑥 = 𝑥( − 5.5 + 6 ⋅ 𝑥2
q 𝑓 𝑥 = 2 ⋅ 𝑥( − 5 + 11 ⋅ 𝑥2

38
q …...
Key Question: Which Line Is Better?
q There might be many feasible linear
functions
q Both H1 and H2 will work
q Which one is better?
q H2 looks “better” in the sense that it is
also furthest from both groups
q We will introduce more in the SVM section

39
Logistic Regression: General Ideas
q Key Idea: Turns linear predic`ons into probabili`es
q Sigmoid func`on:
( 4'
q 𝑆 𝑥 = (D4 &'
= 4 ' D(
q Projects (−∞, +∞) to [0, 1] Logistic Regression
q Compare to linear probability model Model
q More smooth

Linear Probability
Model

40
Logistic Regression: An Example
q Suppose we only consider the year as feature

1 (Tenured)

year
6

41
Logistic Regression: Maximum Likelihood
q The prediction function to learn
-
q 𝑝 𝑌 = 1 𝑋 = 𝑥; 𝒘) = 𝑆 𝑤; + ∑0,( 𝑤0 ⋅ 𝑥0
q 𝒘 = 𝑤; , 𝑤( , 𝑤) , … , 𝑤- are the parameters
q Maximum Likelihood
q Log likelihood:
E
𝑙 𝑤 = D 𝑦0 log 𝑝 𝑌 = 1 𝑋 = 𝑥0 ; 𝒘 + 1 − 𝑦0 log 1 − 𝑝 𝑌 = 1 𝑋 = 𝑥0 ; 𝒘
0,(
q There’s no close form solution
q Gradient Descent
q Update w based on training data
q Chain-rule for the gradient

42
Gradient Descent
q Gradient Descent is an itera`ve op`miza`on algorithm for finding the minimum
of a func`on (e.g., the nega`ve log likelihood)
q For a func`on F(x) at a point a, F(x) decreases fastest if we go in the direc`on of
the nega`ve gradient of a

When the gradient is zero, we


arrive at the local minimum

43
Generative vs. Discriminative Classifiers
q X: observed variables (features)
q Y: target variables (class labels)
q A genera`ve classifier models p(Y, X)
q It models how the data was "generated“? "what is the likelihood this or that
class generated this instance?" and pick the one with higher probability
q Naïve Bayes
q Bayesian Networks
q A discrimina`ve classifier models p(Y|X)
q It uses the data to create a decision boundary
q Logis`c Regression
q Support Vector Machines
44
Further Comments on Discriminative Classifiers
q Strength
q Prediction accuracy is generally high
q As compared to generative models
q Robust, works when training examples contain errors
q Fast evaluation of the learned target function
q Comparing to (covered in future) Bayesian networks (which are normally slow)
q Criticism
q Long training time
q Difficult to understand the learned function (weights)
q Bayesian networks can be used easily for pattern discovery
q Not easy to incorporate domain knowledge
q Easy in the form of priors on the data or distributions
45
Chapter 8. Classification: Basic Concepts
q Classifica`on: Basic Concepts

q Decision Tree Induc`on

q Bayes Classifica`on Methods

q Linear Classifier

q Model Evalua`on and Selec`on

q Techniques to Improve Classifica`on Accuracy: Ensemble Methods

q Addi`onal Concepts on Classifica`on

q Summary

46
Model Evaluation and Selection
q Evaluation metrics
q How can we measure accuracy?
q Other metrics to consider?
q Use validation test set of class-labeled tuples instead of training set when assessing
accuracy
q Methods for estimating a classifier’s accuracy
q Holdout method
q Cross-validation
q Bootstrap
q Comparing classifiers:
q ROC Curves
47
Classifier Evaluation Metrics: Confusion Matrix
q Confusion Matrix:
Actual class\Predicted class C1 ¬ C1
C1 True Positives (TP) False Negatives (FN)
¬ C1 False Positives (FP) True Negatives (TN)
q In a confusion matrix w. m classes, CMi,j indicates # of tuples in class i that
were labeled by the classifier as class j
q May have extra rows/columns to provide totals
q Example of Confusion Matrix:

Actual class\Predicted class buy_computer = yes buy_computer = no Total


buy_computer = yes 6954 46 7000
buy_computer = no 412 2588 3000
Total 7366 2634 10000
48
Classifier Evaluation Metrics: Accuracy, Error Rate,
Sensitivity and Specificity
A\P C ¬C qClass imbalance problem
C TP FN P q One class may be rare
¬C FP TN N
q E.g., fraud, or HIV-positive
P’ N’ All
q Significant majority of the negative class and
minority of the positive class
q Classifier accuracy, or
recognition rate q Measures handle the class imbalance problem

q Percentage of test set tuples q Sensitivity (recall): True positive recognition


that are correctly classified rate
Accuracy = (TP + TN)/All q Sensitivity = TP/P

q Error rate: 1 – accuracy, or q Specificity: True negative recognition rate

Error rate = (FP + FN)/All q Specificity = TN/N

49
Classifier Evaluation Metrics:
Precision and Recall, and F-measures
q Precision: Exactness: what % of tuples that the classifier labeled as positive are
actually positive? TP
P = Precision =
TP + FP
q Recall: Completeness: what % of positive tuples did the classifier label as positive?
TP
R = Recall =
TP + FN
q Range: [0, 1]
q The “inverse” relationship between precision & recall
q F measure (or F-score): harmonic mean of precision and recall
q In general, it is the weighted measure of precision & recall
1 β% + 1 PR Assigning β times as much
F) = =
1 1 β %P + R weight to recall as to precision)
𝛼 * + (1 − 𝛼) *
P R
q F1-measure (balanced F-measure)
q That is, when β = 1, F =
2PR
!
50 P+R
Classifier Evaluation Metrics: Example
q Use the same confusion matrix, calculate the measure just introduced
Actual Class\Predicted class cancer = yes cancer = no Total Recognition(%)
cancer = yes 90 210 300 30.00 (sensitivity)
cancer = no 140 9560 9700 98.56 (specificity)
Total 230 9770 10000 96.50 (accuracy)

q Sensitivity = TP/P = 90/300 = 30%


q Specificity = TN/N = 9560/9700 = 98.56%
q Accuracy = (TP + TN)/All = (90+9560)/10000 = 96.50%
q Error rate = (FP + FN)/All = (140 + 210)/10000 = 3.50%
q Precision = TP/(TP + FP) = 90/(90 + 140) = 90/230 = 39.13%
q Recall = TP/ (TP + FN) = 90/(90 + 210) = 90/300 = 30.00%
q F1 = 2 P × R /(P + R) = 2 × 39.13% × 30.00%/(39.13% + 30%) = 33.96%
51
Classifier Evaluation: Holdout & Cross-Validation
q Holdout method
q Given data is randomly partitioned into two independent sets
q Training set (e.g., 2/3) for model construction
q Test set (e.g., 1/3) for accuracy estimation
q Repeated random sub-sampling validation: a variation of holdout
q Repeat holdout k times, accuracy = avg. of the accuracies obtained
q Cross-validation (k-fold, where k = 10 is most popular)
q Randomly partition the data into k mutually exclusive subsets, each
approximately equal size
q At i-th iteration, use Di as test set and others as training set
q Leave-one-out: k folds where k = # of tuples, for small sized data
q *Stratified cross-validation*: folds are stratified so that class distribution,
in each fold is approximately the same as that in the initial data
52
Classifier Evaluation: Bootstrap
q Bootstrap
q Works well with small data sets
q Samples the given training tuples uniformly with replacement
q Each time a tuple is selected, it is equally likely to be selected again and re-added
to the training set
q Several bootstrap methods, and a common one is .632 bootstrap
q A data set with d tuples is sampled d times, with replacement, resulting in a training
set of d samples. The data tuples that did not make it into the training set end up
forming the test set. About 63.2% of the original data end up in the bootstrap, and
the remaining 36.8% form the test set (since (1 – 1/d)d ≈ e-1 = 0.368)
q Repeat the sampling procedure k times, overall accuracy of the model:

53
Model Selection: ROC Curves
q ROC (Receiver Operating Characteristics) curves:
for visual comparison of classification models
q Originated from signal detection theory
q Shows the trade-off between the true positive
rate and the false positive rate
q The area under the ROC curve (AUC: Area Under
Curve) is a measure of the accuracy of the model
q Vertical axis represents the
q Rank the test tuples in decreasing order: the one true positive rate
that is most likely to belong to the positive class q Horizontal axis rep. the false
appears at the top of the list positive rate
q The closer to the diagonal line (i.e., the closer the q The plot also shows a diagonal
line
area is to 0.5), the less accurate is the model
q A model with perfect accuracy
will have an area of 1.0
54
Issues Affecting Model Selection
q Accuracy
q classifier accuracy: predicting class label
q Speed
q time to construct the model (training time)
q time to use the model (classification/prediction time)
q Robustness: handling noise and missing values
q Scalability: efficiency in disk-resident databases
q Interpretability
q understanding and insight provided by the model
q Other measures, e.g., goodness of rules, such as decision tree size or compactness
of classification rules
55
Chapter 8. Classification: Basic Concepts
q Classification: Basic Concepts

q Decision Tree Induction

q Bayes Classification Methods

q Linear Classifier

q Model Evaluation and Selection

q Techniques to Improve Classification Accuracy: Ensemble Methods

q Additional Concepts on Classification

q Summary

56
Ensemble Methods: Increasing the Accuracy
q Ensemble methods
q Use a combination of models to increase accuracy
q Combine a series of k learned models, M1, M2, …, Mk,
with the aim of creating an improved model M*
q Popular ensemble methods
q Bagging: Trains each model using a subset of the
training set, and models learned in parallel
q Boosting: Trains each new model instance to
emphasize the training instances that previous models
mis-classified, and models learned in order

57
Bagging: Bootstrap Aggregation
q Analogy: Diagnosis based on multiple doctors’ majority vote
q Training
q Given a set D of d tuples, at each iteration i, a training set Di of d tuples is
sampled with replacement from D (i.e., bootstrap)
q A classifier model Mi is learned for each training set Di
q Classification: classify an unknown sample X
q Each classifier Mi returns its class prediction
q The bagged classifier M* counts the votes and assigns the class with the most
votes to X
q Prediction: It can be applied to the prediction of continuous values by taking the
average value of each prediction for a given test tuple
q Accuracy: Improved accuracy in prediction
q Often significantly better than a single classifier derived from D
q For noise data: Not considerably worse, more robust
58
Random Forest: Basic Concepts
q Random Forest (first proposed by L. Breiman in 2001)
q A varia`on of bagging for decision trees
q Data bagging
q Use a subset of training data by sampling with replacement for each tree
q Feature bagging
q At each node use a random selec`on of aaributes as candidates and split by
the best aaribute among them
q Compared to original bagging, increases the diversity among generated trees
q During classifica`on, each tree votes and the most popular class is returned

59
Random Forest
q Two Methods to construct Random Forest:
q Forest-RI (random input selection): Randomly select, at each node, F attributes
as candidates for the split at the node. The CART methodology is used to grow
the trees to maximum size
q Forest-RC (random linear combinations): Creates new attributes (or features)
that are a linear combination of the existing attributes (reduces the correlation
between individual classifiers)
q Comparable in accuracy to Adaboost, but more robust to errors and outliers
q Insensitive to the number of attributes selected for consideration at each split, and
faster than typical bagging or boosting

60
Boosting
q Analogy: Consult several doctors, based on a combination of weighted diagnoses—
weight assigned based on the previous diagnosis accuracy
q How boosting works?
q Weights are assigned to each training tuple
q A series of k classifiers is iteratively learned
q After a classifier Mi is learned, the weights are updated to allow the subsequent
classifier, Mi+1, to pay more attention to the training tuples that were
misclassified by Mi
q The final M* combines the votes of each individual classifier, where the weight
of each classifier's vote is a function of its accuracy
q Boosting algorithm can be extended for numeric prediction
q Comparing with bagging: Boosting tends to have greater accuracy, but it also risks
overfitting the model to misclassified data
61
Adaboost (Freund and Schapire, 1997)
q Given a set of d class-labeled tuples, (X1, y1), …, (Xd, yd)
q Initially, all the weights of tuples are set the same (1/d)
q Generate k classifiers in k rounds. At round i,
q Tuples from D are sampled (with replacement) to form a training set Di of
the same size
q Each tuple’s chance of being selected is based on its weight
q A classification model Mi is derived from Di
q Its error rate is calculated using Di as a test set
q If a tuple is misclassified, its weight is increased; otherwise, it is decreased
q Error rate: err(Xj) is the misclassification error of tuple Xj. Classifier Mi error rate
is the sum of the weights of the misclassified tuples: d
error ( M i ) = å w j ´ err ( X j )
q The weight of classifier Mi’s vote is log 1 - error( M i ) j

error ( M i )
62
Classification of Class-Imbalanced Data Sets
q Class-imbalance problem: Rare positive examples but numerous negative ones
q E.g., medical diagnosis, fraud transaction, accident (oil-spill), and product fault
q Traditional methods assume a balanced distribution of classes and equal error
costs: not suitable for class-imbalanced data x
x x x
q Typical methods on imbalanced data in two-class classification
x x x x
x x
q Oversampling: Re-sampling of data from positive class x x xx x x
q Under-sampling: Randomly eliminate tuples from negative class x o x
x x oox x
x
q Threshold-moving: Move the decision threshold, t, so that the x x ox x
rare class tuples are easier to classify, and hence, less chance of
costly false negative errors
q Ensemble techniques: Ensemble multiple classifiers introduced
above
q Still difficult for class imbalance problem on multiclass tasks
63
Classifying Data Streams with Skewed Distribution
Biased Sampling

…… ?
… ……
S1 S2 …
S1 Ensemble Sm
S2 Sm Sm+1

Classification Model …… ?
Sm as training data? Two few positive examples! …
J. Gao, et al., “A General Framework for Mining Concept- C1 C2 Ck
Drifting Data Streams with Skewed Distributions”, SDM’07
q Classify data stream with skewed distribution (i.e., rare events)
q Biased sampling: Save only the positive examples in the streams
q Ensemble: Partition negative examples of Sm into k portions to build k classifiers
q Effectively reduce classification errors on the minority class
64
Chapter 8. Classification: Basic Concepts
q Classification: Basic Concepts

q Decision Tree Induction

q Bayes Classification Methods

q Linear Classifier

q Model Evaluation and Selection

q Techniques to Improve Classification Accuracy: Ensemble Methods

q Additional Concepts on Classification

q Summary

65
Multiclass Classification
q Classification involving more than two classes (i.e., > 2 Classes)
q Methodology: Reducing the multi-class problem into multiple binary problems
q Method 1. One-vs.-rest (or one-vs.-all)
q Given m classes, train m classifiers: one for each class
q Classifier j: treat tuples in class j as positive & all the rest as negative
q To classify a tuple X, the set of classifiers vote as an ensemble
q Method 2. one-vs.-one (or all-vs.-all): Learn a classifier for each pair of classes
q Given m classes, construct m(m − 1)/2 binary classifiers
q A classifier is trained using tuples of the two classes
q To classify a tuple X, each classifier votes
q X is assigned to the class with maximal vote
q Comparison: One-vs.-one tends to perform better than one-vs.-rest
q Many new algorithms have been developed to go beyond binary classifier method
66
Semi-Supervised Classification
q Semi-supervised: Uses labeled and unlabeled data to build a classifier
+
q Self-training unlabeled labeled
q Build a classifier using the labeled data
̶
q Use it to label the unlabeled data, and those with the most confident
label prediction are added to the set of labeled data
q Repeat the above process
q Adv.: easy to understand; Disadv.: may reinforce errors
q Co-training: Use two or more classifiers to teach each other
q Each learner uses a mutually independent set of features of each tuple to
train a good classifier, say f1 and f2
q Then f1 and f2 are used to predict the class label for unlabeled data X
q Teach each other: The tuple having the most confident prediction from f1
is added to the set of labeled data for f2 & vice versa
67 q Other methods include joint probability distribution of features and labels
Active Learning
q A special case of semi-supervised learning
q Unlabeled data: Abundant
q Class labels are expensive to obtain
q Active learner: Interactively query teachers (oracle) for labels
q Pool-based approach: Uses a pool of unlabeled data
q L: a small subset of D is labeled, U: a pool of unlabeled data in D
q Use a query function to carefully select one or more tuples from U and
request labels from an oracle (a human annotator)
q The newly labeled samples are added to L, and learn a model
q Goal: Achieve high accuracy using as few labeled data as possible
q Evaluated using learning curves: Accuracy as a function of the number of
instances queried (# of tuples to be queried should be small)
q A lot of algorithms have been developed for active learning
68
Transfer Learning: Conceptual Framework
q Transfer learning: Extract knowledge from one or more source tasks (e.g.,
recognizing cars) and apply the knowledge to a target task (e.g., recognizing trucks)
q Traditional learning: Build a new classifier for each new task
q Transfer learning: Build new classifier by applying existing knowledge learned from
source tasks
q Many algorithms are developed, applied to text classification, spam filtering, etc.
Different Tasks
Source Tasks Target Task

Learning System Learning System Learning System Knowledge Learning System

Traditional Learning Framework Transfer Learning Framework


69
Weak Supervision: A New Programming Paradigm for
Machine Learning
q Overcome the training data boaleneck
q Leverage higher-level and/or noisier input from experts
q Exploring weak label distribu`ons provided more cheaply and efficiently by
q Higher-level, less precise supervision (e.g., heuris`c rules, expected label
distribu`ons)
q Cheaper, lower-quality supervision (e.g. crowdsourcing)
q Exis`ng resources (e.g. knowledge bases, pre-trained models)
q These weak label distribu`ons could take many forms
q Weak Labels from crowd workers, output of heuris`c rules, or the result
of distant supervision (from KBs), or the output of other classifiers, etc.
q Constraints and invariances (e.g., from physics, logic, or other experts)
q Probability distribu`ons (e.g., from weak or biased classifiers or user-
provided label or feature expecta`ons or measurements)
70
Relationships Among Different Kinds of Supervisions

Courtesy: A Ratner et al.


@Stanford Blog, July 2017

71
Chapter 8. Classification: Basic Concepts
q Classifica`on: Basic Concepts

q Decision Tree Induc`on

q Bayes Classifica`on Methods

q Linear Classifier

q Model Evalua`on and Selec`on

q Techniques to Improve Classifica`on Accuracy: Ensemble Methods

q Addi`onal Concepts on Classifica`on

q Summary

72
Summary
q Classification: Model construction from a set of training data
q Effective and scalable methods
q Decision tree induction, Bayes classification methods, linear classifier, …

q No single method has been found to be superior over all others for all data sets

q Evaluation metrics: Accuracy, sensitivity, specificity, precision, recall, F measure

q Model evaluation: Holdout, cross-validation, bootstrapping, ROC curves (AUC)

q Improve Classification Accuracy: Bagging, boosting

q Additional concepts on classification: Multiclass classification, semi-supervised


classification, active learning, transfer learning, weak supervision
73
References (1)
q C. Apte and S. Weiss. Data mining with decision trees and decision rules. Future
Generation Computer Systems, 13, 1997
q P. K. Chan and S. J. Stolfo. Learning arbiter and combiner trees from partitioned data for
scaling machine learning. KDD'95
q A. J. Dobson. An Introduction to Generalized Linear Models. Chapman & Hall, 1990.
q R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification, 2ed. John Wiley, 2001
q U. M. Fayyad. Branching on attribute values in decision tree generation. AAAI’94.
q Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and
an application to boosting. J. Computer and System Sciences, 1997.
q J. Gehrke, R. Ramakrishnan, and V. Ganti. Rainforest: A framework for fast decision tree
construction of large datasets. VLDB’98.
q J. Gehrke, V. Gant, R. Ramakrishnan, and W.-Y. Loh, BOAT -- Optimistic Decision Tree
Construction. SIGMOD'99.
q T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data
Mining, Inference, and Prediction. Springer-Verlag, 2001.
74
References (2)
q T.-S. Lim, W.-Y. Loh, and Y.-S. Shih. A comparison of prediction accuracy, complexity, and
training time of thirty-three old and new classification algorithms. Machine Learning, 2000
q J. Magidson. The Chaid approach to segmentation modeling: Chi-squared automatic
interaction detection. In R. P. Bagozzi, editor, Advanced Methods of Marketing Research,
Blackwell Business, 1994
q M. Mehta, R. Agrawal, and J. Rissanen. SLIQ : A fast scalable classifier for data mining.
EDBT'96
q T. M. Mitchell. Machine Learning. McGraw Hill, 1997
q S. K. Murthy, Automatic Construction of Decision Trees from Data: A Multi-Disciplinary
Survey, Data Mining and Knowledge Discovery 2(4): 345-389, 1998
q J. R. Quinlan. Induction of decision trees. Machine Learning, 1:81-106, 1986.
q J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.
q J. R. Quinlan. Bagging, boosting, and c4.5. AAAI'96.
75
References (3)
q R. Rastogi and K. Shim. Public: A decision tree classifier that integrates building and
pruning. VLDB’98
q J. Shafer, R. Agrawal, and M. Mehta. SPRINT : A scalable parallel classifier for data
mining. VLDB’96
q J. W. Shavlik and T. G. Dietterich. Readings in Machine Learning. Morgan Kaufmann, 1990
q P. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Addison Wesley, 2005
q S. M. Weiss and C. A. Kulikowski. Computer Systems that Learn: Classification and
Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert Systems.
Morgan Kaufman, 1991
q S. M. Weiss and N. Indurkhya. Predictive Data Mining. Morgan Kaufmann, 1997
q I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques,
2ed. Morgan Kaufmann, 2005

76
Bayes’ Theorem: Basics
q Total probability Theorem: M
P(B) = å P(B | A )P( A )
i i
i =1
P(H | X) = P(X | H )P(H ) = P(X | H )´ P(H ) / P(X)
qBayes’ Theorem: P(X)
q Let X be a data sample (“evidence”): class label is unknown
q Let H be a hypothesis that X belongs to class C
q Classifica`on is to determine P(H|X), (i.e., posteriori probability): the probability
that the hypothesis holds given the observed data sample X
q P(H) (prior probability): the ini`al probability
q E.g., X will buy computer, regardless of age, income, …
q P(X): probability that sample data is observed
q P(X|H) (likelihood): the probability of observing the sample X, given that the
hypothesis holds
q E.g., Given that X will buy computer, the prob. that X is 31..40, medium income
77
Classification Is to Derive the Maximum Posteriori
q Let D be a training set of tuples and their associated class labels, and each tuple is
represented by an n-D attribute vector X = (x1, x2, …, xn)
q Suppose there are m classes C1, C2, …, Cm.
q Classification is to derive the maximum posteriori, i.e., the maximal P(Ci|X)
q This can be derived from Bayes’ theorem
P(X | C )P(C )
P(C | X) = i i
i P(X)

q Since P(X) is constant for all classes, only


P(C | X)¥P(X | C )P(C )
i i i
needs to be maximized

78
Linear Discriminant Analysis (LDA)
q Linear Discriminant Analysis (LDA) works when the aaributes are all con`nuous
q For the categorical aaributes, discriminant correspondence analysis is the
equivalent technique
q Basic Ideas: Project all samples on a line such that different classes are well separated
q Example: Suppose we have 2 classes and 2-dimensional samples 𝑥( , … , 𝑥-
q 𝑛( samples come from class 1
q 𝑛) samples come from class 2
q Let the line direc`on be given by unit vector 𝒗
q There are two candidates of projec`ons
q Ver`cal: 𝒗 = (0,1)
q Horizontal: 𝒗 = (1,0)
q Which one looks beaer?
q How to mathema`cally measure it?
79
Fisher’s LDA (Linear Discriminant Analysis)
q 𝒗𝑻𝒙𝒊 is the distance of projection of 𝒙𝒊 from the origin
q Let 𝝁𝟏 and 𝝁𝟐 be the means of class 1 and class 2 in the original
space
(
q 𝝁𝟏 = ∑ 𝒙
-! 0∈#PQRR ( 𝒊
(
q 𝝁𝟐 = ∑ 𝒙
-" 0∈#PQRR ) 𝒊
qThe distance between the means of the projected points
q |𝒗𝑻 𝝁𝟏 − 𝒗𝑻 𝝁𝟐 |
q Good? No. Horizontal one may have larger distance

80
Fisher’s LDA (con’t)
q Normalization needed
q Scatter: Sample variance multiplied by 𝑛
)
q 𝑠( = 𝒗𝑻𝒙
∑0∈#PQRR ( 𝑻
𝒊 − 𝒗 𝝁𝟏
𝑻 𝑻 )
q 𝑠) = ∑0∈#PQRR ) 𝒗 𝒙𝒊 − 𝒗 𝝁𝟐
q Fisher’s LDA
Smaller
"
𝒗𝑻 𝝁𝟏 @𝒗𝑻 𝝁𝟐 Scatter
q Maximize 𝐽 𝒗 = U! DU"
q Closed-form optimal solution

Bigger
Scatter

81
Fisher’s LDA: Summary
qAdvantages
q Useful for dimension reduction
q Easy to extend to multi-classes

qFisher’s LDA will fail


q When 𝝁𝟏 = 𝝁𝟐 , 𝐽 𝒗 is always 0.
q When classes have large overlap when projected to any line

82

You might also like