Pattern
Classification
All materials in these slides were taken from
Pattern Classification (2nd ed) by R. O.
Duda, P. E. Hart and D. G. Stork, John Wiley
& Sons, 2000
with the permission of the authors and the
publisher
Chapter 2 :
Bayesian Decision Theory
(Sections 1-6)
1. Introduction – Bayesian Decision Theory
• Pure statistics, probabilities known, optimal decision
2. Bayesian Decision Theory–Continuous Features
3. Minimum-Error-Rate Classification (omit)
4. Classifiers, Discriminant Functions, Decision Surfaces
5. The Normal Density
Bayesian Decision Theory
• Design classifiers to make decisions subject to
minimizing an expected ”risk”.
– The simplest risk is the classification error.
– When misclassification errors are not equally
important, the risk can include the cost associated
with different misclassification errors.
Terminology
• State of nature ω (class label):
– e.g., ω1 for sea bass, ω2 for salmon
• Probabilities P(ω1) and P(ω2) (priors):
– e.g., prior knowledge of how likely is to get a sea bass
or a salmon
• Probability density function p(x) (evidence):
– e.g., how frequently we will measure a pattern with
feature value x (e.g., x corresponds to lightness)
Terminology (cont’d)
• Conditional probability density p(x/ωj) (likelihood) :
– e.g., how frequently we will measure a pattern with
feature value x given that the pattern belongs to class ωj
e.g., lightness distributions
between salmon/sea-bass
populations
Terminology (cont’d)
• Conditional probability P(ωj /x) (posterior) :
– e.g., the probability that the fish belongs to class
ωj given feature x.
Decision Rule Using
Conditional Probabilities
• Using Bayes’ rule:
p ( x / j ) P ( j ) likelihood prior
P( j / x)
p( x) evidence
2
where p ( x) p ( x / j ) P ( j ) (i.e., scale factor – sum of probs = 1)
j 1
Decide ω1 if P(ω1 /x) > P(ω2 /x); otherwise decide ω2
or
Decide ω1 if p(x/ω1)P(ω1)>p(x/ω2)P(ω2); otherwise decide ω2
or
Decide ω1 if p(x/ω1)/p(x/ω2) >P(ω2)/P(ω1) ; otherwise decide ω2
likelihood ratio threshold
8
1. Introduction
• The sea bass/salmon example
• State of nature, prior
• State of nature is a random variable
• The catch of salmon and sea bass is equiprobable
• P(1) = P(2) (uniform priors)
• P(1) + P( 2) = 1 (exclusivity and exhaustivity)
Pattern Classification, Chapter 2 (Part 1)
9
• Decision rule with only the prior information
• Decide 1 if P(1) > P(2) otherwise decide 2
• Use of the class –conditional information
• P(x | ) and P(x | ) describe the difference in
1 2
lightness between populations of sea and salmon
Pattern Classification, Chapter 2 (Part 1)
10
Pattern Classification, Chapter 2 (Part 1)
11
• Posterior, likelihood, evidence
• P( | x) = P(x | ) P ( ) / P(x)
j j j (Bayes Rule)
• Where in case of two categories
j2
P ( x ) P ( x | j )P ( j )
j 1
• Posterior = (Likelihood. Prior) / Evidence
Pattern Classification, Chapter 2 (Part 1)
12
Pattern Classification, Chapter 2 (Part 1)
13
• Decision given the posterior probabilities
X is an observation for which:
if P(1 | x) > P(2 | x) True state of nature = 1
if P(1 | x) < P(2 | x) True state of nature = 2
Pattern Classification, Chapter 2 (Part 1)
14
2. Bayesian Decision Theory –
Continuous Features
• Generalization of the preceding ideas
1. Use of more than one feature, vector X
2. Use more than two states of nature, c classes
3. Allowing actions other than deciding the state of nature
(skip)
4. Introduce a loss of function which is more general than
the probability of error (skip)
Pattern Classification, Chapter 2 (Part 1)
15
4. Classifiers, Discriminant Functions,
and Decision Surfaces
• The multi-category case
• Set of discriminant functions g (x), i = 1,…, c
i
• The classifier assigns a feature vector x to class i
if:
gi(x) > gj(x) j i
Pattern Classification, Chapter 2 (Part 2)
16
• Discriminant function
gi(x) = P(i | x)
(max. discrimination corresponds to max.
posterior!)
gi(x) P(x | i) P(i)
gi(x) = ln P(x | i) + ln P(i)
(ln: natural logarithm!)
Pattern Classification, Chapter 2 (Part 2)
17
Pattern Classification, Chapter 2 (Part 2)
18
• Feature space divided into c decision regions
if g (x) > g (x) j i then x is in R
i j i
(R means assign x to )
i i
• The two-category case
• A classifier is a “dichotomizer” that has two discriminant
functions g1 and g2
Let g(x) g1(x) – g2(x)
Decide 1 if g(x) > 0 ; Otherwise decide 2
Pattern Classification, Chapter 2 (Part 2)
19
• The computation of g(x)
g( x ) P ( 1 | x ) P ( 2 | x )
P( x | 1 ) P( 1 )
ln ln
P( x | 2 ) P( 2 )
Pattern Classification, Chapter 2 (Part 2)
20
Pattern Classification, Chapter 2 (Part 2)
21
5. The Normal Density
• Univariate density
• Density which is analytically tractable
• Continuous density
• A lot of processes are asymptotically Gaussian
• Handwritten characters, speech sounds are ideal or prototype
corrupted by random process (central limit theorem)
1 1 x
2
P( x ) exp ,
2 2
Where:
= mean (or expected value) of x
2 = expected squared standard deviation or variance
Pattern Classification, Chapter 2 (Part 2)
22
Pattern Classification, Chapter 2 (Part 2)
23
• Multivariate normal density p(x) ~ N(, )
• Multivariate normal density in d dimensions is:
1 1
P( x ) 1/ 2
exp ( x ) ( x )
t 1
( 2 ) d/2
2
where:
x = (x1, x2, …, xd)t (t stands for the transpose vector form)
= (1, 2, …, d)t mean vector
= d*d covariance matrix
|| and -1 are determinant and inverse respectively
Pattern Classification, Chapter 2 (Part 2)
Properties of the Normal Density
Covariance Matrix
• Always symmetric
• Always positive semi-definite, but for our purposes
is positive definite
• determinant is positive
invertible
• Eigenvalues real and positive, and the principal
axes of the hyperellipsoidal loci of points of
constant density are eigenvectors of
25
Pattern Classification, Chapter 2 (Part 2)
26
Pattern Classification, Chapter 2 (Part 2)
27
6. Discriminant Functions for the
Normal Density
• We saw that the minimum error-rate classification
can be achieved by the discriminant function
gi(x) = ln P(x | i) + ln P(i)
• Case of multivariate normal p(x) ~ N(, )
1
1 d 1
gi ( x ) ( x i ) ( x i ) ln 2 ln i ln P ( i )
t
2 i 2 2
Pattern Classification, Chapter 2 (Part 3)
28
1.Case =
i
2 I (I stands for the identity matrix)
g i ( x ) w x w i 0 (linear discrimina nt function)
t
i
where :
i 1
wi 2 ; wi 0 i i ln P ( i )
t
2 2
( i 0 is called the threshold for the ith category! )
Pattern Classification, Chapter 2 (Part 3)
29
• A classifier that uses linear discriminant functions is
called “a linear machine”
• The decision surfaces for a linear machine are
pieces of hyperplanes defined by:
gi(x) = gj(x)
• If equal priors for all classes, this reduces to the
minimum-distance classifier where an unknown is
assigned to the class of the nearest mean
Pattern Classification, Chapter 2 (Part 3)
30
Pattern Classification, Chapter 2 (Part 3)
31
• The hyperplane separating R and R
i j
1 2 P( i )
x0 ( i j ) ln ( i j )
2 i j
2
P( j )
always orthogonal to the line linking the means!
1
if P ( i ) P ( j ) then x0 ( i j )
2
Pattern Classification, Chapter 2 (Part 3)
32
Pattern Classification, Chapter 2 (Part 3)
33
Pattern Classification, Chapter 2 (Part 3)
34
2. Case i = (covariance of all classes are identical
but arbitrary!)
• Hyperplane separating R and R
i j
1
x0 ( i j )
ln P ( i ) / P ( j )
.( i j )
2 ( i j ) ( i j )
t 1
(the hyperplane separating Ri and Rj is generally not
orthogonal to the line between the means!)
If priors equal, reduces to squared Mahalanobis distance
Pattern Classification, Chapter 2 (Part 3)
35
Pattern Classification, Chapter 2 (Part 3)
36
Pattern Classification, Chapter 2 (Part 3)
37
Pattern Classification, Chapter 2 (Part 3)
38
3. Case i = arbitrary
• The covariance matrices are different for each category
g i ( x ) x tWi x w it x w i 0
where :
1 1
Wi i
2
w i i 1 i
1 t 1 1
w i 0 i i i ln i ln P ( i )
2 2
(Hyperquadrics which are: hyperplanes, pairs of hyperplanes,
hyperspheres, hyperellipsoids, hyperparaboloids,
hyperhyperboloids)
Pattern Classification, Chapter 2 (Part 3)
39
Pattern Classification, Chapter 2 (Part 3)
40
Pattern Classification, Chapter 2 (Part 3)
41
Pattern Classification, Chapter 2 (Part 3)
42
Pattern Classification, Chapter 2 (Part 3)
43
Pattern Classification, Chapter 2 (Part 3)
44
Pattern Classification, Chapter 2 (Part 3)