CSCE 970 Lecture 2: Bayesian-Based Classifiers: Most Probable

Introduction
• A Bayesian classifier classifies instance in the

CSCE 970 Lecture 2:
most probable class
Bayesian-Based Classifiers
• Given M classes ω1, . . . , ωM and feat. vector x,

find conditional probabilities
Stephen D. Scott P (ωi | x) ∀i = 1, . . . , M,
called a posteriori (posterior) probabilities, and

predict with largest
January 16, 2003

• Will use training data to estimate probability
density function (pdf) that yields P (ωi | x) and
classify to ωi that maximizes
1 2
Bayesian Decision Theory
• Use ω1 and ω2 only Bayesian Decision Theory

(Cont’d)
• Need a priori (prior) probabilities of classes:

P (ω1) and P (ω2 ) • But p(x) is same for all ωi, so since we want
max:
• Estimate from training data: If p(x | ω1)P (ω1 ) > p(x | ω2)P (ω2 ), classif. x as ω1
P (ωi ) ≈ Ni /N, Ni = no. of class ωi, N = N1 + N2 If p(x | ω1)P (ω1 ) < p(x | ω2)P (ω2 ), classif. x as ω2
(will be accurate for sufficiently large N )

• If prior probs. equal (P (ω1) = P (ω2) = 1/2) then
decide based on:
• Also need likelihood of x given class = ωi:
p(x | ωi) (is a pdf if x ∈ <`) p(x | ω1) ≷ p(x | ω2)
• Now apply Bayes Rule: • Since can estimate P (ωi ), now only need
p(x | ωi)
p(x | ωi)P (ωi )
P (ωi | x) =
p(x)
and classify to ωi that maximizes
3 4
Bayesian Decision Theory Bayesian Decision Theory
Example Probability of Error
• In general, error is
Pe = P (x ∈ R2, ω1) + P (x ∈ R1, ω2)
= P (x ∈ R2 | ω1)P (ω1 ) + P (x ∈ R1 | ω2)P (ω2 )
Z Z
= P (ω1 ) p(x | ω1)dx + P (ω2) p(x | ω2)dx
R2 R1
Z Z
= P (ω1 | x)p(x)dx + P (ω2 | x)p(x)dx
R2 R1
• Since R1 and R2 cover entire space,

Z Z
P (ω1 | x)p(x)dx + P (ω1 | x)p(x)dx = P (ω1 )
R1 R2
• ` = 1 feature, P (ω1 ) = P (ω2), so predict at • Thus

Z
dotted line
Pe = P (ω1) − (P (ω1 | x) − P (ω2 | x)) p(x)dx,
R1
which is minimized if
• Total error probability = shaded area: n o
Z x
0
Z +∞ R1 = x ∈ <` : P (ω1 | x) > P (ω2 | x) ,
Pe = p(x | ω2)dx + p(x | ω1)dx
−∞ x0 which is what the Bayesian classifier does!
5 6

Minimizing Risk
• What if different errors have different penal-

ties, e.g. cancer diagnosis?
– False negative worse than false positive

`>2 • Define λki as loss (penalty, risk) if we pre-
dict ωi when correct answer is ωk (forms L =
loss matrix)
• If number of classes ` > 2, then classify
according to • Can minimize average loss:
prob. of error ki
argmax P (ωi | x) M M z
Z }| {
ωi X X
r= P (ωk )
λki p(x | ωk )dx
k=1 i=1 Ri
 
M Z M
• Proof of optimality still holds X X
=  λki p(x | ωk )P (ωk ) dx
i=1 Ri k=1
by minimizing each integral:

 M
X
Ri = x ∈ <` : λki p(x | ωk )P (ωk )

k=1 
M
X 
< λkj p(x | ωk )P (ωk ) ∀j 6= i

k=1
7 8
Discriminant Functions
• Rather than using probabilities (or risk func-

Bayesian Decision Theory tions) directly, sometimes easier to work with
Minimizing Risk a function of them, e.g.
Example gi(x) = f (P (ωi | x))
! f (·) is monotonically increasing function, gi(x)

0 λ12 is called discriminant function
• Let ` = 2, P (ω1 ) = P (ω2 ) = 1/2, L = ,
λ21 0
and λ21 > λ12 n o
• Then Ri = x ∈ <` : gi(x) > gj (x) ∀j 6= i
• Then
n o
R2 = x ∈ <2 : λ21 p(x | ω2) > λ12 p(x | ω1) • Common choice of f (·) is natural logarithm
( ) (multiplications become sums)
λ
= x ∈ <2 : p(x | ω2) > p(x | ω1) 12 ,
λ21
• Still requires good estimate of pdf
which slides threshold left of x0 on slide 5
since λ12/λ21 < 1 – Will look at a tractable case next
– In general, cannot necessarily easily esti-

mate pdf, so use other cost functions (Chap-
ters 3 & 4)
9 10
Normal Distributions Normal Distributions

Minimum Distance Classifiers
• Assume the pdf of likelihood functions follow a

normal (Gaussian) distribution for 1 ≤ i ≤ M : • If P (ωi )’s equal and Σi’s equal, can use:
1

1
1
p(x | ωi) = exp − (x − µi)T Σ−1 gi(x) = − (x − µi)T Σ−1 (x − µi)
i (x − µi ) 2
(2π)`/2 |Σi|1/2 2
· ` = dimension of x • If features statistically independent with same

variance, then Σ = σ 2I and can instead use
· µi = E[x | ωi] = mean value of ωi class
1 X̀
gi(x) = − (xj − µij )2
· |Σi| = determinant of Σi, ωi’s covariance matrix: 2 j=1
h i
Σi = E (x − µi)(x − µi)T
• Finding ωi maximizing this implies finding µi
– Assume we know µi and Σi ∀i that minimizes Euclidian distance to x
– Constant distance = circle centered at µi

• Using the following discriminant function:
gi(x) = ln(p(x | ωi)P (ωi )) • If Σ not diagonal (but Σi s and P (ωi )s still =),
we get: then maximizing gi(x) is same as minimizing
Mahalanobis distance:
1
gi(x) = − (x − µi)T Σ−1
i (x − µi ) + ln(P (ωi ))
q
2 (x − µi)T Σ−1(x − µi)
−`/2 ln(2π) − (1/2) ln |Σi|
– Constant distance = ellipse centered at µi
11 12
Estimating Unknown pdf’s
Estimating Unknown pdf’s ML Param Est (cont’d)
Maximum Likelihood Parameter Estimation
• Assuming statistical indep. of xki’s, Σ−1
ij = 0
• If we know cov. matrix but not mean for a for i 6= j, so
 2 
class ωi, can parameterize ωi’s pdf on mean µ: 
∂L
 ∂ − 1 PN P`
x − µ Σ−1
∂µ 2 k=1 j=1 kj j jj
 ∂µ
 
∂L .. 1   1 .. 
1 1 = =
exp − (xk − µ)T Σ−1(xk − µ)

pi(xk ; µ) = ∂µ    
(2π)`/2 |Σ|1/2 ∂L 2 −1 
2  ∂ 1 PN P`
∂µ` ∂µ` − 2 k=1 j=1 xkj − µj Σjj
and use data x1, . . . , xN from ω to estimate µ
N
X
= Σ−1(xk − µ) = 0,
k=1
• The maximum likelihood (ML) method esti-
mates µ such that the following likelihood func- yielding
tion is maximized: N
1 X
N
Y µ̂M L = xk
p(X; µ) = p(x1 , . . . , xN ; µ) = p(xk ; µ) N k=1
k=1
• Solve above for each class independently

• Taking logarithm and setting gradient = 0:
 
∂  N 1 XN • Can generalize technique for other
− ln (2π)` |Σ| − (xk − µ)T Σ−1(xk − µ) = 0 distributions and parameters
∂µ 2 2 k=1
| {z }
L
• Has many nice properties (p. 30) as N → ∞
13 14
Estimating Unknown pdf’s Estimating Unknown pdf’s

Maximum A Posteriori Parameter Estimation (Nonparametric Approach)
Parzen Windows
• Historgram-based technique to approximate pdf:
• If µ is norm. distrib., Σ = σµ2I, mean = µ0:
Partition space into “bins” and count number
!
1 (µ − µ0)T (µ − µ0) of training vectors per bin
p(µ) = exp −
(2π)`/2 σµ` 2σµ2 p(x)
• Maximizing p(µ | X) is same as maximizing

N
Y
p(µ)p(X | µ) = p(xk | µ)p(µ)
k=1
x
• Again, take log and set gradient = 0: (Σ = σ 2I)

N
X 1 1 1 if |xj | ≤ 1/2 ∀ j = 1, . . . , `
2
(xk − µ) − 2 (µ − µ0) = 0 • Let φ(x) =
k=1
σ σµ 0 otherwise
so
• Now approximate pdf p(x) with
µ0 + (σµ2/σ 2) N
P
k=1 xk  
µ̂M AP = 2 2
binz centered at
{ x
1 + (σµ /σ )N   }|  
 
 N 
1  1 X 
 xi − x 
 
p̂(x) = ` 
 φ  

• µM AP ≈ µM L if p(µ) almost uniform (σµ2 σ 2) h  N i=1  h
|{z}
 
 
or N → ∞ (Fig. 2.7)  bin size 
15 16
Estimating Unknown pdf’s Estimating Unknown pdf’s
Parzen Windows Parzen Windows
(cont’d) Numeric Example
 
N
1 1 X x −x 
p̂(x) = `  φ i
h N i=1 h
• I.e. given x, to compute p̂(x):
– Count number of training vectors in size-h

(per side) hypercube H centered at x
– Divide by N to est. probability of getting a

point in H
– Divide by volume of H
• Problem: Approximating continuous function

p(x) with discontinuous p̂(x)
• Solution: Substitute
a smooth
function for
φ(·), e.g. φ(x) = 1/(2π)`/2 exp −xT x/2 =
N (0, 1) for each dimension independently
17 18
k-Nearest Neighbor Techniques
• Classify unlabeled feature vector x according

to a majority vote of its k nearest neighbors
k=3
Euclidean distance
= Class A
= Class B
= unclassified
(predict B)
• As N → ∞,
– 1-NN error is at most twice Bayes opt. (PB )

√
– k-NN error is ≤ PB + 1/ ke
• Can also weight votes by relative distance
• Complexity issues: Research into more effi-

cient algorithms, approximation algorithms
19

CSCE 970 Lecture 2: Bayesian-Based Classifiers: Most Probable

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

CSCE 970 Lecture 2: Bayesian-Based Classifiers: Most Probable

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CSCE 970 Lecture 2: Bayesian-Based Classifiers: Most Probable

Uploaded by

Copyright:

Available Formats

Introduction

• A Bayesian classifier classifies instance in the

• Given M classes ω1, . . . , ωM and feat. vector x,

Stephen D. Scott P (ωi | x) ∀i = 1, . . . , M,

called a posteriori (posterior) probabilities, and

January 16, 2003

Bayesian Decision Theory

• Use ω1 and ω2 only Bayesian Decision Theory

• Need a priori (prior) probabilities of classes:

(will be accurate for sufficiently large N )

• Since R1 and R2 cover entire space,

• ` = 1 feature, P (ω1 ) = P (ω2), so predict at • Thus

Bayesian Decision Theory

• What if different errors have different penal-

– False negative worse than false positive

• Rather than using probabilities (or risk func-

! f (·) is monotonically increasing function, gi(x)

– In general, cannot necessarily easily esti-

Normal Distributions Normal Distributions

• Assume the pdf of likelihood functions follow a

· ` = dimension of x • If features statistically independent with same

– Constant distance = circle centered at µi

• Solve above for each class independently

Estimating Unknown pdf’s Estimating Unknown pdf’s

• Maximizing p(µ | X) is same as maximizing

• I.e. given x, to compute p̂(x):

– Count number of training vectors in size-h

– Divide by N to est. probability of getting a

• Problem: Approximating continuous function

k-Nearest Neighbor Techniques

• Classify unlabeled feature vector x according

– 1-NN error is at most twice Bayes opt. (PB )

• Can also weight votes by relative distance

• Complexity issues: Research into more effi-

You might also like