[go: up one dir, main page]

0% found this document useful (0 votes)
38 views5 pages

CSCE 970 Lecture 2: Bayesian-Based Classifiers: Most Probable

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 5

Introduction

• A Bayesian classifier classifies instance in the


CSCE 970 Lecture 2:
most probable class
Bayesian-Based Classifiers

• Given M classes ω1, . . . , ωM and feat. vector x,


find conditional probabilities

Stephen D. Scott P (ωi | x) ∀i = 1, . . . , M,

called a posteriori (posterior) probabilities, and


predict with largest

January 16, 2003


• Will use training data to estimate probability
density function (pdf) that yields P (ωi | x) and
classify to ωi that maximizes

1 2

Bayesian Decision Theory

• Use ω1 and ω2 only Bayesian Decision Theory


(Cont’d)

• Need a priori (prior) probabilities of classes:


P (ω1) and P (ω2 ) • But p(x) is same for all ωi, so since we want
max:

• Estimate from training data: If p(x | ω1)P (ω1 ) > p(x | ω2)P (ω2 ), classif. x as ω1

P (ωi ) ≈ Ni /N, Ni = no. of class ωi, N = N1 + N2 If p(x | ω1)P (ω1 ) < p(x | ω2)P (ω2 ), classif. x as ω2

(will be accurate for sufficiently large N )


• If prior probs. equal (P (ω1) = P (ω2) = 1/2) then
decide based on:
• Also need likelihood of x given class = ωi:
p(x | ωi) (is a pdf if x ∈ <`) p(x | ω1) ≷ p(x | ω2)

• Now apply Bayes Rule: • Since can estimate P (ωi ), now only need
p(x | ωi)
p(x | ωi)P (ωi )
P (ωi | x) =
p(x)
and classify to ωi that maximizes

3 4
Bayesian Decision Theory Bayesian Decision Theory
Example Probability of Error

• In general, error is
Pe = P (x ∈ R2, ω1) + P (x ∈ R1, ω2)
= P (x ∈ R2 | ω1)P (ω1 ) + P (x ∈ R1 | ω2)P (ω2 )
Z Z
= P (ω1 ) p(x | ω1)dx + P (ω2) p(x | ω2)dx
R2 R1
Z Z
= P (ω1 | x)p(x)dx + P (ω2 | x)p(x)dx
R2 R1

• Since R1 and R2 cover entire space,


Z Z
P (ω1 | x)p(x)dx + P (ω1 | x)p(x)dx = P (ω1 )
R1 R2

• ` = 1 feature, P (ω1 ) = P (ω2), so predict at • Thus


Z
dotted line
Pe = P (ω1) − (P (ω1 | x) − P (ω2 | x)) p(x)dx,
R1
which is minimized if
• Total error probability = shaded area: n o
Z x
0
Z +∞ R1 = x ∈ <` : P (ω1 | x) > P (ω2 | x) ,
Pe = p(x | ω2)dx + p(x | ω1)dx
−∞ x0 which is what the Bayesian classifier does!

5 6

Bayesian Decision Theory


Minimizing Risk

• What if different errors have different penal-


ties, e.g. cancer diagnosis?

– False negative worse than false positive


Bayesian Decision Theory
`>2 • Define λki as loss (penalty, risk) if we pre-
dict ωi when correct answer is ωk (forms L =
loss matrix)
• If number of classes ` > 2, then classify
according to • Can minimize average loss:
prob. of error ki
argmax P (ωi | x) M M z
Z }| {
ωi X X
r= P (ωk )
λki p(x | ωk )dx
k=1 i=1 Ri
 
M Z M
• Proof of optimality still holds X X
=  λki p(x | ωk )P (ωk ) dx
i=1 Ri k=1
by minimizing each integral:

 M
X
Ri = x ∈ <` : λki p(x | ωk )P (ωk )

k=1 
M
X 
< λkj p(x | ωk )P (ωk ) ∀j 6= i

k=1
7 8
Discriminant Functions

• Rather than using probabilities (or risk func-


Bayesian Decision Theory tions) directly, sometimes easier to work with
Minimizing Risk a function of them, e.g.
Example gi(x) = f (P (ωi | x))

! f (·) is monotonically increasing function, gi(x)


0 λ12 is called discriminant function
• Let ` = 2, P (ω1 ) = P (ω2 ) = 1/2, L = ,
λ21 0
and λ21 > λ12 n o
• Then Ri = x ∈ <` : gi(x) > gj (x) ∀j 6= i

• Then
n o
R2 = x ∈ <2 : λ21 p(x | ω2) > λ12 p(x | ω1) • Common choice of f (·) is natural logarithm
( ) (multiplications become sums)
λ
= x ∈ <2 : p(x | ω2) > p(x | ω1) 12 ,
λ21
• Still requires good estimate of pdf
which slides threshold left of x0 on slide 5
since λ12/λ21 < 1 – Will look at a tractable case next

– In general, cannot necessarily easily esti-


mate pdf, so use other cost functions (Chap-
ters 3 & 4)

9 10

Normal Distributions Normal Distributions


Minimum Distance Classifiers

• Assume the pdf of likelihood functions follow a


normal (Gaussian) distribution for 1 ≤ i ≤ M : • If P (ωi )’s equal and Σi’s equal, can use:
1

1
 1
p(x | ωi) = exp − (x − µi)T Σ−1 gi(x) = − (x − µi)T Σ−1 (x − µi)
i (x − µi ) 2
(2π)`/2 |Σi|1/2 2

· ` = dimension of x • If features statistically independent with same


variance, then Σ = σ 2I and can instead use
· µi = E[x | ωi] = mean value of ωi class
1 X̀
gi(x) = − (xj − µij )2
· |Σi| = determinant of Σi, ωi’s covariance matrix: 2 j=1
h i
Σi = E (x − µi)(x − µi)T
• Finding ωi maximizing this implies finding µi
– Assume we know µi and Σi ∀i that minimizes Euclidian distance to x

– Constant distance = circle centered at µi


• Using the following discriminant function:

gi(x) = ln(p(x | ωi)P (ωi )) • If Σ not diagonal (but Σi s and P (ωi )s still =),
we get: then maximizing gi(x) is same as minimizing
Mahalanobis distance:
1
gi(x) = − (x − µi)T Σ−1
i (x − µi ) + ln(P (ωi ))
q
2 (x − µi)T Σ−1(x − µi)
−`/2 ln(2π) − (1/2) ln |Σi|
– Constant distance = ellipse centered at µi

11 12
Estimating Unknown pdf’s
Estimating Unknown pdf’s ML Param Est (cont’d)
Maximum Likelihood Parameter Estimation
• Assuming statistical indep. of xki’s, Σ−1
ij = 0
• If we know cov. matrix but not mean for a for i 6= j, so
   2 
class ωi, can parameterize ωi’s pdf on mean µ: 
∂L
 ∂ − 1 PN P`
x − µ Σ−1
∂µ 2 k=1 j=1 kj j jj
 ∂µ
 
  ∂L .. 1   1 .. 
1 1 = =
exp − (xk − µ)T Σ−1(xk − µ)

pi(xk ; µ) = ∂µ     
(2π)`/2 |Σ|1/2 ∂L   2 −1 
2  ∂ 1 PN P`
∂µ` ∂µ` − 2 k=1 j=1 xkj − µj Σjj
and use data x1, . . . , xN from ω to estimate µ
N
X
= Σ−1(xk − µ) = 0,
k=1
• The maximum likelihood (ML) method esti-
mates µ such that the following likelihood func- yielding
tion is maximized: N
1 X
N
Y µ̂M L = xk
p(X; µ) = p(x1 , . . . , xN ; µ) = p(xk ; µ) N k=1
k=1

• Solve above for each class independently


• Taking logarithm and setting gradient = 0:
 
∂  N   1 XN • Can generalize technique for other
− ln (2π)` |Σ| − (xk − µ)T Σ−1(xk − µ) = 0 distributions and parameters
∂µ 2 2 k=1
| {z }
L
• Has many nice properties (p. 30) as N → ∞

13 14

Estimating Unknown pdf’s Estimating Unknown pdf’s


Maximum A Posteriori Parameter Estimation (Nonparametric Approach)
Parzen Windows
• Historgram-based technique to approximate pdf:
• If µ is norm. distrib., Σ = σµ2I, mean = µ0:
Partition space into “bins” and count number
!
1 (µ − µ0)T (µ − µ0) of training vectors per bin
p(µ) = exp −
(2π)`/2 σµ` 2σµ2 p(x)

• Maximizing p(µ | X) is same as maximizing


N
Y
p(µ)p(X | µ) = p(xk | µ)p(µ)
k=1

x
• Again, take log and set gradient = 0: (Σ = σ 2I)

N
X 1 1 1 if |xj | ≤ 1/2 ∀ j = 1, . . . , `
2
(xk − µ) − 2 (µ − µ0) = 0 • Let φ(x) =
k=1
σ σµ 0 otherwise
so
• Now approximate pdf p(x) with
µ0 + (σµ2/σ 2) N
P
k=1 xk  
µ̂M AP = 2 2
binz centered at
{ x
1 + (σµ /σ )N   }|  
 
 N 
1  1 X 
 xi − x 
 
p̂(x) = ` 
 φ  

• µM AP ≈ µM L if p(µ) almost uniform (σµ2  σ 2) h  N i=1  h
|{z}
 
 
or N → ∞ (Fig. 2.7)  bin size 

15 16
Estimating Unknown pdf’s Estimating Unknown pdf’s
Parzen Windows Parzen Windows
(cont’d) Numeric Example

 
N  
1 1 X x −x 
p̂(x) = `  φ i
h N i=1 h

• I.e. given x, to compute p̂(x):

– Count number of training vectors in size-h


(per side) hypercube H centered at x

– Divide by N to est. probability of getting a


point in H

– Divide by volume of H

• Problem: Approximating continuous function


p(x) with discontinuous p̂(x)

• Solution: Substitute
 a smooth
 function for
φ(·), e.g. φ(x) = 1/(2π)`/2 exp −xT x/2 =
N (0, 1) for each dimension independently

17 18

k-Nearest Neighbor Techniques

• Classify unlabeled feature vector x according


to a majority vote of its k nearest neighbors

k=3
Euclidean distance
= Class A

= Class B

= unclassified
(predict B)

• As N → ∞,

– 1-NN error is at most twice Bayes opt. (PB )



– k-NN error is ≤ PB + 1/ ke

• Can also weight votes by relative distance

• Complexity issues: Research into more effi-


cient algorithms, approximation algorithms

19

You might also like