CSCE 970 Lecture 2: Bayesian-Based Classifiers: Most Probable
CSCE 970 Lecture 2: Bayesian-Based Classifiers: Most Probable
CSCE 970 Lecture 2: Bayesian-Based Classifiers: Most Probable
1 2
• Estimate from training data: If p(x | ω1)P (ω1 ) > p(x | ω2)P (ω2 ), classif. x as ω1
P (ωi ) ≈ Ni /N, Ni = no. of class ωi, N = N1 + N2 If p(x | ω1)P (ω1 ) < p(x | ω2)P (ω2 ), classif. x as ω2
• Now apply Bayes Rule: • Since can estimate P (ωi ), now only need
p(x | ωi)
p(x | ωi)P (ωi )
P (ωi | x) =
p(x)
and classify to ωi that maximizes
3 4
Bayesian Decision Theory Bayesian Decision Theory
Example Probability of Error
• In general, error is
Pe = P (x ∈ R2, ω1) + P (x ∈ R1, ω2)
= P (x ∈ R2 | ω1)P (ω1 ) + P (x ∈ R1 | ω2)P (ω2 )
Z Z
= P (ω1 ) p(x | ω1)dx + P (ω2) p(x | ω2)dx
R2 R1
Z Z
= P (ω1 | x)p(x)dx + P (ω2 | x)p(x)dx
R2 R1
5 6
• Then
n o
R2 = x ∈ <2 : λ21 p(x | ω2) > λ12 p(x | ω1) • Common choice of f (·) is natural logarithm
( ) (multiplications become sums)
λ
= x ∈ <2 : p(x | ω2) > p(x | ω1) 12 ,
λ21
• Still requires good estimate of pdf
which slides threshold left of x0 on slide 5
since λ12/λ21 < 1 – Will look at a tractable case next
9 10
gi(x) = ln(p(x | ωi)P (ωi )) • If Σ not diagonal (but Σi s and P (ωi )s still =),
we get: then maximizing gi(x) is same as minimizing
Mahalanobis distance:
1
gi(x) = − (x − µi)T Σ−1
i (x − µi ) + ln(P (ωi ))
q
2 (x − µi)T Σ−1(x − µi)
−`/2 ln(2π) − (1/2) ln |Σi|
– Constant distance = ellipse centered at µi
11 12
Estimating Unknown pdf’s
Estimating Unknown pdf’s ML Param Est (cont’d)
Maximum Likelihood Parameter Estimation
• Assuming statistical indep. of xki’s, Σ−1
ij = 0
• If we know cov. matrix but not mean for a for i 6= j, so
2
class ωi, can parameterize ωi’s pdf on mean µ:
∂L
∂ − 1 PN P`
x − µ Σ−1
∂µ 2 k=1 j=1 kj j jj
∂µ
∂L .. 1 1 ..
1 1 = =
exp − (xk − µ)T Σ−1(xk − µ)
pi(xk ; µ) = ∂µ
(2π)`/2 |Σ|1/2 ∂L 2 −1
2 ∂ 1 PN P`
∂µ` ∂µ` − 2 k=1 j=1 xkj − µj Σjj
and use data x1, . . . , xN from ω to estimate µ
N
X
= Σ−1(xk − µ) = 0,
k=1
• The maximum likelihood (ML) method esti-
mates µ such that the following likelihood func- yielding
tion is maximized: N
1 X
N
Y µ̂M L = xk
p(X; µ) = p(x1 , . . . , xN ; µ) = p(xk ; µ) N k=1
k=1
13 14
x
• Again, take log and set gradient = 0: (Σ = σ 2I)
N
X 1 1 1 if |xj | ≤ 1/2 ∀ j = 1, . . . , `
2
(xk − µ) − 2 (µ − µ0) = 0 • Let φ(x) =
k=1
σ σµ 0 otherwise
so
• Now approximate pdf p(x) with
µ0 + (σµ2/σ 2) N
P
k=1 xk
µ̂M AP = 2 2
binz centered at
{ x
1 + (σµ /σ )N }|
N
1 1 X
xi − x
p̂(x) = `
φ
• µM AP ≈ µM L if p(µ) almost uniform (σµ2 σ 2) h N i=1 h
|{z}
or N → ∞ (Fig. 2.7) bin size
15 16
Estimating Unknown pdf’s Estimating Unknown pdf’s
Parzen Windows Parzen Windows
(cont’d) Numeric Example
N
1 1 X x −x
p̂(x) = ` φ i
h N i=1 h
– Divide by volume of H
• Solution: Substitute
a smooth
function for
φ(·), e.g. φ(x) = 1/(2π)`/2 exp −xT x/2 =
N (0, 1) for each dimension independently
17 18
k=3
Euclidean distance
= Class A
= Class B
= unclassified
(predict B)
• As N → ∞,
19