0% found this document useful (0 votes)

66 views130 pages

Lecture 2

This document summarizes concepts related to Bayesian inference and classification. It discusses: 1) Bayesian decision rules that minimize the expected loss conditioned on observations, where the loss depends on an unknown variable and a decision rule. 2) Maximum a posteriori (MAP) classification, which chooses the class with maximum posterior probability given observations. For binary classification, this leads to a likelihood ratio test comparing the ratio to a threshold. 3) How the Bayesian optimal decision rule depends on the loss structure when classifying into two classes with general costs. 4) MAP classification when the class-conditional densities are Gaussian with different means and covariance matrices.

Uploaded by

Rafael Cabral

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

66 views130 pages

Lecture 2

Uploaded by

Rafael Cabral

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 130

Statistical Learning

Lecture 2: More on Bayesian Inference

Mário A. T. Figueiredo

Instituto Superior Técnico & Instituto de Telecomunicações

University of Lisbon, Portugal

2018

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 1 / 33

Bayesian Decision Rules

Posterior expected loss of decision rule a ∈ A,

 X


 fS|X (s|x) L(s, a), if S discrete
ρ(fS , a|x) = E [L(S, a)|x] = s∈S
Z
fS|X (s|x) L(s, a) ds, if S continuous



s∈S

where, via Bayes law, fS|X (s|x) = fX|S (x|s) fS (s)/fX (x)

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 2 / 33

Bayesian Decision Rules

Posterior expected loss of decision rule a ∈ A,

 X


 fS|X (s|x) L(s, a), if S discrete
ρ(fS , a|x) = E [L(S, a)|x] = s∈S
Z
fS|X (s|x) L(s, a) ds, if S continuous



s∈S

where, via Bayes law, fS|X (s|x) = fX|S (x|s) fS (s)/fX (x)

Bayes optimal decision rule:

δBayes (x) = arg min ρ(fS , a|x)

a∈A

...minimizer of the expected loss (with respect to the unknown S),

conditioned on the observations X = x

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 2 / 33

Bayesian Decision Rules

Posterior expected loss of decision rule a ∈ A,

 X


 fS|X (s|x) L(s, a), if S discrete
ρ(fS , a|x) = E [L(S, a)|x] = s∈S
Z
fS|X (s|x) L(s, a) ds, if S continuous



s∈S

where, via Bayes law, fS|X (s|x) = fX|S (x|s) fS (s)/fX (x)

Bayes optimal decision rule:

δBayes (x) = arg min ρ(fS , a|x) = arg min E [L(S, a)|x]
a∈A a∈A

...minimizer of the expected loss (with respect to the unknown S),

conditioned on the observations X = x

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 2 / 33

MAP Classification
Let A = S = {1, ..., K} (discrete/classification problem)

Observations: X ∈ X , with {fX|S (·|s), s ∈ S}

(set of class-conditional pdf or pmf)

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 3 / 33

MAP Classification
Let A = S = {1, ..., K} (discrete/classification problem)

Observations: X ∈ X , with {fX|S (·|s), s ∈ S}

(set of class-conditional pdf or pmf)
P
Prior pmf fS (fS (s) ≥ 0 and s∈S fS (s) = 1)

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 3 / 33

MAP Classification
Let A = S = {1, ..., K} (discrete/classification problem)

Observations: X ∈ X , with {fX|S (·|s), s ∈ S}

(set of class-conditional pdf or pmf)
P
Prior pmf fS (fS (s) ≥ 0 and s∈S fS (s) = 1)

Adopt the 0/1 loss: L(s, a) = 0, if s = a, and 1 otherwise.

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 3 / 33

MAP Classification
Let A = S = {1, ..., K} (discrete/classification problem)

Observations: X ∈ X , with {fX|S (·|s), s ∈ S}

(set of class-conditional pdf or pmf)
P
Prior pmf fS (fS (s) ≥ 0 and s∈S fS (s) = 1)

Adopt the 0/1 loss: L(s, a) = 0, if s = a, and 1 otherwise.

Optimal Bayesian decision:

K
X
δBayes (x) = arg min L(s, a) fS|X (s|x)
a∈S
s=1

= arg min 1 − fS|X (a|x) = arg max fS|X (a|x) ≡ δMAP (x)
a∈S a∈S

MAP = maximum a posteriori criterion.

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 3 / 33
Binary MAP Classification
Let A = S = {1, 2} (binary classification problem)

Observation model {fX|S (·|1), fX|S (·|2)}

Prior fS (1) (of course fS (2) = 1 − fS (1))

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 4 / 33

Binary MAP Classification
Let A = S = {1, 2} (binary classification problem)

Observation model {fX|S (·|1), fX|S (·|2)}

Prior fS (1) (of course fS (2) = 1 − fS (1))

MAP rule:

1 ⇐ l(x) ≥ t
δMAP (x) = arg max fX|S (x|a)fS (a) =
a∈{1,2} 2 ⇐ l(x) < t
where
fX|S (x|1) fS (2)
l(x) = and t=
fX|S (x|2) f (1)
| {z } | S{z }
likelihood ratio threshold

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 4 / 33

Binary MAP Classification
Let A = S = {1, 2} (binary classification problem)

Observation model {fX|S (·|1), fX|S (·|2)}

Prior fS (1) (of course fS (2) = 1 − fS (1))

MAP rule:

1 ⇐ l(x) ≥ t
δMAP (x) = arg max fX|S (x|a)fS (a) =
a∈{1,2} 2 ⇐ l(x) < t
where
fX|S (x|1) fS (2)
l(x) = and t=
fX|S (x|2) f (1)
| {z } | S{z }
likelihood ratio threshold

If fS (1) = fS (2) = 1/2, we have a maximum likelihood classifier,

a.k.a., likelihood ratio test (LRT)
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 4 / 33
Binary Classification with General Costs
Let A = S = {1, 2} (binary classification problem)

Observation model {fX|S (·|1), fX|S (·|2)} and prior fS (1)

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 5 / 33

Binary Classification with General Costs
Let A = S = {1, 2} (binary classification problem)

Observation model {fX|S (·|1), fX|S (·|2)} and prior fS (1)

Optimal Bayesian decision:

δBayes (x) = arg min L(1, a) fX|S (x|1)fS (1) + L(2, a) fX|S (x|2)fS (2)
a∈{1,2}

1 ⇐ l(x) ≥ t
=
2 ⇐ l(x) < t

where l(x) is as before and the threshold becomes

fS (2) L(2, 1) − L(2, 2)
t=
fS (1) L(1, 2) − L(1, 1)

...i.e., the loss structure affects the decision threshold.

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 5 / 33
MAP Classification with Gaussian Classes
Let A = S = {1, ..., K} (classification problem)
Prior class probabilities: {fS (s), s = 1, ..., K}

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 6 / 33

MAP Classification with Gaussian Classes
Let A = S = {1, ..., K} (classification problem)
Prior class probabilities: {fS (s), s = 1, ..., K}
Gaussian class-conditional densities, for observations X ∈ Rp

fX|S (x|s) = N (x; µs , Cs )

1 1 T −1
=p exp − (x − µs ) Cs (x − µs )
det(2 π Cs ) 2

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 6 / 33

fX|S (x|s) = N (x; µs , Cs )

1 1 T −1
=p exp − (x − µs ) Cs (x − µs )
det(2 π Cs ) 2

MAP rule (taking logs and dropping constants):

δMAP (x) = (squared) Mahalanobis
γs distance kx − µs k2Cs
nz }| { z }| {o
= arg min − log fS (s) + 12 log det(Cs ) + 12 (x − µs )T Cs−1 (x − µs )
s∈{1,...,K}

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 6 / 33

fX|S (x|s) = N (x; µs , Cs )

1 1 T −1
=p exp − (x − µs ) Cs (x − µs )
det(2 π Cs ) 2

MAP rule (taking logs and dropping constants):

δMAP (x) = (squared) Mahalanobis
γs distance kx − µs k2Cs
nz }| { z }| {o
= arg min − log fS (s) + 12 log det(Cs ) + 12 (x − µs )T Cs−1 (x − µs )
s∈{1,...,K}
n o
= arg min γs + 12 kx − µs k2Cs
s∈{1,...,K}

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 6 / 33

MAP Classification with Gaussian
n
Classes (II) o
MAP rule: δMAP (x) = arg min γs + 12 kx − µs k2Cs
s∈{1,...,K}

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 7 / 33

MAP Classification with Gaussian
n
Classes (II) o
MAP rule: δMAP (x) = arg min γs + 12 kx − µs k2Cs
s∈{1,...,K}

Decision regions: {Rs , s = 1, ..., K} (partition of Rp ), such that

Rs = {x ∈ Rp : δMAP (x) = s}

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 7 / 33

MAP Classification with Gaussian
n
Classes (II) o
MAP rule: δMAP (x) = arg min γs + 12 kx − µs k2Cs
s∈{1,...,K}

Decision regions: {Rs , s = 1, ..., K} (partition of Rp ), such that

Rs = {x ∈ Rp : δMAP (x) = s}
Quadratic discrimination (quadratic boundaries); a.k.a. quadratic
discriminant analysis (QDA).

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 7 / 33

MAP Classification with Gaussian
n
Classes (II) o
MAP rule: δMAP (x) = arg min γs + 12 kx − µs k2Cs
s∈{1,...,K}

Decision regions: {Rs , s = 1, ..., K} (partition of Rp ), such that

Rs = {x ∈ Rp : δMAP (x) = s}
Quadratic discrimination (quadratic boundaries); a.k.a. quadratic
discriminant analysis (QDA). Example: binary classification in R2 :

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 7 / 33

MAP Classification with Gaussian
n
Classes (II) o
MAP rule: δMAP (x) = arg min γs + 12 kx − µs k2Cs
s∈{1,...,K}

Decision regions: {Rs , s = 1, ..., K} (partition of Rp ), such that

Rs = {x ∈ Rp : δMAP (x) = s}
Quadratic discrimination (quadratic boundaries); a.k.a. quadratic
discriminant analysis (QDA). Example: binary classification in R2 :
Parabolic Boundary

2
level curves of class-conditional
Gaussian densities (and a few
0 samples)
boundary is quadratic (parabolic)
increasing (resp. decreasing) fS (1)
−2
shifts the boundary away from
(resp. towards) µ1
−2 0 2

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 7 / 33

MAP Classification with Gaussian Classes (III)
In the case of a common covariance matrix, Cs = C,
kx − µs k2Cs = xT C −1 x + µTs C −1 µs − 2(C −1 µs )T x

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 8 / 33

MAP Classification with Gaussian Classes (III)
In the case of a common covariance matrix, Cs = C,
kx − µs k2Cs = xT C −1 x + µTs C −1 µs − 2(C −1 µs )T x
n o
MAP rule: δMAP (x) = arg min γs + xT β s ,
s∈{1,...,K}

where βs = −(C −1 µs ) and γs = − log fS (s) + 12 µTs C −1 µs

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 8 / 33

where βs = −(C −1 µs ) and γs = − log fS (s) + 12 µTs C −1 µs

...MAP classifier is linear in x; region boundaries are hyper-planes

Linear Boundary All Linear Boundaries

0 2

0
−2

−2
−2 0 2 −2 0 2 4 6
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 8 / 33
MAP Classification with Multinomial Classes
Let A = S = {1, ..., K} (classification problem)
Prior class probabilities: {fS (s), s = 1, ..., K}

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 9 / 33

MAP Classification with Multinomial Classes
Let A = S = {1, ..., K} (classification problem)
Prior class probabilities: {fS (s), s = 1, ..., K}
Multinomial class-conditional pmf, for observations X ∈ {0, ..., n}p
p
Y
n
(θi )xi 1x1 +...+xp =n ,
(s)
fX|S (x|s) =
x1 x2 · · · xp
i=1

where each θ(s)

is in the canonical simplex,
(s) 
θ1 p
=  ...  ∈ ∆p−1 ≡ v ∈ Rp : vi ≥ 0,
X
θ(s) vi = 1
 
(s) i=1
θp

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 9 / 33

where each θ(s)

is in the canonical simplex,
(s) 
θ1 p
=  ...  ∈ ∆p−1 ≡ v ∈ Rp : vi ≥ 0,
X
θ(s) vi = 1
 
(s) i=1
θp
MAP rule (taking logs and dropping constants):
n o
δMAP (x) = arg max log fS (s) + xT log θ(s)
s∈{1,...,K}

...again, linear in x.
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 9 / 33
MAP Classification with Exponential Family Classes

Let A = S = {1, ..., K} (classification problem)

Prior class probabilities: {fS (s), s = 1, ..., K}

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 10 / 33

MAP Classification with Exponential Family Classes

Let A = S = {1, ..., K} (classification problem)

Prior class probabilities: {fS (s), s = 1, ..., K}
Exponential family class-conditional pdf or pmf, observations X ∈ X
1
(s) T

fX|S (x|s) = h(x) exp (η ) φ(x)
Z(η (s) )

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 10 / 33

MAP Classification with Exponential Family Classes

Let A = S = {1, ..., K} (classification problem)

Prior class probabilities: {fS (s), s = 1, ..., K}
Exponential family class-conditional pdf or pmf, observations X ∈ X
1
(s) T

fX|S (x|s) = h(x) exp (η ) φ(x)
Z(η (s) )

MAP rule (taking logs and dropping constants):

n o
δMAP (x) = arg max log fS (s) − log Z(η (s) ) + (η (s) )T φ(x)
s∈{1,...,K}

...linear in the sufficient statistics φ(x).

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 10 / 33

MAP Classification with Exponential Family Classes

Let A = S = {1, ..., K} (classification problem)

Prior class probabilities: {fS (s), s = 1, ..., K}
Exponential family class-conditional pdf or pmf, observations X ∈ X
1
(s) T

fX|S (x|s) = h(x) exp (η ) φ(x)
Z(η (s) )

MAP rule (taking logs and dropping constants):

n o
δMAP (x) = arg max log fS (s) − log Z(η (s) ) + (η (s) )T φ(x)
s∈{1,...,K}

...linear in the sufficient statistics φ(x).

The previous examples (Gaussian and Multinomial) are in this family.

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 10 / 33

Class Posteriors for Exponential Family Classes
Same scenario as in the previous slide

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 11 / 33

Class Posteriors for Exponential Family Classes
Same scenario as in the previous slide
Class posterior probabilities (from Bayes law):
fS (s) (s) T

exp (η ) φ(x)
Z(η (s) )
fS|X (s|x) = K
X fS (r)
exp (η (r) )T φ(x)

(r)
Z(η )
r=1
...taking logs,
log fS|X (s|x) = (η (s) )T φ(x) + ζ (s) − ξ(x),
thus
exp (η (s) )T φ(x) + ζ (s)
fS|X (s|x) = K
X
exp (η (r) )T φ(x) + ζ (r)
r=1
...called a generalized linear model (GLM).
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 11 / 33
Bayesian MAP Estimation
Let A = S = R (scalar estimation problem)
Likelihood function fX|S , where X ∈ X , and prior pdf fS

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 12 / 33

Bayesian MAP Estimation
Let A = S = R (scalar estimation problem)
Likelihood function fX|S , where X ∈ X , and prior pdf fS

1 ⇐ |s − a| ≥ ε
ε−loss: Lε (s, a) =
0 ⇐ |s − a| < ε

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 12 / 33

Bayesian MAP Estimation
Let A = S = R (scalar estimation problem)
Likelihood function fX|S , where X ∈ X , and prior pdf fS

1 ⇐ |s − a| ≥ ε
ε−loss: Lε (s, a) =
0 ⇐ |s − a| < ε

Bayes rule for given ε:

Z ∞
δε (x) = arg min fS|X (s|x) Lε (s, a) ds
a∈R −∞
Z a+ε
= arg min 1 − fS|X (s|x) ds
a∈R a−ε
Z a+ε
= arg max fS|X (s|x) ds
a∈R a−ε

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 12 / 33

Bayesian MAP Estimation
Let A = S = R (scalar estimation problem)
Likelihood function fX|S , where X ∈ X , and prior pdf fS

1 ⇐ |s − a| ≥ ε
ε−loss: Lε (s, a) =
0 ⇐ |s − a| < ε

Bayes rule for given ε:

Z ∞
δε (x) = arg min fS|X (s|x) Lε (s, a) ds
a∈R −∞
Z a+ε
a a a
= arg min 1 − fS|X (s|x) ds
a∈R a−ε
Z a+ε
= arg max fS|X (s|x) ds
a∈R a−ε
a a a

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 12 / 33

Bayesian MAP Estimation
Let A = S = R (scalar estimation problem)
Likelihood function fX|S , where X ∈ X , and prior pdf fS

1 ⇐ |s − a| ≥ ε
ε−loss: Lε (s, a) =
0 ⇐ |s − a| < ε

Bayes rule for given ε:

Z ∞
δε (x) = arg min fS|X (s|x) Lε (s, a) ds
a∈R −∞
Z a+ε
a a a
= arg min 1 − fS|X (s|x) ds
a∈R a−ε
Z a+ε
= arg max fS|X (s|x) ds
a∈R a−ε
a a a
lim δε (x) = arg max fS|X (a|x) = δMAP (x)
ε→0 a∈R
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 12 / 33
Bayesian MAP Estimation

Let A = S = R (scalar estimation problem)

Likelihood function fX|S , where X ∈ X , and prior pdf fS

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 13 / 33

Bayesian MAP Estimation

Let A = S = R (scalar estimation problem)

Likelihood function fX|S , where X ∈ X , and prior pdf fS

MAP estimation criterion: δMAP (x) = arg max fS|X (a|x)

a∈R

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 13 / 33

Bayesian MAP Estimation

Let A = S = R (scalar estimation problem)

Likelihood function fX|S , where X ∈ X , and prior pdf fS

MAP estimation criterion: δMAP (x) = arg max fS|X (a|x)

a∈R
From Bayes law:

fX|S (x|a) fS (a)

δMAP (x) = arg max log
a∈R fX (x)

= arg max log fX|S (x|a) + log fS (a)
a∈R | {z } | {z }
log-likelihood log-prior

...a form (there are others) of trade-off between information from the
observation x and the prior knowledge/belief

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 13 / 33

Bayesian MAP Estimation: Vector Case
Let A = S = Rp (vector estimation problem)

Likelihood function fX|S , where X ∈ X , and prior pdf fS

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 14 / 33

Bayesian MAP Estimation: Vector Case
Let A = S = Rp (vector estimation problem)

Likelihood function fX|S , where X ∈ X , and prior pdf fS

1 ⇐ ks − ak2 ≥ ε
ε−loss: Lε (s, a) =
0 ⇐ ks − ak2 < ε

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 14 / 33

Bayesian MAP Estimation: Vector Case
Let A = S = Rp (vector estimation problem)

Likelihood function fX|S , where X ∈ X , and prior pdf fS

1 ⇐ ks − ak2 ≥ ε
ε−loss: Lε (s, a) =
0 ⇐ ks − ak2 < ε
Bayes rule for given ε:
Z
δε (x) = arg min fS|X (s|x) Lε (s, a) ds
a∈R Rp
Z !
= arg min 1 − fS|X (s|x) ds
a∈R {s∈Rp : ks−ak2 <ε}
Z
= arg max fS|X (s|x) ds
a∈R {s∈Rp : ks−ak2 <ε}

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 14 / 33

Bayesian MAP Estimation: Vector Case
Let A = S = Rp (vector estimation problem)

Likelihood function fX|S , where X ∈ X , and prior pdf fS

lim δε (x) = arg maxp fS|X (a|x) = δMAP (x)

ε→0 a∈R

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 14 / 33

Bayesian MMSE Estimation: Scalar and Vector Case
A = S = Rp (p ≥ 1); likelihood fX|S , for X ∈ X ; prior fS

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 15 / 33

Bayesian MMSE Estimation: Scalar and Vector Case
A = S = Rp (p ≥ 1); likelihood fX|S , for X ∈ X ; prior fS
Xp
2
Squared error loss: L(s, a) = ks − ak2 = (si − ai )2
i=1

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 15 / 33

Bayesian MMSE Estimation: Scalar and Vector Case
A = S = Rp (p ≥ 1); likelihood fX|S , for X ∈ X ; prior fS
Xp
2
Squared error loss: L(s, a) = ks − ak2 = (si − ai )2
i=1
The Bayes optimal rule
δBayes (x) = arg minp E kS − ak22 |x

a∈R | {z }
mean squared error (MSE)

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 15 / 33

Bayesian MMSE Estimation: Scalar and Vector Case
A = S = Rp (p ≥ 1); likelihood fX|S , for X ∈ X ; prior fS
Xp
2
Squared error loss: L(s, a) = ks − ak2 = (si − ai )2
i=1
The Bayes optimal rule, minimum mean squared error (MMSE),
δBayes (x) = arg minp E kS − ak22 |x = δMSME (x)

a∈R | {z }
mean squared error (MSE)

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 15 / 33

...can be easily derived, by equating the gradient to zero:

z }| { z }| {
δMSME (x) = arg minp E kSk22 |x − 2aT E [S|x] + E kak22 |x
a∈R

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 15 / 33

...can be easily derived, by equating the gradient to zero:

indep. of a kak22
z }| { z }| {
δMSME (x) = arg minp E kSk22 |x − 2aT E [S|x] + E kak22 |x
a∈R

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 15 / 33

...can be easily derived, by equating the gradient to zero:

indep. of a kak22
z }| { z }| {
δMSME (x) = arg minp E kSk22 |x − 2aT E [S|x] + E kak22 |x
a∈R
= solutiona (2a − 2E [S|x] = 0)

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 15 / 33

...can be easily derived, by equating the gradient to zero:

indep. of a kak22
z }| { z }| {
δMSME (x) = arg minp E kSk22 |x − 2aT E [S|x] + E kak22 |x
a∈R
= solutiona (2a − 2E [S|x] = 0)
= E [S|x]

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 15 / 33

MAP vs MMSE Estimation: Optimization vs Integration
A = S = Rp (p ≥ 1); likelihood fX|S , for X ∈ X ; prior fS

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 16 / 33

MAP vs MMSE Estimation: Optimization vs Integration
A = S = Rp (p ≥ 1); likelihood fX|S , for X ∈ X ; prior fS

MAP estimate: δMAP (x) = arg max log fX|S (x|a) + log fS (a)
a∈R

I Optimization problem (maybe large scale, if p 1)

I Does not require knowing fX (x) (i.e., fS|X may be un-normalized)

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 16 / 33

MAP vs MMSE Estimation: Optimization vs Integration
A = S = Rp (p ≥ 1); likelihood fX|S , for X ∈ X ; prior fS

MAP estimate: δMAP (x) = arg max log fX|S (x|a) + log fS (a)
a∈R

I Optimization problem (maybe large scale, if p 1)

I Does not require knowing fX (x) (i.e., fS|X may be un-normalized)

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 16 / 33

MAP vs MMSE Estimation: Optimization vs Integration
A = S = Rp (p ≥ 1); likelihood fX|S , for X ∈ X ; prior fS

MAP estimate: δMAP (x) = arg max log fX|S (x|a) + log fS (a)
a∈R

I Optimization problem (maybe large scale, if p 1)

I Does not require knowing fX (x) (i.e., fS|X may be un-normalized)

I Integration problem (maybe large scale, if p 1)

I In general, much harder than MAP estimation

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 16 / 33

A Classic: Linear-Gaussian Likelihood and Gaussian Prior
A = S = Rp (p ≥ 1); linear-Gaussian likelihood

exp − 12 (x − As)T C −1 (x − As)

fX|S (x|s) = N (x; As, C) = p
det(2 π C)

where x ∈ Rn , thus A ∈ Rn×p ,

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 17 / 33

A Classic: Linear-Gaussian Likelihood and Gaussian Prior
A = S = Rp (p ≥ 1); linear-Gaussian likelihood

exp − 12 (x − As)T C −1 (x − As)

fX|S (x|s) = N (x; As, C) = p
det(2 π C)

where x ∈ Rn , thus A ∈ Rn×p ,

... and Gaussian prior: fS (s) = N (s; µ, B)

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 17 / 33

A Classic: Linear-Gaussian Likelihood and Gaussian Prior
A = S = Rp (p ≥ 1); linear-Gaussian likelihood

exp − 12 (x − As)T C −1 (x − As)

fX|S (x|s) = N (x; As, C) = p
det(2 π C)

where x ∈ Rn , thus A ∈ Rn×p ,

... and Gaussian prior: fS (s) = N (s; µ, B)

Somewhat tedious manipulation yiels the posterior

fS|X (s|x) = N (s; ŝ(x), D) ,

where
−1
D = AT C −1 A + B −1 and ŝ(x) = D AT C −1 x + B −1 µ

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 17 / 33

A Classic: Linear-Gaussian Likelihood and Gaussian Prior
A = S = Rp (p ≥ 1); linear-Gaussian likelihood

exp − 12 (x − As)T C −1 (x − As)

fX|S (x|s) = N (x; As, C) = p
det(2 π C)

where x ∈ Rn , thus A ∈ Rn×p ,

... and Gaussian prior: fS (s) = N (s; µ, B)

Somewhat tedious manipulation yiels the posterior

fS|X (s|x) = N (s; ŝ(x), D) ,

where
−1
D = AT C −1 A + B −1 and ŝ(x) = D AT C −1 x + B −1 µ

Of course, δMMSE (x) = δMAP (x) = ŝ(x)

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 17 / 33
Linear-Gaussian Observation, Gaussian Prior: Example 1

Let X = X1 , ..., Xn be i.i.d. noisy (with variance σ 2 ) observations of

S ∈ R, i.e.,
 
n 1
Y
2 2  .. 
fX|S (x|s) = N (xi ; s, σ ) = N (x; As, σ I), with A =  . 
i=1 1

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 18 / 33

Linear-Gaussian Observation, Gaussian Prior: Example 1

Let X = X1 , ..., Xn be i.i.d. noisy (with variance σ 2 ) observations of

S ∈ R, i.e.,
 
n 1
Y
2 2  .. 
fX|S (x|s) = N (xi ; s, σ ) = N (x; As, σ I), with A =  . 
i=1 1

With fS (s) = N (s; µ, λ2 ) (i.e., B = λ2 ),

n 1
x̄ + µ n
fS|X (s|x) = N (s; ŝ(x), τ 2 ), ŝ(x) = σ
2 λ2 , x̄ = 1
X
xi
n 1 n
+ i=1
σ2 λ2

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 18 / 33

Linear-Gaussian Observation, Gaussian Prior: Example 1

Let X = X1 , ..., Xn be i.i.d. noisy (with variance σ 2 ) observations of

S ∈ R, i.e.,
 
n 1
Y
2 2  .. 
fX|S (x|s) = N (xi ; s, σ ) = N (x; As, σ I), with A =  . 
i=1 1

With fS (s) = N (s; µ, λ2 ) (i.e., B = λ2 ),

n 1
x̄ + µ n
fS|X (s|x) = N (s; ŝ(x), τ 2 ), ŝ(x) = σ
2 λ2 , x̄ = 1
X
xi
n 1 n
+ i=1
σ2 λ2
ŝ(x) is a weighted average of the observation average and the prior
mean

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 18 / 33

Linear-Gaussian Observation, Gaussian Prior: Example 1
(cont.)
n 1
2
x̄ + µ
Let’s study limit cases of ŝ(x) = σ λ2
n 1
+
σ2 λ2

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 19 / 33

Linear-Gaussian Observation, Gaussian Prior: Example 1
(cont.)
n 1
2
x̄ + 2 µ
Let’s study limit cases of ŝ(x) = σ λ
n 1
2
+ 2
σ λ
Large sample limit, the prior is ignored: lim ŝ(x) = x̄
n→∞

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 19 / 33

Low noise limit, the prior is ignored: lim ŝ(x) = x̄

σ 2 →0

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 19 / 33

Low noise limit, the prior is ignored: lim ŝ(x) = x̄

σ 2 →0

Large noise limit, the data is ignored: lim ŝ(x) = µ

σ 2 →∞

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 19 / 33

Low noise limit, the prior is ignored: lim ŝ(x) = x̄

σ 2 →0

Large noise limit, the data is ignored: lim ŝ(x) = µ

σ 2 →∞

High prior confidence limit, the data is ignored: lim ŝ(x) = µ

λ2 →0

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 19 / 33

Low noise limit, the prior is ignored: lim ŝ(x) = x̄

σ 2 →0

Large noise limit, the data is ignored: lim ŝ(x) = µ

σ 2 →∞

High prior confidence limit, the data is ignored: lim ŝ(x) = µ

λ2 →0

Low prior confidence limit, the prior is ignored: lim ŝ(x) = x̄

λ2 →∞

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 19 / 33

Low noise limit, the prior is ignored: lim ŝ(x) = x̄

σ 2 →0

Large noise limit, the data is ignored: lim ŝ(x) = µ

σ 2 →∞

High prior confidence limit, the data is ignored: lim ŝ(x) = µ

λ2 →0

Low prior confidence limit, the prior is ignored: lim ŝ(x) = x̄

λ2 →∞

Note: the prior is equivalent to an observation equal to µ with

variance λ2 .

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 19 / 33

Linear-Gaussian Observation, Gaussian Prior: Example 2
Let X = X1 , ..., Xn be independent noisy (with variance σi2 )
observations of S ∈ R, i.e.,
 
1
 .. 
fX|S (x|s) = N (x; As, C), with A =  .  , C = diag(σ12 , ..., σ12 )
1

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 20 / 33

With fS (s) = N (s; µ, λ2 ) (i.e., B = λ2 ),

n
X xi µ
2 + 2
σ λ
i=1 i
ŝ(x) = n
X 1 1
2 + 2
σ λ
i=1 i

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 20 / 33

With fS (s) = N (s; µ, λ2 ) (i.e., B = λ2 ),

n
X xi µ
2 + 2
σ λ
i=1 i
ŝ(x) = n
X 1 1
2 + 2
σ λ
i=1 i
1
each sample is weighted proportionally to σi2
(the precision)

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 20 / 33

With fS (s) = N (s; µ, λ2 ) (i.e., B = λ2 ),

n
X xi µ
2 + 2
σ λ
i=1 i
ŝ(x) = n
X 1 1
2 + 2
σ λ
i=1 i
1
each sample is weighted proportionally to σi2
(the precision)

Notice that, again, limn→∞ ŝ(x) is independent of the prior

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 20 / 33
Bernstein-Von Mises Theorem

Notice that
1 Pn
Example 1: δMLE (x) = n i=1 xi

P −1 P
n 1 n xi
Example 2: δMLE (x) = i=1 σi2 i=1 σi2

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 21 / 33

Bernstein-Von Mises Theorem

Notice that
1 Pn
Example 1: δMLE (x) = n i=1 xi

P −1 P
n 1 n xi
Example 2: δMLE (x) = i=1 σi2 i=1 σi2

In the previous examples, we noticed that (independently of the prior)

lim δMMSE (x) = lim δMAP (x) = δMLE (x)

n→∞ n→∞

where x̄ is the mean or the weighted mean of the observations

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 21 / 33

Bernstein-Von Mises Theorem

Notice that
1 Pn
Example 1: δMLE (x) = n i=1 xi

P −1 P
n 1 n xi
Example 2: δMLE (x) = i=1 σi2 i=1 σi2

In the previous examples, we noticed that (independently of the prior)

lim δMMSE (x) = lim δMAP (x) = δMLE (x)

n→∞ n→∞

where x̄ is the mean or the weighted mean of the observations

This is a general result, known as the Bernstein-Von Mises Theorem;

a sufficient condition is that the prior is continuous and non-zero
everywhere.

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 21 / 33

Conjugate Priors
In the previous examples, the posterior and the prior were both
Gaussian; were we lucky?

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 22 / 33

Conjugate Priors
In the previous examples, the posterior and the prior were both
Gaussian; were we lucky? No, the prior was conjugate!

Consider a likelihood/observation model fX|S , for S ∈ S and X ∈ X

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 22 / 33

Conjugate Priors
In the previous examples, the posterior and the prior were both
Gaussian; were we lucky? No, the prior was conjugate!

Consider a likelihood/observation model fX|S , for S ∈ S and X ∈ X

...and a family of priors P = {fS (s; θ), θ ∈ Θ}

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 22 / 33

Conjugate Priors
In the previous examples, the posterior and the prior were both
Gaussian; were we lucky? No, the prior was conjugate!

Consider a likelihood/observation model fX|S , for S ∈ S and X ∈ X

...and a family of priors P = {fS (s; θ), θ ∈ Θ}

P is conjugate for fX|S if, for any θ ∈ Θ and x ∈ S,

fX|S (x|s) fS (s; θ)

fS|X (·|x) = ∈ P ⇔ ∃θ0 ∈Θ : fS|X (·|x) = fS (·; θ0 )
fX (x; θ)

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 22 / 33

Conjugate Priors
In the previous examples, the posterior and the prior were both
Gaussian; were we lucky? No, the prior was conjugate!

Consider a likelihood/observation model fX|S , for S ∈ S and X ∈ X

...and a family of priors P = {fS (s; θ), θ ∈ Θ}

P is conjugate for fX|S if, for any θ ∈ Θ and x ∈ S,

fX|S (x|s) fS (s; θ)

fS|X (·|x) = ∈ P ⇔ ∃θ0 ∈Θ : fS|X (·|x) = fS (·; θ0 )
fX (x; θ)

Example: in example 1 (above),

fS (s) = N (s; µ, λ2 ) and fS|X (s|x) = N (s; ŝ(x), τ 2 )

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 22 / 33

Conjugate Priors: Example
S ∈ [0, 1] is the probability of success in n i.i.d. Bernoulli samples

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 23 / 33

Conjugate Priors: Example
S ∈ [0, 1] is the probability of success in n i.i.d. Bernoulli samples
Pn Pn
xi
X = (X1 , ..., Xn ) ∈ {0, 1}n : fX|S (x, s) = s i=1 (1 − s)n− i=1 xi

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 23 / 33

Conjugate Priors: Example
S ∈ [0, 1] is the probability of success in n i.i.d. Bernoulli samples
Pn Pn
xi
X = (X1 , ..., Xn ) ∈ {0, 1}n : fX|S (x, s) = s i=1 (1 − s)n− i=1 xi

Γ(α + β) α−1
Conjugate prior: Beta(s; α, β) = s (1 − s)β−1
Γ(α) Γ(β)

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 23 / 33

Conjugate Priors: Example
S ∈ [0, 1] is the probability of success in n i.i.d. Bernoulli samples
Pn Pn
xi
X = (X1 , ..., Xn ) ∈ {0, 1}n : fX|S (x, s) = s i=1 (1 − s)n− i=1 xi

Γ(α + β) α−1
Conjugate prior: Beta(s; α, β) = s (1 − s)β−1
Γ(α) Γ(β)

α
Expected value: E[S] = α+β

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 23 / 33

Conjugate Priors: Example
S ∈ [0, 1] is the probability of success in n i.i.d. Bernoulli samples
Pn Pn
xi
X = (X1 , ..., Xn ) ∈ {0, 1}n : fX|S (x, s) = s i=1 (1 − s)n− i=1 xi

Γ(α + β) α−1
Conjugate prior: Beta(s; α, β) = s (1 − s)β−1
Γ(α) Γ(β)

α
Expected value: E[S] = α+β

Mode:
α−1
arg max Beta(s; α, β) = ,
s α+β−2

if α, β > 1

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 23 / 33

Conjugate Priors: Example (cont.)
S ∈ [0, 1] is the probability of success in n i.i.d. Bernoulli samples

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 24 / 33

Conjugate Priors: Example (cont.)
S ∈ [0, 1] is the probability of success in n i.i.d. Bernoulli samples
Pn Pn
xi
X = (X1 , ..., Xn ) ∈ {0, 1}n : fX|S (x, s) = s i=1 (1 − s)n− i=1 xi

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 24 / 33

Conjugate Priors: Example (cont.)
S ∈ [0, 1] is the probability of success in n i.i.d. Bernoulli samples
Pn Pn
xi
X = (X1 , ..., Xn ) ∈ {0, 1}n : fX|S (x, s) = s i=1 (1 − s)n− i=1 xi

Γ(α + β) α−1
Conjugate prior: Beta(s; α, β) = s (1 − s)β−1
Γ(α) Γ(β)

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 24 / 33

Conjugate Priors: Example (cont.)
S ∈ [0, 1] is the probability of success in n i.i.d. Bernoulli samples
Pn Pn
xi
X = (X1 , ..., Xn ) ∈ {0, 1}n : fX|S (x, s) = s i=1 (1 − s)n− i=1 xi

Γ(α + β) α−1
Conjugate prior: Beta(s; α, β) = s (1 − s)β−1
Γ(α) Γ(β)
Posterior:
Pn Pn
xi
fS|X (s|x) ∝ s (1 − s)n− i=1 xi sα−1 (1 − s)β−1
i=1

Xn n
X
= Beta s; α + xi , β + n − xi = Beta s; α0 , β 0
i=1 i=1

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 24 / 33

Conjugate Priors: Example (cont.)
S ∈ [0, 1] is the probability of success in n i.i.d. Bernoulli samples
Pn Pn
xi
X = (X1 , ..., Xn ) ∈ {0, 1}n : fX|S (x, s) = s i=1 (1 − s)n− i=1 xi

Γ(α + β) α−1
Conjugate prior: Beta(s; α, β) = s (1 − s)β−1
Γ(α) Γ(β)
Posterior:
Pn Pn
xi
fS|X (s|x) ∝ s (1 − s)n− i=1 xi sα−1 (1 − s)β−1
i=1

Xn n
X
= Beta s; α + xi , β + n − xi = Beta s; α0 , β 0
i=1 i=1

Estimates:
α + ni=1 xi α − 1 + ni=1 xi
P P
δMMSE (x) = δMAP (x) =
α+β+n α+β−2+n

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 24 / 33

Conjugate Priors: Example (cont.)

1 1
n=1 n=5
0.5 0.5
Plots of fS|X (s|x) for n
samples
0 0
0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1
1 1
n=10 n=20 Solid: α = β = 1
0.5 0.5

0
0 0.25 0.5 0.75 1
0
0 0.25 0.5 0.75 1 Dotted: α = β = 5
1 1
n=50 n=500
0.5 0.5 Notice:
(x1 , ..., x5 ) = (1, 1, 1, 1, 1)
0 0
0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 25 / 33

Conjugate Priors: Example (cont.)

1 1
n=1 n=5
0.5 0.5
Plots of fS|X (s|x) for n
samples
0 0
0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1
1 1
n=10 n=20 Solid: α = β = 1
0.5 0.5

0
0 0.25 0.5 0.75 1
0
0 0.25 0.5 0.75 1 Dotted: α = β = 5
1 1
n=50 n=500
0.5 0.5 Notice:
(x1 , ..., x5 ) = (1, 1, 1, 1, 1)
0 0
0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1

Observe the Bernstein-Von Mises effect.

Note that δMLE (x) = δMAP (x), for α = β = 1 (flat prior).
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 25 / 33
Conjugate Priors: Exponential Families
Let S ∈ Rp be the parameter of an exponential family; for a single
observation X,
1
fX|S (x|s) = h(x) exp η(s)T φ(x)
Z(s)

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 26 / 33

Conjugate Priors: Exponential Families
Let S ∈ Rp be the parameter of an exponential family; for a single
observation X,
1
fX|S (x|s) = h(x) exp η(s)T φ(x)
Z(s)
For n i.i.d. observations, (X1 , ..., Xn )
1 n Y n n
X
fX|S (x|s) = h(xi ) exp η(s)T φ(xi )
Z(s)
i=1 i=1
Conjugate prior has the form:
1 ν
fS (s; γ, ν) ∝ exp η(s)T γ
Z(s)
Posterior
1 n+ν n
X
fS|X (s|x; γ, ν) ∝ exp η(s)T γ + φ(xi )
Z(s)
i=1
... the conjugate prior behaves like virtual additional observations
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 26 / 33
Conjugate Priors for Exponential Families:
Bernoulli/Binomial
S ∈ [0, 1] is the probability of success in n i.i.d. Bernoulli samples
(see slide 23)

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 27 / 33

Conjugate Priors for Exponential Families:
Bernoulli/Binomial
S ∈ [0, 1] is the probability of success in n i.i.d. Bernoulli samples
(see slide 23)
In exponential family form
s X n
fX|S (x|s) = (1 − s) exp log xi
| {z } 1−s}
1
| {z i=1
Z(s) η(s)

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 27 / 33

Conjugate prior
1 ν s
fS (s; γ, ν) ∝ exp η(s)T γ = (1 − s)ν exp γ log
Z(s) 1−s
γ ν−γ
= s (1 − s)
∝ Beta(s; γ + 1, ν − γ + 1)

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 27 / 33

An Important Example: Multinomial Observations
Let S ∈ ∆p−1 ⊂ [0, 1]p to be estimated from X ∈ {0, ..., n}p
p
n
1(x1 +...+xp =n) sxi i
Y
fX|S (x|s) =
x1 x2 · · · xp
| {z } i=1
h(x)

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 28 / 33

An Important Example: Multinomial Observations
Let S ∈ ∆p−1 ⊂ [0, 1]p to be estimated from X ∈ {0, ..., n}p
p
n
1(x1 +...+xp =n) sxi i
Y
fX|S (x|s) =
x1 x2 · · · xp
| {z } i=1
h(x)

In exponential family form (notice that Z(s) = 1)

fX|S (x|s) = h(x) exp xT log s = h(x) exp xT η(s))

|{z}
 
log s1

 .. 
 .   
log sp

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 28 / 33

An Important Example: Multinomial Observations
Let S ∈ ∆p−1 ⊂ [0, 1]p to be estimated from X ∈ {0, ..., n}p
p
n
1(x1 +...+xp =n) sxi i
Y
fX|S (x|s) =
x1 x2 · · · xp
| {z } i=1
h(x)

In exponential family form (notice that Z(s) = 1)

fX|S (x|s) = h(x) exp xT log s = h(x) exp xT η(s))

|{z}
 
log s1

 .. 
 .   
log sp
Conjugate prior
p
fS (s; γ) ∝ 1s∈∆p−1 exp γ T log s = 1s∈∆p−1 sγi i ∝ Dirichlet(s; γ+1)
Y

i=1

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 28 / 33

Multinomial Observations and Dirichlet Prior
Multinomial observations with Dirichlet prior
fX|S (x|s) = Multi(x; n, s), fS (s; γ) = Dirichlet(s; γ + 1)

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 29 / 33

Multinomial Observations and Dirichlet Prior
Multinomial observations with Dirichlet prior
fX|S (x|s) = Multi(x; n, s), fS (s; γ) = Dirichlet(s; γ + 1)
Dirichlet examples in ∆2 ⊂ R3
s1

s3
simplex ∆2 ⊂ R3 Dirichlet(s; [2, 2, 2]) Dirichlet(s; [20, 2, 2])

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 29 / 33

Multinomial Observations and Dirichlet Prior
Multinomial observations with Dirichlet prior
fX|S (x|s) = Multi(x; n, s), fS (s; γ) = Dirichlet(s; γ + 1)
Dirichlet examples in ∆2 ⊂ R3
s1

s3
simplex ∆2 ⊂ R3 Dirichlet(s; [2, 2, 2]) Dirichlet(s; [20, 2, 2])

Posterior and estimates: fS|X (s|x) = Dirichlet(s; γ + x + 1)

γ+x γ+x+1
δMAP (x) = , δMMSE (x) = (1p = [1, ..., 1]T )
1Tp (γ+ x) 1Tp (γ+ x + 1)

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 29 / 33

Example of Multinomial Model: Bag of Words (BoW)
Vocabulary with p words (stemmed, without stop-words, etc.)

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 30 / 33

Example of Multinomial Model: Bag of Words (BoW)
Vocabulary with p words (stemmed, without stop-words, etc.)
Multinomial language model: S ∈ ∆p−1 contains the probability of
occurrence of each of the p words.
Probability of a set of document with a total of n words, with
x1 , ..., xp word counts: Multi(x; n, s)

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 30 / 33

...vector x of term counts (bag of words) (“unk” = all other words)

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 30 / 33

...vector x of term counts (bag of words) (“unk” = all other words)

MAP estimate of the probabilities, with Dirichlet(s; [2, 2, ..., 2]) prior:

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 30 / 33

Additive Loss Functions and Marginal Inference
Recall the optimal Bayes rule:

δBayes (x) = arg min E [L(S, a)|x]

a∈A

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 31 / 33

Additive Loss Functions and Marginal Inference
Recall the optimal Bayes rule:

δBayes (x) = arg min E [L(S, a)|x]

a∈A

Let S = S1 × S2 × · · · × Sd and A = A1 × A2 × · · · × Ad

S = (S1 , ..., Sd ), where Si ∈ Si ,

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 31 / 33

Additive Loss Functions and Marginal Inference
Recall the optimal Bayes rule:

δBayes (x) = arg min E [L(S, a)|x]

a∈A

Let S = S1 × S2 × · · · × Sd and A = A1 × A2 × · · · × Ad

S = (S1 , ..., Sd ), where Si ∈ Si , and a = (a1 , ..., ad ), where ai ∈ Ai

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 31 / 33

Additive Loss Functions and Marginal Inference
Recall the optimal Bayes rule:

δBayes (x) = arg min E [L(S, a)|x]

a∈A

Let S = S1 × S2 × · · · × Sd and A = A1 × A2 × · · · × Ad

S = (S1 , ..., Sd ), where Si ∈ Si , and a = (a1 , ..., ad ), where ai ∈ Ai

Let the loss have an additive structure:

d
X
L(s, a) = Li (si , ai ), with Li : Si × Ai → R+
i=1

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 31 / 33

Additive Loss Functions and Marginal Inference
Recall the optimal Bayes rule:

δBayes (x) = arg min E [L(S, a)|x]

a∈A

Let S = S1 × S2 × · · · × Sd and A = A1 × A2 × · · · × Ad

S = (S1 , ..., Sd ), where Si ∈ Si , and a = (a1 , ..., ad ), where ai ∈ Ai

Let the loss have an additive structure:

d
X
L(s, a) = Li (si , ai ), with Li : Si × Ai → R+
i=1
Pd
Example of additive loss: L(s, a) = ks − ak22 = i=1 (si − ai )2

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 31 / 33

Additive Loss Functions and Marginal Inference
Bayes optimal rule:
d
" #
X
δBayes (x) = arg min E Li (Si , ai )x
a1 ∈A1 ,...,ad ∈Ad
i=1
d
X
= arg min E Li (Si , ai )x
a1 ∈S1 ,...,ad ∈Sd
i=1

= arg min E L1 (S1 , a1 ) x , ..., arg min E Ld (Sd , ad ) x

a1 ∈A1 ad ∈Ad

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 32 / 33

Marginal Bayes decisions:

Z

arg min E Li (Si , ai )x = arg min Li (si , ai ) fSi |X (si |xi ) dsi
ai ∈Ai ai ∈Ai Si | {z }
posterior marginal

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 32 / 33

Marginal Bayes decisions:

Z

arg min E Li (Si , ai )x = arg min Li (si , ai ) fSi |X (si |xi ) dsi
ai ∈Ai ai ∈Ai Si | {z }
posterior marginal

If Li (si , ai ) is 0/1 loss, δBayes = δMPM (maximizer of posterior marginals)

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 32 / 33
Example of Marginal Inference: Estimation of Variance
Estimation: S = (µ, σ 2 ) ∈ A = S = R × R+

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 33 / 33

Example of Marginal Inference: Estimation of Variance
Estimation: S = (µ, σ 2 ) ∈ A = S = R × R+
Qn
Observations X = (X1 , ..., Xn ), fX|S (x|µ, σ 2 ) = i=1 N (xi ; µ, σ
2)

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 33 / 33

Example of Marginal Inference: Estimation of Variance
Estimation: S = (µ, σ 2 ) ∈ A = S = R × R+
Qn
Observations X = (X1 , ..., Xn ), fX|S (x|µ, σ 2 ) = i=1 N (xi ; µ, σ
2)

Flat prior fS (s) = K, thus fS|X (s|x) ∝ fX|S (x|s)

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 33 / 33

Example of Marginal Inference: Estimation of Variance
Estimation: S = (µ, σ 2 ) ∈ A = S = R × R+
Qn
Observations X = (X1 , ..., Xn ), fX|S (x|µ, σ 2 ) = i=1 N (xi ; µ, σ
2)

Flat prior fS (s) = K, thus fS|X (s|x) ∝ fX|S (x|s)

2 ) = (µ̂
MAP estimate: (µ̂MAP , σ̂MAP 2
MLE , σ̂MLE )

n n
1X 2 1X
µ̂MAP = xi σ̂MAP = (xi − µ̂MLE )2
n n
i=1 i=1
n−1 2
with σ̂ 2
MAP = σ̂ 2
MLE known to be biased E[σ̂ 2 MLE ]= n σ

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 33 / 33

Example of Marginal Inference: Estimation of Variance
Estimation: S = (µ, σ 2 ) ∈ A = S = R × R+
Qn
Observations X = (X1 , ..., Xn ), fX|S (x|µ, σ 2 ) = i=1 N (xi ; µ, σ
2)

Flat prior fS (s) = K, thus fS|X (s|x) ∝ fX|S (x|s)

2 ) = (µ̂
MAP estimate: (µ̂MAP , σ̂MAP 2
MLE , σ̂MLE )

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 33 / 33

Example of Marginal Inference: Estimation of Variance
Estimation: S = (µ, σ 2 ) ∈ A = S = R × R+
Qn
Observations X = (X1 , ..., Xn ), fX|S (x|µ, σ 2 ) = i=1 N (xi ; µ, σ
2)

Flat prior fS (s) = K, thus fS|X (s|x) ∝ fX|S (x|s)

2 ) = (µ̂
MAP estimate: (µ̂MAP , σ̂MAP 2
MLE , σ̂MLE )

n n
1X 2 1X
µ̂MAP = xi σ̂MAP = (xi − µ̂MLE )2
n n
i=1 i=1
n−1 2
with σ̂ 2
MAP = σ̂ 2
MLE known to be biased E[σ̂ 2 MLE ]= n σ
Posterior marginal: Z ∞ n
Y
fΣ2 |X (σ 2 |x) ∝ N (xi ; µ, σ 2 )dµ
−∞ i=1
n
2
n−1 1 X

= 2πσ 2 exp − 2 (xi − µ̂MLE )2
2σ
i=1
n
MPM estimate, σ̂ 2
MPM = 2
n−1 σ̂MAP , is unbiased
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 33 / 33

AE - Tema 5 - Two-Class Fisher Discriminant Analysis
No ratings yet
AE - Tema 5 - Two-Class Fisher Discriminant Analysis
6 pages
The Gaussian Classifier: Nuno Vasconcelos
No ratings yet
The Gaussian Classifier: Nuno Vasconcelos
19 pages
Weatherwax Theodoridis Solutions
No ratings yet
Weatherwax Theodoridis Solutions
212 pages
Lecturer4 - Bayesian Decision Theory
No ratings yet
Lecturer4 - Bayesian Decision Theory
40 pages
Statistical Machine Learning W4400 Lecture Slides PDF
No ratings yet
Statistical Machine Learning W4400 Lecture Slides PDF
520 pages
PR January20 03 PDF
No ratings yet
PR January20 03 PDF
74 pages
MAP Classifier For Normal Distributions: Alexander Wong SYDE 372
No ratings yet
MAP Classifier For Normal Distributions: Alexander Wong SYDE 372
54 pages
Notes and Solutions For: Pattern Recognition by Sergios Theodoridis and Konstantinos Koutroumbas.
100% (1)
Notes and Solutions For: Pattern Recognition by Sergios Theodoridis and Konstantinos Koutroumbas.
209 pages
Inf2b Learn Note10 2up
No ratings yet
Inf2b Learn Note10 2up
7 pages
pr2 Bayes
No ratings yet
pr2 Bayes
44 pages
02-Classification - Commented2
No ratings yet
02-Classification - Commented2
22 pages
Bayesian Decision Theory
No ratings yet
Bayesian Decision Theory
65 pages
Lecture 5 Discriminant Analysis
No ratings yet
Lecture 5 Discriminant Analysis
10 pages
Bayesian
No ratings yet
Bayesian
21 pages
4.2 Bayes Decision Theory
No ratings yet
4.2 Bayes Decision Theory
49 pages
22-23 323 Week7Notes
No ratings yet
22-23 323 Week7Notes
15 pages
Linearclassification
No ratings yet
Linearclassification
31 pages
Lec 1
No ratings yet
Lec 1
42 pages
Lecture 4
No ratings yet
Lecture 4
51 pages
Bayesian Decision Theory
No ratings yet
Bayesian Decision Theory
63 pages
Bayesian and MLE
No ratings yet
Bayesian and MLE
30 pages
Supervised Unsupervised
No ratings yet
Supervised Unsupervised
39 pages
Bayesian Learning
No ratings yet
Bayesian Learning
21 pages
Unsupervised Learning Clustering Math
No ratings yet
Unsupervised Learning Clustering Math
28 pages
Bayesian Decision Theory: Prof. Richard Zanibbi
No ratings yet
Bayesian Decision Theory: Prof. Richard Zanibbi
47 pages
PR January20 04 PDF
No ratings yet
PR January20 04 PDF
40 pages
Fish Species Classification Guide
No ratings yet
Fish Species Classification Guide
141 pages
Decision Theory - I Part 5mar24
No ratings yet
Decision Theory - I Part 5mar24
38 pages
Solutions For Exercises - Pattern Recognition, 4th Edition by Theodoridis & Koutroumbas
No ratings yet
Solutions For Exercises - Pattern Recognition, 4th Edition by Theodoridis & Koutroumbas
13 pages
Pattern Recognition 21BR551 MODULE 02 NOTES
No ratings yet
Pattern Recognition 21BR551 MODULE 02 NOTES
16 pages
Linear Classification: 1 1 N N I D I
No ratings yet
Linear Classification: 1 1 N N I D I
33 pages
Bayesian Decision Theory Guide
No ratings yet
Bayesian Decision Theory Guide
3 pages
The Bayes Optimal Classifier: Machine Learning
No ratings yet
The Bayes Optimal Classifier: Machine Learning
12 pages
Pattern Classification: All Materials in These Slides Were Taken From
No ratings yet
Pattern Classification: All Materials in These Slides Were Taken From
44 pages
Chapter 04b Riskmin-Class - Commented2
No ratings yet
Chapter 04b Riskmin-Class - Commented2
30 pages
07 - Bayesian Learning
No ratings yet
07 - Bayesian Learning
55 pages
Bayesian Decision Theory Guide
No ratings yet
Bayesian Decision Theory Guide
64 pages
Sergios Theodoridis Konstantinos Koutroumbas
No ratings yet
Sergios Theodoridis Konstantinos Koutroumbas
80 pages
Lecture 15 - Generative Models For Supervised Learning - Plain
No ratings yet
Lecture 15 - Generative Models For Supervised Learning - Plain
15 pages
ML1 2023 Fe
No ratings yet
ML1 2023 Fe
25 pages
Supervised Machine Learning Guide
No ratings yet
Supervised Machine Learning Guide
74 pages
Optimum Statistical Classifiers
100% (1)
Optimum Statistical Classifiers
12 pages
MIT18 657F15 LecNote PDF
No ratings yet
MIT18 657F15 LecNote PDF
194 pages
Machine Learning 04 - Bayes
No ratings yet
Machine Learning 04 - Bayes
35 pages
CSCE 970 Lecture 2: Bayesian-Based Classifiers: Most Probable
No ratings yet
CSCE 970 Lecture 2: Bayesian-Based Classifiers: Most Probable
5 pages
Sergios Theodoridis Konstantinos Koutroumbas
No ratings yet
Sergios Theodoridis Konstantinos Koutroumbas
76 pages
Bayes Classifier
No ratings yet
Bayes Classifier
31 pages
Lecture 03 Bayes Classifier With Prob Concepts
No ratings yet
Lecture 03 Bayes Classifier With Prob Concepts
70 pages
Discriminant Analysis for Students
No ratings yet
Discriminant Analysis for Students
41 pages
Inf2b Learn10 Notes Nup
No ratings yet
Inf2b Learn10 Notes Nup
6 pages
Bayesian Decision Theory in ML
No ratings yet
Bayesian Decision Theory in ML
56 pages
Lecture Notes MAI
No ratings yet
Lecture Notes MAI
114 pages
MLE and MAP Classifier
No ratings yet
MLE and MAP Classifier
55 pages
Genomic Signal Processing: Classification of Disease Subtype Based On Microarray Data
No ratings yet
Genomic Signal Processing: Classification of Disease Subtype Based On Microarray Data
26 pages
Chapter 07
No ratings yet
Chapter 07
68 pages