[go: up one dir, main page]

0% found this document useful (0 votes)
66 views130 pages

Lecture 2

This document summarizes concepts related to Bayesian inference and classification. It discusses: 1) Bayesian decision rules that minimize the expected loss conditioned on observations, where the loss depends on an unknown variable and a decision rule. 2) Maximum a posteriori (MAP) classification, which chooses the class with maximum posterior probability given observations. For binary classification, this leads to a likelihood ratio test comparing the ratio to a threshold. 3) How the Bayesian optimal decision rule depends on the loss structure when classifying into two classes with general costs. 4) MAP classification when the class-conditional densities are Gaussian with different means and covariance matrices.

Uploaded by

Rafael Cabral
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views130 pages

Lecture 2

This document summarizes concepts related to Bayesian inference and classification. It discusses: 1) Bayesian decision rules that minimize the expected loss conditioned on observations, where the loss depends on an unknown variable and a decision rule. 2) Maximum a posteriori (MAP) classification, which chooses the class with maximum posterior probability given observations. For binary classification, this leads to a likelihood ratio test comparing the ratio to a threshold. 3) How the Bayesian optimal decision rule depends on the loss structure when classifying into two classes with general costs. 4) MAP classification when the class-conditional densities are Gaussian with different means and covariance matrices.

Uploaded by

Rafael Cabral
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 130

Statistical Learning

Lecture 2: More on Bayesian Inference

Mário A. T. Figueiredo

Instituto Superior Técnico & Instituto de Telecomunicações

University of Lisbon, Portugal

2018

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 1 / 33


Bayesian Decision Rules

Posterior expected loss of decision rule a ∈ A,


 X


 fS|X (s|x) L(s, a), if S discrete
ρ(fS , a|x) = E [L(S, a)|x] = s∈S
Z
fS|X (s|x) L(s, a) ds, if S continuous



s∈S

where, via Bayes law, fS|X (s|x) = fX|S (x|s) fS (s)/fX (x)

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 2 / 33


Bayesian Decision Rules

Posterior expected loss of decision rule a ∈ A,


 X


 fS|X (s|x) L(s, a), if S discrete
ρ(fS , a|x) = E [L(S, a)|x] = s∈S
Z
fS|X (s|x) L(s, a) ds, if S continuous



s∈S

where, via Bayes law, fS|X (s|x) = fX|S (x|s) fS (s)/fX (x)

Bayes optimal decision rule:

δBayes (x) = arg min ρ(fS , a|x)


a∈A

...minimizer of the expected loss (with respect to the unknown S),


conditioned on the observations X = x

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 2 / 33


Bayesian Decision Rules

Posterior expected loss of decision rule a ∈ A,


 X


 fS|X (s|x) L(s, a), if S discrete
ρ(fS , a|x) = E [L(S, a)|x] = s∈S
Z
fS|X (s|x) L(s, a) ds, if S continuous



s∈S

where, via Bayes law, fS|X (s|x) = fX|S (x|s) fS (s)/fX (x)

Bayes optimal decision rule:

δBayes (x) = arg min ρ(fS , a|x) = arg min E [L(S, a)|x]
a∈A a∈A

...minimizer of the expected loss (with respect to the unknown S),


conditioned on the observations X = x

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 2 / 33


MAP Classification
Let A = S = {1, ..., K} (discrete/classification problem)

Observations: X ∈ X , with {fX|S (·|s), s ∈ S}


(set of class-conditional pdf or pmf)

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 3 / 33


MAP Classification
Let A = S = {1, ..., K} (discrete/classification problem)

Observations: X ∈ X , with {fX|S (·|s), s ∈ S}


(set of class-conditional pdf or pmf)
P
Prior pmf fS (fS (s) ≥ 0 and s∈S fS (s) = 1)

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 3 / 33


MAP Classification
Let A = S = {1, ..., K} (discrete/classification problem)

Observations: X ∈ X , with {fX|S (·|s), s ∈ S}


(set of class-conditional pdf or pmf)
P
Prior pmf fS (fS (s) ≥ 0 and s∈S fS (s) = 1)

Adopt the 0/1 loss: L(s, a) = 0, if s = a, and 1 otherwise.

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 3 / 33


MAP Classification
Let A = S = {1, ..., K} (discrete/classification problem)

Observations: X ∈ X , with {fX|S (·|s), s ∈ S}


(set of class-conditional pdf or pmf)
P
Prior pmf fS (fS (s) ≥ 0 and s∈S fS (s) = 1)

Adopt the 0/1 loss: L(s, a) = 0, if s = a, and 1 otherwise.

Optimal Bayesian decision:


K
X
δBayes (x) = arg min L(s, a) fS|X (s|x)
a∈S
s=1

= arg min 1 − fS|X (a|x) = arg max fS|X (a|x) ≡ δMAP (x)
a∈S a∈S

MAP = maximum a posteriori criterion.


Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 3 / 33
Binary MAP Classification
Let A = S = {1, 2} (binary classification problem)

Observation model {fX|S (·|1), fX|S (·|2)}

Prior fS (1) (of course fS (2) = 1 − fS (1))

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 4 / 33


Binary MAP Classification
Let A = S = {1, 2} (binary classification problem)

Observation model {fX|S (·|1), fX|S (·|2)}

Prior fS (1) (of course fS (2) = 1 − fS (1))

MAP rule:

1 ⇐ l(x) ≥ t
δMAP (x) = arg max fX|S (x|a)fS (a) =
a∈{1,2} 2 ⇐ l(x) < t
where
fX|S (x|1) fS (2)
l(x) = and t=
fX|S (x|2) f (1)
| {z } | S{z }
likelihood ratio threshold

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 4 / 33


Binary MAP Classification
Let A = S = {1, 2} (binary classification problem)

Observation model {fX|S (·|1), fX|S (·|2)}

Prior fS (1) (of course fS (2) = 1 − fS (1))

MAP rule:

1 ⇐ l(x) ≥ t
δMAP (x) = arg max fX|S (x|a)fS (a) =
a∈{1,2} 2 ⇐ l(x) < t
where
fX|S (x|1) fS (2)
l(x) = and t=
fX|S (x|2) f (1)
| {z } | S{z }
likelihood ratio threshold

If fS (1) = fS (2) = 1/2, we have a maximum likelihood classifier,


a.k.a., likelihood ratio test (LRT)
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 4 / 33
Binary Classification with General Costs
Let A = S = {1, 2} (binary classification problem)

Observation model {fX|S (·|1), fX|S (·|2)} and prior fS (1)

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 5 / 33


Binary Classification with General Costs
Let A = S = {1, 2} (binary classification problem)

Observation model {fX|S (·|1), fX|S (·|2)} and prior fS (1)

Optimal Bayesian decision:



δBayes (x) = arg min L(1, a) fX|S (x|1)fS (1) + L(2, a) fX|S (x|2)fS (2)
a∈{1,2}

1 ⇐ l(x) ≥ t
=
2 ⇐ l(x) < t

where l(x) is as before and the threshold becomes



fS (2) L(2, 1) − L(2, 2)
t= 
fS (1) L(1, 2) − L(1, 1)

...i.e., the loss structure affects the decision threshold.


Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 5 / 33
MAP Classification with Gaussian Classes
Let A = S = {1, ..., K} (classification problem)
Prior class probabilities: {fS (s), s = 1, ..., K}

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 6 / 33


MAP Classification with Gaussian Classes
Let A = S = {1, ..., K} (classification problem)
Prior class probabilities: {fS (s), s = 1, ..., K}
Gaussian class-conditional densities, for observations X ∈ Rp

fX|S (x|s) = N (x; µs , Cs )


 
1 1 T −1
=p exp − (x − µs ) Cs (x − µs )
det(2 π Cs ) 2

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 6 / 33


MAP Classification with Gaussian Classes
Let A = S = {1, ..., K} (classification problem)
Prior class probabilities: {fS (s), s = 1, ..., K}
Gaussian class-conditional densities, for observations X ∈ Rp

fX|S (x|s) = N (x; µs , Cs )


 
1 1 T −1
=p exp − (x − µs ) Cs (x − µs )
det(2 π Cs ) 2

MAP rule (taking logs and dropping constants):


δMAP (x) = (squared) Mahalanobis
γs distance kx − µs k2Cs
nz }| { z }| {o
= arg min − log fS (s) + 12 log det(Cs ) + 12 (x − µs )T Cs−1 (x − µs )
s∈{1,...,K}

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 6 / 33


MAP Classification with Gaussian Classes
Let A = S = {1, ..., K} (classification problem)
Prior class probabilities: {fS (s), s = 1, ..., K}
Gaussian class-conditional densities, for observations X ∈ Rp

fX|S (x|s) = N (x; µs , Cs )


 
1 1 T −1
=p exp − (x − µs ) Cs (x − µs )
det(2 π Cs ) 2

MAP rule (taking logs and dropping constants):


δMAP (x) = (squared) Mahalanobis
γs distance kx − µs k2Cs
nz }| { z }| {o
= arg min − log fS (s) + 12 log det(Cs ) + 12 (x − µs )T Cs−1 (x − µs )
s∈{1,...,K}
n o
= arg min γs + 12 kx − µs k2Cs
s∈{1,...,K}

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 6 / 33


MAP Classification with Gaussian
n
Classes (II) o
MAP rule: δMAP (x) = arg min γs + 12 kx − µs k2Cs
s∈{1,...,K}

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 7 / 33


MAP Classification with Gaussian
n
Classes (II) o
MAP rule: δMAP (x) = arg min γs + 12 kx − µs k2Cs
s∈{1,...,K}

Decision regions: {Rs , s = 1, ..., K} (partition of Rp ), such that


Rs = {x ∈ Rp : δMAP (x) = s}

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 7 / 33


MAP Classification with Gaussian
n
Classes (II) o
MAP rule: δMAP (x) = arg min γs + 12 kx − µs k2Cs
s∈{1,...,K}

Decision regions: {Rs , s = 1, ..., K} (partition of Rp ), such that


Rs = {x ∈ Rp : δMAP (x) = s}
Quadratic discrimination (quadratic boundaries); a.k.a. quadratic
discriminant analysis (QDA).

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 7 / 33


MAP Classification with Gaussian
n
Classes (II) o
MAP rule: δMAP (x) = arg min γs + 12 kx − µs k2Cs
s∈{1,...,K}

Decision regions: {Rs , s = 1, ..., K} (partition of Rp ), such that


Rs = {x ∈ Rp : δMAP (x) = s}
Quadratic discrimination (quadratic boundaries); a.k.a. quadratic
discriminant analysis (QDA). Example: binary classification in R2 :

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 7 / 33


MAP Classification with Gaussian
n
Classes (II) o
MAP rule: δMAP (x) = arg min γs + 12 kx − µs k2Cs
s∈{1,...,K}

Decision regions: {Rs , s = 1, ..., K} (partition of Rp ), such that


Rs = {x ∈ Rp : δMAP (x) = s}
Quadratic discrimination (quadratic boundaries); a.k.a. quadratic
discriminant analysis (QDA). Example: binary classification in R2 :
Parabolic Boundary

2
level curves of class-conditional
Gaussian densities (and a few
0 samples)
boundary is quadratic (parabolic)
increasing (resp. decreasing) fS (1)
−2
shifts the boundary away from
(resp. towards) µ1
−2 0 2

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 7 / 33


MAP Classification with Gaussian Classes (III)
In the case of a common covariance matrix, Cs = C,
kx − µs k2Cs = xT C −1 x + µTs C −1 µs − 2(C −1 µs )T x

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 8 / 33


MAP Classification with Gaussian Classes (III)
In the case of a common covariance matrix, Cs = C,
kx − µs k2Cs = xT C −1 x + µTs C −1 µs − 2(C −1 µs )T x
n o
MAP rule: δMAP (x) = arg min γs + xT β s ,
s∈{1,...,K}

where βs = −(C −1 µs ) and γs = − log fS (s) + 12 µTs C −1 µs

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 8 / 33


MAP Classification with Gaussian Classes (III)
In the case of a common covariance matrix, Cs = C,
kx − µs k2Cs = xT C −1 x + µTs C −1 µs − 2(C −1 µs )T x
n o
MAP rule: δMAP (x) = arg min γs + xT β s ,
s∈{1,...,K}

where βs = −(C −1 µs ) and γs = − log fS (s) + 12 µTs C −1 µs

...MAP classifier is linear in x; region boundaries are hyper-planes


Linear Boundary All Linear Boundaries

0 2

0
−2

−2
−2 0 2 −2 0 2 4 6
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 8 / 33
MAP Classification with Multinomial Classes
Let A = S = {1, ..., K} (classification problem)
Prior class probabilities: {fS (s), s = 1, ..., K}

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 9 / 33


MAP Classification with Multinomial Classes
Let A = S = {1, ..., K} (classification problem)
Prior class probabilities: {fS (s), s = 1, ..., K}
Multinomial class-conditional pmf, for observations X ∈ {0, ..., n}p
 p
Y
n
(θi )xi 1x1 +...+xp =n ,
(s)
fX|S (x|s) =
x1 x2 · · · xp
i=1

where each θ(s)


is in the canonical simplex,
(s) 
θ1  p 
=  ...  ∈ ∆p−1 ≡ v ∈ Rp : vi ≥ 0,
X
θ(s) vi = 1
 
(s) i=1
θp

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 9 / 33


MAP Classification with Multinomial Classes
Let A = S = {1, ..., K} (classification problem)
Prior class probabilities: {fS (s), s = 1, ..., K}
Multinomial class-conditional pmf, for observations X ∈ {0, ..., n}p
 p
Y
n
(θi )xi 1x1 +...+xp =n ,
(s)
fX|S (x|s) =
x1 x2 · · · xp
i=1

where each θ(s)


is in the canonical simplex,
(s) 
θ1  p 
=  ...  ∈ ∆p−1 ≡ v ∈ Rp : vi ≥ 0,
X
θ(s) vi = 1
 
(s) i=1
θp
MAP rule (taking logs and dropping constants):
n o
δMAP (x) = arg max log fS (s) + xT log θ(s)
s∈{1,...,K}

...again, linear in x.
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 9 / 33
MAP Classification with Exponential Family Classes

Let A = S = {1, ..., K} (classification problem)


Prior class probabilities: {fS (s), s = 1, ..., K}

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 10 / 33


MAP Classification with Exponential Family Classes

Let A = S = {1, ..., K} (classification problem)


Prior class probabilities: {fS (s), s = 1, ..., K}
Exponential family class-conditional pdf or pmf, observations X ∈ X
1 
(s) T

fX|S (x|s) = h(x) exp (η ) φ(x)
Z(η (s) )

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 10 / 33


MAP Classification with Exponential Family Classes

Let A = S = {1, ..., K} (classification problem)


Prior class probabilities: {fS (s), s = 1, ..., K}
Exponential family class-conditional pdf or pmf, observations X ∈ X
1 
(s) T

fX|S (x|s) = h(x) exp (η ) φ(x)
Z(η (s) )

MAP rule (taking logs and dropping constants):


n o
δMAP (x) = arg max log fS (s) − log Z(η (s) ) + (η (s) )T φ(x)
s∈{1,...,K}

...linear in the sufficient statistics φ(x).

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 10 / 33


MAP Classification with Exponential Family Classes

Let A = S = {1, ..., K} (classification problem)


Prior class probabilities: {fS (s), s = 1, ..., K}
Exponential family class-conditional pdf or pmf, observations X ∈ X
1 
(s) T

fX|S (x|s) = h(x) exp (η ) φ(x)
Z(η (s) )

MAP rule (taking logs and dropping constants):


n o
δMAP (x) = arg max log fS (s) − log Z(η (s) ) + (η (s) )T φ(x)
s∈{1,...,K}

...linear in the sufficient statistics φ(x).


The previous examples (Gaussian and Multinomial) are in this family.

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 10 / 33


Class Posteriors for Exponential Family Classes
Same scenario as in the previous slide

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 11 / 33


Class Posteriors for Exponential Family Classes
Same scenario as in the previous slide
Class posterior probabilities (from Bayes law):
fS (s) (s) T

exp (η ) φ(x)
Z(η (s) )
fS|X (s|x) = K
X fS (r)
exp (η (r) )T φ(x)

(r)
Z(η )
r=1

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 11 / 33


Class Posteriors for Exponential Family Classes
Same scenario as in the previous slide
Class posterior probabilities (from Bayes law):
fS (s) (s) T

exp (η ) φ(x)
Z(η (s) )
fS|X (s|x) = K
X fS (r)
exp (η (r) )T φ(x)

(r)
Z(η )
r=1
...taking logs,
log fS|X (s|x) = (η (s) )T φ(x) + ζ (s) − ξ(x),
thus  
exp (η (s) )T φ(x) + ζ (s)
fS|X (s|x) = K
X  
exp (η (r) )T φ(x) + ζ (r)
r=1
...called a generalized linear model (GLM).
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 11 / 33
Bayesian MAP Estimation
Let A = S = R (scalar estimation problem)
Likelihood function fX|S , where X ∈ X , and prior pdf fS

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 12 / 33


Bayesian MAP Estimation
Let A = S = R (scalar estimation problem)
Likelihood function fX|S , where X ∈ X , and prior pdf fS

1 ⇐ |s − a| ≥ ε
ε−loss: Lε (s, a) =
0 ⇐ |s − a| < ε

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 12 / 33


Bayesian MAP Estimation
Let A = S = R (scalar estimation problem)
Likelihood function fX|S , where X ∈ X , and prior pdf fS

1 ⇐ |s − a| ≥ ε
ε−loss: Lε (s, a) =
0 ⇐ |s − a| < ε

Bayes rule for given ε:


Z ∞
δε (x) = arg min fS|X (s|x) Lε (s, a) ds
a∈R −∞
 Z a+ε 
= arg min 1 − fS|X (s|x) ds
a∈R a−ε
Z a+ε
= arg max fS|X (s|x) ds
a∈R a−ε

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 12 / 33


Bayesian MAP Estimation
Let A = S = R (scalar estimation problem)
Likelihood function fX|S , where X ∈ X , and prior pdf fS

1 ⇐ |s − a| ≥ ε
ε−loss: Lε (s, a) =
0 ⇐ |s − a| < ε

Bayes rule for given ε:


Z ∞
δε (x) = arg min fS|X (s|x) Lε (s, a) ds
a∈R −∞
 Z a+ε 
a a a
= arg min 1 − fS|X (s|x) ds
a∈R a−ε
Z a+ε
= arg max fS|X (s|x) ds
a∈R a−ε
a a a

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 12 / 33


Bayesian MAP Estimation
Let A = S = R (scalar estimation problem)
Likelihood function fX|S , where X ∈ X , and prior pdf fS

1 ⇐ |s − a| ≥ ε
ε−loss: Lε (s, a) =
0 ⇐ |s − a| < ε

Bayes rule for given ε:


Z ∞
δε (x) = arg min fS|X (s|x) Lε (s, a) ds
a∈R −∞
 Z a+ε 
a a a
= arg min 1 − fS|X (s|x) ds
a∈R a−ε
Z a+ε
= arg max fS|X (s|x) ds
a∈R a−ε
a a a
lim δε (x) = arg max fS|X (a|x) = δMAP (x)
ε→0 a∈R
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 12 / 33
Bayesian MAP Estimation

Let A = S = R (scalar estimation problem)

Likelihood function fX|S , where X ∈ X , and prior pdf fS

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 13 / 33


Bayesian MAP Estimation

Let A = S = R (scalar estimation problem)

Likelihood function fX|S , where X ∈ X , and prior pdf fS

MAP estimation criterion: δMAP (x) = arg max fS|X (a|x)


a∈R

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 13 / 33


Bayesian MAP Estimation

Let A = S = R (scalar estimation problem)

Likelihood function fX|S , where X ∈ X , and prior pdf fS

MAP estimation criterion: δMAP (x) = arg max fS|X (a|x)


a∈R
From Bayes law:

fX|S (x|a) fS (a)


δMAP (x) = arg max log
a∈R fX (x)
 
= arg max log fX|S (x|a) + log fS (a)
a∈R | {z } | {z }
log-likelihood log-prior

...a form (there are others) of trade-off between information from the
observation x and the prior knowledge/belief

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 13 / 33


Bayesian MAP Estimation: Vector Case
Let A = S = Rp (vector estimation problem)

Likelihood function fX|S , where X ∈ X , and prior pdf fS

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 14 / 33


Bayesian MAP Estimation: Vector Case
Let A = S = Rp (vector estimation problem)

Likelihood function fX|S , where X ∈ X , and prior pdf fS



1 ⇐ ks − ak2 ≥ ε
ε−loss: Lε (s, a) =
0 ⇐ ks − ak2 < ε

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 14 / 33


Bayesian MAP Estimation: Vector Case
Let A = S = Rp (vector estimation problem)

Likelihood function fX|S , where X ∈ X , and prior pdf fS



1 ⇐ ks − ak2 ≥ ε
ε−loss: Lε (s, a) =
0 ⇐ ks − ak2 < ε
Bayes rule for given ε:
Z
δε (x) = arg min fS|X (s|x) Lε (s, a) ds
a∈R Rp
Z !
= arg min 1 − fS|X (s|x) ds
a∈R {s∈Rp : ks−ak2 <ε}
Z
= arg max fS|X (s|x) ds
a∈R {s∈Rp : ks−ak2 <ε}

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 14 / 33


Bayesian MAP Estimation: Vector Case
Let A = S = Rp (vector estimation problem)

Likelihood function fX|S , where X ∈ X , and prior pdf fS



1 ⇐ ks − ak2 ≥ ε
ε−loss: Lε (s, a) =
0 ⇐ ks − ak2 < ε
Bayes rule for given ε:
Z
δε (x) = arg min fS|X (s|x) Lε (s, a) ds
a∈R Rp
Z !
= arg min 1 − fS|X (s|x) ds
a∈R {s∈Rp : ks−ak2 <ε}
Z
= arg max fS|X (s|x) ds
a∈R {s∈Rp : ks−ak2 <ε}

lim δε (x) = arg maxp fS|X (a|x) = δMAP (x)


ε→0 a∈R

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 14 / 33


Bayesian MMSE Estimation: Scalar and Vector Case
A = S = Rp (p ≥ 1); likelihood fX|S , for X ∈ X ; prior fS

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 15 / 33


Bayesian MMSE Estimation: Scalar and Vector Case
A = S = Rp (p ≥ 1); likelihood fX|S , for X ∈ X ; prior fS
Xp
2
Squared error loss: L(s, a) = ks − ak2 = (si − ai )2
i=1

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 15 / 33


Bayesian MMSE Estimation: Scalar and Vector Case
A = S = Rp (p ≥ 1); likelihood fX|S , for X ∈ X ; prior fS
Xp
2
Squared error loss: L(s, a) = ks − ak2 = (si − ai )2
i=1
The Bayes optimal rule
δBayes (x) = arg minp E kS − ak22 |x
 
a∈R | {z }
mean squared error (MSE)

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 15 / 33


Bayesian MMSE Estimation: Scalar and Vector Case
A = S = Rp (p ≥ 1); likelihood fX|S , for X ∈ X ; prior fS
Xp
2
Squared error loss: L(s, a) = ks − ak2 = (si − ai )2
i=1
The Bayes optimal rule, minimum mean squared error (MMSE),
δBayes (x) = arg minp E kS − ak22 |x = δMSME (x)
 
a∈R | {z }
mean squared error (MSE)

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 15 / 33


Bayesian MMSE Estimation: Scalar and Vector Case
A = S = Rp (p ≥ 1); likelihood fX|S , for X ∈ X ; prior fS
Xp
2
Squared error loss: L(s, a) = ks − ak2 = (si − ai )2
i=1
The Bayes optimal rule, minimum mean squared error (MMSE),
δBayes (x) = arg minp E kS − ak22 |x = δMSME (x)
 
a∈R | {z }
mean squared error (MSE)

...can be easily derived, by equating the gradient to zero:

z  }| { z  }| { 
δMSME (x) = arg minp E kSk22 |x − 2aT E [S|x] + E kak22 |x
a∈R

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 15 / 33


Bayesian MMSE Estimation: Scalar and Vector Case
A = S = Rp (p ≥ 1); likelihood fX|S , for X ∈ X ; prior fS
Xp
2
Squared error loss: L(s, a) = ks − ak2 = (si − ai )2
i=1
The Bayes optimal rule, minimum mean squared error (MMSE),
δBayes (x) = arg minp E kS − ak22 |x = δMSME (x)
 
a∈R | {z }
mean squared error (MSE)

...can be easily derived, by equating the gradient to zero:


indep. of a kak22
z  }| { z  }| { 
δMSME (x) = arg minp E kSk22 |x − 2aT E [S|x] + E kak22 |x
a∈R

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 15 / 33


Bayesian MMSE Estimation: Scalar and Vector Case
A = S = Rp (p ≥ 1); likelihood fX|S , for X ∈ X ; prior fS
Xp
2
Squared error loss: L(s, a) = ks − ak2 = (si − ai )2
i=1
The Bayes optimal rule, minimum mean squared error (MMSE),
δBayes (x) = arg minp E kS − ak22 |x = δMSME (x)
 
a∈R | {z }
mean squared error (MSE)

...can be easily derived, by equating the gradient to zero:


indep. of a kak22
z  }| { z  }| { 
δMSME (x) = arg minp E kSk22 |x − 2aT E [S|x] + E kak22 |x
a∈R
= solutiona (2a − 2E [S|x] = 0)

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 15 / 33


Bayesian MMSE Estimation: Scalar and Vector Case
A = S = Rp (p ≥ 1); likelihood fX|S , for X ∈ X ; prior fS
Xp
2
Squared error loss: L(s, a) = ks − ak2 = (si − ai )2
i=1
The Bayes optimal rule, minimum mean squared error (MMSE),
δBayes (x) = arg minp E kS − ak22 |x = δMSME (x)
 
a∈R | {z }
mean squared error (MSE)

...can be easily derived, by equating the gradient to zero:


indep. of a kak22
z  }| { z  }| { 
δMSME (x) = arg minp E kSk22 |x − 2aT E [S|x] + E kak22 |x
a∈R
= solutiona (2a − 2E [S|x] = 0)
= E [S|x]

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 15 / 33


MAP vs MMSE Estimation: Optimization vs Integration
A = S = Rp (p ≥ 1); likelihood fX|S , for X ∈ X ; prior fS

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 16 / 33


MAP vs MMSE Estimation: Optimization vs Integration
A = S = Rp (p ≥ 1); likelihood fX|S , for X ∈ X ; prior fS
 
MAP estimate: δMAP (x) = arg max log fX|S (x|a) + log fS (a)
a∈R

I Optimization problem (maybe large scale, if p  1)


I Does not require knowing fX (x) (i.e., fS|X may be un-normalized)

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 16 / 33


MAP vs MMSE Estimation: Optimization vs Integration
A = S = Rp (p ≥ 1); likelihood fX|S , for X ∈ X ; prior fS
 
MAP estimate: δMAP (x) = arg max log fX|S (x|a) + log fS (a)
a∈R

I Optimization problem (maybe large scale, if p  1)


I Does not require knowing fX (x) (i.e., fS|X may be un-normalized)

MMSE estimate: Z
Z s fX|S (x|s) fS (s) ds
δMMSE (x) = E[S|x] = s fS|X (s|x) ds = ZS
S fX|S (x|s) fS (s) ds
S

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 16 / 33


MAP vs MMSE Estimation: Optimization vs Integration
A = S = Rp (p ≥ 1); likelihood fX|S , for X ∈ X ; prior fS
 
MAP estimate: δMAP (x) = arg max log fX|S (x|a) + log fS (a)
a∈R

I Optimization problem (maybe large scale, if p  1)


I Does not require knowing fX (x) (i.e., fS|X may be un-normalized)

MMSE estimate: Z
Z s fX|S (x|s) fS (s) ds
δMMSE (x) = E[S|x] = s fS|X (s|x) ds = ZS
S fX|S (x|s) fS (s) ds
S

I Integration problem (maybe large scale, if p  1)


I In general, much harder than MAP estimation

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 16 / 33


A Classic: Linear-Gaussian Likelihood and Gaussian Prior
A = S = Rp (p ≥ 1); linear-Gaussian likelihood

exp − 12 (x − As)T C −1 (x − As)



fX|S (x|s) = N (x; As, C) = p
det(2 π C)

where x ∈ Rn , thus A ∈ Rn×p ,

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 17 / 33


A Classic: Linear-Gaussian Likelihood and Gaussian Prior
A = S = Rp (p ≥ 1); linear-Gaussian likelihood

exp − 12 (x − As)T C −1 (x − As)



fX|S (x|s) = N (x; As, C) = p
det(2 π C)

where x ∈ Rn , thus A ∈ Rn×p ,

... and Gaussian prior: fS (s) = N (s; µ, B)

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 17 / 33


A Classic: Linear-Gaussian Likelihood and Gaussian Prior
A = S = Rp (p ≥ 1); linear-Gaussian likelihood

exp − 12 (x − As)T C −1 (x − As)



fX|S (x|s) = N (x; As, C) = p
det(2 π C)

where x ∈ Rn , thus A ∈ Rn×p ,

... and Gaussian prior: fS (s) = N (s; µ, B)

Somewhat tedious manipulation yiels the posterior

fS|X (s|x) = N (s; ŝ(x), D) ,

where
−1
D = AT C −1 A + B −1 and ŝ(x) = D AT C −1 x + B −1 µ


Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 17 / 33


A Classic: Linear-Gaussian Likelihood and Gaussian Prior
A = S = Rp (p ≥ 1); linear-Gaussian likelihood

exp − 12 (x − As)T C −1 (x − As)



fX|S (x|s) = N (x; As, C) = p
det(2 π C)

where x ∈ Rn , thus A ∈ Rn×p ,

... and Gaussian prior: fS (s) = N (s; µ, B)

Somewhat tedious manipulation yiels the posterior

fS|X (s|x) = N (s; ŝ(x), D) ,

where
−1
D = AT C −1 A + B −1 and ŝ(x) = D AT C −1 x + B −1 µ


Of course, δMMSE (x) = δMAP (x) = ŝ(x)


Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 17 / 33
Linear-Gaussian Observation, Gaussian Prior: Example 1

Let X = X1 , ..., Xn be i.i.d. noisy (with variance σ 2 ) observations of


S ∈ R, i.e.,
 
n 1
Y
2 2  .. 
fX|S (x|s) = N (xi ; s, σ ) = N (x; As, σ I), with A =  . 
i=1 1

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 18 / 33


Linear-Gaussian Observation, Gaussian Prior: Example 1

Let X = X1 , ..., Xn be i.i.d. noisy (with variance σ 2 ) observations of


S ∈ R, i.e.,
 
n 1
Y
2 2  .. 
fX|S (x|s) = N (xi ; s, σ ) = N (x; As, σ I), with A =  . 
i=1 1

With fS (s) = N (s; µ, λ2 ) (i.e., B = λ2 ),


n 1
x̄ + µ n
fS|X (s|x) = N (s; ŝ(x), τ 2 ), ŝ(x) = σ
2 λ2 , x̄ = 1
X
xi
n 1 n
+ i=1
σ2 λ2

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 18 / 33


Linear-Gaussian Observation, Gaussian Prior: Example 1

Let X = X1 , ..., Xn be i.i.d. noisy (with variance σ 2 ) observations of


S ∈ R, i.e.,
 
n 1
Y
2 2  .. 
fX|S (x|s) = N (xi ; s, σ ) = N (x; As, σ I), with A =  . 
i=1 1

With fS (s) = N (s; µ, λ2 ) (i.e., B = λ2 ),


n 1
x̄ + µ n
fS|X (s|x) = N (s; ŝ(x), τ 2 ), ŝ(x) = σ
2 λ2 , x̄ = 1
X
xi
n 1 n
+ i=1
σ2 λ2
ŝ(x) is a weighted average of the observation average and the prior
mean

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 18 / 33


Linear-Gaussian Observation, Gaussian Prior: Example 1
(cont.)
n 1
2
x̄ + µ
Let’s study limit cases of ŝ(x) = σ λ2
n 1
+
σ2 λ2

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 19 / 33


Linear-Gaussian Observation, Gaussian Prior: Example 1
(cont.)
n 1
2
x̄ + 2 µ
Let’s study limit cases of ŝ(x) = σ λ
n 1
2
+ 2
σ λ
Large sample limit, the prior is ignored: lim ŝ(x) = x̄
n→∞

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 19 / 33


Linear-Gaussian Observation, Gaussian Prior: Example 1
(cont.)
n 1
2
x̄ + 2 µ
Let’s study limit cases of ŝ(x) = σ λ
n 1
2
+ 2
σ λ
Large sample limit, the prior is ignored: lim ŝ(x) = x̄
n→∞

Low noise limit, the prior is ignored: lim ŝ(x) = x̄


σ 2 →0

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 19 / 33


Linear-Gaussian Observation, Gaussian Prior: Example 1
(cont.)
n 1
2
x̄ + 2 µ
Let’s study limit cases of ŝ(x) = σ λ
n 1
2
+ 2
σ λ
Large sample limit, the prior is ignored: lim ŝ(x) = x̄
n→∞

Low noise limit, the prior is ignored: lim ŝ(x) = x̄


σ 2 →0

Large noise limit, the data is ignored: lim ŝ(x) = µ


σ 2 →∞

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 19 / 33


Linear-Gaussian Observation, Gaussian Prior: Example 1
(cont.)
n 1
2
x̄ + 2 µ
Let’s study limit cases of ŝ(x) = σ λ
n 1
2
+ 2
σ λ
Large sample limit, the prior is ignored: lim ŝ(x) = x̄
n→∞

Low noise limit, the prior is ignored: lim ŝ(x) = x̄


σ 2 →0

Large noise limit, the data is ignored: lim ŝ(x) = µ


σ 2 →∞

High prior confidence limit, the data is ignored: lim ŝ(x) = µ


λ2 →0

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 19 / 33


Linear-Gaussian Observation, Gaussian Prior: Example 1
(cont.)
n 1
2
x̄ + 2 µ
Let’s study limit cases of ŝ(x) = σ λ
n 1
2
+ 2
σ λ
Large sample limit, the prior is ignored: lim ŝ(x) = x̄
n→∞

Low noise limit, the prior is ignored: lim ŝ(x) = x̄


σ 2 →0

Large noise limit, the data is ignored: lim ŝ(x) = µ


σ 2 →∞

High prior confidence limit, the data is ignored: lim ŝ(x) = µ


λ2 →0

Low prior confidence limit, the prior is ignored: lim ŝ(x) = x̄


λ2 →∞

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 19 / 33


Linear-Gaussian Observation, Gaussian Prior: Example 1
(cont.)
n 1
2
x̄ + 2 µ
Let’s study limit cases of ŝ(x) = σ λ
n 1
2
+ 2
σ λ
Large sample limit, the prior is ignored: lim ŝ(x) = x̄
n→∞

Low noise limit, the prior is ignored: lim ŝ(x) = x̄


σ 2 →0

Large noise limit, the data is ignored: lim ŝ(x) = µ


σ 2 →∞

High prior confidence limit, the data is ignored: lim ŝ(x) = µ


λ2 →0

Low prior confidence limit, the prior is ignored: lim ŝ(x) = x̄


λ2 →∞

Note: the prior is equivalent to an observation equal to µ with


variance λ2 .

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 19 / 33


Linear-Gaussian Observation, Gaussian Prior: Example 2
Let X = X1 , ..., Xn be independent noisy (with variance σi2 )
observations of S ∈ R, i.e.,
 
1
 .. 
fX|S (x|s) = N (x; As, C), with A =  .  , C = diag(σ12 , ..., σ12 )
1

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 20 / 33


Linear-Gaussian Observation, Gaussian Prior: Example 2
Let X = X1 , ..., Xn be independent noisy (with variance σi2 )
observations of S ∈ R, i.e.,
 
1
 .. 
fX|S (x|s) = N (x; As, C), with A =  .  , C = diag(σ12 , ..., σ12 )
1

With fS (s) = N (s; µ, λ2 ) (i.e., B = λ2 ),


n
X xi µ
2 + 2
σ λ
i=1 i
ŝ(x) = n
X 1 1
2 + 2
σ λ
i=1 i

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 20 / 33


Linear-Gaussian Observation, Gaussian Prior: Example 2
Let X = X1 , ..., Xn be independent noisy (with variance σi2 )
observations of S ∈ R, i.e.,
 
1
 .. 
fX|S (x|s) = N (x; As, C), with A =  .  , C = diag(σ12 , ..., σ12 )
1

With fS (s) = N (s; µ, λ2 ) (i.e., B = λ2 ),


n
X xi µ
2 + 2
σ λ
i=1 i
ŝ(x) = n
X 1 1
2 + 2
σ λ
i=1 i
1
each sample is weighted proportionally to σi2
(the precision)

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 20 / 33


Linear-Gaussian Observation, Gaussian Prior: Example 2
Let X = X1 , ..., Xn be independent noisy (with variance σi2 )
observations of S ∈ R, i.e.,
 
1
 .. 
fX|S (x|s) = N (x; As, C), with A =  .  , C = diag(σ12 , ..., σ12 )
1

With fS (s) = N (s; µ, λ2 ) (i.e., B = λ2 ),


n
X xi µ
2 + 2
σ λ
i=1 i
ŝ(x) = n
X 1 1
2 + 2
σ λ
i=1 i
1
each sample is weighted proportionally to σi2
(the precision)

Notice that, again, limn→∞ ŝ(x) is independent of the prior


Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 20 / 33
Bernstein-Von Mises Theorem

Notice that
1 Pn
Example 1: δMLE (x) = n i=1 xi

P −1 P
n 1 n xi
Example 2: δMLE (x) = i=1 σi2 i=1 σi2

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 21 / 33


Bernstein-Von Mises Theorem

Notice that
1 Pn
Example 1: δMLE (x) = n i=1 xi

P −1 P
n 1 n xi
Example 2: δMLE (x) = i=1 σi2 i=1 σi2

In the previous examples, we noticed that (independently of the prior)

lim δMMSE (x) = lim δMAP (x) = δMLE (x)


n→∞ n→∞

where x̄ is the mean or the weighted mean of the observations

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 21 / 33


Bernstein-Von Mises Theorem

Notice that
1 Pn
Example 1: δMLE (x) = n i=1 xi

P −1 P
n 1 n xi
Example 2: δMLE (x) = i=1 σi2 i=1 σi2

In the previous examples, we noticed that (independently of the prior)

lim δMMSE (x) = lim δMAP (x) = δMLE (x)


n→∞ n→∞

where x̄ is the mean or the weighted mean of the observations

This is a general result, known as the Bernstein-Von Mises Theorem;


a sufficient condition is that the prior is continuous and non-zero
everywhere.

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 21 / 33


Conjugate Priors
In the previous examples, the posterior and the prior were both
Gaussian; were we lucky?

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 22 / 33


Conjugate Priors
In the previous examples, the posterior and the prior were both
Gaussian; were we lucky? No, the prior was conjugate!

Consider a likelihood/observation model fX|S , for S ∈ S and X ∈ X

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 22 / 33


Conjugate Priors
In the previous examples, the posterior and the prior were both
Gaussian; were we lucky? No, the prior was conjugate!

Consider a likelihood/observation model fX|S , for S ∈ S and X ∈ X

...and a family of priors P = {fS (s; θ), θ ∈ Θ}

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 22 / 33


Conjugate Priors
In the previous examples, the posterior and the prior were both
Gaussian; were we lucky? No, the prior was conjugate!

Consider a likelihood/observation model fX|S , for S ∈ S and X ∈ X

...and a family of priors P = {fS (s; θ), θ ∈ Θ}

P is conjugate for fX|S if, for any θ ∈ Θ and x ∈ S,

fX|S (x|s) fS (s; θ)


fS|X (·|x) = ∈ P ⇔ ∃θ0 ∈Θ : fS|X (·|x) = fS (·; θ0 )
fX (x; θ)

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 22 / 33


Conjugate Priors
In the previous examples, the posterior and the prior were both
Gaussian; were we lucky? No, the prior was conjugate!

Consider a likelihood/observation model fX|S , for S ∈ S and X ∈ X

...and a family of priors P = {fS (s; θ), θ ∈ Θ}

P is conjugate for fX|S if, for any θ ∈ Θ and x ∈ S,

fX|S (x|s) fS (s; θ)


fS|X (·|x) = ∈ P ⇔ ∃θ0 ∈Θ : fS|X (·|x) = fS (·; θ0 )
fX (x; θ)

Example: in example 1 (above),

fS (s) = N (s; µ, λ2 ) and fS|X (s|x) = N (s; ŝ(x), τ 2 )

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 22 / 33


Conjugate Priors: Example
S ∈ [0, 1] is the probability of success in n i.i.d. Bernoulli samples

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 23 / 33


Conjugate Priors: Example
S ∈ [0, 1] is the probability of success in n i.i.d. Bernoulli samples
Pn Pn
xi
X = (X1 , ..., Xn ) ∈ {0, 1}n : fX|S (x, s) = s i=1 (1 − s)n− i=1 xi

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 23 / 33


Conjugate Priors: Example
S ∈ [0, 1] is the probability of success in n i.i.d. Bernoulli samples
Pn Pn
xi
X = (X1 , ..., Xn ) ∈ {0, 1}n : fX|S (x, s) = s i=1 (1 − s)n− i=1 xi

Γ(α + β) α−1
Conjugate prior: Beta(s; α, β) = s (1 − s)β−1
Γ(α) Γ(β)

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 23 / 33


Conjugate Priors: Example
S ∈ [0, 1] is the probability of success in n i.i.d. Bernoulli samples
Pn Pn
xi
X = (X1 , ..., Xn ) ∈ {0, 1}n : fX|S (x, s) = s i=1 (1 − s)n− i=1 xi

Γ(α + β) α−1
Conjugate prior: Beta(s; α, β) = s (1 − s)β−1
Γ(α) Γ(β)

α
Expected value: E[S] = α+β

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 23 / 33


Conjugate Priors: Example
S ∈ [0, 1] is the probability of success in n i.i.d. Bernoulli samples
Pn Pn
xi
X = (X1 , ..., Xn ) ∈ {0, 1}n : fX|S (x, s) = s i=1 (1 − s)n− i=1 xi

Γ(α + β) α−1
Conjugate prior: Beta(s; α, β) = s (1 − s)β−1
Γ(α) Γ(β)

α
Expected value: E[S] = α+β

Mode:
α−1
arg max Beta(s; α, β) = ,
s α+β−2

if α, β > 1

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 23 / 33


Conjugate Priors: Example (cont.)
S ∈ [0, 1] is the probability of success in n i.i.d. Bernoulli samples

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 24 / 33


Conjugate Priors: Example (cont.)
S ∈ [0, 1] is the probability of success in n i.i.d. Bernoulli samples
Pn Pn
xi
X = (X1 , ..., Xn ) ∈ {0, 1}n : fX|S (x, s) = s i=1 (1 − s)n− i=1 xi

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 24 / 33


Conjugate Priors: Example (cont.)
S ∈ [0, 1] is the probability of success in n i.i.d. Bernoulli samples
Pn Pn
xi
X = (X1 , ..., Xn ) ∈ {0, 1}n : fX|S (x, s) = s i=1 (1 − s)n− i=1 xi

Γ(α + β) α−1
Conjugate prior: Beta(s; α, β) = s (1 − s)β−1
Γ(α) Γ(β)

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 24 / 33


Conjugate Priors: Example (cont.)
S ∈ [0, 1] is the probability of success in n i.i.d. Bernoulli samples
Pn Pn
xi
X = (X1 , ..., Xn ) ∈ {0, 1}n : fX|S (x, s) = s i=1 (1 − s)n− i=1 xi

Γ(α + β) α−1
Conjugate prior: Beta(s; α, β) = s (1 − s)β−1
Γ(α) Γ(β)
Posterior:
Pn Pn
xi
fS|X (s|x) ∝ s (1 − s)n− i=1 xi sα−1 (1 − s)β−1
i=1

 Xn n
X   
= Beta s; α + xi , β + n − xi = Beta s; α0 , β 0
i=1 i=1

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 24 / 33


Conjugate Priors: Example (cont.)
S ∈ [0, 1] is the probability of success in n i.i.d. Bernoulli samples
Pn Pn
xi
X = (X1 , ..., Xn ) ∈ {0, 1}n : fX|S (x, s) = s i=1 (1 − s)n− i=1 xi

Γ(α + β) α−1
Conjugate prior: Beta(s; α, β) = s (1 − s)β−1
Γ(α) Γ(β)
Posterior:
Pn Pn
xi
fS|X (s|x) ∝ s (1 − s)n− i=1 xi sα−1 (1 − s)β−1
i=1

 Xn n
X   
= Beta s; α + xi , β + n − xi = Beta s; α0 , β 0
i=1 i=1

Estimates:
α + ni=1 xi α − 1 + ni=1 xi
P P
δMMSE (x) = δMAP (x) =
α+β+n α+β−2+n

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 24 / 33


Conjugate Priors: Example (cont.)

1 1
n=1 n=5
0.5 0.5
Plots of fS|X (s|x) for n
samples
0 0
0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1
1 1
n=10 n=20 Solid: α = β = 1
0.5 0.5

0
0 0.25 0.5 0.75 1
0
0 0.25 0.5 0.75 1 Dotted: α = β = 5
1 1
n=50 n=500
0.5 0.5 Notice:
(x1 , ..., x5 ) = (1, 1, 1, 1, 1)
0 0
0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 25 / 33


Conjugate Priors: Example (cont.)

1 1
n=1 n=5
0.5 0.5
Plots of fS|X (s|x) for n
samples
0 0
0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1
1 1
n=10 n=20 Solid: α = β = 1
0.5 0.5

0
0 0.25 0.5 0.75 1
0
0 0.25 0.5 0.75 1 Dotted: α = β = 5
1 1
n=50 n=500
0.5 0.5 Notice:
(x1 , ..., x5 ) = (1, 1, 1, 1, 1)
0 0
0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1

Observe the Bernstein-Von Mises effect.


Note that δMLE (x) = δMAP (x), for α = β = 1 (flat prior).
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 25 / 33
Conjugate Priors: Exponential Families
Let S ∈ Rp be the parameter of an exponential family; for a single
observation X,
1  
fX|S (x|s) = h(x) exp η(s)T φ(x)
Z(s)

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 26 / 33


Conjugate Priors: Exponential Families
Let S ∈ Rp be the parameter of an exponential family; for a single
observation X,
1  
fX|S (x|s) = h(x) exp η(s)T φ(x)
Z(s)
For n i.i.d. observations, (X1 , ..., Xn )
 1 n Y n   n
X 
fX|S (x|s) = h(xi ) exp η(s)T φ(xi )
Z(s)
i=1 i=1

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 26 / 33


Conjugate Priors: Exponential Families
Let S ∈ Rp be the parameter of an exponential family; for a single
observation X,
1  
fX|S (x|s) = h(x) exp η(s)T φ(x)
Z(s)
For n i.i.d. observations, (X1 , ..., Xn )
 1 n Y n   n
X 
fX|S (x|s) = h(xi ) exp η(s)T φ(xi )
Z(s)
i=1 i=1
Conjugate prior has the form:
 1 ν  
fS (s; γ, ν) ∝ exp η(s)T γ
Z(s)

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 26 / 33


Conjugate Priors: Exponential Families
Let S ∈ Rp be the parameter of an exponential family; for a single
observation X,
1  
fX|S (x|s) = h(x) exp η(s)T φ(x)
Z(s)
For n i.i.d. observations, (X1 , ..., Xn )
 1 n Y n   n
X 
fX|S (x|s) = h(xi ) exp η(s)T φ(xi )
Z(s)
i=1 i=1
Conjugate prior has the form:
 1 ν  
fS (s; γ, ν) ∝ exp η(s)T γ
Z(s)
Posterior
 1 n+ν  n
X 
fS|X (s|x; γ, ν) ∝ exp η(s)T γ + φ(xi )
Z(s)
i=1
... the conjugate prior behaves like virtual additional observations
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 26 / 33
Conjugate Priors for Exponential Families:
Bernoulli/Binomial
S ∈ [0, 1] is the probability of success in n i.i.d. Bernoulli samples
(see slide 23)

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 27 / 33


Conjugate Priors for Exponential Families:
Bernoulli/Binomial
S ∈ [0, 1] is the probability of success in n i.i.d. Bernoulli samples
(see slide 23)
In exponential family form
  s X n 
fX|S (x|s) = (1 − s) exp log xi
| {z } 1−s}
1
| {z i=1
Z(s) η(s)

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 27 / 33


Conjugate Priors for Exponential Families:
Bernoulli/Binomial
S ∈ [0, 1] is the probability of success in n i.i.d. Bernoulli samples
(see slide 23)
In exponential family form
  s X n 
fX|S (x|s) = (1 − s) exp log xi
| {z } 1−s}
1
| {z i=1
Z(s) η(s)

Conjugate prior
 1 ν    s 
fS (s; γ, ν) ∝ exp η(s)T γ = (1 − s)ν exp γ log
Z(s) 1−s
γ ν−γ
= s (1 − s)
∝ Beta(s; γ + 1, ν − γ + 1)

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 27 / 33


An Important Example: Multinomial Observations
Let S ∈ ∆p−1 ⊂ [0, 1]p to be estimated from X ∈ {0, ..., n}p
  p
n
1(x1 +...+xp =n) sxi i
Y
fX|S (x|s) =
x1 x2 · · · xp
| {z } i=1
h(x)

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 28 / 33


An Important Example: Multinomial Observations
Let S ∈ ∆p−1 ⊂ [0, 1]p to be estimated from X ∈ {0, ..., n}p
  p
n
1(x1 +...+xp =n) sxi i
Y
fX|S (x|s) =
x1 x2 · · · xp
| {z } i=1
h(x)

In exponential family form (notice that Z(s) = 1)


fX|S (x|s) = h(x) exp xT log s = h(x) exp xT η(s))

|{z}
 
log s1

 .. 
 .   
log sp

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 28 / 33


An Important Example: Multinomial Observations
Let S ∈ ∆p−1 ⊂ [0, 1]p to be estimated from X ∈ {0, ..., n}p
  p
n
1(x1 +...+xp =n) sxi i
Y
fX|S (x|s) =
x1 x2 · · · xp
| {z } i=1
h(x)

In exponential family form (notice that Z(s) = 1)


fX|S (x|s) = h(x) exp xT log s = h(x) exp xT η(s))

|{z}
 
log s1

 .. 
 .   
log sp
Conjugate prior
p
fS (s; γ) ∝ 1s∈∆p−1 exp γ T log s = 1s∈∆p−1 sγi i ∝ Dirichlet(s; γ+1)
 Y

i=1

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 28 / 33


Multinomial Observations and Dirichlet Prior
Multinomial observations with Dirichlet prior
fX|S (x|s) = Multi(x; n, s), fS (s; γ) = Dirichlet(s; γ + 1)

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 29 / 33


Multinomial Observations and Dirichlet Prior
Multinomial observations with Dirichlet prior
fX|S (x|s) = Multi(x; n, s), fS (s; γ) = Dirichlet(s; γ + 1)
Dirichlet examples in ∆2 ⊂ R3
s1

s2

s3
simplex ∆2 ⊂ R3 Dirichlet(s; [2, 2, 2]) Dirichlet(s; [20, 2, 2])

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 29 / 33


Multinomial Observations and Dirichlet Prior
Multinomial observations with Dirichlet prior
fX|S (x|s) = Multi(x; n, s), fS (s; γ) = Dirichlet(s; γ + 1)
Dirichlet examples in ∆2 ⊂ R3
s1

s2

s3
simplex ∆2 ⊂ R3 Dirichlet(s; [2, 2, 2]) Dirichlet(s; [20, 2, 2])

Posterior and estimates: fS|X (s|x) = Dirichlet(s; γ + x + 1)


γ+x γ+x+1
δMAP (x) = , δMMSE (x) = (1p = [1, ..., 1]T )
1Tp (γ+ x) 1Tp (γ+ x + 1)

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 29 / 33


Example of Multinomial Model: Bag of Words (BoW)
Vocabulary with p words (stemmed, without stop-words, etc.)

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 30 / 33


Example of Multinomial Model: Bag of Words (BoW)
Vocabulary with p words (stemmed, without stop-words, etc.)
Multinomial language model: S ∈ ∆p−1 contains the probability of
occurrence of each of the p words.

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 30 / 33


Example of Multinomial Model: Bag of Words (BoW)
Vocabulary with p words (stemmed, without stop-words, etc.)
Multinomial language model: S ∈ ∆p−1 contains the probability of
occurrence of each of the p words.
Probability of a set of document with a total of n words, with
x1 , ..., xp word counts: Multi(x; n, s)

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 30 / 33


Example of Multinomial Model: Bag of Words (BoW)
Vocabulary with p words (stemmed, without stop-words, etc.)
Multinomial language model: S ∈ ∆p−1 contains the probability of
occurrence of each of the p words.
Probability of a set of document with a total of n words, with
x1 , ..., xp word counts: Multi(x; n, s)
Example: observed text

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 30 / 33


Example of Multinomial Model: Bag of Words (BoW)
Vocabulary with p words (stemmed, without stop-words, etc.)
Multinomial language model: S ∈ ∆p−1 contains the probability of
occurrence of each of the p words.
Probability of a set of document with a total of n words, with
x1 , ..., xp word counts: Multi(x; n, s)
Example: observed text

...vector x of term counts (bag of words) (“unk” = all other words)

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 30 / 33


Example of Multinomial Model: Bag of Words (BoW)
Vocabulary with p words (stemmed, without stop-words, etc.)
Multinomial language model: S ∈ ∆p−1 contains the probability of
occurrence of each of the p words.
Probability of a set of document with a total of n words, with
x1 , ..., xp word counts: Multi(x; n, s)
Example: observed text

...vector x of term counts (bag of words) (“unk” = all other words)

MAP estimate of the probabilities, with Dirichlet(s; [2, 2, ..., 2]) prior:

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 30 / 33


Additive Loss Functions and Marginal Inference
Recall the optimal Bayes rule:

δBayes (x) = arg min E [L(S, a)|x]


a∈A

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 31 / 33


Additive Loss Functions and Marginal Inference
Recall the optimal Bayes rule:

δBayes (x) = arg min E [L(S, a)|x]


a∈A

Let S = S1 × S2 × · · · × Sd and A = A1 × A2 × · · · × Ad

S = (S1 , ..., Sd ), where Si ∈ Si ,

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 31 / 33


Additive Loss Functions and Marginal Inference
Recall the optimal Bayes rule:

δBayes (x) = arg min E [L(S, a)|x]


a∈A

Let S = S1 × S2 × · · · × Sd and A = A1 × A2 × · · · × Ad

S = (S1 , ..., Sd ), where Si ∈ Si , and a = (a1 , ..., ad ), where ai ∈ Ai

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 31 / 33


Additive Loss Functions and Marginal Inference
Recall the optimal Bayes rule:

δBayes (x) = arg min E [L(S, a)|x]


a∈A

Let S = S1 × S2 × · · · × Sd and A = A1 × A2 × · · · × Ad

S = (S1 , ..., Sd ), where Si ∈ Si , and a = (a1 , ..., ad ), where ai ∈ Ai

Let the loss have an additive structure:


d
X
L(s, a) = Li (si , ai ), with Li : Si × Ai → R+
i=1

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 31 / 33


Additive Loss Functions and Marginal Inference
Recall the optimal Bayes rule:

δBayes (x) = arg min E [L(S, a)|x]


a∈A

Let S = S1 × S2 × · · · × Sd and A = A1 × A2 × · · · × Ad

S = (S1 , ..., Sd ), where Si ∈ Si , and a = (a1 , ..., ad ), where ai ∈ Ai

Let the loss have an additive structure:


d
X
L(s, a) = Li (si , ai ), with Li : Si × Ai → R+
i=1
Pd
Example of additive loss: L(s, a) = ks − ak22 = i=1 (si − ai )2

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 31 / 33


Additive Loss Functions and Marginal Inference
Bayes optimal rule:
d
" #
X
δBayes (x) = arg min E Li (Si , ai ) x
a1 ∈A1 ,...,ad ∈Ad
i=1
d
X  
= arg min E Li (Si , ai ) x
a1 ∈S1 ,...,ad ∈Sd
i=1
 
   
= arg min E L1 (S1 , a1 ) x , ..., arg min E Ld (Sd , ad ) x

a1 ∈A1 ad ∈Ad

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 32 / 33


Additive Loss Functions and Marginal Inference
Bayes optimal rule:
d
" #
X
δBayes (x) = arg min E Li (Si , ai ) x
a1 ∈A1 ,...,ad ∈Ad
i=1
d
X  
= arg min E Li (Si , ai ) x
a1 ∈S1 ,...,ad ∈Sd
i=1
 
   
= arg min E L1 (S1 , a1 ) x , ..., arg min E Ld (Sd , ad ) x

a1 ∈A1 ad ∈Ad

Marginal Bayes decisions:


Z
 
arg min E Li (Si , ai ) x = arg min Li (si , ai ) fSi |X (si |xi ) dsi
ai ∈Ai ai ∈Ai Si | {z }
posterior marginal

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 32 / 33


Additive Loss Functions and Marginal Inference
Bayes optimal rule:
d
" #
X
δBayes (x) = arg min E Li (Si , ai ) x
a1 ∈A1 ,...,ad ∈Ad
i=1
d
X  
= arg min E Li (Si , ai ) x
a1 ∈S1 ,...,ad ∈Sd
i=1
 
   
= arg min E L1 (S1 , a1 ) x , ..., arg min E Ld (Sd , ad ) x

a1 ∈A1 ad ∈Ad

Marginal Bayes decisions:


Z
 
arg min E Li (Si , ai ) x = arg min Li (si , ai ) fSi |X (si |xi ) dsi
ai ∈Ai ai ∈Ai Si | {z }
posterior marginal

If Li (si , ai ) is 0/1 loss, δBayes = δMPM (maximizer of posterior marginals)


Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 32 / 33
Example of Marginal Inference: Estimation of Variance
Estimation: S = (µ, σ 2 ) ∈ A = S = R × R+

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 33 / 33


Example of Marginal Inference: Estimation of Variance
Estimation: S = (µ, σ 2 ) ∈ A = S = R × R+
Qn
Observations X = (X1 , ..., Xn ), fX|S (x|µ, σ 2 ) = i=1 N (xi ; µ, σ
2)

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 33 / 33


Example of Marginal Inference: Estimation of Variance
Estimation: S = (µ, σ 2 ) ∈ A = S = R × R+
Qn
Observations X = (X1 , ..., Xn ), fX|S (x|µ, σ 2 ) = i=1 N (xi ; µ, σ
2)

Flat prior fS (s) = K, thus fS|X (s|x) ∝ fX|S (x|s)

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 33 / 33


Example of Marginal Inference: Estimation of Variance
Estimation: S = (µ, σ 2 ) ∈ A = S = R × R+
Qn
Observations X = (X1 , ..., Xn ), fX|S (x|µ, σ 2 ) = i=1 N (xi ; µ, σ
2)

Flat prior fS (s) = K, thus fS|X (s|x) ∝ fX|S (x|s)


2 ) = (µ̂
MAP estimate: (µ̂MAP , σ̂MAP 2
MLE , σ̂MLE )

n n
1X 2 1X
µ̂MAP = xi σ̂MAP = (xi − µ̂MLE )2
n n
i=1 i=1
n−1 2
with σ̂ 2
MAP = σ̂ 2
MLE known to be biased E[σ̂ 2 MLE ]= n σ

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 33 / 33


Example of Marginal Inference: Estimation of Variance
Estimation: S = (µ, σ 2 ) ∈ A = S = R × R+
Qn
Observations X = (X1 , ..., Xn ), fX|S (x|µ, σ 2 ) = i=1 N (xi ; µ, σ
2)

Flat prior fS (s) = K, thus fS|X (s|x) ∝ fX|S (x|s)


2 ) = (µ̂
MAP estimate: (µ̂MAP , σ̂MAP 2
MLE , σ̂MLE )

n n
1X 2 1X
µ̂MAP = xi σ̂MAP = (xi − µ̂MLE )2
n n
i=1 i=1
n−1 2
with σ̂ 2
MAP = σ̂ 2
MLE known to be biased E[σ̂ 2 MLE ]= n σ
Posterior marginal: Z ∞ n
Y
fΣ2 |X (σ 2 |x) ∝ N (xi ; µ, σ 2 )dµ
−∞ i=1
n
2
 n−1 1 X
 
= 2πσ 2 exp − 2 (xi − µ̂MLE )2

i=1

Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 33 / 33


Example of Marginal Inference: Estimation of Variance
Estimation: S = (µ, σ 2 ) ∈ A = S = R × R+
Qn
Observations X = (X1 , ..., Xn ), fX|S (x|µ, σ 2 ) = i=1 N (xi ; µ, σ
2)

Flat prior fS (s) = K, thus fS|X (s|x) ∝ fX|S (x|s)


2 ) = (µ̂
MAP estimate: (µ̂MAP , σ̂MAP 2
MLE , σ̂MLE )

n n
1X 2 1X
µ̂MAP = xi σ̂MAP = (xi − µ̂MLE )2
n n
i=1 i=1
n−1 2
with σ̂ 2
MAP = σ̂ 2
MLE known to be biased E[σ̂ 2 MLE ]= n σ
Posterior marginal: Z ∞ n
Y
fΣ2 |X (σ 2 |x) ∝ N (xi ; µ, σ 2 )dµ
−∞ i=1
n
2
 n−1 1 X
 
= 2πσ 2 exp − 2 (xi − µ̂MLE )2

i=1
n
MPM estimate, σ̂ 2
MPM = 2
n−1 σ̂MAP , is unbiased
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 33 / 33

You might also like