Statistical Learning
Lecture 2: More on Bayesian Inference
Mário A. T. Figueiredo
Instituto Superior Técnico & Instituto de Telecomunicações
University of Lisbon, Portugal
2018
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 1 / 33
Bayesian Decision Rules
Posterior expected loss of decision rule a ∈ A,
X
fS|X (s|x) L(s, a), if S discrete
ρ(fS , a|x) = E [L(S, a)|x] = s∈S
Z
fS|X (s|x) L(s, a) ds, if S continuous
s∈S
where, via Bayes law, fS|X (s|x) = fX|S (x|s) fS (s)/fX (x)
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 2 / 33
Bayesian Decision Rules
Posterior expected loss of decision rule a ∈ A,
X
fS|X (s|x) L(s, a), if S discrete
ρ(fS , a|x) = E [L(S, a)|x] = s∈S
Z
fS|X (s|x) L(s, a) ds, if S continuous
s∈S
where, via Bayes law, fS|X (s|x) = fX|S (x|s) fS (s)/fX (x)
Bayes optimal decision rule:
δBayes (x) = arg min ρ(fS , a|x)
a∈A
...minimizer of the expected loss (with respect to the unknown S),
conditioned on the observations X = x
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 2 / 33
Bayesian Decision Rules
Posterior expected loss of decision rule a ∈ A,
X
fS|X (s|x) L(s, a), if S discrete
ρ(fS , a|x) = E [L(S, a)|x] = s∈S
Z
fS|X (s|x) L(s, a) ds, if S continuous
s∈S
where, via Bayes law, fS|X (s|x) = fX|S (x|s) fS (s)/fX (x)
Bayes optimal decision rule:
δBayes (x) = arg min ρ(fS , a|x) = arg min E [L(S, a)|x]
a∈A a∈A
...minimizer of the expected loss (with respect to the unknown S),
conditioned on the observations X = x
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 2 / 33
MAP Classification
Let A = S = {1, ..., K} (discrete/classification problem)
Observations: X ∈ X , with {fX|S (·|s), s ∈ S}
(set of class-conditional pdf or pmf)
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 3 / 33
MAP Classification
Let A = S = {1, ..., K} (discrete/classification problem)
Observations: X ∈ X , with {fX|S (·|s), s ∈ S}
(set of class-conditional pdf or pmf)
P
Prior pmf fS (fS (s) ≥ 0 and s∈S fS (s) = 1)
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 3 / 33
MAP Classification
Let A = S = {1, ..., K} (discrete/classification problem)
Observations: X ∈ X , with {fX|S (·|s), s ∈ S}
(set of class-conditional pdf or pmf)
P
Prior pmf fS (fS (s) ≥ 0 and s∈S fS (s) = 1)
Adopt the 0/1 loss: L(s, a) = 0, if s = a, and 1 otherwise.
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 3 / 33
MAP Classification
Let A = S = {1, ..., K} (discrete/classification problem)
Observations: X ∈ X , with {fX|S (·|s), s ∈ S}
(set of class-conditional pdf or pmf)
P
Prior pmf fS (fS (s) ≥ 0 and s∈S fS (s) = 1)
Adopt the 0/1 loss: L(s, a) = 0, if s = a, and 1 otherwise.
Optimal Bayesian decision:
K
X
δBayes (x) = arg min L(s, a) fS|X (s|x)
a∈S
s=1
= arg min 1 − fS|X (a|x) = arg max fS|X (a|x) ≡ δMAP (x)
a∈S a∈S
MAP = maximum a posteriori criterion.
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 3 / 33
Binary MAP Classification
Let A = S = {1, 2} (binary classification problem)
Observation model {fX|S (·|1), fX|S (·|2)}
Prior fS (1) (of course fS (2) = 1 − fS (1))
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 4 / 33
Binary MAP Classification
Let A = S = {1, 2} (binary classification problem)
Observation model {fX|S (·|1), fX|S (·|2)}
Prior fS (1) (of course fS (2) = 1 − fS (1))
MAP rule:
1 ⇐ l(x) ≥ t
δMAP (x) = arg max fX|S (x|a)fS (a) =
a∈{1,2} 2 ⇐ l(x) < t
where
fX|S (x|1) fS (2)
l(x) = and t=
fX|S (x|2) f (1)
| {z } | S{z }
likelihood ratio threshold
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 4 / 33
Binary MAP Classification
Let A = S = {1, 2} (binary classification problem)
Observation model {fX|S (·|1), fX|S (·|2)}
Prior fS (1) (of course fS (2) = 1 − fS (1))
MAP rule:
1 ⇐ l(x) ≥ t
δMAP (x) = arg max fX|S (x|a)fS (a) =
a∈{1,2} 2 ⇐ l(x) < t
where
fX|S (x|1) fS (2)
l(x) = and t=
fX|S (x|2) f (1)
| {z } | S{z }
likelihood ratio threshold
If fS (1) = fS (2) = 1/2, we have a maximum likelihood classifier,
a.k.a., likelihood ratio test (LRT)
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 4 / 33
Binary Classification with General Costs
Let A = S = {1, 2} (binary classification problem)
Observation model {fX|S (·|1), fX|S (·|2)} and prior fS (1)
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 5 / 33
Binary Classification with General Costs
Let A = S = {1, 2} (binary classification problem)
Observation model {fX|S (·|1), fX|S (·|2)} and prior fS (1)
Optimal Bayesian decision:
δBayes (x) = arg min L(1, a) fX|S (x|1)fS (1) + L(2, a) fX|S (x|2)fS (2)
a∈{1,2}
1 ⇐ l(x) ≥ t
=
2 ⇐ l(x) < t
where l(x) is as before and the threshold becomes
fS (2) L(2, 1) − L(2, 2)
t=
fS (1) L(1, 2) − L(1, 1)
...i.e., the loss structure affects the decision threshold.
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 5 / 33
MAP Classification with Gaussian Classes
Let A = S = {1, ..., K} (classification problem)
Prior class probabilities: {fS (s), s = 1, ..., K}
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 6 / 33
MAP Classification with Gaussian Classes
Let A = S = {1, ..., K} (classification problem)
Prior class probabilities: {fS (s), s = 1, ..., K}
Gaussian class-conditional densities, for observations X ∈ Rp
fX|S (x|s) = N (x; µs , Cs )
1 1 T −1
=p exp − (x − µs ) Cs (x − µs )
det(2 π Cs ) 2
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 6 / 33
MAP Classification with Gaussian Classes
Let A = S = {1, ..., K} (classification problem)
Prior class probabilities: {fS (s), s = 1, ..., K}
Gaussian class-conditional densities, for observations X ∈ Rp
fX|S (x|s) = N (x; µs , Cs )
1 1 T −1
=p exp − (x − µs ) Cs (x − µs )
det(2 π Cs ) 2
MAP rule (taking logs and dropping constants):
δMAP (x) = (squared) Mahalanobis
γs distance kx − µs k2Cs
nz }| { z }| {o
= arg min − log fS (s) + 12 log det(Cs ) + 12 (x − µs )T Cs−1 (x − µs )
s∈{1,...,K}
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 6 / 33
MAP Classification with Gaussian Classes
Let A = S = {1, ..., K} (classification problem)
Prior class probabilities: {fS (s), s = 1, ..., K}
Gaussian class-conditional densities, for observations X ∈ Rp
fX|S (x|s) = N (x; µs , Cs )
1 1 T −1
=p exp − (x − µs ) Cs (x − µs )
det(2 π Cs ) 2
MAP rule (taking logs and dropping constants):
δMAP (x) = (squared) Mahalanobis
γs distance kx − µs k2Cs
nz }| { z }| {o
= arg min − log fS (s) + 12 log det(Cs ) + 12 (x − µs )T Cs−1 (x − µs )
s∈{1,...,K}
n o
= arg min γs + 12 kx − µs k2Cs
s∈{1,...,K}
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 6 / 33
MAP Classification with Gaussian
n
Classes (II) o
MAP rule: δMAP (x) = arg min γs + 12 kx − µs k2Cs
s∈{1,...,K}
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 7 / 33
MAP Classification with Gaussian
n
Classes (II) o
MAP rule: δMAP (x) = arg min γs + 12 kx − µs k2Cs
s∈{1,...,K}
Decision regions: {Rs , s = 1, ..., K} (partition of Rp ), such that
Rs = {x ∈ Rp : δMAP (x) = s}
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 7 / 33
MAP Classification with Gaussian
n
Classes (II) o
MAP rule: δMAP (x) = arg min γs + 12 kx − µs k2Cs
s∈{1,...,K}
Decision regions: {Rs , s = 1, ..., K} (partition of Rp ), such that
Rs = {x ∈ Rp : δMAP (x) = s}
Quadratic discrimination (quadratic boundaries); a.k.a. quadratic
discriminant analysis (QDA).
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 7 / 33
MAP Classification with Gaussian
n
Classes (II) o
MAP rule: δMAP (x) = arg min γs + 12 kx − µs k2Cs
s∈{1,...,K}
Decision regions: {Rs , s = 1, ..., K} (partition of Rp ), such that
Rs = {x ∈ Rp : δMAP (x) = s}
Quadratic discrimination (quadratic boundaries); a.k.a. quadratic
discriminant analysis (QDA). Example: binary classification in R2 :
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 7 / 33
MAP Classification with Gaussian
n
Classes (II) o
MAP rule: δMAP (x) = arg min γs + 12 kx − µs k2Cs
s∈{1,...,K}
Decision regions: {Rs , s = 1, ..., K} (partition of Rp ), such that
Rs = {x ∈ Rp : δMAP (x) = s}
Quadratic discrimination (quadratic boundaries); a.k.a. quadratic
discriminant analysis (QDA). Example: binary classification in R2 :
Parabolic Boundary
2
level curves of class-conditional
Gaussian densities (and a few
0 samples)
boundary is quadratic (parabolic)
increasing (resp. decreasing) fS (1)
−2
shifts the boundary away from
(resp. towards) µ1
−2 0 2
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 7 / 33
MAP Classification with Gaussian Classes (III)
In the case of a common covariance matrix, Cs = C,
kx − µs k2Cs = xT C −1 x + µTs C −1 µs − 2(C −1 µs )T x
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 8 / 33
MAP Classification with Gaussian Classes (III)
In the case of a common covariance matrix, Cs = C,
kx − µs k2Cs = xT C −1 x + µTs C −1 µs − 2(C −1 µs )T x
n o
MAP rule: δMAP (x) = arg min γs + xT β s ,
s∈{1,...,K}
where βs = −(C −1 µs ) and γs = − log fS (s) + 12 µTs C −1 µs
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 8 / 33
MAP Classification with Gaussian Classes (III)
In the case of a common covariance matrix, Cs = C,
kx − µs k2Cs = xT C −1 x + µTs C −1 µs − 2(C −1 µs )T x
n o
MAP rule: δMAP (x) = arg min γs + xT β s ,
s∈{1,...,K}
where βs = −(C −1 µs ) and γs = − log fS (s) + 12 µTs C −1 µs
...MAP classifier is linear in x; region boundaries are hyper-planes
Linear Boundary All Linear Boundaries
0 2
0
−2
−2
−2 0 2 −2 0 2 4 6
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 8 / 33
MAP Classification with Multinomial Classes
Let A = S = {1, ..., K} (classification problem)
Prior class probabilities: {fS (s), s = 1, ..., K}
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 9 / 33
MAP Classification with Multinomial Classes
Let A = S = {1, ..., K} (classification problem)
Prior class probabilities: {fS (s), s = 1, ..., K}
Multinomial class-conditional pmf, for observations X ∈ {0, ..., n}p
p
Y
n
(θi )xi 1x1 +...+xp =n ,
(s)
fX|S (x|s) =
x1 x2 · · · xp
i=1
where each θ(s)
is in the canonical simplex,
(s)
θ1 p
= ... ∈ ∆p−1 ≡ v ∈ Rp : vi ≥ 0,
X
θ(s) vi = 1
(s) i=1
θp
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 9 / 33
MAP Classification with Multinomial Classes
Let A = S = {1, ..., K} (classification problem)
Prior class probabilities: {fS (s), s = 1, ..., K}
Multinomial class-conditional pmf, for observations X ∈ {0, ..., n}p
p
Y
n
(θi )xi 1x1 +...+xp =n ,
(s)
fX|S (x|s) =
x1 x2 · · · xp
i=1
where each θ(s)
is in the canonical simplex,
(s)
θ1 p
= ... ∈ ∆p−1 ≡ v ∈ Rp : vi ≥ 0,
X
θ(s) vi = 1
(s) i=1
θp
MAP rule (taking logs and dropping constants):
n o
δMAP (x) = arg max log fS (s) + xT log θ(s)
s∈{1,...,K}
...again, linear in x.
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 9 / 33
MAP Classification with Exponential Family Classes
Let A = S = {1, ..., K} (classification problem)
Prior class probabilities: {fS (s), s = 1, ..., K}
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 10 / 33
MAP Classification with Exponential Family Classes
Let A = S = {1, ..., K} (classification problem)
Prior class probabilities: {fS (s), s = 1, ..., K}
Exponential family class-conditional pdf or pmf, observations X ∈ X
1
(s) T
fX|S (x|s) = h(x) exp (η ) φ(x)
Z(η (s) )
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 10 / 33
MAP Classification with Exponential Family Classes
Let A = S = {1, ..., K} (classification problem)
Prior class probabilities: {fS (s), s = 1, ..., K}
Exponential family class-conditional pdf or pmf, observations X ∈ X
1
(s) T
fX|S (x|s) = h(x) exp (η ) φ(x)
Z(η (s) )
MAP rule (taking logs and dropping constants):
n o
δMAP (x) = arg max log fS (s) − log Z(η (s) ) + (η (s) )T φ(x)
s∈{1,...,K}
...linear in the sufficient statistics φ(x).
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 10 / 33
MAP Classification with Exponential Family Classes
Let A = S = {1, ..., K} (classification problem)
Prior class probabilities: {fS (s), s = 1, ..., K}
Exponential family class-conditional pdf or pmf, observations X ∈ X
1
(s) T
fX|S (x|s) = h(x) exp (η ) φ(x)
Z(η (s) )
MAP rule (taking logs and dropping constants):
n o
δMAP (x) = arg max log fS (s) − log Z(η (s) ) + (η (s) )T φ(x)
s∈{1,...,K}
...linear in the sufficient statistics φ(x).
The previous examples (Gaussian and Multinomial) are in this family.
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 10 / 33
Class Posteriors for Exponential Family Classes
Same scenario as in the previous slide
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 11 / 33
Class Posteriors for Exponential Family Classes
Same scenario as in the previous slide
Class posterior probabilities (from Bayes law):
fS (s) (s) T
exp (η ) φ(x)
Z(η (s) )
fS|X (s|x) = K
X fS (r)
exp (η (r) )T φ(x)
(r)
Z(η )
r=1
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 11 / 33
Class Posteriors for Exponential Family Classes
Same scenario as in the previous slide
Class posterior probabilities (from Bayes law):
fS (s) (s) T
exp (η ) φ(x)
Z(η (s) )
fS|X (s|x) = K
X fS (r)
exp (η (r) )T φ(x)
(r)
Z(η )
r=1
...taking logs,
log fS|X (s|x) = (η (s) )T φ(x) + ζ (s) − ξ(x),
thus
exp (η (s) )T φ(x) + ζ (s)
fS|X (s|x) = K
X
exp (η (r) )T φ(x) + ζ (r)
r=1
...called a generalized linear model (GLM).
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 11 / 33
Bayesian MAP Estimation
Let A = S = R (scalar estimation problem)
Likelihood function fX|S , where X ∈ X , and prior pdf fS
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 12 / 33
Bayesian MAP Estimation
Let A = S = R (scalar estimation problem)
Likelihood function fX|S , where X ∈ X , and prior pdf fS
1 ⇐ |s − a| ≥ ε
ε−loss: Lε (s, a) =
0 ⇐ |s − a| < ε
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 12 / 33
Bayesian MAP Estimation
Let A = S = R (scalar estimation problem)
Likelihood function fX|S , where X ∈ X , and prior pdf fS
1 ⇐ |s − a| ≥ ε
ε−loss: Lε (s, a) =
0 ⇐ |s − a| < ε
Bayes rule for given ε:
Z ∞
δε (x) = arg min fS|X (s|x) Lε (s, a) ds
a∈R −∞
Z a+ε
= arg min 1 − fS|X (s|x) ds
a∈R a−ε
Z a+ε
= arg max fS|X (s|x) ds
a∈R a−ε
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 12 / 33
Bayesian MAP Estimation
Let A = S = R (scalar estimation problem)
Likelihood function fX|S , where X ∈ X , and prior pdf fS
1 ⇐ |s − a| ≥ ε
ε−loss: Lε (s, a) =
0 ⇐ |s − a| < ε
Bayes rule for given ε:
Z ∞
δε (x) = arg min fS|X (s|x) Lε (s, a) ds
a∈R −∞
Z a+ε
a a a
= arg min 1 − fS|X (s|x) ds
a∈R a−ε
Z a+ε
= arg max fS|X (s|x) ds
a∈R a−ε
a a a
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 12 / 33
Bayesian MAP Estimation
Let A = S = R (scalar estimation problem)
Likelihood function fX|S , where X ∈ X , and prior pdf fS
1 ⇐ |s − a| ≥ ε
ε−loss: Lε (s, a) =
0 ⇐ |s − a| < ε
Bayes rule for given ε:
Z ∞
δε (x) = arg min fS|X (s|x) Lε (s, a) ds
a∈R −∞
Z a+ε
a a a
= arg min 1 − fS|X (s|x) ds
a∈R a−ε
Z a+ε
= arg max fS|X (s|x) ds
a∈R a−ε
a a a
lim δε (x) = arg max fS|X (a|x) = δMAP (x)
ε→0 a∈R
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 12 / 33
Bayesian MAP Estimation
Let A = S = R (scalar estimation problem)
Likelihood function fX|S , where X ∈ X , and prior pdf fS
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 13 / 33
Bayesian MAP Estimation
Let A = S = R (scalar estimation problem)
Likelihood function fX|S , where X ∈ X , and prior pdf fS
MAP estimation criterion: δMAP (x) = arg max fS|X (a|x)
a∈R
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 13 / 33
Bayesian MAP Estimation
Let A = S = R (scalar estimation problem)
Likelihood function fX|S , where X ∈ X , and prior pdf fS
MAP estimation criterion: δMAP (x) = arg max fS|X (a|x)
a∈R
From Bayes law:
fX|S (x|a) fS (a)
δMAP (x) = arg max log
a∈R fX (x)
= arg max log fX|S (x|a) + log fS (a)
a∈R | {z } | {z }
log-likelihood log-prior
...a form (there are others) of trade-off between information from the
observation x and the prior knowledge/belief
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 13 / 33
Bayesian MAP Estimation: Vector Case
Let A = S = Rp (vector estimation problem)
Likelihood function fX|S , where X ∈ X , and prior pdf fS
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 14 / 33
Bayesian MAP Estimation: Vector Case
Let A = S = Rp (vector estimation problem)
Likelihood function fX|S , where X ∈ X , and prior pdf fS
1 ⇐ ks − ak2 ≥ ε
ε−loss: Lε (s, a) =
0 ⇐ ks − ak2 < ε
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 14 / 33
Bayesian MAP Estimation: Vector Case
Let A = S = Rp (vector estimation problem)
Likelihood function fX|S , where X ∈ X , and prior pdf fS
1 ⇐ ks − ak2 ≥ ε
ε−loss: Lε (s, a) =
0 ⇐ ks − ak2 < ε
Bayes rule for given ε:
Z
δε (x) = arg min fS|X (s|x) Lε (s, a) ds
a∈R Rp
Z !
= arg min 1 − fS|X (s|x) ds
a∈R {s∈Rp : ks−ak2 <ε}
Z
= arg max fS|X (s|x) ds
a∈R {s∈Rp : ks−ak2 <ε}
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 14 / 33
Bayesian MAP Estimation: Vector Case
Let A = S = Rp (vector estimation problem)
Likelihood function fX|S , where X ∈ X , and prior pdf fS
1 ⇐ ks − ak2 ≥ ε
ε−loss: Lε (s, a) =
0 ⇐ ks − ak2 < ε
Bayes rule for given ε:
Z
δε (x) = arg min fS|X (s|x) Lε (s, a) ds
a∈R Rp
Z !
= arg min 1 − fS|X (s|x) ds
a∈R {s∈Rp : ks−ak2 <ε}
Z
= arg max fS|X (s|x) ds
a∈R {s∈Rp : ks−ak2 <ε}
lim δε (x) = arg maxp fS|X (a|x) = δMAP (x)
ε→0 a∈R
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 14 / 33
Bayesian MMSE Estimation: Scalar and Vector Case
A = S = Rp (p ≥ 1); likelihood fX|S , for X ∈ X ; prior fS
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 15 / 33
Bayesian MMSE Estimation: Scalar and Vector Case
A = S = Rp (p ≥ 1); likelihood fX|S , for X ∈ X ; prior fS
Xp
2
Squared error loss: L(s, a) = ks − ak2 = (si − ai )2
i=1
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 15 / 33
Bayesian MMSE Estimation: Scalar and Vector Case
A = S = Rp (p ≥ 1); likelihood fX|S , for X ∈ X ; prior fS
Xp
2
Squared error loss: L(s, a) = ks − ak2 = (si − ai )2
i=1
The Bayes optimal rule
δBayes (x) = arg minp E kS − ak22 |x
a∈R | {z }
mean squared error (MSE)
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 15 / 33
Bayesian MMSE Estimation: Scalar and Vector Case
A = S = Rp (p ≥ 1); likelihood fX|S , for X ∈ X ; prior fS
Xp
2
Squared error loss: L(s, a) = ks − ak2 = (si − ai )2
i=1
The Bayes optimal rule, minimum mean squared error (MMSE),
δBayes (x) = arg minp E kS − ak22 |x = δMSME (x)
a∈R | {z }
mean squared error (MSE)
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 15 / 33
Bayesian MMSE Estimation: Scalar and Vector Case
A = S = Rp (p ≥ 1); likelihood fX|S , for X ∈ X ; prior fS
Xp
2
Squared error loss: L(s, a) = ks − ak2 = (si − ai )2
i=1
The Bayes optimal rule, minimum mean squared error (MMSE),
δBayes (x) = arg minp E kS − ak22 |x = δMSME (x)
a∈R | {z }
mean squared error (MSE)
...can be easily derived, by equating the gradient to zero:
z }| { z }| {
δMSME (x) = arg minp E kSk22 |x − 2aT E [S|x] + E kak22 |x
a∈R
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 15 / 33
Bayesian MMSE Estimation: Scalar and Vector Case
A = S = Rp (p ≥ 1); likelihood fX|S , for X ∈ X ; prior fS
Xp
2
Squared error loss: L(s, a) = ks − ak2 = (si − ai )2
i=1
The Bayes optimal rule, minimum mean squared error (MMSE),
δBayes (x) = arg minp E kS − ak22 |x = δMSME (x)
a∈R | {z }
mean squared error (MSE)
...can be easily derived, by equating the gradient to zero:
indep. of a kak22
z }| { z }| {
δMSME (x) = arg minp E kSk22 |x − 2aT E [S|x] + E kak22 |x
a∈R
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 15 / 33
Bayesian MMSE Estimation: Scalar and Vector Case
A = S = Rp (p ≥ 1); likelihood fX|S , for X ∈ X ; prior fS
Xp
2
Squared error loss: L(s, a) = ks − ak2 = (si − ai )2
i=1
The Bayes optimal rule, minimum mean squared error (MMSE),
δBayes (x) = arg minp E kS − ak22 |x = δMSME (x)
a∈R | {z }
mean squared error (MSE)
...can be easily derived, by equating the gradient to zero:
indep. of a kak22
z }| { z }| {
δMSME (x) = arg minp E kSk22 |x − 2aT E [S|x] + E kak22 |x
a∈R
= solutiona (2a − 2E [S|x] = 0)
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 15 / 33
Bayesian MMSE Estimation: Scalar and Vector Case
A = S = Rp (p ≥ 1); likelihood fX|S , for X ∈ X ; prior fS
Xp
2
Squared error loss: L(s, a) = ks − ak2 = (si − ai )2
i=1
The Bayes optimal rule, minimum mean squared error (MMSE),
δBayes (x) = arg minp E kS − ak22 |x = δMSME (x)
a∈R | {z }
mean squared error (MSE)
...can be easily derived, by equating the gradient to zero:
indep. of a kak22
z }| { z }| {
δMSME (x) = arg minp E kSk22 |x − 2aT E [S|x] + E kak22 |x
a∈R
= solutiona (2a − 2E [S|x] = 0)
= E [S|x]
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 15 / 33
MAP vs MMSE Estimation: Optimization vs Integration
A = S = Rp (p ≥ 1); likelihood fX|S , for X ∈ X ; prior fS
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 16 / 33
MAP vs MMSE Estimation: Optimization vs Integration
A = S = Rp (p ≥ 1); likelihood fX|S , for X ∈ X ; prior fS
MAP estimate: δMAP (x) = arg max log fX|S (x|a) + log fS (a)
a∈R
I Optimization problem (maybe large scale, if p 1)
I Does not require knowing fX (x) (i.e., fS|X may be un-normalized)
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 16 / 33
MAP vs MMSE Estimation: Optimization vs Integration
A = S = Rp (p ≥ 1); likelihood fX|S , for X ∈ X ; prior fS
MAP estimate: δMAP (x) = arg max log fX|S (x|a) + log fS (a)
a∈R
I Optimization problem (maybe large scale, if p 1)
I Does not require knowing fX (x) (i.e., fS|X may be un-normalized)
MMSE estimate: Z
Z s fX|S (x|s) fS (s) ds
δMMSE (x) = E[S|x] = s fS|X (s|x) ds = ZS
S fX|S (x|s) fS (s) ds
S
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 16 / 33
MAP vs MMSE Estimation: Optimization vs Integration
A = S = Rp (p ≥ 1); likelihood fX|S , for X ∈ X ; prior fS
MAP estimate: δMAP (x) = arg max log fX|S (x|a) + log fS (a)
a∈R
I Optimization problem (maybe large scale, if p 1)
I Does not require knowing fX (x) (i.e., fS|X may be un-normalized)
MMSE estimate: Z
Z s fX|S (x|s) fS (s) ds
δMMSE (x) = E[S|x] = s fS|X (s|x) ds = ZS
S fX|S (x|s) fS (s) ds
S
I Integration problem (maybe large scale, if p 1)
I In general, much harder than MAP estimation
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 16 / 33
A Classic: Linear-Gaussian Likelihood and Gaussian Prior
A = S = Rp (p ≥ 1); linear-Gaussian likelihood
exp − 12 (x − As)T C −1 (x − As)
fX|S (x|s) = N (x; As, C) = p
det(2 π C)
where x ∈ Rn , thus A ∈ Rn×p ,
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 17 / 33
A Classic: Linear-Gaussian Likelihood and Gaussian Prior
A = S = Rp (p ≥ 1); linear-Gaussian likelihood
exp − 12 (x − As)T C −1 (x − As)
fX|S (x|s) = N (x; As, C) = p
det(2 π C)
where x ∈ Rn , thus A ∈ Rn×p ,
... and Gaussian prior: fS (s) = N (s; µ, B)
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 17 / 33
A Classic: Linear-Gaussian Likelihood and Gaussian Prior
A = S = Rp (p ≥ 1); linear-Gaussian likelihood
exp − 12 (x − As)T C −1 (x − As)
fX|S (x|s) = N (x; As, C) = p
det(2 π C)
where x ∈ Rn , thus A ∈ Rn×p ,
... and Gaussian prior: fS (s) = N (s; µ, B)
Somewhat tedious manipulation yiels the posterior
fS|X (s|x) = N (s; ŝ(x), D) ,
where
−1
D = AT C −1 A + B −1 and ŝ(x) = D AT C −1 x + B −1 µ
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 17 / 33
A Classic: Linear-Gaussian Likelihood and Gaussian Prior
A = S = Rp (p ≥ 1); linear-Gaussian likelihood
exp − 12 (x − As)T C −1 (x − As)
fX|S (x|s) = N (x; As, C) = p
det(2 π C)
where x ∈ Rn , thus A ∈ Rn×p ,
... and Gaussian prior: fS (s) = N (s; µ, B)
Somewhat tedious manipulation yiels the posterior
fS|X (s|x) = N (s; ŝ(x), D) ,
where
−1
D = AT C −1 A + B −1 and ŝ(x) = D AT C −1 x + B −1 µ
Of course, δMMSE (x) = δMAP (x) = ŝ(x)
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 17 / 33
Linear-Gaussian Observation, Gaussian Prior: Example 1
Let X = X1 , ..., Xn be i.i.d. noisy (with variance σ 2 ) observations of
S ∈ R, i.e.,
n 1
Y
2 2 ..
fX|S (x|s) = N (xi ; s, σ ) = N (x; As, σ I), with A = .
i=1 1
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 18 / 33
Linear-Gaussian Observation, Gaussian Prior: Example 1
Let X = X1 , ..., Xn be i.i.d. noisy (with variance σ 2 ) observations of
S ∈ R, i.e.,
n 1
Y
2 2 ..
fX|S (x|s) = N (xi ; s, σ ) = N (x; As, σ I), with A = .
i=1 1
With fS (s) = N (s; µ, λ2 ) (i.e., B = λ2 ),
n 1
x̄ + µ n
fS|X (s|x) = N (s; ŝ(x), τ 2 ), ŝ(x) = σ
2 λ2 , x̄ = 1
X
xi
n 1 n
+ i=1
σ2 λ2
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 18 / 33
Linear-Gaussian Observation, Gaussian Prior: Example 1
Let X = X1 , ..., Xn be i.i.d. noisy (with variance σ 2 ) observations of
S ∈ R, i.e.,
n 1
Y
2 2 ..
fX|S (x|s) = N (xi ; s, σ ) = N (x; As, σ I), with A = .
i=1 1
With fS (s) = N (s; µ, λ2 ) (i.e., B = λ2 ),
n 1
x̄ + µ n
fS|X (s|x) = N (s; ŝ(x), τ 2 ), ŝ(x) = σ
2 λ2 , x̄ = 1
X
xi
n 1 n
+ i=1
σ2 λ2
ŝ(x) is a weighted average of the observation average and the prior
mean
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 18 / 33
Linear-Gaussian Observation, Gaussian Prior: Example 1
(cont.)
n 1
2
x̄ + µ
Let’s study limit cases of ŝ(x) = σ λ2
n 1
+
σ2 λ2
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 19 / 33
Linear-Gaussian Observation, Gaussian Prior: Example 1
(cont.)
n 1
2
x̄ + 2 µ
Let’s study limit cases of ŝ(x) = σ λ
n 1
2
+ 2
σ λ
Large sample limit, the prior is ignored: lim ŝ(x) = x̄
n→∞
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 19 / 33
Linear-Gaussian Observation, Gaussian Prior: Example 1
(cont.)
n 1
2
x̄ + 2 µ
Let’s study limit cases of ŝ(x) = σ λ
n 1
2
+ 2
σ λ
Large sample limit, the prior is ignored: lim ŝ(x) = x̄
n→∞
Low noise limit, the prior is ignored: lim ŝ(x) = x̄
σ 2 →0
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 19 / 33
Linear-Gaussian Observation, Gaussian Prior: Example 1
(cont.)
n 1
2
x̄ + 2 µ
Let’s study limit cases of ŝ(x) = σ λ
n 1
2
+ 2
σ λ
Large sample limit, the prior is ignored: lim ŝ(x) = x̄
n→∞
Low noise limit, the prior is ignored: lim ŝ(x) = x̄
σ 2 →0
Large noise limit, the data is ignored: lim ŝ(x) = µ
σ 2 →∞
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 19 / 33
Linear-Gaussian Observation, Gaussian Prior: Example 1
(cont.)
n 1
2
x̄ + 2 µ
Let’s study limit cases of ŝ(x) = σ λ
n 1
2
+ 2
σ λ
Large sample limit, the prior is ignored: lim ŝ(x) = x̄
n→∞
Low noise limit, the prior is ignored: lim ŝ(x) = x̄
σ 2 →0
Large noise limit, the data is ignored: lim ŝ(x) = µ
σ 2 →∞
High prior confidence limit, the data is ignored: lim ŝ(x) = µ
λ2 →0
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 19 / 33
Linear-Gaussian Observation, Gaussian Prior: Example 1
(cont.)
n 1
2
x̄ + 2 µ
Let’s study limit cases of ŝ(x) = σ λ
n 1
2
+ 2
σ λ
Large sample limit, the prior is ignored: lim ŝ(x) = x̄
n→∞
Low noise limit, the prior is ignored: lim ŝ(x) = x̄
σ 2 →0
Large noise limit, the data is ignored: lim ŝ(x) = µ
σ 2 →∞
High prior confidence limit, the data is ignored: lim ŝ(x) = µ
λ2 →0
Low prior confidence limit, the prior is ignored: lim ŝ(x) = x̄
λ2 →∞
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 19 / 33
Linear-Gaussian Observation, Gaussian Prior: Example 1
(cont.)
n 1
2
x̄ + 2 µ
Let’s study limit cases of ŝ(x) = σ λ
n 1
2
+ 2
σ λ
Large sample limit, the prior is ignored: lim ŝ(x) = x̄
n→∞
Low noise limit, the prior is ignored: lim ŝ(x) = x̄
σ 2 →0
Large noise limit, the data is ignored: lim ŝ(x) = µ
σ 2 →∞
High prior confidence limit, the data is ignored: lim ŝ(x) = µ
λ2 →0
Low prior confidence limit, the prior is ignored: lim ŝ(x) = x̄
λ2 →∞
Note: the prior is equivalent to an observation equal to µ with
variance λ2 .
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 19 / 33
Linear-Gaussian Observation, Gaussian Prior: Example 2
Let X = X1 , ..., Xn be independent noisy (with variance σi2 )
observations of S ∈ R, i.e.,
1
..
fX|S (x|s) = N (x; As, C), with A = . , C = diag(σ12 , ..., σ12 )
1
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 20 / 33
Linear-Gaussian Observation, Gaussian Prior: Example 2
Let X = X1 , ..., Xn be independent noisy (with variance σi2 )
observations of S ∈ R, i.e.,
1
..
fX|S (x|s) = N (x; As, C), with A = . , C = diag(σ12 , ..., σ12 )
1
With fS (s) = N (s; µ, λ2 ) (i.e., B = λ2 ),
n
X xi µ
2 + 2
σ λ
i=1 i
ŝ(x) = n
X 1 1
2 + 2
σ λ
i=1 i
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 20 / 33
Linear-Gaussian Observation, Gaussian Prior: Example 2
Let X = X1 , ..., Xn be independent noisy (with variance σi2 )
observations of S ∈ R, i.e.,
1
..
fX|S (x|s) = N (x; As, C), with A = . , C = diag(σ12 , ..., σ12 )
1
With fS (s) = N (s; µ, λ2 ) (i.e., B = λ2 ),
n
X xi µ
2 + 2
σ λ
i=1 i
ŝ(x) = n
X 1 1
2 + 2
σ λ
i=1 i
1
each sample is weighted proportionally to σi2
(the precision)
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 20 / 33
Linear-Gaussian Observation, Gaussian Prior: Example 2
Let X = X1 , ..., Xn be independent noisy (with variance σi2 )
observations of S ∈ R, i.e.,
1
..
fX|S (x|s) = N (x; As, C), with A = . , C = diag(σ12 , ..., σ12 )
1
With fS (s) = N (s; µ, λ2 ) (i.e., B = λ2 ),
n
X xi µ
2 + 2
σ λ
i=1 i
ŝ(x) = n
X 1 1
2 + 2
σ λ
i=1 i
1
each sample is weighted proportionally to σi2
(the precision)
Notice that, again, limn→∞ ŝ(x) is independent of the prior
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 20 / 33
Bernstein-Von Mises Theorem
Notice that
1 Pn
Example 1: δMLE (x) = n i=1 xi
P −1 P
n 1 n xi
Example 2: δMLE (x) = i=1 σi2 i=1 σi2
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 21 / 33
Bernstein-Von Mises Theorem
Notice that
1 Pn
Example 1: δMLE (x) = n i=1 xi
P −1 P
n 1 n xi
Example 2: δMLE (x) = i=1 σi2 i=1 σi2
In the previous examples, we noticed that (independently of the prior)
lim δMMSE (x) = lim δMAP (x) = δMLE (x)
n→∞ n→∞
where x̄ is the mean or the weighted mean of the observations
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 21 / 33
Bernstein-Von Mises Theorem
Notice that
1 Pn
Example 1: δMLE (x) = n i=1 xi
P −1 P
n 1 n xi
Example 2: δMLE (x) = i=1 σi2 i=1 σi2
In the previous examples, we noticed that (independently of the prior)
lim δMMSE (x) = lim δMAP (x) = δMLE (x)
n→∞ n→∞
where x̄ is the mean or the weighted mean of the observations
This is a general result, known as the Bernstein-Von Mises Theorem;
a sufficient condition is that the prior is continuous and non-zero
everywhere.
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 21 / 33
Conjugate Priors
In the previous examples, the posterior and the prior were both
Gaussian; were we lucky?
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 22 / 33
Conjugate Priors
In the previous examples, the posterior and the prior were both
Gaussian; were we lucky? No, the prior was conjugate!
Consider a likelihood/observation model fX|S , for S ∈ S and X ∈ X
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 22 / 33
Conjugate Priors
In the previous examples, the posterior and the prior were both
Gaussian; were we lucky? No, the prior was conjugate!
Consider a likelihood/observation model fX|S , for S ∈ S and X ∈ X
...and a family of priors P = {fS (s; θ), θ ∈ Θ}
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 22 / 33
Conjugate Priors
In the previous examples, the posterior and the prior were both
Gaussian; were we lucky? No, the prior was conjugate!
Consider a likelihood/observation model fX|S , for S ∈ S and X ∈ X
...and a family of priors P = {fS (s; θ), θ ∈ Θ}
P is conjugate for fX|S if, for any θ ∈ Θ and x ∈ S,
fX|S (x|s) fS (s; θ)
fS|X (·|x) = ∈ P ⇔ ∃θ0 ∈Θ : fS|X (·|x) = fS (·; θ0 )
fX (x; θ)
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 22 / 33
Conjugate Priors
In the previous examples, the posterior and the prior were both
Gaussian; were we lucky? No, the prior was conjugate!
Consider a likelihood/observation model fX|S , for S ∈ S and X ∈ X
...and a family of priors P = {fS (s; θ), θ ∈ Θ}
P is conjugate for fX|S if, for any θ ∈ Θ and x ∈ S,
fX|S (x|s) fS (s; θ)
fS|X (·|x) = ∈ P ⇔ ∃θ0 ∈Θ : fS|X (·|x) = fS (·; θ0 )
fX (x; θ)
Example: in example 1 (above),
fS (s) = N (s; µ, λ2 ) and fS|X (s|x) = N (s; ŝ(x), τ 2 )
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 22 / 33
Conjugate Priors: Example
S ∈ [0, 1] is the probability of success in n i.i.d. Bernoulli samples
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 23 / 33
Conjugate Priors: Example
S ∈ [0, 1] is the probability of success in n i.i.d. Bernoulli samples
Pn Pn
xi
X = (X1 , ..., Xn ) ∈ {0, 1}n : fX|S (x, s) = s i=1 (1 − s)n− i=1 xi
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 23 / 33
Conjugate Priors: Example
S ∈ [0, 1] is the probability of success in n i.i.d. Bernoulli samples
Pn Pn
xi
X = (X1 , ..., Xn ) ∈ {0, 1}n : fX|S (x, s) = s i=1 (1 − s)n− i=1 xi
Γ(α + β) α−1
Conjugate prior: Beta(s; α, β) = s (1 − s)β−1
Γ(α) Γ(β)
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 23 / 33
Conjugate Priors: Example
S ∈ [0, 1] is the probability of success in n i.i.d. Bernoulli samples
Pn Pn
xi
X = (X1 , ..., Xn ) ∈ {0, 1}n : fX|S (x, s) = s i=1 (1 − s)n− i=1 xi
Γ(α + β) α−1
Conjugate prior: Beta(s; α, β) = s (1 − s)β−1
Γ(α) Γ(β)
α
Expected value: E[S] = α+β
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 23 / 33
Conjugate Priors: Example
S ∈ [0, 1] is the probability of success in n i.i.d. Bernoulli samples
Pn Pn
xi
X = (X1 , ..., Xn ) ∈ {0, 1}n : fX|S (x, s) = s i=1 (1 − s)n− i=1 xi
Γ(α + β) α−1
Conjugate prior: Beta(s; α, β) = s (1 − s)β−1
Γ(α) Γ(β)
α
Expected value: E[S] = α+β
Mode:
α−1
arg max Beta(s; α, β) = ,
s α+β−2
if α, β > 1
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 23 / 33
Conjugate Priors: Example (cont.)
S ∈ [0, 1] is the probability of success in n i.i.d. Bernoulli samples
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 24 / 33
Conjugate Priors: Example (cont.)
S ∈ [0, 1] is the probability of success in n i.i.d. Bernoulli samples
Pn Pn
xi
X = (X1 , ..., Xn ) ∈ {0, 1}n : fX|S (x, s) = s i=1 (1 − s)n− i=1 xi
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 24 / 33
Conjugate Priors: Example (cont.)
S ∈ [0, 1] is the probability of success in n i.i.d. Bernoulli samples
Pn Pn
xi
X = (X1 , ..., Xn ) ∈ {0, 1}n : fX|S (x, s) = s i=1 (1 − s)n− i=1 xi
Γ(α + β) α−1
Conjugate prior: Beta(s; α, β) = s (1 − s)β−1
Γ(α) Γ(β)
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 24 / 33
Conjugate Priors: Example (cont.)
S ∈ [0, 1] is the probability of success in n i.i.d. Bernoulli samples
Pn Pn
xi
X = (X1 , ..., Xn ) ∈ {0, 1}n : fX|S (x, s) = s i=1 (1 − s)n− i=1 xi
Γ(α + β) α−1
Conjugate prior: Beta(s; α, β) = s (1 − s)β−1
Γ(α) Γ(β)
Posterior:
Pn Pn
xi
fS|X (s|x) ∝ s (1 − s)n− i=1 xi sα−1 (1 − s)β−1
i=1
Xn n
X
= Beta s; α + xi , β + n − xi = Beta s; α0 , β 0
i=1 i=1
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 24 / 33
Conjugate Priors: Example (cont.)
S ∈ [0, 1] is the probability of success in n i.i.d. Bernoulli samples
Pn Pn
xi
X = (X1 , ..., Xn ) ∈ {0, 1}n : fX|S (x, s) = s i=1 (1 − s)n− i=1 xi
Γ(α + β) α−1
Conjugate prior: Beta(s; α, β) = s (1 − s)β−1
Γ(α) Γ(β)
Posterior:
Pn Pn
xi
fS|X (s|x) ∝ s (1 − s)n− i=1 xi sα−1 (1 − s)β−1
i=1
Xn n
X
= Beta s; α + xi , β + n − xi = Beta s; α0 , β 0
i=1 i=1
Estimates:
α + ni=1 xi α − 1 + ni=1 xi
P P
δMMSE (x) = δMAP (x) =
α+β+n α+β−2+n
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 24 / 33
Conjugate Priors: Example (cont.)
1 1
n=1 n=5
0.5 0.5
Plots of fS|X (s|x) for n
samples
0 0
0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1
1 1
n=10 n=20 Solid: α = β = 1
0.5 0.5
0
0 0.25 0.5 0.75 1
0
0 0.25 0.5 0.75 1 Dotted: α = β = 5
1 1
n=50 n=500
0.5 0.5 Notice:
(x1 , ..., x5 ) = (1, 1, 1, 1, 1)
0 0
0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 25 / 33
Conjugate Priors: Example (cont.)
1 1
n=1 n=5
0.5 0.5
Plots of fS|X (s|x) for n
samples
0 0
0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1
1 1
n=10 n=20 Solid: α = β = 1
0.5 0.5
0
0 0.25 0.5 0.75 1
0
0 0.25 0.5 0.75 1 Dotted: α = β = 5
1 1
n=50 n=500
0.5 0.5 Notice:
(x1 , ..., x5 ) = (1, 1, 1, 1, 1)
0 0
0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1
Observe the Bernstein-Von Mises effect.
Note that δMLE (x) = δMAP (x), for α = β = 1 (flat prior).
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 25 / 33
Conjugate Priors: Exponential Families
Let S ∈ Rp be the parameter of an exponential family; for a single
observation X,
1
fX|S (x|s) = h(x) exp η(s)T φ(x)
Z(s)
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 26 / 33
Conjugate Priors: Exponential Families
Let S ∈ Rp be the parameter of an exponential family; for a single
observation X,
1
fX|S (x|s) = h(x) exp η(s)T φ(x)
Z(s)
For n i.i.d. observations, (X1 , ..., Xn )
1 n Y n n
X
fX|S (x|s) = h(xi ) exp η(s)T φ(xi )
Z(s)
i=1 i=1
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 26 / 33
Conjugate Priors: Exponential Families
Let S ∈ Rp be the parameter of an exponential family; for a single
observation X,
1
fX|S (x|s) = h(x) exp η(s)T φ(x)
Z(s)
For n i.i.d. observations, (X1 , ..., Xn )
1 n Y n n
X
fX|S (x|s) = h(xi ) exp η(s)T φ(xi )
Z(s)
i=1 i=1
Conjugate prior has the form:
1 ν
fS (s; γ, ν) ∝ exp η(s)T γ
Z(s)
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 26 / 33
Conjugate Priors: Exponential Families
Let S ∈ Rp be the parameter of an exponential family; for a single
observation X,
1
fX|S (x|s) = h(x) exp η(s)T φ(x)
Z(s)
For n i.i.d. observations, (X1 , ..., Xn )
1 n Y n n
X
fX|S (x|s) = h(xi ) exp η(s)T φ(xi )
Z(s)
i=1 i=1
Conjugate prior has the form:
1 ν
fS (s; γ, ν) ∝ exp η(s)T γ
Z(s)
Posterior
1 n+ν n
X
fS|X (s|x; γ, ν) ∝ exp η(s)T γ + φ(xi )
Z(s)
i=1
... the conjugate prior behaves like virtual additional observations
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 26 / 33
Conjugate Priors for Exponential Families:
Bernoulli/Binomial
S ∈ [0, 1] is the probability of success in n i.i.d. Bernoulli samples
(see slide 23)
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 27 / 33
Conjugate Priors for Exponential Families:
Bernoulli/Binomial
S ∈ [0, 1] is the probability of success in n i.i.d. Bernoulli samples
(see slide 23)
In exponential family form
s X n
fX|S (x|s) = (1 − s) exp log xi
| {z } 1−s}
1
| {z i=1
Z(s) η(s)
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 27 / 33
Conjugate Priors for Exponential Families:
Bernoulli/Binomial
S ∈ [0, 1] is the probability of success in n i.i.d. Bernoulli samples
(see slide 23)
In exponential family form
s X n
fX|S (x|s) = (1 − s) exp log xi
| {z } 1−s}
1
| {z i=1
Z(s) η(s)
Conjugate prior
1 ν s
fS (s; γ, ν) ∝ exp η(s)T γ = (1 − s)ν exp γ log
Z(s) 1−s
γ ν−γ
= s (1 − s)
∝ Beta(s; γ + 1, ν − γ + 1)
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 27 / 33
An Important Example: Multinomial Observations
Let S ∈ ∆p−1 ⊂ [0, 1]p to be estimated from X ∈ {0, ..., n}p
p
n
1(x1 +...+xp =n) sxi i
Y
fX|S (x|s) =
x1 x2 · · · xp
| {z } i=1
h(x)
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 28 / 33
An Important Example: Multinomial Observations
Let S ∈ ∆p−1 ⊂ [0, 1]p to be estimated from X ∈ {0, ..., n}p
p
n
1(x1 +...+xp =n) sxi i
Y
fX|S (x|s) =
x1 x2 · · · xp
| {z } i=1
h(x)
In exponential family form (notice that Z(s) = 1)
fX|S (x|s) = h(x) exp xT log s = h(x) exp xT η(s))
|{z}
log s1
..
.
log sp
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 28 / 33
An Important Example: Multinomial Observations
Let S ∈ ∆p−1 ⊂ [0, 1]p to be estimated from X ∈ {0, ..., n}p
p
n
1(x1 +...+xp =n) sxi i
Y
fX|S (x|s) =
x1 x2 · · · xp
| {z } i=1
h(x)
In exponential family form (notice that Z(s) = 1)
fX|S (x|s) = h(x) exp xT log s = h(x) exp xT η(s))
|{z}
log s1
..
.
log sp
Conjugate prior
p
fS (s; γ) ∝ 1s∈∆p−1 exp γ T log s = 1s∈∆p−1 sγi i ∝ Dirichlet(s; γ+1)
Y
i=1
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 28 / 33
Multinomial Observations and Dirichlet Prior
Multinomial observations with Dirichlet prior
fX|S (x|s) = Multi(x; n, s), fS (s; γ) = Dirichlet(s; γ + 1)
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 29 / 33
Multinomial Observations and Dirichlet Prior
Multinomial observations with Dirichlet prior
fX|S (x|s) = Multi(x; n, s), fS (s; γ) = Dirichlet(s; γ + 1)
Dirichlet examples in ∆2 ⊂ R3
s1
s2
s3
simplex ∆2 ⊂ R3 Dirichlet(s; [2, 2, 2]) Dirichlet(s; [20, 2, 2])
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 29 / 33
Multinomial Observations and Dirichlet Prior
Multinomial observations with Dirichlet prior
fX|S (x|s) = Multi(x; n, s), fS (s; γ) = Dirichlet(s; γ + 1)
Dirichlet examples in ∆2 ⊂ R3
s1
s2
s3
simplex ∆2 ⊂ R3 Dirichlet(s; [2, 2, 2]) Dirichlet(s; [20, 2, 2])
Posterior and estimates: fS|X (s|x) = Dirichlet(s; γ + x + 1)
γ+x γ+x+1
δMAP (x) = , δMMSE (x) = (1p = [1, ..., 1]T )
1Tp (γ+ x) 1Tp (γ+ x + 1)
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 29 / 33
Example of Multinomial Model: Bag of Words (BoW)
Vocabulary with p words (stemmed, without stop-words, etc.)
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 30 / 33
Example of Multinomial Model: Bag of Words (BoW)
Vocabulary with p words (stemmed, without stop-words, etc.)
Multinomial language model: S ∈ ∆p−1 contains the probability of
occurrence of each of the p words.
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 30 / 33
Example of Multinomial Model: Bag of Words (BoW)
Vocabulary with p words (stemmed, without stop-words, etc.)
Multinomial language model: S ∈ ∆p−1 contains the probability of
occurrence of each of the p words.
Probability of a set of document with a total of n words, with
x1 , ..., xp word counts: Multi(x; n, s)
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 30 / 33
Example of Multinomial Model: Bag of Words (BoW)
Vocabulary with p words (stemmed, without stop-words, etc.)
Multinomial language model: S ∈ ∆p−1 contains the probability of
occurrence of each of the p words.
Probability of a set of document with a total of n words, with
x1 , ..., xp word counts: Multi(x; n, s)
Example: observed text
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 30 / 33
Example of Multinomial Model: Bag of Words (BoW)
Vocabulary with p words (stemmed, without stop-words, etc.)
Multinomial language model: S ∈ ∆p−1 contains the probability of
occurrence of each of the p words.
Probability of a set of document with a total of n words, with
x1 , ..., xp word counts: Multi(x; n, s)
Example: observed text
...vector x of term counts (bag of words) (“unk” = all other words)
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 30 / 33
Example of Multinomial Model: Bag of Words (BoW)
Vocabulary with p words (stemmed, without stop-words, etc.)
Multinomial language model: S ∈ ∆p−1 contains the probability of
occurrence of each of the p words.
Probability of a set of document with a total of n words, with
x1 , ..., xp word counts: Multi(x; n, s)
Example: observed text
...vector x of term counts (bag of words) (“unk” = all other words)
MAP estimate of the probabilities, with Dirichlet(s; [2, 2, ..., 2]) prior:
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 30 / 33
Additive Loss Functions and Marginal Inference
Recall the optimal Bayes rule:
δBayes (x) = arg min E [L(S, a)|x]
a∈A
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 31 / 33
Additive Loss Functions and Marginal Inference
Recall the optimal Bayes rule:
δBayes (x) = arg min E [L(S, a)|x]
a∈A
Let S = S1 × S2 × · · · × Sd and A = A1 × A2 × · · · × Ad
S = (S1 , ..., Sd ), where Si ∈ Si ,
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 31 / 33
Additive Loss Functions and Marginal Inference
Recall the optimal Bayes rule:
δBayes (x) = arg min E [L(S, a)|x]
a∈A
Let S = S1 × S2 × · · · × Sd and A = A1 × A2 × · · · × Ad
S = (S1 , ..., Sd ), where Si ∈ Si , and a = (a1 , ..., ad ), where ai ∈ Ai
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 31 / 33
Additive Loss Functions and Marginal Inference
Recall the optimal Bayes rule:
δBayes (x) = arg min E [L(S, a)|x]
a∈A
Let S = S1 × S2 × · · · × Sd and A = A1 × A2 × · · · × Ad
S = (S1 , ..., Sd ), where Si ∈ Si , and a = (a1 , ..., ad ), where ai ∈ Ai
Let the loss have an additive structure:
d
X
L(s, a) = Li (si , ai ), with Li : Si × Ai → R+
i=1
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 31 / 33
Additive Loss Functions and Marginal Inference
Recall the optimal Bayes rule:
δBayes (x) = arg min E [L(S, a)|x]
a∈A
Let S = S1 × S2 × · · · × Sd and A = A1 × A2 × · · · × Ad
S = (S1 , ..., Sd ), where Si ∈ Si , and a = (a1 , ..., ad ), where ai ∈ Ai
Let the loss have an additive structure:
d
X
L(s, a) = Li (si , ai ), with Li : Si × Ai → R+
i=1
Pd
Example of additive loss: L(s, a) = ks − ak22 = i=1 (si − ai )2
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 31 / 33
Additive Loss Functions and Marginal Inference
Bayes optimal rule:
d
" #
X
δBayes (x) = arg min E Li (Si , ai )x
a1 ∈A1 ,...,ad ∈Ad
i=1
d
X
= arg min E Li (Si , ai )x
a1 ∈S1 ,...,ad ∈Sd
i=1
= arg min E L1 (S1 , a1 ) x , ..., arg min E Ld (Sd , ad ) x
a1 ∈A1 ad ∈Ad
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 32 / 33
Additive Loss Functions and Marginal Inference
Bayes optimal rule:
d
" #
X
δBayes (x) = arg min E Li (Si , ai )x
a1 ∈A1 ,...,ad ∈Ad
i=1
d
X
= arg min E Li (Si , ai )x
a1 ∈S1 ,...,ad ∈Sd
i=1
= arg min E L1 (S1 , a1 ) x , ..., arg min E Ld (Sd , ad ) x
a1 ∈A1 ad ∈Ad
Marginal Bayes decisions:
Z
arg min E Li (Si , ai )x = arg min Li (si , ai ) fSi |X (si |xi ) dsi
ai ∈Ai ai ∈Ai Si | {z }
posterior marginal
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 32 / 33
Additive Loss Functions and Marginal Inference
Bayes optimal rule:
d
" #
X
δBayes (x) = arg min E Li (Si , ai )x
a1 ∈A1 ,...,ad ∈Ad
i=1
d
X
= arg min E Li (Si , ai )x
a1 ∈S1 ,...,ad ∈Sd
i=1
= arg min E L1 (S1 , a1 ) x , ..., arg min E Ld (Sd , ad ) x
a1 ∈A1 ad ∈Ad
Marginal Bayes decisions:
Z
arg min E Li (Si , ai )x = arg min Li (si , ai ) fSi |X (si |xi ) dsi
ai ∈Ai ai ∈Ai Si | {z }
posterior marginal
If Li (si , ai ) is 0/1 loss, δBayes = δMPM (maximizer of posterior marginals)
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 32 / 33
Example of Marginal Inference: Estimation of Variance
Estimation: S = (µ, σ 2 ) ∈ A = S = R × R+
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 33 / 33
Example of Marginal Inference: Estimation of Variance
Estimation: S = (µ, σ 2 ) ∈ A = S = R × R+
Qn
Observations X = (X1 , ..., Xn ), fX|S (x|µ, σ 2 ) = i=1 N (xi ; µ, σ
2)
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 33 / 33
Example of Marginal Inference: Estimation of Variance
Estimation: S = (µ, σ 2 ) ∈ A = S = R × R+
Qn
Observations X = (X1 , ..., Xn ), fX|S (x|µ, σ 2 ) = i=1 N (xi ; µ, σ
2)
Flat prior fS (s) = K, thus fS|X (s|x) ∝ fX|S (x|s)
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 33 / 33
Example of Marginal Inference: Estimation of Variance
Estimation: S = (µ, σ 2 ) ∈ A = S = R × R+
Qn
Observations X = (X1 , ..., Xn ), fX|S (x|µ, σ 2 ) = i=1 N (xi ; µ, σ
2)
Flat prior fS (s) = K, thus fS|X (s|x) ∝ fX|S (x|s)
2 ) = (µ̂
MAP estimate: (µ̂MAP , σ̂MAP 2
MLE , σ̂MLE )
n n
1X 2 1X
µ̂MAP = xi σ̂MAP = (xi − µ̂MLE )2
n n
i=1 i=1
n−1 2
with σ̂ 2
MAP = σ̂ 2
MLE known to be biased E[σ̂ 2 MLE ]= n σ
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 33 / 33
Example of Marginal Inference: Estimation of Variance
Estimation: S = (µ, σ 2 ) ∈ A = S = R × R+
Qn
Observations X = (X1 , ..., Xn ), fX|S (x|µ, σ 2 ) = i=1 N (xi ; µ, σ
2)
Flat prior fS (s) = K, thus fS|X (s|x) ∝ fX|S (x|s)
2 ) = (µ̂
MAP estimate: (µ̂MAP , σ̂MAP 2
MLE , σ̂MLE )
n n
1X 2 1X
µ̂MAP = xi σ̂MAP = (xi − µ̂MLE )2
n n
i=1 i=1
n−1 2
with σ̂ 2
MAP = σ̂ 2
MLE known to be biased E[σ̂ 2 MLE ]= n σ
Posterior marginal: Z ∞ n
Y
fΣ2 |X (σ 2 |x) ∝ N (xi ; µ, σ 2 )dµ
−∞ i=1
n
2
n−1 1 X
= 2πσ 2 exp − 2 (xi − µ̂MLE )2
2σ
i=1
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 33 / 33
Example of Marginal Inference: Estimation of Variance
Estimation: S = (µ, σ 2 ) ∈ A = S = R × R+
Qn
Observations X = (X1 , ..., Xn ), fX|S (x|µ, σ 2 ) = i=1 N (xi ; µ, σ
2)
Flat prior fS (s) = K, thus fS|X (s|x) ∝ fX|S (x|s)
2 ) = (µ̂
MAP estimate: (µ̂MAP , σ̂MAP 2
MLE , σ̂MLE )
n n
1X 2 1X
µ̂MAP = xi σ̂MAP = (xi − µ̂MLE )2
n n
i=1 i=1
n−1 2
with σ̂ 2
MAP = σ̂ 2
MLE known to be biased E[σ̂ 2 MLE ]= n σ
Posterior marginal: Z ∞ n
Y
fΣ2 |X (σ 2 |x) ∝ N (xi ; µ, σ 2 )dµ
−∞ i=1
n
2
n−1 1 X
= 2πσ 2 exp − 2 (xi − µ̂MLE )2
2σ
i=1
n
MPM estimate, σ̂ 2
MPM = 2
n−1 σ̂MAP , is unbiased
Mário A. T. Figueiredo (IST & IT) Statistical Learning: Lecture 2 2018 33 / 33