[go: up one dir, main page]

0% found this document useful (0 votes)
54 views40 pages

Estimation Theory Eng

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 40

Estimation Theory for Engineers

Roberto Togneri

August 16, 2001

1 Applications
Modern estimation theory can be found at the heart of many electronic signal
processing systems designed to extract information. These systems include:

Radar where the delay of the received pulse echo has to be estimated in the
presence of noise
Sonar where the delay of the received signal from each sensor has to estimated
in the presence of noise
Speech where the parameters of the speech model have to be estimated in the
presence of speech/speaker variability and environmental noise
Image where the position and orientation of an object from a camera image
has to be estimated in the presence of lighting and background noise
Biomedicine where the heart rate of a fetus has to be estimated in the presence
of sensor and environmental noise
Communications where the carrier frequency of a signal has to be estimated
for demodulation to the baseband in the presence of degradation noise
Control where the position of a powerboat for corrective navigation correction
has to be estimated in the presence of sensor and environmental noise
Siesmology where the underground distance of an oil deposit has to be es-
timated from noisy sound reflections due to different densities of oil and
rock layers

The majority of applications require estimation of an unknown parameter θ


from a collection of observation data x[n] which also includes “artificacts” due
to to sensor inaccuracies, additive noise, signal distortion (convolutional noise),
model imperfections, unaccounted source variability and multiple interfering
signals.

1
2 Introduction
Define:

• x[n] ≡ observation data at sample time n,


• x = (x[0] x[1] . . . x[N − 1])T ≡ vector of N observation samples (N -point
data set), and
• p(x; θ) ≡ mathemtical model (i.e. PDF) of the N -point data set parametrized
by θ.

The problem is to find a function of the N -point data set which provides an
estimate of θ, that is:
θ̂ = g(x = {x[0], x[1], . . . , x[N − 1]})
where θ̂ is an estimate of θ, and g(x) is known as the estimator function.
Once a candidate g(x) is found, then we usually ask:

1. How close will θ̂ be to θ (i.e. how good or optimal is our estimator)?


2. Are there better (i.e. closer ) estimators?

A natural optimal criterion is minimisation of the mean square error :


mse(θ̂) = E[(θ̂ − θ)2 ]
But this does not yield realisable estimator functions which can be written as
functions of the data only:
h   i2  h i2
mse(θ̂) = E θ̂ − E(θ̂) + E(θ̂) − θ = var(θ̂) + E(θ̂) − θ
h i2
However although E(θ̂) − θ is a function of θ the variance of the estimator,
var(θ̂), is only a function of data. Thus an alternative approach is to assume
E(θ̂) − θ = 0 and minimise var(θ̂). This produces the Minimum Variance
Unbiased (MVU) estimator.

2.1 Minimum Variance Unbiased (MVU) Estimator


1. Estimator has to be unbiased , that is:
E(θ̂) = θ for a < θ < b
where [a,b] is the range of interest
2. Estimator has to have minimum variance:
θ̂M V U = arg min{var(θ̂)} = arg min{E(θ̂ − E(θ̂))2 }
θ̂ θ̂

2
2.2 EXAMPLE
Consider a fixed signal, A, embedded in a WGN (White Gaussian Noise) signal,
w[n] :
x[n] = A + w[n] n = 0, 1, . . . , N − 1
where θ = A is the parameter to be estimated from the observed data, x[n].
Consider the sample-mean estimator function:
N −1
1 X
θ̂ = x[n]
N n=0
Is the sample-mean an MVU estimator for A?

Unbiased?

E(θ̂) = E( N1 1 1 1
P P P
x[n]) = N E(x[n]) = N A= N NA =A

Minimum Variance?
σ2
var(θ̂) = var( N1 1 1 Nσ
P P P
x[n]) = N2 var(x[n]) = N2 σ= N = N

But is the sample-mean θ̂ the MVU estimator? It is unbiased but is it minimum


2
variance? That is, is var(θ̃) ≥ σN for all other unbiased estimator functions θ̃ ?

3 Cramer-Rao Lower Bound (CLRB)

The variance of any unbiased estimater θ̂ must be lower bounded by the CLRB,
with the variance of the MVU estimator attaining the CLRB. That is:
1
var(θ̂) ≥ h i
∂2 ln p(x;θ)
−E ∂θ 2

and
1
var(θ̂M V U ) = h i
∂ 2 ln p(x;θ)
−E ∂θ 2

Furthermore if, for some functions g and I:


∂ ln p(x; θ)
= I(θ)(g(x) − θ)
∂θ
then we can find the MVU estimator as: θ̂M V U = g(x) and the minimum
variance is 1/I(θ). For a p-dimensional vector parameter,θ, the equivalent
condition is:
Cθ̂ − I−1 (θ) ≥ 0

3
i.e. Cθ̂ − I−1 (θ) is postitive semidefinite where Cθ̂ = E[(θ̂ − E(θ̂)T (θ̂ − E(θ̂))
is the covariance matrix. The Fisher matrix, I(θ), is given as:
 2 
∂ ln p(x; θ)
[I(θ)]ij = −E
∂θi ∂θj
Furthermore if, for some p-dimensional function g and p × p matrix I:
∂ ln p(x; θ)
= I(θ)(g(x) − θ)
∂θ
then we can find the MVU estimator as: θ M V U = g(x) and the minimum
covariance is I−1 (θ).

3.1 EXAMPLE
Consider the case of a signal embedded in noise:
x[n] = A + w[n] n = 0, 1, . . . , N − 1
where w[n] is a WGN with variance σ 2 , and thus:

N −1  
Y 1 1 2
p(x; θ) = √ exp − 2 (x[n] − θ)
n=0 2πσ 2 2σ
N −1
" #
1 1 X 2
= N exp − (x[n] − θ)
(2πσ 2 ) 2 2σ 2 n=0

where p(x; θ) is considered a function of the parameter θ = A (for known x)


and is thus termed the likelihood function. Taking the first and then second
derivatives:
∂ ln p(x; θ) N 1 X N
= ( x[n] − θ) = 2 (θ̂ − θ)
∂θ σ2 N σ
∂ 2 ln p(x; θ) N
2
= − 2
∂θ σ
For a MVU estimator the lower bound has to apply, that is:
1 σ2
var(θ̂M V U ) = h i=
−E ∂2 ln p(x;θ) N
∂θ 2
2
but we know from the previous example that var(θ̂) = σN and thus the sample-
mean is a MVU estimator. Alternatively we can show this by considering the
first derivative:
∂ ln p(x; θ) N 1 X
= 2( x[n] − θ) = I(θ)(g(x) − θ)
∂θ σ N
where I(θ) = σN2 and g(x) = N1
P
x[n]. Thus the MVU estimator is indeed
2
1 1
= σN .
P
θ̂M V U = N x[n] with minimum variance I(θ)

4
4 Linear Models
If N-point samples of data are observed and modeled as:

x = Hθ + w

where

x = N ×1 observation vector
H = N ×p observation matrix
θ = p×1 vector of parameters to be estimated
w = N ×1 noise vector with PDF N(0, σ 2 I)

then using the CRLB theorem θ = g(x) will be an MVU estimator if:

∂ ln p(x; θ)
= I(θ)(g(x) − θ)
∂θ
with Cθ̂ = I−1 (θ). So we need to factor:
 
∂ ln p(x; θ) ∂ N 1
= − ln(2πσ 2 ) 2 − 2 (x − Hθ)T (x − Hθ)
∂θ ∂θ 2σ

into the form I(θ)(g(x) − θ). When we do this the MVU estimator for θ is:

θ̂ = (HT H)−1 HT x

and the covariance matrix of θ is:

Cθ̂ = σ 2 (HT H)−1

4.1 EXAMPLES

4.1.1 Curve Fitting

Consider fitting the data, x(t), by a pth order polynomial function of t:

x(t) = θ0 + θ1 t + θ2 t2 + · · · + θp tp + w(t)

Say we have N samples of data, then:


T
x = [x(t0 ), x(t1 ), x(t2 ), . . . , x(tN −1 )]
T
w = [w(t0 ), w(t1 ), w(t2 ), . . . , w(tN −1 )]
T
θ = [θ0 , θ1 , θ2 , . . . . θp ]

5
so x = Hθ + w, where H is the N × p matrix:
 
1 t0 t2 0 ··· tp 0
 1 t1 t1 2 ··· tp 1
H= .
 
.. .. ..
 ..

. . . ··· 
1 tN −1 t2N −1 ··· tpN −1

Hence the MVU estimate of the polynomial coefficients based on the N samples
of data is:
θ̂ = (HT H)−1 HT x

4.1.2 Fourier Analysis

Consider fitting or representing the N samples of data, x[n], by a linear combi-


nation of sine and cosine functions at different harmonics of the fundamental
with period N samples. This implies that x[n] is a periodic time series with
period N and this type of analysis is known as Fourier analysis. We consider
our model as:
M   M  
X 2πkn X 2πkn
x[n] = ak cos + bk sin + w[n]
N N
k=1 k=1

so x = Hθ + w, where:
T
x = [x[0], x[1], x[2], . . . , x[N − 1]]
T
w = [w[0], w[1], w[2], . . . , w[N − 1]]
T
θ = [a1 , a2 , . . . , aM , b1 , b2 , . . . , bM ]

where H is the N × 2M matrix:

H = ha1 ha2 · · · haM hb1 hb2 · · · hbM


 

where:
   
1 0
cos 2πk sin 2πk
 
   
 N    N  
a
 cos 2πk2N
 b
 sin 2πk2N

hk = 
 ..
,
 hk = 
 ..


 .  .
   
     
−1) −1)
cos 2πk(N N sin 2πk(N N

Hence the MVU estimate of the Fourier co-efficients based on the N samples of
data is:
θ̂ = (HT H)−1 HT x

6
After carrying out the simplification the solution can be shown to be:
 2 a T 
N (h1 ) x
 .. 
 2 a. T 
 
 (h ) x 
θ̂ =  N M 
 2 (hb1 )T x 
 N 
 .. 
 . 
2 b T
N (hM ) x

which is none other than the standard solution found in signal processing text-
books, usually expressed directly as:
N −1  
2 X 2πkn
âk = x[n] cos
N n=0 N
N −1  
2 X 2πkn
b̂k = x[n] sin
N n=0 N

2σ 2
and Cθ̂ = N I.

4.1.3 System Identification

We assume a tapped delay line (TDL) or finite impulse response (FIR) model
with p stages or “taps” of an unknown system with output, x[n]. To identify
the system a known input, u[n], is used to “probe” the system which produces
output:
p−1
X
x[n] = h[k]u[n − k] + w[n]
k=0

We assume N input samples are used to yield N output samples and our iden-
tification problem is the same as estimation of the linear model parameters for
x = Hθ + w, where:
T
x = [x[0], x[1], x[2], . . . , x[N − 1]]
T
w = [w[0], w[1], w[2], . . . , w[N − 1]]
T
θ = [h[0], h[1], h[2], . . . , h[p − 1]]

and H is the N × p matrix:


 
u[0] 0 ··· 0
 u[1] u[0] ··· 0 
H=
 
.. .. .. .. 
 . . . . 
u[N − 1] u[N − 2] · · · u[N − p]

7
The MVU estimate of the system model co-efficients is given by:

θ̂ = (HT H)−1 HT x

where Cθ̂ = σ 2 (HT H)−1 . Since H is a function of u[n] we would like to choose
u[n] to achieve minimum variance. It can be shown that the signal we need is a
pseudorandom noise (PRN) sequence which has the property that the autocor-
relation function is zero for k 6= 0, that is:
−1−k
NX
1
ruu [k] = u[n]u[n + k] = 0 k 6= 0
N n=0

and hence HT H = N ruu [0]I. Define the crosscorrelation function:


−1−k
NX
1
rux [k] = u[n]x[n + k]
N n=0

then the system model co-efficients are given by:

rux [i]
ĥ[i] =
ruu [0]

4.2 General Linear Models

In a general linear model two important extensions are:

1. The noise vector, w, is no longer white and has PDF N(0, C) (i.e. general
Gaussian noise)
2. The observed data vector, x, also includes the contribution of known signal
components, s.

Thus the general linear model for the observed data is expressed as:

x = Hθ + s + w

where:

s N ×1 vector of known signal samples


w N ×1 noise vector with PDF N(0, C)

Our solution for the simple linear model where the noise is assumed white can
be used after applying a suitable whitening transformation. If we factor the
noise covariance matrix as:
C−1 = DT D

8
then the matrix D is the required transformation since:
−1
E[wwT ] = C ⇒ E[(Dw)(Dw)T ] = DCDT = (DD−1 )(DT DT ) = I

that is w0 = Dw has PDF N(0, I). Thus by transforming the general linear
model:
x = Hθ + s + w
to:

x0 = Dx = DHθ + Ds + Dw
x0 = H0 θ + s0 + w0

or:
x00 = x0 − s0 = H0 θ + w0

we can then write the MVU estimator of θ given the observed data x00 as:
T T
θ̂ = (H0 H0 )−1 H0 x00
= (HT DT DH)−1 HT DT D(x − s)

That is:
θ̂ = (HT C−1 H)−1 HT C−1 (x − s)
and the covariance matrix is:

Cθ̂ = (HT C−1 H)−1

5 General MVU Estimation

5.1 Sufficient Statistic

For the cases where the CRLB cannot be established a more general approach
to finding the MVU estimator is required. We need to first find a sufficient
statistic for the unknown parameter θ :
Neyman-Fisher Factorization: If we can factor the PDF p(x; θ) as:

p(x; θ) = g(T (x), θ)h(x)

where g(T (x), θ) is a function of T (x) and θ only and h(x) is a function of x
only, then T (x) is a sufficient statistic for θ. Conceptually one expects that the
PDF after the sufficient statistic has been observed, p(x|T (x) = T0 ; θ), should
not depend on θ since T (x) is sufficient for the estimation of θ and no more
knowledge can be gained about θ once we know T (x).

9
5.1.1 EXAMPLE

Consider a signal embedded in a WGN signal:

x[n] = A + w[n]

Then:
N −1
" #
1 1 X
p(x; θ) = N exp − 2 (x[n] − θ)2
(2πσ 2 ) 2 2σ n=0
where θ = A is the unknown parameter we want to estimate. We factor as
follows:
−1 N −1
" N
# " #
1 1 2
X 1 X 2
p(x; θ) = N exp − (N θ − 2θ x[n] exp − 2 x [n]
(2πσ 2 ) 2 2σ 2 n=0
2σ n=0
= g(T (x), θ) · h(x)
PN −1
where we define T (x) = n=0 x[n] which is a sufficient statistic for θ.

5.2 MVU Estimator

There are two different ways one may derive the MVU estimator based on the
sufficient statistic, T (x):

R
1. Let θ̆ be any unbiased estimator of θ. Then θ̂ = E(θ̆|T (x)) = θ̆p(θ̆|T (x))dθ̆
is the MVU estimator.
2. Find some function g such that θ̂ = g(T (x)) is an unbiased estimator of
θ, that is E[g(T (x))] = θ, then θ̂ is the MVU estimator.

The Rao-Blackwell-Lehmann-Scheffe (RBLS) theorem tells us that θ̂ =


E(θ̆|T (x)) is:

1. A valid estimator for θ


2. Unbiased
3. Of lesser or equal variance that that of θ̆, for all θ.
4. The MVU estimator if the sufficient statistic, T (x), is complete.

The sufficient statistic, T (x), is complete if there is only one function g(T (x))
that is unbiased. That is, if h(T (x)) is another unbiased estimator (i.e. E[h(T (x))] =
θ) then we must have that g = h if T (x) is complete.

10
5.2.1 EXAMPLE

Consider the previous example of a signal embedded in a WGN signal:

x[n] = A + w[n]
PN −1
where we derived the sufficient statistic, T (x) = n=0 x[n]. Using method 2
we need to find a function g such that E[g(T (x))] = θ = A. Now:
"N −1 # N −1
X X
E[T (x)] = E x[n] = E[x[n]] = N θ
n=0 n=0

It is obvious that:
N −1
" #
1 X
E x[n] = θ
N n=0
PN −1
and thus θ̂ = g(T (x)) = N1 n=0 x[n], which is the sample mean we have
already seen before, is the MVU estimator for θ.

6 Best Linear Unbiased Estimators (BLUEs)


It may occur that the MVU estimator or a sufficient statistic cannot be found or,
indeed, the PDF of the data is itself unknown (only the second-order statistics
are known). In such cases one solution is to assume a functional model of the
estimator, as being linear in the data, and find the linear estimator which is
both unbiased and has minimum variance, i.e. the BLUE.
For the general vector case we want our estimator to be a linear function of the
data, that is:
θ̂ = Ax
Our first requirement is that the estimator be unbiased, that is:

E(θ̂) = AE(x) = θ

which can only be satisfied if:

E(x) = Hθ

i.e. AH = I. The BLUE is derived by finding the A which minimises the


variance, Cθ̂ = ACAT , where C = E[(x − E(x))(x − E(x))T ] is the covariance
of the data x, subject to the constraint AH = I. Carrying out this minimisation
yields the following for the BLUE:

θ̂ = Ax = (HT C−1 H)−1 HT C−1 x

where Cθ̂ = (HT C−1 H)−1 . The form of the BLUE is identical to the MVU
estimator for the general linear model. The crucial difference is that the BLUE

11
does not make any assumptions on the PDF of the data (or noise) whereas the
MVU estimator was derived assuming Gaussian noise. Of course, if the data is
truly Gaussian then the BLUE is also the MVU estimator. The BLUE for the
general linear model can be stated as follows:

Gauss-Markov Theorem Consider a general linear model of the form:

x = Hθ + w

where H is known, and w is noise with covariance C (the PDF of w is otherwise


arbitrary), then the BLUE of θ is:

θ̂ = (HT C−1 H)−1 HT C−1 x

where Cθ̂ = (HT C−1 H)−1 is the minimum covariance.

6.1 EXAMPLE

Consider a signal embedded in noise:

x[n] = A + w[n]

Where w[n] is of unspecified PDF with var(w[n]) = σn2 and the unknown pa-
rameter θ = A is to be estimated. We assume a BLUE estimate and we derive
H by noting:
E[x] = 1θ
T T
where x = [x[0], x[1], x[2], . . . , x[N − 1]] , 1 = [1, 1, 1, . . . , 1] and we have
H ≡ 1. Also:
 1
0 ···

 2
σ0 0 · · · 0

σ02
0
1
 0 σ12 · · · 0 
 0
σ12
··· 0 
−1
 
C= . ⇒ C =
 
. . .  . . . . 
 .. .. .. ..  .. .. .. .. 
 
 
2
0 0 · · · σN −1 0 0 · · · σ21
N −1

and hence the BLUE is:


PN −1 x[n]
T −1 −1 T −1 1T C−1 x n=0 σ 2
θ̂ = (H C H) H C x = T −1 = PN −1 1n
1 C 1 n=0 σ 2 n

and the minimum covariance is:


1 1
Cθ̂ = var(θ̂) = = PN −1
1T C−1 1 n=0
1
σn2

12
and we note that in the case of white noise where σn2 = σ 2 then we get the
sample mean:
N −1
1 X
θ̂ = x[n]
N n=0
σ2
and miminum variance var(θ̂) = N .

7 Maximum Likelihood Estimation (MLE)


7.1 Basic MLE Procedure
In some cases the MVU estimator may not exist or it cannot be found by any of
the methods discussed so far. The MLE approach is an alternative method in
cases where the PDF is known. With MLE the unknown parameter is estimated
by maximising the PDF. That is define θ̂ such that:
θ̂ = arg max p(x; θ)
θ

where x is the vector of observed data (of N samples). It can be shown that θ̂
is asymptotically unbiased:
lim E(θ̂) = θ
N →∞
and asymptotically efficient:
lim var(θ̂) = CRLB
N →∞

An important result is that if an MVU estimator exists, then the MLE


procedure will produce it. An important observation is that unlike the
previous estimates the MLE does not require an explicit expression for
p(x; θ)! Indeed given a histogram plot of the PDF as a function of θ one can
numerically search for the θ that maximises the PDF.

7.1.1 EXAMPLE

Consider the signal embedded in noise problem:


x[n] = A + w[n]
where w[n] is WGN with zero mean but unknown variance which is also A, that
is the unknown parameter, θ = A, manifests itself both as the unknown signal
and the variance of the noise. Although a highly unlikely scenario, this simple
example demonstrates the power of the MLE approach since finding the MVU
estimator by the previous procedures is not easy. Consider the PDF:
N −1
" #
1 1 X 2
p(x; θ) = N exp − (x[n] − θ)
(2πθ) 2 2θ n=0

13
We consider p(x; θ) as a function of θ, thus it is a likelihood function and we
need to maximise it wrt to θ. For Gaussian PDFs it easier to find the maximum
of the log-likelihood function:
N −1
!
1 1 X
ln p(x; θ) = ln N − (x[n] − θ)2
(2πθ) 2 2θ n=0

Differentiating we have:
N −1 N −1
∂ ln p(x; θ) N 1 X 1 X
=− + (x[n] − θ) + 2 (x[n] − θ)2
∂θ 2θ θ n=0 2θ n=0

and setting the derivative to zero and solving for θ, produces the MLE estimate:
v
u N −1
1 u 1 X 2 1
θ̂ = − + t x [n] +
2 N n=0 4

where we have assumed θ > 0. It can be shown that:

lim E(θ̂) = θ
N →∞

and:
θ2
lim var(θ̂) = CRLB =
N →∞ N (θ + 12 )

7.1.2 EXAMPLE

Consider a signal embedded in noise:

x[n] = A + w[n]

where w[n] is WGN with zero mean and known variance σ 2 . We know the MVU
estimator for θ is the sample mean. To see that this is also the MLE, we consider
the PDF:
N −1
" #
1 1 X 2
p(x; θ) = N exp − (x[n] − θ)
(2πσ 2 ) 2 2σ 2 n=0
and maximise the log-likelihood function be setting it to zero:
N −1 N −1
∂ ln p(x; θ) 1 X X
= 2 (x[n] − θ) = 0 ⇒ x[n] − N θ = 0
∂θ σ n=0 n=0

1
PN −1
thus θ̂ = N n=0 x[n] which is the sample-mean.

14
7.2 MLE for Transformed Parameters

The MLE of the transformed parameter, α = g(θ), is given by:

α̂ = g(θ̂)

where θ̂ is the MLE of θ. If g is not a one-to-one function (i.e. not invertible)


then α̂ is obtained as the MLE of the transformed likelihood function, pT (x; α),
which is defined as:
pT (x; α) = max p(x; θ)
{θ:α=g(θ)}

7.3 MLE for the General Linear Model

Consider the general linear model of the form:

x = Hθ + w

where H is a known N × p matrix, x is the N × 1 observation vector with N


samples, and w is a noise vector of dimension N × 1 with PDF N(0, C). The
PDF is:
 
1 1 T −1
p(x; θ) = N 1 exp − (x − Hθ) C (x − Hθ)
(2π) 2 det 2 (C) 2

and the MLE of θ is found by differentiating the log-likelihood which can be


shown to yield:
∂ ln p(x; θ) ∂(Hθ)T −1
= C (x − Hθ)
∂θ ∂θ
which upon simplification and setting to zero becomes:

HT C−1 (x − Hθ) = 0

and this we obtain the MLE of θ as:

θ̂ = (HT C−1 H)−1 HT C−1 x

which is the same as the MVU estimator.

7.4 EM Algorithm

We wish to use the MLE procedure to find an estimate for the unknown param-
eter θ which requires maximisation of the log-likelihood function, ln px (x; θ).
However we may find that this is either too difficult or there aredifficulties in
finding an expression for the PDF itself. In such circumstances where direct
expression and maxmisation of the PDF in terms of the observed data x is

15
difficult or intractable, an iterative solution is possible if another data set, y,
can be found such that the PDF in terms of y set is much easier to express in
closed form and maximise. We term the data set y the complete data and the
original data x the incomplete data. In general we can find a mapping from the
complete to the incomplete data:

x = g(y)

however this is usually a many-to-one transformation in that a subset of the


complete data will map to the same incomplete data (e.g. the incomplete data
may represent an accumulation or sum of the complete data). This explains the
terminology: x is incomplete (or is “missing” something) relative to the data,
y, which is complete (for performing the MLE procedure). This is usually not
evident, however, until one is able to define what the complete data set. Un-
fortuneately defining what constitutes the complete data is usually an arbitrary
procedure which is highly problem specific.

7.4.1 EXAMPLE

Consider spectral analysis where a known signal, x[n], is composed of an un-


known summation of harmonic components embedded in noise:
p
X
x[n] = cos 2πfi n + w[n] n = 0, 1, . . . , N − 1
i=1

where w[n] is WGN with known variance σ 2 and the unknown parameter vector
to be estimated is the group of frequencies: θ = f = [f1 f2 . . . fp ]T . The
standard MLE would require maximisation of the log-likelihood of a multi-
variate Gaussian distribution which is equivalent to minimising the argument
of the exponential:
N −1 p
!2
X X
J(f ) = x[n] − cos 2πfi n
n=0 i=1

which is a p-dimensional minimisation problem (hard!). On the other hand if


we had access to the individual harmonic signal embedded in noise:

i = 1, 2, . . . , p
yi [n] = cos 2πfi n + wi [n]
n = 0, 1, . . . , N − 1

where wi [n] is WGN with known variance σi2 then the MLE procedure would
result in minimisation of:
N
X −1
J(fi ) = (yi [n] − cos 2πfi n)2 i = 1, 2, . . . , p
n=0

16
which are p independent one-dimensional minimisation problems (easy!). Thus
y is the complete data set that we are looking for to facilitate the MLE proce-
dure. However we do not have access to y. The relationship to the known data
x is:
X p
x = g(y) ⇒ x[n] = yi [n] n = 0, 1, . . . , N − 1
i=1

and we further assume that:


p
X
w[n] = wi [n]
i=1
Xp
σ2 = σi2
i=1

However since the mapping from y to x is many-to-one we cannot directly form


an expression for the PDF py (y; θ) in terms of the known x (since we can’t do
the obvious substitution of y = g−1 (x)).

♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦

Once we have found the complete data set, y, even though an expression for
ln py (y; θ) can now easily be derived we can’t directly maxmise ln py (y; θ) wrt
θ since y is unavailable. However we know x and if we further assume that we
have a good “guess” estimate for θ then we can consider the expected value of
ln py (y; θ) conditioned on what we know:
Z
E[ln py (y; θ)|x; θ] = ln py (y; θ)p(y|x; θ)dy

and we attempt to maximise this expectation to yield, not the MLE, but next
“best-guess” estimate for θ.
What we do is iterate through both an E-step (find the expression for the
Expectation) and M-step (Maxmisation of the expectation), hence the name
EM algorithm. Specifically:
Expectation(E): Determine the average or expectation of the log-likelihood
of the complete data given the known incomplete or observed data and current
estimate of the parameter vector
Z
Q(θ, θ k ) = ln py (y; θ)p(y|x; θ k )dy

Maxmisation (M):Maximise the average log-likelihood of the complete date


or “Q-function” to obtain the next estimate of the parameter vector

θ k+1 = arg max Q(θ, θ k )


θ

17
Convergence of the EM algorithm is guaranteed (under mild conditions) in the
sense that the average log-likelihood of the complete data does not decrease at
each iteration, that is:
Q(θ, θ k+1 ) ≥ Q(θ, θ k )
with equality when θ k is the MLE. The three main attributes of the EM algo-
rithm are:

1. An initial value for the unknown parameter is needed and as with most
iterative procedures a good initial estimate is required for good conver-
gence
2. The selection of the complete data set is arbitrary
3. Although ln py (y; θ) can usually be easily expressed in closed form finding
the closed form expression for the expectation is usually harder.

7.4.2 EXAMPLE

Applying the EM algorithm to the previous example requires finding a closed


expression for the average log-likelihood.

E-step We start with finding an expression for ln py (y; θ) in terms of the


complete data:
p N −1
X 1 X
ln py (y; θ) ≈ h(y) + yi [n] cos 2πfi n
σ2
i=1 i n=0
p
X 1 T
≈ h(y) + c yi
σ2 i
i=1 i

where the terms in h(y) do not depend on θ , ci = [1, cos 2πfi , cos 2πfi (2), . . . , cos 2πfi (N −
1)]T and yi = [yi [0], yi [1], yi [2], . . . , yi [N − 1]]T . We write the conditional expec-
tation as:

Q(θ, θ k ) = E[ln py (y; θ)|x; θ k ]


p
X 1 T
= E(h(y)|x; θ k ) +
σ 2 ci E(yi |x; θ k )
i=1 i

Since we wish to maximise Q(θ, θ k ) wrt to θ, then this is equivalent to maxim-


ing:
p
X
Q0 (θ, θ k ) = cTi E(yi |x; θ k )
i=1

We note that E(yi |x; θ k ) can be thought as as an estimate of the yi [n] data set
given the observed data set x[n] and current estimate θ k . Since y is Gaussian

18
then x is a sum of Gaussians and thus x and y are jointly Gaussian and one of
the standard results is:

E(y|x; θ k ) = E(y) + Cyx C−1


xx (x − E(x))

and application of this yields:


p
σi2 X
ŷi = E(yi |x; θ k ) = ci + 2 (x − ci )
σ i=1

and
p
!
σ2 X
ŷi [n] = cos 2πfik n + i2 x[n] − cos 2πfik n
σ i=1

Thus:
p
X
Q0 (θ, θ k ) = cTi ŷi
i=1

M-step Maximisation of Q0 (θ, θ k ) consists of maximising each term in the


sum separately or:
fik+1 = arg max cTi ŷi
fi
Pp
Furthermore since we assumed σ 2 = i=1 σi2 we still have the problem that we
don’t know what the σi2 are. However as long as:
p p
X X σ2 i
σ2 = σi2 ⇒ =1
i=1 i=1
σ2

then we can chose these values arbitrarily.

8 Least Squares Estimation (LSE)

8.1 Basic LSE Procedure

The MVU, BLUE and MLE estimators developed previously required an ex-
pression for the PDF p(x; θ) in order to estimate the unknown parameter θ in
some optimal fashion. An alternative approach is to assume a signal model
(rather than probabilistic assumptions about the data) and achieve a design
goal assuming this model. With the Least Squares (LS) approach we assume
that the signal model is a function of the unknown parameter θ and produces a
signal:
s[n] ≡ s(n; θ)

19
where s(n; θ) is a function of n and parameterised by θ. Due to noise and model
inaccuracies, w[n], the signal s[n] can only be observed as:

x[n] = s[n] + w[n]

Unlike previous approaches no statement is made on the probabilistic distribu-


tion nature of w[n]. We only state that what we have is an “error”: e[n] =
x[n] − s[n] which with the appropriate choice of θ should be minimised in a
least-squares sense. That is we choose θ = θ̂ so that the criterion:
N
X −1
J(θ) = (x[n] − s[n])2
n=0

is minimised over the N observation samples of interest and we call this the LSE
of θ. More precisely we have:

θ̂ = arg min J(θ)


θ

and the minimum LS error is given by:

Jmin = J(θ̂)

An important assumption to produce a meaningful unbiassed estimate is that


the noise and model inaccuracies, w[n], have zero-mean. However no other
probabilistic assumptions about the data are made (i.e. LSE is valid for both
Gaussian and non-Gaussian noise), although by the same token we also can-
not make any optimality claims with LSE. (since this would depend on the
distribution of the noise and modelling errors ).
A problem that arises from assuming a signal model function s(n; θ) rather than
knowledge of p(x; θ) is the need to choose an appropriate signal model. Then
again in order to obtain a closed form or “parameteric” expression for p(x; θ)
one usually needs to know what the underlying model and noise characteristics
are anyway.

8.1.1 EXAMPLE

Consider observations, x[n], arising from a DC-level signal model, s[n] = s(n; θ) =
θ:
x[n] = θ + w[n]
where θ is the unknown parameter to be estimated. Then we have:
N
X −1 N
X −1
J(θ) = (x[n] − s[n])2 = (x[n] − θ)2
n=0 n=0

20
The differentiating wrt θ and setting to zero:
N −1 N −1
∂J(θ) X X
=0 ⇒ −2 (x[n] − θ̂) = 0 ⇒ x[n] − N θ̂ = 0
∂θ θ=θ̂ n=0 n=0

1
PN −1
and hence θ̂ = N n=0 x[n] which is the sample-mean. We also have that:

N −1 N −1
!2
X 1 X
Jmin = J(θ̂) = x[n] − x[n]
n=0
N n=0

8.2 Linear Least Squares

We assume the signal model is a linear function of the estimator, that is:

s = Hθ

where s = [s[0], s[1], . . . , s[N − 1]]T and H is a known N × p matrix with θ =


[θ1 , θ2 , . . . , θp ]. Now:
x = Hθ + w
and with x = [x[0], x[1], . . . , x[N − 1]]T we have:
N
X −1
J(θ) = (x[n] − s[n])2 = (x − Hθ)T (x − Hθ)
n=0

Differentiating and setting to zero:



∂J(θ)
=0 ⇒ −2HT x + 2HT Hθ̂ = 0
∂θ θ=θ̂

yields the required LSE:


θ̂ = (HT H)−1 HT x
which, surprise, surprise, is the identical functional form of the MVU estimator
for the linear model.
An interesting extension to the linear LS is the weighted LS where the contribu-
tion to the error from each component of the parameter vector can be weighted
in importance by using a different from of the error criterion:

J(θ) = (x − Hθ)T W(x − Hθ)

where W is an N × N postive definite (symmetric) weighting matrix.

21
8.3 Order-Recursive Least Squares

In many cases the signal model is unknown and must be assumed. Obviously
we would like to choose the model, s(θ), that minimises Jmin , that is:

sbest (θ) = arg min Jmin


s(θ̂)

We can do this arbitrarily by simply choosing models, obtaining the LSE θ̂, and
then selecting the model which provides the smallest Jmin . However, models
are not arbitrary and some models are more “complex” (or more precisely have
a larger number of parameters or degrees of freedom) than others. The more
complex a model the lower the Jmin one can expect but also the more likely
the model is to overfit the data or be overtrained (i.e. fit the noise and not
generalise to other data sets).

8.3.1 EXAMPLE

Consider the case of “line fitting” where we have observations x(t) plotted
against the sample time index t and we would like to fit the “best” line to
the data. But what line do we fit: s(t; θ) = θ1 , constant? s(t; θ) = θ1 + θ2 t,
a straight line? s(t; θ) = θ1 + θ2 t + θ3 t2 , a quadratic? etc. Each case repre-
sents an increase in the order of the model (i.e. order of the polynomial fit),
or the number of parameters to be estimated and consequent increase in the
“modelling power”. A polynomial fit represents a linear model, s = Hθ, where
s = [s(0), s(1), . . . , s(N − 1)]T and:

 
1
 1 
Constant θ = [θ1 ]T and H =   is a N × 1 matrix
 
..
 . 
1
 
1 0

 1 1 

T
Linear θ = [θ1 , θ2 ] and H = 
 1 2  is a N × 2 matrix

 .. .. 
 . . 
1 N −1
 
1 0 0

 1 1 1 

T
Quadratic θ = [θ1 , θ2 , θ3 ] and H = 
 1 2 4  is a N × 3

 .. .. .. 
 . . . 
1 N −1 (N − 1)2
matrix

22
and so on. If the underlying model is indeed a straight line then we would expect
not only that the minimum Jmin result with a straight line model but also that
higher order polynomial models (e.g. quadratic, cubic, etc.) will yield the same
Jmin (indeed higher-order models would “degenerate” to a straight line model,
except in cases of overfitting). Thus the straight line model is the “best” model
to use.
♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦

An alternative to providing an independent LSE for each possible signal model


a more efficient order-recursive LSE is possible if the models are different orders
of the same base model (e.g. polynomials of different degree). In this method
the LSE is updated in order (of increasing parameters). Specifially define θ̂ k as
the LSE of order k (i.e. k parameters to be estimated). Then for a linear model
we can derive the order-recursive LSE as:

θ̂ k+1 = θ̂ k + UPDATEk

However the success of this approach depends on proper formulation of the linear
models in order to facilitate the derivation of the recursive update. For example,
if H has othonormal column vectors then the LSE is equivalent to projecting
the observation x onto the space spanned by the orthonormal column vectors of
H. Since increasing the order implies increasing the dimensionality of the space
by just adding another column to H this allows a recursive update relationship
to be derived.

8.4 Sequential Least Squares

In most signal processing applications the observations samples arrive as a


stream of data. All our estimation strategies have assumed a batch or block
mode of processing whereby we wait for N samples to arrive and then form our
estimate based on these samples. One problem is the delay in waiting for the
N samples before we produce our estimate, another problem is that as more
data arrives we have to repeat the calculations on the larger blocks of data (N
increases as more data arrives). The latter not only implies a growing computa-
tional burden but also the fact that we have to buffer all the data we have seen,
both will grow linearly with the number of samples we have. Since in signal
processing applications samples arise from sampling a continuous process our
computational and memory burden will grow linearly with time! One solution
is to use a sequential mode of processing where the parameter estimate for n
samples, θ̂[n], is derived from the previous parameter estimate for n−1 samples,
θ̂[n − 1]. For linear models we can represent sequential LSE as:

θ̂[n] = θ̂[n − 1] + K[n](x[n] − s[n|n − 1])

where s[n|n − 1] ≡ s(n; θ[n − 1]). The K[n] is the correction gain and (x[n] −
s[n|n − 1]) is the prediction error. The magnitude of the correction gain K[n] is

23
usually directly related to the value of the estimator error variance, var(θ̂[n−1]),
with a larger variance yielding a larger correction gain. This behaviour is rea-
sonable since a larger variance implies a poorly estimated parameter (which
should have minumum variance) and a larger correction gain is expected. Thus
one expects the variance to decrease with more samples and the estimated pa-
rameter to converge to the “true” value (or the LSE with an infinite number of
samples).

8.4.1 EXAMPLE

We consider the specific case of linear model with a vector parameter:

s[n] = H[n]θ[n]

The most interesting example of sequential LS arises with the weighted LS


error criterion with W[n] = C−1 [n] where C[n] is the covariance matrix of the
zero-mean noise, w[n], which is assumed to be uncorrelated. The argument [n]
implies that the vectors are based on n sample observations. We also consider:

C[n] = diag(σ02 , σ12 , . . . , σn2 )


 
H[n − 1]
H[n] =
hT [n]

where hT [n] is the nth row vector of the n × p matrix H[n]. It should be obvious
that:
s[n − 1] = H[n − 1]θ[n − 1]

We also have that s[n|n − 1] = hT [n]θ[n − 1]. So the estimator update is:

θ̂[n] = θ̂[n − 1] + K[n](x[n] − hT [n]θ[n − 1])

Let Σ[n] = Cθ̂ [n] be the covariance matrix of θ̂ based on n samples of data and
the it can be shown that:
Σ[n − 1]h[n]
K[n] =
σn2 + hT [n]Σ[n − 1]h[n]

and furthermore we can also derive a covariance update:

Σ[n] = (I − K[n]hT [n])Σ[n − 1]

yielding a wholly recursive procedure requiring only knowledge of the observa-


tion data x[n] and the initialisation values: θ̂[−1] and Σ[−1], the initial estimate
of the parameter and a initial estimate of the parameter covariance matrix.

24
8.5 Constrained Least Squares

Assume that in the vector parameter LSE problem we are aware of constraints
on the individual parameters, that is p-dimensional θ is subject to r < p inde-
pendent linear constraints. The constraints can be summarised by the condition
that θ satisfy the following system of linear “constraint” equations:

Aθ = b

Then using the technique of Lagrangian multipliers our LSE problem is that of
miminising the following Lagrangian error criterion:

Jc = (x − Hθ)T (x − Hθ) + λT (Aθ − b)

Let θ̂ be the unconstrained LSE of a linear model, then the expression for θ̂ c
the constrained estimate is:

θ̂ c = θ̂ + (HT H)−1 AT [A(HT H)−1 AT ]−1 (Aθ − b)

where θ̂ = (HT H)−1 HT x for unconstrained LSE.

8.6 Nonlinear Least Squares

So far we have assumed a linear signal model: s(θ) = Hθ where the notation
for the signal model, s(θ) , explicitly shows its dependence on the parameter θ.
In general the signal model will be an N -dimensional nonlinear function of the
p-dimensional parameter θ. In such a case the minimisation of:

J = (x − s(θ))T (x − s(θ))

becomes much more difficult. Differentiating wrt θ and setting to zero yields:

∂J ∂s(θ)T
=0 ⇒ (x − s(θ)) = 0
∂θ ∂θ
which requires solution of N nonlinear simultaneous equations. Approximate
solutions based on linearization of the problem exist which require iteration
until convergence.

8.6.1 Newton-Rhapson Method

Define:
∂s(θ)T
g(θ) = (x − s(θ))
∂θ
So the problem becomes that of finding the zero of a nonlinear function, that
is: g(θ) = 0. The Newton-Rhapson method linearizes the function about the

25
initialised value θ k and directly solves for the zero of the function to produce
the next estimate:
 −1
∂g(θ)
θ k+1 = θ k − g(θ)

∂θ
θ=θ k

Of course if the function was linear the next estimate would be the correct value,
but since the function is nonlinear this will not be the case and the procedure
is iterated until the estimates converge.

8.6.2 Gauss-Newton Method

We linearize the signal model about the known (i.e. initial guess or current
estimate) θ k :
∂s(θ)
s(θ) ≈ s(θ k ) + (θ − θ k )
∂θ θ=θk
and the LSE minimisation problem which can be shown to be:

J = (x̂(θ k ) − H(θ k )θ)T (x̂(θ k ) − H(θ k )θ)

where x̂(θ k ) = x − s(θ k ) + H(θ k )θ k is known and:



∂s(θ)
H(θ k ) =
∂θ θ=θk

Based on the standard solution to the LSE of the linear model, we have that:

θ k+1 = θ k + (HT (θ k )H(θ k ))−1 HT (θ k )(x − s(θ k ))

9 Method of Moments
Although we may not have an expression for the PDF we assume that we can
use the natrual estimator for the k th moment, µk = E(xk [n]), that is:
N −1
1 X k
µ̂k = x [n]
N n=0

If we can write an expression for the k th moment as a function of the unknown


parameter θ:
µk = h(θ)
and assuming h−1 exists then we can derive the an estimate by:
N −1
!
−1 −1 1 X k
θ̂ = h (µ̂k ) = h x [n]
N n=0

26
If θ is a p-dimensional vector then we require p equations to solve for the p
unknowns. That is we need some set of p moment equations. Using the lowest
order p moments what we would like is:

θ̂ = h−1 (µ̂)

where: PN −1
 1 
N P n=0 x[n]
1 N −1 2
n=0 x [n]
 
N
µ̂ = 
 
.. 
 . 
1
PN −1
N n=0 xp [n]

9.1 EXAMPLE

Consider a 2-mixture Gaussian PDF:

p(x; θ) = (1 − θ)g1 (x) + θg2 (x)

where g1 = N(µ1 , σ12 ) and g2 = N(µ2 , σ22 ) are two different Gaussian PDFs and
θ is the unknown parameter that has to be estimated. We can write the second
moment as a function of θ as follows:
Z
µ2 = E(x2 [n]) = x2 p(x; θ)dx = (1 − θ)σ12 + θσ22 = h(θ)

and hence: PN −1
1
µ̂2 − σ12 N n=0 x2 [n] − σ12
θ̂ = 2 =
σ2 − σ12 σ22 − σ12

10 Bayesian Philosophy

10.1 Minimum Mean Square Estimator (MMSE)

The classic approach we have been using so far has assumed that the parameter
θ is unknown but deterministic. Thus the optimal estimator θ̂ is optimal irre-
spective and independent of the actual value of θ. But in cases where the actual
value or prior knowledge of θ could be a factor (e.g. where an MVU estimator
does not exist for certain values or where prior knowledge would improve the
estimation) the classic approach would not work effectively.
In the Bayesian philosophy the θ is treated as a random variable with a known
prior pdf, p(θ). Such prior knowledge concerning the distribution of the estima-
tor should provide better estimators than the deterministic case.

27
In the classic approach we derived the MVU estimator by first considering mimi-
sation of the mean square eror, i.e. θ̂ = arg minθ̂ mse(θ̂) where:
Z
mse(θ̂) = E[(θ̂ − θ)2 ] = (θ̂ − θ)p(x; θ)dx

and p(x; θ) is the pdf of x parametrised by θ. In the Bayesian approach we


similarly derive an estimator by minimising θ̂ = arg minθ̂ Bmse(θ̂) where:
Z Z
Bmse(θ̂) = E[(θ − θ̂)2 ] = (θ − θ̂)2 p(x, θ)dxdθ

is the Bayesian mse and p(x, θ) is the joint pdf of x and θ (since θ is now a
random variable). It should be noted that the Bayesian squared error (θ − θ̂)2
and classic squared error (θ̂−θ)2 are the same. The minimum Bmse(θ̂) estimator
or MMSE is derived by differentiating the expression for Bmse(θ̂) with respect
to θ̂ and setting this to zero to yield:
Z
θ̂ = E(θ|x) = θp(θ|x)dθ

where the posterior pdf, p(θ|x), is given by:

p(x, θ) p(x, θ) p(x|θ)p(θ)


p(θ|x) = =R =R
p(x) p(x, θ)dθ p(x|θ)p(θ)dθ

Apart from the computational (and analytical!) requirements in deriving an


expression for the posterior pdf and then evaluating the expectation E(θ|x)
there is also the problem of finding an appropriate prior pdf . The usual choice
is to assume the joint pdf, p(x, θ), is Gaussian and hence both the prior pdf, p(θ),
and posterior pdf, p(θ|x), are also Gaussian (this property imples the Gaussian
pdf is a conjugate prior distribution). Thus the form of the pdfs remains the
same and all that changes are the means and variances.

10.1.1 EXAMPLE

Consider signal embedded in noise:

x[n] = A + w[n]

where as before w[n] = N(0, σ 2 ) is a WGN process and the unknown parameter
θ = A is to be estimated. However in the Bayesian approach we also assume
the parameter A is a random variable with a prior pdf which in this case is the
2
Gaussian pdf p(A) = N(µA , σA ) . We also have that p(x|A) = N(A, σ 2 ) and we
can assume that x and A are jointly Gaussian. Thus the posterior pdf:
p(x|A)p(A) 2
p(A|x) = R = N(µA|x , σA|x )
p(x|A)p(A)dA

28
is also a Gaussian pdf and after the required simplification we have that:
 
2 1 N µA 2
σA|x = N 1
and µ A|x = 2
x̄ + 2 σA|x
σ2 + σ2
σ σ A
A

and hence the MMSE is:


Z
 = E[A|x] = Ap(A|x)dA = µA|x

= αx̄ + (1 − α)µA
2
σA
where α = 2 +σ 2 . Upon closer examination of the MMSE we observe the
σA N
2
following (assume σA  σ 2 ):

2
1. With few data (N is small) then σA  σ 2 /N and  → µA , that is the
MMSE tends towards the mean of the prior pdf (and effectively ignores
2
the contribution of the data). Also p(A|x) ≈ N(µA , σA ).
2
2. With large amounts of data (N is large)σA  σ 2 /N and  → x̄, that is
the MMSE tends towards the sample mean x̄ (and effectively ignores the
contribution of the prior information). Also p(A|x) ≈ N(x̄, σ 2 /N ).

♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦

If x and y are jointly Gaussian, where x is k × 1 and y is l × 1, with mean vector


[E(x)T , E(y)T ]T and partitioned covariance matrix
   
Cxx Cxy k×k k×l
C= =
Cyx Cyy l×k l×l
then the conditional pdf, p(y|x), is also Gaussian and:
E(y|x) = E(y) + Cyx C−1
xx (x − E(x))
Cy|x = Cyy − Cyx C−1
xx Cxy

10.2 Bayesian Linear Model


Now consider the Bayesian Llinear Model:
x = Hθ + w
where θ is the unknown parameter to be estimated with prior pdf N(µθ , Cθ )
and w is a WGN with N(0, Cw ). The MMSE is provided by the expression for
E(y|x) where we identify y ≡ θ. We have that:
E(x) = Hµθ
E(y) = µθ

29
and we can show that:

Cxx = E[(x − E(x))(x − E(x))T ] = HCθ HT + Cw


Cyx = E[(y − E(y))(x − E(x)T ] = Cθ HT

and hence since x and θ are jointly Gaussian we have that:

θ̂ = E(θ|x) = µθ + Cθ HT (HCθ HT + Cw )−1 (x − Hµθ )

♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦

10.3 Relation with Classic Estimation


In classical estimation we cannot make any assumptions on the prior, thus all
possible θ have to be considered. The equivalent prior pdf would be a flat
distribution, essentially σθ2 = ∞. This so-called noninformative prior pdf will
yield the classic estimator where such is defined.

10.3.1 EXAMPLE

Consider the signal embedded in noise problem:

x[n] = A + w[n]
where we have shown that the MMSE is:

 = αx̄ + (1 − α)µA

2
σA 2
where α = 2 +σ 2 . If the prior pdf is noninformative then σA = ∞ and α = 1
σA N

with  = x̄ which is the classic estimator.

10.4 Nuisance Parameters


Suppose that both θ and α were unknown parameters but we are only interested
in θ. Then α is a nuisance parameter. We can deal with this by “integrating α
out of the way”. Consider:
p(x|θ)p(θ)
p(θ|x) = R
p(x|θ)p(θ)dθ
Now p(x|θ) is, in reality, p(x|θ, α), but we can obtain the true p(x|θ) by:
Z
p(x|θ) = p(x|θ, α)p(α|θ)dα

30
and if α and θ are independent then:
Z
p(x|θ) = p(x|θ, α)p(α)dα

11 General Bayesian Estimators

The Bmse(θ̂) given by:


Z Z
Bmse(θ̂) = E[(θ − θ̂)2 ] = (θ − θ̂)2 p(x, θ)dxdθ

is one specific case for a general estimator that attempts to minimise the average
of the cost function, C(), that is the Bayes risk R = E[C()] where  = (θ − θ̂).
There are three different cost functions of interest:

1. Quadratic: C() = 2 which yields R = Bmse(θ̂). We already know that


the estimate to minimise R = Bmse(θ̂) is:
Z
θ̂ = θp(θ|x)dθ

which is the mean of the posterior pdf.


2. Absolute: C() = ||. The estimate, θ̂, that minimises R = E[|θ − θ̂|]
satisfies:
Z θ̂ Z ∞
1
p(θ|x)dθ = p(θ|x)dθ or Pr{θ ≤ θ̂|x} =
−∞ θ̂ 2

that is, the median of the posterior pdf.



0 || < δ
3. Hit-or-miss: C() = . The estimate that minimises the
1 || > δ
Bayes risk can be shown to be:

θ̂ = arg max p(θ|x)


θ

which is the mode of the posterior pdf (the value that maxmises the pdf).

For the Gaussian posterior pdf it should be noted that the mean, mediam and
mode are identical. Of most interest are the quadratic and hit-or-miss cost
functions which, together with a special case of the latter, yield the following
three important classes of estimator:

31
1. MMSE (Minimum Mean Square Error ) estimator which we have
already introduced as the mean of the posterior pdf :
Z Z
θ̂ = θp(θ|x)dθ = θp(x|θ)p(θ)dθ

2. MAP (Maximum A Posteriori ) estimator which is the mode or maximum


of the posterior pdf :

θ̂ = arg max p(θ|x)


θ
= arg max p(x|θ)p(θ)
θ

3. Bayesian ML (Bayesian Maximum Likelihood ) estimator which is


the special case of the MAP estimator where the prior pdf, p(θ), is uniform
or noninformative:
θ̂ = arg max p(x|θ)
θ

Noting that the conditional pdf of x given θ, p(x|θ), is essentially equiva-


lent to the pdf of x parametrized by θ, p(x; θ), the Bayesian ML estimator
is equivalent to the classic MLE.

Comparing the three estimators:

• The MMSE is preferred due to its least-squared cost function but it is


also the most difficult to derive and compute due to the need to find an
expression
R or measurements of the posterior pdf p(θ|x)in order to integrate
θp(θ|x)dθ
• The MAP hit-or-miss cost function is more “crude” but the MAP esti-
mate is easier to derive since there is no need to integrate, only find the
maximum of the posterior pdf p(θ|x) either analytically or numerically.
• The Bayesian ML is equivalent in performance to the MAP only in the
case where the prior pdf is noninformative, otherwise it is a sub-optimal
estimator. However, like the classic MLE, the expression for the condi-
tional pdf p(x|θ) is usually easier to obtain then that of the posterior pdf,
p(θ|x). Since in most cases knowledge of the the prior pdf is unavailable
so, not surprisingly, ML estimates tend to be more prevalent. However it
may not always be prudent to assume the prior pdf is uniform, especially
in cases where prior knowledge of the estimate is available even though
the exact pdf is unknown. In these cases a MAP estimate may perform
better even if an “artificial” prior pdf is assumed (e.g. a Gaussian prior
which has the added benefit of yielding a Gaussian posterior).

32
12 Linear Bayesian Estimators

12.1 Linear Minimum Mean Square Error (LMMSE) Es-


timator
We assume that the parameter θ is to be estimated based on the data set
x = [x[0] x[1] . . . x[N − 1]]T rather than assume any specific form for the joint
pdf p(x, θ). We consider the class of all affine estimators of the form:
N
X −1
θ̂ = an x[n] + aN = aT x + aN
n=0

where a = [a1 a2 . . . aN −1 ]T and choose the weight co-efficients {a, aN } to min-


imize the Bayesian MSE:
h i
Bmse(θ̂) = E (θ − θ̂)2

The resultant estimator is termed the linear minimum mean square error (LMMSE)
estimator. The LMMSE will be sub-optimal unless the MMSE is also linear.
Such would be the case if the Bayesian linear model applied:

x = Hθ + w
∂Bmse(θ̂)
The weight co-efficients are obtained from ∂ai = 0 for i = 1, 2, . . . , N
which yields:
N
X −1
aN = E(θ) − an E(x[n]) = E(θ) − aT E(x) and a = C−1
xx Cxθ
n=0
 
where Cxx = E (x − E(x))(x − E(x))T is the N × N covariance matrix and
Cxθ = E [(x − E(x))(θ − E(θ))] is the N × 1 cross-covariance vector. Thus the
LMMSE estimator is:

θ̂ = aT x + aN = E(θ) + Cθx C−1


xx (x − E(x))
 
where we note Cθx = CTxθ = E (θ − E(θ))(x − E(x))T . For the 1 × p vector
parameter θ an equivalent expression for the LMMSE estimator is derived:

θ̂ = E(θ) + Cθx C−1


xx (x − E(x))
 
where now Cθx = E (θ − E(θ))(x − E(x))T is the p × N cross-covariance
matrix. And the Bayesian MSE matrix is:
h i
Mθ̂ = E (θ − θ̂)(θ − θ̂)T
= Cθθ − Cθx C−1xx Cxθ
 
where Cθθ = E (θ − E(θ))(θ − E(θ))T is the p × p covariance matrix.

33
12.1.1 Bayesian Gauss-Markov Theorem

If the data are described by the Bayesian linear model form:


x = Hθ + w
where x is the N × 1 data vector, H is a known N × p observation matrix, θ
is a p × 1 random vector of parameters with mean E(θ) and covariance matrix
Cθθ and w is an N × 1 random vector with zero mean and covariance matrix
Cw which is uncorrelated with θ (the joint pdf p(w, θ) and hence also p(x, θ)
are otherwise arbitrary), then noting that:
E(x) = HE(θ)
Cxx = HCθθ HT + Cw
Cθx = Cθθ HT
the LMMSE estimator of θ is:
θ̂ = E(θ) + Cθθ HT (HCθθ HT + Cw )−1 (x − HE(θ))
= E(θ) + (C−1 T −1
θθ + H Cw H)
−1 T −1
H Cw (x − HE(θ))
and the covariance of the error which is the Bayesian MSE matrix is:
Mθ̂ = (C−1 T −1
θθ + H Cw H)
−1

12.2 Wiener Filtering


We assume N samples of time-series data x = [x[0] x[1] . . . x[N − 1]]T which are
wide-sense stationary (WSS). As such the N × N covariance matrix takes the
symmetric Toeplitz form:
Cxx = Rxx where [Rxx ]ij = rxx [i − j]
where rxx [k] = E(x[n]x[n − k]) is the autocorrelation function (ACF) of the
x[n] process and Rxx denotes the autocorrelation matrix. Note that since x[n]
is WSS the expectation E(x[n]x[n − k]) is independent of the absolute time
index n. In signal processing the estimated ACF is used:

1
PN −1−|k|
n=0 x[n]x[n + |k|] |k| ≤ N − 1
r̂xx [k] = N
0 |k| ≥ N

Both the data x and the parameter to be estimated θ̂ are assumed zero mean.
Thus the LMMSE estimator is:
θ̂ = Cθx C−1
xx x

Application of the LMMSE estimator to the three signal processing estimation


problems of filtering, smoothing and prediction gives rise to the Wiener filtering
equation solutions.

34
12.2.1 Smoothing

The problem is to estimate the signal θ = s = [s[0] s[1] . . . s[N − 1]]T based on
the noisy data x = [x[0] x[1] . . . x[N − 1]]T where:

x=s+w

and w = [w[0] w[1] . . . w[N − 1]]T is the noise process. An important difference
between smoothing and filtering is that the signal estimate s[n] can use the
entire data set: the past values (x[0], x[1], . . . x[n − 1]), the present x[n] and
future values (x[n + 1], x[n + 2], . . . , x[N − 1]). This means that the solution
cannot be cast as “filtering” problem since we cannot apply a causal filter to
the data.
We assume that the signal and noise processes are uncorrelated. Hence:

rxx [k] = rss [k] + rww [k]

and thus:
Cxx = Rxx = Rss + Rww
also:
Cθx = E(sxT ) = E(s(s + w)T ) = Rss
Hence the LMMSE estimator (also called the Wiener estimator) is:

ŝ = Rss (Rss + Rww )−1 x = Wx

and the N × N matrix:

W = Rss (Rss + Rww )−1

is referred to as the Wiener smoothing matrix.

12.2.2 Filtering

The problem is to estimate the signal θ = s[n] based only on the present and
past noisy data x = [x[0] x[1] . . . x[n]]T . As n increases this allows us to view
the estimation process as an application of a causal filter to the data and we
need to cast the LMMSE estimator expression in the form of a filter.
Assuming the signal and noise processes are uncorrelated we have:

Cxx = Rss + Rww

where Cxx is an (n + 1) × (n + 1) autocorrelation matrix. Also:

Cθx = E(sxT ) = E(s(s + w)T ) = E(ssT )


= E(s[n] [s[0] s[1] . . . s[n]])
= [rss [n] rss [n − 1] . . . rss [0]] = řTss

35
is a 1 × (n + 1) row vector. The LMMSE estimator is:

ŝ[n] = řTss (Rss + Rww )−1 x = aT x

where a = [a0 a1 . . . an ]T = (Rss + Rww )−1 řTss is the (n + 1) × 1 vector of


weights. Note that the “check” subscript is used to denote time-reversal. Thus:

rTss = [rss [0] rss [1] . . . rss [n]]

We interpret the process of forming the estimator as time evolves (n increases)


as a filtering operation. Specifically we let h(n) [k], the time-varying impulse
response, be the response of the filter at time n to an impulse applied k samples
before (i.e. at time n − k). We note that ai can be intrepreted as the response
of the filter at time n to the signal (or impulse) applied at time i = n − k . Thus
we can make the following correspondence:

h(n) [k] = an−k k = 0, 1, . . . , n.

Then:
n
X n
X n
X
(n)
ŝ[n] = ak x[k] = h [n − k]x[k] = h(n) [k]x[n − k]
k=0 k=0 k=0
 T
We define the vector h = h(n) [0] h(n) [1] . . . h(n) [n] . Then we have that h = ǎ,
that is, h is a time-reversed version of a. To explicitly find the impulse response
h we note that since:
(Rss + Rww )a = řTss
then it also true that:
(Rss + Rww )h = rss

When written out we get the Wiener-Hopf filtering equations:

h(n) [0]
    
rxx [0] rxx [1] ··· rxx [n] rss [0]
 rxx [1] rxx [0] ··· rxx [n − 1]  h(n) [1]   rss [1] 
=
    
 .. .. .. ..  .. .. 
 . . . .  .   . 
rxx [n] rxx [n − 1] · · · rxx [0] h(n) [n] rss [n]
where rxx [k] = rss [k] + rww [k]. A computationally efficient solution for solving
the equations is the Levinson recursion which solves the equations recursively
to avoid resolving them for each value of n.

12.2.3 Prediction

The problem is to estimate θ = x[N − 1 + l] based on the current and past x =


T
[x[0] x[1] . . . x[N − 1]] at sample l ≥ 1 in the future. The resulting estimator is
termed the l-step linear predictor.

36
As before we have Cxx = Rxx where Rxx is the N × N autocorrelation matrix,
and:

Cθx = E(xxT )
= E(x[N − 1 + l] [x[0] x[1] . . . x[N − 1]])
= [rxx [N − 1 + l] rxx [N − 2 + l] . . . rxx [l]] = řTxx

Then the LMMSE estimator is:

x̂[N − 1 + l] = řTxx R−1 T


xx x = a x

T
where a = [a0 a1 . . . aN −1 ] = R−1
xx řxx . We can interpret the process of forming
the estimator as a filtering operation where h(N ) [k] = h[k] = an−k ⇒ h[N −
k] = ak and then:
N
X −1 N
X
x̂[N − 1 + l] = h[N − k]x[k] = h[k]x[N − k]
k=0 k=1

T T
Defining h = [h[1] h[2] . . . h[N ]] = [aN −1 aN −2 . . . a0 ] = ǎ as before we can
find an explicit expression for h by noting that:

Rxx a = rxx ⇒ Rxx h = rxx


T
where rxx = [rxx [l] rxx [l + 1] . . . rxx [l + N − 1]] is the time-reversed version of
řxx . When written out we get the Wiener-Hopf prediction equations:
    
rxx [0] rxx [1] · · · rxx [N − 1] h[1] rxx [l]
 rxx [1] rxx [0] · · · rxx [N − 2]    h[2]  
   rxx [l + 1] 
=
 
.. .. . .. .
.. . .
  ..   ..
    
 . . 
rxx [N − 1] rxx [N − 2] · · · rxx [0] h[N ] rxx [l + N − 1]

A computationally efficient solution for solving the equations is the Levinson


recursion which solves the equations recursively to avoid resolving them for
each value of n. The special case for l = 1, the one-step linear predictor, covers
two important cases in signal processing:

• the values of −h[n] are termed the linear prediction coefficients which are
used extensively in speech coding, and
• the resulting equations are identical to the Yule-Walker equations used to
solve the AR filter parameters of an AR(N ) process.

12.3 Sequential LMMSE


We can derive a recursive re-estimation of θ̂[n] (i.e. θ̂ based on n data samples)
from:

37
• θ̂[n − 1], the estimate based on n − 1 data samples,
• x[n], the nth data sample, and
• x̂[n|n − 1], an estimate of x[n] based on n − 1 data samples.

Using a vector space analogy where we define the following inner product:
(x, y) = E(xy)
the procedure is as follows:

1. From the previous iteration we have the estimate, θ̂[n − 1], or we have
an initial estimate θ̂[0]. We can consider θ̂[n − 1] as the true value of θ
projected on the subspace spanned by {x[0], , x[1], . . . , x[n − 1]}.
2. We find the LMMSE estimator of x[n] based on the previous n−1 samples,
that is the one-step linear predictor, x̂[n|n − 1], or we have an initial
estimate, x̂[0| − 1]. We can consider x̂[n|n − 1] as the true value x[n]
projected on the subspace spanned by {x[0], , x[1], . . . , x[n − 1]}.
3. We form the innovation x[n] − x̂[n|n − 1] which we can consider as being
orthogonal to the space spanned by {x[0], , x[1], . . . , x[n − 1]} and rep-
resenting the direction of correction based exclusively on the new data
sample x[n].
4. We calculate the correction gain K[n] by the normalised projection of θ
on the innovation x[n] − x̂[n|n − 1], that is:
E[θ(x[n] − x̂[n|n − 1])]
K[n] =
E[(x[n] − x̂[n|n − 1])2 ]
5. Finally we update the estimator by adding the correction:
θ̂[n] = θ̂[n − 1] + K[n](x[n] − x̂[n|n − 1])

12.3.1 Sequential LMMSE for the Bayesian Linear Model

For the vector parameter-scalar data form of the Bayesian linear model:
x[n] = hT θ + w[n]

we can derive the sequential vector LMMSE:


θ̂[n] = θ̂[n − 1] + K[n](x[n] − hT [n]θ̂[n − 1])
M[n − 1]h[n]
K[n] =
σn2 + hT [n]M[n − 1]h[n]
M[n] = (I − K[n]hT [n])M[n − 1]

where M[n] is the error covariance estimate:


M[n] = E[(θ − θ̂[n])(θ − θ̂[n])T ]

38
13 Kalman Filters
The Kalman filter can be viewed as an extension of the sequential LMMSE to
the case where the parameter estimate, or state, varies with time in a known but
stochastic way and unknown initialisation. We consider the vector state-vector
observation case and consider the unknown state (to be estimated) at time n,
s[n], to vary according to the Gauss-Markov signal model process equation:

s[n] = As[n − 1] + Bu[n] n ≥ 0

where s[n] is the p×1 state vector with unknown initialisation s[−1] = N(µs , Cs )
, u[n] = N(0, Q) is the WGN r × 1 driving noise vector and A, B are known
p × p and p × r matrices. The observed data, x[n], is then assumed a linear
function of the state vector with the following observation equation:

x[n] = H[n]s[n] + w[n] n ≥ 0

where x[n] is the M × 1 observation vector, w[n] = N(0, R) is the WGN M × 1


observation noise sequence and H is a known M × p matrix.
 
We wish to estimate s[n] based on the observations X[n] = xT [0] xT [1] . . . xT [n] .
Our criterion of optimality is the Bayesian MMSE:

E[(s[n] − ŝ[n|n])2 ]

where ŝ[n|n] = E(s[n]|X[n]) is the estimate of s[n] based on the observation


sequence X[n]. We define the innovation sequence:

x̃[n] = x[n] − x̂[n|n − 1]

where x̂[n|n − 1] = E(x[n]|X[n − 1]) is the predictor of x[n] based on the


observation sequence X[n − 1]. We can consider ŝ[n|n] as being derived from
two separate components:

ŝ[n|n] = E(s[n]|X[n − 1]) + E(s[n]|x̃[n])

where E(s[n]|X[n − 1]) is the “state prediction” and E(s[n]|x̃[n]) is the “inno-
vation correction”. By definition ŝ[n|n − 1] = E(s[n]|X[n − 1]) and from the
process equation we can show that:

ŝ[n|n − 1] = Aŝ[n − 1|n − 1]

From our sequential LMMSE results we also have that :

E(s[n]|x̃[n]) = K[n](x[n] − x̂[n|n − 1])

and:

K[n] = E(s[n]x̃T [n])E(x̃[n]x̃T [n])−1 = M[n|n−1]HT [n](H[n]M[n|n−1]HT [n]+R)−1

39
where:
M[n|n − 1] = E (s[n] − ŝ[n|n − 1])(s[n] − ŝ[n|n − 1])T
 

is the state error covariance predictor at time n based on the observation se-
quence X[n − 1]. From the process equation we can show that:

M[n|n − 1] = AM[n − 1|n − 1]AT + BQBT

and by appropriate substitution we can derive the following expression for the
state error covariance estimate at time n:

M[n|n] = (I − K[n]H[n])M[n|n − 1]

Kalman Filter Procedure

Prediction

ŝ[n|n − 1] = Aŝ[n − 1|n − 1]


M[n|n − 1] = AM[n − 1|n − 1]AT + BQBT

Gain
K[n] = M[n|n − 1]HT [n](H[n]M[n|n − 1]HT [n] + R)−1

Correction

ŝ[n|n] = ŝ[n|n − 1] + K[n](x[n] − x̂[n|n − 1])


M[n|n] = (I − K[n]H[n])M[n|n − 1]

14 MAIN REFERENCE
Steven M. Kay, “Fundamentals of Statistical Signal Processing: Estimation The-
ory”, Prentice-Hall, 1993

40

You might also like