0% found this document useful (0 votes)

23 views12 pages

Filt Ident Lecturenotes

Uploaded by

renilalexander10

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views12 pages

Filt Ident Lecturenotes

Uploaded by

renilalexander10

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Lecture notes on Bayesian estimation for SC42025:

Filtering & Identification

Manon Kok and Frida Viset

December 1, 2023

1 Introduction

In this document, the Bayesian interpretation and derivation of the filtering part of this course is introduced.
A higher level introduction to Bayesian filtering and smoothing can also be found in Sections 1.1–1.3 in [1].

2 Notation preliminaries

Throughout this document we will assume that all distributions are Gaussian. We use the notation N (a ; µa , Pa )
to denote the variable a ∈ Rn being Gaussian distributed with mean µa and covariance Pa . In other words,
1 1
exp − 12 (a − µa )T Pa−1 (a − µa ) = p exp − 12 ∥a − µa ∥2P −1 ,

N (a ; µa , Pa ) = p
(2π)n det(Pa ) (2π)n det(Pa ) a

(1)
T
where we introduced the weighted norm notation ∥a − µa ∥2P −1 = (a − µa ) Pa−1 (a − µa ), and where det(Pa )
a
denotes the determinant of the matrix Pa . Sometimes we simplify this notation to N (µa , Pa ) if there can’t
be any confusion about what the random variable is.

When working with probabilities, we will make use of three fundamental relations:
Z
Marginalization: p(a | c) = p(a, b | c)db (2a)

Conditioning: p(a, b | c) = p(a | b, c)p(b | c) (2b)

p(b | a, c)p(a | c)
Bayes theorem: p(a | b, c) = (2c)
p(b | c)

Furthermore, we will often make use of the following three theorems and one lemma:

Theorem 1: Joint distribution of Gaussian variables Assume that both xa and xb conditioned on
xa are distributed according to

p(xa ) = N (xa ; µa , Pa ), (3a)

p(xb | xa ) = N (xb ; M xa + b, Pb|a ), (3b)

where M and b are deterministic and known. Then the joint distribution of xa and xb is given by

Pa M T

xa µa Pa
p(xa , xb ) = N ; ,P , P = . (4)
xb M µa + b M Pa M Pa M T + Pb|a

1
Proof: See Appendix A.2 of [2].

Theorem 2: Marginal distribution of Gaussian variables Assume that the random variables xa
and xb are jointly distributed as

xa µa Paa Pab
p(xa , xb ) = N ; , . (5)
xb µb Pba Pbb

Then the marginal densities p(xa ) and p(xb ) are given by

p(xa ) = N (xa ; µa , Paa ), (6a)

p(xb ) = N (xb ; µb , Pbb ). (6b)

Proof: See Appendix A.1 of [2].

Theorem 3: Conditional distribution of Gaussian variables Assume that the random variables
xa and xb are jointly distributed as (5), then the conditional densities p(xa | xb ) and p(xb | xa ) are given by
−1 −1

p(xa | xb ) = N xa ; µa + Pab Pbb (xb − µb ) , Paa − Pab Pbb Pba , (7a)
−1 −1

p(xb | xa ) = N xb ; µb + Pba Paa (xa − µa ) , Pbb − Pba Paa Pab . (7b)

Proof: See Appendix A.1 of [2].

Lemma 1: Distribution of the linear transform of a Gaussian variable Given the distribution

p(xa ) = N (xa ; µa , Pa ) (8)

and the model

xb = Gxa + w, (9)
with G being deterministic and w ∼ N (0, Pw ), it holds that

p(xb ) = N (xb ; Gµa , GPa GT + Pw ). (10)

In the special case of w = 0 (in other words, in the absence of noise), p(xb ) = N (xb ; Gµa , GPa GT ).

Proof: We can derive this lemma using a combination of Theorems 1 and 2, by setting M = G, b = 0
and Pb|a = Pw .

3 Stochastic least squares

3.1 Probabilistic models

Let us first consider the standard least squares problem of estimating the unknown parameters θ ∈ Rn from
measurements y ∈ RN modelled as

y = F θ + ε, ε ∼ N (0, Σ), (11)

where we assume F to be deterministic and known. If θ in (11) would be also be deterministic and known,
we can make the following observations about y:

• The mean of y would be F θ as F θ is known and the mean of ε is equal to zero.

• The covariance of y would be Σ as F θ is deterministic and the covariance of ε is equal to Σ.
• The distribution of y would be Gaussian as F θ is deterministic and ε is Gaussian.

2
We can summarise these three observations as

p(y | θ) = N (y ; F θ, Σ), (12)

where p(y | θ) is the conditional distribution of y given θ and the notation N (· ; ·, ·) is introduced in (1).
Note that (11) and (12) contain exactly the same information and can hence be used interchangeably.

In stochastic least squares, the goal is now to combine collected measurements y (modelled as (11) or
equivalently (12)) with prior information about the parameters θ. Throughout the course, we will assume
that this prior information is Gaussian and will model it as θ ∼ N (µθ , Pθ ). In other words, we assume that

p(θ) = N (µθ , Pθ ). (13)

Note that p(θ) is a prior distribution, representing information that is available before any data is observed.
This is in contrast to the posterior distribution p(θ | y) which represents the distribution of θ given the
observed data y. This is the distribution that we are interested in computing in stochastic least squares.

3.2 Computing the posterior

We can write the stochastic least squares problem as

p(y | θ)p(θ) p(y, θ)
p(θ | y) = =R . (14)
p(y) p(y, θ)dθ

Here, we first used Bayes’ rule (2c) and subsequently used the conditioning and marginalisation relations (2b)
and (2a). Note that we know both p(y | θ) and p(θ) from our measurement model (12) and our prior (13),
respectively. In fact, they are exactly of the form as Theorem 1. Hence, we can write

Pθ F T

θ µθ Pθ
p(θ, y) = N ; ,P , P = (15)
y F µθ F Pθ F Pθ F T + Σ

Now using Theorem 3, we find that the posterior p(θ | y) is given by

−1 −1
p(θ | y) = N θ; µθ + Pθ F T F Pθ F T + Σ (y − F µθ ) , Pθ − Pθ F T F Pθ F T + Σ F Pθ . (16)

The mean and covariance can now be seen to be equal to the well-known stochastic least squares solution.

3.3 Interpretation in terms of solving an optimisation problem

We can alternatively interpret the stochastic least squares problem as a maximum a posteriori (MAP)
estimation problem. In other words, we can interpret computing an estimate θ̂ as computing the maximum
of the posterior distribution p(θ | y) which can be formulated as

p(y | θ)p(θ)
θ̂ = arg max p(θ | y) = arg max , (17)
θ θ p(y)

where arg max θ denotes the maximising argument. Since p(y) doesn’t explicitly depend on θ, it does not
affect the location of the maximum. Hence,

θ̂ = arg max p(θ | y) = arg max p(y | θ)p(θ). (18)

θ θ

Now using the fact that maximising p(·) is equivalent to minimising − log p(·), we can rewrite this problem
as
θ̂ = arg max p(θ | y) = arg min − log p(y | θ) − log p(θ). (19)
θ θ

3
Now it is possible to insert the known expressions p(y | θ) and p(θ) (12) and (13) using (1)

θ̂ = arg max p(θ | y) = arg min 21 ∥y−F θ∥2Σ−1 + 12 ∥θ−µθ ∥2P −1 + 12 log det(Σ)+ 21 log det(Pθ )+ (N +n)
2 log 2π (20)
θ θ θ

Again, as constants do not affect the location of the minimum they can be omitted giving the resulting
optimisation problem

θ̂ = arg max p(θ | y) = arg min 12 ∥y − F θ∥2Σ−1 + 21 ∥θ − µθ ∥2P −1 (21)

θ θ θ

Note that throughout this course we sometimes omit the pre-multiplication 12 for notational brevity as it
does not affect the minimum. Note also that (21) is again a weighted stochastic least squares problem. To
solve (21), we can first equivalently write it as
1
θ̂ = arg min 21 (y − F θ)T Σ−1 (y − F θ) + (θ − µθ )T Pθ−1 (θ − µθ )
θ 2
T −1
1 y F Σ 0 y F
= arg min 2 − θ − θ (22)
θ µ θ I 0 P θ µθ I
−1
y F Σ 0
= arg min 12 ∥ − θ∥2W , W = .
θ µθ I 0 Pθ

This is a least squares problem with solution

T −1 !−1 T −1
F Σ 0 F F Σ 0 y
θ̂ = . (23)
I 0 Pθ I I 0 Pθ µθ

Some algebraic manipulation of (23) gives

θ̂ =(F T Σ−1 F + Pθ−1 )−1 (F T Σ−1 y + Pθ−1 µθ ),

(24)
=(F T Σ−1 F + Pθ−1 )−1 F T Σ−1 y + (F T Σ−1 F + Pθ−1 )−1 Pθ−1 µθ .

We can use Lemma 3 from Appendix A, with A = Pθ−1 , B = F T , C = Σ−1 , and D = F , to see that
(F T Σ−1 F + Pθ−1 )−1 F T Σ−1 = Pθ F T (F Pθ F T + Σ)−1 . We will denote this matrix K, such that

θ̂ =Ky + (F T Σ−1 F + Pθ−1 )−1 Pθ−1 µθ ,

=Ky + (F T Σ−1 F + Pθ−1 )−1 (Pθ−1 + F T Σ−1 F − F T Σ−1 F )µθ , (25)
=Ky + (I − KF )µθ .

Similarly, the covariance of the least squares estimate is given by

T −1 !−1
h i F Σ 0 F
E (θ − θ̂)(θ − θ̂)T = = (F T Σ−1 F + Pθ−1 )−1 , (26)
I 0 Pθ I

where applying Lemma 2 from Appendix A with A = Pθ−1 , B = F T , C = Σ−1 and D = F gives that
h i
E (θ − θ̂)θ − θ̂)T = Pθ − Pθ F T (Σ + F Pθ F T )−1 F Pθ , (27)

which corresponds exactly to the expected value and covariance derived in Section 3.2. Note that when
we don’t have any prior information available, or equivalently, if the prior is flat (in-dependent of θ), the
T −1 −1 T −1
solution to this
h problem is given
i trivially as the least squares solution θ̂ = (F Σ F ) F Σ y, and with
a variance E (θ − θ̂)(θ − θ̂)T = (F T Σ−1 F )−1 .

4
3.4 Extension to nonlinear models

Let us now assume that instead of the linear measurement model (12), we have a nonlinear model

y = f (θ) + ϵ, ϵ ∼ N (0, Σ), (28)

or equivalently, p(y | θ) = N (y ; f (θ), Σ). In this case, the computations in Section 3.2 no longer hold as
Theorem 1 can not be used. The posterior p(θ | y) is generally no longer Gaussian. However, it is still possible
to compute a MAP estimate using nonlinear least squares (NLS) similar to the approach in Section 3.3. For
notational brevity, let us assume there is no prior knowledge available on θ (or in other words the prior on
θ is flat). We can now straightforwardly adapt (21) to the nonlinear model (28) as

θ̂ = arg max p(θ | y) = arg min 21 ∥y − f (θ)∥2Σ−1 (29)

θ θ

We can solve this problem by solving a sequence of linear least squares problems if we first use a first order
Taylor series expansion of f (θ) around a linearisation point θ̂(i) as

df (θ)
f (θ) = f (θ̂(i) + ∆θ) ≈ f (θ̂(i) ) + ∆θ (30)
dθ θ=θ̂ (i)

Inserting this first order Taylor expansion (30) into (28) and reordering terms, we can recognise an approxi-
mate least squares problem

df (θ)
y − f (θ̂(i) ) ≈ ∆θ + ϵ, ϵ ∼ N (0, Σ). (31)
| {z } dθ θ=θ̂(i)
e(i) | {z }
F (i)

The NLS problem can now be solved by solving a sequence of linear least squares problems. Starting from
an initial estimate θ̂(0) , for each iteration i, we can first compute ∆θ from (31) and subsequently update
the linearisation point as θ̂(i+1) = θ̂(i) + ∆θ. The iterations can be terminated for instance when ∥∆θ∥22 is
smaller than a certain value or when a maximum number of iterations has been reached. The challenge in
practice is how to choose θ̂(0) as different choices of θ̂(0) can lead to convergence to different local minima.
In practice, some NLS problems are not that sensitive to the choice of θ̂(0) and standard choices like the
zero- or the one-vector are sufficient. In other NLS problems, a more elaborate scheme to find a proper θ̂(0)
are needed. This typically involves making use of domain knowledge and is outside the scope of our course.

Algorithm 1: Nonlinear least squares

Input: An initial estimate θ̂(0) and a stopping criterion.
Output: A final estimate θ̂.
Set i = 0
while stopping criterion not fulfilled
Compute Fi = dfdθ(θ) and ei = y − f (θ̂(i) ).
θ=θ̂ (i)
Solve a least squares problem with ei = Fi ∆θ + ϵ to compute ∆θ.
Update the estimate θ̂(i+1) = θ̂(i) + ∆θ.
if stopping criterion fulfilled
Set θ̂ = θ̂(i+1) .
else
Set i = i + 1.
end if
end while

5
4 Kalman filtering

4.1 Probabilistic models

In Kalman filtering, we are interested in recursively computing the posterior distribution of a state xk . The
dynamics of this state is modelled as
xk+1 = Axk + wk , wk ∼ N (0, Q), (32)
where A and Q are assumed to be known and deterministic. Similarly to the reasoning in Section 3.1, we
can equivalently write this model in terms of the conditional distribution p(xk+1 | xk ) as
p(xk+1 | xk ) = N (xk+1 ; Axk , Q). (33)
We also assume that measurements yk providing information about the state xk are available. The measure-
ment model is given by
yk = Cxk + vk , vk ∼ N (0, R), (34)
where C and R are assumed to be known and deterministic. We also assume that vk and wk are uncorrelated.
Again, similarly to the reasoning in Section 3.1, we can equivalently write this model in terms of the
conditional distribution p(yk | xk ) as
p(yk | xk ) = N (yk ; Cxk , R). (35)
We refer to the combination of (32) and (34) or equivalently the combination (33) (35) as a state-space model.
An additional essential element of a state-space model is a prior on the state at the first time instance. We
model this as
x1 ∼ N (x̂1 , P1 ), (36)
or equivalently as the prior distribution
p(x1 ) = N (x1 ; x̂1 , P1 ). (37)
Here, x̂1 encodes our prior belief on the state at time instance 1 and P1 encodes the uncertainty of this belief.
Note that in practice it is often the case that no prior information is available. It is therefore not uncommon
that x̂1 is chosen as the zero-vector and P1 is chosen to be fairly large. Another commonly used method for
choosing x̂1 , P1 is by basing it on the first available measurement(s).

4.2 Computing the posterior

In filtering, we are interested in computing the conditional (filtering) distribution p(xk | y1:k ), or in other
words the posterior distribution of xk given all the measurements up to and including time k. Note that
we use the notation y1:k to denote the sequence of measurements y1 , y2 , . . . yk . Hence, p(xk | y1:k ) = p(xk |
y1 , y2 , . . . yk ). To compute p(xk | y1:k ), we will make use of the dynamic model p(xk+1 | xk ), the measurement
model p(yk | xk ) and the prior p(x1 ) defined in (33), (35), (37). Throughout our derivation, we will heavily
make use of two properties of state-space models:

Property 1 Given xk , the state xk+1 is conditionally independent of x1:k−1 and y1:k , or in other words
p(xk+1 | x1:k , y1:k ) = p(xk+1 | xk ). This is also called the Markov property and implies that we assume
that the evolution of the state only depends on the current state and not on its history or on previous
measurements. This property can directly be seen in (33), where xk+1 only depends on xk , the system
matrices and the noise wk .
Property 2 The measurements yk are conditionally independent given xk , or in other words p(yk | x1:k , y1:k−1 ) =
p(yk | xk ). This property can directly be seen in (35), where yk only depends on xk , the system matrices
and the noise vk .

Here, we can recognise the so-called predictive distribution p(xk | y1:k−1 ), or in other words the distribution
of xk given all the measurements up to and including time k − 1. This is called the predictive distribution
as it predicts the state one time-step into the future. The term p(yk | y1:k−1 ) can be written
Z
p(yk | y1:k−1 ) = p(yk , xk | y1:k−1 ) dxk
Z
= p(yk | xk , y1:k−1 )p(xk | y1:k−1 ) dxk
Z
= p(yk | xk )p(xk | y1:k−1 ) dxk (39)

where we first used (2a), then (2b) and finally Property 2. Here, we can again recognise the measurement
model p(yk | xk ) and the predictive distribution p(xk | y1:k−1 ). From (38) and (39) it can be concluded that
the filtering distribution p(xk | y1:k ) can be computed using the measurement model (35) and a predictive
distribution p(xk | y1:k−1 ). We will therefore first focus on computing p(xk | y1:k−1 ) and subsequently on
computing p(xk | y1:k ).

by first marginalising using (2a), then using conditioning (2b) and then using the Markov property (Property
1 above). In (40) we can recognise the dynamic model p(xk | xk−1 ) (33) and the filtering distribution at time
instance k − 1, as p(xk−1 | y1:k−1 ) is the distribution of xk−1 given the measurements y1:k−1 . Let us for now
assume that this distribution is given by p(xk−1 | y1:k−1 ) = N (xk−1 ; x̂k−1|k−1 , Pk−1|k−1 ). In other words,
we assume that the distribution is Gaussian and denote its mean and covariance with x̂k−1|k−1 , Pk−1|k−1 ,
respectively. We use the double subscript k − 1 | k − 1 to denote the state estimate and covariance at time
instance k − 1 using all measurements up to and including time k − 1. For the moment, the assumption that
the distribution is Gaussian is ad hoc, but we will later see that the distribution is indeed Gaussian for our
linear Gaussian state space model (33), (35), (37). We can now conclude that in (40) we have written the
predictive distribution p(xk | y1:k−1 ) in terms of the dynamic model p(xk | xk−1 ) and the filtering distribution
p(xk−1 | y1:k−1 ), which we assume to be Gaussian. Hence, we can in a straightforward but cumbersome
way compute p(xk | y1:k−1 ) by inserting (1) into (40), for more details see [2]. We will use an alternative
approach here and use Lemma 1.

Comparing to (3), we can use xk−1 as xa in (3) and xk as xb in (3). Note that compared to Theorem 1,
we don’t have a distribution p(xk−1 ) but rather p(xk−1 | y1:k−1 ). However, Theorem 1 equally holds if we
add conditioning to both equations in (3). Hence, we can write our model in the form

p(xk−1 | y1:k−1 ) = N xk−1 ; x̂k−1|k−1 , Pk−1|k−1 , (41a)
p(xk | xk−1 , y1:k−1 ) = N (xk ; Axk−1 , Q). (41b)

7
Note that the Markov property allowed us to add the conditioning on y1:k−1 in p(xk | xk−1 ), see Property 1
above. We can now equivalently write (41) as

p(xk−1 | y1:k−1 ) = N xk−1 ; x̂k−1|k−1 , Pk−1|k−1 , (42a)
xk = Axk−1 + wk , wk ∼ N (0, Q), (42b)
which is exactly on the form of Lemma 1 with xb = xk , xa = xk−1 , G = A and Pw = Q if we condition both
the distributions in (8) and (10) on y1:k−1 . Hence, we can directly conclude that the predictive distribution
is given by

p(xk | y1:k−1 ) = N xk ; x̂k|k−1 , Pk|k−1 , (43)
with
x̂k|k−1 = Ax̂k−1|k−1 , (44a)
T
Pk|k−1 = APk−1|k−1 A + Q. (44b)
Note that the shorthand notation for the mean and covariance in (43) is introduced for ease of notation.
The double subscript k | k − 1 is used to denote the state estimate and covariance at time instance k using
all measurements up to and including time k − 1, which in other words is a one-step ahead prediction.

Deriving the filtering distribution The filtering distribution p(xk | y1:k ) was given in (38) and can
be computed using the measurement model (35) and the predictive distribution (43). Again, we can in a
straightforward but cumbersome way compute p(xk | y1:k−1 ) by inserting (1) into (40), for more details
see [2]. We will instead again use Theorems 1 and 3.

Comparing to (3), we can use xk as xa and yk as xb . Note that compared to Theorem 1, we don’t have
a distribution p(xk ) but rather p(xk | y1:k−1 ). However, Theorem 1 equally holds if we add conditioning to
both equations in (3). Hence, we can write our model in the form

p(xk | y1:k−1 ) = N xk ; x̂k|k−1 , Pk|k−1 (45a)
p(yk | xk , y1:k−1 ) = N (yk ; Cxk , R). (45b)
Note that Property 2 allowed us to add the conditioning on y1:k−1 in p(yk | xk ). Hence, Theorem 1 directly
results in the joint distribution
Pk|k−1 C T

xk x̂k|k−1 Pk|k−1
p(xk , yk | y1:k−1 ) = N ; , . (46)
yk C x̂k|k−1 CPk|k−1 CPk|k−1 C T + R
Using Theorem 3, using xk as xa and yk as xb , we can directly read off the filtering distribution from (46) as

p(xk | y1:k ) = N xk ; x̂k|k , Pk|k , (47)
with
−1
x̂k|k = x̂k|k−1 + Pk|k−1 C T CPk|k−1 C T + R (yk − C x̂k|k−1 ), (48a)
T T
−1
Pk|k = Pk|k−1 − Pk|k−1 C CPk|k−1 C + R CPk|k−1 . (48b)
Note that we still need to revisit the assumption in the derivation of the predictive distribution that the
filtering distribution from the previous time step was Gaussian. This can be argued recursively. Assuming
that the prior is Gaussian, the filtering distribution p(x1 | y1 ) is Gaussian. Hence, it can be assumed when
computing the predictive distribution p(x2 | y1 ) that the filtering distribution was Gaussian. As this is
possible, p(x2 | y1 ) will be Gaussian and hence p(x2 | y2 ) will be etc.

The resulting Kalman filtering algorithm is summarised in Algorithm 2. Note also that in some Kalman
filters the order or the time update and the measurement update is reversed. This is specifically natural if
the first measurement is used to determine the prior distribution.

x̂k+1|k = Ax̂k|k ,
Pk+1|k = APk|k AT + Q.

end for

4.3 Interpretation in terms of solving an optimisation problem

We can alternatively interpret the Kalman filter as a MAP estimation problem. In other words, we can
interpret computing an estimate x̂k as computing the maximum of the posterior distribution p(xk | y1:k ).
One way to formulate this problem is to write it using Bayes’ rule as

p(yk | xk , y1:k−1 )p(xk | y1:k−1 )

x̂k|k = arg max p(xk | y1:k ) = arg max . (49)
xk xk p(yk | y1:k−1 )

Using Property 2, we can recognise that p(yk | xk , y1:k−1 ) = p(yk | xk ) which is given by the measurement
model (35). Furthermore, let us assume that p(xk | y1:k−1 ) = N (x̂k|k−1 , Pk|k−1 ) and that we have previously
estimated x̂k|k−1 and Pk|k−1 . Again using the fact that p(·) is equivalent to minimising − log p(·) and
omitting constants, we can write this optimisation problem equivalently as

p(yk | xk )p(xk | y1:k−1 )

x̂k|k = arg max
xk p(yk | y1:k−1 )
= arg max p(yk | xk )p(xk | y1:k−1 )
xk

= arg min − log p(yk | xk ) − log p(xk | y1:k−1 )

= arg min 21 ∥yk − Cxk ∥2R−1 + 21 ∥xk − x̂k|k−1 ∥2P −1 . (50)

xk k|k−1

Note that (50) is a weighted stochastic least squares problem which is exactly on the same form as the
weighted stochastic least squares problem in (21), with yk = y, F = C, x = θ, Σ = R, x̂k|k−1 = µ and
Pk|k−1 = Pθ . Therefore, the derivation in Section 3.3 gives that the optimal solution to this system is

x̂k|k = Kk yk + (I − Kk C)x̂k|k−1 , (51)

−1
with Kk = (C T R−1 C + Pk|k−1 )C T R−1 = Pk|k−1 C T (CPk|k−1 C T + R)−1 . Similarly, the variance of this
estimate is given by

E (xk − x̂k|k )(xk − x̂k|k )T = Pk|k−1 − Pk|k−1 C T (CPk|k−1 C T + R)−1 CPk|k−1 .

(52)

The optimal solution and its variance therefore corresponds to the Kalman filter measurement updates from
Section 4.2.

9
4.4 Extension to nonlinear models

Let us now assume that instead of the linear state-space model (32), (34), we have a nonlinear model

xk+1 = f (xk ) + wk , wk ∼ N (0, Q), (53a)

yk = h(xk ) + vk , vk ∼ N (0, R), (53b)

where f (·) and h(·) are nonlinear functions. We can equivalently write (53) as

p(xk+1 | xk ) = N (xk+1 ; f (xk ), Q), (54a)

p(yk | xk ) = N (yk ; h(xk ), R). (54b)

For this nonlinear model the computations in Section 4.2 no longer hold as Theorem 1 can not be used.
The filtering distribution p(xk | y1:k ) is generally no longer Gaussian. There are now a number of strategies
that can be used for nonlinear filtering. The nonlinear filtering problem can for instance be solved using
linearisation, see Section 4.4.1, or as a nonlinear least squares problem, see Section 4.4.2. Other techniques,
such as particle filtering, unscented Kalman filtering and moving horizon estimation, are not explicitly
discussed in the course.

4.4.1 Extended Kalman filtering

The Extended Kalman filter (EKF) is an extension of Kalman filtering that can be used for nonlinear state
space models like (53). The EKF relies on linearisation of both the dynamic model and the measurement
model around the currently best estimate. In other words, in the time update the following first order Taylor
approximation is made
df (xk )
xk+1 ≈ f (x̂k|k ) + xk − x̂k|k + wk . (55)
dxk xk =x̂k|k
In the measurement update the first order Taylor approximation is instead
dh(xk )
yk ≈ h(x̂k|k−1 ) + xk − x̂k|k−1 + vk . (56)
dxk xk =x̂k|k−1

The EKF is summarised in Algorithm 3. Note that although there are no guarantees on the performance of
an EKF, it is widely used in practice and is known to work well as long as the functions f (·), h(·) are not
“too nonlinear” and the sampling frequency is not “too low.”

4.4.2 Nonlinear least squares for Kalman filtering

An alternative approach to approximating the filtering distribution for the nonlinear state-space model (54)
is to compute a MAP estimate using nonlinear least squares (NLS) similar to the approach in Section 3.4.
Using the same reasoning as in Section 4.3, we can solve the optimisation problem

x̂k|k = arg max p(xk | y1:k )

= arg min − log p(yk | xk ) − log p(xk | y1:k−1 )

= arg min 12 ∥yk − h(xk )∥2R−1 + 21 ∥xk − x̂k|k−1 ∥2P −1 , (57)

xk k|k−1

assuming that p(xk | y1:k−1 ) ≈ N (x̂k|k−1 , Pk|k−1 ) and that we have previously estimated x̂k|k−1 and Pk|k−1 .
Note that this is an approximation as the predictive distribution for nonlinear state-space models is generally
not Gaussian. The optimisation problem (57) is a nonlinear least squares problem that can be solved using
Algorithm 1. In general it can not be guaranteed that this optimisation problem converges to the correct

10
Algorithm 3: Extended Kalman filter
Input: An initial estimate x̂1 , covariance P1 and a known model with nonlinear functions f (xk ), h(xk )
and matrices Q, R.
Output: For every time instance a filtered estimate x̂k|k and uncertainty Pk|k .
for k = 1, . . . NT
Measurement update: Compute the mean and covariance of the filtering distribution as

x̂k|k = x̂k|k−1 + Kk yk − h(x̂k|k−1 ) ,
−1
Pk|k = Pk|k−1 − Pk|k−1 HkT Hk Pk|k−1 HkT + R Hk Pk|k−1 ,
−1
and Hk = dh(x k)

with Kk = Pk|k−1 HkT Hk Pk|k−1 HkT + R dxk .
xk =x̂k|k−1
Time update: Compute an approximate mean and covariance of the predictive distribution as

end for

local optimum. However, in practice (57) often converges when using x̂k|k−1 as an initial estimate, as long as
the functions f (·), h(·) are not “too nonlinear” and the sampling frequency is not “too low.” Note that the
optimisation problem (57) is only one example of an optimisation problem that can be solved for nonlinear
estimation. Commonly used alternatives include moving horizon estimation and smoothing. These do not
optimise only over xk , as in (57), but instead optimise over a sequences of states.

A Matrix inversion

In this Appendix we give two useful lemma’s for matrix inversion. Note that Lemma 2 is equivalent to
Lemma 2.3 in [3].

Lemma 2: The matrix inversion lemma, a.k.a. the Woodbury identity Consider matrices
A ∈ Rn×n , B ∈ Rn×m , C ∈ Rm×m and D ∈ Rm×n . If A, C and (A + BCD) are invertible, then

(A + BCD)−1 = A−1 − A−1 B(C −1 + DA−1 B)−1 DA−1 . (58)

Lemma 3: Matrix inversion identity Consider matrices A ∈ Rn×n , B ∈ Rn×m , C ∈ Rm×m and
D ∈ Rm×n . If A, C and (A + BCD) are invertible, then

(A + BCD)−1 BC = A−1 B(C −1 + DA−1 B)−1 . (59)

Proof of Lemma’s 2 and 3: Both Lemma’s can be proven using the Schur complement, using a similar
strategy as on page 19 of [3] and the fact that
−1 −1
I 0 I 0 I X I −X
= , = . (60)
X I −X I 0 I 0 I

For matrices A ∈ Rn×n , B ∈ Rn×m , C ∈ Rm×m and D ∈ Rm×n and C assumed to be invertible, let us

11

A B
construct a matrix M = . Using the Schur complement, we can write M equivalently as
−D C −1

A B I BC A + BCD 0 I 0
M= = . (61)
−D C −1 0 I 0 C −1 −CD I

We can now first recognise that using property (iii) on page 20 of [3],

det(M ) = det(A + BCD) det(C −1 ). (62)

As C and (A + BCD) are assumed to be invertible, their determinant is non-zero. Hence, the determinant
of M is non-zero and M is invertible. This inverse can be written as
−1
−1 I 0 A + BCD 0 I −BC
M =
CD I 0 C −1 0 I
−1 −1 −1 −1 −1 −1
−A−1 B(C −1 + DA−1 B)−1

A − A B(C + DA B) DA
= . (63)
(C −1 + DA−1 B)−1 DA−1 −(C −1 + DA−1 B)−1

Alternatively, assuming A to be invertible, allows us to write

A−1 B

A B I 0 A 0 I
M= = . (64)
−D C −1 −DA−1 I 0 C −1 + DA−1 B 0 I

Using property (iii) on page 20 of [3],

det(M ) = det(A) det(C −1 − DA−1 B). (65)

As A and M are known to be invertible, C −1 − DA−1 B can be concluded to be invertible. We can now
obtain a second expression for M −1 given by
−1 −1
−A−1 B

−1 A B I A 0 I 0
M = =
−D C −1 0 I 0 C −1 + DA−1 B DA−1 I
(A + BCD)−1 −(A + BCD)−1 BC

= . (66)
CD(A + BCD)−1 C − CD(A + BCD)−1 BC

We can now directly read off from (63) and (66) that

(A + BCD)−1 = A−1 − A−1 B(C −1 + DA−1 B)−1 DA−1 , (67a)

−1 −1 −1 −1 −1
−(A + BCD) BC = −A B(C + DA B) , (67b)

which is equivalent to Lemma’s 2 and 3.

References
[1] S. Särkkä. Bayesian Filtering and Smoothing. Cambridge University Press, 2013.
[2] T. B. Schön and F. Lindsten. Manipulating the multivariate Gaussian density. 2011.
[3] M. Verhaegen and V. Verdult. Filtering and System Identification: A Least Squares Approach. Cambridge
University Press, 2007.

Lecture 2
No ratings yet
Lecture 2
8 pages
SSP Fall2024 PS1 1aug2024
No ratings yet
SSP Fall2024 PS1 1aug2024
5 pages
Stat520 Ch.5
No ratings yet
Stat520 Ch.5
5 pages
Econometrics Homework Solutions
No ratings yet
Econometrics Homework Solutions
11 pages
Solutions For Practice Set
No ratings yet
Solutions For Practice Set
7 pages
Chapter V. Appendix
No ratings yet
Chapter V. Appendix
5 pages
Chapter Iii. Statistical Models: I I N I I N I I N I I N
No ratings yet
Chapter Iii. Statistical Models: I I N I I N I I N I I N
5 pages
Maximum Likelihood Estimation of Heckman's Sample Selection Model
No ratings yet
Maximum Likelihood Estimation of Heckman's Sample Selection Model
16 pages
2023 Tarea Curso Identificacion
No ratings yet
2023 Tarea Curso Identificacion
10 pages
Lecture 13
No ratings yet
Lecture 13
12 pages
Stat 111
No ratings yet
Stat 111
7 pages
Intro Bayes Time Series 1
No ratings yet
Intro Bayes Time Series 1
72 pages
Homework1 2024
No ratings yet
Homework1 2024
2 pages
Bayesian Estimation of AR 1 Models
No ratings yet
Bayesian Estimation of AR 1 Models
5 pages
MA204 FinalTest 2022
No ratings yet
MA204 FinalTest 2022
14 pages
Final Exam Solution 20220201
No ratings yet
Final Exam Solution 20220201
14 pages
Errata 7
No ratings yet
Errata 7
5 pages
Sufficient Statistics & Exponential Families
No ratings yet
Sufficient Statistics & Exponential Families
8 pages
Bayes Gauss
100% (1)
Bayes Gauss
29 pages
Assign 1
No ratings yet
Assign 1
5 pages
Chapter 9 Bayesian Methods - Machine Learning For Factor Investing
No ratings yet
Chapter 9 Bayesian Methods - Machine Learning For Factor Investing
11 pages
Statistical Inference Foundations
No ratings yet
Statistical Inference Foundations
89 pages
Probability Distributions Guide
No ratings yet
Probability Distributions Guide
86 pages
Bayesian Models for AI Experts
No ratings yet
Bayesian Models for AI Experts
130 pages
Tutorial 9 Memo
No ratings yet
Tutorial 9 Memo
3 pages
MA40189 Notes
No ratings yet
MA40189 Notes
70 pages
Linear - Regression
100% (1)
Linear - Regression
39 pages
Bayesian Inference & Gaussian Processes
No ratings yet
Bayesian Inference & Gaussian Processes
2 pages
Training LDA Correction
No ratings yet
Training LDA Correction
4 pages
Estimation Theory Essentials
No ratings yet
Estimation Theory Essentials
2 pages
Probability and Statistics Cheat Sheet
100% (3)
Probability and Statistics Cheat Sheet
28 pages
Stat-Review Xid-8243919 1
No ratings yet
Stat-Review Xid-8243919 1
24 pages
MLE Lecture Note For Econometrician
No ratings yet
MLE Lecture Note For Econometrician
13 pages
T10 Sol..ol
No ratings yet
T10 Sol..ol
8 pages
Statistical Inference Cheat Sheet
No ratings yet
Statistical Inference Cheat Sheet
4 pages
Point Estimation: Definition of Estimators
No ratings yet
Point Estimation: Definition of Estimators
8 pages
Statistics I: Parameter Estimation, Part II
No ratings yet
Statistics I: Parameter Estimation, Part II
22 pages
Scribe Notes BML
No ratings yet
Scribe Notes BML
25 pages
Linear Regression
No ratings yet
Linear Regression
19 pages
Advanced Learning Theory Insights
No ratings yet
Advanced Learning Theory Insights
59 pages
Solutions To Steven Kay's Statistical Estimation Book
67% (3)
Solutions To Steven Kay's Statistical Estimation Book
16 pages
Engineering Statistics Solutions
No ratings yet
Engineering Statistics Solutions
18 pages
ST3189 Exam Paper - October 2023
No ratings yet
ST3189 Exam Paper - October 2023
5 pages
Stat520 Ch.3
No ratings yet
Stat520 Ch.3
5 pages
Lecture 1
No ratings yet
Lecture 1
8 pages
Hybrid Least Squares For Learning Functions From Highly Noisy Data
No ratings yet
Hybrid Least Squares For Learning Functions From Highly Noisy Data
30 pages
Notes
No ratings yet
Notes
10 pages
Non-Linear Models With BSSM
No ratings yet
Non-Linear Models With BSSM
1 page
CS229 Midterm Solutions 2010
No ratings yet
CS229 Midterm Solutions 2010
8 pages
Lecture39 Module16 Econometrics
No ratings yet
Lecture39 Module16 Econometrics
10 pages
Principles of Statistics
No ratings yet
Principles of Statistics
113 pages
Linear Regression for Students
No ratings yet
Linear Regression for Students
6 pages
Regression Modeling Strategies - With Applications To Linear Models by Frank E. Harrell
100% (4)
Regression Modeling Strategies - With Applications To Linear Models by Frank E. Harrell
598 pages
Chapter 4 - Ordinal and Multinomial Logit Models
No ratings yet
Chapter 4 - Ordinal and Multinomial Logit Models
28 pages
Rat Weight Analysis Using R
No ratings yet
Rat Weight Analysis Using R
6 pages
Chapter 5 - Correlation Regression
No ratings yet
Chapter 5 - Correlation Regression
9 pages
Gaussian PDF Products & Convolutions
No ratings yet
Gaussian PDF Products & Convolutions
13 pages
Table 4: Descriptive Statistics
No ratings yet
Table 4: Descriptive Statistics
2 pages
3.3 Data Measure
No ratings yet
3.3 Data Measure
4 pages
DISC 203-Probability and Statistics - Muhammad Imran Chaudhry
No ratings yet
DISC 203-Probability and Statistics - Muhammad Imran Chaudhry
4 pages
Trial Questions - ECON 316
No ratings yet
Trial Questions - ECON 316
4 pages
Business Statistics Teaching Plan
No ratings yet
Business Statistics Teaching Plan
3 pages
Probability & Calendar
No ratings yet
Probability & Calendar
38 pages
GATE 2025 Test Prep Series
No ratings yet
GATE 2025 Test Prep Series
5 pages
MULTICOLLINEARITY
No ratings yet
MULTICOLLINEARITY
12 pages
Group Assignment Questions Landscape
No ratings yet
Group Assignment Questions Landscape
14 pages
Till 10 021 Done
No ratings yet
Till 10 021 Done
5 pages
Economics - Statistical Methods For Economics
No ratings yet
Economics - Statistical Methods For Economics
1 page
Homework (Session 5) S5.1
No ratings yet
Homework (Session 5) S5.1
2 pages
Modern Regression - Ridge Regression
100% (1)
Modern Regression - Ridge Regression
21 pages
Business Statistics
No ratings yet
Business Statistics
3 pages
Business Statistics Exam 1 Preparation
No ratings yet
Business Statistics Exam 1 Preparation
3 pages
A Leisurely Look at The Bootstrap, The Jackknife, and Cross-Validation (1983 13s) - BRADLEY EFRON
No ratings yet
A Leisurely Look at The Bootstrap, The Jackknife, and Cross-Validation (1983 13s) - BRADLEY EFRON
13 pages
Is 15393 2 2003 PDF
No ratings yet
Is 15393 2 2003 PDF
51 pages
Wooldridge InstrumentalVariablesEstimation 2005
No ratings yet
Wooldridge InstrumentalVariablesEstimation 2005
6 pages
ANOVA
No ratings yet
ANOVA
2 pages
13-LIKELIHOOD, PRIOR, POSTERIOR, MARGINAL Probability, Problems Solved Appliying Bayes Rule-11-03-2024
No ratings yet
13-LIKELIHOOD, PRIOR, POSTERIOR, MARGINAL Probability, Problems Solved Appliying Bayes Rule-11-03-2024
43 pages
Chap 19 Waiting Lines and Simulation
No ratings yet
Chap 19 Waiting Lines and Simulation
16 pages
S1 Binomial and Geometric Distributions
No ratings yet
S1 Binomial and Geometric Distributions
13 pages
Measure of Dispersion
No ratings yet
Measure of Dispersion
24 pages
Stat 1st Year
No ratings yet
Stat 1st Year
2 pages

Filt Ident Lecturenotes

Uploaded by

Filt Ident Lecturenotes

Uploaded by

Lecture notes on Bayesian estimation for SC42025:

Filtering & Identification

Manon Kok and Frida Viset

Conditioning: p(a, b | c) = p(a | b, c)p(b | c) (2b)

p(xa ) = N (xa ; µa , Pa ), (3a)

Then the marginal densities p(xa ) and p(xb ) are given by

p(xa ) = N (xa ; µa , Paa ), (6a)

Proof: See Appendix A.1 of [2].

Proof: See Appendix A.1 of [2].

p(xa ) = N (xa ; µa , Pa ) (8)

and the model

p(xb ) = N (xb ; Gµa , GPa GT + Pw ). (10)

3 Stochastic least squares

3.1 Probabilistic models

y = F θ + ε, ε ∼ N (0, Σ), (11)

• The mean of y would be F θ as F θ is known and the mean of ε is equal to zero.

p(y | θ) = N (y ; F θ, Σ), (12)

p(θ) = N (µθ , Pθ ). (13)

3.2 Computing the posterior

We can write the stochastic least squares problem as

Now using Theorem 3, we find that the posterior p(θ | y) is given by

3.3 Interpretation in terms of solving an optimisation problem

θ̂ = arg max p(θ | y) = arg max p(y | θ)p(θ). (18)

θ̂ = arg max p(θ | y) = arg min 12 ∥y − F θ∥2Σ−1 + 21 ∥θ − µθ ∥2P −1 (21)

This is a least squares problem with solution

Some algebraic manipulation of (23) gives

θ̂ =(F T Σ−1 F + Pθ−1 )−1 (F T Σ−1 y + Pθ−1 µθ ),

θ̂ =Ky + (F T Σ−1 F + Pθ−1 )−1 Pθ−1 µθ ,

Similarly, the covariance of the least squares estimate is given by

y = f (θ) + ϵ, ϵ ∼ N (0, Σ), (28)

θ̂ = arg max p(θ | y) = arg min 21 ∥y − f (θ)∥2Σ−1 (29)

Algorithm 1: Nonlinear least squares

4.1 Probabilistic models

4.2 Computing the posterior

4.3 Interpretation in terms of solving an optimisation problem

p(yk | xk , y1:k−1 )p(xk | y1:k−1 )

p(yk | xk )p(xk | y1:k−1 )

= arg min − log p(yk | xk ) − log p(xk | y1:k−1 )

= arg min 21 ∥yk − Cxk ∥2R−1 + 21 ∥xk − x̂k|k−1 ∥2P −1 . (50)

x̂k|k = Kk yk + (I − Kk C)x̂k|k−1 , (51)

E (xk − x̂k|k )(xk − x̂k|k )T = Pk|k−1 − Pk|k−1 C T (CPk|k−1 C T + R)−1 CPk|k−1 .

xk+1 = f (xk ) + wk , wk ∼ N (0, Q), (53a)

p(xk+1 | xk ) = N (xk+1 ; f (xk ), Q), (54a)

4.4.1 Extended Kalman filtering

4.4.2 Nonlinear least squares for Kalman filtering

x̂k|k = arg max p(xk | y1:k )

= arg min − log p(yk | xk ) − log p(xk | y1:k−1 )

= arg min 12 ∥yk − h(xk )∥2R−1 + 21 ∥xk − x̂k|k−1 ∥2P −1 , (57)

(A + BCD)−1 = A−1 − A−1 B(C −1 + DA−1 B)−1 DA−1 . (58)

(A + BCD)−1 BC = A−1 B(C −1 + DA−1 B)−1 . (59)

det(M ) = det(A + BCD) det(C −1 ). (62)

Alternatively, assuming A to be invertible, allows us to write

Using property (iii) on page 20 of [3],

det(M ) = det(A) det(C −1 − DA−1 B). (65)

(A + BCD)−1 = A−1 − A−1 B(C −1 + DA−1 B)−1 DA−1 , (67a)

which is equivalent to Lemma’s 2 and 3.

You might also like