[go: up one dir, main page]

0% found this document useful (0 votes)
23 views12 pages

Filt Ident Lecturenotes

Uploaded by

renilalexander10
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views12 pages

Filt Ident Lecturenotes

Uploaded by

renilalexander10
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Lecture notes on Bayesian estimation for SC42025:

Filtering & Identification

Manon Kok and Frida Viset

December 1, 2023

1 Introduction

In this document, the Bayesian interpretation and derivation of the filtering part of this course is introduced.
A higher level introduction to Bayesian filtering and smoothing can also be found in Sections 1.1–1.3 in [1].

2 Notation preliminaries

Throughout this document we will assume that all distributions are Gaussian. We use the notation N (a ; µa , Pa )
to denote the variable a ∈ Rn being Gaussian distributed with mean µa and covariance Pa . In other words,
1 1  
exp − 12 (a − µa )T Pa−1 (a − µa ) = p exp − 12 ∥a − µa ∥2P −1 ,

N (a ; µa , Pa ) = p
(2π)n det(Pa ) (2π)n det(Pa ) a

(1)
T
where we introduced the weighted norm notation ∥a − µa ∥2P −1 = (a − µa ) Pa−1 (a − µa ), and where det(Pa )
a
denotes the determinant of the matrix Pa . Sometimes we simplify this notation to N (µa , Pa ) if there can’t
be any confusion about what the random variable is.

When working with probabilities, we will make use of three fundamental relations:
Z
Marginalization: p(a | c) = p(a, b | c)db (2a)

Conditioning: p(a, b | c) = p(a | b, c)p(b | c) (2b)


p(b | a, c)p(a | c)
Bayes theorem: p(a | b, c) = (2c)
p(b | c)

Furthermore, we will often make use of the following three theorems and one lemma:

Theorem 1: Joint distribution of Gaussian variables Assume that both xa and xb conditioned on
xa are distributed according to

p(xa ) = N (xa ; µa , Pa ), (3a)


p(xb | xa ) = N (xb ; M xa + b, Pb|a ), (3b)

where M and b are deterministic and known. Then the joint distribution of xa and xb is given by

Pa M T
      
xa µa Pa
p(xa , xb ) = N ; ,P , P = . (4)
xb M µa + b M Pa M Pa M T + Pb|a

1
Proof: See Appendix A.2 of [2].

Theorem 2: Marginal distribution of Gaussian variables Assume that the random variables xa
and xb are jointly distributed as
     
xa µa Paa Pab
p(xa , xb ) = N ; , . (5)
xb µb Pba Pbb

Then the marginal densities p(xa ) and p(xb ) are given by

p(xa ) = N (xa ; µa , Paa ), (6a)


p(xb ) = N (xb ; µb , Pbb ). (6b)

Proof: See Appendix A.1 of [2].

Theorem 3: Conditional distribution of Gaussian variables Assume that the random variables
xa and xb are jointly distributed as (5), then the conditional densities p(xa | xb ) and p(xb | xa ) are given by
−1 −1

p(xa | xb ) = N xa ; µa + Pab Pbb (xb − µb ) , Paa − Pab Pbb Pba , (7a)
−1 −1

p(xb | xa ) = N xb ; µb + Pba Paa (xa − µa ) , Pbb − Pba Paa Pab . (7b)

Proof: See Appendix A.1 of [2].

Lemma 1: Distribution of the linear transform of a Gaussian variable Given the distribution

p(xa ) = N (xa ; µa , Pa ) (8)

and the model


xb = Gxa + w, (9)
with G being deterministic and w ∼ N (0, Pw ), it holds that

p(xb ) = N (xb ; Gµa , GPa GT + Pw ). (10)

In the special case of w = 0 (in other words, in the absence of noise), p(xb ) = N (xb ; Gµa , GPa GT ).

Proof: We can derive this lemma using a combination of Theorems 1 and 2, by setting M = G, b = 0
and Pb|a = Pw .

3 Stochastic least squares

3.1 Probabilistic models

Let us first consider the standard least squares problem of estimating the unknown parameters θ ∈ Rn from
measurements y ∈ RN modelled as

y = F θ + ε, ε ∼ N (0, Σ), (11)

where we assume F to be deterministic and known. If θ in (11) would be also be deterministic and known,
we can make the following observations about y:

• The mean of y would be F θ as F θ is known and the mean of ε is equal to zero.


• The covariance of y would be Σ as F θ is deterministic and the covariance of ε is equal to Σ.
• The distribution of y would be Gaussian as F θ is deterministic and ε is Gaussian.

2
We can summarise these three observations as

p(y | θ) = N (y ; F θ, Σ), (12)

where p(y | θ) is the conditional distribution of y given θ and the notation N (· ; ·, ·) is introduced in (1).
Note that (11) and (12) contain exactly the same information and can hence be used interchangeably.

In stochastic least squares, the goal is now to combine collected measurements y (modelled as (11) or
equivalently (12)) with prior information about the parameters θ. Throughout the course, we will assume
that this prior information is Gaussian and will model it as θ ∼ N (µθ , Pθ ). In other words, we assume that

p(θ) = N (µθ , Pθ ). (13)

Note that p(θ) is a prior distribution, representing information that is available before any data is observed.
This is in contrast to the posterior distribution p(θ | y) which represents the distribution of θ given the
observed data y. This is the distribution that we are interested in computing in stochastic least squares.

3.2 Computing the posterior

We can write the stochastic least squares problem as


p(y | θ)p(θ) p(y, θ)
p(θ | y) = =R . (14)
p(y) p(y, θ)dθ

Here, we first used Bayes’ rule (2c) and subsequently used the conditioning and marginalisation relations (2b)
and (2a). Note that we know both p(y | θ) and p(θ) from our measurement model (12) and our prior (13),
respectively. In fact, they are exactly of the form as Theorem 1. Hence, we can write

Pθ F T
      
θ µθ Pθ
p(θ, y) = N ; ,P , P = (15)
y F µθ F Pθ F Pθ F T + Σ

Now using Theorem 3, we find that the posterior p(θ | y) is given by


 −1 −1 
p(θ | y) = N θ; µθ + Pθ F T F Pθ F T + Σ (y − F µθ ) , Pθ − Pθ F T F Pθ F T + Σ F Pθ . (16)

The mean and covariance can now be seen to be equal to the well-known stochastic least squares solution.

3.3 Interpretation in terms of solving an optimisation problem

We can alternatively interpret the stochastic least squares problem as a maximum a posteriori (MAP)
estimation problem. In other words, we can interpret computing an estimate θ̂ as computing the maximum
of the posterior distribution p(θ | y) which can be formulated as

p(y | θ)p(θ)
θ̂ = arg max p(θ | y) = arg max , (17)
θ θ p(y)

where arg max θ denotes the maximising argument. Since p(y) doesn’t explicitly depend on θ, it does not
affect the location of the maximum. Hence,

θ̂ = arg max p(θ | y) = arg max p(y | θ)p(θ). (18)


θ θ

Now using the fact that maximising p(·) is equivalent to minimising − log p(·), we can rewrite this problem
as
θ̂ = arg max p(θ | y) = arg min − log p(y | θ) − log p(θ). (19)
θ θ

3
Now it is possible to insert the known expressions p(y | θ) and p(θ) (12) and (13) using (1)

θ̂ = arg max p(θ | y) = arg min 21 ∥y−F θ∥2Σ−1 + 12 ∥θ−µθ ∥2P −1 + 12 log det(Σ)+ 21 log det(Pθ )+ (N +n)
2 log 2π (20)
θ θ θ

Again, as constants do not affect the location of the minimum they can be omitted giving the resulting
optimisation problem

θ̂ = arg max p(θ | y) = arg min 12 ∥y − F θ∥2Σ−1 + 21 ∥θ − µθ ∥2P −1 (21)


θ θ θ

Note that throughout this course we sometimes omit the pre-multiplication 12 for notational brevity as it
does not affect the minimum. Note also that (21) is again a weighted stochastic least squares problem. To
solve (21), we can first equivalently write it as
1
θ̂ = arg min 21 (y − F θ)T Σ−1 (y − F θ) + (θ − µθ )T Pθ−1 (θ − µθ )
θ 2
    T  −1     
1 y F Σ 0 y F
= arg min 2 − θ − θ (22)
θ µ θ I 0 P θ µθ I
     −1
y F Σ 0
= arg min 12 ∥ − θ∥2W , W = .
θ µθ I 0 Pθ

This is a least squares problem with solution


 T  −1  !−1  T  −1  
F Σ 0 F F Σ 0 y
θ̂ = . (23)
I 0 Pθ I I 0 Pθ µθ

Some algebraic manipulation of (23) gives

θ̂ =(F T Σ−1 F + Pθ−1 )−1 (F T Σ−1 y + Pθ−1 µθ ),


(24)
=(F T Σ−1 F + Pθ−1 )−1 F T Σ−1 y + (F T Σ−1 F + Pθ−1 )−1 Pθ−1 µθ .

We can use Lemma 3 from Appendix A, with A = Pθ−1 , B = F T , C = Σ−1 , and D = F , to see that
(F T Σ−1 F + Pθ−1 )−1 F T Σ−1 = Pθ F T (F Pθ F T + Σ)−1 . We will denote this matrix K, such that

θ̂ =Ky + (F T Σ−1 F + Pθ−1 )−1 Pθ−1 µθ ,


=Ky + (F T Σ−1 F + Pθ−1 )−1 (Pθ−1 + F T Σ−1 F − F T Σ−1 F )µθ , (25)
=Ky + (I − KF )µθ .

Similarly, the covariance of the least squares estimate is given by


 T  −1  !−1
h i F Σ 0 F
E (θ − θ̂)(θ − θ̂)T = = (F T Σ−1 F + Pθ−1 )−1 , (26)
I 0 Pθ I

where applying Lemma 2 from Appendix A with A = Pθ−1 , B = F T , C = Σ−1 and D = F gives that
h i
E (θ − θ̂)θ − θ̂)T = Pθ − Pθ F T (Σ + F Pθ F T )−1 F Pθ , (27)

which corresponds exactly to the expected value and covariance derived in Section 3.2. Note that when
we don’t have any prior information available, or equivalently, if the prior is flat (in-dependent of θ), the
T −1 −1 T −1
solution to this
h problem is given
i trivially as the least squares solution θ̂ = (F Σ F ) F Σ y, and with
a variance E (θ − θ̂)(θ − θ̂)T = (F T Σ−1 F )−1 .

4
3.4 Extension to nonlinear models

Let us now assume that instead of the linear measurement model (12), we have a nonlinear model

y = f (θ) + ϵ, ϵ ∼ N (0, Σ), (28)

or equivalently, p(y | θ) = N (y ; f (θ), Σ). In this case, the computations in Section 3.2 no longer hold as
Theorem 1 can not be used. The posterior p(θ | y) is generally no longer Gaussian. However, it is still possible
to compute a MAP estimate using nonlinear least squares (NLS) similar to the approach in Section 3.3. For
notational brevity, let us assume there is no prior knowledge available on θ (or in other words the prior on
θ is flat). We can now straightforwardly adapt (21) to the nonlinear model (28) as

θ̂ = arg max p(θ | y) = arg min 21 ∥y − f (θ)∥2Σ−1 (29)


θ θ

We can solve this problem by solving a sequence of linear least squares problems if we first use a first order
Taylor series expansion of f (θ) around a linearisation point θ̂(i) as

df (θ)
f (θ) = f (θ̂(i) + ∆θ) ≈ f (θ̂(i) ) + ∆θ (30)
dθ θ=θ̂ (i)

Inserting this first order Taylor expansion (30) into (28) and reordering terms, we can recognise an approxi-
mate least squares problem

df (θ)
y − f (θ̂(i) ) ≈ ∆θ + ϵ, ϵ ∼ N (0, Σ). (31)
| {z } dθ θ=θ̂(i)
e(i) | {z }
F (i)

The NLS problem can now be solved by solving a sequence of linear least squares problems. Starting from
an initial estimate θ̂(0) , for each iteration i, we can first compute ∆θ from (31) and subsequently update
the linearisation point as θ̂(i+1) = θ̂(i) + ∆θ. The iterations can be terminated for instance when ∥∆θ∥22 is
smaller than a certain value or when a maximum number of iterations has been reached. The challenge in
practice is how to choose θ̂(0) as different choices of θ̂(0) can lead to convergence to different local minima.
In practice, some NLS problems are not that sensitive to the choice of θ̂(0) and standard choices like the
zero- or the one-vector are sufficient. In other NLS problems, a more elaborate scheme to find a proper θ̂(0)
are needed. This typically involves making use of domain knowledge and is outside the scope of our course.

Algorithm 1: Nonlinear least squares


Input: An initial estimate θ̂(0) and a stopping criterion.
Output: A final estimate θ̂.
Set i = 0
while stopping criterion not fulfilled
Compute Fi = dfdθ(θ) and ei = y − f (θ̂(i) ).
θ=θ̂ (i)
Solve a least squares problem with ei = Fi ∆θ + ϵ to compute ∆θ.
Update the estimate θ̂(i+1) = θ̂(i) + ∆θ.
if stopping criterion fulfilled
Set θ̂ = θ̂(i+1) .
else
Set i = i + 1.
end if
end while

5
4 Kalman filtering

4.1 Probabilistic models

In Kalman filtering, we are interested in recursively computing the posterior distribution of a state xk . The
dynamics of this state is modelled as
xk+1 = Axk + wk , wk ∼ N (0, Q), (32)
where A and Q are assumed to be known and deterministic. Similarly to the reasoning in Section 3.1, we
can equivalently write this model in terms of the conditional distribution p(xk+1 | xk ) as
p(xk+1 | xk ) = N (xk+1 ; Axk , Q). (33)
We also assume that measurements yk providing information about the state xk are available. The measure-
ment model is given by
yk = Cxk + vk , vk ∼ N (0, R), (34)
where C and R are assumed to be known and deterministic. We also assume that vk and wk are uncorrelated.
Again, similarly to the reasoning in Section 3.1, we can equivalently write this model in terms of the
conditional distribution p(yk | xk ) as
p(yk | xk ) = N (yk ; Cxk , R). (35)
We refer to the combination of (32) and (34) or equivalently the combination (33) (35) as a state-space model.
An additional essential element of a state-space model is a prior on the state at the first time instance. We
model this as
x1 ∼ N (x̂1 , P1 ), (36)
or equivalently as the prior distribution
p(x1 ) = N (x1 ; x̂1 , P1 ). (37)
Here, x̂1 encodes our prior belief on the state at time instance 1 and P1 encodes the uncertainty of this belief.
Note that in practice it is often the case that no prior information is available. It is therefore not uncommon
that x̂1 is chosen as the zero-vector and P1 is chosen to be fairly large. Another commonly used method for
choosing x̂1 , P1 is by basing it on the first available measurement(s).

4.2 Computing the posterior

In filtering, we are interested in computing the conditional (filtering) distribution p(xk | y1:k ), or in other
words the posterior distribution of xk given all the measurements up to and including time k. Note that
we use the notation y1:k to denote the sequence of measurements y1 , y2 , . . . yk . Hence, p(xk | y1:k ) = p(xk |
y1 , y2 , . . . yk ). To compute p(xk | y1:k ), we will make use of the dynamic model p(xk+1 | xk ), the measurement
model p(yk | xk ) and the prior p(x1 ) defined in (33), (35), (37). Throughout our derivation, we will heavily
make use of two properties of state-space models:

Property 1 Given xk , the state xk+1 is conditionally independent of x1:k−1 and y1:k , or in other words
p(xk+1 | x1:k , y1:k ) = p(xk+1 | xk ). This is also called the Markov property and implies that we assume
that the evolution of the state only depends on the current state and not on its history or on previous
measurements. This property can directly be seen in (33), where xk+1 only depends on xk , the system
matrices and the noise wk .
Property 2 The measurements yk are conditionally independent given xk , or in other words p(yk | x1:k , y1:k−1 ) =
p(yk | xk ). This property can directly be seen in (35), where yk only depends on xk , the system matrices
and the noise vk .

6
As a first step to derive p(xk | y1:k ) in terms of these three distributions given by our model, we use first
Bayes’ theorem (2c) and subsequently Property 2 to write
p(yk |xk ,y1:k−1 )p(xk |y1:k−1 )
p(xk | y1:k ) = p(yk |y1:k−1 )
p(yk |xk )p(xk |y1:k−1 )
= p(yk |y1:k−1 ) . (38)

Here, we can recognise the so-called predictive distribution p(xk | y1:k−1 ), or in other words the distribution
of xk given all the measurements up to and including time k − 1. This is called the predictive distribution
as it predicts the state one time-step into the future. The term p(yk | y1:k−1 ) can be written
Z
p(yk | y1:k−1 ) = p(yk , xk | y1:k−1 ) dxk
Z
= p(yk | xk , y1:k−1 )p(xk | y1:k−1 ) dxk
Z
= p(yk | xk )p(xk | y1:k−1 ) dxk (39)

where we first used (2a), then (2b) and finally Property 2. Here, we can again recognise the measurement
model p(yk | xk ) and the predictive distribution p(xk | y1:k−1 ). From (38) and (39) it can be concluded that
the filtering distribution p(xk | y1:k ) can be computed using the measurement model (35) and a predictive
distribution p(xk | y1:k−1 ). We will therefore first focus on computing p(xk | y1:k−1 ) and subsequently on
computing p(xk | y1:k ).

Deriving the predictive distribution The predictive distribution p(xk | y1:k−1 ) can be written as
Z
p(xk | y1:k−1 ) = p(xk , xk−1 | y1:k−1 ) dxk−1
Z
= p(xk | xk−1 , y1:k−1 )p(xk−1 | y1:k−1 ) dxk−1
Z
= p(xk | xk−1 )p(xk−1 | y1:k−1 ) dxk−1 , (40)

by first marginalising using (2a), then using conditioning (2b) and then using the Markov property (Property
1 above). In (40) we can recognise the dynamic model p(xk | xk−1 ) (33) and the filtering distribution at time
instance k − 1, as p(xk−1 | y1:k−1 ) is the distribution of xk−1 given the measurements y1:k−1 . Let us for now
assume that this distribution is given by p(xk−1 | y1:k−1 ) = N (xk−1 ; x̂k−1|k−1 , Pk−1|k−1 ). In other words,
we assume that the distribution is Gaussian and denote its mean and covariance with x̂k−1|k−1 , Pk−1|k−1 ,
respectively. We use the double subscript k − 1 | k − 1 to denote the state estimate and covariance at time
instance k − 1 using all measurements up to and including time k − 1. For the moment, the assumption that
the distribution is Gaussian is ad hoc, but we will later see that the distribution is indeed Gaussian for our
linear Gaussian state space model (33), (35), (37). We can now conclude that in (40) we have written the
predictive distribution p(xk | y1:k−1 ) in terms of the dynamic model p(xk | xk−1 ) and the filtering distribution
p(xk−1 | y1:k−1 ), which we assume to be Gaussian. Hence, we can in a straightforward but cumbersome
way compute p(xk | y1:k−1 ) by inserting (1) into (40), for more details see [2]. We will use an alternative
approach here and use Lemma 1.

Comparing to (3), we can use xk−1 as xa in (3) and xk as xb in (3). Note that compared to Theorem 1,
we don’t have a distribution p(xk−1 ) but rather p(xk−1 | y1:k−1 ). However, Theorem 1 equally holds if we
add conditioning to both equations in (3). Hence, we can write our model in the form

p(xk−1 | y1:k−1 ) = N xk−1 ; x̂k−1|k−1 , Pk−1|k−1 , (41a)
p(xk | xk−1 , y1:k−1 ) = N (xk ; Axk−1 , Q). (41b)

7
Note that the Markov property allowed us to add the conditioning on y1:k−1 in p(xk | xk−1 ), see Property 1
above. We can now equivalently write (41) as

p(xk−1 | y1:k−1 ) = N xk−1 ; x̂k−1|k−1 , Pk−1|k−1 , (42a)
xk = Axk−1 + wk , wk ∼ N (0, Q), (42b)
which is exactly on the form of Lemma 1 with xb = xk , xa = xk−1 , G = A and Pw = Q if we condition both
the distributions in (8) and (10) on y1:k−1 . Hence, we can directly conclude that the predictive distribution
is given by


p(xk | y1:k−1 ) = N xk ; x̂k|k−1 , Pk|k−1 , (43)
with
x̂k|k−1 = Ax̂k−1|k−1 , (44a)
T
Pk|k−1 = APk−1|k−1 A + Q. (44b)
Note that the shorthand notation for the mean and covariance in (43) is introduced for ease of notation.
The double subscript k | k − 1 is used to denote the state estimate and covariance at time instance k using
all measurements up to and including time k − 1, which in other words is a one-step ahead prediction.

Deriving the filtering distribution The filtering distribution p(xk | y1:k ) was given in (38) and can
be computed using the measurement model (35) and the predictive distribution (43). Again, we can in a
straightforward but cumbersome way compute p(xk | y1:k−1 ) by inserting (1) into (40), for more details
see [2]. We will instead again use Theorems 1 and 3.

Comparing to (3), we can use xk as xa and yk as xb . Note that compared to Theorem 1, we don’t have
a distribution p(xk ) but rather p(xk | y1:k−1 ). However, Theorem 1 equally holds if we add conditioning to
both equations in (3). Hence, we can write our model in the form

p(xk | y1:k−1 ) = N xk ; x̂k|k−1 , Pk|k−1 (45a)
p(yk | xk , y1:k−1 ) = N (yk ; Cxk , R). (45b)
Note that Property 2 allowed us to add the conditioning on y1:k−1 in p(yk | xk ). Hence, Theorem 1 directly
results in the joint distribution
Pk|k−1 C T
     
xk x̂k|k−1 Pk|k−1
p(xk , yk | y1:k−1 ) = N ; , . (46)
yk C x̂k|k−1 CPk|k−1 CPk|k−1 C T + R
Using Theorem 3, using xk as xa and yk as xb , we can directly read off the filtering distribution from (46) as

p(xk | y1:k ) = N xk ; x̂k|k , Pk|k , (47)
with
−1
x̂k|k = x̂k|k−1 + Pk|k−1 C T CPk|k−1 C T + R (yk − C x̂k|k−1 ), (48a)
T T
−1
Pk|k = Pk|k−1 − Pk|k−1 C CPk|k−1 C + R CPk|k−1 . (48b)
Note that we still need to revisit the assumption in the derivation of the predictive distribution that the
filtering distribution from the previous time step was Gaussian. This can be argued recursively. Assuming
that the prior is Gaussian, the filtering distribution p(x1 | y1 ) is Gaussian. Hence, it can be assumed when
computing the predictive distribution p(x2 | y1 ) that the filtering distribution was Gaussian. As this is
possible, p(x2 | y1 ) will be Gaussian and hence p(x2 | y2 ) will be etc.

The resulting Kalman filtering algorithm is summarised in Algorithm 2. Note also that in some Kalman
filters the order or the time update and the measurement update is reversed. This is specifically natural if
the first measurement is used to determine the prior distribution.

8
Algorithm 2: Kalman filter
Input: An initial estimate x̂1 , covariance P1 and a known model with matrices A, C, Q, R.
Output: For every time instance a filtered estimate x̂k|k and uncertainty Pk|k .
for k = 1, . . . NT
Measurement update: Compute the mean and covariance of the filtering distribution as

x̂k|k = x̂k|k−1 + Kk yk − C x̂k|k−1 ,
−1
Pk|k = Pk|k−1 − Pk|k−1 C T CPk|k−1 C T + R CPk|k−1 ,
 −1
with Kk = Pk|k−1 C T CPk|k−1 C T + R .
Time update: Compute the mean and covariance of the predictive distribution as

x̂k+1|k = Ax̂k|k ,
Pk+1|k = APk|k AT + Q.

end for

4.3 Interpretation in terms of solving an optimisation problem

We can alternatively interpret the Kalman filter as a MAP estimation problem. In other words, we can
interpret computing an estimate x̂k as computing the maximum of the posterior distribution p(xk | y1:k ).
One way to formulate this problem is to write it using Bayes’ rule as

p(yk | xk , y1:k−1 )p(xk | y1:k−1 )


x̂k|k = arg max p(xk | y1:k ) = arg max . (49)
xk xk p(yk | y1:k−1 )

Using Property 2, we can recognise that p(yk | xk , y1:k−1 ) = p(yk | xk ) which is given by the measurement
model (35). Furthermore, let us assume that p(xk | y1:k−1 ) = N (x̂k|k−1 , Pk|k−1 ) and that we have previously
estimated x̂k|k−1 and Pk|k−1 . Again using the fact that p(·) is equivalent to minimising − log p(·) and
omitting constants, we can write this optimisation problem equivalently as

p(yk | xk )p(xk | y1:k−1 )


x̂k|k = arg max
xk p(yk | y1:k−1 )
= arg max p(yk | xk )p(xk | y1:k−1 )
xk

= arg min − log p(yk | xk ) − log p(xk | y1:k−1 )


xk

= arg min 21 ∥yk − Cxk ∥2R−1 + 21 ∥xk − x̂k|k−1 ∥2P −1 . (50)


xk k|k−1

Note that (50) is a weighted stochastic least squares problem which is exactly on the same form as the
weighted stochastic least squares problem in (21), with yk = y, F = C, x = θ, Σ = R, x̂k|k−1 = µ and
Pk|k−1 = Pθ . Therefore, the derivation in Section 3.3 gives that the optimal solution to this system is

x̂k|k = Kk yk + (I − Kk C)x̂k|k−1 , (51)


−1
with Kk = (C T R−1 C + Pk|k−1 )C T R−1 = Pk|k−1 C T (CPk|k−1 C T + R)−1 . Similarly, the variance of this
estimate is given by

E (xk − x̂k|k )(xk − x̂k|k )T = Pk|k−1 − Pk|k−1 C T (CPk|k−1 C T + R)−1 CPk|k−1 .


 
(52)

The optimal solution and its variance therefore corresponds to the Kalman filter measurement updates from
Section 4.2.

9
4.4 Extension to nonlinear models

Let us now assume that instead of the linear state-space model (32), (34), we have a nonlinear model

xk+1 = f (xk ) + wk , wk ∼ N (0, Q), (53a)


yk = h(xk ) + vk , vk ∼ N (0, R), (53b)

where f (·) and h(·) are nonlinear functions. We can equivalently write (53) as

p(xk+1 | xk ) = N (xk+1 ; f (xk ), Q), (54a)


p(yk | xk ) = N (yk ; h(xk ), R). (54b)

For this nonlinear model the computations in Section 4.2 no longer hold as Theorem 1 can not be used.
The filtering distribution p(xk | y1:k ) is generally no longer Gaussian. There are now a number of strategies
that can be used for nonlinear filtering. The nonlinear filtering problem can for instance be solved using
linearisation, see Section 4.4.1, or as a nonlinear least squares problem, see Section 4.4.2. Other techniques,
such as particle filtering, unscented Kalman filtering and moving horizon estimation, are not explicitly
discussed in the course.

4.4.1 Extended Kalman filtering

The Extended Kalman filter (EKF) is an extension of Kalman filtering that can be used for nonlinear state
space models like (53). The EKF relies on linearisation of both the dynamic model and the measurement
model around the currently best estimate. In other words, in the time update the following first order Taylor
approximation is made
df (xk ) 
xk+1 ≈ f (x̂k|k ) + xk − x̂k|k + wk . (55)
dxk xk =x̂k|k
In the measurement update the first order Taylor approximation is instead
dh(xk ) 
yk ≈ h(x̂k|k−1 ) + xk − x̂k|k−1 + vk . (56)
dxk xk =x̂k|k−1

The EKF is summarised in Algorithm 3. Note that although there are no guarantees on the performance of
an EKF, it is widely used in practice and is known to work well as long as the functions f (·), h(·) are not
“too nonlinear” and the sampling frequency is not “too low.”

4.4.2 Nonlinear least squares for Kalman filtering

An alternative approach to approximating the filtering distribution for the nonlinear state-space model (54)
is to compute a MAP estimate using nonlinear least squares (NLS) similar to the approach in Section 3.4.
Using the same reasoning as in Section 4.3, we can solve the optimisation problem

x̂k|k = arg max p(xk | y1:k )


xk

= arg min − log p(yk | xk ) − log p(xk | y1:k−1 )


xk

= arg min 12 ∥yk − h(xk )∥2R−1 + 21 ∥xk − x̂k|k−1 ∥2P −1 , (57)


xk k|k−1

assuming that p(xk | y1:k−1 ) ≈ N (x̂k|k−1 , Pk|k−1 ) and that we have previously estimated x̂k|k−1 and Pk|k−1 .
Note that this is an approximation as the predictive distribution for nonlinear state-space models is generally
not Gaussian. The optimisation problem (57) is a nonlinear least squares problem that can be solved using
Algorithm 1. In general it can not be guaranteed that this optimisation problem converges to the correct

10
Algorithm 3: Extended Kalman filter
Input: An initial estimate x̂1 , covariance P1 and a known model with nonlinear functions f (xk ), h(xk )
and matrices Q, R.
Output: For every time instance a filtered estimate x̂k|k and uncertainty Pk|k .
for k = 1, . . . NT
Measurement update: Compute the mean and covariance of the filtering distribution as

x̂k|k = x̂k|k−1 + Kk yk − h(x̂k|k−1 ) ,
−1
Pk|k = Pk|k−1 − Pk|k−1 HkT Hk Pk|k−1 HkT + R Hk Pk|k−1 ,
−1
and Hk = dh(x k)

with Kk = Pk|k−1 HkT Hk Pk|k−1 HkT + R dxk .
xk =x̂k|k−1
Time update: Compute an approximate mean and covariance of the predictive distribution as

x̂k+1|k = f (x̂k|k ),
df (xk )
Pk+1|k = Fk Pk|k FkT + Q, Fk = .
dxk xk =x̂k|k

end for

local optimum. However, in practice (57) often converges when using x̂k|k−1 as an initial estimate, as long as
the functions f (·), h(·) are not “too nonlinear” and the sampling frequency is not “too low.” Note that the
optimisation problem (57) is only one example of an optimisation problem that can be solved for nonlinear
estimation. Commonly used alternatives include moving horizon estimation and smoothing. These do not
optimise only over xk , as in (57), but instead optimise over a sequences of states.

A Matrix inversion

In this Appendix we give two useful lemma’s for matrix inversion. Note that Lemma 2 is equivalent to
Lemma 2.3 in [3].

Lemma 2: The matrix inversion lemma, a.k.a. the Woodbury identity Consider matrices
A ∈ Rn×n , B ∈ Rn×m , C ∈ Rm×m and D ∈ Rm×n . If A, C and (A + BCD) are invertible, then

(A + BCD)−1 = A−1 − A−1 B(C −1 + DA−1 B)−1 DA−1 . (58)

Lemma 3: Matrix inversion identity Consider matrices A ∈ Rn×n , B ∈ Rn×m , C ∈ Rm×m and
D ∈ Rm×n . If A, C and (A + BCD) are invertible, then

(A + BCD)−1 BC = A−1 B(C −1 + DA−1 B)−1 . (59)

Proof of Lemma’s 2 and 3: Both Lemma’s can be proven using the Schur complement, using a similar
strategy as on page 19 of [3] and the fact that
 −1    −1  
I 0 I 0 I X I −X
= , = . (60)
X I −X I 0 I 0 I

For matrices A ∈ Rn×n , B ∈ Rn×m , C ∈ Rm×m and D ∈ Rm×n and C assumed to be invertible, let us

11
 
A B
construct a matrix M = . Using the Schur complement, we can write M equivalently as
−D C −1
     
A B I BC A + BCD 0 I 0
M= = . (61)
−D C −1 0 I 0 C −1 −CD I

We can now first recognise that using property (iii) on page 20 of [3],

det(M ) = det(A + BCD) det(C −1 ). (62)

As C and (A + BCD) are assumed to be invertible, their determinant is non-zero. Hence, the determinant
of M is non-zero and M is invertible. This inverse can be written as
  −1  
−1 I 0 A + BCD 0 I −BC
M =
CD I 0 C −1 0 I
 −1 −1 −1 −1 −1 −1
−A−1 B(C −1 + DA−1 B)−1

A − A B(C + DA B) DA
= . (63)
(C −1 + DA−1 B)−1 DA−1 −(C −1 + DA−1 B)−1

Alternatively, assuming A to be invertible, allows us to write

A−1 B
     
A B I 0 A 0 I
M= = . (64)
−D C −1 −DA−1 I 0 C −1 + DA−1 B 0 I

Using property (iii) on page 20 of [3],

det(M ) = det(A) det(C −1 − DA−1 B). (65)

As A and M are known to be invertible, C −1 − DA−1 B can be concluded to be invertible. We can now
obtain a second expression for M −1 given by
−1 −1 
−A−1 B
   
−1 A B I A 0 I 0
M = =
−D C −1 0 I 0 C −1 + DA−1 B DA−1 I
(A + BCD)−1 −(A + BCD)−1 BC
 
= . (66)
CD(A + BCD)−1 C − CD(A + BCD)−1 BC

We can now directly read off from (63) and (66) that

(A + BCD)−1 = A−1 − A−1 B(C −1 + DA−1 B)−1 DA−1 , (67a)


−1 −1 −1 −1 −1
−(A + BCD) BC = −A B(C + DA B) , (67b)

which is equivalent to Lemma’s 2 and 3.

References
[1] S. Särkkä. Bayesian Filtering and Smoothing. Cambridge University Press, 2013.
[2] T. B. Schön and F. Lindsten. Manipulating the multivariate Gaussian density. 2011.
[3] M. Verhaegen and V. Verdult. Filtering and System Identification: A Least Squares Approach. Cambridge
University Press, 2007.

12

You might also like