State Estimation For Robotics
State Estimation For Robotics
ROBOTICS
A Matrix Lie Group Approach
Timothy D. Barfoot
c 2016
Copyright
iii
Contents
1 Introduction 1
1.1 A Little History 1
1.2 Sensors, Measurements, and Problem Definition 3
1.3 How This Book is Organized 4
1.4 Relationship to Other Books 5
v
vi Contents
3 Linear-Gaussian Estimation 35
3.1 Batch Discrete-Time Estimation 35
3.1.1 Problem Setup 35
3.1.2 Maximum A Posteriori 37
3.1.3 Bayesian Inference 42
3.1.4 Existence, Uniqueness, and Observability 44
3.1.5 MAP Covariance 48
3.2 Recursive Discrete-Time Smoothing 49
3.2.1 Exploiting Sparsity in the Batch Solution 50
3.2.2 Cholesky Smoother 51
3.2.3 Rauch-Tung-Striebel Smoother 53
3.3 Recursive Discrete-Time Filtering 56
3.3.1 Factoring the Batch Solution 57
3.3.2 Kalman Filter via MAP 61
3.3.3 Kalman Filter via Bayesian Inference 66
3.3.4 Kalman Filter via Gain Optimization 67
3.3.5 Kalman Filter Discussion 68
3.3.6 Error Dynamics 69
3.3.7 Existence, Uniqueness, and Observability 70
3.4 Batch Continuous-Time Estimation 72
3.4.1 Gaussian Process Regression 72
3.4.2 A Class of Exactly Sparse Gaussian Process Priors 75
3.4.3 Linear Time-Invariant Case 81
3.4.4 Relationship to Batch Discrete-Time Estimation 85
3.5 Summary 86
3.6 Exercises 86
References 361
Index 367
Acronyms and Abbreviations
xi
Notation
– General Notation –
xiii
xiv Notation
– Matrix-Lie-Group Notation –
Introductio
Geographica by
Petrus Apianus
(1495-1552), a
German
mathematician,
astronomer, and
cartographer.
Much of
three-dimensional
state estimation
has to do with
triangulation
and/or
trilateration; we
measure some
angles and lengths
and infer the
others through
trigonometry.
xv
xvi Foreword
Tim Barfoot
May 9, 2016
Toronto
1
Introduction
Robotics inherently deals with things that move in the world. We live
in an era of rovers on Mars, drones surveying the Earth, and soon
self-driving cars. And, although specific robots have their subtleties,
there are also some common issues we must face in all applications,
particularly state estimation and control.
The state of a robot is a set of quantities, such as position, orien-
tation, and velocity, that if known fully describe that robot’s motion
over time. Here we will focus entirely on the problem of estimating
the state of a robot, putting aside the notion of control. Yes, control
is essential, as we would like to make our robots behave in a certain
way. But, the first step in doing so is often the process of determining
the state. Moreover, the difficulty of state estimation is often underes-
timated for real-world problems and thus it is important to put it on
an equal footing with control.
In this book, we will introduce the classic estimation results for linear
systems corrupted by Gaussian measurement noise. We will then ex-
amine some of the extensions to nonlinear systems with non-Gaussian
noise. In a departure from typical estimation texts, we will take a de-
tailed look at how to tailor general estimation results to robots oper-
ating in three-dimensional space, advocating a particular approach to
handling rotations: matrix Lie groups.
The rest of this introduction will provide a little history of estimation,
discuss types of sensors and measurements, and introduce the problem
of state estimation. We conclude with a breakdown of the contents of
the book and provide some other suggested reading.
1 There is some debate as to whether Adrien Marie Legendre might have come up with
least squares before Gauss.
1.2 Sensors, Measurements, and Problem Definition 3
There are many specific versions of this problem and just as many
solutions. The goal will be to understand which methods work well in
which situations, in order to pick the best tool for the job.
I: Estimation Machinery
II: Three-Dimensional Machinery
III: Applications
The first part, Estimation Machinery, presents classic and state-of-the-
art estimation tools, without the complication of dealing with things
that live in three-dimensional space (and therefore translate and ro-
tate); the state to be estimated is assumed to be a generic vector.
For those not interested in the details of working in three-dimensional
space, this first part can be read in a standalone manner. It covers both
recursive state estimation techniques as well as batch methods (less
common in classic estimation books). As is commonplace in robotics
and machine learning today, we adopt a Bayesian approach to estima-
tion in this book. We contrast (full) Bayesian methods with maximum
a posteriori (MAP) methods, and attempt to make clear the difference
between these when faced with nonlinear problems. The book also con-
nects continuous-time estimation with Gaussian process regression from
1.4 Relationship to Other Books 5
Estimation Machinery
7
2
That is, it satisfies the axiom of total probability. Note that this is
probability density not probability.
Z b
Figure 2.1
p(x) p(x) dx = 1 p(x) Pr(c x d) Probability density
a over a finite
internal (left).
Probability of
being within a
x x sub-interval
a b a c d b (right).
9
10 Primer on Probability Theory
Figure 2.1 depicts a general PDF over a finite interval as well as the
probability of being within a sub-interval. We use PDFs to represent
the likelihood of x being in all possible states in the interval, [a, b].
We can also introduce a conditioning
variable. Let p(x|y) be a PDF
over x ∈ a, b conditioned on y ∈ r, s such that
Z b
(∀y) p(x|y) dx = 1. (2.3)
a
where E[·] denotes the expectation operator. For a general matrix func-
tion, F(x), the expectation is written as
Z
E [F(x)] = F (x) p(x) dx, (2.12)
The next two moments are called the skewness and kurtosis, but for
the multivariate case these get quite complicated and require tensor
representations. We will not need them here, but it should be mentioned
that there are an infinite number of these probability moments.
of densities3 . We will often exploit (or assume) that variables are sta-
tistically independent to simplify computations.
and
Σ = E (x − µ)(x − µ)T
Z ∞
1
= (x − µ)(x − µ)T p (2.31)
−∞ (2π)N det Σ
1 T −1
× exp − (x − µ) Σ (x − µ) dx.
2
We may also write that x is normally (a.k.a., Gaussian) distributed
using the following notation:
x ∼ N (µ, Σ) .
We say a random variable is standard normally distributed if
x ∼ N (0, 1) ,
where 1 is an N × N identity matrix.
Similarly, we have
T T
Σ12 ΣT12 Σ12 Σ22
E xx2 x2 x = Σ tr(Σ22 ) + 2
Σ22 ΣT12 Σ222
0 0
= Σ tr(Σ22 )1 + 2 , (2.39)
ΣT12 Σ22
and as a final check,
E xxT xxT = E x(xT1 x1 + xT2 x2 )xT = E xxT1 x1 xT + E xxT2 x2 xT .
(2.40)
We furthermore have that
" !#
N X
X N
E xxT AxxT = E xi xj xk ak` x`
k=1 `=1 ij
" N X
N
#
X
= ak` E [xi xj xk x` ]
k=1 `=1 ij
" N
N X
X
= ak` (E[xi xj ]E[xk x` ] + E[xi xk ]E[xj x` ]
k=1 `=1
+ E[xi x` ]E[xj xk ])
ij
N X
N
!
X
= [E[xi xj ]]ij ak` E[xk x` ]
k=1 `=1
" N
N X
#
X
+ E[xi xk ]ak` E[x` xj ]
k=1 `=1 ij
" N X
N
#
X
+ E[xi x` ]ak` E[xk xj ]
k=1 `=1 ij
T
= Σ tr(AΣ) + ΣAΣ + ΣA Σ
= Σ tr (AΣ) 1 + AΣ + AT Σ , (2.41)
where A is a compatible square matrix.
factors, p(x, y) = p(x|y) p(y), and we can work out the details for
Issai Schur the joint Gaussian case by using the Schur complement6 . We begin by
(1875-1941) was a noting that
German
mathematician Σxx Σxy 1 Σxy Σ−1yy Σxx − Σxy Σ−1
yy Σyx 0 1 0
= ,
who worked on
group
Σyx Σyy 0 1 0 Σyy Σ−1 yy Σyx 1
representations (2.43)
(the subject with where 1 is the identity matrix. We then invert both sides to find that
which he is most
closely associated),
−1
Σxx Σxy 1 0
but also in =
combinatorics and Σyx Σyy −Σ−1
yy Σyx 1
" −1 #
number theory and
Σxx − Σxy Σ−1 Σyx 0 1 −Σxy Σ−1
yy
even theoretical × yy . (2.44)
physics. 0 Σ−1
yy
0 1
Looking just to the quadratic part (inside the exponential) of the joint
PDF, p(x, y), we have
T −1
x µx Σxx Σxy x µ
− − x
y µy Σyx Σyy y µy
T " −1 #
x µx 1 0 Σxx − Σxy Σ−1 Σ yx 0
= − yy
y µy −Σ−1yy Σyx 1 0 Σ−1
yy
−1
1 −Σxy Σyy x µ
× − x
0 1 y µy
T −1
= x − µx − Σ−1
yy Σyx (y − µy ) Σxx − Σxy Σ−1
yy Σyx
T −1
× x − µx − Σ−1
yy Σyx (y − µy ) + y − µy Σyy y − µy , (2.45)
which is the sum of two quadratic terms. Since the exponential of a
sum is the product of two exponentials, we have that
p(x, y) = p(x|y) p(y), (2.46a)
−1 −1
p(x|y) = N µx + Σxy Σyy (y − µy ), Σxx − Σxy Σyy Σyx , (2.46b)
p(y) = N µy , Σyy . (2.46c)
It is important to note that both factors, p(x|y) and p(y), are Gaussian
PDFs. Also, if we happen to know the value of y (i.e., it is measured),
we can work out the likelihood of x given this value of y by computing
p(x|y) using (2.46b).
This is in fact the cornerstone of Gaussian inference: we start with
a prior about our state, x ∼ N (µx , Σxx ), then narrow this down based
on some measurements, ymeas . In (2.46b), we see that an adjustment is
made to the mean, µx , and the covariance, Σxx (it is made smaller).
6 In this case, we have that the Schur complement of Σyy is the expression,
Σxx − Σxy Σ−1 yy Σyx .
2.2 Gaussian Probability Density Functions 19
we have that
v ∼ N (Σ−1 −1
xx µx , Σxx ). (2.58)
Since the mapping from y to x is not unique, we need to specify what
we want to do. One choice is to let
v = GT u ⇔ Σ−1 T −1
xx x = G Σyy y. (2.59)
We then take expectations:
Σ−1 T T T −1
xx µx = E[v] = E[G u] = G E[u] = G Σyy µy , (2.60a)
Σ−1
xx = E[(v − Σ−1 − Σ−1
xx µx )(v
T
xx µx ) ] (2.60b)
= G E[(u T
− Σ−1
yy µy )(u − Σ−1 T
yy µy ) ]G = G Σ−1 T
yy G.
Note, if Σ−1
xx is not full rank, we cannot actually recover Σxx and µx and
must keep them in information form. However, multiple such estimates
can be fused together, which is the subject of the next section.
where
K
X
Σ−1 = GTk Σ−1
k Gk , (2.64a)
k=1
K
X
Σ−1 µ = GTk Σ−1
k µk , (2.64b)
k=1
which is invertible:
x = ln(y). (2.73)
The infinitesimal integration volumes for x and y are then related by
dy = exp(x) dx, (2.74)
or
1
dx = dy. (2.75)
y
According to the axiom of total probability we have
Z ∞
1= p(x) dx
−∞
Z ∞
1 1 x2
= √ exp − 2 dx
−∞ 2πσ 2 2σ
Z ∞ 2
!
1 1 (ln(y)) 1
= √ exp − 2
dy
0 2πσ 2 2 σ y
| {z }
p(y)
Z ∞
= p(y) dy, (2.76)
0
giving us the exact expression for p(y), which is plotted in Figure 2.4
for σ 2 = 1 as the black curve; the area under this curve, from y = 0 to
∞ is 1. The grey histogram is a numerical approximation of the PDF
generated by sampling x a large number of times and passing these
through the nonlinearity individually, then binning. These approaches
agree very well, validating our method of changing variables.
Note, p(y) is no longer Gaussian owing to the nonlinear change of
variables. We can verify numerically that the area under this function
is indeed 1 (i.e., it is a valid PDF). It is worth noting that had we not
Figure 2.4 The 0.7
PDF, p(y),
resulting from 0.6
passing p(x) = 0.5
√1 exp − 1 x2
2π 2
p(y)
been careful about handling the change of variables and including the
1
y
factor, we would not have a valid PDF.
we may complete the square for the part inside the integral such that
1
exp − (x − µx )T Σ−1xx + G T
R−1
G (x − µx )
2
× exp (y − µy )T R−1 G(x − µx )
1 T
= exp − x − µx ) − F(y − µy
2
× GT R−1 G + Σ−1 xx x − µ x ) − F(y − µ y
1 T T T −1 −1
× exp (y − µy ) F G R G + Σxx F(y − µy ) .
2
(2.80)
As we will see later, the two equations, (2.63) and (2.81), constitute the
observation and predictive steps of the classic discrete-time (extended)
Kalman filter (Kalman, 1960b). These two steps may be thought of as
the creation and destruction of information in the filter, respectively.
2.2 Gaussian Probability Density Functions 27
1
(x − µ)T Σ−1 (x − µ) = , (2.87)
M2
where M is a factor applied to scale the nominal (M = 1) covariance.
In this case we have that
1
p(x) = p , (2.88)
(2π)N e1/M 2 det Σ
which is independent of x and thus constant. Looking at this last ex-
pression, we note that as the dimension, N , of the data increases, we
need to very rapidly increase M in order to keep p(x) constant.
By inserting (2.86) into (2.22) we can easily see that the mutual infor-
mation for the joint Gaussian is given by
1 1
I(x, y) = ln (2πe)N det Σxx + ln (2πe)M det Σyy
2 2
1 M +N
− ln (2πe) det Σ
2
1 det Σ
= − ln . (2.90)
2 det Σxx det Σyy
Looking back to (2.43), we can also note that
det Σ = det Σxx det Σyy − Σyx Σ−1xx Σxy
= det Σyy det Σxx − Σxy Σ−1
yy Σyx . (2.91)
2.2 Gaussian Probability Density Functions 29
1
I(x, y) = − ln det(1 − Σ−1 −1
xx Σxy Σyy Σyx )
2
1
= − ln det(1 − Σ−1 −1
yy Σyx Σxx Σxy ), (2.92)
2 James Joseph
Sylvester
where the two versions can be seen to be equivalent through Sylvester’s (1814-1897) was an
determinant theorem. English
mathematician
who made
fundamental
2.2.11 Cramér-Rao Lower Bound Applied to Gaussian contributions to
PDFs matrix theory,
invariant theory,
Suppose that we have K samples (i.e., measurements), xmeas,k ∈ RN , number theory,
drawn from a Gaussian PDF. The K statistically independent random partition theory,
and combinatorics.
variables associated with these measurements are thus
This theorem says
that
(∀k) xk ∼ N (µ, Σ) . (2.93) det(1 − AB)
= det(1 − BA),
The term statistically independent implies that E [(xk − µ)(x` − µ)T ] = even when A and
0 for k 6= `. Now suppose our goal is to estimate the mean of this PDF, B are not square.
µ, from the measurements, xmeas,1 , . . . , xmeas,K . For the joint density of
all the random variables, x = (x1 , . . . , xK ), we in fact have
q
1
ln p(x|µ, Σ) = − (x−Aµ)T B−1 (x−Aµ)−ln (2π)N K det B, (2.94)
2
where
T
A = 1 1 ··· 1 , B = diag (Σ, Σ, . . . , Σ) . (2.95)
| {z } | {z }
K blocks K blocks
which we can see is just K times the inverse covariance of the Gaussian
density. The CRLB thus says
1
cov(x̂|xmeas,1 , . . . , xmeas,K ) ≥
Σ. (2.98)
K
In other words, the lower limit of the uncertainty in the estimate of the
mean, x̂, becomes smaller and smaller the more measurements we have
(as we would expect).
Note, in computing the CRLB we did not need to actually specify the
form of the unbiased estimator at all; the CRLB is the lower bound for
any unbiased estimator. In this case, it is not hard to find an estimator
that performs right at the CRLB:
K
1 X
x̂ = xmeas,k . (2.99)
K k=1
For the mean of this estimator we have
" K
# K K
1 X 1 X 1 X
E [x̂] = E xk = E[xk ] = µ = µ, (2.100)
K k=1 K k=1 K k=1
which shows the estimator is indeed unbiased. For the covariance we
have
cov(x̂|xmeas,1 , . . . , xmeas,K ) = E (x̂ − µ)(x̂ − µ)T
! !T
XK XK
1 1
=E xk − µ xk − µ
K k=1 K k=1
1 XX h i
K K
T
= E (x k − µ) (x ` − µ)
K 2 k=1 `=1 | {z }
Σ when k = `, 0 otherwise
1
= Σ, (2.101)
K
which is right at the CRLB.
2.4 Summary
The main take-away points from this chapter are:
32 Primer on Probability Theory
2.5 Exercises
2.5.1 Show that for any two columns of the same length, u and v,
that
uT v = tr(vuT ).
2.5.2 Show that if two random variables, x and y, are statistically
independent, then the Shannon information of the joint density,
p(x, y), is the sum of the Shannon informations of the individual
densities, p(x) and p(y):
H(x, y) = H(x) + H(y).
2.5.3 For a Gaussian random variable, x ∼ N (µ, Σ), show that
E[xxT ] = Σ + µµT .
2.5.4 For a Gaussian random variable, x ∼ N (µ, Σ), show directly
that Z ∞
µ = E[x] = x p(x) dx.
−∞
where
K
X N
X
Σ−1 = Σ−1
k , Σ−1 µ = Σ−1
k µk ,
k=1 k=1
Linear-Gaussian Estimation
This chapter will introduce some of the classic results from estima-
tion theory for linear models and Gaussian random variables includ-
ing the Kalman filter (KF) (Kalman, 1960b). We will begin with a
batch, discrete-time estimation problem that will provide important
insights into the nonlinear extension of the work in subsequent chap-
ters. From the batch approach, we will show how the recursive methods
can be developed. Finally, we will circle back to the more general case
of handling continuous-time motion models and connect these to the
discrete-time results as well as to Gaussian process regression from the
machine-learning world. Classic books that cover linear estimation in-
clude Bryson (1975), Maybeck (1994), and Stengel (1994).
The rest of this chapter will present a suite of techniques for addressing
this state estimation problem. Our approach will always be to attempt
to come up with not only a state estimate, but also to quantify the
uncertainty in that estimate.
To set ourselves up for what is to follow in the later chapters on
nonlinear estimation, we will begin by formulating a batch linear-
Gaussian (LG) estimation problem. The batch solution is very useful
for computing state estimates after the fact because it uses all the mea-
surements in the estimation of all the states at once (hence the usage of
‘batch’). However, a batch method cannot be used in real-time since we
cannot employ future measurements to estimate past states. For this
we will need recursive state estimators, which will be covered later in
this chapter.
1 Sometimes the input is specialized to be of the form vk = Bk uk , where uk ∈ RU is
now the input and Bk ∈ RN ×U is called the control matrix. We will use this form as
needed in our development.
2 ˆ to indicate posterior estimates (including measurements) and (·)
We will use (·) ˇ to
indicate prior estimates (not including measurements).
3 In robotics, this input is sometimes replaced by an interoceptive measurement. This is
a bit of a dangerous thing to do since it then conflates two sources of uncertainty:
process noise and measurement noise. If this is done, we must be careful to inflate Q
appropriately to reflect the two uncertainties.
3.1 Batch Discrete-Time Estimation 37
which is to say that we want to find the best single estimate for the
state of the system (at all timesteps), x̂, given the prior information,
v, and measurements, y4 . Note that we have
p(y|x, v)p(x|v)
x̂ = arg max p(x|v, y) = arg max
x x p(y|v)
= arg max p(y|x)p(x|v), (3.3)
x
K
Y
p(y|x) = p(yk | xk ). (3.4)
k=0
K
Y
p(x|v) = p(x0 | x̌0 ) p(xk | xk−1 , vk ). (3.5)
k=1
1
p(x0 | x̌0 ) = q
(2π)N det P̌0
1 T −1
× exp − (x0 − x̌0 ) P̌0 (x0 − x̌0 ) , (3.6a)
2
1 1 T
p(xk | xk−1 , vk ) = p exp − (xk − Ak−1 xk−1 − vk )
(2π)N det Qk 2
−1
× Qk (xk − Ak−1 xk−1 − vk ) , (3.6b)
1 1 T
p(yk | xk ) = p exp − (yk − Ck xk )
M
(2π) det Rk 2
× R−1k (y k − C x
k k ) . (3.6c)
Note that we must have P̌0 , Qk , and Rk invertible; they are in fact
positive-definite by assumption and therefore invertible. To make the
3.1 Batch Discrete-Time Estimation 39
We will work with J(x) as is, but note that it is possible to add all
kinds of additional terms to this expression that will influence the so-
lution for the best estimate (e.g., constraints, penalty terms). From an
6 A logarithm is a monotonically increasing function and therefore will not affect our
optimization problem.
40 Linear-Gaussian Estimation
which will result in the same solution for the best estimate, x̂, as (3.2).
In other words, we are still finding the best estimate in order to max-
imize the joint likelihood of all the data we have. This is an uncon-
strained optimization problem in that we do not have to satisfy any
constraints on the design variable, x.
To further simplify our problem, we make use of the fact that equa-
tions (3.9) are quadratic in x. To make this more clear, we stack all
the known data into a lifted column, z, and recall that x is also a tall
column consisting of all the states:
x̌0
v1
..
.
x0
vK .
z=
y0 , x = .. . (3.12)
y1 xK
.
..
yK
We then define the following block-matrix quantities:
1
−A0 1
.. ..
. .
−AK−1 1
H= , (3.13a)
C0
C1
.
..
CK
P̌0
Q1
..
.
QK
W=
,
(3.13b)
R0
R1
..
.
RK
where only non-zero blocks are shown. The solid partition lines are
used to show the boundaries between the parts of the matrices relevant
3.1 Batch Discrete-Time Estimation 41
We only care about the first factor, which is the full Bayesian posterior.
This can be written, using the approach outlined in Section 2.2.3, as
p(x|v, y) = N x̌ + P̌CT (CP̌CT + R)−1 (y − Cx̌),
P̌ − P̌CT (CP̌CT + R)−1 CP̌ . (3.28)
Using the SMW identity from equations (2.68), this can be manipulated
into the following form:
−1 −1
p(x|v, y) = N P̌−1 + CT R−1 C P̌ x̌ + CT R−1 y ,
| {z }
x̂, mean
−1
P̌−1 + CT R−1 C . (3.29)
| {z }
P̂, covariance
and we see the inverse covariance appearing on the left-hand side. Sub-
−1
stituting in x̌ = Av and P̌−1 = (AQAT ) = A−T Q−1 A−1 we can
rewrite this as
A−T Q−1 A−1 + CT R−1 C x̂ = A−T Q−1 v + CT R−1 y. (3.31)
| {z }
P̂−1
44 Linear-Gaussian Estimation
We see that this requires computing A−1 . It turns out this has a beau-
tifully simple form11 ,
1
−A0 1
−A1 1
−1
A = .. , (3.32)
−A .
2
..
. 1
−AK−1 1
which is still lower-triangular, but also very sparse (only the main di-
agonal and the one below are non-zero). If we define
−1
v A Q
z= , H= , W= , (3.33)
y C R
we can rewrite our system of equations as
HT W−1 H x̂ = HT W−1 z, (3.34)
which is identical to the optimization solution discussed earlier.
Again, it must be stressed that the reason the Bayesian approach
produces the same answer as the optimization solution for our LG esti-
mation problem is that the full Bayesian posterior is exactly Gaussian
and the mean and mode (i.e., maximum) of a Gaussian are one and
the same.
rank HT
1 −AT0 CT0
1 −AT1 CT1
..
= rank 1 . CT2 ,
.. ..
. −ATK−1 .
1 CTK
(3.38)
rank HT
−AT0 CT0
1 −AT1 CT1
..
= rank
1 . CT2 ,
.. ..
. −ATK−1 .
1 CTK
(3.40)
which we note has K + 1 block-rows (each of size N ). Moving the top
block-row to the bottom does not alter the rank:
rank HT
1 −AT1 CT1
..
1 . CT2
= rank
..
. −ATK−1
..
.
.
1 CTK
−AT0 CT0
(3.41)
Except for the bottom block-row, this is in row-echelon form. Again
without altering the rank, we can add to the bottom block-row, AT0
times the first block-row, AT0 AT1 times the second block-row, . . . , and
AT0 · · · ATK−1 times the Kth block-row, to see that
rank HT
1 −AT1 CT1
..
1 . CT1
= rank
..
. −ATK−1
..
.
.
1 CTK
CT0 AT0 CT1 AT0 AT1 CT2 ··· AT0 · · · ATK−1 CTK
(3.42)
Examining this last expression, we notice immediately that the lower-
left partition is zero. Moreover, the upper-left partition is in row-echelon
form and in fact is of full rank, N K, since every row has a ‘leading one’.
Our overall rank condition for HT therefore collapses to showing that
the lower-right partition has rank N :
rank CT0 AT0 CT1 AT0 AT1 CT2 · · · AT0 · · · ATK−1 CTK = N (3.43)
If we further assume the system is time-invariant such that for all k we
have Ak = A and Ck = C (we use italicized symbols to avoid confusion
with lifted form) and we make the not-too-restrictive assumption that
K N , we may further simplify this condition.
3.1 Batch Discrete-Time Estimation 47
Interpretation
We can return to the mass-spring analogy to better understand the
observability issue. Figure 3.2 shows a few examples. With the initial
state and all the inputs (top example), the system is always observable
since it is impossible to move any group of carts left or right without
altering the length of at least one spring. This means there is a unique
minimum-energy state. The same is true for the middle example, even
though there is no knowledge of the initial state. The bottom example
48 Linear-Gaussian Estimation
Figure 3.2 In a
Jv,0 Jv,1 Jv,2 Jv,3
single dimension,
initial state
the mass-spring knowledge plus x̂0 x̂1 x̂2 x̂3
system is inputs ensures Jy,1 Jy,3
observable if there observability
is no group of carts
that can be shifted
left or right Jv,1 Jv,2 Jv,3
without altering no initial state
knowledge can x̂0 x̂1 x̂2 x̂3
the energy state of be observable
at least one spring. Jy,1 Jy,3
The top example
uses the initial
state and inputs, no initial state
so this is always knowledge Jv,1 Jv,2 Jv,3
observable. The can also be
middle example is
unobservable x̂0 x̂1 x̂2 x̂3
(in 1D this only
also observable happens with no
since any measurements)
movement changes
at least one spring.
The bottom
is unobservable since the entire chain of carts can be moved left or
example is not
observable since right without changing the amount of energy stored in the springs.
the whole chain of This means the minimum-energy state is not unique.
carts can be moved
left-right together
without changing
any spring lengths;
3.1.5 MAP Covariance
in one dimension, Looking back to (3.35), x̂ represents the most likely estimate of x, the
this only happens
with no initial
true state. One important question to ask is how confident are we in x̂?
state and no It turns out we can re-interpret the least-squares solution as a Gaussian
measurements. estimate for x in the following way:
HT W−1 H |{z}x̂ = H T
| W
−1
{z z} . (3.49)
| {z }
mean information
inverse
covariance vector
−1 T −1 −1 T −1
x − HT W−1 H H W z = HT W−1 H H W (Hx − z),
| {z } | {z }
E[x] s
(3.52)
where
w
s= . (3.53)
n
h i
T
P̂ = E (x − E[x]) (x − E[x])
−1 T −1 T −1
= HT W−1 H H W E s s W−1 H HT W−1 H ,
| {z }
W
T −1
−1
= H W H , (3.54)
+ ATK−1 Q−1
K AK−1 , (3.61e)
T −1
LK,K−1 LK−1 = −QK AK−1 , (3.61f)
T T −1 T −1
LK LK = −LK,K−1 LK,K−1 + QK + CK RK CK , (3.61g)
| {z }
IK
L1 d1 = −L10 d0 + Q−1 T −1 T −1
1 v1 + C1 R1 y1 −A1 Q2 v2 , (3.63b)
| {z }
q1
..
.
LK−1 dK−1 = −LK−1,K−2 dK−2 + Q−1 T −1
K−1 vK−1 + CK−1 RK−1 yK−1
| {z }
qK−1
− ATK−1 Q−1
K vK , (3.63c)
−1 T −1
LK dK = −LK,K−1 dK−1 + QK vK + CK RK yK , (3.63d)
| {z }
qK
forwards:
(k = 1 . . . K)
Lk−1 LTk−1 = Ik−1 + ATk−1 Q−1
k Ak−1 , (3.66a)
Lk−1 dk−1 = qk−1 − Ak−1 Qk−1 vk ,
T
(3.66b)
Lk,k−1 LTk−1 = −Q−1k Ak−1 , (3.66c)
Ik = −Lk,k−1 LTk,k−1 + Q−1 T −1
k + Ck Rk Ck , (3.66d)
qk = −Lk,k−1 dk−1 + Q−1 T −1
k vk + Ck Rk yk , (3.66e)
backwards:
(k = K . . . 1)
LTk−1 x̂k−1 = −LTk,k−1 x̂k + dk−1 , (3.66f)
I0 = P̌−1 T −1
0 + C0 R0 C0 , (3.67a)
q0 = P̌−1 T −1
0 x̌0 + C0 R0 y0 , (3.67b)
x̂K = L−TK dK . (3.67c)
The forwards pass maps {qk−1 , Ik−1 } to the same pair at the next Herbert E. Rauch
time, {qk , Ik }. The backwards pass maps x̂k to the same quantity at (1935-2011) was a
the previous timestep, x̂k−1 . In the process, we solve for all the blocks pioneer in the area
of control and
of L and d. The only linear algebra operations required to implement estimation. Frank
this smoother are Cholesky decomposition, multiplication, addition, F. Tung
and solving a linear system via forward/backward substitution. (1933-2006) was a
As we will see in the next section, these six recursive equations are research scientist
working in the area
algebraically equivalent to the canonical Rauch-Tung-Striebel smoother; of computing and
the five equations forming the forwards pass are algebraically equivalent control. Charlotte
to the famous Kalman filter. T. Striebel
(1929-2014) was a
statistician and
professor of
mathematics. All
3.2.3 Rauch-Tung-Striebel Smoother three co-developed
the Rauch-Tung-
While the Cholesky smoother is a convenient implementation and is Striebel smoother
easy to understand when starting from the batch solution, it does not while working at
Lockheed Missiles
represent the canonical form of the smoothing equations. It is, however,
and Space
algebraically equivalent to the canonical Rauch-Tung-Striebel (RTS) Company in order
smoother, which we now show. This requires several uses of the different to estimate
forms of the SMW identity in (2.68). spacecraft
trajectories.
We begin by working on the forwards pass. Solving for Lk,k−1 in (3.66c)
54 Linear-Gaussian Estimation
−1
Kk = P̌−1 T −1
k,f + Ck Rk Ck CTk R−1
k
−1
= P̌k,f CTk Ck P̌k,f CTk + Rk , (3.71)
where the last expression requires a use of the SMW identity from (2.68).
Then (3.69b) can be rewritten as
P̌−1 −1 T −1 −1 T −1
k,f = P̂k,f − Ck Rk Ck = P̂k,f 1 − P̂k,f Ck Rk Ck
| {z }
Kk
= P̂−1
k,f (1 − Kk Ck ) , (3.72)
Substituting (3.66a) into Lk,k−1 dk and then this into (3.66e) we have
−1
qk = Q−1 T
k Ak−1 Ik−1 + Ak−1 Qk Ak−1
−1
qk−1
| {z }
−1
(Ak−1 I−1
k−1 Ak−1 +Qk ) k−1 , by (2.68)
Ak−1 I−1
T
−1 T
+ Qk − Qk Ak−1 Ik−1 + ATk−1 Q−1
−1 −1
k Ak−1 Ak−1 Q−1
k vk
| {z }
−1
(Ak−1 Ik−1 Ak−1 +Qk ) , by (2.68)
−1 T
+ CTk R−1
k yk , (3.75)
where we have used two versions of the SMW identity to get to the
expressions in the underbraces. By letting P̂−1
k,f x̂k,f = qk , this can be
written in two steps as
or
x̂k,f = x̌k,f + Kk (yk − Ck x̌k,f ) , (3.78)
forwards:
(k = 1 . . . K)
P̌k,f = Ak−1 P̂k−1,f ATk−1 + Qk , (3.82a)
x̌k,f = Ak−1 x̂k−1,f + vk , (3.82b)
−1
Kk = P̌k,f CTk Ck P̌k,f CTk + Rk , (3.82c)
P̂k,f = (1 − Kk Ck ) P̌k,f , (3.82d)
x̂k,f = x̌k,f + Kk (yk − Ck x̌k,f ) , (3.82e)
backwards:
(k = K . . . 1)
x̂k−1 = x̂k−1,f + P̂k−1,f ATk−1 P̌−1
k−1,f (x̂k − x̌k,f ) , (3.82f)
As will be discussed in more detail in the next section, the five equations
in the forwards pass are known as the Kalman filter. However, the
important message to take away from this section on smoothing is that
these six equations representing the RTS smoother can be used to solve
the original batch problem that we set up in a very efficient manner,
with no approximation. This is possible precisely because of the block-
tridiagonal sparsity pattern in the left-hand side of the batch problem.
p(xk |v, y) = p(xk |x̌0 , v1:k , y0:k ) p(xk |vk+1:K , yk+1:K ). (3.86)
Thus, it should be possible to take our batch solution and factor it into
the product of two Gaussian PDFs. To do this, we need to exploit the
sparse structure of H in (3.3.1).
We begin by partitioning H into 12 blocks (only 6 of which are non-
zero):
2 3
H11 information from 0 . . . k 1
6H21 H22 7 information from k
H=6
4
7
H32 H33 5 information from k + 1
H43 information from k + 2 . . . K
states from k + 1 . . . K
states from k
states from 0 . . . k 1
(3.87)
The sizes of each block-row and block-column are indicated above. For
example, with k = 2 and K = 4, the partitions are
1
C0
−A 1
0
C1
−A1 1
H= . (3.88)
C2
−A2 1
C3
−A3 1
C4
HT W−1 H
T
H11 W1−1 H11 + HT21 W2−1 H21 HT21 W2−1 H22
= HT22 W2−1 H21 HT22 W2−1 H22 + HT32 W3−1 H32
HT33 W3−1 H32
Our overall batch system of equations now looks like the following:
L11 L12 x̂0:k−1 r1
LT12 L22 LT32 x̂k = r2 , (3.93)
L32 L33 x̂k+1:K r3
where we have added the (·) ˆ to indicate this is the solution to the op-
timization estimation problem considered earlier. Our short-term goal,
in making progress towards a recursive LG estimator, is to solve for x̂k .
To isolate x̂k , we left-multiply both sides of (3.93) by
1
−LT12 L−1
11 1 −LT32 L−133
, (3.94)
1
of equations is
L11 L12 x̂0:k−1
L22 − LT12 L−1 T −1
11 L12 − L32 L33 L32
x̂k
L32 L33 x̂k+1:K
r1
= r2 − LT12 L−1 T −1
11 r1 − L32 L33 r3 , (3.95)
r3
(3.96)
where we have defined P̂k (by its inverse) as well as qk . We have es-
sentially marginalized out x̂0:k−1 and x̂k+1:K just as in (3.85). We can
now substitute the values of the Lij blocks back into P̂−1k to see that
P̂−1 T −1 T −1
k = L22 − L12 L11 L12 − L32 L33 L32
−1
= HT22 W2−1 − W2−1 H21 HT11 W1−1 H11 + HT21 W2−1 H21 HT21 W2−1 H22
| {z }
−1
−1
k,f = W2 +H21 (H11 W1 H11 )
P̂−1 −1
T HT
21 ,
by (2.68)
−1
+ HT32 W3−1 − W3−1 H33 HT33 W3−1 H33 + HT43 W4−1 H43 HT33 W3−1 H32
| {z }
−1
−1
k,b = W3 +H33 (H43 W4 H43 )
P̂−1 −1
T HT
33 , by (2.68)
= HT22 P̂−1
k,f H22 + HT32 P̂−1
k,b H32 , (3.97)
| {z } | {z }
forwards backwards
qk = r2 − LT12 L−1 T −1
11 r1 − L32 L33 r3
= HT22 qk,f + HT32 qk,b , (3.98)
| {z } | {z }
forwards backwards
P̂−1
k,f x̂k,f = qk,f , (3.100a)
P̂−1
k,b x̂k,b = qk,b , (3.100b)
where x̂k,f depends only on quantities up to time k and x̂k,b depends
only on quantities from time k + 1 to K. Under these definitions we
have that
P̂−1 T −1 T −1
k = H22 P̂k,f H22 + H32 P̂k,b H32 , (3.101)
x̂k = P̂k HT22 P̂−1 x̂
k,f k,f + H T −1
32 k,b k,b ,
P̂ x̂ (3.102)
where P̂k , P̂k,f , and P̂k,b are the covariances associated with x̂k , x̂k,f ,
and x̂k,b . In other words we have Gaussian estimators with the MAP
estimators as the means.
In the next section, we will examine how we can turn the forwards
Gaussian estimator, x̂k,f , into a recursive filter16 .
Recall that these estimates are based on all the data up to and including
those at time k − 1. Our goal will be to compute
n o
x̂k , P̂k , (3.105)
using all the data up to and including those at time k. It turns out we
do not need to start all over again, but rather can simply incorporate
the new data at time k, vk and yk , into the estimate at time k − 1:
n o n o
x̂k−1 , P̂k−1 , vk , yk 7→ x̂k , P̂k . (3.106)
x̂k 1
x̂k
We do not really care what x̂0k−1 is in this context, because we seek a re-
cursive estimator appropriate to online estimation, and this quantity in-
corporates future data; we can marginalize this out by left-multiplying
both sides by
" #
1 0
−1 , (3.111)
Q−1 −1 T −1
k Ak−1 P̂k−1 + Ak−1 Qk Ak−1 1
which is just an elementary row operation and will not alter the solution
to the linear system of equations18 . Equation (3.110) then becomes
−1
P̂k−1 + ATk−1 Q−1
k Ak−1 −ATk−1 Q−1
k
−1 x̂0
Q−1 −1 −1 T
k − Qk Ak−1 P̂k−1 + Ak−1 Qk Ak−1
−1
k−1
0 x̂k
× ATk−1 Q−1 T −1
k + Ck Rk Ck
P̂−1 T −1
k−1 x̂k−1 − Ak−1 Qk vk
=
−1
Q−1 −1 T −1
k Ak−1 P̂k−1 + Ak−1 Qk Ak−1 P̂−1 T −1
k−1 x̂k−1 − Ak−1 Qk vk
.
+ Q−1 T −1
k vk + Ck Rk yk
(3.112)
+ CTk R−1
k Ck x̂k
−1
= Q−1
k Ak−1 P̂−1
k−1 + A T
Q
k−1 k
−1
A k−1
× P̂−1 k−1 x̂ k−1 − A T
Q
k−1 k
−1
v k
+ Q−1
k vk + CTk R−1
k yk . (3.113)
+ CTk R−1
k yk
= P̌−1 T −1
k (Ak−1 x̂k−1 + vk ) +Ck Rk yk , (3.115)
| {z }
x̌k
where we have defined x̌k as the ‘predicted’ value of the state. We also
made use of the following logic in simplifying the above:
−1
Q−1 −1 T −1
k Ak−1 P̂k−1 + Ak−1 Qk Ak−1 P̂−1
k−1
| {z }
apply (2.68) again
−1
= Q−1 T T
k Ak−1 P̂k−1 − P̂k−1 Ak−1 Qk + Ak−1 P̂k−1 Ak−1
| {z }
P̌−1
k
× Ak−1 P̂k−1 P̂−1
k−1
= Q−1 −1 T −1
k − Qk Ak−1 P̂k−1 Ak−1 P̌k Ak−1
| {z }
P̌k −Qk
= Q−1 −1 −1
k − Qk + P̌k Ak−1
= P̌−1
k Ak−1 . (3.116)
Bringing together all of the above, we have for the recursive filter up-
date the following:
P̌k = Ak−1 P̂k−1 ATk−1 + Qk , (3.117a)
predictor:
x̌k = Ak−1 x̂k−1 + vk , (3.117b)
P̂−1 −1 T −1
k = P̌k + Ck Rk Ck , (3.117c)
corrector:
P̂−1
k x̂k = P̌−1
k x̌k + CTk R−1
k yk , (3.117d)
3.3 Recursive Discrete-Time Filtering 65
Figure 3.5 The
p(x) prediction correction
Kalman filter
works in two steps:
becomes more becomes more
uncertain certain prediction then
correction. The
prediction step
propagates the old
x estimate, x̂k−1 ,
N v k , Qk forward in time
old predic- new measure-
estimate tion estimate ment using the
measurement
model and latest
input, vk , to arrive
N x̂k 1 , P̂k 1 N x̌k , P̌k N x̂k , P̂k N yk , Rk
at the prediction,
x̌k . The correction
step fuses the
which we will refer to as inverse covariance or information form for the prediction with the
Kalman filter. Figure 3.5 depicts the predictor-corrector form of the latest
Kalman filter graphically. measurement, yk ,
to arrive at the
To get to canonical form, we manipulate these equations slightly. new estimate, x̂k ;
Begin by defining the Kalman gain, Kk , as this step is carried
out using a direct
Kk = P̂k CTk R−1
k . (3.118) product of
Gaussians (clear
We then manipulate: from inverse
covariance version
1 = P̂k P̌−1 T −1
k + Ck Rk Ck of KF).
= P̂k P̌−1
k + Kk Ck , (3.119a)
P̂k = (1 − Kk Ck ) P̌k , (3.119b)
P̂k CTk R−1 T −1
k = (1 − Kk Ck ) P̌k Ck Rk , (3.119c)
| {z }
Kk
Kk 1 + Ck P̌k CTk R−1
k = P̌k CTk R−1
k . (3.119d)
Solving for Kk in this last expression, we can rewrite the recursive filter
equations as
P̌k = Ak−1 P̂k−1 ATk−1 + Qk , (3.120a)
predictor:
x̌k = Ak−1 x̂k−1 + vk , (3.120b)
−1
Kalman gain: Kk = P̌k CTk Ck P̌k CTk + Rk , (3.120c)
P̂k = (1 − Kk Ck ) P̌k , (3.120d)
corrector:
x̂k = x̌k + Kk (yk − Ck x̌k ), (3.120e)
| {z }
innovation
where the ‘innovation’ has been highlighted; it is the difference be-
tween the actual and expected measurements. The role of the Kalman
gain is to properly weight the innovation’s contribution to the estimate
(in comparison to the prediction). In this form, these five equations
(and their extension to nonlinear systems) have been the workhorse
of estimation since Kalman’s initial paper (Kalman, 1960b). These are
66 Linear-Gaussian Estimation
p(xk−1 |x̌0 , v1:k−1 , y0:k−1 ) = N x̂k−1 , P̂k−1 . (3.121)
where
These are identical to the prediction equations from the previous sec-
tion. These last two expressions can be found by exactly passing the
prior at k − 1 through the linear motion model. For the mean we have
Next, for the correction step, we express the joint density of our state
19 In the next chapter, we will generalize this section to present the Bayes filter, which
can handle non-Gaussian PDFs as well as nonlinear motion and observation models.
We can think of this section as a special case of the Bayes filter, one that requires no
approximations to be made.
3.3 Recursive Discrete-Time Filtering 67
where we have defined x̂k as the mean and P̂k as the covariance. Sub-
stituting in the moments from above, we have
−1
Kk = P̌k CTk Ck P̌k CTk + Rk , (3.128a)
P̂k = (1 − Kk Ck ) P̌k , (3.128b)
x̂k = x̌k + Kk (yk − Ck x̌k ) , (3.128c)
which are identical to the correction equations from the previous section
on MAP. Again, this is because the motion and measurement models
are linear and the noises and prior are Gaussian. Under these condi-
tions, the posterior density is exactly Gaussian. Thus, the mean and
mode of the posterior are one and the same. This property does not
hold if we switch to a nonlinear measurement model, which we discuss
in the next chapter.
then we have
T
E[êk êTk ] = (1 − Kk Ck ) P̌k (1 − Kk Ck ) + Kk Rk KTk . (3.131)
We then define a cost function of the form
1 T 1 T
J(Kk ) = tr E[êk êk ] = E êk êk , (3.132)
2 2
which quantifies (in some sense) the magnitude of the covariance of êk .
We can minimize this cost directly with respect to Kk , to generate the
‘minimum variance’ estimate. We will make use of the identities
∂tr XY ∂tr XZXT
≡ YT , ≡ 2XZ, (3.133)
∂X ∂X
where Z is symmetric. Then we have
∂J(Kk )
= − (1 − Kk Ck ) P̌k CTk + Kk Rk . (3.134a)
∂Kk
Setting this to zero and solving for Kk we have
−1
Kk = P̌k CTk Ck P̌k CTk + Rk , (3.135)
which is our usual expression for the Kalman gain.
Using (3.1) and (3.120), we can then write out the ‘error dynamics’:
where we note that ê0 = x̂0 − x0 . From this system it is not hard to
see that E [êk ] = 0 for k > 0 so long as E [ê0 ] = 0. This means our
estimator is unbiased. We can use proof by induction. It is true for
k = 0 by assertion. Assume it is also true for k − 1. Then
E ěk ěTk = P̌k , (3.140a)
E êk êTk = P̂k , (3.140b)
for k > 0 so long as E [ê0 êT0 ] = P̂0 . This means our estimator is consis-
tent. We again
use proof
by induction. It is true for k = 0 by assertion.
Assume E êk−1 êTk−1 = P̂k−1 . Then
h i
T
E ěk ěTk = E (Ak−1 êk−1 − wk ) (Ak−1 êk−1 − wk )
= Ak−1 E êk−1 êTk−1 ATk−1 − Ak−1 E êk−1 wkT
| {z } | {z }
P̂k−1 0 by independence
− E wk êTk−1 ATk−1 + E wk wkT
| {z } | {z }
0 by independence Qk
= P̌k , (3.141)
70 Linear-Gaussian Estimation
and
h i
T
E êk êTk = E ((1 − Kk Ck ) ěk + Kk nk ) ((1 − Kk Ck ) ěk + Kk nk )
T
= (1 − Kk Ck ) E ěk ěTk (1 − Kk Ck )
| {z }
P̌k
+ (1 − Kk Ck ) E ěk nTk KTk
| {z }
0 by independence
T
+ Kk E nk ěTk (1 − Kk Ck ) + Kk E nk nTk KTk
| {z } | {z }
0 by independence Rk
= P̂k . (3.142)
It is therefore true for all k. This means that the true uncertainty in the
system (i.e., the covariance of the error, E [êk êTk ]) is perfectly modeled
by our estimate of the covariance, P̂k . In this sense, the Kalman filter
is an optimal filter. This is why it is sometimes referred to as best
linear unbiased estimate (BLUE). Yet another way of saying this is
that the covariance of the Kalman filter is right at the Cramér-Rao
Lower Bound; we cannot be any more certain in our estimate given the
uncertainty in the measurements we have used in that estimate.
A final important point to make is that the expectations we have
employed in this section are over the possible outcomes of the random
variables. They are not time averages. If we ran an infinite number of
trials and averaged over the trials (i.e., an ensemble average), then we
should see an average performance of zero error (i.e., an unbiased esti-
mator). This does not imply that within a single trial (i.e., a realization)
that the error will be zero or decay to zero over time.
which means that we must have |λ| < 1, and thus the steady-
state error dynamics are stable. Technically the right-hand side
could be zero, but after N repetitions of this process, we build
up a copy of the observability Grammian on the right-hand side
making it invertible (if the system is observable).
72 Linear-Gaussian Estimation
Figure 3.6 State
estimation with a
continuous-time
x(t) ⇠ GP x̌(t), P̌(t, t0 )
prior can be viewed .
as one-dimensional .. tK 1
tK
Gaussian process
regression with
... ⌧ tk+1
time as the t2
independent t1 tk 1
variable. We have t0 asynchronous
tk x(⌧ ) = ?
data about the measurement times query time
trajectory at a
number of
asynchronous
measurements 3.4 Batch Continuous-Time Estimation
times and would
like to query the In this section, we circle back to consider a more general problem than
state at some other the discrete-time setup in the earlier part of this chapter. In particular,
time of interest. we consider what happens when we choose to use a continuous-time
motion model as the prior. We approach the problem from a Gaussian
process regression perspective (Rasmussen and Williams, 2006). We
show that for linear-Gaussian systems, the discrete-time formulation
is implementing the continuous-time one exactly, under certain special
conditions (Tong et al., 2013; Barfoot et al., 2014).
(τ0 < τ1 < . . . < τJ ) that may or may not be different from the mea-
surement times (t0 < t1 < . . . < tK ). Figure 3.6 depicts our problem
setup. The joint likelihood between the state (at the query times) and
the measurements (at the measurement times) is written as
xτ x̌τ P̌τ τ P̌τ CT
p =N , , (3.149)
y Cx̌ CP̌Tτ R + CP̌CT
where
x(t0 ) x̌(t0 ) x(τ0 ) x̌(τ0 )
x = ... , x̌ = ... , xτ = ... , x̌τ = ... ,
x(tK ) x̌(tK ) x(τJ ) x̌(τJ )
y0
..
y = . , C = diag (C0 , . . . , CK ) , R = diag (R0 , . . . , RK ) ,
yK
P̌ = P̌(ti , tj ) ij , P̌τ = P̌(τi , tj ) ij , P̌τ τ = P̌(τi , τj ) ij .
for the likelihood of the predicted state at the query times, given the
measurements.
The expression simplifies further if we take the query times to be
exactly the same as the measurement times (i.e., τk = tk , K = J). This
implies that
P̌ = P̌τ = P̌τ τ , (3.151)
this as
−1 −1
p(x|y) = N P̌−1 + CT R−1 C P̌ x̌ + CT R−1 y ,
| {z }
x̂, mean
−1
P̌−1 + CT R−1 C . (3.153)
| {z }
P̂, covariance
to produce (3.156).
In general, this GP approach has complexity O(K 3 + K 2 J) since the
initial solve is O(K 3 ) and the query is O(K 2 J); this is quite expensive
and next we will seek to improve the cost by exploiting the structure
of the matrices involved.
3.4 Batch Continuous-Time Estimation 75
Mean Function
For the mean function we have
Z t
E[x(t)] = Φ(t, t0 ) E[x(t0 )] + Φ(t, s) v(s) + L(s) E[w(s)] ds,
| {z } | {z } t0 | {z }
x̌(t) x̌0 0
(3.165)
where x̌0 is the initial value of the mean at t0 . Thus, the mean function
is
Z t
x̌(t) = Φ(t, t0 )x̌0 + Φ(t, s)v(s) ds. (3.166)
t0
where
Z tk
vk = Φ(tk , s)v(s) ds, k = 1 . . . K. (3.168)
tk−1
x̌ = Av, (3.169)
where
x̌(t0 ) x̌0
x̌(t1 ) v1
x̌ = .. , v = .. ,
. .
x̌(tK ) vK
1
Φ(t1 , t0 )1
Φ(t2 , t0 )
Φ(t2 , t1 ) 1
A= .. .
.. .. .. .
. . .
Φ(tK−1 , t0 ) Φ(tK−1 , t1 ) Φ(tK−1 , t2 ) · · · 1
Φ(tK , t0 ) Φ(tK , t1 ) Φ(tK , t2 ) · · · Φ(tK , tK−1 ) 1
(3.170)
and
Z tk
Bk = Φ(tk , s)B(s) ds, k = 1 . . . K. (3.172)
tk−1
and
x̌ = ABu, (3.174)
Covariance Function
For the covariance function we have
E (x(t) − E[x(t)])(x(t0 ) − E[x(t0 )])T
| {z }
P̌(t,t0 )
= Φ(t, t0 ) E (x(t0 ) − E[x(t0 )])(x(t0 ) − E[x(t0 )])T Φ(t0 , t0 )T
| {z }
P̌0
Z tZ t0
+ Φ(t, s)L(s) E[w(s)w(s0 )] L(s0 )T Φ(t0 , s0 )T ds0 ds, (3.175)
t0 t0 | {z }
Q δ(s−s0 )
where P̌0 is the initial covariance at t0 and we have made the as-
sumption that E[x(t0 )w(t)T ] = 0. Putting this together we have the
following expression for the covariance:
where H(·) is the Heaviside step function. There are now three cases to
worry about: t < t0 , t = t0 , and t > t0 . In the first case, the upper inte-
gration limit terminates the integration, while in the last the Heaviside
step function does the same job. The result is that second term in the
covariance function can be written as
Z min(t,t0 )
Φ(t, s)L(s)QL(s)T Φ(t0 , s)T ds
t0
R 0
t
Φ(t, t0
) Φ(t 0
, s)L(s)QL(s) T
Φ(t0
, s) T
ds t0 < t
Rt t0
= t0
Φ(t, s)L(s)QL(s)T Φ(t, s)T ds t = t0 . (3.178)
R
t
Φ(t, s)L(s)QL(s)T Φ(t, s)T ds Φ(t0 , t)T t < t0
t0
where
Z tk
Qk = Φ(tk , s)L(s)QL(s)T Φ(tk , s)T ds, k = 1 . . . K, (3.180)
tk−1
where
1
−Φ(t1 , t0 ) 1
−Φ(t2 , t1 ) 1
A−1 = . .
−Φ(t3 , t2 ) . .
..
. 1
−Φ(tK , tK−1 ) 1
(3.186)
Since A−1 has only the main diagonal and the one below non-zero,
and Q−1 is block-diagonal, the block-tridiagonal structure of P̌−1 can
be verified by carrying out the multiplication. This is precisely the
structure we had at the start of the chapter for the batch discrete-time
case.
3.4 Batch Continuous-Time Estimation 79
Summary of Prior
We can write our final GP for x(t) as
Z t
x(t) ∼ GP Φ(t, t0 )x̌0 + Φ(t, s)v(s) ds,
t0
| {z }
x̌(t)
Z min(t,t0 )
0 T T 0 T
Φ(t, t0 )P̌0 Φ(t , t0 ) + Φ(t, s)L(s)QL(s) Φ(t , s) ds .
t0
| {z }
P̌(t,t0 )
(3.187)
At the measurement times, t0 < t1 < · · · < tK , we can also then write
x ∼ N (x̌, P̌) = N Av, AQAT , (3.188)
and we can further substitute v = Bu in the case that the inputs are
constant between measurement times.
Querying the GP
As discussed above, if we solve for the trajectory at the measurement
times, we may want to query it at other times of interest as well. This
can be done through the GP linear interpolation equations in (3.156).
Without loss of generality, we consider a single query time, tk ≤ τ <
tk+1 and so in this case we write
Since Q−1 is block diagonal and A−1 has only the main diagonal and
the one below it non-zero, we can evaluate the product very efficiently.
Working it out, we have
h
P̌(τ )P̌−1 = 0 · · · 0 Φ(τ, tk ) − Qτ Φ(tk+1 , τ )T Q−1
k+1 Φ(tk+1 , tk )
| {z }
Λ(τ ), block column k
i
··· Qτ Φ(tk+1 , τ )T Q−1
k+1 0 ··· 0 , (3.198)
| {z }
Ψ(τ ), block column k + 1
which has exactly two non-zero block columns. Inserting this into (3.189),
we have
x̂k x̌(tk )
x̂(τ ) = x̌(τ ) + Λ(τ ) Ψ(τ ) − , (3.199a)
x̂k+1 x̌(tk+1 )
" #
P̂k,k P̂k,k+1
P̂(τ, τ ) = P̌(τ, τ ) + Λ(τ ) Ψ(τ ) (3.199b)
P̂k+1,k P̂k+1,k+1
P̌(tk , tk ) P̌(tk , tk+1 ) Λ(τ )T
− ,
P̌(tk+1 , tk ) P̌(tk+1 , tk+1 ) Ψ(τ )T
3.4 Batch Continuous-Time Estimation 81
which is a simple combination of just the two terms from tk and tk+1 .
Thus, to query the trajectory at a single time of interest is O(1) com-
plexity.
which we note depends only on the difference of the two times (i.e., it
is stationary). We can therefore write,
to simplify matters.
21 We use italicized symbols for the time-invariant system matrices to avoid confusion
with the lifted-form quantities.
82 Linear-Gaussian Estimation
Mean Function
For the mean function we have the following simplification:
Z ∆tk:k−1
vk = exp (A(∆tk:k−1 − s)) Bu(s) ds, k = 1 . . . K.
0
(3.209)
If we assume that u(t) is constant between each pair of consecutive
measurement times, we can further simplify the expression. Let uk be
the constant input when t ∈ (tk−1 , tk ]. Then we can define
x̌0
u1
B = diag (1, B1 , . . . BM ) , u = .. , (3.210)
.
uM
and
Z ∆tk:k−1
Bk = exp (A(∆tk:k−1 − s)) ds B
0
= Φ(tk , tk−1 ) 1 − Φ(tk , tk−1 )−1 A−1 B, k = 1 . . . K. (3.211)
x̌ = ABu, (3.212)
Covariance Function
For the covariance function, we have the following simplification:
Z ∆tk:k−1
T
Qk = exp (A(∆tk:k−1 − s)) LQLT exp (A(∆tk:k−1 − s)) ds,
0
(3.213)
for k = 1 . . . K. This is relatively straightforward to evaluate, particu-
larly if A is nilpotent. Letting
Q = diag(P̌0 , Q1 , Q2 , . . . , QK ), (3.214)
we then have
P̌ = AQAT , (3.215)
Querying the GP
To query the GP we need the following quantities:
Φ(tk+1 , τ ) = exp (A ∆tk+1:τ ) , ∆tk+1:τ = tk+1 − τ, (3.216)
Φ(τ, tk ) = exp (A ∆tτ :k ) , ∆tτ :k = τ − tk , (3.217)
Z ∆tτ :k
Qτ = exp (A(∆tτ :k − s)) LQLT exp A(∆tτ :k − s)T ds.
0
(3.218)
Our interpolation equation is still
x̂(τ ) = x̌(τ ) + Φ(τ, tk ) − Qτ Φ(tk+1 , τ )T Q−1
k+1 Φ(tk+1 , tk ) (x̂k − x̌k )
+ Qτ Φ(tk+1 , τ )T Q−1
k+1 (x̂k+1 − x̌k+1 ), (3.219)
which is a linear combination of just the two terms from tk and tk+1 .
Example 3.2 Consider the case,
p̈(t) = w(t), (3.220)
where p(t) corresponds to position and
w(t) ∼ GP(0, Q δ(t − t0 )), (3.221)
is white noise as before. This corresponds to white noise on acceleration
(i.e., the ‘constant velocity’ model). We can cast this in the form
ẋ(t) = Ax(t) + Bu(t) + Lw(t), (3.222)
by taking
p(t) 0 1 0
x(t) = , A= , B = 0, L= . (3.223)
ṗ(t) 0 0 1
In this case we have
1 2 2 0 1 1 ∆t1
exp (A∆t) = 1 + A∆t + A ∆t + · · · = 1 + ∆t = ,
2 |{z} 0 0 0 1
0
(3.224)
since A is nilpotent. Therefore, we have
1 ∆tk:k−1 1
Φ(tk , tk−1 ) = . (3.225)
0 1
For the Qk we have
Z ∆tk:k−1
1 (∆tk:k−1 − s)1 0 1 0
Qk = Q 0 1 ds
0 0 1 1 (∆tk:k−1 − s)1 1
Z ∆tk
(∆tk:k−1 − s)2 Q (∆tk:k−1 − s)Q
= ds
0 (∆tk:k−1 − s)Q Q
1 3
∆t Q 21 ∆t2k:k−1 Q
= 13 k:k−1 , (3.226)
2
∆t2k:k−1 Q ∆tk:k−1 Q
84 Linear-Gaussian Estimation
which we note is positive definite even though LQLT is not. The inverse
is
−1 −1
−1 12 ∆t−3
k:k−1 Q −6 ∆t−2
k:k−1 Q
Qk = −1 −1 , (3.227)
−6 ∆t−2k:k−1 Q 4 ∆t−1
k:k−1 Q
are the Hermite basis functions. The bottom row (corresponding to ve-
locity) is only quadratic in α, and the basis functions are the derivatives
of the ones used to interpolate position. It is very important to note
that this Hermite interpolation scheme arises automatically from using
the GP regression approach and our choice of prior motion model. At
implementation, we may work directly with the general matrix equa-
tions and avoid working out the details of the resulting interpolation
scheme.
It is also easy to verify that when α = 0 we have
x̂τ = x̌τ + (x̂k − x̌k ), (3.235)
and when α = 1 we have
x̂τ = x̌τ + (x̂k+1 − x̌k+1 ), (3.236)
which seem to be sensible boundary conditions.
3.5 Summary
The main take-away points from this chapter are:
1. When the motion and observation models are linear, and the mea-
surement and process noises are zero-mean Gaussian, the batch and
recursive solutions to state estimation are straightforward, requiring
no approximation.
2. The Bayesian posterior of a linear-Gaussian estimation problem is
exactly Gaussian. This implies that the MAP solution is the same
as the mean of the full Bayesian solution, since the mode and the
mean of a Gaussian are one and the same.
3. The batch, discrete-time, linear-Gaussian solution can exactly imple-
ment (at the measurements times) the case where a continuous-time
motion model is employed; appropriate prior terms must be used for
this to be true.
The next chapter will investigate what happens when the motion and
observation models are nonlinear.
3.6 Exercises
3.6.1 Consider the discrete-time system,
which could represent a cart moving back and forth along the
x-axis. The initial state, x̌0 , is unknown. Set up the system of
equations for the batch least-squares estimation approach:
HT W−1 H x̂ = HT W−1 z.
In other words, work out the details of H, W, z, and x̂, for this
system. Take the maximum timestep to be K = 5. Assume all the
noises are uncorrelated with one another. Will a unique solution
exist to the problem?
3.6.2 Using the same system as the first question, set Q = R = 1 and
3.6 Exercises 87
show that
2 −1 0 0 0
−1 3 −1 0 0
T −1
H W H = 0 −1 3 −1 0 .
0 0 −1 3 −1
0 0 0 −1 2
3.6.3 Using the same system as the first question, modify the least-
squares solution for the case in which the measurements noises are
correlated with one another in the following way:
R |k − `| = 0
R/2 |k − `| = 1
E[yk y` ] = .
R/4 |k − `| = 2
0 otherwise
3.6.4 Using the same system as the first question, work out the details
of the Kalman filter solution. In this case, assume that the initial
conditions for the mean and covariance are x̌0 and P̌0 , respectively.
Show that the steady-state values for the prior and posterior co-
variances, P̌ and P̂ , as K → ∞ are the solutions to the following
quadratics:
P̌ 2 − QP̌ − QR = 0,
P̂ 2 + QP̂ − QR = 0,
This chapter is one of the most important ones contained in this book.
Here we examine how to deal with the fact that in the real world
there are no linear-Gaussian systems. It should be stated up front that
nonlinear, non-Gaussian (NLNG) estimation is very much still an ac-
tive research topic. The ideas in this chapter provide only some of
the more common approaches to dealing with nonlinear and/or non-
Gaussian systems1 . We begin by contrasting full Bayesian to maximum
a posteriori (MAP) estimation for nonlinear systems. We then intro-
duce a general theoretical framework for recursive filtering problems
called the Bayes filter. Several of the more common filtering tech-
niques are shown to be approximations of the Bayes filter: extended
Kalman filter, sigmapoint Kalman filter, particle filter. We then return
to batch estimation for nonlinear systems, both in discrete and con-
tinuous time. Some books that address nonlinear estimation include
Jazwinski (1970), Maybeck (1994), and Simon (2006).
4.1 Introduction
In the linear-Gaussian chapter, we discussed two perspectives to es-
timation: full Bayesian and maximum a posteriori. We saw that for
linear motion and observation models driven by Gaussian noise, these
two paradigms come to the same answer (i.e., the MAP point was the
mean of the full Bayesian approach); this is because the full posterior
is exactly Gaussian and therefore the mean and mode (i.e., maximum)
are the same point.
This is not true once we move to nonlinear models, since the full
Bayesian posterior is no longer Gaussian. To provide some intuition on
this topic, this section considers a simplified, one-dimensional, nonlin-
ear estimation problem: estimating the position of a landmark from a
stereo camera.
1 Even most of the methods in this chapter actually assume the noise is Gaussian.
89
90 Nonlinear Non-Gaussian Estimation
Figure 4.1
Idealized stereo image plane left pinhole
camera model
relating the u x
landmark depth, x,
fb depth
to the (noise-free)
y=u v= b baseline
disparity x v
measurement, y. disparity
f
focal right pinhole
length landmark
− ln (p(x|y))
example, p(x|y), as
well as the negative 0.2 100
p(x|y)
log-likelihood of
0.15 75
the posterior,
− ln(p(x|y) 0.1 50
(dashed). We see
that the MAP 0.05 25
solution is simply
0 0
the value of x that 5 10 15 20 25 30 35
maximizes (or x
minimizes) either
of these functions.
In other words, the
not tractable for real problems. As a result, a variety of tactics have
MAP solution is
the mode of the been built up over the years to compute an approximate posterior. For
posterior, which is example, the MAP approach is concerned with finding only the most
not generally the likely state, or in other words the mode or ‘peak’ of the posterior. We
same as the mean.
discuss this next.
which can be easier when the PDFs involved are from the exponential
family. As we are seeking only the most likely state, we can use Bayes’
rule to write
x̂map = arg min (− ln(p(y|x)) − ln(p(x))) , (4.8)
x
with
1 fb 2 1 2
J(x) = y− + (x̌ − x) , (4.10)
2R x 2P̌
4.1 Introduction 93
0.25 Figure 4.4
EXN [x̂map ] x̌ Histogram of
estimator values
0.2
for 1, 000, 000 trials
of the stereo
p(x̂map )
Figure 4.5
Markov process x̌0 v1 vk 1 vk
representation of
the NLNG system
described + x0 f x1 ... f xk 1 f xk ...
by (4.14).
w0 w1 wk 1 wk
n0 g n1 g nk 1 g nk g
y0 y1 yk 1 yk
4.2 Recursive Discrete-Time Estimation 95
models:
motion model: xk = f (xk−1 , vk , wk ) , k = 1 . . . K (4.14a)
observation model: yk = g (xk , nk ) , k = 0...K (4.14b)
where k is again the discrete-time index and K its maximum. The
function f (·) is the nonlinear motion model and the function g(·) is the
nonlinear observation model. The variables take on the same mean-
ings as in the linear-Gaussian chapter. For now we do not make any
assumption about any of the random variables being Gaussian.
Figure 4.5 provides a graphical representation of the temporal evolu-
tion of the system described by (4.14). From this picture we can observe
a very important characteristic of the system, the Markov property:
In the simplest sense, a stochastic process has the Markov property if the conditional
probability density functions (PDFs) of future states of the process, given the present
state, depend only upon the present state, but not on any other past states, i.e.,
they are conditionally independent of these older states. Such a process is called
Markovian or a Markov process.
where x̂k is the mean and P̂k the covariance. Next, we assume that the
noise variables, wk and nk (∀k), are in fact Gaussian as well:
wk ∼ N (0, Qk ), (4.22a)
nk ∼ N (0, Rk ). (4.22b)
Note, a Gaussian PDF can be transformed through a nonlinearity to
be non-Gaussian. In fact, we will look at this in more detail a bit later
in this chapter. We assume this is the case for the noise variables; in
other words, the nonlinear motion and observation models may affect
wk and nk . They are not necessarily added after the nonlinearities as
in
xk = f (xk−1 , vk ) + wk , (4.23a)
yk = g (xk ) + nk , (4.23b)
but rather appear inside the nonlinearities as in (4.14). Equations (4.23)
are in fact a special case of (4.14). However, we can recover additive
noise (approximately) through linearization, which we show next.
With g(·) and f (·) nonlinear, we still cannot compute the integral in
the Bayes filter in closed form so we turn to linearization. We linearize
the motion and observation models about the current state estimate
mean:
f (xk−1 , vk , wk ) ≈ x̌k + Fk−1 (xk−1 − x̂k−1 ) + wk0 , (4.24a)
g (xk , nk ) ≈ y̌k + Gk (xk − x̌k ) + n0k , (4.24b)
where
∂f (xk−1 , vk , wk )
x̌k = f (x̂k−1 , vk , 0) , Fk−1 = , (4.25a)
∂xk−1 x̂k−1 ,vk ,0
0 ∂f (xk−1 , vk , wk )
wk = wk , (4.25b)
∂w k x̂k−1 ,vk ,0
and
∂g(xk , nk )
y̌k = g (x̌k , 0) , Gk = , (4.26a)
∂xk x̌k ,0
0 ∂g(xk , nk )
nk = nk . (4.26b)
∂n k x̌k ,0
100 Nonlinear Non-Gaussian Estimation
From here the statistical properties of the current state, xk , given the
old state and latest input, are
Using our identity for Gaussian inference (2.82), we can see that the
integral is also Gaussian:
We are now left with the direct product of two Gaussian PDFs, which
4.2 Recursive Discrete-Time Estimation 101
(i) The nonlinear motion and observation models are used to prop-
agate the mean of our estimate.
(ii) There are Jacobians embedded in the Q0k and R0k covariances
for the noise. This comes from the fact that we allowed the
noise to be applied within the nonlinearities in (4.14).
It should be noted that there is no proof that the EKF will work in
general for any nonlinear system6 . In order to gauge the performance
of the EKF on a particular nonlinear system, it often comes down to
simply trying it out. The main problem with the EKF is the operating
point of the linearization is the mean of our estimate of the state, not
the true state. This seemingly small difference can cause the EKF to
diverge wildly in some cases. Sometimes the result is less dramatic,
with the estimate being biased or inconsistent, or most often, both.
where we have defined x̂k as the mean and P̂k as the covariance. The
nonlinear observation model, g(·), is used in the computation of µy,k .
From here, we can write the generalized Gaussian correction-step equa-
tions as
Kk = Σxy,k Σ−1
yy,k , (4.37a)
P̂k = P̌k − Kk ΣTxy,k , (4.37b)
x̂k = x̌k + Kk yk − µy,k , (4.37c)
where we have let µx,k = x̌k , Σxx,k = P̌k , and Kk is still known as the
Kalman gain. Unfortunately, unless the motion and observation models
are linear, we cannot compute all the remaining quantities required
exactly: µy,k , Σyy,k , and Σxy,k . This is because putting a Gaussian PDF
into a nonlinearity generally results in something non-Gaussian coming
4.2 Recursive Discrete-Time Estimation 103
where again we have defined x̂k as the mean and P̂k as the covariance.
As shown in the last section, the generalized Gaussian correction-step
equations are
Kk = Σxy,k Σ−1
yy,k , (4.44a)
P̂k = P̌k − Kk ΣTxy,k , (4.44b)
x̂k = x̌k + Kk yk − µy,k . (4.44c)
Substituting in the moments, µy,k , Σyy,k , and Σxy,k , from above we
have
−1
Kk = P̌k GTk Gk P̌k GTk + R0k , (4.45a)
P̂k = (1 − Kk Gk ) P̌k , (4.45b)
x̂k = x̌k + Kk (yk − yop,k − Gk (x̌k − xop,k )) . (4.45c)
These equations are very similar to the Kalman gain and corrector
equations in (4.32); the only difference is the operating point of the
linearization. If we set the operating point of the linearization to be
the mean of the predicted prior, xop,k = x̌k , then (4.45) and (4.32) are
identical.
However, it turns out that we can do much better if we iteratively
recompute (4.45), each time setting the operating point to be the mean
of the posterior at the last iteration:
xop,k ← x̂k . (4.46)
At the first iteration we take xop,k = x̌k . This allows us to be linearizing
about better and better estimates, thereby improving our approxima-
tion each iteration. We terminate the process when the change to xop,k
from one iteration to the next is sufficiently small. Note, the covariance
equation need only be computed once, after the other two equations
converge.
fb
xtrue = 26 [m], ymeas = − 0.6 [pixel],
xtrue
to exaggerate the difference between the methods. As discussed above,
the mean of the IEKF corresponds to the MAP (i.e., mode) solution,
while the EKF is not easily relatable to the full posterior.
To understand why the IEKF is the same as the MAP estimate,
we require some optimization tools that we will introduce later in the
chapter. For now, the important take-away message from this section is
that our choice to iteratively relinearize about our best guess leads to a
MAP solution. Thus, the ‘mean’ of our IEKF Gaussian estimator does
not actually match the mean of the full Bayesian posterior; it matches
the mode.
Linearization
The most popular method of transforming a Gaussian PDF through a
nonlinearity is linearization, which we have already used to derive the
EKF/ IEKF. Technically, the mean is actually passed through the non-
linearity exactly, while the covariance is approximately passed through
a linearized version of the function. Typically, the operating point of
the linearization process is the mean of the PDF. This procedure is
depicted in Figure 4.9 (repeat of Figure 2.5 for convenience). This pro-
cedure is highly inaccurate for the following reasons:
(i) The outcome of passing a Gaussian PDF through a nonlin-
ear function will not be another Gaussian PDF. By keeping
only the mean and covariance of the posterior PDF, we are ap-
proximating the posterior (by throwing away higher statistical
moments).
(ii) We are approximating the covariance of the true output PDF
by linearizing the nonlinear function.
(iii) The operating point about which we linearize the nonlinear
function is often not the true mean of the prior PDF, but rather
our estimate of the mean of the input PDF. This is an approx-
imation that introduces error.
(iv) We are approximating the mean of the true output PDF by
simply passing the mean of the prior PDF through the nonlinear
function. This does not represent the true mean of the output.
Another disadvantage of linearization is that we need to be able to ei-
ther calculate the Jacobian of the nonlinearity in closed form, or com-
pute it numerically (which introduces yet another approximation).
Despite all these approximations and disadvantages, if the function
is only slightly nonlinear, and the input PDF is Gaussian, the lineariza-
tion method is very simple to understand and quick to implement. One
advantage9 is that the procedure is actually reversible (if the nonlinear-
ity is locally invertible). That is, we can recover the input PDF exactly
9 It might be more accurate to say this is a by-product than an advantage, since it is a
direct result of the specific approximations made in linearization.
108 Nonlinear Non-Gaussian Estimation
Figure 4.10
One-dimensional p(x) p(y) g(µx x) + g(µx + x)
µy =
Gaussian PDF x x 2
transformed y = g(x) 2
y y 2 (g(µx x) g(µx + x ))
through a y =
4
deterministic
nonlinear function,
µx x µy y
g(·). Here the basic µx x µx + x µy y µy + y
sigmapoint
draw deterministic combine
transformation is
samples from samples to form
used in which only yi = g(xi )
input density output density
two deterministic
pass each sample
samples (one on
through nonlinearity
either side of the
mean) approximate
the input density.
Sigmapoint Transformation
In a sense, the sigmapoint (SP) or unscented transformation (Julier
and Uhlmann, 1996) is the compromise between the Monte Carlo and
linearization methods when the input density is roughly a Gaussian
PDF. It is more accurate than linearization, but for a comparable com-
putational cost to linearization. Monte Carlo is still the most accurate
method, but the computational cost is prohibitive in most situations.
It is actually a bit misleading to refer to ‘the’ sigmapoint transfor-
mation as there is actually a whole family of such transformations. Fig-
ure 4.10 depicts the very simplest version in one dimension. In general,
a version of the SP transformation is used that includes one additional
sample beyond the basic version at the mean of the input density. The
steps are as follows:
where
κ
L+κ
i=0
αi = 1 1 , (4.49)
2 L+κ
otherwise
which we note sums to 1. The user-definable parameter, κ, will be
explained in the next section.
2. Each of the sigmapoints is individually passed through the nonlin-
earity, g(·):
yi = g (xi ) . i = 0 . . . 2L (4.50)
3. The mean of the output density, µy , is computed as
2L
X
µy = α i yi . (4.51)
i=0
yi = f (xi ) = f (µx + δxi ) = (µx + δxi )2 = µ2x + 2µx δxi + δx2i . (4.54)
µy = E [yi ] = µ2x + 2µx E [δxi ] + E δx2i = µ2x + σx2 . (4.55)
| {z } | {z }
0 2
σx
4.2 Recursive Discrete-Time Estimation 111
where E [δx3i ] = 0 and E [δx4i ] = 3σx4 are the well-known third and
fourth moments for a Gaussian PDF.
In truth, the resulting output density is not Gaussian. We could go
on to compute higher moments of the output (and they would not all
match a Gaussian). However, if we want to approximate the output
as Gaussian by not considering the moments beyond the variance,
we
can. In this case, the resulting output density is N µy , σy2 . We have
effectively used the Monte Carlo method with an infinite number of
samples to carry out the computation of the first two moments of the
posterior exactly in closed form. Let us now see how linearization and
the sigmapoint transformation perform.
Linearization
Linearizing the nonlinearity about the mean of the input density we
have
∂f
yi = f (µx + δxi ) ≈ f (µx ) + δxi = µ2x + 2µx δxi . (4.57)
| {z } ∂x µx
µ2x | {z }
2µx
which is just the mean of the input passed through the nonlinearity:
µy = f (µx ). For the variance of the output we have
σy2 = E (yi − µy )2 = E (2µx δxi )2 = 4µ2x σx2 . (4.59)
Comparing (4.55) with (4.58), and (4.56) with (4.59), we see there
are some discrepancies. In fact, the linearized mean has a bias and the
variance is too small (i.e., overconfident). Let us see what happens with
the sigmapoint transformation.
112 Nonlinear Non-Gaussian Estimation
Sigmapoint Transformation
There are 2L + 1 = 3 sigmapoints in dimension L = 1:
√ √
x0 = µx , x1 = µx + 1 + κ σx , x2 = µx − 1 + κ σx , (4.60)
where κ is a user-definable parameter that we discuss below. We pass
each sigmapoint through the nonlinearity:
y0 = f (x0 ) = µ2x , (4.61a)
√ 2
y1 = f (x1 ) = µx + 1 + κ σx
√
= µ2x + 2µx 1 + κ σx + (1 + κ)σx2 , (4.61b)
√ 2
y2 = f (x2 ) = µx − 1 + κ σx
√
= µ2x − 2µx 1 + κ σx + (1 + κ)σx2 . (4.61c)
The mean of the output is given by
2
!
1 1X
µy = κy0 + yi (4.62a)
1+κ 2 i=1
1 2 1 2 √
= κµx + µx + 2µx 1 + κ σx + (1 + κ)σx2 + µ2x
1+κ 2
√
−2µx 1 + κ σx + (1 + κ)σx2 (4.62b)
1
= κµ2x + µ2x + (1 + κ)σx2 (4.62c)
1+κ
= µ2x + σx2 , (4.62d)
which is independent of κ and exactly the same as (4.55). For the
variance we have
2
!
2 1 2 1X 2
σy = κ (y0 − µy ) + (yi − µy ) (4.63a)
1+κ 2 i=1
1 1 √ 2
= κσx4 + 2µx 1 + κ σx + κσx2
1+κ 2
√ 2
+ −2µx 1 + κ σx + κσx2 (4.63b)
1
= κσx4 + 4(1 + κ)µ2x σx2 + κ2 σx4 (4.63c)
1+κ
= 4µ2x σx2 + κσx4 , (4.63d)
which can be made to be identical to (4.56) by selecting the user-
definable parameter, κ, to be 2. Thus, for this nonlinearity, the un-
scented transformation can exactly capture the correct mean and vari-
ance of the output.
4.2 Recursive Discrete-Time Estimation 113
0.03 Figure 4.12
true mean Graphical
0.025 depiction of
exact
Monte Carlo passing a Gaussian
0.02
sigmapoint PDF, p(x) =
p(y)
N 5, (3/2)2
0.015 linearization
through the
0.01 nonlinearity,
y = x2 , using
0.005 various methods.
We see that the
0
0 10 20 30 40 50 60 70 80 Monte Carlo and
y sigmapoint
methods match the
true mean while
linearization does
To understand why we should pick κ = 2, we need look no fur- not. We also show
the exact
ther than the input density. The parameter κ scales how far away the transformed PDF,
sigmapoints are from the mean. This does not affect the first three mo- which is not
ments of the sigmapoints (i.e., µx , σx2 , and the zero skewness). However, Gaussian and
changing κ does influence the fourth moment, kurtosis. We already used therefore does not
have its mean at
the fact that for a Gaussian PDF, the fourth moment is 3σx4 . We can its mode.
chose κ to make the fourth moment of the sigmapoints match the true
kurtosis of the Gaussian prior density:
2
1 41X 4
3σx4 = κ (x0 − µx ) + (xi − µx ) (4.64a)
1+κ | {z } 2 i=1
0
4 √ 4
1 √
= 1 + κσx + − 1 + κσx (4.64b)
2(1 + κ)
= (1 + κ)σx4 . (4.64c)
In the next few sections, we return to the Bayes filter and use our
new knowledge about the different methods of passing PDFs through
nonlinearities to make some useful improvements to the EKF. We will
begin with the particle filter, which makes use of the Monte Carlo
method. We will then try to implement a Gaussian filter using the SP
transformation.
114 Nonlinear Non-Gaussian Estimation
PF Algorithm
Using the notation from the section on the Bayes filter, the main steps
in the particle filter are as follows:
1. Draw M samples from the joint density comprised of the prior and
the motion noise:
x̂k−1,m
← p (xk−1 |x̌0 , v1:k−1 , y1:k−1 ) p(wk ), (4.65)
wk,m
where m is the unique particle index. In practice we can just draw
from each factor of this joint density separately.
2. Generate a prediction of the posterior PDF by using vk . This is done
by passing each prior particle/noise sample through the nonlinear
motion model:
x̌k,m = f (x̂k−1,m , vk , wk,m ) . (4.66)
These new ‘predicted particles’ together approximate the density,
p (xk |x̌0 , v1:k , y1:k−1 ).
3. Correct the posterior PDF by incorporating yk . This is done indi-
rectly in two steps:
– First, assign a scalar weight, wk,m , to each predicted particle based
on the divergence between the desired posterior and the predicted
posterior for each particle:
p (x̌k,m |x̌0 , v1:k , y1:k )
wk,m = = η p (yk |x̌k,m ) , (4.67)
p (x̌k,m |x̌0 , v1:k , y1:k−1 )
4.2 Recursive Discrete-Time Estimation 115
Figure 4.13
sampled process noise Block-diagram
wk,m representation of
input predicted belief particle filter.
vk f (x, v, w) x̌k,m
sampled
sampled prior belief resample posterior belief
x̂k 1,m (using weights) x̂k,m
measurement
yk g(x, 0) wk,m = ⌘ p(yk |x̌k,m )
sample weight
Resampling
A key aspect of particle filters is the need to resample the posterior
density according to weights assigned to each current sample. One way
to do this is to use the systematic resampling method described by
Madow (1949). We assume we have M samples and each of these is
assigned an unnormalized weight, wm ∈ R > 0. From the weights, we
create bins with boundaries, βm , according to
Pm
wn
βm = Pn=1M
. (4.70)
`=1 w`
Prediction Step
The prediction step is a fairly straightforward application of the sigma-
point transformation since we are simply trying to bring our prior
forward in time through the motion n model. o We employ the following
steps to go from the prior belief, x̂k−1 , P̂k−1 , to the predicted belief,
x̌k , P̌k :
1. Both the prior belief and the motion noise have uncertainty so these
are stacked together in the following way:
x̂k−1 P̂k−1 0
µz = , Σzz = , (4.71)
0 0 Qk
where we see that {µz , Σzz } is still a Gaussian representation. We
let L = dim µz .
2. Convert {µz , Σzz } to a sigmapoint representation:
LLT = Σzz , (Cholesky decomposition, L lower-triangular) (4.72a)
z0 = µz , (4.72b)
√
zi = µz + L + κ coli L, (4.72c)
√ i = 1...L
zi+L = µz − L + κ coli L. (4.72d)
10 These are sometimes handled together in a single step, but we prefer to think of each of
these as a separate application of the sigmapoint transformation.
118 Nonlinear Non-Gaussian Estimation
where
κ
L+κ
i=0
αi = 1 1 . (4.76)
2 L+κ
otherwise
Next we will look at a second application of the sigmapoint transfor-
mation to implement the correction step.
Correction Step
This step is a little more complicated. We look back to Section 4.2.4 and
recall that the conditional Gaussian density for xk (i.e., the posterior)
is
where we have defined x̂k as the mean and P̂k as the covariance. In this
form, we can write the generalized Gaussian correction-step equations
as
Kk = Σxy,k Σ−1
yy,k , (4.78a)
P̂k = P̌k − Kk ΣTxy,k , (4.78b)
x̂k = x̌k + Kk yk − µy,k . (4.78c)
We will use the SP transformation to come up with better versions of
µy,k , Σyy,k , and Σxy,k . We employ the following steps:
1. Both the predicted belief and the observation noise have uncertainty
so these are stacked together in the following way:
x̌k P̌k 0
µz = , Σzz = , (4.79)
0 0 Rk
where we see that {µz , Σzz } is still a Gaussian representation. We
let L = dim µz .
2. Convert {µz , Σzz } to a sigmapoint representation:
LLT = Σzz , (Cholesky decomposition, L lower-triangular) (4.80a)
z0 = µz , (4.80b)
√
zi = µz + L + κ coli L, (4.80c)
√ i = 1...L
zi+L = µz − L + κ coli L. (4.80d)
3. Unstack each sigmapoint into state and observation noise,
x̌
zi = k,i , (4.81)
nk,i
4.2 Recursive Discrete-Time Estimation 119
where
κ
L+κ
i=0
αi = 1 1 . (4.84)
2 L+κ
otherwise
These are plugged into the generalized Gaussian correction-step
equations above to complete the correction step.
One huge advantage of the SPKF are that it does not require any
analytical derivatives and uses only basic linear algebra operations in
the implementation. Moreover, we don not even need the nonlinear
motion and observation models in closed form; they could just be black-
box software functions.
2L
X 2L
X
T T
+ Gk αi (x̌k,i − x̌k ) n0k,i + αi n0k,i (x̌k,i − x̌k ) GTk , (4.87)
i=0 i=0
| {z } | {z }
0 0
120 Nonlinear Non-Gaussian Estimation
where some of the terms are zero owing to the block-diagonal structure
of Σzz above. For Σxy,k , by substituting our approximation into (4.83c),
we have
2L
X 2L
X
T T
Σxy,k ≈ αi (x̌k,i − x̌k ) (x̌k,i − x̌k ) GTk + αi (x̌k,i − x̌k ) n0k,i ,
i=0 i=0
| {z } | {z }
P̌k 0
(4.88)
so that
−1
Kk = Σxy,k Σ−1 T T 0
yy,k ≈ P̂k Gk Gk P̌k Gk + Rk , (4.89)
yk = g(xk ) + nk , (4.90)
the SPKF correction step can be greatly sped up. Without loss of gen-
erality, we can break the sigmapoints into two categories based on the
block-diagonal partitioning in the matrix, Σzz , above; we say there
are 2N + 1 sigmapoints coming from the dimension of the state and
2(L − N ) additional sigmapoints coming from the dimension of the
measurements. To make this convenient, we will re-order the indexing
on the sigmapoints accordingly:
g (x̌k,j ) j = 0 . . . 2N
y̌k,j = . (4.91)
g (x̌k ) + nk,j j = 2N + 1 . . . 2L + 1
2N
X 2L+1
X
µy,k = αj nk,j + αj y̌k,j (4.92a)
j=0 j=2N +1
2N
X 2L+1
X
= αj y̌k,j + αj (g (x̌k ) + nk,j ) (4.92b)
j=0 j=2N +1
2N
X 2L+1
X
= αj y̌k,j + g (x̌k ) αj (4.92c)
j=0 j=2N +1
2N
X
= βj y̌k,j , (4.92d)
j=0
4.2 Recursive Discrete-Time Estimation 121
where
P2L+1
αi + j=2N +1 αj i = 0
βi = (4.93a)
αi otherwise
(
(κ+L−N )
N +(κ+L−N )
i=0
= 1 1 . (4.93b)
2 N +(κ+L−N )
otherwise
The is the same form as the original weights (and they still sum to 1).
We can then easily verify that
2N
X T
Σyy,k = βj y̌k,j − µy,k y̌k,j − µy,k + Rk (4.94)
j=0
By the SMW identity from Section 2.2.7, we can then show that
−1
Σ−1 T
yy,k = Zk Zk + Rk (4.97a)
−1 T
= R−1 −1 T −1
k − Rk Zk Zk Rk Zk + 1 Z R−1 , (4.97b)
| {z } k k
(2N +1)×(2N +1)
where we now only need to invert a (2N +1)×(2N +1) matrix (assuming
R−1
k is known).
1. Both the predicted belief and the observation noise have uncertainty
so these are stacked together in the following way:
x P̌k 0
µz = op,k , Σzz = , (4.98)
0 0 Rk
where
κ
L+κ
i=0
αi = 1 1 . (4.103)
2 L+κ
otherwise
At this point, Sibley et al. use the relationships between the SPKF and
EKF quantities to update the IEKF correction equations, (4.45), using
4.2 Recursive Discrete-Time Estimation 123
0.25 Figure 4.14
Stereo camera
x̂map x̄ p(x|y)
example,
0.2 x̂spkf p(x) comparing the
posterior x̂ispkf x̂iekf inference (i.e.,
0.15 ‘corrective’) step of
x̂spkf
p
which results in
Kk = Σxy,k Σ−1
yy,k , (4.105a)
P̂k = Σxx,k − Kk Σyx,k , (4.105b)
−1
x̂k = x̌k + Kk yk − µy,k − Σyx,k Σxx,k (x̌k − xop,k ) . (4.105c)
Initially, we set the operating point to be the mean of the prior: xop,k =
x̌k . At subsequent iterations we set it to be the best estimate so far:
The process terminates when the change from one iteration to the next
becomes sufficiently small.
We have seen that the first iteration of the IEKF results in the EKF
method, and this is also true for the SPKF/ ISPKF. Setting xop,k = x̌k
in (4.105c) results in
x̂k = x̌k + Kk yk − µy,k , (4.107)
p(x̂ispkf )
camera experiment 0.15
where each time a
new xtrue is 0.1
randomly drawn
from the prior and 0.05
a new ymeas is
randomly drawn
0
from the 10 12 14 16 18 20 22 24 26
measurement x̂ispkf
model. The dashed
line marks the
mean of the prior,
x̌, and the solid 4.2.11 ISPKF Seeks the Posterior Mean
line marks the
Now, the question we must ask ourselves is, how do the sigmapoint esti-
expected value of
the iterated mates relate to the full posterior? Figure 4.14 compares the sigmapoint
sigmapoint methods to the full posterior and iterated linearization (i.e., MAP) on
estimator, x̂ispkf , our stereo camera example where we used
over all the trials.
The gap between fb
dashed and solid is
xtrue = 26 [m], ymeas = − 0.6 [pixel],
xtrue
emean ≈ −3.84 cm,
which indicates a again to exaggerate the differences between the various methods. Our
small bias, and the implementations of the sigmapoint methods used κ = 2, which is ap-
average squared
error is esq ≈ 4.32
propriate for a Gaussian prior. Much like the EKF, we see that the
m2 . one-shot SPKF method bears no obvious relationship to the full poste-
rior. However, the ISPKF method appears to come closer to the mean,
x̄, rather than the mode (i.e., MAP) value, x̂map . Numerically, the num-
bers of interest are:
x̂map = 24.5694, x̄ = 24.7770,
x̂iekf = 24.5694, x̂ispkf = 24.7414.
We see that the IEKF solution matches the MAP one, and the ISPKF
solution is close to (but not exactly) the mean.
Now, we consider the question, how well does the iterated sigmapoint
method capture xtrue ? Once again, we compute the performance over
a large number of trials (using the parameters in (4.5)). The results
are shown in Figure 4.15. We see that the average difference of the
estimator, x̂ispkf , and the ground-truth, xtrue , is emean ≈ −3.84 cm,
demonstrating a small bias. This is significantly better than the MAP
estimator, which had a bias of −33.0 cm on this same metric. The
average squared error is approximately the same, with esq ≈ 4.32 m2 .
Although it is difficult to show analytically, it is plausible to think
that the iterated sigmapoint method is trying to converge to the mean
of the full posterior rather than the mode. If what we care about is
4.3 Batch Discrete-Time Estimation 125
Figure 4.16
Taxonomy of the
Bayes filter different filtering
approximate PDFs methods, showing
approximate PDFs using a large number of
as Gaussian their relationships
random samples
to the Bayes filter.
(iterated) (iterated)
extended sigmapoint
Kalman filter Kalman filter
Objective Function
We seek to construct an objective function that we will minimize with
respect to
x0
x1
x = .. , (4.108)
.
xK
1 −1
Jv,k (x) = ev,k (x)T Wv,k ev,k (x), (4.110a)
2
1 −1
Jy,k (x) = ey,k (x)T Wy,k ey,k (x). (4.110b)
2
The overall objective function is then
K
X
J(x) = (Jv,k (x) + Jy,k (x)) . (4.111)
k=0
We further define
ev,0 (x) ey,0 (x)
ev (x)
e(x) = , ev (x) = ... , ey (x) = ... ,
ey (x)
ev,K (x) ey,K (x)
(4.112a)
W = diag (Wv , Wy ) , Wv = diag (Wv,0 , . . . , Wv,K ) , (4.112b)
Wy = diag (Wy,0 , . . . , Wy,K ) , (4.112c)
so that the objective function can be written as
1
J(x) = e(x)T W−1 e(x). (4.113)
2
We can further define the modified error term,
u(x) = L e(x), (4.114)
where LT L = W−1 (i.e., from a Cholesky decomposition since W is
symmetric positive-definite). Using these definitions, we can write the
objective function simply as
1
J(x) = u(x)T u(x). (4.115)
2
This is precisely in a quadratic form, but not with respect to the design
variables, x. Our goal is to determine the optimum design parameter,
x̂, that minimizes the objective function:
x̂ = arg min J(x). (4.116)
x
Newton’s Method
Newton’s method works by iteratively approximating the (differen-
tiable) objective function by a quadratic function and then jumping to
(or moving towards) the minimum. Suppose we have an initial guess,
or operating point, for the design parameter, xop . We use a three-term
Taylor-series expansion to approximate J as a quadratic function,
∂J(x) 1 T ∂ 2 J(x)
J(xop + δx) ≈ J(xop ) + δx + δx δx,
∂x xop 2 ∂x∂xT xop
| {z } | {z }
Jacobian Hessian
(4.117)
128 Nonlinear Non-Gaussian Estimation
of δx, a ‘small’ change to the initial guess, xop . We note that the sym-
metric Hessian matrix needs to be positive-definite for this method
to work (otherwise there is no well-defined minimum to the quadratic
approximation).
The next step is to find the value of δx that minimizes this quadratic
approximation. We can do this by taking the derivative with respect
to δx and setting to zero to find a critical point:
2
∂J(xop + δx) ∂J(x) ∗T ∂ J(x)
= + δx =0
∂ δx ∂x xop ∂x∂xT xop
2 T
∂ J(x) ∗ ∂J(x)
⇒ δx = − . (4.118)
∂x∂xT xop ∂x xop
The last line is just a linear system of equations and can be solved
when the Hessian is invertible (which it must be, since it was assumed
to be positive-definite above). We may then update our operating point
according to:
This procedure iterates until δx∗ becomes sufficiently small. A few com-
ments about Newton’s method:
Gauss-Newton Method
Let us now return to the nonlinear quadratic objective function we have
in Equation (4.115). In this case, the Jacobian and Hessian matrices
4.3 Batch Discrete-Time Estimation 129
are
∂J(x) T ∂u(x)
Jacobian: = u(xop ) , (4.120a)
∂x xop ∂x xop
T
∂ 2 J(x) ∂u(x) ∂u(x)
Hessian: =
∂x∂xT xop ∂x xop ∂x xop
X M 2
∂ ui (x)
+ ui (xop ) ,
i=1
∂x∂xT xop
(4.120b)
where u(x) = u1 (x), . . . , ui (x), . . . , uM (x) . We have so far not made
any approximations.
Looking to the expression for the Hessian, we assert that near the
minimum of J, the second term is small relative to the first. One intu-
ition behind this is that near the optimum we should have ui (x) small
(and ideally zero). We thus approximate the Hessian according to
T
∂ 2 J(x) ∂u(x) ∂u(x)
≈ , (4.121)
∂x∂xT xop ∂x xop ∂x xop
J(xop + δx)
!T !
1 ∂u(x) ∂u(x)
≈ u(xop ) + δx u(xop ) + δx .
2 ∂x xop ∂x xop
(4.124)
130 Nonlinear Non-Gaussian Estimation
We can also combine both of these patches together to give us the most
options in controlling convergence.
Batch Estimation
We now return to our specific estimation problem and apply the Gauss-
Newton optimization method. We will use the ‘shortcut’ approach and
thus begin by approximating our error expressions:
ev,0 (xop ) − δx0 , k=0
ev,k (xop + δx) ≈ ,
ev,k (xop ) + Fk−1 δxk−1 − δxk , k = 1 . . . K
(4.133)
ey,k (xop + δx) ≈ ey,k (xop ) − Gk δxk , k = 0 . . . K, (4.134)
where
x̌0 − xop,0 , k=0
ev,k (xop ) ≈ , (4.135)
f (xop,k−1 , vk , 0) − xop,k , k = 1 . . . K
ey,k (xop ) ≈ yk − g (xop,k , 0) , k = 0 . . . K, (4.136)
and we require definitions of the Jacobians of the nonlinear motion and
observations models given by
∂f (xk−1 , vk , wk ) ∂g(xk , nk )
Fk−1 = , Gk = .
∂xk−1 xop,k−1 ,vk ,0 ∂xk xop,k ,0
(4.137)
132 Nonlinear Non-Gaussian Estimation
we can define
1
−F0 1
.
−F1 . .
δx0 ..
δx1 . 1
−F 1
δx = δx2 , H= K−1 , (4.139a)
.. G0
.
G1
δxK
G2
..
.
GK
ev,0 (xop )
ev,1 (xop )
..
.
ev,K (xop )
e(xop ) =
, (4.139b)
ey,0 (xop )
ey,1 (xop )
..
.
ey,K (xop )
and
W = diag P̌0 , Q01 , . . . , Q0K , R00 , R01 , . . . , R0K , (4.140)
where
g(xop,0 , 0)
g(xop,1 , 0)
yop = .. , (4.149a)
.
g(xop,K , 0)
G = diag (G0 , G1 , G2 , . . . , GK ) , (4.149b)
R = diag (R00 , R01 , R02 , . . . , R0K ) , (4.149c)
and n0 ∼ N (0, R0 ). It is fairly easy to see that
E [y] = yop + G (x̌ − xop ) , (4.150a)
E (y − E[y])(y − E[y])T = GP̌GT + R0 , (4.150b)
E (y − E[y])(x − E[x])T = GP̌. (4.150c)
Again, the (·)0 notation is used to indicate the Jacobian with respect
to the noise is incorporated into the quantity.
With these quantities in hand, we can write a joint density for the
lifted trajectory and measurements as
x̌ P̌ P̌GT
p(x, y|v) = N , ,
yop + G (x̌ − xop ) GP̌ GP̌GT + R0
(4.151)
which is quite similar to the expression for the IEKF situation in (4.42),
but now for the whole trajectory rather than just one timestep. Us-
ing the usual relationship from (2.46b), we can immediately write the
Gaussian posterior as
p(x|v, y) = N x̂, P̂ (4.152)
where
−1
K = P̌GT GP̌GT + R0 , (4.153a)
P̂ = (1 − KG) P̌, (4.153b)
x̂ = x̌ + K (y − yop − G(x̌ − xop )) . (4.153c)
Using the SMW identity from (2.68), we can rearrange the equation
for the posterior mean to be
−1
−1
P̌−1 + GT R0 G δx∗ = P̌−1 (x̌ − xop )+GT R0 (y − yop ) , (4.154)
where δx∗ = x̂ − xop . Inserting the details of the prior this becomes
−1 −1
F−T Q0 F−1 + GT R0 G δx∗
| {z }
block-tridiagonal
−1 −1
= F−T Q0 ν − F−1 xop + GT R0 (y − yop ) . (4.155)
4.3 Batch Discrete-Time Estimation 135
for a minimum.
We will come back to this ML setup when discussing a problem called
bundle adjustment in the later chapters.
Maximum Likelihood Bias Estimation
We have already seen in the simple example at the start of this chapter
that the MAP method is biased with respect to average mean error. It
turns out that the ML method is biased as well (unless the measurement
model is linear). There is a classic paper by Box (1971) that derives an
approximate expression for the bias in ML, and we use this section to
present it.
We will see below that we need a second-order Taylor expansion of
g(x) while only a first-order expansion of G(x). Thus, we have the
following approximate expressions:
1X
gk (x̂) = gk (x + δx) ≈ gk (x) + Gk (x) δx + 1j δxT G jk (x) δx,
2 `
(4.165a)
X
T
Gk (x̂) = Gk (x + δx) ≈ Gk (x) + 1j δx G jk (x), (4.165b)
j
11 This is because the logarithm is a monotonically increasing function. Another way of
looking at ML is that it is MAP with a uniform prior over all possible solutions.
4.3 Batch Discrete-Time Estimation 137
where
∂gk (x) ∂gjk (x)
gk (x) = gjk (x) j , Gk (x) = , G k` = (4.166)
∂x ∂x∂xT
and 1j is the jth column of the identity matrix. We have indicated
whether each Jacobian/Hessian is evaluated at x (the true state) or x̂
(our estimate). In this section, the quantity, δx = x̂ − x, will be the
difference between our estimate and the true state, on a given trial.
Each time we change the measurement noise, we will get a different
value for the estimate and hence δx. We will seek an expression for the
expected value of this difference, E[δx], over all possible realizations of
the measurement noise; this represents the systematic error or bias.
As discussed above, after convergence of Gauss-Newton, the esti-
mate, x̂, will satisfy the following optimality criterion:
X
Gk (x̂)T R−1
k (yk − gk (x̂)) = 0, (4.167)
k
or
X X T
Gk (x) + 1j δx G jk (x) Rk−1
T
k j
1X T
× yk − gk (x) −Gk (x) δx − 1j δx G jk (x) δx ≈ 0, (4.168)
| {z } 2 j
nk
X X T
T
Gk (x) + 1j (A(x) n + b(n)) G jk (x)
k j
× R−1k Pk n − Gk (x) (A(x) n + b(n))
1X T
− 1j (A(x) n + b(n)) G jk (x) (A(x) n + b(n)) ≈ 0. (4.170)
2 j
and then noting that q1 (−n) = q1 (n) and q2 (−n) = q2 (n) owing to
the quadratic nature of these terms. Subtracting the second case from
the first we have 2 L n = 0 and since n can take on any value, it follows
that L = 0 and thus
X
A(x) = W(x)−1 Gk (x)T R−1
k Pk , (4.174)
k
where
X
W(x) = Gk (x)T R−1
k Gk (x). (4.175)
k
Choosing this value for A(x) and taking the expectation (over all values
of n) we are left with
Fortunately, it turns out that E [q2 (n)] = 0. To see this, we need two
identities:
= 0, (4.178)
where we have employed the above identities. We are thus left with
E [q1 (n)] = 0, (4.179)
or
E[b(n)]
1 X X
= − W(x)−1 Gk (x)T R−1
k 1j E nT A(x)T G jk (x) A(x) n
2 k j
1 X X
= − W(x)−1 Gk (x)T R−1
k 1j E tr G jk (x) A(x) n nT A(x)T
2 k j
1 X X
= − W(x)−1 Gk (x)T R−1
k 1j tr G jk (x) A(x) E n nT A(x)T
2 k j
| {z }
R
| {z }
W(x)−1
1 X X
= − W(x)−1 Gk (x)T R−1
k 1j tr G jk (x) W(x)−1 , (4.180)
2 k j
and so our final expression for the systematic part of the bias is
1 X X
E[δx] = − W(x)−1 Gk (x)T R−1
k 1j tr G jk (x) W(x)−1 .
2 k j
(4.182)
To use this expression in operation, we will need to substitute our
estimate, x̂, in place of x when computing (4.182). Then we can update
our estimate according to
x̂ ← x̂ − E[δx], (4.183)
to subtract off the bias. Note, this expression is only approximate and
may only work well in mildly nonlinear situations.
140 Nonlinear Non-Gaussian Estimation
4.3.4 Discussion
If we think of the EKF as an approximation of the full nonlinear Gauss-
Newton (or even Newton) method applied to our estimation problem,
we can see that it is really quite inferior, mainly because it does not
iterate to convergence. The Jacobians are evaluated only once (at the
best estimate so far). In truth, the EKF can do better than just one
iteration of Gauss-Newton because the EKF does not evaluate all the
Jacobians at once, but the lack of iteration is its main downfall. This
is obvious from an optimization perspective; we need to iterate to con-
verge. However, the EKF was originally derived from the Bayes filter
earlier in this chapter, which used something called the Markov as-
sumption to achieve its recursive form. The problem with the Markov
assumption is that once this is built into the estimator, we cannot get
rid of it. It is a fundamental constraint that cannot be overcome.
There have been many attempts to patch the EKF, including the
Iterated EKF described earlier in this chapter. However, for very non-
linear systems, these may not help much. The problem with the IEKF
is that it still clings to the Markov assumption. It is iterating at a single
time-step, not over the whole trajectory at once. The difference between
Gauss-Newton and the IEKF can be seen plainly in Figure 4.17.
Batch estimation via the Gauss-Newton method has its own prob-
lems. In particular, it must be run offline and is not a constant-time
algorithm, whereas the EKF is both online and a constant-time method.
So-called sliding-window filters (SWFs) (Sibley, 2006) seek to get the
best of both worlds by iterating over a window of time-steps and sliding
this window along to allow for online/constant-time implementation.
SWFs are really still an active area of research, but when viewed from
Figure 4.17
Comparison of the
Gauss-Newton iterates over the entire trajectory, but runs offline and not in constant time
iterative schemes
x0 x1 x2 x3 ··· xk 2 xk 1 xk xk+1 xk+2 ··· xK
used in various
estimation
paradigms.
Sliding-window filters iterate over several timesteps at once, run online and in constant time
x0 x1 x2 x3 ··· xk 2 xk 1 xk xk+1 xk+2 ··· xK
IEKF iterates at only one timestep at a time, but runs online and in constant time
x0 x1 x2 x3 ··· xk 2 xk 1 xk xk+1 xk+2 ··· xK
4.4 Batch Continuous-Time Estimation 141
with
nk ∼ N (0, Rk ). (4.188)
Linearization
Linearizing our motion model about this trajectory, we have
where ν(t), F(t), and L(t) are now known functions of time (since xop (t)
is known). Thus, approximately, our process model is of the form
Z t
x(t) ∼ GP Φ(t, t0 )x̌0 + Φ(t, s)ν(s) ds,
t0
| {z }
x̌(t)
Z min(t,t0 )
Φ(t, t0 )P̌0 Φ(t0 , t0 )T + Φ(t, s)L(s)QL(s)T Φ(t0 , s)T ds .
t0
| {z }
P̌(t,t0 )
(4.191)
where
−1
K = P̌GT GP̌GT + R0 , (4.203a)
P̂ = (1 − KG) P̌, (4.203b)
x̂ = x̌ + K (y − yop − G(x̌ − xop )) . (4.203c)
Using the SMW identity from (2.68), we can rearrange the equation
for the posterior mean to be
−1
−1
P̌−1 + GT R0 G δx∗ = P̌−1 (x̌ − xop ) + GT R0 (y − yop ) (4.204)
where δx∗ = x̂ − xop . Inserting the details of the prior this becomes
−1 −1
F−T Q0 F−1 + GT R0 G δx∗
| {z }
block-tridiagonal
−1 −1
= F−T Q0 ν − F−1 xop + GT R0 (y − yop ) . (4.205)
This result is identical in form to the nonlinear, discrete-time batch
solution discussed earlier in this chapter. The only difference is that we
started with a continuous-time motion model and integrated it directly
to evaluate the prior at the measurement times.
The most expensive step in this whole process is building ν, F−1 , and
−1
Q0 . However, the cost (at each iteration) will still be linear in the
length of the trajectory and therefore should be manageable.
4.5 Summary
The main take-away points from this chapter are:
The next chapter will look briefly at how to handle estimator bias,
measurement outliers, and data correspondences.
14 We did not work this out for the nonlinear case, but it should follow from the GP
section in the linear-Gaussian chapter.
4.6 Exercises 147
4.6 Exercises
4.6.1 Consider the discrete-time system,
xk xk−1 cos θk−1 0
yk = yk−1 + T sin θk−1 0 vk
+ wk ,
ωk
θk θk−1 0 1
wk ∼ N (0, Q),
p
rk 2 2
x k + yk
= + nk , nk ∼ N (0, R),
φk atan2(−yk , −xk ) − θk
which could represent a mobile robot moving around on the xy-
plane and measuring the range and bearing to the origin. Set up
the EKF equation to estimate the pose of the mobile robot. In
particular, work out expressions for the Jacobians, Fk−1 and Gk ,
and modified covariances, Q0k and R0k .
4.6.2 Consider transforming the prior Gaussian density, N (µx , σx2 )
through the nonlinearity, f (x) = x3 . Use the Monte Carlo, lin-
earization, and sigmapoint methods to determine the transformed
mean and covariance and comment on the results. Hint: use Is-
serlis’ theorem to compute the higher-order Gaussian moments.
4.6.3 Consider transforming the prior Gaussian density, N (µx , σx2 )
through the nonlinearity, f (x) = x4 . Use the Monte Carlo, lin-
earization, and sigmapoint methods to determine the transformed
mean (and optionally covariance) and comment on the results.
Hint: use Isserlis’ theorem to compute the higher-order Gaussian
moments.
4.6.4 From the section on the sigmapoint Kalman filter, we learned
that the measurement covariance could be written as
2N
X T
Σyy,k = βj y̌k,j − µy,k y̌k,j − µy,k + Rk ,
j=0
E [ê0 ] = 0, E ê0 êT0 = P̂0 , (5.7)
E [ě1 ] = A E [ê0 ] − B ū + E [w1 ] = −B ū, (5.8a)
| {z } | {z }
0 0
E [ê1 ] = (1 − K1 C) E [ě1 ] +K1 ȳ + E [n1 ]
| {z } | {z }
−B ū 0
= − (1 − K1 C) B ū + K1 ȳ, (5.8b)
which are already biased in the case that ū 6= 0 and/or ȳ 6= 0. For the
covariance of the ‘predicted error’ we have
h i
T
E ě1 ěT1 = E (Aê0 − (B ū + w1 )) (Aê0 − (B ū + w1 ))
= E (Aê0 − w1 )(Aê0 − w1 )T +(−B ū) E (Aê0 − w1 )T
| {z } | {z }
P̌1 0
P̌1 = E ě1 ěT1 − E[ě1 ] E[ě1 ]T , (5.10)
| {z }
bias effect
error’ we have
T
E ê1 ê1 = E ((1 − K1 C) ě1 + K1 (ȳ + n1 ))
T
× ((1 − K1 C) ě1 + K1 (ȳ + n1 ))
h i
T
= E ((1 − K1 C) ě1 + K1 n1 ) ((1 − K1 C) ě1 + K1 n1 )
| {z }
P̂1
h i
T
+ (K1 ȳ) E ((1 − K1 C) ě1 + K1 n1 )
| {z }
(−(1−K1 C)B ū)T
= P̂1 + (− (1 − K1 C) B ū + K1 ȳ)
T
× (− (1 − K1 C) B ū + K1 ȳ) , (5.11a)
and so
P̂1 = E ê1 êT1 − E[ê1 ] E[ê1 ]T , (5.12)
| {z }
bias effect
where we can see again the KF’s estimate of the covariance is overcon-
fident and thus inconsistent. It is interesting to note that the KF will
become overconfident, regardless of the sign of the bias. Moreover, it
is not hard to see that as k gets bigger, the effects of the biases grow
without bound. It is tempting to modify the KF to be
yk = C 0 x0k + nk , (5.18)
| {z }
C0
Let us assume these conditions do indeed hold for the system in the
5.1 Handling Input/Measurement Biases 153
Figure 5.1 Input
xk xk = xk 1 + vk bias on
0 vk = vk 1 + ak + āk acceleration. In
this case we can
successfully
interoceptive bias estimate the bias
dk = xk
as part of our state
estimation
problem.
case that the bias is zero, i.e., ū = 0. Defining
C0
C 0 A0
0
O = .. , (5.20)
.
(N +U −1)
C 0 A0
we are required to show that
Q0 > 0, R > 0, rank O0 = N + U, (5.21)
| {z }
true by definitions
measurement model
at, is much more difficult than in the GPS case. Because the
star chart can be generated in advance, this system becomes
practical.
There are essentially two main approaches to data association: external
and internal.
receiver
5.3.1 RANSAC
Random sample consensus (RANSAC) is an iterative method to fit
a parameterized model to a set of observed data containing outliers.
Outliers are measurements that do not ‘fit’ a model, while inliers do
‘fit’. RANSAC is a probabilistic algorithm in the sense that its ability
to find a reasonable answer can only be guaranteed to occur with a
certain probability that improves with more time spent in the search.
Figure 5.6 provides a classic line-fitting example in the presence of
outliers.
RANSAC proceeds in an iterative manner. In the basic version, each
iteration consists of the following five steps:
1. Select a (small) random subset of the original data to be hypothe-
sized inliers (e.g., pick two points if fitting a line to xy-data).
2. Fit a model to the hypothesized inliers (e.g., a line is fit to two
points).
3. Test the rest of the original data against the fitted model and classify
as either inliers or outliers. If too few inliers are found, the iteration
is labelled invalid and aborted.
4. Refit the model using both the hypothesized and classified inliers.
5. Evaluate the refit model in terms of the residual error of all the inlier
data.
160 Biases, Correspondences, and Outliers
5.3.2 M-Estimation
Many of our earlier estimation techniques were shown to be minimizing
a sum-of-squared-error cost function. The trouble with sum-of-squared-
error cost functions, is that they are highly sensitive to outliers. A
single large outlier can exercise a huge influence on the estimate because
it dominates the cost. M-estimation3 modifies the shape of the cost
function so that outliers do not dominate the solution.
Recall that in Section 4.3.1 we wrote our overall nonlinear objective
function (for batch estimation) in the form
1
J(x) = u(x)T u(x), (5.38)
2
which is quadratic. We also showed (in the linear case) that minimizing
this is equivalent to maximizing the likelihood of all the measurements.
The gradient of our original objective function was
∂J(x) ∂u(x)
= u(x)T . (5.39)
∂x ∂x
Let us now generalize this objective function and write it as
J 0 (x) = ρ (u(x)) , (5.40)
where ρ(·) is some nonlinear cost function; assume it is bounded, posi-
tive definite, has a unique zero at u(x) = 0, and increases more slowly
3 ‘M’ stands for ‘maximum likelihood-type’, i.e., a generalization of maximum likelihood
(which we saw earlier was equivalent to the least-squares solution).
5.3 Handling Outliers 161
%&!
1 u2
⇢(u) =
%&#
2 1 + u2
%
!! !" !# !$ % $ # " !
5.4 Summary
The main take-away points from this chapter are:
1. There are always non-idealities (e.g., biases, outliers) that make the
real estimation problem different from the clean mathematical se-
tups discussed in this book. Sometimes these deviations result in
performance reductions that are the main source of error in prac-
tice.
2. In some situations, we can fold the estimation of a bias into our
estimation framework and in others we cannot. This comes down to
the question of observability.
3. In most practical estimation problems, outliers are a reality and thus
using some form of preprocessing (e.g, RANSAC) as well as a robust
cost function that downplays the effect of outliers is a necessity.
The next part of the book will introduce techniques for handling state
estimation in a three-dimensional world where objects are free to trans-
late and rotate.
5.5 Exercises
5.5.1 Consider the discrete-time system,
xk = xk−1 + vk + v̄k ,
d k = xk ,
where v̄k is an unknown input bias. Set up the augmented-state
system and determine if this system is observable.
5.5.2 Consider the discrete-time system,
xk = xk−1 + vk ,
vk = vk−1 + ak ,
d1,k = xk ,
d2,k = xk + d¯k ,
5.5 Exercises 163
where d¯k is an unknown input bias (on just one of the two mea-
surement equations). Set up the augmented-state system and de-
termine if this system is observable.
5.5.3 How many RANSAC iterations, k, would be needed to pick a
set of n = 3 inlier points with probability p = 0.999, given that
each point has probability w = 0.1 of being an inlier?
5.5.4 What advantage might the robust cost function,
(
1 2
u
2 2
u2 ≤ 1
ρ(u) = 2u 1 ,
1+u2
− 2 u2 ≥ 1
have over the Geman-McClure cost function?
Part II
Three-Dimensional Machinery
165
6
Primer on Three-Dimensional
Geometry
Figure 6.1
vehicle Vehicle and typical
reference frames.
Fv
!
r vi
!
Fi
!
167
168 Primer on Three-Dimensional Geometry
r r = r1 →
→
− 1 1 + r2 →
− 1 2 + r3 →
− 13
−
1 − →
→3
−
11
→
−
6
F
→1
− = [r1 r2 r3 ] →
−12
13
-
1 →
−
→2
−
= rT −
F 1.
→ (6.1)
1
→1
− The quantity
r1
r = r2 ,
r3
1
→
− 1 h i s1
1
r = [r1 r2 r3 ] →
→
− −2 , →
s = →
− 11 →
− − 1 3 s2 .
12 →
−
1 s3
−3
→
6.1 Vectors and Reference Frames 169
→
− s = rT 1s = rT s = r1 s1 + r2 s2 + r3 s3 .
r ·→
−
The notation 1 will be used to designate the identity matrix. Its di-
mension can be inferred from context.
6.2 Rotations
Critical to our ability to estimate how objects are moving in the world
is the ability to parameterize the orientation, or rotation, of those ob-
jects. We begin by introducing rotation matrices and then provide some
alternative representations.
r1 = C−1
21 r2 = C12 r2 .
C12 = C−1 T
21 = C21 . (6.4)
1 ,21
→1 −
− →
6.2.3 Alternate Rotation Representations
We have seen one way of discussing the orientation of one reference
frame with respect to another: the rotation matrix. This requires nine
parameters (they are not independent). There are a number of other
alternatives.
The key thing to realize about the different representations of rota-
tions, is that there are always only three underlying degrees of freedom.
The representations that have more than three parameters must have
associated constraints to limit the number of degrees of freedom to
three. The representations that have exactly three parameters have as-
sociated singularities. There is no perfect representation that is minimal
Leonhard Euler
(1707-1783) is (i.e., having only three parameters) and that is also free of singularities
considered to be (Stuelpnagel, 1964).
the preeminent
mathematician of
the 18th century
Euler Angles
and one of the The orientation of one reference frame with respect to another can also
greatest be specified by a sequence of three principal rotations. One possible
mathematicians to
have ever lived. He
sequence is as follows:
made important
discoveries in fields (i) A rotation ψ about the original 3-axis,
as diverse as (ii) A rotation γ about the intermediate 1-axis,
infinitesimal
(iii) A rotation θ about the transformed 3-axis.
calculus and graph
theory. He also
introduced much of
the modern 1 ,I 3 I T ,23
→3 −
− → T →3
− →3 −
− →
mathematical →3
−
6 6 6
terminology and KAAγ
I T 2
notation, →2
− →2
− →2
−
* *
*
particularly for A
mathematical
6ψ - −
1
→2
A 6γ - −I
→2
6θ - − T
→2
-A -A
analysis, such as ψ A θ A
the notion of a
AA AAU
mathematical U
1 I I ,T 1 T 2
function. He is also →1
− →1
− →1 −
− → →1
− →1
−
renowned for his
work in mechanics,
fluid dynamics,
optics, astronomy, This transformation is called a 3-1-3 transformation and is the se-
and music theory. quence originally used by Euler. In the classical mechanics literature,
6.2 Rotations 173
Infinitesimal Rotations
Consider the 1-2-3 transformation when the angles θ1 , θ2 , θ3 1 (i.e.,
‘small’ angles). In this case, we make the approximations ci ≈ 1, si ≈ θi
174 Primer on Three-Dimensional Geometry
Euler Parameters
Euler’s theorem says that the most general motion of a rigid body with
one point fixed is a rotation about an axis through that point.
Let us denote the axis of rotation by a = [a1 a2 a3 ]T and assume
that it is a unit vector:
aT a = a21 + a22 + a23 = 1 (6.11)
The angle of rotation is φ. We state, without proof, that the rotation
matrix in this case is given by
C21 = cos φ1 + (1 − cos φ)aaT − sin φa× . (6.12)
It does not matter in which frame a is expressed because
C21 a = a. (6.13)
The combination of variables,
a1 sin(φ/2) ε1
φ φ
η = cos , ε = a sin = a2 sin(φ/2) = ε2 , (6.14)
2 2 a3 sin(φ/2) ε3
is particularly useful. The four parameters {ε, η} are called the Eu-
ler parameters associated with a rotation2 . They are not independent
because they satisfy the constraint
η 2 + ε21 + ε22 + ε23 = 1.
2 ε
These are sometimes referred to as unit-length quaternions when stacked as q = .
η
These are discussed in more detail below.
6.2 Rotations 175
Angular Velocity
Let frame − F 2 rotate with respect to frame −
→ F 1 . The angular velocity
→
of frame 2 with respect to frame 1 is denoted by → −ω 21 . The angular
velocity of frame 1 with respect to 2 is →
ω 12 = − ω 21 .
− →
−
1 2
ω 21
→
−
→3 −
− →3
BM 6
2
B →2
−
B *
1
→2
B -−
A
A
AU
1 2
→1
− →1
−
q
The magnitude of → ω ,
− 21 →| ω
− 21 |= (→
−ω 21 · →
ω 21 ), is the rate of rotation.
−
The direction of → −ω 21 (i.e., the unit vector in the direction of → ω 21 , which
−
−1
is | →
ω 21 | →
− ω 21 ) is the instantaneous axis of rotation.
−
Observers in the frames − F 2 and −
→ F 1 do not see the same motion
→
because of their own relative motions. Let us denote the vector time
derivative as seen in − F 1 by (·)• and that seen in −
→ F 2 by (·)◦ . Therefore,
→
• ◦
F
− 0 , −F
→1 = →
− →2 = →
0.
−
It can be shown that
2• 1 = →
→
− ω 21 × →
− −2 1, →2• 2 = →
− ω 21 × →
− −2 2, →2• 3 = →
− ω 21 × →
− 2 3,
−
or equivalently
h i h i
2˙ 1 2˙ 2 2˙ 3 = ω 21 × 2 1 2 2 2 3 ,
→− → − → − →
− →− → − → −
or
•T
F
− − F T2 .
ω 21 × −
→2 = → → (6.36)
6.2 Rotations 179
→
− F T1 r1 = −
r =−
→ F T2 r2 .
→
F 1 is
Therefore, the time derivative as seen in −
→
•T
→
− F
r• = − F T1 ṙ1 = −
→ 1 r1 + −
→ F T1 ṙ1 .
→ (6.37)
In a similar way,
◦T ◦ ◦
r◦ = −F
→
− F T2 r2 = −
→ 2 r2 + −
→ F T2 r2 = −
→ F T2 ṙ2 .
→ (6.38)
◦
(Note that for nonvectors, (˙) = ( ◦ ), i.e., r2 = ṙ2 ).
Alternatively, the time derivative as seen in − F 1 , but expressed in
→
−
→F 2 , is
→
− F T2 ṙ2 + −
r• = −
→ F •T
→ 2 r2
=−F T2 ṙ2 + →
→ ω 21 × −
− F T2 r2
→
r◦ + →
=→
− ω 21 × →
− r.
− (6.39)
The above is true for any vector → −r . The most important application
occurs when → r denotes position, −
− F 1 is a nonrotating inertial reference
→
frame, and − F
→ 2 is a frame that rotates with a body, vehicle, etc. In this
case, (6.39) expresses the velocity in the inertial frame in terms of the
motion in the second frame.
Now, express the angular velocity in − F 2:
→
→
− F T2 ω 21
ω 21 = −
→ 2 . (6.40)
Therefore,
→
− F T1 ṙ1 = −
r• = −
→ F T2 ṙ2 + →
→ ω 21 × →
− − r
×
=−F T2 ṙ2 + −
→ F T2 ω 21
→ 2 r2
×
F T2 (ṙ2 + ω 21
=−
→ 2 r2 ). (6.41)
F 1)
If we want to express the ‘inertial time derivative’ (that seen in −
→
in −F
→1 , then we can use the rotation matrix C 12 :
×
ṙ1 = C12 (ṙ2 + ω 21
2 r2 ). (6.42)
Acceleration
Let us denote the velocity by
→
− r• = →
v =→
− r◦ + →
− ω 21 × →
− r.
−
180 Primer on Three-Dimensional Geometry
−
••
r→ =− F T1 r̈1 , −
→ r→◦◦
=− F T2 r̈2 , →
→ −ω ◦ 21 = −F T2 ω̇ 21
→ 2 .
→
− − F T2 C21 + −
ω 21 × −
0 =→ → F T2 Ċ21 .
→
Now use (6.40) to get
T
0 = ω 21
→
− 2 −F2 × −
→ F T2 C21 + −
→ F T2 Ċ21
→
×
Siméon Denis
Poisson
=−F T2
→ ω 21
2 C21 + Ċ21 .
(1781-1840), was a
French
Therefore, we conclude that
mathematician, ×
geometer, and
Ċ21 = −ω 21
2 C21 , (6.45)
physicist. which is known as Poisson’s equation. Given the angular velocity as
6.2 Rotations 181
×
ω 21
2 = −Ċ21 C−1
21
= −Ċ21 CT21 , (6.46)
which gives the angular velocity when the rotation matrix is known as
a function of time.
Euler Angles
Consider the 1-2-3 Euler angle sequence and its associated rotation
matrix. In this case (6.46) becomes
×
ω 21
2 = −C3 C2 Ċ1 CT1 CT2 CT3 − C3 Ċ2 CT2 CT3 − Ċ3 CT3 . (6.47)
Then, using
−Ċi CTi = 1×
i θ̇i , (6.48)
which gives the angular velocity in terms of the Euler angles and the
Euler rates, θ̇. In scalar detail we have
cos θ2 cos θ3 sin θ3 0
S(θ2 , θ3 ) = − cos θ2 sin θ3 cos θ3 0 . (6.52)
sin θ2 0 1
4 This is termed ‘strapdown navigation’ because the sensors that measure ω 21
2 are
strapped down in the rotating frame, F 2 .
−
→
182 Primer on Three-Dimensional Geometry
θ̇ = S−1 (θ2 , θ3 )ω 21
2
sec θ2 cos θ3 − sec θ2 sin θ3 0
= sin θ3 cos θ3 0 ω 21
2 . (6.53)
− tan θ2 cos θ3 tan θ2 sin θ3 1
Note that S−1 does not exist at θ2 = π/2, which is the precisely the
singularity associated with the 1-2-3 sequence.
It should be noted that the above developments hold true in general
for any Euler sequence. If we pick an α-β-γ set,
C21 (θ1 , θ2 , θ3 ) = Cγ (θ3 )Cβ (θ2 )Cα (θ1 ), (6.54)
then
S(θ2 , θ3 ) = Cγ (θ3 )Cβ (θ2 )1α Cγ (θ3 )1β 1γ , (6.55)
and S−1 does not exist at the singularities of S.
r× s ≡ −s× r, (6.62a)
×
(Rs) ≡ Rs× RT , (6.62b)
for any vectors r, s and any rotation matrix, R. Combining the results
in (6.61) we have
∂ (C(θ)v) h ∂(C(θ)v) ∂(C(θ)v) ∂(C(θ)v) i
= ∂θ1 ∂θ2 ∂θ3
∂θ
×
= (C(θ)v) Cγ (θ3 )Cβ (θ2 )1α Cγ (θ3 )1β 1γ , (6.63)
| {z }
S(θ2 ,θ3 )
we have
∂ (C(θ)v)
C(θ̄ + δθ)v ≈ C(θ̄)v + δθ
∂θ θ̄
×
= C(θ̄)v + (C(θ)v) S(θ2 , θ3 ) δθ
θ̄
×
= C(θ̄)v + C(θ̄)v S(θ̄2 , θ̄3 ) δθ
×
= C(θ̄)v − S(θ̄2 , θ̄3 ) δθ C(θ̄)v
×
= 1 − S(θ̄2 , θ̄3 ) δθ C(θ̄)v, (6.67)
where we have used (6.64) to get to the second line. Observing that v
is arbitrary, we can drop it from both sides and write
×
C(θ̄ + δθ) ≈ 1 − S(θ̄2 , θ̄3 ) δθ C(θ̄), (6.68)
| {z }
infinitesimal rot. mat.
Example 6.1 The following example shows how we can apply our
linearized rotation expression in an arbitrary expression. Suppose we
have a scalar function, J, given by
where
we see the factor in front of δθ is indeed constant; in fact, it is
∂J
∂θ θ̄
, the Jacobian of J with respect to θ.
186 Primer on Three-Dimensional Geometry
6.3 Poses
We have spent considerable effort discussing the rotational aspect of a
moving body. We now introduce the notation of translation. Together,
the translation and rotation of a body are referred to as the pose. Pose
estimation problems are often concerned with transforming the coor-
dinates of a point, P , between a moving (in translation and rotation)
vehicle frame, and a stationary frame, as depicted in Figure 6.2.
We can relate the vectors in Figure 6.2 as follows:
r pi = →
→
− r pv + →
− r vi ,
− (6.73)
where we have not yet selected any particular reference frame in which
to express the relationship. Writing the relationship in the stationary
F i , we have
frame, −
→
rpi pv vi
i = ri + ri . (6.74)
rpi pv vi
i = Civ rv + ri , (6.75)
{rvi
i , Civ }, (6.76)
r vi
!
Fi
!
I
6.3 Poses 187
It makes sense to use θvi for orientation; it naturally describes the head-
ing of the vehicle since it is −→F v that is moving with respect to − F i.
→
However, as discussed in the last section, the rotation matrix that we re-
ally care about when constructing the pose is Civ = CTvi = C3 (−θvi ) =
C3 (θiv ). Importantly, we note that θiv = −θvi . We do not want to use
θiv as the heading as that will be quite confusing.
Sticking with θvi , the pose of the vehicle can then be written in
transformation matrix form as
cos θvi − sin θvi 0 x
Civ rvi sin θvi cos θvi 0 y
Tiv = i
= 0
, (6.86)
0T
1 0 1 0
0 0 0 1
where we note the change in sign of the third term due to the skew-
T
symmetric property, a× = −a× . In other words, we are free to use θvi
rather than θiv to construct Civ .
Confusion arises, however, when all the subscripts are dropped and
we simply write
F v,
of the path. Stacking the axes into a frame, →
−
−t
→
F v = →−n, (6.92)
→
−
−b
→
we can write the Frenet-Serret equations as
0 κ 0
d
F v = −κ 0 τ → F v. (6.93)
ds →
−
0 −τ 0
−
Multiplying both sides by the speed along the path, v = ds/dt, and
right-multiplying by →−F Ti we have
0 vκ 0
d
−vκ
F v · Fi
T
= 0 vτ F v · F Ti , (6.94)
→
− → − →
− → −
|dt {z } 0 −vτ 0 | {z }
Ċvi | {z } Cvi
∧
−ω vi
v
where we have applied the chain rule. We see that this has recov-
ered Poisson’s equation for rotational kinematics as given previously
in (6.45); the angular velocity expressed in the moving frame,
vτ
vi
ωv = 0 , (6.95)
vκ
is constrained to only two degrees of freedom since the middle entry is
zero. We also have the translational kinematics,
v
ṙi = Cvi ν v , ν v = 0 .
vi T vi vi (6.96)
0
To express this in the body frame, we note that
d
ṙiv
v = −Cvi rvi
i = −Ċvi rvi vi
i − Cvi ṙi
dt
∧
vi∧ iv
= ω vi vi T vi vi
v Cvi ri − Cvi Cvi ν v = −ω v rv − ν v . (6.97)
Integrating this forward in time provides both the translation and ro-
tation of the moving frame. We can think of (v, κ, τ ) as three inputs in
this case as they determine the shape of the curve that is traced out.
We will see in the next chapter that these kinematic equations can be
generalized to the form
∧
ω −ν
Ṫ = T T, (6.99)
0 0
where
ν
$= , (6.100)
ω
ẋ = v cos θ, (6.103a)
ẏ = v sin θ, (6.103b)
θ̇ = ω, (6.103c)
Fi
! r ps
!
I r pi
!
tance between the pinhole, S, and the image plane center, C, called the
focal length, is 1 for this idealized camera model.
If the coordinates of P in → F s are
−
x
ρ = rps
s = y , (6.104)
z
with the z-axis orthogonal to the image plane, then the coordinates of
O in the image plane are
xn = x/z, (6.105a)
yn = y/z. (6.105b)
These are called the normalized image coordinates, and are sometimes
provided in a homogeneous form as
xn
p = yn . (6.106)
1
Essential Matrix
If a point, P , is observed by a camera, the camera moved, and then
the same point observed again, the two normalized image coordinates
Figure 6.7 Two
camera
observations of the
same point, P . P
pa pb
Fa
!
Fb
!
Tba
6.4 Sensor Models 195
Lens Distortion
In general, lens effects can distort camera images so that the normal-
ized image coordinate equations are only approximately true. A variety
of analytical models of this distortion are available and these can be
used to correct the raw images such that the resulting images appear
as though they come from an idealized pinhole camera, and thus the
normalized image coordinate equations hold. We will assume this undis-
tortion procedure has been applied to the images and avoid elaborating
on the distortion models.
Intrinsic Parameters
The normalized image coordinates are really associated with a hypo-
thetical camera with unit focal length and image origin at the opti-
cal axis intersection. We can map the normalized image coordinates,
196 Primer on Three-Dimensional Geometry
Figure 6.8
Camera model
showing intrinsic
parameters: f is (cu , cv ) P
the focal length,
(cu , cv ) is the f (x, y, z)
optical axis
intersection. (u, v)
(xn , yn ), to the actual pixel coordinates, (u, v), through the following
relation:
u fu 0 cu xn
v = 0 fv cv yn , (6.113)
1 0 0 1 1
| {z }
K
where K is called the intrinsic parameter matrix and contains the ac-
tual camera focal length expressed in horizontal pixels, fu , and vertical
pixels, fv , as well as the actual offset of the image origin from the
optical axis intersection, (cu , cv ), also expressed in horizontal, vertical
pixels7 . These intrinsic parameters are typically determined during the
calibration procedure used to remove the lens effects, so that we can
assume K is known.
Fundamental Matrix
Similarly to the essential matrix constraint, there is a constraint that
can be expressed between the homogeneous pixel coordinates of two
observations of a point from different camera perspectives (and possibly
even different cameras). Let
qi = Ki pi , (6.114)
where
−1
Fab = K−T
a Eab Kb , (6.116)
7 On many imaging sensors, the pixels are not square, resulting in different units in the
horizontal and vertical directions.
6.4 Sensor Models 197
Complete Model
Combining everything but the lens effects, the perspective camera model
can be written as
u 1
= s(ρ) = P K ρ, (6.118)
v z
where
fu 0 cu x
1 0 0
P= , K=0 fv cv , ρ = y . (6.119)
0 1 0
0 0 1 z
P is simply a projection matrix to remove the bottom row from the
homogeneous point representation. This form of the model makes it
clear that with a single camera, there is a loss of information as we are
Figure 6.9 If a
point is observed in
one image, qa , and
epipolar line the fundamental
qTa Fab qb = 0 matrix, Fab , is
qa known, this can be
used to define a
line in the second
Fa image along which
! the second
Fb
! observation, qb ,
must lie.
Tba
198 Primer on Three-Dimensional Geometry
going from three parameters in ρ to just two in (u, v); we are unable
to determine depth from just one camera.
Homography
Although we cannot determine depth from just one camera, if assume
that the point a camera is observing lies on the surface of a plane whose
geometry is known, we can work out the depth and then how that point
will look to another camera. The geometry of this situation is depicted
in Figure 6.10.
The homogeneous observations for the two cameras can be written
as
xi
1
qi = Ki ρi , ρi = yi , (6.120)
zi zi
where ρi are the coordinates of P in each camera frame with i =
a, b. Let us assume we know the equation of the plane containing P ,
expressed in both camera frames; this can be parameterized as
{ni , di } , (6.121)
where di is the distance of camera i from the plane and ni are the
coordinates of the plane normal in frame i. This implies that
nTi ρi + di = 0, (6.122)
since P is in the plane. Solving for ρi in (6.120) and substituting into
the plane equation we have
zi nTi K−1
i qi + di = 0, (6.123)
or
di
zi = − , (6.124)
nTi K−1
i qi
Figure 6.10 If
the point observed Tba
by a camera lies on B
a plane whose A
geometry is known,
it is possible to Fa r pb Fb
!
work out what ! r pa !
that point will look ! db
like after the da
camera makes a
pose change using
a transform called n
!
a homography. P
6.4 Sensor Models 199
or
ρb = Cba ρa + rab
b . (6.127)
Inserting (6.120), we have that
zb K−1 −1 ab
b qb = za Cba Ka qa + rb . (6.128)
We can then isolate for qb in terms of qa :
za 1
qb = Kb Cba K−1
a qa + Kb rba
b . (6.129)
zb zb
Then, substituting zb from (6.124) we have
za 1 ba T
qb = Kb Cba 1 + ra na K−1 a qa , (6.130)
zb da
where we used that rab ba
b = −Cba ra . Finally, we can write
qb = Kb Hba K−1
a qa , (6.131)
where
za 1 ba T
Hba = Cba 1 + ra na , (6.132)
zb da
is called the homography matrix. Since the factor za /zb just scales qb , it
can be dropped in practice owing to the fact that qb are homogeneous
coordinates and the true pixel coordinates can always be recovered by
dividing the first two entries by the third; doing so means that Hba is
only a function of the pose change and the plane parameters.
It is worth noting that in the case of a pure rotation, rba
a = 0, so that
the homography matrix simplifies to
Hba = Cba , (6.133)
when the za /zb factor is dropped.
200 Primer on Three-Dimensional Geometry
Midpoint Model
F s as
If we express the coordinates of the point, P , in →
−
x
ρ = rpss = y , (6.135)
z
then the model for the left camera is
x + 2b
u` 1
= PK y , (6.136)
v` z z
and the model for the right camera is
x − 2b
ur 1
= PK y , (6.137)
vr z z
Figure 6.11
Stereo camera rig.
P
Two cameras are left image
mounted pointing
in the same
direction but with q`
a known separation
F` qr
of b along the
!
x-axis. We choose Fs
!
the sensor frame to
Fr right image
be located at the b/2 !
midpoint between b/2
the two cameras.
stereo baseline
6.4 Sensor Models 201
where we assume the two cameras have the same intrinsic parameter
matrix. Stacking the two observations together we can write the stereo
camera model as
u` fu 0 cu fu 2b x
v` 0 fv cv 0 1 y
= s(ρ) =
ur fu 0 cu −fu b z z , (6.138)
2
vr 0 fv cv 0 1
| {z }
M
Left Model
We could also choose to locate the sensor frame at the left camera
rather than the midpoint between the two cameras. In this case, the
Figure 6.12
P
Alternate stereo
left image model with the
sensor frame
located at the left
q` camera.
F s, !
F` qr
!
Fr right image
b !
stereo baseline
202 Primer on Three-Dimensional Geometry
6.4.3 Range-Azimuth-Elevation
Some sensors, such as lidar (light detection and ranging) can be mod-
elled as a range-azimuth-elevation (RAE), which essentially observes a
point, P , in spherical coordinates. For lidar, which can measure dis-
tance by reflecting laser pulses off a scene, the azimuth and elevation
are the angles of the mirrors that are used to steer the laser beam and
the range is the reported distance determined by time of flight. The
geometry of this sensor type is depicted in Figure 6.13.
The coordinates of point P in the sensor frame, → F s , are
−
x
ps
ρ = rs = y . (6.143)
z
8 The disparity equation can be used as a one-dimensional stereo camera model, as we
have already seen in the earlier chapter on nonlinear estimation.
Figure 6.13 A
range-azimuth- z r range P
elevation sensor Fs r ps
!
model observes a
! y
point P in ✏ elevation
spherical S
coordinates. x ↵ azimuth
6.4 Sensor Models 203
Naturally, if the offset between the sensor and vehicle frames, rsv
v , is
sufficiently small, we may choose to neglect the last two terms.
10 If the angular acceleration quantity, ω̇ vi
v , is not actually part of the state, it could be
determined from two or more gyro measurements.
6.5 Summary 205
Figure 6.15 For
vehicle high-performance
topocentric
inertial-
Ft Fv V measurement-unit
! r vt
! ! applications, it is
r sv
! necessary to
r ti
!
account for the
sensor rotation of the
Earth-centred inertial F i S Earth, in which
! Fs case a topometric
!
I reference frame is
g located on the
! Earth’s surface and
an inertial frame is
located at the
Earth’s center.
where Csv and rsv v are the (known) pose change between the vehicle
and sensor frames and gi is gravity in the inertial frame.
For some high-performance inertial-measurement-unit applications,
the above model is insufficient since it assumes an inertial reference
frame can be located conveniently on the Earth’s surface, for example.
High-end IMU units, however, are sensitive enough to detect the ro-
tation of the Earth and thus a more elaborate model of the sensor is
required. The typical setup is depicted in Figure 6.15, where the iner-
tial frame is located at the Earth’s center of mass (but not rotating)
and then a convenient (non-inertial) reference frame (used to track the
vehicle’s motion) is located on the Earth’s surface. This requires gen-
eralizing the sensor model to account for this more sophisticated setup
(not shown).
6.5 Summary
The main take-away points from this chapter are:
6.6 Exercises
6.6.1 Show that u v ≡ −v∧ u for any two 3 × 1 columns u and v.
∧
is
zb 1
H−1
ba = Cab 1 + rab n T
.
za db b b
6.6.8 Work out the stereo camera model for the case when the sensor
frame is located at the right camera instead of the left or the
midpoint.
6.6.9 Work out the inverse of the left stereo camera model. In other
words, how can we go from (u` , v` , d) back to the point coordinates,
(x, y, z)?
6.6.10 Work out an IMU model for the situation depicted in Fig-
ure 6.15, where it is necessary to account for the rotation of the
Earth about its axis.
7
we have that
det C = ±1, (7.3)
209
210 Matrix Lie Groups
1 There is another case in which det C = −1, sometimes called an improper rotation or
rotary reflection, but we shall not be concerned with it here.
2 A subspace of a vectorspace is also a vectorspace.
3 They are actually non-Abelian (or non-commutative) groups since the order in which
we compound elements matters.
4 Smoothness implies that we can use differential calculus on the manifold; or, roughly, if
we change the input to any group operation by a little bit, the output will only change
by a little bit.
7.1 Geometry 211
closure: [X, Y] ∈ V,
bilinearity: [aX + bY, Z] = a[X, Z] + b[Y, Z],
[Z, aX + bY] = a[Z, X] + b[Z, Y],
alternating: [X, X] = 0,
Jacobi identity: [X, [Y, Z]] + [Z, [Y, X]] + [Y, [Z, X]] = 0,
for all X, Y, Z ∈ V and a, b ∈ F. The vectorspace of a Lie algebra is
the tangent space of the associated Lie group at the identity element of
the group, and it completely captures the local structure of the group.
Rotations
The Lie algebra associated with SO(3) is given by
vectorspace: so(3) = Φ = φ∧ ∈ R3×3 | φ ∈ R3 , ,
field: R,
Lie bracket: [Φ1 , Φ2 ] = Φ1 Φ2 − Φ2 Φ1 ,
where
∧
φ1 0 −φ3 φ2
φ∧ = φ2 = φ3 0 −φ1 ∈ R3×3 , φ ∈ R3 . (7.10)
φ3 −φ2 φ1 0
We already saw this linear, skew-symmetric operator in the previous
chapter during our discussion of cross products and rotations. Later,
we will also make use of the inverse of this operator, denoted (·)∨ , so
that
Φ = φ∧ ⇒ φ = Φ∨ . (7.11)
We will omit proving that so(3) is a vectorspace, but will briefly show
that the four Lie bracket properties hold. Let Φ, Φ1 = φ∧1 , Φ2 = φ∧2 ∈
so(3). Then for the closure property we have
∧
[Φ1 , Φ2 ] = Φ1 Φ2 − Φ2 Φ1 = φ∧1 φ∧2 − φ∧2 φ∧1 = φ∧1 φ2 ∈ so(3). (7.12)
| {z }
∈R3
where
f ∧
f ρ φ ρ∧
ξ = = ∈ R6×6 , ρ, φ ∈ R3 . (7.17)
φ 0 φ∧
Bilinearity follows directly from the fact that (·)∧ is a linear operator.
The alternating property can be seen easily through
[Ξ, Ξ] = ΞΞ − ΞΞ = 0 ∈ se(3). (7.18)
Finally, the Jacobi identity can be verified by substituting and applying
the definition of the Lie bracket. Again, we will refer to se(3) as the Lie
algebra, although technically this is only the associated vectorspace.
In the next section, we will make clear the relationships between our
matrix Lie groups and their associated Lie algebras:
SO(3) ↔ so(3),
SE(3) ↔ se(3),
For this we require the exponential map.
Rotations
For rotations, we can relate elements of SO(3) to elements of so(3)
through the exponential map:
∞
X 1 n
C = exp φ∧ = φ∧ , (7.21)
n=0
n!
a∧ a∧ ≡ −1 + aaT , (7.24a)
a∧ a∧ a∧ ≡ −a∧ , (7.24b)
the proofs of which are left to the reader. This shows that every φ ∈ R3
will generate a valid C ∈ SO(3). It also shows that if we add a multiple
of 2π to the angle of rotation, we will generate the same C. In detail,
we have
C = exp((φ + 2πm) a∧ ), (7.25)
with m any positive integer, since cos(φ + 2πm) = cos φ and sin(φ +
2πm) = sin φ. If we limit the angle of rotation, |φ| < π, of the input,
then each C can only be generated by one φ.
Additionally, we would like to show that every C ∈ SO(3) can be
generated by some φ ∈ R3 and for that we need the inverse (logarith-
mic) mapping: φ = ln(C)∨ . We can also work this out in closed form.
Since a rotation matrix applied to its own axis does not alter the axis,
Ca = a, (7.26)
this implies that a is a (unit-length) eigenvector of C corresponding to
an eigenvalue of 1. Thus, by solving the eigenproblem associated with
C, we can find a9 . The angle can be found by exploiting the trace (sum
of the diagonal elements) of a rotation matrix:
tr(C) = tr cos φ 1 + (1 − cos φ)aaT + sin φ a∧
= cos φ tr(1) +(1 − cos φ) tr aaT + sin φ tr (a∧ ) = 2 cos φ + 1.
| {z } | {z } | {z }
3 aT a=1 0
(7.27)
Solving we have
−1 tr(C) − 1
φ = cos + 2πm, (7.28)
2
which indicates there are many solutions for φ. By convention, we will
pick the one such that |φ| < π. To complete the process, we combine
a and φ according to φ = φ a. This shows that every C ∈ SO(3) can
be built from at least one φ ∈ R3 .
Figure 7.1 provides a simple example of the relationship between the
Lie group and Lie algebra for the case of rotation constrained to the
plane. We see that in a neighbourhood near the zero-rotation point,
θvi = 0, the Lie algebra vectorspace is a just a line that is tangent
9 There are some subtleties that occur when there is more than one eigenvalue equal to
1. E.g., C = 1, whereupon a is not unique and can be any unit vector.
216 Matrix Lie Groups
Figure 7.1
Example of the
SO(3)
relationship so(3)
between the Lie
group and Lie Fv (1, ✓vi , 0)
algebra for the
!
case of rotation
02 3^ 1 2 3
✓vi 0 1
constrained to the C3 (✓vi ) 11 = exp @4 0 5 A 405
plane. In a small
Fi
!
✓vi 0
neighbourhood 0 2 3^ 1 2 3 2 3
around the θvi = 0 0 1 1
point, the ⇡ @1 + 4 0 5 A 405 = 4✓vi 5
vectorspace ✓vi 0 0
associated with the
Lie algebra is a
line tangent to the
circle. to the circle of rotation. We see that indeed, near zero rotation, the
Lie algebra captures the local structure of the Lie group. It should be
pointed out that this example is constrained to the plane (i.e., a single
rotational degree of freedom) but in general the dimension of the Lie
algebra vectorspace is three. Put another way, the line in the figure is
a one-dimensional subspace of the full three-dimensional Lie algebra
vectorspace.
Connecting rotation matrices with the exponential map makes it easy
to show that det(C) = 1 using Jacobi’s formula, which for a general
square complex matrix, A, says
det (exp (A)) = exp (tr(A)) . (7.29)
In the case of rotations, we have
det(C) = det exp (φ∧ ) = exp tr(φ∧ ) = exp(0) = 1, (7.30)
∧
since φ is skew-symmetric and therefore has zeros on its diagonal
making its trace zero.
Poses
For poses, we can relate elements of SE(3) to elements of se(3), again
through the exponential map:
∞
X 1 ∧ n
T = exp ξ ∧ = ξ , (7.31)
n=0
n!
where T ∈ SE(3) and ξ ∈ R6 (and hence ξ ∧ ∈ se(3)). We can also go
in the other direction10 using
∨
ξ = ln (T) . (7.32)
The exponential map from se(3) to SE(3) is also surjective: every ξ ∈
R6 maps to some T ∈ SE(3) (many-to-one) and every T ∈ SE(3) can
be generated by at least one ξ ∈ R6 .
10 Again, not uniquely.
7.1 Geometry 217
where
∞
X 1 n
r = Jρ ∈ R3 , J= φ∧ . (7.34)
n=0
(n + 1)!
Figure 7.2 By
x y z varying each of the
components of ξ,
constructing
T = exp (ξ∧ ), and
2 3 2 3 2 3 then using this to
x 0 0 transform the
607 6y 7 607
6 7 6 7 6 7 points comprising
607 607 6z 7
translation ⇠=6
607
7 ⇠=6
607
7 ⇠=6
607
7
the corners of a
6 7 6 7 6 7
405 405 405 rectangular prism,
0 0 0 we see that the
prism’s pose can
be translated and
rotated.
Combining these
basic movements
can result in any
2 3 2 3 2 3
0 0 0 arbitrary pose
607 607 607
6 7 6 7 6 7 change of the
607 607 607
rotation ⇠=6
6↵ 7
7 ⇠=6
607
7 ⇠=6
607
7 prism.
6 7 6 7 6 7
405 4 5 405
0 0
218 Matrix Lie Groups
Jacobian
The matrix, J, described just above plays an important role in allow-
ing us to convert the translation component of pose in se(3) into the
translation component of pose in SE(3) through r = Jρ. This quantity
appears in other situations as well when dealing with our matrix Lie
groups and we will learn later on in this chapter that this is called the
(left) Jacobian of SO(3). In this section, we will derive some alternate
forms of this matrix that are sometimes useful.
We have defined J as
X∞
1 n
J= φ∧ . (7.36)
n=0
(n + 1)!
By expanding this series and manipulating, we can show the following
closed-form expressions for J and its inverse:
sin φ sin φ 1 − cos φ ∧
J= 1+ 1− aaT + a , (7.37a)
φ φ φ
φ φ φ φ φ
J−1 = cot 1 + 1 − cot aaT − a∧ , (7.37b)
2 2 2 2 2
where φ = |φ| is the angle of rotation and a = φ/φ is the unit-length
axis of rotation. Due to the nature of the cot(φ/2) function, there are
singularities associated with J (i.e., the inverse does not exist) at φ =
2πm with m a non-zero integer.
Occasionally, we will come across the matrix, JJT and its inverse.
Starting with (7.37a), we can manipulate to show
−1 1 1
JJT = γ1 + (1 − γ)aaT , JJT = 1+ 1− aaT ,
γ γ
1 − cos φ
γ=2 . (7.38)
φ2
7.1 Geometry 219
since the first term is only zero when a and x are perpendicular and
the second term is only zero when x and a are parallel (these cannot
happen at the same time). This shows JJT is positive-definite.
It turns out that we can also write J in terms of the rotation matrix,
C, associated with φ in the following way:
Z 1
J= Cα dα. (7.40)
0
7.1.4 Adjoints
There is a 6 × 6 transformation matrix, T , that can be constructed
directly from the components of the 4 × 4 transformation matrix. We
call this the adjoint of an element of SE(3):
C r C r∧ C
T = Ad(T) = Ad = . (7.43)
0T 1 0 C
We will abuse notation a bit and say that the set of adjoints of all the
elements of SE(3) is denoted
Ad(SE(3)) = {T = Ad(T) | T ∈ SE(3)} . (7.44)
220 Matrix Lie Groups
It turns out that Ad(SE(3)) is also a matrix Lie Group, which we show
next.
For closure we let T 1 = Ad(T1 ), T 2 = Ad(T2 ) ∈ Ad(SE(3)), and
then
C1 r1 C2 r2
T 1 T 2 = Ad(T1 )Ad(T2 ) = Ad Ad
0T 1 0T 1
C1 r∧1 C1 C2 r∧2 C2 C1 C2 C1 r∧2 C2 + r∧1 C1 C2
= =
0 C1 0 C2 0 C1 C2
∧
C1 C2 (C1 r2 + r1 ) C1 C2
=
0 C1 C2
C1 C2 C1 r2 + r1
= Ad ∈ Ad(SE(3)), (7.45)
0T 1
where we have used the nice property that
∧
Cv∧ CT = (Cv) , (7.46)
for any C ∈ SO(3) and v ∈ R3 . Associativity follows from basic proper-
ties of matrix multiplication and the identity element of the group is the
6 × 6 identity matrix. For invertibility, we let T = Ad(T) ∈ Ad(SE(3))
and then we have
−1 −1
−1 −1 C r C r∧ C
T = Ad(T) = Ad =
0T 1 0 C
T T ∧
T ∧
C −C r C (−C r) CT
T
= =
0 CT 0 CT
T
C −CT r
= Ad T = Ad T−1 ∈ Ad(SE(3)). (7.47)
0 1
Other than smoothness, these four properties show that Ad(SE(3)) is
a matrix Lie group.
We can also talk about the adjoint of an element of se(3). Let Ξ =
∧
ξ ∈ se(3); then the adjoint of this element is
ad(Ξ) = ad ξ ∧ = ξ f , (7.48)
where
f ∧
f ρ φ ρ∧
ξ = = ∈ R6×6 , ρ, φ ∈ R3 . (7.49)
φ 0 φ∧
Note that we have used uppercase, Ad(·), for the adjoint of SE(3) and
lowercase, ad(·), for the adjoint of se(3).
The Lie algebra associated with Ad(SE(3)) is given by
vectorspace: ad(se(3)) = {Ψ = ad(Ξ) ∈ R6×6 | Ξ ∈ se(3), },
field: R,
Lie bracket: [Ψ1 , Ψ2 ] = Ψ1 Ψ2 − Ψ2 Ψ1 ,
7.1 Geometry 221
∞
X 1 f n
T = exp ξ f = ξ , (7.52)
n=0
n!
n=0 m=0
n! m! 0
∧
where we have used that ∧ is linear and that (Cv) = Cv∧ CT . After
several integrations by parts we can show that
Z 1
n! m!
αn (1 − α)m dα = , (7.59)
0 (n + m + 1)!
∧
and therefore K = (Jρ) C, which is the desired result.
7.1 Geometry 223
7.1.5 Baker-Campbell-Hausdorff
We can combine two scalar exponential functions as follows:
In later sections and chapters, we will (arbitrarily) work with the left
Jacobian and it will therefore be useful to write out the BCH approxi-
mations in (7.71) using only the left Jacobian:
∨ ∨
ln (C1 C2 ) = ln exp(φ∧1 ) exp(φ∧2 )
J(φ2 )−1 φ1 + φ2 if φ1 small
≈ , (7.78)
φ1 + J(−φ1 )−1 φ2 if φ2 small
where it is now implied that J = J` , by convention12 .
Poses
In the particular cases of SE(3) and Ad(SE(3)), we can show that
∨ ∨
ln (T1 T2 ) = ln exp(ξ ∧1 ) exp(ξ ∧2 )
1 1 f f 1
= ξ1 + ξ2 + ξf 1 ξ2 + ξ1 ξ1 ξ2 + ξf ξ f ξ + · · · , (7.79a)
2 12 12 2 2 1
g f g
ln (T 1 T 2 ) = ln exp(ξ f1 ) exp(ξ 2 )
1 1 f f 1
= ξ1 + ξ2 + ξf 1 ξ2 + ξ1 ξ1 ξ2 + ξf ξ f ξ + · · · , (7.79b)
2 12 12 2 2 1
where T1 = exp(ξ ∧1 ), T2 = exp(ξ ∧2 ) ∈ SE(3) and T 1 = exp(ξ f 1 ), T 2 =
exp(ξ f
2 ) ∈ Ad(SE(3)). Alternatively, if we assume that ξ 1 or ξ 2 is
small, then using the approximate BCH formulas we can show that
∨ ∨
ln (T1 T2 ) = ln exp(ξ ∧1 ) exp(ξ ∧2 )
J ` (ξ 2 )−1 ξ 1 + ξ 2 if ξ 1 small
≈ , (7.80a)
ξ 1 + J r (ξ 1 )−1 ξ 2 if ξ 2 small
g f g
ln (T 1 T 2 ) = ln exp(ξ f1 ) exp(ξ 2 )
J ` (ξ 2 )−1 ξ 1 + ξ 2 if ξ 1 small
≈ , (7.80b)
ξ 1 + J r (ξ 1 )−1 ξ 2 if ξ 2 small
where
∞
X
−1 Bn n
J r (ξ) = −ξ f , (7.81a)
n=0
n!
∞
X Bn f n
J ` (ξ)−1 = ξ . (7.81b)
n=0
n!
12 We will use this convention throughout the book and only show the subscript on the
Jacobian when making specific points.
7.1 Geometry 227
In Lie group theory, J r and J ` are referred to as the right and left
Jacobians of SE(3), respectively. Inverting, we have the following ex-
pressions for the Jacobians:
X∞ Z 1
1 f n −α Jr Qr
J r (ξ) = −ξ = T dα = , (7.82a)
(n + 1)! 0 0 Jr
n=0
∞
X Z 1
1 f n α J` Q`
J ` (ξ) = ξ = T dα = , (7.82b)
(n + 1)! 0 0 J`
n=0
where
∞ X
X ∞
1 n m
Q` (ξ) = φ∧ ρ∧ φ∧ (7.83a)
n=0 m=0
(n + m + 2)!
1 φ − sin φ ∧ ∧ ∧ ∧ ∧ ∧ ∧
= ρ∧ + φ ρ + ρ φ + φ ρ φ
2 φ3
φ2
1− − cos φ ∧ ∧ ∧
− 2
4
φ φ ρ + ρ∧ φ∧ φ∧ − 3φ∧ ρ∧ φ∧
φ
2 3 !
1 1 − φ2 − cos φ φ − sin φ − φ6
− −3
2 φ4 φ5
× φ∧ ρ∧ φ∧ φ∧ + φ∧ φ∧ ρ∧ φ∧ , (7.83b)
∧
Qr (ξ) = Q` (−ξ) = C Q` (ξ) + (J` ρ) C J` , (7.83c)
ρ
and T = exp ξ f , T = exp ξ ∧ , C = exp φ∧ , ξ = . The expres-
φ
sion for Q` comes from expanding the series and grouping terms into
the series forms of the trigonometric functions13 . The relations for Qr
come from the relationships between the left and right Jacobians:
J ` (ξ) = T J r (ξ), J ` (−ξ) = J r (ξ). (7.84)
The first can be seen to be true from
Z 1 Z 1
T J r (ξ) = T T −α dα = T 1−α dα
0 0
Z 0 Z 1
β
=− T dβ = T β dβ = J ` (ξ), (7.85)
1 0
In later sections and chapters, we will (arbitrarily) work with the left
Jacobian and it will therefore be useful to write out the BCH approxi-
mations in (7.80) using only the left Jacobian:
∨ ∨
ln (T1 T2 ) = ln exp(ξ ∧1 ) exp(ξ ∧2 )
J (ξ 2 )−1 ξ 1 + ξ 2 if ξ 1 small
≈ , (7.87a)
ξ 1 + J (−ξ 1 )−1 ξ 2 if ξ 2 small
g f g
ln (T 1 T 2 ) = ln exp(ξ f
1 ) exp(ξ 2 )
J (ξ 2 )−1 ξ 1 + ξ 2 if ξ 1 small
≈ , (7.87b)
ξ 1 + J (−ξ 1 )−1 ξ 2 if ξ 2 small
where we have used that JJT > 0, which was shown previously.
14 We will use this convention throughout the book and only show the subscript on the
Jacobian when making specific points.
7.1 Geometry 229
Rotations
There are two common ways to define the difference of two rotations:
∨
φ12 = ln CT1 C2 , (7.92a)
∨
φ21 = ln C2 CT1 . (7.92b)
∧ ∧ 1 T
φ1 , φ2 = − tr φ∧1 φ2∧ = φT1 φ2 . (7.93)
2
The metric distance between two rotations can be thought of in two
ways: (i) the square root of the inner product of the difference with
itself, or (ii) the Euclidean norm of the difference:
q q
q
φ12 = hln (CT1 C2 ) , ln (CT1 C2 )i = φ∧12 , φ∧12 = φT12 φ12 = |φ12 |,
(7.94a)
q q
q
φ21 = hln (C2 CT1 ) , ln (C2 CT1 )i = φ∧21 , φ∧21 = φT21 φ21 = |φ21 |.
(7.94b)
This can also be viewed as the magnitude of the angle of the rotation
difference.
To consider integrating functions of rotations, we parametrize C =
∧
exp φ ∈ SO(3). Perturbing φ by a little bit results in the new ro-
∧
tation matrix, C0 = exp (φ + δφ) ∈ SO(3). We have that the right
and left differences (relative to C) are
∨ ∧ ∨
ln(δCr )∨ = ln CT C0 = ln CT exp (φ + δφ)
∧ ∨
≈ ln CT C exp (Jr δφ) = Jr δφ, (7.95a)
∨
0 T ∨ ∧ ∨
ln(δC` ) = ln C C = ln exp (φ + δφ) CT
∧ ∨
≈ ln exp (J` δφ) CCT = J` δφ, (7.95b)
minant15 :
dCr = |det(Jr )| dφ , (7.96a)
dC` = |det(J` )| dφ . (7.96b)
We note that
det(J` ) = det (C Jr ) = det (C) det (Jr ) = det (Jr ) , (7.97)
| {z }
1
where we are careful to ensure |φ| < π so as to sweep out all of SO(3)
just once (due to the surjective nature of the exponential map).
Poses
We briefly summarize the SE(3) and Ad(SE(3)) results as they are
very similar to SO(3). We can define right and left distance metrics:
∨ g
ξ 12 = ln T−1
1 T2 = ln T −1
1 T 2 , (7.101a)
−1 ∨
−1 g
ξ 21 = ln T2 T1 = ln T 2 T 1 . (7.101b)
The 4 × 4 and 6 × 6 inner products are
1
∧ ∧ ∧ 21 0 ∧T
ξ 1 , ξ 2 = −tr ξ 1 T ξ = ξ T1 ξ 2 , (7.102a)
0 1 2
1
f f 1 0 fT
ξ 1 , ξ 2 = −tr ξ f 4 ξ = ξ T1 ξ 2 . (7.102b)
1
0 12 1 2
15 We are slightly abusing notation here by writing dC, but hopefully it is clear from
context what is meant.
7.1 Geometry 231
7.1.7 Interpolation
We will have occasion later to interpolate between two elements of a ma-
trix Lie group. Unfortunately, the typical linear interpolation scheme,
Rotations
There are many possible interpolation scheme that we could define.
One of these is the following:
α
C = C2 CT1 C1 , α ∈ [0, 1], (7.114)
where C, C1 , C2 ∈ SO(3). We see that when α = 0 we have C = C1 and
when α = 1 we have C2 . The nice thing about this scheme is that we
guarantee closure, meaning C∈ SO(3) for all α ∈ [0, 1]. This is because
we know that C21 = exp φ∧21 = C2 CT1 is still a rotation matrix due to
closure of the Lie group. Exponentiating by the interpolation variable
keeps the result in SO(3),
α
Cα21 = exp φ∧21 = exp α φ∧21 ∈ SO(3), (7.115)
and finally compounding with C1 results in a member of SO(3), again
due to closure of the group. We can also see that we are essentially just
scaling the rotation angle of C21 by α, which is appealing intuitively.
Our scheme in (7.114) is actually similar to (7.112), if we rearrange
it a bit:
x = α(x2 − x1 ) + x1 . (7.116)
Or, letting x = ln(y), x1 = ln(y1 ), x2 = ln(y2 ), we can rewrite it as
α
y = y2 y1−1 y1 , (7.117)
which is very similar to our proposed scheme. Given our understanding
of the relationship between so(3) and SO(3) (i.e., through the expo-
nential map), it is therefore not a leap to understand that (7.114) is
somehow defining linear-like interpolation in the Lie algebra, where we
can treat elements as vectors.
To examine
this further, we let C = exp φ∧ , C1 = exp φ∧1 , C2 =
exp φ∧2 ∈ SO(3) with φ, φ1 , φ2 ∈ R3 . If we are able to make the
7.1 Geometry 233
assumption that φ21 is small (in the sense of distance from the previous
section), then we have
α ∨
∨
φ = ln (C) = ln C2 CT1 C1
∨
= ln exp α φ∧21 exp φ∧1 ≈ α J(φ1 )−1 φ21 + φ1 , (7.118)
which is comparable to (7.116) and is a form of linear interpolation.
Another case worth noting is when C1 = 1, whereupon
C = Cα2 , φ = α φ2 , (7.119)
with no approximation.
Another way to interpret our interpolation scheme is that it is enforc-
ing a constant angular velocity, ω. If we think of our rotation matrix
as being a function of time, C(t), then the scheme is
α t − t1
C(t) = C(t2 )C(t1 )T C(t1 ), α = . (7.120)
t2 − t1
Defining the constant angular velocity as
1
ω= φ , (7.121)
t2 − t1 21
the scheme becomes
C(t) = exp ((t − t1 ) ω ∧ ) C(t1 ). (7.122)
This is exactly the solution to Poisson’s equation, (6.45),
Ċ(t) = ω ∧ C(t), (7.123)
with constant angular velocity16 . Thus, while other interpolation schemes
are possible, this one has a strong physical connection.
Perturbed Rotations
Another thing that will be very useful to investigate, is what hap-
pens to C if we perturb C1 and/or C2 a little bit. Suppose now that
C0 , C01 , C02 ∈ SO(3) are the perturbed rotation matrices with the (left)
differences17 given as
∨ ∨ ∨
δφ = ln C0 CT , δφ1 = ln C01 CT1 , δφ2 = ln C02 CT2 .
(7.124)
The interpolation scheme must hold for the perturbed rotation matri-
ces: T α
C0 = C02 C01 C01 , α ∈ [0, 1]. (7.125)
16 Kinematics will be discussed in further detail later in this chapter.
17 In anticipation of how we will use this result, we will consider perturbations on the left,
but we saw in the previous section that there are equivalent perturbations on the right
and in the middle.
234 Matrix Lie Groups
where
n
! α−1
1 X n+1 X
Fn (α) = Bm αn+1−m = βn, (7.133)
n + 1 m=0 m β=0
is a version of Faulhaber’s formula. The first few Faulhaber coefficients Johann Faulhaber
(as we will call them) are (1580-1635) was a
German
mathematician
α(α − 1) α(α − 1)(2α − 1) whose major
F0 (α) = α, F1 (α) = , F2 (α) = ,
2 6 contribution
α2 (α − 1)2 involved
F3 (α) = , . . . . (7.134) calculating the
4 sums of powers of
integers. Jakob
Putting these back into A(α, φ), we have Bernoulli makes
references to
α(α − 1) ∧ α(α − 1)(2α − 1) ∧ ∧ Faulhaber in his
A(α, φ) = α 1 + φ + φ φ Ars Conjectandi.
2 12
α2 (α − 1)2 ∧ ∧ ∧
+ φ φ φ + · · · , (7.135)
24
where we likely would not need many terms if φ is small.
α−1
! !∧ !
∧ α X
exp δφ C ≈ 1+ Cβ δφ Cα
β=0
∧
= 1 + (A(α, φ) δφ) Cα , (7.138)
236 Matrix Lie Groups
where
α−1 α−1 ∞
X X α−1XX 1 n ∧ n
A(α, φ) = Cβ = exp βφ∧ = β φ
β=0 β=0 β=0 n=0
n!
∞ α−1
! ∞
X 1 X n ∧ n
X Fn (α) ∧ n
= β φ = φ , (7.139)
n=0
n! β=0 n=0
n!
| {z }
Fn (α)
Poses
Interpolation for elements of SE(3) parallels the SO(3) case. We define
the interpolation scheme as the following:
α
T = T2 T−11 T1 , α ∈ [0, 1]. (7.141)
Again, this scheme ensures that T = exp ξ ∧ ∈ SE(3) as long as
∧ ∧
T1 = exp ξ 1 , T2 = exp ξ ∧2 ∈ SE(3). Let T21 = T1 T−1 2 = exp ξ 21 ,
so that
∨ α ∨ ∨
ξ = ln (T) = ln T2 T−1 1 T1 = ln exp α ξ ∧21 exp ξ ∧1
≈ α J (ξ 1 )−1 ξ 21 + ξ 1 , (7.142)
where the approximation on the right holds if ξ 21 is small. When T1 =
1, the scheme becomes
T = Tα2 , ξ = α ξ2 , (7.143)
with no approximation.
Perturbed Poses
As in the SO(3) case, it will be useful to investigate what happens to T
if we perturb T1 and/or T2 a little bit. Suppose now that T0 , T01 , T02 ∈
7.1 Geometry 237
SE(3) are the perturbed transformation matrices with the (left) differ-
ences given as
∨ ∨ ∨
δξ = ln T0 T−1 , δξ 1 = ln T01 T−1 1 , δξ 2 = ln T02 T−1
2 .
(7.144)
The interpolation scheme must hold for the perturbed transformation
matrices:
−1 α
T0 = T02 T10 T01 , α ∈ [0, 1]. (7.145)
lating 4 × 1 columns,
}
ε η1 −ε∧ ε 0 ε
= T , = , (7.149)
η 0 0T η −ε∧ 0
which result in a 4 × 6 and 6 × 4, respectively. With these definitions,
we have the following useful identities,
ξ ∧ p ≡ p ξ, pT ξ ∧ ≡ ξ T p} , (7.150)
where ξ ∈ R6 and p ∈ R4 , which will prove useful when manipulat-
ing expressions involving points and poses together. We also have the
identity,
(Tp) ≡ Tp T −1 , (7.151)
which is similar to some others we have already seen.
Rotations
We have already seen in Section 6.2.5 a preview of perturbing expres-
sions in terms of their Euler angles. We first consider directly taking
the Jacobian of a rotated point with respect to the Lie algebra vector
representing the rotation:
∂(Cv)
, (7.152)
∂φ
where C = exp(φ∧ ) ∈ SO(3) and v ∈ R3 is some arbitrary three-
dimensional point.
To do this, we can start by taking the derivative with respect to a
single element of φ = (φ1 , φ2 , φ3 ). Applying the definition of a derivative
along the 1i direction we have
∂(Cv) exp ((φ + h1i )∧ ) v − exp φ∧ v
= lim , (7.153)
∂φi h→0 h
which we will refer to as a directional derivative. Since we are interested
in the limit of h infinitely small, we can use the approximate BCH
formula to write
exp ((φ + h1i )∧ ) ≈ exp ((J h1i )∧ ) exp φ∧
≈ (1 + h(J1i )∧ ) exp φ∧ , (7.154)
7.1 Geometry 239
The previous update can actually be cast in this form by using the
19 A right-hand version is also possible.
240 Matrix Lie Groups
with α > 0 a small step size and D > 0 any positive-definite matrix.
Then apply the perturbation within the scheme to update the rotation,
Cop ← exp −αDδ ∧ Cop , (7.169)
and iterate to convergence. Our scheme guarantees Cop ∈ SO(3) at
each iteration.
The perturbation idea generalizes to more interesting optimization
schemes than basic gradient descent, which can be quite slow. Consider
the alternate derivation of the Gauss-Newton optimization method
from Section 4.3.1. Suppose we have a general nonlinear, quadratic
cost function of a rotation of the form,
1X 2
J(C) = (um (Cvm )) , (7.170)
2 m
where um (·) are scalar nonlinear functions and vm ∈ R3 are three-
dimensional points. We begin with an initial guess for the optimal ro-
tation, Cop ∈ SO(3), and then perturb this (on the left) according to
C = exp ψ ∧ Cop , (7.171)
where ψ is the perturbation. We then apply our perturbation scheme
inside each um (·) so that
um (Cvm ) = um exp(ψ ∧ )Cop vm ≈ um 1 + ψ ∧ Cop vm
∂um ∧
≈ um (Cop vm ) − (Cop vm ) ψ, (7.172)
| {z } ∂x x=Cop vm
βm | {z }
δT
m
Poses
The same concepts can also be applied to poses. The Jacobian of a
transformed point with respect to the Lie algebra vector representing
the transformation is
∂(Tp)
= (Tp) J , (7.177)
∂ξ
then the Jacobian with respect to this perturbation (i.e., the (left) Lie
derivative) is simply
∂(Tp)
= (Tp) , (7.179)
∂
which removes the need to calculate the J matrix.
Finally, for optimization, suppose we have a general nonlinear, quadratic
cost function of a transformation of the form,
1X 2
J(T) = (um (Tpm )) , (7.180)
2 m
Gauss-Newton Discussion
This approach to Gauss-Newton optimization for our matrix Lie groups
where we use a customized perturbation scheme has three key proper-
ties:
(i) we are storing our rotation or pose in a singularity-free format,
(ii) at each iteration we are performing unconstrained optimiza-
tion,
(iii) our manipulations occur at the matrix level so that we do not
need to worry about taking the derivatives of a bunch of scalar
trigonometric functions, which can easily lead to mistakes.
SO(3) Identities and Approximations
u1 0 −u3 u2
0
∧
u3 −u2 u1 0 φ = φa
∧
u∧ = u2 = u3 −u1
∧ ∧
T 1
(αu + βv) ≡ αu + βv aT a ≡ 1
u∧ ≡ −u∧ CT C ≡ 1 ≡ CCT J= 0
Cα dα ≡n=0 (n+1)!
φ∧ ≈ 1 + 12 φ∧
φ ∧
a
R1 P∞ n
φ
u∧ v ≡ −v∧ u tr(C) ≡ 2 cos φ + 1 J ≡ sinφ φ 1 + 1 − sinφ φ aaT + 1−cos
∞ B ∧ n 1 ∧
n
u∧ u ≡ 0 J−1 ≡ n=0 n!n φ
∧ ∧
(Wu) ≡ u (tr(W) 1 − W) − WT u∧ C = exp φ∧ ≡ n=0 n! φ∧ ≈ 1 + φ∧ −1
P
∧
n
det(C)
P∞ 1
≡1
u∧ v∧ ≡ −(uT v) 1 + vuT
≈ 1 − 2 φ
C ≡ 1 + φ∧ J
C ≡ cos φ1 + (1 − cos φ)aaT + sin φa∧
P∞ 1
∧
u∧ v∧ v∧ − v∧ v∧ u∧ ≡ (v∧ u∧ v)∧ (Cu) ≡ Cu∧ CT
∧
P∞ n
n
u∧ ≡ (u∧ v) exp ((Cu)∧ ) ≡ C exp (u∧ ) CT
∧ ∧
u , u , . . . u , v∧ . . . ≡ ((u∧ ) v)
n
∧ ∧
[u∧ , v∧ ] ≡u∧ v∧ − v∧
| {z }
α, β ∈ R, u, v, φ, δφ ∈ R3 , W, A, J ∈ R3×3 , C ∈ SO(3)
SE(3) Identities and Approximations
ρ
ξ=
φ
≈ 1 + ξ∧
u v u C Jρ 1
∧
x = = T 0 n=0 ξf
P∞ 1 ∧ n
v 0 0
J = T α dα ≡ ≈ 1 + 21 ξ f
0
T≡ T
∧ ∧
J Q
T = exp ξ ∧ ≡ n=1 n! ξ
f
R1 P∞ n
∧
f u v u
x = = 0 J
T = exp ξ f ≡ n=1 J ≡
(n+1)!
v 0 v∧ C (Jρ) C ξ
f ∧ P∞ 1 1 f n
∧ n!
J −1 ≡ n=0
0 C
T = Ad (T) ≡
−1
J
n! ξ ∧≈ 1 +ξ
f
(αx + βy) ≡ αx∧ + βy∧
P∞ Bn f n
−J QJ−1
0
(αx + βy) ≡ αxf + βyf tr(T) ≡ 2 cos φ + 2 J −1 ≡
f n
−1 ≈ 1 −12 ξ f
1
φ∧ ρ∧ φ∧
x y ≡ −yf x det(T) ≡ 1
Q = n=0 m=0 (n+m+2)!
xf x ≡ 0 φ
∧
φ3
φ∧ ρ∧ + ρ∧ φ∧ + φ∧ ρ∧ φ∧
J−1 m
P∞ P∞
∧ ∧ ∧ ∧
2
≡ 12 ρ∧ + φ−sin
f
[x , y ] ≡ x y − y∧ x∧ ≡ (xf y)
f f f f f f f 1− φ2 −cos φ
Ad (T1 T2 ) = Ad (T1 ) Ad (T
P∞ 1 n2 )
CT −CT r
n
x ≡ (x y) − φ∧ φ∧ ρ∧ + ρ∧ φ∧ φ∧ − 3φ∧ ρ∧ φ∧
∧ 2 3
0 1
T−1 ≡ T
1− φ2 −cos φ φ−sin φ− φ6
T−1 ≡ exp −ξ ∧ ≡ n=1 n! −ξ∧ ≈ 1 − ξ ∧
3 φ∧ ρ∧ φ∧ φ∧
x , x , . . . x∧ , y∧ . . . ≡ ((xf ) y)
−1 φ4 φ5
− 12 −
T
∧ ∧
[x , y ] ≡ x y −
y
φ4
n f
f P∞ 1 n
CT −CT (Jρ)∧
| {z }
x , x , . . . x , y . . . ≡ ((xf ) y) ∧
0 CT
T −1 ≡
≡ exp −ξ ≡ n=1 n! −ξ f ≈ 1 − ξ f
n
f
f fn f f
+ φ∧ φ∧ ρ∧ φ∧
Tξ≡ξ exp (ξ + δξ) ≈ exp ((J δξ)f ) exp ξ f
| {z }
ε
Tξ ∧ ≡ ξ ∧ T
exp (ξ + δξ) ≈ exp ((J δξ)∧ ) exp ξ ∧
T ≡ 1 + ξf J
p =
η1 −ε∧
= T
η 0 0T
T ξf ≡ ξf T J ξf ≡ ξf J
∧
ε 0 ε
J (ξ) ≡ T J (−ξ)
} ∧
p = =
f
exp (T x) ≡ T exp (x∧ ) T−1
η
}
−ε∧ 0
(T x) ≡ Tx∧ T−1
∧
f
exp δξ ∧ T ≈ 1 + (A(α, ξ) δξ) Tα
x p≡p x
α ∧
Ad (x∧ T) ≡ xf T
P∞ n
T
(Tp) ≡ Tp T −1
T
(Tp) (Tp) ≡ T −T p p T −1
7.1.10 Identities
We have seen many identities and expressions in this section related
to our matrix Lie groups, SO(3) and SE(3). The previous two pages
summarize these. The first page provides identities for SO(3) and the
second for SE(3).
7.2 Kinematics
We have seen how the geometry of a Lie group works. The next step
is to allow the geometry to change over time. We will work out the
kinematics associated with our two Lie groups, SO(3) and SE(3).
7.2.1 Rotations
We have already seen the kinematics of rotations in the previous chap-
ter, but this was before we introduced Lie groups.
Lie Group
We know that a rotation matrix can be written as
C = exp φ∧ , (7.187)
We will refer to this as kinematics of the Lie group; these equations are
singularity-free since they are in terms of C, but have the constraint
that CCT = 1. Due to the surjective property of the exponential map
from so(3) to SO(3), we can also work out the kinematics in terms of
the Lie Algebra.
21 Compared to (6.45) in our earlier development, this ω is opposite in sign. This is
because we have adopted the robotics convention described in Section 6.3.2 for the
angle of rotation and this leads to the form in (7.187); this in turn means we must use
the angular velocity associated with that angle and this is opposite in sign to the one
we discussed earlier.
7.2 Kinematics 247
Lie Algebra
To see the equivalent kinematics in terms of the Lie algebra, we need
to differentiate C:
Z 1
d ∧ ∧
Ċ = exp φ = exp αφ∧ φ̇ exp (1 − α)φ∧ dα, (7.189)
dt 0
where the last relationship comes from the general expression for the
time derivative of the matrix exponential:
Z 1
d dA(t)
exp (A(t)) = exp (αA(t)) exp ((1 − α)A(t)) dα. (7.190)
dt 0 dt
From (7.189) we can rearrange to have
Z 1 Z 1 ∧
T α ∧ −α
ĊC = C φ̇ C dα = Cα φ̇ dα
0 0
Z 1 ∧ ∧
= Cα dα φ̇ = J φ̇ , (7.191)
0
R1
where J = 0 Cα dα is the (left) Jacobian for SO(3) that we saw earlier.
Comparing (7.188) and (7.191) we have the pleasing result that
ω = J φ̇, (7.192)
or
φ̇ = J−1 ω, (7.193)
which is an equivalent expression for the kinematics, but in terms of
the Lie algebra. Note that J−1 does not exist at |φ| = 2πm, where m
is a non-zero integer, due to singularities of the 3 × 1 representation of
rotation; the good news is that we no longer have constraints to worry
about.
Numerical Integration
Because φ has no constraints, we can use any numerical method we like
to integrate (7.193). The same is not true if we want to integrate (7.188)
directly, since we must enforce the constraint that CCT = 1. There are
a few simple strategies we can use to do this.
One approach is to assume that ω is piecewise constant. Suppose
ω is constant between two times, t1 and t2 . In this case, (7.188) is a
linear, time-invariant, ordinary differential equation and we know the
solution will be of the form
C(t2 ) = exp ((t2 − t1 ) ω ∧ ) C(t1 ), (7.194)
| {z }
C21 ∈ SO(3)
where the Lagrange multiplier terms are necessary to enforce the RRT =
1 constraint. Note that δij is the Kronecker delta and
RT = r1 r2 r3 , CT = c1 c2 c3 . (7.199)
We also note that
tr CRT = rT1 c1 + rT2 c2 + rT3 c3 . (7.200)
We then take the derivative of J with respect to the three rows of R
revealing
X3
∂J
= c i − λij rj , ∀i = 1 . . . 3. (7.201)
∂rTi j=1
− 1
R = CCT 2 C,
which simply looks like we are ‘normalizing’ C. Computing the projec-
tion whenever the orthogonality constraint is not satisfied (to within
some threshold) and then overwriting the integrated value,
C ← R, (7.203)
ensures we do not stray too far from SO(3)22 .
7.2.2 Poses
There is an analogous approach to kinematics for SE(3) that we will
develop next.
Lie Group
We have seen that a transformation matrix can be written as
C r C Jρ
T= T = T = exp ξ ∧ (7.204)
0 1 0 1
where
ρ
ξ= .
φ
Suppose the kinematics in terms of separated translation and rotation
are given by
ṙ = ω ∧ r + ν, (7.205a)
Ċ = ω ∧ C, (7.205b)
22 Technically, this matrix square-root approach only works under certain conditions. For
some pathological C matrices, it can produce an R where det R = −1 instead of
det R = 1, as desired. This is because we have not enforced the det R = 1 constraint in
our optimization properly. A more rigorous method, based on singular-value
decomposition, is presented later in Section 8.1.3 that handles the difficult cases. A
sufficient test to know whether this matrix square-root approach will work is to check
that det C > 0 before applying it. This should almost always be true in real situations
where our integration step is small. If it is not true, the detailed method in
Section 8.1.3 should be used.
250 Matrix Lie Groups
Lie Algebra
Again, we can find an equivalent set of kinematics in terms of the Lie
algebra. As in the rotation case, we have that
Z 1
d ∧ ∧
Ṫ = exp ξ = exp αξ ∧ ξ̇ exp (1 − α)ξ ∧ dα, (7.207)
dt 0
or equivalently
Z 1 Z 1 ∧
∧
ṪT−1 = Tα ξ̇ T−α dα = T α ξ̇ dα
0 0
Z 1 ∧ ∧
α
= T dα ξ̇ = J ξ̇ , (7.208)
0
R1
where J = 0 T α dα is the (left) Jacobian for SE(3). Comparing (7.206)
and (7.208) we have that
$ = J ξ̇, (7.209)
or
ξ̇ = J −1 $, (7.210)
for our equivalent kinematics in terms of the Lie algebra. Again, these
equations are now free of constraints.
Hybrid
There is, however, a another way to propagate the kinematics by noting
that the equation for ṙ is actually linear in the velocity. By combining
the equations for ṙ and φ̇ we have
ṙ 1 −r∧ ν
= , (7.211)
φ̇ 0 J−1 ω
23 We can also write the kinematics equivalently in 6 × 6 format:
Ṫ = $ f T .
7.2 Kinematics 251
Numerical Integration
Similarly to the SO(3) approach, we can integrate (7.210) without wor-
rying about constraints, but integrating (7.206) takes a little more care.
Just as in the SO(3) approach we could assume that $ is piecewise
constant. Suppose $ is constant between two times, t1 and t2 . In this
case, (7.206) is a linear, time-invariant, ordinary differential equation
and we know the solution will be of the form
T(t2 ) = exp ((t2 − t1 ) $ ∧ ) T(t1 ), (7.212)
| {z }
T21 ∈ SE(3)
With Dynamics
We can augment our kinematic equation for pose with an equation for
the translational/rotational dynamics (i.e., Newton’s second law) as
24 See the numerical integration section on rotations, above, for the details.
252 Matrix Lie Groups
Lie Algebra
Perturbing the kinematics in the Lie algebra is more difficult but equiv-
alent. In terms of quantities in the Lie algebra, we have
φ0 = φ + J(φ)−1 δφ. (7.222)
0
where φ = ln(C ) is the perturbed rotation vector, φ = ln(C)∨ the
0 ∨
We start with the perturbed kinematics, φ̇0 = J(φ0 )−1 ω 0 , and then
inserting our perturbation scheme we have
d −1
φ + J(φ)−1 δφ ≈ J(φ) + δJ (ω + δω) . (7.223)
|dt {z } | {z } | {z }
J(φ0 ) ω0
φ̇0
Z 1 Z 1
α α
J(φ0 ) = C0 dα = exp δφ∧ C dα
0 0
Z 1
∧
≈ 1 − (A(α, φ) δφ) Cα dα
0
Z 1 Z 1
∧
= α
C dα − α J(αφ)J(φ)−1 δφ Cα dα, (7.224)
| 0 {z } | 0 {z }
J(φ) δJ
∂ω
J̇(φ) − ω ∧ J(φ) ≡ , (7.227)
∂φ
we have
∂ω
δ φ̇ = ω ∧ δφ + δω + J(φ)−1 δφ − δJ φ̇, (7.228)
∂φ
| {z }
extra term
which is the same as the Lie group result for the perturbation kinemat-
ics, but with an extra term; it turns out this extra term is zero:
∂ω ∂
J(φ)−1 δφ = J(φ) φ̇ J(φ)−1 δφ
∂φ ∂φ
Z 1 Z 1
∂ −1 ∂ α
= α
C φ̇ dα J(φ) δφ = C φ̇ dα J(φ)−1 δφ
∂φ 0 0 ∂φ
Z 1 ∧
= α Cα φ̇ J(αφ) dα J(φ)−1 δφ
0
Z 1
∧
=− α J(αφ)J(φ)−1 δφ Cα dα φ̇ = δJ φ̇, (7.229)
| 0 {z }
δJ
where we have used an identity derived back in Section 6.2.5 for the
derivative of a rotation matrix times a vector with respect to a three-
parameter representation of rotation. Thus, our pair of equations is
Solutions Commute
It is worth asking whether integrating the full solution is (approxi-
mately) equivalent to integrating the nominal and perturbation equa-
tions separately and then combining them. We show this for the Lie
group kinematics. The perturbed solution will be given by:
Z t
0 0
C (t) = C (0) + ω 0 (s)∧ C0 (s) ds (7.231)
0
which is the desired result. The rightmost integral on the second line
can be computed by noting that
d ∧ ∧
δφ∧ C = δ φ̇ C + δφ∧ Ċ = ω ∧ δφ + δω C + δφ∧ ω ∧
C
| {z }
dt | {z }
perturbation nom.
= ω ∧ δφ∧ C − δφ∧ ω ∧ C + δω ∧ C + δφ∧ ω ∧ C
= ω ∧ δφ∧ C + δω ∧ C, (7.233)
The state transition matrix always exists and is unique, but it cannot al-
ways be found analytically. Fortunately, for our particular perturbation
equation, we can express the 3×3 state transition matrix analytically26 :
26 The nominal rotation matrix, C(t), is the fundamental matrix of the state transition
matrix.
256 Matrix Lie Groups
We need the solution to the nominal equation, C(t), but this is readily
available. To see this is indeed the correct solution, we can differentiate:
Z t
T
δ φ̇(t) = Ċ(t)C(0) δφ(0)+Ċ(t) C(s)T δω(s) ds+C(t) C(t)T δω(t)
0
Z t
= ω(t)∧ C(t)C(0)T δφ(0) + C(t) C(s)T δω(s) ds + δω(t)
0
| {z }
δφ(t)
which is the original differential equation for δφ(t). We also see that
our state transition matrix satisfies the required conditions:
d
C(t)C(s)T = Ċ(t)C(s)T = ω(t)∧ C(t)C(s)T , (7.240a)
|dt {z } | {z }
Φ(t,s)
Φ̇(t,s)
C(t)C(t)T = 1. (7.240b)
| {z }
Φ(t,t)
Lie Group
We will use the perturbation,
T0 = exp δξ ∧ T ≈ 1 + δξ ∧ T, (7.241)
Ṫ0 = $ 0∧ T0 , (7.242)
With Dynamics
We can also perturb the joint kinematics/dynamics equations in (7.217).
We consider perturbing all of the quantities around some operating
points as follows:
T0 = exp δξ ∧ T, $ 0 = $ + δ$, a0 = a + δa, (7.247)
so that the kinematics/dynamics are
∧
Ṫ0 = $ 0 T0 , (7.248a)
f
$̇ 0 = M−1 $ 0 M$ 0 + a0 . (7.248b)
If we think of δa as an unknown noise input, then we would like to
know how this turns into uncertainty on the pose and velocity vari-
ables through the chain of dynamics and kinematics. Substituting the
perturbations into the motion models, we can separate into a (nonlin-
ear) nominal motion model,
nominal kinematics: Ṫ = $ ∧ T, (7.249a)
nominal dynamics: $̇ = M−1 $ f M$ + a, (7.249b)
and (linear) perturbation motion model,
perturbation kinematics: δ ξ̇ = $ f δξ + δ$, (7.250a)
−1 f f
perturbation dynamics: δ $̇ = M $ M − (M$) δ$ + δa,
(7.250b)
which we can write in combined matrix form:
f
δ ξ̇ $ 1 δξ 0
= −1 f + . (7.251)
δ $̇ 0 M $ M − (M$)
f
δ$ δa
Finding the transition matrix for this LTV SDE may be difficult, but
it can be integrated numerically.
258 Matrix Lie Groups
Rotations
We have seen several times the dual nature of rotations/poses in the
sense that they can be described in terms of a Lie group or a Lie al-
gebra, each having advantages and disadvantages. Lie groups are nice
because they are free of singularities, but have constraints; this is also
the form that is usually required in order to rotate/transform some-
thing in the real world. Lie algebras are nice because we can treat
them as vectorspaces (for which there are many useful mathematical
tools27 ), and they are free of constraints, but we need to worry about
singularities.
It seems logical to exploit the vectorspace character of a Lie alge-
bra in defining our random variables for rotations and poses. In this
way, we can leverage all the usual tools from probability and statistics,
rather than starting over. Given this decision, and using (7.253) for
27 Including probability and statistics.
7.3 Probability and Statistics 259
(7.261)
where we indicate the induced p(C). It is important to realize that
p(C) looks like this due to our choice to define p() directly31 .
31 It is also possible to work in the other direction by first defining p(C). (Chirikjian,
2009)
7.3 Probability and Statistics 261
where
C̄0 = R C̄, 0 = R ∼ N 0, R Σ RT . (7.267)
Poses
Similarly to the rotation case, we choose to define a Gaussian random
variable for poses as
T = exp (∧ ) T̄, (7.268)
where T̄ ∈ SE(3) is a ‘large’ mean transformation and ∈ R6 ∼
N (0, Σ) is a ‘small’ Gaussian random variable (i.e., in a vectorspace).
The mean transformation, M ∈ SE(3), is the unique solution of the
equation,
Z
∨
ln exp (∧ ) T̄M−1 p() d = 0. (7.269)
∧
T0 = R T = R exp (∧ ) T̄ = exp (R) R T̄ = exp (0∧ ) T̄0 , (7.272)
where
T̄0 = R T̄, 0 = R ∼ N 0, R Σ RT , (7.273)
and R = Ad(R).
7.3 Probability and Statistics 263
and
E [∧ ∧ ∧ ∧ ] = E −(T ) ∧ ∧
= E −(T ) −(T )1 + T
= tr E T T 1 − E T T
= tr (Σ (tr (Σ) 1 + 2Σ)) 1 − Σ (tr (Σ) 1 + 2Σ)
2
= (tr (Σ)) + 2 tr Σ2 1 − Σ (tr (Σ) 1 + 2Σ) , (7.279)
Figure 7.4
T̄2 , ⌃2 T̄, ⌃ Combining a chain
) of two poses into a
single compound
T̄1 , ⌃1 pose.
Theory
Consider two noisy poses, T1 and T2 ; we keep track of their nominal
values and associated uncertainties:
T̄1 , Σ1 , T̄2 , Σ2 .
since everything except the fourth-order term has zero mean. Thus, to
third order, we can safely assume that E [] = 0 and thus (7.283) seems
to be a reasonable way to compound the mean transformations32 .
The next task is to compute Σ = E [T ]. Multiplying out to fourth
order we have
T 1 0 0T fT
E T ≈ E 1 T1 + 02 02 + f 1
4 1 2 2
1 f f 0 0T 0 0T f f T
+ (1 1 ) 2 2 + 2 2 (1 1 )
12
+ (0f
2 2
0f
)
1 1
T
+ ( 0f 0f
2 2 )
1 1
T
(7.287)
where we have omitted showing any terms that have an odd power in
either 1 or 02 since these will by definition have expectation zero. This
expression may look daunting, but we can take it term by term. To save
space, we define and make use of the following two linear operators:
32 It is also possible to show that the fourth-order term has zero mean,
E 0f f f 0
2 1 1 2 = 0, if Σ1 is of the special form
" #
Σ1,ρρ 0
Σ1 = 2 ,
0 σ1,φφ 1
where the ρ and φ subscripts indicate a partitioning of the covariance into the
translation and rotation components, respectively. This is a common situation for Σ1
when we are, for example, propagating uncertainty on velocity through the kinematics
equations presented for SE(3); from (7.212) we have, T1 = exp ((t2 − t1 )$ ∧ ), where
$ is the (noisy) generalized velocity. In this case, we are justified in assuming E [] = 0
all the way out to fifth order (and possibly further).
7.3 Probability and Statistics 267
where
Sigmapoint Method
We can also make use of the sigmapoint transformation (Julier and
Uhlmann, 1996) to pass uncertainty through the compound pose change.
In this section, we tailor this to our specific type of SE(3) perturbation.
Our approach to handling sigmapoints is quite similar to that taken by
Hertzberg et al. (2013) and also Brookshire and Teller (2012). In our
case, we begin by approximating the joint input Gaussian using a finite
33 The sixth order terms require a lot more work, but it is possible to compute them
using Isserlis’ theorem.
268 Matrix Lie Groups
Note, we have assumed that the output sigmapoint samples have zero
mean in this formula, to be consistent with our mean propagation. In-
terestingly, this turns out to be algebraically equivalent to the second-
order method (from the previous section) for this particular nonlinear-
ity since the noise sources on T1 and T2 are assumed to be uncorrelated.
34 For all experiments in this section, we used λ = 1; we need to ensure the sigmapoints
associated with the rotational degrees of freedom have length less than π to avoid
numerical problems.
7.3 Probability and Statistics 269
Compound Experiment
To quantitatively evaluate the pose-compounding techniques, we ran
a second numerical experiment in which we compounded two poses
270 Matrix Lie Groups
Figure 7.5
Example of
compounding 50
K = 100 uncertain
transformations 40
(Section 7.3.3).
The light blue lines
30
and blue dots show
20
1000 individual
sampled
10
trajectories
starting from (0, 0)
0
y
and moving
nominally to the
-10
right at constant
translational
-20
speed, but with
some uncertainty
-30
on rotational
velocity. The grey
-40 samples
1-sigma covariance second-order
ellipse is simply fourth-order
-50
fitted to the
samples to show 0 20 40 60 80 100
what keeping x
xy-covariance
relative to the including their associated covariance matrices,
start looks like. ∧ T
The dotted T̄1 = exp ξ̄ 1 , ξ̄ 1 = 0 2 0 π/6 0 0 ,
(second-order) and
dash-dotted 1 1
(fourth-order) lines
Σ1 = α × diag 10, 5, 5, , 1, , (7.298a)
2 2
are the principal ∧ T
great circles of the T̄2 = exp ξ̄ 2 , ξ̄ 2 = 0 0 1 0 π/4 0 ,
1-sigma covariance
ellipsoid, given by 1 1
ΣK , mapped to
Σ2 = α × diag 5, 10, 5, , , 1 , (7.298b)
2 2
the xy-plane.
Looking to the where α ∈ [0, 1] is a scaling parameter that increases the magnitude of
area (95, 0),
the input covariances parametrically.
corresponding to
straight ahead, the We compounded these two poses according to (7.281), which results
fourth-order in a mean of T̄ = T̄1 T̄2 . The covariance, Σ, was computed using four
scheme has some methods:
non-zero
uncertainty (as do
(i) Monte Carlo: We drew a large number, M = 1000000, of random
the samples),
whereas the samples (m1 and m2 ) from the input covariance matrices, com-
second-order pounded the resulting
PM transformations, and computed
the covari-
scheme does not. ance as Σmc = M1 m=1 m Tm with Tm = exp ∧m1 T̄1 exp ∧m2 T̄2 ,
We used r = 1 and ∨
σ = 0.03.
and m = ln Tm T̄−1 . This slow-but-accurate approach served
as our benchmark to which the other three much faster methods
were compared.
(ii) Second-Order: We used the second-order method described above
to compute Σ2nd .
7.3 Probability and Statistics 271
4 Figure 7.6
sigmapoint Results from
3.5 second-order Compound
fourth-order Experiment: Error,
Covariance error, ε
3 ε, in computing
covariance
2.5
associated with
2 compounding two
poses using three
1.5 methods, as
compared to
1 Monte Carlo. The
sigmapoint and
0.5 second-order
methods are
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 algebraically
Noise scaling, α equivalent for this
problem and thus
appear the same
(iii) Fourth-Order: We used the fourth-order method described above on the plot. The
input covariances
to compute Σ4th . were gradually
(iv) Sigmapoint: We used the sigmapoint transformation described scaled up via the
parameter, α,
above to compute Σsp . highlighting the
improved
We compared each of the last three covariance matrices to the Monte performance of the
fourth-order
Carlo one, using the Frobenius norm:
method.
r
T
ε= tr (Σ − Σmc ) (Σ − Σmc ) .
Figure 7.6 shows that for small input covariance matrices (i.e., α small)
there is very little difference between the various methods and the errors
are all low compared to our benchmark. However, as we increase the
magnitude of the input covariances, all the methods get worse, with the
fourth-order method faring the best by about a factor of seven based
on our error metric. Note, since α is scaling the covariance, the applied
noise is increasing quadratically.
The second-order method and the sigmapoint method have indistin-
guishable performance, as they are algebraically equivalent. The fourth-
order method goes beyond both of these by considering higher-order
terms in the input covariance matrices. We did not compare the compu-
tational costs of the various methods as they are all extremely efficient
as compared to Monte Carlo.
It is also worth noting that our ability to correctly keep track of
uncertainties on SE(3) decreases with increasing uncertainty. This can
be seen directly in Figure 7.6, as error increases with increasing un-
certainty. This suggests that it may be wise to use only relative pose
variables in order to keep uncertainties small.
272 Matrix Lie Groups
Theory
Suppose that we have K estimates of a pose and associated uncertain-
ties:
T̄1 , Σ1 , T̄2 , Σ2 , . . . , T̄K , ΣK . (7.299)
If we think of these as uncertain (pseudo)-measurements of the true
pose,
T
true , how can we optimally combine these into a single estimate,
T̄, Σ ?
As we have seen in the first part of this book, vectorspace solution
to fusion is straightforward and can be found exactly in closed form:
K K
!−1
X −1
X −1
x̄ = Σ Σk x̄k , Σ = Σk . (7.300)
k=1 k=1
While this may appear strange compared to (7.300), the Jacobian terms
appear because our choice of error definition is in fact nonlinear owing
to the presence of the matrix exponentials. We then apply this optimal
perturbation to our current guess,
Top ← exp (?∧ ) Top , (7.306)
which ensures Top remains in SE(3), and iterate to convergence. At
the last iteration, we take T̄ = Top as the mean of our fused estimate
and
K
!−1
X −1
T
Σ= Gk Σk Gk , (7.307)
k=1
274 Matrix Lie Groups
for the covariance matrix. This approach has the form of a Gauss-
Newton method as discussed in Section 7.1.9.
This fusion problem is similar to one investigated by Smith et al.
(2003), but they only discuss the K = 2 case. Our approach is closer
to that of Long et al. (2012), who discuss the N = 2 case and derive
closed-form expressions for the fused mean and covariance for an ar-
bitrary number of individual measurements, K; however, they do not
iterate their solution and they are tracking a slightly different PDF.
Wolfe et al. (2011) also discuss fusion at length, albeit again using a
slightly different PDF than us. They discuss non-iterative methods of
fusion for arbitrary K and show numerical results for K = 2. We be-
lieve our approach generalizes all of these previous works by (i) allowing
the number of individual estimates, K, to be arbitrary, (ii) keeping an
arbitrary number of terms in the approximation of the inverse Jaco-
bian, N , and (iii) iterating to convergence via a Gauss-Newton style
optimization method. Our approach may also be simpler to implement
than some of these previous methods.
Fusion Experiment
To validate the pose fusion method from the previous subsection, we
used a true pose given by
T
Ttrue = exp ξ ∧true , ξ true = 1 0 0 0 0 π/6 , (7.308)
and then generated 3 random pose measurements,
T̄1 = exp (∧1 ) Ttrue , T̄2 = exp (∧2 ) Ttrue , T̄3 = exp (∧3 ) Ttrue ,
(7.309)
where
1 1
1 ∼ N 0, diag 10, 5, 5, , 1, ,
2 2
1 1
2 ∼ N 0, diag 5, 15, 5, , , 1 ,
2 2
1 1
3 ∼ N 0, diag 5, 5, 25, 1, , . (7.310)
2 2
We then solved for the pose using our Gauss-Newton technique (it-
erating until convergence), using the initial condition, Top = 1. We
repeated this for N = 1 . . . 6, the number of terms kept in Gk =
J (−ek (Top ))−1 . We also used the closed-form expression to compute
J analytically (and then inverted numerically) and this is denoted by
‘N = ∞’.
Figure 7.8 plots two performance metrics. First, it plots the final
converged value of thePM cost function, Jm , averaged over M = 1000
random trials, J = M1 m=1 Jm . Second, it plots the root-mean-squared
7.3 Probability and Statistics 275
5.8758 2.7717 Figure 7.8
Results from
Fusion
2.7717 Experiment: (left)
5.8757
pose error (with respect to the true pose), of our estimate, T̄m , again
averaged over the same M random trials:
v
u M
u 1 X ∨
ε= t εT εm , εm = ln Ttrue T̄−1 .
M m=1 m m
The plots show that both measures of error are monotonically reduced
with increasing N . Moreover, we see that for this example almost all
of the benefit is gained with just four terms (or possibly even two).
The results for N = 2, 3 are identical as are those for N = 4, 5. This
is because in the Bernoulli number sequence, B3 = 0 and B5 = 0, so
these terms make no additional contribution to J −1 . It is also worth
stating that if we make the rotational part of the covariances in (7.310)
any bigger, we end up with a lot of samples that have rotated by more
than angle π, and this can be problematic for the performance metrics
we are using.
Figure 7.9 shows the convergence history of the cost, J, for a single
random trial. The left side shows the strong benefit of iterating over
the solution, while the right side shows that the cost converges to a
lower value by keeping more terms in the approximation of J −1 (cases
N = 2, 4, ∞ shown). It would seem that taking N = 4 for about seven
iterations gains most of the benefit, for this example.
276 Matrix Lie Groups
Figure 7.9 13.5
Results from second-order
Fusion fourth-order 8.868
13 infinite-order
Experiment: (left)
Convergence of the 8.867
12.5
cost function, J,
Average final cost function, J
1 X X T
J J
= G Ξ GT − tr (G i Ξ) 1i tr (G j Ξ) 1j
4 i=1 j=1
J
1 X X
9
+ Gik` Gjmn Ξk` Ξmn + Ξkm Ξ`n + Ξkn Ξ`m ,
4 i,j=1 k,`,m,n=1
(7.328)
280 Matrix Lie Groups
where Gikl is the k`th element of G i and Ξk` is the k`th element of
Ξ. The first- and third-order terms in the covariance expansion are
identically zero owing to the symmetry of the Gaussian density. The last
term in the above makes use of Isserlis’ theorem for Gaussian variables.
Sigmapoint Method
Finally, we can also make use of the sigmapoint transformation to pass
uncertainty through the nonlinear camera model. As in the pose com-
pounding problem, we tailor this to our specific type of SE(3) pertur-
bation. We begin by approximating the input Gaussian using a finite
number of samples, {T` , p` }:
LLT = Ξ, (Cholesky decomposition; L lower-triangular) (7.329a)
θ ` = 0, (7.329b)
√
θ ` = L + κ col` L, ` = 1 . . . L, (7.329c)
√
θ `+L = − L + κ col` L, ` = 1 . . . L, (7.329d)
`
= θ` , (7.329e)
ζ`
T` = exp (∧` ) T̄, (7.329f)
p` = p̄ + D ζ ` , (7.329g)
where κ is a user-definable constant37 and L = 9. We then pass each of
these samples through the nonlinear camera model:
y` = s (T` p` ) , ` = 0 . . . 2L. (7.330)
These are combined to create the output mean and covariance accord-
ing to
2L
1 1X
ȳsp = κ y0 + y` , (7.331a)
L+κ 2 `=1
1
Rsp = κ (y0 − ȳsp )(y0 − ȳsp )T
L+κ
2L
1X T
+ (y` − ȳsp )(y` − ȳsp ) . (7.331b)
2 `=1
The next section will provide the details for a specific nonlinear camera
model, f (·), representing a stereo camera.
model given by
1
s(ρ) = M z, (7.332)
z3
where
s1 z1 fu 0 cu fu 2b
s2 ρ z2 0 fv cv 0
s=
s3 , z= =
z3 , M=
fu
, (7.333)
1 0 cu −fu 2b
s4 z4 0 fv cv 0
and fu , fv are the horizontal, vertical focal lengths (in pixels), (cu , cv )
is the optical center of the images (in pixels), and b is the separation
between the cameras (in metres). The optical axis of the camera is
along the z3 , direction.
The Jacobian of this measurement model is given by
1 0 − zz31 0
∂s 1 0 1 − zz23 0
=M , (7.334)
∂z z3 0 0 0 0
0 0 − zz43 1
and the Hessian is given by
0 0 −1 0
2
∂ s1
fu 0 0 0 0
= 2 ,
z3 −1 − 2b
2z1 +bz4
∂z∂zT 0 z3
0 0 − 2b 0
0 0 0 0
∂ 2 s2 ∂ 2 s4 fv 0 0 −1 0
,
= =
z32 0 −1 z3 0
2z2
∂z∂zT ∂z∂zT
0 0 0 0
0 0 −1 0
∂ 2 s3 fu 0 0 0 0
= 2 2z1 −bz4
b, (7.335)
∂z∂z T z3 −1 0 z3 2
b
0 0 2
0
where we have shown each component separately.
Camera Experiment
We used the following methods to pass a Gaussian uncertainty on cam-
era pose and landmark position through the nonlinear stereo camera
model:
2
Mean and (right)
covariance errors, 16
εmean and εcov , for 3000
three methods of 14
passing a Gaussian
uncertainty 2500
12
through a
nonlinear stereo
camera model, as 10 2000
compared to
Monte Carlo. The 8
parameter, α, 1500
scales the
6
magnitude of the
1000
input covariance
matrix. 4
500
2
0 0
0 0.5 1 0 0.5 1
Noise scaling, α Noise scaling, α
image of a stereo
camera showing
240 the mean and
covariance (as a
220 one-standard-
deviation ellipse)
200 for four methods of
imaging a
landmark with
180
Gaussian
uncertainty on the
160 camera’s pose and
the landmark’s
140 position. This case
corresponds to the
120 α = 1 data point
in Figure 7.10.
100
100 150 200 250 300
Horizontal Image Coordinate [pixels]
7.4 Summary
The main take-away points from this chapter are:
284 Matrix Lie Groups
7.5 Exercises
7.5.1 Prove that
∧
(Cu) ≡ Cu∧ CT .
7.5.2 Prove that
∧
(Cu) ≡ (2 cos φ + 1)u∧ − u∧ C − CT u∧ .
7.5.3 Prove that
∧
exp (Cu) ≡ C exp (u∧ ) CT .
7.5.4 Prove that
∧
(T x) ≡ Tx∧ T−1 .
7.5.5 Prove that
∧
exp (T x) ≡ T exp (x∧ ) T−1 .
7.5.6 Work out the expression for Q` (ξ) in (7.83b).
7.5.7 Prove that
x∧ p ≡ p x.
7.5.8 Prove that
pT x∧ ≡ xT p} .
7.5 Exercises 285
Applications
287
8
In this last part of the book, we will address some key three-dimensional
estimation problems from robotics. We will bring together the ideas
from Part I on classic state estimation with the three-dimensional ma-
chinery of Part II.
This chapter will start by looking at a key problem, aligning two
point-clouds (i.e., collections of points) using the principle of least
squares. We will then return to the EKF and batch state estimators and
adjust these to work with rotation and pose variables, in the context of
a specific pose estimation problem. Our focus will be on localization of
a vehicle when the geometry of the world is known. The next chapter
will address the more difficult scenario of unknown world geometry.
289
290 Pose Estimation Problems
Figure 8.1
F vk
! 1 moving
Definition of
reference frames Vk 1
F vk
!
for a point-cloud
alignment problem. Vk
There is a
F vk+1
!
stationary
reference frame estimated
and a moving r vk i
! Vk+1
reference frame, measured
p j vk
attached to a Fi
!
r
!
vehicle. A known
stationary
collection of points,
Pj , are observed in r pj i
!
I
both frames and
the goal is to Pj
determine the
relative pose of the
moving frame with
respect to the 8.1.1 Problem Setup
stationary one by
aligning the two
We will use the setup in Figure 8.1. There are two reference frames,
point-clouds. one non-moving, → F i , and one attached to a moving vehicle, →
− F vk . In
−
particular, we have M measurements, rpvkj vk , where j = 1 . . . M , of points
on the vehicle (expressed in the moving frame → F vk ). We assume these
−
measurements could have been corrupted by noise.
p i
Let us assume that we know ri j , the position of each point, Pj ,
located and expressed in the non-moving frame, → F i . For example, in
−
the ICP algorithm these points are determined by finding the closest
point on the model to each observed point. Thus, we seek to align a
collection of M points expressed in two different references frames. In
other words, we want to find the translation and rotation that best align
the two point-clouds2 . Note, in this first problem we are only carrying
out the alignment at a single time, k. We will consider a point-cloud
tracking problem later in the chapter.
+ +
ej = yj − q−1 (pj − r) q, (8.4)
but instead we can manipulate the above to generate an error that
appears linear in q,
+
e0j = q+ ej = yj ⊕ − (pj − r) q. (8.5)
where the wj are unique scalar weights assigned to each of the point
pairs. We have included the Lagrange multiplier term on the right to
ensure the unit-length constraint on the rotation quaternion. It is also
worth noting that selecting e0j over ej has no effect on our objective
function since
T T + T
+
+
e0j e0j = q+ ej q ej = eTj q+ q+ ej = eTj q−1 q ej = eTj ej .
(8.7)
292 Pose Estimation Problems
Inserting the expression for e0j into the objective function we see
1X
M T
+ +
J(q, r, λ) = wj qT yj ⊕ − (pj − r) yj ⊕ − (pj − r) q
2 j=1
1
− λ qT q − 1 . (8.8)
2
Taking the derivative of the objective function with respect to q, r, and
λ we find
∂J XM T
⊕ + ⊕ +
= w j y j − (p j − r) y j − (pj − r) q − λq, (8.9a)
∂qT j=1
∂J XM
−1 ⊕ ⊕ +
= q w j y j − (p j − r) q, (8.9b)
∂rT j=1
∂J 1
= − qT q − 1 . (8.9c)
∂λ 2
Setting the second to zero we find
r = p − q+ y+ q−1 , (8.10)
where p and y are defined below. Thus, the optimal translation is the
difference of the centroids of the two point-clouds, in the stationary
frame.
Substituting r into the first and setting to zero we can show
Wq = λq, (8.11)
where
1X T
M
⊕ + ⊕ +
W= wj (yj − y) − (pj − p) (yj − y) − (pj − p) ,
w j=1
(8.12a)
M
X M
X M
X
1 1
y= wj y j , p = wj pj , w = wj . (8.12b)
w j=1
w j=1 j=1
Thus, picking the smallest possible value for λ will minimize the objec-
tive function.
However, there are some complications if W is singular or the small-
est eigenvalue is not distinct. Then there can be multiple choices for the
eigenvector corresponding to the smallest eigenvalue, and therefore the
solution may not be unique. Will forgo discussing this further for the
quaternion method as it would require advanced linear algebra tech-
niques (e.g., Jordan normal form) and instead return to this issue in
the next section when using rotation matrices.
Note, we have not made any approximations or linearizations in our
technique, but this depends heavily on the fact that the weights are
scalar not matrices. Figure 8.2 shows the process to align two point-
clouds. Once we have the r and q, we can construct the final estimates
of the rotation matrix, Ĉvk i , and translation, r̂vi k i , from
vk i
Ĉvk i 0 −1 + ⊕ r̂i
=q q , = r, (8.15)
0T 1 0
and then we can construct an estimated transformation matrix accord-
ing to
Ĉvk i −Ĉvk i r̂vi k i
T̂vk i = , (8.16)
0T 1
which combines our rotation and translation into a single answer for
294 Pose Estimation Problems
Also, we define
M M M
1X 1X X
y = wj y j , p= wj pj , w = wj , (8.19)
w j=1 w j=1 j=1
where the wj are scalar weights for each point. Note that, as compared
to the last section, some of the symbols are now 3 × 1 rather than 4 × 1.
We define an error term for each point:
ej = yj − C(pj − r) (8.20)
M M
1X 1X T
J(C, r) = wj eTj ej = wj (yj − C(pj − r)) (yj − C(pj − r)) ,
2 j=1 2 j=1
(8.21)
subject to C ∈ SO(3) (i.e., CCT = 1 and det C = 1).
Before carrying out the optimization, we will make a change of vari-
ables for the translation parameter. Define
d = r + CT y − p, (8.22)
which is easy to isolate for r if all the other quantities are known. In
8.1 Point-Cloud Alignment 295
on the right; these are used to ensure that the resulting C ∈ SO(3).
Note, when CCT = 1 and det C = 1, these terms have no effect on the
resulting cost. It is also worth noting that Λ is symmetric since we only
need to enforce six orthogonality constraints. This new cost function
will be minimized by the same C as our original one.
Taking the derivative of J(C, Λ, γ) with respect to C, Λ, and γ, we
have5
∂J −T
= −W + 2ΛC + γ det | {zC} |C{z } = −W + LC, (8.29a)
∂C
1 C
∂J
= CCT − 1, (8.29b)
∂Λ
∂J
= det C − 1, (8.29c)
∂γ
where we have lumped together the Lagrange multipliers as L = 2Λ +
γ1. Setting the first equation to zero, we find that
LC = W. (8.30)
At this point, our explanation can proceed in a simplified or detailed
manner, depending on the level of fidelity we want to capture.
Before moving forward, we show that it is possible to arrive at (8.30)
using our Lie group tools without the use of Lagrange multipliers. We
consider a perturbation of the rotation matrix of the form
C0 = exp φ∧ C, (8.31)
and then take the derivative of the objective function with respect to φ
and set this to zero for a critical point. For the derivative with respect
to the ith element of φ we have
∂J J(C0 ) − J(C)
= lim
∂φi h→0 h
−tr(C0 WT ) + tr(CWT )
= lim
h→0 h
−tr(exp(h1∧i )CWT ) + tr(CWT )
= lim
h→0 h
−tr((1 + h1∧i )CWT ) + tr(CWT )
≈ lim
h→0 h
−tr(h1∧i CWT )
= lim
h→0 h
= −tr 1∧i CWT . (8.32)
5 We require these useful facts to take the derivatives:
∂
∂A
det A = det(A) A−T ,
∂
∂A
tr(ABT ) = B,
∂
∂A
tr(BAAT ) = (B + BT )A.
8.1 Point-Cloud Alignment 297
Simplified Explanation
If we somehow knew that det W > 0, then we could proceed as follows.
First, we postmultiply (8.30) by itself transposed to find
T T T
{z } L = WW .
L |CC (8.34)
1
Detailed Explanation
The detailed explanation begins by first carrying out a singular-value
decomposition (SVD)6 on the (square, real) matrix, W, so that
W = UDVT , (8.37)
6 The singular-value decomposition of a real M × N matrix, A, is a factorization of the
form A = UDVT where U is an M × M real, orthogonal matrix (i.e., UT U = 1), D is
an M × N matrix with real entries di ≥ 0 on the main diagonal (all other entries zero),
and V is an N × N real, orthogonal matrix (i.e., VT V = 1). The di are called the
singular values and are typically ordered from largest to smallest along the diagonal of
D. Note, the SVD is not unique.
298 Pose Estimation Problems
Since the singular values are positive, we have that det D > 0. Or in
other words, the signs of the determinants of S and W must be the
same, which implies that
det S = sgn (det S) = sgn (det W) = sgn det(UDVT )
| {zU} det
= sgn det | {zV} = det U det V = ±1. (8.46)
| {zD} det
±1 >0 ±1
Summary:
We have provided all of the solutions for C in our point-alignment
problem; depending on the properties of W, there can be one or in-
finitely many global solutions. Looking back through all the cases and
subcases, we can see that if there is a unique global solution for C, it
is always of the form
C = USVT (8.60)
with S = diag(1, 1, det U det V) and W = UDVT is a singular-value
decomposition of W. The necessary and sufficient conditions for this
unique global solution to exist are:
(i) det W > 0, or
(ii) det W < 0 and minimum singular value distinct: d1 ≥ d2 >
d3 > 0, or
(iii) rank W = 2.
If none of these conditions is true, there will be infinite solutions for C.
However, these cases are fairly pathological and do not occur frequently
in practical situations.
Once we have solved for the optimal rotation matrix, we take Ĉvk i =
C as our estimated rotation. We build the estimated translation as
r̂vi k i = p − ĈTvk i y, (8.61)
and, if desired, combine the translation and rotation into an estimated
transformation matrix,
Ĉvk i −Ĉvk i r̂vi k i
T̂vk i = , (8.62)
0T 1
that provides the optimal alignment of the two point-clouds in a sin-
gle quantity. Again, as mentioned in Section 6.3.2 we may actually be
interested in T̂ivk , which can be recovered using
−1 Ĉivk r̂vi k i
T̂ivk = T̂vk i = . (8.63)
0T 1
Both forms of the transformation matrix are useful, depending on how
the solution will be used.
Example 8.1 We provide an example of subcase (i-b) to make things
tangible. Consider the following two point-clouds that we wish to align,
each consisting of six points:
p1 = 3 × 11 , p2 = 2 × 12 , p3 = 13 , p4 = −3 × 11 ,
p5 = −2 × 12 , p6 = −13 ,
y1 = −3 × 11 , y2 = −2 × 12 , y3 = −13 , y4 = 3 × 11 ,
y5 = 2 × 12 , y6 = 13 ,
8.1 Point-Cloud Alignment 303
where
ϕ1
ϕ = ϕ2 = YT UT φ,
(8.73)
ϕ3
and owing to the properties of a skew-symmetric matrix (zeros on the
diagonal). For the second term, we use the identity u∧ u∧ = −uT u 1 +
uuT to write
1
δJ = − tr φ∧ φ∧ UYDSYT UT
2
1
= − tr YT UT −φT φ 1 + φφT UYDS
2
1
= − tr −ϕ2 1 + ϕϕT DS , (8.74)
2
where ϕ2 = ϕT ϕ = ϕ21 + ϕ22 + ϕ23 .
8.1 Point-Cloud Alignment 305
Iterative Approach
We can also consider using an iterative approach to solve for the optimal
rotation matrix, C. We will use our SO(3)-sensitive scheme to do this.
Importantly, the optimization we carry out is unconstrained, thereby
avoiding the difficulties of the previous two approaches8 . Technically,
the result is not valid globally, only locally, as we require an initial
guess that is refined from one iteration to the next; typically only a few
iterations are needed. However, based on our discussion of local minima
in the last section, we know that in all the important situations where
there is a unique global minimum, there are no additional local minima
to worry about.
Starting from the cost function where the translation has been elim-
inated,
M
1X T
J(C) = wj ((yj − y) − C(pj − p)) ((yj − y) − C(pj − p)) ,
2 j=1
(8.81)
we can insert the SO(3)-sensitive perturbation,
C = exp ψ ∧ Cop ≈ 1 + ψ ∧ Cop , (8.82)
where Cop is the current guess and ψ is the perturbation; we will seek
an optimal value to update the guess (and then iterate). Inserting the
approximate perturbation scheme into the cost function turns it into
a quadratic in ψ for which the minimizing value, ψ ? , is given by the
solution to
M
!
1X
Cop − wj (pj − p)∧ (pj − p)∧ CTop ψ ?
w j=1
| {z }
constant
M
1X
=− wj (yj − y)∧ Cop (pj − p). (8.83)
w j=1
Since each term in the sum is non-negative, the total must be non-
negative. The only way to have the total be zero is if every term in the
sum is also zero, or
(∀j) (pj − p)∧ x = 0. (8.92)
In other words, we must have x = 0 (not true by assumption), pj = p,
or x parallel to pj − p. The last two conditions are never true as long
as there are at least three points and they are not collinear.
Note, having three non-collinear points only provides a sufficient con-
dition for a unique solution for ψ ? at each iteration, and does not tell
us about the number of possible global solutions to minimize our ob-
jective function in general. This was discussed at length in the previous
sections, where we learned there could be one or infinitely many global
solutions. Moreover, if there is a unique global minimum, there are no
local minima to worry about.
where wj > 0 are the usual scalar weights. We seek to minimize J with
respect to T ∈ SE(3). Notably, this objective function is equivalent to
the ones for the unit-quaternion and rotation-matrix parameterizations,
so the minima should be the same.
To do this, we use our SE(3)-sensitive perturbation scheme,
T = exp (∧ ) Top ≈ (1 + ∧ ) Top , (8.96)
where Top is some initial guess (i.e., operating point of our lineariza-
tion) and is a small perturbation to that guess. Inserting this into the
objective function we then have
M
1X T
J(T) ≈ wj (y j − z j ) − z
j (y j − z j ) − z
j , (8.97)
2 j=1
Setting this to zero, we have the following system of equations for the
optimal ? :
M
! M
1X T 1X T
wj z j z j ? = wj z
j (y j − z j ). (8.100)
w j=1 w j=1
While we could use this to compute the optimal update, both the left-
and right-hand sides require construction from the original points at
each iteration. As in the previous section on the iterative solution using
rotation matrices, it turns out we can manipulate both sides into forms
that do not require the original points.
Looking to the left-hand side first, we can show that
M M
!
1X T −T 1X T
wj zj zj = T op wj pj pj T −1 op , (8.101)
w j=1 | {z } w j=1
|{z}
>0 | {z } >0
M
310 Pose Estimation Problems
where
1 0 1 0 1 p∧
T op = Ad(Top ), M= ,
−p∧ 1 0 I 0 1
M
X M M
1X 1X
w= wj , p= wj pj , I=− wj (pj − p)∧ (pj − p)∧ .
j=1
w j=1 w j=1
(8.102)
The 6×6 matrix, M, has the form of a generalized mass matrix (Murray
et al., 1994) with the weights as surrogates for masses. Notably, it is
only a function of the points in the stationary frame and is therefore a
constant.
Looking to the right-hand side, we can also show that
M
1X T y − Cop (p − rop )
a= wj z j (y j − z j ) = , (8.103)
w b − y∧ Cop (p − rop )
j=1
where
Cop −Cop rop
b = tr (1∧i Cop WT ) i , Top = , (8.104)
0T 1
M M
1X 1X
W= wj (yj − y)(pj − p)T , y= wj y j . (8.105)
w j=1 w j−1
Both W and y we have seen before and can be computed in advance
from the points and then used at each iteration of the scheme.
Once again, we can write the solution for the optimal update down
in closed form:
? = T op M−1 T Top a. (8.106)
Once computed, we simply update our operating point,
∧
Top ← exp ? Top , (8.107)
and iterate the procedure to convergence. The estimated transforma-
tion is then T̂vk i = Top at the final iteration. Alternatively, T̂ivk = T̂−1
vk i
may be the output of interest.
Note, applying the optimal perturbation through the exponential
map ensures that Top remains in SE(3) at each iteration. Also, looking
back to Section 4.3.1, we can see that our iterative optimization of T
is exactly in the form of a Gauss-Newton style estimator, but adapted
to work with SE(3).
ivk
of the vehicle, ω vk , which we note are expressed in the vehicle
frame. We combine these as
iv
ν k
$ k = vivkk , k = 1 . . . K, (8.112)
ω vk
at a number of discrete times (we will assume the inputs
312 Pose Estimation Problems
Continuous Time
We will start with the SE(3) kinematics10 ,
Ṫ = $ ∧ T, (8.116)
where the quantities involved are perturbed by process noise according
to:
T = exp δξ ∧ T̄, (8.117a)
$ = $̄ + δ$. (8.117b)
We can separate these into nominal and perturbation kinematics as
in (7.243):
Discrete Time
If we assume quantities remain constant between discrete times, then
we can use the ideas from Section 7.2.2 to write
nominal kinematics: T̄k = exp (∆tk $̄ ∧k ) T̄k−1 , (8.119a)
| {z }
Ξk
Nonlinear
Our 3 × 1 measurement model can be compactly as
yjk = DT Tk pj + njk , (8.120)
where the position of the known points on the moving vehicle are ex-
pressed in 4 × 1 homogeneous coordinates (bottom row equal to 1),
pj i
r
pj = i , (8.121)
1
and
1 0 0 0
DT = 0 1 0 0 , (8.122)
0 0 1 0
is a projection matrix used to ensure the measurements are indeed 3×1
by removing the 1 on the bottom row. We have also now included,
njk ∼ N (0, Rjk ), which is Gaussian measurement noise.
Linearized
We linearize (8.120) in much the same way as the motion model through
the use of perturbations:
Tk = exp δξ ∧k T̄k , (8.123a)
yjk = ȳjk + δyjk . (8.123b)
314 Pose Estimation Problems
Nomenclature
To match the notation used in our derivations of our nonlinear estima-
tors, we define the following symbols:
T̂k : 4 × 1 corrected estimate of pose at time k
P̂k : 6 × 6 covariance of corrected estimate at time k
(for both translation and rotation)
Ťk : 4 × 1 predicted estimate of rotation at time k
P̌k : 6 × 6 covariance of predicted estimate at time k
(for both translation and rotation)
Ť0 : 4 × 4 prior input as pose at time 0
$ k : 6 × 1 prior input as generalized velocity at time k
Qk : 6 × 6 covariance of process noise
(for both translation and rotation)
yjk : 3 × 1 measurement of point j from vehicle at time k
Rjk : 3 × 3 covariance of measurement j at time k
We will use these in two different estimators, the EKF and batch
discrete-time MAP estimation.
Prediction Step
Predicting the mean forwards in time is not difficult in the case of
the EKF; we simply pass our prior estimate and latest measurements
8.2 Point-Cloud Tracking 315
Thus, in this case the coefficient matrix of the linearized motion model
is
Fk−1 = exp (∆tk $ f k ), (8.130)
which depends only on the input and not the state due to our conve-
nient choice of representing uncertainty via the exponential map. The
covariance prediction proceeds in the usual EKF manner as
P̌k = Fk−1 P̂k−1 FTk−1 + Qk . (8.131)
The corrective step is where we must pay particular attention to the
pose variables.
Correction Step
Looking back to (8.126) for the perturbation measurement model,
δyjk = DT Ťk p 0
j δ ξ̌ k + njk , (8.132)
| {z }
Gjk
∨
k = ln T̂k Ť−1
k = Kk (yk − y̌k ), (8.137)
| {z } | {z }
update innovation
∨
where k = ln T̂k Ť−1
k is the difference of the corrected and predicted
means and y̌k is the nonlinear, nominal measurement model evaluated
at the predicted mean:
y̌1k
y̌k = ... , y̌jk = DT Ťk pj , (8.138)
y̌M k
where we have again accounted for the fact that there could be M
observations of points on the vehicle. Once we have computed the mean
correction, k , we apply it according to
Summary
Putting the pieces from the last two sections together, we have our
canonical five EKF equations for this system:
We have essentially modified the EKF so that all the mean calculations
occur in SE(3), the Lie group, and all of the covariance calculations
occur in se(3), the Lie algebra. As usual, we must initialize the filter
at the first timestep using Ť0 . Although we do not show it, we could
easily turn this into an iterated EKF by relinearizing about the latest
estimate and iterating over the correction step. Finally, the algorithm
has T̂vk i = T̂k so we can compute T̂ivk = T̂−1
k if desired.
8.2 Point-Cloud Tracking 317
so that
ey,jk (x) ∼ N (0, Rjk ) . (8.149)
These noise properties allow us to next construct the objective func-
tion that we want to minimize in our batch MAP problem:
1
e (x)T P̌−1
2 v,0 0 ev,0 (x) k=0
Jv,k (x) = 1 T −1 , (8.150a)
e
2 v,k
(x) Qk e v,k (x) k = 1...K
1
Jy,k (x) = ey,k (x)T R−1 k ey,k (x), (8.150b)
2
where we have stacked the M point quantities together according to
ey,1k (x)
..
ey,k (x) = . , Rk = diag (R1k , . . . , RM k ) . (8.151)
ey,M k (x)
The overall objective function that we will seek to minimize is then
K
X
J(x) = (Jv,k (x) + Jy,k (x)) . (8.152)
k=0
the next section will look at linearizing our error terms in order to carry
out Gauss-Newton optimization.
but this approximation will get better as 0 goes to zero, which will
happen as the Gauss-Newton algorithm converges11 .
For the input errors at the later times, we have
∨
ev,k (x) = ln Ξk Tk−1 T−1
k
∧ ∨
= ln Ξk exp ∧k−1 Top,k−1 T−1
op,k exp (−k )
∧ ∨
= ln Ξk Top,k−1 T−1
op,k exp Ad Top,k T−1
op,k−1 k−1 exp (−∧k )
| {z }
exp(ev,k (xop )∧ )
≈ ev,k (xop ) + Ad Top,k T−1
op,k−1 k−1 − k , (8.156)
| {z }
Fk−1
∨
where ev,k (xop ) = ln Ξk Top,k−1 T−1
op,k is the error evaluated at the
operating point.
For the measurement errors, we have
where
ey,1k (x) ey,1k (xop )
G1k
..
..
ey,k (x) = . , Gk = ... .
ey,k (xop ) = . ,
ey,M k (x) GM k
ey,M k (xop )
(8.159)
Next, we will insert these approximations into our objective function
to complete the Gauss-Newton derivation.
11 We could also include the SE(3) Jacobian to do better here, as was done in
Section 7.3.4, but this is a reasonable starting point.
320 Pose Estimation Problems
Gauss-Newton Update
To set up the Gauss-Newton update, we define the following stacked
quantities:
1
−F0 1
.
−F1 . .
0 ..
1 . 1
−F 1
δx = 2 , H= K−1 ,
.. G0
.
G1
K
G2
..
.
GK
ev,0 (xop )
ev,1 (xop )
..
.
ev,K−1 (xop )
e(xop ) =
ev,K (xop )
,
(8.160)
ey,0 (xop )
e (x )
y,1 op
.
..
ey,K (xop )
and
W = diag P̌0 , Q1 , . . . , QK , R0 , R1 , . . . , RK , (8.161)
1
J(x) ≈ J(xop ) − bT δx + δxT A δx, (8.162)
2
where
−1
A= HT
| W{z H} , b = HT W−1 e(xop ). (8.163)
block-tridiagonal
A δx? = b, (8.164)
8.3 Pose-Graph Relaxation 321
which ensures that Top,k stays in SE(3). We then iterate the entire
scheme to convergence. As a reminder we note that at the final iteration
we have T̂vk i = Top,k as our estimate, but if we prefer we can compute
T̂ivk = T̂−1
vk i .
Once again, the main concept that we have used to derive this Gauss-
Newton optimization problem involving pose variables is to compute
the update in the Lie algebra, se(3), but store the mean in the Lie
group, SE(3).
where we note that there will be one term in the sum for each relative
324 Pose Estimation Problems
criterion is met. Once converged, we set T̂k0 = Top,k at the last iteration
as the final estimates for the vehicle poses relative to pose 0.
8.3.3 Initialization
There are several ways to initialize the operating point, xop at the
start of the Gauss-Newton procedure. A common method is to find a
spanning tree as in Figure 8.4; the initial values of the pose variables can
be found by compounding (a subset of) the relative pose measurements
outward from the chosen privileged node 0. Note, the spanning tree is
not unique and as such different initialization can be computed. A
shallow spanning tree is preferable over a deep one so that as little
uncertain as possible is accumulated to any given node.
AT34 A44
0−1 −1 −1
Σ10 + T T21 Σ021 T 21 −T T21 Σ021
0−1 0−1 −1 −1
−Σ21 T 21 Σ21 + T 32 Σ032 T 32
T
−T T32 Σ032
= 0−1 −1 −1 −1
−Σ32 T 32 Σ032 + T T43 Σ043 T 43 −T T43 Σ043
0−1 0−1 T 0−1
−Σ43 T 43 Σ43 + T 54 Σ54 T 54
(8.189)
where
−1
0
Σk` = J −T −1 −1
k` Σk` J k` , (8.190a)
−1
T k` = T op,k T op,` , (8.190b)
J k` = J (−ek` (xop )) . (8.190c)
A = UUT , (8.192)
In this chapter of the book, we will address one of the most fundamen-
tal problems in mobile robotics, estimating the trajectory of a robot
and the structure of the world around it (i.e., point landmarks) to-
gether. In robotics, this is called the simultaneous localization and map-
ping (SLAM) problem. However, in computer vision an almost identical
problem came to prominence through the application of aligning aerial
photographs into a mosaic; the classic solution to this problem is called
bundle adjustment (BA). We will look at BA through the lens of our
SE(3) estimation techniques.
x = {T1 , . . . , TK , p1 , . . . , pM } , (9.1)
Nonlinear Model
The measurement, yjk , will correspond to some observation of point j
from pose k (i.e., some function of rpvkj vk ). The measurement model for
this problem will be of the form
yjk = g (xjk ) + njk , (9.2)
where g(·) is the nonlinear model and njk ∼ N (0, Rjk ) is additive
Gaussian noise. We can use the shorthand
y = {y10 , . . . yM 0 , . . . , y1K , . . . , yM K } , (9.3)
to capture all the measurements that we have available.
As discussed in Section 7.3.5, we can think of the overall observation
model as the composition of two nonlinearities: one to transform the
point into the vehicle frame and one turn that point into the actual
sensor measurement through a camera (or other sensor) model. Letting
z(xjk ) = Tk pj , (9.4)
we can write
g (xjk ) = s (z(xjk )) , (9.5)
where s(·) is the nonlinear camera model1 . In other words, we have
g = s ◦ z, in terms of the composition of functions.
Perturbed Model
We will go one step beyond simply linearizing our model and work
out the perturbed model to second order. This could be used, for ex-
ample, to estimate the bias in using ML estimation, as discussed to
Section 4.3.3.
We define the following perturbations to our state variables:
1
Tk = exp (∧k ) Top,k ≈ 1 + ∧k + ∧k ∧k Top,k , (9.6a)
2
pj = pop,j + D ζ j , (9.6b)
where
1 0 0
0 1 0
D=
0
, (9.7)
0 1
0 0 0
is a dilation matrix so that our landmark perturbation, ζ j , is 3 × 1.
We will use the shorthand xop = {Top,1 , . . . , Top,K , pop,1 , . . . , pop,M }
to indicate the entire trajectory’s linearization operating point as well
as xop,jk = {Top,k , pop,j } to indicate the subset of the operating point
1 See Section 6.4 for several possibilities for the camera (or sensor) model.
332 Pose-and-Point Estimation Problems
including the kth pose and jth landmark. The perturbations will be
denoted
1
..
.
K
δx = , (9.8)
ζ
1
.
..
ζM
with the pose quantities on top and the landmark quantities on the
bottom. We will also use
k
δxjk = , (9.9)
ζj
to indicate just the perturbations associated with the kth pose and the
jth landmark.
Using the perturbation schemes above, we have that
∧ 1 ∧ ∧
z(xjk ) ≈ 1 + k + k k Top,k pop,j + D ζ j
2
≈ Top,k pop,j + ∧k Top,k pop,j + Top,k D ζ j
1
+ ∧k ∧k Top,k pop,j + ∧k Top,k D ζ j
2
1X
= z(xop,jk ) + Zjk δxjk + 1i δxTjk Z ijk δxjk , (9.10)
2 i | {z }
scalar
and i is an index over the rows of z(·), and 1i is the ith column of the
identity matrix, 1.
To then apply the nonlinear camera model, we use the chain rule (for
9.1 Bundle Adjustment 333
g(xjk ) = s (z(xjk ))
1X T
≈ s zop,jk + Zjk δxjk + 1m δxjk Z mjk δxjk
2 m
| {z }
δzjk
1X T T
≈ s(zop,jk ) + Sjk Zjk δzjk + 1 δz S ijk δzjk
2 i i jk
= s(zop,jk )
!
X 1X
T T
+ 1i 1i Sjk Zjk δxjk + 1m δxjk Z mjk δxjk
i
2 m
!T
1X 1X
+ 1i Zjk δxjk + 1m δxTjk Z mjk δxjk
2 i 2 m
!
1X
× S ijk Zjk δxjk + 1m δxTjk Z mjk δxjk
2 m
1X
≈ g(xop,jk ) + Gjk δxjk + 1i δxTjk G ijk δxjk , (9.12)
2 i | {z }
scalar
and i is an index over the rows of s(·), and 1i is the ith column of the
identity matrix, 1.
If we only care about the linearized (i.e., first-order) model, then we
can simply use
where x is the full state that we wish to estimate (all poses and land-
marks) and Rjk is the symmetric, positive-definite covariance matrix
associated with the jkth measurement. If a particular landmark is not
actually observed from a particular pose, we can simply delete the ap-
propriate term from the objective function. The usual approach to this
estimation problem is to apply the Gauss-Newton method. Here we
will derive the full Newton’s method and then approximate to arrive
at Gauss-Newton.
Newton’s Method
Approximating the error function, we have
1X
ey,jk (x) ≈ yjk − g(xop,jk ) −Gjk δxjk − 1i δxTjk G ijk δxjk , (9.17)
| {z } 2 i
ey,jk (xop )
Gauss-Newton Method
Typically in practice, the Gauss-Newton approximation to the Hessian
is taken so that at each iteration we solve the linear system
A δx? = b, (9.24)
3 In practice, including this extra term sometimes makes the numerical stability of the
whole procedure worse, so it should be added with caution.
336 Pose-and-Point Estimation Problems
with
X
b= PTjk GTjk R−1
jk ey,jk (xop ), (9.25a)
j,k
X
A= PTjk GTjk R−1
jk Gjk Pjk , (9.25b)
j,k
with
Gjk = G1,jk G2,jk ,
G1,jk = Sjk (Top,k pop,j ) , G2,jk = Sjk Top,k D, (9.27)
using the definitions from earlier.
In the case of K = 3 free poses (plus fixed pose 0) and M = 2
landmarks, the matrices have the form
G2,10
G2,20
G1,11 G
2,11
G1,21 G
G = G1 G2 = 2,21
G1,12 G2,12 ,
G1,22 G2,22
G1,13 G2,13
G1,23 G2,23
ey,10 (xop )
ey,20 (xop )
ey,11 (xop )
ey,21 (xop )
ey (xop ) = ,
ey,12 (xop )
ey,22 (xop )
ey,13 (xop )
ey,23 (xop )
R = diag (R10 , R20 , R11 , R21 , R12 , R22 , R13 , R23 ) , (9.28)
under one particular ordering of the measurements.
In general, multiplying out the left-hand side, A = GT R−1 G, we see
that
A11 A12
A= , (9.29)
AT12 A22
9.1 Bundle Adjustment 337
where
The fact that both A11 and A22 are block-diagonal means this system
has a very special sparsity pattern that can be exploited to efficiently
solve for δx? at each iteration. This will be discussed in detail in the
next section.
where the state, δx? , has been partitioned into parts corresponding to
(1) the pose perturbation, δx?1 = ? , and (2) the landmark perturba-
tions, δx?2 = ζ ? .
It turns out that the Hessian of the objective function, A, has a
very special sparsity pattern as depicted in Figure 9.2; it is sometimes
referred to as an arrowhead matrix. This pattern is due to the presence
of the projection matrices, Pjk , in each term of A; they embody the
fact that each measurement involves just one pose variable and one
landmark.
As seen in Figure 9.2, we have that A11 and A22 are both block-
diagonal because each measurement involves only one pose and one
landmark at a time. We can exploit this sparsity to efficiently solve (9.21)
for δx? ; this is sometimes referred to as sparse bundle adjustment. There
are few different ways to do this; we will discuss the Schur complement
and a Cholesky technique.
338 Pose-and-Point Estimation Problems
Figure 9.2
Sparsity pattern of * * * * * * * * * ✏✏01
A. Non-zero * * * * * * * * *
..
..
.
..
.
entries are * * * * * * * * * .
indicated by *. *. * * .* * * .* * .*
This structure is .. .. .. ..
often referred to as * * * * * * * * * ✏k
..
.
..
.
an arrowhead * .. * *.* * *.* *.* ..
matrix, because . .. .. .. .
the ζ part is large * * * * * * * * *
..
.
..
.
compared to the * * * * * * * * * ✏K
A=
part. * * * * * * * * * ⇣1
* * * * * * * * *
..
.
..
.
* * * * * * * * * ..
* *
.
.*
..
* * .*
..
*.*
..
*.
..
* * * * * * * * * ⇣j
..
.
..
.
* *.* * *.* *.* *. ..
.. .. .. .. .
* * * * * * * * *
..
.
..
.
* * * * * * * * * ⇣M
✏10 ... ✏k . . . ✏K ⇣ 1 ... ⇣j . . . ⇣M
Schur Complement
Typically, the Schur complement is used to manipulate (9.31) into a
form that is more efficiently solved. This can be seen by premultiplying
both sides by
1 −A12 A−1 22
,
0 1
so that
?
A11 − A12 A−1 T
22 A12 0 δx1 b1 − A12 A−1
22 b2
= ,
AT12 A22 δx?2 b2
which has the same solution as (9.31). We may then easily solve for
δx?1 and since A22 is block-diagonal, A−1 22 is cheap to compute. Fi-
nally, δx?2 (if desired) can also be efficiently computed through back-
substitution, again owing to the sparsity of A22 . This procedure brings
the complexity of each solve down from O ((K + M )3 ) without spar-
sity to O (K 3 + K 2 M ) with sparsity, which is most beneficial when
K M.
A similar procedure can be had by exploiting the sparsity of A11 , but
in robotics problems we may also have some additional measurements
that perturb this structure and, more importantly, δx?2 is usually much
larger than δx?1 in bundle adjustment. While the Schur complement
method works well, it does not directly provide us with an explicit
method of computing A−1 , the covariance matrix associated with δx? ,
should we desire it. The Cholesky approach is better suited to this end.
9.1 Bundle Adjustment 339
Cholesky Decomposition
Every symmetric positive-definite matrix, including A, can be factored
as follows through a Cholesky decomposition:
A11 A12 U11 U12 UT11 0
= , (9.32)
AT12 A22 0 U22 UT12 UT22
| {z } | {z }| {z }
A U UT
which can again be computed efficiently due to the fact that U22 is
block-diagonal and U11 is small and in upper-triangular form. Then we
have that
UT δx? = U−1 b, (9.36)
or T ? −1
U11 0 δx1 U11 (b1 − U12 U−1
22 b2 )
= , (9.37)
UT12 UT22 δx?2 U−1
22 b2
which allows us to compute δx?1 and then back-substitute for δx?2 , sim-
ilarly to the Schur complement method.
340 Pose-and-Point Estimation Problems
Figure 9.3 A BA p2
problem with only
three p1 Jy,22
(non-collinear) T2
point landmarks
Jy,11 Jy,21
and two free poses
Jy,10
(1 and 2). Pose 0 is
fixed. It turns out Jy,32
this problem does T1
not have a unique
T0
solution as there
fixed Jy,11
are too few p3
landmarks to
constrain the two
free poses. There is
one term in the However, unlike the Schur complement method, A−1 is now com-
cost function, puted easily:
Jy,jk , for each
−1
measurement, as A−1 = UUT = U−T U−1 = LLT
shown. −1
U−T
11 0 U11 −U−1 −1
11 U12 U22
=
−U−T T
22 U12 U11
−T
U−T
22 0 U−1
22
| {z }| {z }
L LT
U−T −1
11 U11 −U−T −1 −1
11 U11 U12 U22
= −1 , (9.38)
−U−T T −T
22 U12 U11 U11
−1
U−T
22 UT12 U−T −1
11 U11 U12 + 1 U22
We will first set up the equations as though it were possible to solve for
both poses 1 and 2 and then introduce the pose-interpolation scheme.
The state variables to be estimated are
x = {T1 , T2 , p1 , p2 , p3 } . (9.39)
1
2
δx =
ζ1 .
(9.41)
ζ2
ζ3
A δx? = b, (9.42)
where the A and b matrices for this problem have the form
A11 A13 A14 b1
T A22 A24 A25 b2
A=
A13 A33 ,
b=
b3
(9.43)
AT AT A44 b4
14 24
AT25 A55 b5
342 Pose-and-Point Estimation Problems
with
A11 = GT1,11 R−1 T −1
11 G1,11 + G1,21 R21 G1,21 ,
and
b1 = GT1,11 R−1 T −1
11 ey,11 (xop ) + G1,21 R21 ey,21 (xop ),
b2 = GT1,22 R−1 T −1
22 ey,22 (xop ) + G1,32 R32 ey,32 (xop ),
b3 = GT2,10 R−1 T −1
10 ey,10 (xop ) + G2,11 R11 ey,11 (xop ),
b4 = GT2,21 R−1 T −1
21 ey,21 (xop ) + G2,22 R22 ey,22 (xop ),
b5 = GT2,30 R−1 T −1
30 ey,30 (xop ) + G2,32 R32 ey,32 (xop ).
where we will call I the interpolation matrix. Our new set of state
variables to be estimated is
x0 = {T, p1 , p2 , p3 } , (9.50)
now that we have eliminated T1 as a free variable. Returning to our
original ML cost function, we can now rewrite it as
T 1 T
J(x0 ) ≈ J(x0op ) − b0 δx0 + δx0 A0 δx0 , (9.51)
2
where
A0 = I T A I, b0 = I T b. (9.52)
?
The optimal perturbation (that minimizes the cost function), δx0 , is
now the solution to
?
A0 δx0 = b0 . (9.53)
We update all the operating points in
x0op = {Top , pop,1 , pop,2 , pop,3 } , (9.54)
using the usual schemes,
∧
Top ← exp ? Top , pop,j ← pop,j + Dζj? , (9.55)
and iterate to convergence.
Importantly, applying the interpolation matrix on either side of A to
create A0 does not completely destroy the sparsity. In fact, the bottom-
right block corresponding to the landmarks remains block-diagonal, and
thus A0 is still an arrowhead matrix:
∗ ∗ ∗ ∗
∗ ∗
A0 =
∗
,
(9.56)
∗
∗ ∗
where ∗ indicates a non-zero block. This means that we can still exploit
the sparsity using the methods of the previous section, while interpo-
lating poses.
It turns out that we can use this interpolation scheme (and others) for
more complicated BA problems as well. We just need to decide which
344 Pose-and-Point Estimation Problems
x = {T0 , . . . , TK , p1 , . . . , pM } . (9.57)
We will adopt the motion model from Section 8.2 and the inputs are
given by
v = Ť0 , $ 1 , $ 2 , . . . , $ K . (9.59)
4 We could also chose not to estimate it and simply hold it fixed, which is very common.
9.2 Simultaneous Localization and Mapping 345
−1
δx1 F 0 Q 0
δx = , H= , W= ,
δx2 G1 G2 0 R
ev (xop )
e(xop ) = , (9.60)
ey (xop )
where
0 ζ1
1 ζ2
δx1 = .. , δx2 = .. ,
. .
K ζM
ev,0 (xop ) ey,10 (xop )
ev,1 (xop ) ey,20 (xop )
ev (xop ) = .. , ey (xop ) = .. ,
. .
ev,K (xop ) ey,M K (xop )
Q = diag P̌0 , Q1 , . . . , QK , R = diag (R10 , R20 , . . . , RM K ) ,
1
−F0 1
..
−1
F = −F1 . , (9.61)
. ..
1
−FK−1 1
G1,10 G2,10
.. ..
. .
G1,M 0 G2,M 0
G1,11 G2,11
.. . .
. .
G1,M 1 G2,M 1
G1 = , G2 = .
.. ..
. .
.. ..
. .
G1,1K
G2,1K
.. ..
. .
G1,M K G2,M K
346 Pose-and-Point Estimation Problems
From Sections 8.2.5 for the motion priors and 9.1.3 for the measure-
ments, the detailed blocks are
Fk−1 = Ad Top,k T−1 op,k−1 , k = 1 . . . K,
(
−1 ∨
ln Ť0 Top,0 k=0
ev,k (xop ) = ∨ ,
ln exp ((tk − tk−1 )$ ∧k ) Top,k−1 T−1
op,k k = 1...K
(9.62)
G1,jk = Sjk (Top,k pop,j ) , G2,jk = Sjk Top,k D,
ey,jk (xop ) = yjk − s (Top,k pop,j ) .
Finally, the objective function can be written as usual as
1
J(x) ≈ J(xop ) − bT δx + δxT A δx, (9.63)
2
where
A = HT W−1 H, b = HT W−1 e(xop ), (9.64)
whereupon the minimizing perturbations, δx? , are the solutions to
A δx? = b. (9.65)
We solve for δx? , then update our operating points according to
∧
Top,k ← exp ?k Top,k , pop,j ← pop,j + Dζj? , (9.66)
9.2.4 Example
Figure 9.4 shows a simple SLAM problem with three point landmarks
and three free poses. In contrast to the BA example of Figure 9.3, we
now allow T 0 to be estimated as we have some prior information about
it, Ť0 , P̌0 , in relation to an external reference frame (shown as the
black, fixed pose). We have shown graphically all of the terms in the
objective function5 , one for each measurement and input, totalling nine
terms:
J = Jv,0 + Jv,1 + Jv,2 + Jy,10 + Jy,30 + Jy,11 + Jy,21 + Jy,22 + Jy,32
| {z } | {z }
prior terms measurement terms
(9.69)
Also, with the motion priors we have used, A is always well condi-
tioned and will provide a solution for the trajectory, even without any
measurements.
5 Sometimes this type of diagram is called a factor graph with each ‘factor’ from the
posterior likelihood over the states becoming a ‘term’ in the objective function, which
is really just the negative log-likelihood of the posterior over states.
10
Continuous-Time Estimation
All of our examples in this last part of the book have been in discrete
time, which is sufficient for many applications. However, it is worth
investigating how we might make use of the continuous-time estimation
tools from Sections 3.4 and 4.4 when working with state variables in
SE(3). To this end, we show one way to start from a specific nonlinear,
stochastic differential equation and build motion priors that encourage
trajectory smoothness1 . We then show where these motion priors could
be used within a trajectory estimation problem. Finally, we show how
to interpolate for query poses at times in between the main solution
times, using Gaussian process interpolation.
10.1.1 General
Ideally, we would like to use the following system of nonlinear, stochas-
tic, differential equations to build our motion prior2 :
Ṫ(t) = $(t)∧ T(t), (10.1a)
$̇(t) = w(t), (10.1b)
w(t) ∼ GP (0, Q δ(t − t0 )) . (10.1c)
To use this to build our motion priors, we will need to estimate the pose,
T(t), and the body-centric, generalized angular velocity, $(t), at some
times of interest: t0 , t1 , . . . , tK . White noise, w(t), enters the system
through the generalized angular acceleration; in the absence of noise,
the body-centric, generalized angular velocity is constant. The trouble
1 See Furgale et al. (2015) for a survey of continuous-time methods.
2 It is this model that we approximated as a discrete-time system in the previous two
chapters in order to build discrete-time motion priors.
349
350 Continuous-Time Estimation
Transition Function
With $̌ constant, the transition function for the perturbation system
is
exp ((t − s)$̌ f ) (t − s)J ((t − s)$̌)
Φ(t, s) = , (10.5)
0 1
Error Terms
With the transition function in hand, we can define error terms for use
within a maximum-a-posterior estimator. The error term at the first
timestep will be
" ∨ #
ln T0 Ť−1
ev,0 (x) = −γ(t0 ) = − 0 , (10.7)
$ 0 − $̌
which does not contain any terms associated with the measurements.
where {Top,k , $ op,k } is the operating point and (k , ψ k ) is the pertur-
bation. Using the pose perturbation scheme we see that
∨ ∨
ξ k = ln Tk Ť−1
k = ln exp (∧k ) Top,k Ť−1
k
∨
= ln exp (∧k ) exp ξ ∧op,k ≈ ξ op,k + k , (10.13)
∨
where ξ op,k = ln Top,k Ť−1 k . We have used a very approximate ver-
sion of the BCH formula here, which is only valid if both k and ξ op,k
are both quite small. The former is reasonable since k → 0 as our esti-
mation scheme converges. The latter will be so if the motion prior has
low uncertainty; we have essentially already made this assumption in
separating our SDE into the nominal and perturbation parts. Inserting
this linearization results in the following linearized error terms:
ev,0 (xop ) − θ 0 k=0
ev,k (x) ≈ , (10.14)
ev,k (xop ) + Fk−1 θ k−1 − θ k k = 1 . . . K
where
k
θk = , (10.15)
ψk
10.1 Motion Prior 353
is the stacked perturbation for both the pose and generalized angular
velocity at time k and
Fk−1 = Φ (tk , tk−1 ) . (10.16)
Defining
θ0 ev,0 (xop )
θ1 ev,1 (xop )
δx1 = .. , ev (xop ) = .. ,
. .
θK ev,K (xop )
1
−F0 1
..
F−1 =
−F1 . ,
..
. 1
−FK−1 1
Q = diag P̌0 , Q1 , . . . , QK , (10.17)
we can write the approximate motion-prior part of the objective func-
tion in block form as
1 T
Ju (x) ≈ ev (xop ) − F−1 δx1 Q−1 ev (xop ) − F−1 δx1 , (10.18)
2
which is quadratic in the perturbation, δx1 .
10.1.2 Simplification
Computing the Qk blocks in the general case can be done numerically.
However, we can evaluate them in closed form fairly easily for the
specific case of no rotational motion (in the mean of the prior only).
To do this, we define the (constant) generalized angular velocity to be
of the form
ν̌
$̌ = . (10.19)
0
The mean function will be a constant, linear velocity (i.e., no angular
rotation), ν̌. We then have that
f f 0 ν̌ ∧ 0 ν̌ ∧
$̌ $̌ = = 0, (10.20)
0 0 0 0
so that
exp ((t − s)$̌ f ) = 1 + (t − s)$̌ f , (10.21a)
1
J ((t − s)$̌) = 1 + (t − s)$̌ f , (10.21b)
2
354 Continuous-Time Estimation
how to solve for the state at the measurement times. Then, we show
how to use Gaussian process interpolation to solve for the state (and
covariance) at other query times.
x = {T0 , $ 0 , . . . , TK , $ K , p1 , . . . , pM } , (10.26)
pj = pop,j + Dζ j , (10.28)
and
G1,10 G2,10
.. ..
. .
G1,M 0 G2,M 0
G1,11 G2,11
.. ..
. .
G G2,M 1
G1 = 1,M 1 , G2 = .
.. ..
. .
.. ..
. .
G1,1K G2,1K
.. ..
. .
G1,M K G2,M K
(10.30)
where we see that the only change from the SLAM case is that the G1,jk
matrix has a padding 0 to account for the fact that the ψ k perturbation
variable (associated with $ k ) is not involved in the observation of
landmark j from pose k. The part of the objective function associated
with the measurements is then approximately
1 T
Jy (x) ≈ (ey (xop ) − G1 δx1 − G2 δx2 ) R−1
2
× (ey (xop ) − G1 δx1 − G2 δx2 ) , (10.31)
with
and
−1
δx1 F 0 Q 0
δx = , H= , W= ,
δx2 G1 G2 0 R
ev (xop )
e(xop ) = . (10.34)
ey (xop )
The minimizing perturbation, δx? , is the solution to
A δx? = b. (10.35)
As usual, we solve for δx? , then apply the optimal perturbations using
the appropriate schemes,
∧
Top,k ← exp ?k Top,k , (10.36a)
$ op,k ← $ op,k + ψ ?k , (10.36b)
pop,j ← pop,j + Dζ ?j , (10.36c)
and iterate to convergence. Similarly to the SLAM case, once converged
p i
we set T̂vk i = Top,k , $̂ vvkk i = $ op,k , and p̂i j = pop,j at the last iteration
as the final estimates for the vehicle poses, generalized angular velocity,
and landmark positions of interest at the measurement times.
for example. In this case, the Cholesky method is preferred over the
Schur one as we do not need to construct A−1
11 , which is actually dense.
Yan et al. (2014) explain how to use the sparse-GP method within the
incremental approach of Kaess et al. (2008, 2012).
10.2.5 Interpolation
After we have solved for the state at the measurement times, we can also
now use our Gaussian process framework to interpolate for the state
at one or more query times. The situation is depicted in Figure 10.1,
where our goal is to interpolate the posterior pose (and generalized
velocity) at query time, τ .
Because we have deliberately estimated a Markovian state for our
chosen SDE defining the prior, {Tk , $ k }, we know that to interpolate
at time τ , we need only consider the two measurements times on either
side. Without loss of generality, we assume
tk ≤ τ < tk+1 . (10.39)
The difference between the posterior and the prior at times τ , tk , and
tk+1 we write as
" ∨ # " ∨ #
ln T̂τ Ť−1
τ ln T̂k Ť−1
k
γτ = , γk = ,
$̂ τ − $̌ $̂ k − $̌
" ∨ #
ln T̂k+1 Ť−1
k+1
γ k+1 = , (10.40)
$̂ k+1 − $̌
where we note that the posterior values at the two measurement times
come from the operating
n opoint values at the last iteration of the main
MAP solution: T̂k , $̂ k = {Top,k , $ op,k }.
Using these definitions, we can go ahead and carry out state interpo-
lation (for the mean) on se(3)× R3 using the approach from Section 3.4:
where
Λ(τ ) = Φ(τ, tk ) − Qτ Φ(tk+1 , τ )T Q−1
k+1 Φ(tk+1 , tk ), (10.42a)
T −1
Ψ(τ ) = Qτ Φ(tk+1 , τ ) Qk+1 . (10.42b)
We have all the required pieces to build these two matrices except Qτ ,
which is given by
Z τ
Qτ = Φ(τ, s)LQLT Φ(τ, s)T ds. (10.43)
tk
10.2.6 Postscript
It is worth pointing out that while our underlying approach in this
chapter considers a trajectory that is continuous in time, we are still
360 Continuous-Time Estimation
discretizing it in order to carry out the batch MAP solution at the mea-
surement times and also the interpolation at the query times. The point
is that we have a principled way to query the trajectory at any time
of interest, not just the measurement times. Moreover, the interpola-
tion scheme is chosen up front and provides the abilities to (i) smooth
the solution based on a physically motivated prior, and (ii) carry out
interpolation at any time of interest.
It is also worth noting that the Gaussian process approach taken in
this chapter is quite different from the interpolation approach taken in
Section 9.1.5. There we forced the motion between measurement times
to have constant body-centric generalized velocity: it was a constraint-
based interpolation method. Here we are defining a whole distribution
of possible trajectories and encouraging the solution to find one that
balances the prior likelihood with the likelihood of the measurements:
this is a penalty-term approach. Both approaches have their merits.
Finally, it is worth making a point about the estimation philoso-
phy used in this chapter. We have claimed that we presented a MAP
method. However, in the nonlinear chapter, the MAP approach always
linearized the motion model about the current MAP estimate. On the
surface, it appear that we have done something slightly different in this
chapter: to separate the desired nonlinear SDE into the nominal and
perturbation components, we essentially linearized about the mean of
the prior. We then built an error term for the motion prior and lin-
earized that about the current MAP estimate. However, the other way
to look at it is that we simply replaced the desired SDE with a slightly
different one that was easier to work with and then applied the MAP
approach for SE(3). This is not the only way to apply the Gaussian
process, continuous-time approach to estimation on SE(3), but we hope
it provides one useful example; Anderson and Barfoot (2015) provide
an alternative.
References
361
362 References
adjoint, 219, 220 72, 86, 89, 94, 141, 145, 312, 313,
affine transformation, 197 349, 350, 354, 355, 357–360
algebra, 211 covariance matrix, 11
Apianus, Petrus, xv Cramér, Harold, 14
arrowhead matrix, 337, 338 Cramér-Rao lower bound, 14, 31, 68,
axiom of total probability, 9 70, 116
BA, see bundle adjustment CRLB, see Cramér-Rao lower bound
Baker, Henry Frederick, 223 cross product, 169, 170
Baker-Campbell-Hausdorff, 223, 224, cubic Hermite polynomial, 84
226, 228, 238, 240, 265, 273, 318, curvature, 190
323, 352 DARCES, see data-aligned
Bayes filter, xv, 3, 66, 89, 95–100, 102, rigidity-constrained exhaustive
105, 113, 114, 125, 140 search
Bayes’ rule, 3, 10, 33, 38, 48, 92, 96 data association, 148, 156
Bayes, Thomas, 11 data-aligned rigidity-constrained
Bayesian, 9 exhaustive search, 158
Bayesian inference, 11, 24, 37, 42, 44, Dirac, Paul Adrien Maurice, 32
66–68, 89, 90, 133, 135, 141, 144 directional derivative, 238
BCH, see Baker-Campbell-Hausdorff discrete time, 27, 31, 35, 49, 56, 72, 78,
belief function, 95 85, 86, 94, 95, 125, 141, 145, 147,
Bernoulli numbers, 224 268, 311–314, 317, 349, 355, 357
Bernoulli, Jakob, 224, 235 disparity, 202
Bessel’s correction, 12 dot product, 169, 171
Bessel, Friedrich Wilhelm, 12
best linear unbiased estimate, 68 early estimation milestones, 3
biased, 101, 137 EKF, see extended Kalman filter
BLUE, see best linear unbiased epipolar constraint, 197
estimate epipolar line, 197
bundle adjustment, 329, 340, 343, 344, essential matrix (of computer vision),
346, 347 195
estimate, 36
camera, 193
estimation, see state estimation
Campbell, John Edward, 223
Euler parameters, see unit-length
Cauchy product, 234
quaternions
Cauchy, Baron Augustin-Louis, 234
Euler’s theorem, 174
causal, 56
Euler, Leonhard, 172
Cayley-Hamilton theorem, 47
Cayley-Rodrigues parameters, 176 exponential map, 213
Cholesky decomposition, 50–53, 85, 88, extended Kalman filter, 68, 89, 98, 99,
108, 109, 117, 118, 122, 127, 268, 101, 103–105, 107, 113, 116, 119,
280, 326, 327, 337, 339, 346, 347, 120, 122–125, 132, 140, 141, 147,
357 289, 311, 313–316
Cholesky smoother, 51 exteroceptive, 3
Cholesky, André-Louis, 50 extrinsic sensor parameters, 193
consistent, 69, 149 factor graph, 347
continuous time, xvi, 4, 9, 31, 32, 35, Faulhaber’s formula, 235
367
368 Index
LTI, see linear time-invariant 99, 102, 105–111, 113, 114, 146,
LTV, see linear time-varying 260, 267, 274
M-estimation, 160 probability distributions, 10
Möbius, Augustus Ferdinand, 187 proper rotation, 210
Mahalanobis, 273 pseudoinverse, 41
Mahalanobis distance, 28, 39 quaternion, 290
Mahalanobis, Prasanta Chandra, 28 RAE, see range-azimuth-elevation
MAP, see maximum a posteriori random sample consensus, 159, 162,
marginalization, 11 163, 289, 297
Markov property, 62, 95 random variable, 9
matrix inversion lemma, see range-azimuth-elevation, 202, 203
Sherman-Morrison-Woodbury RANSAC, see random sample
matrix Lie group, 209, 210 consensus
maximum a posteriori, 37, 38, 61, 62, Rao, Calyampudi Radhakrishna, 14
67, 86, 87, 89, 92, 93, 104, 105, Rauch, Herbert E., 53
123–126, 135, 136, 146, 148, 313, Rauch-Tung-Striebel smoother, 3, 49,
314, 317, 318, 344, 356, 360 53, 56
maximum likelihood, 135, 136, 147, realization, 12, 14, 36
148, 322, 323, 331, 334, 343 reference frame, 168
mean, 11, 14 robust cost, 160
mean rotation, 261 robust estimation, 160
ML, see maximum likelihood rotary reflection, 210
Monte Carlo, 98, 106, 270, 271, 281 rotation matrix, 170, see also rotations
Moore-Penrose pseudoinverse, see rotations, 167, 209, 214, 224, 229, 232,
pseudoinverse 233, 238, 246, 252, 258
mutual information, 13, 29 RTS, see Rauch-Tung-Striebel
NASA, see National Aeronautics and sample covariance, 12
Space Administration sample mean, 12
National Aeronautics and Space Schmidt, Stanley F., 98
Administration, 3, 98 Schur complement, 18, 63, 337, 338,
Newton’s method, 127 340, 346, 347, 357, 358
NLNG, see nonlinear, non-Gaussian Schur, Issai, 18
non-commutative group, 176, 182, 209 SDE, see stochastic differential equation
nonlinear, non-Gaussian, 89, 94, 95 Serret, Joseph Alfred, 190
normalized image coordinates, 194 Shannon information, 13, 28, 29, 33
observability, 2, 47, 153 Shannon, Claude Elwood, 13
observability matrix, 47 Sherman-Morrison-Woodbury, 23, 43,
onto, 214 53–55, 73, 121, 134, 135, 145
optical axis, 193 sigmapoint, 108, 113, 118
sigmapoint Kalman filter, 89, 116, 119,
outlier, 148, 158, 159
120, 122–125
particle filter, 89, 113–116 sigmapoint transformation, 108, 111,
PDF, see probability density function 112, 117, 118, 146, 263, 264, 267,
point-cloud alignment, 289 271, 276, 280
point-clouds, 289 simultaneous localization and mapping,
Poisson’s equation, 180 334, 344, 347, 355–357
Poisson, Siméon Denis, 180 simultaneous trajectory estimation and
pose-graph relaxation, 321 mapping, 354
poses, 167, 186, 210, 212, 216, 226, 230, singular-value decomposition, 297
236, 242, 249, 256, 262, 265, 272 skewness, 12, 113
posterior, 11, 36 SLAM, see simultaneous localization
power spectral density martrix, 75 and mapping
power spectral density matrix, 32 sliding-window filter, 140
prior, 11, 36 smoother, 57
probability, 9 SMW, see
probability density function, 9–15, 19, Sherman-Morrison-Woodbury
22, 24–26, 28–30, 33, 34, 93, 95–97, SP, see sigmapoint
370 Index