0% found this document useful (0 votes)

2K views386 pages

State Estimation For Robotics

Uploaded by

lunarquan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2K views386 pages

State Estimation For Robotics

Uploaded by

lunarquan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 386

STATE ESTIMATION FOR

ROBOTICS
A Matrix Lie Group Approach

Timothy D. Barfoot

c 2016
Copyright

In preparation for publication by Cambridge University Press

Send errata to <tim.barfoot@utoronto.ca>
Draft compiled on May 9, 2016
Revision History

1 Sept 2015 First draft released

15 Dec 2015 Added a new section to Chapter 3 on recursive
discrete-time smoothers and their relationship to the
batch solution; fixed a few typos
17 Dec 2015 Fixed a few typos in the new section on smoothers
14 Jan 2016 Added historical note regarding Stanley Schmidt’s
role in EKF to Chapter 4
20 Mar 2016 Clarified in the introduction and probability chapter
that we use a Bayesian view of probability and ap-
proach to estimation in this book
26 Mar 2016 Fixed subscript typos in (3.126), (3.127), (4.33),
(4.34), (4.42)
29 Mar 2016 Added a note on Jacobi’s formula to the section on
the matrix exponential in Chapter 7
30 Mar 2016 Added “squared” in front of “Mahalanobis distance”
to match actual definition
8 Apr 2016 Added a footnote at start of probability chapter ac-
knowledging that we work with probability densities
although the classical formal approach is to start from
probability distributions; also made a table in Chap-
ter 7 fit inside the margins
11 Apr 2016 Fixed typo in x∧ definition on SE(3) identity page in
Chapter 7
19 Apr 2016 Added acronym list, clarified ISPKF experiment sec-
tion, removed embarrassing uses of ‘maximum a pri-
ori’
20 Apr 2016 Added index to back
21 Apr 2016 Ran spellchecker on whole book
28 Apr 2016 Adjusted sentence at start of introduction to reflect
actual contents of intro
2 May 2016 Added missing y0 to z in (3.12); added missing neg-
ative sign to (3.14a)
9 May 2016 Fixed a bunch more little typos while proofreading

iii
Contents

Acronyms and Abbreviations xi

Notation xiii
Foreword xv

1 Introduction 1
1.1 A Little History 1
1.2 Sensors, Measurements, and Problem Definition 3
1.3 How This Book is Organized 4
1.4 Relationship to Other Books 5

Part I Estimation Machinery 7

2 Primer on Probability Theory 9

2.1 Probability Density Functions 9
2.1.1 Definitions 9
2.1.2 Bayes’ Rule and Inference 10
2.1.3 Moments of PDFs 11
2.1.4 Sample Mean and Covariance 12
2.1.5 Statistically Independent, Uncorrelated 12
2.1.6 Shannon and Mutual Information 13
2.1.7 Cramér-Rao Lower Bound and Fisher Information 13
2.2 Gaussian Probability Density Functions 14
2.2.1 Definitions 14
2.2.2 Isserlis’ Theorem 15
2.2.3 Joint Gaussian PDFs, Their Factors, and Inference 17
2.2.4 Statistically Independent, Uncorrelated 19
2.2.5 Linear Change of Variables 19
2.2.6 Product of Gaussians 21
2.2.7 Sherman-Morrison-Woodbury Identity 22
2.2.8 Passing a Gaussian Through a Nonlinearity 23
2.2.9 Shannon Information of a Gaussian 27
2.2.10 Mutual Information of a Joint Gaussian PDF 28
2.2.11 Cramér-Rao Lower Bound Applied to Gaussian PDFs 29
2.3 Gaussian Processes 30
2.4 Summary 31
2.5 Exercises 32

v
vi Contents

3 Linear-Gaussian Estimation 35
3.1 Batch Discrete-Time Estimation 35
3.1.1 Problem Setup 35
3.1.2 Maximum A Posteriori 37
3.1.3 Bayesian Inference 42
3.1.4 Existence, Uniqueness, and Observability 44
3.1.5 MAP Covariance 48
3.2 Recursive Discrete-Time Smoothing 49
3.2.1 Exploiting Sparsity in the Batch Solution 50
3.2.2 Cholesky Smoother 51
3.2.3 Rauch-Tung-Striebel Smoother 53
3.3 Recursive Discrete-Time Filtering 56
3.3.1 Factoring the Batch Solution 57
3.3.2 Kalman Filter via MAP 61
3.3.3 Kalman Filter via Bayesian Inference 66
3.3.4 Kalman Filter via Gain Optimization 67
3.3.5 Kalman Filter Discussion 68
3.3.6 Error Dynamics 69
3.3.7 Existence, Uniqueness, and Observability 70
3.4 Batch Continuous-Time Estimation 72
3.4.1 Gaussian Process Regression 72
3.4.2 A Class of Exactly Sparse Gaussian Process Priors 75
3.4.3 Linear Time-Invariant Case 81
3.4.4 Relationship to Batch Discrete-Time Estimation 85
3.5 Summary 86
3.6 Exercises 86

4 Nonlinear Non-Gaussian Estimation 89

4.1 Introduction 89
4.1.1 Full Bayesian Estimation 90
4.1.2 Maximum a Posteriori Estimation 92
4.2 Recursive Discrete-Time Estimation 94
4.2.1 Problem Setup 94
4.2.2 Bayes Filter 95
4.2.3 Extended Kalman Filter 98
4.2.4 Generalized Gaussian Filter 102
4.2.5 Iterated Extended Kalman Filter 103
4.2.6 IEKF is a MAP Estimator 104
4.2.7 Alternatives for Passing PDFs through Nonlinearities 105
4.2.8 Particle Filter 114
4.2.9 Sigmapoint Kalman Filter 116
4.2.10 Iterated Sigmapoint Kalman Filter 121
4.2.11 ISPKF Seeks the Posterior Mean 124
4.2.12 Taxonomy of Filters 125
4.3 Batch Discrete-Time Estimation 125
4.3.1 Maximum A Posteriori 126
4.3.2 Bayesian Inference 133
4.3.3 Maximum Likelihood 135
4.3.4 Discussion 140
Contents vii

4.4 Batch Continuous-Time Estimation 141

4.4.1 Motion Model 141
4.4.2 Observation Model 144
4.4.3 Bayesian Inference 144
4.4.4 Algorithm Summary 145
4.5 Summary 146
4.6 Exercises 147

5 Biases, Correspondences, and Outliers 148

5.1 Handling Input/Measurement Biases 149
5.1.1 Bias Effects on the Kalman Filter 149
5.1.2 Unknown Input Bias 152
5.1.3 Unknown Measurement Bias 154
5.2 Data Association 156
5.2.1 External Data Association 157
5.2.2 Internal Data Association 157
5.3 Handling Outliers 158
5.3.1 RANSAC 159
5.3.2 M-Estimation 160
5.4 Summary 162
5.5 Exercises 162

Part II Three-Dimensional Machinery 165

6 Primer on Three-Dimensional Geometry 167

6.1 Vectors and Reference Frames 167
6.1.1 Reference Frames 168
6.1.2 Dot Product 168
6.1.3 Cross Product 169
6.2 Rotations 170
6.2.1 Rotation Matrices 170
6.2.2 Principal Rotations 171
6.2.3 Alternate Rotation Representations 172
6.2.4 Rotational Kinematics 178
6.2.5 Perturbing Rotations 182
6.3 Poses 186
6.3.1 Transformation Matrices 187
6.3.2 Robotics Conventions 188
6.3.3 Frenet-Serret Frame 190
6.4 Sensor Models 193
6.4.1 Perspective Camera 193
6.4.2 Stereo Camera 200
6.4.3 Range-Azimuth-Elevation 202
6.4.4 Inertial Measurement Unit 203
6.5 Summary 205
6.6 Exercises 206
viii Contents

7 Matrix Lie Groups 209

7.1 Geometry 209
7.1.1 Special Orthogonal and Special Euclidean Groups 209
7.1.2 Lie Algebras 211
7.1.3 Exponential Map 213
7.1.4 Adjoints 219
7.1.5 Baker-Campbell-Hausdorff 223
7.1.6 Distance, Volume, Integration 229
7.1.7 Interpolation 232
7.1.8 Homogeneous Points 237
7.1.9 Calculus and Optimization 238
7.1.10 Identities 246
7.2 Kinematics 246
7.2.1 Rotations 246
7.2.2 Poses 249
7.2.3 Linearized Rotations 252
7.2.4 Linearized Poses 256
7.3 Probability and Statistics 258
7.3.1 Gaussian Random Variables and PDFs 258
7.3.2 Uncertainty on a Rotated Vector 263
7.3.3 Compounding Poses 265
7.3.4 Fusing Poses 272
7.3.5 Propagating Uncertainty Through a Nonlinear Camera Model 276
7.4 Summary 283
7.5 Exercises 284

Part III Applications 287

8 Pose Estimation Problems 289

8.1 Point-Cloud Alignment 289
8.1.1 Problem Setup 290
8.1.2 Unit-Length Quaternion Solution 290
8.1.3 Rotation Matrix Solution 294
8.1.4 Transformation Matrix Solution 308
8.2 Point-Cloud Tracking 311
8.2.1 Problem Setup 311
8.2.2 Motion Priors 312
8.2.3 Measurement Model 313
8.2.4 EKF Solution 314
8.2.5 Batch Maximum a Posteriori Solution 317
8.3 Pose-Graph Relaxation 321
8.3.1 Problem Setup 321
8.3.2 Batch Maximum Likelihood Solution 322
8.3.3 Initialization 325
8.3.4 Exploiting Sparsity 325
8.3.5 Chain Example 326
Contents ix

9 Pose-and-Point Estimation Problems 329

9.1 Bundle Adjustment 329
9.1.1 Problem Setup 330
9.1.2 Measurement Model 330
9.1.3 Maximum Likelihood Solution 334
9.1.4 Exploiting Sparsity 337
9.1.5 Interpolation Example 340
9.2 Simultaneous Localization and Mapping 344
9.2.1 Problem Setup 344
9.2.2 Batch Maximum a Posteriori Solution 345
9.2.3 Exploiting Sparsity 346
9.2.4 Example 347

10 Continuous-Time Estimation 349

10.1 Motion Prior 349
10.1.1 General 349
10.1.2 Simplification 353
10.2 Simultaneous Trajectory Estimation and Mapping 354
10.2.1 Problem Setup 355
10.2.2 Measurement Model 355
10.2.3 Batch Maximum a Posteriori Solution 356
10.2.4 Exploiting Sparsity 357
10.2.5 Interpolation 358
10.2.6 Postscript 359

References 361
Index 367
Acronyms and Abbreviations

BA bundle adjustment 329

BCH Baker-Campbell-Hausdorff 223
BLUE best linear unbiased estimate 68
CRLB Cramér-Rao lower bound 14
DARCES data-aligned rigidity-constrained exhaustive search 158
EKF extended Kalman filter 68
GP Gaussian process 31
GPS global positioning system 4
ICP iterative closest point 289
IEKF iterated extended Kalman filter 103
IMU inertial measurement unit 203
ISPKF iterated sigmapoint Kalman filter 121
KF Kalman filter 35
LDU lower-diagonal-upper 22
LG linear-Gaussian 36
LTI linear time-invariant 81
LTV linear time-varying 75
MAP maximum a posteriori 4
ML maximum likelihood 135
NASA National Aeronautics and Space Administration 3
NLNG nonlinear, non-Gaussian 89
PDF probability density function 9
RAE range-azimuth-elevation 202
RANSAC random sample consensus
RTS Rauch-Tung-Striebel 53
SDE stochastic differential equation 75
SLAM simultaneous localization and mapping 155
SMW Sherman-Morrison-Woodbury 22
SP sigmapoint 108
SPKF sigmapoint Kalman filter 116
STEAM simultaneous trajectory estimation and mapping 354
SWF sliding-window filter 140
UDL upper-diagonal-lower 22
UKF unscented Kalman filter (also called SPKF) 117

xi
Notation

– General Notation –

a This font is used for quantities that are real

scalars
a This font is used for quantities that are real col-
umn vectors
A This font is used for quantities that are real ma-
trices
A This font is used for time-invariant system quan-
tities
p(a) The probability density of a
p(a|b) The probability density of a given b
N (a, B) Gaussian probability density with mean a and
covariance B
GP(µ(t), K(t, t0 )) Gaussian process with mean function, µ(t), and
covariance function, K(t, t0 )
O Observability matrix
(·)k The value of a quantity at timestep k
(·)k1 :k2 The set of values of a quantity from timestep k1
to timestep k2 , inclusive
Fa
−
→ A vectrix representing a reference frame in three
dimensions
→
−a A vector quantity in three dimensions
(·)× The cross-product operator, which produces a
skew-symmetric matrix from a 3 × 1 column
1 The identity matrix
0 The zero matrix
RM ×N The vectorspace of real M × N matrices
ˆ
(·) A posterior (estimated) quantity
ˇ
(·) A prior quantity

xiii
xiv Notation

– Matrix-Lie-Group Notation –

SO(3) The special orthogonal group, a matrix Lie group

used to represent rotations
so(3) The Lie algebra associated with SO(3)
SE(3) The special Euclidean group, a matrix Lie group
used to represent poses
se(3) The Lie algebra associated with SE(3)
(·)∧ An operator associated with the Lie algebra for
rotations and poses
(·)f An operator associated with the adjoint of an
element from the Lie algebra for poses
Ad(·) An operator producing the adjoint of an element
from the Lie group for rotations and poses
ad(·) An operator producing the adjoint of an element
from the Lie algebra for rotations and poses
Foreword

My interest in state estimation stems from the field of mobile robotics,

particularly for space exploration. Within mobile robotics, there has
been an explosion of research referred to as probabilistic robotics. With
computing resources becoming very inexpensive, and the advent of rich
new sensing technologies such as digital cameras and laser rangefinders,
robotics has been at the forefront of developing exciting new ideas in
the area of state estimation.
In particular, this field was probably the first to find practical appli-
cations of the so-called Bayes filter, a much more general technique than
the famous Kalman filter. In just the last few years, mobile robotics
has even started going beyond the Bayes filter to batch, nonlinear
optimization-based techniques, with very promising results. Because

Introductio
Geographica by
Petrus Apianus
(1495-1552), a
German
mathematician,
astronomer, and
cartographer.
Much of
three-dimensional
state estimation
has to do with
triangulation
and/or
trilateration; we
measure some
angles and lengths
and infer the
others through
trigonometry.

xv
xvi Foreword

my primary area of interest is navigation of robots in outdoor environ-

ments, I have often been faced with vehicles operating in three dimen-
sions. Accordingly, I have attempted to provide a detailed look at how
to approach state estimation in three dimensions. In particular, I show
how to treat rotations and poses in a simple and practical way using
matrix Lie groups. The reader should have a background in under-
graduate linear algebra and calculus, but otherwise this book is fairly
standalone. I hope readers of these pages will find something useful; I
know I learned a great deal while creating them.
I have provided some historical notes in the margins throughout
the book, mostly in the form of biographical sketches of some of the
researchers after whom various concepts and techniques are named;
I primarily used Wikipedia as the source for this information. Also,
the first part of Chapter 6 (up to the alternate rotation parameteriza-
tions), which introduces three-dimensional geometry, is based heavily
on notes originally produced by Prof. Chris Damaren at the University
of Toronto Institute for Aerospace Studies.
This book would not have been possible without the collaborations
of many fantastic graduate students along the way. Paul Furgale’s PhD
thesis extended my understanding of matrix Lie groups significantly by
introducing me to their use for describing poses; this led us on an inter-
esting journey into the details of transformation matrices and how to
use them effectively in estimation problems. Paul’s later work led me
to become interested in continuous-time estimation. Chi Hay Tong’s
PhD thesis introduced me to the use of Gaussian processes in esti-
mation theory and he helped immensely in working out the details of
the continuous-time methods presented herein; my knowledge in this
area was further improved through collaborations with Simo Särkkä
from Aalto University while on sabbatical at the University of Oxford.
Additionally, I learned a great deal by working with Sean Anderson,
Patrick Carle, Hang Dong, Andrew Lambert, Keith Leung, Colin Mc-
Manus, and Braden Stenning; each of their projects added to my un-
derstanding of state estimation. Colin, in particular, encouraged me
several times to turn my notes from my graduate course on state esti-
mation into this book. Finally, I am indebted to Gabriele D’Eleuterio,
who set me on the path of studying rotations and reference frames in
the context of dynamics; many of the tools he showed me transferred
effortlessly to state estimation; he also taught me the importance of
clean, unambiguous notation. Thank you all, for your help and encour-
agement.

Tim Barfoot
May 9, 2016
Toronto
1

Introduction

Robotics inherently deals with things that move in the world. We live
in an era of rovers on Mars, drones surveying the Earth, and soon
self-driving cars. And, although specific robots have their subtleties,
there are also some common issues we must face in all applications,
particularly state estimation and control.
The state of a robot is a set of quantities, such as position, orien-
tation, and velocity, that if known fully describe that robot’s motion
over time. Here we will focus entirely on the problem of estimating
the state of a robot, putting aside the notion of control. Yes, control
is essential, as we would like to make our robots behave in a certain
way. But, the first step in doing so is often the process of determining
the state. Moreover, the difficulty of state estimation is often underes-
timated for real-world problems and thus it is important to put it on
an equal footing with control.
In this book, we will introduce the classic estimation results for linear
systems corrupted by Gaussian measurement noise. We will then ex-
amine some of the extensions to nonlinear systems with non-Gaussian
noise. In a departure from typical estimation texts, we will take a de-
tailed look at how to tailor general estimation results to robots oper-
ating in three-dimensional space, advocating a particular approach to
handling rotations: matrix Lie groups.
The rest of this introduction will provide a little history of estimation,
discuss types of sensors and measurements, and introduce the problem
of state estimation. We conclude with a breakdown of the contents of
the book and provide some other suggested reading.

1.1 A Little History

About 4,000 years ago, the early seafarers were faced with a vehicular
state estimation problem: how to determine a ship’s position while at
sea. Early attempts to develop primitive charts and make observations
of the sun allowed local navigation along coastlines. However, it was
not until the fifteenth century that global navigation on the open sea
became possible with the advent of key technologies and tools. The
Mariner’s compass, an early form of the magnetic compass, allowed
1
2 Introduction

crude measurements of direction to be made. Together with coarse

nautical charts, the compass made it possible to sail along rhumb lines
Figure 1.1 between key destinations (i.e., following a compass bearing). A series
Quadrant. A tool of instruments were then gradually invented that made it possible to
used to measure
measure the angle between distant points (i.e., cross-staff, astrolabe,
angles.
quadrant, sextant, theodolite) with increasing accuracy.
These instruments allowed latitude to be determined at sea fairly
readily using celestial navigation. For example, in the Northern hemi-
sphere the angle between the North Star, Polaris, and the horizon pro-
vides the latitude. Longitude, however, was a much more difficult prob-
lem. It was known early on that an accurate timepiece was the missing
piece of the puzzle for the determination of longitude. The behaviours
of key celestial bodies appear differently at different locations on the
Earth. Knowing the time of day therefore allows longitude to be in-
Figure 1.2 ferred. In 1764, British clockmaker John Harrison built the first accu-
Harrison’s H4. The rate portable timepiece that effectively solved the longitude problem;
first clock able to
a ship’s longitude could be determined to within about ten nautical
keep accurate time
at sea, enabling miles.
determination of Estimation theory also finds its roots in astronomy. The method of
longitude. least squares was pioneered1 by Gauss, who developed the technique to
minimize the impact of measurement error in the prediction of orbits.
Gauss reportedly used least squares to predict the position of the dwarf
planet Ceres after passing behind the Sun, accurate to within half a
degree (about nine months after it was last seen). The year was 1801
and Gauss was 23. Later, in 1809, he proved that the least-squares
method is optimal under the assumption of normally distributed errors.
Most of the classic estimation techniques in use today can be directly
related to Gauss’ least-squares method.
Carl Friedrich The idea of fitting models to minimize the impact of measurement
Gauss (1777-1855) error carried forward, but it was not until the middle of the twenti-
was a German
mathematician
eth century that estimation really took off. This was likely correlated
who contributed with the dawn of the computer age. In 1960, Kálmán published two
significantly to landmark papers that have defined much of what has followed in the
many fields field of state estimation. First, he introduced the notion of observability
including statistics
(Kalman, 1960a), which tells us when a state can be inferred from a
and estimation.
set of measurements in a dynamic system. Second, he introduced an
optimal framework for estimating a system’s state in the presence of
Rudolf Emil
Kálmán (1930-) is measurement noise (Kalman, 1960b); this classic technique for linear
a Hungarian-born systems (whose measurements are corrupted by Gaussian noise) is fa-
American electrical mously known as the Kalman filter, and has been the workhorse of esti-
engineer,
mation for the more than 50 years since its inception. Although used in
mathematician,
and inventor. many fields, it has been widely adopted in aerospace applications. Re-

1 There is some debate as to whether Adrien Marie Legendre might have come up with
least squares before Gauss.
1.2 Sensors, Measurements, and Problem Definition 3

searchers at National Aeronautics and Space Administration (NASA)

were the first to employ the Kalman filter to aid in the estimation of Early Estimation
spacecraft trajectories on the Ranger, Mariner, and Apollo programs. Milestones
In particular, the on-board computer on the Apollo 11 lunar module,
the first spacecraft to land on the surface of the Moon, employed a 1654 Pascal and
Kalman filter to estimate the module’s position above the lunar sur- Fermat lay
face based on noisy radar measurements. foundations
of probability
Many incremental improvements have been made to the field of state theory
estimation since these early milestones. Faster and cheaper computers 1764 Bayes’ rule
have allowed much more computationally complex techniques to be 1801 Gauss uses
implemented in practical systems. However, until about fifteen years least-squares
to estimate
ago, it seemed that estimation was possibly waning as an active research
the orbit of
area. But, something has happened to change that; exciting new sensing the planetoid
technologies are coming along (e.g., digital cameras, laser imaging, the Ceres
global positioning satellites) that pose new challenges to this old field. 1805 Legendre pub-
lishes “least-
squares”
1913 Markov chains
1.2 Sensors, Measurements, and Problem Definition 1933 (Chapman)-
Kolmogorov
To understand the need for state estimation is to understand the na-
equations
ture of sensors. All sensors have a limited precision. Therefore, all mea- 1949 Wiener filter
surements derived from real sensors have associated uncertainty. Some 1960 Kalman
sensors are better at measuring specific quantities than others, but even (Bucy) filter
1965 Rauch-Tung-
the best sensors still have a degree of imprecision. When we combine
Striebel
various sensor measurements into a state estimate, it is important to smoother
keep track of all the uncertainties involved and therefore (hopefully)
know how confident we can be in our estimate. 1970 Jazwinski
In some sense, state estimation is about doing the best we can with coins “Bayes
filter”
the sensors we have. This, however, does not prevent us from in parallel
improving the quality of our sensors. A good example is the theodolite Figure 1.3
sensor that was developed in 1787 to allow triangulation across the Theodolite. A
English Channel. It was much more precise than its predecessors and better tool to
measure angles.
helped show that much of England was poorly mapped by tying mea-
surements to well-mapped France.
It is useful to put sensors into two categories: interoceptive and exte-
roceptive. These are actually terms borrowed from human physiology,
but they have become somewhat common in engineering. Some defini-
tions2 :
in·tero·cep·tive [int-@-rō-’sep-tiv], adjective: of, relating to, or being
stimuli arising within the body.
ex·tero·cep·tive [ek-st@-rō-’sep-tiv], adjective: relating to, being, or
activated by stimuli received by an organism from outside.
Typical interoceptive sensors are the accelerometer (measures transla-
2 Merriam-Webster’s Dictionary.
4 Introduction

tional acceleration), gyroscope (measures angular rate), wheel odome-

ter (measures angular rate). Typical exteroceptive sensors are the cam-
era (measures range/bearing to a landmark or landmarks) and time-
of-flight transmitter/receiver (e.g., laser rangefinder, pseudolites, global
positioning system (GPS) transmitter/receiver). Roughly speaking, we
can think of exteroceptive measurements as being of the position and
orientation of a vehicle while interoceptive ones are of a vehicle’s ve-
locity or acceleration. In most cases, the best state estimation concepts
make use of both interoceptive and exteroceptive measurements. For
example, the combination of a GPS receiver (exteroceptive) and an in-
ertial measurement unit (3 linear accelerometers and 3 rate gyros; inte-
roceptive) is a popular means of estimating a vehicle’s position/velocity
on Earth. And, the combination of a sun/star sensor (exteroceptive)
and three rate gyros (interoceptive) is commonly used to carry out pose
determination on satellites.
Now that we understand a little bit about sensors, we are prepared
to define the problem that will be investigated in this book:
Estimation is the problem of reconstructing the underlying state of a system given a
sequence of measurements as well as a prior model of the system.

There are many specific versions of this problem and just as many
solutions. The goal will be to understand which methods work well in
which situations, in order to pick the best tool for the job.

1.3 How This Book is Organized

The book is broken into three main parts:

I: Estimation Machinery
II: Three-Dimensional Machinery
III: Applications
The first part, Estimation Machinery, presents classic and state-of-the-
art estimation tools, without the complication of dealing with things
that live in three-dimensional space (and therefore translate and ro-
tate); the state to be estimated is assumed to be a generic vector.
For those not interested in the details of working in three-dimensional
space, this first part can be read in a standalone manner. It covers both
recursive state estimation techniques as well as batch methods (less
common in classic estimation books). As is commonplace in robotics
and machine learning today, we adopt a Bayesian approach to estima-
tion in this book. We contrast (full) Bayesian methods with maximum
a posteriori (MAP) methods, and attempt to make clear the difference
between these when faced with nonlinear problems. The book also con-
nects continuous-time estimation with Gaussian process regression from
1.4 Relationship to Other Books 5

the machine-learning world. Finally, it touches on some practical issues

such as robust estimation and biases.
The second part, Three-Dimensional Machinery, provides a basic
primer on three-dimensional geometry and gives a detailed but accessi-
ble introduction to matrix Lie groups. To represent an object in three-
dimensional space, we need to talk about that object’s translation and
rotation. The rotational part turns out to be a problem for our estima-
tion tools because rotations are not vectors in the usual sense and so
we cannot naively apply the methods from Part I to three-dimensional
robotics problems involving rotations. Part II, therefore, examines the
geometry, kinematics, and probability/statistics of rotations and poses
(translation plus rotation).
Finally, in the third part, Applications, the first two parts of the book
are brought together. We look at a number of classic three-dimensional
estimation problems involving objects translating and rotating in three-
dimensional space. We show how to adapt the methods from Part I
based on the knowledge gained in Part II. The result is a suite of
easy-to-implement methods for three-dimensional state estimation. The
spirit of these examples can also hopefully be adapted to create other
novel techniques moving forwards.

1.4 Relationship to Other Books

There are many other good books on state estimation and robotics,
but very few that cover both topics simultaneously. We briefly describe
a few recent works that do cover both topics and their relationships to
this book.
Probabilistic Robotics by Thrun et al. (2006) is a great introduction
to mobile robotics, with a large focus on state estimation in relation to
mapping and localization. It covers the probabilistic paradigm that is
dominant in much of robotics today. It mainly describes robots operat-
ing in the two-dimensional, horizontal plane. The probabilistic methods
described are not necessarily limited to the two-dimensional case, but
the details of extending to three dimensions are not provided.
Computational Principles of Mobile Robotics by Dudek and Jenkin
(2010) is a great overview book on mobile robotics that touches on state
estimation, again in relation to localization and mapping methods. It
does not work out the details of performing state estimation in three
dimensions.
Mobile Robotics: Mathematics, Models, and Methods by Kelly (2013)
is another excellent book on mobile robotics and covers state estimation
extensively. Three-dimensional situations are covered, particularly in
relation to satellite-based and inertial navigation. As the book covers
all aspects of robotics, it does not delve deeply into how to handle
rotational variables within three-dimensional state estimation.
6 Introduction

Robotics, Vision, and Control by Corke (2011) is another great and

comprehensive book that covers state estimation for robotics, including
in three dimensions. Similarly to the previously mentioned book, the
breadth of Corke’s book necessitates that it not delve too deeply into
the specific aspects of state estimation treated herein.
Bayesian Filtering and Smoothing by Särkkä (2013) is an super book
focused on recursive Bayesian methods. It covers the recursive methods
in far more depth than this book, but does not cover batch methods
nor focus on the details of carrying out estimation in three dimensions.
Stochastic Models, Information Theory, and Lie Groups: Classical
Results and Geometric Methods by Chirikjian (2009), an excellent two-
volume work, is perhaps the closest in content to the current book. It
explicitly investigates the consequences of carrying out state estima-
tion on matrix Lie groups (and hence rotational variables). It is quite
theoretical in nature and goes beyond the current book in this sense,
covering applications beyond robotics.
The current book is somewhat unique in focusing only on state es-
timation and working out the details of common three-dimensional
robotics problems in enough detail to be easily implemented.
Part I

Estimation Machinery

7
2

Primer on Probability Theory

In what is to follow, we will be using a number of basic concepts from

probability and statistics. This chapter serves to provide a review of
these concepts. For a classic book on probability and random processes
see Papoulis (1965). For a light read on the history of probability theory,
Devlin (2008) provides a wonderful introduction; this book also helps to
understand the difference between the frequentist and Bayesian views
of probability. We will primarily adopt the latter in our approach to
estimation, although this chapter mentions some basic frequentist sta-
tistical concepts in passing. We begin by discussing general probability
density functions (PDFs) and then focus on the specific case of Gaus-
sian PDFs. The chapter concludes by introducing Gaussian processes,
the continuous-time version of Gaussian random variables.

2.1 Probability Density Functions

2.1.1 Definitions
We say that a random variable, x, is distributed according to a par-
ticular PDF.
Let p(x) be a PDF for the random variable, x, over the
interval a, b . This is a non-negative function that satisfies
Z b
p(x) dx = 1. (2.1)
a

That is, it satisfies the axiom of total probability. Note that this is
probability density not probability.
Z b
Figure 2.1
p(x) p(x) dx = 1 p(x) Pr(c  x  d) Probability density
a over a finite
internal (left).
Probability of
being within a
x x sub-interval
a b a c d b (right).

9
10 Primer on Probability Theory

Probability is given by the area under the density function1 . For

example, the probability that x lies between c and d, Pr(c ≤ x ≤ d), is
given by
Z d
Pr(c ≤ x ≤ d) = p(x) dx. (2.2)
c

Figure 2.1 depicts a general PDF over a finite interval as well as the
probability of being within a sub-interval. We use PDFs to represent
the likelihood of x being in all possible states in the interval, [a, b].
We can also introduce a conditioning
variable. Let p(x|y) be a PDF
over x ∈ a, b conditioned on y ∈ r, s such that
Z b
(∀y) p(x|y) dx = 1. (2.3)
a

We may also denote joint probability densities for N -dimensional con-

tinuous
variables
in our framework as p(x) where x = (x1 , . . . , xN ) with
xi ∈ i i . Note that we can also use the notation
a , b
p(x1 , x2 , . . . , xN ), (2.4)
in place of p(x). Sometimes we even mix and match the two and write
p(x, y), (2.5)
for the joint density of x and y. In the N -dimensional case, the axiom
of total probability requires
Z b Z bN Z b2 Z b1
p(x) dx = ··· p (x1 , x2 , . . . , xN ) dx1 dx2 · · · dxN = 1,
a aN a2 a1
(2.6)
where a = (a1 , a2 , . . . , aN ) and b = (b1 , b2 , . . . , bN ). In what follows, we
will sometimes simplify notation by leaving out the integration limits,
a and b.

2.1.2 Bayes’ Rule and Inference

We can always factor a joint probability density into a conditional and
a non-conditional factor2 :
p(x, y) = p(x|y)p(y) = p(y|x)p(x). (2.7)
1 The classical treatment of probability theory starts with probability distributions,
Komolgorov’s three axioms, and works out the details of probability densities as a
consequence of being the derivative of probability distributions. As is common in
robotics, we will work directly with densities in a Bayesian framework, and therefore
we will skip these formalities and present only the results we need using densities. We
shall be careful to use the term density not distribution as we are working with
continuous variables throughout this book.
2 In the specific case that x and y are statistically independent, we can factor the joint
density as p(x, y) = p(x)p(y).
2.1 Probability Density Functions 11

Rearranging gives Bayes’ rule: Thomas Bayes

(1701-1761) was an
p(y|x)p(x) English
p(x|y) = . (2.8)
p(y) statistician,
philosopher, and
We can use this to infer the posterior likelihood of the state given the Presbyterian
measurements, p(x|y), if we have a prior PDF over the state, p(x), and minister, known
the sensor model, p(y|x). We do this by expanding the denominator so for having
formulated a
that
specific case of the
p(y|x)p(x)
p(x|y) = R . (2.9) theorem that bears
p(y|x)p(x) dx his name. Bayes
never published
We compute the denominator, p(y), by marginalization as follows: what would
Z Z eventually become
p(y) = p(y) p(x|y) dx = p(x|y)p(y) dx his most famous
accomplishment;
| {z } his notes were
1
Z Z edited and
= p(x, y) dx = p(y|x)p(x) dx, (2.10) published after his
death by Richard
Price (Bayes,
which can be quite expensive to do in the general nonlinear case. 1764).
Note that in Bayesian inference, p(x) is known as the prior den-
sity while p(x|y), is known as the posterior density. Thus, all a priori
information is encapsulated in p(x) and while p(x|y) contains the a
posteriori information.

2.1.3 Moments of PDFs

When working with mass distributions (a.k.a., density functions) in
dynamics, we often keep track of only a few properties called the mo-
ments of mass (e.g., mass, center of mass, inertia matrix). The same is
true with PDFs. The zeroth probability moment is always 1 since this
is exactly the axiom of total probability. The first probability moment
is known as the mean, µ:
Z
µ = E [x] = x p(x) dx, (2.11)

where E[·] denotes the expectation operator. For a general matrix func-
tion, F(x), the expectation is written as
Z
E [F(x)] = F (x) p(x) dx, (2.12)

but note that we must interpret this as

R
E [F(x)] = E [fij (x)] = fij (x) p(x) dx . (2.13)
The second probability moment is known as the covariance matrix, Σ:

Σ = E (x − µ)(x − µ)T . (2.14)
12 Primer on Probability Theory

The next two moments are called the skewness and kurtosis, but for
the multivariate case these get quite complicated and require tensor
representations. We will not need them here, but it should be mentioned
that there are an infinite number of these probability moments.

2.1.4 Sample Mean and Covariance

Suppose we have random variable, x, and an associated PDF, p(x). We
can draw samples from this density, which we denote as:
xmeas ← p(x). (2.15)
A sample is sometimes referred to as a realization of the random vari-
able and we can think of it intuitively as a measurement. If we drew N
such samples and wanted to estimate the mean and covariance of ran-
dom variable, x, we could use the sample mean and sample covariance
to do so:
N
1 X
µmeas = xi,meas , (2.16a)
N i=1
N
1 X T
Σmeas = (xi,meas − µmeas ) (xi,meas − µmeas ) . (2.16b)
N − 1 i=1
Notably, the normalization in the sample covariance uses N − 1 rather
Friedrich Wilhelm than N in the denominator, which is referred to as Bessel’s correction.
Bessel (1784-1846) Intuitively, this is necessary because the sample covariance uses the
was a German
difference of the measurements with the sample mean, which itself is
astronomer,
mathematician computed from the same measurements, resulting in a slight correla-
(systematizer of tion. The sample covariance can be shown to be an unbiased estimate
the Bessel of the true covariance, and it is also ‘larger’ than when N is used in
functions, which
the denominator. It is also worth mentioning that as N becomes large,
were discovered by
Bernoulli). He was N − 1 ≈ N , so the bias effect for which sample covariance compensates
the first becomes less pronounced.
astronomer to
determine the
distance from the
2.1.5 Statistically Independent, Uncorrelated
sun to another star
by the method of If we have two random variables, x and y, we say that the variables
parallax. The
are statistically independent if their joint density factors as follows:
Bessel correction is
technically a factor p(x, y) = p(x) p(y). (2.17)
of N/(N − 1) that
is multiplied in We say that the variables are uncorrelated if
front of the
T
‘biased’ formula for E xyT = E [x] E [y] . (2.18)
covariance that
divides by N If the variables are statistically independent, this implies they are also
instead of N − 1.
uncorrelated. However, the reverse is not true in general for all types
2.1 Probability Density Functions 13

of densities3 . We will often exploit (or assume) that variables are sta-
tistically independent to simplify computations.

2.1.6 Shannon and Mutual Information

Often we have estimated a PDF for some random variable and then
want to quantify how certain we are in, for example, the mean of that
PDF. One method of doing this is to look at the negative entropy or Claude Elwood
Shannon
(Shannon) information, H, which is given by (1916-2001) was an
Z American
H (x) = −E [ln p(x)] = − p(x) ln p(x) dx. (2.19) mathematician,
electronic engineer,
We will make this expression specific to Gaussian PDFs below. and cryptographer
known as ‘the
Another useful quantity is the mutual information, I(x, y), between father of
two random variables, x and y, given by information theory’
ZZ (Shannon, 1948).
p(x, y) p(x, y)
I(x, y) = E ln = p(x, y) ln dx dy.
p(x)p(y) p(x)p(y)
(2.20)
Mutual information measures how much knowing one of the variables
reduces uncertainty about the other. When x and y are statistically
independent, we have
ZZ
p(x)p(y)
I(x, y) = p(x) p(y) ln dx dy
p(x)p(y)
ZZ
= p(x) p(y) ln (1) dx dy = 0. (2.21)
| {z }
0

When x and y are correlated, we have I(x, y) ≥ 0. We also have the

useful relationship,
I(x, y) = H(x) + H(y) − H(x, y), (2.22)
relating mutual information and Shannon information.

2.1.7 Cramér-Rao Lower Bound and Fisher Information

Suppose we have a deterministic parameter, θ, that influences the out-
come of a random variable, x. This can be captured by writing the
PDF for x as depending on θ:
p(x|θ). (2.23)
Further suppose we now draw a sample, xmeas , from p(x|θ):
xmeas ← p(x|θ). (2.24)
3 It is true for Gaussian PDFs, as discussed below.
14 Primer on Probability Theory
Harald Cramér
The xmeas is sometimes called a realization of the random variable x;
(1893-1985) was a
Swedish we can think of it as a ‘measurement’4 .
mathematician, Then, the Cramér-Rao lower bound (CRLB) says that the covariance
actuary, and of any unbiased estimate5 , θ̂ (based on the measurement, xmeas ), of
statistician,
specializing in
the deterministic parameter, θ, is bounded by the Fisher information
mathematical matrix, I(x|θ):
statistics and h i
probabilistic cov(θ̂|xmeas ) = E (θ̂ − θ)(θ̂ − θ)T ≥ I−1 (x|θ), (2.25)
number theory.
Calyampudi
h i
Radhakrishna Rao, where ‘unbiased’ implies E θ̂ − θ = 0 and ‘bounded’ means
(1920-present) is
an Indian cov(θ̂|xmeas ) − I−1 (x|θ) ≥ 0, (2.26)
American
mathematician and i.e., positive-semi-definite. The Fisher information matrix is given by
statistician.
" T #
Cramér and Rao
∂ ln p(x|θ) ∂ ln p(x|θ)
were amongst the I(x|θ) = E . (2.27)
first to derive what ∂θ ∂θ
is now know as the
CRLB. The CRLB therefore sets a fundamental limit on how certain we can
be about an estimate of a parameter, given our measurements.
Sir Ronald Aylmer
Fisher (1890-1962)
was an English
statistician,
2.2 Gaussian Probability Density Functions
evolutionary 2.2.1 Definitions
biologist,
geneticist, and In much of what is to follow, we will be working with Gaussian PDFs.
eugenicist. His In one dimension, a Gaussian PDF is given by
contributions to

statistics include
2 1 1 (x − µ)2
the analysis of p(x|µ, σ ) = √ exp − , (2.28)
variance, method 2πσ 2 2 σ2
of maximum
likelihood, fiducial
where µ is the mean and σ 2 is the variance (σ is called the standard
inference, and the deviation). Figure 2.2 shows a one-dimensional Gaussian PDF.
derivation of A multivariate Gaussian PDF, p(x|µ, Σ), over the random variable,
various sampling x ∈ RN , may be expressed as
distributions.
1 1
p(x|µ, Σ) = p exp − (x − µ)T Σ−1 (x − µ) , (2.29)
(2π)N det Σ 2

where µ ∈ RN is the mean and Σ ∈ RN ×N is the (symmetric positive-

definite) covariance matrix. Thus, for a Gaussian we have that
Z ∞
1 1 T −1
µ = E [x] = xp exp − (x − µ) Σ (x − µ) dx,
−∞ (2π)N det Σ 2
(2.30)
4 We use the subscript, ‘meas’, to indicate it is a measurement.
5 ˆ to indicate an estimated quantity.
We will use (·)
2.2 Gaussian Probability Density Functions 15
Figure 2.2
✓ ◆ One-dimensional
p(x) p(x|µ, 2
)= p
1
exp
1 (x µ) 2
Gaussian PDF. A
2 2 2
2⇡ notable property of
a Gaussian is that
the mean and
mode (most likely
x
µ x) are both at µ.

and

Σ = E (x − µ)(x − µ)T
Z ∞
1
= (x − µ)(x − µ)T p (2.31)
−∞ (2π)N det Σ

1 T −1
× exp − (x − µ) Σ (x − µ) dx.
2
We may also write that x is normally (a.k.a., Gaussian) distributed
using the following notation:
x ∼ N (µ, Σ) .
We say a random variable is standard normally distributed if
x ∼ N (0, 1) ,
where 1 is an N × N identity matrix.

2.2.2 Isserlis’ Theorem

Moments of multivariate Gaussian PDFs get a little messy to compute
beyond the usual mean and covariance, but there are some specific
cases that we will make use of later that are worth discussing. We can
use Isserlis’ theorem to compute higher-order moments of a Gaussian Leon Isserlis
random variable, x = (x1 , x2 , . . . , x2M ) ∈ R2M . In general, this theorem (1881-1966) was a
Russian-born
says
British statistician
XY
known for his work
E[x1 x2 x3 · · · x2M ] = E[xi xj ], (2.32)
on the exact
where this implies summing over all distinct ways of partitioning into distribution of
sample moments.
a product of M pairs. This implies that there are (2(2M )!
M M !) terms in the

sum. With four variables we have

E[xi xj xk x` ] = E[xi xj ]E[xk x` ] + E[xi xk ]E[xj x` ] + E[xi x` ]E[xj xk ].
(2.33)
We can apply this theorem to work out some useful results for matrix
expressions.
Assume we have x ∼ N (0, Σ) ∈ RN . We will have occasion to com-
pute expressions of the form
h p i
E x xT x x T , (2.34)
16 Primer on Probability Theory

where p is a non-negative integer. Trivially, when p = 0 we simply have

E[xxT ] = Σ. When p = 1 we have
" !#  " N #
N
X X
T T
E xx xx = E  xi xj 2
xk = 2
E xi xj xk
k=1 ij k=1 ij
" N
#
X
= E[xi xj ]E[x2k ] + 2E[xi xk ]E[xk xj ]
k=1 ij
N
" N
#
X X
= [E[xi xj ]]ij E[x2k ] + 2 E[xi xk ]E[xk xj ]
k=1 k=1 ij
2
= Σ tr(Σ) + 2Σ
= Σ (tr(Σ)1 + 2Σ) . (2.35)

Note, in the scalar case we have x ∼ N (0, σ 2 ) and hence E[x4 ] =

σ 2 (σ 2 + 2σ 2 ) = 3σ 4 , a well-known result. Results for p > 1 are possible
using a similar approach, but we do not compute them for now.
We also consider the case where

x1 Σ11 Σ12
x= ∼ N 0, , (2.36)
x2 ΣT12 Σ22

with dim(x1 ) = N1 and dim(x2 ) = N2 . We will need to compute ex-

pressions of the form

E x(xT1 x1 )p xT , (2.37)

where p is a non-negative integer. Again, when p = 0 we trivially have

E[xxT ] = Σ. When p = 1 we have
" !#  "N #
N1
X X
T 1

E xx1 x1 xT = E  xi xj x2k = 2
E xi xj xk
k=1 ij k=1 ij
"N #
X1

= E[xi xj ]E[x2k ] + 2E[xi xk ]E[xk xj ]
k=1 ij
N1
"N #
X X1

= [E[xi xj ]]ij E[x2k ] + 2 E[xi xk ]E[xk xj ]

k=1 k=1 ij

Σ11 Σ12 Σ211
= Σ tr(Σ11 ) + 2
ΣT12 Σ11
ΣT12 Σ12

Σ11 Σ12
= Σ tr(Σ11 )1 + 2 . (2.38)
0 0
2.2 Gaussian Probability Density Functions 17

Similarly, we have

T T
Σ12 ΣT12 Σ12 Σ22
E xx2 x2 x = Σ tr(Σ22 ) + 2
Σ22 ΣT12 Σ222

0 0
= Σ tr(Σ22 )1 + 2 , (2.39)
ΣT12 Σ22
and as a final check,

E xxT xxT = E x(xT1 x1 + xT2 x2 )xT = E xxT1 x1 xT + E xxT2 x2 xT .
(2.40)
We furthermore have that
" !# 
N X
X N

E xxT AxxT = E  xi xj xk ak` x` 
k=1 `=1 ij
" N X
N
#
X
= ak` E [xi xj xk x` ]
k=1 `=1 ij
" N
N X
X
= ak` (E[xi xj ]E[xk x` ] + E[xi xk ]E[xj x` ]
k=1 `=1

+ E[xi x` ]E[xj xk ])
ij
N X
N
!
X
= [E[xi xj ]]ij ak` E[xk x` ]
k=1 `=1
" N
N X
#
X
+ E[xi xk ]ak` E[x` xj ]
k=1 `=1 ij
" N X
N
#
X
+ E[xi x` ]ak` E[xk xj ]
k=1 `=1 ij
T
= Σ tr(AΣ) + ΣAΣ + ΣA Σ

= Σ tr (AΣ) 1 + AΣ + AT Σ , (2.41)
where A is a compatible square matrix.

2.2.3 Joint Gaussian PDFs, Their Factors, and Inference

We can also have a joint Gaussian over a pair of variables, (x, y), which
we write as
µx Σxx Σxy
p (x, y) = N , , (2.42)
µy Σyx Σyy
which has the same exponential form as (2.29). Note that Σyx = ΣTxy .
It is always possible to break a joint density into the product of two
18 Primer on Probability Theory

factors, p(x, y) = p(x|y) p(y), and we can work out the details for
Issai Schur the joint Gaussian case by using the Schur complement6 . We begin by
(1875-1941) was a noting that
German

mathematician Σxx Σxy 1 Σxy Σ−1yy Σxx − Σxy Σ−1
yy Σyx 0 1 0
= ,
who worked on
group
Σyx Σyy 0 1 0 Σyy Σ−1 yy Σyx 1
representations (2.43)
(the subject with where 1 is the identity matrix. We then invert both sides to find that
which he is most
closely associated),
−1
Σxx Σxy 1 0
but also in =
combinatorics and Σyx Σyy −Σ−1
yy Σyx 1
" −1 #
number theory and
Σxx − Σxy Σ−1 Σyx 0 1 −Σxy Σ−1
yy
even theoretical × yy . (2.44)
physics. 0 Σ−1
yy
0 1

Looking just to the quadratic part (inside the exponential) of the joint
PDF, p(x, y), we have
T −1
x µx Σxx Σxy x µ
− − x
y µy Σyx Σyy y µy
T " −1 #
x µx 1 0 Σxx − Σxy Σ−1 Σ yx 0
= − yy
y µy −Σ−1yy Σyx 1 0 Σ−1
yy
−1
1 −Σxy Σyy x µ
× − x
0 1 y µy
T −1
= x − µx − Σ−1
yy Σyx (y − µy ) Σxx − Σxy Σ−1
yy Σyx
T −1
× x − µx − Σ−1
yy Σyx (y − µy ) + y − µy Σyy y − µy , (2.45)
which is the sum of two quadratic terms. Since the exponential of a
sum is the product of two exponentials, we have that
p(x, y) = p(x|y) p(y), (2.46a)
−1 −1

p(x|y) = N µx + Σxy Σyy (y − µy ), Σxx − Σxy Σyy Σyx , (2.46b)

p(y) = N µy , Σyy . (2.46c)
It is important to note that both factors, p(x|y) and p(y), are Gaussian
PDFs. Also, if we happen to know the value of y (i.e., it is measured),
we can work out the likelihood of x given this value of y by computing
p(x|y) using (2.46b).
This is in fact the cornerstone of Gaussian inference: we start with
a prior about our state, x ∼ N (µx , Σxx ), then narrow this down based
on some measurements, ymeas . In (2.46b), we see that an adjustment is
made to the mean, µx , and the covariance, Σxx (it is made smaller).
6 In this case, we have that the Schur complement of Σyy is the expression,
Σxx − Σxy Σ−1 yy Σyx .
2.2 Gaussian Probability Density Functions 19

2.2.4 Statistically Independent, Uncorrelated

In the case of Gaussian PDFs, statistically independent variables are
also uncorrelated (true in general) and uncorrelated variables are also
statistically independent (not true for all types of PDFs). We can see
this fairly easily by looking at (2.46). If we assume statistical inde-
pendence, p(x, y) = p(x)p(y) and so p(x|y) = p(x) = N (µx , Σxx ).
Looking at (2.46b), this implies
Σxy Σ−1
yy (y − µy ) = 0, (2.47a)
Σxy Σ−1
yy Σyx = 0, (2.47b)
which further implies that Σxy = 0. Since
T
Σxy = E (x − µx )(y − µy )T = E xyT − E [x] E [y] , (2.48)
we have the uncorrelated condition:
T
E xyT = E [x] E [y] . (2.49)
We can also work through the logic in the other direction by first assum-
ing the variables are uncorrelated, which leads to Σxy = 0, and finally
to statistical independence. Since these conditions are equivalent, we
will often use statistically independent and uncorrelated interchangeably
in the context of Gaussian PDFs.

2.2.5 Linear Change of Variables

Suppose that we have a Gaussian random variable,
x ∈ RN ∼ N (µx , Σxx ),
and that we have a second random variable, y ∈ RM , related to x
through the linear map,
y = Gx, (2.50)
where we assume that G ∈ M RN . We would like to know what the
statistical properties of y are. One way to do this is to simply apply
the expectation operator directly:
µy = E[y] = E[Gx] = G E[x] = Gµx , (2.51a)
Σyy = E[(y − µy )(y − µy )T ]
= G E[(x − µx )(x − µx )T ] GT = GΣxx GT , (2.51b)
so that we have y ∼ N (µy , Σyy ) = N (Gµx , GΣxx GT ).
Another way to look at this is a change of variables. We assume that
the linear map is injective meaning two x values cannot map to a single
y value; in fact, let us simplify the injective condition by assuming a
20 Primer on Probability Theory

stricter condition, that G is invertible (and hence M = N ). The axiom

of total probability lets us write,
Z ∞
p(x) dx = 1. (2.52)
−∞

A small volume of x is related to a small volume of y by

dy = | det G| dx. (2.53)
We can then make a substitution of variables to have
Z ∞
1= p(x) dx
−∞
Z ∞
1 1 T −1
= p exp − (x − µx ) Σxx (x − µx ) dx
−∞ (2π)N det Σxx 2
Z ∞
1
= p
−∞ (2π)N det Σxx

1 −1 −1
× exp − (G y − µx ) Σxx (G y − µx ) | det G|−1 dy
T −1
2
Z ∞
1
= p
−∞ (2π) det G det Σxx det GT
N

1 T −T −1 −1
× exp − (y − Gµx ) G Σxx G (y − Gµx ) dy
2
Z ∞
1
= p
−∞ (2π)N det(GΣxx GT )

1 T T −1
× exp − (y − Gµx ) (GΣxx G ) (y − Gµx ) dy,
2
(2.54)
whereupon we have µy = Gµx and Σyy = GΣxx GT , as before. If
M < N , our linear mapping is no longer injective and the change of
variable approach cannot be used to map statistics from x to y.
We can also think about going in the other direction from y to x,
assuming M < N and rank G = M . This is a bit tricky since the
resulting covariance for x will blow up since we are dilating7 to a larger
space. To get around this, we switch to information form. Letting
u = Σ−1
yy y, (2.55)
we have that
u ∼ N (Σ−1 −1
yy µy , Σyy ). (2.56)
Likewise, letting
v = Σ−1
xx x, (2.57)
7 Dilation is the opposite of projection.
2.2 Gaussian Probability Density Functions 21

we have that
v ∼ N (Σ−1 −1
xx µx , Σxx ). (2.58)
Since the mapping from y to x is not unique, we need to specify what
we want to do. One choice is to let
v = GT u ⇔ Σ−1 T −1
xx x = G Σyy y. (2.59)
We then take expectations:
Σ−1 T T T −1
xx µx = E[v] = E[G u] = G E[u] = G Σyy µy , (2.60a)
Σ−1
xx = E[(v − Σ−1 − Σ−1
xx µx )(v
T
xx µx ) ] (2.60b)
= G E[(u T
− Σ−1
yy µy )(u − Σ−1 T
yy µy ) ]G = G Σ−1 T
yy G.

Note, if Σ−1
xx is not full rank, we cannot actually recover Σxx and µx and
must keep them in information form. However, multiple such estimates
can be fused together, which is the subject of the next section.

2.2.6 Product of Gaussians

We now discuss a useful property of Gaussian PDFs; the direct product
of K Gaussian PDFs is also a Gaussian PDF:

1
exp − (x − µ)T Σ−1 (x − µ)
2
K
Y
1
≡η exp − (x − µk )T Σ−1
k (x − µ k ) , (2.61)
k=1
2
where
K
X
−1
Σ = Σ−1
k , (2.62a)
k=1
N
X
−1
Σ µ= Σ−1
k µk , (2.62b)
k=1

and η is a normalization constant to enforce the axiom of total prob-

ability. The direct product comes up when fusing multiple estimates
together. A one-dimensional example is provided in Figure 2.3.
Figure 2.3 The
1 1 1 product of two
p(x) = + one-dimensional
2 2 2
1 2
Gaussian PDFs is
µ µ1 µ2 another
1 1
2
= 2 + 2
2 2 one-dimensional
1 2
Gaussian PDF.
x
µ1 µ µ2
22 Primer on Probability Theory

We also have that

1
exp − (x − µ)T Σ−1 (x − µ)
2
YK
1 T −1
≡η exp − (Gk x − µk ) Σk (Gk x − µk ) , (2.63)
k=1
2

where
K
X
Σ−1 = GTk Σ−1
k Gk , (2.64a)
k=1
K
X
Σ−1 µ = GTk Σ−1
k µk , (2.64b)
k=1

in the case that the matrices, Gk ∈ RMk ×N , are present, with Mk ≤ N .

Again, η is a normalization constant. We also note this generalizes a
result from the previous section.

2.2.7 Sherman-Morrison-Woodbury Identity

We will require the Sherman-Morrison-Woodbury (SMW) (Sherman
and Morrison, 1949, 1950; Woodbury, 1950) matrix identity (sometimes
The SMW formula called the matrix inversion lemma) in what follows. There are actually
is named for four different identities that come from a single derivation.
American
statisticians Jack
We start by noting that we can factor a matrix into either a lower-
Sherman, Winifred diagonal-upper (LDU) or upper-diagonal-lower (UDL) form as follows:
J. Morrison, and −1
Max A. Woodbury, A −B
but was C D
independently
presented by 1 0 A−1 0 1 −AB
= (LDU)
English CA 1 0 D + CAB 0 1
mathematician W.
J. Duncan, 1 −BD−1 A−1 + BD−1 C 0 1 0
= . (UDL)
American 0 1 0 D D−1 C 1
statisticians L. (2.65)
Guttman and M.
S. Bartlett, and
We then invert each of these forms. For the LDU we have
possibly others.
−1 −1
A −B
C D

1 AB A 0 1 0
=
0 1 0 (D + CAB)−1 −CA 1

A − AB(D + CAB)−1 CA AB(D + CAB)−1
= . (2.66)
−(D + CAB)−1 CA (D + CAB)−1
2.2 Gaussian Probability Density Functions 23

For the UDL we have

−1 −1
A −B
C D

1 0 (A−1 + BD−1 C)−1 0 1 BD−1
=
−D−1 C 1 0 D−1 0 1

(A−1 + BD−1 C)−1 (A−1 + BD−1 C)−1 BD−1
= .
−D−1 C(A−1 + BD−1 C)−1 D−1 − D−1 C(A−1 + BD−1 C)−1 BD−1
(2.67)
Comparing the blocks of (2.66) and (2.67), we have the following iden-
tities:
(A−1 + BD−1 C)−1 ≡ A − AB(D + CAB)−1 CA, (2.68a)
(D + CAB)−1 ≡ D−1 − D−1 C(A−1 + BD−1 C)−1 BD−1 , (2.68b)
AB(D + CAB)−1 ≡ (A−1 + BD−1 C)−1 BD−1 , (2.68c)
(D + CAB)−1 CA ≡ D−1 C(A−1 + BD−1 C)−1 . (2.68d)
These are all used frequently when manipulating expressions involving
the covariance matrices associated with Gaussian PDFs.

2.2.8 Passing a Gaussian Through a Nonlinearity

We now examine the process of passing a Gaussian PDF through a
nonlinear function, namely computing
Z ∞
p(y) = p(y|x)p(x)dx, (2.69)
−∞

where we have that

p(y|x) ∼ N (g(x), R) ,
p(x) ∼ N (µx , Σxx ) ,
and g(·) is a nonlinear map: g : x 7→ y. This procedure is required, for
example, in the denominator when carrying out full Bayesian inference.

Scalar Case via Change of Variables

We begin with a Gaussian random variable, x ∈ R1 :
x ∼ N (0, σ 2 ). (2.70)
For the PDF on x we have

1 1 x2
p(x) = √ exp − 2 . (2.71)
2πσ 2 2σ
Now consider the nonlinear mapping,
y = exp(x), (2.72)
24 Primer on Probability Theory

which is invertible:
x = ln(y). (2.73)
The infinitesimal integration volumes for x and y are then related by
dy = exp(x) dx, (2.74)
or
1
dx = dy. (2.75)
y
According to the axiom of total probability we have
Z ∞
1= p(x) dx
−∞
Z ∞
1 1 x2
= √ exp − 2 dx
−∞ 2πσ 2 2σ
Z ∞ 2
!
1 1 (ln(y)) 1
= √ exp − 2
dy
0 2πσ 2 2 σ y
| {z }
p(y)
Z ∞
= p(y) dy, (2.76)
0

giving us the exact expression for p(y), which is plotted in Figure 2.4
for σ 2 = 1 as the black curve; the area under this curve, from y = 0 to
∞ is 1. The grey histogram is a numerical approximation of the PDF
generated by sampling x a large number of times and passing these
through the nonlinearity individually, then binning. These approaches
agree very well, validating our method of changing variables.
Note, p(y) is no longer Gaussian owing to the nonlinear change of
variables. We can verify numerically that the area under this function
is indeed 1 (i.e., it is a valid PDF). It is worth noting that had we not
Figure 2.4 The 0.7
PDF, p(y),
resulting from 0.6
passing p(x) = 0.5
√1 exp − 1 x2

2π 2
p(y)

through the 0.4

nonlinearity,
y = exp(x).
0.3
0.2
0.1
0
0 1 2 3 4 5
y
2.2 Gaussian Probability Density Functions 25

been careful about handling the change of variables and including the
1
y
factor, we would not have a valid PDF.

General Case via Linearization

Unfortunately, (2.69) cannot be computed in closed form for every g
and becomes more difficult in the multivariate case than the scalar one.
We thus linearize the nonlinear map such that
g(x) ≈ µy + G(x − µx ),

∂g(x)
G= , (2.77)
∂x x=µx
µy = g(µx ),
where G is the Jacobian of g, with respect to x. This allows us to then
pass the Gaussian through the linearized function in closed form; it is
an approximation that works well for mildly nonlinear maps.
Figure 2.5 depicts the process of passing a one-dimensional Gaussian
PDF through a deterministic nonlinear function, g(·). In general, we
could be making an inference though a stochastic function, one that
introduces additional noise.
Figure 2.5
p(x) p(y) µy = g(µx )
Passing a
@g(x)
y µy ⇡ x µx one-dimensional
x x
y = g(x) | {z } @x x=µx | {z }
y | {z } x Gaussian through
y y a a deterministic
2
y = E[ y 2 ] nonlinear function,
= a2 E[ x2 ]
x y g(·). Here we
= a2 x2
µx µy linearize the
nonlinearity in
order to propagate
Returning to (2.69), we have that the variance
Z ∞ approximately.
p(y) = p(y|x)p(x)dx
−∞
Z ∞
1 T
=η exp − y − (µy + G(x − µx ))
−∞ 2

−1

×R y − (µy + G(x − µx ))

1 T −1
× exp − (x − µx ) Σxx (x − µx ) dx
2

1 T −1
= η exp − (y − µy ) R (y − µy )
2
Z ∞
1 T −1 T −1

× exp − (x − µx ) Σxx + G R G (x − µx )
−∞ 2

× exp (y − µy )T R−1 G(x − µx ) dx,
(2.78)
26 Primer on Probability Theory

where η is a normalization constant. Defining F such that

FT GT R−1 G + Σ−1
xx = R
−1
G, (2.79)

we may complete the square for the part inside the integral such that

1
exp − (x − µx )T Σ−1xx + G T
R−1
G (x − µx )
2

× exp (y − µy )T R−1 G(x − µx )

1 T
= exp − x − µx ) − F(y − µy
2

× GT R−1 G + Σ−1 xx x − µ x ) − F(y − µ y

1 T T T −1 −1

× exp (y − µy ) F G R G + Σxx F(y − µy ) .
2
(2.80)

The second factor is independent of x and may be brought outside of the

integral. The remaining integral (the first factor) is exactly Gaussian
in x and thus will integrate (over x) to a constant and thus can be
absorbed in the constant η. Thus, for p(y) we have

1
p(y) = ρ exp − (y − µy )T
2

−1
× R −F G R G+T T −1
Σ−1
xx F (y − µy )

1
= ρ exp − (y − µy )T
2

−1
× R −1 −1
−R G G R G+ T −1
Σ−1
xxG R T −1
(y − µy )
| {z }
(R+GΣxx GT )−1 by (2.68)

1 T

T −1
= ρ exp − (y − µy ) R + GΣxx G (y − µy ) , (2.81)
2

where ρ is the new normalization constant. This is a Gaussian for y:

y ∼ N µy , Σyy = N g(µx ), R + GΣxx GT . (2.82)

As we will see later, the two equations, (2.63) and (2.81), constitute the
observation and predictive steps of the classic discrete-time (extended)
Kalman filter (Kalman, 1960b). These two steps may be thought of as
the creation and destruction of information in the filter, respectively.
2.2 Gaussian Probability Density Functions 27

2.2.9 Shannon Information of a Gaussian

In the case of a Gaussian PDF, we have for the Shannon information:
Z ∞
H (x) = − p(x) ln p(x) dx
−∞
Z ∞ q
1
=− p(x) − (x − µ)T Σ−1 (x − µ) − ln (2π)N det Σ dx
−∞ 2
Z ∞
1 1
N
= ln (2π) det Σ + (x − µ)T Σ−1 (x − µ) p(x) dx
2 −∞ 2
1 1
= ln (2π)N det Σ + E (x − µ)T Σ−1 (x − µ) , (2.83)
2 2
where we have written the second term using an expectation. In fact,
this term is exactly a squared Mahalanobis distance, which is like a Prasanta Chandra
squared Euclidean distance, but weighted in the middle by the inverse Mahalanobis
(1893-1972) was an
covariance matrix. A nice property of this quadratic function inside the
Indian scientist
expectation allows to rewrite it using the (linear) trace operator from and applied
linear algebra: statistician known
for this measure of
(x − µ)T Σ−1 (x − µ) = tr Σ−1 (x − µ)(x − µ)T . (2.84) statistical distance
(Mahalanobis,
Since the expectation is also a linear operator, we may interchange the 1936).
order of the expectation and trace arriving at:

E (x − µ)T Σ−1 (x − µ) = tr E Σ−1 (x − µ)(x − µ)T

= tr Σ−1 E (x − µ)(x − µ)T
| {z }
Σ

= tr Σ−1 Σ
= tr 1
= N, (2.85)
which is just the dimension of the variable. Substituting this back into
our expression for Shannon information we have
1 1
H (x) = ln (2π)N det Σ + E (x − µ)T Σ−1 (x − µ)
2 2
1 1
= ln (2π)N det Σ + N
2 2
1 N

= ln (2π) det Σ + N ln e
2
1
= ln (2πe)N det Σ , (2.86)
2
which is purely a function of Σ, the covariance matrix
√ of the Gaussian
PDF. In fact, geometrically, we may interpret det Σ as the volume
of the uncertainty ellipsoid formed by the Gaussian PDF. Figure 2.6
shows the uncertainty ellipse for a two-dimensional Gaussian PDF.
28 Primer on Probability Theory
Figure 2.6
Uncertainty ellipse x2
for a x ⇠ N (µ, ⌃)
two-dimensional µ
Gaussian PDF. 1
p(x) = p
The geometric area p (2⇡)2 e1/M 2 det ⌃
A = M 2 ⇡ det ⌃
inside the√
ellipse is 1
A = M 2 π det Σ. H(x) = ln (2⇡e)2 det ⌃
2
The Shannon
x1
information
expression is
provided for
comparison. Note that along the boundary of the uncertainty ellipse, p(x) is con-
stant. To see this, consider that the points along this ellipse must satisfy

1
(x − µ)T Σ−1 (x − µ) = , (2.87)
M2
where M is a factor applied to scale the nominal (M = 1) covariance.
In this case we have that
1
p(x) = p , (2.88)
(2π)N e1/M 2 det Σ
which is independent of x and thus constant. Looking at this last ex-
pression, we note that as the dimension, N , of the data increases, we
need to very rapidly increase M in order to keep p(x) constant.

2.2.10 Mutual Information of a Joint Gaussian PDF

Assume we have a joint Gaussian for variables x ∈ RN and y ∈ RM
given by

µx Σxx Σxy
p (x, y) = N (µ, Σ) = N , . (2.89)
µy Σyx Σyy

By inserting (2.86) into (2.22) we can easily see that the mutual infor-
mation for the joint Gaussian is given by
1 1
I(x, y) = ln (2πe)N det Σxx + ln (2πe)M det Σyy
2 2
1 M +N

− ln (2πe) det Σ
2
1 det Σ
= − ln . (2.90)
2 det Σxx det Σyy
Looking back to (2.43), we can also note that

det Σ = det Σxx det Σyy − Σyx Σ−1xx Σxy

= det Σyy det Σxx − Σxy Σ−1
yy Σyx . (2.91)
2.2 Gaussian Probability Density Functions 29

Inserting this into the above we have

1
I(x, y) = − ln det(1 − Σ−1 −1
xx Σxy Σyy Σyx )
2
1
= − ln det(1 − Σ−1 −1
yy Σyx Σxx Σxy ), (2.92)
2 James Joseph
Sylvester
where the two versions can be seen to be equivalent through Sylvester’s (1814-1897) was an
determinant theorem. English
mathematician
who made
fundamental
2.2.11 Cramér-Rao Lower Bound Applied to Gaussian contributions to
PDFs matrix theory,
invariant theory,
Suppose that we have K samples (i.e., measurements), xmeas,k ∈ RN , number theory,
drawn from a Gaussian PDF. The K statistically independent random partition theory,
and combinatorics.
variables associated with these measurements are thus
This theorem says
that
(∀k) xk ∼ N (µ, Σ) . (2.93) det(1 − AB)
= det(1 − BA),
The term statistically independent implies that E [(xk − µ)(x` − µ)T ] = even when A and
0 for k 6= `. Now suppose our goal is to estimate the mean of this PDF, B are not square.
µ, from the measurements, xmeas,1 , . . . , xmeas,K . For the joint density of
all the random variables, x = (x1 , . . . , xK ), we in fact have
q
1
ln p(x|µ, Σ) = − (x−Aµ)T B−1 (x−Aµ)−ln (2π)N K det B, (2.94)
2
where
T
A = 1 1 ··· 1 , B = diag (Σ, Σ, . . . , Σ) . (2.95)
| {z } | {z }
K blocks K blocks

In this case, we have

T
∂ ln p(x|µ, Σ)
= AT B−1 (x − Aµ), (2.96)
∂µ
and thus the Fisher information matrix is
" T #
∂ ln p(x|µ, Σ) ∂ ln p(x|µ, Σ)
I(x|µ) = E
∂µ ∂µ
T −1
= E A B (x − Aµ)(x − Aµ)T B−1 A

= AT B−1 E (x − Aµ)(x − Aµ)T B−1 A
| {z }
B
T −1
=A B A
= KΣ−1 , (2.97)
30 Primer on Probability Theory

which we can see is just K times the inverse covariance of the Gaussian
density. The CRLB thus says
1
cov(x̂|xmeas,1 , . . . , xmeas,K ) ≥
Σ. (2.98)
K
In other words, the lower limit of the uncertainty in the estimate of the
mean, x̂, becomes smaller and smaller the more measurements we have
(as we would expect).
Note, in computing the CRLB we did not need to actually specify the
form of the unbiased estimator at all; the CRLB is the lower bound for
any unbiased estimator. In this case, it is not hard to find an estimator
that performs right at the CRLB:
K
1 X
x̂ = xmeas,k . (2.99)
K k=1
For the mean of this estimator we have
" K
# K K
1 X 1 X 1 X
E [x̂] = E xk = E[xk ] = µ = µ, (2.100)
K k=1 K k=1 K k=1
which shows the estimator is indeed unbiased. For the covariance we
have

cov(x̂|xmeas,1 , . . . , xmeas,K ) = E (x̂ − µ)(x̂ − µ)T
 ! !T 
XK XK
1 1
=E xk − µ xk − µ 
K k=1 K k=1

1 XX h i
K K
T
= E (x k − µ) (x ` − µ)
K 2 k=1 `=1 | {z }
Σ when k = `, 0 otherwise
1
= Σ, (2.101)
K
which is right at the CRLB.

2.3 Gaussian Processes

We have already discussed Gaussian random variables and their asso-
ciated PDFs. We write
x ∼ N (µ, Σ), (2.102)
to say x ∈ RN is Gaussian. We will use this type of random variable
extensively to represent discrete-time quantities. We will also want to
talk about state quantities that are continuous functions of time, t.
To do so, we need to introduce Gaussian processes (GPs) (Rasmussen
and Williams, 2006). Figure 2.7 depicts a trajectory represented by a
2.4 Summary 31
Figure 2.7
Continuous-time
trajectories can be
x(t) ⇠ GP (µ(t), ⌃(t, t0 )) represented using
Gaussian
processes, which
have a mean
function (dark
line) and a
covariance function
(shaded area).
Gaussian process. There is a mean function, µ(t), and a covariance
function, Σ(t, t0 ).
The idea is that the entire trajectory is a single random variable
belonging to a class of functions. The closer a function is to the mean
function, the more likely it is. The covariance function controls how
smooth the function is by describing the correlation between two times,
t and t0 . We write
x(t) ∼ GP(µ(t), Σ(t, t0 )), (2.103)
to indicate a continuous-time trajectory is a Gaussian process (GP).
The GP concept generalizes beyond one-dimensional functions of time,
but we will only have need of this special case.
If we want to consider a variable at a single particular time of interest,
τ , then we can write
x(τ ) ∼ N (µ(τ ), Σ(τ, τ )), (2.104)
where Σ(τ, τ ) is now simply covariance matrix. We have essentially
marginalized out all of the other instants of time, leaving x(τ ) as a
usual Gaussian random variable.
In general, a GP can take on many different forms. One particular
Paul Adrien
GP that we will use frequently is the zero-mean, white noise process.
Maurice Dirac
For w(t) to be zero-mean, white noise, we write (1902-1984) was an
English theoretical
w(t) ∼ GP(0, Qδ(t − t0 )), (2.105) physicist who
made fundamental
where Q is a power spectral density matrix and δ(·) is Dirac’s delta contributions to
function. This is a stationary noise process since it depends only on the the early
difference, t − t0 . development of
both quantum
We will return to GPs when we want to talk about state estimation mechanics and
in continuous time. We will show that estimation in this context can quantum
be viewed as an application of Gaussian process regression (Rasmussen electrodynamics.
and Williams, 2006).

2.4 Summary
The main take-away points from this chapter are:
32 Primer on Probability Theory

1. We will be using probability density functions (PDFs) over some

continuous state space to represent how certain we are that a robot
is in each possible state.
2. We often restrict ourselves to Gaussian PDFs to make the calcula-
tions easier.
3. We will frequently employ Bayes’ rule to carry out so-called Bayesian
inference as an approach to state estimation; we begin with a set of
possible states (the prior) and narrow these down based on actual
measurements (the posterior).
The next chapter will introduce some of the classic linear-Gaussian,
state-estimation methods.

2.5 Exercises
2.5.1 Show that for any two columns of the same length, u and v,
that
uT v = tr(vuT ).
2.5.2 Show that if two random variables, x and y, are statistically
independent, then the Shannon information of the joint density,
p(x, y), is the sum of the Shannon informations of the individual
densities, p(x) and p(y):
H(x, y) = H(x) + H(y).
2.5.3 For a Gaussian random variable, x ∼ N (µ, Σ), show that
E[xxT ] = Σ + µµT .
2.5.4 For a Gaussian random variable, x ∼ N (µ, Σ), show directly
that Z ∞
µ = E[x] = x p(x) dx.
−∞

2.5.5 For a Gaussian random variable, x ∼ N (µ, Σ), show directly

that
Z ∞
Σ = E[(x − µ)(x − µ)T ] = (x − µ)(x − µ)T p(x) dx.
−∞

2.5.6 Show that the direct product of K statistically independent

Gaussian PDFs, xk ∼ N (µk , Σk ), is also a Gaussian PDF:

1
exp − (x − µ)T Σ−1 (x − µ)
2
YK
1 T −1
≡η exp − (x − µk ) Σk (x − µk ) ,
k=1
2
2.5 Exercises 33

where
K
X N
X
Σ−1 = Σ−1
k , Σ−1 µ = Σ−1
k µk ,
k=1 k=1

and η is a normalization constant to enforce the axiom of total

probability.
2.5.7 Show that the weighted sum of K statistically independent ran-
dom variables, xk , given by
K
X
x= wk x k ,
k=1
PK
with k=1 wk = 1 and wk ≥ 0 has a PDF that satisfies the axiom
of total probability and whose mean is given by
K
X
µ= wk µk ,
k=1

where µk is the mean of xk . Determine an expression for the covari-

ance. Note, the random variables are not assumed to be Gaussian.
2.5.8 The random variable,
y = xT x,
is Chi-squared (of order K) when x ∼ N (0, 1) is length K. Show
that the mean and variance are given by K and 2K, respectively.
Hint: use Isserlis’ theorem.
3

Linear-Gaussian Estimation

This chapter will introduce some of the classic results from estima-
tion theory for linear models and Gaussian random variables includ-
ing the Kalman filter (KF) (Kalman, 1960b). We will begin with a
batch, discrete-time estimation problem that will provide important
insights into the nonlinear extension of the work in subsequent chap-
ters. From the batch approach, we will show how the recursive methods
can be developed. Finally, we will circle back to the more general case
of handling continuous-time motion models and connect these to the
discrete-time results as well as to Gaussian process regression from the
machine-learning world. Classic books that cover linear estimation in-
clude Bryson (1975), Maybeck (1994), and Stengel (1994).

3.1 Batch Discrete-Time Estimation

We will begin by setting up the problem that we want to solve and
then discuss methods of solution.

3.1.1 Problem Setup

In much of this chapter, we will consider discrete-time, linear, time-
varying equations. We define the following motion and observation
models:
motion model: xk = Ak−1 xk−1 + vk + wk , k = 1 . . . K (3.1a)
observation model: yk = Ck xk + nk , k = 0 . . . K (3.1b)
where k is the discrete-time index and K its maximum. The variables
in (3.1) have the following meanings:
system state : xk ∈ RN

initial state : x0 ∈ RN ∼ N x̌0 , P̌0
input : vk ∈ RN
process noise : wk ∈ RN ∼ N (0, Qk )
measurement : yk ∈ RM
measurement noise : nk ∈ RM ∼ N (0, Rk )
35
36 Linear-Gaussian Estimation

These are all random variables except vk , which is deterministic1 . The

noise variables and initial state knowledge are all assumed to be uncor-
related with one another (and with themselves at different timesteps).
The matrix, Ak ∈ RN ×N , is called the transition matrix. The matrix,
Ck ∈ RM ×N , is called the observation matrix.
Although we want to know the state of the system (at all times),
we only have access to the following quantities, and must base our
estimate, x̂k , on just this information:
(i) The initial state knowledge, x̌0 , and the associated covariance
matrix, P̌0 ; sometimes we do not have this piece of information
and must do without2 ,
(ii) The inputs, vk , which typically come from the output of our
controller and so are known3 ; we also have the associated pro-
cess noise covariance, Qk ,
(iii) The measurements, yk,meas , which are realizations of the as-
sociated random variables, yk , and the associated covariance
matrix, Rk .
Based on the models in the previous section, we define the state esti-
mation problem as follows:
The problem of state estimation is to come up with an estimate, x̂k , of the true
state of a system, at one or more timesteps, k, given knowledge of the initial state,
x̌0 , a sequence of measurements, y0:K,meas , a sequence of inputs, v1:K , as well as
knowledge of the system’s motion and observation models.

The rest of this chapter will present a suite of techniques for addressing
this state estimation problem. Our approach will always be to attempt
to come up with not only a state estimate, but also to quantify the
uncertainty in that estimate.
To set ourselves up for what is to follow in the later chapters on
nonlinear estimation, we will begin by formulating a batch linear-
Gaussian (LG) estimation problem. The batch solution is very useful
for computing state estimates after the fact because it uses all the mea-
surements in the estimation of all the states at once (hence the usage of
‘batch’). However, a batch method cannot be used in real-time since we
cannot employ future measurements to estimate past states. For this
we will need recursive state estimators, which will be covered later in
this chapter.
1 Sometimes the input is specialized to be of the form vk = Bk uk , where uk ∈ RU is
now the input and Bk ∈ RN ×U is called the control matrix. We will use this form as
needed in our development.
2 ˆ to indicate posterior estimates (including measurements) and (·)
We will use (·) ˇ to
indicate prior estimates (not including measurements).
3 In robotics, this input is sometimes replaced by an interoceptive measurement. This is
a bit of a dangerous thing to do since it then conflates two sources of uncertainty:
process noise and measurement noise. If this is done, we must be careful to inflate Q
appropriately to reflect the two uncertainties.
3.1 Batch Discrete-Time Estimation 37

To show the relationship between various concepts, we will set up

the batch LG estimation problem using two different paradigms:

(i) Bayesian inference; here we update a prior density over states

(based on the initial state knowledge, inputs, and motion model)
with our measurements, to produce a posterior (Gaussian) den-
sity over states,
(ii) Maximum A Posteriori (MAP); here we employ optimization
to find the most likely posterior state given the information we
have (initial state knowledge, measurements, inputs).

While these approaches are somewhat different in nature, it turns out

that we arrive at the exact same answer for the LG problem. This
is because the full Bayesian posterior is exactly Gaussian. Therefore,
the optimization approach will find the maximum (i.e., mode) of a
Gaussian, and this is the same as the mean. It is important to pursue
these two avenues because when we move to nonlinear, non-Gaussian
systems in subsequent chapters, the mean and mode of the posterior
will no longer be the same and the two methods will arrive at different
answers. We will start with the MAP optimization method as it is a
bit easier to explain.

3.1.2 Maximum A Posteriori

In batch estimation, our goal is to solve the following MAP problem:

x̂ = arg max p(x|v, y), (3.2)

which is to say that we want to find the best single estimate for the
state of the system (at all timesteps), x̂, given the prior information,
v, and measurements, y4 . Note that we have

x = x0:K = (x0 , . . . , xK ), v = (x̌0 , v1:K ) = (x̌0 , v1 , . . . , vK ),

y = y0:K = (y0 , . . . , yK ),

where the timestep range may be dropped for convenience of notation

(when the range is the largest possible for that variable)5 . Note that we
have included the initial state information with the inputs to the sys-
tem; together these define our prior over the state. The measurements
serve to improve this prior information.
4 We will be a bit loose on notation here by dropping ‘meas’ from ymeas .
5 We will sometimes refer to lifted form when discussing variables and equations over the
entire trajectory rather than a single timestep. It should be clear when quantities are in
lifted form as they will not have a subscript for the timestep.
38 Linear-Gaussian Estimation

We begin by rewriting the MAP estimate using Bayes’ rule:

p(y|x, v)p(x|v)
x̂ = arg max p(x|v, y) = arg max
x x p(y|v)
= arg max p(y|x)p(x|v), (3.3)
x

where we drop the denominator because it does not depend on x. We

also drop v in p(y|x, v) since it does not affect y in our system if x is
known (see observation model).
A vital assumption that we are making, is that all of the noise vari-
ables, wk and nk for k = 0 . . . K, are uncorrelated. This allows us to
use Bayes’ rule to factor p(y|x) in the following way:

K
Y
p(y|x) = p(yk | xk ). (3.4)
k=0

Furthermore, Bayes’ rule allows us to factor p(x|v) as

K
Y
p(x|v) = p(x0 | x̌0 ) p(xk | xk−1 , vk ). (3.5)
k=1

In this linear system, the component (Gaussian) densities are given by

1
p(x0 | x̌0 ) = q
(2π)N det P̌0

1 T −1
× exp − (x0 − x̌0 ) P̌0 (x0 − x̌0 ) , (3.6a)
2

1 1 T
p(xk | xk−1 , vk ) = p exp − (xk − Ak−1 xk−1 − vk )
(2π)N det Qk 2

−1
× Qk (xk − Ak−1 xk−1 − vk ) , (3.6b)

1 1 T
p(yk | xk ) = p exp − (yk − Ck xk )
M
(2π) det Rk 2

× R−1k (y k − C x
k k ) . (3.6c)

Note that we must have P̌0 , Qk , and Rk invertible; they are in fact
positive-definite by assumption and therefore invertible. To make the
3.1 Batch Discrete-Time Estimation 39

optimization easier, we take the logarithm of both sides6 :

XK K
X
ln(p(y|x)p(x |v)) = ln p(x0 | x̌0 )+ ln p(xk | xk−1 , vk )+ ln p(yk | xk ),
k=1 k=0
(3.7)
where
1 T
ln p(x0 | x̌0 ) = − (x0 − x̌0 ) P̌−1
0 (x0 − x̌0 )
2
1
− ln (2π)N det P̌0 , (3.8a)
|2 {z }
independent of x
1 T
ln p(xk | xk−1 , vk ) = − (xk − Ak−1 xk−1 − vk )
2
× Q−1
k (xk − Ak−1 xk−1 − vk )
1
− ln (2π)N det Qk , (3.8b)
|2 {z }
independent of x
1 T
ln p(yk | xk ) = − (yk − Ck xk ) R−1
k (yk − Ck xk )
2
1
− ln (2π)M det Rk . (3.8c)
|2 {z }
independent of x

Noticing that there are terms in (3.8) that do not depend on x, we

define the following quantities:
 1 T −1
 2 (x0 − x̌0 ) P̌0 (x0 − x̌0 ) , k=0
1 T
Jv,k (x) = (x − Ak−1 xk−1 − vk ) , (3.9a)
 2 k −1
× Qk (xk − Ak−1 xk−1 − vk ) , k = 1 . . . K
1 T
Jy,k (x) = (yk − Ck xk ) R−1 k (yk − Ck xk ) , k = 0 . . . K, (3.9b)
2
which are all squared Mahalanobis distances. We then define an overall
objective function, J(x), that we will seek to minimize with respect to
the design parameter, x:
K
X
J(x) = (Jv,k (x) + Jy,k (x)) . (3.10)
k=0

We will work with J(x) as is, but note that it is possible to add all
kinds of additional terms to this expression that will influence the so-
lution for the best estimate (e.g., constraints, penalty terms). From an
6 A logarithm is a monotonically increasing function and therefore will not affect our
optimization problem.
40 Linear-Gaussian Estimation

optimization perspective, we seek to solve the following problem:

x̂ = arg min J(x), (3.11)
x

which will result in the same solution for the best estimate, x̂, as (3.2).
In other words, we are still finding the best estimate in order to max-
imize the joint likelihood of all the data we have. This is an uncon-
strained optimization problem in that we do not have to satisfy any
constraints on the design variable, x.
To further simplify our problem, we make use of the fact that equa-
tions (3.9) are quadratic in x. To make this more clear, we stack all
the known data into a lifted column, z, and recall that x is also a tall
column consisting of all the states:
 
x̌0
 v1 
 
 ..   
 . 
  x0
 vK   . 
z= 
 y0  , x =  ..  . (3.12)
 
 y1  xK
 
 . 
 .. 
yK
We then define the following block-matrix quantities:
 
1
 −A0 1 
 
 .. .. 
 . . 
 
 −AK−1 1 
H=  , (3.13a)

 C0 
 C1 
 
 . 
 .. 
CK
 
P̌0
 Q1 
 
 .. 
 . 
 
 QK 
W= 
,
 (3.13b)
 R0 
 R1 
 
 .. 
 . 
RK
where only non-zero blocks are shown. The solid partition lines are
used to show the boundaries between the parts of the matrices relevant
3.1 Batch Discrete-Time Estimation 41

to the prior, v, and the measurements, y, in the lifted data vector, z.

Under these definitions, we find that
1 T
(z − Hx) W−1 (z − Hx) ,
J(x) = (3.14)
2
which is exactly quadratic in x. We note that we also have

1 T
p(z|x) = η exp − (z − Hx) W−1 (z − Hx) , (3.15)
2
where η is a normalization constant.
Since J(x) is exactly a paraboloid, we can find its minimum in closed
form. Simply set the partial derivative with respect to the design vari-
able, x, to zero:

∂J(x)
= −HT W−1 (z − Hx̂) = 0, (3.16a)
∂xT x̂

⇒ HT W−1 H x̂ = HT W−1 z. (3.16b)
The solution of (3.16b), x̂, is the classic batch least-squares solution
and is equivalent to the fixed-interval smoother7 from classic estimation
theory. The batch least-squares solution employs the pseudoinverse8 .
Computationally, to solve this linear system of equations, we would
never actually invert HT W−1 H (even if it were densely populated).
As we will see later, we have a special block-tridiagonal structure to
HT W−1 H, and hence a sparse-equation solver can be used to solve this
system efficiently9 .
One intuitive explanation of the batch linear-Gaussian problem is
that it is like a mass-spring system, as shown in Figure 3.1. Each term
in the objective function represents energy stored in one of the springs,
which varies as the masses’ positions are shifted. The optimal posterior
solution corresponds to the minimum-energy state.
Figure 3.1 The
batch
prior Jv,0 Jv,1 Jv,2 Jv,3
x̌0 x̌3 linear-Gaussian
(initial state, x̌1 x̌2
inputs) problem is like a
mass-spring
system. Each term
Jv,0 Jv,1 Jv,2 Jv,3 in the objective
posterior function represents
(initial state, x̂0 x̂1 x̂2 x̂3
energy stored in
inputs, Jy,0 Jy,1 Jy,2 Jy,3
measurements) one of the springs,
which varies as the
carts’ (i.e, masses)
positions are
shifted. The
optimal posterior
7 The fixed-interval smoother is usually presented in a recursive formulation. We will solution
discuss this in more detail later. corresponds to the
8 Also called the Moore-Penrose pseudoinverse. minimum energy
9 True for the problem posed; not true for all LG problems. state.
42 Linear-Gaussian Estimation

3.1.3 Bayesian Inference

Now that we have seen the optimization approach to batch LG estima-
tion, we take a look at computing the full Bayesian posterior, p(x|v, y),
not just the maximum. This approach requires us to begin with a prior
density over states, which we will then update based on the measure-
ments.
In our case, a prior can be built up using the knowledge of the initial
state, as well as the inputs to the system: p(x|v). We will use just the
motion model to build this prior:
xk = Ak−1 xk−1 + vk + wk . (3.17)
In lifted matrix form10 , we can write this as
x = A(v + w), (3.18)
where w is the lifted form of the initial state and process noise, and
 
1
 A 1 
 0 
 A1 A0 A 1 
 1 
A= .
.. .
.. .. .. ,
 . . 
 
AK−2 · · · A0 AK−2 · · · A1 AK−2 · · · A2 1 ··· 
AK−1 · · · A0 AK−1 · · · A1 AK−1 · · · A2 AK−1 1 ···
(3.19)
is the lifted transition matrix, which we see is lower-triangular. The
lifted mean is then
x̌ = E[x] = E[A(v + w)] = Av, (3.20)
and lifted covariance is

P̌ = E (x − E[x])(x − E[x])T = AQAT , (3.21)

where Q = E[wwT ] = diag(P̌0 , Q1 , . . . , QK ). Our prior can then be

neatly expressed as

p(x|v) = N x̌, P̌ = N Av, AQAT . (3.22)
We next turn to the measurements.
The measurement model is
yk = Ck xk + nk . (3.23)
This can also be written in lifted form as
y = Cx + n, (3.24)
10 ‘Lifted’ here refers to the fact that we are considering what happens at the entire
trajectory level.
3.1 Batch Discrete-Time Estimation 43

where n is the lifted form of the measurement noise and

C = diag (C0 , C1 , . . . , CK ) , (3.25)

is the lifted observation matrix.

The joint likelihood of the prior lifted state and the measurements
can now be written as

x̌ P̌ P̌CT
p(x, y|v) = N , , (3.26)
Cx̌ CP̌ CP̌CT + R

where R = E[nnT ] = diag(R0 , R1 , . . . , RK ). We can factor this ac-

cording to
p(x, y|v) = p(x|v, y)p(y|v). (3.27)

We only care about the first factor, which is the full Bayesian posterior.
This can be written, using the approach outlined in Section 2.2.3, as

p(x|v, y) = N x̌ + P̌CT (CP̌CT + R)−1 (y − Cx̌),

P̌ − P̌CT (CP̌CT + R)−1 CP̌ . (3.28)

Using the SMW identity from equations (2.68), this can be manipulated
into the following form:
−1 −1
p(x|v, y) = N P̌−1 + CT R−1 C P̌ x̌ + CT R−1 y ,
| {z }
x̂, mean
−1
P̌−1 + CT R−1 C . (3.29)
| {z }
P̂, covariance

We can actually implement a batch estimator based on this equation,

since it represents the full Bayesian posterior, but this may not be
efficient.
To see the connection to the optimization approach discussed earlier,
we rearrange the mean expression to arrive at a linear system for x̂,

P̌−1 + CT R−1 C x̂ = P̌−1 x̌ + CT R−1 y, (3.30)
| {z }
P̂−1

and we see the inverse covariance appearing on the left-hand side. Sub-
−1
stituting in x̌ = Av and P̌−1 = (AQAT ) = A−T Q−1 A−1 we can
rewrite this as

A−T Q−1 A−1 + CT R−1 C x̂ = A−T Q−1 v + CT R−1 y. (3.31)
| {z }
P̂−1
44 Linear-Gaussian Estimation

We see that this requires computing A−1 . It turns out this has a beau-
tifully simple form11 ,
 
1
−A0 1 
 
 −A1 1 
 
−1
A =  .. , (3.32)
−A . 
 2 
 .. 
 . 1 
−AK−1 1
which is still lower-triangular, but also very sparse (only the main di-
agonal and the one below are non-zero). If we define
−1
v A Q
z= , H= , W= , (3.33)
y C R
we can rewrite our system of equations as

HT W−1 H x̂ = HT W−1 z, (3.34)
which is identical to the optimization solution discussed earlier.
Again, it must be stressed that the reason the Bayesian approach
produces the same answer as the optimization solution for our LG esti-
mation problem is that the full Bayesian posterior is exactly Gaussian
and the mean and mode (i.e., maximum) of a Gaussian are one and
the same.

3.1.4 Existence, Uniqueness, and Observability

Most of the classic LG estimation results can be viewed as a special
case of (3.34). It is therefore important to ask when (3.34) has a unique
solution, which is the subject of this section.
Examining (3.34), we have from basic linear algebra that x̂ will exist
and be a unique solution if and only if HT W−1 H is invertible, where-
upon
−1 T −1
x̂ = HT W−1 H H W z. (3.35)
The question is then, when is HT W−1 H invertible? From linear algebra
again, we know that a necessary and sufficient condition for invertibility
is

rank HT W−1 H = N (K + 1), (3.36)
because we have dim x = N (K + 1). Since W−1 is real symmetric
11 The special sparsity of A−1 is in fact critical to all classic LG results, as we will discuss
later. This makes the left-hand side of (3.31) exactly block-tridiagonal. This means we
can solve for x̂ in O(K) time instead of the usual O(K 3 ) time for solving linear systems.
This leads to the popular recursive solution known as the Kalman filter/smoother. The
sparsity comes from the fact that the system model obeys the Markov property.
3.1 Batch Discrete-Time Estimation 45

positive-definite12 , we know that it can be dropped from the test so

that we only need

rank HT H = rank HT = N (K + 1). (3.37)

In other words, we need N (K+1) linearly independent rows (or columns)

in the matrix HT .
We now have two cases that should be considered:

(i) We have good prior knowledge of the initial state, x̌0 .

(ii) We do not have good prior knowledge of the initial state.

The first case is much easier than the second.

Case (i): Knowledge of initial state

Writing out HT , our rank test takes the form

rank HT
 
1 −AT0 CT0
 1 −AT1 CT1 
 
 .. 

= rank  1 . CT2 ,

 .. .. 
 . −ATK−1 . 
1 CTK
(3.38)

which we see is exactly in row-echelon form. This means the matrix is

full rank, N (K + 1), since all the block-rows are linearly independent.
This means there will always be a unique solution for x̂ provided that

P̌0 > 0, Qk > 0, (3.39)

where > 0 means a matrix is positive-definite (and hence invertible).

The intuition behind this is that the prior already provides a complete
solution to the problem. The measurements only serve to adjust the
answer. Note, these are sufficient but not necessary conditions.

Case (ii): No knowledge of initial state

Each block-column of HT represents some piece of information that
we have about the system. The first block-column represents our knowl-
edge about the initial state. Thus, removing knowledge of the initial

12 Follows from P̌0 , Q, and R being real symmetric positive-definite.

46 Linear-Gaussian Estimation

state results in the rank test considering

rank HT
 
−AT0 CT0
 1 −AT1 CT1 
 
 .. 
= rank 
 1 . CT2 ,

 .. .. 
 . −ATK−1 . 
1 CTK
(3.40)
which we note has K + 1 block-rows (each of size N ). Moving the top
block-row to the bottom does not alter the rank:

rank HT
 
1 −AT1 CT1
 .. 
 1 . CT2 
 
= rank 
 ..
. −ATK−1
..
.
.

 
 1 CTK 
−AT0 CT0
(3.41)
Except for the bottom block-row, this is in row-echelon form. Again
without altering the rank, we can add to the bottom block-row, AT0
times the first block-row, AT0 AT1 times the second block-row, . . . , and
AT0 · · · ATK−1 times the Kth block-row, to see that

rank HT
 
1 −AT1 CT1
 .. 
 1 . CT1 
 
= rank 
 ..
. −ATK−1
..
.
.

 
 1 CTK 
CT0 AT0 CT1 AT0 AT1 CT2 ··· AT0 · · · ATK−1 CTK
(3.42)
Examining this last expression, we notice immediately that the lower-
left partition is zero. Moreover, the upper-left partition is in row-echelon
form and in fact is of full rank, N K, since every row has a ‘leading one’.
Our overall rank condition for HT therefore collapses to showing that
the lower-right partition has rank N :

rank CT0 AT0 CT1 AT0 AT1 CT2 · · · AT0 · · · ATK−1 CTK = N (3.43)
If we further assume the system is time-invariant such that for all k we
have Ak = A and Ck = C (we use italicized symbols to avoid confusion
with lifted form) and we make the not-too-restrictive assumption that
K N , we may further simplify this condition.
3.1 Batch Discrete-Time Estimation 47

To do so, we employ the Cayley-Hamilton theorem from linear alge- Cayley-Hamilton

Theorem: Every
bra. Because A is N × N , its characteristic equation has at most N square matrix, A,
terms, and therefore any power of A greater than or equal to N can over the real field,
be rewritten as a linear combination of 1, A, . . . , A(N −1) . By extension, satisfies its own
for any k ≥ N , we can write, characteristic
equation,
(k−1) det (λ1 − A) = 0.
AT CT
(N −1)
= a0 1T C T + a1 AT C T + a2 AT AT C T + · · · + aN −1 AT CT ,
(3.44)
for some set of scalars, a0 , a1 , . . . , aN −1 , not all zero. Since row-rank
and column-rank are the same for any matrix, we can conclude that
K
T T T T T T T T
rank C A C A A C ··· A C
(N −1)
T T T T T
= rank C A C ··· A C . (3.45)

Defining the observability matrix, O, as

 
C
 CA 
 
O= .. , (3.46)
 . 
CA(N −1)
our rank condition is
rank O = N. (3.47)
Readers familiar with linear control theory will recognize this as pre-
cisely the test for observability (Kalman, 1960a). Thus, we can see the A system is
direct connection between observability and invertibility of HT W−1 H. observable if the
initial state can be
The overall conditions for existence and uniqueness of a solution to (3.34)
uniquely inferred
are based on
Qk > 0, Rk > 0, rank O = N, (3.48) measurements
gathered in a finite
where > 0 means a matrix is positive-definite (and hence invertible). amount of time.
Again, these are sufficient but not necessary conditions.

Interpretation
We can return to the mass-spring analogy to better understand the
observability issue. Figure 3.2 shows a few examples. With the initial
state and all the inputs (top example), the system is always observable
since it is impossible to move any group of carts left or right without
altering the length of at least one spring. This means there is a unique
minimum-energy state. The same is true for the middle example, even
though there is no knowledge of the initial state. The bottom example
48 Linear-Gaussian Estimation
Figure 3.2 In a
Jv,0 Jv,1 Jv,2 Jv,3
single dimension,
initial state
the mass-spring knowledge plus x̂0 x̂1 x̂2 x̂3
system is inputs ensures Jy,1 Jy,3
observable if there observability
is no group of carts
that can be shifted
left or right Jv,1 Jv,2 Jv,3
without altering no initial state
knowledge can x̂0 x̂1 x̂2 x̂3
the energy state of be observable
at least one spring. Jy,1 Jy,3
The top example
uses the initial
state and inputs, no initial state
so this is always knowledge Jv,1 Jv,2 Jv,3
observable. The can also be
middle example is
unobservable x̂0 x̂1 x̂2 x̂3
(in 1D this only
also observable happens with no
since any measurements)
movement changes
at least one spring.
The bottom
is unobservable since the entire chain of carts can be moved left or
example is not
observable since right without changing the amount of energy stored in the springs.
the whole chain of This means the minimum-energy state is not unique.
carts can be moved
left-right together
without changing
any spring lengths;
3.1.5 MAP Covariance
in one dimension, Looking back to (3.35), x̂ represents the most likely estimate of x, the
this only happens
with no initial
true state. One important question to ask is how confident are we in x̂?
state and no It turns out we can re-interpret the least-squares solution as a Gaussian
measurements. estimate for x in the following way:

HT W−1 H |{z}x̂ = H T
| W
−1
{z z} . (3.49)
| {z }
mean information
inverse
covariance vector

The right-hand side is referred to as the information vector. To see this,

we employ Bayes’ rule to rewrite (3.15) as

1 T
p(x|z) = β exp − (Hx − z) W−1 (Hx − z) , (3.50)
2
where β is a new normalization constant. We then substitute (3.35) in
and, after a little manipulation, find that

1 T T −1

p(x|x̂) = κ exp − (x − x̂) H W H (x − x̂) , (3.51)
2
where
κ is yet another normalization constant. We see from this that
N x̂, P̂ is a Gaussian estimator for x whose mean is the optimization
−1
solution and whose covariance is P̂ = (HT W−1 H) .
3.2 Recursive Discrete-Time Smoothing 49

Another way to explain this is to directly take the expectation of the

estimate. We notice that

−1 T −1 −1 T −1
x − HT W−1 H H W z = HT W−1 H H W (Hx − z),
| {z } | {z }
E[x] s
(3.52)
where

w
s= . (3.53)
n

In this case we have

h i
T
P̂ = E (x − E[x]) (x − E[x])
−1 T −1 T −1
= HT W−1 H H W E s s W−1 H HT W−1 H ,
| {z }
W
T −1
−1
= H W H , (3.54)

which is the same result as above.

3.2 Recursive Discrete-Time Smoothing

The batch solution is appealing in that it is fairly easy to set up and un-
derstand from a least-squares perspective. However, brute-force solving
the resulting system of linear equations will likely not be very efficient
for most situations. Fortunately, since the inverse covariance matrix on
the left-hand side is sparse (i.e., block-tridiagonal), we can use this to
solve the system of equations very efficiently. This typically involves a
forwards recursion followed by a backwards recursion. When the equa-
tions are solved in this way, the method is typically referred to as a
fixed-interval smoother. It is useful to think of smoothers as efficiently
implementing the full batch solution, with no approximation. We use
the rest of this section to show that this can be done, first by a sparse
Cholesky approach and then by the algebraically equivalent classical
Rauch-Tung-Striebel smoother (Rauch et al., 1965). Särkkä (2013) pro-
vides an excellent reference on smoothing and filtering.
50 Linear-Gaussian Estimation

3.2.1 Exploiting Sparsity in the Batch Solution

As discussed earlier, the left-hand side of (3.34), HT W−1 H, is block-
tridiagonal (under our chronological variable ordering for x):
 
∗ ∗
∗ ∗ ∗ 
 
 ∗ ∗ ∗ 
 
HT W−1 H =  . . . . . . , (3.55)
 . . . 
 
 ∗ ∗ ∗
André-Louis
Cholesky ∗ ∗
(1875-1918) was a
French military where ∗ indicates a non-zero block. There are solvers that can exploit
officer and this structure and therefore solve for x̂ efficiently.
mathematician. He One way to solve the batch equations efficiently is to do a sparse
is primarily Cholesky decomposition followed by forward and backward passes. It
remembered for a
particular
turns out we can efficiently factor HT W−1 H into
decomposition of
HT W−1 H = LLT , (3.56)
matrices, which he
used in his military where L is a block-lower-triangular matrix called the Cholesky factor13 .
map-making work.
The Cholesky
Owing to the block-tridiagonal structure of HT W−1 H, L will have the
decomposition of a form,
Hermitian
 
∗
positive-definite ∗ ∗ 
matrix, A, is a  
 ∗ ∗ 
decomposition of  
the form A = LL∗ L= .. .. , (3.57)
 . . 
where L is a  
lower-triangular
 ∗ ∗ 
matrix with real ∗ ∗
and positive
diagonal entries and the decomposition can be computed in O(N (K + 1)) time. Next,
and L∗ denotes the we solve
conjugate
transpose of L. Ld = HT W−1 z, (3.58)
Every Hermitian
positive-definite
for d. This can again be done in O(N (K + 1)) time through forward
matrix (and thus substitution owing to the sparse lower-triangular form of L; this is
also every called the forward pass. Finally, we solve
real-valued
symmetric LT x̂ = d, (3.59)
positive-definite
matrix) has a for x̂, which again can be done in O(N (K + 1)) time through backward
unique Cholesky substitution owing to the sparse upper-triangular form of LT ; this is
decomposition. called the backward pass. Thus, the batch equations can be solved in
computation time that scales linearly with the size of the state. The
next section will make the details of this sparse Cholesky approach
specific.
13 We could just as easily factor into HT W−1 H = UUT with U upper-triangular and
then carry out backward and forward passes.
3.2 Recursive Discrete-Time Smoothing 51

3.2.2 Cholesky Smoother

In this section, we work out the details of the sparse Cholesky solution
to the batch estimation problem. The result will be a set of forwards-
backwards recursions that we will refer to as the Cholesky smoother.
There are several similar square-root information smoothers described
in the literature and Bierman (1974) is a classic reference on the topic.
Let us begin by defining the non-zero sub-blocks of L as
 
L0
L10 L1 
 
 L L 
 21 2 
L= . .. . .. . (3.60)
 
 
 LK−1,K−2 LK−1 
LK,K−1 LK

Using the definitions of H and W from (3.13), when we multiply out

HT W−1 H = LLT and compare at the block level we have

L0 LT0 = P̌−1 + CT0 R−1 C0 +AT0 Q−1

1 A0 , (3.61a)
|0 {z 0 }
I0

L10 LT0 = −Q−1

1 A0 , (3.61b)
L1 L1 = −L10 LT10 + Q−1
T T −1 T −1
1 + C1 R1 C1 +A1 Q2 A1 , (3.61c)
| {z }
I1

L21 LT1 = −Q−1

2 A1 , (3.61d)
..
.
LK−1 LTK−1 = −LK−1,K−2 LTK−1,K−2 + Q−1 T −1
K−1 + CK−1 RK−1 CK−1
| {z }
IK−1

+ ATK−1 Q−1
K AK−1 , (3.61e)
T −1
LK,K−1 LK−1 = −QK AK−1 , (3.61f)
T T −1 T −1
LK LK = −LK,K−1 LK,K−1 + QK + CK RK CK , (3.61g)
| {z }
IK

where the underbraces allow us to define the Ik quantities14 , whose

purpose will be revealed shortly. From these equations, we can first
solve for L0 by doing a small (dense) Cholesky decomposition in the
first equation, then substitute this into the second to solve for L10 ,
then substitute this into the third to solve for L1 , and so on all the way
down to LK . This confirms that we can work out all the blocks of L in
a single forwards pass in O(N (K + 1)) time.
14 In this book, 1 is the identity matrix, which should not be confused with the use of I,
which in this instance stands for information matrix (i.e., inverse covariance matrix).
52 Linear-Gaussian Estimation

Next we solve Ld = HT W−1 z for d, where

 
d0
 d1 
 
d =  ..  . (3.62)
 . 
dK
Multiplying out and comparing at the block level we have
L0 d0 = P̌−1 x̌0 + CT0 R−1 T −1
0 y0 −A0 Q1 v1 , (3.63a)
|0 {z }
q0

L1 d1 = −L10 d0 + Q−1 T −1 T −1
1 v1 + C1 R1 y1 −A1 Q2 v2 , (3.63b)
| {z }
q1
..
.
LK−1 dK−1 = −LK−1,K−2 dK−2 + Q−1 T −1
K−1 vK−1 + CK−1 RK−1 yK−1
| {z }
qK−1

− ATK−1 Q−1
K vK , (3.63c)
−1 T −1
LK dK = −LK,K−1 dK−1 + QK vK + CK RK yK , (3.63d)
| {z }
qK

where again the underbraces allow us to define the qk quantities, which

will be used shortly. From these equations, we can solve for d0 in the
first equation, then substitute this into the second to solve for d1 , and
so on all the way down to dK . This confirms that we can work out all
of the blocks of d in a single forwards pass in O(N (K + 1)) time.
The last step in the Cholesky approach is to solve LT x̂ = d for x̂,
where  
x̂0
 x̂1 
 
x̂ =  ..  . (3.64)
 . 
x̂K
Multiplying out and comparing at the block level we have
LTK x̂K = dK , (3.65a)
LTK−1 x̂K−1 = −LTK,K−1 x̂K + dK−1 , (3.65b)
..
.
LT1 x̂1 = −LT21 x̂2 + d1 , (3.65c)
LT0 x̂0 = −LT10 x̂1 + d0 . (3.65d)
From these equations, we can solve for x̂K in the first equation, then
substitute this into the second to solve for x̂K−1 , and so on all the way
down to x̂0 . This confirms that we can work out all of the blocks of x̂
in a single backwards pass in O(N (K + 1)) time.
3.2 Recursive Discrete-Time Smoothing 53

In terms of the Ik and qk quantities, we can combine the two forwards

passes (to solve for L and d) and also write the backwards pass as

forwards:
(k = 1 . . . K)
Lk−1 LTk−1 = Ik−1 + ATk−1 Q−1
k Ak−1 , (3.66a)
Lk−1 dk−1 = qk−1 − Ak−1 Qk−1 vk ,
T
(3.66b)
Lk,k−1 LTk−1 = −Q−1k Ak−1 , (3.66c)
Ik = −Lk,k−1 LTk,k−1 + Q−1 T −1
k + Ck Rk Ck , (3.66d)
qk = −Lk,k−1 dk−1 + Q−1 T −1
k vk + Ck Rk yk , (3.66e)
backwards:
(k = K . . . 1)
LTk−1 x̂k−1 = −LTk,k−1 x̂k + dk−1 , (3.66f)

which are initialized with

I0 = P̌−1 T −1
0 + C0 R0 C0 , (3.67a)
q0 = P̌−1 T −1
0 x̌0 + C0 R0 y0 , (3.67b)
x̂K = L−TK dK . (3.67c)

The forwards pass maps {qk−1 , Ik−1 } to the same pair at the next Herbert E. Rauch
time, {qk , Ik }. The backwards pass maps x̂k to the same quantity at (1935-2011) was a
the previous timestep, x̂k−1 . In the process, we solve for all the blocks pioneer in the area
of control and
of L and d. The only linear algebra operations required to implement estimation. Frank
this smoother are Cholesky decomposition, multiplication, addition, F. Tung
and solving a linear system via forward/backward substitution. (1933-2006) was a
As we will see in the next section, these six recursive equations are research scientist
working in the area
algebraically equivalent to the canonical Rauch-Tung-Striebel smoother; of computing and
the five equations forming the forwards pass are algebraically equivalent control. Charlotte
to the famous Kalman filter. T. Striebel
(1929-2014) was a
statistician and
professor of
mathematics. All
3.2.3 Rauch-Tung-Striebel Smoother three co-developed
the Rauch-Tung-
While the Cholesky smoother is a convenient implementation and is Striebel smoother
easy to understand when starting from the batch solution, it does not while working at
Lockheed Missiles
represent the canonical form of the smoothing equations. It is, however,
and Space
algebraically equivalent to the canonical Rauch-Tung-Striebel (RTS) Company in order
smoother, which we now show. This requires several uses of the different to estimate
forms of the SMW identity in (2.68). spacecraft
trajectories.
We begin by working on the forwards pass. Solving for Lk,k−1 in (3.66c)
54 Linear-Gaussian Estimation

and substituting this and (3.66a) into (3.66d) we have

−1 T
Ik = Q−1 −1 T
k − Qk Ak−1 Ik−1 + Ak−1 Qk Ak−1
−1
Ak−1 Q−1 T −1
k +Ck Rk Ck
| {z }
−1
k−1 Ak−1 +Qk )
(Ak−1 I−1 T , by (2.68)
(3.68)
where we have used a version of the SMW identity to get to the ex-
pression in the underbrace. By letting P̂k,f = I−1 k , this can be written
in two steps as

P̌k,f = Ak−1 P̂k−1,f ATk−1 + Qk , (3.69a)

P̂−1
k,f = P̌−1
k,f + CTk R−1
k Ck , (3.69b)

where P̌k,f represents a ‘predicted’ covariance and P̂k,f a ‘corrected’

one. We have added the subscript, (·)f , to indicated these quantities
come from the forwards pass (i.e., a filter). The second of these equa-
tions is written in information (i.e., inverse covariance) form. To reach
the canonical version, we define the Kalman gain matrix, Kk , as

Kk = P̂k,f CTk R−1

k . (3.70)

Substituting in (3.69b) this can also be written as

−1
Kk = P̌−1 T −1
k,f + Ck Rk Ck CTk R−1
k
−1
= P̌k,f CTk Ck P̌k,f CTk + Rk , (3.71)

where the last expression requires a use of the SMW identity from (2.68).
Then (3.69b) can be rewritten as

P̌−1 −1 T −1 −1 T −1
k,f = P̂k,f − Ck Rk Ck = P̂k,f 1 − P̂k,f Ck Rk Ck
| {z }
Kk

= P̂−1
k,f (1 − Kk Ck ) , (3.72)

and finally rearranging for P̂k,f we have

P̂k,f = (1 − Kk Ck ) P̌k,f , (3.73)

which is the canonical form for the covariance correction step.

Next, solving for Lk,k−1 in (3.66c) and dk−1 in (3.66b) we have
−1
Lk,k−1 dk = −Q−1 T
k Ak−1 Lk−1 Lk−1 qk−1 − ATk−1 Q−1
k vk . (3.74)
3.2 Recursive Discrete-Time Smoothing 55

Substituting (3.66a) into Lk,k−1 dk and then this into (3.66e) we have
−1
qk = Q−1 T
k Ak−1 Ik−1 + Ak−1 Qk Ak−1
−1
qk−1
| {z }
−1
(Ak−1 I−1
k−1 Ak−1 +Qk ) k−1 , by (2.68)
Ak−1 I−1
T

−1 T
+ Qk − Qk Ak−1 Ik−1 + ATk−1 Q−1
−1 −1
k Ak−1 Ak−1 Q−1
k vk
| {z }
−1
(Ak−1 Ik−1 Ak−1 +Qk ) , by (2.68)
−1 T

+ CTk R−1
k yk , (3.75)

where we have used two versions of the SMW identity to get to the
expressions in the underbraces. By letting P̂−1
k,f x̂k,f = qk , this can be
written in two steps as

x̌k,f = Ak−1 x̂k−1,f + vk , (3.76a)

P̂−1
k,f x̂k,f = P̌−1
k,f x̌k,f + CTk R−1
k yk , (3.76b)

where x̌k,f represents a ‘predicted’ mean and x̂k,f a ‘corrected’ one.

Again, the second of these is in information (i.e., inverse covariance)
form. To get to the canonical form we rewrite it as

x̂k,f = P̂k,f P̌−1 T −1

k,f x̌k,f + P̂k,f Ck Rk yk , (3.77)
| {z } | {z }
1−Kk Ck Kk

or
x̂k,f = x̌k,f + Kk (yk − Ck x̌k,f ) , (3.78)

which is the canonical form for the mean correction step.

The last step is to resolve the backwards pass into its canonical form.
We begin by premultiplying (3.66f) by Lk−1 and solve for x̂k−1 :
−1
x̂k−1 = Lk−1 LTk−1 Lk−1 −LTk,k−1 x̂k + dk−1 . (3.79)

Substituting in (3.66a), (3.66b), and (3.66c), we have

−1 T
x̂k−1 = Ik−1 + ATk−1 Q−1 k Ak−1 Ak−1 Q−1 k (x̂k − vk )
| {z }
−1
k−1 Ak−1 (Ak−1 Ik−1 Ak−1 +Qk )
I−1 T −1 T , by (2.68)
−1
+ Ik−1 + ATk−1 Q−1 k Ak−1 qk−1 . (3.80)
| {z }
−1
k−1 −Ik−1 Ak−1 (Ak−1 Ik−1 Ak−1 +Qk )
I−1 k−1 , by (2.68)
−1 −1
T T Ak−1 I−1

Using our symbols from above, this can be written as

x̂k−1 = x̂k−1,f + P̂k−1,f ATk−1 P̌−1

k−1,f (x̂k − x̌k,f ) , (3.81)

which is the canonical form for the backwards smoothing equation.

56 Linear-Gaussian Estimation

Together, equations (3.69a), (3.71), (3.73), (3.76a), (3.78), and (3.81)

constitute the Rauch-Tung-Striebel smoother:

forwards:
(k = 1 . . . K)
P̌k,f = Ak−1 P̂k−1,f ATk−1 + Qk , (3.82a)
x̌k,f = Ak−1 x̂k−1,f + vk , (3.82b)
−1
Kk = P̌k,f CTk Ck P̌k,f CTk + Rk , (3.82c)
P̂k,f = (1 − Kk Ck ) P̌k,f , (3.82d)
x̂k,f = x̌k,f + Kk (yk − Ck x̌k,f ) , (3.82e)
backwards:
(k = K . . . 1)
x̂k−1 = x̂k−1,f + P̂k−1,f ATk−1 P̌−1
k−1,f (x̂k − x̌k,f ) , (3.82f)

which are initialized with

P̌0,f = P̌0 , (3.83a)

x̌0,f = x̌0 , (3.83b)
x̂K = x̂K,f . (3.83c)

As will be discussed in more detail in the next section, the five equations
in the forwards pass are known as the Kalman filter. However, the
important message to take away from this section on smoothing is that
these six equations representing the RTS smoother can be used to solve
the original batch problem that we set up in a very efficient manner,
with no approximation. This is possible precisely because of the block-
tridiagonal sparsity pattern in the left-hand side of the batch problem.

3.3 Recursive Discrete-Time Filtering

The batch solution (and the corresponding smoother implementations)
outlined above is really the best we can do. It makes use of all the data
in the estimate of every state. However, it has one major drawback:
it cannot be used online15 because it employs future data to estimate
past states (i.e., it is not causal). To be used online, the estimate of
the current state can only employ data up to the current timestep. The
Kalman filter is the classical solution to this problem. We have already
seen a preview of the KF; it is the forwards pass of the Rauch-Tung-
Striebel smoother. However, there are several other ways of deriving it,
some of which we provide in this section.
15 It is preferable to say ‘online’ rather than ‘real-time’ in this context.
3.3 Recursive Discrete-Time Filtering 57
Figure 3.3 The
smoothers use all available information to estimate states
batch LG solution
x̂k is a smoother. To
develop an
x̌0 , y0 , v1 , y1 , v2 , y2 , . . . , vk estimator
1 , yk 1 , vk , yk , vk+1 , yk+1 , . . . , vK , yK
appropriate to
online estimation,
x̂k,f we require a filter.
filters only use past/current information to estimate states

3.3.1 Factoring the Batch Solution

We do not need to start from scratch in our search for a recursive
LG estimator. It turns out we can re-use the batch solution and ex-
actly factor it into two recursive estimators, one that runs forwards in
time and the other backwards. The backwards pass is a little different
than the one presented in the smoother section, as it is not correcting
the forwards pass, but rather producing an estimate using only future
measurements.
To set things up for our development of the recursive solutions, we
will reorder some of our variables from the batch solution. We redefine
z, H, and W as
   
x̌0 1
 y   C 
 0   0 
 v   −A 1 
 1   0 
 y   C 
 1   1 
   
z= v2  , H =  −A1 1 ,
   
 y2   C2 
 .   
 .   .. .. 
 .   . . 
   
 vK   −AK−1 1 
yK CK
 
P̌0
 R0 
 
 Q1 
 
 
 R1 
 
W= Q2  , (3.84)
 
 R2 
 
 .. 
 . 
 
 QK 
RK
where the partition lines now show divisions between timesteps. This
re-ordering does not change the ordering of x, so HT W−1 H is still
block-tridiagonal.
We now consider the factorization at the probability density level.
58 Linear-Gaussian Estimation

As discussed in Section 3.1.5, we have an expression for p(x|v, y). If we

want to consider only the state at time k, we can marginalize out the
other states by integrating over all possible values:
Z
p(xk |v, y) = p(x0 , . . . , xK |v, y) dxi,∀i6=k . (3.85)
xi,∀i6=k

We can now factor this probability density into two parts:

p(xk |v, y) = p(xk |x̌0 , v1:k , y0:k ) p(xk |vk+1:K , yk+1:K ). (3.86)

Thus, it should be possible to take our batch solution and factor it into
the product of two Gaussian PDFs. To do this, we need to exploit the
sparse structure of H in (3.3.1).
We begin by partitioning H into 12 blocks (only 6 of which are non-
zero):
2 3
H11 information from 0 . . . k 1
6H21 H22 7 information from k
H=6
4
7
H32 H33 5 information from k + 1
H43 information from k + 2 . . . K

states from k + 1 . . . K
states from k
states from 0 . . . k 1
(3.87)

The sizes of each block-row and block-column are indicated above. For
example, with k = 2 and K = 4, the partitions are
 
1
 C0 
 
 −A 1 
 0 
 C1 
 
 
 −A1 1 
H= . (3.88)
 C2 
 
 −A2 1 
 
 C3 
 
 −A3 1 
C4

We use compatible partitions for z and W:

   
z1 W1
z2   W2 
z= 
z3  , W = 
 .
 (3.89)
W3
z4 W4
3.3 Recursive Discrete-Time Filtering 59

For HT W−1 H we then have

HT W−1 H
 T
H11 W1−1 H11 + HT21 W2−1 H21 HT21 W2−1 H22
= HT22 W2−1 H21 HT22 W2−1 H22 + HT32 W3−1 H32
HT33 W3−1 H32


··· HT32 W3−1 H33 

T −1 T −1
H33 W3 H33 + H43 W4 H43
 
L11 L12
= LT12 L22 LT32  , (3.90)
L32 L33

where we have assigned the blocks to some useful intermediate vari-

ables, Lij . For HT W−1 z we have
 T   
H11 W1−1 z1 + HT21 W2−1 z2 r1
HT W−1 Gz = HT22 W2−1 z2 + HT32 W3−1 z3  = r2  (3.91)
HT33 W3−1 z3 + HT43 W4−1 z4 r3

where we have the assigned the blocks to some useful intermediate

variables, ri . Next, we partition the states, x, in the following way:
2 3
x0:k 1 states from k + 1 . . . K
x = 4 xk 5 states from k
xk+1:K states from 0 . . . k 1
(3.92)

Our overall batch system of equations now looks like the following:
    
L11 L12 x̂0:k−1 r1
LT12 L22 LT32   x̂k  = r2  , (3.93)
L32 L33 x̂k+1:K r3

where we have added the (·) ˆ to indicate this is the solution to the op-
timization estimation problem considered earlier. Our short-term goal,
in making progress towards a recursive LG estimator, is to solve for x̂k .
To isolate x̂k , we left-multiply both sides of (3.93) by
 
1
−LT12 L−1
11 1 −LT32 L−133
, (3.94)
1

which can be viewed as performing an elementary row operation (and

therefore will not change the solution to (3.93)). The resulting system
60 Linear-Gaussian Estimation

of equations is

  
L11 L12 x̂0:k−1
 L22 − LT12 L−1 T −1
11 L12 − L32 L33 L32
  x̂k 
L32 L33 x̂k+1:K
 
r1
= r2 − LT12 L−1 T −1 
11 r1 − L32 L33 r3 , (3.95)
r3

and the solution for x̂k is therefore given by

L22 − LT12 L−1 T −1 T −1 T −1
11 L12 − L32 L33 L32 x̂k = r2 − L12 L11 r1 − L32 L33 r3 ,
| {z } | {z }
P̂−1
k
qk

(3.96)
where we have defined P̂k (by its inverse) as well as qk . We have es-
sentially marginalized out x̂0:k−1 and x̂k+1:K just as in (3.85). We can
now substitute the values of the Lij blocks back into P̂−1k to see that

P̂−1 T −1 T −1
k = L22 − L12 L11 L12 − L32 L33 L32
−1
= HT22 W2−1 − W2−1 H21 HT11 W1−1 H11 + HT21 W2−1 H21 HT21 W2−1 H22
| {z }
−1
−1
k,f = W2 +H21 (H11 W1 H11 )
P̂−1 −1
T HT
21 ,
by (2.68)
−1
+ HT32 W3−1 − W3−1 H33 HT33 W3−1 H33 + HT43 W4−1 H43 HT33 W3−1 H32
| {z }
−1
−1
k,b = W3 +H33 (H43 W4 H43 )
P̂−1 −1
T HT
33 , by (2.68)

= HT22 P̂−1
k,f H22 + HT32 P̂−1
k,b H32 , (3.97)
| {z } | {z }
forwards backwards

where the term labelled ‘forwards’ depends only on the blocks of H

and W up to time k and the term labelled ‘backwards’ depends only
on the blocks of H and W from k + 1 to K. Turning now to qk , we
substitute in the values of the Lij and ri blocks:

qk = r2 − LT12 L−1 T −1
11 r1 − L32 L33 r3
= HT22 qk,f + HT32 qk,b , (3.98)
| {z } | {z }
forwards backwards

where again the term labelled ‘forwards’ depends only on quantities up

to time k and the term labelled ‘backwards’ depends only on quantities
3.3 Recursive Discrete-Time Filtering 61

from time k + 1 to K. We made use of the following definitions:

−1
qk,f = −HT22 W2−1 H21 HT11 W1−1 H11 + HT21 W2−1 H21 HT11 W1−1 z1 (3.99a)
−1 T
−1 −1 −1 −1
T T
+ H22 W2 − W2 H21 H11 W1 H11 + H21 W2 H21 T
H21 W2−1 z2 ,
−1
qk,b = HT32 W3−1 − W3−1 H33 HT33 W3−1 H33 + HT43 W4−1 H43 HT33 W3−1 z3
−1 T
+ HT32 W3−1 H33 HT43 W4−1 H43 + HT33 W3−1 H33 H43 W4−1 z4 . (3.99b)
Now let us define the following two ‘forwards’ and ‘backwards’ estima-
tors, x̂k,f and x̂k,b , respectively:

P̂−1
k,f x̂k,f = qk,f , (3.100a)
P̂−1
k,b x̂k,b = qk,b , (3.100b)
where x̂k,f depends only on quantities up to time k and x̂k,b depends
only on quantities from time k + 1 to K. Under these definitions we
have that
P̂−1 T −1 T −1
k = H22 P̂k,f H22 + H32 P̂k,b H32 , (3.101)

x̂k = P̂k HT22 P̂−1 x̂
k,f k,f + H T −1
32 k,b k,b ,
P̂ x̂ (3.102)

which is precisely the product of two Gaussian PDFs as in (2.63). Re-

ferring back to (3.86), we have that

p(xk |v, y) → N x̂k , P̂k , (3.103a)

p(xk |x̌0 , v1:k , y0:k ) → N x̂k,f , P̂k,f , (3.103b)

p(xk |vk+1:K , yk+1:K ) → N x̂k,b , P̂k,b . (3.103c)

where P̂k , P̂k,f , and P̂k,b are the covariances associated with x̂k , x̂k,f ,
and x̂k,b . In other words we have Gaussian estimators with the MAP
estimators as the means.
In the next section, we will examine how we can turn the forwards
Gaussian estimator, x̂k,f , into a recursive filter16 .

3.3.2 Kalman Filter via MAP

In this section, we will show how to turn the forwards estimator from
the last section into a recursive filter called the Kalman filter (Kalman,
1960b), using our MAP approach. To simplify the notation slightly, we
will use x̂k instead of x̂k,f and P̂k instead of P̂k,f , but these new symbols
should not be confused with the batch/smoothed estimates discussed
16 A similar thing can be done for the backwards estimator, but the recursion is
backwards in time rather than forwards.
62 Linear-Gaussian Estimation

previously. Let us assume we already have a forwards estimate and the

associated covariance at some time k − 1:
n o
x̂k−1 , P̂k−1 . (3.104)

Recall that these estimates are based on all the data up to and including
those at time k − 1. Our goal will be to compute
n o
x̂k , P̂k , (3.105)

using all the data up to and including those at time k. It turns out we
do not need to start all over again, but rather can simply incorporate
the new data at time k, vk and yk , into the estimate at time k − 1:
n o n o
x̂k−1 , P̂k−1 , vk , yk 7→ x̂k , P̂k . (3.106)

To see this, we define

     
x̂k−1 1 P̂k−1
z =  vk  , H = −Ak−1 1  , W= Qk ,
yk Ck Rk
n o (3.107)
where x̂k−1 , P̂k−1 serve as substitutes for all the data up to time
k − 117 . Figure 3.4 depicts this graphically.
Figure 3.4
Recursive filter
x̂0k 1
replaces past data
with an estimate. x̌0 , y0 , v1 , y1 , v2 , y2 , . . . , vk 1 , yk 1 , vk , yk , vk+1 , yk+1 , . . . , vK , yK

x̂k 1

x̂k

Our usual MAP solution to the problem is x̂ given by

HT W−1 H x̂ = HT W−1 z. (3.108)
We then define
0
x̂k−1
x̂ = , (3.109)
x̂k
where we carefully distinguish x̂0k−1 from x̂k−1 . The addition of the 0
indicates that x̂0k−1 is the estimate at time k − 1 incorporating data
up to and including time k, whereas x̂k−1 is the estimate at time k − 1
17 To do this we have actually employed something called the Markov property. Further
discussion of this will be left to the chapter on nonlinear-non-Gaussian estimation
techniques. For now it suffices to say that for LG estimation, this assumption is valid.
3.3 Recursive Discrete-Time Filtering 63

using data up to and including time k−1. Substituting in our quantities

from (3.107) to the least-squares solution we have
−1 0
P̂k−1 + ATk−1 Q−1
k Ak−1 −ATk−1 Q−1
k x̂k−1
−Q−1
k A k−1 Q−1
k + C T
k R −1
k Ck x̂k
−1
P̂k−1 x̂k−1 − ATk−1 Q−1 k vk . (3.110)
=
Q−1 T
k vk + Ck Rk yk
−1

We do not really care what x̂0k−1 is in this context, because we seek a re-
cursive estimator appropriate to online estimation, and this quantity in-
corporates future data; we can marginalize this out by left-multiplying
both sides by
" #
1 0
−1 , (3.111)
Q−1 −1 T −1
k Ak−1 P̂k−1 + Ak−1 Qk Ak−1 1

which is just an elementary row operation and will not alter the solution
to the linear system of equations18 . Equation (3.110) then becomes
 −1 
P̂k−1 + ATk−1 Q−1
k Ak−1 −ATk−1 Q−1
 k
−1  x̂0
 Q−1 −1 −1 T
k − Qk Ak−1 P̂k−1 + Ak−1 Qk Ak−1
−1
 k−1
0 x̂k
× ATk−1 Q−1 T −1
k + Ck Rk Ck
 
P̂−1 T −1
k−1 x̂k−1 − Ak−1 Qk vk

= 
−1
 Q−1 −1 T −1
k Ak−1 P̂k−1 + Ak−1 Qk Ak−1 P̂−1 T −1
k−1 x̂k−1 − Ak−1 Qk vk
.
+ Q−1 T −1
k vk + Ck Rk yk
(3.112)

The solution for x̂k is given by

−1
Q−1 −1 −1 T
k − Qk Ak−1 P̂k−1 + Ak−1 Qk Ak−1
−1
ATk−1 Q−1
k
| {z }
−1
(Qk +Ak−1 P̂k−1 Ak−1 ) by (2.68)
T

+ CTk R−1
k Ck x̂k
−1
= Q−1
k Ak−1 P̂−1
k−1 + A T
Q
k−1 k
−1
A k−1

× P̂−1 k−1 x̂ k−1 − A T
Q
k−1 k
−1
v k

+ Q−1
k vk + CTk R−1
k yk . (3.113)

18 This is also sometimes called the Schur complement.

64 Linear-Gaussian Estimation

We then define the following helpful quantities:

P̌k = Qk + Ak−1 P̂k−1 ATk−1 , (3.114a)
−1
P̂k = P̌−1 T −1
k + Ck Rk Ck . (3.114b)
Equation (3.113) then becomes
−1
P̂−1 −1 −1 T −1
k x̂k = Qk Ak−1 P̂k−1 + Ak−1 Qk Ak−1

× P̂−1 x̂
k−1 k−1 − A T
Q
k−1 k
−1
v −1 T −1
k + Qk vk + Ck Rk yk
−1
= Q−1 −1 T −1
k Ak−1 P̂k−1 + Ak−1 Qk Ak−1 P̂−1
k−1 x̂k−1
| {z }
P̌−1
k Ak−1 by logic below
−1
−1 −1 −1 T −1 T −1
+ Qk − Qk Ak−1 P̂k−1 + Ak−1 Qk Ak−1 Ak−1 Qk vk
| {z }
P̌−1
k

+ CTk R−1
k yk
= P̌−1 T −1
k (Ak−1 x̂k−1 + vk ) +Ck Rk yk , (3.115)
| {z }
x̌k

where we have defined x̌k as the ‘predicted’ value of the state. We also
made use of the following logic in simplifying the above:
−1
Q−1 −1 T −1
k Ak−1 P̂k−1 + Ak−1 Qk Ak−1 P̂−1
k−1
| {z }
apply (2.68) again
−1
= Q−1 T T
k Ak−1 P̂k−1 − P̂k−1 Ak−1 Qk + Ak−1 P̂k−1 Ak−1
| {z }
P̌−1
k

× Ak−1 P̂k−1 P̂−1
k−1

= Q−1 −1 T −1
k − Qk Ak−1 P̂k−1 Ak−1 P̌k Ak−1
| {z }
P̌k −Qk

= Q−1 −1 −1
k − Qk + P̌k Ak−1
= P̌−1
k Ak−1 . (3.116)
Bringing together all of the above, we have for the recursive filter up-
date the following:
P̌k = Ak−1 P̂k−1 ATk−1 + Qk , (3.117a)
predictor:
x̌k = Ak−1 x̂k−1 + vk , (3.117b)
P̂−1 −1 T −1
k = P̌k + Ck Rk Ck , (3.117c)
corrector:
P̂−1
k x̂k = P̌−1
k x̌k + CTk R−1
k yk , (3.117d)
3.3 Recursive Discrete-Time Filtering 65
Figure 3.5 The
p(x) prediction correction
Kalman filter
works in two steps:
becomes more becomes more
uncertain certain prediction then
correction. The
prediction step
propagates the old
x estimate, x̂k−1 ,
N v k , Qk forward in time
old predic- new measure-
estimate tion estimate ment using the
measurement
model and latest
input, vk , to arrive
N x̂k 1 , P̂k 1 N x̌k , P̌k N x̂k , P̂k N yk , Rk
at the prediction,
x̌k . The correction
step fuses the
which we will refer to as inverse covariance or information form for the prediction with the
Kalman filter. Figure 3.5 depicts the predictor-corrector form of the latest
Kalman filter graphically. measurement, yk ,
to arrive at the
To get to canonical form, we manipulate these equations slightly. new estimate, x̂k ;
Begin by defining the Kalman gain, Kk , as this step is carried
out using a direct
Kk = P̂k CTk R−1
k . (3.118) product of
Gaussians (clear
We then manipulate: from inverse
covariance version
1 = P̂k P̌−1 T −1
k + Ck Rk Ck of KF).
= P̂k P̌−1
k + Kk Ck , (3.119a)
P̂k = (1 − Kk Ck ) P̌k , (3.119b)
P̂k CTk R−1 T −1
k = (1 − Kk Ck ) P̌k Ck Rk , (3.119c)
| {z }
Kk

Kk 1 + Ck P̌k CTk R−1
k = P̌k CTk R−1
k . (3.119d)
Solving for Kk in this last expression, we can rewrite the recursive filter
equations as
P̌k = Ak−1 P̂k−1 ATk−1 + Qk , (3.120a)
predictor:
x̌k = Ak−1 x̂k−1 + vk , (3.120b)
−1
Kalman gain: Kk = P̌k CTk Ck P̌k CTk + Rk , (3.120c)
P̂k = (1 − Kk Ck ) P̌k , (3.120d)
corrector:
x̂k = x̌k + Kk (yk − Ck x̌k ), (3.120e)
| {z }
innovation
where the ‘innovation’ has been highlighted; it is the difference be-
tween the actual and expected measurements. The role of the Kalman
gain is to properly weight the innovation’s contribution to the estimate
(in comparison to the prediction). In this form, these five equations
(and their extension to nonlinear systems) have been the workhorse
of estimation since Kalman’s initial paper (Kalman, 1960b). These are
66 Linear-Gaussian Estimation

identical to the the forwards pass of the Rauch-Tung-Striebel smoother

discussed previously (with the (·)f subscripts dropped).

3.3.3 Kalman Filter via Bayesian Inference

A cleaner, simpler derivation of the Kalman filter can be had using our
Bayesian inference approach19 . Our Gaussian prior estimate at k − 1 is

p(xk−1 |x̌0 , v1:k−1 , y0:k−1 ) = N x̂k−1 , P̂k−1 . (3.121)

First, for the prediction step, we incorporate the latest input, vk , to

write a ‘prior’ at time k:

p(xk |x̌0 , v1:k , y0:k−1 ) = N x̌k , P̌k , (3.122)

where

P̌k = Ak−1 P̂k−1 ATk−1 + Qk , (3.123a)

x̌k = Ak−1 x̂k−1 + vk . (3.123b)

These are identical to the prediction equations from the previous sec-
tion. These last two expressions can be found by exactly passing the
prior at k − 1 through the linear motion model. For the mean we have

x̌k = E [xk ] = E [Ak−1 xk−1 + vk + wk ]

= Ak−1 E [xk−1 ] +vk + E [wk ] = Ak−1 x̂k−1 + vk , (3.124)
| {z } | {z }
x̂k−1 0

and for the covariance we have

P̌k = E (xk − E[xk ])(xk − E[xk ])T
= E [(Ak−1 xk−1 + vk + wk − Ak−1 x̂k−1 − vk )

× (Ak−1 xk−1 + vk + wk − Ak−1 x̂k−1 − vk )T

= Ak−1 E (xk−1 − x̂k−1 )(xk−1 − x̂k−1 )T ATk−1 + E wk wkT
| {z } | {z }
P̂k−1 Qk

= Ak−1 P̂k−1 ATk−1 + Qk . (3.125)

Next, for the correction step, we express the joint density of our state
19 In the next chapter, we will generalize this section to present the Bayes filter, which
can handle non-Gaussian PDFs as well as nonlinear motion and observation models.
We can think of this section as a special case of the Bayes filter, one that requires no
approximations to be made.
3.3 Recursive Discrete-Time Filtering 67

and latest measurement, at time k, as a Gaussian:

µx Σxx Σxy
p(xk , yk |x̌0 , v1:k , y0:k−1 ) = N , (3.126)
µy Σyx Σyy

x̌k P̌k P̌k CTk
=N , .
Ck x̌k Ck P̌k Ck P̌k CTk + Rk
Looking back to Section 2.2.3 where we introduced Bayesian inference,
we can then directly write the conditional density for xk (i.e., the pos-
terior) as

p(xk |x̌0 , v1:k , y0:k )

= N µx + Σxy Σ−1
yy (y k − µy ), Σxx − Σ Σ−1
xy yy Σyx , (3.127)
| {z } | {z }
x̂k P̂k

where we have defined x̂k as the mean and P̂k as the covariance. Sub-
stituting in the moments from above, we have
−1
Kk = P̌k CTk Ck P̌k CTk + Rk , (3.128a)
P̂k = (1 − Kk Ck ) P̌k , (3.128b)
x̂k = x̌k + Kk (yk − Ck x̌k ) , (3.128c)
which are identical to the correction equations from the previous section
on MAP. Again, this is because the motion and measurement models
are linear and the noises and prior are Gaussian. Under these condi-
tions, the posterior density is exactly Gaussian. Thus, the mean and
mode of the posterior are one and the same. This property does not
hold if we switch to a nonlinear measurement model, which we discuss
in the next chapter.

3.3.4 Kalman Filter via Gain Optimization

The Kalman filter is often referred to as being optimal. We did indeed
perform an optimization to come up with the recursive relations above
in the MAP derivation. There are also several other ways to look at
the optimality of the KF. We present one of these.
Assume we have an estimator with the correction step taking the
form
x̂k = x̌k + Kk (yk − Ck x̌k ) , (3.129)
but we do not yet know the gain matrix, Kk , to blend the corrective
measurements with the prediction. If we define the error in the state
estimate to be
êk = x̂k − xk , (3.130)
68 Linear-Gaussian Estimation

then we have
T
E[êk êTk ] = (1 − Kk Ck ) P̌k (1 − Kk Ck ) + Kk Rk KTk . (3.131)
We then define a cost function of the form

1 T 1 T
J(Kk ) = tr E[êk êk ] = E êk êk , (3.132)
2 2
which quantifies (in some sense) the magnitude of the covariance of êk .
We can minimize this cost directly with respect to Kk , to generate the
‘minimum variance’ estimate. We will make use of the identities
∂tr XY ∂tr XZXT
≡ YT , ≡ 2XZ, (3.133)
∂X ∂X
where Z is symmetric. Then we have
∂J(Kk )
= − (1 − Kk Ck ) P̌k CTk + Kk Rk . (3.134a)
∂Kk
Setting this to zero and solving for Kk we have
−1
Kk = P̌k CTk Ck P̌k CTk + Rk , (3.135)
which is our usual expression for the Kalman gain.

3.3.5 Kalman Filter Discussion

There are a few points worth mentioning:
(i) For a linear system with Gaussian noise, the Kalman filter equa-
tions are the best linear unbiased estimate (BLUE); this means
they are performing right at the Cramér-Rao lower bound.
(ii) Initial conditions must be provided:

x̌0 , P̌0 . (3.136)
(iii) The covariance equations can be propagated independently of
the mean equations. Sometimes a steady-state value of Kk is
computed and used for all time-steps to propagate the mean;
this is known as the ‘steady-state Kalman filter’.
(iv) At implementation, we must use yk,meas , the actual readings we
receive from our sensors, in the filter.
(v) A similar set of equations can be developed for the backwards
estimator that runs backwards in time.
It is worth reminding ourselves that we have arrived at the Kalman
filter equations through both an optimization paradigm as well as a full
Bayesian paradigm. The difference between these two will be significant
when we consider what happens in the nonlinear case (and why the
extension of the Kalman filter, the extended Kalman filter (EKF), does
not perform well in many situations).
3.3 Recursive Discrete-Time Filtering 69

3.3.6 Error Dynamics

It is useful to look at the difference between the estimated state and
the actual state. We define the following errors

ěk = x̌k − xk , (3.137a)

êk = x̂k − xk . (3.137b)

Using (3.1) and (3.120), we can then write out the ‘error dynamics’:

ěk = Ak−1 êk−1 − wk , (3.138a)

êk = (1 − Kk Ck ) ěk + Kk nk , (3.138b)

where we note that ê0 = x̂0 − x0 . From this system it is not hard to
see that E [êk ] = 0 for k > 0 so long as E [ê0 ] = 0. This means our
estimator is unbiased. We can use proof by induction. It is true for
k = 0 by assertion. Assume it is also true for k − 1. Then

E [ěk ] = Ak−1 E [êk−1 ] − E [wk ] = 0, (3.139a)

| {z } | {z }
0 0
E [êk ] = (1 − Kk Ck ) E [ěk ] +Kk E [nk ] = 0. (3.139b)
| {z } | {z }
0 0

It is therefore true for all k. It is less obvious that

E ěk ěTk = P̌k , (3.140a)

E êk êTk = P̂k , (3.140b)

for k > 0 so long as E [ê0 êT0 ] = P̂0 . This means our estimator is consis-
tent. We again
use proof
by induction. It is true for k = 0 by assertion.
Assume E êk−1 êTk−1 = P̂k−1 . Then

h i
T
E ěk ěTk = E (Ak−1 êk−1 − wk ) (Ak−1 êk−1 − wk )

= Ak−1 E êk−1 êTk−1 ATk−1 − Ak−1 E êk−1 wkT
| {z } | {z }
P̂k−1 0 by independence

− E wk êTk−1 ATk−1 + E wk wkT
| {z } | {z }
0 by independence Qk

= P̌k , (3.141)
70 Linear-Gaussian Estimation

and
h i
T
E êk êTk = E ((1 − Kk Ck ) ěk + Kk nk ) ((1 − Kk Ck ) ěk + Kk nk )
T
= (1 − Kk Ck ) E ěk ěTk (1 − Kk Ck )
| {z }
P̌k

+ (1 − Kk Ck ) E ěk nTk KTk
| {z }
0 by independence
T
+ Kk E nk ěTk (1 − Kk Ck ) + Kk E nk nTk KTk
| {z } | {z }
0 by independence Rk

= (1 − Kk Ck ) P̌k −P̂k CTk KTk + Kk Rk KTk

| {z }
−1
0 because Kk =P̂k CT
k Rk

= P̂k . (3.142)
It is therefore true for all k. This means that the true uncertainty in the
system (i.e., the covariance of the error, E [êk êTk ]) is perfectly modeled
by our estimate of the covariance, P̂k . In this sense, the Kalman filter
is an optimal filter. This is why it is sometimes referred to as best
linear unbiased estimate (BLUE). Yet another way of saying this is
that the covariance of the Kalman filter is right at the Cramér-Rao
Lower Bound; we cannot be any more certain in our estimate given the
uncertainty in the measurements we have used in that estimate.
A final important point to make is that the expectations we have
employed in this section are over the possible outcomes of the random
variables. They are not time averages. If we ran an infinite number of
trials and averaged over the trials (i.e., an ensemble average), then we
should see an average performance of zero error (i.e., an unbiased esti-
mator). This does not imply that within a single trial (i.e., a realization)
that the error will be zero or decay to zero over time.

3.3.7 Existence, Uniqueness, and Observability

A sketch of the stability proof of the KF is provided (Simon, 2006).We
consider only the time-invariant case and use italicized symbols to avoid
confusion with the lifted form: Ak = A, Ck = C, Qk = Q, Rk = R.
The sketch proceeds as follows:

(i) The covariance equation of the KF can be iterated to conver-

gence prior to computing the equations for the mean. A big
question is whether the covariance will converge to a steady-
state value and if so, will it be unique? Writing P to mean the
steady-state value for P̌k , we have (by combining the predictive
and corrective covariance equations) that the following must be
3.3 Recursive Discrete-Time Filtering 71

true at steady state:

T
P = A (1 − KC) P (1 − KC) AT + AKRK T AT + Q,
(3.143)
which is one form of the Discrete Algebraic Riccati Equation
(DARE). Note that K depends on P in the above equation.
The DARE has a unique positive-semidefinite solution, P , if
and only if the following conditions hold:
– R > 0; note, we already assume this in the batch LG case,
– Q ≥ 0; in the batch LG case we actually assumed that Q > 0,
whereupon the next condition is redundant,
– (A, V ) is stabilizable with V T V = Q; this condition is re-
dundant when Q > 0,
– (A, C) is detectable; same as ‘observable’ except any unob-
servable eigenvalues are stable; we saw a similar observability
condition in the batch LG case.
The proof of the above statement is beyond the scope of this
book.
(ii) Once the covariance evolves to its steady-state value, P , so does
the Kalman gain. Let K be the steady-state value of K k . We
have
−1
K = P C T CP C T + R , (3.144)

for the steady-state Kalman gain.

(iii) The error dynamics of the filter are then stable:
E[ěk ] = A (1 − KC) E[ěk−1 ]. (3.145)
| {z }
eigs. < 1 in mag.

We can see this by noting that for any eigenvector, v, corre-

T
sponding to an eigenvalue, λ, of (1 − KC) AT we have
T
vT P v = vT A (1 − KC) P (1 − KC) AT v
| {z } | {z }
λvT λv

+ vT AKRK T AT + Q v,
(3.146a)

(1 − λ2 ) v T
| {zP v} = vT AKRK T AT + Q v, (3.146b)
>0 | {z }
>0

which means that we must have |λ| < 1, and thus the steady-
state error dynamics are stable. Technically the right-hand side
could be zero, but after N repetitions of this process, we build
up a copy of the observability Grammian on the right-hand side
making it invertible (if the system is observable).
72 Linear-Gaussian Estimation
Figure 3.6 State
estimation with a
continuous-time
x(t) ⇠ GP x̌(t), P̌(t, t0 )
prior can be viewed .
as one-dimensional .. tK 1
tK
Gaussian process
regression with
... ⌧ tk+1
time as the t2
independent t1 tk 1
variable. We have t0 asynchronous
tk x(⌧ ) = ?
data about the measurement times query time
trajectory at a
number of
asynchronous
measurements 3.4 Batch Continuous-Time Estimation
times and would
like to query the In this section, we circle back to consider a more general problem than
state at some other the discrete-time setup in the earlier part of this chapter. In particular,
time of interest. we consider what happens when we choose to use a continuous-time
motion model as the prior. We approach the problem from a Gaussian
process regression perspective (Rasmussen and Williams, 2006). We
show that for linear-Gaussian systems, the discrete-time formulation
is implementing the continuous-time one exactly, under certain special
conditions (Tong et al., 2013; Barfoot et al., 2014).

3.4.1 Gaussian Process Regression

We take a Gaussian process regression approach to state estimation20 .
This allows us (i) to represent trajectories in continuous time (and
therefore query the solution at any time of interest), and (ii) for the
nonlinear case that we will treat in the next chapter, to optimize our
solution by iterating over the entire trajectory (it is difficult to do this in
the recursive formulation, which typically iterates at just one timestep
at a time). We will show that under a certain special class of prior
motion models, GP regression enjoys a sparse structure that allows for
very efficient solutions.
We will consider systems with a continuous-time GP process model
prior and a discrete-time, linear measurement model:
x(t) ∼ GP(x̌(t), P̌(t, t0 )), t0 < t, t0 (3.147)
yk = Ck x(tk ) + nk , t0 < t1 < · · · < tK , (3.148)
where x(t) is the state, x̌(t) is the mean function, P̌(t, t0 ) is the co-
variance function, yk are measurements, nk ∼ N (0, Rk ) is Gaussian
measurement noise, and Ck is the measurement model coefficient ma-
trix.
We consider that we want to query the state at a number of times
20 There are other ways to represent continuous-time trajectories such as temporal basis
functions; see Furgale et al. (2015) for a detailed review.
3.4 Batch Continuous-Time Estimation 73

(τ0 < τ1 < . . . < τJ ) that may or may not be different from the mea-
surement times (t0 < t1 < . . . < tK ). Figure 3.6 depicts our problem
setup. The joint likelihood between the state (at the query times) and
the measurements (at the measurement times) is written as

xτ x̌τ P̌τ τ P̌τ CT
p =N , , (3.149)
y Cx̌ CP̌Tτ R + CP̌CT

where
       
x(t0 ) x̌(t0 ) x(τ0 ) x̌(τ0 )
       
x =  ...  , x̌ =  ...  , xτ =  ...  , x̌τ =  ...  ,
x(tK ) x̌(tK ) x(τJ ) x̌(τJ )
 
y0
 .. 
y =  . , C = diag (C0 , . . . , CK ) , R = diag (R0 , . . . , RK ) ,
yK

P̌ = P̌(ti , tj ) ij , P̌τ = P̌(τi , tj ) ij , P̌τ τ = P̌(τi , τj ) ij .

In GP regression, the matrix, P̌, is known as the kernel matrix. Based

on the factoring discussed in Section 2.2.3, we then have that

p(xτ |y) = N x̌τ + P̌τ CT (CP̌CT + R)−1 (y − Cx̌),
| {z }
x̂τ , mean

T T −1
P̌τ τ − P̌τ C (CP̌C + R) CP̌Tτ , (3.150)
| {z }
P̂τ τ , covariance

for the likelihood of the predicted state at the query times, given the
measurements.
The expression simplifies further if we take the query times to be
exactly the same as the measurement times (i.e., τk = tk , K = J). This
implies that
P̌ = P̌τ = P̌τ τ , (3.151)

and then we can write

p(x|y) = N x̌ + P̌CT (CP̌CT + R)−1 (y − Cx̌),
| {z }
x̂, mean

T T −1
P̌ − P̌C (CP̌C + R) CP̌ . (3.152)
| {z }
P̂, covariance

Or, after an application of the SMW identity in (2.68), we can write

74 Linear-Gaussian Estimation

this as

−1 −1
p(x|y) = N P̌−1 + CT R−1 C P̌ x̌ + CT R−1 y ,
| {z }
x̂, mean

−1
P̌−1 + CT R−1 C . (3.153)
| {z }
P̂, covariance

Rearranging the mean expression we have a linear system for x̂:

P̌−1 + CT R−1 C x̂ = P̌−1 x̌ + CT R−1 y. (3.154)
This can be viewed as the solution to the following optimization prob-
lem:
1 T 1 T
x̂ = arg min (x̌ − x) P̌−1 (x̌ − x) + (y − C x) R−1 (y − C x) .
x 2 2
(3.155)
Note, at implementation we must be careful to use ymeas , the actual
measurements received from our sensors.
If after solving for the estimate at the measurements times we later
want to query the state at some other times of interest (τ0 < τ1 < . . . <
τJ ), we can use the GP interpolation equations to do so:

x̂τ = x̌τ + P̌τ P̌−1 (x̂ − x̌), (3.156a)
T
P̂τ τ = P̌τ τ + P̌τ P̌−1 P̂ − P̌ P̌τ P̌−1 . (3.156b)

This is linear interpolation in the state variable (but not necessarily in

time). To arrive at these interpolation equations, we return to (3.153)
and rearrange both the mean and covariance expressions:
P̌−1 (x̂ − x̌) = CT (CP̌CT + R)−1 (y − Cx̌), (3.157a)

P̌−1 P̂ − P̌ P̌−T = CT (CP̌CT + R)−1 C. (3.157b)

These can then be substituted back into (3.150)

x̂τ = x̌τ + P̌τ CT (CP̌CT + R)−1 (y − Cx̌), (3.158a)
| {z }
P̌−1 (x̂−x̌)

P̂τ τ = P̌τ τ − P̌τ CT (CP̌CT + R)−1 C P̌Tτ , (3.158b)

| {z }
P̌−1 (P̂−P̌)P̌−T

to produce (3.156).
In general, this GP approach has complexity O(K 3 + K 2 J) since the
initial solve is O(K 3 ) and the query is O(K 2 J); this is quite expensive
and next we will seek to improve the cost by exploiting the structure
of the matrices involved.
3.4 Batch Continuous-Time Estimation 75

3.4.2 A Class of Exactly Sparse Gaussian Process Priors

Next, we will develop a special class of GP priors that lead to very
efficient implementation. These priors are based on linear time-varying
(LTV) stochastic differential equations (SDEs):
ẋ(t) = A(t)x(t) + v(t) + L(t)w(t), (3.159)
with
w(t) ∼ GP(0, Q δ(t − t0 )), (3.160)
a (stationary) zero-mean GP with (symmetric, positive-definite) power
spectral density matrix, Q. In what follows, we will use an engineering
approach that avoids introducing Itō calculus; we hope that what our Kiyoshi Itō
treatment lacks in formality it makes up for in accessibility. For a more (1915-2008) was a
Japanese
formal treatment of stochastic differential equations in estimation see
mathematician
Särkkä (2006). who pioneered the
The general solution to this LTV ordinary differential equation is theory of
Z t stochastic
integration and
x(t) = Φ(t, t0 )x(t0 ) + Φ(t, s) (v(s) + L(s)w(s)) ds, (3.161) stochastic
t0
differential
where Φ(t, s) is known as the transition function and has the following equations, now
properties: known as the Itō
calculus.
Φ(t, t) = 1, (3.162)
Φ̇(t, s) = A(t)Φ(t, s), (3.163)
Φ(t, s) = Φ(t, r)Φ(r, s). (3.164)
It is usually straightforward to work out the transition function for
systems in practice, but there is no general formula.

Mean Function
For the mean function we have
Z t
E[x(t)] = Φ(t, t0 ) E[x(t0 )] + Φ(t, s) v(s) + L(s) E[w(s)] ds,
| {z } | {z } t0 | {z }
x̌(t) x̌0 0
(3.165)
where x̌0 is the initial value of the mean at t0 . Thus, the mean function
is
Z t
x̌(t) = Φ(t, t0 )x̌0 + Φ(t, s)v(s) ds. (3.166)
t0

If we now have a sequence of times, t0 < t1 < t2 < · · · < tK , then we

can write the mean at these times as
k
X
x̌(tk ) = Φ(tk , t0 )x̌0 + Φ(tk , tn )vn , (3.167)
n=1
76 Linear-Gaussian Estimation

where
Z tk
vk = Φ(tk , s)v(s) ds, k = 1 . . . K. (3.168)
tk−1

Or we can write our system in lifted form,

x̌ = Av, (3.169)

where
   
x̌(t0 ) x̌0
 x̌(t1 )   v1 
   
x̌ =  ..  , v =  ..  ,
 .   . 
x̌(tK ) vK
 
1
 Φ(t1 , t0 )1 
 
 Φ(t2 , t0 )
Φ(t2 , t1 ) 1 
 
A= .. .
.. .. .. .
 . . . 
 
Φ(tK−1 , t0 ) Φ(tK−1 , t1 ) Φ(tK−1 , t2 ) · · · 1 
Φ(tK , t0 ) Φ(tK , t1 ) Φ(tK , t2 ) · · · Φ(tK , tK−1 ) 1
(3.170)

Notably, A, the lifted transition matrix, is lower-triangular.

If we assume that v(t) = B(t)u(t) with u(t) constant between mea-
surement times, we can further simplify the expression. Let uk be the
constant input when t ∈ (tk−1 , tk ]. Then we can define
 
x̌0
 u1 
 
B = diag (1, B1 , . . . BK ) , u =  ..  , (3.171)
 . 
uK

and
Z tk
Bk = Φ(tk , s)B(s) ds, k = 1 . . . K. (3.172)
tk−1

This allows us to write

x̌(tk ) = Φ(tk , tk−1 )x̌(tk−1 ) + Bk uk , (3.173)

and
x̌ = ABu, (3.174)

for the vector of means.

3.4 Batch Continuous-Time Estimation 77

Covariance Function
For the covariance function we have

E (x(t) − E[x(t)])(x(t0 ) − E[x(t0 )])T
| {z }
P̌(t,t0 )

= Φ(t, t0 ) E (x(t0 ) − E[x(t0 )])(x(t0 ) − E[x(t0 )])T Φ(t0 , t0 )T
| {z }
P̌0
Z tZ t0
+ Φ(t, s)L(s) E[w(s)w(s0 )] L(s0 )T Φ(t0 , s0 )T ds0 ds, (3.175)
t0 t0 | {z }
Q δ(s−s0 )

where P̌0 is the initial covariance at t0 and we have made the as-
sumption that E[x(t0 )w(t)T ] = 0. Putting this together we have the
following expression for the covariance:

P̌(t, t0 ) = Φ(t, t0 )P̌0 Φ(t0 , t0 )T

Z t Z t0
+ Φ(t, s)L(s)QL(s0 )T Φ(t0 , s0 )T δ(s − s0 ) ds0 ds. (3.176)
t0 t0

Focusing on the second term, we integrate once to see that it is

Z t
Φ(t, s)L(s)QL(s)T Φ(t0 , s)T H(t0 − s) ds, (3.177)
t0

where H(·) is the Heaviside step function. There are now three cases to
worry about: t < t0 , t = t0 , and t > t0 . In the first case, the upper inte-
gration limit terminates the integration, while in the last the Heaviside
step function does the same job. The result is that second term in the
covariance function can be written as
Z min(t,t0 )
Φ(t, s)L(s)QL(s)T Φ(t0 , s)T ds
t0
 R 0
t

 Φ(t, t0
) Φ(t 0
, s)L(s)QL(s) T
Φ(t0
, s) T
ds t0 < t
 Rt t0

= t0
Φ(t, s)L(s)QL(s)T Φ(t, s)T ds t = t0 . (3.178)

 R
 t
Φ(t, s)L(s)QL(s)T Φ(t, s)T ds Φ(t0 , t)T t < t0
t0

If we now have a sequence of times, t0 < t1 < t2 < · · · < tK , then we

can write the covariance between two of these times as
 P
j

 Φ(t , t ) Φ(t , t )Q Φ(t , t )T
tj < ti
 i j
Pi
n=0 j n n j n
T
P̌(ti , tj ) = n=0 Φ(ti , tn )Qn Φ(ti ,tn ) ti = tj ,

 Pi Φ(ti , tn )Qn Φ(ti , tn )T Φ(tj , ti )T ti < tj

n=0
(3.179)
78 Linear-Gaussian Estimation

where
Z tk
Qk = Φ(tk , s)L(s)QL(s)T Φ(tk , s)T ds, k = 1 . . . K, (3.180)
tk−1

and we let Q0 = P̌0 to keep the notation in (3.178) clean.

Given this preparation, we are now ready to state the main result of
this section. Let t0 < t1 < t2 < · · · < tK be a monotonically increasing
sequence of time values. Define the kernel matrix to be

P̌ = Φ(ti , t0 )P̌0 Φ(tj , t0 )T
Z min(ti ,tj )
+ Φ(ti , s)L(s)QL(s)T Φ(tj , s)T ds , (3.181)
t0 ij

where Q > 0 is symmetric. Note that P̌ has (K + 1) × (K + 1) blocks.

Then, we can factor P̌ according to a block-lower-diagonal-upper de-
composition:
P̌ = AQAT , (3.182)

where A is the lower-triangular matrix given in (3.170) and

Z tk
Qk = Φ(tk , s)L(s)QL(s)T Φ(tk , s)T ds, k = 1 . . . K, (3.183)
tk−1

Q = diag P̌0 , Q1 , Q2 , . . . , QK . (3.184)

It follows that P̌−1 is block-tridiagonal and is given by

P̌−1 = (AQAT )−1 = A−T Q−1 A−1 , (3.185)

where
 
1
−Φ(t1 , t0 ) 1 
 
 −Φ(t2 , t1 ) 1 
 
A−1 = . .

 −Φ(t3 , t2 ) . . 

 .. 
 . 1 
−Φ(tK , tK−1 ) 1
(3.186)
Since A−1 has only the main diagonal and the one below non-zero,
and Q−1 is block-diagonal, the block-tridiagonal structure of P̌−1 can
be verified by carrying out the multiplication. This is precisely the
structure we had at the start of the chapter for the batch discrete-time
case.
3.4 Batch Continuous-Time Estimation 79

Summary of Prior
We can write our final GP for x(t) as
Z t
x(t) ∼ GP Φ(t, t0 )x̌0 + Φ(t, s)v(s) ds,
t0
| {z }
x̌(t)
Z min(t,t0 )
0 T T 0 T
Φ(t, t0 )P̌0 Φ(t , t0 ) + Φ(t, s)L(s)QL(s) Φ(t , s) ds .
t0
| {z }
P̌(t,t0 )
(3.187)

At the measurement times, t0 < t1 < · · · < tK , we can also then write

x ∼ N (x̌, P̌) = N Av, AQAT , (3.188)

and we can further substitute v = Bu in the case that the inputs are
constant between measurement times.

Querying the GP
As discussed above, if we solve for the trajectory at the measurement
times, we may want to query it at other times of interest as well. This
can be done through the GP linear interpolation equations in (3.156).
Without loss of generality, we consider a single query time, tk ≤ τ <
tk+1 and so in this case we write

x̂(τ ) = x̌(τ ) + P̌(τ )P̌−1 (x̂ − x̌), (3.189a)

P̂(τ, τ ) = P̌(τ, τ ) + P̌(τ )P̌−1 P̂ − P̌ P̌−T P̌(τ )T . (3.189b)

For the mean function at the query time we simply have

Z τ
x̌(τ ) = Φ(τ, tk )x̌(tk ) + Φ(τ, s)v(s) ds, (3.190)
tk

which has complexity O(1) to evaluate. For the covariance function at

the query time we have
Z τ
P̌(τ, τ ) = Φ(τ, tk )P̌(tk , tk )Φ(τ, tk )T + Φ(τ, s)L(s)QL(s)T Φ(τ, s)T ds,
tk
(3.191)
which is also O(1) to evaluate.
We now examine the sparsity of the product P̌(τ )P̌−1 in the case of
a general LTV process model. The matrix, P̌(τ ), can be written as

P̌(τ ) = P̌(τ, t0 ) P̌(τ, t1 ) · · · P̌(τ, tK ) . (3.192)
80 Linear-Gaussian Estimation

The individual blocks are given by

 P
 j T

 Φ(τ, tk )Φ(tk , tj ) n=0 Φ(tj , tn )Qn Φ(tj , tn ) tj < tk

 P

 k T

 Φ(τ, tk ) n=0 Φ(tk , tn )Qn Φ(tk , tn ) tk = tj

 P
i T
Φ(τ, tk ) n=0 Φ(tk , tn )Qn Φ(tk , tn ) Φ(tk+1 , tk )T
P̌(τ, tj ) =
 tk+1 = tj ,

 + Qτ Φ(tk+1 , τ ) T

 P

 k T

 Φ(τ, tk ) n=0 Φ(t k , t n )Q n Φ(tk , tn ) Φ(tj , tk )T

 T T
tk+1 < tj
+ Qτ Φ(tk+1 , τ ) Φ(tj , tk+1 )
(3.193)
where Z τ
Qτ = Φ(τ, s)L(s)QL(s)T Φ(τ, s)T ds. (3.194)
tk

Although this looks difficult to work with, we may write

P̌(τ ) = V(τ )AT , (3.195)
where A was defined before and
h
V(τ ) = Φ(τ, tk )Φ(tk , t0 )P̌0 Φ(τ, tk )Φ(tk , t1 )Q1 ···
··· Φ(τ, tk )Φ(tk , tk−1 )Qk−1 Φ(τ, tk )Qk Qτ Φ(tk+1 , τ )T · · ·
i
· · · 0 · · · 0 . (3.196)
Returning to the desired product, we have
A−T} Q−1 A−1 = V(τ )Q−1 A−1 .
P̌(τ )P̌−1 = V(τ ) |AT{z (3.197)
1

Since Q−1 is block diagonal and A−1 has only the main diagonal and
the one below it non-zero, we can evaluate the product very efficiently.
Working it out, we have
h
P̌(τ )P̌−1 = 0 · · · 0 Φ(τ, tk ) − Qτ Φ(tk+1 , τ )T Q−1
k+1 Φ(tk+1 , tk )
| {z }
Λ(τ ), block column k
i
··· Qτ Φ(tk+1 , τ )T Q−1
k+1 0 ··· 0 , (3.198)
| {z }
Ψ(τ ), block column k + 1
which has exactly two non-zero block columns. Inserting this into (3.189),
we have

x̂k x̌(tk )
x̂(τ ) = x̌(τ ) + Λ(τ ) Ψ(τ ) − , (3.199a)
x̂k+1 x̌(tk+1 )
" #
P̂k,k P̂k,k+1
P̂(τ, τ ) = P̌(τ, τ ) + Λ(τ ) Ψ(τ ) (3.199b)
P̂k+1,k P̂k+1,k+1

P̌(tk , tk ) P̌(tk , tk+1 ) Λ(τ )T
− ,
P̌(tk+1 , tk ) P̌(tk+1 , tk+1 ) Ψ(τ )T
3.4 Batch Continuous-Time Estimation 81

which is a simple combination of just the two terms from tk and tk+1 .
Thus, to query the trajectory at a single time of interest is O(1) com-
plexity.

Example 3.1 As a simple example, consider the system

ẋ(t) = w(t), (3.200)

which can be written as

ẋ(t) = A(t)x(t) + v(t) + L(t)w(t), (3.201)

with A(t) = 0, v(t) = 0, L(t) = 1. In this case, the query equation

becomes
x̂τ = (1 − α) x̂k + α x̂k+1 , (3.202)

assuming the mean function is zero everywhere and where

τ − tk
α= ∈ [0, 1], (3.203)
tk+1 − tk

which is a familiar interpolation scheme that is linear in τ . More compli-

cated process models lead to more complicated interpolation equations.

3.4.3 Linear Time-Invariant Case

Naturally, the equations simplify considerably in the linear time-invariant
(LTI) case:
ẋ(t) = Ax(t) + Bu(t) + Lw(t), (3.204)

with A, B, and L constant21 . The transition function is simply

Φ(t, s) = exp (A(t − s)) , (3.205)

which we note depends only on the difference of the two times (i.e., it
is stationary). We can therefore write,

∆tk:k−1 = tk − tk−1 , k = 1 . . . K, (3.206)

Φ(tk , tk−1 ) = exp (A ∆tk:k−1 ) , k = 1 . . . K, (3.207)
Φ(tk , tj ) = Φ(tk , tk−1 )Φ(tk−1 , tk−2 ) · · · Φ(tj+1 , tj ), (3.208)

to simplify matters.
21 We use italicized symbols for the time-invariant system matrices to avoid confusion
with the lifted-form quantities.
82 Linear-Gaussian Estimation

Mean Function
For the mean function we have the following simplification:
Z ∆tk:k−1
vk = exp (A(∆tk:k−1 − s)) Bu(s) ds, k = 1 . . . K.
0
(3.209)
If we assume that u(t) is constant between each pair of consecutive
measurement times, we can further simplify the expression. Let uk be
the constant input when t ∈ (tk−1 , tk ]. Then we can define
 
x̌0
 u1 
 
B = diag (1, B1 , . . . BM ) , u =  ..  , (3.210)
 . 
uM

and
Z ∆tk:k−1
Bk = exp (A(∆tk:k−1 − s)) ds B
0

= Φ(tk , tk−1 ) 1 − Φ(tk , tk−1 )−1 A−1 B, k = 1 . . . K. (3.211)

This allows us to write

x̌ = ABu, (3.212)

for the vector of means.

Covariance Function
For the covariance function, we have the following simplification:
Z ∆tk:k−1
T
Qk = exp (A(∆tk:k−1 − s)) LQLT exp (A(∆tk:k−1 − s)) ds,
0
(3.213)
for k = 1 . . . K. This is relatively straightforward to evaluate, particu-
larly if A is nilpotent. Letting

Q = diag(P̌0 , Q1 , Q2 , . . . , QK ), (3.214)

we then have

P̌ = AQAT , (3.215)

for the covariance matrix.

3.4 Batch Continuous-Time Estimation 83

Querying the GP
To query the GP we need the following quantities:
Φ(tk+1 , τ ) = exp (A ∆tk+1:τ ) , ∆tk+1:τ = tk+1 − τ, (3.216)
Φ(τ, tk ) = exp (A ∆tτ :k ) , ∆tτ :k = τ − tk , (3.217)
Z ∆tτ :k

Qτ = exp (A(∆tτ :k − s)) LQLT exp A(∆tτ :k − s)T ds.
0
(3.218)
Our interpolation equation is still

x̂(τ ) = x̌(τ ) + Φ(τ, tk ) − Qτ Φ(tk+1 , τ )T Q−1
k+1 Φ(tk+1 , tk ) (x̂k − x̌k )

+ Qτ Φ(tk+1 , τ )T Q−1
k+1 (x̂k+1 − x̌k+1 ), (3.219)

which is a linear combination of just the two terms from tk and tk+1 .
Example 3.2 Consider the case,
p̈(t) = w(t), (3.220)
where p(t) corresponds to position and
w(t) ∼ GP(0, Q δ(t − t0 )), (3.221)
is white noise as before. This corresponds to white noise on acceleration
(i.e., the ‘constant velocity’ model). We can cast this in the form
ẋ(t) = Ax(t) + Bu(t) + Lw(t), (3.222)
by taking

p(t) 0 1 0
x(t) = , A= , B = 0, L= . (3.223)
ṗ(t) 0 0 1
In this case we have

1 2 2 0 1 1 ∆t1
exp (A∆t) = 1 + A∆t + A ∆t + · · · = 1 + ∆t = ,
2 |{z} 0 0 0 1
0
(3.224)
since A is nilpotent. Therefore, we have

1 ∆tk:k−1 1
Φ(tk , tk−1 ) = . (3.225)
0 1
For the Qk we have
Z ∆tk:k−1
1 (∆tk:k−1 − s)1 0 1 0
Qk = Q 0 1 ds
0 0 1 1 (∆tk:k−1 − s)1 1
Z ∆tk
(∆tk:k−1 − s)2 Q (∆tk:k−1 − s)Q
= ds
0 (∆tk:k−1 − s)Q Q
1 3
∆t Q 21 ∆t2k:k−1 Q
= 13 k:k−1 , (3.226)
2
∆t2k:k−1 Q ∆tk:k−1 Q
84 Linear-Gaussian Estimation

which we note is positive definite even though LQLT is not. The inverse
is
−1 −1
−1 12 ∆t−3
k:k−1 Q −6 ∆t−2
k:k−1 Q
Qk = −1 −1 , (3.227)
−6 ∆t−2k:k−1 Q 4 ∆t−1
k:k−1 Q

which is needed to construct P̌−1 . For the mean function we have

x̌k = Φ(tk , t0 )x̌0 , k = 1 . . . K, (3.228)
which can be stacked and written as
 
x̌0
0
 
x̌ = A  ..  , (3.229)
.
0
for convenience.
For trajectory queries we also need

1 ∆tτ :k 1 1 ∆tk+1:τ 1
Φ(τ, tk ) = , Φ(tk+1 , τ ) = ,
0 1 0 1
1 3
∆tτ :k Q 12 ∆t2τ :k Q
x̌τ = Φ(τ, tk )x̌k , Qτ = 1 2
3 , (3.230)
2
∆tτ :k Q ∆tτ :k Q
which we see will result in a scheme that is not linear in τ . Substituting
these into the interpolation equation, we have

x̂τ = x̌τ + Φ(τ, tk ) − Qτ Φ(tk+1 , τ )T Q−1
k+1 Φ(tk+1 , tk ) (x̂k − x̌k )
+ Qτ Φ(tk+1 , τ )T Q−1
k+1 (x̂k+1 − x̌k+1 ) (3.231)

(1 − 3α2 + 2α3 )1 T (α − 2α2 + α3 )1
= x̌τ + 1 (x̂k − x̌k )
T
6(−α + α2 )1 (1 − 4α + 3α2 )1

(3α2 − 2α3 )1 T (−α2 + α3 )1
+ 1 (x̂k+1 − x̌k+1 ),
T
6(α − α2 )1 (−2α + 3α2 )1
where
τ − tk
α= ∈ [0, 1], T = ∆tk+1:k = tk+1 − tk . (3.232)
tk+1 − tk
Remarkably, the top row (corresponding to position) is precisely cubic
Charles Hermite Hermite polynomial interpolation:
(1822-1901) was a
French ˆ k − ṗ
p̂τ − p̌τ = h00 (α)(p̂k − p̌k ) + h10 (α)T (ṗ ˇk)
mathematician
who did research + h01 (α)(p̂k+1 − p̌k+1 ) + h11 (α)T (ṗ ˆ k+1 − ṗ
ˇ k+1 ), (3.233)
on a variety of
topics including where
orthogonal
polynomials. h00 (α) = 1 − 3α2 + 2α3 , h10 (α) = α − 2α2 + α3 , (3.234a)
h01 (α) = 3α2 − 2α3 , h11 (α) = −α2 + α3 , (3.234b)
3.4 Batch Continuous-Time Estimation 85

are the Hermite basis functions. The bottom row (corresponding to ve-
locity) is only quadratic in α, and the basis functions are the derivatives
of the ones used to interpolate position. It is very important to note
that this Hermite interpolation scheme arises automatically from using
the GP regression approach and our choice of prior motion model. At
implementation, we may work directly with the general matrix equa-
tions and avoid working out the details of the resulting interpolation
scheme.
It is also easy to verify that when α = 0 we have
x̂τ = x̌τ + (x̂k − x̌k ), (3.235)
and when α = 1 we have
x̂τ = x̌τ + (x̂k+1 − x̌k+1 ), (3.236)
which seem to be sensible boundary conditions.

3.4.4 Relationship to Batch Discrete-Time Estimation

Now that we have seen how to efficiently represent the prior, we can
revisit the GP optimization problem described by (3.155). Substituting
in our prior terms, the problem becomes
1
Av −x)T A−T Q−1 A−1 (|{z}
x̂ = arg min (|{z} Av −x)
x 2 | {z }
x̌ x̌
P̌−1
1 T
+ (y − Cx) R−1 (y − Cx) . (3.237)
2
Rearranging, we have
1
x̂ = arg min (v − A−1 x)T Q−1 (v − A−1 x)
x 2
1 T
+ (y − Cx) R−1 (y − Cx) . (3.238)
2
The solution to this optimization problem is given by

A−T Q−1 A−1 + CT R−1 C x̂ = A−T Q−1 v + CT R−1 y. (3.239)
| {z }
block-tridiagonal
Because the left-hand side is block-tridiagonal, we can solve this system
of equations in O(K) time with a sparse solver (e.g., sparse Cholesky
decomposition followed by sparse forward-backward passes). To query
the trajectory at J extra times will be O(J) since each query is O(1).
This means that we can solve for the state at the measurement and
query times in O(K + J) time. This is a big improvement over the
O(K 3 + K 2 J) cost when we did not exploit the sparse structure of our
particular class of GP priors.
86 Linear-Gaussian Estimation

This is identical to the system of equations we had to solve in the

discrete-time approach earlier. Thus, the discrete-time approach can ex-
actly capture the continuous-time approach (at the measurement times)
and both can be viewed as carrying out Gaussian progress regression.

3.5 Summary
The main take-away points from this chapter are:

1. When the motion and observation models are linear, and the mea-
surement and process noises are zero-mean Gaussian, the batch and
recursive solutions to state estimation are straightforward, requiring
no approximation.
2. The Bayesian posterior of a linear-Gaussian estimation problem is
exactly Gaussian. This implies that the MAP solution is the same
as the mean of the full Bayesian solution, since the mode and the
mean of a Gaussian are one and the same.
3. The batch, discrete-time, linear-Gaussian solution can exactly imple-
ment (at the measurements times) the case where a continuous-time
motion model is employed; appropriate prior terms must be used for
this to be true.

The next chapter will investigate what happens when the motion and
observation models are nonlinear.

3.6 Exercises
3.6.1 Consider the discrete-time system,

xk = xk−1 + vk + wk , wk ∼ N (0, Q),

yk = xk + nk , nk ∼ N (0, R),

which could represent a cart moving back and forth along the
x-axis. The initial state, x̌0 , is unknown. Set up the system of
equations for the batch least-squares estimation approach:

HT W−1 H x̂ = HT W−1 z.

In other words, work out the details of H, W, z, and x̂, for this
system. Take the maximum timestep to be K = 5. Assume all the
noises are uncorrelated with one another. Will a unique solution
exist to the problem?
3.6.2 Using the same system as the first question, set Q = R = 1 and
3.6 Exercises 87

show that

 
2 −1 0 0 0
−1 3 −1 0 0
 
T −1 
H W H =  0 −1 3 −1 0 .
 0 0 −1 3 −1
0 0 0 −1 2

What will be the sparsity pattern of the Cholesky factor, L, such

that LLT = HT W−1 H?

3.6.3 Using the same system as the first question, modify the least-
squares solution for the case in which the measurements noises are
correlated with one another in the following way:



 R |k − `| = 0

R/2 |k − `| = 1
E[yk y` ] = .

 R/4 |k − `| = 2

0 otherwise

Will a unique least-squares solution still exist?

3.6.4 Using the same system as the first question, work out the details
of the Kalman filter solution. In this case, assume that the initial
conditions for the mean and covariance are x̌0 and P̌0 , respectively.
Show that the steady-state values for the prior and posterior co-
variances, P̌ and P̂ , as K → ∞ are the solutions to the following
quadratics:

P̌ 2 − QP̌ − QR = 0,
P̂ 2 + QP̂ − QR = 0,

which are two versions of the discrete algebraic Riccati equations.

Explain why only one of the two roots to each quadratic is physi-
cally possible.

3.6.5 Using the MAP approach of Section 3.3.2, derive a version of

the Kalman filter that recurses backwards in time rather than
forwards.
88 Linear-Gaussian Estimation

3.6.6 Show that

 −1
1
 A 1 
 
 A2 A 1 
 
 .. .. .. . . 
 . . . . 
 K−1 
A A K−2
A K−3
··· 1 
K K−1 K−2
A A A ··· A 1
 
1
−A 1 
 
 −A 1 
 
= . .

 −A . . 

 .. 
 . 1 
−A 1
3.6.7 We have seen that for the batch least-squares solution, the pos-
terior covariance is given by
−1
P̂ = HT W−1 H .
We have also seen that the computational cost of performing a
Cholesky decomposition,
LLT = HT W−1 H,
is O(N (K + 1)) owing to the sparsity of the system. Inverting, we
have
P̂ = L−T L−1 .
Comment on the computational cost of computing P̂ by this ap-
proach.
4

Nonlinear Non-Gaussian Estimation

This chapter is one of the most important ones contained in this book.
Here we examine how to deal with the fact that in the real world
there are no linear-Gaussian systems. It should be stated up front that
nonlinear, non-Gaussian (NLNG) estimation is very much still an ac-
tive research topic. The ideas in this chapter provide only some of
the more common approaches to dealing with nonlinear and/or non-
Gaussian systems1 . We begin by contrasting full Bayesian to maximum
a posteriori (MAP) estimation for nonlinear systems. We then intro-
duce a general theoretical framework for recursive filtering problems
called the Bayes filter. Several of the more common filtering tech-
niques are shown to be approximations of the Bayes filter: extended
Kalman filter, sigmapoint Kalman filter, particle filter. We then return
to batch estimation for nonlinear systems, both in discrete and con-
tinuous time. Some books that address nonlinear estimation include
Jazwinski (1970), Maybeck (1994), and Simon (2006).

4.1 Introduction
In the linear-Gaussian chapter, we discussed two perspectives to es-
timation: full Bayesian and maximum a posteriori. We saw that for
linear motion and observation models driven by Gaussian noise, these
two paradigms come to the same answer (i.e., the MAP point was the
mean of the full Bayesian approach); this is because the full posterior
is exactly Gaussian and therefore the mean and mode (i.e., maximum)
are the same point.
This is not true once we move to nonlinear models, since the full
Bayesian posterior is no longer Gaussian. To provide some intuition on
this topic, this section considers a simplified, one-dimensional, nonlin-
ear estimation problem: estimating the position of a landmark from a
stereo camera.

1 Even most of the methods in this chapter actually assume the noise is Gaussian.

89
90 Nonlinear Non-Gaussian Estimation
Figure 4.1
Idealized stereo image plane left pinhole
camera model
relating the u x
landmark depth, x,
fb depth
to the (noise-free)
y=u v= b baseline
disparity x v
measurement, y. disparity
f
focal right pinhole
length landmark

4.1.1 Full Bayesian Estimation

To gain some intuition, consider a simple estimation problem using a
nonlinear, camera model,
fb
y= + n. (4.1)
x
This is the type of nonlinearity present in a stereo camera (cf., Fig-
ure 4.1), where the state, x, is the position of a landmark (in metres),
the measurement, y, is the disparity between the horizontal coordinates
of the landmark in the left and right images (in pixels), f is the focal
length (in pixels), b is the baseline (horizontal distance between left and
right cameras; in metres), and n is the measurement noise (in pixels).
To perform Bayesian inference,
p(y|x)p(x)
p(x|y) = R ∞ , (4.2)
−∞
p(y|x)p(x) dx
we require expressions for p(y|x) and p(x). We meet this requirement
by making two assumptions. First, we assume that the measurement
noise is zero-mean Gaussian, n ∼ N (0, R), such that
!
fb 1 1 fb 2
p(y|x) = N ,R = √ exp − y− , (4.3)
x 2πR 2R x
and second, we assume that the prior is Gaussian, where

1 1 2
p(x) = N x̌, P̌ = √ exp − (x − x̌) . (4.4)
2π P̌ 2P̌
Before we continue, we note that the Bayesian framework provides
an implied order of operations that we would like to make explicit:
assign prior → draw xtrue → draw ymeas → compute posterior
In words, we start with a prior. The ‘true’ state is then drawn from
the prior, and the measurement is generated by observing the true
state through the camera model and adding noise. The estimator then
reconstructs the posterior from the measurement and prior, without
4.1 Introduction 91
0.25 Figure 4.2
Example of
p(x|y)
0.2 Bayesian inference
p(x) on one-dimensional
posterior xtrue stereo camera
0.15
h−1 (ymeas ) example. We see
p

that the full

0.1 posterior is not
Gaussian, owing to
0.05 prior the nonlinear
measurement
0 model.
5 10 15 20 25 30 35
x

knowing xtrue . This process is necessary to ensure ‘fair’ comparison

between state estimation algorithms.
To put these mathematical models into practical terms, let us assign
the following numerical values to the problem:
x̌ = 20 [m], P̌ = 9 [m2 ], (4.5)
f = 400 [pixel], b = 0.1 [m], R = 0.09 [pixel2 ].
As discussed above, the true state, xtrue , and (noise-corrupted) mea-
surement, ymeas , are drawn randomly from p(x) and p(y|x), respectively.
Each time we repeat the experiment, these values will change. In order
to plot the posterior for a single experiment, we used the particular
values,
fb
xtrue = 22 [m], ymeas = + 1 [pixel],
xtrue
which are fairly typical given the noise characteristics.
Figure 4.2 plots the prior and posterior for this example. Since we
are considering a one-dimensional scenario, the denominator integral
in (4.2) was computed numerically and thus we effectively have a view
of the full Bayesian posterior with no approximation. We can observe
that even though the prior and measurement densities are Gaussian,
the posterior is asymmetrical; it is skewed to one side by the nonlin-
ear observation model. However, since the posterior is still unimodal
(a single peak), we might still be justified in approximating it as Gaus-
sian. This idea is discussed later in the chapter. We also see that the
incorporation of the measurement results in a posterior that is more
concentrated (i.e., more ‘certain’) about the state than the prior; this
the main idea behind Bayesian state estimation: we want to incorporate
measurements into the prior to become more certain about the posterior
state.
Unfortunately, while we were able to effectively compute the exact
Bayesian posterior in our simple stereo camera example, this is typically
92 Nonlinear Non-Gaussian Estimation
Figure 4.3
0.3 150
Posterior from MAP solution (mode)
stereo camera 0.25 125

− ln (p(x|y))
example, p(x|y), as
well as the negative 0.2 100

p(x|y)
log-likelihood of
0.15 75
the posterior,
− ln(p(x|y) 0.1 50
(dashed). We see
that the MAP 0.05 25
solution is simply
0 0
the value of x that 5 10 15 20 25 30 35
maximizes (or x
minimizes) either
of these functions.
In other words, the
not tractable for real problems. As a result, a variety of tactics have
MAP solution is
the mode of the been built up over the years to compute an approximate posterior. For
posterior, which is example, the MAP approach is concerned with finding only the most
not generally the likely state, or in other words the mode or ‘peak’ of the posterior. We
same as the mean.
discuss this next.

4.1.2 Maximum a Posteriori Estimation

As mentioned above, computing the full Bayesian posterior can be in-
tractable in general. A very common approach is to seek out only the
value of the state that maximizes the true posterior. This is called
maximum a posteriori (MAP) estimation and is depicted graphically
in Figure 4.3.
In other words, we want to compute
x̂map = arg max p(x|y). (4.6)
x

Equivalently, we can try minimizing the negative log likelihood:

x̂map = arg min (− ln(p(x|y))) , (4.7)
x

which can be easier when the PDFs involved are from the exponential
family. As we are seeking only the most likely state, we can use Bayes’
rule to write
x̂map = arg min (− ln(p(y|x)) − ln(p(x))) , (4.8)
x

which drops p(y) since it does not depend on x.

Relating this back to the stereo camera example presented earlier,
we can write
x̂map = arg min J(x), (4.9)
x

with

1 fb 2 1 2
J(x) = y− + (x̌ − x) , (4.10)
2R x 2P̌
4.1 Introduction 93
0.25 Figure 4.4
EXN [x̂map ] x̌ Histogram of
estimator values
0.2
for 1, 000, 000 trials
of the stereo
p(x̂map )

0.15 camera experiment

where each time a
0.1 new xtrue is
randomly drawn
0.05 from the prior and
a new ymeas is
randomly drawn
0
10 12 14 16 18 20 22 24 26 from the
x̂map measurement
model. The dashed
line marks the
mean of the prior,
where we have dropped any further normalization constants that do not x̌, and the solid
depend on x. We can then find x̂map using any number of numerical line marks the
optimization techniques. expected value of
the MAP
Since the MAP estimator, x̂map , finds the most likely state given
estimator, x̂map ,
the data and prior, a question we might ask is, how well does this over all the trials.
estimator actually capture xtrue ? In robotics, we often report the average The gap between
performance of our estimators, x̂ with respect to some ‘ground truth’. dashed and solid is
In other words, we compute emean ≈ −33.0 cm,
which indicates a
bias. The average
emean (x̂) = EXN [x̂ − xtrue ], (4.11)
squared error is
esq ≈ 4.41 m2 .
where EXN [·] is the expectation operator; we explicitly include the sub-
scripts XN to indicate that we are averaging over both the random
draw of xtrue from the prior, as well as the random draw of n from the
measurement noise. Since xtrue is assumed to be independent of n, we
have EXN [xtrue ] = EX [xtrue ] = x̌, and so
emean (x̂) = EXN [x̂] − x̌. (4.12)
It may be surprising to learn that under this performance measure,
MAP estimation is biased (i.e., emean (x̂map ) 6= 0). This can be attributed
to the presence of a nonlinear measurement model, h(·), and the fact
that the mode and mean of the posterior PDF are not the same. As
discussed in the last chapter, when h(·) is linear then emean (x̂map ) = 0.
However, since we can trivially set the estimate to the prior, x̂ = x̌,
and obtain emean (x̌) = 0, we need to define a secondary performance
metric. This metric is typically the average squared error, esq , where
esq (x̂) = EXN [(x̂ − xtrue )2 ]. (4.13)
In other words, the first metric, emean , captures the mean of the es-
timator error, while the second, esq , captures the combined effects of
bias and variance. Performing well on these two metrics results in the
bias-variance tradeoff in the machine learning literature (Bishop, 2006).
94 Nonlinear Non-Gaussian Estimation

Good performance on both metrics is necessary for a practical state es-

timator.
Figure 4.4 shows the MAP bias for the stereo camera example. We
see that over a large number of trials (using the parameters in (4.5)),
the average difference between the estimator, x̂map , and the ground-
truth, xtrue = 20 m, is emean ≈ −33.0 cm, demonstrating a small bias.
The average squared error is esq ≈ 4.41 m2 .
Note, in this experiment we have drawn the true state from the prior
used in the estimator and we still see a bias. The bias can be worse
in practice, as we often do not really know from which prior the true
state is drawn and must invent one.
In the rest of this chapter, we will be discussing various estimation ap-
proach for nonlinear, non-Gaussian systems. We must be careful to try
to understand what aspect of the full Bayesian posterior each method
captures: mean, mode, something else? We prefer to make distinctions
in these terms rather than by saying one method is more accurate than
another. Accuracy can only really be compared fairly if two methods
are trying to get to the same answer.

4.2 Recursive Discrete-Time Estimation

4.2.1 Problem Setup
Just as in the chapter on linear-Gaussian estimation, we require a set of
motion and observation models upon which to base our estimator. We
will consider discrete-time, time-invariant equations, but this time we
will allow nonlinear equations (we will return to continuous time at the
end of this chapter). We define the following motion and observation

Figure 4.5
Markov process x̌0 v1 vk 1 vk
representation of
the NLNG system
described + x0 f x1 ... f xk 1 f xk ...
by (4.14).

w0 w1 wk 1 wk

n0 g n1 g nk 1 g nk g

y0 y1 yk 1 yk
4.2 Recursive Discrete-Time Estimation 95

models:
motion model: xk = f (xk−1 , vk , wk ) , k = 1 . . . K (4.14a)
observation model: yk = g (xk , nk ) , k = 0...K (4.14b)
where k is again the discrete-time index and K its maximum. The
function f (·) is the nonlinear motion model and the function g(·) is the
nonlinear observation model. The variables take on the same mean-
ings as in the linear-Gaussian chapter. For now we do not make any
assumption about any of the random variables being Gaussian.
Figure 4.5 provides a graphical representation of the temporal evolu-
tion of the system described by (4.14). From this picture we can observe
a very important characteristic of the system, the Markov property:
In the simplest sense, a stochastic process has the Markov property if the conditional
probability density functions (PDFs) of future states of the process, given the present
state, depend only upon the present state, but not on any other past states, i.e.,
they are conditionally independent of these older states. Such a process is called
Markovian or a Markov process.

Our system is such a Markov process. For example, once we know

the value of xk−1 , we do not need to know the value of any previous
states in order to evolve the system forwards in time to compute xk .
This property was exploited fully in the section on linear-Gaussian
estimation. There it was assumed that we could employ this property
in our estimator design and this led to an elegant recursive estimator,
the Kalman filter. But what about NLNG systems? Can we still have
a recursive solution? The answer is yes, but only approximately. The
next few sections will examine this claim.

4.2.2 Bayes Filter

In the chapter on linear-Gaussian estimation, we started with a batch
estimation technique and worked down to the recursive Kalman filter.
In this section, we will start by deriving a recursive filter, the Bayes
filter (Jazwinski, 1970), and return to batch methods near the end
of the chapter. This order reflects the historical sequence of events in
the estimation world and will allow us to highlight exactly where the
limiting assumptions and approximations have been made.
The Bayes filter seeks to come up with a entire PDF to represent the
likelihood of the state, xk , using only measurements up to and including
the current time. Using our notation from before, we want to compute

p(xk |x̌0 , v1:k , y0:k ). (4.15)

which is also sometimes called the belief for xk . Recall from the section
96 Nonlinear Non-Gaussian Estimation

on factoring the batch linear-Gaussian solution that

p(xk |v, y) = p(xk |x̌0 , v1:k , y0:k ) p(xk |vk+1:K , yk+1:K ) . (4.16)
| {z } | {z }
forwards backwards
Thus, in this section we will focus on turning the ‘forwards’ PDF into a
recursive filter (for nonlinear non-Gaussian systems). By employing the
independence of all the measurements2 , we may factor out the latest
measurement to have
p(xk |x̌0 , v1:k , y0:k ) = η p(yk |xk ) p(xk |x̌0 , v1:k , y0:k−1 ), (4.17)
where we have employed Bayes’ rule to reverse the dependence and η
serves to preserve the axiom of total probability. Turning our attention
to the second factor, we introduce the hidden state, xk−1 , and integrate
over all possible values:
Z
p(xk |x̌0 , v1:k , y0:k−1 ) = p(xk , xk−1 |x̌0 , v1:k , y0:k−1 ) dxk−1
Z
= p(xk |xk−1 , x̌0 , v1:k , y0:k−1 ) p(xk−1 |x̌0 , v1:k , y0:k−1 ) dxk−1 . (4.18)

The introduction of the hidden state can be viewed as the opposite

of marginalization. So far we have not introduced any approximations.
The next step is subtle and is the cause of many limitations in recursive
estimation. Since our system enjoys the Markov property, we use said
property (on the estimator) to say that
p(xk |xk−1 , x̌0 , v1:k , y0:k−1 ) = p(xk |xk−1 , vk ), (4.19a)
p(xk−1 |x̌0 , v1:k , y0:k−1 ) = p(xk−1 |x̌0 , v1:k−1 , y0:k−1 ), (4.19b)
which seems entirely reasonable given the depiction in Figure 4.5. How-
ever, we will come back to examine (4.19) later in this chapter. Substi-
tuting (4.19) and (4.18) into (4.17), we have the Bayes filter3 :

p(xk |x̌0 , v1:k , y0:k )

| {z }
posterior belief
Z
= η p(yk |xk ) p(xk |xk−1 , vk ) p(xk−1 |x̌0 , v1:k−1 , y0:k−1 ) dxk−1 .
| {z } | {z }| {z }
observation motion prior belief
correction prediction
using g(·) using f (·)
(4.20)
2 We will continue to assume all the measurements are statistically uncorrelated as in the
LG case.
3 There is a special case at the first timestep, k = 0, that only involves the observation
correction with y0 , but we omit it to avoid complexity. We assume the filter is
initialized with p(x0 |x̌0 , y0 ).
4.2 Recursive Discrete-Time Estimation 97
Figure 4.6
process noise Graphical
p(wk ) depictions of the
input predicted belief Bayes filter. Here
vk f (x, v, w) p(xk |x̌0 , v1:k , y0:k 1) the dashed line
indicates that in
prior belief posterior belief practice a hint
direct
p(xk 1 |x̌0 , v1:k 1 , y0:k 1 ) ⇥ p(xk |x̌0 , v1:k , y0:k ) could be passed to
product
the ‘correction
measurement step’ about the
yk g(x, n) p(yk |xk )
states in which the
probability mass of
p(nk ) the belief function
measurement noise is concentrated.
This can be used
to reduce the need
to work out the
We can see that (4.20) takes on a predictor-corrector form. In the pre- likelihood of the
diction step, the prior4 belief, p(xk−1 |x̌0 , v1:k−1 , y0:k−1 ), is propagated observation, yk
forwards in time using the input, vk , and the motion model, f (·). In having been
the correction step, the predicted estimate is then updated using the produced for all
possible states, xk .
measurement, yk , and the measurement model, g(·). The result is the
posterior belief, p(xk |x̌0 , v1:k , y0:k ). Figure 4.6 provides a graphical de-
piction of the information flow in the Bayes filter. The important mes-
sage to take away from these diagrams is that we require methods of
passing PDFs through the nonlinear functions, f (·) and g(·)
Although exact, the Bayes filter is really nothing more than a math-
ematical artifact; it can never be implemented in practice, except for
the linear-Gaussian case. There are two primary reasons for this, and
as such we need to make appropriate approximations:

(i) Probability density functions live in an infinite-dimensional space

(as do all continuous functions) and as such an infinite amount
of memory (i.e., infinite number of parameters) would be needed
to completely represent the belief, p(xk |x̌0 , v1:k , y0:k ). To over-
come this memory issue, the belief is approximately represented.
One approach is to approximate this function as a Gaussian
PDF (i.e., keep track of the first two moments, mean and co-
variance). Another approach is to approximate the PDF using
a finite number of random samples. We will look into both of
these later on.
(ii) The integral in the Bayes filter is computationally very expen-
sive; it would require infinite computing resources to evaluate
exactly. To overcome this computational resource issue, the in-
tegral must be evaluated approximately. One approach is to
linearize the motion and observation models and then evalu-
4 To be clear, the Bayes filter is using Bayesian inference, but just at a single timestep.
The batch methods discussed in the previous chapter performed inference over the
whole trajectory at once. We will return to the batch situation later in this chapter.
98 Nonlinear Non-Gaussian Estimation

ate the integrals in closed form. Another approach is to employ

Monte Carlo integration. We will look into both of these later
on as well.

Much of the research in recursive state estimation has focused on better

and better approximations to handle these two issues. Considerable
gains have been made that are worth examining in more detail. As
such, we will look at some of the classic and modern approaches to
approximate the Bayes filter in the next few sections. However, we
must keep in our minds the assumption on which the Bayes filter is
predicated: the Markov property. A question we must ask ourselves
is, what happens to the Markov property when we start making these
approximations to the Bayes filter? We will return to this later. For
now, let us assume it holds.

4.2.3 Extended Kalman Filter

We now show that if the belief is constrained to be Gaussian, the noise
Gaussian, and we linearize the motion and observation models in order
to carry out the integral (and also the direct product) in the Bayes filter,
we arrive at the famous extended Kalman filter (EKF)5 . The EKF is
still the mainstay of estimation and data fusion in many circles, and
can often be effective for mildly nonlinear, non-Gaussian systems. For
a good reference on the EKF see Maybeck (1994).
The EKF was a key tool used to estimate spacecraft trajectories
on the NASA Apollo program. Shortly after Kalman’s original paper
Stanley F. Schmidt (Kalman, 1960b) was published, he met with Stanley F. Schmidt of
(1926-) is an NASA Ames Research Center. Schmidt was impressed with Kalman’s
American
filter and his team went on to modify it to work for their task of space-
aerospace engineer
who adapted craft navigation. In particular, they (i) extended it to work for nonlinear
Kalman’s filter motion and observation models, (ii) came up with the idea of lineariz-
early on to ing about the best current estimate to reduce nonlinear effects, and (iii)
estimate spacecraft
reformulated the original filter to the now-standard separate prediction
trajectories on the
Apollo program. It and correction steps (McGee and Schmidt, 1985). For these significant
was his work that contributions, the EKF was sometimes called the Schmidt-Kalman fil-
led to what is now ter, but this name has fallen out of favour due to confusion with an-
called the extended other similarly named contribution later made by Schmidt (to account
Kalman filter.
for unobservable biases while keeping state dimension low). Schmidt
also went on to work on the square-root formulation of the EKF to
improve numerical stability (Bierman, 1974). Later at Lockheed Mis-
siles and Space Company, Schmidt’s popularization of Kalman’s work
also inspired Charlotte Striebel to begin work on connecting the KF
5 The EKF is called ‘extended’ because it is the extension of the Kalman filter to
nonlinear systems.
4.2 Recursive Discrete-Time Estimation 99

to other types of trajectory estimation, which ultimately led to the

Rauch-Tung-Striebel smoother discussed in the previous chapter.
To derive the EKF, we first limit (i.e., constrain) our belief function
for xk to be Gaussian:

p(xk |x̌0 , v1:k , y0:k ) → N x̂k , P̂k , (4.21)

where x̂k is the mean and P̂k the covariance. Next, we assume that the
noise variables, wk and nk (∀k), are in fact Gaussian as well:
wk ∼ N (0, Qk ), (4.22a)
nk ∼ N (0, Rk ). (4.22b)
Note, a Gaussian PDF can be transformed through a nonlinearity to
be non-Gaussian. In fact, we will look at this in more detail a bit later
in this chapter. We assume this is the case for the noise variables; in
other words, the nonlinear motion and observation models may affect
wk and nk . They are not necessarily added after the nonlinearities as
in
xk = f (xk−1 , vk ) + wk , (4.23a)
yk = g (xk ) + nk , (4.23b)
but rather appear inside the nonlinearities as in (4.14). Equations (4.23)
are in fact a special case of (4.14). However, we can recover additive
noise (approximately) through linearization, which we show next.
With g(·) and f (·) nonlinear, we still cannot compute the integral in
the Bayes filter in closed form so we turn to linearization. We linearize
the motion and observation models about the current state estimate
mean:
f (xk−1 , vk , wk ) ≈ x̌k + Fk−1 (xk−1 − x̂k−1 ) + wk0 , (4.24a)
g (xk , nk ) ≈ y̌k + Gk (xk − x̌k ) + n0k , (4.24b)
where

∂f (xk−1 , vk , wk )
x̌k = f (x̂k−1 , vk , 0) , Fk−1 = , (4.25a)
∂xk−1 x̂k−1 ,vk ,0

0 ∂f (xk−1 , vk , wk )
wk = wk , (4.25b)
∂w k x̂k−1 ,vk ,0

and

∂g(xk , nk )
y̌k = g (x̌k , 0) , Gk = , (4.26a)
∂xk x̌k ,0

0 ∂g(xk , nk )
nk = nk . (4.26b)
∂n k x̌k ,0
100 Nonlinear Non-Gaussian Estimation

From here the statistical properties of the current state, xk , given the
old state and latest input, are

xk ≈ x̌k + Fk−1 (xk−1 − x̂k−1 ) + wk0 , (4.27a)

E [xk ] ≈ x̌k + Fk−1 (xk−1 − x̂k−1 ) + E [wk0 ], (4.27b)
| {z }
0
h i h T
i
T
E (xk − E [xk ]) (xk − E [xk ]) ≈ E wk wk0 ,
0
(4.27c)
| {z }
Q0k

p(xk |xk−1 , vk ) ≈ N (x̌k + Fk−1 (xk−1 − x̂k−1 ) , Q0k ) . (4.27d)

For the statistical properties of the current measurement, yk , given the

current state, we have

yk ≈ y̌k + Gk (xk − x̌k ) + n0k , (4.28a)

E [yk ] ≈ y̌k + Gk (xk − x̌k ) + E [n0k ], (4.28b)
| {z }
0
h i h T
i
T
E (yk − E [yk ]) (yk − E [yk ]) ≈ E n0k n0k , (4.28c)
| {z }
R0k

p(yk |xk ) ≈ N (y̌k + Gk (xk − x̌k ), R0k ) . (4.28d)

Substituting in these results, the Bayes filter becomes

p(xk |x̌0 , v1:k , y0:k ) = η p(yk |xk )

| {z } | {z }
N (x̂k ,P̂k ) N (y̌k +Gk (xk −x̌k ),R0k )
Z
× p(xk |xk−1 , vk ) p(xk−1 |x̌0 , v1:k−1 , y0:k−1 ) dxk−1 . (4.29)
| {z } | {z }
N (x̌k +Fk−1 (xk−1 −x̂k−1 ),Q0k ) N (x̂k−1 ,P̂k−1 )

Using our identity for Gaussian inference (2.82), we can see that the
integral is also Gaussian:

p(xk |, x̌0 , v1:k , y0:k ) = η p(yk |xk )

| {z } | {z }
N (x̂k ,P̂k ) N (y̌k +Gk (xk −x̌k ),R0k )
Z
× p(xk |xk−1 , vk ) p(xk−1 |x̌0 , v1:k−1 , y0:k−1 ) dxk−1 . (4.30)
| {z }
N (x̌k ,Fk−1 P̂k−1 FTk−1 +Qk )
0

We are now left with the direct product of two Gaussian PDFs, which
4.2 Recursive Discrete-Time Estimation 101

we also discussed previously. Applying (2.63), we find that

p(xk |x̌0 , , v1:k , y0:k )

| {z }
N (x̂k ,P̂k )
Z
= η p(yk |xk ) p(xk |xk−1 , vk ) p(xk−1 |x̌0 , v1:k−1 , y0:k−1 ) dxk−1 (4.31)
| {z }
N (x̌k +Kk (yk −y̌k ),(1−Kk Gk )(Fk−1 P̂k−1 FT
k−1 +Q 0 )
k )

where Kk is known as the Kalman gain matrix (given below). Getting

to this last line takes quite a bit of tedious algebra and is left to the
reader. Comparing left and right sides of our posterior expression above
we have

P̌k = Fk−1 P̂k−1 FTk−1 + Q0k , (4.32a)

predictor:
x̌k = f (x̂k−1 , vk , 0), (4.32b)
−1
Kalman gain: Kk = P̌k GTk Gk P̌k GTk + R0k , (4.32c)
P̂k = (1 − Kk Gk ) P̌k , (4.32d)
corrector:
x̂k = x̌k + Kk (yk − g(x̌k , 0)) . (4.32e)
| {z }
innovation

Equations (4.32) are known as the classic recursive update

n equations
o
for the EKF. The update equations allow us to compute x̂k , P̂k from
n o
x̂k−1 , P̂k−1 . We notice immediately the similar structure to (3.120)
for linear-Gaussian estimation. There are two main differences here:

(i) The nonlinear motion and observation models are used to prop-
agate the mean of our estimate.
(ii) There are Jacobians embedded in the Q0k and R0k covariances
for the noise. This comes from the fact that we allowed the
noise to be applied within the nonlinearities in (4.14).

It should be noted that there is no proof that the EKF will work in
general for any nonlinear system6 . In order to gauge the performance
of the EKF on a particular nonlinear system, it often comes down to
simply trying it out. The main problem with the EKF is the operating
point of the linearization is the mean of our estimate of the state, not
the true state. This seemingly small difference can cause the EKF to
diverge wildly in some cases. Sometimes the result is less dramatic,
with the estimate being biased or inconsistent, or most often, both.

6 To the best knowledge of the author.

102 Nonlinear Non-Gaussian Estimation

4.2.4 Generalized Gaussian Filter

The Bayes filter is appealing because it can be written out exactly.
We can then reach a number of implementable filters through different
approximations on the form of the estimated PDF and handling meth-
ods. There is, however, a cleaner approach to deriving those filters that
assume up front that the estimated PDF is Gaussian. We have actu-
ally already seen this in practice in Section 3.3.3, where we derived the
Kalman filter using Bayesian inference.
In general, we begin with a Gaussian prior at time k − 1:

p(xk−1 |x̌0 , v1:k−1 , y0:k−1 ) = N x̂k−1 , P̂k−1 . (4.33)

We pass this forwards in time through the nonlinear motion model,

f (·), to propose a Gaussian prior at time k:

p(xk |x̌0 , v1:k , y0:k−1 ) = N x̌k , P̌k . (4.34)
This is the prediction step and incorporates the latest input, vk .
For the correction step, we employ the method from Section 2.2.3
and write a joint Gaussian for the state and latest measurement, at
time k:

µx,k Σxx,k Σxy,k
p(xk , yk |x̌0 , v1:k , y0:k−1 ) = N , . (4.35)
µy,k Σyx,k Σyy,k
We then write the conditional Gaussian density for xk (i.e., the poste-
rior) directly as

p(xk |x̌0 , v1:k , y0:k )

= N µx,k + Σxy,k Σ−1 yy,k (y k − µy,k ) , Σxx,k − Σ −1
xy,k yy,k yx,k , (4.36)
Σ Σ
| {z } | {z }
x̂k P̂k

where we have defined x̂k as the mean and P̂k as the covariance. The
nonlinear observation model, g(·), is used in the computation of µy,k .
From here, we can write the generalized Gaussian correction-step equa-
tions as
Kk = Σxy,k Σ−1
yy,k , (4.37a)
P̂k = P̌k − Kk ΣTxy,k , (4.37b)

x̂k = x̌k + Kk yk − µy,k , (4.37c)

where we have let µx,k = x̌k , Σxx,k = P̌k , and Kk is still known as the
Kalman gain. Unfortunately, unless the motion and observation models
are linear, we cannot compute all the remaining quantities required
exactly: µy,k , Σyy,k , and Σxy,k . This is because putting a Gaussian PDF
into a nonlinearity generally results in something non-Gaussian coming
4.2 Recursive Discrete-Time Estimation 103

out the other end. We therefore need to consider approximations at this

stage.
The next section will revisit linearizing the motion and observation
models, to complete this cleaner derivation of the EKF. After that,
we will discuss other methods of passing PDFs through nonlinearities,
which lead to other flavours of the Bayes and Kalman filters.

4.2.5 Iterated Extended Kalman Filter

Continuing on from the last section, we complete our alternate deriva-
tion of the iterated extended Kalman filter (IEKF). The prediction step
is fairly straightforward and essentially the same as Section 4.2.3. We
therefore omit it but note that the prior, at time k, is

p(xk |x̌0 , v1:k , y0:k−1 ) = N x̌k , P̌k , (4.38)
which incorporates, vk .
The correction step is where things become a little more interesting.
Our nonlinear measurement model is given by
yk = g(xk , nk ). (4.39)
We linearize about an arbitrary operating point, xop,k :
g (xk , nk ) ≈ yop,k + Gk (xk − xop,k ) + n0k , (4.40)
where

∂g(xk , nk )
yop,k = g (xop,k , 0) , Gk = , (4.41a)
∂xk xop,k ,0

0 ∂g(xk , nk )
nk = nk . (4.41b)
∂n k xop,k ,0

Note, the observation model and Jacobians are evaluated at xop,k .

Using this linearized model, we can then express the joint density for
the state and the measurement at time k as approximately Gaussian:

µx,k Σxx,k Σxy,k
p(xk , yk |x̌0 , v1:k , y0:k−1 ) ≈ N ,
µy,k Σyx,k Σyy,k

x̌k P̌k P̌k GTk
=N , . (4.42)
yop,k + Gk (x̌k − xop,k ) Gk P̌k Gk P̌k GTk + R0k
Once again, if the measurement, yk , is known, we can use (2.46b) to
write the Gaussian conditional density for xk (i.e., the posterior) as

p(xk |x̌0 , v1:k , y0:k )

= N µx,k + Σxy,k Σ−1 −1
yy,k (yk − µy,k ), Σxx,k − Σxy,k Σyy,k Σyx,k , (4.43)
| {z } | {z }
x̂k P̂k
104 Nonlinear Non-Gaussian Estimation

where again we have defined x̂k as the mean and P̂k as the covariance.
As shown in the last section, the generalized Gaussian correction-step
equations are
Kk = Σxy,k Σ−1
yy,k , (4.44a)
P̂k = P̌k − Kk ΣTxy,k , (4.44b)

x̂k = x̌k + Kk yk − µy,k . (4.44c)
Substituting in the moments, µy,k , Σyy,k , and Σxy,k , from above we
have
−1
Kk = P̌k GTk Gk P̌k GTk + R0k , (4.45a)
P̂k = (1 − Kk Gk ) P̌k , (4.45b)
x̂k = x̌k + Kk (yk − yop,k − Gk (x̌k − xop,k )) . (4.45c)
These equations are very similar to the Kalman gain and corrector
equations in (4.32); the only difference is the operating point of the
linearization. If we set the operating point of the linearization to be
the mean of the predicted prior, xop,k = x̌k , then (4.45) and (4.32) are
identical.
However, it turns out that we can do much better if we iteratively
recompute (4.45), each time setting the operating point to be the mean
of the posterior at the last iteration:
xop,k ← x̂k . (4.46)
At the first iteration we take xop,k = x̌k . This allows us to be linearizing
about better and better estimates, thereby improving our approxima-
tion each iteration. We terminate the process when the change to xop,k
from one iteration to the next is sufficiently small. Note, the covariance
equation need only be computed once, after the other two equations
converge.

4.2.6 IEKF is a MAP Estimator

A great question to ask at this point is, what is the relationship between
the EKF/ IEKF estimate and the full Bayesian posterior? It turns out
that the IEKF estimate corresponds to a (local) maximum of the full
posterior7 ; in other words, it is a MAP estimate. On the other hand,
since the EKF is not iterated, it can be very far from a local maximum;
there is actually very little we can say about its relationship to the full
posterior.
These relations are illustrated in Figure 4.7, where we compare the
correction steps of the IEKF and EKF to the full Bayesian posterior on
7 To be clear, this is only true for the correction step at a single timestep.
4.2 Recursive Discrete-Time Estimation 105
0.25 Figure 4.7 Stereo
camera example,
x̂map x̄ p(x|y)
comparing the
0.2 p(x) inference (i.e.,
posterior x̂ekf x̂ekf ‘corrective’) step of
0.15 the EKF and IEKF
x̂iekf
p

to the full Bayesian

0.1 prior x̂map
posterior, p(x|y).
x̄ We see that the
0.05 x̂iekf mean of the IEKF
matches up against
0 the MAP solution,
10 15 20 25 30 35 x̂map , while the
x EKF does not. The
actual mean of the
posterior is
denoted x̄.
our stereo camera example introduced in Section 4.1.2. In this version
of the example, we used

fb
xtrue = 26 [m], ymeas = − 0.6 [pixel],
xtrue
to exaggerate the difference between the methods. As discussed above,
the mean of the IEKF corresponds to the MAP (i.e., mode) solution,
while the EKF is not easily relatable to the full posterior.
To understand why the IEKF is the same as the MAP estimate,
we require some optimization tools that we will introduce later in the
chapter. For now, the important take-away message from this section is
that our choice to iteratively relinearize about our best guess leads to a
MAP solution. Thus, the ‘mean’ of our IEKF Gaussian estimator does
not actually match the mean of the full Bayesian posterior; it matches
the mode.

4.2.7 Alternatives for Passing PDFs through Nonlinearities

In our derivation of the EKF/ IEKF, we used one particular technique
to pass a PDF through a nonlinearity. Specifically, we linearized the
nonlinear model about an operating point and then passed our Gaus-
sian PDFs through the linearized model analytically. This is certainly
one approach, but there are others. This section will discuss three com-
mon approaches: the Monte Carlo method (brute force), linearization
(as in the EKF), and the sigmapoint or unscented8 transformation. Our
motivation is introduce some tools that can be used within our Bayes
filter framework to derive alternatives to the EKF/ IEKF.
8 This name lives on in the literature; apparently, Simon Julier named it after a
unscented deodorant to make the point that often we take names for granted without
knowing their origins.
106 Nonlinear Non-Gaussian Estimation
Figure 4.8 Monte
Carlo method to p(x) p(y)
transform a PDF
through a y = g(x)
nonlinearity. A
large number of
random samples
are drawn from the x y
input density, draw random combine
passed through the samples from samples to form
nonlinearity, and input density yi = g(xi ) output density
then used to form
pass each sample
the output density.
through nonlinearity

Monte Carlo Method

The Monte Carlo method of transforming a PDF through a nonlinear-
ity is essentially the ‘brute force’ approach. The process is depicted in
Figure 4.8. We draw a large number of samples from the input density,
transform each one of these samples through the nonlinearity exactly,
and then build the output density from the transformed samples (e.g.,
by computing the statistical moments). Loosely, the law of large num-
bers ensures this procedure will converge to the correct answer as the
number of samples used approaches infinity.
The obvious problem with this method is that it can be terribly
inefficient, particularly in higher dimensions. Aside from this obvious
disadvantage, there are actually some advantages to this method:
(i) It works with any PDF, not just Gaussian.
(ii) It handles any type of nonlinearity (no requirement for differ-
entiable or even continuous).
(iii) We do not need to know the mathematical form of the nonlinear
function – in practice the nonlinearity could be any software
function.
(iv) It is an ‘anytime’ algorithm – we can easily trade off accuracy
against speed by choosing the number of samples appropriately.
Due to the fact that we can make this method highly accurate, we can
also use it to gauge the performance of other methods.
The other point worth mentioning at this stage is that the mean of
the output density is not the same as the mean of the input density
after being passed through the nonlinearity. This can be seen by way
of a simple example. Consider the input density for x to be uniform
over the interval [0, 1]; in other words, p(x) = 1, x ∈ [0, 1]. Let the
nonlinearity be y = x2 . The mean of the output is µx = 1/2 and
passing this through the nonlinearity gives µy = 1/4. However, the
R1
actual mean of the output is µy = 0 p(x) x2 dx = 1/3. Similar things
happen to the higher statistical moments. The Monte Carlo method is
4.2 Recursive Discrete-Time Estimation 107
Figure 4.9
p(x) p(y) µy = g(µx )
One-dimensional
@g(x)
y µy ⇡ x µx Gaussian PDF
x x
y = g(x) | {z } @x x=µx | {z }
y | {z } x transformed
y y a through a
2
y = E[ y 2 ] deterministic
= a2 E[ x2 ]
x y nonlinear function,
= a2 x2
µx µy g(·). Here we
linearize the
nonlinearity in
able to approach the correct answer with a large number of samples, order to propagate
the variance
but as we will see, some of the other methods cannot.
approximately.

Linearization
The most popular method of transforming a Gaussian PDF through a
nonlinearity is linearization, which we have already used to derive the
EKF/ IEKF. Technically, the mean is actually passed through the non-
linearity exactly, while the covariance is approximately passed through
a linearized version of the function. Typically, the operating point of
the linearization process is the mean of the PDF. This procedure is
depicted in Figure 4.9 (repeat of Figure 2.5 for convenience). This pro-
cedure is highly inaccurate for the following reasons:
(i) The outcome of passing a Gaussian PDF through a nonlin-
ear function will not be another Gaussian PDF. By keeping
only the mean and covariance of the posterior PDF, we are ap-
proximating the posterior (by throwing away higher statistical
moments).
(ii) We are approximating the covariance of the true output PDF
by linearizing the nonlinear function.
(iii) The operating point about which we linearize the nonlinear
function is often not the true mean of the prior PDF, but rather
our estimate of the mean of the input PDF. This is an approx-
imation that introduces error.
(iv) We are approximating the mean of the true output PDF by
simply passing the mean of the prior PDF through the nonlinear
function. This does not represent the true mean of the output.
Another disadvantage of linearization is that we need to be able to ei-
ther calculate the Jacobian of the nonlinearity in closed form, or com-
pute it numerically (which introduces yet another approximation).
Despite all these approximations and disadvantages, if the function
is only slightly nonlinear, and the input PDF is Gaussian, the lineariza-
tion method is very simple to understand and quick to implement. One
advantage9 is that the procedure is actually reversible (if the nonlinear-
ity is locally invertible). That is, we can recover the input PDF exactly
9 It might be more accurate to say this is a by-product than an advantage, since it is a
direct result of the specific approximations made in linearization.
108 Nonlinear Non-Gaussian Estimation
Figure 4.10
One-dimensional p(x) p(y) g(µx x) + g(µx + x)
µy =
Gaussian PDF x x 2
transformed y = g(x) 2
y y 2 (g(µx x) g(µx + x ))
through a y =
4
deterministic
nonlinear function,
µx x µy y
g(·). Here the basic µx x µx + x µy y µy + y

sigmapoint
draw deterministic combine
transformation is
samples from samples to form
used in which only yi = g(xi )
input density output density
two deterministic
pass each sample
samples (one on
through nonlinearity
either side of the
mean) approximate
the input density.

by passing the output PDF through the inverse of the nonlinearity

(using the same linearization procedure). This is not true for all meth-
ods of passing PDFs through nonlinearities since they do not all make
the same approximations as linearization. For example, the sigmapoint
transformation is not reversible in this way.

Sigmapoint Transformation
In a sense, the sigmapoint (SP) or unscented transformation (Julier
and Uhlmann, 1996) is the compromise between the Monte Carlo and
linearization methods when the input density is roughly a Gaussian
PDF. It is more accurate than linearization, but for a comparable com-
putational cost to linearization. Monte Carlo is still the most accurate
method, but the computational cost is prohibitive in most situations.
It is actually a bit misleading to refer to ‘the’ sigmapoint transfor-
mation as there is actually a whole family of such transformations. Fig-
ure 4.10 depicts the very simplest version in one dimension. In general,
a version of the SP transformation is used that includes one additional
sample beyond the basic version at the mean of the input density. The
steps are as follows:

1. A set of 2L + 1 sigmapoints is computed from the input density,

N (µx , Σxx ), according to

LLT = Σx , (Cholesky decomposition, L lower-triangular)(4.47a)

x0 = µx , (4.47b)
√
xi = µx + L + κ coli L, (4.47c)
√ i = 1...L
xi+L = µx − L + κ coli L, (4.47d)
4.2 Recursive Discrete-Time Estimation 109

where L = dim(µx ). We note that

2L
X
µx = αi xi , (4.48a)
i=0
2L
X T
Σxx = αi (xi − µx ) (xi − µx ) . (4.48b)
i=0

where
κ
L+κ
i=0
αi = 1 1 , (4.49)
2 L+κ
otherwise
which we note sums to 1. The user-definable parameter, κ, will be
explained in the next section.
2. Each of the sigmapoints is individually passed through the nonlin-
earity, g(·):
yi = g (xi ) . i = 0 . . . 2L (4.50)
3. The mean of the output density, µy , is computed as
2L
X
µy = α i yi . (4.51)
i=0

4. The covariance of the output density, Σy , is computed as

2L
X T
Σyy = αi yi − µy yi − µy . (4.52)
i=0

5. The output density, N µy , Σyy , is returned.
This method of transforming a PDF through a nonlinearity has a num-
ber of advantages over linearization:
(i) By approximating the input density instead of linearizing, we
avoid the need to compute the Jacobian of the nonlinearity
(either in closed form or numerically). Figure 4.11 provides an
example of the sigmapoints for a two-dimensional Gaussian.
(ii) We employ only standard linear algebra operations (Cholesky
decomposition, outer products, matrix summations).
(iii) The computation cost is similar to linearization (when a nu-
merical Jacobian is used).
(iv) There is no requirement that the nonlinearity be smooth and
differentiable.
The next section will furthermore show that the unscented transfor-
mation can also more accurately capture the posterior density than
linearization (by way of an example).
110 Nonlinear Non-Gaussian Estimation
Figure 4.11
Two-dimensional
(L = 2) Gaussian p
PDF, whose N (µ, ⌃) x2 = µ + L +  `2
covariance is p
displayed using
x1 = µ + L +  `1
elliptical
equiprobable
x0 = µ ⌃ = LLT
contours of 1, 2,
⇥ ⇤
and 3 standard L = `1 `2
1
deviations, and the
p 2
corresponding 3 =2
x3 = µ L +  `1 L=2
2L + 1 = 5
sigmapoints for p
κ = 2.
x4 = µ L +  `2

Example 4.1 We will use a simple one-dimensional nonlinearity,

f (x) = x2 , as an example and compare the various transformation
methods. Let the prior density be N (µx , σx2 ).

Monte Carlo Method

In fact, for this particularly nonlinearity, we can essentially use the
Monte Carlo method in closed form (i.e., we do not actually draw any
samples) to get the exact answer for transforming the input density
through the nonlinearity. An arbitrary sample (a.k.a., realization) of
the input density is given by

xi = µx + δxi , δxi ← N (0, σx2 ). (4.53)

Transforming this sample through the nonlinearity we get

yi = f (xi ) = f (µx + δxi ) = (µx + δxi )2 = µ2x + 2µx δxi + δx2i . (4.54)

Taking the expectation of both sides we arrive at the mean of the

output:

µy = E [yi ] = µ2x + 2µx E [δxi ] + E δx2i = µ2x + σx2 . (4.55)
| {z } | {z }
0 2
σx
4.2 Recursive Discrete-Time Estimation 111

We do a similar thing for the variance of the output:

σy2 = E (yi − µy )2 (4.56a)

= E (2µx δxi + δx2i − σx2 )2 (4.56b)

= E δx4i +4µx E δx3i +(4µ2x − 2σx2 ) E δx2i
| {z } | {z } | {z }
4
3σx 0 2
σx

− 4µx σx2 E [δxi ] +σx4 (4.56c)

| {z }
0

= 4µ2x σx2 + 2σx4 , (4.56d)

where E [δx3i ] = 0 and E [δx4i ] = 3σx4 are the well-known third and
fourth moments for a Gaussian PDF.
In truth, the resulting output density is not Gaussian. We could go
on to compute higher moments of the output (and they would not all
match a Gaussian). However, if we want to approximate the output
as Gaussian by not considering the moments beyond the variance,
we
can. In this case, the resulting output density is N µy , σy2 . We have
effectively used the Monte Carlo method with an infinite number of
samples to carry out the computation of the first two moments of the
posterior exactly in closed form. Let us now see how linearization and
the sigmapoint transformation perform.

Linearization
Linearizing the nonlinearity about the mean of the input density we
have

∂f
yi = f (µx + δxi ) ≈ f (µx ) + δxi = µ2x + 2µx δxi . (4.57)
| {z } ∂x µx
µ2x | {z }
2µx

Taking expectation we arrive at the mean of the output:

µy = E[yi ] = µ2x + 2µx E[δxi ] = µ2x , (4.58)

| {z }
0

which is just the mean of the input passed through the nonlinearity:
µy = f (µx ). For the variance of the output we have

σy2 = E (yi − µy )2 = E (2µx δxi )2 = 4µ2x σx2 . (4.59)

Comparing (4.55) with (4.58), and (4.56) with (4.59), we see there
are some discrepancies. In fact, the linearized mean has a bias and the
variance is too small (i.e., overconfident). Let us see what happens with
the sigmapoint transformation.
112 Nonlinear Non-Gaussian Estimation

Sigmapoint Transformation
There are 2L + 1 = 3 sigmapoints in dimension L = 1:
√ √
x0 = µx , x1 = µx + 1 + κ σx , x2 = µx − 1 + κ σx , (4.60)
where κ is a user-definable parameter that we discuss below. We pass
each sigmapoint through the nonlinearity:
y0 = f (x0 ) = µ2x , (4.61a)
√ 2
y1 = f (x1 ) = µx + 1 + κ σx
√
= µ2x + 2µx 1 + κ σx + (1 + κ)σx2 , (4.61b)
√ 2
y2 = f (x2 ) = µx − 1 + κ σx
√
= µ2x − 2µx 1 + κ σx + (1 + κ)σx2 . (4.61c)
The mean of the output is given by
2
!
1 1X
µy = κy0 + yi (4.62a)
1+κ 2 i=1
1 2 1 2 √
= κµx + µx + 2µx 1 + κ σx + (1 + κ)σx2 + µ2x
1+κ 2
√
−2µx 1 + κ σx + (1 + κ)σx2 (4.62b)
1
= κµ2x + µ2x + (1 + κ)σx2 (4.62c)
1+κ
= µ2x + σx2 , (4.62d)
which is independent of κ and exactly the same as (4.55). For the
variance we have
2
!
2 1 2 1X 2
σy = κ (y0 − µy ) + (yi − µy ) (4.63a)
1+κ 2 i=1

1 1 √ 2
= κσx4 + 2µx 1 + κ σx + κσx2
1+κ 2
√ 2
+ −2µx 1 + κ σx + κσx2 (4.63b)
1
= κσx4 + 4(1 + κ)µ2x σx2 + κ2 σx4 (4.63c)
1+κ
= 4µ2x σx2 + κσx4 , (4.63d)
which can be made to be identical to (4.56) by selecting the user-
definable parameter, κ, to be 2. Thus, for this nonlinearity, the un-
scented transformation can exactly capture the correct mean and vari-
ance of the output.
4.2 Recursive Discrete-Time Estimation 113
0.03 Figure 4.12
true mean Graphical
0.025 depiction of
exact
Monte Carlo passing a Gaussian
0.02
sigmapoint PDF, p(x) =
p(y)

N 5, (3/2)2

0.015 linearization
through the
0.01 nonlinearity,
y = x2 , using
0.005 various methods.
We see that the
0
0 10 20 30 40 50 60 70 80 Monte Carlo and
y sigmapoint
methods match the
true mean while
linearization does
To understand why we should pick κ = 2, we need look no fur- not. We also show
the exact
ther than the input density. The parameter κ scales how far away the transformed PDF,
sigmapoints are from the mean. This does not affect the first three mo- which is not
ments of the sigmapoints (i.e., µx , σx2 , and the zero skewness). However, Gaussian and
changing κ does influence the fourth moment, kurtosis. We already used therefore does not
have its mean at
the fact that for a Gaussian PDF, the fourth moment is 3σx4 . We can its mode.
chose κ to make the fourth moment of the sigmapoints match the true
kurtosis of the Gaussian prior density:

 
2
1  41X 4
3σx4 = κ (x0 − µx ) + (xi − µx )  (4.64a)
1+κ | {z } 2 i=1
0
4 √ 4
1 √
= 1 + κσx + − 1 + κσx (4.64b)
2(1 + κ)
= (1 + κ)σx4 . (4.64c)

Comparing the desired and actual kurtosis, we should pick κ = 2 to

make them match exactly. Not surprisingly, this has a positive effect
on accuracy of the transformation.
In summary, this example shows that linearization is an inferior
method of transforming a PDF through a nonlinearity if the goal is
to capture the true mean of the output. Figure 4.12 provides a graph-
ical depiction of this example.

In the next few sections, we return to the Bayes filter and use our
new knowledge about the different methods of passing PDFs through
nonlinearities to make some useful improvements to the EKF. We will
begin with the particle filter, which makes use of the Monte Carlo
method. We will then try to implement a Gaussian filter using the SP
transformation.
114 Nonlinear Non-Gaussian Estimation

4.2.8 Particle Filter

We have seen above that drawing a large number of samples is one
way to approximate a PDF. We further saw that we could pass each
sample through a nonlinearity and recombine them on the other side to
get an approximation of the transformation of a PDF. In this section,
we extend this idea to an approximation of the Bayes filter, called the
particle filter (Thrun et al., 2001).
The particle filter is one of the only practical techniques able to han-
dle non-Gaussian noise and nonlinear observation and motion models.
It is practical in that it is very easy to implement; we do not even need
to have analytical expressions for f (·) and g(·), nor their derivatives.
There are actually many different flavours of the particle filter; we
will outline a basic version and indicate where the variations typically
occur. The approach taken here is based on sample importance resam-
pling where the so-called proposal PDF is the prior PDF in the Bayes
filter, propagated forward using the motion model and the latest motion
measurement, vk . This version of the particle filter is sometimes called
the bootstrap algorithm, the condensation algorithm, or the survival-of-
the-fittest algorithm.

PF Algorithm
Using the notation from the section on the Bayes filter, the main steps
in the particle filter are as follows:
1. Draw M samples from the joint density comprised of the prior and
the motion noise:

x̂k−1,m
← p (xk−1 |x̌0 , v1:k−1 , y1:k−1 ) p(wk ), (4.65)
wk,m
where m is the unique particle index. In practice we can just draw
from each factor of this joint density separately.
2. Generate a prediction of the posterior PDF by using vk . This is done
by passing each prior particle/noise sample through the nonlinear
motion model:
x̌k,m = f (x̂k−1,m , vk , wk,m ) . (4.66)
These new ‘predicted particles’ together approximate the density,
p (xk |x̌0 , v1:k , y1:k−1 ).
3. Correct the posterior PDF by incorporating yk . This is done indi-
rectly in two steps:
– First, assign a scalar weight, wk,m , to each predicted particle based
on the divergence between the desired posterior and the predicted
posterior for each particle:
p (x̌k,m |x̌0 , v1:k , y1:k )
wk,m = = η p (yk |x̌k,m ) , (4.67)
p (x̌k,m |x̌0 , v1:k , y1:k−1 )
4.2 Recursive Discrete-Time Estimation 115
Figure 4.13
sampled process noise Block-diagram
wk,m representation of
input predicted belief particle filter.
vk f (x, v, w) x̌k,m
sampled
sampled prior belief resample posterior belief
x̂k 1,m (using weights) x̂k,m

measurement
yk g(x, 0) wk,m = ⌘ p(yk |x̌k,m )
sample weight

where η is a normalization constant. This is typically accomplished

in practice by simulating an expected sensor reading, y̌k,m , using
the nonlinear observation model:
y̌k,m = g (x̌k,m , 0) . (4.68)
We then assume p (yk |x̌k,m ) = p (yk |y̌k,m ), where the right-hand
side is a known density (e.g., Gaussian).
– Resample the posterior based on the weight assigned to each pre-
dicted posterior particle:
resample
x̂k,m ←− {x̌k,m , wk,m } . (4.69)
This can be done in several different ways. Madow provides a
simple systematic technique to do resampling, which we describe
below.
Figure 4.13 captures these steps in a block diagram.
There are some additional comments that should be made at this
point to help get this basic version of the particle filter working in
practical situations:
(i) How do we know how many particles to use? It depends very
much on the specific estimation problem. Typically hundreds
of particles are used for low-dimensional problems (e.g., x =
(x, y, θ)).
(ii) We can dynamicallyP pick the number of particles online using a
heuristic such as wk,m ≥ wthresh , a threshold. In other words,
we keep adding particles/samples and repeating steps 1 through
4(a) until the sum of the weights exceeds an experimentally-
determined threshold.
(iii) We do not necessarily need to resample every time we go through
the algorithm. We can delay resampling, but then need to carry
the weights forward to the next iteration of the algorithm.
(iv) To be on the safe side, it is wise to add a small percentage of
samples in during Step 1 that are uniformly drawn from the
116 Nonlinear Non-Gaussian Estimation

entire state sample space. This protects against outlier sensor

measurements / vehicle movements.
(v) For high-dimensional state estimation problems, the particle
filter can become computationally intractable. If too few par-
ticles are used, the densities involved are undersampled and
give highly skewed results. The number of samples needed goes
up exponentially with the dimension of the state space. Thrun
et al. (2001) offer some alternative flavours of the particle filter
to combat sample impoverishment.
(vi) The particle filter is an ‘anytime’ algorithm. That is, we can just
keep adding particles until we run out of time, then resample
and give an answer. Using more particles always helps, but
comes with a computational cost.
(vii) The Cramér-Rao lower bound (CRLB) is set by the uncertainty
in the measurements that we have available. Using more sam-
ples does not allow us to do better than the CRLB. See Sec-
tion 2.2.11 for some discussion of the CRLB.

Resampling
A key aspect of particle filters is the need to resample the posterior
density according to weights assigned to each current sample. One way
to do this is to use the systematic resampling method described by
Madow (1949). We assume we have M samples and each of these is
assigned an unnormalized weight, wm ∈ R > 0. From the weights, we
create bins with boundaries, βm , according to
Pm
wn
βm = Pn=1M
. (4.70)
`=1 w`

The βm define the boundaries of M bins on the interval [0, 1]:

0 ≤ β1 ≤ β2 ≤ . . . ≤ βM −1 ≤ 1,
where we note that we will have βM ≡ 1. We then select a random
number, ρ, sampled from a uniform density on [0, 1). For M iterations
we add to the new list of samples, the sample whose bin contains ρ.
At each iteration we step ρ forward by 1/M . The algorithm guarantees
that all bins whose size is greater than 1/M will have a sample in the
new list.

4.2.9 Sigmapoint Kalman Filter

Another way we can attempt to improve on the basic EKF is to get
rid of the idea of linearizing altogether and instead use the sigmapoint
transformation to pass PDFs through the nonlinear motion and obser-
vation models. The result is the sigmapoint Kalman filter (SPKF), also
4.2 Recursive Discrete-Time Estimation 117

sometimes called the unscented Kalman filter (UKF). We will discuss

the prediction and correction steps separately10 :

Prediction Step
The prediction step is a fairly straightforward application of the sigma-
point transformation since we are simply trying to bring our prior
forward in time through the motion n model. o We employ the following
steps to go from the prior belief, x̂k−1 , P̂k−1 , to the predicted belief,

x̌k , P̌k :

1. Both the prior belief and the motion noise have uncertainty so these
are stacked together in the following way:

x̂k−1 P̂k−1 0
µz = , Σzz = , (4.71)
0 0 Qk
where we see that {µz , Σzz } is still a Gaussian representation. We
let L = dim µz .
2. Convert {µz , Σzz } to a sigmapoint representation:
LLT = Σzz , (Cholesky decomposition, L lower-triangular) (4.72a)
z0 = µz , (4.72b)
√
zi = µz + L + κ coli L, (4.72c)
√ i = 1...L
zi+L = µz − L + κ coli L. (4.72d)

3. Unstack each sigmapoint into state and motion noise,

x̂
zi = k−1,i , (4.73)
wk,i
and then pass each sigmapoint through the nonlinear motion model
exactly:
x̌k,i = f (x̂k−1,i , vk , wk,i ) . i = 0 . . . 2L. (4.74)
Note that the latest input, vk , is required.
4.
Recombine the transformed sigmapoints into the predicted belief,
x̌k , P̌k , according to
2L
X
x̌k = αi x̌k,i , (4.75a)
i=0
2L
X T
P̌k = αi (x̌k,i − x̌k ) (x̌k,i − x̌k ) , (4.75b)
i=0

10 These are sometimes handled together in a single step, but we prefer to think of each of
these as a separate application of the sigmapoint transformation.
118 Nonlinear Non-Gaussian Estimation

where
κ
L+κ
i=0
αi = 1 1 . (4.76)
2 L+κ
otherwise
Next we will look at a second application of the sigmapoint transfor-
mation to implement the correction step.

Correction Step
This step is a little more complicated. We look back to Section 4.2.4 and
recall that the conditional Gaussian density for xk (i.e., the posterior)
is

p(xk |x̌0 , v1:k−1 , y0:k )

= N µx,k + Σxy,k Σ−1 yy,k (y k − µ y,k ) , Σxx,k − Σ −1
xy,k yy,k yx,k , (4.77)
Σ Σ
| {z } | {z }
x̂k P̂k

where we have defined x̂k as the mean and P̂k as the covariance. In this
form, we can write the generalized Gaussian correction-step equations
as
Kk = Σxy,k Σ−1
yy,k , (4.78a)
P̂k = P̌k − Kk ΣTxy,k , (4.78b)

x̂k = x̌k + Kk yk − µy,k . (4.78c)
We will use the SP transformation to come up with better versions of
µy,k , Σyy,k , and Σxy,k . We employ the following steps:

1. Both the predicted belief and the observation noise have uncertainty
so these are stacked together in the following way:

x̌k P̌k 0
µz = , Σzz = , (4.79)
0 0 Rk
where we see that {µz , Σzz } is still a Gaussian representation. We
let L = dim µz .
2. Convert {µz , Σzz } to a sigmapoint representation:
LLT = Σzz , (Cholesky decomposition, L lower-triangular) (4.80a)
z0 = µz , (4.80b)
√
zi = µz + L + κ coli L, (4.80c)
√ i = 1...L
zi+L = µz − L + κ coli L. (4.80d)
3. Unstack each sigmapoint into state and observation noise,

x̌
zi = k,i , (4.81)
nk,i
4.2 Recursive Discrete-Time Estimation 119

and then pass each sigmapoint through the nonlinear observation

model exactly:
y̌k,i = g (x̌k,i , nk,i ) . (4.82)
4. Recombine the transformed sigmapoints into the desired moments:
2L
X
µy,k = αi y̌k,i , (4.83a)
i=0
2L
X T
Σyy,k = αi y̌k,i − µy,k y̌k,i − µy,k , (4.83b)
i=0
2L
X T
Σxy,k = αi (x̌k,i − x̌k ) y̌k,i − µy,k , (4.83c)
i=0

where
κ
L+κ
i=0
αi = 1 1 . (4.84)
2 L+κ
otherwise
These are plugged into the generalized Gaussian correction-step
equations above to complete the correction step.
One huge advantage of the SPKF are that it does not require any
analytical derivatives and uses only basic linear algebra operations in
the implementation. Moreover, we don not even need the nonlinear
motion and observation models in closed form; they could just be black-
box software functions.

Comparing Terms to EKF

We see in the correction step that the matrix Σyy,k takes on the role of
Gk P̌k GTk +R0k in the EKF. We can see this more directly by linearizing
the observation model (about the predicted state) as in the EKF:
y̌k,i = g (x̌k,i , nk,i ) ≈ g(x̌k , 0) + Gk (x̌k,i − x̌k ) + n0i . (4.85)
Substituting this approximation into (4.83a) we can see that
y̌k,i − µy,k ≈ Gk (x̌k,i − x̌k ) + n0k,i . (4.86)
Substituting this into (4.83b) we have that
2L
X 2L
X
T T
Σyy,k ≈ Gk αi (x̌k,i − x̌k ) (x̌k,i − x̌k ) GTk + αi n0k,i n0k,i
i=0 i=0
| {z } | {z }
P̂− R0k
k

2L
X 2L
X
T T
+ Gk αi (x̌k,i − x̌k ) n0k,i + αi n0k,i (x̌k,i − x̌k ) GTk , (4.87)
i=0 i=0
| {z } | {z }
0 0
120 Nonlinear Non-Gaussian Estimation

where some of the terms are zero owing to the block-diagonal structure
of Σzz above. For Σxy,k , by substituting our approximation into (4.83c),
we have
2L
X 2L
X
T T
Σxy,k ≈ αi (x̌k,i − x̌k ) (x̌k,i − x̌k ) GTk + αi (x̌k,i − x̌k ) n0k,i ,
i=0 i=0
| {z } | {z }
P̌k 0
(4.88)
so that
−1
Kk = Σxy,k Σ−1 T T 0
yy,k ≈ P̂k Gk Gk P̌k Gk + Rk , (4.89)

which is what we had in the EKF.

Special Case of Linear Dependence on Measurement Noise

In the case that our nonlinear observation model has the special form

yk = g(xk ) + nk , (4.90)

the SPKF correction step can be greatly sped up. Without loss of gen-
erality, we can break the sigmapoints into two categories based on the
block-diagonal partitioning in the matrix, Σzz , above; we say there
are 2N + 1 sigmapoints coming from the dimension of the state and
2(L − N ) additional sigmapoints coming from the dimension of the
measurements. To make this convenient, we will re-order the indexing
on the sigmapoints accordingly:

g (x̌k,j ) j = 0 . . . 2N
y̌k,j = . (4.91)
g (x̌k ) + nk,j j = 2N + 1 . . . 2L + 1

We can then write our expression for µy,k as

2N
X 2L+1
X
µy,k = αj nk,j + αj y̌k,j (4.92a)
j=0 j=2N +1
2N
X 2L+1
X
= αj y̌k,j + αj (g (x̌k ) + nk,j ) (4.92b)
j=0 j=2N +1
2N
X 2L+1
X
= αj y̌k,j + g (x̌k ) αj (4.92c)
j=0 j=2N +1
2N
X
= βj y̌k,j , (4.92d)
j=0
4.2 Recursive Discrete-Time Estimation 121

where
P2L+1
αi + j=2N +1 αj i = 0
βi = (4.93a)
αi otherwise
(
(κ+L−N )
N +(κ+L−N )
i=0
= 1 1 . (4.93b)
2 N +(κ+L−N )
otherwise

The is the same form as the original weights (and they still sum to 1).
We can then easily verify that
2N
X T
Σyy,k = βj y̌k,j − µy,k y̌k,j − µy,k + Rk (4.94)
j=0

with no approximation. This is already helpful in that we do not really

need all 2L + 1 sigmapoints but only 2N + 1 of them. This means we
do not need to call g(·) as many times, which can be expensive in some
situations.
We still have a problem, however. It is still necessary to invert Σyy,k ,
which is size (L − N ) × (L − N ), to compute the Kalman gain matrix.
If the number of measurements, L − N , is large, this could be very
expensive. We can make further gains if we assume that the inverse of
Rk can be computed cheaply. For example, if Rk = σ 2 1 then R−1 k =
σ −2 1. We proceed by noting that Σyy,k can be conveniently written as
Σyy,k = Zk ZTk + Rk , (4.95)
where
q
colj Zk = βj y̌k,j − µy,k . (4.96)

By the SMW identity from Section 2.2.7, we can then show that
−1
Σ−1 T
yy,k = Zk Zk + Rk (4.97a)
−1 T
= R−1 −1 T −1
k − Rk Zk Zk Rk Zk + 1 Z R−1 , (4.97b)
| {z } k k
(2N +1)×(2N +1)

where we now only need to invert a (2N +1)×(2N +1) matrix (assuming
R−1
k is known).

4.2.10 Iterated Sigmapoint Kalman Filter

An iterated sigmapoint Kalman filter (ISPKF) has been proposed by
Sibley et al. (2006) that does better than the one-shot version. In this
case, we compute input sigmapoints around an operating point, xop,k ,
at each iteration. At the first iteration, we let xop,k = x̌k , but this is
then improved with each subsequent iteration. We show all the steps
to avoid confusion:
122 Nonlinear Non-Gaussian Estimation

1. Both the predicted belief and the observation noise have uncertainty
so these are stacked together in the following way:

x P̌k 0
µz = op,k , Σzz = , (4.98)
0 0 Rk

where we see that {µz , Σzz } is still a Gaussian representation. We

let L = dim µz .
2. Convert {µz , Σzz } to a sigmapoint representation:

LLT = Σzz , (Cholesky decomposition, L lower-triangular) (4.99a)

z0 = µz , (4.99b)
√
zi = µz + L + κ coli L, (4.99c)
√ i = 1...L
zi+L = µz − L + κ coli L. (4.99d)

3. Unstack each sigmapoint into state and observation noise,

xop,k,i
zi = , (4.100)
nk,i

and then pass each sigmapoint through the nonlinear observation

model exactly:
yop,k,i = g (xop,k,i , nk,i ) . (4.101)

4. Recombine the transformed sigmapoints into the desired moments:

2L
X
µy,k = αi yop,k,i , (4.102a)
i=0
2L
X T
Σyy,k = αi yop,k,i − µy,k yop,k,i − µy,k , (4.102b)
i=0
2L
X T
Σxy,k = αi (xop,k,i − xop,k ) yop,k,i − µy,k . (4.102c)
i=0
2L
X T
Σxx,k = αi (xop,k,i − xop,k ) (xop,k,0 − xop,k ) , (4.102d)
i=0

where
κ
L+κ
i=0
αi = 1 1 . (4.103)
2 L+κ
otherwise

At this point, Sibley et al. use the relationships between the SPKF and
EKF quantities to update the IEKF correction equations, (4.45), using
4.2 Recursive Discrete-Time Estimation 123
0.25 Figure 4.14
Stereo camera
x̂map x̄ p(x|y)
example,
0.2 x̂spkf p(x) comparing the
posterior x̂ispkf x̂iekf inference (i.e.,
0.15 ‘corrective’) step of
x̂spkf
p

the IEKF, SPKF,

0.1 prior x̂ispkf
and ISPKF to the
x̂map full Bayesian
0.05 x̂iekf x̄ posterior, p(x|y).
We see that neither
of the sigmapoint
0
10 15 20 25 30 35 methods match up
against the MAP
x solution, x̂map .
Superficially, the
ISPKF seems to
the statistical rather than analytical Jacobian quantities: come closer to the
mean of the full
−1 posterior, x̄.
Kk = P̌k GTk Gk P̌k GTk + R0k , (4.104a)
| {z } | {z }
Σxy,k Σyy,k

P̂k = 1 − Kk Gk P̌k , (4.104b)
|{z} |{z}
Σyx,k Σ−1
xx,k
Σxx,k

x̂k = x̌k + Kk yk − g(xop,k , 0) − Gk (x̌k − xop,k ) , (4.104c)
| {z } |{z}
µy,k Σyx,k Σ−1
xx,k

which results in

Kk = Σxy,k Σ−1
yy,k , (4.105a)
P̂k = Σxx,k − Kk Σyx,k , (4.105b)
−1

x̂k = x̌k + Kk yk − µy,k − Σyx,k Σxx,k (x̌k − xop,k ) . (4.105c)

Initially, we set the operating point to be the mean of the prior: xop,k =
x̌k . At subsequent iterations we set it to be the best estimate so far:

xop,k ← x̂k . (4.106)

The process terminates when the change from one iteration to the next
becomes sufficiently small.
We have seen that the first iteration of the IEKF results in the EKF
method, and this is also true for the SPKF/ ISPKF. Setting xop,k = x̌k
in (4.105c) results in

x̂k = x̌k + Kk yk − µy,k , (4.107)

which is the same as the one-shot method in (4.78c).

124 Nonlinear Non-Gaussian Estimation
Figure 4.15 0.25
Histogram of EXN [x̂ispkf ] x̌
estimator values
0.2
for 1, 000, 000 trials
of the stereo

p(x̂ispkf )
camera experiment 0.15
where each time a
new xtrue is 0.1
randomly drawn
from the prior and 0.05
a new ymeas is
randomly drawn
0
from the 10 12 14 16 18 20 22 24 26
measurement x̂ispkf
model. The dashed
line marks the
mean of the prior,
x̌, and the solid 4.2.11 ISPKF Seeks the Posterior Mean
line marks the
Now, the question we must ask ourselves is, how do the sigmapoint esti-
expected value of
the iterated mates relate to the full posterior? Figure 4.14 compares the sigmapoint
sigmapoint methods to the full posterior and iterated linearization (i.e., MAP) on
estimator, x̂ispkf , our stereo camera example where we used
over all the trials.
The gap between fb
dashed and solid is
xtrue = 26 [m], ymeas = − 0.6 [pixel],
xtrue
emean ≈ −3.84 cm,
which indicates a again to exaggerate the differences between the various methods. Our
small bias, and the implementations of the sigmapoint methods used κ = 2, which is ap-
average squared
error is esq ≈ 4.32
propriate for a Gaussian prior. Much like the EKF, we see that the
m2 . one-shot SPKF method bears no obvious relationship to the full poste-
rior. However, the ISPKF method appears to come closer to the mean,
x̄, rather than the mode (i.e., MAP) value, x̂map . Numerically, the num-
bers of interest are:
x̂map = 24.5694, x̄ = 24.7770,
x̂iekf = 24.5694, x̂ispkf = 24.7414.
We see that the IEKF solution matches the MAP one, and the ISPKF
solution is close to (but not exactly) the mean.
Now, we consider the question, how well does the iterated sigmapoint
method capture xtrue ? Once again, we compute the performance over
a large number of trials (using the parameters in (4.5)). The results
are shown in Figure 4.15. We see that the average difference of the
estimator, x̂ispkf , and the ground-truth, xtrue , is emean ≈ −3.84 cm,
demonstrating a small bias. This is significantly better than the MAP
estimator, which had a bias of −33.0 cm on this same metric. The
average squared error is approximately the same, with esq ≈ 4.32 m2 .
Although it is difficult to show analytically, it is plausible to think
that the iterated sigmapoint method is trying to converge to the mean
of the full posterior rather than the mode. If what we care about is
4.3 Batch Discrete-Time Estimation 125
Figure 4.16
Taxonomy of the
Bayes filter different filtering
approximate PDFs methods, showing
approximate PDFs using a large number of
as Gaussian their relationships
random samples
to the Bayes filter.

Gaussian filter particle filter

pass PDFs through deterministically sample
linearized motion PDFs and pass through
and observation nonlinear motion and
models observation models

(iterated) (iterated)
extended sigmapoint
Kalman filter Kalman filter

matching up against groundtruth, matching the mean of the full pos-

terior could be an interesting avenue.

4.2.12 Taxonomy of Filters

Figure 4.16 provides a summary of the methods we have discussed in
this section on nonlinear recursive state estimation. We can think of
each of the methods having a place in a larger taxonomy with the Bayes
filter at the top position. Depending on the approximations made, we
wind up with the different filters discussed.
It is also worth recalling the role of iteration in our filter methods.
Without iteration, both the EKF and SPKF were difficult to relate
back to the full Bayesian posterior. However, we saw that the ‘mean’ of
the IEKF converges to the MAP solution, while the mean of the ISPKF
comes quite close to the mean of the full posterior. We will use these
lessons in the next section on batch estimation, where we will attempt
to estimate entire trajectories at once.

4.3 Batch Discrete-Time Estimation

In this section, we take a step back and question how valid the Bayes
filter really is given the fact that we always have to implement it approx-
imately and therefore are violating the Markov assumption on which it
is predicated. We propose that a much better starting point in deriv-
ing nonlinear filters (i.e., better than the Bayes filter) is the nonlinear
version of the batch estimator we first introduced in the chapter on
linear-Gaussian estimation. Setting the estimation problem up as an
optimization problem affords a different perspective that helps explain
the shortcomings of all variants of the EKF.
126 Nonlinear Non-Gaussian Estimation

4.3.1 Maximum A Posteriori

In this section, we revisit to our approach to linear-Gaussian estima-
tion problems, batch optimization, and introduce the Gauss-Newton
method to solve our nonlinear version of this estimation problem. This
optimization approach can be viewed as the MAP approach once again.
We first set up our objective function that we seek to minimize, then
consider methods to solve it.

Objective Function
We seek to construct an objective function that we will minimize with
respect to
 
x0
 x1 
 
x =  ..  , (4.108)
 . 
xK

which represents the entire trajectory that we want to estimate.

Recall the linear-Gaussian objective function given by Equations (3.10)
and (3.9). It took the form of a squared Mahalanobis distance and was
proportional to the negative log likelihood of all the measurements. For
the nonlinear case, we define the errors with respect to the prior and
measurements to be

x̌0 − x0 , k=0
ev,k (x) = , (4.109a)
f (xk−1 , vk , 0) − xk , k = 1 . . . K
ey,k (x) = yk − g (xk , 0) , k = 0 . . . K, (4.109b)

so that the contributions to the objective function are

1 −1
Jv,k (x) = ev,k (x)T Wv,k ev,k (x), (4.110a)
2
1 −1
Jy,k (x) = ey,k (x)T Wy,k ey,k (x). (4.110b)
2
The overall objective function is then
K
X
J(x) = (Jv,k (x) + Jy,k (x)) . (4.111)
k=0

Note, we can generally think of Wv,k and Wy,k simply as symmetric

positive-definite matrix weights. By choosing these to be related to
the covariances of the measurement noises, minimizing the objective
function is equivalent to maximizing the joint likelihood of all the data.
4.3 Batch Discrete-Time Estimation 127

We further define
   
ev,0 (x) ey,0 (x)
ev (x)    
e(x) = , ev (x) =  ...  , ey (x) =  ...  ,
ey (x)
ev,K (x) ey,K (x)
(4.112a)
W = diag (Wv , Wy ) , Wv = diag (Wv,0 , . . . , Wv,K ) , (4.112b)
Wy = diag (Wy,0 , . . . , Wy,K ) , (4.112c)
so that the objective function can be written as
1
J(x) = e(x)T W−1 e(x). (4.113)
2
We can further define the modified error term,
u(x) = L e(x), (4.114)
where LT L = W−1 (i.e., from a Cholesky decomposition since W is
symmetric positive-definite). Using these definitions, we can write the
objective function simply as
1
J(x) = u(x)T u(x). (4.115)
2
This is precisely in a quadratic form, but not with respect to the design
variables, x. Our goal is to determine the optimum design parameter,
x̂, that minimizes the objective function:
x̂ = arg min J(x). (4.116)
x

There are many nonlinear optimization techniques that we can apply

to minimize this expression due to its quadratic nature. A typical tech-
nique to use is Gauss-Newton optimization, but there are many other
possibilities. The more important issue is that we are considering this
as a nonlinear optimization problem. We will derive the Gauss-Newton
algorithm by way of Newton’s method.

Newton’s Method
Newton’s method works by iteratively approximating the (differen-
tiable) objective function by a quadratic function and then jumping to
(or moving towards) the minimum. Suppose we have an initial guess,
or operating point, for the design parameter, xop . We use a three-term
Taylor-series expansion to approximate J as a quadratic function,

∂J(x) 1 T ∂ 2 J(x)
J(xop + δx) ≈ J(xop ) + δx + δx δx,
∂x xop 2 ∂x∂xT xop
| {z } | {z }
Jacobian Hessian
(4.117)
128 Nonlinear Non-Gaussian Estimation

of δx, a ‘small’ change to the initial guess, xop . We note that the sym-
metric Hessian matrix needs to be positive-definite for this method
to work (otherwise there is no well-defined minimum to the quadratic
approximation).
The next step is to find the value of δx that minimizes this quadratic
approximation. We can do this by taking the derivative with respect
to δx and setting to zero to find a critical point:

2
∂J(xop + δx) ∂J(x) ∗T ∂ J(x)
= + δx =0
∂ δx ∂x xop ∂x∂xT xop
2 T
∂ J(x) ∗ ∂J(x)
⇒ δx = − . (4.118)
∂x∂xT xop ∂x xop

The last line is just a linear system of equations and can be solved
when the Hessian is invertible (which it must be, since it was assumed
to be positive-definite above). We may then update our operating point
according to:

xop ← xop + δx∗ . (4.119)

This procedure iterates until δx∗ becomes sufficiently small. A few com-
ments about Newton’s method:

(i) It is ‘locally convergent’, which means the successive approx-

imations are guaranteed to converge to a solution when the
initial guess is already close enough to the solution. For a com-
plex nonlinear objective function, this is really the best we can
expect (i.e., global convergence is difficult to achieve).
(ii) The rate of convergence is quadratic (i.e., it converges much
faster than simple gradient descent).
(iii) It can be difficult to implement because the Hessian must be
computed.

The Gauss-Newton method approximates Newton’s method further, in

the case of a special form of objective function.

Gauss-Newton Method
Let us now return to the nonlinear quadratic objective function we have
in Equation (4.115). In this case, the Jacobian and Hessian matrices
4.3 Batch Discrete-Time Estimation 129

are

∂J(x) T ∂u(x)
Jacobian: = u(xop ) , (4.120a)
∂x xop ∂x xop
T
∂ 2 J(x) ∂u(x) ∂u(x)
Hessian: =
∂x∂xT xop ∂x xop ∂x xop
X M 2
∂ ui (x)
+ ui (xop ) ,
i=1
∂x∂xT xop
(4.120b)

where u(x) = u1 (x), . . . , ui (x), . . . , uM (x) . We have so far not made
any approximations.
Looking to the expression for the Hessian, we assert that near the
minimum of J, the second term is small relative to the first. One intu-
ition behind this is that near the optimum we should have ui (x) small
(and ideally zero). We thus approximate the Hessian according to
T
∂ 2 J(x) ∂u(x) ∂u(x)
≈ , (4.121)
∂x∂xT xop ∂x xop ∂x xop

which does not involve any second derivatives. Substituting (4.120a)

and (4.121) into the Newton update represented by (4.118), we have
T T
∂u(x) ∂u(x) ∗ ∂u(x)
δx = − u(xop ), (4.122)
∂x xop ∂x xop ∂x xop

which is the classic Gauss-Newton update method. Again, this is iter-

ated to convergence.

Gauss-Newton Method – Alternative Derivation

The other way to think about the Gauss-Newton method is to start
with a Taylor expansion of u(x), instead of J(x). The approximation
in this case is

∂u(x)
u(xop + δx) ≈ u(xop ) + δx. (4.123)
∂x xop

Substituting into J we have

J(xop + δx)
!T !
1 ∂u(x) ∂u(x)
≈ u(xop ) + δx u(xop ) + δx .
2 ∂x xop ∂x xop
(4.124)
130 Nonlinear Non-Gaussian Estimation

Minimizing with respect to δx gives

!T
∂J(xop + δx) ∂u(x) ∗ ∂u(x)
= u(xop ) + δx =0
∂ δx ∂x xop ∂x xop
T T
∂u(x) ∂u(x) ∗ ∂u(x)
⇒ δx = − u(xop ),
∂x xop ∂x xop ∂x xop
(4.125)
which is the same update as in (4.122). We will employ this shortcut
to the Gauss-Newton method in a later chapter when confronted with
dealing with nonlinearities in the form of rotations.

Practical Patches to Gauss-Newton

Since the Gauss-Newton method is not guaranteed to converge (owing
to the approximate Hessian matrix), we can make two practical patches
to help with convergence:
(i) Once the optimal update is computed, δx∗ , we perform the
actual update according to
xop ← xop + α δx∗ , (4.126)
where α ∈ [0, 1] is a user-definable parameter. Performing a
line search for the best value of α works well in practice. This
works because δx∗ is a descent direction; we are just adjusting
how far we step in this direction to be a bit more conservative
towards robustness (rather than speed).
(ii) We can use the Levenberg-Marquardt modification to the Gauss-
Newton method:
T !
∂u(x) ∂u(x)
+ λD δx∗
∂x xop ∂x xop
T
∂u(x)
=− u(xop ), (4.127)
∂x xop
where D is a positive diagonal matrix. When D = 1, we can
see that as λ becomes very big the Hessian is relatively small
and we have
T
∗ 1 ∂u(x)
δx ≈ − u(xop ), (4.128)
λ ∂x xop
which corresponds to a very small step in the direction of steep-
est descent (i.e., the gradient). When λ = 0 we recover the usual
Gauss-Newton update. The Levenberg-Marquardt method can
work well in situations when the Hessian approximation is poor
or is poorly conditioned.
4.3 Batch Discrete-Time Estimation 131

We can also combine both of these patches together to give us the most
options in controlling convergence.

Gauss-Newton Update in Terms of Errors

Recalling that
u(x) = L e(x), (4.129)
with L a constant, we substitute this into the Gauss-Newton update
to see that in terms of the error, e(x), we have

HT W−1 H δx∗ = HT W−1 e(xop ), (4.130)
with

∂e(x)
H=− (4.131)
∂x xop

and where we have used LT L = W−1 .

Another way to view this is to notice that
1 T
J(xop + δx) ≈ (e(xop ) − H δx) W−1 (e(xop ) − H δx) , (4.132)
2
where e(xop ) = Lu(xop ), is the quadratic approximation of the objec-
tive function in terms of the error. Minimizing this with respect to δx
yields the Gauss-Newton update.

Batch Estimation
We now return to our specific estimation problem and apply the Gauss-
Newton optimization method. We will use the ‘shortcut’ approach and
thus begin by approximating our error expressions:

ev,0 (xop ) − δx0 , k=0
ev,k (xop + δx) ≈ ,
ev,k (xop ) + Fk−1 δxk−1 − δxk , k = 1 . . . K
(4.133)
ey,k (xop + δx) ≈ ey,k (xop ) − Gk δxk , k = 0 . . . K, (4.134)
where

x̌0 − xop,0 , k=0
ev,k (xop ) ≈ , (4.135)
f (xop,k−1 , vk , 0) − xop,k , k = 1 . . . K
ey,k (xop ) ≈ yk − g (xop,k , 0) , k = 0 . . . K, (4.136)
and we require definitions of the Jacobians of the nonlinear motion and
observations models given by

∂f (xk−1 , vk , wk ) ∂g(xk , nk )
Fk−1 = , Gk = .
∂xk−1 xop,k−1 ,vk ,0 ∂xk xop,k ,0
(4.137)
132 Nonlinear Non-Gaussian Estimation

Then, if we let the matrix weights be given by

Wv,k = Q0k , Wy,k = R0k , (4.138)

we can define
 
1
 −F0 1 
 
 . 
  
 −F1 . . 

δx0  .. 
 δx1   . 1 
   
   −F 1 
δx =  δx2  , H= K−1  , (4.139a)
 ..   G0 
 .   
 G1 
δxK  
 G2 
 
 .. 
 . 
GK
 
ev,0 (xop )
 ev,1 (xop ) 
 
 .. 
 . 
 
 ev,K (xop ) 
e(xop ) = 

, (4.139b)
 ey,0 (xop ) 

 ey,1 (xop ) 
 
 .. 
 . 
ey,K (xop )

and

W = diag P̌0 , Q01 , . . . , Q0K , R00 , R01 , . . . , R0K , (4.140)

which are identical in structure to the matrices in the linear batch

case, summarized in (3.3.1), with a few extra subscripts to show time
dependence as well as the Jacobians of the motion/observation models
with respect to the noise variables. Under these definitions, our Gauss-
Newton update is given by

HT W−1 H δx∗ = HT W−1 e(xop ). (4.141)
| {z }
block-tridiagonal

This is very comparable to the linear-Gaussian batch case. The key

difference to remember here is that we are in fact iterating our solution
for the entire trajectory, x. We could at this point recover the recursive
EKF from our batch solution using similar logic to the linear-Gaussian
case.
4.3 Batch Discrete-Time Estimation 133

4.3.2 Bayesian Inference

We can also get to the same batch update equations from a Bayesian-
inference perspective. Assume we begin with an initial guess for the
entire trajectory, xop . We can linearize the motion model about this
guess and construct a prior over the whole trajectory using all the
inputs. The linearized motion model is
xk ≈ f (xop,k−1 , vk , 0) + Fk−1 (xk−1 − xop,k−1 ) + wk0 , (4.142)
where the Jacobian, Fk−1 , is the same as the previous section. After a
bit of manipulation, we can write this in lifted form as
x = F (ν + w0 ) , (4.143)
where
 
x̌0
 
f (xop,0 , v1 , 0) − F0 xop,0
 
 
f (xop,1 , v2 , 0) − F1 xop,1
ν= , (4.144a)
 .. 
 . 
f (xop,K−1 , vK , 0) − FK−1 xop,K−1
 
1
 F 1 
 0 
 F1 F0 F 1 
 1 
F= .
.. .
.. .. .. ,
 . . 
 
FK−2 · · · F0 FK−2 · · · F1 FK−2 · · · F2 · · · 1 
FK−1 · · · F0 FK−1 · · · F1 FK−1 · · · F2 · · · FK−1 1
(4.144b)
0

Q = diag P̌0 , Q01 , Q02 , . . . , Q0K , (4.144c)
and w0 ∼ N (0, Q0 ). For the mean of the prior, x̌, we then simply have

x̌ = E [x] = E [F (ν + w0 )] = Fν. (4.145)

For the covariance of the prior, P̌, we have
h T
i
P̌ = E (x − E[x])(x − E[x])T = F E w0 w0 FT = FQ0 FT . (4.146)

Thus, the prior can be summarized as x ∼ N (Fν, FQ0 FT ). The (·)0

notation is used to indicate the Jacobian with respect to the noise is
incorporated into the quantity.
The linearized observation model is
yk ≈ g (xop,k , 0) + Gk (xk−1 − xop,k−1 ) + n0k , (4.147)
which can be written in lifted form as
y = yop + G (x − xop ) + n0 , (4.148)
134 Nonlinear Non-Gaussian Estimation

where
 
g(xop,0 , 0)
 g(xop,1 , 0) 
 
yop = .. , (4.149a)
 . 
g(xop,K , 0)
G = diag (G0 , G1 , G2 , . . . , GK ) , (4.149b)
R = diag (R00 , R01 , R02 , . . . , R0K ) , (4.149c)
and n0 ∼ N (0, R0 ). It is fairly easy to see that
E [y] = yop + G (x̌ − xop ) , (4.150a)

E (y − E[y])(y − E[y])T = GP̌GT + R0 , (4.150b)

E (y − E[y])(x − E[x])T = GP̌. (4.150c)
Again, the (·)0 notation is used to indicate the Jacobian with respect
to the noise is incorporated into the quantity.
With these quantities in hand, we can write a joint density for the
lifted trajectory and measurements as

x̌ P̌ P̌GT
p(x, y|v) = N , ,
yop + G (x̌ − xop ) GP̌ GP̌GT + R0
(4.151)
which is quite similar to the expression for the IEKF situation in (4.42),
but now for the whole trajectory rather than just one timestep. Us-
ing the usual relationship from (2.46b), we can immediately write the
Gaussian posterior as

p(x|v, y) = N x̂, P̂ (4.152)

where
−1
K = P̌GT GP̌GT + R0 , (4.153a)
P̂ = (1 − KG) P̌, (4.153b)
x̂ = x̌ + K (y − yop − G(x̌ − xop )) . (4.153c)
Using the SMW identity from (2.68), we can rearrange the equation
for the posterior mean to be
−1
−1
P̌−1 + GT R0 G δx∗ = P̌−1 (x̌ − xop )+GT R0 (y − yop ) , (4.154)

where δx∗ = x̂ − xop . Inserting the details of the prior this becomes
−1 −1

F−T Q0 F−1 + GT R0 G δx∗
| {z }
block-tridiagonal
−1 −1
= F−T Q0 ν − F−1 xop + GT R0 (y − yop ) . (4.155)
4.3 Batch Discrete-Time Estimation 135

Then, under the definitions

−1
F ν − F−1 xop
H= , W = diag (Q0 , R0 ) , e(xop ) = ,
G y − yop
(4.156)
we can rewrite this as

HT W−1 H δx∗ = HT W−1 e(xop ), (4.157)
| {z }
block-tridiagonal

which is identical to the update equation from the previous section. As

usual, we iterate to convergence. The difference between the Bayesian
and MAP approaches basically comes down to on which side of the
SMW identity one begins; plus, the Bayesian approach produces a co-
variance explicitly, although we have shown the same thing can be ex-
tracted from the MAP approach. Note, it was our choice to iteratively
relinearize about the mean of the best estimate so far that caused the
Bayesian approach to have the same ‘mean’ as the MAP solution. We
saw this phenomenon previously in the IEKF section. We could also
imagine making different choices than linearization in the batch case
(e.g., particles, sigmapoints) to compute the required moments for the
update equations, but we will not explore these possibilities here.

4.3.3 Maximum Likelihood

In this section, we consider a simplified version of our batch estimation
problem where we throw away the prior and use only the measurements
for our solution.

Maximum Likelihood via Gauss-Newton

We will assume the observation model takes on a simplified form in
which the measurement noise is purely additive (i.e., outside the non-
linearity):
yk = gk (x) + nk , (4.158)
where nk ∼ N (0, Rk ). Note that in this case, we allow for the possibility
that the measurement function is changing with k and that it could
depend on an arbitrary portion of the state, x. We do not even need
to think of k as a time index anymore, simply a measurement index.
Without the prior, our objective function takes the form
1X T
J(x) = (yk − gk (x)) R−1k (yk − gk (x)) = − log p(y|x) + const..
2 k
(4.159)
Without the prior term in the objective function, we refer to this as
a maximum likelihood (ML) problem because finding the solution that
136 Nonlinear Non-Gaussian Estimation

minimizes the objective function also maximizes the likelihood of the

measurements11 :
x̂ = arg min J(x) = arg max p(y|x). (4.160)
x x

We can still use the Gauss-Newton algorithm to solve the ML prob-

lem, just as in the MAP case. We begin with an initial guess for the
solution, xop . We then compute an optimal update, δx∗ , by solving
!
X X
−1
Gk (xop ) Rk Gk (xop ) δx∗ =
T
Gk (xop )T R−1
k (yk − gk (xop )) ,
k k
(4.161)
where
∂gk (x)
Gk (x) =, (4.162)
∂x
is the Jacobian of the observation model with respect to the state.
Finally, we apply the optimal update to our guess
xop ← xop + δx∗ , (4.163)
and iterate to convergence. Once converged, we take x̂ = xop as our
estimate. At convergence, we should have that

∂J(x) X
T =− Gk (x̂)T R−1
k (yk − gk (x̂)) = 0, (4.164)
∂x x̂ k

for a minimum.
We will come back to this ML setup when discussing a problem called
bundle adjustment in the later chapters.
Maximum Likelihood Bias Estimation
We have already seen in the simple example at the start of this chapter
that the MAP method is biased with respect to average mean error. It
turns out that the ML method is biased as well (unless the measurement
model is linear). There is a classic paper by Box (1971) that derives an
approximate expression for the bias in ML, and we use this section to
present it.
We will see below that we need a second-order Taylor expansion of
g(x) while only a first-order expansion of G(x). Thus, we have the
following approximate expressions:
1X
gk (x̂) = gk (x + δx) ≈ gk (x) + Gk (x) δx + 1j δxT G jk (x) δx,
2 `
(4.165a)
X
T
Gk (x̂) = Gk (x + δx) ≈ Gk (x) + 1j δx G jk (x), (4.165b)
j
11 This is because the logarithm is a monotonically increasing function. Another way of
looking at ML is that it is MAP with a uniform prior over all possible solutions.
4.3 Batch Discrete-Time Estimation 137

where
∂gk (x) ∂gjk (x)
gk (x) = gjk (x) j , Gk (x) = , G k` = (4.166)
∂x ∂x∂xT
and 1j is the jth column of the identity matrix. We have indicated
whether each Jacobian/Hessian is evaluated at x (the true state) or x̂
(our estimate). In this section, the quantity, δx = x̂ − x, will be the
difference between our estimate and the true state, on a given trial.
Each time we change the measurement noise, we will get a different
value for the estimate and hence δx. We will seek an expression for the
expected value of this difference, E[δx], over all possible realizations of
the measurement noise; this represents the systematic error or bias.
As discussed above, after convergence of Gauss-Newton, the esti-
mate, x̂, will satisfy the following optimality criterion:
X
Gk (x̂)T R−1
k (yk − gk (x̂)) = 0, (4.167)
k

or
X X T
Gk (x) + 1j δx G jk (x) Rk−1
T

k j

1X T
× yk − gk (x) −Gk (x) δx − 1j δx G jk (x) δx ≈ 0, (4.168)
| {z } 2 j
nk

after substituting (4.165a) and (4.165b). We will assume that δx has up

to quadratic dependence12 on the stacked noise variable, n ∼ N (0, R):

δx = A(x) n + b(n), (4.169)

where A(x) is an unknown coefficient matrix and b(n) is an unknown

quadratic function of n. We will use Pk , a projection matrix, to extract
the kth noise variable from the stacked version: nk = Pk n. Substitut-
ing (4.169) we have that

X X T
T
Gk (x) + 1j (A(x) n + b(n)) G jk (x)
k j

× R−1k Pk n − Gk (x) (A(x) n + b(n))

1X T
− 1j (A(x) n + b(n)) G jk (x) (A(x) n + b(n)) ≈ 0. (4.170)
2 j

12 In reality there are an infinite number of terms so this expression is a big

approximation but may work for mildly nonlinear observation models.
138 Nonlinear Non-Gaussian Estimation

Multiplying out and keeping terms up to quadratic in n we have

X
Gk (x)R−1
k (Pk − Gk (x)A(x)) n
k
| {z }
L n (linear in n)
X
1X
+ Gk (x) T
R−1
k −Gk (x) b(n) − T T
1j n A(x) G jk (x)A(x) n
k
2 j | {z }
scalar
| {z }
q1 (n) (quadratic in n)
X
+ G jk (x)T A(x) n 1Tj R−1
k (Pk − Gk (x)A(x)) n ≈ 0. (4.171)
j,k
| {z }
scalar
| {z }
q2 (n) (quadratic in n)

To make the expression identically zero (up to second order in n),

L n + q1 (n) + q2 (n) = 0, (4.172)

we must have L = 0. This follows by considering the case of the op-

posing sign of n,

−L n + q1 (−n) + q2 (−n) = 0, (4.173)

and then noting that q1 (−n) = q1 (n) and q2 (−n) = q2 (n) owing to
the quadratic nature of these terms. Subtracting the second case from
the first we have 2 L n = 0 and since n can take on any value, it follows
that L = 0 and thus
X
A(x) = W(x)−1 Gk (x)T R−1
k Pk , (4.174)
k

where
X
W(x) = Gk (x)T R−1
k Gk (x). (4.175)
k

Choosing this value for A(x) and taking the expectation (over all values
of n) we are left with

E [q1 (n)] + E [q2 (n)] = 0. (4.176)

Fortunately, it turns out that E [q2 (n)] = 0. To see this, we need two
identities:

A(x)RA(x)T ≡ W(x)−1 , (4.177a)

A(x)RPTk ≡ W(x)−1 Gk (x). (4.177b)
4.3 Batch Discrete-Time Estimation 139

The proofs of these are left to the reader. We then have

" #
X
T T −1
E [q2 (n)] = E G jk (x) A(x) n 1j Rk (Pk − Gk (x)A(x)) n
j,k
X
= G jk (x)T A(x) E n nT PTk − A(x)T Gk (x)T R−1
k 1j
j,k
| {z }
R
X
= G jk (x)T A(x)RPTk − A(x)RA(x)T Gk (x)T R−1 k 1j
| {z } | {z }
j,k
W(x)−1 Gk (x) W(x)−1

= 0, (4.178)
where we have employed the above identities. We are thus left with
E [q1 (n)] = 0, (4.179)
or
E[b(n)]
1 X X
= − W(x)−1 Gk (x)T R−1
k 1j E nT A(x)T G jk (x) A(x) n
2 k j
1 X X
= − W(x)−1 Gk (x)T R−1
k 1j E tr G jk (x) A(x) n nT A(x)T
2 k j
1 X X
= − W(x)−1 Gk (x)T R−1
k 1j tr G jk (x) A(x) E n nT A(x)T
2 k j
| {z }
R
| {z }
W(x)−1
1 X X
= − W(x)−1 Gk (x)T R−1
k 1j tr G jk (x) W(x)−1 , (4.180)
2 k j

where tr(·) indicates the trace of a matrix. Looking back to (4.169), we

see that
E[δx] = A(x) E[n] +E[b(n)], (4.181)
| {z }
0

and so our final expression for the systematic part of the bias is
1 X X
E[δx] = − W(x)−1 Gk (x)T R−1
k 1j tr G jk (x) W(x)−1 .
2 k j
(4.182)
To use this expression in operation, we will need to substitute our
estimate, x̂, in place of x when computing (4.182). Then we can update
our estimate according to
x̂ ← x̂ − E[δx], (4.183)
to subtract off the bias. Note, this expression is only approximate and
may only work well in mildly nonlinear situations.
140 Nonlinear Non-Gaussian Estimation

4.3.4 Discussion
If we think of the EKF as an approximation of the full nonlinear Gauss-
Newton (or even Newton) method applied to our estimation problem,
we can see that it is really quite inferior, mainly because it does not
iterate to convergence. The Jacobians are evaluated only once (at the
best estimate so far). In truth, the EKF can do better than just one
iteration of Gauss-Newton because the EKF does not evaluate all the
Jacobians at once, but the lack of iteration is its main downfall. This
is obvious from an optimization perspective; we need to iterate to con-
verge. However, the EKF was originally derived from the Bayes filter
earlier in this chapter, which used something called the Markov as-
sumption to achieve its recursive form. The problem with the Markov
assumption is that once this is built into the estimator, we cannot get
rid of it. It is a fundamental constraint that cannot be overcome.
There have been many attempts to patch the EKF, including the
Iterated EKF described earlier in this chapter. However, for very non-
linear systems, these may not help much. The problem with the IEKF
is that it still clings to the Markov assumption. It is iterating at a single
time-step, not over the whole trajectory at once. The difference between
Gauss-Newton and the IEKF can be seen plainly in Figure 4.17.
Batch estimation via the Gauss-Newton method has its own prob-
lems. In particular, it must be run offline and is not a constant-time
algorithm, whereas the EKF is both online and a constant-time method.
So-called sliding-window filters (SWFs) (Sibley, 2006) seek to get the
best of both worlds by iterating over a window of time-steps and sliding
this window along to allow for online/constant-time implementation.
SWFs are really still an active area of research, but when viewed from

Figure 4.17
Comparison of the
Gauss-Newton iterates over the entire trajectory, but runs offline and not in constant time
iterative schemes
x0 x1 x2 x3 ··· xk 2 xk 1 xk xk+1 xk+2 ··· xK
used in various
estimation
paradigms.

Sliding-window filters iterate over several timesteps at once, run online and in constant time
x0 x1 x2 x3 ··· xk 2 xk 1 xk xk+1 xk+2 ··· xK

IEKF iterates at only one timestep at a time, but runs online and in constant time
x0 x1 x2 x3 ··· xk 2 xk 1 xk xk+1 xk+2 ··· xK
4.4 Batch Continuous-Time Estimation 141

an optimization perspective, it is hard to imagine that they do not offer

a drastic improvement over the EKF and its variants.

4.4 Batch Continuous-Time Estimation

We saw in the previous chapter how to handle continuous-time pri-
ors through Gaussian process regression. Our priors were generated by
linear stochastic differential equations of the form

ẋ(t) = A(t)x(t) + v(t) + L(t)w(t), (4.184)

with

w(t) ∼ GP(0, Q δ(t − t0 )), (4.185)

and Q, the usual power spectral density matrix.

In this section, we show how we can extend our results to nonlinear,
continuous-time motion models of the form

ẋ(t) = f (x(t), v(t), w(t), t), (4.186)

where f (·) is a nonlinear function. We will still receive observations at

discrete times,

yk = g(x(tk ), nk , t), (4.187)

where g(·) is a nonlinear function and

nk ∼ N (0, Rk ). (4.188)

We will begin by linearizing both models and constructing their lifted

forms, then carry out GP regression (Bayesian inference). See Anderson
et al. (2015) for applications of this section.

4.4.1 Motion Model

We will linearize the motion model about an operating point, xop (t),
which we note is an entire continuous-time trajectory. We will then
construct our motion prior (mean and covariance) in lifted form at the
measurement times.
142 Nonlinear Non-Gaussian Estimation

Linearization
Linearizing our motion model about this trajectory, we have

ẋ(t) = f (x(t), v(t), w(t), t)

∂f
≈ f (xop (t), v(t), 0, t) + (x(t) − xop (t))
∂x xop (t),v(t),0,t

∂f
+ w(t)
∂w xop (t),v(t),0,t

∂f
= f (xop (t), v(t), 0, t) − xop (t)
∂x xop (t),v(t),0,t
| {z }
ν(t)

∂f ∂f
+ x(t) + w(t), (4.189)
∂x xop (t),v(t),0,t ∂w xop (t),v(t),0,t
| {z } | {z }
F(t) L(t)

where ν(t), F(t), and L(t) are now known functions of time (since xop (t)
is known). Thus, approximately, our process model is of the form

ẋ(t) ≈ F(t)x(t) + ν(t) + L(t)w(t). (4.190)

Thus, after linearization, this is in the LTV form we studied in the

linear-Gaussian chapter.

Mean and Covariance Functions

Since the SDE for the motion model is approximately in the LTV form
studied earlier, we can go ahead and write

Z t
x(t) ∼ GP Φ(t, t0 )x̌0 + Φ(t, s)ν(s) ds,
t0
| {z }
x̌(t)
Z min(t,t0 )
Φ(t, t0 )P̌0 Φ(t0 , t0 )T + Φ(t, s)L(s)QL(s)T Φ(t0 , s)T ds .
t0
| {z }
P̌(t,t0 )
(4.191)

where Φ(t, s) is the transition function associated with F(t). At the

measurement times, t0 < t1 < · · · < tK , we can also then write

x ∼ N (x̌, P̌) = N Fν, FQ0 FT , (4.192)
4.4 Batch Continuous-Time Estimation 143

for the usual lifted form of the prior where

 
1
 Φ(t1 , t0 ) 1 
 
 Φ(t2 , t0 ) Φ(t2 , t1 ) 1 
 
F= .
.. .
.. .. .. ,
 . . 
 
Φ(tK−1 , t0 ) Φ(tK−1 , t1 ) Φ(tK−1 , t2 ) · · · 1 
Φ(tK , t0 ) Φ(tK , t1 ) Φ(tK , t2 ) · · · Φ(tK , tK−1 ) 1
(4.193a)
 
x̌0
 ν1 
 
ν =  ..  , (4.193b)
 . 
νK
Z tk
νk = Φ(tk , s)ν(s) ds, k = 1 . . . K, (4.193c)
tk−1

Q0 = diag P̌0 , Q01 , Q02 , . . . , Q0K , (4.193d)
Z tk
Q0k = Φ(tk , s)L(s)QL(s)T Φ(tk , s)T ds, k = 1 . . . K. (4.193e)
tk−1

Unfortunately, we have a bit of a problem. To compute x̌ and P̌, we

require an expression for xop (s) for all s ∈ [t0 , tM ]. This is because ν(s),
F(s) (through Φ(t, s)), and L(s) appear inside the integrals for x̌(t)
and P̌(t, t0 ), and these depend in turn on xop (s). If we are performing
iterated GP regression as discussed earlier, we will only have xop from
the previous iteration, which is evaluated only at the measurement
times.
Fortunately, the whole point of GP regression is that we can query
the state at any time of interest. Moreover, we showed previously that
this can be done very efficiently (i.e., O(1) time) for our particular
choice of process model:
xop (s) = x̌(s) + P̌(s)P̌−1 (xop − x̌). (4.194)
Because we are making use of this inside an iterative process, we use
the x̌(s), P̌(s), x̌, and P̌ from the previous iteration to evaluate this
expression.
The biggest challenge will be to identify Φ(t, s), which is problem-
specific. As we will already be carrying out numerical integration in our
scheme, we can compute the transition function numerically as well, via
a normalized fundamental matrix (of control theory)13 , Υ(t). In other
words, we will integrate
Υ̇(t) = F(t)Υ(t), Υ(0) = 1, (4.195)
13 Not to be confused with the fundamental matrix of computer vision.
144 Nonlinear Non-Gaussian Estimation

ensuring to save Υ(t) at all the times of interest in our GP regression.

The transition function is then given by
Φ(t, s) = Υ(t) Υ(s)−1 . (4.196)
For specific systems, analytical expressions for the transition function
will be possible.

4.4.2 Observation Model

The linearized observation model is
yk ≈ g (xop,k , 0) + Gk (xk−1 − xop,k−1 ) + n0k , (4.197)
which can be written in lifted form as
y = yop + G (x − xop ) + n0 , (4.198)
where
 
g(xop,0 , 0)
 g(xop,1 , 0) 
 
yop = .. , (4.199a)
 . 
g(xop,K , 0)
G = diag (G0 , G1 , G2 , . . . , GK ) , (4.199b)
R = diag (R00 , R01 , R02 , . . . , R0K ) , (4.199c)
and n0 ∼ N (0, R0 ). It is fairly easy to see that
E [y] = yop + G (x̌ − xop ) , (4.200a)

E (y − E[y])(y − E[y])T = GP̌GT + R0 , (4.200b)

E (y − E[y])(x − E[x])T = GP̌. (4.200c)

4.4.3 Bayesian Inference

With these quantities in hand, we can write a joint density for the lifted
trajectory and measurements as

x̌ P̌ P̌GT
p(x, y|v) = N , ,
yop + G (x̌ − xop ) GP̌ GP̌GT + R0
(4.201)
which is quite similar to the expression for the IEKF situation in (4.42),
but now for the whole trajectory rather than just one timestep. Us-
ing the usual relationship from (2.46b), we can immediately write the
Gaussian posterior as

p(x|v, y) = N x̂, P̂ (4.202)
4.4 Batch Continuous-Time Estimation 145

where
−1
K = P̌GT GP̌GT + R0 , (4.203a)
P̂ = (1 − KG) P̌, (4.203b)
x̂ = x̌ + K (y − yop − G(x̌ − xop )) . (4.203c)
Using the SMW identity from (2.68), we can rearrange the equation
for the posterior mean to be
−1
−1
P̌−1 + GT R0 G δx∗ = P̌−1 (x̌ − xop ) + GT R0 (y − yop ) (4.204)
where δx∗ = x̂ − xop . Inserting the details of the prior this becomes
−1 −1

F−T Q0 F−1 + GT R0 G δx∗
| {z }
block-tridiagonal
−1 −1
= F−T Q0 ν − F−1 xop + GT R0 (y − yop ) . (4.205)
This result is identical in form to the nonlinear, discrete-time batch
solution discussed earlier in this chapter. The only difference is that we
started with a continuous-time motion model and integrated it directly
to evaluate the prior at the measurement times.

4.4.4 Algorithm Summary

We summarize the steps needed to carry out GP regression with a
nonlinear motion model and/or measurement model:
1. Start with an initial guess for the posterior mean over the whole
trajectory, xop (t). We will only need this over the whole trajectory
to initial the process. We will only be updating our estimate at the
measurement times, xop , then using the GP interpolation to fill in
the other times.
−1
2. Calculate the ν, F−1 , and Q0 for the new iteration. This will
likely be done numerically and will involve determining ν(s), F(s)
(through Φ(t, s)), and L(s), which in turn will require xop (s) and
hence x̌(s), P̌(s), x̌, P̌ from the previous iteration to do the inter-
polation inside the required integrals.
−1
3. Calculate the yop , G, R0 for the new iteration.
4. Solve for δx∗ in the following equation:
−1 −1

F−T Q0 F−1 + GT R0 G δx∗
| {z }
block-tridiagonal
−1 −1
= F−T Q0 ν − F−1 xop + GT R0 (y − yop ) . (4.206)
In practice, we will prefer to build only the non-zero blocks in the
products of matrices that appear in this equation.
146 Nonlinear Non-Gaussian Estimation

5. Update the guess at the measurement times using

xop ← xop + δx? , (4.207)

and check for convergence. If not converged, return to Step 2. If

converged, output x̂ = xop .
6. If desired, compute the covariance at the measurement times, P̂.
7. Use the GP interpolation equation14 to also compute the estimate
at other times of interest, x̂τ , P̂τ τ .

The most expensive step in this whole process is building ν, F−1 , and
−1
Q0 . However, the cost (at each iteration) will still be linear in the
length of the trajectory and therefore should be manageable.

4.5 Summary
The main take-away points from this chapter are:

1. Unlike the linear-Gaussian case, the Bayesian posterior is not, in

general, a Gaussian PDF when the motion and observation models
are nonlinear and/or the measurement and process noises are non-
Gaussian.
2. To carry out nonlinear estimation, some form of approximation is
required. The different techniques vary in their choices of (i) how
to approximate the posterior (Gaussian, mixture of Gaussians, set
of samples), and (ii) how to approximately carry out inference (lin-
earization, Monte Carlo, sigmapoint transformation) or MAP esti-
mation.
3. There are a variety of methods, both batch and recursive, that ap-
proximate the posterior as a Gaussian. Some of these methods, par-
ticularly the ones that iterate the solution (i.e., batch MAP, IEKF)
converge to a ‘mean’ that is actually at the maximum of the Bayesian
posterior (which is not the same as the true mean of the Bayesian
posterior). This can be a point of confusion when comparing differ-
ent methods since, depending on the approximations made, we may
be asking the methods to find different answers.
4. Batch methods are able to iterate over the whole trajectory whereas
recursive methods can only iterate at one time-step at a time, mean-
ing they will converge to different answers on most problems.

The next chapter will look briefly at how to handle estimator bias,
measurement outliers, and data correspondences.
14 We did not work this out for the nonlinear case, but it should follow from the GP
section in the linear-Gaussian chapter.
4.6 Exercises 147

4.6 Exercises
4.6.1 Consider the discrete-time system,
     
xk xk−1 cos θk−1 0
 yk  =  yk−1  + T  sin θk−1 0 vk
+ wk ,
ωk
θk θk−1 0 1
wk ∼ N (0, Q),
p
rk 2 2
x k + yk
= + nk , nk ∼ N (0, R),
φk atan2(−yk , −xk ) − θk
which could represent a mobile robot moving around on the xy-
plane and measuring the range and bearing to the origin. Set up
the EKF equation to estimate the pose of the mobile robot. In
particular, work out expressions for the Jacobians, Fk−1 and Gk ,
and modified covariances, Q0k and R0k .
4.6.2 Consider transforming the prior Gaussian density, N (µx , σx2 )
through the nonlinearity, f (x) = x3 . Use the Monte Carlo, lin-
earization, and sigmapoint methods to determine the transformed
mean and covariance and comment on the results. Hint: use Is-
serlis’ theorem to compute the higher-order Gaussian moments.
4.6.3 Consider transforming the prior Gaussian density, N (µx , σx2 )
through the nonlinearity, f (x) = x4 . Use the Monte Carlo, lin-
earization, and sigmapoint methods to determine the transformed
mean (and optionally covariance) and comment on the results.
Hint: use Isserlis’ theorem to compute the higher-order Gaussian
moments.
4.6.4 From the section on the sigmapoint Kalman filter, we learned
that the measurement covariance could be written as
2N
X T
Σyy,k = βj y̌k,j − µy,k y̌k,j − µy,k + Rk ,
j=0

when the measurement model has linear dependence on the mea-

surement noise. Verify that this can also be written as
Σyy,k = Zk ZTk + Rk ,
where q
colj Zk = βj y̌k,j − µy,k .
4.6.5 Show that the below two identities used in the section on ML
bias estimation are true:
A(x)RA(x)T ≡ W(x)−1 ,
A(x)RPTk ≡ W(x)−1 Gk (x).
5

Biases, Correspondences, and Outliers

In the last chapter, we learned that our estimation machinery can be

biased, particularly when our motion/observation models are nonlinear.
On our simple stereo camera example, we saw that MAP estimation is
biased with respect to the mean of the full posterior. We also saw that
the batch ML method is biased with respect to the groundtruth and
derived an expression to try to quantity that bias. Unfortunately, these
are not the the only sources of bias.
In many of our estimation techniques, we make the assumption that
the noise corrupting the inputs or the measurements is zero-mean Gaus-
sian. In reality, our inputs and/or measurements may also be corrupted
with unknown biases. If we do not account for these, our estimate will
also be biased. The classic example of this is the typical accelerometer,
which can have temperature-dependent biases that change over time.
Another huge issue in many estimation problems is determining cor-
respondences between measurements and a model. For example, if we
are measuring the range to a landmark, we might assume we know
which landmark is being measured. This is a very big assumption. An-
other good example is a star tracker, which detects points of lights;
how do we know which point of light corresponds to which star in our
star chart? The pairing of a measurement with a part of a model/map
is termed determining correspondences or data association.
Finally, despite our best efforts to negate the effects of biases and
find proper correspondences, something deleterious can always happen
to our measurements so that we are stuck with a datum that is highly
improbable according to our noise model; we call this an outlier mea-
surement. If we do not properly detect and remove outliers, many of
our estimation techniques will fail, often catastrophically.
This chapter will investigate how to deal with inputs/measurements
that are not well-behaved. It will present some of the classic tactics
for handling these types of biases, determining correspondences, and
detecting/rejecting outliers. A handful of examples will be provided as
illustrations.
148
5.1 Handling Input/Measurement Biases 149

5.1 Handling Input/Measurement Biases

In this section, we will investigate the impact of a bias on both the
inputs and measurements. We will see that the case of the input bias
is less difficult to deal with than the measurement bias, but both can
be handled. We will use linear, time-invariant motion and observation
models with non-zero-mean Gaussian noise for the purpose of our dis-
cussion, but many of the concepts to be discussed can also be extended
to nonlinear systems.

5.1.1 Bias Effects on the Kalman Filter

As an example of the effect of a input/measurement bias, we return
to the error dynamics discussed in Section 3.3.6 and see what happens
to the Kalman filter (if we do not explicitly account for bias) when we
introduce non-zero-mean noise. In particular, we will now assume that

xk = Axk−1 + B(uk + ū) + wk , (5.1a)

yk = Cxk + ȳ + nk . (5.1b)
where ū is an input bias and ȳ a measurement bias. We will continue to
assume that all measurements are corrupted with zero-mean Gaussian
noise,
wk ∼ N (0, Q), nk ∼ N (0, R), (5.2)
that is statistically-independent, i.e.,
E[wk wl ] = 0, E[nk nl ] = 0, E[wk nk ] = 0, E[wk nl ] = 0, (5.3)
for all k 6= l, but this could be another source of filter inconsistency.
We earlier defined the estimation errors,
ěk = x̌k − xk , (5.4a)
êk = x̂k − xk , (5.4b)
and constructed the ‘error dynamics’, which in this case are
ěk = Aêk−1 − (B ū + wk ), (5.5a)
êk = (1 − Kk C) ěk + Kk (ȳ + nk ). (5.5b)
where ê0 = x̂0 −x0 . Furthermore, as discussed earlier, for our estimator
to be unbiased and consistent we would like to have for all k = 1 . . . K
that

E [êk ] = 0, E [ěk ] = 0, E êk êTk = P̂k , E ěk ěTk = P̌k , (5.6)
which we showed was true in the case that ū = ȳ = 0. Let us see what
happens when this zero-bias condition does not necessarily hold. We
150 Biases, Correspondences, and Outliers

will still assume that

E [ê0 ] = 0, E ê0 êT0 = P̂0 , (5.7)

although this initial condition is another place a bias could be intro-

duced. At k = 1 we have

E [ě1 ] = A E [ê0 ] − B ū + E [w1 ] = −B ū, (5.8a)
| {z } | {z }
0 0

E [ê1 ] = (1 − K1 C) E [ě1 ] +K1 ȳ + E [n1 ]
| {z } | {z }
−B ū 0

= − (1 − K1 C) B ū + K1 ȳ, (5.8b)

which are already biased in the case that ū 6= 0 and/or ȳ 6= 0. For the
covariance of the ‘predicted error’ we have

h i
T
E ě1 ěT1 = E (Aê0 − (B ū + w1 )) (Aê0 − (B ū + w1 ))

= E (Aê0 − w1 )(Aê0 − w1 )T +(−B ū) E (Aê0 − w1 )T
| {z } | {z }
P̌1 0

+ E [(Aê0 − w1 )] (−B ū) + (−B ū)(−B ū)T

T
| {z }
0

= P̌1 + (−B ū)(−B ū)T . (5.9a)

Rearranging we see that

P̌1 = E ě1 ěT1 − E[ě1 ] E[ě1 ]T , (5.10)
| {z }
bias effect

and therefore the KF will ‘underestimate’ the true uncertainty in the

error and become inconsistent. For the covariance of the ‘corrected
5.1 Handling Input/Measurement Biases 151

error’ we have

T

E ê1 ê1 = E ((1 − K1 C) ě1 + K1 (ȳ + n1 ))

T
× ((1 − K1 C) ě1 + K1 (ȳ + n1 ))
h i
T
= E ((1 − K1 C) ě1 + K1 n1 ) ((1 − K1 C) ě1 + K1 n1 )
| {z }
P̂1
h i
T
+ (K1 ȳ) E ((1 − K1 C) ě1 + K1 n1 )
| {z }
(−(1−K1 C)B ū)T

+ E [((1 − K1 C) ě1 + K1 n1 )](K1 ȳ)T + (K1 ȳ)(K1 ȳ)T

| {z }
−(1−K1 C)B ū

= P̂1 + (− (1 − K1 C) B ū + K1 ȳ)
T
× (− (1 − K1 C) B ū + K1 ȳ) , (5.11a)

and so

P̂1 = E ê1 êT1 − E[ê1 ] E[ê1 ]T , (5.12)
| {z }
bias effect

where we can see again the KF’s estimate of the covariance is overcon-
fident and thus inconsistent. It is interesting to note that the KF will
become overconfident, regardless of the sign of the bias. Moreover, it
is not hard to see that as k gets bigger, the effects of the biases grow
without bound. It is tempting to modify the KF to be

P̌k = AP̂k−1 AT + Q, (5.13a)

predictor:
x̌k = Ax̂k−1 + Buk + |{z}
B ū , (5.13b)
bias
−1
T T
Kalman gain: Kk = P̌k C C P̌k C + R , (5.13c)
P̂k = (1 − Kk C) P̌k , (5.13d)
corrector:
x̂k = x̌k + Kk yk − C x̌k − ȳ , (5.13e)
|{z}
bias

whereupon we recover an unbiased and consistent estimate. The prob-

lem is that we must know the value of the bias exactly for this to be
a viable means of counteracting its effects. In most cases we do not
know the exact value of the bias (it may even change with time). Given
that we already have an estimation problem, it is logical to attempt to
include estimation of the bias into our problem. The next few sections
will investigate this possibility for both inputs and measurements.
152 Biases, Correspondences, and Outliers

5.1.2 Unknown Input Bias

Continuing from the previous section, suppose we had ȳ = 0 but not
necessarily ū 6= 0. Rather than estimating just the state of the system,
xk , we augment the state to be

x
x0k = k , (5.14)
ūk
where we have made the bias now a function of time as we want it to
be part of our state. As the bias is now a function of time, we need to
define a motion model for it. A typical one is to assume that

ūk = ūk−1 + sk , (5.15)

where sk ∼ N (0, W ); this corresponds to Brownian motion (a.k.a.,

random walk) of the bias. In some sense, we are simply pushing the
problem back through an integrator as we now have zero-mean Gaus-
sian noise influencing the motion of the interoceptive bias. In practice,
this type of trick can be effective. Other motion models for the bias
could also be assumed, but often we do not have a lot of information
as to their temporal behaviour. Under this bias motion model, we have
for the motion model for our augmented state that

A B 0 B wk
x0k = xk−1 + uk + , (5.16)
0 1 0 sk
| {z } | {z } | {z }
A0 B0 0
wk

where we have defined several new symbols for convenience. We note

that

0 0
0 Q 0
wk ∼ N 0, Q , Q = , (5.17)
0 W
so we are back to an unbiased system. The observation model is simply

yk = C 0 x0k + nk , (5.18)
| {z }
C0

in terms of the augmented state.

A critical question to ask is whether or not this augmented-state
filter will converge to the correct answer. Will the above trick really
work? The conditions we saw for existence and uniqueness of the linear-
Gaussian batch estimation earlier (with no prior on the initial condi-
tion) were
Q > 0, R > 0, rank O = N. (5.19)

Let us assume these conditions do indeed hold for the system in the
5.1 Handling Input/Measurement Biases 153
Figure 5.1 Input
xk xk = xk 1 + vk bias on
0 vk = vk 1 + ak + āk acceleration. In
this case we can
successfully
interoceptive bias estimate the bias
dk = xk
as part of our state
estimation
problem.
case that the bias is zero, i.e., ū = 0. Defining
 
C0
 C 0 A0 
0  
O = .. , (5.20)
 . 
(N +U −1)
C 0 A0
we are required to show that
Q0 > 0, R > 0, rank O0 = N + U, (5.21)
| {z }
true by definitions

for existence and uniqueness of the solution to the batch estimation

problem for the augmented-state system. The first two conditions are
true by the definitions of these covariance matrices. For the last con-
dition, the rank needs to be N + U since the augmented state now
includes the bias, where U = dim ūk . In general this condition does
not hold. The next two examples will illustrate this.
Example 5.1 Take the system matrices to be

1 1 0
A= , B= , C= 1 0 , (5.22)
0 1 1
such that N = 2 and U = 1. This example roughly corresponds to
a simple one-dimensional unit-mass cart, whose state is its position
and velocity. The input is the acceleration and the measurement is the
distance back to the origin. The bias is on the input. See Figure 5.1.2
for an illustration. We have

C 1 0
O= = ⇒ rank O = 2 = N, (5.23)
CA 1 1
so the unbiased system is observable1 . For the augmented-state system
we have
     
C0 C 0 1 0 0
0 0
O0 =  C A  =  CA CB  = 1 1 0
0 02 2
CA CA CAB + CB 1 2 1
⇒ rank O0 = 3 = N + U, (5.24)
1 It is also controllable.
154 Biases, Correspondences, and Outliers
Figure 5.2 Input
biases on both xk xk = xk 1 + vk + v̄k
speed and 0 vk = vk 1 + ak + āk
acceleration. In
this case we cannot
estimate the bias dk = xk interoceptive biases
as the system is
unobservable.

1
so it, too, is observable. Note, taking B = is observable, too2 .
0

Example 5.2 Take the system matrices to be

1 1 1 0
A= , B= , C= 1 0 , (5.25)
0 1 0 1
such that N = 2 and U = 2. This is a strange system wherein the
command to the system is a function of both speed and acceleration,
and we have biases on both of these quantities. See Figure 5.1.2 for an
illustration. We still have that the unbiased system is observable since
A and C are unchanged. For the augmented-state system we have
     
C0 C 0 1 0 0 0
 C 0 A0   CA CB  1 1 1 0
 
O0 =  0 02  =  
= 
C A  CA 2
C(A + 1)B  1 2 2 1
CA3 C(A2 + A + 1)B
3
C 0 A0 1 3 3 3
⇒ rank O0 = 3 < 4 = N + U, (5.26)
so it is not observable (since columns 2 and 3 are the same).

5.1.3 Unknown Measurement Bias

Suppose now we have ū = 0 but not necessarily ȳ 6= 0. The augmented
state is
x
x0k = k , (5.27)
ȳk
where we have again made the bias a function of time. We again assume
a random-walk motion model
ȳk = ȳk−1 + sk , (5.28)
where sk ∼ N (0, W). Under this bias motion model, we have for the
motion model for our augmented state that

0 A 0 0 B wk
xk = xk−1 + uk + , (5.29)
0 1 0 sk
| {z } | {z } | {z }
A0 B0 0
wk

2 But now the unbiased system is not controllable.

5.1 Handling Input/Measurement Biases 155
Figure 5.3
exteroceptive bias Measurement bias
on position. In this
xk xk = xk 1 + vk
case we cannot
0 d¯k vk = vk 1 + ak estimate the bias
as the system is
unobservable.
dk = xk + d¯k

where we have defined several new symbols for convenience. We note

that

0 0
0 Q 0
wk ∼ N 0, Q , Q = . (5.30)
0 W
The observation model is

yk = C 1 x0k + nk , (5.31)
| {z }
C0

in terms of the augmented state. We again examine the observability

of the system in the context of an example.
Example 5.3 Take the system matrices to be

1 1 0
A= , B= , C= 1 0 , (5.32)
0 1 1
such that N = 2 and U = 1. This corresponds to our cart measuring
its distance from a landmark (whose position it does not know – see
Figure 5.1.3). In the context of mobile robotics this is a very simple
example of simultaneous localization and mapping (SLAM), a popular
estimation research area. The ‘localization’ is the cart state and the
‘map’ is the landmark position (here the negative of the bias).
We have

C 1 0
O= = ⇒ rank O = 2 = N, (5.33)
CA 1 1
so the unbiased system is observable. For the augmented-state system
we have
     
C0 C 1 1 0 1
0 0
O0 =  C A  =  CA 1 = 1 1 1
2
C 0 A0 CA2 1 1 2 1
⇒ rank O0 = 2 < 3 = N + U, (5.34)
so it is not observable (since columns 1 and 3 are the same). Since
we are rank-deficient by 1, this means that dim( null O0 ) = 1; the
nullspace of the observability matrix corresponds to those vectors that
156 Biases, Correspondences, and Outliers

produce outputs of zero. Here we see that

 
 1 
null O = span  0  ,
0
(5.35)
 
−1
which means that we can shift the cart and landmark together (left or
right) and the measurement will not change. Does this mean our esti-
mator will fail? Not if we are careful to interpret the solutions properly;
we do so for both batch LG estimation and the KF:
(i) In the batch LG estimator, the left-hand side cannot be in-
verted, but recalling basic linear algebra, every system of the
form Ax = b can have zero, one, or infinitely many solutions.
In this case we have infinitely many solutions rather than a
single unique solution.
(ii) In the KF, we need to start with an initial guess for the state.
The final answer we get will depend on the initial conditions
selected. In other words, the value of the bias will remain at its
initial guess.
In both cases, we have a way forwards.

5.2 Data Association

As discussed above, the data association problem has to do with fig-
uring out which measurements correspond to which parts of a model.
Virtually all real estimation techniques, particularly for robotics, em-
ploy some form of model or map in order to determine a vehicle’s state,
and in particular its position/orientation. Some common examples of
these models/maps are:
(i) Positioning using GPS satellites. Here the positions of the GPS
satellites are assumed to be known (as a function of time) in a
reference frame attached to the Earth (e.g., using their orbital
elements). A GPS receiver on the ground measures range to the
satellites (e.g., using time of flight based on a timing message
sent by the satellite) and then triangulates for position. In this
case, it is easy to know which range measurement is associated
with which satellite because the whole system is engineered and
therefore unique codes are embedded in the timing messages to
indicate which satellite has sent which message.
(ii) Attitude determination using celestial observation. A map (or
chart) of all the brightest stars in the sky is used by a star sensor
to determine which direction the sensor is pointing. Here the
natural world is being used as a map (surveyed in advance) and
thus data association, or knowing which star you are looking
5.2 Data Association 157
Figure 5.4 A
measurement and
point-cloud model
with two possible
data associations
shown.

measurement model

at, is much more difficult than in the GPS case. Because the
star chart can be generated in advance, this system becomes
practical.
There are essentially two main approaches to data association: external
and internal.

5.2.1 External Data Association

In external data association, specialized knowledge of the model/mea-
surements is used for association. This knowledge is ‘external’ to the
estimation problem. This is sometimes called ‘known data association’
because from the perspective of the estimation problem, the job has
been done.
For example, a bunch of targets could be painted with unique colours;
a stereo camera could be used to observe the targets and the colour in-
formation used to do data association. The colour information would
not be used in the estimation problem. Other examples of external data
association include visual bar codes and unique transmission frequen-
cies/codes (e.g., GPS satellites).
External data association can work well if the model can be modified
in advance to be cooperative; this makes the estimation problem a lot
easier. However, cutting-edge computer vision techniques can be used
as external data association on unprepared models, too, although the
results are more prone to misassociations.

5.2.2 Internal Data Association

In internal data association, only the measurements/model are used
to do data association. This is sometimes called ‘unknown data as-
sociation’. Typically, association is based on the likelihood of a given
measurement, given the model. In the simplest version, the most likely
association is accepted and the other possibilities are ignored. More
158 Biases, Correspondences, and Outliers
Figure 5.5 A
pathological transmitter
configuration of
erroneous longer path
buildings can trick
a GPS system into
using an incorrect line-of-sight path blocked
range
measurement. reflection

receiver

sophisticated techniques allow multiple data association hypotheses to

be carried forward into the estimation problem.
In the case of certain types of models, such as three-dimensional
landmarks or star charts, ‘constellations’ of landmarks are sometimes
used to help perform data association (see Figure 5.4). The data-aligned
rigidity-constrained exhaustive search (DARCES) algorithm (Chen et al.,
1999) is an example of a constellation-based data association method.
The idea is that the distances between pairs of points in a constellation
can be used as a type of unique identifier for data association.
Regardless of the type of data association employed, it is highly likely
that if an estimation technique fails, the blame can be squarely placed
on bad data association. For this reason, it is very important to ac-
knowledge that misassociations will occur in practice, and therefore
design in techniques to make the estimation problem robust to these
occurrences. The next section on outlier detection and rejection will
discuss some methods to help deal with this type of problem.

5.3 Handling Outliers

Data misassociation can certainly lead to an estimator completely di-
verging. However, data association is not the only cause of divergence.
There are other factors that can cause any particular measurement to
be very poor/incorrect. A classic example is multipath reflections of
GPS timing signals near tall buildings. Figure 5.5 illustrates this point.
A reflected signal can give a range measurement that is too long. In
the absence of additional information, when the line-of-sight path is
blocked the receiver has no way of knowing the longer path is incor-
rect.
We call measurements that are extremely improbable (according to
our measurement model), outliers. Just how improbable is a matter
of choice, but a common measurement (in one dimensional data) is to
consider measurements that are more than three standard deviations
away from the mean to be outliers.
If we accept that a portion (possibly large) of our measurements
5.3 Handling Outliers 159
Figure 5.6
Line-fitting
example. If a line
is fit to all the
data, the outliers
will have a large
impact on the
result. The
RANSAC
approach is to
classify the data as
either an inlier or
an outlier and then
example dataset with inliers and outliers RANSAC finds line with most inliers
only use the inliers
in the line fit.

could be outliers, we need to devise a means to detect and reduce/remove

the influence of outliers on our estimators. We will discuss the two most
common techniques to handle outliers:
(i) Random sample consensus (Fischler and Bolles, 1981)
(ii) M-Estimation (Zhang, 1997) (also called iteratively reweighted
least-squares)
These can be used separately or in tandem.

5.3.1 RANSAC
Random sample consensus (RANSAC) is an iterative method to fit
a parameterized model to a set of observed data containing outliers.
Outliers are measurements that do not ‘fit’ a model, while inliers do
‘fit’. RANSAC is a probabilistic algorithm in the sense that its ability
to find a reasonable answer can only be guaranteed to occur with a
certain probability that improves with more time spent in the search.
Figure 5.6 provides a classic line-fitting example in the presence of
outliers.
RANSAC proceeds in an iterative manner. In the basic version, each
iteration consists of the following five steps:
1. Select a (small) random subset of the original data to be hypothe-
sized inliers (e.g., pick two points if fitting a line to xy-data).
2. Fit a model to the hypothesized inliers (e.g., a line is fit to two
points).
3. Test the rest of the original data against the fitted model and classify
as either inliers or outliers. If too few inliers are found, the iteration
is labelled invalid and aborted.
4. Refit the model using both the hypothesized and classified inliers.
5. Evaluate the refit model in terms of the residual error of all the inlier
data.
160 Biases, Correspondences, and Outliers

This is repeated for a large number of iterations and the hypothesis

with the lowest residual error is selected as the best.
A critical question to ask is how many iterations, k, are needed to
ensure a subset is selected comprised solely of inliers, with probability
p? In general this is difficult to answer. However, if we assume that
each measurement is selected independently, and each has probability
w of being an inlier, then the following relation holds:
1 − p = (1 − wn )k , (5.36)
where n is the number of data points in the random subset and k is
the number of iterations. Solving for k gives
ln(1 − p)
k= . (5.37)
ln (1 − wn )
In reality this can be thought of as an upper bound as the data points
are typically selected sequentially, not independently. There can also
be constraints between the data points that complicate the selection of
random subsets.

5.3.2 M-Estimation
Many of our earlier estimation techniques were shown to be minimizing
a sum-of-squared-error cost function. The trouble with sum-of-squared-
error cost functions, is that they are highly sensitive to outliers. A
single large outlier can exercise a huge influence on the estimate because
it dominates the cost. M-estimation3 modifies the shape of the cost
function so that outliers do not dominate the solution.
Recall that in Section 4.3.1 we wrote our overall nonlinear objective
function (for batch estimation) in the form
1
J(x) = u(x)T u(x), (5.38)
2
which is quadratic. We also showed (in the linear case) that minimizing
this is equivalent to maximizing the likelihood of all the measurements.
The gradient of our original objective function was
∂J(x) ∂u(x)
= u(x)T . (5.39)
∂x ∂x
Let us now generalize this objective function and write it as
J 0 (x) = ρ (u(x)) , (5.40)
where ρ(·) is some nonlinear cost function; assume it is bounded, posi-
tive definite, has a unique zero at u(x) = 0, and increases more slowly
3 ‘M’ stands for ‘maximum likelihood-type’, i.e., a generalization of maximum likelihood
(which we saw earlier was equivalent to the least-squares solution).
5.3 Handling Outliers 161

than squared. Remember, u(x) = L e(x), so u(x) is just proportional

to the difference between the actual and expected measurements, e(x).
The gradient of our new function is simply
∂J 0 (x) ∂ρ ∂u
= , (5.41)
∂x ∂u ∂x
using the chain rule. We want the gradient to go to zero if we are
seeking a minimum:
∂ρ ∂u
= 0. (5.42)
∂u ∂x
We could solve this system of equations directly. Instead, we can solve
it iteratively by minimizing
1
J 00 (x) = u(x)T W(xop )u(x), (5.43)
2
where ( )
1 ∂ρ
W(xop ) = diag , (5.44)
ui ∂ui xop
is a weighting matrix that is evaluated using the estimated state from
the previous iteration, xop . At each iteration we are simply solving the
original least-squares problem (with an additional weighting matrix).
To see why this iterative scheme might work, we can examine the
gradient of J 00 (x):
∂J 00 (x) ∂u(x)
= u(x)T W(xop ) = 0. (5.45)
∂x ∂x
If the iterative scheme converges, we will have x̂ = xop , so

∂J 0 (x) ∂J 00 (x)
= = 0, (5.46)
∂x x̂ ∂x x̂
and thus the two systems should arrive at the same solution, if a unique
minimum exists (which it will if the cost function conditions mentioned
above are met). To be clear, however, the path taken to get to the
optimum will differ if we minimize J 00 (x) rather than minimizing J 0 (x).
For example, if we apply Newton’s method to both objective functions,
the Hessians are not the same.
There are many possible cost functions including
1 1 u2 1 q
ρ(u) = u2 , ρ(u) = , ρ(u) = |u| (0 < q < 2). (5.47)
2 2 1 + u2 q
Figure 5.7 compares the usual quadratic cost function to one less sus-
ceptible to outliers. Refer to Zhang (1997) for a more complete list
of cost functions and MacTavish and Barfoot (2015) for a comparison
study.
162 Biases, Correspondences, and Outliers
Figure 5.7 $
Comparison of
quadratic and 1 2
Geman-McClure
%&(
⇢(u) = u
cost functions. 2
%&'

%&!

1 u2
⇢(u) =
%&#
2 1 + u2
%
!! !" !# !$ % $ # " !

5.4 Summary
The main take-away points from this chapter are:
1. There are always non-idealities (e.g., biases, outliers) that make the
real estimation problem different from the clean mathematical se-
tups discussed in this book. Sometimes these deviations result in
performance reductions that are the main source of error in prac-
tice.
2. In some situations, we can fold the estimation of a bias into our
estimation framework and in others we cannot. This comes down to
the question of observability.
3. In most practical estimation problems, outliers are a reality and thus
using some form of preprocessing (e.g, RANSAC) as well as a robust
cost function that downplays the effect of outliers is a necessity.
The next part of the book will introduce techniques for handling state
estimation in a three-dimensional world where objects are free to trans-
late and rotate.

5.5 Exercises
5.5.1 Consider the discrete-time system,
xk = xk−1 + vk + v̄k ,
d k = xk ,
where v̄k is an unknown input bias. Set up the augmented-state
system and determine if this system is observable.
5.5.2 Consider the discrete-time system,
xk = xk−1 + vk ,
vk = vk−1 + ak ,
d1,k = xk ,
d2,k = xk + d¯k ,
5.5 Exercises 163

where d¯k is an unknown input bias (on just one of the two mea-
surement equations). Set up the augmented-state system and de-
termine if this system is observable.
5.5.3 How many RANSAC iterations, k, would be needed to pick a
set of n = 3 inlier points with probability p = 0.999, given that
each point has probability w = 0.1 of being an inlier?
5.5.4 What advantage might the robust cost function,
(
1 2
u
2 2
u2 ≤ 1
ρ(u) = 2u 1 ,
1+u2
− 2 u2 ≥ 1
have over the Geman-McClure cost function?
Part II

Three-Dimensional Machinery

165
6

Primer on Three-Dimensional
Geometry

This chapter will introduce three-dimensional geometry and specifically

the concept of a rotation and some of its representations. It pays partic-
ular attention to the establishment of reference frames. Sastry (1999) is
a comprehensive reference on control for robotics that includes a back-
ground on three-dimensional geometry. Hughes (1986) also provides a
good first-principles background.

6.1 Vectors and Reference Frames

Vehicles (e.g., robots, satellites, aircraft) are typically free to translate
and rotate. Mathematically, they have six degrees of freedom: three in
translation and three in rotation. This six-degree-of-freedom geometric
configuration is known as the pose (position and orientation) of the
vehicle. Some vehicles may have multiple bodies connected together;
in this case each body has its own pose. We will consider only the
single-body case here.

Figure 6.1
vehicle Vehicle and typical
reference frames.

Fv
!

r vi
!
Fi
!

167
168 Primer on Three-Dimensional Geometry

6.1.1 Reference Frames

The position of a point on a vehicle can be described with a vector,
→
−r vi , consisting of three components. Rotational motion is described
by expressing the orientation of a reference frame on the vehicle, − F v,
→
with respect to another static (inertial) frame, −F i . Figure 6.1 shows
→
the typical setup for a single-body vehicle.
We will take a vector to be a quantity →
r having length and direction.
−
This vector can be expressed in a reference frame as

r r = r1 →
→
− 1 1 + r2 →
− 1 2 + r3 →
− 13
−
1 − →
→3
− 
11
 →
−
6
F
→1
− = [r1 r2 r3 ]  →
−12 
13
-
1 →
−
→2
−
= rT −
F 1.
→ (6.1)
1
→1
− The quantity
 
r1
r =  r2  ,
r3

is a column matrix containing the components of →

r . The quantity
−
 
11
→
−
 1 
−F1 =  →
→ − 2 ,
13
→
−
is a column containing the basis (or unit) vectors forming the reference
frame − F 1 . We shall refer to −
→ F 1 as a vectrix (Hughes, 1986).
→
The vector can also be written as
 
h i r1
r = →
→
− 11 →
− 12 →
− 1 3  r2 
−
r3
F T1 r.
=−
→

6.1.2 Dot Product

Consider two vectors, →
r and →
− s , expressed in the same reference frame
−
F
−
→ 1:

   
1
→
− 1 h i s1
 1 
r = [r1 r2 r3 ]  →
→
− −2  , →
s = →
− 11 →
− − 1 3  s2  .
12 →
−
1 s3
−3
→
6.1 Vectors and Reference Frames 169

The dot product (a.k.a., inner product) is given by

   
→
−11 h i s1
 1 
→
−r ·→
− −2 · →
s = [r1 r2 r3 ]  → 11 →
− 12 →
− −1 3  s2 
13 s3
 →
−  
→
−11·→−11 →−11·→−12 → 11·→
− 13
− s1
 1 · 1 12·→ 12 → 12·→ 
1 3   s2  .
= [r1 r2 r3 ]  →
−2 → −1 → − − − −
13·→ 11 → 13·→ 12 → 13·→13 s3
→
− − − − − −
But
11·→
→
− − 12·→
11=→
− − 13·→
12=→
− 1 3 = 1,
−
and
11·→
→
− − 12·→
12=→
− − 13·→
13=→
− 1 1 = 0.
−
Therefore,

→
− s = rT 1s = rT s = r1 s1 + r2 s2 + r3 s3 .
r ·→
−
The notation 1 will be used to designate the identity matrix. Its di-
mension can be inferred from context.

6.1.3 Cross Product

The cross product of two vectors expressed in the same reference frame
is given by
  
→
−11×→ −11 →−11×→ −12 → 11×→
− −13 s1
 1 × 1 12×→ 12 → 12×→ 
1 3   s2 
→
−r ×→−s = [r1 r2 r3 ]  →
−2 → −1 → − − − −
13×→ 11 → 13×→ 12 → 13×− 13 s3
 →
− − − −  − →
0 1 3 −→
→
− −12 s1
 −1 
= [r1 r2 r3 ]  → 3 0 1 1   s2 
− →
−
1 2 − 1 1 0 s3
→
− →
−  
h i 0 −r3 r2 s1
= → 11 →
− 12 →
− 1 3  r3
−
0 −r1   s2 
−r2 r1 0 s3
F T1 r× s.
=−
→
Hence, if → s are expressed in the same reference frame, the 3 ×
r and →
− −
3 matrix
 
0 −r3 r2
r× =  r3 0 −r1  , (6.2)
−r2 r1 0
170 Primer on Three-Dimensional Geometry

can be used to construct the components of the cross product. This

matrix is skew-symmetric1 ; that is,
(r× )T = −r× .
It is easy to verify that
r× r = 0,
where 0 is a column matrix of zeros and
r× s = −s× r.

6.2 Rotations
Critical to our ability to estimate how objects are moving in the world
is the ability to parameterize the orientation, or rotation, of those ob-
jects. We begin by introducing rotation matrices and then provide some
alternative representations.

6.2.1 Rotation Matrices

r Let us consider two frames −
→F 1 and −
F 2 with a common origin, and let
→
→
− us express →
r in each frame:
13 23 −
−
→ −
→
BMB 6
− F T1 r1 = −
r =−
→ → F T2 r2 .
→
2
B →2
−
*

F 1 , r1 ,
B
- − 1 We seek to discover a relationship between the components in −
→
→2
B
A F
and those in − ,
→2 2 r . We proceed as follows:
A
AA
U F T2 r2 = −
−
→ F T1 r1 ,
→
1 2
→1
− →1
− F T2 r2 = −
F2 · − F2 · − F T1 r1 ,
−
→ → → →
r2 = C21 r1 .
We have defined
C21 = −F2 · →F T1
→ − 
2
−  h
1 i
 →
2
= →−2  → · 1
−1 → 1
−2 → 1
−3
2 3
 →
− 
2 1 · 11 2 1 · 12 2 1 · 13
 →
− → − →
− → − →
− → − 
= →2 2 · 11 2 2 · 12 2 2 · 1 3 .
− → − →
− → − →
− → −
23·→
→
− −11 → 23·→
− −12 →23·→
− −13
1 There are many equivalent notations in the literature for this skew-symmetric
definition: r× = r̂ = r∧ = −[[r]] = [r]× . For now, we use the first one since it makes an
obvious connection to the cross product; later we will also use (·)∧ as this is in common
use in robotics.
6.2 Rotations 171

The matrix C21 is called a rotation matrix. It is sometimes referred to

as a ‘direction cosine matrix’ since the dot product of two unit vectors
is just the cosine of the angle between them.
The unit vectors in −→F 2 can be related to those in −
F 1:
→
F T1 = −
−
→ F T2 C21 .
→ (6.3)

Rotation matrices possess some special properties:

r1 = C−1
21 r2 = C12 r2 .

But, CT21 = C12 . Hence,

C12 = C−1 T
21 = C21 . (6.4)

We say that C21 is an orthonormal matrix because its inverse is equal

to its transpose (and det C12 = 1).
Consider three reference frames −F 1, −
→ F 2 , and −
→ F 3 . The components
→
of a vector →
−r in these three frames are r1 , r2 , and r3 . Now,

r3 = C32 r2 = C32 C21 r1 .

But, r3 = C31 r1 , and therefore

C31 = C32 C21 .

6.2.2 Principal Rotations

The most important rotations of one frame with respect to another are 1 ,23
→3 −
− →
those about one of the coordinate axes. The situation where − F 2 has
→ 6
been rotated from −F
→1 through a rotation about the 3-axis is shown 2
→2
−
below. The rotation matrix in this case is *

6θ3 - −1
→2
  -A
θ3 A
cos θ3 sin θ3 0
C3 =  − sin θ3 cos θ3 0  . (6.5) AAU
0 0 1 1 2
→1
− →1
−

For a rotation about the 2-axis, the rotation matrix is 1

2 →3
−
→3
−
  6
cos θ2 0 − sin θ2 KAAθ2
C2 =  0 .

1 0 (6.6) A
sin θ2 0 cos θ2 1 ,22
→2 −
A -− →

θ AA
2
U
1 2
→1 −
− →1
172 Primer on Three-Dimensional Geometry

1 For a rotation about the 1-axis, the rotation matrix is

2 →3
−
→3
−
AAθ1
6  
2 1 0 0
→2
K
*−
A C1 =  0 cos θ1 sin θ1  . (6.7)
A6θ1 - −1
→2 0 − sin θ1 cos θ1

1 ,21
→1 −
− →
6.2.3 Alternate Rotation Representations
We have seen one way of discussing the orientation of one reference
frame with respect to another: the rotation matrix. This requires nine
parameters (they are not independent). There are a number of other
alternatives.
The key thing to realize about the different representations of rota-
tions, is that there are always only three underlying degrees of freedom.
The representations that have more than three parameters must have
associated constraints to limit the number of degrees of freedom to
three. The representations that have exactly three parameters have as-
sociated singularities. There is no perfect representation that is minimal
Leonhard Euler
(1707-1783) is (i.e., having only three parameters) and that is also free of singularities
considered to be (Stuelpnagel, 1964).
the preeminent
mathematician of
the 18th century
Euler Angles
and one of the The orientation of one reference frame with respect to another can also
greatest be specified by a sequence of three principal rotations. One possible
mathematicians to
have ever lived. He
sequence is as follows:
made important
discoveries in fields (i) A rotation ψ about the original 3-axis,
as diverse as (ii) A rotation γ about the intermediate 1-axis,
infinitesimal
(iii) A rotation θ about the transformed 3-axis.
calculus and graph
theory. He also
introduced much of
the modern 1 ,I 3 I T ,23
→3 −
− → T →3
− →3 −
− →
mathematical →3
−
6 6 6
terminology and KAAγ
I T 2
notation, →2
− →2
− →2
−

* *

*
particularly for A
mathematical
6ψ - −
1
→2
A 6γ - −I
→2
6θ - − T
→2
-A -A
analysis, such as ψ A θ A
the notion of a
AA AAU
mathematical U
1 I I ,T 1 T 2
function. He is also →1
− →1
− →1 −
− → →1
− →1
−
renowned for his
work in mechanics,
fluid dynamics,
optics, astronomy, This transformation is called a 3-1-3 transformation and is the se-
and music theory. quence originally used by Euler. In the classical mechanics literature,
6.2 Rotations 173

the angles are referred to by the following names:

θ : spin angle,
γ : nutation angle,
ψ : precession angle.
The rotation matrix from frame 1 to frame 2 is given by
C21 (θ, γ, ψ) = C2T CT I CI1
= C3 (θ)C1 (γ)C3 (ψ)
 
cθ cψ − sθ cγ sψ sψ cθ + cγ sθ cψ sγ sθ
=  −cψ sθ − cθ cγ sψ −sψ sθ + cθ cγ cψ sγ cθ  . (6.8)
sψ sγ −sγ cψ cγ
We have made the abbreviations s = sin, c = cos.
Another possible sequence that can be used is as follows:

(i) A rotation θ1 about the original 1-axis (‘roll’ rotation),

(ii) A rotation θ2 about the intermediate 2-axis (‘pitch’ rotation),
(iii) A rotation θ3 about the transformed 3-axis (‘yaw’ rotation).
This sequence, which is very common in aerospace applications, is called
the 1-2-3 attitude sequence or the ‘roll-pitch-yaw’ convention. In this
case, the rotation matrix from frame 1 to frame 2 is given by
C21 (θ3 , θ2 , θ1 ) = C3 (θ3 )C2 (θ2 )C1 (θ1 )
 
c2 c3 c1 s3 + s1 s2 c3 s1 s3 − c1 s2 c3
=  −c2 s3 c1 c3 − s1 s2 s3 s1 c3 + c1 s2 s3  . (6.9)
s2 −s1 c2 c1 c2
where si = sin θi , ci = cos θi .
The above transformations have singularities. If γ = 0 for the 3-1-3,
then the angles θ and ψ become associated with the same degree of
freedom and cannot be uniquely determined.
For the 1-2-3, a singularity occurs when θ2 = π/2. In this case,
 
0 sin(θ1 + θ3 ) − cos(θ1 + θ3 )
π
C21 (θ3 , , θ1 ) =  0 cos(θ1 + θ3 ) sin(θ1 + θ3 )  .
2 1 0 0
Therefore, θ1 and θ3 are associated with the same rotation. However,
this is only a problem if we want to recover the rotation angles from
the rotation matrix.

Infinitesimal Rotations
Consider the 1-2-3 transformation when the angles θ1 , θ2 , θ3 1 (i.e.,
‘small’ angles). In this case, we make the approximations ci ≈ 1, si ≈ θi
174 Primer on Three-Dimensional Geometry

and neglect products of small angles, θi θj ≈ 0. Then we have

 
1 θ3 −θ2
C21 ≈  −θ3 1 θ1 
θ2 −θ1 1
≈ 1 − θ× , (6.10)
where
 
θ1
θ = θ2  ,

θ3
which is referred to as a rotation vector.
It is easy to show that the form of the rotation matrix for infinitesimal
rotations (i.e., ‘small angle approximation’) does not depend on the
order in which the rotations are performed. For example, we can show
that the same result is obtained for a 2-1-3 Euler sequence.

Euler Parameters
Euler’s theorem says that the most general motion of a rigid body with
one point fixed is a rotation about an axis through that point.
Let us denote the axis of rotation by a = [a1 a2 a3 ]T and assume
that it is a unit vector:
aT a = a21 + a22 + a23 = 1 (6.11)
The angle of rotation is φ. We state, without proof, that the rotation
matrix in this case is given by
C21 = cos φ1 + (1 − cos φ)aaT − sin φa× . (6.12)
It does not matter in which frame a is expressed because
C21 a = a. (6.13)
The combination of variables,
  
a1 sin(φ/2) ε1
φ φ
η = cos , ε = a sin =  a2 sin(φ/2)  =  ε2  , (6.14)
2 2 a3 sin(φ/2) ε3
is particularly useful. The four parameters {ε, η} are called the Eu-
ler parameters associated with a rotation2 . They are not independent
because they satisfy the constraint
η 2 + ε21 + ε22 + ε23 = 1.

2 ε
These are sometimes referred to as unit-length quaternions when stacked as q = .
η
These are discussed in more detail below.
6.2 Rotations 175

The rotation matrix can be expressed in terms of the Euler parameters

as
C21 = (η 2 − εT ε)1 + 2εεT − 2ηε×
 
1 − 2(ε22 + ε23 ) 2(ε1 ε2 + ε3 η) 2(ε1 ε3 − ε2 η)
=  2(ε2 ε1 − ε3 η) 1 − 2(ε23 + ε21 ) 2(ε2 ε3 + ε1 η)  . (6.15) Quaternions were
2(ε3 ε1 + ε2 η) 2(ε3 ε2 − ε1 η) 1 − 2(ε21 + ε22 ) first described by
Sir William Rowan
Euler parameters are useful in many spacecraft applications. There are Hamilton
no singularities associated with them and the calculation of the rotation (1805-1865) in
matrix does not involve trigonometric functions, which is a significant 1843 and applied
to mechanics in
numerical advantage. The only drawback is the use of four parameters
three-dimensional
instead of three as is the case with Euler angles; this makes it challeng- space. Hamilton
ing to perform some estimation problems because the constraint must was an Irish
be enforced. physicist,
astronomer, and
mathematician,
Quaternions who made
We will use the notation of Barfoot et al. (2011) for this section. A important
quaternion will be a 4 × 1 column that may be written as contributions to
classical
ε mechanics, optics,
q= , (6.16) and algebra. His
η
studies of
where ε is a 3×1 and η is a scalar. The quaternion left-hand compound mechanical and
optical systems led
operator, +, and the right-hand compound operator, ⊕, will be defined him to discover
as new mathematical

+ η1 − ε× ε ⊕ η1 + ε× ε concepts and
q = , q = . (6.17)
−εT η −εT η techniques. His
best known
The inverse operator, −1, will be defined by contribution to
mathematical
−1 −ε physics is the
q = . (6.18) reformulation of
η
Newtonian
Let u, v, and w be quaternions. Then some useful identities are mechanics, now
called Hamiltonian
u+ v ≡ v⊕ u, (6.19) mechanics. This
work has proven
and central to the
T −1 + T −1 ⊕ modern study of
(u+ ) ≡ (u+ ) ≡ (u−1 ) (u⊕ ) ≡ (u⊕ ) ≡ (u−1 ) classical field
(u+ v) ≡ v−1 + u−1 (u⊕ v) ≡ v−1 ⊕ u−1
−1 −1
theories such as
+ ⊕ electromagnetism,
(u+ v) w ≡ u+ (v+ w) ≡ u+ v+ w (u⊕ v) w ≡ u⊕ (v⊕ w) ≡ u⊕ v⊕ w
+ ⊕ and to the
αu+ + βv+ ≡ (αu + βv) αu⊕ + βv⊕ ≡ (αu + βv) development of
(6.20) quantum
where α and β are scalars. We also have mechanics. In pure
mathematics, he is
u+ v ⊕ ≡ v ⊕ u+ . (6.21) best known as the
inventor of
The proofs are left to the reader. quaternions.
176 Primer on Three-Dimensional Geometry

Quaternions form a non-commutative group3 under both the + and

⊕ operations. Many of the identities above are prerequisites to showing
T
this fact. The identity element of this group, ι = 0 0 0 1 , is such
that
ι+ = ι⊕ = 1, (6.22)
where 1 is the 4 × 4 identity matrix.
Rotations may be represented in this notation by using a unit-length
quaternion, q, such that
qT q = 1. (6.23)
These form a sub-group that can be used to represent rotations.
To rotate a point (in homogeneous form),
 
x
y 
v= 
z  , (6.24)
1
to another frame using the rotation, q, we compute
⊕
u = q+ v+ q−1 = q+ q−1 v = Rv, (6.25)
where

+ −1 ⊕ −1 ⊕ + ⊕T + C 0
R=q q =q q =q q = , (6.26)
0 1
Josiah Willard
Gibbs (1839-1903) and C is the 3 × 3 rotation matrix with which we are now familiar. We
was an American
scientist who made
have included various forms for R to show the different structures this
important transformation can take.
theoretical
contributions to Gibbs Vector
physics, chemistry,
and mathematics. Yet another way that we can parameterize rotation is through the Gibbs
As a vector. In terms of axis/angle parameters discussed earlier, the Gibbs
mathematician, he vector, g, is given by
invented modern
vector calculus φ
(independently of
g = a tan , (6.27)
2
the British
scientist Oliver which we note blows up at φ = π, so this parameterization does not
Heaviside, who work well for all angles. The rotation matrix, C, can then be written
carried out similar in terms of the Gibbs vector as
work during the
same period). The −1 1
Gibbs vector is
C = 1 + g× 1 − g× = T
(1 − gT g)1 + 2ggT − 2g× .
1+g g
also sometimes (6.28)
known as the
Cayley-Rodrigues 3 The next chapter will discuss group theory as it pertains to rotations in much more
parameters. detail.
6.2 Rotations 177

Substituting in the Gibbs vector definition, the right-hand expression

becomes

1 2 φ 2 φ T φ ×
C= 1 − tan 1 + 2 tan aa − 2 tan a ,
1 + tan2 φ2 2 2 2
(6.29)
T

2 φ −1
where we have used that a a = 1. Utilizing that 1 + tan 2 =
cos2 φ2 we have

2 φ 2 φ φ φ φ
C = cos − sin 1 + 2 sin2 aaT − 2 sin cos a×
2 2 | {z 2} | 2{z 2}
| {z }
cos φ 1−cos φ sin φ

= cos φ1 + (1 − cos φ)aaT − sin φa× , (6.30)

which is our usual expression for the rotation matrix in terms of the
axis/angle parameters.
To relate the two expressions for C in terms of g given in (6.28), we
first note that
∞
X
−1 n
1 + g× = 1 − g× + g× g× − g× g× g× + · · · = −g× . (6.31)
n=0

Then we observe that

−1
gT g 1 + g×
= (gT g)1 − (gT g)g× + (gT g)g× g× − (gT g)g× g× g× + · · ·
| {z } | {z } | {z }
−g× g× g× −g× g× g× g× −g× g× g× g× g×
−1
= 1 + ggT − g× − 1 + g× , (6.32)
where we have used the following manipulation several times:

(gT g)g× = −g× g× + ggT g× = −g× g× g× + g gT g× = −g× g× g× .
| {z }
0
(6.33)
Therefore we have that
−1
1 + gT g 1 + g× = 1 + ggT − g× , (6.34)
and thus
−1
1 + gT g 1 + g× 1 − g× = 1 + ggT − g× 1 − g×
| {z }
C

= 1 + ggT − 2g× − g gT g× + g× g× = 1 − gT g 1 + 2ggT − 2g× .
| {z } | {z }
0 −gT g1+ggT
(6.35)
Dividing both sides by (1 + gT g) provides the desired result.
178 Primer on Three-Dimensional Geometry

6.2.4 Rotational Kinematics

In the last section, we showed that the orientation of one frame − F2
→
with respect to another − F
→1 could be parameterized in different ways.
In other words, the rotation matrix could be written as a function of
Euler angles or Euler parameters. However, in most applications the
orientation changes with time and thus we must introduce the vehi-
cle kinematics, which form an important part of the vehicle’s motion
model.
We will first introduce the concept of angular velocity, then accelera-
tion in a rotating frame. We will finish with expressions that relate the
rate of change of the orientation parameterization to angular velocity.

Angular Velocity
Let frame − F 2 rotate with respect to frame −
→ F 1 . The angular velocity
→
of frame 2 with respect to frame 1 is denoted by → −ω 21 . The angular
velocity of frame 1 with respect to 2 is →
ω 12 = − ω 21 .
− →
−

1 2
ω 21
→
−
→3 −
− →3
BM 6
2
B →2
−
B *
1
→2
B -−
A
A
AU
1 2
→1
− →1
−
q
The magnitude of → ω ,
− 21 →| ω
− 21 |= (→
−ω 21 · →
ω 21 ), is the rate of rotation.
−
The direction of → −ω 21 (i.e., the unit vector in the direction of → ω 21 , which
−
−1
is | →
ω 21 | →
− ω 21 ) is the instantaneous axis of rotation.
−
Observers in the frames − F 2 and −
→ F 1 do not see the same motion
→
because of their own relative motions. Let us denote the vector time
derivative as seen in − F 1 by (·)• and that seen in −
→ F 2 by (·)◦ . Therefore,
→
• ◦
F
− 0 , −F
→1 = →
− →2 = →
0.
−
It can be shown that
2• 1 = →
→
− ω 21 × →
− −2 1, →2• 2 = →
− ω 21 × →
− −2 2, →2• 3 = →
− ω 21 × →
− 2 3,
−
or equivalently
h i h i
2˙ 1 2˙ 2 2˙ 3 = ω 21 × 2 1 2 2 2 3 ,
→− → − → − →
− →− → − → −
or
•T
F
− − F T2 .
ω 21 × −
→2 = → → (6.36)
6.2 Rotations 179

We want to determine the time derivative of an arbitrary vector ex-

pressed in both frames:

→
− F T1 r1 = −
r =−
→ F T2 r2 .
→
F 1 is
Therefore, the time derivative as seen in −
→
•T
→
− F
r• = − F T1 ṙ1 = −
→ 1 r1 + −
→ F T1 ṙ1 .
→ (6.37)

In a similar way,
◦T ◦ ◦
r◦ = −F
→
− F T2 r2 = −
→ 2 r2 + −
→ F T2 r2 = −
→ F T2 ṙ2 .
→ (6.38)
◦
(Note that for nonvectors, (˙) = ( ◦ ), i.e., r2 = ṙ2 ).
Alternatively, the time derivative as seen in − F 1 , but expressed in
→
−
→F 2 , is

→
− F T2 ṙ2 + −
r• = −
→ F •T
→ 2 r2
=−F T2 ṙ2 + →
→ ω 21 × −
− F T2 r2
→
r◦ + →
=→
− ω 21 × →
− r.
− (6.39)

The above is true for any vector → −r . The most important application
occurs when → r denotes position, −
− F 1 is a nonrotating inertial reference
→
frame, and − F
→ 2 is a frame that rotates with a body, vehicle, etc. In this
case, (6.39) expresses the velocity in the inertial frame in terms of the
motion in the second frame.
Now, express the angular velocity in − F 2:
→

→
− F T2 ω 21
ω 21 = −
→ 2 . (6.40)

Therefore,

→
− F T1 ṙ1 = −
r• = −
→ F T2 ṙ2 + →
→ ω 21 × →
− − r
×
=−F T2 ṙ2 + −
→ F T2 ω 21
→ 2 r2
×
F T2 (ṙ2 + ω 21
=−
→ 2 r2 ). (6.41)

F 1)
If we want to express the ‘inertial time derivative’ (that seen in −
→
in −F
→1 , then we can use the rotation matrix C 12 :
×
ṙ1 = C12 (ṙ2 + ω 21
2 r2 ). (6.42)

Acceleration
Let us denote the velocity by

→
− r• = →
v =→
− r◦ + →
− ω 21 × →
− r.
−
180 Primer on Three-Dimensional Geometry

The acceleration can be calculated by applying (6.39) to →

v:
−
••
r−
→=→ −v• = →
v◦ + →
− ω 21 × →
− v
−
◦◦
→+→
= (r− − r◦ + →
ω 21 × →
− ω ◦ 21 × →
− r)
−
+ (→
− r◦ + →
ω 21 × →
− ω 21 × (→
− ω 21 × →
− r ))
−
◦◦
=−
r→ + 2→
− r◦ + →
ω 21 × →
− ω ◦ 21 × →
− ω 21 × (→
r + →
− − ω 21 × →
− r ).
−
(6.43)
The matrix equivalent in terms of components can be had by making
the following substitutions:

−
••
r→ =− F T1 r̈1 , −
→ r→◦◦
=− F T2 r̈2 , →
→ −ω ◦ 21 = −F T2 ω̇ 21
→ 2 .

The result for the components is

h i
×
21× 21× 21×
r̈1 = C12 r̈2 + 2ω 21 2 ṙ 2 + ω̇ 2 r 2 + ω 2 ω 2 r 2 . (6.44)
The various terms in the expression for the acceleration have been given
special names:
r→
−
◦◦
F2
: acceleration with respect to −
→
◦
ω 21 × →
2→
− r : Coriolis acceleration
−
◦
→21 × →
ω
−
r : angular acceleration
−
ω 21 × ω 21 × r
→
− →
− − : centripetal acceleration
→
Angular Velocity Given Rotation Matrix
Begin with (6.3), which relates relates two reference frames via the
rotation matrix:
−F T1 = −
→ F T2 C21 .
→
F 1:
Now take the time derivative of both sides as seen in −
→
•T
→
−0 = −F F T2 Ċ21 .
→2 C21 + −
→
•T
F
Substitute (6.36) for −→2 :

→
− − F T2 C21 + −
ω 21 × −
0 =→ → F T2 Ċ21 .
→
Now use (6.40) to get
T
0 = ω 21
→
− 2 −F2 × −
→ F T2 C21 + −
→ F T2 Ċ21
→
×

Siméon Denis
Poisson
=−F T2
→ ω 21
2 C21 + Ċ21 .
(1781-1840), was a
French
Therefore, we conclude that
mathematician, ×

geometer, and
Ċ21 = −ω 21
2 C21 , (6.45)
physicist. which is known as Poisson’s equation. Given the angular velocity as
6.2 Rotations 181

measured in the frame −→F 2 , the rotation matrix relating −

F 1 to −
→ F2
→
4
can be determined by integrating the above expression .
We can also rearrange to obtain an explicit function of ω 21
2 :

×
ω 21
2 = −Ċ21 C−1
21
= −Ċ21 CT21 , (6.46)

which gives the angular velocity when the rotation matrix is known as
a function of time.

Euler Angles
Consider the 1-2-3 Euler angle sequence and its associated rotation
matrix. In this case (6.46) becomes
×
ω 21
2 = −C3 C2 Ċ1 CT1 CT2 CT3 − C3 Ċ2 CT2 CT3 − Ċ3 CT3 . (6.47)

Then, using
−Ċi CTi = 1×
i θ̇i , (6.48)

for each principal axis rotation (where 1i is column i of 1) and the

identity
×
(Ci r) ≡ Ci r× CTi , (6.49)

we can show that

×
× × ×
ω 21
2 = C3 C2 11 θ̇1 + C3 12 θ̇2 + 13 θ̇3 , (6.50)

which can be simplified to

 
θ̇1 
ω 21
2 = C3 (θ3 )C2 (θ2 )11 C3 (θ3 )12 13 θ̇2 
| {z }
S(θ2 ,θ3 )
θ̇3
| {z }
θ̇

= S(θ2 , θ3 )θ̇. (6.51)

which gives the angular velocity in terms of the Euler angles and the
Euler rates, θ̇. In scalar detail we have
 
cos θ2 cos θ3 sin θ3 0
S(θ2 , θ3 ) =  − cos θ2 sin θ3 cos θ3 0  . (6.52)
sin θ2 0 1
4 This is termed ‘strapdown navigation’ because the sensors that measure ω 21
2 are
strapped down in the rotating frame, F 2 .
−
→
182 Primer on Three-Dimensional Geometry

By inverting the matrix S, we arrive at a system of differential equations

that can be integrated to yield the Euler angles assuming ω 212 is known:

θ̇ = S−1 (θ2 , θ3 )ω 21
2
 
sec θ2 cos θ3 − sec θ2 sin θ3 0
= sin θ3 cos θ3 0  ω 21
2 . (6.53)
− tan θ2 cos θ3 tan θ2 sin θ3 1
Note that S−1 does not exist at θ2 = π/2, which is the precisely the
singularity associated with the 1-2-3 sequence.
It should be noted that the above developments hold true in general
for any Euler sequence. If we pick an α-β-γ set,
C21 (θ1 , θ2 , θ3 ) = Cγ (θ3 )Cβ (θ2 )Cα (θ1 ), (6.54)
then

S(θ2 , θ3 ) = Cγ (θ3 )Cβ (θ2 )1α Cγ (θ3 )1β 1γ , (6.55)
and S−1 does not exist at the singularities of S.

6.2.5 Perturbing Rotations

Now that we have some basic notation built up for handling quantities
in three-dimensional space, we will turn our focus to an issue that is
often handled incorrectly or simply ignored altogether. We have shown
in the previous section that the state of a single-body vehicle involves
a translation, which has three degrees of freedom, as well as a rotation,
which also has three degrees of freedom. The problem is that the degrees
of freedom associated with rotations are a bit unique and must be
handled carefully. The reason is that rotations do not live in a vector
space5 ; rather, they form the non-commutative group called SO(3).
As we have seen above, there are many ways of representing rota-
tions mathematically including rotation matrices, axis-angle formula-
tion, Euler parameters, and Euler angles/unit-length quaternions. The
most important fact to remember is that all these representations have
the same underlying rotation, which only has three degrees of freedom.
A 3 × 3 rotation matrix has nine elements, but only three are indepen-
dent. Euler parameters have four scalar parameters, but only three are
independent. Of all the common rotation representations, Euler angles
are the only ones that have exactly three parameters; the problem is
that Euler sequences have singularities, so for some problems, one must
choose an appropriate sequence that avoid the singularities.
The fact that rotations do not live in a vector space is actually quite
fundamental when it comes to linearizing motion and observation mod-
els involving rotations. What are we to do about linearizing rotations?
5 Here we mean a vector space in the sense of linear algebra.
6.2 Rotations 183

Fortunately, there is way forwards. The key is to consider what is hap-

pening on a small, in fact infinitesimal, level. We will begin by deriving
a few key identities and then turn to linearizing a rotation matrix built
from a sequence of Euler angles.

Some Key Identities

Euler’s theorem allows us to write a rotation matrix, C, in terms of a
rotation about an axis, a, through an angle, φ:
C = cos φ1 + (1 − cos φ)aaT − sin φa× . (6.56)
We now take the partial derivative of C with respect to the angle, φ:
∂C
= − sin φ1 + sin φaaT − cos φa× (6.57a)
∂φ

= sin φ −1 + aaT − cos φa× (6.57b)
| {z }
a× a×
× ×
= − cos φa + (1 − cos φ) a a aT + sin φa× a×
|{z} (6.57c)
0

= −a× cos φ1 + (1 − cos φ)aaT − sin φa× . (6.57d)
| {z }
C

Thus, our first important identity is

∂C
≡ −a× C. (6.58)
∂φ
An immediate application of this is that for any principal-axis rotation,
about axis α, we have
∂Cα (θ)
≡ −1×α Cα (θ), (6.59)
∂θ
where 1α is column α of the identity matrix.
Let us now consider an α-β-γ Euler sequence:
C(θ) = Cγ (θ3 )Cβ (θ2 )Cα (θ1 ), (6.60)
where θ = (θ1 , θ2 , θ3 ). Furthermore, we select an arbitrary constant
vector, v. Applying (6.59) we have
∂ (C(θ)v) ×
= −1×
γ Cγ (θ3 )Cβ (θ2 )Cα (θ1 )v = (C(θ)v) 1γ , (6.61a)
∂θ3
∂ (C(θ)v) ×
= −Cγ (θ3 )1×
β Cβ (θ2 )Cα (θ1 )v = (C(θ)v) Cγ (θ3 )1β ,
∂θ2
(6.61b)
∂ (C(θ)v) ×
= −Cγ (θ3 )Cβ (θ2 )1×
α Cα (θ1 )v = (C(θ)v) Cγ (θ3 )Cβ (θ2 )1α ,
∂θ1
(6.61c)
184 Primer on Three-Dimensional Geometry

where we have made use of the two general identities,

r× s ≡ −s× r, (6.62a)
×
(Rs) ≡ Rs× RT , (6.62b)

for any vectors r, s and any rotation matrix, R. Combining the results
in (6.61) we have
∂ (C(θ)v) h ∂(C(θ)v) ∂(C(θ)v) ∂(C(θ)v) i
= ∂θ1 ∂θ2 ∂θ3
∂θ
×
= (C(θ)v) Cγ (θ3 )Cβ (θ2 )1α Cγ (θ3 )1β 1γ , (6.63)
| {z }
S(θ2 ,θ3 )

and thus another very important identity that we can state is

∂ (C(θ)v) ×
≡ C(θ)v S(θ2 , θ3 ), (6.64)
∂θ
which we note is true regardless of the choice of Euler set. This will
prove critical in the next section when we discuss linearization of a
rotation matrix.

Perturbing a Rotation Matrix

Let us return to first principles and consider carefully how to linearize a
rotation. If we have a function, f (x), of some variable, x, then perturb-
ing x slightly from its nominal value, x̄, by an amount δx will result in
a change in the function. We can express this in terms of a Taylor-series
expansion of f about x̄:

∂f (x)
f (x̄ + δx) = f (x̄) + δx + (higher order terms) (6.65)
∂x x̄
and so if δx is small, a ‘first-order’ approximation is

∂f (x)
f (x̄ + δx) ≈ f (x̄) + δx. (6.66)
∂x x̄
This presupposes that δx is not constrained in any way. The trouble
with carrying out the same process with rotations is that most of the
representations involve constraints and thus are not easily perturbed
(without enforcing the constraint). The notable exceptions are the Eu-
ler angle sets. These contain exactly three parameters and thus each
can be varied independently. For this reason, we choose to use Euler
angles in our perturbation of functions involving rotations.
Consider perturbing C(θ)v with respect to Euler angles θ, where
v is an arbitrary constant vector. Letting θ̄ = (θ̄1 , θ̄2 , θ̄3 ) and δθ =
(δθ1 , δθ2 , δθ3 ), then applying a first-order Taylor-series approximation
6.2 Rotations 185

we have

∂ (C(θ)v)
C(θ̄ + δθ)v ≈ C(θ̄)v + δθ
∂θ θ̄

×
= C(θ̄)v + (C(θ)v) S(θ2 , θ3 ) δθ
θ̄
×
= C(θ̄)v + C(θ̄)v S(θ̄2 , θ̄3 ) δθ
×
= C(θ̄)v − S(θ̄2 , θ̄3 ) δθ C(θ̄)v
×
= 1 − S(θ̄2 , θ̄3 ) δθ C(θ̄)v, (6.67)

where we have used (6.64) to get to the second line. Observing that v
is arbitrary, we can drop it from both sides and write
×
C(θ̄ + δθ) ≈ 1 − S(θ̄2 , θ̄3 ) δθ C(θ̄), (6.68)
| {z }
infinitesimal rot. mat.

which we see is the product (not the sum) of an infinitesimal rotation

matrix and the unperturbed rotation matrix, C(θ̄). Notationally, it is
simpler to write

C(θ̄ + δθ) ≈ 1 − δφ× C(θ̄), (6.69)

with δφ = S(θ̄2 , θ̄3 ) δθ. Equation (6.68) is extremely important. It tells

us exactly how to perturb a rotation matrix (in terms of perturbations
to its Euler angles) when it appears inside any function.

Example 6.1 The following example shows how we can apply our
linearized rotation expression in an arbitrary expression. Suppose we
have a scalar function, J, given by

J(θ) = uT C(θ)v, (6.70)

where u and v are arbitrary vectors. Applying our approach to lin-

earizing rotations we have
×
J(θ̄ + δθ) ≈ uT 1 − δφ× C(θ̄)v = uT C(θ̄)v + uT C(θ̄)v δφ,
| {z } | {z }
J(θ̄) δJ(δθ)
(6.71)
so that the linearized function is
×
δJ(δθ) = uT C(θ̄)v S(θ̄2 , θ̄3 ) δθ, (6.72)
| {z }
constant

where
we see the factor in front of δθ is indeed constant; in fact, it is
∂J
∂θ θ̄
, the Jacobian of J with respect to θ.
186 Primer on Three-Dimensional Geometry

6.3 Poses
We have spent considerable effort discussing the rotational aspect of a
moving body. We now introduce the notation of translation. Together,
the translation and rotation of a body are referred to as the pose. Pose
estimation problems are often concerned with transforming the coor-
dinates of a point, P , between a moving (in translation and rotation)
vehicle frame, and a stationary frame, as depicted in Figure 6.2.
We can relate the vectors in Figure 6.2 as follows:

r pi = →
→
− r pv + →
− r vi ,
− (6.73)

where we have not yet selected any particular reference frame in which
to express the relationship. Writing the relationship in the stationary
F i , we have
frame, −
→
rpi pv vi
i = ri + ri . (6.74)

If the point, P , is attached to the vehicle, we typically know its coordi-

nates in −
→F v , which is rotated with respect to −
F i . Letting Civ represent
→
this rotation, we can rewrite the relationship as

rpi pv vi
i = Civ rv + ri , (6.75)

which tells us how to convert the coordinates of P in − F v to its coordi-

→
vi
F i , given knowledge of the translation, ri , and rotation, Civ ,
nates in −
→
between the two frames. We will refer to

{rvi
i , Civ }, (6.76)

as the pose of the vehicle.

Figure 6.2 Pose

estimation vehicle
problems are often
concerned with P
transforming the Fv
!
coordinates of a
point, P , between
r pv
!
a moving vehicle
frame, and a
stationary frame. r pi V
!

r vi
!
Fi
!

I
6.3 Poses 187

6.3.1 Transformation Matrices Homogeneous

We can also write the relationship expressed in (6.75) in another con- coordinates were
introduced by
venient form:
pi pv Augustus
ri Civ rvi
i rv Ferdinand Möbius
= , (6.77)
1 0T 1 1 (1790-1868) in his
| {z } work entitled Der
Tiv Barycentrische
Calcul, published
where Tiv is referred to as a 4 × 4 transformation matrix. in 1827. Möbius
To make use of a transformation matrix, we must augment the co- parameterized a
ordinates of a point with a 1, point on a plane,
  (x, y), by
x considering masses,
y  m1 , m2 , and m3 ,
 , (6.78) that must be
z 
placed at the
1 vertices of a fixed
triangle to make
which is referred to as a homogeneous point representation. An interest- the point the
ing property of homogeneous point representations is that each entry triangle’s center of
can be multiplied by a scale factor, s: mass. The
coordinates
 
sx (m1 , m2 , m3 ) are
sy  not unique as
 . (6.79) scaling the three
 sz 
masses equally
s does not change
the point location.
To recover the original (x, y, z) coordinates, one needs only to divide When the equation
the first three entires by the fourth. In this way, as the scale factor of a curve is
approaches 0, we can represent points arbitrarily far away from the written in this
coordinate system,
origin. Hartley and Zisserman (2000) discuss the use of homogeneous it becomes
coordinates at length for computer-vision applications. homogeneous in
To transform the coordinates back the other way, we require the (m1 , m2 , m3 ). For
inverse of a transformation matrix: example, a circle
pv pi centered at (a, b)
rv −1 ri with radius r is:
= Tiv , (6.80) (x−a)2 +(y −b)2 =
1 1
r2 . Written in
where homogeneous
coordinates with
−1 T x = m1 /m3 and
Civ rvi Civ −CTiv rvi Cvi −rvi
T−1 = i
= i
= v y = m2 /m3 , the
iv
0T 1 0T 1 0T 1 equation becomes
(m1 − m3 a)2 +
Cvi riv
v
= = Tvi , (6.81) (m2 − m3 b)2 =
0T 1 m23 r2 , where every
term is now
where we have used that riv vi
v = −rv , which simply flips the direction of quadratic in the
the vector. homogeneous
We can also compound transformation matrices: coordinates. Taken
from: (Furgale,
Tiv = Tia Tab Tbv , (6.82) 2011)
188 Primer on Three-Dimensional Geometry

which makes is easy to chain an arbitrary number of pose changes

together:
T T T T
−Fi ←
→
iv
Fv = −
−
→ Fi ←
→
ia
Fa ←
−
→
ab
Fb ←
−
→
bv
Fv
−
→ (6.83)
For example, each frame could represent the pose of a mobile vehicle
at a different instant in time and this relation tells us how to combine
relative motions into a global one.
Transformation matrices are very appealing because they tell us to
first apply the translation and then the rotation. This is often a source
of ambiguity when working with poses because the subscripts and su-
perscripts are typically dropped in practice and then it is difficult to
know the exact meaning of each quantity.

6.3.2 Robotics Conventions

There is an important subtlety that must be mentioned in order to
conform with standard practice in robotics. We can understand this
in the context of a simple example. Imagine a vehicle travelling in the
xy-plane, as depicted in Figure 6.3.
Figure 6.3
Simple planar left
v2 forward
example with a ! v1
!
mobile vehicle
whose state is Fv
! ✓vi
given by position, V
(x, y), and i2
!
orientation, θvi . It (x, y)
is standard for Fi
!
‘forward’ to be the I vehicle
1-axis of the i1
!
vehicle frame and (0, 0)
‘left’ to be the
2-axis. Note, the
3-axis is coming The position of the vehicle can be written in a straightforward man-
out of the page. ner as
 
x
rvi
i =  y . (6.84)
0
The z-coordinate is zero for this planar example.
The rotation of − F v with respect to −
→ F i is a principal-axis rota-
→
tion about the 3-axis, through an angle θvi (we add the subscript to
demonstrate a point). Following our convention from before, the angle
of rotation is positive (according to the right-hand rule). Thus, we have
 
cos θvi sin θvi 0
Cvi = C3 (θvi ) = − sin θvi cos θvi 0 . (6.85)
0 0 1
6.3 Poses 189

It makes sense to use θvi for orientation; it naturally describes the head-
ing of the vehicle since it is −→F v that is moving with respect to − F i.
→
However, as discussed in the last section, the rotation matrix that we re-
ally care about when constructing the pose is Civ = CTvi = C3 (−θvi ) =
C3 (θiv ). Importantly, we note that θiv = −θvi . We do not want to use
θiv as the heading as that will be quite confusing.
Sticking with θvi , the pose of the vehicle can then be written in
transformation matrix form as
 
cos θvi − sin θvi 0 x
Civ rvi  sin θvi cos θvi 0 y 
Tiv = i
= 0
, (6.86)
0T
1 0 1 0
0 0 0 1

which is perfectly fine. In general, even when the axis of rotation, a, is

not →
−i 3 , we are free to write
T
Civ = CTvi = cos θvi 1 + (1 − cos θvi )aaT − sin θvi a×
= cos θvi 1 + (1 − cos θvi )aaT + sin θvi a× , (6.87)

where we note the change in sign of the third term due to the skew-
T
symmetric property, a× = −a× . In other words, we are free to use θvi
rather than θiv to construct Civ .
Confusion arises, however, when all the subscripts are dropped and
we simply write

C = cos θ1 + (1 − cos θ)aaT + sin θ a× , (6.88)

which is very common in robotics. There is absolutely nothing wrong

with this expression, we must simply realize that when written in this
form, the rotation is the other way around from our earlier develop-
ment6 .
There is another slight change in notation that is common in robotics
as well. Often, the (·)× symbol is replaced with the (·)∧ symbol (Murray
et al., 1994), particularly when dealing with transformation matrices.
The expression for a rotation matrix is then written as

C = cos θ1 + (1 − cos θ)aaT + sin θ a∧ . (6.89)

We need to be quite careful with angular velocity as well, since this

should in some way match the convention we are using for the angle of
rotation.
6 Our goal in this section is to make things clear, rather than to argue in favour of one
convention over another. However, it is worth noting that this convention, with the
third term in (6.88) positive, is conforming to a left-hand rotation rather than a
right-hand one.
190 Primer on Three-Dimensional Geometry

Finally, the pose is written as

C r
T= T , (6.90)
0 1

with all the subscripts removed. We simply need to be careful to re-

member what all of the quantities actually mean when using them in
practice.
In an effort to be relevant to robotics, we will adopt the conventions
in (6.89) moving forward in this book. However, we believe it has been
worthwhile to begin at first principles in order to better understand
what all the quantities associated with pose really mean.

6.3.3 Frenet-Serret Frame

It is worth drawing the connection between our pose variables (rep-
resented as transformation matrices) and the classical Frenet-Serret
moving frame. Figure 6.4 depicts a point, V , moving smoothly through
space. The Frenet-Serret frame is attached to the point with the first
axis in the direction of motion (the curve tangent), the second axis
pointing in the direction of the tangent derivative with respect to arc
length (the curve normal), and the third axis completing the frame (the
binormal).
Figure 6.4 The
classical Fi binormal vector
Frenet-Serret
! t ⇥!
b =! n
!
moving frame can
be used to describe
I Fv
the motion of a
vi
!
point. The frame r
! n normal vector
axes point in the !
tangent, normal, V s arc length
and binormal
directions of the t tangent vector
!
curve traced out
by the point. This
frame and its The Frenet-Serret equations describe how the frame axes change with
motion equations arc length:
are named after
the two French
d
−t = κ→
n, (6.91a)
mathematicians
who independently ds → −
discovered them: d
n = −κ→−t + τ →
−b , (6.91b)
Jean Frédéric ds →
−
Frenet, in his
d
−b = −τ →n, (6.91c)
thesis of 1847, and
Joseph Alfred ds → −
Serret in 1851.
where κ is called the curvature of the path and τ is called the torsion
6.3 Poses 191

F v,
of the path. Stacking the axes into a frame, →
−
 
−t 
→
F v = →−n, (6.92)
→
−
−b
→
we can write the Frenet-Serret equations as
 
0 κ 0
d
F v = −κ 0 τ  → F v. (6.93)
ds →
−
0 −τ 0
−

Multiplying both sides by the speed along the path, v = ds/dt, and
right-multiplying by →−F Ti we have
 
0 vκ 0
d
−vκ

F v · Fi
T
= 0 vτ  F v · F Ti , (6.94)
→
− → − →
− → −
|dt {z } 0 −vτ 0 | {z }
Ċvi | {z } Cvi
∧
−ω vi
v

where we have applied the chain rule. We see that this has recov-
ered Poisson’s equation for rotational kinematics as given previously
in (6.45); the angular velocity expressed in the moving frame,
 
vτ
vi 
ωv = 0  , (6.95)
vκ
is constrained to only two degrees of freedom since the middle entry is
zero. We also have the translational kinematics,
 
v
ṙi = Cvi ν v , ν v = 0 .
vi T vi vi  (6.96)
0
To express this in the body frame, we note that
d
ṙiv
v = −Cvi rvi
i = −Ċvi rvi vi
i − Cvi ṙi
dt
∧
vi∧ iv
= ω vi vi T vi vi
v Cvi ri − Cvi Cvi ν v = −ω v rv − ν v . (6.97)

We can then combine the translational and rotational kinematics into

transformation-matrix form as follows:
∧
vi∧ iv

d Cvi riv v Ċvi ṙivv −ω vi vi
v Cvi −ω v rv − ν v
Ṫvi = = =
dt 0T 1 0T 0 0T 0
 
0 vκ 0 −v
vi∧ 
−ω v vi
−ν v C r iv
−vκ 0 vτ 0
= vi v
=
 T . (6.98)
0T
0 0T
1 0 −vτ 0 0  vi
0 0 0 0
192 Primer on Three-Dimensional Geometry

Integrating this forward in time provides both the translation and ro-
tation of the moving frame. We can think of (v, κ, τ ) as three inputs in
this case as they determine the shape of the curve that is traced out.
We will see in the next chapter that these kinematic equations can be
generalized to the form
∧
ω −ν
Ṫ = T T, (6.99)
0 0

where

ν
$= , (6.100)
ω

is a generalized six-degree-of-freedom velocity vector (expressed in the

moving frame) that allows for all possible curves for T to be traced
out. The Frenet-Serret equations can be viewed as a special case of this
general kinematic formula.
If we want to use Tiv (for reasons described in the last section)
instead of Tvi , we can either integrate the above and then output Tiv =
T−1
vi , or we can instead integrate
 
0 −vκ 0 v
vκ 0 −vτ 0
Ṫiv = Tiv 
0
, (6.101)
vτ 0 0
0 0 0 0

which will achieve the same result (proof left as an exercise). If we

constrain the motion to the xy-plane, which can be achieved by setting
the initial condition to
 
cos θvi (0) − sin θvi (0) 0 x(0)
 sin θvi (0) cos θvi (0) 0 y(0)
Tiv (0) = 

, (6.102)
0 0 1 0 
0 0 0 1

and then forcing τ = 0 for all time, the kinematics simplify to

ẋ = v cos θ, (6.103a)
ẏ = v sin θ, (6.103b)
θ̇ = ω, (6.103c)

where ω = vκ and it is understood that θ = θvi . This last model is

sometimes referred to as the ‘unicycle model’ for a differential-drive
mobile robot. The inputs are the longitudinal speed, v, and the rota-
tional speed, ω. The robot is unable to translate sideways due to the
nonholonomic constraint associated with its wheels; it can only roll
forwards and turn.
6.4 Sensor Models 193

6.4 Sensor Models

Now that we have some three-dimensional tools, we will introduce a
few three-dimensional sensor models that can be used inside our state
estimation algorithms. In general, we will be interested in sensors that
are on-board our robot. This situation is depicted in Figure 6.5.
Figure 6.5
vehicle Reference frames
for a moving
Fv V
!
vehicle with a
sensor on-board
r sv
! that obverses a
point, P , in the
sensor world.
r vi
! S
Fs
!

Fi
! r ps
!

I r pi
!

We have an inertial frame, → F i , a vehicle frame, →

− F v , and a sensor frame,
−
F
→
− s . The pose change between the sensor frame and the vehicle frame,
Tsv , called the extrinsic sensor parameters, is typically fixed and is
either determined through some form of separate calibration method
or is folded directly into the state estimation procedure. In the sensor-
model developments to follow, we will focus solely on how a point, P ,
is observed by a sensor attached to → F s.
−

6.4.1 Perspective Camera

One of the most important sensors is the perspective camera as it is
cheap yet can be used to infer motion of a vehicle and also the shape
of the world.

Normalized Image Coordinates

Figure 6.6 depicts the observation, O, of a point, P , in an ideal per-
spective camera. In reality, the image plane is behind the pinhole, but
showing it in front avoids the mental effort of working with a flipped
image. This is called the frontal projection model. The z-axis of →−F s,
called the optical axis, is orthogonal to the image plane and the dis-
194 Primer on Three-Dimensional Geometry
Figure 6.6
Frontal projection image plane optical axis
model of a camera
showing the
observation, O, of
a point, P in the
C r ps P
xn !
image plane. In 1
reality, the image yn
plane is behind the
z O
pinhole but the pinhole S
frontal model
x observation
avoids flipping the
image. y Fs
!

tance between the pinhole, S, and the image plane center, C, called the
focal length, is 1 for this idealized camera model.
If the coordinates of P in → F s are
−
 
x
ρ = rps
s =  y , (6.104)
z
with the z-axis orthogonal to the image plane, then the coordinates of
O in the image plane are
xn = x/z, (6.105a)
yn = y/z. (6.105b)
These are called the normalized image coordinates, and are sometimes
provided in a homogeneous form as
 
xn
p =  yn  . (6.106)
1

Essential Matrix
If a point, P , is observed by a camera, the camera moved, and then
the same point observed again, the two normalized image coordinates
Figure 6.7 Two
camera
observations of the
same point, P . P
pa pb
Fa
!
Fb
!
Tba
6.4 Sensor Models 195

corresponding to the observations, pa and pb , are related to one another

through the following constraint:
pTa Eab pb = 0, (6.107)
where Eab is called the essential matrix (of computer vision),
∧
Eab = CTba rab
b (6.108)
and is related to the pose change of the camera,

Cba rabb
Tba = . (6.109)
0T 1
To see that the constraint is true, we let
 
xj
1
pj = ρj , ρj =  yj  , (6.110)
zj zj
for j = a, b. We also have

ρa = CTba ρb − rab
b , (6.111)
for the change in coordinates of P due to the camera moving. Then,
returning to the constraint we see
1 T 1 T ∧
pTa Eab pb = ρa Eab ρb = ρb − rab
b Cba CTba rab
b ρb
za zb za zb | {z }
1
1 abT ab∧

= − ρTb ρ∧b rab
b − rb rb ρb = 0. (6.112)
za zb | {z } | {z }
0 0

The essential matrix can be useful in some pose-estimation problems

including camera calibration.

Lens Distortion
In general, lens effects can distort camera images so that the normal-
ized image coordinate equations are only approximately true. A variety
of analytical models of this distortion are available and these can be
used to correct the raw images such that the resulting images appear
as though they come from an idealized pinhole camera, and thus the
normalized image coordinate equations hold. We will assume this undis-
tortion procedure has been applied to the images and avoid elaborating
on the distortion models.

Intrinsic Parameters
The normalized image coordinates are really associated with a hypo-
thetical camera with unit focal length and image origin at the opti-
cal axis intersection. We can map the normalized image coordinates,
196 Primer on Three-Dimensional Geometry
Figure 6.8
Camera model
showing intrinsic
parameters: f is (cu , cv ) P
the focal length,
(cu , cv ) is the f (x, y, z)
optical axis
intersection. (u, v)

(xn , yn ), to the actual pixel coordinates, (u, v), through the following
relation:
    
u fu 0 cu xn
v  =  0 fv cv   yn  , (6.113)
1 0 0 1 1
| {z }
K

where K is called the intrinsic parameter matrix and contains the ac-
tual camera focal length expressed in horizontal pixels, fu , and vertical
pixels, fv , as well as the actual offset of the image origin from the
optical axis intersection, (cu , cv ), also expressed in horizontal, vertical
pixels7 . These intrinsic parameters are typically determined during the
calibration procedure used to remove the lens effects, so that we can
assume K is known.

Fundamental Matrix
Similarly to the essential matrix constraint, there is a constraint that
can be expressed between the homogeneous pixel coordinates of two
observations of a point from different camera perspectives (and possibly
even different cameras). Let

qi = Ki pi , (6.114)

with i = a, b for the pixel coordinates of two camera observations with

different intrinsic parameter matrices. Then the following constraint
holds:
qTa Fab qb = 0, (6.115)

where
−1
Fab = K−T
a Eab Kb , (6.116)
7 On many imaging sensors, the pixels are not square, resulting in different units in the
horizontal and vertical directions.
6.4 Sensor Models 197

is called the fundamental matrix (of computer vision). It is fairly easy

to see the constraint is true by substitution:
qTa Fab qb = pTb KTa K−T Eab K−1 Kb pb = 0, (6.117)
| {z a } | b{z }
1 1

where we use the essential-matrix constraint for the last step.

The constraint associated with the fundamental matrix is also some-
times called the epipolar constraint and is depicted geometrically in
Figure 6.9. If a point is observed in one camera, qa , and the funda-
mental matrix between the first and a second camera is known, the
constraint describes a line, called the epipolar line, along which the
observation of the point in the second camera, qb , must lie. This prop-
erty can be used to limit the search for a matching point to just the
epipolar line. This is possible because the camera model is an affine
transformation, implying a straight line in Euclidean space projects to
a straight line in image space. The fundamental matrix is also useful
in developing methods to determine the intrinsic parameter matrix, for
example.

Complete Model
Combining everything but the lens effects, the perspective camera model
can be written as

u 1
= s(ρ) = P K ρ, (6.118)
v z
where
   
fu 0 cu x
1 0 0
P= , K=0 fv cv  , ρ = y  . (6.119)
0 1 0
0 0 1 z
P is simply a projection matrix to remove the bottom row from the
homogeneous point representation. This form of the model makes it
clear that with a single camera, there is a loss of information as we are
Figure 6.9 If a
point is observed in
one image, qa , and
epipolar line the fundamental
qTa Fab qb = 0 matrix, Fab , is
qa known, this can be
used to define a
line in the second
Fa image along which
! the second
Fb
! observation, qb ,
must lie.
Tba
198 Primer on Three-Dimensional Geometry

going from three parameters in ρ to just two in (u, v); we are unable
to determine depth from just one camera.

Homography
Although we cannot determine depth from just one camera, if assume
that the point a camera is observing lies on the surface of a plane whose
geometry is known, we can work out the depth and then how that point
will look to another camera. The geometry of this situation is depicted
in Figure 6.10.
The homogeneous observations for the two cameras can be written
as
 
xi
1
qi = Ki ρi , ρi =  yi  , (6.120)
zi zi
where ρi are the coordinates of P in each camera frame with i =
a, b. Let us assume we know the equation of the plane containing P ,
expressed in both camera frames; this can be parameterized as
{ni , di } , (6.121)
where di is the distance of camera i from the plane and ni are the
coordinates of the plane normal in frame i. This implies that
nTi ρi + di = 0, (6.122)
since P is in the plane. Solving for ρi in (6.120) and substituting into
the plane equation we have
zi nTi K−1
i qi + di = 0, (6.123)
or
di
zi = − , (6.124)
nTi K−1
i qi

Figure 6.10 If
the point observed Tba
by a camera lies on B
a plane whose A
geometry is known,
it is possible to Fa r pb Fb
!
work out what ! r pa !
that point will look ! db
like after the da
camera makes a
pose change using
a transform called n
!
a homography. P
6.4 Sensor Models 199

for the depth of point P in each camera, i = a, b. This further implies

that we can write the coordinates of P , expressed in each camera frame,
as
di
ρi = − T −1 K−1 qi . (6.125)
ni Ki qi
This shows that the knowledge of the plane parameters, {ni , di }, allows
us to recover the coordinates of P even though a single camera cannot
determine depth on its own.
Let us also assume we know the pose change, Tba , from → F a to →
− −Fb
so that

ρb Cba rab
b ρa
= , (6.126)
1 0T 1 1
| {z }
Tba

or
ρb = Cba ρa + rab
b . (6.127)
Inserting (6.120), we have that
zb K−1 −1 ab
b qb = za Cba Ka qa + rb . (6.128)
We can then isolate for qb in terms of qa :
za 1
qb = Kb Cba K−1
a qa + Kb rba
b . (6.129)
zb zb
Then, substituting zb from (6.124) we have

za 1 ba T
qb = Kb Cba 1 + ra na K−1 a qa , (6.130)
zb da
where we used that rab ba
b = −Cba ra . Finally, we can write

qb = Kb Hba K−1
a qa , (6.131)
where
za 1 ba T
Hba = Cba 1 + ra na , (6.132)
zb da
is called the homography matrix. Since the factor za /zb just scales qb , it
can be dropped in practice owing to the fact that qb are homogeneous
coordinates and the true pixel coordinates can always be recovered by
dividing the first two entries by the third; doing so means that Hba is
only a function of the pose change and the plane parameters.
It is worth noting that in the case of a pure rotation, rba
a = 0, so that
the homography matrix simplifies to
Hba = Cba , (6.133)
when the za /zb factor is dropped.
200 Primer on Three-Dimensional Geometry

The homography matrix is invertible and its inverse is given by

−1 zb 1 ab T
Hba = Hab = Cab 1 + rb nb . (6.134)
za db
This allows us to transform observations in the other direction.

6.4.2 Stereo Camera

Another common three-dimensional sensor is a stereo camera, which
consists of two perspective cameras rigidly connected to one another
with a known transformation between them. Figure 6.11 depicts one of
the most common stereo configurations where the two cameras are sep-
arated along the x-axis by a stereo baseline of b. Unlike a single camera,
it is possible to determine depth to a point from a stereo observation.

Midpoint Model
F s as
If we express the coordinates of the point, P , in →
−
 
x
ρ = rpss =  y , (6.135)
z
then the model for the left camera is
 
x + 2b
u` 1
= PK  y , (6.136)
v` z z
and the model for the right camera is
 
x − 2b
ur 1
= PK  y , (6.137)
vr z z

Figure 6.11
Stereo camera rig.
P
Two cameras are left image
mounted pointing
in the same
direction but with q`
a known separation
F` qr
of b along the
!
x-axis. We choose Fs
!
the sensor frame to
Fr right image
be located at the b/2 !
midpoint between b/2
the two cameras.
stereo baseline
6.4 Sensor Models 201

where we assume the two cameras have the same intrinsic parameter
matrix. Stacking the two observations together we can write the stereo
camera model as
     
u` fu 0 cu fu 2b x
 v`   0 fv cv 0  1 y 
  = s(ρ) =    
ur  fu 0 cu −fu b  z  z  , (6.138)
2
vr 0 fv cv 0 1
| {z }
M

where M is a now a combined parameter matrix for the stereo rig. It

is worth noting that M is not invertible since two of its rows are the
same. In fact, because of the stereo setup, the vertical coordinates of
the two observations will always be the same; this corresponds to the
fact that epipolar lines in this configuration are horizontal such that if
a point is observed in one image, the observation in the other image
can be found by searching along the line with the same vertical pixel
coordinate. We can see this using the fundamentalmatrix constraint;

for this stereo setup, we have Cr` = 1 and r`r r = −b 0 0 so that
the constraint is
    
fu 0 0 0 0 0 fu 0 cu ur
u` v` 1  0 fv 0 0 0 b   0 fv cv   vr  = 0.
cu cv 1 0 −b 0 0 0 1 1
| {z }| {z }| {z }
K` T E`r Kr
(6.139)
Multiplying this out, we see that it simplifies to vr = v` .

Left Model
We could also choose to locate the sensor frame at the left camera
rather than the midpoint between the two cameras. In this case, the

Figure 6.12
P
Alternate stereo
left image model with the
sensor frame
located at the left
q` camera.

F s, !
F` qr
!

Fr right image
b !

stereo baseline
202 Primer on Three-Dimensional Geometry

camera model becomes

     
u` fu 0 cu 0 x
 v`   0 fv cv 0  1 y 
 =   
ur  fu 0 cu −fu b z  z  . (6.140)
vr 0 fv cv 0 1
Typically, in this form the vr equation is dropped and the ur is equation
is replaced with one for the disparity8 , d, given by
1
d = u` − ur = fu b, (6.141)
z
so that we can write
 
    x
u` fu 0 cu 0 y 
1
 v`  = f (ρ) =  0 fv cv 0   
z  , (6.142)
d 0 0 0 fu b z
1
for the stereo model. This form has the appealing property that we
are going from three point parameters, (x, y, z), to three observations,
(u` , v` , d). A similar model can be developed for the right camera.

6.4.3 Range-Azimuth-Elevation
Some sensors, such as lidar (light detection and ranging) can be mod-
elled as a range-azimuth-elevation (RAE), which essentially observes a
point, P , in spherical coordinates. For lidar, which can measure dis-
tance by reflecting laser pulses off a scene, the azimuth and elevation
are the angles of the mirrors that are used to steer the laser beam and
the range is the reported distance determined by time of flight. The
geometry of this sensor type is depicted in Figure 6.13.
The coordinates of point P in the sensor frame, → F s , are
−
 
x
ps 
ρ = rs = y  . (6.143)
z
8 The disparity equation can be used as a one-dimensional stereo camera model, as we
have already seen in the earlier chapter on nonlinear estimation.

Figure 6.13 A
range-azimuth- z r range P
elevation sensor Fs r ps
!
model observes a
! y
point P in ✏ elevation
spherical S
coordinates. x ↵ azimuth
6.4 Sensor Models 203

These can also be written as

 
r
ρ = CT3 (α) CT2 (−) 0 , (6.144)
0
where α is the azimuth, is the elevation, r is the range, and Ci is
the principal rotation about axis i. The elevation rotation indicated in
Figure 6.13 is negative according to the right-hand rule. Inserting the
principal-axis rotation formulas and multiplying out we find that
   
x r cos α cos
y  =  r sin α cos  , (6.145)
z r sin
which are the common spherical-coordinate expressions. Unfortunately,
this is the inverse of the sensor model we desire. We can invert this
expression to show that the RAE sensor model is
   √ 2 
r x + y2 + z2
α = s(ρ) =  tan√−1 (y/x) 
−1
. (6.146)
sin z/ x2 + y 2 + z 2
In the case that the point P lies in the xy-plane, we have z = 0 and
hence = 0 so that the RAE model simplifies to the range-bearing
model: √ 2
r x + y2
= s(ρ) = , (6.147)
α tan−1 (y/x)
which is commonly used in mobile robotics.

6.4.4 Inertial Measurement Unit

Another common sensor that functions in three-dimensional space is
the inertial measurement unit (IMU). An ideal IMU comprises three
orthogonal linear accelerometers and three orthogonal rate gyros9 . All
quantities are measured in a sensor frame, → F s , which is typically not
−
F
located at the vehicle frame, →
− v , as shown in Figure 6.14.
To model an IMU, we assume that the state of the vehicle can be
captured by the quantities
rvi , Cvi , ω vi , ω̇ vi (6.148)
| i {z } v
|{z} v
|{z}
pose angular velocity angular acceleration
and that we know the fixed pose change between the vehicle and sensor
frames given by rsv
v and Csv , which is typically determined by calibra-
tion.
9 Typically, calibration is required as the axes are never perfectly orthogonal due to
manufacturing tolerances.
204 Primer on Three-Dimensional Geometry
Figure 6.14 An
inertial vehicle
measurement unit
has three linear
Fv V
!
accelerometers and
three rate gyros r sv
that measure
r vi
!
!
quantities in sensor
Fi Fs sensor
frame that is ! ! S
typically not
coincident with the I
vehicle frame. g
!

The gyro sensor model is simpler than the accelerometers so we will

discuss this first. Essentially, the measured angular rates, ω, are the
body rates of the vehicle, expressed in the sensor frame:
ω = Csv ω vi
v . (6.149)
This exploits the fact that the sensor frame is fixed with respect to the
vehicle frame so that Ċsv = 0.
Because accelerometers typically use test masses as part of the mea-
surement principle, the resulting observations, a, can be written as

a = Csi r̈si
i − gi , (6.150)
where r̈si
i is the inertial acceleration of the sensor point, S, and gi
is gravity. Notably, in freefall the accelerometers will measure a = 0,
whereas at rest they will measure only gravity (in the sensor frame).
Unfortunately, this accelerometer model is not in terms of the vehicle
state quantities that we identified above, and must be modified to ac-
count for the offset between the sensor and vehicle frames. To do this,
we note that
rsi vi T sv
i = ri + Cvi rv . (6.151)
Differentiating twice (and using Poisson’s equation from (6.45) and that
ṙsv
v = 0) provides
∧ ∧ ∧
vi
r̈si vi T sv T vi vi sv
i = r̈i + Cvi ω̇ v rv + Cvi ω v ω v rv , (6.152)
where the right-hand side is now in terms of our state quantities10 and
known calibration parameters. Inserting this into (6.150) gives our final
model for the accelerometers:

vi∧ sv vi∧ vi∧ sv
a = Csv Cvi r̈vii − g i + ω̇ v r v + ω v ω v r v . (6.153)

Naturally, if the offset between the sensor and vehicle frames, rsv
v , is
sufficiently small, we may choose to neglect the last two terms.
10 If the angular acceleration quantity, ω̇ vi
v , is not actually part of the state, it could be
determined from two or more gyro measurements.
6.5 Summary 205
Figure 6.15 For
vehicle high-performance
topocentric
inertial-
Ft Fv V measurement-unit
! r vt
! ! applications, it is
r sv
! necessary to
r ti
!
account for the
sensor rotation of the
Earth-centred inertial F i S Earth, in which
! Fs case a topometric
!
I reference frame is
g located on the
! Earth’s surface and
an inertial frame is
located at the
Earth’s center.

To summarize, we can stack the accelerometer and gyro models into

the following combined IMU sensor model:

a vi

= s rvi vi
i , Cvi , ω v , ω̇ v
ω
" #
vi∧ sv vi∧ vi∧ sv
Csv Cvi (r̈vi i − g i ) + ω̇ v r v + ω v ω v rv
= , (6.154)
Csv ω vi
v

where Csv and rsv v are the (known) pose change between the vehicle
and sensor frames and gi is gravity in the inertial frame.
For some high-performance inertial-measurement-unit applications,
the above model is insufficient since it assumes an inertial reference
frame can be located conveniently on the Earth’s surface, for example.
High-end IMU units, however, are sensitive enough to detect the ro-
tation of the Earth and thus a more elaborate model of the sensor is
required. The typical setup is depicted in Figure 6.15, where the iner-
tial frame is located at the Earth’s center of mass (but not rotating)
and then a convenient (non-inertial) reference frame (used to track the
vehicle’s motion) is located on the Earth’s surface. This requires gen-
eralizing the sensor model to account for this more sophisticated setup
(not shown).

6.5 Summary
The main take-away points from this chapter are:

1. Objects that are able to rotate in three dimensions pose a problem

for our state estimation techniques in the first part of the book. This
is because we cannot, in general, use a vectorspace to describe the
three-dimensional orientation of an object.
206 Primer on Three-Dimensional Geometry

2. There are several ways to parameterize rotations (e.g., rotation ma-

trix, Euler angles, unit-length quaternions). They all have advan-
tages and disadvantages; some have singularities while the others
have constraints. Our choice in this book is to favour the use of the
rotation matrix since this is the quantity that is most commonly
used to rotate vectors from one reference frame to another.
3. There are many different notational conventions in use in differ-
ent fields (i.e., robotics, computer vision, aerospace). Coupled with
the variety of rotational parameterizations, this can often lead to
a source of miscommunication in practice. Our goal in this book is
only to attempt to explain three-dimensional state estimation con-
sistently and clearly in just one of the notational possibilities.
The next chapter will explore more deeply the mathematics of rotations
and poses by introducing Matrix Lie groups.

6.6 Exercises
6.6.1 Show that u v ≡ −v∧ u for any two 3 × 1 columns u and v.
∧

6.6.2 Show that C−1 = CT starting from

C = cos θ1 + (1 − cos θ)aaT + sin θ a∧ .
∧
6.6.3 Show that (Cv) ≡ Cv∧ CT for any 3×1 column v and rotation
matrix, C.
6.6.4 Show that
 
0 −vκ 0 v
vκ 0 −vτ 0
Ṫiv = Tiv 
0
.
vτ 0 0
0 0 0 0
6.6.5 Show that if we constrain the motion to the xy-plane, the Frenet-
Serret equations simplify to
ẋ = v cos θ,
ẏ = v sin θ,
θ̇ = ω,
where ω = vκ.
6.6.6 Show that the single-camera model is an affine transformation,
meaning straight lines in Euclidean space project to straight lines
in image space.
6.6.7 Show directly that the inverse of the homography matrix

za 1
Hba = Cba 1 + rba n T
,
zb da a a
6.6 Exercises 207

is
zb 1
H−1
ba = Cab 1 + rab n T
.
za db b b
6.6.8 Work out the stereo camera model for the case when the sensor
frame is located at the right camera instead of the left or the
midpoint.
6.6.9 Work out the inverse of the left stereo camera model. In other
words, how can we go from (u` , v` , d) back to the point coordinates,
(x, y, z)?
6.6.10 Work out an IMU model for the situation depicted in Fig-
ure 6.15, where it is necessary to account for the rotation of the
Earth about its axis.
7

Matrix Lie Groups

We have already introduced rotations and poses in the previous chap-

ter on three-dimensional geometry. In this chapter, we look deeper into
the nature of these quantities. It turns out that rotations are quite
different from the usual vector quantities with which we are familiar.
The set of rotations is not a vectorspace in the sense of linear alge-
bra. However, rotations do form another mathematical object called a
non-commutative group, which possesses some but not all of the usual
vectorspace properties.
We will focus our efforts in this chapter on two sets known as ma-
trix Lie groups. Stillwell (2008) provides an accessible account of Lie Marius Sophus Lie
theory and Chirikjian (2009) is an excellent reference from the robotics (1842-1899) was a
Norwegian
perspective.
mathematician. He
largely created the
theory of
continuous
7.1 Geometry symmetry and
We will work with two specific matrix Lie groups in this chapter: the applied it to the
study of geometry
special orthogonal group, denoted SO(3), which can represent rotations, and differential
and the special Euclidean group, SE(3), which can represent poses. equations.

7.1.1 Special Orthogonal and Special Euclidean Groups

The special orthogonal group, representing rotations, is simply the set
of valid rotation matrices:

SO(3) = C ∈ R3×3 | CCT = 1, det C = 1 . (7.1)

The CCT = 1 orthogonality condition is needed to impose six con-

straints on the nine-parameter rotation matrix, thereby reducing the
number of degrees of freedom to three. Noticing that
2
(det C) = det CCT = det 1 = 1, (7.2)

we have that
det C = ±1, (7.3)
209
210 Matrix Lie Groups

allowing for two possibilities. Choosing det C = 1 ensures that we have

a proper rotation1 .
Although the set of all matrices can be shown to be a vectorspace,
SO(3) is not a valid subspace2 . For example, SO(3) is not closed under
addition so adding two rotation matrices does not result in a valid
rotation matrix:
C1 , C2 ∈ SO(3) ; C1 + C2 ∈ SO(3) (7.4)
Also, the zero matrix is not a valid rotation matrix: 0 ∈
/ SO(3). Without
these properties (and some others), SO(3) cannot be a vectorspace (at
least not a subspace of R3×3 ).
The special Euclidean group, representing poses (i.e., translation and
rotation), is simply the set of valid transformation matrices:
( )

C r 4×4 3
SE(3) = T = ∈R C ∈ SO(3), r ∈ R . (7.5)
0T 1

By similar arguments to SO(3), we can show that SE(3) is not a vec-

torspace (at least not a subspace of R4×4 ).
While SO(3) and SE(3) are not vectorspaces, they can be shown
to be matrix Lie groups3 . We next show what this means. In math-
ematics, a group is a set of elements together with an operation that
combines any two of its elements to form a third element also in the
set, while satisfying four conditions called the group axioms, namely
closure, associativity, identity, and invertibility. A Lie group is a group
that is also a differential manifold, with the property that the group
operations are smooth4 . A matrix Lie group further specifies that the
elements of the group are matrices, the combination operation is matrix
multiplication, and the inversion operation is matrix inversion.
The four group properties are then as shown in Table 7.1.1 for our
two candidate matrix Lie groups. Closure for SO(3) actually follows
directly from Euler’s theorem, which says a compounding of rotations
can always be replaced by a single rotation. Or, we can note that
T
(C1 C2 ) (C1 C2 ) = C1 C2 CT2 CT1 = C1 CT1 = 1,
| {z } | {z }
1 1

det (C1 C2 ) = det (C1 ) det (C2 ) = 1, (7.6)

| {z } | {z }
1 1

1 There is another case in which det C = −1, sometimes called an improper rotation or
rotary reflection, but we shall not be concerned with it here.
2 A subspace of a vectorspace is also a vectorspace.
3 They are actually non-Abelian (or non-commutative) groups since the order in which
we compound elements matters.
4 Smoothness implies that we can use differential calculus on the manifold; or, roughly, if
we change the input to any group operation by a little bit, the output will only change
by a little bit.
7.1 Geometry 211

property SO(3) SE(3) Table 7.1 Matrix

Lie group
C1 , C2 ∈ SO(3) T1 , T2 ∈ SE(3) properties for
closure
⇒ C1 C2 ∈ SO(3) ⇒ T1 T2 ∈ SE(3) SO(3) (rotations)
C1 (C2 C3 ) = (C1 C2 ) C3 T1 (T2 T3 ) = (T1 T2 ) T3 and SE(3) (poses).
associativity
= C1 C2 C3 = T1 T2 T3
C, 1 ∈ SO(3) T, 1 ∈ SE(3)
identity
⇒ C1 = 1C = C ⇒ T1 = 1T = T
C ∈ SO(3) T ∈ SE(3)
invertibility
⇒ C−1 ∈ SO(3) ⇒ T−1 ∈ SE(3)

such that C1 C2 ∈ SO(3) if C1 , C2 ∈ SO(3). Closure for SE(3) can be

seen by simply multiplying:

C1 r1 C2 r2 C1 C2 C1 r2 + r1
T1 T2 = = ∈ SE(3), (7.7)
0T 1 0T 1 0T 1
since C1 C2 ∈ SO(3) and C1 r2 + r1 ∈ R3 . Associativity follows for
both groups from the properties of matrix multiplication5 . The identity
matrix is the identity element of both groups, which again follows from
the properties of matrix multiplication. Finally, since C−1 = CT , which
follows from CCT = 1, we know that the inverse of an element of SO(3)
is still in SO(3). This can be seen through
T T
C−1 C−1 = CT CT = C T
| {zC} = 1,
1
−1

det C = det CT = |det {zC} = 1. (7.8)
1

The inverse of an element of SE(3) is

−1 T
−1 C r C −CT r
T = = T ∈ SE(3), (7.9)
0T 1 0 1
since CT ∈ SO(3), −CT r ∈ R3 , so this also holds. Other than the
smoothness criterion, this establishes SO(3) and SE(3) as matrix Lie
groups.

7.1.2 Lie Algebras

To every matrix Lie group there is associated a Lie algebra, which
consists of a vectorspace6 , V, over some field7 , F, together with a binary
operation, [·, ·], called the Lie bracket (of the algebra) that satisfies four
properties:
5 The set of all real matrices can be shown to be an algebra and associativity of matrix
multiplication is a required property.
6 We can take this to be a subspace of the square real matrices, which is a vectorspace.
7 We can take this to be the field of real numbers, R.
212 Matrix Lie Groups

closure: [X, Y] ∈ V,
bilinearity: [aX + bY, Z] = a[X, Z] + b[Y, Z],
[Z, aX + bY] = a[Z, X] + b[Z, Y],
alternating: [X, X] = 0,
Jacobi identity: [X, [Y, Z]] + [Z, [Y, X]] + [Y, [Z, X]] = 0,
for all X, Y, Z ∈ V and a, b ∈ F. The vectorspace of a Lie algebra is
the tangent space of the associated Lie group at the identity element of
the group, and it completely captures the local structure of the group.

Rotations
The Lie algebra associated with SO(3) is given by

vectorspace: so(3) = Φ = φ∧ ∈ R3×3 | φ ∈ R3 , ,
field: R,
Lie bracket: [Φ1 , Φ2 ] = Φ1 Φ2 − Φ2 Φ1 ,
where
 ∧  
φ1 0 −φ3 φ2
φ∧ = φ2  =  φ3 0 −φ1  ∈ R3×3 , φ ∈ R3 . (7.10)
φ3 −φ2 φ1 0
We already saw this linear, skew-symmetric operator in the previous
chapter during our discussion of cross products and rotations. Later,
we will also make use of the inverse of this operator, denoted (·)∨ , so
that
Φ = φ∧ ⇒ φ = Φ∨ . (7.11)
We will omit proving that so(3) is a vectorspace, but will briefly show
that the four Lie bracket properties hold. Let Φ, Φ1 = φ∧1 , Φ2 = φ∧2 ∈
so(3). Then for the closure property we have
∧
[Φ1 , Φ2 ] = Φ1 Φ2 − Φ2 Φ1 = φ∧1 φ∧2 − φ∧2 φ∧1 = φ∧1 φ2 ∈ so(3). (7.12)
| {z }
∈R3

Carl Gustav Jacob

Bilinearity follows directly from the fact that (·)∧ is a linear operator.
Jacobi (1804-1851) The alternating property can be seen easily through
was a German
mathematician, [Φ, Φ] = ΦΦ − ΦΦ = 0 ∈ so(3). (7.13)
who made
fundamental Finally, the Jacobi identity can be verified by substituting and applying
contributions to the definition of the Lie bracket. Informally, we will refer to so(3) as the
elliptic functions,
Lie algebra, although technically this is only the associated vectorspace.
dynamics,
differential
equations, and Poses
number theory.
The Lie algebra associated with SE(3) is given by
7.1 Geometry 213

vectorspace: se(3) = Ξ = ξ ∧ ∈ R4×4 | ξ ∈ R 6
,
field: R,
Lie bracket: [Ξ1 , Ξ2 ] = Ξ1 Ξ2 − Ξ2 Ξ1 ,
where
∧ ∧
∧ ρ φ ρ
ξ = = T ∈ R4×4 , ρ, φ ∈ R3 . (7.14)
φ 0 0
This is an overloading of the (·)∧ operator (Murray et al., 1994) from
before to take elements of R6 and turn them into elements of R4×4 ;
it is still linear. We will also make use of the inverse of this operator,
denoted (·)∨ , so that
Ξ = ξ ∧ ⇒ ξ = Ξ∨ . (7.15)
Again, we will omit showing that se(3) is a vectorspace, but will briefly
show that the four Lie bracket properties hold. Let Ξ, Ξ1 = ξ ∧1 , Ξ2 =
ξ ∧2 ∈ se(3). Then for the closure property we have
∧
[Ξ1 , Ξ2 ] = Ξ1 Ξ2 − Ξ2 Ξ1 = ξ ∧1 ξ ∧2 − ξ ∧2 ξ ∧1 = ξ f ξ ∈ se(3), (7.16)
| 1{z }2
∈R6

where
f ∧
f ρ φ ρ∧
ξ = = ∈ R6×6 , ρ, φ ∈ R3 . (7.17)
φ 0 φ∧
Bilinearity follows directly from the fact that (·)∧ is a linear operator.
The alternating property can be seen easily through
[Ξ, Ξ] = ΞΞ − ΞΞ = 0 ∈ se(3). (7.18)
Finally, the Jacobi identity can be verified by substituting and applying
the definition of the Lie bracket. Again, we will refer to se(3) as the Lie
algebra, although technically this is only the associated vectorspace.
In the next section, we will make clear the relationships between our
matrix Lie groups and their associated Lie algebras:
SO(3) ↔ so(3),
SE(3) ↔ se(3),
For this we require the exponential map.

7.1.3 Exponential Map

It turns out that the exponential map is the key to relating a matrix
Lie group to its associated Lie algebra. The matrix exponential is given
by
X∞
1 2 1 3 1 n
exp(A) = 1 + A + A + A + · · · = A , (7.19)
2! 3! n=0
n!
214 Matrix Lie Groups

where A ∈ RM ×M is a square matrix. There is also a matrix logarithm:

∞
X (−1)n−1 n
ln (A) = (A − 1) . (7.20)
n=1
n

Rotations
For rotations, we can relate elements of SO(3) to elements of so(3)
through the exponential map:
∞
X 1 n
C = exp φ∧ = φ∧ , (7.21)
n=0
n!

where C ∈ SO(3) and φ ∈ R3 (and hence φ∧ ∈ so(3)). We can also go

in the other direction (but not uniquely) using
∨
φ = ln (C) . (7.22)
Mathematically, the exponential map from so(3) to SO(3) is surjec-
tive (or onto). This means that multiple elements of so(3) may map
to a single element in SO(3)8 , and it also means that we can generate
every element of SO(3) from at least one element of so(3).
It is useful to examine the surjective property a little deeper. We
begin by working out the forwards (exponential) mapping in closed
form to go from a φ ∈ R3 to a C ∈ SO(3). Let φ = φ a where φ = |φ|
is the angle of rotation and a = φ/φ is the unit-length axis of rotation.
For the matrix exponential, we then have
exp(φ∧ ) = exp(φ a∧ )
1 2 ∧ ∧ 1
= 1
|{z} + φ a∧ + φ a a + φ3 a∧ ∧ ∧
a a
2! 3! | {z∧ }
aaT −a∧ a∧ −a
1 4 ∧ ∧ ∧ ∧
φ a a a a} − · · · ,
+
4! | {z
−a∧ a∧

1 1
= aaT + φ − φ3 + φ5 − · · · a∧
3! 5!
| {z }
sin φ

1 2 1 4
− 1 − φ + φ − · · · |a∧{za∧} ,
2! 4!
| {z } −1+aaT
cos φ

= cos φ 1 + (1 − cos φ)aa + sin φ a∧ ,

T
(7.23)
| {z }
C

8 This many-to-one mapping property is related to the concept of singularities in

rotation parameterizations. We know that every three-parameter representation of a
rotation has singularities. Singularities imply that given a C, we cannot uniquely find a
single φ ∈ R3 that generated it, there are an infinite number of them.
7.1 Geometry 215

which we see is the canonical axis-angle form of a rotation matrix pre-

sented earlier. We have used the useful identities (for unit-length a),

a∧ a∧ ≡ −1 + aaT , (7.24a)
a∧ a∧ a∧ ≡ −a∧ , (7.24b)
the proofs of which are left to the reader. This shows that every φ ∈ R3
will generate a valid C ∈ SO(3). It also shows that if we add a multiple
of 2π to the angle of rotation, we will generate the same C. In detail,
we have
C = exp((φ + 2πm) a∧ ), (7.25)
with m any positive integer, since cos(φ + 2πm) = cos φ and sin(φ +
2πm) = sin φ. If we limit the angle of rotation, |φ| < π, of the input,
then each C can only be generated by one φ.
Additionally, we would like to show that every C ∈ SO(3) can be
generated by some φ ∈ R3 and for that we need the inverse (logarith-
mic) mapping: φ = ln(C)∨ . We can also work this out in closed form.
Since a rotation matrix applied to its own axis does not alter the axis,
Ca = a, (7.26)
this implies that a is a (unit-length) eigenvector of C corresponding to
an eigenvalue of 1. Thus, by solving the eigenproblem associated with
C, we can find a9 . The angle can be found by exploiting the trace (sum
of the diagonal elements) of a rotation matrix:

tr(C) = tr cos φ 1 + (1 − cos φ)aaT + sin φ a∧

= cos φ tr(1) +(1 − cos φ) tr aaT + sin φ tr (a∧ ) = 2 cos φ + 1.
| {z } | {z } | {z }
3 aT a=1 0
(7.27)
Solving we have

−1 tr(C) − 1
φ = cos + 2πm, (7.28)
2
which indicates there are many solutions for φ. By convention, we will
pick the one such that |φ| < π. To complete the process, we combine
a and φ according to φ = φ a. This shows that every C ∈ SO(3) can
be built from at least one φ ∈ R3 .
Figure 7.1 provides a simple example of the relationship between the
Lie group and Lie algebra for the case of rotation constrained to the
plane. We see that in a neighbourhood near the zero-rotation point,
θvi = 0, the Lie algebra vectorspace is a just a line that is tangent
9 There are some subtleties that occur when there is more than one eigenvalue equal to
1. E.g., C = 1, whereupon a is not unique and can be any unit vector.
216 Matrix Lie Groups
Figure 7.1
Example of the
SO(3)
relationship so(3)
between the Lie
group and Lie Fv (1, ✓vi , 0)
algebra for the
!
case of rotation
02 3^ 1 2 3
✓vi 0 1
constrained to the C3 (✓vi ) 11 = exp @4 0 5 A 405
plane. In a small
Fi
!
✓vi 0
neighbourhood 0 2 3^ 1 2 3 2 3
around the θvi = 0 0 1 1
point, the ⇡ @1 + 4 0 5 A 405 = 4✓vi 5
vectorspace ✓vi 0 0
associated with the
Lie algebra is a
line tangent to the
circle. to the circle of rotation. We see that indeed, near zero rotation, the
Lie algebra captures the local structure of the Lie group. It should be
pointed out that this example is constrained to the plane (i.e., a single
rotational degree of freedom) but in general the dimension of the Lie
algebra vectorspace is three. Put another way, the line in the figure is
a one-dimensional subspace of the full three-dimensional Lie algebra
vectorspace.
Connecting rotation matrices with the exponential map makes it easy
to show that det(C) = 1 using Jacobi’s formula, which for a general
square complex matrix, A, says
det (exp (A)) = exp (tr(A)) . (7.29)
In the case of rotations, we have

det(C) = det exp (φ∧ ) = exp tr(φ∧ ) = exp(0) = 1, (7.30)
∧
since φ is skew-symmetric and therefore has zeros on its diagonal
making its trace zero.

Poses
For poses, we can relate elements of SE(3) to elements of se(3), again
through the exponential map:
∞
X 1 ∧ n
T = exp ξ ∧ = ξ , (7.31)
n=0
n!
where T ∈ SE(3) and ξ ∈ R6 (and hence ξ ∧ ∈ se(3)). We can also go
in the other direction10 using
∨
ξ = ln (T) . (7.32)
The exponential map from se(3) to SE(3) is also surjective: every ξ ∈
R6 maps to some T ∈ SE(3) (many-to-one) and every T ∈ SE(3) can
be generated by at least one ξ ∈ R6 .
10 Again, not uniquely.
7.1 Geometry 217

To show the surjective property of the exponential

map, we first
ρ
examine the forwards direction. Starting with ξ = ∈ R6 , we have
φ
∞
X 1 ∧ n
exp ξ ∧ = ξ
n=0
n!
X∞ ∧ !n
1 ρ
=
n! φ
n=0
X 1 φ∧ ρn
∞
=
n=0
n! 0T 0
"P P n #
∞ ∧ n ∞
n=0
1
φ n=0
1
φ∧ ρ
= n!
T
(n+1)!
0 1

C r
= T ∈ SE(3), (7.33)
0 1
| {z }
T

where
∞
X 1 n
r = Jρ ∈ R3 , J= φ∧ . (7.34)
n=0
(n + 1)!

This shows that every ξ ∈ R6 will generate a valid T ∈ SE(3). We will

discuss the matrix, J, in greater detail just below. Figure 7.2 provides
a visualization of how each of the six components of ξ can be varied

Figure 7.2 By
x y z varying each of the
components of ξ,
constructing
T = exp (ξ∧ ), and
2 3 2 3 2 3 then using this to
x 0 0 transform the
607 6y 7 607
6 7 6 7 6 7 points comprising
607 607 6z 7
translation ⇠=6
607
7 ⇠=6
607
7 ⇠=6
607
7
the corners of a
6 7 6 7 6 7
405 405 405 rectangular prism,
0 0 0 we see that the
prism’s pose can
be translated and
rotated.
Combining these
basic movements
can result in any
2 3 2 3 2 3
0 0 0 arbitrary pose
607 607 607
6 7 6 7 6 7 change of the
607 607 607
rotation ⇠=6
6↵ 7
7 ⇠=6
607
7 ⇠=6
607
7 prism.
6 7 6 7 6 7
405 4 5 405
0 0
218 Matrix Lie Groups

to alter the pose of a rectangular prism. By combining these basic

translations and rotations, an arbitrary pose change can be achieved.
Nextwe would like to go in the inverse direction. Starting with
T=
C r ρ
, we want to show this can be generated by some ξ = ∈ R6 ;
0T 1 φ
we need the inverse mapping, ξ = ln(T)∨ . We have already seen how
to go from C ∈ SO(3) to φ ∈ R3 in the last section. Next, we can
compute
ρ = J−1 r, (7.35)
where J is built from φ (already computed). Finally, we assemble ξ ∈ R6
from ρ, φ ∈ R3 . This shows that every T ∈ SE(3) can be generated by
at least one ξ ∈ R6 .

Jacobian
The matrix, J, described just above plays an important role in allow-
ing us to convert the translation component of pose in se(3) into the
translation component of pose in SE(3) through r = Jρ. This quantity
appears in other situations as well when dealing with our matrix Lie
groups and we will learn later on in this chapter that this is called the
(left) Jacobian of SO(3). In this section, we will derive some alternate
forms of this matrix that are sometimes useful.
We have defined J as
X∞
1 n
J= φ∧ . (7.36)
n=0
(n + 1)!
By expanding this series and manipulating, we can show the following
closed-form expressions for J and its inverse:

sin φ sin φ 1 − cos φ ∧
J= 1+ 1− aaT + a , (7.37a)
φ φ φ

φ φ φ φ φ
J−1 = cot 1 + 1 − cot aaT − a∧ , (7.37b)
2 2 2 2 2
where φ = |φ| is the angle of rotation and a = φ/φ is the unit-length
axis of rotation. Due to the nature of the cot(φ/2) function, there are
singularities associated with J (i.e., the inverse does not exist) at φ =
2πm with m a non-zero integer.
Occasionally, we will come across the matrix, JJT and its inverse.
Starting with (7.37a), we can manipulate to show

−1 1 1
JJT = γ1 + (1 − γ)aaT , JJT = 1+ 1− aaT ,
γ γ
1 − cos φ
γ=2 . (7.38)
φ2
7.1 Geometry 219

It turns out JJT is positive-definite. There are two cases to consider,

φ = 0 and φ 6= 0. For φ = 0, we have JJT = 1, which is positive-definite.
For φ 6= 0, we have for x 6= 0 that

xT JJT x = xT γ1 + (1 − γ)aaT x = xT aaT − γa∧ a∧ x
= xT aaT x + γ(a∧ x)T (a∧ x)
1 − cos φ ∧ T ∧
= (aT x)T (aT x) + 2 (a x) (a x) > 0, (7.39)
| {z } φ2 | {z }
≥0 | {z } ≥0
>0

since the first term is only zero when a and x are perpendicular and
the second term is only zero when x and a are parallel (these cannot
happen at the same time). This shows JJT is positive-definite.
It turns out that we can also write J in terms of the rotation matrix,
C, associated with φ in the following way:
Z 1
J= Cα dα. (7.40)
0

This can be seen through the following sequence of manipulations:

Z 1 Z 1 Z 1
α
Cα dα = exp φ∧ dα = exp α φ∧ dα
0 0 0
Z 1 X ∞
! ∞ Z 1
1 n ∧ n X 1 n
= α φ dα = α dα φ∧
n
0 n=0
n! n=0
n! 0
∞ ! ∞
X 1 1 α=1 n X 1 n
= αn+1 φ∧ = φ∧ , (7.41)
n=0
n! n + 1 α=0 n=0
(n + 1)!
which is the original series form of J defined above.
Finally, we can also relate J and C through
C = 1 + φ∧ J, (7.42)
∧
but it is not possible to solve for J in this expression since φ is not
invertible.

7.1.4 Adjoints
There is a 6 × 6 transformation matrix, T , that can be constructed
directly from the components of the 4 × 4 transformation matrix. We
call this the adjoint of an element of SE(3):

C r C r∧ C
T = Ad(T) = Ad = . (7.43)
0T 1 0 C
We will abuse notation a bit and say that the set of adjoints of all the
elements of SE(3) is denoted
Ad(SE(3)) = {T = Ad(T) | T ∈ SE(3)} . (7.44)
220 Matrix Lie Groups

It turns out that Ad(SE(3)) is also a matrix Lie Group, which we show
next.
For closure we let T 1 = Ad(T1 ), T 2 = Ad(T2 ) ∈ Ad(SE(3)), and
then

C1 r1 C2 r2
T 1 T 2 = Ad(T1 )Ad(T2 ) = Ad Ad
0T 1 0T 1

C1 r∧1 C1 C2 r∧2 C2 C1 C2 C1 r∧2 C2 + r∧1 C1 C2
= =
0 C1 0 C2 0 C1 C2
∧
C1 C2 (C1 r2 + r1 ) C1 C2
=
0 C1 C2

C1 C2 C1 r2 + r1
= Ad ∈ Ad(SE(3)), (7.45)
0T 1
where we have used the nice property that
∧
Cv∧ CT = (Cv) , (7.46)
for any C ∈ SO(3) and v ∈ R3 . Associativity follows from basic proper-
ties of matrix multiplication and the identity element of the group is the
6 × 6 identity matrix. For invertibility, we let T = Ad(T) ∈ Ad(SE(3))
and then we have
−1 −1
−1 −1 C r C r∧ C
T = Ad(T) = Ad =
0T 1 0 C
T T ∧
T ∧
C −C r C (−C r) CT
T
= =
0 CT 0 CT
T
C −CT r
= Ad T = Ad T−1 ∈ Ad(SE(3)). (7.47)
0 1
Other than smoothness, these four properties show that Ad(SE(3)) is
a matrix Lie group.
We can also talk about the adjoint of an element of se(3). Let Ξ =
∧
ξ ∈ se(3); then the adjoint of this element is

ad(Ξ) = ad ξ ∧ = ξ f , (7.48)
where
f ∧
f ρ φ ρ∧
ξ = = ∈ R6×6 , ρ, φ ∈ R3 . (7.49)
φ 0 φ∧
Note that we have used uppercase, Ad(·), for the adjoint of SE(3) and
lowercase, ad(·), for the adjoint of se(3).
The Lie algebra associated with Ad(SE(3)) is given by
vectorspace: ad(se(3)) = {Ψ = ad(Ξ) ∈ R6×6 | Ξ ∈ se(3), },
field: R,
Lie bracket: [Ψ1 , Ψ2 ] = Ψ1 Ψ2 − Ψ2 Ψ1 ,
7.1 Geometry 221

Again, we will omit showing that ad(se(3)) is a vectorspace, but will

briefly show that the four Lie bracket properties hold. Let Ψ, Ψ1 =
ξf f
1 , Ψ2 = ξ 2 ∈ ad(se(3)). Then for the closure property we have
f
[Ψ1 , Ψ2 ] = Ψ1 Ψ2 − Ψ2 Ψ1 = ξ f ξ f
− ξ f f
ξ = ξ f
ξ ∈ ad(se(3)).
1 2 2 1
| 1{z }2
∈R6
(7.50)
Bilinearity follows directly from the fact that (·)f is a linear operator.
The alternating property can be seen easily through
[Ψ, Ψ] = ΨΨ − ΨΨ = 0 ∈ ad(se(3)). (7.51)
Finally, the Jacobi identity can be verified by substituting and applying
the definition of the Lie bracket. Again, we will refer to ad(se(3)) as the
Lie algebra, although technically this is only the associated vectorspace.
The last issue to discuss is the relationship between Ad(SE(3)) and
ad(se(3)) through the exponential map. Not surprisingly, we have that

∞
X 1 f n
T = exp ξ f = ξ , (7.52)
n=0
n!

where T ∈ Ad(SE(3)) and ξ ∈ R6 (and hence ξ f ∈ ad(se(3))). We can

go in the other direction using
g
ξ = ln (T ) , (7.53)
where g undoes the f operation. The exponential mapping is again
surjective, which we discuss below.
First, however, we note that there is a nice commutative relationship
between the various Lie groups and algebras associated with poses:

Lie algebra Lie group

exp
4×4 ξ ∧ ∈ se(3) −−−−→ T ∈ SE(3)
  (7.54)
 
yad yAd
exp
6×6 ξ f ∈ ad(se(3)) −−−−→ T ∈ Ad(SE(3))

We could draw on this commutative relationship to claim the surjective

property of the exponential map from ad(se(3)) to Ad(SE(3)) by going
the long way around the loop, since we have already shown this path
exists. However, it is also possible to show it directly, which amounts
to showing that

Ad exp ξ ∧ = exp ad ξ ∧ , (7.55)
| {z } | {z }
T ξf
222 Matrix Lie Groups

since this implies that we can go from ξ ∈ R6 to T ∈ Ad(SE(3)) and

back.
ρ
To see this, let ξ = and then starting from the right-hand side
φ
we have
∞
X 1 f n
exp ad ξ ∧ = exp ξ f = ξ
n=0
n!
X∞ n
1 φ∧ ρ∧ C K
= ∧ = , (7.56)
n! 0 φ
n=0
0 C

where C is the usual expression for the rotation matrix in terms of φ

and
∞ X
X ∞
1 n m
K= φ∧ ρ∧ φ∧ ,
n=0 m=0
(n + m + 1)!

which can be found through careful manipulation. Starting from the

left-hand side we have
∧
∧ C Jρ C (Jρ) C
Ad exp ξ = Ad = , (7.57)
0T 1 0 C

where J is given in (7.36). Comparing (7.56) and (7.57), what remains

∧
to be shown is the equivalence of the top-right block: K = (Jρ) C. To
see this, we use the following sequence of manipulations:
Z 1 ∧ Z 1
∧ ∧
(Jρ) C = Cα dα ρ C = (Cα ρ) C dα
0 0
Z 1 Z 1

= Cα ρ∧ C1−α dα = exp αφ∧ ρ∧ exp (1 − α)φ∧ dα
0 0
Z 1 X ∞
! ∞
!
1 n
X 1 m
= αφ∧ ρ∧ (1−α)φ∧ dα
0 n=0
n! m=0
m!
X∞ X ∞ Z 1
1 n m
= α (1−α) dα φ∧ ρ∧ φ∧ , (7.58)
n m

n=0 m=0
n! m! 0

∧
where we have used that ∧ is linear and that (Cv) = Cv∧ CT . After
several integrations by parts we can show that
Z 1
n! m!
αn (1 − α)m dα = , (7.59)
0 (n + m + 1)!
∧
and therefore K = (Jρ) C, which is the desired result.
7.1 Geometry 223

7.1.5 Baker-Campbell-Hausdorff
We can combine two scalar exponential functions as follows:

exp(a) exp(b) = exp(a + b), (7.60)

where a, b ∈ R. Unfortunately, this is not so easy for the matrix case.

To compound two matrix exponentials, we use the Baker-Campbell-
Hausdorff (BCH) formula: Henry Frederick
Baker (1866-1956)
was a British
ln (exp(A) exp(B)) mathematician,
∞ Pn −1
X (−1)n−1 X ( i=1 (ri + si )) working mainly in
= Q n [Ar1 Bs1 Ar2 Bs2 · · · Arn Bsn ] , algebraic geometry,
n=1
n r
i=1 i i!s ! but also
ri + si > 0,
1≤i≤n remembered for
(7.61) contributions to
partial differential
where equations and Lie
groups. John
Edward Campbell
[Ar1 Bs1 Ar2 Bs2 · · · Arn Bsn ] (1862-1924) was a
British
= A, . . . A, B, . . . , B, . . . A, . . . A, B, . . . B, B . . . . . . . . . . . . . . . ,
| {z }| {z } | {z }| {z } mathematician,
r1 s1 rn sn best known for his
(7.62) contribution to the
BCH formula and
which is zero if sn > 1 or if sn = 0 and rn > 1. The Lie bracket is the a 1903 book
popularizing the
usual ideas of Sophus
Lie. Felix
[A, B] = AB − BA. (7.63) Hausdorff
(1868-1942) was a
Note, the BCH formula is an infinite series. In the event that [A, B] = 0, German
the BCH formula simplifies to mathematician
who is considered
to be one of the
ln (exp(A) exp(B)) = A + B, (7.64)
founders of modern
topology and who
but this case is not particularly useful to us except as an approximation. contributed
The first several terms of the general BCH formula are significantly to set
theory, descriptive
1 set theory, measure
ln (exp(A) exp(B)) = A + B + [A, B] theory, function
2 theory, and
1 1 1 functional analysis.
+ [A, [A, B]] − [B, [A, B]] − [B, [A, [A, B]]]
12 12 24 Henri Poincaré
1 (1854-1912) is also
− ([[[[A, B] , B] , B] , B] + [[[[B, A] , A] , A] , A]) said to have had a
720 hand in the BCH
1 formula.
+ ([[[[A, B] , B] , B] , A] + [[[[B, A] , A] , A] , B])
360
1
+ ([[[[A, B] , A] , B] , A] + [[[[B, A] , B] , A] , B]) + · · · . (7.65)
120
224 Matrix Lie Groups

If we keep only terms linear in A, the general BCH formula becomes

(Klarsfeld and Oteo, 1989)
X∞
Bn
ln (exp(A) exp(B)) ≈ B + B, B, . . . B, A . . . . (7.66)
n=0
n! | {z }
n

If we keep only terms linear in B, the general BCH formula becomes

∞
X Bn
ln (exp(A) exp(B)) ≈ A + (−1)n A, A, . . . A, B . . . . (7.67)
n=0
n! | {z }
n
The Bernoulli
numbers were
discovered around
The Bn are the Bernoulli numbers11 ,
the same time by
the Swiss 1 1 1 1
B0 = 1, B1 = − , B2 = , B3 = 0, B4 = − , B5 = 0, B6 = ,
mathematician 2 6 30 42
Jakob Bernoulli 1 5 691
(1655-1705), after B7 = 0, B8 = − , B9 = 0, B10 = , B11 = 0, B12 = − ,
whom they are 30 66 2730
named, and 7
B13 = 0, B14 = , B15 = 0, . . . , (7.68)
independently by 6
Japanese
mathematician which appear frequently in number theory. It is also worth noting that
Seki Kōwa Bn = 0 for all odd n > 1, which reduces the number of terms that need
(1642-1708). Seki’s
to be implemented in approximations of some of our infinite series.
discovery was
posthumously The Lie product formula,
published in 1712 α
in his work exp (A + B) = lim (exp (A/α) exp (B/α)) , (7.69)
α→∞
Katsuyo Sampo;
Bernoulli’s, also
provides another way of looking at compounding matrix exponentials;
posthumously, in
his Ars compounding is effectively slicing each matrix exponential into an in-
Conjectandi (The finite number of infinitely thin slices and then interleaving the slices.
Art of Conjecture) We next discuss application of the general BCH formula to the specific
of 1713. Ada cases of rotations and poses.
Lovelace’s Note G
on the analytical
engine from 1842 Rotations
describes an
algorithm for In the particular case of SO(3), we can show that
generating
∨ ∨
Bernoulli numbers
ln (C1 C2 ) = ln exp(φ∧1 ) exp(φ∧2 )
with Babbage’s
machine. As a 1 1 1
result, the
= φ1 + φ2 + φ∧1 φ2 + φ∧1 φ∧1 φ2 + φ∧2 φ∧2 φ1 + · · · , (7.70)
2 12 12
Bernoulli numbers
have the where C1 = exp(φ∧1 ), C2 = exp(φ∧2 ) ∈ SO(3). Alternatively, if we as-
distinction of being
sume that φ1 or φ2 is small, then using the approximate BCH formulas
the subject of the
first computer
11 Technically, the sequence shown is the first Bernoulli sequence. There is also a second
program.
sequence in which B1 = 21 , but we will not need it here.
7.1 Geometry 225

we can show that

∨ ∨
ln (C1 C2 ) = ln exp(φ∧1 ) exp(φ∧2 )

J` (φ2 )−1 φ1 + φ2 if φ1 small
≈ , (7.71)
φ1 + Jr (φ1 )−1 φ2 if φ2 small
where
∞
X
−1 Bn φ ∧ n
φ φ φ φ
Jr (φ) = −φ = cot 1 + 1 − cot aaT + a∧ ,
n=0
n! 2 2 2 2 2
(7.72a)
X∞
−1 Bn ∧ n φ φ φ φ φ
J` (φ) = φ = cot 1 + 1 − cot aaT − a∧ .
n=0
n! 2 2 2 2 2
(7.72b)
In Lie group theory, Jr and J` are referred to as the right and left
Jacobians of SO(3), respectively. As noted earlier, due to the nature of
the cot(φ/2) function, there are singularities associated with Jr , J` at
φ = 2πm with m a non-zero integer. Inverting, we have the following
expressions for the Jacobians:
X∞ Z 1
1 ∧ n
Jr (φ) = −φ = C−α dα
n=0
(n + 1)! 0

sin φ sin φ 1 − cos φ ∧
= 1+ 1− aaT − a , (7.73a)
φ φ φ
X∞ Z 1
1 ∧ n
J` (φ) = φ = Cα dα
n=0
(n + 1)! 0

sin φ sin φ 1 − cos φ ∧
= 1+ 1− aaT + a , (7.73b)
φ φ φ

where C = exp φ∧ , φ = |φ|, and a = φ/φ. We draw attention to the
fact that
J` (φ) = C Jr (φ), (7.74)
which allows us to relate one Jacobian to the other. To show this is
fairly straightforward using the definitions:
Z 1 Z 1
−α
C Jr (φ) = C C dα = C1−α dα
0 0
Z 0 Z 1
β
=− C dβ = Cβ dβ = J` (φ). (7.75)
1 0

Another relationship between the left and right Jacobians is:

J` (−φ) = Jr (φ), (7.76)
226 Matrix Lie Groups

which is again fairly easy to see:

Z 1 Z 1
−α
α
Jr (φ) = C(φ) dα = C(φ)−1 dα
0 0
Z 1
α
= (C(−φ)) dα = J` (−φ). (7.77)
0

In later sections and chapters, we will (arbitrarily) work with the left
Jacobian and it will therefore be useful to write out the BCH approxi-
mations in (7.71) using only the left Jacobian:
∨ ∨
ln (C1 C2 ) = ln exp(φ∧1 ) exp(φ∧2 )

J(φ2 )−1 φ1 + φ2 if φ1 small
≈ , (7.78)
φ1 + J(−φ1 )−1 φ2 if φ2 small
where it is now implied that J = J` , by convention12 .

Poses
In the particular cases of SE(3) and Ad(SE(3)), we can show that
∨ ∨
ln (T1 T2 ) = ln exp(ξ ∧1 ) exp(ξ ∧2 )
1 1 f f 1
= ξ1 + ξ2 + ξf 1 ξ2 + ξ1 ξ1 ξ2 + ξf ξ f ξ + · · · , (7.79a)
2 12 12 2 2 1
g f g
ln (T 1 T 2 ) = ln exp(ξ f1 ) exp(ξ 2 )
1 1 f f 1
= ξ1 + ξ2 + ξf 1 ξ2 + ξ1 ξ1 ξ2 + ξf ξ f ξ + · · · , (7.79b)
2 12 12 2 2 1
where T1 = exp(ξ ∧1 ), T2 = exp(ξ ∧2 ) ∈ SE(3) and T 1 = exp(ξ f 1 ), T 2 =
exp(ξ f
2 ) ∈ Ad(SE(3)). Alternatively, if we assume that ξ 1 or ξ 2 is
small, then using the approximate BCH formulas we can show that
∨ ∨
ln (T1 T2 ) = ln exp(ξ ∧1 ) exp(ξ ∧2 )

J ` (ξ 2 )−1 ξ 1 + ξ 2 if ξ 1 small
≈ , (7.80a)
ξ 1 + J r (ξ 1 )−1 ξ 2 if ξ 2 small
g f g
ln (T 1 T 2 ) = ln exp(ξ f1 ) exp(ξ 2 )

J ` (ξ 2 )−1 ξ 1 + ξ 2 if ξ 1 small
≈ , (7.80b)
ξ 1 + J r (ξ 1 )−1 ξ 2 if ξ 2 small
where
∞
X
−1 Bn n
J r (ξ) = −ξ f , (7.81a)
n=0
n!
∞
X Bn f n
J ` (ξ)−1 = ξ . (7.81b)
n=0
n!
12 We will use this convention throughout the book and only show the subscript on the
Jacobian when making specific points.
7.1 Geometry 227

In Lie group theory, J r and J ` are referred to as the right and left
Jacobians of SE(3), respectively. Inverting, we have the following ex-
pressions for the Jacobians:
X∞ Z 1
1 f n −α Jr Qr
J r (ξ) = −ξ = T dα = , (7.82a)
(n + 1)! 0 0 Jr
n=0
∞
X Z 1
1 f n α J` Q`
J ` (ξ) = ξ = T dα = , (7.82b)
(n + 1)! 0 0 J`
n=0

where
∞ X
X ∞
1 n m
Q` (ξ) = φ∧ ρ∧ φ∧ (7.83a)
n=0 m=0
(n + m + 2)!
1 φ − sin φ ∧ ∧ ∧ ∧ ∧ ∧ ∧
= ρ∧ + φ ρ + ρ φ + φ ρ φ
2 φ3
φ2
1− − cos φ ∧ ∧ ∧
− 2
4
φ φ ρ + ρ∧ φ∧ φ∧ − 3φ∧ ρ∧ φ∧
φ
2 3 !
1 1 − φ2 − cos φ φ − sin φ − φ6
− −3
2 φ4 φ5

× φ∧ ρ∧ φ∧ φ∧ + φ∧ φ∧ ρ∧ φ∧ , (7.83b)
∧
Qr (ξ) = Q` (−ξ) = C Q` (ξ) + (J` ρ) C J` , (7.83c)

ρ
and T = exp ξ f , T = exp ξ ∧ , C = exp φ∧ , ξ = . The expres-
φ
sion for Q` comes from expanding the series and grouping terms into
the series forms of the trigonometric functions13 . The relations for Qr
come from the relationships between the left and right Jacobians:
J ` (ξ) = T J r (ξ), J ` (−ξ) = J r (ξ). (7.84)
The first can be seen to be true from
Z 1 Z 1
T J r (ξ) = T T −α dα = T 1−α dα
0 0
Z 0 Z 1
β
=− T dβ = T β dβ = J ` (ξ), (7.85)
1 0

and the second from

Z 1 Z 1
−α
α
J r (ξ) = T (ξ) dα = T (ξ)−1 dα
0 0
Z 1
α
= (T (−ξ)) dα = J ` (−ξ). (7.86)
0

13 This is a very lengthly derivation, but the result is exact.

228 Matrix Lie Groups

In later sections and chapters, we will (arbitrarily) work with the left
Jacobian and it will therefore be useful to write out the BCH approxi-
mations in (7.80) using only the left Jacobian:
∨ ∨
ln (T1 T2 ) = ln exp(ξ ∧1 ) exp(ξ ∧2 )

J (ξ 2 )−1 ξ 1 + ξ 2 if ξ 1 small
≈ , (7.87a)
ξ 1 + J (−ξ 1 )−1 ξ 2 if ξ 2 small
g f g
ln (T 1 T 2 ) = ln exp(ξ f
1 ) exp(ξ 2 )

J (ξ 2 )−1 ξ 1 + ξ 2 if ξ 1 small
≈ , (7.87b)
ξ 1 + J (−ξ 1 )−1 ξ 2 if ξ 2 small

where it is now implied that J = J ` , by convention14 .

Alternate expressions for the inverses are
−1
−1 Jr −J−1 −1
r Qr Jr
Jr = , (7.88a)
0 J−1
r
−1
−1 J` −J−1 −1
` Q` J`
J` = . (7.88b)
0 J−1
`

We see that the singularities of J r and J ` are precisely the same as

the singularities of Jr and J` , respectively, since
2 2
det(J r ) = (det(Jr )) , det(J ` ) = (det(J` )) , (7.89)

and having a non-zero determinant is a necessary and sufficient condi-

tion for invertibility (and therefore no singularity).
We also have that

C J` ρ C CJr ρ
T= T = T , (7.90a)
0 1 0 1
∧ ∧
C (J` ρ) C C C (Jr ρ)
T = = , (7.90b)
0 C 0 C
which tells us how to relate the ρ variable to the translational compo-
nent of T or T . In our work later on, we will only use the left Jacobian.
It is also worth noting that J J T > 0 (positive-definite), for either
the left or right Jacobians. We can see this through the following fac-
torization:

T 1 QJ−1 JJT 0 1 0
JJ = > 0, (7.91)
0 1 0 JJT J−T QT 1
| {z }| {z }| {z }
>0 >0 >0

where we have used that JJT > 0, which was shown previously.
14 We will use this convention throughout the book and only show the subscript on the
Jacobian when making specific points.
7.1 Geometry 229

7.1.6 Distance, Volume, Integration

We need to think about the concepts of distance, volume, and inte-
gration differently for Lie groups than for vectorspaces. This section
quickly covers these topics for both rotations and poses.

Rotations
There are two common ways to define the difference of two rotations:
∨
φ12 = ln CT1 C2 , (7.92a)
∨
φ21 = ln C2 CT1 . (7.92b)

where C1 , C2 ∈ SO(3). One can be thought of as the right difference

and the other the left. We can define the inner product for so(3) as

∧ ∧ 1 T

φ1 , φ2 = − tr φ∧1 φ2∧ = φT1 φ2 . (7.93)
2
The metric distance between two rotations can be thought of in two
ways: (i) the square root of the inner product of the difference with
itself, or (ii) the Euclidean norm of the difference:
q q
q
φ12 = hln (CT1 C2 ) , ln (CT1 C2 )i = φ∧12 , φ∧12 = φT12 φ12 = |φ12 |,
(7.94a)
q q
q
φ21 = hln (C2 CT1 ) , ln (C2 CT1 )i = φ∧21 , φ∧21 = φT21 φ21 = |φ21 |.
(7.94b)

This can also be viewed as the magnitude of the angle of the rotation
difference.
To consider integrating functions of rotations, we parametrize C =
∧
exp φ ∈ SO(3). Perturbing φ by a little bit results in the new ro-
∧
tation matrix, C0 = exp (φ + δφ) ∈ SO(3). We have that the right
and left differences (relative to C) are
∨ ∧ ∨
ln(δCr )∨ = ln CT C0 = ln CT exp (φ + δφ)
∧ ∨
≈ ln CT C exp (Jr δφ) = Jr δφ, (7.95a)
∨

0 T ∨ ∧ ∨
ln(δC` ) = ln C C = ln exp (φ + δφ) CT
∧ ∨
≈ ln exp (J` δφ) CCT = J` δφ, (7.95b)

where Jr and J` are evaluated at φ. To compute the infinitesimal vol-

ume element, we want to find the volume of the parallelepiped formed
by the columns of Jr or J` , which is simply the corresponding deter-
230 Matrix Lie Groups

minant15 :
dCr = |det(Jr )| dφ , (7.96a)
dC` = |det(J` )| dφ . (7.96b)
We note that
det(J` ) = det (C Jr ) = det (C) det (Jr ) = det (Jr ) , (7.97)
| {z }
1

which means that regardless of which distance metric we use, right or

left, the infinitesimal volume element is the same. This is true for all
unimodular Lie groups, such as SO(3). Therefore, we can write
dC = |det (J)| dφ, (7.98)
for the calculation of an infinitesimal volume element.
It turns out that

1 − cos φ 2 φ2 φ4 φ6 φ8
|det (J)| = 2 = 2 1−1+ − + − + ···
φ2 φ 2! 4! 6! 8!
1 1 4 1
= 1 − φ2 + φ − φ6 + · · · , (7.99)
2 360 20160
where φ = |φ|. For most practical situations we can safely use just the
first two or even one term of this expression.
Integrating functions of rotations can then be carried out like this:
Z Z
f (C) dC → f (φ) |det (J)| dφ, (7.100)
SO(3) |φ|<π

where we are careful to ensure |φ| < π so as to sweep out all of SO(3)
just once (due to the surjective nature of the exponential map).

Poses
We briefly summarize the SE(3) and Ad(SE(3)) results as they are
very similar to SO(3). We can define right and left distance metrics:
∨ g
ξ 12 = ln T−1
1 T2 = ln T −1
1 T 2 , (7.101a)

−1 ∨

−1 g
ξ 21 = ln T2 T1 = ln T 2 T 1 . (7.101b)
The 4 × 4 and 6 × 6 inner products are
1

∧ ∧ ∧ 21 0 ∧T
ξ 1 , ξ 2 = −tr ξ 1 T ξ = ξ T1 ξ 2 , (7.102a)
0 1 2
1

f f 1 0 fT
ξ 1 , ξ 2 = −tr ξ f 4 ξ = ξ T1 ξ 2 . (7.102b)
1
0 12 1 2
15 We are slightly abusing notation here by writing dC, but hopefully it is clear from
context what is meant.
7.1 Geometry 231

Note, we could adjust the weighting matrix in the middle to weight

rotation and translation differently if we so desired. The right and left
distances are
q
q
q
∧ ∧ f
ξ12 = ξ 12 , ξ 12 = ξ 12 , ξ 12 = ξ T12 ξ 12 = |ξ 12 |,
f
(7.103a)
q
q
f f q
ξ21 = ξ ∧21 , ξ ∧21 = ξ 21 , ξ 21 = ξ T21 ξ 21 = |ξ 21 |. (7.103b)

Using the parametrization,

T = exp ξ ∧ , (7.104)
and the perturbation,
∧
T0 = exp (ξ + δξ) , (7.105)
the differences (relative to T) are
∨ ∨
ln (δTr ) = ln TT T0 ≈ J r δξ, (7.106a)
∨ 0

T ∨
ln (δT` ) = ln T T ≈ J ` δξ. (7.106b)
The right and left infinitesimal volume elements are
dTr = |det(J r )| dξ , (7.107a)
dT` = |det(J ` )| dξ . (7.107b)
We have that
det(J ` ) = det(T J r ) = det(T ) det(J r ) = det(J r ), (7.108)
2
since det(T ) = (det(C)) = 1. We can therefore write
dT = |det(J )| dξ, (7.109)
for our integration volume. Finally, we have that
2
2 1 − cos φ
|det(J )| = |det(J)| = 2
φ2
23 4 57 6
= 1 − φ2 + φ − φ + · · · , (7.110)
90 20160
and again we probably will never need more than two terms of this
expression.
To integrate functions over SE(3), we can now use our infinitesimal
volume in the calculation:
Z Z
f (T) dT = f (ξ) |det (J )| dξ, (7.111)
SE(3) R3 ,|φ|<π

where we limit φ to the ball of radius π (due to the surjective nature

of the exponential map) but let ρ ∈ R3 .
232 Matrix Lie Groups

7.1.7 Interpolation
We will have occasion later to interpolate between two elements of a ma-
trix Lie group. Unfortunately, the typical linear interpolation scheme,

x = (1 − α) x1 + α x2 , α ∈ [0, 1], (7.112)

will not work because this interpolation scheme does not satisfy closure
(i.e., the result is no longer in the group). In other words,
(1 − α) C1 + α C2 ∈
/ SO(3), (7.113a)
(1 − α) T1 + α T2 ∈
/ SE(3), (7.113b)
for some values of α ∈ [0, 1] with C1 , C2 ∈ SO(3), T1 , T2 ∈ SE(3). We
must rethink what interpolation means for Lie groups.

Rotations
There are many possible interpolation scheme that we could define.
One of these is the following:
α
C = C2 CT1 C1 , α ∈ [0, 1], (7.114)
where C, C1 , C2 ∈ SO(3). We see that when α = 0 we have C = C1 and
when α = 1 we have C2 . The nice thing about this scheme is that we
guarantee closure, meaning C∈ SO(3) for all α ∈ [0, 1]. This is because
we know that C21 = exp φ∧21 = C2 CT1 is still a rotation matrix due to
closure of the Lie group. Exponentiating by the interpolation variable
keeps the result in SO(3),
α
Cα21 = exp φ∧21 = exp α φ∧21 ∈ SO(3), (7.115)
and finally compounding with C1 results in a member of SO(3), again
due to closure of the group. We can also see that we are essentially just
scaling the rotation angle of C21 by α, which is appealing intuitively.
Our scheme in (7.114) is actually similar to (7.112), if we rearrange
it a bit:
x = α(x2 − x1 ) + x1 . (7.116)
Or, letting x = ln(y), x1 = ln(y1 ), x2 = ln(y2 ), we can rewrite it as
α
y = y2 y1−1 y1 , (7.117)
which is very similar to our proposed scheme. Given our understanding
of the relationship between so(3) and SO(3) (i.e., through the expo-
nential map), it is therefore not a leap to understand that (7.114) is
somehow defining linear-like interpolation in the Lie algebra, where we
can treat elements as vectors.
To examine
this further, we let C = exp φ∧ , C1 = exp φ∧1 , C2 =
exp φ∧2 ∈ SO(3) with φ, φ1 , φ2 ∈ R3 . If we are able to make the
7.1 Geometry 233

assumption that φ21 is small (in the sense of distance from the previous
section), then we have
α ∨
∨
φ = ln (C) = ln C2 CT1 C1
∨
= ln exp α φ∧21 exp φ∧1 ≈ α J(φ1 )−1 φ21 + φ1 , (7.118)
which is comparable to (7.116) and is a form of linear interpolation.
Another case worth noting is when C1 = 1, whereupon
C = Cα2 , φ = α φ2 , (7.119)
with no approximation.
Another way to interpret our interpolation scheme is that it is enforc-
ing a constant angular velocity, ω. If we think of our rotation matrix
as being a function of time, C(t), then the scheme is
α t − t1
C(t) = C(t2 )C(t1 )T C(t1 ), α = . (7.120)
t2 − t1
Defining the constant angular velocity as
1
ω= φ , (7.121)
t2 − t1 21
the scheme becomes
C(t) = exp ((t − t1 ) ω ∧ ) C(t1 ). (7.122)
This is exactly the solution to Poisson’s equation, (6.45),
Ċ(t) = ω ∧ C(t), (7.123)
with constant angular velocity16 . Thus, while other interpolation schemes
are possible, this one has a strong physical connection.

Perturbed Rotations
Another thing that will be very useful to investigate, is what hap-
pens to C if we perturb C1 and/or C2 a little bit. Suppose now that
C0 , C01 , C02 ∈ SO(3) are the perturbed rotation matrices with the (left)
differences17 given as
∨ ∨ ∨
δφ = ln C0 CT , δφ1 = ln C01 CT1 , δφ2 = ln C02 CT2 .
(7.124)
The interpolation scheme must hold for the perturbed rotation matri-
ces: T α
C0 = C02 C01 C01 , α ∈ [0, 1]. (7.125)
16 Kinematics will be discussed in further detail later in this chapter.
17 In anticipation of how we will use this result, we will consider perturbations on the left,
but we saw in the previous section that there are equivalent perturbations on the right
and in the middle.
234 Matrix Lie Groups

We are interested in finding a relationship between δφ and δφ1 , δφ2 .

Substituting in our perturbations we have
α
exp δφ∧ C = exp δφ∧2 C2 CT1 exp −δφ∧1 exp δφ∧1 C1 ,
| {z }
≈ exp((φ21 +J(φ21 )−1 (δφ2 −C21 δφ1 ))∧ )
(7.126)
where we have assumed the perturbations are small to make the ap-
proximation hold inside the brackets. Bringing the interpolation vari-
able inside the exponential we have

exp δφ∧ C
∧
≈ exp α φ21 + α J(φ21 )−1 (δφ2 − C21 δφ1 ) exp δφ∧1 C1
| {z }
≈ exp((α J(αφ21 )J(φ21 )−1 (δφ2 −C21 δφ1 ))∧ ) Cα
21
∧
≈ exp α J(αφ21 )J(φ21 )−1 (δφ2 − C21 δφ1 )
∧
× exp (Cα21 δφ1 ) Cα21 C1 . (7.127)
| {z }
C

Dropping the C from both sides, expanding the matrix exponentials,

distributing the multiplication, and then keeping only first-order terms
in the perturbation quantities we have
δφ = α J(αφ21 )J(φ21 )−1 (δφ2 − C21 δφ1 ) + Cα21 δφ1 . (7.128)
Manipulating a little further (using several identities involving the Ja-
cobians), we can show this simplifies to
δφ = (1 − A(α, φ21 )) δφ1 + A(α, φ21 ) δφ2 , (7.129)
where
A(α, φ) = α J(αφ)J(φ)−1 . (7.130)
Baron
We see this has a very nice form that mirrors the usual linear interpo-
Augustin-Louis
Cauchy lation scheme. Notably, when φ is small then A(α, φ) ≈ α 1.
(1789-1857) was a Although we have a means of computing A(α, φ) in closed form (via
French J(·)), we can work out a series expression for it as well. In terms of our
mathematician series expressions for J(·) and its inverse, we have
who pioneered the
∞
! ∞ !
study of continuity X 1 ∧ k
X B` ∧ `
k
in terms of A(α, φ) = α α φ φ . (7.131)
infinitesimals, k=0
(k + 1)! `=0
`!
almost | {z }| {z }
singlehandedly J(αφ) J(φ)−1
founded complex
analysis, and
We can use a discrete convolution, or Cauchy product (of two series),
initiated the study to rewrite this as
of permutation X∞
Fn (α) ∧ n
groups in abstract A(α, φ) = φ , (7.132)
algebra. n=0
n!
7.1 Geometry 235

where
n
! α−1
1 X n+1 X
Fn (α) = Bm αn+1−m = βn, (7.133)
n + 1 m=0 m β=0

is a version of Faulhaber’s formula. The first few Faulhaber coefficients Johann Faulhaber
(as we will call them) are (1580-1635) was a
German
mathematician
α(α − 1) α(α − 1)(2α − 1) whose major
F0 (α) = α, F1 (α) = , F2 (α) = ,
2 6 contribution
α2 (α − 1)2 involved
F3 (α) = , . . . . (7.134) calculating the
4 sums of powers of
integers. Jakob
Putting these back into A(α, φ), we have Bernoulli makes
references to
α(α − 1) ∧ α(α − 1)(2α − 1) ∧ ∧ Faulhaber in his
A(α, φ) = α 1 + φ + φ φ Ars Conjectandi.
2 12
α2 (α − 1)2 ∧ ∧ ∧
+ φ φ φ + · · · , (7.135)
24
where we likely would not need many terms if φ is small.

Alternate Interpretation of Perturbed Rotations

Technically speaking, the last sum on the far right of (7.133) does not
make much sense since α ∈ [0, 1], but we can also get to this another
way. Let us pretend for the moment that α is in fact a positive inte-
ger. Then we can expand the exponentiated part of our interpolation
formula according to
α
exp δφ∧ C = exp δφ∧ C · · · exp δφ∧ C, (7.136)
| {z }
α

where C = exp φ∧ . We can then move all of the δφ terms to the far
left so that
α ∧ α
∧
exp δφ∧ C = exp δφ∧ exp (C δφ) · · · exp Cα−1 δφ C ,
(7.137)
where we have not yet made any approximations. Expanding each of
the exponentials, multiplying out, and keeping only terms first-order
in δφ leaves us with

α−1
! !∧ !
∧ α X
exp δφ C ≈ 1+ Cβ δφ Cα
β=0
∧
= 1 + (A(α, φ) δφ) Cα , (7.138)
236 Matrix Lie Groups

where
α−1 α−1 ∞
X X α−1XX 1 n ∧ n
A(α, φ) = Cβ = exp βφ∧ = β φ
β=0 β=0 β=0 n=0
n!
∞ α−1
! ∞
X 1 X n ∧ n
X Fn (α) ∧ n
= β φ = φ , (7.139)
n=0
n! β=0 n=0
n!
| {z }
Fn (α)

which is the same as (7.132). Some examples of Faulhaber’s coefficients

are:
F0 (α) = 00 + 10 + 20 + · · · + (α − 1)0 = α, (7.140a)
α(α − 1)
F1 (α) = 01 + 11 + 21 + · · · + (α − 1)1 = , (7.140b)
2
α(α − 1)(2α − 1)
F2 (α) = 02 + 12 + 22 + · · · + (α − 1)2 = , (7.140c)
6
α2 (α − 1)2
F3 (α) = 03 + 13 + 23 + · · · + (α − 1)3 = , (7.140d)
4
which are the same as what we had before. Interestingly, these expres-
sions work even when α ∈ [0, 1].

Poses
Interpolation for elements of SE(3) parallels the SO(3) case. We define
the interpolation scheme as the following:
α
T = T2 T−11 T1 , α ∈ [0, 1]. (7.141)

Again, this scheme ensures that T = exp ξ ∧ ∈ SE(3) as long as
∧ ∧
T1 = exp ξ 1 , T2 = exp ξ ∧2 ∈ SE(3). Let T21 = T1 T−1 2 = exp ξ 21 ,
so that
∨ α ∨ ∨
ξ = ln (T) = ln T2 T−1 1 T1 = ln exp α ξ ∧21 exp ξ ∧1
≈ α J (ξ 1 )−1 ξ 21 + ξ 1 , (7.142)
where the approximation on the right holds if ξ 21 is small. When T1 =
1, the scheme becomes
T = Tα2 , ξ = α ξ2 , (7.143)
with no approximation.

Perturbed Poses
As in the SO(3) case, it will be useful to investigate what happens to T
if we perturb T1 and/or T2 a little bit. Suppose now that T0 , T01 , T02 ∈
7.1 Geometry 237

SE(3) are the perturbed transformation matrices with the (left) differ-
ences given as
∨ ∨ ∨
δξ = ln T0 T−1 , δξ 1 = ln T01 T−1 1 , δξ 2 = ln T02 T−1
2 .
(7.144)
The interpolation scheme must hold for the perturbed transformation
matrices:

−1 α
T0 = T02 T10 T01 , α ∈ [0, 1]. (7.145)

We are interested in finding a relationship between δξ and δξ 1 , δξ 2 .

The derivation is very similar to SO(3) so we will simply state the
result:
δξ = (1 − A(α, ξ 21 )) δξ 1 + A(α, ξ 21 ) δξ 2 , (7.146)
where
A(α, ξ) = α J (αξ)J (ξ)−1 , (7.147)
and we note this is a 6 × 6 matrix. Again, we see this has a very nice
form that mirrors the usual linear interpolation scheme. Notably, when
ξ is small then A(α, ξ) ≈ α 1. In series form, we have
∞
X Fn (α) n
A(α, ξ) = ξf , (7.148)
n=0
n!

where the Fn (α) are the Faulhaber coefficients discussed earlier.

7.1.8 Homogeneous Points

As discussed in Section 6.3.1, points in R3 can be represented using 4×1
homogeneous coordinates (Hartley and Zisserman, 2000) as follows:
 
sx
sy  ε

p= =  ,
sz η
s
where s is some real, nonzero scalar, ε ∈ R3 , and η is scalar. When s
is zero, it is not possible to convert back to R3 , as this case represents
points that are infinitely far away. Thus, homogeneous coordinates can
be used to describe near and distant landmarks with no singularities
or scaling issues (Triggs et al., 2000). They are also a natural repre-
sentation in that points may then be transformed from one frame to
another very easily using transformation matrices (e.g., p2 = T21 p1 ).
We will later make use of the following two operators18 for manipu-
18 The operator for 4 × 1 columns is similar to the operator defined by Furgale
(2011), which did not have the negative sign.
238 Matrix Lie Groups

lating 4 × 1 columns,
}
ε η1 −ε∧ ε 0 ε
= T , = , (7.149)
η 0 0T η −ε∧ 0
which result in a 4 × 6 and 6 × 4, respectively. With these definitions,
we have the following useful identities,
ξ ∧ p ≡ p ξ, pT ξ ∧ ≡ ξ T p} , (7.150)
where ξ ∈ R6 and p ∈ R4 , which will prove useful when manipulat-
ing expressions involving points and poses together. We also have the
identity,

(Tp) ≡ Tp T −1 , (7.151)
which is similar to some others we have already seen.

7.1.9 Calculus and Optimization

Now that we have introduced homogeneous points, we formulate a bit
of calculus to allow us to optimize functions of rotations and/or poses,
sometimes in combination with three-dimensional points. As usual, we
first study rotations and then poses.

Rotations
We have already seen in Section 6.2.5 a preview of perturbing expres-
sions in terms of their Euler angles. We first consider directly taking
the Jacobian of a rotated point with respect to the Lie algebra vector
representing the rotation:
∂(Cv)
, (7.152)
∂φ
where C = exp(φ∧ ) ∈ SO(3) and v ∈ R3 is some arbitrary three-
dimensional point.
To do this, we can start by taking the derivative with respect to a
single element of φ = (φ1 , φ2 , φ3 ). Applying the definition of a derivative
along the 1i direction we have

∂(Cv) exp ((φ + h1i )∧ ) v − exp φ∧ v
= lim , (7.153)
∂φi h→0 h
which we will refer to as a directional derivative. Since we are interested
in the limit of h infinitely small, we can use the approximate BCH
formula to write

exp ((φ + h1i )∧ ) ≈ exp ((J h1i )∧ ) exp φ∧

≈ (1 + h(J1i )∧ ) exp φ∧ , (7.154)
7.1 Geometry 239

where J is the (left) Jacobian of SO(3), evaluated at φ. Plugging this

back into (7.153), we find that
∂(Cv) ∧ ∧
= (J1i ) Cv = − (Cv) J 1i . (7.155)
∂φi
Stacking the three directional derivatives alongside one another pro-
vides the desired Jacobian:
∂(Cv) ∧
= − (Cv) J. (7.156)
∂φ
Moreover, if Cv appears inside another scalar function, u(x), with x =
Cv, then we have
∂u ∂u ∂x ∂u ∧
= =− (Cv) J, (7.157)
∂φ ∂x ∂φ ∂x
by the chain rule of differentiation. The result is the transpose of the
gradient of u with respect to φ.
If we wanted to perform simple gradient descent of our function, we
could take a step in the direction of thenegative gradient, evaluated at
our linearization point, Cop = exp φ∧op :

T∂u T
∧
φ = φop − α J (Cop v) , (7.158)
∂x x=Cop v
| {z }
δ

where α > 0 defines the step size.

We can easily see that stepping in this direction (by a small amount)
will reduce the function value:

u exp φ∧ v − u exp φ∧op v ≈ − α δ T JJT δ (7.159)
| {z }
≥0

However, this is not the most streamlined way we could optimize u

with respect to C because it requires that we store our rotation as a
rotation vector, φ, which has singularities associated with it. Plus, we
need to compute the Jacobian matrix, J.
A cleaner way to carry out optimization is to find an update step for
C in the form of a small rotation on the left19 , rather than directly on
the Lie algebra rotation vector representing C:

C = exp ψ ∧ Cop . (7.160)

The previous update can actually be cast in this form by using the
19 A right-hand version is also possible.
240 Matrix Lie Groups

approximate BCH formula once again:

∧
C = exp φ∧ = exp φop − αJT δ
∧
≈ exp −α JJT δ Cop , (7.161)

or in other words we could let ψ = −αJJT δ to accomplish the same

thing as before, but this still requires that we compute J. Instead, we
can essentially just drop JJT > 0 from the update and use
ψ = −αδ, (7.162)
which still reduces the function,
T
u (Cv) − u (Cop v) ≈ − α
|δ{z δ} (7.163)
≥0

but takes a slightly different direction to do so.

Another way to look at this is that we are computing the Jacobian
with respect to ψ, where the perturbation is applied on the left20 . Along
the ψi direction we have
∂ (Cv) exp (h1∧i ) Cv − Cv
= lim
∂ψi h→0 h
(1 + h1∧i ) Cv − Cv ∧
≈ lim = − (Cv) 1i . (7.164)
h→0 h
Stacking the three directional derivatives together we have
∂ (Cv) ∧
= − (Cv) , (7.165)
∂ψ
which is the same as our previous expression but without the J.
An even simpler way to think about optimization is to skip the
derivatives altogether and think in terms of perturbations. Choose a
perturbation scheme,

C = exp ψ ∧ Cop , (7.166)
where ψ is a small perturbation applied to an initial guess, Cop . Then
insert this in the function to be optimized:

u (Cv) = u exp ψ ∧ Cop v ≈ u 1 + ψ ∧ Cop v
≈ u(Cop v) + δ T ψ. (7.167)
Then pick a perturbation to decrease the function. For example, gra-
dient descent suggests we would like to pick
ψ = −αDδ, (7.168)
20 This is sometimes called a (left) Lie derivative.
7.1 Geometry 241

with α > 0 a small step size and D > 0 any positive-definite matrix.
Then apply the perturbation within the scheme to update the rotation,

Cop ← exp −αDδ ∧ Cop , (7.169)
and iterate to convergence. Our scheme guarantees Cop ∈ SO(3) at
each iteration.
The perturbation idea generalizes to more interesting optimization
schemes than basic gradient descent, which can be quite slow. Consider
the alternate derivation of the Gauss-Newton optimization method
from Section 4.3.1. Suppose we have a general nonlinear, quadratic
cost function of a rotation of the form,
1X 2
J(C) = (um (Cvm )) , (7.170)
2 m
where um (·) are scalar nonlinear functions and vm ∈ R3 are three-
dimensional points. We begin with an initial guess for the optimal ro-
tation, Cop ∈ SO(3), and then perturb this (on the left) according to

C = exp ψ ∧ Cop , (7.171)
where ψ is the perturbation. We then apply our perturbation scheme
inside each um (·) so that

um (Cvm ) = um exp(ψ ∧ )Cop vm ≈ um 1 + ψ ∧ Cop vm

∂um ∧
≈ um (Cop vm ) − (Cop vm ) ψ, (7.172)
| {z } ∂x x=Cop vm
βm | {z }
δT
m

is a linearized version of um (·) in terms of our perturbation, ψ. Inserting

this back into our cost function we have
1 X T 2
J(C) ≈ δ m ψ + βm , (7.173)
2 m
which is exactly quadratic in ψ. Taking the derivative of V with respect
to ψ we have
∂J X
T
= δ m δ Tm ψ + βm . (7.174)
∂ψ m

We can set the derivative to zero to find the optimal perturbation, ψ ? ,

that minimizes J:
!
X X
δmδm ψ? = −
T
βm δ m . (7.175)
m m

This is a linear system of equations, which we can solve for ψ ? . We

242 Matrix Lie Groups

then apply this optimal perturbation to our initial guess, according to

our perturbation scheme:
∧
Cop ← exp ψ ? Cop , (7.176)

which ensures that at each iteration we have Cop ∈ SO(3). We iterate

to convergence and then output C? = Cop at the final iteration as our
optimized rotation. This is exactly the Gauss-Newton algorithm, but
adapted to work with the matrix Lie group, SO(3), by exploiting the
surjective property of the exponential map to define an appropriate
perturbation scheme.

Poses
The same concepts can also be applied to poses. The Jacobian of a
transformed point with respect to the Lie algebra vector representing
the transformation is
∂(Tp)
= (Tp) J , (7.177)
∂ξ

where T = exp(ξ ∧ ) ∈ SE(3) and p ∈ R4 is some arbitrary three-

dimensional point, expressed in homogeneous coordinates.
However, if we perturb the transformation matrix on the left,

T ← exp (∧ ) T, (7.178)

then the Jacobian with respect to this perturbation (i.e., the (left) Lie
derivative) is simply
∂(Tp)
= (Tp) , (7.179)
∂
which removes the need to calculate the J matrix.
Finally, for optimization, suppose we have a general nonlinear, quadratic
cost function of a transformation of the form,
1X 2
J(T) = (um (Tpm )) , (7.180)
2 m

where um (·) are nonlinear functions and pm ∈ R4 are three-dimensional

points expressed in homogeneous coordinates. We begin with an initial
guess for the optimal transformation, Top ∈ SE(3), and then perturb
this (on the left) according to

T = exp (∧ ) Top , (7.181)

where is the perturbation. We then apply our perturbation scheme

7.1 Geometry 243

inside each um (·) so that

um (Tpm ) = um (exp(∧ )Top pm ) ≈ um ((1 + ∧ ) Top pm )

∂um
≈ um (Top pm ) + (Top pm ) , (7.182)
| {z } ∂x x=Top pm
βm | {z }
δT
m

is a linearized version of um (·) in terms of our perturbation, . Inserting

this back into our cost function we have
1 X T 2
J(T) = δ m + βm , (7.183)
2 m
which is exactly quadratic in . Taking the derivative of J with respect
to we have
∂J X
T
= δ m δ m ψ + βm . (7.184)
∂T m

We can set the derivative to zero to find the optimal perturbation, ? ,

that minimizes J:
!
X T
X
δ m δ m ? = − βm δ m . (7.185)
m m

This is a linear system of equations, which we can solve for ? . We then

apply this optimal perturbation to our initial guess, according to our
perturbation scheme:
∧
Top ← exp ? Top , (7.186)

which ensures that at each iteration we have Top ∈ SE(3). We iterate

to convergence and then output T? = Top at the final iteration as the
optimal pose. This is exactly the Gauss-Newton algorithm, but adapted
to work with the matrix Lie group, SE(3), by exploiting the surjective
property of the exponential map to define an appropriate perturbation
scheme.

Gauss-Newton Discussion
This approach to Gauss-Newton optimization for our matrix Lie groups
where we use a customized perturbation scheme has three key proper-
ties:
(i) we are storing our rotation or pose in a singularity-free format,
(ii) at each iteration we are performing unconstrained optimiza-
tion,
(iii) our manipulations occur at the matrix level so that we do not
need to worry about taking the derivatives of a bunch of scalar
trigonometric functions, which can easily lead to mistakes.
SO(3) Identities and Approximations

Lie Algebra Lie Group (left) Jacobian

u1 0 −u3 u2
0
 ∧  

u3 −u2 u1 0 φ = φa
∧
u∧ = u2  =  u3 −u1 

∧ ∧
T 1
(αu + βv) ≡ αu + βv aT a ≡ 1
u∧ ≡ −u∧ CT C ≡ 1 ≡ CCT J= 0
Cα dα ≡n=0 (n+1)!
φ∧ ≈ 1 + 12 φ∧
φ ∧
a
R1 P∞ n

φ
u∧ v ≡ −v∧ u tr(C) ≡ 2 cos φ + 1 J ≡ sinφ φ 1 + 1 − sinφ φ aaT + 1−cos
∞ B ∧ n 1 ∧

n
u∧ u ≡ 0 J−1 ≡ n=0 n!n φ
∧ ∧
(Wu) ≡ u (tr(W) 1 − W) − WT u∧ C = exp φ∧ ≡ n=0 n! φ∧ ≈ 1 + φ∧ −1
P

∧
n
det(C)
P∞ 1
≡1

u∧ v∧ ≡ −(uT v) 1 + vuT
≈ 1 − 2 φ

exp (φ + δφ) ≈ exp ((J δφ)∧ ) exp φ∧

u∧ W v∧ ≡ − (−tr(vuT ) 1 + vuT ) C−1 ≡ CT ≡ n=0 n! −φ∧ ≈ 1 − φ∧
J ≡ φ2 cot φ2 1 + 1 − φ2 cot φ2 aaT − φ2 a∧

C ≡ 1 + φ∧ J
C ≡ cos φ1 + (1 − cos φ)aaT + sin φa∧
P∞ 1

× (−tr(W) 1 + WT ) Ca ≡ a J(φ) ≡ C J(−φ)

+ tr(WT vuT ) 1 − WT vuT Cφ = φ
∧ ∧ ∧
u v u ≡ u∧ u∧ v∧ + v∧ u∧ u∧ + (uT u) v∧ Ca∧ ≡ a∧ C exp δφ∧ C ≈ 1 + (A(α, φ) δφ) Cα
u∧ u∧ u∧ ≡ −(uT u) u∧ Cφ∧ ≡ φ∧ C
A(α, φ) = α J(αφ)J(φ)−1 = n=0 Fnn!(α) φ∧
α ∧

∧
u∧ v∧ v∧ − v∧ v∧ u∧ ≡ (v∧ u∧ v)∧ (Cu) ≡ Cu∧ CT
∧
P∞ n

n
u∧ ≡ (u∧ v) exp ((Cu)∧ ) ≡ C exp (u∧ ) CT
∧ ∧
u , u , . . . u , v∧ . . . ≡ ((u∧ ) v)
n
∧ ∧
[u∧ , v∧ ] ≡u∧ v∧ − v∧

| {z }

α, β ∈ R, u, v, φ, δφ ∈ R3 , W, A, J ∈ R3×3 , C ∈ SO(3)
SE(3) Identities and Approximations

Lie Algebra Lie Group (left) Jacobian

ρ
ξ=
φ

≈ 1 + ξ∧
u v u C Jρ 1
∧
x = = T 0 n=0 ξf
P∞ 1 ∧ n

v 0 0
J = T α dα ≡ ≈ 1 + 21 ξ f
0
T≡ T
∧ ∧

J Q
T = exp ξ ∧ ≡ n=1 n! ξ

f
R1 P∞ n

∧
f u v u
x = = 0 J
T = exp ξ f ≡ n=1 J ≡
(n+1)!

v 0 v∧ C (Jρ) C ξ
f ∧ P∞ 1 1 f n

∧ n!
J −1 ≡ n=0
0 C
T = Ad (T) ≡
−1
J
n! ξ ∧≈ 1 +ξ

f
(αx + βy) ≡ αx∧ + βy∧
P∞ Bn f n
−J QJ−1
0
(αx + βy) ≡ αxf + βyf tr(T) ≡ 2 cos φ + 2 J −1 ≡
f n
−1 ≈ 1 −12 ξ f

1
φ∧ ρ∧ φ∧
x y ≡ −yf x det(T) ≡ 1
Q = n=0 m=0 (n+m+2)!
xf x ≡ 0 φ
∧
φ3
φ∧ ρ∧ + ρ∧ φ∧ + φ∧ ρ∧ φ∧
J−1 m
P∞ P∞

∧ ∧ ∧ ∧
2
≡ 12 ρ∧ + φ−sin
f
[x , y ] ≡ x y − y∧ x∧ ≡ (xf y)
f f f f f f f 1− φ2 −cos φ
Ad (T1 T2 ) = Ad (T1 ) Ad (T
P∞ 1 n2 )

CT −CT r
n
x ≡ (x y) − φ∧ φ∧ ρ∧ + ρ∧ φ∧ φ∧ − 3φ∧ ρ∧ φ∧
∧ 2 3
0 1
T−1 ≡ T
1− φ2 −cos φ φ−sin φ− φ6
T−1 ≡ exp −ξ ∧ ≡ n=1 n! −ξ∧ ≈ 1 − ξ ∧

3 φ∧ ρ∧ φ∧ φ∧
x , x , . . . x∧ , y∧ . . . ≡ ((xf ) y)
−1 φ4 φ5
− 12 −
T
∧ ∧
[x , y ] ≡ x y −
y
φ4

n f
f P∞ 1 n

CT −CT (Jρ)∧
| {z }
x , x , . . . x , y . . . ≡ ((xf ) y) ∧
0 CT
T −1 ≡
≡ exp −ξ ≡ n=1 n! −ξ f ≈ 1 − ξ f

f
f fn f f
+ φ∧ φ∧ ρ∧ φ∧
Tξ≡ξ exp (ξ + δξ) ≈ exp ((J δξ)f ) exp ξ f
| {z }

ε
Tξ ∧ ≡ ξ ∧ T
exp (ξ + δξ) ≈ exp ((J δξ)∧ ) exp ξ ∧

T ≡ 1 + ξf J
p =
η1 −ε∧
= T
η 0 0T
T ξf ≡ ξf T J ξf ≡ ξf J
∧

ε 0 ε
J (ξ) ≡ T J (−ξ)
} ∧
p = =
f
exp (T x) ≡ T exp (x∧ ) T−1
η
}

−ε∧ 0
(T x) ≡ Tx∧ T−1

∧
f
exp δξ ∧ T ≈ 1 + (A(α, ξ) δξ) Tα
x p≡p x
α ∧

exp (T x) ≡ T exp (xf ) T −1 A(α, ξ) = α J (αξ)J (ξ)−1 = n=0 Fnn!(α) ξ f

pT x ∧ ≡ x T p}
(T x) ≡ T xf T −1

Ad (x∧ T) ≡ xf T
P∞ n

T
(Tp) ≡ Tp T −1
T
(Tp) (Tp) ≡ T −T p p T −1

α, β ∈ R, u, v, φ, δφ ∈ R3 , p ∈ R4 , x, y, ξ, δξ ∈ R6 , C ∈ SO(3), J, Q ∈ R3×3 , T, T1 , T2 ∈ SE(3), T ∈ Ad(SE(3)), J , A ∈ R6×6

246 Matrix Lie Groups

This makes implementation quite straightforward. We can also easily

incorporate both of the practical patches to Gauss-Newton that were
outlined in Section 4.3.1 (a line search and Levenberg-Marquardt) as
well as the ideas from robust estimation described in 5.3.2.

7.1.10 Identities
We have seen many identities and expressions in this section related
to our matrix Lie groups, SO(3) and SE(3). The previous two pages
summarize these. The first page provides identities for SO(3) and the
second for SE(3).

7.2 Kinematics
We have seen how the geometry of a Lie group works. The next step
is to allow the geometry to change over time. We will work out the
kinematics associated with our two Lie groups, SO(3) and SE(3).

7.2.1 Rotations
We have already seen the kinematics of rotations in the previous chap-
ter, but this was before we introduced Lie groups.

Lie Group
We know that a rotation matrix can be written as

C = exp φ∧ , (7.187)

where C ∈ SO(3) and φ = φ a ∈ R. The rotational kinematic equation

relating angular velocity, ω, to rotation (i.e., Poisson’s equation) is
given by21
Ċ = ω ∧ C or ω ∧ = ĊCT . (7.188)

We will refer to this as kinematics of the Lie group; these equations are
singularity-free since they are in terms of C, but have the constraint
that CCT = 1. Due to the surjective property of the exponential map
from so(3) to SO(3), we can also work out the kinematics in terms of
the Lie Algebra.
21 Compared to (6.45) in our earlier development, this ω is opposite in sign. This is
because we have adopted the robotics convention described in Section 6.3.2 for the
angle of rotation and this leads to the form in (7.187); this in turn means we must use
the angular velocity associated with that angle and this is opposite in sign to the one
we discussed earlier.
7.2 Kinematics 247

Lie Algebra
To see the equivalent kinematics in terms of the Lie algebra, we need
to differentiate C:
Z 1
d ∧ ∧
Ċ = exp φ = exp αφ∧ φ̇ exp (1 − α)φ∧ dα, (7.189)
dt 0

where the last relationship comes from the general expression for the
time derivative of the matrix exponential:
Z 1
d dA(t)
exp (A(t)) = exp (αA(t)) exp ((1 − α)A(t)) dα. (7.190)
dt 0 dt
From (7.189) we can rearrange to have
Z 1 Z 1 ∧
T α ∧ −α
ĊC = C φ̇ C dα = Cα φ̇ dα
0 0
Z 1 ∧ ∧
= Cα dα φ̇ = J φ̇ , (7.191)
0
R1
where J = 0 Cα dα is the (left) Jacobian for SO(3) that we saw earlier.
Comparing (7.188) and (7.191) we have the pleasing result that
ω = J φ̇, (7.192)
or
φ̇ = J−1 ω, (7.193)
which is an equivalent expression for the kinematics, but in terms of
the Lie algebra. Note that J−1 does not exist at |φ| = 2πm, where m
is a non-zero integer, due to singularities of the 3 × 1 representation of
rotation; the good news is that we no longer have constraints to worry
about.

Numerical Integration
Because φ has no constraints, we can use any numerical method we like
to integrate (7.193). The same is not true if we want to integrate (7.188)
directly, since we must enforce the constraint that CCT = 1. There are
a few simple strategies we can use to do this.
One approach is to assume that ω is piecewise constant. Suppose
ω is constant between two times, t1 and t2 . In this case, (7.188) is a
linear, time-invariant, ordinary differential equation and we know the
solution will be of the form
C(t2 ) = exp ((t2 − t1 ) ω ∧ ) C(t1 ), (7.194)
| {z }
C21 ∈ SO(3)

where we note that C21 is in fact in the correct form to be a rotation

248 Matrix Lie Groups

matrix. Let the rotation vector be

φ = φ a = (t2 − t1 ) ω, (7.195)
with angle, φ = |φ|, and axis, a = φ/φ. Then construct the rotation
matrix through our usual closed-form expression:
C21 = cos φ 1 + (1 − cos φ) aaT + sin φ a∧ . (7.196)
The update then proceeds as
C(t2 ) = C21 C(t1 ), (7.197)
which mathematically guarantees that C(t2 ) will be in SO(3) since
C21 , C(t1 ) ∈ SO(3). Repeating this over and over for many small time
intervals allows us to integrate the equation numerically.
However, even if we do follow an integration approach (such as the
one above) that claims to keep the computed rotation in SO(3), small
numerical errors may eventually cause the result to depart SO(3) through
violation of the orthogonality constraint. A common solution is to peri-
odically ‘project’ the computed rotation, C ∈ / SO(3), back onto SO(3).
In other words, we can try to find the rotation matrix, R ∈ SO(3), that
is closest to C in some sense. We do this by solving the following opti-
mization problem (Green, 1952):
3 X 3
1X
arg max J(R), J(R) = tr CRT − λij rTi rj − δij , (7.198)
R 2 i=1 j=1
| {z }
Lagrange multiplier terms

where the Lagrange multiplier terms are necessary to enforce the RRT =
1 constraint. Note that δij is the Kronecker delta and

RT = r1 r2 r3 , CT = c1 c2 c3 . (7.199)
We also note that

tr CRT = rT1 c1 + rT2 c2 + rT3 c3 . (7.200)
We then take the derivative of J with respect to the three rows of R
revealing
X3
∂J
= c i − λij rj , ∀i = 1 . . . 3. (7.201)
∂rTi j=1

Setting this to zero ∀i = 1 . . . 3 we have that

 
λ11 λ12 λ13
r1 r2 r3 λ21 λ22 λ23  = c1 c2 c3 . (7.202)
| {z } λ λ32 λ33 | {z }
31
RT | {z } CT

Note, however, that Λ can be assumed to be symmetric owing to the

7.2 Kinematics 249

symmetry of the Lagrange multiplier terms. Thus, what we know so

far is that
ΛR = C, Λ = ΛT , RT R = RRT = 1.
We can solve for Λ by noticing
12
Λ2 = ΛΛT = Λ |RR T T
{z } Λ = CC
T
⇒ Λ = CCT ,
1
1
with (·) indicating a matrix square-root. Finally
2

− 1
R = CCT 2 C,
which simply looks like we are ‘normalizing’ C. Computing the projec-
tion whenever the orthogonality constraint is not satisfied (to within
some threshold) and then overwriting the integrated value,
C ← R, (7.203)
ensures we do not stray too far from SO(3)22 .

7.2.2 Poses
There is an analogous approach to kinematics for SE(3) that we will
develop next.

Lie Group
We have seen that a transformation matrix can be written as

C r C Jρ
T= T = T = exp ξ ∧ (7.204)
0 1 0 1
where
ρ
ξ= .
φ
Suppose the kinematics in terms of separated translation and rotation
are given by
ṙ = ω ∧ r + ν, (7.205a)
Ċ = ω ∧ C, (7.205b)
22 Technically, this matrix square-root approach only works under certain conditions. For
some pathological C matrices, it can produce an R where det R = −1 instead of
det R = 1, as desired. This is because we have not enforced the det R = 1 constraint in
our optimization properly. A more rigorous method, based on singular-value
decomposition, is presented later in Section 8.1.3 that handles the difficult cases. A
sufficient test to know whether this matrix square-root approach will work is to check
that det C > 0 before applying it. This should almost always be true in real situations
where our integration step is small. If it is not true, the detailed method in
Section 8.1.3 should be used.
250 Matrix Lie Groups

where ν and ω are the translational and rotational velocities, respec-

tively. Using transformation matrices, this can be written equivalently
as
Ṫ = $ ∧ T or $ ∧ = ṪT−1 , (7.206)
where
ν
$= ,
ω
is the generalized velocity23 . Again, these equations are singularity-free
but still have the constraint that CCT = 1.

Lie Algebra
Again, we can find an equivalent set of kinematics in terms of the Lie
algebra. As in the rotation case, we have that
Z 1
d ∧ ∧
Ṫ = exp ξ = exp αξ ∧ ξ̇ exp (1 − α)ξ ∧ dα, (7.207)
dt 0

or equivalently
Z 1 Z 1 ∧
∧
ṪT−1 = Tα ξ̇ T−α dα = T α ξ̇ dα
0 0
Z 1 ∧ ∧
α
= T dα ξ̇ = J ξ̇ , (7.208)
0
R1
where J = 0 T α dα is the (left) Jacobian for SE(3). Comparing (7.206)
and (7.208) we have that
$ = J ξ̇, (7.209)
or
ξ̇ = J −1 $, (7.210)
for our equivalent kinematics in terms of the Lie algebra. Again, these
equations are now free of constraints.

Hybrid
There is, however, a another way to propagate the kinematics by noting
that the equation for ṙ is actually linear in the velocity. By combining
the equations for ṙ and φ̇ we have

ṙ 1 −r∧ ν
= , (7.211)
φ̇ 0 J−1 ω
23 We can also write the kinematics equivalently in 6 × 6 format:

Ṫ = $ f T .
7.2 Kinematics 251

which still has singularities at the singularities of J−1 , but no longer

requires us to evaluate Q and avoids the conversion, r = Jρ, after we
integrate. This approach is also free of constraints. We can refer to this
as a hybrid method as the translation is kept in the usual space and
the rotation is kept in the Lie algebra.

Numerical Integration
Similarly to the SO(3) approach, we can integrate (7.210) without wor-
rying about constraints, but integrating (7.206) takes a little more care.
Just as in the SO(3) approach we could assume that $ is piecewise
constant. Suppose $ is constant between two times, t1 and t2 . In this
case, (7.206) is a linear, time-invariant, ordinary differential equation
and we know the solution will be of the form
T(t2 ) = exp ((t2 − t1 ) $ ∧ ) T(t1 ), (7.212)
| {z }
T21 ∈ SE(3)

where we note that T21 is in fact in the correct form to be a transfor-

mation matrix. Let

ρ ν
ξ= = (t2 − t1 ) = (t2 − t1 ) $. (7.213)
φ ω
with angle, φ = |φ|, and axis, a = φ/φ. Then construct the rotation
matrix through our usual closed-form expression:
C = cos φ 1 + (1 − cos φ) aaT + sin φ a∧ . (7.214)
Build J and calculate r = Jρ. Assemble C and r into

C r
T21 = T . (7.215)
0 1
The update then proceeds as
T(t2 ) = T21 T(t1 ), (7.216)
which mathematically guarantees that T(t2 ) will be in SE(3) since
T21 , T(t1 ) ∈ SE(3). Repeating this over and over for many small time
intervals allows us to integrate the equation numerically. We can also
project the upper-left, rotation matrix part of T back onto SO(3) pe-
riodically24 , reset the lower left block to 0T , and reset the lower-right
block to 1, to ensure T does not stray too far from SE(3).

With Dynamics
We can augment our kinematic equation for pose with an equation for
the translational/rotational dynamics (i.e., Newton’s second law) as
24 See the numerical integration section on rotations, above, for the details.
252 Matrix Lie Groups

follows (D’Eleuterio, 1985):

kinematics: Ṫ = $ ∧ T, (7.217a)
dynamics: $̇ = M−1 $ f M$ + a, (7.217b)
where T ∈ SE(3) is the pose, $ ∈ R6 is the generalized velocity (in
the body frame), a ∈ R6 is a generalized applied force (per mass, in the
body frame), and M ∈ R6×6 is a generalized mass matrix of the form

m1 −mc∧
M= , (7.218)
mc∧ I
with m the mass, c ∈ R3 the center of mass, and I ∈ R3×3 the inertia
matrix, all in the body frame.

7.2.3 Linearized Rotations

Lie Group
We can also perturb our kinematics about some nominal solution (i.e.,
linearize), both in the Lie group and the Lie algebra. We begin with
the Lie group. Consider the following perturbed rotation matrix, C0 ∈
SO(3):

C0 = exp δφ∧ C ≈ 1 + δφ∧ C, (7.219)
where C ∈ SO(3) is the nominal rotation matrix and δφ ∈ R3 is a
perturbation as a rotation vector. The perturbed kinematics equation,
Ċ0 = ω 0∧ C0 , becomes
d ∧ ∧
1 + δφ∧ C ≈ ω + δω
| {z } | 1 + δφ C, (7.220)
|dt {z } ω0
{z
0
}
C
Ċ0

after inserting our perturbation scheme. Dropping products of small

terms, we can manipulate this into a pair of equations:
nominal kinematics: Ċ = ω ∧ C, (7.221a)
perturbation kinematics: δ φ̇ = ω ∧ δφ + δω, (7.221b)
which can be integrated separately and combined to provide the com-
plete solution (approximately).

Lie Algebra
Perturbing the kinematics in the Lie algebra is more difficult but equiv-
alent. In terms of quantities in the Lie algebra, we have
φ0 = φ + J(φ)−1 δφ. (7.222)
0
where φ = ln(C ) is the perturbed rotation vector, φ = ln(C)∨ the
0 ∨

nominal rotation vector, and δφ the same perturbation as in the Lie

group case.
7.2 Kinematics 253

We start with the perturbed kinematics, φ̇0 = J(φ0 )−1 ω 0 , and then
inserting our perturbation scheme we have

d −1
φ + J(φ)−1 δφ ≈ J(φ) + δJ (ω + δω) . (7.223)
|dt {z } | {z } | {z }
J(φ0 ) ω0
φ̇0

We obtain δJ through a perturbation of J(φ0 ) directly:

Z 1 Z 1
α α
J(φ0 ) = C0 dα = exp δφ∧ C dα
0 0
Z 1
∧
≈ 1 − (A(α, φ) δφ) Cα dα
0
Z 1 Z 1
∧
= α
C dα − α J(αφ)J(φ)−1 δφ Cα dα, (7.224)
| 0 {z } | 0 {z }
J(φ) δJ

where we have used the perturbed interpolation formula from Sec-

tion 7.1.7. Manipulating the perturbed kinematics equation we have

φ̇ − J(φ)−1 J̇(φ)J(φ)−1 δφ + J(φ)−1 δ φ̇

≈ J(φ)−1 − J(φ)−1 δJ J(φ)−1 (ω + δω) . (7.225)

Multiplying out, dropping the nominal solution, φ̇ = J(φ)−1 ω, as well

as products of small terms, we have

δ φ̇ = J̇(φ) J(φ)−1 δφ − δJ φ̇ + δω. (7.226)

Substituting in the identity25

∂ω
J̇(φ) − ω ∧ J(φ) ≡ , (7.227)
∂φ

we have
∂ω
δ φ̇ = ω ∧ δφ + δω + J(φ)−1 δφ − δJ φ̇, (7.228)
∂φ
| {z }
extra term

which is the same as the Lie group result for the perturbation kinemat-

25 This identity is well-known in the dynamics literature (Hughes, 1986).

254 Matrix Lie Groups

ics, but with an extra term; it turns out this extra term is zero:

∂ω ∂
J(φ)−1 δφ = J(φ) φ̇ J(φ)−1 δφ
∂φ ∂φ
Z 1 Z 1
∂ −1 ∂ α
= α
C φ̇ dα J(φ) δφ = C φ̇ dα J(φ)−1 δφ
∂φ 0 0 ∂φ
Z 1 ∧
= α Cα φ̇ J(αφ) dα J(φ)−1 δφ
0
Z 1
∧
=− α J(αφ)J(φ)−1 δφ Cα dα φ̇ = δJ φ̇, (7.229)
| 0 {z }
δJ

where we have used an identity derived back in Section 6.2.5 for the
derivative of a rotation matrix times a vector with respect to a three-
parameter representation of rotation. Thus, our pair of equations is

nominal kinematics: φ̇ = J(φ)−1 ω, (7.230a)

perturbation kinematics: δ φ̇ = ω ∧ δφ + δω, (7.230b)

which can be integrated separately and combined to provide the com-

plete solution (approximately).

Solutions Commute
It is worth asking whether integrating the full solution is (approxi-
mately) equivalent to integrating the nominal and perturbation equa-
tions separately and then combining them. We show this for the Lie
group kinematics. The perturbed solution will be given by:
Z t
0 0
C (t) = C (0) + ω 0 (s)∧ C0 (s) ds (7.231)
0

Breaking this into nominal and perturbation parts we have

Z t
∧
C0 (t) ≈ (1 + δφ(0)∧ ) C(0) + (ω + δω) (1 + δφ(s)∧ ) C(s) ds
0
Z t
∧
≈ C(0) + ω(s) C(s) ds
0
| {z }
C(t)
Z t
+ δφ(0)∧ C(0) + (ω(s)∧ δφ(s)∧ C(s) + δω(s)∧ C(s)) ds
0
| {z }
δφ(t)∧ C(t)
∧
≈ (1 + δφ(t) ) C(t), (7.232)
7.2 Kinematics 255

which is the desired result. The rightmost integral on the second line
can be computed by noting that

d ∧ ∧
δφ∧ C = δ φ̇ C + δφ∧ Ċ = ω ∧ δφ + δω C + δφ∧ ω ∧
C
| {z }
dt | {z }
perturbation nom.
= ω ∧ δφ∧ C − δφ∧ ω ∧ C + δω ∧ C + δφ∧ ω ∧ C
= ω ∧ δφ∧ C + δω ∧ C, (7.233)

where we have used the nominal and perturbation kinematics.

Integrating the Solutions

In this section, we make some observations about integrating the nom-
inal and perturbation kinematics. The nominal equation is nonlinear
and can be integrated numerically (using either the Lie group or Lie
algebra equations). The perturbation kinematics,

δ φ̇(t) = ω(t)∧ δφ(t) + δω(t), (7.234)

is a LTV equation of the form

ẋ(t) = A(t) x(t) + B(t) u(t). (7.235)

The general solution to the initial value problem is given by

Z t
x(t) = Φ(t, 0) x(0) + Φ(t, s) B(s) u(s) ds, (7.236)
0

where Φ(t, s) is called the state transition matrix and satisfies

Φ̇(t, s) = A(t) Φ(t, s),

Φ(t, t) = 1.

The state transition matrix always exists and is unique, but it cannot al-
ways be found analytically. Fortunately, for our particular perturbation
equation, we can express the 3×3 state transition matrix analytically26 :

Φ(t, s) = C(t)C(s)T . (7.237)

The solution is therefore given by

Z t
T
δφ(t) = C(t)C(0) δφ(0) + C(t) C(s)T δω(s) ds. (7.238)
0

26 The nominal rotation matrix, C(t), is the fundamental matrix of the state transition
matrix.
256 Matrix Lie Groups

We need the solution to the nominal equation, C(t), but this is readily
available. To see this is indeed the correct solution, we can differentiate:
Z t
T
δ φ̇(t) = Ċ(t)C(0) δφ(0)+Ċ(t) C(s)T δω(s) ds+C(t) C(t)T δω(t)
0
Z t
= ω(t)∧ C(t)C(0)T δφ(0) + C(t) C(s)T δω(s) ds + δω(t)
0
| {z }
δφ(t)

= ω(t)∧ δφ(t) + δω(t), (7.239)

which is the original differential equation for δφ(t). We also see that
our state transition matrix satisfies the required conditions:
d
C(t)C(s)T = Ċ(t)C(s)T = ω(t)∧ C(t)C(s)T , (7.240a)
|dt {z } | {z }
Φ(t,s)
Φ̇(t,s)

C(t)C(t)T = 1. (7.240b)
| {z }
Φ(t,t)

Thus, we have everything we need to integrate the perturbation kine-

matics as long as we can also integrate the nominal kinematics.

7.2.4 Linearized Poses

We will only briefly summarize the perturbed kinematics for SE(3) as
they are quite similar to the SO(3) case.

Lie Group
We will use the perturbation,

T0 = exp δξ ∧ T ≈ 1 + δξ ∧ T, (7.241)

with T0 , T ∈ SE(3) and δξ ∈ R6 . The perturbed kinematics,

Ṫ0 = $ 0∧ T0 , (7.242)

can then be broken into nominal and perturbation kinematics:

nominal kinematics: Ṫ = $ ∧ T, (7.243a)

perturbation kinematics: δ ξ̇ = $ f δξ + δ$, (7.243b)

where $ 0 = $ +δ$. These can be integrated separately and combined

to provide the complete solution (approximately).
7.2 Kinematics 257

Integrating the Solutions

The 6 × 6 transition matrix for the perturbation equation is
Φ(t, s) = T (t) T (s)−1 , (7.244)
where T = Ad(T). The solution for δξ(t) is
Z t
δξ(t) = T (t) T (0)−1 δξ(0) + T (t) T (s)−1 δ$(s) ds. (7.245)
0

Differentiating recovers the perturbation kinematics, where we require

the 6 × 6 version of the nominal kinematics in the derivation:
Ṫ (t) = $(t)f T (t), (7.246)
which is equivalent to the 4 × 4 version.

With Dynamics
We can also perturb the joint kinematics/dynamics equations in (7.217).
We consider perturbing all of the quantities around some operating
points as follows:

T0 = exp δξ ∧ T, $ 0 = $ + δ$, a0 = a + δa, (7.247)
so that the kinematics/dynamics are
∧
Ṫ0 = $ 0 T0 , (7.248a)
f
$̇ 0 = M−1 $ 0 M$ 0 + a0 . (7.248b)
If we think of δa as an unknown noise input, then we would like to
know how this turns into uncertainty on the pose and velocity vari-
ables through the chain of dynamics and kinematics. Substituting the
perturbations into the motion models, we can separate into a (nonlin-
ear) nominal motion model,
nominal kinematics: Ṫ = $ ∧ T, (7.249a)
nominal dynamics: $̇ = M−1 $ f M$ + a, (7.249b)
and (linear) perturbation motion model,
perturbation kinematics: δ ξ̇ = $ f δξ + δ$, (7.250a)
−1 f f
perturbation dynamics: δ $̇ = M $ M − (M$) δ$ + δa,
(7.250b)
which we can write in combined matrix form:
f
δ ξ̇ $ 1 δξ 0
= −1 f + . (7.251)
δ $̇ 0 M $ M − (M$)
f
δ$ δa
Finding the transition matrix for this LTV SDE may be difficult, but
it can be integrated numerically.
258 Matrix Lie Groups

7.3 Probability and Statistics

We have seen throughout this chapter that elements of matrix Lie
groups do not satisfy some basic operations that we normally take for
granted. This theme continues when working with random variables.
For example, we often work with Gaussian random variables, which
typically take the form,
x ∼ N (µ, Σ), (7.252)
where x ∈ RN (i.e., x lives in a vectorspace). An equivalent way to look
at this is that x is comprised of a ‘large’, noise-free component, µ, and
a ‘small’, noisy component, , that is zero-mean:
x = µ + , ∼ N (0, Σ). (7.253)
This arrangement works because all the quantities involved are vectors
and the vectorspace is closed under the + operation. Unfortunately,
our matrix Lie groups are not closed under this type of addition, and
so we need to think of a different way of defining random variables.
This section will introduce our definitions of random variables and
probability density functions (PDFs) for rotations and poses, and then
present four examples of using our new probability and statistics. We
follow the approach outlined by Barfoot and Furgale (2014).

7.3.1 Gaussian Random Variables and PDFs

We will discuss general random variables and PDFs briefly, and then
focus on Gaussians. We frame the main discussion in terms of rotations
and then state the results for poses afterwards.

Rotations
We have seen several times the dual nature of rotations/poses in the
sense that they can be described in terms of a Lie group or a Lie al-
gebra, each having advantages and disadvantages. Lie groups are nice
because they are free of singularities, but have constraints; this is also
the form that is usually required in order to rotate/transform some-
thing in the real world. Lie algebras are nice because we can treat
them as vectorspaces (for which there are many useful mathematical
tools27 ), and they are free of constraints, but we need to worry about
singularities.
It seems logical to exploit the vectorspace character of a Lie alge-
bra in defining our random variables for rotations and poses. In this
way, we can leverage all the usual tools from probability and statistics,
rather than starting over. Given this decision, and using (7.253) for
27 Including probability and statistics.
7.3 Probability and Statistics 259

inspiration, there are three possible ways to define a random variable

for SO(3) based on the different perturbation options:
SO(3) so(3)
left C = exp (∧` ) C̄ φ ≈ µ + J` (µ) `
middle C = exp ((µ + m )∧ ) φ = µ + m
right C = C̄ exp (∧r ) φ ≈ µ + Jr (µ) r

where ` , m , r ∈ R3 are random variables in the usual (vectorspace)

sense, µ ∈ R3 is a constant, and C = exp(φ∧ ), C̄ = exp (µ∧ ) ∈ SO(3).
In each of these three cases, we know through the surjective property
of the exponential map and the closure property of Lie groups, that we
will ensure that C = exp φ∧ ∈ SO(3). This idea of mapping a random
variable onto a Lie group through the exponential map is sometimes
referred to as an injection28 of noise onto the group.
Looking to the Lie algebra versions of the perturbations, we can see
the usual relationships between the left, middle, right: m ≈ J` (µ) ` ≈
Jr (µ) r . Based on this, we might conclude that all the options are
equally good. However, in the middle option, we must keep the nominal
component of the variable in the Lie algebra as well as the perturbation,
which means we will have to contend with the associated singularities.
On the other hand, both the left and right perturbation approaches
allow us to keep the nominal component of the variable in the Lie
group. By convention, we will choose the left perturbation approach,
but one could just as easily pick the right.
This approach to defining random variables for rotations/poses in
some sense gets the best of both worlds. We can avoid singularities
for the large, nominal part by keeping it in the Lie group, but we can
exploit the constraint-free, vectorspace character of the Lie algebra for
the small, noisy part. Since the noisy part is assumed to be small, it will
tend to stay away from the singularities associated with the rotation-
vector parameterization29 .
Thus, for SO(3), a random variable, C, will be of the form30
C = exp (∧ ) C̄, (7.254)
28 The term injection is misleading here since mathematically it means that at most one
element of the Lie algebra should map to each element of the Lie group. As we have
seen, the exponential map linking the Lie algebra to the Lie group is surjective, which
means every element of the Lie algebra maps to some element of the Lie group and
every element of the Lie group is mapped to from at least one element of the Lie
algebra. If we limit the rotation angle magnitude, |φ| < π, then the exponential map is
bijective, meaning both surjective and injective (i.e., one-to-one and onto). However, we
may not want to impose this limit, whereupon the injective property does not hold.
29 This approach works reasonably well as long as the perturbation is small. If this is not
the case, a more global approach to defining a random variable on Lie groups is
required, but these are less well explored. (Lee et al., 2008)
30 We will drop the ` subscript from here to keep things clean.
260 Matrix Lie Groups

where C̄ ∈ SO(3) is a ‘large’, noise-free, nominal rotation and ∈ R3

is a ‘small’, noisy, component (i.e., it is just a regular, random variable
from a vectorspace). This means that we can simply define a PDF for
, and this will induce a PDF on SO(3):
p() → p(C). (7.255)
We will mainly be concerned with Gaussian PDFs in our estimation
problems, and in this case we let

1 1 T −1
p() = p exp − Σ , (7.256)
(2π)3 det(Σ) 2
or ∼ N (0, Σ). Note, we can now think of C̄ as the ‘mean’ rotation
and Σ as the associated covariance.
By definition, p() is a valid PDF and so
Z
p() d = 1. (7.257)

We deliberately avoid making the integration limits explicit because we

have defined to be Gaussian, which means it has probability mass out
to infinity in all directions. However, we assume that most of the prob-
ability mass is encompassed in || < π for this to make sense. Referring
back to Section 7.1.6, we know that we can relate an infinitesimal vol-
ume element in the Lie algebra to an infinitesimal volume element in
the Lie group according to
dC = | det(J())| d, (7.258)
where we note that due to our choice of using the left perturbation, the
Jacobian, J, is evaluated at (which will hopefully be small) rather
than at φ (which could be large); this will hopefully keep J very close
to 1. We can use this to now work out the PDF that is induced on C.
We have that
Z
1 = p() d (7.259)
Z
1 1
= p exp − T Σ−1 d (7.260)
(2π)3 det(Σ) 2
Z
1 1 T −1
T ∨

T ∨ 1
= p exp − ln CC̄ Σ ln CC̄ dC,
3 (2π) det(Σ) 2 | det(J)|
| {z }
p(C)

(7.261)
where we indicate the induced p(C). It is important to realize that
p(C) looks like this due to our choice to define p() directly31 .
31 It is also possible to work in the other direction by first defining p(C). (Chirikjian,
2009)
7.3 Probability and Statistics 261

A common method of defining the mean rotation, M ∈ SO(3), is the

unique solution of the equation,
Z
∨
ln CMT p(C) dC = 0. (7.262)

Switching variables from C to , this is equivalent to

Z
∨
ln exp (∧ ) C̄MT p() d = 0. (7.263)

Taking M = C̄ we see that

Z Z
∧

T ∨
∨
ln exp ( ) C̄M p() d = ln exp (∧ ) C̄C̄T p() d
Z
= p() d = E[] = 0, (7.264)

which validates our logic in referring to C̄ as the mean earlier.

The corresponding covariance, Σ, computed about M, can be defined
as
Z
∨ ∨T
Σ = ln exp (∧ ) C̄MT ln exp (∧ ) C̄MT p() d
Z
∨ ∨T
= ln exp (∧ ) C̄C̄T ln exp (∧ ) C̄C̄T p() d
Z
= T p() d = E[T ], (7.265)

which implies that choosing ∼ N (0, Σ) is a reasonable thing to do

and matches nicely with the noise ‘injection’ procedure. In fact, all
higher-order statistics defined in an analogous way will produce the
statistics associated with as well.
As another advantage to this approach to representing random vari-
ables for rotations, consider what happens to our rotation random vari-
ables under a pure (deterministic) rotation mapping. Let R ∈ SO(3) be
a constant rotation matrix that we apply to C to create a new random
variable, C0 = R C. With no approximation we have
∧
C0 = R C = R exp (∧ ) C̄ = exp (R) R C̄ = exp (0∧ ) C̄0 , (7.266)

where

C̄0 = R C̄, 0 = R ∼ N 0, R Σ RT . (7.267)

This is very appealing as it allows us to carry out this common opera-

tion exactly.
262 Matrix Lie Groups

Poses
Similarly to the rotation case, we choose to define a Gaussian random
variable for poses as
T = exp (∧ ) T̄, (7.268)
where T̄ ∈ SE(3) is a ‘large’ mean transformation and ∈ R6 ∼
N (0, Σ) is a ‘small’ Gaussian random variable (i.e., in a vectorspace).
The mean transformation, M ∈ SE(3), is the unique solution of the
equation,
Z
∨
ln exp (∧ ) T̄M−1 p() d = 0. (7.269)

Taking M = T̄ we see that

Z Z
∨ ∨
ln exp (∧ ) T̄M−1 p() d = ln exp (∧ ) T̄T̄−1 p() d
Z
= p() d = E[] = 0, (7.270)

which validates our logic in referring to T̄ as the mean.

The corresponding covariance, Σ, computed about M, can be defined
as
Z
∨ ∨T
Σ = ln exp (∧ ) T̄M−1 ln exp (∧ ) T̄M−1 p() d
Z
∨ ∨T
= ln exp (∧ ) T̄T̄−1 ln exp (∧ ) T̄T̄−1 p() d
Z
= T p() d = E[T ], (7.271)

which implies that choosing ∼ N (0, Σ) is a reasonable thing to do

and matches nicely with the noise ‘injection’ procedure. In fact, all
higher-order statistics defined in an analogous way will produce the
statistics associated with as well.
Again, consider what happens to our transformation random vari-
ables under a pure (deterministic) transformation mapping. Let R ∈
SE(3) be a constant transformation matrix that we apply to T to cre-
ate a new random variable, T0 = R T. With no approximation we have

∧
T0 = R T = R exp (∧ ) T̄ = exp (R) R T̄ = exp (0∧ ) T̄0 , (7.272)
where

T̄0 = R T̄, 0 = R ∼ N 0, R Σ RT , (7.273)

and R = Ad(R).
7.3 Probability and Statistics 263

7.3.2 Uncertainty on a Rotated Vector

Consider the simple mapping from rotation to position given by
y = Cx, (7.274)
where x ∈ R3 is a constant and
C = exp (∧ ) C̄, ∼ N (0, Σ). (7.275)
Figure 7.3 shows what the resulting density over y looks like for some
particular values of C̄ and Σ. We see that as expected the samples live
on a sphere whose radius is |x| since rotations preserve length.
Figure 7.3
Depiction of
uncertainty on a
vector
y = Cx ∈ R3 ,
where x is
constant and C =
exp (∧ ) C̄, φ ∼
N (0, Σ) is a
random variable.
The dots show
samples of the
resulting density
over y. The
contours (of
varying darkness)
show the 1, 2, 3
standard deviation
equiprobable
contours of
mapped to y. The
solid black line is
the noisefree
vector, ȳ = C̄x.
The grey, dashed,
dotted, and
dash-dotted lines
show various
estimates of E[y]
using brute-force
sampling, the
sigmapoint
transformation, a
second-order
method, and a
fourth-order
method.
264 Matrix Lie Groups

We might be interested in computing E[y], in the vectorspace R3

(i.e., not exploiting special knowledge that y must have length |x|). We
can imagine three ways of doing this:

(i) drawing a large number of random samples and then averag-

ing,
(ii) using the sigmapoint transformation,
(iii) an analytical approximation.

For (iii) we consider expanding C in terms of so that

1 1 1
y = Cx = 1 + ∧ + ∧ ∧ + ∧ ∧ ∧ + ∧ ∧ ∧ ∧ + · · · C̄x.
2 6 24
(7.276)

Since is Gaussian, the odd terms average to zero such that

1 ∧ ∧ 1 ∧ ∧ ∧ ∧
E[y] = 1 + E [ ] + E [ ] + · · · C̄x. (7.277)
2 24
Going term by term we have

E [∧ ∧ ] = E −(T )1 + T = −tr E T 1 + E T
= −tr (Σ) 1 + Σ, (7.278)

and

E [∧ ∧ ∧ ∧ ] = E −(T ) ∧ ∧

= E −(T ) −(T )1 + T

= tr E T T 1 − E T T
= tr (Σ (tr (Σ) 1 + 2Σ)) 1 − Σ (tr (Σ) 1 + 2Σ)

2
= (tr (Σ)) + 2 tr Σ2 1 − Σ (tr (Σ) 1 + 2Σ) , (7.279)

where we have used the multivariate version of Isserlis’ theorem. Higher-

order terms are also possible but to fourth order in we have

1
E[y] ≈ 1 + (−tr (Σ) 1 + Σ)
2
1
2 2
+ (tr (Σ)) + 2 tr Σ 1 − Σ (tr (Σ) 1 + 2Σ) C̄x. (7.280)
24
We refer to the method keeping terms to second order in as ‘second-
order’ and the method keeping terms to fourth order in as ‘fourth-
order’ in Figure 7.3. The fourth-order method is very comparable to
the sigmapoint method and random sampling.
7.3 Probability and Statistics 265

7.3.3 Compounding Poses

In this section, we investigate the problem of compounding two poses,
each with associated uncertainty, as depicted in Figure 7.4.

Figure 7.4
T̄2 , ⌃2 T̄, ⌃ Combining a chain
) of two poses into a
single compound
T̄1 , ⌃1 pose.

Theory
Consider two noisy poses, T1 and T2 ; we keep track of their nominal
values and associated uncertainties:

T̄1 , Σ1 , T̄2 , Σ2 .

Suppose now we let

T = T1 T2 ,

as depicted in Figure 7.4. What is T̄, Σ ? Under our perturbation
scheme we have,

exp (∧ ) T̄ = exp (∧1 ) T̄1 exp (∧2 ) T̄2 . (7.281)

Moving all the uncertain factors to the left side, we have

∧
exp (∧ ) T̄ = exp (∧1 ) exp T̄ 1 2 T̄1 T̄2 , (7.282)

where T̄ 1 = Ad T̄1 . If we let

T̄ = T̄1 T̄2 , (7.283)

we are left with

∧
exp (∧ ) = exp (∧1 ) exp T̄ 1 2 . (7.284)

Letting 02 = T̄ 1 2 we can apply the BCH formula to find

1 1 f f 0 1 1 0f f f 0
= 1 + 02 + f 0
1 2 + 1 1 2 + 0f 0f
2 2 1 − + ··· .
2 12 12 24 2 1 1 2
(7.285)
For our approach to hold, we require that E [] = 0. Assuming that
1 ∼ N (0, Σ1 ) and 02 ∼ N 0, Σ02 are uncorrelated with one another,
we have
1
E [] = − E [0f f f 0
2 1 1 2 ] + O
6
, (7.286)
24
266 Matrix Lie Groups

since everything except the fourth-order term has zero mean. Thus, to
third order, we can safely assume that E [] = 0 and thus (7.283) seems
to be a reasonable way to compound the mean transformations32 .
The next task is to compute Σ = E [T ]. Multiplying out to fourth
order we have

T 1 0 0T fT
E T ≈ E 1 T1 + 02 02 + f 1
4 1 2 2
1 f f 0 0T 0 0T f f T
+ (1 1 ) 2 2 + 2 2 (1 1 )
12

+ (0f
2 2
0f
)
1 1
T
+ ( 0f 0f

2 2 )
1 1
T
(7.287)

where we have omitted showing any terms that have an odd power in
either 1 or 02 since these will by definition have expectation zero. This
expression may look daunting, but we can take it term by term. To save
space, we define and make use of the following two linear operators:

hhAii = −tr (A) 1 + A, (7.288a)

hhA, Bii = hhAiihhBii + hhBAii, (7.288b)

with A, B ∈ Rn×n . These provide the useful identity,

−u∧ Av∧ ≡ hhvuT , AT ii, (7.289)

where u, v ∈ R3 and A ∈ R3×3 . Making use of this repeatedly, we have

32 It is also possible to show that the fourth-order term has zero mean,
E 0f f f 0

2 1 1 2 = 0, if Σ1 is of the special form

" #
Σ1,ρρ 0
Σ1 = 2 ,
0 σ1,φφ 1

where the ρ and φ subscripts indicate a partitioning of the covariance into the
translation and rotation components, respectively. This is a common situation for Σ1
when we are, for example, propagating uncertainty on velocity through the kinematics
equations presented for SE(3); from (7.212) we have, T1 = exp ((t2 − t1 )$ ∧ ), where
$ is the (noisy) generalized velocity. In this case, we are justified in assuming E [] = 0
all the way out to fifth order (and possibly further).
7.3 Probability and Statistics 267

out to fourth order,

T Σ1,ρρ Σ1,ρφ
E 1 1 = Σ1 = , (7.290a)
ΣT1,ρφ Σ1,φφ
" 0 #
h T
i Σ2,ρρ Σ02,ρφ T
E 02 02 = Σ02 = 0T 0 = T̄ 1 Σ2 T̄ 1 , (7.290b)
Σ2,ρφ Σ2,φφ

∧ ∧ hhΣ1,φφ ii hhΣ1,ρφ + ΣT1,ρφ ii
E [1 1 ] = A1 = , (7.290c)
0 hhΣ1,φφ ii
" #
h ∧ ∧i T
0 0 0 hhΣ02,φφ ii hhΣ02,ρφ + Σ02,ρφ ii
E 2 2 = A2 = , (7.290d)
0 hhΣ02,φφ ii
h Ti
f 0 0T f Bρρ Bρφ
E 1 2 2 1 = B = , (7.290e)
BTρφ Bφφ

where

Bρρ = hhΣ1,φφ , Σ02,ρρ ii + hhΣT1,ρφ , Σ02,ρφ ii

T
+ hhΣ1,ρφ , Σ02,ρφ ii + hhΣ1,ρρ , Σ02,φφ ii, (7.291a)
T
Bρφ = hhΣ1,φφ , Σ02,ρφ ii + hhΣT1,ρφ , Σ02,φφ ii, (7.291b)
Bφφ = hhΣ1,φφ , Σ02,φφ ii. (7.291c)

The resulting covariance is then

1 1 T

Σ4th ≈ Σ1 + Σ02 + B + A1 Σ02 + Σ02 AT1 + A02 Σ1 + Σ1 A02 ,
| {z } |4 12 {z }
Σ 2nd
additional fourth-order terms
(7.292)
correct to fourth order33 . This result is essentially the same as that of
Wang and Chirikjian (2008), but worked out for our slightly different
PDF; it is important to note that while our method is fourth-order in
the perturbation variables, it is only second-order in the covariance. In
summary, to compound two poses we propagate the mean using (7.283)
and the covariance using (7.292).

Sigmapoint Method
We can also make use of the sigmapoint transformation (Julier and
Uhlmann, 1996) to pass uncertainty through the compound pose change.
In this section, we tailor this to our specific type of SE(3) perturbation.
Our approach to handling sigmapoints is quite similar to that taken by
Hertzberg et al. (2013) and also Brookshire and Teller (2012). In our
case, we begin by approximating the joint input Gaussian using a finite
33 The sixth order terms require a lot more work, but it is possible to compute them
using Isserlis’ theorem.
268 Matrix Lie Groups

number of samples, {T1,` , T2,` }:

LLT = diag(Σ1 , Σ2 ), (Cholesky decomposition; L lower-triangular)

√
ψ ` = λ col` L, ` = 1 . . . L,
√
ψ `+L = − λ col` L, ` = 1 . . . L,

1,`
= ψ ` , ` = 1 . . . 2L
2,`

T1,` = exp ∧1,` T̄1 , ` = 1 . . . 2L,

T2,` = exp ∧2,` T̄2 , ` = 1 . . . 2L,

where λ is a user-definable scaling constant34 and L = 12. We then pass

each of these samples through the compound pose change and compute
the difference from the mean:
∨
` = ln T1,` T2,` T̄−1 , ` = 1 . . . 2L. (7.293)

These are combined to create the output covariance according to

2L
1 X T
Σsp = ` ` . (7.294)
2λ `=1

Note, we have assumed that the output sigmapoint samples have zero
mean in this formula, to be consistent with our mean propagation. In-
terestingly, this turns out to be algebraically equivalent to the second-
order method (from the previous section) for this particular nonlinear-
ity since the noise sources on T1 and T2 are assumed to be uncorrelated.

Simple Compound Example

In this section, we present a simple qualitative example of pose com-
pounding and in Section 7.3.3 we carry out a more quantitative study on
a different setup. To see the qualitative difference between the second-
and fourth-order methods, let us consider the case of compounding
transformations many times in a row:
K
!
Y
exp (∧K ) T̄K = exp (∧ ) T̄ exp (∧0 ) T̄0 . (7.295)
k=1

As discussed earlier, this can be viewed as a discrete-time integration

of the SE(3) kinematic equations as in (7.212). To keep things simple,

34 For all experiments in this section, we used λ = 1; we need to ensure the sigmapoints
associated with the rotational degrees of freedom have length less than π to avoid
numerical problems.
7.3 Probability and Statistics 269

we make the following assumptions:

T̄0 = 1, 0 ∼ N (0, 0), (7.296a)

C̄ r̄
T̄ = T , ∼ N (0, Σ) , (7.296b)
0 1
 
r
C̄ = 1, r̄ = 0 , Σ = diag 0, 0, 0, 0, 0, σ 2 . (7.296c)
0
Although this example uses our three-dimensional tools, it is confined
to a plane for the purpose of illustration and ease of plotting; it cor-
responds to a rigid body moving along the x-axis but with some un-
certainty only on the rotational velocity about the z-axis. This could
model a unicycle robot driving in the plane with constant translational
speed and slightly uncertain rotational speed (centered about zero). We
are interested in how the covariance matrix fills in over time.
According to the second-order scheme we have
 
1 0 0 Kr
0 1 0 0 
T̄K = 0 0 1 0  ,
 (7.297a)
0 0 0 1
 
0 0 0 0 0 0
0 K(K−1)(2K−1) K(K−1)
 6
r2 σ 2 0 0 0 − 2 rσ 2  
0 0 0 0 0 0 
ΣK =   , (7.297b)

0 0 0 0 0 0 
0 0 0 0 0 0 
0 − K(K−1)
2
rσ 2 0 0 0 Kσ 2
where we see that the top-left entry of ΣK , corresponding to uncer-
tainty in the x-direction, does not have any growth of uncertainty.
However, in the fourth-order scheme, the fill-in pattern is such that
the top-left entry is non-zero. This happens for several reasons, but
mainly through the Bρρ submatrix of B. This leaking of uncertainty
into an additional degree of freedom cannot be captured by keeping
only the second-order terms. Figure 7.5 provides a numerical example
of this effect. It shows that both the second- and fourth-order schemes
do a good job of representing the ‘banana’-like density over poses, as
discussed by Long et al. (2012). However, the fourth-order scheme has
some finite uncertainty in the straight-ahead direction (as do the sam-
pled trajectories) while the second-order scheme does not.

Compound Experiment
To quantitatively evaluate the pose-compounding techniques, we ran
a second numerical experiment in which we compounded two poses
270 Matrix Lie Groups
Figure 7.5
Example of
compounding 50
K = 100 uncertain
transformations 40
(Section 7.3.3).
The light blue lines
30
and blue dots show
20
1000 individual
sampled
10
trajectories
starting from (0, 0)
0
y

and moving
nominally to the
-10
right at constant
translational
-20
speed, but with
some uncertainty
-30
on rotational
velocity. The grey
-40 samples
1-sigma covariance second-order
ellipse is simply fourth-order
-50
fitted to the
samples to show 0 20 40 60 80 100
what keeping x
xy-covariance
relative to the including their associated covariance matrices,
start looks like. ∧ T
The dotted T̄1 = exp ξ̄ 1 , ξ̄ 1 = 0 2 0 π/6 0 0 ,
(second-order) and
dash-dotted 1 1
(fourth-order) lines
Σ1 = α × diag 10, 5, 5, , 1, , (7.298a)
2 2
are the principal ∧ T
great circles of the T̄2 = exp ξ̄ 2 , ξ̄ 2 = 0 0 1 0 π/4 0 ,
1-sigma covariance
ellipsoid, given by 1 1
ΣK , mapped to
Σ2 = α × diag 5, 10, 5, , , 1 , (7.298b)
2 2
the xy-plane.
Looking to the where α ∈ [0, 1] is a scaling parameter that increases the magnitude of
area (95, 0),
the input covariances parametrically.
corresponding to
straight ahead, the We compounded these two poses according to (7.281), which results
fourth-order in a mean of T̄ = T̄1 T̄2 . The covariance, Σ, was computed using four
scheme has some methods:
non-zero
uncertainty (as do
(i) Monte Carlo: We drew a large number, M = 1000000, of random
the samples),
whereas the samples (m1 and m2 ) from the input covariance matrices, com-
second-order pounded the resulting
PM transformations, and computed
the covari-

scheme does not. ance as Σmc = M1 m=1 m Tm with Tm = exp ∧m1 T̄1 exp ∧m2 T̄2 ,
We used r = 1 and ∨
σ = 0.03.
and m = ln Tm T̄−1 . This slow-but-accurate approach served
as our benchmark to which the other three much faster methods
were compared.
(ii) Second-Order: We used the second-order method described above
to compute Σ2nd .
7.3 Probability and Statistics 271
4 Figure 7.6
sigmapoint Results from
3.5 second-order Compound
fourth-order Experiment: Error,
Covariance error, ε

3 ε, in computing
covariance
2.5
associated with
2 compounding two
poses using three
1.5 methods, as
compared to
1 Monte Carlo. The
sigmapoint and
0.5 second-order
methods are
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 algebraically
Noise scaling, α equivalent for this
problem and thus
appear the same
(iii) Fourth-Order: We used the fourth-order method described above on the plot. The
input covariances
to compute Σ4th . were gradually
(iv) Sigmapoint: We used the sigmapoint transformation described scaled up via the
parameter, α,
above to compute Σsp . highlighting the
improved
We compared each of the last three covariance matrices to the Monte performance of the
fourth-order
Carlo one, using the Frobenius norm:
method.
r
T
ε= tr (Σ − Σmc ) (Σ − Σmc ) .

Figure 7.6 shows that for small input covariance matrices (i.e., α small)
there is very little difference between the various methods and the errors
are all low compared to our benchmark. However, as we increase the
magnitude of the input covariances, all the methods get worse, with the
fourth-order method faring the best by about a factor of seven based
on our error metric. Note, since α is scaling the covariance, the applied
noise is increasing quadratically.
The second-order method and the sigmapoint method have indistin-
guishable performance, as they are algebraically equivalent. The fourth-
order method goes beyond both of these by considering higher-order
terms in the input covariance matrices. We did not compare the compu-
tational costs of the various methods as they are all extremely efficient
as compared to Monte Carlo.
It is also worth noting that our ability to correctly keep track of
uncertainties on SE(3) decreases with increasing uncertainty. This can
be seen directly in Figure 7.6, as error increases with increasing un-
certainty. This suggests that it may be wise to use only relative pose
variables in order to keep uncertainties small.
272 Matrix Lie Groups

7.3.4 Fusing Poses

This section will investigate a different type of nonlinearity, the fusing
of several pose estimates, as depicted in Figure 7.7. We will approach
this as an estimation problem, our first involving quantities from a
matrix Lie group. We will use the optimization ideas introduced in
Section 7.1.9.
Figure 7.7
Combining K pose
T̄1 , ⌃1
estimates into a T̄2 , ⌃2
single fused T̄, ⌃
estimate.
.. )
.
T̄K , ⌃K

Theory
Suppose that we have K estimates of a pose and associated uncertain-
ties:

T̄1 , Σ1 , T̄2 , Σ2 , . . . , T̄K , ΣK . (7.299)
If we think of these as uncertain (pseudo)-measurements of the true
pose,
T
true , how can we optimally combine these into a single estimate,
T̄, Σ ?
As we have seen in the first part of this book, vectorspace solution
to fusion is straightforward and can be found exactly in closed form:
K K
!−1
X −1
X −1
x̄ = Σ Σk x̄k , Σ = Σk . (7.300)
k=1 k=1

The situation is somewhat more complicated when dealing with SE(3),

and we shall resort to an iterative scheme.
We define the error (that we will seek to minimize) as ek (T), which
occurs between the individual measurement and the optimal estimate,
T, so that
∨
ek (T) = ln T̄k T−1 . (7.301)
We use our approach to pose optimization outlined earlier35 , wherein
we start with an initial guess, Top , and perturb this (on the left) by a
small amount, so that
T = exp (∧ ) Top . (7.302)
35 It is worth mentioning that we are using our constraint-sensitive perturbations for
matrix Lie groups in two distinct ways in this section. First, the perturbations are used
as a means of injection noise on the Lie group so that probability and statistics can be
defined. Second, we are using a perturbation in order to carry out iterative
optimization.
7.3 Probability and Statistics 273

Inserting this into the error expression we have

∨ ∨
ek (T) = ln T̄k T−1 = ln T̄k T−1 op exp (−∧
)
| {z }
small
∨
= ln (exp (ek (Top )∧ ) exp (−∧ )) = ek (Top ) − Gk , (7.303)
∨
where ek (Top ) = ln T̄k T−1 op and Gk = J (−ek (Top ))−1 . We have
used the approximate BCH formula from (7.87) to arrive at the final
expression. Since ek (Top ) is fairly small, this series will converge rapidly
and we can get away with keeping just a few terms. With our iterative
scheme, will (hopefully) converge to zero and hence we are justified
in keeping only terms linear in this quantity.
We define the cost function that we want to minimize as
K
1X
J(T) = ek (T)T Σ−1
k ek (T)
2 k=1
K
1X T
≈ (ek (Top ) − Gk ) Σ−1
k (ek (Top ) − Gk ) , (7.304)
2 k=1

which is already (approximately) quadratic in . It is in fact a squared

Mahalanobis distance (Mahalanobis, 1936) since we have chosen the
weighting matrices to be the inverse covariance matrices; thus mini-
mizing J with respect to is equivalent to maximizing the joint like-
lihood of the individual estimates. It is worth noting that because we
are using a constraint-sensitive perturbation scheme, we do not need
to worry about enforcing any constraints on our state variables during
the optimization procedure. Taking the derivative with respect to and
setting to zero results in the following system of linear equations for
the optimal value of :
K
! K
X −1
X
T
Gk Σk Gk = ?
GTk Σ−1
k ek (Top ). (7.305)
k=1 k=1

While this may appear strange compared to (7.300), the Jacobian terms
appear because our choice of error definition is in fact nonlinear owing
to the presence of the matrix exponentials. We then apply this optimal
perturbation to our current guess,
Top ← exp (?∧ ) Top , (7.306)
which ensures Top remains in SE(3), and iterate to convergence. At
the last iteration, we take T̄ = Top as the mean of our fused estimate
and
K
!−1
X −1
T
Σ= Gk Σk Gk , (7.307)
k=1
274 Matrix Lie Groups

for the covariance matrix. This approach has the form of a Gauss-
Newton method as discussed in Section 7.1.9.
This fusion problem is similar to one investigated by Smith et al.
(2003), but they only discuss the K = 2 case. Our approach is closer
to that of Long et al. (2012), who discuss the N = 2 case and derive
closed-form expressions for the fused mean and covariance for an ar-
bitrary number of individual measurements, K; however, they do not
iterate their solution and they are tracking a slightly different PDF.
Wolfe et al. (2011) also discuss fusion at length, albeit again using a
slightly different PDF than us. They discuss non-iterative methods of
fusion for arbitrary K and show numerical results for K = 2. We be-
lieve our approach generalizes all of these previous works by (i) allowing
the number of individual estimates, K, to be arbitrary, (ii) keeping an
arbitrary number of terms in the approximation of the inverse Jaco-
bian, N , and (iii) iterating to convergence via a Gauss-Newton style
optimization method. Our approach may also be simpler to implement
than some of these previous methods.

Fusion Experiment
To validate the pose fusion method from the previous subsection, we
used a true pose given by
T
Ttrue = exp ξ ∧true , ξ true = 1 0 0 0 0 π/6 , (7.308)
and then generated 3 random pose measurements,
T̄1 = exp (∧1 ) Ttrue , T̄2 = exp (∧2 ) Ttrue , T̄3 = exp (∧3 ) Ttrue ,
(7.309)
where

1 1
1 ∼ N 0, diag 10, 5, 5, , 1, ,
2 2

1 1
2 ∼ N 0, diag 5, 15, 5, , , 1 ,
2 2

1 1
3 ∼ N 0, diag 5, 5, 25, 1, , . (7.310)
2 2
We then solved for the pose using our Gauss-Newton technique (it-
erating until convergence), using the initial condition, Top = 1. We
repeated this for N = 1 . . . 6, the number of terms kept in Gk =
J (−ek (Top ))−1 . We also used the closed-form expression to compute
J analytically (and then inverted numerically) and this is denoted by
‘N = ∞’.
Figure 7.8 plots two performance metrics. First, it plots the final
converged value of thePM cost function, Jm , averaged over M = 1000
random trials, J = M1 m=1 Jm . Second, it plots the root-mean-squared
7.3 Probability and Statistics 275
5.8758 2.7717 Figure 7.8
Results from
Fusion
2.7717 Experiment: (left)
5.8757

RMS error (compared to true pose), ε

Average final cost
function, J, as a
Average final cost function, J

2.7716 function of the

5.8757
number of terms,
N , kept in J −1 .
2.7716
(right) Same for
5.8756 the
2.7715 root-mean-squared
pose error with
5.8755 respect to the true
2.7715 pose. Both plots
show there is
5.8755 benefit in keeping
2.7714 more than one
term in J −1 . The
5.8755 data point denoted
2.7714 ‘∞’ uses the
analytical
expression to keep
5.8754 2.7713
1 2 3 4 5 6 ∞ 1 2 3 4 5 6 ∞ all the terms in the
Number of terms, N Number of terms, N expansion.

pose error (with respect to the true pose), of our estimate, T̄m , again
averaged over the same M random trials:
v
u M
u 1 X ∨
ε= t εT εm , εm = ln Ttrue T̄−1 .
M m=1 m m

The plots show that both measures of error are monotonically reduced
with increasing N . Moreover, we see that for this example almost all
of the benefit is gained with just four terms (or possibly even two).
The results for N = 2, 3 are identical as are those for N = 4, 5. This
is because in the Bernoulli number sequence, B3 = 0 and B5 = 0, so
these terms make no additional contribution to J −1 . It is also worth
stating that if we make the rotational part of the covariances in (7.310)
any bigger, we end up with a lot of samples that have rotated by more
than angle π, and this can be problematic for the performance metrics
we are using.
Figure 7.9 shows the convergence history of the cost, J, for a single
random trial. The left side shows the strong benefit of iterating over
the solution, while the right side shows that the cost converges to a
lower value by keeping more terms in the approximation of J −1 (cases
N = 2, 4, ∞ shown). It would seem that taking N = 4 for about seven
iterations gains most of the benefit, for this example.
276 Matrix Lie Groups
Figure 7.9 13.5
Results from second-order
Fusion fourth-order 8.868
13 infinite-order
Experiment: (left)
Convergence of the 8.867
12.5
cost function, J,
Average final cost function, J

Average final cost function, J

with successive 8.866
Gauss-Newton 12
iterations. This is 8.865
for just one of the 11.5
M trials used to
8.864
generate 11
Figure 7.8. (right)
Same as left, but
8.863
10.5
zoomed in to show
that the 8.862
N = 2, 4, ∞ 10
solutions do 8.861
converge to 9.5
progressively lower 8.86
costs. 9
8.859
8.5
1 2 3 4 5 6 7 8 9 10 4 5 6 7 8 9 10
Number of Iterations Number of Iterations

7.3.5 Propagating Uncertainty Through a Nonlinear

Camera Model
In estimation problems, we are often faced with passing uncertain quan-
tities through nonlinear measurement models to produce expected mea-
surements. Typically this is carried out via linearization (Matthies and
Shafer, 1987). Sibley (2007) shows how to carry out a second-order
propagation for a stereo camera model accounting for landmark uncer-
tainty, but not pose uncertainty. Here we derive the full second-order
expression for the mean (and covariance) and compare this with Monte
Carlo, the sigmapoint transformation, and linearization. We begin by
discussing our representation of points and then present the Taylor-
series expansion of the measurement (camera) model followed by an
experiment.

Perturbing Homogeneous points

As we have seen in Section 7.1.8, points in R3 can be represented using
4 × 1 homogeneous coordinates as follows:
 
sx
sy 
p= 
 sz  , (7.311)
s
where s is some real, non-negative scalar.
7.3 Probability and Statistics 277

To perturb points in homogeneous coordinates we will operate di-

rectly on the xyz components by writing
p = p̄ + D ζ, (7.312)
where ζ ∈ R3 is the perturbation and D is a dilation matrix given by
 
1 0 0
0 1 0
D= 
0 0 1 . (7.313)
0 0 0

We thus have that E[p] = p̄ and

E[(p − p̄)(p − p̄)T ] = D E[ζζ T ] DT , (7.314)
with no approximation.

Taylor-Series Expansion of Camera Model

It is common to linearize a nonlinear observation model for use in pose
estimation. In this section, we show how to do a more general Taylor-
series expansion of such a model and work out the second-order case
in detail. Our camera model will be
y = g(T, p), (7.315)
where T is the pose of the camera and p is the position of a landmark
(as a homogeneous point). Our task will be to pass a Gaussian rep-
resentation of the pose and landmark, given by {T̄, p̄, Ξ} where Ξ is
a 9 × 9 covariance for both quantities, through the camera model to
produce a mean and covariance for the measurement36 , {y, R}.
We can think of this as the composition of two nonlinearities, one
to transfer the landmark into the camera frame, z(T, p) = Tp, and
one to produce the observations from the point in the camera frame,
y = s(z). Thus we have
g(T, p) = s(z(T, p)). (7.316)
We will treat each one in turn. If we change the pose of the camera
and/or the position of the landmark a little bit, we have

1
z = Tp = exp (∧ ) T̄ (p̄ + D ζ) ≈ 1 + ∧ + ∧ ∧ T̄ (p̄ + D ζ) ,
2
(7.317)
where we have kept the first two terms in the Taylor series for the pose
36 In this example, the only sources of uncertainty come from the pose and the point and
we neglect inherent measurement noise, but this could be incorporated as additive
Gaussian noise, if desired.
278 Matrix Lie Groups

perturbation. If we multiply out and continue to keep only those terms

that are second-order or lower in and ζ we have
4
1X T
z ≈ z̄ + Z θ + θ Z i θ 1i , (7.318)
2 i=1 | {z }
scalar
where 1i is the ith column of the 4 × 4 identity matrix and
z̄ = T̄p̄, (7.319a)
h i
Z = T̄p̄ T̄D , (7.319b)
" #
1}i T̄p̄ 1}
i T̄D
Zi = T , (7.319c)
1} i T̄D 0

θ= . (7.319d)
ζ
Arriving at these expressions requires repeated application of the iden-
tities from Section 7.1.8.
To then apply the nonlinear camera model, we use the chain rule (for
first and second derivatives), so that
1X T
g (T, p) ≈ ḡ + G θ + θ G j θ 1j , (7.320)
2 j | {z }
scalar
correct to second order in θ, where
ḡ = s(z̄), (7.321a)

∂s
G = SZ, S = , (7.321b)
∂z z̄
X4
G j = ZT S j Z + 1Tj S 1i Z i , (7.321c)
i=1
| {z }
scalar

∂ 2 sj
Sj = , (7.321d)
∂z ∂zT z̄
j is an index over the rows of s(·), and 1j is the jth column of the
identity matrix. The Jacobian of s(·) is S and the Hessian of the jth
row, sj (·), is S j .
If we only care about the first-order perturbation, we simply have
g (T, p) = ḡ + G θ, (7.322)
where ḡ and G are unchanged from above.
These perturbed measurement equations can then be used within
any estimation scheme we like; in the next subsection we will use these
with a stereo camera model to show the benefit of the second-order
terms.
7.3 Probability and Statistics 279

Propagating Gaussian Uncertainty Through the Camera

Suppose that the input uncertainties, embodied by θ, are zero-mean,
Gaussian,
θ ∼ N (0, Ξ) , (7.323)
where we note that in general there could be correlations between the
pose, T, and the landmark, p.
Then, to first order, our measurement is given by
y1st = ḡ + G θ, (7.324)
and ȳ1st = E[y1st ] = ḡ since E[θ] = 0 by assumption. The (second-
order) covariance associated with the first-order camera model is given
by
h i
T
R2nd = E (y1st − ȳ1st ) (y1st − ȳ1st ) = G Ξ GT . (7.325)
For the second-order camera model, we have
1X T
y2nd = ḡ + G θ + θ G j θ 1j , (7.326)
2 j
and consequently
1X
ȳ2nd = E[y2nd ] = ḡ + tr (G j Ξ) 1j , (7.327)
2 j
which has an extra non-zero term as compared to the first-order cam-
era model. The larger the input covariance Ξ is, the larger this term
can become, depending on the nonlinearity. For a linear camera model,
G j = 0 and the second- and first-order camera model means are iden-
tical.
We will also compute a (fourth-order) covariance, but with just
second-order terms in the camera model expansion. To do this properly,
we should expand the camera model to third order as there is an addi-
tional fourth-order covariance term involving the product of first- and
third-order camera-model terms; however, this would involve a compli-
cated expression employing the third derivative of the camera model.
As such, the approximate fourth-order covariance we will use is given
by
h i
T
R4th ≈ E (y2nd − ȳ2nd ) (y2nd − ȳ2nd )

1 X X T
J J
= G Ξ GT − tr (G i Ξ) 1i tr (G j Ξ) 1j
4 i=1 j=1
J
1 X X
9
+ Gik` Gjmn Ξk` Ξmn + Ξkm Ξ`n + Ξkn Ξ`m ,
4 i,j=1 k,`,m,n=1
(7.328)
280 Matrix Lie Groups

where Gikl is the k`th element of G i and Ξk` is the k`th element of
Ξ. The first- and third-order terms in the covariance expansion are
identically zero owing to the symmetry of the Gaussian density. The last
term in the above makes use of Isserlis’ theorem for Gaussian variables.

Sigmapoint Method
Finally, we can also make use of the sigmapoint transformation to pass
uncertainty through the nonlinear camera model. As in the pose com-
pounding problem, we tailor this to our specific type of SE(3) pertur-
bation. We begin by approximating the input Gaussian using a finite
number of samples, {T` , p` }:
LLT = Ξ, (Cholesky decomposition; L lower-triangular) (7.329a)
θ ` = 0, (7.329b)
√
θ ` = L + κ col` L, ` = 1 . . . L, (7.329c)
√
θ `+L = − L + κ col` L, ` = 1 . . . L, (7.329d)

`
= θ` , (7.329e)
ζ`
T` = exp (∧` ) T̄, (7.329f)
p` = p̄ + D ζ ` , (7.329g)
where κ is a user-definable constant37 and L = 9. We then pass each of
these samples through the nonlinear camera model:
y` = s (T` p` ) , ` = 0 . . . 2L. (7.330)
These are combined to create the output mean and covariance accord-
ing to
2L
1 1X
ȳsp = κ y0 + y` , (7.331a)
L+κ 2 `=1

1
Rsp = κ (y0 − ȳsp )(y0 − ȳsp )T
L+κ
2L
1X T
+ (y` − ȳsp )(y` − ȳsp ) . (7.331b)
2 `=1

The next section will provide the details for a specific nonlinear camera
model, f (·), representing a stereo camera.

Stereo Camera Model

To demonstrate the propagation of uncertainty through a nonlinear
measurement model, f (·), we will employ our midpoint stereo camera
37 For all experiments in this section, we used κ = 0.
7.3 Probability and Statistics 281

model given by
1
s(ρ) = M z, (7.332)
z3
where
     
s1 z1 fu 0 cu fu 2b
s2  ρ z2  0 fv cv 0 
s= 
s3  , z= = 
z3  , M=
fu
 , (7.333)
1 0 cu −fu 2b 
s4 z4 0 fv cv 0
and fu , fv are the horizontal, vertical focal lengths (in pixels), (cu , cv )
is the optical center of the images (in pixels), and b is the separation
between the cameras (in metres). The optical axis of the camera is
along the z3 , direction.
The Jacobian of this measurement model is given by
 
1 0 − zz31 0
∂s 1 0 1 − zz23 0
=M  , (7.334)
∂z z3 0 0 0 0
0 0 − zz43 1
and the Hessian is given by
 
0 0 −1 0
2
∂ s1 
fu 0 0 0 0 
= 2 ,
z3 −1 − 2b 
2z1 +bz4
∂z∂zT 0 z3
0 0 − 2b 0
 
0 0 0 0
∂ 2 s2 ∂ 2 s4 fv 0 0 −1 0
,
= =
z32 0 −1 z3 0
2z2
∂z∂zT ∂z∂zT
0 0 0 0
 
0 0 −1 0
∂ 2 s3 fu  0 0 0 0
= 2  2z1 −bz4

b, (7.335)
∂z∂z T z3 −1 0 z3 2
b
0 0 2
0
where we have shown each component separately.

Camera Experiment
We used the following methods to pass a Gaussian uncertainty on cam-
era pose and landmark position through the nonlinear stereo camera
model:

(i) Monte Carlo: We drew a large number, M = 1000000, of random

samples from the input density, passed these through the cam-
era model, and then computed the mean, ȳmc , and covariance,
282 Matrix Lie Groups

Mean error (compared to Monte Carlo) [pixels], εmean

Figure 7.10 20 4000

Cov. error (compared to Monte Carlo) [pixels ], εcov

Results from
sigmapoint sigmapoint
Stereo Camera first-order second-order
Experiment: (left)
18 second-order fourth-order
3500

2
Mean and (right)
covariance errors, 16
εmean and εcov , for 3000
three methods of 14
passing a Gaussian
uncertainty 2500
12
through a
nonlinear stereo
camera model, as 10 2000
compared to
Monte Carlo. The 8
parameter, α, 1500
scales the
6
magnitude of the
1000
input covariance
matrix. 4

500
2

0 0
0 0.5 1 0 0.5 1
Noise scaling, α Noise scaling, α

Rmc . This slow-but-accurate approach served as our benchmark

to which the other three much faster methods were compared.
(ii) First/Second-Order: We used the first-order camera model to com-
pute ȳ1st and R2nd , as described above.
(iii) Second/Fourth-Order: We used the second-order camera model to
compute ȳ2nd and R4th , as described above.
(iv) Sigmapoint: We used the sigmapoint method described above to
compute ȳsp and Rsp .
The camera parameters were
b = 0.25 m, fu = fv = 200 pixels, cu = cv = 0 pixels.
We used the camera pose T = 1 and let the landmark be located at
T
p = 10 10 10 1 . For the combined pose/landmark uncertainty,
we used an input covariance of

1 1 1 1 1 1
Ξ = α × diag , , , , , , 1, 1, 1 ,
10 10 10 100 100 100
where α ∈ [0, 1] is a scaling parameter that allowed us to parametrically
increase the magnitude of the uncertainty.
To gauge performance, we evaluated both the mean and covariance
of each method by comparing the results to those of the Monte Carlo
7.4 Summary 283
300 Figure 7.11
Monte Carlo Results from
sigmapoint Stereo Camera
280 first/second-order
second/fourth-order Experiment: A
portion of the left
260
Vertical Image Coordinate [pixels]

image of a stereo
camera showing
240 the mean and
covariance (as a
220 one-standard-
deviation ellipse)
200 for four methods of
imaging a
landmark with
180
Gaussian
uncertainty on the
160 camera’s pose and
the landmark’s
140 position. This case
corresponds to the
120 α = 1 data point
in Figure 7.10.
100
100 150 200 250 300
Horizontal Image Coordinate [pixels]

simulation according to the following metrics:

q
εmean = (ȳ − ȳmc )T (ȳ − ȳmc ),
q
εcov = tr ((R − Rmc )T (R − Rmc )),

where the latter is the Frobenius norm.

Figure 7.10 shows the two performance metrics, εmean and εcov , for
each of the three techniques over a wide range of noise scalings, α.
We see that the sigmapoint technique does the best on both mean
and covariance. The second-order technique does reasonably well on
the mean, but the corresponding fourth-order technique does poorly
on the covariance (due to our inability to compute a fully fourth-order-
accurate covariance, as explained earlier).
Figure 7.11 provides a snapshot of a portion of the left image of the
stereo camera with the mean and one-standard-deviation covariance
ellipses shown for all techniques. We see that the sigmapoint technique
does an excellent job on both the mean and the covariance, while the
others do not fare as well.

7.4 Summary
The main take-away points from this chapter are:
284 Matrix Lie Groups

1. While rotations and poses can not be described using vectorspaces,

we can describe them using the matrix Lie groups, SO(3) and SE(3).
2. We can perturb both rotations and poses conveniently by using the
exponential map, which maps R3 and R6 surjectively to SO(3) and
SE(3), respectively. We can use this mapping for two different pur-
poses within state estimation:
(i) To adjust a point estimate (i.e., mean or MAP) of rotation or
pose by a little bit during an optimal estimation procedure,
(ii) To define Gaussian-like PDFs for rotations and poses by map-
ping Gaussian noise onto SO(3) and SE(3) through the expo-
nential map.
3. Our ability to represent uncertainty for rotations and poses using
the methods in this chapter is limited to only small amounts. We
cannot represent uncertainty globally on SO(3) and SE(3) using
our Gaussian-like PDFs. However, these methods are good enough
to allow us to modify the estimation techniques from the first part
of the book for use with rotations and poses.
The last part of the book will bring together these matrix-Lie-group
tools with the estimation techniques from the first part of the book, in
order to carry out state estimation for practical robotics problems.

7.5 Exercises
7.5.1 Prove that
∧
(Cu) ≡ Cu∧ CT .
7.5.2 Prove that
∧
(Cu) ≡ (2 cos φ + 1)u∧ − u∧ C − CT u∧ .
7.5.3 Prove that
∧
exp (Cu) ≡ C exp (u∧ ) CT .
7.5.4 Prove that
∧
(T x) ≡ Tx∧ T−1 .
7.5.5 Prove that
∧
exp (T x) ≡ T exp (x∧ ) T−1 .
7.5.6 Work out the expression for Q` (ξ) in (7.83b).
7.5.7 Prove that
x∧ p ≡ p x.
7.5.8 Prove that
pT x∧ ≡ xT p} .
7.5 Exercises 285

7.5.9 Prove that

Z 1
n! m!
αn (1 − α)m dα ≡ .
0 (n + m + 1)!
Hint: use integration by parts.
7.5.10 Prove the identity
∂ω
J̇(φ) − ω ∧ J(φ) ≡ ,
∂φ
where
ω = J(φ) φ̇,
are the rotational kinematics expressed in so(3). Hint: it can be
shown one term at a time by writing out each quantity as a series.
7.5.11 Show that

(Tp) ≡ Tp T −1 .
7.5.12 Show that
T T
(Tp) (Tp) ≡ T −T p p T −1 .
7.5.13 Starting from the SE(3) kinematics,
Ṫ = $ ∧ T,
show that the kinematics can also be written using the adjoint
quantities:
Ṫ = $ f T .
7.5.14 Show that it is possible to work with a modified version of the
homogeneous-point representation when using the adjoint quanti-
ties:
4×1
z}|{
Ad Tp = Ad(T) Ad(p),
| {z } | {z } | {z }
6×3 6×6 6×3

where we abuse notation and define an adjoint operator for a

homogeneous point as
∧
c c
Ad = ,
1 1

−1 A (AB−1 )∨
Ad = ,
B 1
with c a 3 × 1 and A, B both 3 × 3.
Part III

Applications

287
8

Pose Estimation Problems

In this last part of the book, we will address some key three-dimensional
estimation problems from robotics. We will bring together the ideas
from Part I on classic state estimation with the three-dimensional ma-
chinery of Part II.
This chapter will start by looking at a key problem, aligning two
point-clouds (i.e., collections of points) using the principle of least
squares. We will then return to the EKF and batch state estimators and
adjust these to work with rotation and pose variables, in the context of
a specific pose estimation problem. Our focus will be on localization of
a vehicle when the geometry of the world is known. The next chapter
will address the more difficult scenario of unknown world geometry.

8.1 Point-Cloud Alignment

In this section, we will study a classic result in pose estimation. Specifi-
cally, we present the solution for aligning two sets of three-dimensional
points, or point-clouds, while minimizing a least-squares cost function.
The caveat is that the weights associated with each term in the cost
function must be scalars, not matrices; this can be referred to as ordi-
nary least squares1 .
This result is used commonly in the popular iterative closest point
(ICP) (Besl and McKay, 1992) algorithm for aligning three-dimensional
points to a three-dimensional model. It is also used inside outlier rejec-
tion schemes, such as RANSAC (Fischler and Bolles, 1981) (see Sec-
tion 5.3.1), for rapid pose determination using a minimal set of points.
We will present the solution using three different parameterizations
of the rotation/pose variable: unit-length quaternions, rotation ma-
trices, and then transformation matrices. The (non-iterative) quater-
nion approach comes down to solving an eigenproblem, while the (non-
iterative) rotation-matrix approach turns into a singular-value decom-
position. Finally, the iterative transformation matrix approach only
involves solving a system of linear equations.
1 This problem finds its origin in spacecraft attitude determination, with the famous
Wahba’s problem (Wahba, 1965).

289
290 Pose Estimation Problems
Figure 8.1
F vk
! 1 moving
Definition of
reference frames Vk 1
F vk
!
for a point-cloud
alignment problem. Vk
There is a
F vk+1
!
stationary
reference frame estimated
and a moving r vk i
! Vk+1
reference frame, measured
p j vk
attached to a Fi
!
r
!
vehicle. A known
stationary
collection of points,
Pj , are observed in r pj i
!
I
both frames and
the goal is to Pj
determine the
relative pose of the
moving frame with
respect to the 8.1.1 Problem Setup
stationary one by
aligning the two
We will use the setup in Figure 8.1. There are two reference frames,
point-clouds. one non-moving, → F i , and one attached to a moving vehicle, →
− F vk . In
−
particular, we have M measurements, rpvkj vk , where j = 1 . . . M , of points
on the vehicle (expressed in the moving frame → F vk ). We assume these
−
measurements could have been corrupted by noise.
p i
Let us assume that we know ri j , the position of each point, Pj ,
located and expressed in the non-moving frame, → F i . For example, in
−
the ICP algorithm these points are determined by finding the closest
point on the model to each observed point. Thus, we seek to align a
collection of M points expressed in two different references frames. In
other words, we want to find the translation and rotation that best align
the two point-clouds2 . Note, in this first problem we are only carrying
out the alignment at a single time, k. We will consider a point-cloud
tracking problem later in the chapter.

8.1.2 Unit-Length Quaternion Solution

We will present the unit-length quaternion approach to aligning point-
clouds first3 . This solution was first studied by Davenport (1965) in the
aerospace world and later by Horn (1987b) in robotics. We will use our
quaternion notation defined earlier in Section 6.2.3. The quaternion ap-
proach has an advantage over the rotation-matrix case (to be described
in the next section) because the constraints required to produce a valid
rotation are easier for unit-length quaternions.
2 An unknown scale between the two point-clouds is also sometimes folded into the
problem; we will assume the two point-clouds have the same scale.
3 Our focus in this book is on the use of rotation matrices, but this is an example of a
problem where unit-length quaternions make things easier and therefore we include the
derivation.
8.1 Point-Cloud Alignment 291

To work with quaternions, we define the following 4×1 homogeneous

versions of our points:
pv pj i
rj k r
yj = vk , pj = i , (8.1)
1 1
where we have dropped the sub- and superscripts except for j, the point
index.
We would like to find the translation, r, and rotation, q, that best
align these points, thereby giving us the relative pose between → F vk and
−
F
−i
→ . We note that the relationships between the quaternion versions of
vi
the translation, r, and rotation, q, and our usual 3 × 1 translation, ri ,
and 3 × 3 rotation matrix, Cvk i are defined by
pv pj i vk i
rvkj k Cvk i 0 ri r
= T − i , (8.2)
1 0 1 1 0
| {z } | {z } | {z } | {z }
yj q−1 + q⊕ pj r

which is just an expression of the geometry of the problem in the ab-

sence of any noise corrupting the measurements. Using the identity
in (6.19) we can rewrite this as
+ +
yj = q−1 (pj − r) q, (8.3)
which again is in the absence of any noise.
Referring to (8.3), we could form an error quaternion for point Pj as

+ +
ej = yj − q−1 (pj − r) q, (8.4)
but instead we can manipulate the above to generate an error that
appears linear in q,

+
e0j = q+ ej = yj ⊕ − (pj − r) q. (8.5)

We will define the total objective function (to minimize), J, as

M
1X T 1
J(q, r, λ) = wj e0j e0j − λ qT q − 1 (8.6)
2 j=1 |2 {z }
Lagrange multiplier term

where the wj are unique scalar weights assigned to each of the point
pairs. We have included the Lagrange multiplier term on the right to
ensure the unit-length constraint on the rotation quaternion. It is also
worth noting that selecting e0j over ej has no effect on our objective
function since
T T + T
+
+
e0j e0j = q+ ej q ej = eTj q+ q+ ej = eTj q−1 q ej = eTj ej .
(8.7)
292 Pose Estimation Problems

Inserting the expression for e0j into the objective function we see

1X
M T
+ +
J(q, r, λ) = wj qT yj ⊕ − (pj − r) yj ⊕ − (pj − r) q
2 j=1
1
− λ qT q − 1 . (8.8)
2
Taking the derivative of the objective function with respect to q, r, and
λ we find
∂J XM T
⊕ + ⊕ +
= w j y j − (p j − r) y j − (pj − r) q − λq, (8.9a)
∂qT j=1

∂J XM
−1 ⊕ ⊕ +
= q w j y j − (p j − r) q, (8.9b)
∂rT j=1

∂J 1
= − qT q − 1 . (8.9c)
∂λ 2
Setting the second to zero we find
r = p − q+ y+ q−1 , (8.10)
where p and y are defined below. Thus, the optimal translation is the
difference of the centroids of the two point-clouds, in the stationary
frame.
Substituting r into the first and setting to zero we can show
Wq = λq, (8.11)
where
1X T
M
⊕ + ⊕ +
W= wj (yj − y) − (pj − p) (yj − y) − (pj − p) ,
w j=1
(8.12a)
M
X M
X M
X
1 1
y= wj y j , p = wj pj , w = wj . (8.12b)
w j=1
w j=1 j=1

We can see this is just an eigenproblem4 . If the eigenvalues are positive

and the smallest eigenvalue is distinct (i.e., not repeated), then finding
the smallest eigenvalue and the corresponding unique eigenvector will
yield q to within a multiplicative constant and our constraint that
qT q = 1 makes the solution unique.
4 The eigenproblem associated with an N × N matrix, A, is defined by the equation
Ax = λx. The N (not necessarily distinct) eigenvalues, λi , are found by solving for the
roots of det (A − λ1) = 0 and then for each eigenvalue the corresponding eigenvector,
xi , is found (to within a multiplicative constant) through substitution of the eigenvalue
into the original equation and appropriate manipulation. The case of non-distinct
eigenvalues is tricky and requires advanced linear algebra.
8.1 Point-Cloud Alignment 293
Figure 8.2 Steps
involved in aligning
point-clouds point-cloud two point-clouds.
with statistical pose aligned
correspondences moments change point-clouds

{yj , pj } {w, y, p, W} {r, q}

To see that we want the smallest eigenvalue, we first note that W

is both symmetric and positive-semidefinite. Positive-semidefiniteness
implies that all the eigenvalues of W are non-negative. Next, we can
see from setting (8.9a) to zero so that an equivalent expression for W
is
XM T
+ +
W= wj yj ⊕ − (pj − r) yj ⊕ − (pj − r) . (8.13)
j=1

Substituting this into the objective function in (8.8) we immediately

see that
1 1 1
J(q, r, λ) = qT Wq − λ qT q − 1 = λ. (8.14)
2 |{z} 2 2
λq

Thus, picking the smallest possible value for λ will minimize the objec-
tive function.
However, there are some complications if W is singular or the small-
est eigenvalue is not distinct. Then there can be multiple choices for the
eigenvector corresponding to the smallest eigenvalue, and therefore the
solution may not be unique. Will forgo discussing this further for the
quaternion method as it would require advanced linear algebra tech-
niques (e.g., Jordan normal form) and instead return to this issue in
the next section when using rotation matrices.
Note, we have not made any approximations or linearizations in our
technique, but this depends heavily on the fact that the weights are
scalar not matrices. Figure 8.2 shows the process to align two point-
clouds. Once we have the r and q, we can construct the final estimates
of the rotation matrix, Ĉvk i , and translation, r̂vi k i , from
vk i
Ĉvk i 0 −1 + ⊕ r̂i
=q q , = r, (8.15)
0T 1 0
and then we can construct an estimated transformation matrix accord-
ing to

Ĉvk i −Ĉvk i r̂vi k i
T̂vk i = , (8.16)
0T 1
which combines our rotation and translation into a single answer for
294 Pose Estimation Problems

the best alignment of the point-clouds. Referring back to Section 6.3.2,

we may actually be interested in T̂ivk , which can be recovered using

−1 Ĉivk r̂vi k i
T̂ivk = T̂vk i = . (8.17)
0T 1
Both forms of the transformation matrix are useful, depending on how
the solution will be used.

8.1.3 Rotation Matrix Solution

The rotation-matrix case was originally studied outside of robotics by
Green (1952) and Wahba (1965) and later within robotics by Horn
(1987a) and Arun et al. (1987) and later by Umeyama (1991) consider-
ing the det C = 1 constraint. We follow the approach of de Ruiter and
Forbes (2014), which captures all of the cases in which C can be deter-
mined uniquely. We also identify how many global and local solutions
can exist for C when there is not a single global solution.
As in the previous section, we will use some simplified notation to
avoid repeating sub- and super-scripts:
p i
yj = rpvkj vk , pj = ri j , r = rvi k i , C = Cvk i . (8.18)

Also, we define
M M M
1X 1X X
y = wj y j , p= wj pj , w = wj , (8.19)
w j=1 w j=1 j=1

where the wj are scalar weights for each point. Note that, as compared
to the last section, some of the symbols are now 3 × 1 rather than 4 × 1.
We define an error term for each point:

ej = yj − C(pj − r) (8.20)

Our estimation problem is then to globally minimize the cost function,

M M
1X 1X T
J(C, r) = wj eTj ej = wj (yj − C(pj − r)) (yj − C(pj − r)) ,
2 j=1 2 j=1
(8.21)
subject to C ∈ SO(3) (i.e., CCT = 1 and det C = 1).
Before carrying out the optimization, we will make a change of vari-
ables for the translation parameter. Define

d = r + CT y − p, (8.22)

which is easy to isolate for r if all the other quantities are known. In
8.1 Point-Cloud Alignment 295

this case, we can rewrite our cost function as

M
1X T
J(C, d) = wj ((yj − y) − C(pj − p)) ((yj − y) − C(pj − p))
2 j=1
| {z }
depends only on C
1 T
+ d d , (8.23)
|2 {z }
depends only on d
which is the sum of two positive-definite terms, the first depending only
on C and the second only on d. We can minimize the second trivially
by taking d = 0, which in turn implies that
r = p − CT y. (8.24)
As in the quaternion case, this is simply the difference of the centroids
of the two point-clouds, expressed in the stationary frame.
What is left is to minimize the first term with respect to C. We note
that if we multiply out each smaller term within the first large term,
only one part actually depends on C:
T
((yj − y) − C(pj − p)) ((yj − y) − C(pj − p))

= (yj − y)T (yj − y) −2 (yj − y)T C(pj − p) + (pj − p)T (pj − p) .
| {z } | {z } | {z }
independent of C tr(C(pj −p)(yj −y)T ) independent of C
(8.25)
Summing this middle term over all the (weighted) points, we have
M M
1X T
1X
wj (yj − y) C(pj − p) = wj tr C(pj − p)(yj − y)T
w j=1 w j=1
M
!
1X
= tr C wj (pj − p)(yj − y)T = tr CWT , (8.26)
w j=1
where
M
1X
W= wj (yj − y)(pj − p)T . (8.27)
w j=1
This W matrix plays a similar role to the one in the quaternion section,
by capturing the spread of the points (similar to an inertia matrix in
dynamics), but it is not exactly the same. Therefore, we can define a
new cost function that we seek to minimize with respect to C as

J(C, Λ, γ) = −tr(CWT ) + tr Λ(CCT − 1) + γ(det C − 1), (8.28)
| {z }
Lagrange multiplier terms
where Λ and γ are Lagrange multipliers associated with the two terms
296 Pose Estimation Problems

on the right; these are used to ensure that the resulting C ∈ SO(3).
Note, when CCT = 1 and det C = 1, these terms have no effect on the
resulting cost. It is also worth noting that Λ is symmetric since we only
need to enforce six orthogonality constraints. This new cost function
will be minimized by the same C as our original one.
Taking the derivative of J(C, Λ, γ) with respect to C, Λ, and γ, we
have5
∂J −T
= −W + 2ΛC + γ det | {zC} |C{z } = −W + LC, (8.29a)
∂C
1 C
∂J
= CCT − 1, (8.29b)
∂Λ
∂J
= det C − 1, (8.29c)
∂γ
where we have lumped together the Lagrange multipliers as L = 2Λ +
γ1. Setting the first equation to zero, we find that
LC = W. (8.30)
At this point, our explanation can proceed in a simplified or detailed
manner, depending on the level of fidelity we want to capture.
Before moving forward, we show that it is possible to arrive at (8.30)
using our Lie group tools without the use of Lagrange multipliers. We
consider a perturbation of the rotation matrix of the form

C0 = exp φ∧ C, (8.31)
and then take the derivative of the objective function with respect to φ
and set this to zero for a critical point. For the derivative with respect
to the ith element of φ we have
∂J J(C0 ) − J(C)
= lim
∂φi h→0 h
−tr(C0 WT ) + tr(CWT )
= lim
h→0 h
−tr(exp(h1∧i )CWT ) + tr(CWT )
= lim
h→0 h
−tr((1 + h1∧i )CWT ) + tr(CWT )
≈ lim
h→0 h
−tr(h1∧i CWT )
= lim
h→0 h
= −tr 1∧i CWT . (8.32)
5 We require these useful facts to take the derivatives:
∂
∂A
det A = det(A) A−T ,
∂
∂A
tr(ABT ) = B,
∂
∂A
tr(BAAT ) = (B + BT )A.
8.1 Point-Cloud Alignment 297

Setting this to zero we require

(∀i) tr 1∧i CW T
| {z } = 0. (8.33)
L

Owing to the skew-symmetric nature of the ∧ operator, this implies

that L = CWT is a symmetric matrix for a critical point. Taking the
transpose and right-multiplying by C we come back to (8.30). We now
continue with the main derivation.

Simplified Explanation
If we somehow knew that det W > 0, then we could proceed as follows.
First, we postmultiply (8.30) by itself transposed to find
T T T
{z } L = WW .
L |CC (8.34)
1

Since L is symmetric, we have that

21
L = WWT , (8.35)
which we see involves a matrix square-root. Substituting this back
into (8.30), the optimal rotation is
− 1
C = WWT 2 W. (8.36)
This has the same form as the projection onto SO(3) discussed in
Section 7.2.1.
Unfortunately, this approach does not tell the entire story since it
relies on assuming something about W, and therefore does not capture
all of the subtleties of the problem. With lots of non-coplanar points,
this method will typically work well. However, there are some difficult
cases for which we need a more detailed analysis. A common situation
in which this problem occurs is when carrying out alignments using just
three pairs of noisy points in the RANSAC algorithm discussed earlier.
The next section provides a more thorough analysis of the solution that
handles the difficult cases.

Detailed Explanation
The detailed explanation begins by first carrying out a singular-value
decomposition (SVD)6 on the (square, real) matrix, W, so that
W = UDVT , (8.37)
6 The singular-value decomposition of a real M × N matrix, A, is a factorization of the
form A = UDVT where U is an M × M real, orthogonal matrix (i.e., UT U = 1), D is
an M × N matrix with real entries di ≥ 0 on the main diagonal (all other entries zero),
and V is an N × N real, orthogonal matrix (i.e., VT V = 1). The di are called the
singular values and are typically ordered from largest to smallest along the diagonal of
D. Note, the SVD is not unique.
298 Pose Estimation Problems

where U and V are square, orthogonal matrices and D = diag(d1 , d2 , d3 )

is a diagonal matrix of singular values, d1 ≥ d2 ≥ d3 ≥ 0.
Returning to (8.30), we can substitute in the SVD of W so that
L2 = LLT = LCCT LT = WWT = UD |V{z
T
V} DT UT = UD2 UT .
1
(8.38)
Taking the matrix square-root, we can write that
L = UMUT , (8.39)
where M is the symmetric, matrix square root of D2 . In other words,
M2 = D2 . (8.40)
It can be shown (de Ruiter and Forbes, 2014) that every real, symmetric
M satisfying this condition can be written in the form
M = YDSYT , (8.41)
where S = diag(s1 , s2 , s3 ) with si = ±1 and Y an orthogonal matrix
(i.e., YT Y = YYT = 1). An obvious example of this is Y = 1 with si =
±1 and any values for di ; a less obvious example that is a possibility
when d1 = d2 is
 
d1 cos θ d1 sin θ 0
M =  d1 sin θ −d1 cos θ 0 
0 0 d3
    T
cos θ2 − sin θ2 0 d1 0 0 cos θ2 − sin θ2 0
=  sin θ2 cos θ2 0  0 −d1 0   sin θ2 cos θ2 0 , (8.42)
0 0 1 0 0 d3 0 0 1
| {z }| {z }| {z }
Y DS YT

for any value of the free parameter, θ. This illustrates an important

point, that the structure of Y can become more complex in correspon-
dence with repeated singular values (i.e., we cannot just pick any Y).
Related to this, we always have that
D = YDYT , (8.43)
due to the relationship between the block structure of Y and the mul-
tiplicity of the singular values in D.
Now, we can manipulate the objective function that we want to min-
imize as follows:

J = −tr(CWT ) = −tr(WCT ) = −tr(L) = −tr(UYDSYT UT )

T T
= −tr(Y
| U{z UY} DS) = −tr(DS) = −(d1 s1 + d2 s2 + d3 s3 ).
1
(8.44)
8.1 Point-Cloud Alignment 299

There are now several cases to consider.

Case (i): det W 6= 0

Here we have that all of the singular values are positive. From (8.30)
and (8.39) we have that
T T
| {zC} = det L = det(UYDSY U )
det W = det L det
1

= det(YT UT UY) det D det S = |det

{zD} det S. (8.45)
| {z }
1 >0

Since the singular values are positive, we have that det D > 0. Or in
other words, the signs of the determinants of S and W must be the
same, which implies that

det S = sgn (det S) = sgn (det W) = sgn det(UDVT )

| {zU} det
= sgn det | {zV} = det U det V = ±1. (8.46)
| {zD} det
±1 >0 ±1

Note, we have det U = ±1 since (det U)2 = det(UT U) = det 1 = 1 and

the same for V. There are now four subcases to consider:
Subcase (i-a): det W > 0
Since det W > 0 by assumption, we must also have det S = 1 and
therefore to uniquely minimize J in (8.44) we must pick s1 = s2 = s3 =
1 since all of the di are positive and therefore we must have Y diagonal.
Thus, from (8.30) we have
−1
C = L−1 W = UYDSYT UT UDVT
S−1 D−1 YT U
= UY |{z} T T −1
| {zU} DV = UYSD Y
T
| {zD}
S 1 DY T

= UYSY V = USVT , (8.47)

T T

with S = diag(1, 1, 1) = 1, which is equivalent to the solution provided

in our ‘simplified explanation’ in the last section.
Subcase (i-b): det W < 0, d1 ≥ d2 > d3 > 0
Since det W < 0 by assumption, we have det S = −1, which means
exactly one of the si must be negative. In this case, we can uniquely
minimize J in (8.44) since the minimum singular value, d3 , is distinct,
whereupon we must pick s1 = s2 = 1 and s3 = −1 for the mini-
mum. Since s1 = s2 = 1, we must have Y diagonal and can therefore
from (8.30) we have that
C = USVT (8.48)
with S = diag(1, 1, −1).
300 Pose Estimation Problems

Subcase (i-c): det W < 0, d1 > d2 = d3 > 0

As in the last subcase, we have det S = −1, which means exactly one
of the si must be negative. Looking to (8.44), since d2 = d3 we can pick
either s2 = −1 or s3 = −1 and end up with the same value for J. With
these values for the si we can pick any of the following for Y:
 
±1 0 0
Y = diag(±1, ±1, ±1), Y =  0 ± cos θ2 ∓ sin θ2  , (8.49)
0 ± sin θ2 ± cos θ2
where θ is a free parameter. We can plug any of these Y in to find
minimizing solutions for C using (8.30):
C = UYSYT VT (8.50)
with S = diag(1, 1, −1) or S = diag(1, −1, 1). Since θ can be anything,
this means there are an infinite number of solutions that minimize the
objective function.
Subcase (i-d): det W < 0, d1 = d2 = d3 > 0
As in the last subcase, we have det S = −1, which means exactly one
of the si must be negative. Looking to (8.44), since d1 = d2 = d3 we
can pick s1 = −1 or s2 = −1 or s3 = −1 and end up with the same
value for J, implying there an infinite number of minimizing solutions.

Case (ii): det W = 0

This time there are three subcases to consider depending on how
many singular values are zero.
Subcase (ii-a): rank W = 2
In this case, we have d1 ≥ d2 > d3 = 0. Looking back to (8.44) we
see that we can uniquely minimize J by picking s1 = s2 = 1 and since
d3 = 0, the value of s3 does not affect J and thus it is a free parameter.
Again looking to (8.30) we have
(UYDSYT UT )C = UDVT . (8.51)
Multiplying by UT from the left and V from the right we have
D |UT{z
CV} = D, (8.52)
Q

since DS = D due to d3 = 0 and then YDYT = D from (8.43).

The matrix, Q, above will be orthogonal since U, C, and V are all
orthogonal. Since DQ = D, D = diag(d1 , d2 , 0), and QQT = 1, we
know that Q = diag(1, 1, q3 ) with q3 = ±1. We also have that

{zC} det V = det U det V = ±1,

q3 = det Q = det U |det (8.53)
1
8.1 Point-Cloud Alignment 301

and therefore rearranging (and renaming Q as S) we have

C = USVT , (8.54)
with S = diag(1, 1, det U det V).
Subcase (ii-b): rank W = 1
In this case, we have d1 > d2 = d3 = 0. We let s1 = 1 to minimize J
and now s2 and s3 do not affect J and are free parameters. Similarly
to the last subcase, we end up with an equation of the form
DQ = D, (8.55)
which along with D = diag(d1 , 0, 0) and QQT = 1 implies that Q will
have one of the following forms:
   
1 0 0 1 0 0
Q = 0 cos θ − sin θ or Q = 0 cos θ sin θ , (8.56)
0 sin θ cos θ 0 sin θ − cos θ
| {z } | {z }
det Q=1 det Q=−1

with θ ∈ R a free parameter. This means there are infinitely many

minimizing solutions. Since

| {zC} det V = det U det V = ±1,

det Q = det U det (8.57)
1

we have (renaming Q as S) that

C = USVT (8.58)
with
  

 1 0 0

 0 cos θ − sin θ if det U det V = 1




 0 sin θ cos θ
S=   (8.59)



 1 0 0

 0 cos θ

 sin θ  if det U det V = −1

0 sin θ − cos θ
Physically, this case corresponds to all of the points being collinear (in
at least one of the frames) so that rotating about the axis formed by
the points through any angle, θ, does not alter the objective function
J.
Subcase (ii-c): rank W = 0
This case corresponds to there being no points or all the points co-
incident and so any C ∈ SO(3) will produce the same value of the
objective function, J.
302 Pose Estimation Problems

Summary:
We have provided all of the solutions for C in our point-alignment
problem; depending on the properties of W, there can be one or in-
finitely many global solutions. Looking back through all the cases and
subcases, we can see that if there is a unique global solution for C, it
is always of the form
C = USVT (8.60)
with S = diag(1, 1, det U det V) and W = UDVT is a singular-value
decomposition of W. The necessary and sufficient conditions for this
unique global solution to exist are:
(i) det W > 0, or
(ii) det W < 0 and minimum singular value distinct: d1 ≥ d2 >
d3 > 0, or
(iii) rank W = 2.
If none of these conditions is true, there will be infinite solutions for C.
However, these cases are fairly pathological and do not occur frequently
in practical situations.
Once we have solved for the optimal rotation matrix, we take Ĉvk i =
C as our estimated rotation. We build the estimated translation as
r̂vi k i = p − ĈTvk i y, (8.61)
and, if desired, combine the translation and rotation into an estimated
transformation matrix,

Ĉvk i −Ĉvk i r̂vi k i
T̂vk i = , (8.62)
0T 1
that provides the optimal alignment of the two point-clouds in a sin-
gle quantity. Again, as mentioned in Section 6.3.2 we may actually be
interested in T̂ivk , which can be recovered using

−1 Ĉivk r̂vi k i
T̂ivk = T̂vk i = . (8.63)
0T 1
Both forms of the transformation matrix are useful, depending on how
the solution will be used.
Example 8.1 We provide an example of subcase (i-b) to make things
tangible. Consider the following two point-clouds that we wish to align,
each consisting of six points:
p1 = 3 × 11 , p2 = 2 × 12 , p3 = 13 , p4 = −3 × 11 ,
p5 = −2 × 12 , p6 = −13 ,
y1 = −3 × 11 , y2 = −2 × 12 , y3 = −13 , y4 = 3 × 11 ,
y5 = 2 × 12 , y6 = 13 ,
8.1 Point-Cloud Alignment 303

where 1i is the ith column of the 3 × 3 identity matrix. The points in

the first point-cloud are the centers of the faces of a rectangular prism
and each point is associated with a point in the second point-cloud on
the opposite face of another prism (that happens to be in the same
location as the first)7 .
Using these points, we have the following:
p = 0, y = 0, W = diag(−18, −8, −2), (8.64)
which means the centroids are already on top of one another so we only
need to rotate to align the point-clouds.
Using the ‘simplified approach’, we have
− 1
C = WWT 2 W = diag(−1, −1, −1). (8.65)
Unfortunately, we can easily see that det C = −1 and so C ∈
/ SO(3),
which indicates this approach has failed.
For the more rigorous approach, a singular-value decomposition of
W is
W = UDVT , U = diag(1, 1, 1), D = diag(18, 8, 2),
V = diag(−1, −1, −1). (8.66)
We have det W = −288 < 0 and see that there is a unique mini-
mum singular value, so we need to use the solution from subcase (ii-
b). The minimal solution is therefore of the form C = USVT with
S = diag(1, 1, −1). Plugging this in we find
C = diag(−1, −1, 1), (8.67)
so that det C = 1. This is a rotation about the 13 axis through an angle
π, which brings the error on four of the points to zero and leaves two
of the points with non-zero error. This brings the objective function
down to its minimum of J = 8.

Testing for Local Minima

In the previous section, we searched for global minima to the point-
alignment problem and found there could be one or infinitely many.
We did not, however, identify whether it was possible for local minima
to exist, which we study now. Looking back to (8.30), this is the condi-
tion for a critical point in our optimization problem and therefore any
solution that satisfies this criterion could be a minimum, a maximum,
or a saddle point of the objective function, J.
If we have a solution, C ∈ SO(3), that satisfies (8.30), and we want
to characterize it, we can try perturbing the solution slightly and see
7 As a physical interpretation, imagine joining each of the six point pairs by rubber
bands. Finding the C that minimizes our cost metric is the same as finding the
rotation that minimizes the amount of elastic energy stored in the rubber bands.
304 Pose Estimation Problems

whether the objective function goes up or down (or both). Consider a

perturbation of the form

C0 = exp φ∧ C, (8.68)
where φ ∈ R3 is a perturbation in an arbitrary direction, but con-
strained to keep C0 ∈ SO(3). The change in the objective function δJ
by applying the perturbation is

δJ = J(C0 ) − J(C) = −tr(C0 WT ) + tr(CWT ) = −tr (C0 − C)WT ,
(8.69)
where we have neglected the Lagrange multiplier terms by assuming
the perturbation keeps C0 ∈ SO(3).
Now, approximating the perturbation out to second order, since this
will tell us about the nature of the critical points, we have

∧ 1 ∧ ∧ T
δJ ≈ −tr 1+φ + φ φ C−C W
2
1
= −tr φ∧ CWT − tr φ∧ φ∧ CWT . (8.70)
2
Then, plugging in the conditions for a critical point from (8.30), we
have
1
δJ = −tr φ∧ UYDSYT UT − tr φ∧ φ∧ UYDSYT UT . (8.71)
2
It turns out that the first term is zero (because it is a critical point),
which we can see from

tr φ∧ UYDSYT UT = tr YT UT φ∧ UYDS
∧
= tr YT UT φ DS = tr (ϕ∧ DS) = 0, (8.72)

where
 
ϕ1
ϕ = ϕ2  = YT UT φ,
 (8.73)
ϕ3
and owing to the properties of a skew-symmetric matrix (zeros on the
diagonal). For the second term, we use the identity u∧ u∧ = −uT u 1 +
uuT to write
1
δJ = − tr φ∧ φ∧ UYDSYT UT
2
1
= − tr YT UT −φT φ 1 + φφT UYDS
2
1
= − tr −ϕ2 1 + ϕϕT DS , (8.74)
2
where ϕ2 = ϕT ϕ = ϕ21 + ϕ22 + ϕ23 .
8.1 Point-Cloud Alignment 305

Manipulating a little further, we have

1 1
δJ = ϕ2 tr(DS) − ϕT DSϕ
2 2
1 2
= ϕ1 (d2 s2 + d3 s3 ) + ϕ22 (d1 s1 + d3 s3 ) + ϕ23 (d1 s1 + d2 s2 ) , (8.75)
2
the sign of which depends entirely on the nature of DS.
We can verify the ability of this expression to test for a minimum
using the unique global minima identified in the previous section. For
subcase (i-a), where d1 ≥ d2 ≥ d3 and s1 = s2 = s3 , we have
1 2
δJ = ϕ1 (d2 + d3 ) + ϕ22 (d1 + d3 ) + ϕ23 (d1 + d2 ) > 0, (8.76)
2
for all ϕ 6= 0, confirming a minimum. For subcase (i-b) where d1 ≥
d2 > d3 > 0 and s1 = s2 = 1, s3 = −1, we have
1
δJ = ϕ21 (d2 − d3 ) +ϕ22 (d1 − d3 ) +ϕ23 (d1 + d2 ) > 0, (8.77)
2 | {z } | {z }
>0 >0

for all ϕ 6= 0, again confirming a minimum. Finally, for subcase (ii-a)

where d1 ≥ d2 > d3 = 0 and s1 = s2 = 1, s3 = ±1, we have
1 2
δJ = ϕ1 d2 + ϕ22 d1 + ϕ23 (d1 + d2 ) > 0, (8.78)
2
for all ϕ 6= 0, once again confirming a minimum.
The more interesting question is whether there are any other local
minima to worry about or not. This will become important when we use
iterative methods to optimize rotation and pose variables. For example,
let us consider subcase (i-a) a little further in the case that d1 > d2 >
d3 > 0. There are some other ways to satisfy (8.30) and generate a
critical point. For example, we could pick s1 = s2 = −1 and s3 = 1 so
that det S = 1. In this case we have
1
δJ = ϕ21 (d3 − d2 ) +ϕ22 (d3 − d1 ) +ϕ23 (−d1 − d2 ) < 0, (8.79)
2 | {z } | {z } | {z }
<0 <0 <0

which corresponds to a maximum since any ϕ 6= 0 will decrease the

objective function. The other two cases, S = diag(−1, 1, −1) and S =
diag(1, −1, −1), turn out to be saddle points since depending on the
direction of the perturbation, the objective function can go up or down.
Since there are no other critical points, we can conclude there are no
local minima other than the global one.
Similarly for subcase (i-b), we need det S = −1 and can show that
S = diag(−1, −1, −1) is a maximum and that S = diag(−1, 1, 1) and
S = diag(1, −1, 1) are saddle points. Again, since there are no other
critical points, we can conclude there are no local minima other than
the global one.
306 Pose Estimation Problems

Also, for subcase (ii-a) we in general have

1 2
δJ = ϕ1 d2 s2 + ϕ22 d1 s1 + ϕ23 (d1 s1 + d2 s2 ) , (8.80)
2
and so the only way to create a local minimum is to pick s1 = s2 = 1,
which is the global minimum we have discussed earlier. Thus, again
there are no additional local minima.

Iterative Approach
We can also consider using an iterative approach to solve for the optimal
rotation matrix, C. We will use our SO(3)-sensitive scheme to do this.
Importantly, the optimization we carry out is unconstrained, thereby
avoiding the difficulties of the previous two approaches8 . Technically,
the result is not valid globally, only locally, as we require an initial
guess that is refined from one iteration to the next; typically only a few
iterations are needed. However, based on our discussion of local minima
in the last section, we know that in all the important situations where
there is a unique global minimum, there are no additional local minima
to worry about.
Starting from the cost function where the translation has been elim-
inated,
M
1X T
J(C) = wj ((yj − y) − C(pj − p)) ((yj − y) − C(pj − p)) ,
2 j=1
(8.81)
we can insert the SO(3)-sensitive perturbation,

C = exp ψ ∧ Cop ≈ 1 + ψ ∧ Cop , (8.82)
where Cop is the current guess and ψ is the perturbation; we will seek
an optimal value to update the guess (and then iterate). Inserting the
approximate perturbation scheme into the cost function turns it into
a quadratic in ψ for which the minimizing value, ψ ? , is given by the
solution to
M
!
1X
Cop − wj (pj − p)∧ (pj − p)∧ CTop ψ ?
w j=1
| {z }
constant
M
1X
=− wj (yj − y)∧ Cop (pj − p). (8.83)
w j=1

At first glance, the right-hand side appears to require recalculation

8 The iterative approach does not require solving either an eigenproblem nor carrying
out a singular-value decomposition.
8.1 Point-Cloud Alignment 307

using the individual points at each iteration. Fortunately, we can ma-

nipulate it into a more useful form. The right-hand side is a 3 × 1
column and its ith row is given by
M
!
1 X
1Ti − wj (yj − y)∧ Cop (pj − p)
w j=1
M
1X
= wj (yj − y)T 1∧i Cop (pj − p)
w j=1
M
1X
= wj tr 1∧i Cop (pj − p)(yj − y)T
w j=1

= tr 1∧i Cop WT , (8.84)
where
M
1X
W= wj (yj − y)(pj − p)T , (8.85)
w j=1

which we already saw in the non-iterative solution. Letting

M
1X
I=− wj (pj − p)∧ (pj − p)∧ , (8.86a)
w j=1

b = tr (1∧i Cop WT ) i , (8.86b)
the optimal update can be written in closed form as
ψ ? = Cop I−1 CTop b. (8.87)
We apply this to the initial guess,
∧
Cop ← exp ψ ? Cop , (8.88)

and iterate to convergence taking Ĉvk i = Cop at the final iteration as

our rotation estimate. After convergence, the translation is given as in
the non-iterative scheme:
r̂vi k i = p − ĈTvk i y. (8.89)
Notably, both I and W can be computed in advance and therefore
we do not require the original points during execution of the iterative
scheme.

Three Non-collinear Points Required

Clearly to solve uniquely for ψ ? above, we need det I 6= 0. A sufficient
condition is to have I positive-definite, which implies that for any x 6= 0,
we must have
xT I x > 0. (8.90)
308 Pose Estimation Problems

We then notice that

M
!
T T 1X ∧ ∧
x Ix = x − wj (pj − p) (pj − p) x
w j=1
M
1X T
= wj ((pj − p)∧ x) ((pj − p)∧ x) ≥ 0. (8.91)
w j=1 | {z }
≥0

Since each term in the sum is non-negative, the total must be non-
negative. The only way to have the total be zero is if every term in the
sum is also zero, or
(∀j) (pj − p)∧ x = 0. (8.92)
In other words, we must have x = 0 (not true by assumption), pj = p,
or x parallel to pj − p. The last two conditions are never true as long
as there are at least three points and they are not collinear.
Note, having three non-collinear points only provides a sufficient con-
dition for a unique solution for ψ ? at each iteration, and does not tell
us about the number of possible global solutions to minimize our ob-
jective function in general. This was discussed at length in the previous
sections, where we learned there could be one or infinitely many global
solutions. Moreover, if there is a unique global minimum, there are no
local minima to worry about.

8.1.4 Transformation Matrix Solution

Finally, for completeness we also can provide an iterative approach to
solving for the pose change using transformation matrices and their
relationship to the exponential map9 . As in the previous two sections,
we will use some simplified notation to avoid repeating sub- and super-
scripts:
pv pj i
yj rvkj k p r
yj = = , pj = j = i ,
1 1 1 1
vk i

Cvk i −Cvk i ri
T = Tvk i = . (8.93)
0T 1
We have used a different font for the homogeneous representations of
the points; we will be making connections back to the previous section
on rotation matrices so also keep the non-homogeneous point represen-
tations around for convenience.
We define our error term for each point as
ej = y j − Tpj , (8.94)
9 We will use the optimization approach outlined in Section 7.1.9.
8.1 Point-Cloud Alignment 309

and our objective function as

M M
1X 1X T
J(T) = wj eTj ej = wj y j − Tpj y j − Tpj , (8.95)
2 j=1 2 j=1

where wj > 0 are the usual scalar weights. We seek to minimize J with
respect to T ∈ SE(3). Notably, this objective function is equivalent to
the ones for the unit-quaternion and rotation-matrix parameterizations,
so the minima should be the same.
To do this, we use our SE(3)-sensitive perturbation scheme,
T = exp (∧ ) Top ≈ (1 + ∧ ) Top , (8.96)
where Top is some initial guess (i.e., operating point of our lineariza-
tion) and is a small perturbation to that guess. Inserting this into the
objective function we then have
M
1X T
J(T) ≈ wj (y j − z j ) − z
j (y j − z j ) − z
j , (8.97)
2 j=1

where z j = Top pj and we have used that

∧ z j = z
j , (8.98)
which was explained in Section 7.1.8.
Our objective function is now exactly quadratic in and therefore
we can carry out a simple, unconstrained optimization for . Taking the
derivative we find
XM
∂J T
T
= − wj z
j (y j − z j ) − z
j . (8.99)
∂ j=1

Setting this to zero, we have the following system of equations for the
optimal ? :
M
! M
1X T 1X T
wj z j z j ? = wj z
j (y j − z j ). (8.100)
w j=1 w j=1

While we could use this to compute the optimal update, both the left-
and right-hand sides require construction from the original points at
each iteration. As in the previous section on the iterative solution using
rotation matrices, it turns out we can manipulate both sides into forms
that do not require the original points.
Looking to the left-hand side first, we can show that
M M
!
1X T −T 1X T
wj zj zj = T op wj pj pj T −1 op , (8.101)
w j=1 | {z } w j=1
|{z}
>0 | {z } >0
M
310 Pose Estimation Problems

where

1 0 1 0 1 p∧
T op = Ad(Top ), M= ,
−p∧ 1 0 I 0 1
M
X M M
1X 1X
w= wj , p= wj pj , I=− wj (pj − p)∧ (pj − p)∧ .
j=1
w j=1 w j=1
(8.102)
The 6×6 matrix, M, has the form of a generalized mass matrix (Murray
et al., 1994) with the weights as surrogates for masses. Notably, it is
only a function of the points in the stationary frame and is therefore a
constant.
Looking to the right-hand side, we can also show that
M
1X T y − Cop (p − rop )
a= wj z j (y j − z j ) = , (8.103)
w b − y∧ Cop (p − rop )
j=1

where

Cop −Cop rop
b = tr (1∧i Cop WT ) i , Top = , (8.104)
0T 1
M M
1X 1X
W= wj (yj − y)(pj − p)T , y= wj y j . (8.105)
w j=1 w j−1
Both W and y we have seen before and can be computed in advance
from the points and then used at each iteration of the scheme.
Once again, we can write the solution for the optimal update down
in closed form:
? = T op M−1 T Top a. (8.106)
Once computed, we simply update our operating point,
∧
Top ← exp ? Top , (8.107)
and iterate the procedure to convergence. The estimated transforma-
tion is then T̂vk i = Top at the final iteration. Alternatively, T̂ivk = T̂−1
vk i
may be the output of interest.
Note, applying the optimal perturbation through the exponential
map ensures that Top remains in SE(3) at each iteration. Also, looking
back to Section 4.3.1, we can see that our iterative optimization of T
is exactly in the form of a Gauss-Newton style estimator, but adapted
to work with SE(3).

Three Non-collinear Points Required

It is interesting to consider when (8.100) has a unique solution. It im-
mediately follows from (8.101) that
det M = det I. (8.108)
8.2 Point-Cloud Tracking 311

Therefore, to uniquely solve for ? above, we need det I 6= 0. A sufficient

condition is to have I positive-definite, which (we saw in the previous
section on rotation matrices) is true as long as there are at least three
points and they are not collinear.

8.2 Point-Cloud Tracking

In this section, we study a problem very much related to point-cloud
alignment, namely point-cloud tracking. In the alignment problem, we
simply wanted to align two point-clouds to determine the vehicle’s pose
at a single time. In the tracking problem, we want to estimate the pose
of an object over time through a combination of measurements and a
prior (with inputs). Accordingly, we will set up motion and observation
models and then show how we can use these in both recursive (i.e.,
EKF) and batch (i.e., Gauss-Newton) solutions.

8.2.1 Problem Setup

We will continue to use the situation depicted in Figure 8.1. The state
of the vehicle is comprised of
rvi k i : translation vector from I to Vk , expressed in →
Fi
−
Cvk i : rotation matrix from → F i to →
− F vk
−
or alternatively

Cvk i −Cvk i rvi k i
Tk = Tvk i = , (8.109)
0T 1
as a single transformation matrix. We use the shorthand
x = {T0 , T1 , . . . , TK } , (8.110)
for the entire trajectory of poses. Our motion prior/inputs and mea-
surements for this problem are
(i) Motion Prior/Inputs:
– We might assume the known inputs are the initial pose (with
uncertainty),
Ť0 , (8.111)
as well as the translational velocity, ν iv
vk , and angular velocity
k

ivk
of the vehicle, ω vk , which we note are expressed in the vehicle
frame. We combine these as
iv
ν k
$ k = vivkk , k = 1 . . . K, (8.112)
ω vk
at a number of discrete times (we will assume the inputs
312 Pose Estimation Problems

are piecewise-constant in time). Together, the inputs can be

written using the shorthand,

v = Ť0 , $ 1 , $ 2 , . . . , $ K . (8.113)
(ii) Measurements:
– We assume we are capable of measuring the position of a
particular stationary point, Pj , in the vehicle frame, rpvkj vk .
We assume the position of the point is known in the station-
p i
ary frame, ri j . Note there could also be measurements of
multiple points, hence the subscript j. We will write
yjk = rpvkj vk , (8.114)
for the observation of point Pj at discrete time k. Together,
the measurements can be written using the shorthand,
y = {y11 , . . . , yM 1 , . . . , y1K , . . . yM K } . (8.115)
This pose estimation problem is fairly generic and could be used to
describe a variety of situations.

8.2.2 Motion Priors

We will derive a general discrete-time, kinematic motion prior that can
be used within a number of different estimation algorithms. We will
start in continuous time and then move to discrete time.

Continuous Time
We will start with the SE(3) kinematics10 ,
Ṫ = $ ∧ T, (8.116)
where the quantities involved are perturbed by process noise according
to:

T = exp δξ ∧ T̄, (8.117a)
$ = $̄ + δ$. (8.117b)
We can separate these into nominal and perturbation kinematics as
in (7.243):

nominal kinematics: T̄˙ = $̄ ∧ T̄, (8.118a)

perturbation kinematics: δ ξ̇ = $̄ f δξ + δ$, (8.118b)
where we will think of δ$(t) as process noise that corrupts the nominal
kinematics. Thus, integrating the perturbed kinematic equation allows
10 To be clear, T = Tvk i in this equation and $ is the generalized velocity expressed in
F .
→vk
−
8.2 Point-Cloud Tracking 313

us to track uncertainty in the pose of the system. While we could do

this in continuous time, we will next move to discrete time to prepare
to use this kinematic model in the EKF and batch, discrete-time MAP
estimators.

Discrete Time
If we assume quantities remain constant between discrete times, then
we can use the ideas from Section 7.2.2 to write
nominal kinematics: T̄k = exp (∆tk $̄ ∧k ) T̄k−1 , (8.119a)
| {z }
Ξk

perturbation kinematics: δξ k = exp (∆tk $̄ f ) δξ + w(8.119b)

k,
| {z k } k−1
Ad(Ξk )

with ∆tk = tk − tk−1 for the nominal and perturbation kinematics in

discrete time. The process noise is now wk = N (0, Qk ).

8.2.3 Measurement Model

We next develop a measurement model and then linearize it.

Nonlinear
Our 3 × 1 measurement model can be compactly as
yjk = DT Tk pj + njk , (8.120)
where the position of the known points on the moving vehicle are ex-
pressed in 4 × 1 homogeneous coordinates (bottom row equal to 1),
pj i
r
pj = i , (8.121)
1
and  
1 0 0 0
DT = 0 1 0 0 , (8.122)
0 0 1 0
is a projection matrix used to ensure the measurements are indeed 3×1
by removing the 1 on the bottom row. We have also now included,
njk ∼ N (0, Rjk ), which is Gaussian measurement noise.

Linearized
We linearize (8.120) in much the same way as the motion model through
the use of perturbations:

Tk = exp δξ ∧k T̄k , (8.123a)
yjk = ȳjk + δyjk . (8.123b)
314 Pose Estimation Problems

Substituting these in we have

ȳjk + δyjk = DT exp δξ ∧k T̄k pj + n0jk . (8.124)
Subtracting off the nominal solution (i.e., the operating point in our
linearization),
ȳjk = DT T̄k pj , (8.125)
we are left with
δyjk ≈ DT T̄k p
j δξ k + njk , (8.126)
correct to first order. This perturbation measurement model relates
small changes in the input to the measurement model to small changes
in the output, in an SE(3)-constraint-sensitive manner.

Nomenclature
To match the notation used in our derivations of our nonlinear estima-
tors, we define the following symbols:
T̂k : 4 × 1 corrected estimate of pose at time k
P̂k : 6 × 6 covariance of corrected estimate at time k
(for both translation and rotation)
Ťk : 4 × 1 predicted estimate of rotation at time k
P̌k : 6 × 6 covariance of predicted estimate at time k
(for both translation and rotation)
Ť0 : 4 × 4 prior input as pose at time 0
$ k : 6 × 1 prior input as generalized velocity at time k
Qk : 6 × 6 covariance of process noise
(for both translation and rotation)
yjk : 3 × 1 measurement of point j from vehicle at time k
Rjk : 3 × 3 covariance of measurement j at time k
We will use these in two different estimators, the EKF and batch
discrete-time MAP estimation.

8.2.4 EKF Solution

In this section, we seek to estimate the pose of our vehicle using the
classic EKF, but carefully applied to our situation involving rotations.

Prediction Step
Predicting the mean forwards in time is not difficult in the case of
the EKF; we simply pass our prior estimate and latest measurements
8.2 Point-Cloud Tracking 315

through the nominal kinematics model in (8.119):

Ťk = exp (∆tk $ ∧k ) T̂k−1 . (8.127)
| {z }
Ξk

To predict the covariance of the estimate,

h T
i
P̌k = E δ ξ̌ k δ ξ̌ k , (8.128)
we require the perturbation kinematics model in (8.119),
0
δ ξ̌ k = exp (∆tk $ f
k ) δ ξ̂ k−1 + wk . (8.129)
| {z }
Fk−1 =Ad(Ξk )

Thus, in this case the coefficient matrix of the linearized motion model
is
Fk−1 = exp (∆tk $ f k ), (8.130)
which depends only on the input and not the state due to our conve-
nient choice of representing uncertainty via the exponential map. The
covariance prediction proceeds in the usual EKF manner as
P̌k = Fk−1 P̂k−1 FTk−1 + Qk . (8.131)
The corrective step is where we must pay particular attention to the
pose variables.

Correction Step
Looking back to (8.126) for the perturbation measurement model,
δyjk = DT Ťk p 0
j δ ξ̌ k + njk , (8.132)
| {z }
Gjk

we see that the coefficient matrix of the linearized measurement model

is
Gjk = DT Ťk p j , (8.133)
which is evaluated at the predicted mean pose, Ťk .
To handle the case in which there are M observations of points on
the vehicle, we can stack the quantities as follows:
   
y1k G1k
   
yk =  ...  , Gk =  ...  , Rk = diag (R1k , . . . , RM k ) .
yM k GM k
(8.134)
The Kalman gain and covariance update equations are then unchanged
from the generic case:
−1
Kk = P̌k GTk Gk P̌k GTk + Rk , (8.135a)
P̂k = (1 − Kk Gk ) P̌k . (8.135b)
316 Pose Estimation Problems

Note, we must be careful to interpret the EKF corrective equations

properly since
h T
i
P̂k = E δ ξ̂ k δ ξ̂ k . (8.136)

In particular, for the mean update we rearrange the equation as follows:

∨
k = ln T̂k Ť−1
k = Kk (yk − y̌k ), (8.137)
| {z } | {z }
update innovation

∨
where k = ln T̂k Ť−1
k is the difference of the corrected and predicted
means and y̌k is the nonlinear, nominal measurement model evaluated
at the predicted mean:
 
y̌1k
 
y̌k =  ...  , y̌jk = DT Ťk pj , (8.138)
y̌M k

where we have again accounted for the fact that there could be M
observations of points on the vehicle. Once we have computed the mean
correction, k , we apply it according to

T̂k = exp (∧k ) Ťk , (8.139)

which ensures the mean stays in SE(3).

Summary
Putting the pieces from the last two sections together, we have our
canonical five EKF equations for this system:

P̌k = Fk−1 P̂k−1 FTk−1 + Qk , (8.140a)

predictor:
Ťk = Ξk T̂k−1 , (8.140b)
−1
Kalman gain: Kk = P̌k GTk Gk P̌k GTk + Rk , (8.140c)
P̂k = (1 − Kk Gk ) P̌k , (8.140d)
corrector: ∧
T̂k = exp (Kk (yk − y̌k )) Ťk . (8.140e)

We have essentially modified the EKF so that all the mean calculations
occur in SE(3), the Lie group, and all of the covariance calculations
occur in se(3), the Lie algebra. As usual, we must initialize the filter
at the first timestep using Ť0 . Although we do not show it, we could
easily turn this into an iterated EKF by relinearizing about the latest
estimate and iterating over the correction step. Finally, the algorithm
has T̂vk i = T̂k so we can compute T̂ivk = T̂−1
k if desired.
8.2 Point-Cloud Tracking 317

8.2.5 Batch Maximum a Posteriori Solution

In this section, we return to the discrete-time, batch estimation ap-
proach to see how this works on our pose tracking problem.

Error Terms and Objective Function

As usual for batch MAP problems, we begin by defining an error term
for each of our inputs and measurements. For the inputs, Ť0 and $ k ,
we have
( ∨
ln Ť0 T−1
0 k=0
ev,k (x) = ∨ , (8.141)
ln Ξk Tk−1 T−1k k = 1...K
where Ξk = exp (∆tk $ ∧k ) and we have used the convenient shorthand,
x = {T0 , . . . , TK }. For the measurements, yjk , we have
ey,jk (x) = yjk − DT Tk pj . (8.142)
Next we examine the noise properties of these errors.
Taking the Bayesian point of view, we consider that the true pose
variables are drawn from the prior (see Section 4.1.1) so that

Tk = exp δξ ∧k Ťk , (8.143)

where δξ k ∼ N 0, P̌k .
For the first input error, we have
∨ ∧ ∨
ev,0 (x) = ln Ť0 T−1
0 = ln Ť0 Ť−1
0 exp −δξ 0 = −δξ 0 , (8.144)
so that

ev,0 (T) ∼ N 0, P̌0 . (8.145)
For the later input errors, we have
∨
ev,k (x) = ln Ξk Tk−1 T−1k
∧ ∨
= ln Ξk exp δξ ∧k−1 Ťk−1 Ť−1
k exp −δξ k
∧ ∨
∧
= ln Ξk Ťk−1 Ť−1 exp Ad(Ξ k ) δξ k−1 exp −δξ k
| {z k }
1
≈ Ad(Ξk ) δξ k−1 − δξ k
= −wk , (8.146)
so that
ev,k (x) ∼ N (0, Qk ) . (8.147)
For the measurement model, we consider that the measurements are
generated by evaluating the noise-free versions (based on the true pose
variables) and then corrupted by noise so that
ey,jk (x) = yjk − DT Tk pj = njk , (8.148)
318 Pose Estimation Problems

so that
ey,jk (x) ∼ N (0, Rjk ) . (8.149)
These noise properties allow us to next construct the objective func-
tion that we want to minimize in our batch MAP problem:
1
e (x)T P̌−1
2 v,0 0 ev,0 (x) k=0
Jv,k (x) = 1 T −1 , (8.150a)
e
2 v,k
(x) Qk e v,k (x) k = 1...K
1
Jy,k (x) = ey,k (x)T R−1 k ey,k (x), (8.150b)
2
where we have stacked the M point quantities together according to
 
ey,1k (x)
 .. 
ey,k (x) =  . , Rk = diag (R1k , . . . , RM k ) . (8.151)
ey,M k (x)
The overall objective function that we will seek to minimize is then
K
X
J(x) = (Jv,k (x) + Jy,k (x)) . (8.152)
k=0

the next section will look at linearizing our error terms in order to carry
out Gauss-Newton optimization.

Linearized Error Terms

It is fairly straightforward to linearize our error terms (in order to carry
out Gauss-Newton optimization) just as we earlier linearized our mo-
tion and observation models. We will linearize about an operating point
for each pose, Top,k , which we can think of as our current trajectory
guess that will be iteratively improved. Thus, we will take
Tk = exp (∧k ) Top,k , (8.153)
where k will be the perturbation to the current guess that we seek to
optimize at each iteration. We will use the shorthand
xop = {Top,1 , Top,2 , . . . , Top,K } , (8.154)
for the operating point of the entire trajectory.
For the first input error, we have
∨ ∨
ev,0 (x) = ln Ť0 T−1
0 = ln Ť T −1
0 op,0 exp (− ∧
0 ) ≈ ev,0 (xop ) − 0 ,
| {z }
exp(ev,0 (xop )∧ )
(8.155)
∨
where ev,0 (xop ) = ln Ť0 T−1
op,0 is the error evaluated at the operating
point. Note, we have used a very crude version of the BCH formula to
arrive at the approximation on the right (i.e., only the first two terms),
8.2 Point-Cloud Tracking 319

but this approximation will get better as 0 goes to zero, which will
happen as the Gauss-Newton algorithm converges11 .
For the input errors at the later times, we have

∨
ev,k (x) = ln Ξk Tk−1 T−1
k
∧ ∨

= ln Ξk exp ∧k−1 Top,k−1 T−1
op,k exp (−k )
∧ ∨
= ln Ξk Top,k−1 T−1
op,k exp Ad Top,k T−1
op,k−1 k−1 exp (−∧k )
| {z }
exp(ev,k (xop )∧ )

≈ ev,k (xop ) + Ad Top,k T−1
op,k−1 k−1 − k , (8.156)
| {z }
Fk−1

∨
where ev,k (xop ) = ln Ξk Top,k−1 T−1
op,k is the error evaluated at the
operating point.
For the measurement errors, we have

ey,jk (x) = yjk − DT Tk pj

= yjk − DT exp (∧k ) Top,k pj
≈ yjk − DT (1 + ∧k ) Top,k pj

= yjk − DT Top,k pj − DT Top,k p
j k . (8.157)
| {z } | {z }
ey,jk (xop ) Gjk

We can stack all of the point measurement errors at time k together so

that

ey,k (x) ≈ ey,k (xop ) − Gk k , (8.158)

where

     
ey,1k (x) ey,1k (xop )
G1k
 ..    
.. 
ey,k (x) =  . , Gk =  ...  .
ey,k (xop ) =  . ,
ey,M k (x) GM k
ey,M k (xop )
(8.159)
Next, we will insert these approximations into our objective function
to complete the Gauss-Newton derivation.

11 We could also include the SE(3) Jacobian to do better here, as was done in
Section 7.3.4, but this is a reasonable starting point.
320 Pose Estimation Problems

Gauss-Newton Update
To set up the Gauss-Newton update, we define the following stacked
quantities:
 
1
 −F0 1 
 
 . 
  
 −F1 . . 

0  .. 
 1   . 1 
   
   −F 1 
δx =  2  , H= K−1 ,
 ..   G0 
 .   
 G1 
K  
 G2 
 
 .. 
 . 
GK
 
ev,0 (xop )
 ev,1 (xop ) 
 
 .. 
 . 
 
 ev,K−1 (xop ) 
 
e(xop ) = 
 ev,K (xop )
,
 (8.160)
 ey,0 (xop ) 
 
 e (x ) 
 y,1 op 
 . 
 .. 
ey,K (xop )

and

W = diag P̌0 , Q1 , . . . , QK , R0 , R1 , . . . , RK , (8.161)

which are identical in structure to the matrices in the nonlinear version.

The quadratic (in terms of the perturbation, δx) approximation to the
objective function is then

1
J(x) ≈ J(xop ) − bT δx + δxT A δx, (8.162)
2
where
−1
A= HT
| W{z H} , b = HT W−1 e(xop ). (8.163)
block-tridiagonal

Minimizing with respect to δx, we have

A δx? = b, (8.164)
8.3 Pose-Graph Relaxation 321

for the optimal perturbation,

 
?0
 ?1 
?  
δx =  ..  . (8.165)
 . 
?K
Once we have the optimal perturbation, we update our operating point
through the original perturbation scheme,
∧
Top,k ← exp ?k Top,k , (8.166)

which ensures that Top,k stays in SE(3). We then iterate the entire
scheme to convergence. As a reminder we note that at the final iteration
we have T̂vk i = Top,k as our estimate, but if we prefer we can compute
T̂ivk = T̂−1
vk i .
Once again, the main concept that we have used to derive this Gauss-
Newton optimization problem involving pose variables is to compute
the update in the Lie algebra, se(3), but store the mean in the Lie
group, SE(3).

8.3 Pose-Graph Relaxation

Another classic problem that is worth investigating in our framework
is that of pose-graph relaxation. Here we do not explicitly measure
any points in the stationary frame, but instead begin directly with a
set of relative pose ‘measurements’ (a.k.a., pseudomeasurements), that
may have come from some form of dead-reckoning. The situation is
depicted in Figure 8.3, where we can consider each white triangle to
be a reference frame in three dimensions. We refer to this diagram as
a pose graph in that it only involves poses and no points.
Importantly, pose graphs can contain closed loops (as well as leaf
nodes), but unfortunately, the relative pose measurements are uncer-
tain and do not necessarily compound to identity around any loop.
Therefore, our task is to ‘relax’ the pose graph with respect to one (ar-
bitrarily selected) pose, called pose 0. In other words, we will determine
an optimal estimate for each pose relative to pose 0, given all of the
relative pose measurements.

8.3.1 Problem Setup

F k located at pose k in Figure 8.3.
There is an implicit reference frame, →
−
We will use a transformation matrix to denote the pose change from
F 0 to →
→
− F k:
−
F k relative to →
Tk : transformation matrix representing pose of →
− F 0.
−
322 Pose Estimation Problems
Figure 8.3 In the
pose-graph
relaxation closed loop
problem, only
relative pose
change
‘measurements’ are
provided and the
task is to
determine where
each pose is in
relation to one
relative pose
privileged pose,
T` change Tk
labelled 0, which is
T0 T̄k` , ⌃k`
fixed. We cannot leaf
simply compound fixed
the relative
transforms due to
the presence of Our task will be to estimate this transformation for all the poses (other
closed loops, than pose 0).
around which the
relative transforms
As mentioned above, the measurements will be a set of relative pose
may not compound changes between nodes in the pose graph. The measurements will be
to identity. assumed to be Gaussian (on SE(3)) and thus have a mean and a co-
variance given by

T̄k` , Σk` . (8.167)
Explicitly, a random sample, Tk` , is drawn from this Gaussian density
according to

Tk` = exp ξ ∧k` T̄k` , (8.168)
where
ξ k` ∼ N (0, Σk` ) . (8.169)
Measurements of this type can arise from a lower-level dead-reckoning
method such as wheel odometry, visual odometry, or inertial sensing.
Not all pairs of poses will have relative measurements, making the
problem fairly sparse in practice.

8.3.2 Batch Maximum Likelihood Solution

We will follow a batch ML approach very similar to the pose-fusion
problem described in Section 7.3.4. As usual, for each measurement we
will formulate an error term:
−1 ∨ ∨
ek` (x) = ln T̄k` Tk T−1
` = ln T̄k` T` T−1
k , (8.170)

where we have used the shorthand

x = {T1 , . . . , TK } , (8.171)
8.3 Pose-Graph Relaxation 323

for the state to be estimated. We will adopt the usual SE(3)-sensitive

perturbation scheme,
Tk = exp (∧k ) Top,k , (8.172)
where Top,k is the operating point and k the small perturbation. In-
serting this into the error expression we have
∧ ∨

ek` (x) = ln T̄k` exp (∧` ) Top,` T−1
op,k exp (−k ) . (8.173)
We can pull the ` factor over to the right without approximation:
!∨
∧

ek` (x) = ln T̄k` Top,` T−1
op,k exp T op,k T −1
op,` ` exp (−∧k ) ,
| {z }
small
(8.174)
where T op,k = Ad(Top,k ). Since both ` and k will be converging to
zero, we can combine these approximately and write
!∨
∧
∧ −1
ek` (x) ≈ ln exp (ek` (xop ) ) exp T op,k T op,` ` − k , (8.175)

where we have also defined

∨
ek` (xop ) = ln T̄k` Top,` T−1 op,k , (8.176a)
xop = {Top,1 , . . . , Top,K } . (8.176b)
Finally, we can use the BCH approximation in (7.87) to write our lin-
earized error as
ek` (x) ≈ ek` (xop ) − Gk` δxk` , (8.177)
where
h i
Gk` = −J (−ek` (xop ))−1 T op,k T −1
op,` J (−ek` (x op ))
−1
, (8.178a)

δxk` = ` . (8.178b)
k
We may choose to approximate J ≈ 1 to keep things simple, but as we
saw in Section 7.3.4, keeping the full expression has some benefit. This
is because even after convergence ek` (xop ) 6= 0; these are the non-zero
residual errors for this least-squares problem.
With our linearized error expression in hand, we can now define the
ML objective function as
1X
J(x) = ek` (x)T Σ−1
k` ek` (x), (8.179)
2 k,`

where we note that there will be one term in the sum for each relative
324 Pose Estimation Problems

pose measurement in the pose graph. Inserting our approximate error

expression, we have
1 X T
J(x) ≈ ek` (xop ) − Gk` Pk` δx Σ−1 k` ek` (x op ) − Gk` P k` δx ,
2 k,`
(8.180)
or
1
J(x) ≈ J(xop ) − bT δx + δxT A δx, (8.181)
2
where
X
b= PTk` GTk` Σ−1
k` ek` (xop ), (8.182a)
k,`
X
A= PTk` GTk` Σ−1
k` Gk` Pk` , (8.182b)
k,`

δxk` = Pk` δx, (8.182c)

and Pk` is a projection matrix to pick out the k`th perturbation vari-
ables from the full perturbation state,
 
1
 .. 
δx =  .  . (8.183)
K
Our approximate objective function is now exactly quadratic and we
minimize J(x) with respect to δx by taking the derivative:
∂J(x)
= −b + A δx (8.184)
∂ δxT
Setting this to zero, the optimal perturbation, δx? , is the solution to
the following linear system:
A δx? = b. (8.185)
As usual, the procedure iterates between solving (9.21) for the optimal
perturbation,
 ?
1
?  .. 
δx =  .  , (8.186)
?
K
and updating the nominal quantities using the optimal perturbations
according to our original scheme,
∧
Top,k ← exp ?k Top,k , (8.187)
(8.188)
which ensures that Top,k ∈ SE(3). We continue until some convergence
8.3 Pose-Graph Relaxation 325
Figure 8.4 The
pose-graph
relaxation
procedure can be
initialized by
finding a spanning
tree (solid lines)
within the pose
graph. The dotted
measurements are
discarded (only for
initialization) and
the all the pose
variables
compounded
outward from pose
0.

criterion is met. Once converged, we set T̂k0 = Top,k at the last iteration
as the final estimates for the vehicle poses relative to pose 0.

8.3.3 Initialization
There are several ways to initialize the operating point, xop at the
start of the Gauss-Newton procedure. A common method is to find a
spanning tree as in Figure 8.4; the initial values of the pose variables can
be found by compounding (a subset of) the relative pose measurements
outward from the chosen privileged node 0. Note, the spanning tree is
not unique and as such different initialization can be computed. A
shallow spanning tree is preferable over a deep one so that as little
uncertain as possible is accumulated to any given node.

8.3.4 Exploiting Sparsity

There is inherent sparsity in pose graphs and this can be exploited to
make the pose-graph relaxation procedure more computationally effi-
cient12 . As shown in Figure 8.5, some nodes (shown as open triangles)
in the pose graph have either one or two edges, creating two types of
local chains:

(i) Constrained: both ends of the chain have a junction (open-

triangle) node and thus the chain’s measurements are important
to the rest of the pose graph.
(ii) Cantilevered: only one end of the chain has a junction (open-
triangle) node and thus the chain’s measurements do not affect
the rest of the pose graph.
12 The method in this section should be viewed as an approximation to the brute-force
approach in the previous section, owing to the fact that it is a nonlinear system.
326 Pose Estimation Problems
Figure 8.5 The
pose-graph
relaxation method
can be sped up by
exploiting the
inherent sparsity in
the pose graph.
The open-triangle
nodes only have
one or two edges
and thus do not
need to be solved
for initially.
Instead, the
relative
measurements
passing through
the open-triangle
nodes (dotted) are We can use any of the pose-compounding methods from Section 7.3.3
combined allowing to combine the relative pose measurements associated with the con-
the closed-triangle strained local chains and then treat this as a new relative pose mea-
nodes to be solved
surement that replaces its constituents. Once this is done, we can use
for much more
efficiently. The the pose-graph relaxation approach to solve for the reduced pose graph
open-triangle formed from only the junction (shown as closed-triangle) nodes.
nodes also can be Afterwards, we can treat all the junction nodes as fixed, and solve for
solved for
the local chain (open-triangle) nodes. For those nodes in cantilevered
afterwards.
local chains, we can simply use one of the pose-compounding methods
from Section 7.3.3 to compound outward from the one junction (open-
triangle) node associated with the chain. The cost of this compounding
procedure is linear in the length of the local chain (and iteration is not
required).
For each constrained local chain, we can run a smaller pose-graph
relaxation just for that chain to solve for its nodes. In this case, the two
bounding junction (open-triangle) nodes will be fixed. If we order the
variables sequentially along the local chain, the A matrix for this pose-
graph relaxation will be block-tridiagonal and thus the cost of each
iteration will be linear in the length of the chain (i.e., sparse Cholesky
decomposition followed by forward-backward passes).
This two-phased approach to pose-graph relaxation is not the only
way to exploit the inherent sparsity in order to gain computational
efficiency. A good sparse solver should be able to exploit the sparsity in
the A matrix for the full system as well, avoiding the need to identify
and bookkeep all of the local chains.

8.3.5 Chain Example

It is worthwhile to provide an example of pose-graph relaxation for the
short constrained chain in Figure 8.6. We only need to solve for poses
1, 2, 3, and 4 since 0 and 5 are fixed. The A matrix for this example is
8.3 Pose-Graph Relaxation 327
Figure 8.6
T3 T2 Example
T̄43 , ⌃43 T̄21 , ⌃21 pose-graph
T̄32 , ⌃32 relaxation problem
for a constrained
T̄54 , ⌃54
chain of poses.
T5 fixed T1 Here the black
T4 T0 poses are fixed and
fixed we must solve for
T̄10 , ⌃10 the white poses,
given all the
relative pose
measurements.
given by
 
A11 A12
AT12 A22 A23 
A=


A23 A33 A34 
T

AT34 A44
 0−1 −1 −1 
Σ10 + T T21 Σ021 T 21 −T T21 Σ021
 0−1 0−1 −1 −1

 −Σ21 T 21 Σ21 + T 32 Σ032 T 32
T
−T T32 Σ032 
=  0−1 −1 −1 −1 
 −Σ32 T 32 Σ032 + T T43 Σ043 T 43 −T T43 Σ043 
0−1 0−1 T 0−1
−Σ43 T 43 Σ43 + T 54 Σ54 T 54
(8.189)

where
−1
0
Σk` = J −T −1 −1
k` Σk` J k` , (8.190a)
−1
T k` = T op,k T op,` , (8.190b)
J k` = J (−ek` (xop )) . (8.190c)

The b matrix is given by

   −T −1 
b1 J 10 Σ10 e10 (xop ) − T T21 J −T −1
21 Σ21 e21 (xop )
b2  J −T Σ−1 T −T −1
21 e21 (xop ) − T 32 J 32 Σ32 e32 (xop )

b=  =  21
b3  J −T Σ−1 e32 (xop ) − T T J −T Σ−1 e43 (xop ) . (8.191)
32 32 43 43 43
b4 J −T −1 T −T −1
43 Σ43 e43 (xop ) − T 54 J 54 Σ54 e54 (xop )

We can see that for this chain example, A is block-tridiagonal and

we can therefore solve the A δx? = b equation quite efficiently using a
sparse Cholesky decomposition as follows. Let

A = UUT , (8.192)

where U is an upper-triangular matrix of the form

 
U11 U12
 U22 U23 
U= 
. (8.193)
U33 U34 
U44
328 Pose Estimation Problems

The blocks of U can be solved for as follows:

U44 UT44 = A44 : solve for U44 using Cholesky decomposition,
U34 UT44 = A34 : solve for U34 using linear algebra solver,
U33 UT33 + U34 UT34 = A33 : solve for U33 using Cholesky decomposition,
U23 UT33 = A23 : solve for U23 using linear algebra solver,
U22 UT22 + U23 UT23 = A22 : solve for U22 using Cholesky decomposition,
U12 UT22 = A12 : solve for U12 using linear algebra solver,
U11 UT11 + U12 UT12 = A11 : solve for U11 using Cholesky decomposition.
Then we can carry out a backwards pass followed by a forwards pass
to solve for δx? :
backwards pass forwards pass
Uc = b UT δx? = c

U44 c4 = b4 UT11 ?1 = c1

U33 c3 + U34 c4 = b3 UT12 ?1 + UT22 ?2 = c2
U22 c2 + U23 c3 = b2 UT23 ?2 + UT33 ?3 = c3
U11 c1 + U12 c2 = b1 UT34 ?3 + UT44 ?4 = c4
First we proceed down the left column solving for c4 , c3 , c2 , and c1 .
Then we proceed down the right column solving for ?1 , ?2 , ?3 , and ?4 .
The cost of solving for each of U, c, and finally δx? is linear in the
length of the chain. Once we solve for δx? , we update the operating
points of each pose variable,
∧
Top,k ← exp ?k Top,k , (8.194)
and iterate the entire procedure to convergence. For this short chain,
exploiting the sparsity may not be worthwhile, but for very long con-
strained chains the benefits are obvious.
9

Pose-and-Point Estimation Problems

In this chapter of the book, we will address one of the most fundamen-
tal problems in mobile robotics, estimating the trajectory of a robot
and the structure of the world around it (i.e., point landmarks) to-
gether. In robotics, this is called the simultaneous localization and map-
ping (SLAM) problem. However, in computer vision an almost identical
problem came to prominence through the application of aligning aerial
photographs into a mosaic; the classic solution to this problem is called
bundle adjustment (BA). We will look at BA through the lens of our
SE(3) estimation techniques.

9.1 Bundle Adjustment

Photogrammetry, the process of aerial map building, has been in use
since the 1920s (Dyce, 2012). It involves flying an airplane along a
route, taking hundreds or thousands of pictures of the terrain below,
and then stitching these together into a mosaic. In the early days,
photogrammetry was a highly laborious process; printed photographs
were spread out on a large surface and aligned by hand. From the late
1920s until the 1960s, clever projectors called multiplex stereoplotters
were used to more precisely align photos, but it was still a painstaking,
manual process. The first automated stitching of photos occurred in
the 1960s with the advent of computers and an algorithm called bun-
dle adjustment (Brown, 1958). Starting around the 1970s, aerial map
making was gradually replaced by satellite-based mapping (e.g., the US
Landsat program), but the basic algorithms for image stitching remain
the same (Triggs et al., 2000). It is worth noting that the robustness of
automated photogrammetry was increased significantly with the inven-
tion of modern feature detectors in computer vision, starting with the
work of Lowe (2004). Today, commercial software packages exist that
automate the photogrammetry process well, and they all essentially use
BA for alignment.
329
330 Pose-and-Point Estimation Problems
Figure 9.1
F vk
! 1 moving
Definition of
reference frames Vk 1
F vk
!
for the bundle
adjustment Vk
problem. There is
F vk+1
!
a stationary
reference frame estimated
and a moving r vk i
! Vk+1
reference frame, measured
p j vk
attached to a Fi
!
r
!
vehicle. A estimated
stationary
collection of points,
Pj , are observed by r pj i
!
I
the moving vehicle
(using a camera) Pj
and the goal is to
determine the
relative pose of the
moving frame with 9.1.1 Problem Setup
respect to the
stationary one (at Figure 9.1 shows the setup for our bundle adjustment problem. The
all of the times of state that we wish to estimate is
interest) as well as
the positions of all Tk = Tvk i : transformation matrix representing the pose
of the points in the
stationary frame. pj i
of vehicle at time k:
r
pj = i : homogeneous point representing the position
1
of landmark j

where k = 1 . . . K and j = 1 . . . M and we will use the cleaner ver-

sion of the notation to avoid writing out all the sub- and super-scripts
throughout the derivation. We will use the shorthand,

x = {T1 , . . . , TK , p1 , . . . , pM } , (9.1)

to indicate the entire state that we wish to estimate as well as xjk =

{Tk , pj } to indicate the subset of the state including the kth pose and
jth landmark. Notably, we exclude T0 from the state to be estimated
as the system is otherwise unobservable.

9.1.2 Measurement Model

There are two main differences between the problem treated here and
the one from the previous chapter. First, we are now estimating the
point positions in addition to the poses. Second, we will introduce a
nonlinear sensor model (e.g., a camera) such that we have a more com-
plicated measurement than simply the point expressed in the vehicle
frame.
9.1 Bundle Adjustment 331

Nonlinear Model
The measurement, yjk , will correspond to some observation of point j
from pose k (i.e., some function of rpvkj vk ). The measurement model for
this problem will be of the form
yjk = g (xjk ) + njk , (9.2)
where g(·) is the nonlinear model and njk ∼ N (0, Rjk ) is additive
Gaussian noise. We can use the shorthand
y = {y10 , . . . yM 0 , . . . , y1K , . . . , yM K } , (9.3)
to capture all the measurements that we have available.
As discussed in Section 7.3.5, we can think of the overall observation
model as the composition of two nonlinearities: one to transform the
point into the vehicle frame and one turn that point into the actual
sensor measurement through a camera (or other sensor) model. Letting

z(xjk ) = Tk pj , (9.4)
we can write
g (xjk ) = s (z(xjk )) , (9.5)
where s(·) is the nonlinear camera model1 . In other words, we have
g = s ◦ z, in terms of the composition of functions.

Perturbed Model
We will go one step beyond simply linearizing our model and work
out the perturbed model to second order. This could be used, for ex-
ample, to estimate the bias in using ML estimation, as discussed to
Section 4.3.3.
We define the following perturbations to our state variables:

1
Tk = exp (∧k ) Top,k ≈ 1 + ∧k + ∧k ∧k Top,k , (9.6a)
2
pj = pop,j + D ζ j , (9.6b)
where
 
1 0 0
0 1 0
D=
0
, (9.7)
0 1
0 0 0
is a dilation matrix so that our landmark perturbation, ζ j , is 3 × 1.
We will use the shorthand xop = {Top,1 , . . . , Top,K , pop,1 , . . . , pop,M }
to indicate the entire trajectory’s linearization operating point as well
as xop,jk = {Top,k , pop,j } to indicate the subset of the operating point
1 See Section 6.4 for several possibilities for the camera (or sensor) model.
332 Pose-and-Point Estimation Problems

including the kth pose and jth landmark. The perturbations will be
denoted
 
1
 .. 
 . 
 
 K 

δx =  , (9.8)
ζ 
 1 
 . 
 .. 
ζM

with the pose quantities on top and the landmark quantities on the
bottom. We will also use

k
δxjk = , (9.9)
ζj

to indicate just the perturbations associated with the kth pose and the
jth landmark.
Using the perturbation schemes above, we have that

∧ 1 ∧ ∧
z(xjk ) ≈ 1 + k + k k Top,k pop,j + D ζ j
2
≈ Top,k pop,j + ∧k Top,k pop,j + Top,k D ζ j
1
+ ∧k ∧k Top,k pop,j + ∧k Top,k D ζ j
2
1X
= z(xop,jk ) + Zjk δxjk + 1i δxTjk Z ijk δxjk , (9.10)
2 i | {z }
scalar

correct to second order, where

z(xop,jk ) = Top,k pop,j , (9.11a)

Zjk = (Top,k pop,j ) Top,k D , (9.11b)
"
#
1}
i (Top,k pop,j ) 1}
i Top,k D
Z ijk = T , (9.11c)
1}i Top,k D 0

and i is an index over the rows of z(·), and 1i is the ith column of the
identity matrix, 1.
To then apply the nonlinear camera model, we use the chain rule (for
9.1 Bundle Adjustment 333

first and second derivatives), so that

g(xjk ) = s (z(xjk ))

1X T
≈ s zop,jk + Zjk δxjk + 1m δxjk Z mjk δxjk
2 m
| {z }
δzjk

1X T T
≈ s(zop,jk ) + Sjk Zjk δzjk + 1 δz S ijk δzjk
2 i i jk
= s(zop,jk )
!
X 1X
T T
+ 1i 1i Sjk Zjk δxjk + 1m δxjk Z mjk δxjk
i
2 m
!T
1X 1X
+ 1i Zjk δxjk + 1m δxTjk Z mjk δxjk
2 i 2 m
!
1X
× S ijk Zjk δxjk + 1m δxTjk Z mjk δxjk
2 m
1X
≈ g(xop,jk ) + Gjk δxjk + 1i δxTjk G ijk δxjk , (9.12)
2 i | {z }
scalar

correct to second order, where

g(xop,jk ) = s(z(xop,jk )), (9.13a)

Gjk = Sjk Zjk , (9.13b)

∂s
Sjk = , (9.13c)
∂z z(xop,jk )
X
G ijk = ZTjk S ijk Zjk + 1Ti Sjk 1m Z mjk , (9.13d)
m
| {z }
scalar

∂ 2 si
S ijk = , (9.13e)
∂z ∂zT z(xop,jk )

and i is an index over the rows of s(·), and 1i is the ith column of the
identity matrix, 1.
If we only care about the linearized (i.e., first-order) model, then we
can simply use

g(xjk ) ≈ g(xop,jk ) + Gjk δxjk , (9.14)

for our approximate observation model.

334 Pose-and-Point Estimation Problems

9.1.3 Maximum Likelihood Solution

We will set up the bundle adjustment problem using the ML framework
described in Section 4.3.3, which means we will not use a motion prior2 .
For each observation of a point from a pose, we define an error term
as
ey,jk (x) = yjk − g (xjk ) , (9.15)
where yjk is the measured quantity and g is our observation model de-
scribed above. We seek to find the values of x to minimize the following
objective function:
1X
J(x) = ey,jk (x)T R−1
jk ey,jk (x), (9.16)
2 j,k

where x is the full state that we wish to estimate (all poses and land-
marks) and Rjk is the symmetric, positive-definite covariance matrix
associated with the jkth measurement. If a particular landmark is not
actually observed from a particular pose, we can simply delete the ap-
propriate term from the objective function. The usual approach to this
estimation problem is to apply the Gauss-Newton method. Here we
will derive the full Newton’s method and then approximate to arrive
at Gauss-Newton.

Newton’s Method
Approximating the error function, we have
1X
ey,jk (x) ≈ yjk − g(xop,jk ) −Gjk δxjk − 1i δxTjk G ijk δxjk , (9.17)
| {z } 2 i
ey,jk (xop )

and thus for the perturbed objective function we have

1
J(x) ≈ J(xop ) − bT δx + δxT A δx, (9.18)
2
correct to second order, where
X
b= PTjk GTjk R−1
jk ey,jk (xop ), (9.19a)
j,k
Gauss-Newton neglects this term
X zX }| {
T T −1 T −1
A= Pjk Gjk Rjk Gjk − 1i Rjk ey,jk (xop ) G ijk Pjk ,
j,k i
| {z }
scalar
(9.19b)
δxjk = Pjk δx, (9.19c)
2 In robotics, when a motion prior or odometry smoothing terms are introduced, we
typically call this SLAM.
9.1 Bundle Adjustment 335

where Pjk is an appropriate projection matrix to pick off the jkth

components of the overall perturbed state, δx.
It is worth noting that A is symmetric, positive-definite. We can see
the term that Gauss-Newton normally neglects in the Hessian of J.
When ey,jk (xop ) is small, this new term has little effect (and this is the
usual justification for its neglect). However, far from the minimum, this
term will be more significant and could improve the rate and region of
convergence3 . We will consider the Gauss-Newton approximation in the
next section.
We now minimize J(x) with respect to δx by taking the derivative:
∂J(x)
= −b + A δx. (9.20)
∂ δxT
Setting this to zero, the optimal perturbation, δx? , is the solution to
the following linear system:
A δx? = b. (9.21)
As usual, the procedure iterates between solving (9.21) for the optimal
perturbation,
 ? 
1
 .. 
 . 
 ? 
 K 
δx = 
? 
 ζ?  , (9.22)
 1 
 . 
 .. 
ζ ?M
and updating the nominal quantities using the optimal perturbations
according to our original schemes,
∧
Top,k ← exp ?k Top,k , (9.23a)
pop,j ← pop,j + D ζ ?j , (9.23b)
which ensure that Top,k ∈ SE(3) and pop,j keeps its bottom (fourth)
entry equal to 1. We continue until some convergence criterion is met.
p i
Once converged, we set T̂vk i = Top,k and p̂i j = pop,j at the last itera-
tion as the final estimates for the vehicle poses and landmark positions
of interest.

Gauss-Newton Method
Typically in practice, the Gauss-Newton approximation to the Hessian
is taken so that at each iteration we solve the linear system
A δx? = b, (9.24)
3 In practice, including this extra term sometimes makes the numerical stability of the
whole procedure worse, so it should be added with caution.
336 Pose-and-Point Estimation Problems

with
X
b= PTjk GTjk R−1
jk ey,jk (xop ), (9.25a)
j,k
X
A= PTjk GTjk R−1
jk Gjk Pjk , (9.25b)
j,k

δxjk = Pjk δx. (9.25c)

This has the significant advantage of not requiring the second deriva-
tive of the measurement model to be computed. Assembling the linear
system, we find it has the form
−1
GT
| R
?
{z G} δx = G
T
R−1 ey (xop ), (9.26)
| {z }
A b

with

Gjk = G1,jk G2,jk ,

G1,jk = Sjk (Top,k pop,j ) , G2,jk = Sjk Top,k D, (9.27)
using the definitions from earlier.
In the case of K = 3 free poses (plus fixed pose 0) and M = 2
landmarks, the matrices have the form
 
G2,10
 G2,20 
 
G1,11 G 
 2,11 
G1,21 G 
G = G1 G2 =   2,21 
G1,12 G2,12 ,
 
 G1,22 G2,22 
 
 G1,13 G2,13 
G1,23 G2,23
 
ey,10 (xop )
ey,20 (xop )
 
ey,11 (xop )
 
ey,21 (xop )

ey (xop ) =  ,

ey,12 (xop )
ey,22 (xop )
 
ey,13 (xop )
ey,23 (xop )
R = diag (R10 , R20 , R11 , R21 , R12 , R22 , R13 , R23 ) , (9.28)
under one particular ordering of the measurements.
In general, multiplying out the left-hand side, A = GT R−1 G, we see
that

A11 A12
A= , (9.29)
AT12 A22
9.1 Bundle Adjustment 337

where

A11 = GT1 R−1 G1 = diag (A11,1 , . . . , A11,K ) , (9.30a)

M
X
A11,k = GT1,jk R−1
jk G1,jk , (9.30b)
j=1
 
A12,11 ··· A12,M 1
T −1  .. .. .. 
A12 = G1 R G2 =  . . . , (9.30c)
A12,1K ··· A12,M K
A12,jk = GT1,jk R−1
jk G2,jk , (9.30d)
T −1
A22 = G2 R G2 = diag (A22,1 , . . . , A22,M ) , (9.30e)
K
X
A22,j = GT2,jk R−1
jk G2,jk . (9.30f)
k=0

The fact that both A11 and A22 are block-diagonal means this system
has a very special sparsity pattern that can be exploited to efficiently
solve for δx? at each iteration. This will be discussed in detail in the
next section.

9.1.4 Exploiting Sparsity

Whether we choose to use Newton’s method or Gauss-Newton, we are
faced with solving a system of the following form at each iteration:

A11 A12 δx?1 b
= 1 , (9.31)
AT12 A22 δx?2 b2
| {z } | {z } | {z }
A δx? b

where the state, δx? , has been partitioned into parts corresponding to
(1) the pose perturbation, δx?1 = ? , and (2) the landmark perturba-
tions, δx?2 = ζ ? .
It turns out that the Hessian of the objective function, A, has a
very special sparsity pattern as depicted in Figure 9.2; it is sometimes
referred to as an arrowhead matrix. This pattern is due to the presence
of the projection matrices, Pjk , in each term of A; they embody the
fact that each measurement involves just one pose variable and one
landmark.
As seen in Figure 9.2, we have that A11 and A22 are both block-
diagonal because each measurement involves only one pose and one
landmark at a time. We can exploit this sparsity to efficiently solve (9.21)
for δx? ; this is sometimes referred to as sparse bundle adjustment. There
are few different ways to do this; we will discuss the Schur complement
and a Cholesky technique.
338 Pose-and-Point Estimation Problems
Figure 9.2
Sparsity pattern of * * * * * * * * * ✏✏01
A. Non-zero * * * * * * * * *
..

..
.

..
.
entries are * * * * * * * * * .
indicated by *. *. * * .* * * .* * .*
This structure is .. .. .. ..
often referred to as * * * * * * * * * ✏k

..
.

..
.
an arrowhead * .. * *.* * *.* *.* ..
matrix, because . .. .. .. .
the ζ part is large * * * * * * * * *

..
.

..
.
compared to the * * * * * * * * * ✏K
A=
part. * * * * * * * * * ⇣1
* * * * * * * * *

..
.
..
.
* * * * * * * * * ..
* *
.
.*
..
* * .*
..
*.*
..
*.
..
* * * * * * * * * ⇣j

..
.
..
.
* *.* * *.* *.* *. ..
.. .. .. .. .
* * * * * * * * *
..
.
..
.

* * * * * * * * * ⇣M
✏10 ... ✏k . . . ✏K ⇣ 1 ... ⇣j . . . ⇣M

Schur Complement
Typically, the Schur complement is used to manipulate (9.31) into a
form that is more efficiently solved. This can be seen by premultiplying
both sides by

1 −A12 A−1 22
,
0 1
so that
?
A11 − A12 A−1 T
22 A12 0 δx1 b1 − A12 A−1
22 b2
= ,
AT12 A22 δx?2 b2
which has the same solution as (9.31). We may then easily solve for
δx?1 and since A22 is block-diagonal, A−1 22 is cheap to compute. Fi-
nally, δx?2 (if desired) can also be efficiently computed through back-
substitution, again owing to the sparsity of A22 . This procedure brings
the complexity of each solve down from O ((K + M )3 ) without spar-
sity to O (K 3 + K 2 M ) with sparsity, which is most beneficial when
K M.
A similar procedure can be had by exploiting the sparsity of A11 , but
in robotics problems we may also have some additional measurements
that perturb this structure and, more importantly, δx?2 is usually much
larger than δx?1 in bundle adjustment. While the Schur complement
method works well, it does not directly provide us with an explicit
method of computing A−1 , the covariance matrix associated with δx? ,
should we desire it. The Cholesky approach is better suited to this end.
9.1 Bundle Adjustment 339

Cholesky Decomposition
Every symmetric positive-definite matrix, including A, can be factored
as follows through a Cholesky decomposition:

A11 A12 U11 U12 UT11 0
= , (9.32)
AT12 A22 0 U22 UT12 UT22
| {z } | {z }| {z }
A U UT

where U is an upper-triangular matrix. Multiplying this out reveals

U22 UT22 = A22 : cheap to compute U22 via Cholesky
due to A22 block-diagonal,
T
U12 U22 = A12 : cheap to solve for U12
due to U22 block-diagonal,
T T
U11 U11 + U12 U12 = A11 : cheap to compute U11 via Cholesky
due to small size of δx?1 ,
so that we have a procedure to very efficiently compute U, owing to
the sparsity of A22 . Note that U22 is also block-diagonal.
If all we cared about was efficiently solving (9.31), then after com-
puting the Cholesky decomposition we can do so in two steps. First,
solve
Uc = b, (9.33)
for a temporary variable, c. This can be done very quickly since U
is upper-triangular and so can be solved from the bottom to the top
through substitution and exploiting the additional known sparsity of
U. Second, solve
UT δx? = c, (9.34)
for δx? . Again, since UT is lower-triangular we can solve quickly from
the top to the bottom through substitution and exploiting the sparsity.
Alternatively, we can invert U directly so that
−1 −1
U11 U12 U11 −U−1 −1
11 U12 U22
= , (9.35)
0 U22 0 U−1
22

which can again be computed efficiently due to the fact that U22 is
block-diagonal and U11 is small and in upper-triangular form. Then we
have that
UT δx? = U−1 b, (9.36)
or T ? −1
U11 0 δx1 U11 (b1 − U12 U−1
22 b2 )
= , (9.37)
UT12 UT22 δx?2 U−1
22 b2

which allows us to compute δx?1 and then back-substitute for δx?2 , sim-
ilarly to the Schur complement method.
340 Pose-and-Point Estimation Problems
Figure 9.3 A BA p2
problem with only
three p1 Jy,22
(non-collinear) T2
point landmarks
Jy,11 Jy,21
and two free poses
Jy,10
(1 and 2). Pose 0 is
fixed. It turns out Jy,32
this problem does T1
not have a unique
T0
solution as there
fixed Jy,11
are too few p3
landmarks to
constrain the two
free poses. There is
one term in the However, unlike the Schur complement method, A−1 is now com-
cost function, puted easily:
Jy,jk , for each
−1
measurement, as A−1 = UUT = U−T U−1 = LLT
shown. −1
U−T
11 0 U11 −U−1 −1
11 U12 U22
=
−U−T T
22 U12 U11
−T
U−T
22 0 U−1
22
| {z }| {z }
L LT

U−T −1
11 U11 −U−T −1 −1
11 U11 U12 U22
= −1 , (9.38)
−U−T T −T
22 U12 U11 U11
−1
U−T
22 UT12 U−T −1
11 U11 U12 + 1 U22

where we see additional room for efficiency through repeated products

inside the final matrix.

9.1.5 Interpolation Example

It will be instructive to work out the details for the small BA prob-
lem in Figure 9.3. There are three (non-collinear) point landmarks and
two free poses (1 and 2). We will assume pose 0 is fixed to make the
problem observable. We will also assume that the measurements of our
point landmarks are three-dimensional; thus the sensor could be either
a stereo camera or a range-azimuth-elevation sensor, for example. Un-
fortunately, there is not enough information present to uniquely solve
for the two free poses as well as the positions of the three landmarks.
This type of situation arises in rolling-shutter cameras and scanning-
while-moving laser sensors.
In the absence of any measurements of additional landmarks, our
only recourse is to assume something about the trajectory the vehicle
has taken. There are essentially two possibilities:

(i) Penalty Term: we can take a maximum a posteriori (MAP)

approach which assumes a prior density over trajectories and
encourages the solver to select a likely trajectory that is com-
patible with the measurements by introducing a penalty term
9.1 Bundle Adjustment 341

in the cost function. This is essentially the simultaneous local-

ization and mapping (SLAM) approach and will be treated in
the next section.
(ii) Constraint: we can stick with a maximum likelihood (ML) ap-
proach, but constrain the trajectory to be of a particular form.
Here we will do this by assuming the vehicle has a constant six-
degree-of-freedom velocity between poses 0 and 2 so that we
can use pose interpolation for pose 1. This reduces the number
of free pose variables from two to one and provides a unique
solution.

We will first set up the equations as though it were possible to solve for
both poses 1 and 2 and then introduce the pose-interpolation scheme.
The state variables to be estimated are

x = {T1 , T2 , p1 , p2 , p3 } . (9.39)

We use the usual perturbation schemes,

Tk = exp (∧k ) Top,k , pj = pop,j + Dζ j , (9.40)

and stack our perturbation variables as

 
1
 2 
 
δx = 
 ζ1 .
 (9.41)
 ζ2 
ζ3

At each iteration, the optimal perturbation variables should be the

solution to the linear system,

A δx? = b, (9.42)

where the A and b matrices for this problem have the form

   
A11 A13 A14 b1
   
 T A22 A24 A25   b2 
A=
 A13 A33 ,
 b=
 b3 
 (9.43)
 AT AT A44   b4 
14 24
AT25 A55 b5
342 Pose-and-Point Estimation Problems

with
A11 = GT1,11 R−1 T −1
11 G1,11 + G1,21 R21 G1,21 ,

A22 = GT1,22 R−1 T −1

22 G1,22 + G1,32 R32 G1,32 ,

A33 = GT2,10 R−1 T −1

10 G2,10 + G2,11 R11 G2,11 ,

A44 = GT2,21 R−1 T −1

21 G2,21 + G2,22 R22 G2,22 ,

A55 = GT2,30 R−1 T −1

30 G2,30 + G2,32 R32 G2,32 ,

A13 = GT1,11 R−1

11 G2,11 ,

A14 = GT1,21 R−1

21 G2,21 ,

A24 = GT1,22 R−1

22 G2,22 ,

A25 = GT1,32 R−1

32 G2,32 ,

and
b1 = GT1,11 R−1 T −1
11 ey,11 (xop ) + G1,21 R21 ey,21 (xop ),

b2 = GT1,22 R−1 T −1
22 ey,22 (xop ) + G1,32 R32 ey,32 (xop ),

b3 = GT2,10 R−1 T −1
10 ey,10 (xop ) + G2,11 R11 ey,11 (xop ),

b4 = GT2,21 R−1 T −1
21 ey,21 (xop ) + G2,22 R22 ey,22 (xop ),

b5 = GT2,30 R−1 T −1
30 ey,30 (xop ) + G2,32 R32 ey,32 (xop ).

Unfortunately, A is not invertible in this situation, which means that

we cannot solve for δx? at any iteration.
To remedy the problem, we will assume that the vehicle has followed
a constant-velocity trajectory so that we can write T1 in terms of T2
using the pose-interpolation scheme of Section 7.1.7. To do this, we
require the times corresponding to each pose:
t0 , t1 , t2 . (9.44)
We then define the interpolation variable,
t1 − t0
α= , (9.45)
t2 − t0
so that we can write
T1 = Tα , (9.46)
where T = T2 . Our usual perturbation scheme is
T = exp (∧ ) Top ≈ (1 + ∧ ) Top , (9.47)
and for the interpolated variable we have
∧ α
α
Tα = (exp (∧ ) Top ) ≈ 1 + A(α, ξ op ) Top , (9.48)

where A is the interpolation Jacobian and ξ op = ln(Top )∨ . Using this

9.1 Bundle Adjustment 343

pose-interpolation scheme, we can write the old stacked perturbation

variables in terms of a new reduced set:
   
1 A(α, ξ op )  
 2   
   1  ζ 
 ζ = 1   1 , (9.49)
 1    ζ 
 ζ   1  2
2
ζ
ζ3 1 | {z3 }
| {z } | {z } δx0
δx I

where we will call I the interpolation matrix. Our new set of state
variables to be estimated is
x0 = {T, p1 , p2 , p3 } , (9.50)
now that we have eliminated T1 as a free variable. Returning to our
original ML cost function, we can now rewrite it as
T 1 T
J(x0 ) ≈ J(x0op ) − b0 δx0 + δx0 A0 δx0 , (9.51)
2
where
A0 = I T A I, b0 = I T b. (9.52)
?
The optimal perturbation (that minimizes the cost function), δx0 , is
now the solution to
?
A0 δx0 = b0 . (9.53)
We update all the operating points in
x0op = {Top , pop,1 , pop,2 , pop,3 } , (9.54)
using the usual schemes,
∧
Top ← exp ? Top , pop,j ← pop,j + Dζj? , (9.55)
and iterate to convergence.
Importantly, applying the interpolation matrix on either side of A to
create A0 does not completely destroy the sparsity. In fact, the bottom-
right block corresponding to the landmarks remains block-diagonal, and
thus A0 is still an arrowhead matrix:
 
∗ ∗ ∗ ∗
 ∗ ∗ 
A0 = 
 ∗
,
 (9.56)
∗
∗ ∗
where ∗ indicates a non-zero block. This means that we can still exploit
the sparsity using the methods of the previous section, while interpo-
lating poses.
It turns out that we can use this interpolation scheme (and others) for
more complicated BA problems as well. We just need to decide which
344 Pose-and-Point Estimation Problems

pose variables we want to keep in the state and which to interpolate,

then build the appropriate interpolation matrix, I.

9.2 Simultaneous Localization and Mapping

The SLAM problem is essentially the same as BA, except that we
also typically know something about how the vehicle has moved (i.e.,
a motion model) and can therefore include inputs, v, in the problem.
Logistically, we only need to augment the BA cost function with ad-
ditional terms corresponding to the inputs (Sibley, 2006). Smith et al.
(1990) is the classic reference on SLAM and Durrant-Whyte and Bai-
ley (2006); Bailey and Durrant-Whyte (2006) provide a detailed sur-
vey of the area. The difference is essentially that BA is a maximum
likelihood (ML) problem and SLAM is a maximum a posteriori (MAP)
problem. Our approach is a batch-SLAM method (Lu and Milios, 1997)
similar to the Graph SLAM approach of Thrun and Montemerlo (2005),
but using our method of handling pose variables in three-dimensional
space.

9.2.1 Problem Setup

Another minor difference is that by including inputs/priors, we can
also assume that we have a prior on the initial state, T0 , so that it can
be included in the estimation problem (unlike BA)4 . Our state to be
estimated is thus

x = {T0 , . . . , TK , p1 , . . . , pM } . (9.57)

We assume the same measurement model as the BA problem and the

measurements are given by

y = {y10 , . . . yM 0 , . . . , y1K , . . . , yM K } . (9.58)

We will adopt the motion model from Section 8.2 and the inputs are
given by

v = Ť0 , $ 1 , $ 2 , . . . , $ K . (9.59)

We will next set up the batch MAP problem.

4 We could also chose not to estimate it and simply hold it fixed, which is very common.
9.2 Simultaneous Localization and Mapping 345

9.2.2 Batch Maximum a Posteriori Solution

We define the following matrices:

−1
δx1 F 0 Q 0
δx = , H= , W= ,
δx2 G1 G2 0 R

ev (xop )
e(xop ) = , (9.60)
ey (xop )

where

   
0 ζ1
 1   ζ2 
   
δx1 =  ..  , δx2 =  ..  ,
 .   . 
K ζM
   
ev,0 (xop ) ey,10 (xop )
 ev,1 (xop )   ey,20 (xop ) 
   
ev (xop ) =  ..  , ey (xop ) =  .. ,
 .   . 
ev,K (xop ) ey,M K (xop )

Q = diag P̌0 , Q1 , . . . , QK , R = diag (R10 , R20 , . . . , RM K ) ,
 
1
−F0 1 
 
 .. 
−1
F =  −F1 . , (9.61)

 . .. 
 1 
−FK−1 1
   
G1,10 G2,10
 ..   .. 
 .   . 
   
G1,M 0   G2,M 0 
   
 G1,11   G2,11 
   
 ..   . . 
 .   . 
   
 G1,M 1   G2,M 1 
G1 =   , G2 =  .
 ..   .. 
 .   . 
   
 ..   .. 
 .   . 
   
 G1,1K   
 G2,1K 
 
..   .. 
 .  . 
G1,M K G2,M K
346 Pose-and-Point Estimation Problems

From Sections 8.2.5 for the motion priors and 9.1.3 for the measure-
ments, the detailed blocks are

Fk−1 = Ad Top,k T−1 op,k−1 , k = 1 . . . K,
(
−1 ∨
ln Ť0 Top,0 k=0
ev,k (xop ) = ∨ ,
ln exp ((tk − tk−1 )$ ∧k ) Top,k−1 T−1
op,k k = 1...K
(9.62)

G1,jk = Sjk (Top,k pop,j ) , G2,jk = Sjk Top,k D,
ey,jk (xop ) = yjk − s (Top,k pop,j ) .
Finally, the objective function can be written as usual as
1
J(x) ≈ J(xop ) − bT δx + δxT A δx, (9.63)
2
where
A = HT W−1 H, b = HT W−1 e(xop ), (9.64)
whereupon the minimizing perturbations, δx? , are the solutions to
A δx? = b. (9.65)
We solve for δx? , then update our operating points according to
∧
Top,k ← exp ?k Top,k , pop,j ← pop,j + Dζj? , (9.66)

and iterate to convergence. As in the BA case, once converged we set

p i
T̂vk i = Top,k and p̂i j = pop,j at the last iteration as the final estimates
for the vehicle poses and landmark positions of interest.

9.2.3 Exploiting Sparsity

Introducing the motion priors does not destroy the nice sparsity of the
original BA problem. We can see this by noting that

A11 A12
A= = HT W−1 H
AT12 A22
−T −1 −1
F Q F + GT1 R−1 G1 GT1 R−1 G2
= . (9.67)
GT2 R−1 G1 GT2 R−1 G2
Compared to the BA problem, blocks A12 and A22 have not changed
at all, showing that A is still an arrowhead matrix with A22 block-
diagonal. We can thus exploit this sparsity to solve for the perturbations
at each iteration efficiently using either the Schur or Cholesky methods.
While block A11 is now different than the BA problem,
A11 = F−T Q−1 F−1 + GT1 R−1 G1 , (9.68)
| {z } | {z }
prior measurements
9.2 Simultaneous Localization and Mapping 347
p2 Figure 9.4 A
SLAM problem
p1 Jy,22 with only three
T2 (non-collinear)
Jy,11 Jy,21 point landmarks
Jy,10 and three free
Jv,2 poses. Pose 0 not
Jv,0 Jv,1 Jy,32 fixed as we have
T1 some prior
information about
T0
fixed Jy,11 it. There is one
term in the cost
p3 function for each
measurement,
Jy,jk , and motion
we have seen previously (e.g., Section 8.2.5) that it is block-tridiagonal. prior, Jv,k , as
Thus, we could choose to exploit the sparsity of A11 rather than A22 , if shown.
the number of poses were large compared to the number of landmarks,
for example. In this case, the Cholesky method is preferred over the
Schur one as we do not need to construct A−1 11 , which is actually dense.
Kaess et al. (2008, 2012) provide incremental methods of updating the
batch-SLAM solution that exploit sparsity beyond the primary block
structure discussed here.

9.2.4 Example
Figure 9.4 shows a simple SLAM problem with three point landmarks
and three free poses. In contrast to the BA example of Figure 9.3, we
now allow T 0 to be estimated as we have some prior information about
it, Ť0 , P̌0 , in relation to an external reference frame (shown as the
black, fixed pose). We have shown graphically all of the terms in the
objective function5 , one for each measurement and input, totalling nine
terms:
J = Jv,0 + Jv,1 + Jv,2 + Jy,10 + Jy,30 + Jy,11 + Jy,21 + Jy,22 + Jy,32
| {z } | {z }
prior terms measurement terms
(9.69)
Also, with the motion priors we have used, A is always well condi-
tioned and will provide a solution for the trajectory, even without any
measurements.

5 Sometimes this type of diagram is called a factor graph with each ‘factor’ from the
posterior likelihood over the states becoming a ‘term’ in the objective function, which
is really just the negative log-likelihood of the posterior over states.
10

Continuous-Time Estimation

All of our examples in this last part of the book have been in discrete
time, which is sufficient for many applications. However, it is worth
investigating how we might make use of the continuous-time estimation
tools from Sections 3.4 and 4.4 when working with state variables in
SE(3). To this end, we show one way to start from a specific nonlinear,
stochastic differential equation and build motion priors that encourage
trajectory smoothness1 . We then show where these motion priors could
be used within a trajectory estimation problem. Finally, we show how
to interpolate for query poses at times in between the main solution
times, using Gaussian process interpolation.

10.1 Motion Prior

We will begin by discussing how to represent general motion priors on
SE(3). We will do this in the context of a specific nonlinear, stochastic
differential equation. We will also further simplify this to make the
analysis tractable.

10.1.1 General
Ideally, we would like to use the following system of nonlinear, stochas-
tic, differential equations to build our motion prior2 :
Ṫ(t) = $(t)∧ T(t), (10.1a)
$̇(t) = w(t), (10.1b)
w(t) ∼ GP (0, Q δ(t − t0 )) . (10.1c)
To use this to build our motion priors, we will need to estimate the pose,
T(t), and the body-centric, generalized angular velocity, $(t), at some
times of interest: t0 , t1 , . . . , tK . White noise, w(t), enters the system
through the generalized angular acceleration; in the absence of noise,
the body-centric, generalized angular velocity is constant. The trouble
1 See Furgale et al. (2015) for a survey of continuous-time methods.
2 It is this model that we approximated as a discrete-time system in the previous two
chapters in order to build discrete-time motion priors.

349
350 Continuous-Time Estimation

with using this model directly in continuous time is that is nonlinear

and the state, {T(t), $(t)} ∈ SE(3) × R3 . Even the nonlinear tools
from Section 4.4 do not directly apply in this situation.

Gaussian Processes for SE(3)

Inspired by the way we have handled Gaussian random variables for
Lie groups, we can define a Gaussian process for poses in which the
mean function is defined on SE(3) × R3 and the covariance function is
defined on se(3) × R3 :

mean function: Ť(t), $̌ , (10.2a)
covariance function: P̌(t, t0 ), (10.2b)
combination: T(t) = exp (ξ(t)∧ ) Ť(t), (10.2c)
$(t) = $̌ + η(t), (10.2d)

ξ(t)
∼ GP 0, P̌(t, t0 ) . (10.2e)
η(t)
| {z }
γ(t)

While we could attempt to specify the covariance function directly, we

prefer to define it via a stochastic differential equation. Fortunately, we
can use (7.243) from Section 7.2.4 to break the above SDE into nominal
(mean) and perturbed (noise) parts:
nominal (mean): ˙
Ť(t) = $̌ ∧ Ť(t), (10.3a)
˙ = 0,
$̌ (10.3b)
f
ξ̇(t) $̌ 1 ξ(t) 0
perturbation (cov): = + w(t), (10.3c)
η̇(t) 0 0 η(t) 1
| {z } | {z } | {z } |{z}
γ̇(t) A γ(t) L
0
w(t) ∼ GP (0, Q δ(t − t )) . (10.3d)
The (deterministic) differential equation defining the mean function is
nonlinear and can be integrated in closed form,
Ť(t) = exp ((t − t0 )$̌ ∧ ) Ť0 , (10.4)
while the SDE defining the covariance function is linear, time-invariant3
and thus we can apply the tools from Section 3.4, specifically 3.4.3, to
build a motion prior. It is important to note that this new system of
equations only approximates the (ideal) one at the start of the section;
they will be very similar when ξ(t) remains small. Anderson and Bar-
foot (2015) provide a different method of approximating the desired
nonlinear SDE that employs local variables.
3 While we have that the prior generalized angular velocity does not change over time, in
general we could use whatever time-varying function we like, $̌(t), but the
computation of the transition function will become more complicated. Letting it be
piecewise constant between measurement times would be a straightforward extension.
10.1 Motion Prior 351

Transition Function
With $̌ constant, the transition function for the perturbation system
is

exp ((t − s)$̌ f ) (t − s)J ((t − s)$̌)
Φ(t, s) = , (10.5)
0 1

with no approximation. This can be checked by verifying that Φ(t, t) =

1 and Φ̇(t, s) = A Φ(t, s). We can also see this by working out the
transition function directly for this LTI system:

Φ(t, s) = exp (A(t − s))

X∞
(t − s)n n
= A
n=0
n!
X∞ f n
(t − s)n $̌ 1
=
n! 0 0
n=0
"P P∞ #
∞ (t−s)n n n n
n=0 n!
($̌ f ) (t − s) n=0 (t−s)
(n+1)!
($̌ f )
=
0 0

exp ((t − s)$̌ f ) (t − s)J ((t − s)$̌)
= . (10.6)
0 1

Notably, it is somewhat rare that we can find the transition function

so neatly when A is not nilpotent.

Error Terms
With the transition function in hand, we can define error terms for use
within a maximum-a-posterior estimator. The error term at the first
timestep will be
" ∨ #
ln T0 Ť−1
ev,0 (x) = −γ(t0 ) = − 0 , (10.7)
$ 0 − $̌

where P̌0 is our initial state uncertainty. For later times, k = 1 . . . K,

we define the error term to be

ev,k (x) = Φ(tk , tk−1 )γ(tk−1 ) − γ(tk ), (10.8)

which takes a bit of explanation. The idea is to formulate the error on

se(3) × R3 since this is where the covariance function lives. It is worth
noting that the mean function for γ(t) was defined to be zero, making
this error definition straightforward. We can also write this as
" ∨ # " #
−1 ∨
ln Tk−1 Ť−1 ln T Ť
k k
ev,k (x) = Φ(tk , tk−1 ) k−1 − , (10.9)
$ k−1 − $̌ $ k − $̌
352 Continuous-Time Estimation

in terms of the quantities to be estimated: Tk , $ k , Tk−1 , and $ k−1 .

The covariance matrix associated with this error is given by
Z tk
Qk = Φ(tk , s)LQLT Φ(tk , s)T ds, (10.10)
tk−1

which is described in more detail in Section 3.4.3; again, we are using

the SDE defined on se(3) × R3 to compute this covariance. Despite hav-
ing the transition matrix in closed form, this integral is best computed
either numerically or by making an approximation or simplification
(which we do in the next section). The combined objective function for
the entire motion prior is thus given by
K
1 1X
Jv (x) = ev,0 (x)T P̌−1
0 ev,0 (x) + ev,k (x)T Q−1
k ev,k (x), (10.11)
2 2 k=1

which does not contain any terms associated with the measurements.

Linearized Error Terms

To linearize our error terms defined above, we adopt the SE(3) per-
turbation scheme for poses and the usual one for generalized velocities,

Tk = exp (∧k ) Top,k , (10.12a)

$ k = $ op,k + ψ k , (10.12b)

where {Top,k , $ op,k } is the operating point and (k , ψ k ) is the pertur-
bation. Using the pose perturbation scheme we see that
∨ ∨
ξ k = ln Tk Ť−1
k = ln exp (∧k ) Top,k Ť−1
k
∨
= ln exp (∧k ) exp ξ ∧op,k ≈ ξ op,k + k , (10.13)
∨
where ξ op,k = ln Top,k Ť−1 k . We have used a very approximate ver-
sion of the BCH formula here, which is only valid if both k and ξ op,k
are both quite small. The former is reasonable since k → 0 as our esti-
mation scheme converges. The latter will be so if the motion prior has
low uncertainty; we have essentially already made this assumption in
separating our SDE into the nominal and perturbation parts. Inserting
this linearization results in the following linearized error terms:

ev,0 (xop ) − θ 0 k=0
ev,k (x) ≈ , (10.14)
ev,k (xop ) + Fk−1 θ k−1 − θ k k = 1 . . . K
where

k
θk = , (10.15)
ψk
10.1 Motion Prior 353

is the stacked perturbation for both the pose and generalized angular
velocity at time k and
Fk−1 = Φ (tk , tk−1 ) . (10.16)
Defining
   
θ0 ev,0 (xop )
 θ1   ev,1 (xop ) 
   
δx1 =  ..  , ev (xop ) =  .. ,
 .   . 
θK ev,K (xop )
 
1
−F0 1 
 
 .. 
F−1 =
 −F1 . ,

 .. 
 . 1 
−FK−1 1

Q = diag P̌0 , Q1 , . . . , QK , (10.17)
we can write the approximate motion-prior part of the objective func-
tion in block form as
1 T
Ju (x) ≈ ev (xop ) − F−1 δx1 Q−1 ev (xop ) − F−1 δx1 , (10.18)
2
which is quadratic in the perturbation, δx1 .

10.1.2 Simplification
Computing the Qk blocks in the general case can be done numerically.
However, we can evaluate them in closed form fairly easily for the
specific case of no rotational motion (in the mean of the prior only).
To do this, we define the (constant) generalized angular velocity to be
of the form

ν̌
$̌ = . (10.19)
0
The mean function will be a constant, linear velocity (i.e., no angular
rotation), ν̌. We then have that

f f 0 ν̌ ∧ 0 ν̌ ∧
$̌ $̌ = = 0, (10.20)
0 0 0 0
so that
exp ((t − s)$̌ f ) = 1 + (t − s)$̌ f , (10.21a)
1
J ((t − s)$̌) = 1 + (t − s)$̌ f , (10.21b)
2
354 Continuous-Time Estimation

with no approximation. The transition function is therefore

1 + (t − s)$̌ f (t − s)1 + 12 (t − s)2 $̌ f
Φ(t, s) = . (10.22)
0 1
We now turn to computing the Qk blocks starting from
Z tk
Qk = Φ(tk , s)LQLT Φ(tk , s)T ds. (10.23)
tk−1

Plugging in the quantities involved in the integrand we have

Qk,11 Qk,12
Qk = , (10.24a)
QTk,12 Qk,22
Z ∆tk:k−1
1
Qk,11 = (∆tk:k−1 − s)2 Q + (∆tk:k−1 − s)3
0 2

f fT 1 4 f fT
× $̌ Q + Q$̌ + (∆tk:k−1 − s) $̌ Q$̌ ds,
4
(10.24b)
Z ∆tk:k−1
1
Qk,12 = (∆tk:k−1 − s)Q + (∆tk:k−1 − s)2 $̌ f Q ds,
0 2
(10.24c)
Z ∆tk:k−1
Qk,22 = Q ds, (10.24d)
0
∆tk:k−1 = tk − tk−1 . (10.24e)
Carrying out the integrals (of simple polynomials) we have
1 1 T

Qk,11 = ∆t3k:k−1 Q + ∆t4k:k−1 $̌ f Q + Q$̌ f
3 8
1 T
+ ∆t5k:k−1 $̌ f Q$̌ f , (10.25a)
20
1 2 1
Qk,12 = ∆tk:k−1 Q + ∆t3k:k−1 $̌ f Q, (10.25b)
2 6
Qk,22 = ∆tk:k−1 Q. (10.25c)
These expressions can be used to build Qk from the known quantities,
$̌, Q, and ∆tk:k−1 .

10.2 Simultaneous Trajectory Estimation and Mapping

Using the motion prior from the previous section to smooth the so-
lution, we can set up a simultaneous trajectory estimation and map-
ping (STEAM) problem. STEAM is really just a variant of simulta-
neous localization and mapping (SLAM), where we have the ability to
inherently query the robot’s underlying continuous-time trajectory at
any time of interest, not just the measurement times. We show first
10.2 Simultaneous Trajectory Estimation and Mapping 355

how to solve for the state at the measurement times. Then, we show
how to use Gaussian process interpolation to solve for the state (and
covariance) at other query times.

10.2.1 Problem Setup

The use of our continuous-time motion prior is a fairly straightforward
modification of the discrete-time approach from Section 9.2.2. The state
to be estimated is now

x = {T0 , $ 0 , . . . , TK , $ K , p1 , . . . , pM } , (10.26)

which includes the poses, the generalized angular velocity variables,

and the landmark positions. The times, t0 , t1 , . . . , tK correspond to the
measurements times and the measurements available in our problem
are

y = {y10 , . . . yM 0 , . . . , y1K , . . . , yM K } , (10.27)

which remain the same as the discrete-time SLAM case.

10.2.2 Measurement Model

We use the same measurement model as the discrete-time SLAM case
but slightly modify some of the block matrices to account for the fact
that the estimated state now includes the $ k quantities, which are not
required the in the measurement error terms. We continue to use the
usual perturbation scheme for the landmark positions:

pj = pop,j + Dζ j , (10.28)

where pop,j is the operating point and ζ j is the perturbation.

To build the part of the objective function associated with the mea-
surements, we define the following matrices:
   
ζ1 ey,10 (xop )
 ζ2   
ey,20 (xop )
   
δx2 =  ..  , ey (xop ) =  ..,
 .   .
ζM ey,M K (xop )
R = diag (R10 , R20 , . . . , RM K ) , (10.29)
356 Continuous-Time Estimation

and
   
G1,10 G2,10
 ..   .. 
 .   . 
   
G1,M 0   G2,M 0 
   
 G1,11   G2,11 
   
 ..   .. 
 .   . 
   
 G   G2,M 1 
G1 =  1,M 1 , G2 =  .
 ..   .. 
 .   . 
   
 ..   .. 
 .   . 
   
 G1,1K  G2,1K 
   
 ..   .. 
 .   . 
G1,M K G2,M K
(10.30)

The detailed blocks are

G1,jk = Sjk (Top,k pop,j ) 0 , G2,jk = Sjk Top,k D,
ey,jk (xop ) = yjk − s (Top,k pop,j ) ,

where we see that the only change from the SLAM case is that the G1,jk
matrix has a padding 0 to account for the fact that the ψ k perturbation
variable (associated with $ k ) is not involved in the observation of
landmark j from pose k. The part of the objective function associated
with the measurements is then approximately

1 T
Jy (x) ≈ (ey (xop ) − G1 δx1 − G2 δx2 ) R−1
2
× (ey (xop ) − G1 δx1 − G2 δx2 ) , (10.31)

which is again quadratic in the perturbation variables, δx1 and δx2 .

10.2.3 Batch Maximum a Posteriori Solution

With both the motion prior and the measurement terms in hand, we
can write the full MAP objective function as

J(x) = Jv (x) + Jy (x) ≈ J(xop ) − bT δx + δxT A δx, (10.32)

with

A = HT W−1 H, b = HT W−1 e(xop ), (10.33)

10.2 Simultaneous Trajectory Estimation and Mapping 357

and
−1
δx1 F 0 Q 0
δx = , H= , W= ,
δx2 G1 G2 0 R

ev (xop )
e(xop ) = . (10.34)
ey (xop )
The minimizing perturbation, δx? , is the solution to
A δx? = b. (10.35)
As usual, we solve for δx? , then apply the optimal perturbations using
the appropriate schemes,
∧
Top,k ← exp ?k Top,k , (10.36a)
$ op,k ← $ op,k + ψ ?k , (10.36b)
pop,j ← pop,j + Dζ ?j , (10.36c)
and iterate to convergence. Similarly to the SLAM case, once converged
p i
we set T̂vk i = Top,k , $̂ vvkk i = $ op,k , and p̂i j = pop,j at the last iteration
as the final estimates for the vehicle poses, generalized angular velocity,
and landmark positions of interest at the measurement times.

10.2.4 Exploiting Sparsity

Introducing the continuous-time motion priors does not destroy the
nice sparsity of the discrete-time SLAM problem. We can see this by
noting that

A11 A12
A= = HT W−1 H
AT12 A22
−T −1 −1
F Q F + GT1 R−1 G1 GT1 R−1 G2
= . (10.37)
GT2 R−1 G1 GT2 R−1 G2
Compared to the SLAM problem, blocks A12 and A22 have not changed
at all, showing that A is still an arrowhead matrix with A22 block-
diagonal. We can thus exploit this sparsity to solve for the perturbations
at each iteration efficiently using either the Schur or Cholesky methods.
The block A11 looks very similar to the discrete-time SLAM case,
A11 = F−T Q−1 F−1 + GT1 R−1 G1 , (10.38)
| {z } | {z }
prior measurements
but recall that the G1 matrix is slightly different due the fact that
we are estimating both pose and generalized angular velocity at each
measurement time. Nevertheless, A11 is still block-tridiagonal. Thus,
we could choose to exploit the sparsity of A11 rather than A22 , if the
number of poses were large compared to the number of landmarks,
358 Continuous-Time Estimation
Figure 10.1 It is
possible to use our interpolated posterior
continuous-time posterior T̂k+1
estimation T̂⌧
posterior
framework to
interpolate (for the T̂k
mean and the
covariance) Ťk+1
Ť⌧
between the main Ťk prior prior
estimation times
prior
(i.e., measurement
times).

for example. In this case, the Cholesky method is preferred over the
Schur one as we do not need to construct A−1
11 , which is actually dense.
Yan et al. (2014) explain how to use the sparse-GP method within the
incremental approach of Kaess et al. (2008, 2012).

10.2.5 Interpolation
After we have solved for the state at the measurement times, we can also
now use our Gaussian process framework to interpolate for the state
at one or more query times. The situation is depicted in Figure 10.1,
where our goal is to interpolate the posterior pose (and generalized
velocity) at query time, τ .
Because we have deliberately estimated a Markovian state for our
chosen SDE defining the prior, {Tk , $ k }, we know that to interpolate
at time τ , we need only consider the two measurements times on either
side. Without loss of generality, we assume
tk ≤ τ < tk+1 . (10.39)
The difference between the posterior and the prior at times τ , tk , and
tk+1 we write as
" ∨ # " ∨ #
ln T̂τ Ť−1
τ ln T̂k Ť−1
k
γτ = , γk = ,
$̂ τ − $̌ $̂ k − $̌
" ∨ #
ln T̂k+1 Ť−1
k+1
γ k+1 = , (10.40)
$̂ k+1 − $̌
where we note that the posterior values at the two measurement times
come from the operating
n opoint values at the last iteration of the main
MAP solution: T̂k , $̂ k = {Top,k , $ op,k }.
Using these definitions, we can go ahead and carry out state interpo-
lation (for the mean) on se(3)× R3 using the approach from Section 3.4:

γ τ = Λ(τ ) γ k + Ψ(τ ) γ k+1 , (10.41)

10.2 Simultaneous Trajectory Estimation and Mapping 359

where
Λ(τ ) = Φ(τ, tk ) − Qτ Φ(tk+1 , τ )T Q−1
k+1 Φ(tk+1 , tk ), (10.42a)
T −1
Ψ(τ ) = Qτ Φ(tk+1 , τ ) Qk+1 . (10.42b)
We have all the required pieces to build these two matrices except Qτ ,
which is given by
Z τ
Qτ = Φ(τ, s)LQLT Φ(τ, s)T ds. (10.43)
tk

We can either integrate this numerically, or adopt the linear-motion

simplification,

ν̌
$̌ = , (10.44)
0
whereupon

Qτ,11 Qτ,12
Qτ = , (10.45a)
QTτ,12 Qτ,22
1 1 T

Qτ,11 = ∆t3τ :k Q + ∆t4τ :k $̌ f Q + Q$̌ f
3 8
1 T
+ ∆t5τ :k $̌ f Q$̌ f , (10.45b)
20
1 1
Qτ,12 = ∆t2τ :k Q + ∆t3τ :k $̌ f Q, (10.45c)
2 6
Qτ,22 = ∆tτ :k Q, (10.45d)
∆tτ :k = τ − tk . (10.45e)
Plugging this in we can evaluate

ξτ
γτ = , (10.46)
ητ
and then compute the interpolated posterior on SE(3) × R3 as

T̂τ = exp ξ ∧τ Ť, (10.47a)
$̂ τ = $̌ + η τ . (10.47b)
The cost of this trajectory query is O(1) since it only involves two
measurement times in the interpolation equation. We can repeat this
as many times as we like for different values of τ . A similar approach
can be used to interpolate the covariance at the query time using the
methods from Section 3.4.

10.2.6 Postscript
It is worth pointing out that while our underlying approach in this
chapter considers a trajectory that is continuous in time, we are still
360 Continuous-Time Estimation

discretizing it in order to carry out the batch MAP solution at the mea-
surement times and also the interpolation at the query times. The point
is that we have a principled way to query the trajectory at any time
of interest, not just the measurement times. Moreover, the interpola-
tion scheme is chosen up front and provides the abilities to (i) smooth
the solution based on a physically motivated prior, and (ii) carry out
interpolation at any time of interest.
It is also worth noting that the Gaussian process approach taken in
this chapter is quite different from the interpolation approach taken in
Section 9.1.5. There we forced the motion between measurement times
to have constant body-centric generalized velocity: it was a constraint-
based interpolation method. Here we are defining a whole distribution
of possible trajectories and encouraging the solution to find one that
balances the prior likelihood with the likelihood of the measurements:
this is a penalty-term approach. Both approaches have their merits.
Finally, it is worth making a point about the estimation philoso-
phy used in this chapter. We have claimed that we presented a MAP
method. However, in the nonlinear chapter, the MAP approach always
linearized the motion model about the current MAP estimate. On the
surface, it appear that we have done something slightly different in this
chapter: to separate the desired nonlinear SDE into the nominal and
perturbation components, we essentially linearized about the mean of
the prior. We then built an error term for the motion prior and lin-
earized that about the current MAP estimate. However, the other way
to look at it is that we simply replaced the desired SDE with a slightly
different one that was easier to work with and then applied the MAP
approach for SE(3). This is not the only way to apply the Gaussian
process, continuous-time approach to estimation on SE(3), but we hope
it provides one useful example; Anderson and Barfoot (2015) provide
an alternative.
References

Anderson, S, and Barfoot, T D. 2015 (28 September - 2 October). Full STEAM

Ahead: Exactly Sparse Gaussian Process Regression for Batch Continuous-Time
Trajectory Estimation on SE(3). In: Proceedings of the IEEE/RSJ International
Conference on Intelligent Robots and Systems (IROS).
Anderson, S, Barfoot, T D, Tong, C H, and Särkkä, S. 2015. Batch Nonlinear
Continuous-Time Trajectory Estimation as Exactly Sparse Gaussian Process Re-
gression. Autonomous Robots, special issue on “Robotics Science and Systems”,
39(3), 221–238.
Arun, K S, Huang, T S, and Blostein, S D. 1987. Least-Squares Fitting of Two 3D
Point Sets. IEEE Transactions on Pattern Analysis and Machine Intelligence,
9(5), 698–700.
Bailey, T, and Durrant-Whyte, H. 2006. Simultaneous Localisation and Mapping
(SLAM): Part II State of the Art. IEEE Robotics and Automation Magazine,
13(3), 108–117.
Barfoot, T D, Forbes, J R, and Furgale, P T. 2011. Pose Estimation using Linearized
Rotations and Quaternion Algebra. Acta Astronautica, 68(1-2), 101–112.
Barfoot, T D, Tong, C H, and Särkkä, S. 2014 (12-16 July). Batch Continuous-
Time Trajectory Estimation as Exactly Sparse Gaussian Process Regression. In:
Proceedings of Robotics: Science and Systems (RSS).
Barfoot, Timothy D, and Furgale, Paul T. 2014. Associating Uncertainty with
Three-Dimensional Poses for use in Estimation Problems. IEEE Transactions on
Robotics, 30(3), 679–693.
Bayes, Thomas. 1764. Essay towards solving a problem in the doctrine of chances.
Philosophical Transactions of the Royal Society of London.
Besl, P J, and McKay, N D. 1992. A Method for Registration of 3-D Shapes. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 14(2), 239–256.
Bierman, G J. 1974. Sequential Square Root Filtering and Smoothing of Discrete
Linear Systems. Automatica, 10(2), 147–158.
Bishop, Christopher M. 2006. Pattern Recognition and Machine Learning. Secaucus,
NJ, USA: Springer-Verlag New York, Inc.
Box, M J. 1971. Bias in Nonlinear Estimation. Journal of the Royal Statistical
Society, Series B, 33(2), 171–201.
Brookshire, J, and Teller, S. 2012 (July). Extrinsic Calibration from Per-Sensor
Egomotion. In: Proceedings of Robotics: Science and Systems.
Brown, D C. 1958. A Solution to the General Problem of Multiple Station Analytical
Stereotriangulation. RCA-MTP Data Reduction Technical Report No. 43 (or
AFMTC TR 58-8). Patrick Airforce Base, Florida.
Bryson, A E. 1975. Applied Optimal Control: Optmization, Estimation and Control.
Taylor and Francis.
Chen, C S, Hung, Y P, and Cheng, J B. 1999. RANSAC-based DARCES: A new
approach to fast automatic registration of partially overlapping range images.

361
362 References

IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(11), 1229–

1234.
Chirikjian, G S. 2009. Stochastic Models, Information Theory, and Lie Groups:
Classical Results and Geometric Methods. Vol. 1-2. New York: Birkhauser.
Corke, P. 2011. Robotics, Vision, and Control. Springer Tracts in Advanced Robotics
73. Springer.
Davenport, P B. 1965. A Vector Approach to the Algebra of Rotations with Appli-
cations. Tech. rept. X-546-65-437. NASA.
de Ruiter, A H J, and Forbes, J R. 2014. On the Solution of Wahba’s Problem on
SO(n). Journal of the Astronautical Sciences.
D’Eleuterio, G M T. 1985 (June). Multibody Dynamics for Space Station Manip-
ulators: Recursive Dynamics of Topological Chains. Tech. rept. SS-3. Dynacon
Enterprises Ltd.
Devlin, Keith. 2008. The Unfinished Game: Pascal, Fermat, and the Seventeenth-
Century Letter that Made the World Modern. Basic Book.
Dudek, G, and Jenkin, M. 2010. Compuational Principles of Mobile Robotics.
Cambridge University Press.
Durrant-Whyte, H, and Bailey, T. 2006. Simultaneous Localisation and Mapping
(SLAM): Part I The Essential Algorithms. IEEE Robotics and Automation Mag-
azine, 11(3), 99–110.
Dyce, M. 2012. Canada between the photograph and the map: Aerial photography,
geographical vision and the state. Journal of Historical Geography, In press.
Fischler, M, and Bolles, R. 1981. Random sample consensus: a paradigm for model
fitting with applications to image analysis and automated cartography. Commu-
nications of ACM, 24(6), 381–395.
Furgale, P T. 2011. Extensions to the Visual Odometry Pipeline for the Exploration
of Planetary Surfaces. Ph.D. thesis, University of Toronto.
Furgale, P T, Tong, C H, Barfoot, T D, and Sibley, G. 2015. Continuous-Time Batch
Trajectory Estimation Using Temporal Basis Functions. International Journal of
Robotics Research.
Green, B F. 1952. The Orthogonal Approximation of an Oblique Structure in Factor
Analysis. Psychometrika, 17(4), 429–440.
Hartley, R, and Zisserman, A. 2000. Multiple View Geometry in Computer Vision.
Cambridge University Press.
Hertzberg, C, Wagner, R, Frese, U, and Schröder, L. 2013. Integrating generic
sensor fusion algorithms with sound state representations through encapsulation
of manifolds. Information Fusion, 14(1), 57 – 77.
Horn, B K P. 1987a. Closed-Form Solution of Absolute Orientation using Orthonor-
mal Matrices. Journal of the Optical Society of America A, 5(7), 1127–1135.
Horn, B K P. 1987b. Closed-Form Solution of Absolute Orientation using Unit
Quaternions. Journal of the Optical Society of America A, 4(4), 629–642.
Hughes, Peter C. 1986. Spacecraft Attitude Dynamics. Dover.
Jazwinski, A H. 1970. Stochastic Processes and Filtering Theory. Academic, New
York.
Julier, S, and Uhlmann, J. 1996. A General Method for Approximating Nonlin-
ear Transformations of Probability Distributions. Tech. rept. Robotics Research
Group, University of Oxford.
Kaess, M, Ranganathan, A, and Dellaert, R. 2008. iSAM: Incremental Smoothing
and Mapping. IEEE TRO, 24(6), 1365–1378.
Kaess, M, Johannsson, H, Roberts, R, Ila, V, Leonard, J J, and Dellaert, F. 2012.
iSAM2: Incremental Smoothing and Mapping Using the Bayes Tree. IJRR, 31(2),
217–236.
References 363

Kalman, R E. 1960a. Contributions to the Theory of Optimal Control. Boletin de

la Sociedad Matematica Mexicana, 5, 102–119.
Kalman, R E. 1960b. A New Approach to Linear Filtering and Prediction Problems.
Trans. ASME, Journal of Basic Engineering, 82, 35–45.
Kelly, Alonzo. 2013. Mobile Robotics: Mathematics, Models, and Methods.
Cambridge University Press.
Klarsfeld, S, and Oteo, J A. 1989. The Baker-Campbell-Hausdorff formula and
the convergence of the Magnus expansion. Journal of Phys. A: Math. Gen., 22,
4565–4572.
Lee, T, Leok, M, and McClamroch, N H. 2008. Global Symplectic Uncertainty
Propagation on SO(3). Pages 61–66 of: Proceedings of the 47th IEEE Conference
on Decision and Control.
Long, A W, Wolfe, K C, Mashner, M J, and Chirikjian, G S. 2012. The Banana
Distribution is Gaussian: A Localization Study with Exponential Coordinates.
In: Proceedings of Robotics: Science and Systems.
Lowe, D G. 2004. Distinctive Image Features from Scale-Invariant Keypoints. In-
ternational Journal of Computer Vision, 60(2), 91–110.
Lu, F, and Milios, E. 1997. Globally Consistent Range Scan Alignment for Envir-
onment Mapping. Auton. Robots, 4(4), 333–349.
MacTavish, K A, and Barfoot, T D. 2015 (3-5 June). At All Costs: A Comparison
of Robust Cost Functions for Camera Correspondence Outliers. Pages 62–69 of:
Proceedings of the 12th Conference on Computer and Robot Vision (CRV).
Madow, William F. 1949. On the Theory of Systematic Sampling, II. Annals of
Mathematical Statistics, 30, 333–354.
Mahalanobis, P. 1936. On the Generalized Distance in Statistics. Pages 49–55 of:
Proceedings of the National Institute of Science, vol. 2.
Matthies, L, and Shafer, S A. 1987. Error Modeling in Stereo Navigation. IEEE
Journal of Robotics and Automation, 3(3), 239–248.
Maybeck, Peter S. 1994. Stochastic Models, Estimation and Control. Navtech Book
and Software Store.
McGee, L A, and Schmidt, S F. 1985 (November). Discovery of the Kalman Filter
as a Practical Tool for Aerospace and Industry. Tech. rept. NASA-TM-86847.
NASA.
Murray, R M, Li, Z, and Sastry, S. 1994. A Mathematical Introduction to Robotic
Manipulation. CRC Press.
Papoulis, Athanasios. 1965. Probability, Random Variables, and Stochastic Pro-
cesses. McGraw-Hill Book Company, New York.
Rasmussen, C E, and Williams, C K I. 2006. Gaussian Processes for Machine
Learning. Cambridge, MA: MIT Press.
Rauch, H E, Tung, F, and Striebel, C T. 1965. Maximum Likelihood Estimates of
Linear Dynamic Systems. AIAA Journal, 3(8), 1445–1450.
Särkkä, S. 2006. Recursive Bayesian Inference on Stochastic Differential Equations.
Ph.D. thesis, Helsinki University of Technology.
Särkkä, S. 2013. Bayesian Filtering and Smoothing. Cambridge University Press.
Sastry, S. 1999. Nonlinear Systems: Analysis, Stability, and Control. New York:
Springer.
Shannon, Claude E. 1948. A Mathematical Theory of Communication. The Bell
System Technical Journal, 27, 379–423, 623–656.
Sherman, J, and Morrison, W J. 1949. Adjustment of an Inverse Matrix Corre-
sponding to Changes in the Elements of a Given Column or Given Row of the
Original Matrix. Annals of Mathematics and Statistics, 20, 621.
364 References

Sherman, J, and Morrison, W J. 1950. Adjustment of an Inverse Matrix Corre-

sponding to a Change in One Element of a Given Matrix. Annals of Mathematics
and Statistics, 21, 124–127.
Sibley, G. 2006. A Sliding Window Filter for SLAM. Tech. rept. University of
Southern California.
Sibley, G., Sukhatme, G., and Matthies, L. 2006. The Iterated Sigma Point Kalman
Filter with Applications to Long-Range Stereo. In: Proceedings of Robotics: Sci-
ence and Systems.
Sibley, Gabe. 2007. Long Range Stereo Data-Fusion From Moving Platforms. Ph.D.
thesis, University of Southern California.
Simon, D. 2006. Optimal State Estimation: Kalman, H Infinity, and Nonlinear
Approaches. Wiley-Interscience.
Smith, P, Drummond, T, and Roussopoulos, K. 2003. Computing MAP Trajec-
tories by Representing, Propagating, and Combining PDFs Over Groups. In:
Proceedings of the IEEE International Conference on Computer Vision.
Smith, Randall C., Self, Matthew, and Cheeseman, Peter. 1990. Estimating Un-
certain Spatial Relationships in Robotics. Pages 167–193 of: Cox, Ingemar J.,
and Wilfong, Gordon T. (eds), Autonomous Robot Vehicles. New York: Springer
Verlag.
Stengel, R F. 1994. Optimal Control and Estimation. Dover Publications Inc.
Stillwell, J. 2008. Naive Lie Theory. Springer.
Stuelpnagel, J. 1964. On the Parameterization of the Three-Dimensional Rotation
Group. SIAM Review, 6(4), 422–430.
Thrun, S., and Montemerlo, M. 2005. The GraphSLAM Algorithm With Appli-
cations to Large-Scale Mapping of Urban Structures. International Journal on
Robotics Research, 25(5/6), 403–430.
Thrun, Sebastian, Fox, Dieter, Burgard, Wolfram, and Dellaert, Frank. 2001. Robust
Monte Carlo localization for mobile robots. Artificial Intelligence, 128(1–2), 99–
141.
Thrun, Sebastian, Burgard, Wolfram, and Fox, Dieter. 2006. Probabilistic Robotics.
MIT Press.
Tong, C H, Furgale, P T, and Barfoot, T D. 2013. Gaussian Process Gauss-Newton
for Non-Parametric Simultaneous Localization and Mapping. International Jour-
nal of Robotics Research, 32(5), 507–525.
Triggs, W, McLauchlan, P, Hartley, R, and Fitzgibbon, A. 2000. Bundle Adjustment:
A Modern Synthesis. Pages 298–375 of: Triggs, W, Zisserman, A, and Szeliski, R
(eds), Vision Algorithms: Theory and Practice. LNCS. Springer Verlag.
Umeyama, S. 1991. Least-Squares Estimation of Transformation Parameters Be-
tween Two Point Patterns. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 13(4), 376–380.
Wahba, G. 1965. A Least-Squares Estimate of Spacecraft Attitude. SIAM Review,
7(3), 409.
Wang, Y, and Chirikjian, G S. 2008. Nonparametric Second-order Theory of Er-
ror Propagation on Motion Groups. International Journal of Robotics Research,
27(11), 1258–1273.
Wolfe, K, Mashner, M, and Chirikjian, G. 2011. Bayesian Fusion on Lie Groups.
Journal of Algebraic Statistics, 2(1), 75–97.
Woodbury, M A. 1950. Inverting Modified Matrices. Tech. rept. 42. Statistical
Research Group, Princeton University.
Yan, X, Indelman, V, and Boots, B. 2014. Incremental Sparse GP Regression for
Continuous-time Trajectory Estimation and Mapping. In: Proceedings of the
NIPS Workshop on Autonomously Learning Robots.
References 365

Zhang, Zhengyou. 1997. Parameter Estimation Techniques: A Tutorial with Appli-

cation to Conic Fitting. Image and Vision Computing, 15(1), 59–76.
Index

adjoint, 219, 220 72, 86, 89, 94, 141, 145, 312, 313,
affine transformation, 197 349, 350, 354, 355, 357–360
algebra, 211 covariance matrix, 11
Apianus, Petrus, xv Cramér, Harold, 14
arrowhead matrix, 337, 338 Cramér-Rao lower bound, 14, 31, 68,
axiom of total probability, 9 70, 116
BA, see bundle adjustment CRLB, see Cramér-Rao lower bound
Baker, Henry Frederick, 223 cross product, 169, 170
Baker-Campbell-Hausdorff, 223, 224, cubic Hermite polynomial, 84
226, 228, 238, 240, 265, 273, 318, curvature, 190
323, 352 DARCES, see data-aligned
Bayes filter, xv, 3, 66, 89, 95–100, 102, rigidity-constrained exhaustive
105, 113, 114, 125, 140 search
Bayes’ rule, 3, 10, 33, 38, 48, 92, 96 data association, 148, 156
Bayes, Thomas, 11 data-aligned rigidity-constrained
Bayesian, 9 exhaustive search, 158
Bayesian inference, 11, 24, 37, 42, 44, Dirac, Paul Adrien Maurice, 32
66–68, 89, 90, 133, 135, 141, 144 directional derivative, 238
BCH, see Baker-Campbell-Hausdorff discrete time, 27, 31, 35, 49, 56, 72, 78,
belief function, 95 85, 86, 94, 95, 125, 141, 145, 147,
Bernoulli numbers, 224 268, 311–314, 317, 349, 355, 357
Bernoulli, Jakob, 224, 235 disparity, 202
Bessel’s correction, 12 dot product, 169, 171
Bessel, Friedrich Wilhelm, 12
best linear unbiased estimate, 68 early estimation milestones, 3
biased, 101, 137 EKF, see extended Kalman filter
BLUE, see best linear unbiased epipolar constraint, 197
estimate epipolar line, 197
bundle adjustment, 329, 340, 343, 344, essential matrix (of computer vision),
346, 347 195
estimate, 36
camera, 193
estimation, see state estimation
Campbell, John Edward, 223
Euler parameters, see unit-length
Cauchy product, 234
quaternions
Cauchy, Baron Augustin-Louis, 234
Euler’s theorem, 174
causal, 56
Euler, Leonhard, 172
Cayley-Hamilton theorem, 47
Cayley-Rodrigues parameters, 176 exponential map, 213
Cholesky decomposition, 50–53, 85, 88, extended Kalman filter, 68, 89, 98, 99,
108, 109, 117, 118, 122, 127, 268, 101, 103–105, 107, 113, 116, 119,
280, 326, 327, 337, 339, 346, 347, 120, 122–125, 132, 140, 141, 147,
357 289, 311, 313–316
Cholesky smoother, 51 exteroceptive, 3
Cholesky, André-Louis, 50 extrinsic sensor parameters, 193
consistent, 69, 149 factor graph, 347
continuous time, xvi, 4, 9, 31, 32, 35, Faulhaber’s formula, 235

367
368 Index

Faulhaber, Johann, 235 inertial measurement unit, 203, 205, 207

filter, 57 information form, 54, 55, 65
Fisher, Sir Ronald Aylmer, 14 information matrix, 51
fixed-internal smoother, 41, 49 information vector, 48
focal length, 194 injection, 259
Frenet, Jean Frédéric, 190 injective, 20
Frenet-Serret frame, 190, 192, 206 inner product, see dot product
frequentist, 9 interoceptive, 3
Frobenius norm, 271, 283 interpolation matrix, 343
frontal projection model, 193 intrinsic parameter matrix, 196
fundamental matrix (of computer inverse covariance form, see information
vision), 197 form
fundamental matrix (of control theory), ISPKF, see iterated sigmapoint
143, 255 Kalman filter
Gauss, Carl Friedrich, 2, 3 Isserlis’ theorem, 15, 280
Gauss-Newton optimization, 127–132, Isserlis, Leon, 15
136, 137, 140, 241–243, 246, 274, Itō calculus, 75
310, 311, 318–321, 325, 334, 335, Itō, Kiyoshi, 75
337 iterated extended Kalman filter,
Gaussian estimator, 48, 61, 105 103–105, 107, 122–125, 134, 135,
Gaussian filter, 113 140, 144, 146
Gaussian inference, 19, 100 iterated sigmapoint Kalman filter, 121,
Gaussian noise, 1, 2, 68, 86, 90, 98, 99, 123–125
148, 149, 152, 313, 331 iterative closest point, 289, 290
Gaussian probability density function, Jacobi’s formula, 216
9, 13–15, 18–20, 22, 24, 26, 28–31, Jacobi, Gustav Jacob, 212
33, 58, 61, 91, 97–100, 102, 103, Jacobian, 218, 225, 227, 240
105–111, 113, 117, 118, 122, 124, John Harrison, 2
134, 144, 260, 280, 322 joint probability density function, 10
Gaussian process, xvi, 4, 9, 31, 32, Kálmán, Rudolf Emil, 2
72–75, 79, 83, 85, 141, 143–146, Kōwa, Seki, 224
349, 350, 355, 358 Kalman filter, xv, 2, 3, 35, 56, 61,
Gaussian random variable, 9, 15, 20, 24, 66–68, 70, 150, 151, 156
35, 258, 262 Kalman gain, 65
generalized mass matrix, 310 kernel matrix, 73, 78
generalized velocity, 250 KF, see Kalman filter
Gibbs vector, 176 kinematics, 178, 191, 192, 246, 247, 249,
Gibbs, Josiah Willard, 176 250, 252, 253, 255–257, 266, 312,
global positioning system, 4, 156–158 313, 315
GP, see Gaussian process kurtosis, 12, 113
GPS, see global positioning system
group, 210 law of large numbers, 106
Levenberg-Marquardt, 130, 246
Hamilton, Sir William Rowan, 175 LG, see linear-Gaussian
Hausdorff, Felix, 223 Lie algebra, 211
Heaviside step function, 77 Lie derivative, 240
Heaviside, Oliver, 176 Lie group, see matrix Lie group
Hermite basis function, 85 Lie product formula, 224
Hermite, Charles, 84 Lie, Marius Sophus, 209
homogeneous coordinates, 187, 237, 276 lifted form, 37, 42, 76
homography matrix, 199 line search, 130, 246
ICP, see iterative closest point linear time-invariant, 81, 351
identity matrix, 169 linear time-varying, 75, 79, 142, 255,
IEKF, see iterated extended Kalman 257
filter linear, time-varying, 35
improper rotation, 210 linear-Gaussian, 36, 37, 41, 42, 44, 57,
IMU, see inertial measurement unit 59, 62, 71, 96, 156
inconsistent, 101 Lovelace, Ada, 224
Index 369

LTI, see linear time-invariant 99, 102, 105–111, 113, 114, 146,
LTV, see linear time-varying 260, 267, 274
M-estimation, 160 probability distributions, 10
Möbius, Augustus Ferdinand, 187 proper rotation, 210
Mahalanobis, 273 pseudoinverse, 41
Mahalanobis distance, 28, 39 quaternion, 290
Mahalanobis, Prasanta Chandra, 28 RAE, see range-azimuth-elevation
MAP, see maximum a posteriori random sample consensus, 159, 162,
marginalization, 11 163, 289, 297
Markov property, 62, 95 random variable, 9
matrix inversion lemma, see range-azimuth-elevation, 202, 203
Sherman-Morrison-Woodbury RANSAC, see random sample
matrix Lie group, 209, 210 consensus
maximum a posteriori, 37, 38, 61, 62, Rao, Calyampudi Radhakrishna, 14
67, 86, 87, 89, 92, 93, 104, 105, Rauch, Herbert E., 53
123–126, 135, 136, 146, 148, 313, Rauch-Tung-Striebel smoother, 3, 49,
314, 317, 318, 344, 356, 360 53, 56
maximum likelihood, 135, 136, 147, realization, 12, 14, 36
148, 322, 323, 331, 334, 343 reference frame, 168
mean, 11, 14 robust cost, 160
mean rotation, 261 robust estimation, 160
ML, see maximum likelihood rotary reflection, 210
Monte Carlo, 98, 106, 270, 271, 281 rotation matrix, 170, see also rotations
Moore-Penrose pseudoinverse, see rotations, 167, 209, 214, 224, 229, 232,
pseudoinverse 233, 238, 246, 252, 258
mutual information, 13, 29 RTS, see Rauch-Tung-Striebel
NASA, see National Aeronautics and sample covariance, 12
Space Administration sample mean, 12
National Aeronautics and Space Schmidt, Stanley F., 98
Administration, 3, 98 Schur complement, 18, 63, 337, 338,
Newton’s method, 127 340, 346, 347, 357, 358
NLNG, see nonlinear, non-Gaussian Schur, Issai, 18
non-commutative group, 176, 182, 209 SDE, see stochastic differential equation
nonlinear, non-Gaussian, 89, 94, 95 Serret, Joseph Alfred, 190
normalized image coordinates, 194 Shannon information, 13, 28, 29, 33
observability, 2, 47, 153 Shannon, Claude Elwood, 13
observability matrix, 47 Sherman-Morrison-Woodbury, 23, 43,
onto, 214 53–55, 73, 121, 134, 135, 145
optical axis, 193 sigmapoint, 108, 113, 118
sigmapoint Kalman filter, 89, 116, 119,
outlier, 148, 158, 159
120, 122–125
particle filter, 89, 113–116 sigmapoint transformation, 108, 111,
PDF, see probability density function 112, 117, 118, 146, 263, 264, 267,
point-cloud alignment, 289 271, 276, 280
point-clouds, 289 simultaneous localization and mapping,
Poisson’s equation, 180 334, 344, 347, 355–357
Poisson, Siméon Denis, 180 simultaneous trajectory estimation and
pose-graph relaxation, 321 mapping, 354
poses, 167, 186, 210, 212, 216, 226, 230, singular-value decomposition, 297
236, 242, 249, 256, 262, 265, 272 skewness, 12, 113
posterior, 11, 36 SLAM, see simultaneous localization
power spectral density martrix, 75 and mapping
power spectral density matrix, 32 sliding-window filter, 140
prior, 11, 36 smoother, 57
probability, 9 SMW, see
probability density function, 9–15, 19, Sherman-Morrison-Woodbury
22, 24–26, 28–30, 33, 34, 93, 95–97, SP, see sigmapoint
370 Index

sparse bundle adjustment, 337

special Euclidean group, see also poses,
210
special orthogonal group, see also
rotations, 209
SPKF, see sigmapoint Kalman filter
state, 1, 36
state estimation, 1, 4, 36
state transition matrix, 255
statistical moments, 11
statistically independent, 10, 12, 20, 30
STEAM, see simultaneous trajectory
estimation and mapping
stereo baseline, 200
stereo camera, 90
stochastic differential equation, 142,
257, 350, 352, 358, 360
Striebel, Charlotte T., 53
surjective, 214
SWF, see sliding-window filter
Sylvester’s determinant theorem, 30
Sylvester, James Joseph, 30
tangent space, 212
taxonomy of filtering methods, 125
torsion, 190
transformation matrix, 187, see also
poses
transition function, 75
transition matrix, 36, 76
Tung, Frank F., 53
UKF, see sigmapoint Kalman filter
unbiased, 14, 69, 149
uncertainty ellipsoid, 29
uncorrelated, 12, 20
unimodular, 230
unit-length quaternions, 174
unscented Kalman filter, see sigmapoint
Kalman filter
variance, 14
vector, 168
vectrix, 168
white noise, 32

State Estimation For Robotics by Timothy D. Barfoot
100% (1)
State Estimation For Robotics by Timothy D. Barfoot
382 pages
Aser - Hidden Markov Models and Dynamical Systems
No ratings yet
Aser - Hidden Markov Models and Dynamical Systems
145 pages
Factor Graphs For Robot Perception
100% (1)
Factor Graphs For Robot Perception
144 pages
Mobile Robot Localization and Mapping Using The Kalman Filter
No ratings yet
Mobile Robot Localization and Mapping Using The Kalman Filter
53 pages
Icra16 Slam Tutorial Grisetti PDF
No ratings yet
Icra16 Slam Tutorial Grisetti PDF
57 pages
(Ebook PDF) State Estimation For Robotics by Timothy D. Barfoot Full Chapters Instanly
100% (4)
(Ebook PDF) State Estimation For Robotics by Timothy D. Barfoot Full Chapters Instanly
86 pages
Nonlinear Model-Based Estimation Algorithms: Tutorial and Recent Developments
No ratings yet
Nonlinear Model-Based Estimation Algorithms: Tutorial and Recent Developments
35 pages
Robot Mapping: Extended Kalman Filter
No ratings yet
Robot Mapping: Extended Kalman Filter
38 pages
KF PF
No ratings yet
KF PF
45 pages
(Ebook PDF) State Estimation For Robotics by Timothy D. Barfoot PDF Download
100% (1)
(Ebook PDF) State Estimation For Robotics by Timothy D. Barfoot PDF Download
53 pages
Ch5 Robot Motion
No ratings yet
Ch5 Robot Motion
9 pages
Bayesian Filtering: Dieter Fox
No ratings yet
Bayesian Filtering: Dieter Fox
132 pages
Special Problems Summer 2018 Planar Robot Localization With The Extended Kalman Filter
No ratings yet
Special Problems Summer 2018 Planar Robot Localization With The Extended Kalman Filter
36 pages
Efficient Reinforcement Learning Using Gaussian Processes
No ratings yet
Efficient Reinforcement Learning Using Gaussian Processes
226 pages
Kalman Filter for Mobile Robot Localization
No ratings yet
Kalman Filter for Mobile Robot Localization
53 pages
Lecture 2: Kalman Filters Lecture 2: Kalman Filters: İtü-Eef
No ratings yet
Lecture 2: Kalman Filters Lecture 2: Kalman Filters: İtü-Eef
65 pages
Robotics Localization Techniques
No ratings yet
Robotics Localization Techniques
42 pages
Robot Mapping: A Short Introduction To The Bayes Filter and Related Models
No ratings yet
Robot Mapping: A Short Introduction To The Bayes Filter and Related Models
34 pages
Aravkin Et Al. 2017 - Generalized Kalman Smoothing - Modeling and Algorithms
No ratings yet
Aravkin Et Al. 2017 - Generalized Kalman Smoothing - Modeling and Algorithms
30 pages
Msi PDF
No ratings yet
Msi PDF
127 pages
Sigma-Point Kalman Filters For Probabilistic Inference in Dynamic State-Space Models - Van Der Merve
No ratings yet
Sigma-Point Kalman Filters For Probabilistic Inference in Dynamic State-Space Models - Van Der Merve
398 pages
Maximum Likelihood and CRLB Estimators
No ratings yet
Maximum Likelihood and CRLB Estimators
4 pages
Kalman Filters for Robotics
No ratings yet
Kalman Filters for Robotics
58 pages
CR Test Exam Review Old
No ratings yet
CR Test Exam Review Old
49 pages
Estimation Theory
No ratings yet
Estimation Theory
603 pages
A Step by Step Mathematical Derivation A
No ratings yet
A Step by Step Mathematical Derivation A
32 pages
Robust Compressed Sensing
No ratings yet
Robust Compressed Sensing
45 pages
Fundamentals of Statistical Signal Processing Estimation 3001q9c4fj
No ratings yet
Fundamentals of Statistical Signal Processing Estimation 3001q9c4fj
5 pages
Machine Learning and System Identification For Estimation in Physical Systems
No ratings yet
Machine Learning and System Identification For Estimation in Physical Systems
185 pages
Sensor Fusion for M.Sc. Students
No ratings yet
Sensor Fusion for M.Sc. Students
102 pages
Fundamentals of Statistical Signal Processing - Estimation Theory-Kay
100% (1)
Fundamentals of Statistical Signal Processing - Estimation Theory-Kay
303 pages
State Estimation
No ratings yet
State Estimation
23 pages
Nonlinear Estimation Methods and Applications With Deterministic Sample Points 1st Edition Shovan Bhaumik (Author) Sample
100% (2)
Nonlinear Estimation Methods and Applications With Deterministic Sample Points 1st Edition Shovan Bhaumik (Author) Sample
150 pages
14-Bayesian Estimation
No ratings yet
14-Bayesian Estimation
33 pages
Supp 2
No ratings yet
Supp 2
332 pages
(SpringerBriefs in Mathematics) Qi He, Le Yi Wang, George G. Yin - System Identification Using Regular and Quantized Observations - Applications of Large Deviations Principles-Springer (2013)
No ratings yet
(SpringerBriefs in Mathematics) Qi He, Le Yi Wang, George G. Yin - System Identification Using Regular and Quantized Observations - Applications of Large Deviations Principles-Springer (2013)
108 pages
2D Tracking Kalman Filter
No ratings yet
2D Tracking Kalman Filter
8 pages
Lessons in Digital Estimation Theory
100% (1)
Lessons in Digital Estimation Theory
161 pages
Lec8 - Bayesian Network II
No ratings yet
Lec8 - Bayesian Network II
50 pages
Sequential State Estimation in Nonlinear, Non Gaussian Dynamical Systems
No ratings yet
Sequential State Estimation in Nonlinear, Non Gaussian Dynamical Systems
38 pages
ch3 Gaussian Filters PDF
No ratings yet
ch3 Gaussian Filters PDF
14 pages
Motion Model
No ratings yet
Motion Model
9 pages
Kalman Filtering Book PDF
No ratings yet
Kalman Filtering Book PDF
83 pages
An Introduction To Kalman Filtering With MATLAB Examples: S L S P
89% (9)
An Introduction To Kalman Filtering With MATLAB Examples: S L S P
83 pages
Kalmannote Basics
No ratings yet
Kalmannote Basics
4 pages
Slam02 Bayes Filter Short
No ratings yet
Slam02 Bayes Filter Short
33 pages
Nonlinear Estimation Methods and Applications With Deterministic Sample Points 1st Edition Shovan Bhaumik (Author) Download
No ratings yet
Nonlinear Estimation Methods and Applications With Deterministic Sample Points 1st Edition Shovan Bhaumik (Author) Download
52 pages
Particle Filter Tutorial
No ratings yet
Particle Filter Tutorial
8 pages
C4B - Mobile Robotics: Paul Michael Newman
No ratings yet
C4B - Mobile Robotics: Paul Michael Newman
18 pages
C4 BMobile Robots
No ratings yet
C4 BMobile Robots
114 pages
Probabilistic Machine Learning Supplement
No ratings yet
Probabilistic Machine Learning Supplement
307 pages
METULecture 1
No ratings yet
METULecture 1
15 pages
TUM CV2 Summary
No ratings yet
TUM CV2 Summary
24 pages
HustyTalk PDF
No ratings yet
HustyTalk PDF
76 pages
LG 04
No ratings yet
LG 04
21 pages
Matlab Robotics Toolbox by Peter Corke Examples
No ratings yet
Matlab Robotics Toolbox by Peter Corke Examples
13 pages
Euler
No ratings yet
Euler
4 pages
(Darryl D. Hol,) Geometric Mechanics Rotating, TR (B-Ok - CC) PDF
No ratings yet
(Darryl D. Hol,) Geometric Mechanics Rotating, TR (B-Ok - CC) PDF
304 pages
Notes
No ratings yet
Notes
186 pages
Changing Coordinate Systems
No ratings yet
Changing Coordinate Systems
18 pages
Chapter 3-1. Rigid Body Motion
No ratings yet
Chapter 3-1. Rigid Body Motion
47 pages
74-Modelling A Telescopic Boom - The 3D Case
No ratings yet
74-Modelling A Telescopic Boom - The 3D Case
15 pages
Poincaré Group PDF
No ratings yet
Poincaré Group PDF
5 pages
Robust Backstepping Controller With Adaptive Sliding Mode Observer For A Tilt-Augmented Quadrotor With Uncertainty Using SO3
No ratings yet
Robust Backstepping Controller With Adaptive Sliding Mode Observer For A Tilt-Augmented Quadrotor With Uncertainty Using SO3
6 pages
Analysis and Control of A Stewart Platform As Base Motion Compensators-Part II: Dynamics
No ratings yet
Analysis and Control of A Stewart Platform As Base Motion Compensators-Part II: Dynamics
22 pages
Slam Handbook
No ratings yet
Slam Handbook
194 pages
Lewis 2007
No ratings yet
Lewis 2007
13 pages
Robotics and Automation by Thomas R. Kurfess 02
No ratings yet
Robotics and Automation by Thomas R. Kurfess 02
4 pages
Symmetry in Optics and Photonics A Group Theory Approach
No ratings yet
Symmetry in Optics and Photonics A Group Theory Approach
12 pages
Robotics Topics PDF
No ratings yet
Robotics Topics PDF
100 pages
Rigid Body Mechanics Mathematics Physics and Applications 1st Edition William B. Heard PDF Download
100% (1)
Rigid Body Mechanics Mathematics Physics and Applications 1st Edition William B. Heard PDF Download
50 pages
Multibody System Dynamics
No ratings yet
Multibody System Dynamics
147 pages
Geometric Methods in MBS Dynamics
No ratings yet
Geometric Methods in MBS Dynamics
24 pages
Understanding Quaternions From Modern Al
No ratings yet
Understanding Quaternions From Modern Al
28 pages
Interpolating Solid Orientations With A - Continuous B-Spline Quaternion Curve
No ratings yet
Interpolating Solid Orientations With A - Continuous B-Spline Quaternion Curve
10 pages
Representations of Clifford Algebras
No ratings yet
Representations of Clifford Algebras
8 pages
Lie-Group Method for Multibody Systems
No ratings yet
Lie-Group Method for Multibody Systems
32 pages
Topological Defects in Physics
100% (1)
Topological Defects in Physics
16 pages
ScienceLunchTalk Beamer
No ratings yet
ScienceLunchTalk Beamer
26 pages
Muster
No ratings yet
Muster
25 pages
The Exact Computation of The Free Rigid Body Motion and Its Use in Splitting Methods
No ratings yet
The Exact Computation of The Free Rigid Body Motion and Its Use in Splitting Methods
30 pages
Applications of Math in CS
No ratings yet
Applications of Math in CS
1 page