1 Chapter 2
2 Introduction to State
3 Estimation
Reading
1. Barfoot, Chapter 2.1-2.2
2. Thrun, Chapter 2
3. Russell Chapter 15.1-15.3
4 2.1 A review of probability
5 Probability is a very useful construct to reason about real systems which
6 we cannot model at all scales. It is a fundamental part of robotics. No
7 matter how sophisticated your camera, it will have noise in how it measures
8 the real world around it. No matter how good your model for a motor is,
9 there will be unmodeled effects which make it move a little differently
10 than how you expect. We begin with a quick review of probability, you
11 can read more at many sources, e.g., MIT’s OCW.
12 An experiment is a procedure which can be repeated infinitely and
13 has a well-defined set of possible outcomes, e.g., the toss of a coin or the
14 roll of dice. The outcome itself need not always be deterministic, e.g.,
15 depending upon your experiment, the coin may come up heads or tails.
16 We call the set Ω the sample space, it is the set of all possible outcomes
17 of an experiment. For two coins, this set would be
Ω = {HH, HT, T H, T T } .
18 We want to pick this set to be right granularity to answer relevant questions,
19 e.g., it is correct but not very useful for Ω to be the position of all the
1
2
1 molecules in the coin. After every experiment, in this case tossing the two
2 coins once each, we obtain an event, it is a subset event A ⊂ Ω from the
3 sample space.
A = {HH} .
4 Probability theory is a mathematical framework that allows us to
5 reason about phenomena or experiments whose outcome is uncertain.
6 Probability of an event
P(A)
7 is a function that maps each event A to a number between 0 and 1: closer
8 to 1 this number, stronger our belief that the outcome of the experiment is
9 going to be A.
10 Axioms Probability is formalized using a set of three basic axioms that
11 are intuitive and yet very powerful. They are known as Kolmogorov’s
12 axioms:
13 • Non-negativity: P(A) ≥ 0
14 • Normalization: P(Ω) = 1
15 • Additivity: If two events A, B are such that A ∩ B = ∅, then
P(A ∪ B) = P(A) + P(B).
16 You can use these axioms to say things like P(∅) = 0, P(Ac ) = 1 − P(A),
17 or if A ⊆ B then P(A) ≤ P(B).
18 Conditioning on events Conditioning helps us answer questions like
P(A | B) := probability of A given that B occurred.
19 Effectively, the sample space has now shrunk from Ω to the event B. It
20 would be silly to have a null sample-space, so let’s say that P(B) ̸= 0. We
21 define conditional probability as
P(A ∩ B)
P(A | B) = ; (2.1)
P(B)
22 the probability is undefined if P(B) = 0. Using this definition, we can
23 compute the probability of events like “what is the probability of rolling a
24 2 on a die given that an even number was rolled”.
25 We can use this trick to get the law of total probability: if a finite Partitioning the sample space
26 number of events {Ai } form a partition of Ω, i.e.,
[
Ai ∩ Aj = ∅ ∀i, j, and Ai = Ω
i
27 X
P(B) = P(B | Ai ) P(Ai ). (2.2)
i
3
1 Bayes’ rule Imagine that instead of someone telling us that the condi-
2 tioning event actually happened, we simply had a belief
P(Ai )
3 about the possibility of such events {Ai }. For each of Ai , we can
4 compute the conditional probability P(B | Ai ) using (2.1). Say we run
5 our experiment and observe that B occurred, how would our belief on the
6 events Ai change? In other words, we wish to compute
P(Ai | B).
7 This is the subject of Bayes’ rule.
P(Ai ∩ B)
P(Ai | B) =
P(B)
P(Ai ) P(B|Ai )
= (2.3)
P(B)
P(Ai ) P(B|Ai )
=P .
i P(Aj ) P(B | Aj )
8 Bayes’ rule naturally leads to the concept of independent events. Two
9 events A, B ⊂ Ω are independent if observing one does not give us any
10 information about the other
P(A ∩ B) = P(A) P(B). (2.4)
11 This is different from disjoint events. Disjoint events never co-occur, i.e.,
12 observing one tells us that the other one did not occur.
13 Probability for experiments with real-valued outcomes We need some
14 more work in defining probability for events with real-valued outcomes.
15 The sample space is easy enough to understand, e.g., Ω = [0, 1] for your
16 score at the end of this course. We however run into difficulties if we
17 define the probability of general subsets of Ω in terms of the probabilities
18 of elementary outcomes (elements of Ω). For instance, if we wish to
19 model all elements ω ∈ Ω to be equally likely, we are forced to assign each
20 element ω a probability of zero (to be consistent with the second axiom of
21 probability). This is not very helpful in determining the probability of the
22 score being 0.9. If you instead assigned some small non-zero number to
23 P(ωi ), then we have undesirable conclusions such as
P({1, 1/2, 1/3, . . .}) = ∞.
24 The way to fix this is to avoid defining the probability of a set in terms
25 of the probability of elementary outcomes and work with more general
26 sets. While we would ideally like to be able to specify the probability of
27 every subset of Ω, it turns out that we cannot do so in a mathematically
28 consistent way. The trick then is to work with a smaller object known as a
29 σ-algebra, that is the set of “nice” subsets of Ω.
4
1 Given a sample space Ω, a σ-algebra F (also called a σ-field) is a
2 collection of subsets of Ω such that
3 • ∅∈F
4 • If A ∈ F, then Ac ∈ F.
5 • If Ai ∈ F for every i ∈ N, then ∪∞
i=1 Ai ∈ F.
6 In short, σ-algebra is a collection of subsets of Ω that is closed under com-
7 plement and countable unions. The pair (Ω, F), also called a measurable
8 space, is now used to define probability of events. A set A that belongs to
9 F is called an event. The probability measure
P : F → [0, 1].
10 assigns a probability to events in F. We cannot take F to be too small,
11 e.g., elements of F = {∅, Ω} are easy to construct our P but are not very
12 useful. For technical reasons, the σ-algebra cannot be too large; notice
13 that we used this concept to avoid considering every subset of the sample
14 space F = 2Ω . Modern probability is defined using a Borel σ-algebra.
15 Roughly speaking, this is an F that is just large enough to do interesting
16 things but small enough that mathematical technicalities do not occur.
17 2.1.1 Random variables
18 A random variable is an assignment of a value to every possible outcome.
19 Mathematically, in our new language of a measurable space, a random Random variables are typically denoted
20 variable is a function using capital letters, X, Y, Z although we will
X:Ω→R be sloppy and not always do so in this course
to avoid complicated notation. The distinction
21 if the set {ω : X(ω) ≤ c} is F-measurable for every number c ∈ R. This
between a random variable and the value that
22 is equivalent to saying that every preimage of the Borel σ-algebra on reals
it takes will be clear from context.
23 B(R) is in F. A statement X(ω) = x = 5 means that the outcome of our
24 experiment happens to be ω ∈ Ω when the realized value of the random
25 variable is a particular number x equal to 5.
26 We can now define functions of random variables, e.g., if X is a
27 random variable, the function Y = X 3 (ω) for every ω ∈ Ω, or Y = X 3
28 for short, is a new random variable. An indicator random variable is ? Let us check that Y satisfies our definition
29 special. If A ⊂ Ω, let IA : Ω → {0, 1} be the indicator function of this of a random variable.
If {ω : X(ω) ≤ c} lies
30 set A, i.e., IA (ω) = 1 if ω ∈ A and zero otherwise. If our set A ∈ F, in F then the set ω : Y (ω) ≤ c1/3 also lies
31 then IA is an indicator random variable. in F.
32 Probability mass functions The probability law, or a probability distri-
33 bution, of a random variable X is denoted by
The function IA is not a random variable if
pX (x) := P(X = x) = P ({ω ∈ Ω : X(ω) = x}) . A∈ / F, but this is, as we said in the previous
section, a mathematical corner case. Most
34 We denote probability distribution using a lower-case p. It is a function subsets of Ω belong to F.
35 of the realized value x in the range of a random variable, and pX (x) ≥ 0
P
36 (the probability is non-zero) and x pX (x) = 1 if X takes on a discrete
5
1 number of values. For instance, if X is the number of coin tosses until the
2 first head, if we assume that our tosses are independent P(H) = p > 0,
3 then we have
pX (k) = P(X = k) = P(T T · · · T H) = (1 − p)k−1 p
4 for all k = 1, 2, . . .. This is what is called a geometric probability mass
5 function.
6 Cumulative distribution function A cumulative distribution function
7 (CDF) is the probability of a random variable X taking a value less than
8 an particular x ∈ R, i.e., The CDF of a geometric random variable
for different values of p
FX (x) = P(X ≤ x).
9 The CDF FX (x) is a non-decreasing function of x. It converges to zero
10 as x → −∞ and goes to 1 as x → ∞.
11 Probability density functions A continuous random variable, i.e., one
12 that takes values in R is described by a probability density function.
13 Note that CDFs need not be continuous: in
the case of a geometric random variable,
14 If FX (x) is the CDF of an r.v. X and X takes values in R, the since the values that X takes belong to the set
15 probability density function (PDF) fX (x) (or sometimes also denoted by of integers, the CDF is constant between any
16 pX (x)) is defined to be two integers.
Z b
P(a ≤ X ≤ b) = fX (x) dx .
a
17 We also have the following relationship between the CDF and the PDF,
18 the former is the integral of the latter:
Z x
P(−∞ ≤ X ≤ x) = FX (x) = fX (x) dx .
−∞
19 This leads to the following interpretation of the probability density func-
20 tion:
P(x ≤ X ≤ x + δ) ≈ fX (x) δ.
21 Expectation and Variance The expected value of a random variable X
22 is X
E[X] = x pX (x)
x
6
1 and denotes the center of gravity of the probability mass function. Roughly
2 speaking, it is the average of a large number of repetitions of the same
3 experiment. Expectation is a linear, i.e.,
E[aX + b] = a E[X] + b
4 for any constants a, b. For two independent random variables X, Y we
5 have
E[XY ] = E[X] E[Y ].
6 We can also compute the expected value of any function g(X) using
7 the same formula
X
E[g(X)] = g(x) pX (x).
x
8 In particular, if g(x) = x2 we have the second moment E[X 2 ]. The
9 variance is defined to be
Var(X) = E (X − E[X])2
X 2
= (x − E[X]) pX (x)
x
2
= E[X 2 ] − (E[X]) .
10 The variance is always non-negative Var(X) ≥ 0. For an affine function
11 of X, we have
Var(aX + b) = a2 Var(X).
12 For continuous-valued random variables, the expectation is defined as
Z ∞
E[X] = xpX (x) dx ;
−∞
13 the definition of variance remains the same.
14 Joint distributions We often wish to think of the joint probability distri-
15 bution of multiple random variables, say the location of an autonomous car
16 in all three dimensions. The cumulative distribution function associated
17 with this is therefore
FX,Y,Z (x, y, z) = P(X ≤ x, Y ≤ y, Z ≤ z).
18 Just like we have the probability density of a single random variable, we
19 can also write the joint probability density of multiple random variables
20 fX,Y,Z (x, y, z). In this case we have
Z x Z y Z z
FX,Y,Z (x, y, z) = fX,Y,Z (x, y, z) dz dy dx .
−∞ −∞ −∞
7
1 The joint probability density factorizes if two random variables are
2 independent:
fX,Y (x, y) = fX (x)fY (y) for all x, y.
3 Two random variables are uncorrelated if and only if
E[XY ] = E[X] E[Y ].
4 Note that independence implies uncorrelatedness, they are not equivalent.
5 The covariance is defined as
Cov(X, Y ) = E[XY ] − E[X] E[Y ].
6 Conditioning As we saw before, for a single random variable X we
7 have
P(x ≤ X ≤ x + δ) ≈ fX (x) δ.
8 For two random variables, by analogy we would like
P(x ≤ X ≤ x + δ | Y ≈ y) ≈ fX|Y (x | y)δ.
9 The conditional probability density function of X given Y is defined to be
fX,Y (x, y)
fX|Y (x | y) = if fY (y) > 0.
fY (y)
10 For any given y, the conditional PDF is a normalized section of the joint
11 PDF, as shown below.
12
8
1 Continuous form of Bayes rule We can show using the definition of
2 conditional probability that
fX|Y (x | y)
fY |X (y | x) = . (2.5)
fX (x)
3 Similarly we also have the law of total probability in the continuous form
Z ∞
fX (x) = fX|Y (x | y) fY (y) dy .
−∞
4 2.2 Using Bayes rule for combining evidence
5 We now study a prototypical state estimation problem. Let us consider a
6 robot that is trying to check whether the door to a room is open or not.
8 We will abstract each observation by the sensors of the robot as a
9 random variable Y . This could be the image from its camera after running
10 some algorithm to check the state of the door, the reading from a laser
11 sensor (if the time-of-flight of the laser is very large then the door is open),
12 or any other mechanism. We have two kinds of conditional probabilities
13 in this problem
P(open | Y ) is a diagnostic quantity, while
P(Y | open) is a causal quantity.
14 The second one is called a causal quantity because the specific Y we
15 observe depends upon whether the door is open or not. The first one is
16 called a diagnostic quantity because using this observation Y we can infer
17 the state of the environment, i.e., whether the door is open or not. Next
18 imagine how you would calibrate the sensor in a lab: for each value of
19 the state of the door open, not open you would record all the different
20 observations received Y and calculate the conditional probabilities. The
21 causal probability is much easier to calculate in this context, one may
22 even use some knowledge of elementary physics to model the probability
23 P(Y | open), or one may count the number of times the observation is
24 Y = y for a given state of the door.
25 Bayes rule allows us to transform causal knowledge into diagnostic
9
1 knowledge
P(Y | open) P(open)
P(open | Y ) = .
P(Y )
2 Remember that the left hand side (diagnostic) is typically something that
3 we desire to calculate. Let us put some numbers in this formula. Let
4 P(Y | open) = 0.6 and P(Y | not open) = 0.3. We will imagine that the
5 door is open or closed with equal probability: P(open) = P(not open) =
6 0.5. We then have
P(Y | open) P(open)
P(open | Y ) =
P(Y )
P(Y | open) P(open)
=
P(Y | open) P(open) + P(Y | not open) P(not open)
0.6 × 0.5 2
= = .
0.6 × 0.5 + 0.3 × 0.5 3
7 Notice something very important, the original (prior) probability of the
8 state of the door was 0.5. If we have a sensor that fires with higher
9 likelihood if the door is open, i.e., if
P(Y | open)
>1
P(Y | not open)
10 then the probability of the door being open after receiving an observation
11 increases. If the likelihood were less than 1, then observing a realization
12 of Y would reduce our estimate of the probability of the door being open.
13 Combining evidence for Markov observations Say we updated the The denominator in Bayes rule, i.e., P(Y )
14 prior probability using our first observation Y1 , let us take another ob- is called the evidence in statistics.
15 servation Y2 . How can we integrate this new observation? It is again an
16 application of Bayes rule using two observations, or in general multiple
17 observations Y1 , . . . , Yn . Let us imagine this time that X = open.
P(Yn | X, Y1 , . . . , Yn−1 ) P(X | Y1 , . . . , Yn−1 )
P(X | Y1 , . . . , Yn ) = .
P(Yn | Y1 , . . . , Yn−1 )
18 Let us make the very natural assumption that says that our observations
19 from the sensor Y1 , . . . , Yn are independent given the state of the door X.
20 This is known as the Markov assumption.
21 We now have
P(Yn | X) P(X | Y1 , . . . , Yn−1 )
P(X | Y1 , . . . , Yn ) =
P(Yn | Y1 , . . . , Yn−1 )
= η P(Yn | X) P(X | Y1 , . . . , Yn−1 )
22 where
η −1 = P(Yn | Y1 , . . . , Yn−1 )
23 is the denominator. We can now expand the diagnostic probability on the
10
1 right-hand side recursively to get
n
Y
P(X | Y1 , . . . , Yn ) = ηi P(Yi | X) P(X). (2.6)
i=1
2 where ηi−1 = P(Yi | Y1 , . . . , Yi−1 ).
The calculation in (2.6) is very neat and you should always
remember it. Given multiple observations Y1 , . . . , Yn of the same
quantity X, we can compute the conditional probability P(X |
Y1 , . . . , Yn ) if we code up two functions to compute
• the causal probability (also called the likelihood of an observa-
tion) P(Yi | X), and
• the denominator ηi−1 .
Given these two functions, we can use the recursion to update multiple
observations. The same basic idea also holds if you have two quantities
to estimate, e.g., X1 = open door and X2 = color of the door. The
recursive application of Bayes rule lies at the heart of all state
estimation methods.
3 Let us again put some numbers into these formulae, imagine that the
4 observation Y2 was taken using a different sensor which now has
P(Y2 | open) = 0.5 and P(Y2 | not open) = 0.6.
5 We have from our previous calculation that P(open | Y1 ) = 2/3 and
P(Y2 | open) P(open | Y1 )
P(open | Y1 , Y2 ) =
P(Y2 | open) P(open | Y1 ) + P(Y2 | not open) P(not open | Y1 )
0.5 × 2/3 5
= = = 0.625.
0.5 × 2/3 + 0.6 × 1/3 8
6 Notice in this case that the probability that the door is open has reduced
7 from P(open | Y1 ) = 2/3.
8 2.2.1 Coherence of Bayes rule
9 Would the probability change if we used sensor Y2 before using Y1 ? In
10 this case, the answer to this question is no and you are encouraged to
11 perform this computation for yourselves. Bayes rule is coherent, it will
12 give the same result regardless of the order of observations. ? Can you think of a situation where the
13 The order of incorporating observation matters if the state of the order of incorporating observations matters?
14 world changes while we make observations, e.g., if we have a sensor that
15 tracks the location of a car, the car presumably moves in between two
16 observations and we would get the wrong answer if our question was “is
17 there a car at this location”.
11
1 As we motivated in the previous chapter, movement is quite funda-
2 mental to robotics and we are typically concerned with estimating the
3 state of a dynamic world around us using our observations. We will next
4 study the concept of a Markov Chain which is a mathematical abstraction
5 for the evolution of the state of the world.
6 2.3 Markov Chains
7 Consider the Whack-The-Mole game: a mole has burrowed a network of
8 three holes x1 , x2 , x3 into the ground. It keeps going in and out of the
9 holes and we are interested in finding which hole it will show up next so
10 that we can give it a nice whack.
11
12 This is an example of a Markov chain. There is a transition matrix T
13 which determines the probability Tij of the mole resurfacing on a given
14 hole xj given that it resurfaced at hole xi the last time. The matrix T k is
15 the k-step transition matrix
Tijk = P(Xk = xj | X0 = xi ).
16 You can see the animations at https://setosa.io/ev/markov-chains to build
17 more intuition.
The key property of a Markov chain is that the next state Xk+1
is independent of all the past states X1 , . . . , Xk−1 given the current
state Xk .
Xk+1 ⊥ ⊥ X1 , . . . , Xk−1 | Xk
This is known as the Markov property and all systems where we ? Does a deterministic dynamical system,
can define a “state” which governs their evolution have this property. e.g., a simple pendulum, also satisfy the
Markov chains form a very broad class of systems. For example, all Markov assumption? What is the transition
of Newtonian physics fits this assumption. matrix in this case?
What is the state of the following systems?
? Can you think of a system which does not
have the Markov property?
12
1 Consider the paramecium above. Its position depends upon a large
2 number of factors: its own motion from the previous time-step but also
3 the viscosity of the material in which it is floating around. One may
4 model the state of the environment around the paramecium as a liquid
5 whose molecules hit thousands of times a second, essentially randomly,
6 and cause disturbances in how the paramecium moves. Let us call this
7 disturbance “noise in the dynamics”. If the motion of the molecules of the
8 liquid has some correlations (does it, usually?), this induces correlations in
9 the position of the paramecium. The position of the organism is no longer
10 Markov. This example is important to remember, the Markov property
11 defined above also implies that the noise in the state transition matrix is
12 independent.
13 Evolution of a Markov chain The probability of being in a state xi at
14 time k + 1 can be written as
N
X
P(Xk+1 = xi ) = P(Xk+1 = xi | Xk = xj ) P(Xk = xj ).
j=1
15 This equation governs how the probabilities P(Xk = xi ) change with time
16 k. Let’s do the calculations for the Whack-The-Mole example. Say the
17 mole was at hole x1 at the beginning. So the probability distribution of its
18 presence
P(Xk = x1 )
π (k) = P(Xk = x2 )
P(Xk = x3 )
19 is such that
π (1) = [1, 0, 0]⊤ .
20 We can now write the above formula as
π (k+1) = T ′ π (k) (2.7)
21 1 and compute the distribution π (t) for all times
π (2) = T ′ π (1) = [0.1, 0.4, 0.5]⊤ ;
π (3) = T ′ π (2) = [0.17, 0.34, 0.49]⊤ ;
π (4) = T ′ π (3) = [0.153, 0.362, 0.485]⊤ ;
..
.
k
π (∞) = lim T ′ π (1)
k→∞
= [0.158, 0.355, 0.487]⊤ .
22 The numbers P(Xk = xi ) stop changing with time k. Under certain tech-
23 nical conditions, the distribution π (∞) is unique (single communicating
1Let us denote the transpose of the matrix T using the Matlab notation T ′ instead of T ⊤
for clarity.
13
1 class for a Markov chain with a finite number states). We can compute
2 this invariant distribution by writing
π (∞) = T ′ π (∞) .
3 We can also compute the distribution π (∞) directly: the invariant dis-
4 tribution is the right-eigenvector of the matrix T ′ corresponding to the
5 eigenvalue 1. ? Do we always know that the transition
matrix has an eigenvalue that is 1?
6 Example 2.1. Consider a Markov chain on two states where the transition
7 matrix is given by
0.5 0.5
T = .
0.4 0.6
8 The invariant distribution is
π (1) = 0.5π (1) + 0.4π (2)
π (2) = 0.5π (1) + 0.6π (2) .
9 Note that the constraint for π being a probability distribution, i.e., π (1) +
10 π (2) = 1 is automatically satisfied by the two equations. We can solve for
11 π (1) , π (2) to get
π (1) = 4/9 π (2) = 5/9.
12 2.4 Hidden Markov Models (HMMs)
13 2
14 Markov chains are a good model for how the state of the world
15 evolves with time. We may not always know the exact state of these
16 systems and only have sensors, e.g., cameras, LiDARs, and radars, to
17 record observations. These sensors are typically noisy. So we model the
18 observations as random variables.
19 Hidden Markov Models (HMMs) are an abstraction to reason about
20 observations of the state of a Markov chain. An HMM is a sequence
21 of random variables Y1 , Y2 , . . . , Yn such that the distribution of Yk only
22 depends upon the hidden state Xk of the associated Markov chain.
2Parts of this section closely follow Emilio Frazzoli’s course notes at
https://ocw.mit.edu/courses/aeronautics-and-astronautics/16-410-principles-of-
autonomy-and-decision-making-fall-2010/lecture-notes/MIT16_410F10_lec20.pdf
and https://ocw.mit.edu/courses/aeronautics-and-astronautics/16-410-principles-of-
autonomy-and-decision-making-fall-2010/lecture-notes/MIT16_410F10_lec21.pdf
14
Figure 2.1: A Hidden Markov Model with the underlying Markov chain, the
observation at time k only depends upon the hidden state at that time instant.
Ignore the notation Z1 , . . . , Zt we will denote the observations by Yk .
1 Notice that an HMM always has an underlying Markov chain behind
2 it. For example, if we model the position of a car Xk as a Markov chain,
3 our observation of the position at time k would be Yk . In our example
4 of the robot sensing whether the door is open or closed using multiple
5 observations across time, the Markov chain is trivial, it is simply the
6 transition matrix P(not open | not open) = P(open | open) = 1. Just like
7 Markov chains, HMMs are a very general class of mathematical models
8 that allow us to think about multiple observations across time of a Markov
9 chain.
10 Let us imagine that the observations of our HMM are also finite in
11 number, e.g., your score in this course ∈ [0, 100] where the associated
12 state of the Markov chain is your expertise in the subject matter. We will
13 write a matrix of observation probabilities
Mij = P(Yk = yj | Xk = xi ). (2.8)
14 The matrix M has non-negative entries, after all, each entry is a probability.
15 Since each state has to result in some observation, we also have
X
Mij = 1.
j
16 The state transition probabilities of the associated Markov chain are
Tij = P(Xk+1 = xj | Xk = xi ).
17 Given the abstraction of an HMM, we may be interested in solving
18 a number of problems. We will consider the problem where the state
19 Xk is the position of a car (which could be stationary or moving) and
20 observations Yk give us some estimate of the position.
21 1. Filtering: Given observations up to time k, compute the distribution
22 of the state at time k
P(Xk | Y1 , . . . , Yk ).
23 This is the most natural problem to understand: we want to find the
24 probability of the car being at a location at time k given all previous
25 observations. This is a temporally causal prediction, i.e., we are not
26 using any information from the future to reason about the present.
15
1 2. Smoothing: Given observations up to time k, compute the distribu-
2 tion of the state at any time j < k
P(Xj | Y1 , . . . , Yk ) for j < k.
3 The observation at a future time Yk+1 gives us some indication
4 of where the car might have been at time k. In this case we are
5 interested in using the entire set of observations from the past
6 Y1 , . . . , Yj , the future Yj+1 , . . . , Yk to estimate the position of the
7 car. Of course, this problem can only be solved ex post facto, i.e.,
8 after the time instant j. An important thing to remember is that we
9 are interested in the position of the car for all j < k in smoothing.
10 3. Prediction: Given observations up to time k, compute the distribu-
11 tion of the state at a time j > k
P(Xj | Y1 , . . . , Yk ) for j > k.
12 This is the case when we wish to make predictions about the state
13 of the car j > k given only observations until time k. If we knew
14 the underlying Markov chain for the HMM and its transition matrix
15 T , this would amount to running (2.7) forward using the output of
16 the filtering problem as the initial distribution of the state. ? Why is this true?
17 4. Decoding: Find the most likely state trajectory X1 , . . . , Xk that
18 maximizes the probability
P(X1 , . . . , Xk | Y1 , . . . , Yk )
19 given observations Y1 , . . . , Yk . Observe that the smoothing problem
20 is essentially solved independently for all time-steps j < k. It stands
21 to reason that if we knew a certain state (say car made a right turn)
22 was likely given observations at time k + 1 and that the traffic
23 light was green at time k (given our observations of the traffic
24 light), then we know that the car did not stop at the intersection at
25 time k. The decoding problem allows us to reason about the joint
26 probability of the states and outputs the most likely trajectory given
27 all observations.
28 5. Likelihood of observations: Given the observation trajectory,
29 Y1 , . . . , Yk , compute the probability
P(Y1 , . . . , Yk ).
30 As you may recall, this is the denominator that we need for the
31 recursive application of Bayes rule. It is made difficult by the fact
32 that we do not know the state trajectory X1 , . . . , Xk corresponding
33 to these observations.
34 These problems are closely related with each other and we will next dig
35 deeper into them. We will first discuss two building blocks, called the
16
1 forward and backward algorithm that together help solve all the above
2 problems.
3 2.4.1 The forward algorithm
4 Consider the problem of computing the likelihood of observations. We
5 can certainly write
P(Y1 , . . . , Yk )
X
= P(Y1 , . . . , Yk | X1 , . . . , Xk ) P(X1 , . . . , Xk )
all (x1 ,...,xk )
X k
Y k
Y
= P(Yi = yi | Xi = xi ) P(X1 = x1 ) P(Xk = xk | Xk−1 = xk−1 )
all (x1 ,...,xk ) i=1 i=2
X
= Mx1 y1 Mx2 y2 . . . Mxk yk πx1 Tx1 x2 . . . Txk−1 xk .
all (x1 ,...,xk )
6 But this is a very large computation, for each possible trajectory (x1 , . . . , xk )
7 the states could have taken, we need to perform 2k matrix multiplications.
8 ? How many possible state trajectories are
there? What is the total cost of computing the
likelihood of observations?
Forward algorithm We can simplify the above computation using
the Markov property of the HMM as follows. We will define a
quantity known as the forward variable
αk (x) = P(Y1 , . . . , Yk , Xk = x) (2.9)
where Y1 , . . . , Yk is our observation sequence up to time k. Observe
now that
1. We can initialize
α1 (x) = πx Mx,y1 for all x.
2. For each time i = 1, . . . , k − 1, for all states x, we can compute
X
αk+1 (x) = Mxyk+1 αk (x′ ) Tx′ x .
x′
using the law of total probability.
3. Finally, we have
X
P(Y1 , . . . , Yk ) = αk (x)
x
by marginalizing over the state variables Xk .
17
1 This recursion in the forward algorithm is a powerful idea and is much
2 faster than our naive summation above. ? What is the computational complexity of
the Forward algorithm?
3 2.4.2 The backward algorithm
4 Just like the forward algorithm performs the computation recursively
5 in the forward direction, we can also perform a backward recursion to
6 obtain the probability of the observations. Let us imagine that we have an
7 observation trajectory
Y1 , . . . , Yt
8 up to some time t. We first define the so-called backward variables which
9 are the probability of a future trajectory given the state of the Markov
10 chain at a particular time instant
βk (x) = P(Yk+1 , Yk+2 , . . . , Yt | Xk = x). (2.10)
11 Notice that the backward variables βk with the conditioning on Xk = x
12 are slightly different than the forward variables αk which are the joint
13 probability of the observation trajectory and Xk = x.
Backward algorithm We can compute the variables βk (x) recur-
sively again as follows.
1. Initialize
βt (x) = 1 for all x.
This simply indicates that since we are at the end of the
trajectory, the future trajectory Yt+1 , . . . does not exist.
2. For all k = t − 1, t − 2, . . . , 1, for all x, update
X
βk (x) = βk+1 (x′ ) Txx′ Mx′ yk+1 .
x′
3. We can now compute
X
P(Y1 , . . . , Yt ) = β1 (x) πx Mxy1 .
x
? What is the computational complexity of
running the backward algorithm?
14 Implementing the forward and backward algorithms in practice The
15 update equations for both αk and βk can be written using a matrix vector
16 multiplication. We maintain the vectors
αk := [αk (x1 ), αk (x2 ), . . . , αk (xN )]
βk := [βk (x1 ), βk (x2 ), . . . , βk (xN )]
18
1 and can write the updates as
⊤ ⊤
⊙ αk⊤ T
αk+1 = M·,y k+1
th
2 where ⊙ denotes the element-wise product and M·,yk+1 is the yk+1 column
3 of the matrix M . The update equation for the backward variables is
βk = T βk+1 ⊙ M·,yk+1 .
4 You must be careful about directly implement these recursions however,
5 because we are iteratively multiplying by matrices T, M whose entries
6 are all smaller than 1 (they are all probabilities after all), we can quickly
7 run into difficulties where αk , βk become too small for some states and
8 we get numerical underflow. You can implement these algorithms in the
9 log-space by writing similar update equations for log αk and log βk to
10 avoid such numerical issues.
11 2.4.3 Bayes filter
12 Let us now use the forward and backward algorithms to solve the filtering
13 problem. We want to compute
P(Xk = x | Y1 , . . . , Yk )
14 for all states x in the Markov chain. We have that
P(Xk = x, Y1 , . . . , Yk )
P(Xk = x | Y1 , . . . , Yk ) = = η αk (x) (2.11)
P(Y1 , . . . , Yk )
15 where since P(Xk = x | Y1 , . . . , Yk ) is a legitimate probability distribu-
16 tion on x, we have !−1
X
η= αk (x) .
x
17 As simple as that. In order to estimate the state at time k, we run the
18 forward algorithm to update variables αi (x) from i = 1, . . . , k. We can
19 implement this using the matrix-vector multiplication in the previous
20 section.
21 This is a commonly used algorithm known as the Bayes filter and is
22 our first insight into state estimation.
23 An important fact Even if the filtering estimate is computed recursively
24 using each observation as it arrives, the estimate is actually the probability
25 of the current state given all past observations.
P(Xk = x | Y1 , . . . , Yk ) ̸= P(Xk = x | Yk )
26 This is an extremely important concept to remember, in state-estimation we
27 are always interested in computing the state given all available observations.
19
1 In the same context, is the following statement true?
P(Xk = x | Y1 , . . . , Yk ) = P(Xk = x | Yk , Xk−1 )
2 2.4.4 Smoothing
3 Given observations till time t, we would like to compute
P(Xk = x | Y1 , . . . , Yt )
4 for all time instants k = 1, . . . , t. Observe the filtering
P(Xk = x, Y1 , . . . , Yt )
P(Xk = x | Y1 , . . . , Yt ) =
P(Y1 , . . . , Yt )
P(Xk = x, Y1 , . . . , Yk , Yk+1 , . . . , Yt )
=
P(Y1 , . . . , Yt )
P(Yk+1 , . . . , Yt | Xk = x, Y1 , . . . , Yk ) P(Xk = x, Y1 , . . . , Yk )
=
P(Y1 , . . . , Yt )
P(Yk+1 , . . . , Yt | Xk = x) P(Xk = x, Y1 , . . . , Yk )
=
P(Y1 , . . . , Yt )
βk (x) αk (x)
=
P(Y1 , . . . , Yt )
(2.12)
5 Study the first step carefully, the numerator is not equal to αk (x) because
6 observations go all the way till time t. The final step uses both the Markov
7 and the HMM properties: future observations Yk+1 , . . . , Yt depend only
8 upon future states Xk+1 , . . . , Xt (HMM property) which are independent
9 of the past observations and states give the current state Xk = x (Markov ? Both the filtering problem and the
10 property). smoothing problem give us the probability of
11 Smoothing can therefore be implemented by running the forward the state given observations. Discuss which
12 algorithm to update αk from k = 1, . . . , t and the backward algorithm to one should we should use in practice and
13 update βk from time k = t, . . . , 1. why?
14 To see an example of smoothing in action, see ORB-SLAM 2. What
15 do you think is the state of the Markov chain in this video?
16 Example for the Whack-the-mole problem Let us assume that we do
17 not see which hole the mole surfaces from (say it is dark outside) but we
18 can hear it. Our hearing is not very precise so we have an observation
19 probabilities
0.6 0.2 0.2
M = 0.2 0.6 0.2 .
0.2 0.2 0.6
20 Assume that the mole surfaces three times and we make the measurements
Y1 = 1, Y2 = 3, Y3 = 3.
21 We want to compute the distribution of the states the mole could be in at
22 each time. Assume that we know that the mole was in hole 1 at the first
23 step, i.e., π1 = (1, 0, 0) for the Markov chain, like we had in Section 2.3.
20
1 Run the forward backward algorithm and see that
α1 = (0.6, 0, 0) , α2 = (0.012, 0.048, 0.18) , α3 = (0.0041, 0.0226, 0.0641) ,
2 and
β3 = (1, 1, 1) , β2 = (0.4, 0.44, 0.36) , β1 = (0.1512, 0.1616, 0.1392) .
3 Using these, we can now compute the filtering and the smoothing state
4 distributions, let us denote them by π f and π s respectively.
π1f = (1, 0, 0) , π2f = (0.05, 0.2, 0.75), π3f = (0.045, 0.2487, 0.7063)
5 and ? Do you notice any pattern in the solution
returned by the filtering and the smoothing
s s s
π1 = (0.999, 0, 0) , π2 = (0.0529, 0.2328, 0.7143), π3 = (0.045, 0.2487, 0.7063).problem? Explain why that is the case.
7 2.4.5 Prediction
8 We would like to compute the future probability of the state give observa-
9 tions up to some time
P(Xk = x | Y1 , . . . , Yt ) for t < k.
10 Here is a typical scenario when you would need this estimate. Imagine
11 that you are tracking the position of a car using images from your camera.
12 You are using a deep network to detect the car in each image Yk and since
13 the neural network is quite slow, the car moves multiple time steps forward
14 before you get the next observation. As you can appreciate, it would help
15 us compute a more accurate estimate of the conditional probability of
16 Xk = x if we propagated the position of the car in between successive
17 observations using our Markov chain. This is easy to do.
18 1. We compute the filtering estimate πtf = P(Xt = x | Y1 , . . . , Yt ),
19 using the forward algorithm.
20 2. Propagate the Markov chain forward for k − t time-steps using πtf
21 as the initial condition using
πi+1 = T ′ πi .
22 2.4.6 Decoding: Viterbi’s Algorithm
23 Both filtering and smoothing calculate the probability distribution of the
24 state at time k. For instance, after recording a few observations, we can
25 compute the probability distribution of the position of the car at each time
26 instant. How do we get the most likely trajectory of the car? One option
27 is to choose
X̂k = argmax P(Xk = x | Y1 , . . . , Yt )
x
21
1 at each instant and output
(X̂1 , . . . , X̂t )
2 as the answer. This is however only the point-wise best estimate of the
3 state. This sequence may not be the most likely trajectory of the Markov
4 chain underlying our HMM. In the decoding problem, we are interested in
5 computing the most likely state trajectory, not the point-wise most likely
6 sequence of states. Let us take an example of the Whack-the-mole again.
7 We will use a slightly different Markov chain shown below.
9 There are three states x1 , x2 , x3 with known initial distribution π =
10 (1, 0, 0) and transition probabilities and observations given by matrices
11 T, M respectively. Let us say that we only have two observations {y2 , y3 }
12 this time and get the observation sequence
(2, 3, 3, 2, 2, 2, 3, 2, 3)
13 from our sensor. The filtering estimates are as follows.
14
15 The most likely state at each instant is marked in blue. The point-wise
16 most likely sequence of states is
(1, 3, 3, 3, 3, 2, 3, 2, 3).
17 Observe that this is not even feasible for the Markov chain. The transition
18 from x3 → x2 is not even possible, so this answer is clearly wrong. Let
19 us look at the smoothing estimates.
22
2 The point-wise most likely states in this case are feasible
(1, 2, 2, 2, 2, 2, 3, 3, 3).
3 Because the smoothing estimate at time k also takes into account the
4 observations from the future t > k, it effectively eliminates the impossible
5 transition from x3 → x2 . This is still not however the most likely
6 trajectory.
7 We will exploit the Markov property again to calculate the most likely
8 state trajectory recursively. Let us define the “decoding variables” as
δk (x) = max P(X1 = x1 , . . . , Xk−1 = xk−1 , Xk = x, Y1 , . . . , Yk );
(x1 ,...,xk−1 )
(2.13)
9 this is the joint probability of the most likely state trajectory that ends at
10 the state x at time k while generating observations Y1 , . . . , Yk . We can
11 now see that
δk+1 (x) = max
′
δk (x′ ) Tx′ x Mx,yk+1 ; (2.14)
x
12 the joint probability that the most likely trajectory ends up at state x at
13 time k + 1 is the maximum of among the joint probabilities that end up
14 at any state x′ at time k multiplied by the one-step state transition Tx′ x
15 and observation Mxyk+1 probabilities. We would like to iterate upon this
16 identity to find the most likely path. The key idea is to maintain a pointer
17 to the parent state parentk (x) of the most likely trajectory, i.e., the state
18 from which you could have reached Xk = x given observations. Let us
19 see how.
Viterbi’s algorithm First initialize
23
δ1 (x) = πx Mxy1
parentk (x) = null.
for all states x. For all times k = 1, . . . , t − 1, for all states x, update
δk+1 (x) = max
′
δk (x′ ) Tx′ x Mx,yk+1
x
parentk+1 (x) = argmax (δk (x′ ) Tx′ x ) .
x′
The most likely final state is
x̂t = argmax δt (x′ )
x′
and we can now backtrack using our parent pointers to find the most
likely trajectory that leads to this state
x̂k = parentk+1 (x̂k+1 ).
The most likely trajectory given observations is
x̂1 , x̂2 , . . . , x̂t
and the joint probability of this trajectory and all observations is
P(X1 = x̂1 , . . . , Xt = x̂t , Y1 = y1 , . . . , Yt = yt ) = δt (x̂t ).
1 This is a very widely used algorithm, both in robotics and in other
2 areas such as speech recognition (given audio, find the most likely sentence
3 spoken by the person), wireless transmission and reception, DNA analysis
4 (e.g., the state of the Markov chain is the sequence ACTG. . . and our
5 observations are functions of these states at periodic intervals). Its name
6 comes from Andrew Viterbi who developed the algorithm in the late 60s,
7 he is one of the founders of Qualcomm Inc.
8 Here is how Viterbi’s algorithm would look like for our whack-the-
9 model example.
δ1 = (0.6, 0, 0), δ2 = (0.012, 0.048, 0.18), δ3 = (0.0038, 0.0216, 0.0432)
parent1 = (null, null, null), parent2 = (1, 1, 1), parent3 = (2, 3, 3).
10 The most likely path is the one that ends in 3 with joint probability 0.0432.
11 This path is (1, 3, 3).
12 Let us also compute Viterbi’s algorithm for a longer observation
13 sequence.
24
Just like the Bayes filter, Viterbi’s
2 The most likely trajectory is algorithm is typically implemented using
log δk (x) to avoid numerical underflows.
(1, 3, 3, 3, 3, 3, 3, 3, 3). This is particularly important for Viterbi’s
algorithm: since δk (x) is the probability of an
3 Notice that if we had only 8 observations, the most likely trajectory would
entire state and observation trajectory it can
4 be
get small very quickly for unlikely states (as
(1, 2, 2, 2, 2, 2, 2, 2, 2).
we see in this example).
5
6 What is the computational complexity of Viterbi’s algorithm? It is
7 linear in the time-horizon t and quadratic in the number of states in the
8 Markov chain. We are plucking out the most likely trajectory out of
9 card(X)t possible trajectories using the δk variables. Does this remind
10 you of some other problem that you may have seen before?
11 2.4.7 Shortest path on a Trellis graph
12 You may have seen Dijkstra’s algorithm before that computes the shortest
13 path to reach a node in the graph given costs of traversing every edge.
Figure 2.2: A graph with costs assigned to every edge. Dijkstra’s algorithm finds
the shortest path in this graph between nodes A and B using dynamic programming.
14 In the case of Viterbi’s algorithm, we are also interested in finding the
25
1 most likely path. For example we can write our joint probabilities as
P(Y1 | X1 ) P(Y2 | X2 ) P(Y3 | X3 ) P(X1 ) P(X2 | X1 ) P(X3 | X2 )
P(X1 , X2 , X3 | Y1 , Y2 , Y3 ) = .
P(Y1 , Y2 , Y3 )
⇒ log P(X1 , X2 , X3 | Y1 , Y2 , Y3 ) = log P(Y1 | X1 ) + log P(Y2 | X2 ) + log P(Y3 | X3 )
+ log P(X1 ) + log P(X2 | X1 ) + log P(X3 | X2 ) − log P(Y1 , Y2 , Y3 ).
2 To find the most likely trajectory, we want to minimize − log P(X1 , X2 , X3 |
3 Y1 , Y2 , Y3 ). The term log P(Y1 , Y2 , Y3 ) does not depend on X1 , X2 , X3
4 and is a constant as far as the most likely path given observations is
5 concerned. We can now write down the “Trellis” graph as shown below.
Figure 2.3: A Trellis graph for a 3-state HMM for a sequence of three observations.
Disregard the subscript x0 .
6 Each edge is either the log-probability of the transition of the Markov
7 chain, or it is the log-probability of the receiving the observation given a
8 state. We create a dummy initial node A and a dummy terminal node B.
9 The edge-costs of the final three states, in this case sunny/cloudy/rainy,
10 are zero. The costs from node A to the respective states are the log-
11 probabilities of the initial state distribution. Dijkstra’s algorithm, which
12 we will study in Module 2 in more detail, now gives the shortest path on the
13 Trellis graph. This approach is the same as that of the Viterbi’s algorithm:
14 our parent pointers parentk (x) are the parent nodes in Dijkstra’s algorithm
15 and our delta variables δk (x) is the cost of each node in the Trellis graph
16 maintained by the Dijkstra’s algorithm.
17 2.5 Learning an HMM from observations
18 In the previous sections, given an HMM that had an initial distribution π
19 for the Markov chain, a transition matrix T for the Markov chain and an
20 observation matrix M
λ = (π, T, M )
21 we computed various quantities such as
P(Y1 , . . . , Yt ; λ)
26
1 for an observation sequence Y1 , . . . , Yt of the HMM. Given an observation
2 sequence, we can also go back and update our HMM to make this
3 observation sequence more likely. This is the simplest instance of learning
4 an HMM. The prototypical problem to imagine that our original HMM λ
5 comes from is our knowledge of the original problem (say a physics model
6 of the dynamics of a robot and its sensors). Given more data, namely
7 the observations, we want to update this model. The most natural way to
8 update the model is to maximize the likelihood of observations given our
9 model, i.e.,
λ∗ = argmax P(Y1 , . . . , Yt ; λ).
λ
10 This is known as maximum-likelihood estimation (MLE). In this section
11 we will look at the Baum-Welch algorithm which solves the MLE problem
12 iteratively. Given λ, it finds a new HMM λ′ = (π ′ , T ′ , M ′ ) (the ′ denotes
13 a new matrix, not the transpose here) such that
P(Y1 , . . . , Yt ; λ′ ) > P(Y1 , . . . , Yt ; λ).
14 Let us consider a simple problem. We are going to imagine that the
15 FBI is trying to catch the dangerous criminal Keyser Soze who is known
16 to travel between two cities Los Angeles (LA) which will be state x1 and
17 New York City (NY) which will be state x2 . The FBI initially have no clue
18 about his whereabouts, so their initial belief on his location is uniform
19 π = [0.5, 0.5]. His movements are modeled using a Markov chain
0.5 0.5
T = ,
0.5 0.5
20 e.g., if Soze is in LA, he is likely to stay in LA or go to NY with equal
21 probability. The FBI can make observations about him, they either observe
22 him to be in LA (y1 ), NY (y2 ) or do not observe anything at all (null, y3 ).
0.4 0.1 0.5
M= .
0.1 0.5 0.4
23 Say that they received an observation sequence of 20 periods
(null, LA, LA, null, NY, null, NY, NY, NY, null, NY, NY, NY, NY, NY, null, null, LA, LA, NY).
24 Can we say something about the probability of Soze’s movements? At
25 each time k we can compute
γk (x) := P(Xk = x | Y1 , . . . , Yt )
26 the smoothing probability. We can also compute the most likely state
27 trajectory he could have taken given our observations using decoding. Let
28 us focus on the smoothing probabilities γk (x) as shown below.
27
2 The point-wise most likely sequence of states after doing so turns out to be
(LA, LA, LA, LA, NY, LA, NY, NY, NY, LA, NY, NY, NY, NY, NY, LA, LA, LA, LA, NY).
3 Notice how smoothing fills in the missing observations above.
4 Expected state visitation counts The next question we should ask is
5 how should we update the model λ given this data. We are going to learn
6 the entries of the state-transition using
′ E[number of transitions from x to x′ ]
Tx,x ′ = .
E[number of times the Markov chain was in state x]
7 What is the denominator, it is simply the sum of the probabilities that the
8 Markov chain was at state x at time 1, 2, . . . , t − 1 given our observations,
9 i.e.,
t−1
X
E[number of times the Markov chain was in state x] = γk (x).
k=1
10 The numerator is given in a similar fashion. We will define a quantity
? Derive the expression for ξk (x, x′ ) for
ξk (x, x′ ) := P(Xk = x, Xk+1 = x′ | Y1 , . . . , Yt ) yourself.
(2.15)
= η αk (x)Tx,x′ Mx′ ,yk+1 βk+1 (x′ );
where η is a normalizing constant such that x,x′ ξk (x, x′ ) = 1. Observe
P
11
12 that ξk is the joint probability of Xk and Xk+1
ξk (x, x′ ) = P(Xk+1 = x′ | Xk = x, Y1 , . . . , Yt ) γk (x)
̸= Tx,x′ γk (x)
= P(Xk+1 = x′ | Xk = x) P(Xk = x | Y1 , . . . , Yt ).
13 The expected value of transitioning between states x and x′ is
t−1
X
E[number of transitions from x to x′ ] = ξk (x, x′ ).
k=1
28
1 This gives us our new state transition matrix, you will see in the homework
2 that it comes to be
0.47023 0.52976
T′ = .
0.35260 0.64739
3 This is a much better informed FBI than the other we had before beginning
4 the problem where the transition matrix was all 0.5s.
5 The new initial distribution What is the new initial distribution for
6 the HMM? Recall that we are trying to compute the best HMM given the
7 observations, so if the initial distribution was
π = P(X1 )
8 before receiving any observations from the HMM, it is now
π ′ = P(X1 | Y1 , . . . , Yt ) = γ1 (x);
9 the smoothing estimate at the first time-step.
10 Updating the observation matrix We can use a similar logic at the
11 expected state visitation counts to write
′ E[number of times in state x, when observation was y]
Mx,y =
E[number of times the Markov chain was in state x]
Pt
γk (x)1{yk =y}
= k=1Pt .
k=1 γk (x)
12 You will see in your homework problem that this matrix comes up to be
0.39024 0.20325 0.40650
M′ = .
0.06779 0.706214 0.2259
13 Notice how the observation probabilities for the unknown state y3 have
14 gone down because the Markov chain does not have those states.
15 The ability to start with a rudimentary model of the HMM and update
16 it using observations is quite revolutionary. Baum et al. proved in the
17 paper Baum, Leonard E., et al. "A maximization technique occurring
18 in the statistical analysis of probabilistic functions of Markov chains."
19 The annals of mathematical statistics 41.1 (1970): 164-171. Discuss the
20 following questions:
21 • When do we stop in our iterated application of the Baum-Welch
22 algorithm?
23 • Are we always guaranteed to find the same HMM irrespective of
24 our initial HMM?
25 • If our initial HMM λ is the same, are we guaranteed to find the
26 same HMM λ′ across two different iterations of the Baum-Welch
27 algorithm?
28 • How many observations should we use to update the HMM?