Hidden Markov Models
Hidden Markov Models
Hidden Markov Models
DHS 3.10
HMMs
Model likelihood of a sequence of observations as a
series of state transitions.
ω1 ω2 ω3 ω4 ω5 ω6 ω7 ω0
2
/v/ /i/ /t/ /e/ /r/ /b/ /i/ /-/
FIGURE 3.11. A left-to-right HMM commonly used in speech recognition. For instance,
such a model could describe the utterance “viterbi,” where ω1 represents the phoneme
/v/, ω represents /i/,. . . , and ω a final silent state. Such a left-to-right model is more
HMMs: On-line Handwriting Recognition
(example from Bishop,
“Pattern Recognition
and Machine Learning”)
**Examples generated
from the HMM
Symbols
Represented as a sequence of (x,y) locations for each pen
stroke
A Simple HMM
16 states representing a line segment of fixed length in
one of 16 angles: another ‘left-to-right’ model with a fixed
sequence of states
3
First-Order Markov Models
Represent Probabilistic State Transitions
“First Order:” probability of
P (ωai (tstate for =
+ 1)|ω(t)) each
aij
time step depends only on the previous state:
|Ω|
!
P (ωj (t + 1)|ωi (t)) = aij aij = 1
j=1
4
P (ωj (t + 1)|ωi (t)) = aij
Example:
!|Ω|
First-Order M. Model
aijj (t=+11)|ωi (t)) = aij
P (ω a22
Consider State j=1
Sequence w6
6 |Ω| ω2
ω = {ω1 , ω3 , ω2 !
,ω 2 , ω1 , ω3 } a21 a32
aij = 1 a11 a12 a23 a33
a31
Probability of this Sequence
j=1 ω1 ω3
a13
Given transition ω = {ωtable 1 , ω3 ,θ,
probability first state known:
ω3 }states, ωproduct of
6
ωFIGURE
2 , ω23.8., ωThe1 ,discrete , in a basic Markov model are represente
i
transition probabilities, aij and the transition probabilities, a , are represented by links. In a first-order d
ij
Markov model, at any step t the full system is in a particular state ω(t ). The
t + 1 is a random function that depends solely on the state at step t and
tion probabilities. From: Richard O. Duda, Peter E. Hart, and David G. S
P (ω 6 |θ) = a a a a a 13Classification
32 22 . Copyright
21 13! c 2001 by John Wiley & Sons, Inc.
Problem
In practice, we often can’t directly observe the states of interest (e.g.
speech recognition: wave features, not phonemes as input)
5
First-Order HMMs
Modification
As we can’t directly observe states, let’s model the likelihood of
observed features for each state
Restriction
We will discuss HMMs where the observations are discrete
Observations
V6 = {v5 , v1 , v1 , v5 , v2 , v3 }
Are now sequences of visible states
V6 =of{v
Probability transition to visible state depends on current (hidden
5 , v1 , v1 , v5 , v2 , v3 } bjk = P (vk (t)|ωj (t))
state); transitions to visible states from each hidden state must sum to
one !
bjk = P (vk (t)|ωj (t)) bjk = 1
6
k
!
Example: HMM
Consider Visible State Sequence V6 v3
v1 v2 v4
b
b b22 23 b24
6 21
V = {v4 , v1 , v1 , v4 , v2 , v3 } a22
ω2
a21 a32
a11 a12 a23 a33
bjk = P (vk (t)|ωj (t)) a31
ω1 ω3
b11 a13
b12 b13 b14 b31 b34
! b32 b33
Sequence of Hidden
bjk = 1 States v1
v2 v3 v4 v1
v2 v3 v4
k FIGURE 3.9. Three hidden units in an HMM and the transitions between them
Is non-unique, even if we know that we begin in
shown in black while the visible states and the emission probabilities of visible sta
are shown in red. This model shows all transitions as being possible; in other HMM
state one. However, some may be more likely
some such candidate transitions are not allowed. From: Richard O. Duda, Peter E. Ha
c 2001 by John Wiley & So
and David G. Stork, Pattern Classification. Copyright !
than others. Inc.
7
Key Problems
Evaluation
Computing the probability that a sequence of visible states was
generated by an HMM
Decoding
Determine the most likely sequence of hidden states that
produced a sequence of visible states
Learning
Given the HMM structure (number of visible and hidden
states) and a training set of visible state sequences, determine
the transition probabilities for hidden and visible states.
8
Evaluation
Probability of Visible Sequence VT:
Is the sum of probabilities for the sequence
over all possible length T paths through the
hidden states:
r!
max
T
"
P (ωrT ) = P (ω(t)|ω(t − 1))
Evaluation, Cont’d t=1
T
"
T
P (V |ωrT ) = P (v(t)|ω(t))
Probability of Visiblet=1Sequence (alt.)
r! T
max "
Interpretation
Sum over all possible hidden state sequences
of: conditional probability of each transition
multiplied by probability of visible symbol being
emitted from current hidden state
Problem
Computation is O(cT T) e.g. c =10, T = 20: ~1021
10
Forward Algorithm
(Much) Faster Evaluation
In O(c2 T) time
Idea
Compute the probability recursively for each
state at each time step from 1...T:
0, if t = 0 and j != initial state
αj (t) = 1, if t = 0 and j = initial state
%
[ i αi (t − 1)aij ]bjk v(t) otherwise
(prob. of visible symbol vk)
Return α0(T) (probability of final state ω0 at
last time step) 11
vk
α1(2)
b2k
ω1 ω1 ω1 ω1 ω1
a12
α2(2)
ω2 ω2 ω2 ω2 ω2
a12
α3(2)
a32
ω3 ω3 ω3 ω3 ω3
ac2
αc(2)
ωc ωc ωc ωc ωc
t= 1 2 3 T-1 T
FIGURE 3.10. The computation of probabilities by the Forward algorithm can be visu-
alized by means of a trellis—a sort of “unfolding” of the HMM through time. Suppose
we seek the probability that the HMM was in state ω2 at t = 3 and generated the ob-
served visible symbol up through that step (including the observed visible symbol vk ).
The probability the HMM was in state ωj (t = 2) and generated the observed sequence
through t = 2 is αj (2) for j = 1, 2, . . . , c . To find α2 (3) we must sum these and multiply
the probability that state ω2 emitted
!the observed symbol vk . Formally, for this particular
c
illustration we have α2 (3) = b2k j=1 αj (2)aj2 . From: Richard O. Duda, Peter E. Hart,
and David G. Stork, Pattern Classification. Copyright ! c 2001 by John Wiley & Sons,
Inc.
Prob. of HMM for Observation
Prob. of model, given T P (V T
|θ)P (θ)
visible sequence: P (θ|V ) = P (VT )
Use Bayes Formula
Forward algorithm provides P(VT | θ)
Other two parameters are provided by
domain knowledge
• likelihood of the sequence itself, and prior
probability of the model; e.g. using a language
model for speech recognition
ω0 ω0 ω0 ω0 ω0 ω0
αmax(1)
ω1 ω1 ω1 ω1 ω1 ω1
αmax(3) αmax(T-1)
ω2 ω2 ω2 ω2 ω2 ω2
αmax(2)
ω3 ω3 ω3 ω3 ω3 ω3
ωc ωc ωc ωc ωc ωc
t= 1 2 3 4 T-1 T
FIGURE 3.12. The decoding algorithm finds at each time step t the state that has the
highest probability of having come from the previous step and generated the observed
visible state vk . The full path is the sequence of such states. Because this is a local
optimization (dependent only upon the single previous time step, not the full sequence),
the algorithm does not guarantee that the path is indeed allowable. For instance, it
might be possible that the maximum at t = 5 is ω1 and at t = 6 is ω2 , and thus
these would appear in the path. This can even occur if a12 = P (ω2 (t + 1)|ω1 (t )) = 0,
precluding that transition. From: Richard O. Duda, Peter E. Hart, and David G. Stork,
Pattern Classification. Copyright ! c 2001 by John Wiley & Sons, Inc.
Viterbi Algorithm
Purpose
Decoding; computes the most likely
sequence of states for an observation
sequence
• State sequence will be consistent with HMM
transition probabilities
Brief Summary
A dynamic programming algorithm that
incrementally computes the most likely
sequence of states from start to finish state
16
Backward Algorithm
Computes P(VT | θ)
Like the forward algorithm, but moving from
P (VT |θ)P (θ)
the final stateP (θ|V
back) =to the initial
T
T
state. The
P (V )
algorithm computes the recurrence:
0, if t = T and j != f inal state
βj (t) = 1, if t = T and j = f inal state
%
j βj (t + 1)aij bjk v(t + 1) otherwise
(prob. of visible symbol vk)
and returns βk(0), for initial state k
17
Learning
Optimal HMM Parameter Setting
For transition probabilities; no known
method.
Forward-Backward Algorithm
A form of expectation-maximization (EM);
iteratively updates transitions to better
explain the observed sequences
• Also known as the Baum-Welch algorithm
18
Individual Transition Probabilities
forward trans+emit backward
αi (t − 1)aij bjk βj (t)
γij (t) = T |θ)
i (t(t−−1)a
αα ij b jk
P
βj
(V
(t) (evaluation: e.g. by forward alg.)
γijγ (t) = i 1)aij bjk βj (t)
ij (t) =
Estimate Updates:
P (V T |θ)
P (VT |θ)
!T
!T γij (t) expected state i-j transitions (at any t)
! γij (t)
âijâ ==!T t=1
t=1
ij !T ! k γik (t) expected state i-any transitions (at any t)
k γik (t)
t=1
t=1
!T !
!Tt=1 !γjl (t)
l
t=1
v(t)=v k
l γjl (t) expected state j transmitting vk
b̂jk
b̂jk== !v(t)=v
T !! k
! T l γjl (t) expected state j transmitting any
l jl (t)
γ
t=1
t=1 symbol
Do:
• Compute updated hidden and visible state transition
estimates, per last slide
20