[go: up one dir, main page]

0% found this document useful (0 votes)
57 views20 pages

Hidden Markov Models

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 20

Hidden Markov Models

DHS 3.10
HMMs
Model likelihood of a sequence of observations as a
series of state transitions.

• Set of states set in advance; likelihood of


state transitions, observed features from
each state learned

• Each state has an associated feature space

• Often used to find most likely sequence of


state transitions, according to the model

• Example: recognizing spoken words

ω1 ω2 ω3 ω4 ω5 ω6 ω7 ω0

2
/v/ /i/ /t/ /e/ /r/ /b/ /i/ /-/
FIGURE 3.11. A left-to-right HMM commonly used in speech recognition. For instance,
such a model could describe the utterance “viterbi,” where ω1 represents the phoneme
/v/, ω represents /i/,. . . , and ω a final silent state. Such a left-to-right model is more
HMMs: On-line Handwriting Recognition
(example from Bishop,
“Pattern Recognition
and Machine Learning”)

**Examples generated
from the HMM
Symbols
Represented as a sequence of (x,y) locations for each pen
stroke

A Simple HMM
16 states representing a line segment of fixed length in
one of 16 angles: another ‘left-to-right’ model with a fixed
sequence of states
3
First-Order Markov Models
Represent Probabilistic State Transitions
“First Order:” probability of
P (ωai (tstate for =
+ 1)|ω(t)) each
aij
time step depends only on the previous state:
|Ω|
!
P (ωj (t + 1)|ωi (t)) = aij aij = 1
j=1

Transition|Ω|probabilities coming out of each


state sum!toaijone.
= 1 States at times t and t+1 may
be the same.
j=1

4
P (ωj (t + 1)|ωi (t)) = aij

Example:
!|Ω|
First-Order M. Model
aijj (t=+11)|ωi (t)) = aij
P (ω a22
Consider State j=1
Sequence w6
6 |Ω| ω2
ω = {ω1 , ω3 , ω2 !
,ω 2 , ω1 , ω3 } a21 a32
aij = 1 a11 a12 a23 a33
a31
Probability of this Sequence
j=1 ω1 ω3
a13
Given transition ω = {ωtable 1 , ω3 ,θ,
probability first state known:
ω3 }states, ωproduct of
6
ωFIGURE
2 , ω23.8., ωThe1 ,discrete , in a basic Markov model are represente
i

transition probabilities, aij and the transition probabilities, a , are represented by links. In a first-order d
ij
Markov model, at any step t the full system is in a particular state ω(t ). The
t + 1 is a random function that depends solely on the state at step t and
tion probabilities. From: Richard O. Duda, Peter E. Hart, and David G. S
P (ω 6 |θ) = a a a a a 13Classification
32 22 . Copyright
21 13! c 2001 by John Wiley & Sons, Inc.

Problem
In practice, we often can’t directly observe the states of interest (e.g.
speech recognition: wave features, not phonemes as input)

5
First-Order HMMs
Modification
As we can’t directly observe states, let’s model the likelihood of
observed features for each state

• Add ‘visible’ states V for observed features

• States of interest are ‘hidden’ (must be inferred)

Restriction
We will discuss HMMs where the observations are discrete

Observations
V6 = {v5 , v1 , v1 , v5 , v2 , v3 }
Are now sequences of visible states

V6 =of{v
Probability transition to visible state depends on current (hidden
5 , v1 , v1 , v5 , v2 , v3 } bjk = P (vk (t)|ωj (t))
state); transitions to visible states from each hidden state must sum to
one !
bjk = P (vk (t)|ωj (t)) bjk = 1
6
k

!
Example: HMM
Consider Visible State Sequence V6 v3
v1 v2 v4
b
b b22 23 b24
6 21
V = {v4 , v1 , v1 , v4 , v2 , v3 } a22
ω2
a21 a32
a11 a12 a23 a33
bjk = P (vk (t)|ωj (t)) a31
ω1 ω3
b11 a13
b12 b13 b14 b31 b34
! b32 b33
Sequence of Hidden
bjk = 1 States v1
v2 v3 v4 v1
v2 v3 v4

k FIGURE 3.9. Three hidden units in an HMM and the transitions between them
Is non-unique, even if we know that we begin in
shown in black while the visible states and the emission probabilities of visible sta
are shown in red. This model shows all transitions as being possible; in other HMM
state one. However, some may be more likely
some such candidate transitions are not allowed. From: Richard O. Duda, Peter E. Ha
c 2001 by John Wiley & So
and David G. Stork, Pattern Classification. Copyright !
than others. Inc.

7
Key Problems
Evaluation
Computing the probability that a sequence of visible states was
generated by an HMM

Decoding
Determine the most likely sequence of hidden states that
produced a sequence of visible states

Learning
Given the HMM structure (number of visible and hidden
states) and a training set of visible state sequences, determine
the transition probabilities for hidden and visible states.

8
Evaluation
Probability of Visible Sequence VT:
Is the sum of probabilities for the sequence
over all possible length T paths through the
hidden states:
r!
max

P (VT ) = P (VT |ωr T )P (ωr T )


r=1

where rmax = cT for c hidden states. State T is


a final or absorbing state (e.g. silence in
speech applications), ω0
9
r r
r=1

T
"
P (ωrT ) = P (ω(t)|ω(t − 1))
Evaluation, Cont’d t=1

T
"
T
P (V |ωrT ) = P (v(t)|ω(t))
Probability of Visiblet=1Sequence (alt.)
r! T
max "

P (VT ) = P (v(t)|ω(t))P (ω(t)|ω(t − 1))


r=1 t=1

Interpretation
Sum over all possible hidden state sequences
of: conditional probability of each transition
multiplied by probability of visible symbol being
emitted from current hidden state
Problem
Computation is O(cT T) e.g. c =10, T = 20: ~1021
10
Forward Algorithm
(Much) Faster Evaluation
In O(c2 T) time
Idea
Compute the probability recursively for each
state at each time step from 1...T:


 0, if t = 0 and j != initial state
αj (t) = 1, if t = 0 and j = initial state

 %
[ i αi (t − 1)aij ]bjk v(t) otherwise
(prob. of visible symbol vk)
Return α0(T) (probability of final state ω0 at
last time step) 11
vk

α1(2)
b2k
ω1 ω1 ω1 ω1 ω1
a12
α2(2)

ω2 ω2 ω2 ω2 ω2
a12

α3(2)
a32
ω3 ω3 ω3 ω3 ω3

ac2

αc(2)

ωc ωc ωc ωc ωc

t= 1 2 3 T-1 T

FIGURE 3.10. The computation of probabilities by the Forward algorithm can be visu-
alized by means of a trellis—a sort of “unfolding” of the HMM through time. Suppose
we seek the probability that the HMM was in state ω2 at t = 3 and generated the ob-
served visible symbol up through that step (including the observed visible symbol vk ).
The probability the HMM was in state ωj (t = 2) and generated the observed sequence
through t = 2 is αj (2) for j = 1, 2, . . . , c . To find α2 (3) we must sum these and multiply
the probability that state ω2 emitted
!the observed symbol vk . Formally, for this particular
c
illustration we have α2 (3) = b2k j=1 αj (2)aj2 . From: Richard O. Duda, Peter E. Hart,
and David G. Stork, Pattern Classification. Copyright ! c 2001 by John Wiley & Sons,
Inc.
Prob. of HMM for Observation
Prob. of model, given T P (V T
|θ)P (θ)
visible sequence: P (θ|V ) = P (VT )
Use Bayes Formula
Forward algorithm provides P(VT | θ)
Other two parameters are provided by
domain knowledge
• likelihood of the sequence itself, and prior
probability of the model; e.g. using a language
model for speech recognition

• Probability of the HMM (model, P(θ)) often


assumed uniform, and ignored for classification 13
Decoding: Finding Hidden States
Decoding
Find most likely sequence of hidden states in HMM

Brute Force (Enumerate Sequences):


Again O(cT T); need a more efficient approach
Greedy Algorithm (with modification, O(c2 T))
Modify forward algorithm so that we add the most likely hidden
state to a list after updating the state probabilities at time t.
Return the list.

• Not guaranteed to produce a valid path.


14
αmax(T)

ω0 ω0 ω0 ω0 ω0 ω0

αmax(1)

ω1 ω1 ω1 ω1 ω1 ω1

αmax(3) αmax(T-1)

ω2 ω2 ω2 ω2 ω2 ω2

αmax(2)

ω3 ω3 ω3 ω3 ω3 ω3

ωc ωc ωc ωc ωc ωc

t= 1 2 3 4 T-1 T

FIGURE 3.12. The decoding algorithm finds at each time step t the state that has the
highest probability of having come from the previous step and generated the observed
visible state vk . The full path is the sequence of such states. Because this is a local
optimization (dependent only upon the single previous time step, not the full sequence),
the algorithm does not guarantee that the path is indeed allowable. For instance, it
might be possible that the maximum at t = 5 is ω1 and at t = 6 is ω2 , and thus
these would appear in the path. This can even occur if a12 = P (ω2 (t + 1)|ω1 (t )) = 0,
precluding that transition. From: Richard O. Duda, Peter E. Hart, and David G. Stork,
Pattern Classification. Copyright ! c 2001 by John Wiley & Sons, Inc.
Viterbi Algorithm
Purpose
Decoding; computes the most likely
sequence of states for an observation
sequence
• State sequence will be consistent with HMM
transition probabilities

Brief Summary
A dynamic programming algorithm that
incrementally computes the most likely
sequence of states from start to finish state
16
Backward Algorithm

Computes P(VT | θ)
Like the forward algorithm, but moving from
P (VT |θ)P (θ)
the final stateP (θ|V
back) =to the initial
T
T
state. The
P (V )
algorithm computes the recurrence:


 0, if t = T and j != f inal state
βj (t) =  1, if t = T and j = f inal state
 %
j βj (t + 1)aij bjk v(t + 1) otherwise
(prob. of visible symbol vk)
and returns βk(0), for initial state k

17
Learning
Optimal HMM Parameter Setting
For transition probabilities; no known
method.

Forward-Backward Algorithm
A form of expectation-maximization (EM);
iteratively updates transitions to better
explain the observed sequences
• Also known as the Baum-Welch algorithm
18
Individual Transition Probabilities
forward trans+emit backward
αi (t − 1)aij bjk βj (t)
γij (t) = T |θ)
i (t(t−−1)a
αα ij b jk
P
βj
(V
(t) (evaluation: e.g. by forward alg.)
γijγ (t) = i 1)aij bjk βj (t)
ij (t) =
Estimate Updates:
P (V T |θ)
P (VT |θ)
!T
!T γij (t) expected state i-j transitions (at any t)
! γij (t)
âijâ ==!T t=1
t=1
ij !T ! k γik (t) expected state i-any transitions (at any t)
k γik (t)
t=1
t=1
!T !
!Tt=1 !γjl (t)
l
t=1
v(t)=v k
l γjl (t) expected state j transmitting vk
b̂jk
b̂jk== !v(t)=v
T !! k
! T l γjl (t) expected state j transmitting any
l jl (t)
γ
t=1
t=1 symbol

(transmitted after transition to state l)


19
Forward-Backward Algorithm
Initialization:
• Given training sequence VT, convergence threshold λ

• Set transition probabilities randomly

Do:
• Compute updated hidden and visible state transition
estimates, per last slide

• Compute largest difference between previous and current


transition (aij) and emission (bij) transition estimates

Until: Largest estimated difference is < λ (convergence)

• Return transition estimates (aij) and emission (bij)

20

You might also like