[go: up one dir, main page]

0% found this document useful (0 votes)
36 views45 pages

Hidden Markov Models and Sequential Data

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 45

Hidden Markov Models

and

Sequential Data
Sequential Data
• Often arise through measurement of time
series
• Snowfall measurements on successive days in Buffalo
• Rainfall measurements in Chirrapunji
• Daily values of currency exchange rate
• Acoustic features at successive time frames in speech
recognition
• Non-time series
• Nucleotide base pairs in a strand of DNA
• Sequence of characters in an English sentence
• Parts of speech of successive words

10
Sound Spectrogram of Spoken Words
• “Bayes Theorem”

• Plot of the intensity of


the spectral
coefficients versus
time index

• Successive
observations of the
speech spectrum are
highly correlated
Task of making a sequence of decisions
• Processes in time, states at time t are influenced by
a state at time t-1

• In many time series applications, eg financial


forecasting, wish to predict next value from
previous values
• Impractical to consider general dependence of
future dependence on all previous observations
• Complexity would grow without limit as number of
observations increases
• Markov models assume dependence on most recent
observations

10
Model Assuming Independence
• Simplest model:
• Treat as independent
• Graph without links
Markov Model

• Most general Markov model for observations {xn}


• Product rule to express joint distribution of sequence
of observations

N
p ( x1 ,..x N ) = ∏ p (x N | x1 ,..x N −1 )
n =1
First Order Markov Model
• Chain of observations {xn}

• Distribution p{xn|xn-1} is conditioned on previous observation


• Joint distribution for a sequence of n variables
N
p ( x 1 ,.. x N ) = p ( x 1 ) ∏ p ( x n | x n − 1 )
n =1

• It can be verified (using product rule from above) that

p ( xn | x1..xn −1 ) = p ( xn | xn −1 )
• If model is used to predict next observation, distribution of
prediction will only depend on preceding observation and
independent of earlier observations
Second Order Markov Model
• Conditional distribution of observation xn
depends on the values of two previous
observations xn-1 and xn-2

N
p ( x1 ,..x N ) = p ( x1 ) p( x2 | x1 )∏ p (xn | xn −1 , xn − 2 )
n =1

• Each observation is influenced by previous two


observations
Introducing Latent Variables
• For each observation xn, introduce a latent variable zn
• zn may be of different type or dimensionality to the
observed variable
• Latent variables form the Markov chain
• Gives the “state-space model”

Latent variables

Observations
• Joint distribution for this model
N N
p ( x1 ,..x N , z1 ,..z n ) = p ( z1 )∏ p( z n | z n −1 )∏ p(xn | z n )
n =1 n =1
Two Models Described by this Graph

Latent variables

Observations

1. If latent variables are discrete: Hidden Markov Model


Observed variables in a HMM may be discrete or continuous

2. If both latent and observed variables are Gaussian then


we obtain linear dynamical systems
Latent variable with three discrete states

• Transition probabilities aij are represented


by a matrix

Not a graphical model since the nodes are not separate


variables but states of a single variable
This can be unfolded over time to get trellis diagram
Markov model for the production of
spoken words
States represent phonemes

Production of the word: “cat”


• Represented by states
/k/ /a/ /t/
• Transitions from Markov Model
• /k/ to /a/ for word “cat”
• /a/ to /t/
• /t/ to a silent state
• Although only the correct cat sound /k/
Is represented by model, perhaps
other transitions can be
introduced, eg, /k/ followed by /t/
/a/ /t/

10
First-order Markov Models (MMs)
• State at time t: ω(t) Model for the production of
• Particular sequence of length T: any sequence is described by
ωT = {ω(1), ω(2), ω(3), …, ω(T)} the transition probabilities
e.g., ω6 = {ω1, ω4, ω2, ω2, ω1, ω4}
P(ωj(t + 1) | ωi (t)) = aij
Note: system can revisit a state
Which is a time-independent
at different steps and not every probability of having state ωj at
state needs to be visited step (t+1) given at time t state
was ωi
No requirement transitional
Particular model probabilities be symmetric

θ = {aij} Given model θ probability that


model generated sequence
ω6 = {ω1, ω4, ω2, ω2, ω1, ω4}
is
P(ω6 | θ) = a14 . a42 . a22 . a21 . a14
Discrete states = nodes, Transition probs = links Can include a priori probability
In first-order discrete time HMM at step t system is in state ω(t)
State at step t +1 is a random function that depends on state at step t of first state as
and transition probabilities P(ω(1) = ωi )
10
First Order Hidden Markov Models
• Perceiver does not have access to the
states ω(t)
• Instead we measure properties of the
emitted sound
• Need to augment Markov model to allow
for visible states (symbols)
First Order Hidden Markov Models
• Visible States (symbols) VT = {v(1), v(2), v(3), …, v(T)}
• For instance V6 = {v5, v1, v1, v5, v2, v3}
• In any state ωj(t) probability of emitting symbol vk(t) is bjk

Three hidden units in HMM


Visible states and emission
probabilities of visible states in red
Hidden Markov Model Computation
• Finite State Machines with transitional
probabilities– called Markov Networks
• Strictly causal: probabilities depend only
on previous states
• A Markov model is ergodic if every state
has non-zero probability of occuring given
some starting state
• A final or absorbing state is one which if
entered is never left
Hidden Markov Model Computation
Three Basic Problems for HMMs
• Given HMM with transition and symbol probabilities
• Problem 1: The evaluation problem
• Determine the probability that a particular sequence
of symbols VT was generated by that model
• Problem 2: The decoding problem
• Given a set of symbols VT determine the most likely
sequence of hidden states ωT that led to the
observations
• Problem 3: The learning problem
• Given a coarse structure of the model (no of states
and no of symbols) but not the probabilities aij and bjk
• determine these parameters
Problem 1: Evaluation Problem
• Probability that model produces a sequence VT of visible states:
rmax
P ( V T ) = ∑ P ( V T | ω rT )P (ω rT )
r =1

Visible sequence Hidden states


where each r indexes a particular sequence
{ }
ω• T = ω ( 1 ),ω ( 2 ),...,ω ( T ) of T hidden states
r

• In the general case of c hidden states there will be


rmax = c T
possible terms
Evaluation Problem Formula
Probability that model produces a sequence VT of visible
states:
rmax
P ( V ) = ∑ P ( V T | ω rT )P (ω rT )
T

r =1

(2)
t =T (1) t =T
P(V T | ωrT ) = ∏ P (v(t ) | ω (t )) P(ωrT ) = ∏ P(ω (t ) | ω (t − 1))
t =1 t =1

Because (1) output probabilities depend only upon hidden states


and (2)first order hidden Markov process. Substituting,
rmax t =T Interpretation: Probability of sequence VT
P ( V T ) = ∑∏ P ( v ( t ) | ω ( t )) P ( ω ( t ) | ω ( t − 1 ) is equal to the sum over all rmax possible
r =1 t =1 sequences of hidden states of the
conditional probability that the system has
made a particular transition multiplied by
the probability that it then emitted the
visible symbol in the target sequence.
Computationally Simpler Evaluation Algorithm
rmax t =T

• Calculate P ( V ) = ∑∏
T

r =1 t =1
P ( v ( t ) | ω ( t )) P ( ω ( t ) | ω ( t − 1 )

• recursively because each term


P( v (t ) | ω (t )) P (ω (t ) | ω (t − 1)
involves only v(t), ω(t) and ω(t-1)

Define:

where: bjkv(t) means the symbol probability bjk corresponding to v(t)

Therefore α j (t ) is the probability that the model is in state ω j(t)


and has generated the target sequence upto step t
HMM Forward

where: bjkv(t) means the symbol probability bjk corresponding to v(t)

Computational complexity of O(c2T)


Time-reversed
Version of
Forward Algorithm
Computation of Probabilities by Forward Algorithm

Trellis:
Unfolding of
HMM through
time
Computation of Probabilities by Forward Algorithm
In the evaluation trellis
we only accumulate
values

In the decoding trellis


we only keep maximum
values
Example of Hidden Markov Model
Four states with explicit absorber state

ω ω ω ω
0 1 2 3

ω 0
1 0 0 0

aij=
ω 1
0.2 0.3 0.1 0.4

ω 2
0.2 0.5 0.2 0.1

ω 3
0.8 0.1 0 0.1

v0 v1 v2 v3 v4
ω 0
1 0 0 0 0

bjk= ω 0 0.3 0.4 0.1 0.2


1
Five symbols with unique null symbol
ω 2
0 0.1 0.1 0.7 0.1

ω 3
0 0.5 0.2 0.1 0.2
Example Problem: compute probability of generating sequence V4 = {v1, v3, v2, v0}
Assume ω 1 is the start state
at t = 0 state is ω1
thus α1 (0) = 1, α j (0) = 0 for j ≠ 1
arrows show calculation of α j (1)
top arrow : α 0 (1) = α1 (0)a10b01 = 1[0.2 × 0] = 0
Visible symbol
at each step aij ω0 ω1 ω2 ω3
α i (t ) ω0 1 0 0 0
a ijb jk ω1 0.2 0.3 0.1 0.4
ω2 0.2 0.5 0.2 0.1
ω3 0.8 0.1 0 0.1

bjk v0 v1 v2 v3 v4

ω0 1 0 0 0 0

ω1 0 0.3 0.4 0.1 0.2

ω2 0 0.1 0.1 0.7 0.1

ω3 0 0.5 0.2 0.1 0.2


Problem 2: Decoding Problem

• Given a sequence of visible states VT, the


decoding problem is to find the most probable
sequence of hidden states.

• Expressed mathematically as:


• find the single “best” state sequence (hidden states)

ωˆ (1), ωˆ (2),..., ωˆ (T) such that :


ωˆ (1), ωˆ (2),..., ωˆ (T) = arg max P[ω (1), ω (2),..., ω (T), v(1), v(2),..., v(T) | θ ]
ω (1),ω ( 2 ),...,ω ( T )

Note that summation is changed to argmax, since we want to find


the best case
Viterbi
Algorithm

Notes:
1. If aij and bjk are replaced by log probabilities a
we add terms rather than multiply them
2. Best path is maintained for each node
3. O(c T T)
Viterbi
Decoding
Trellis
Example of Decoding
Four states with explicit absorber state ω 0 ω 1 ω 2 ω 3

ω 0
1 0 0 0

aij= ω 1
0.2 0.3 0.1 0.4

ω 2
0.2 0.5 0.2 0.1

ω 3
0.8 0.1 0 0.1

Five symbols with unique null symbol


v0 v1 v2 v3 v4
What is the most likely
state sequence that
ω 0
1 0 0 0 0
generated the particular
bjk= symbol sequence
ω 1
0 0.3 0.4 0.1 0.2
V4 = {v1, v3, v2, v0}?
ω 0 0.1 0.1 0.7 0.1 Assume ω 1 is the start
2
state
ω 3
0 0.5 0.2 0.1 0.2
Example 4. HMM Decoding

Note: transition between ω 3 and ω 2 is forbidden by model


Yet decoding algorithm gives it a non-zero probability
Problem 3: Learning Problem
• Goal: To determine model parameters aij and bjk
from an ensemble of training samples

• No known method for obtaining an optimal or


most likely set of parameters

• A good solution is straightforward:


Forward-Backward Algorithm
Forward-Backward Algorithm
• Instance of generalized expectation maximization
algorithm
• We do not know the states that hold when the
symbols are generated
• Approach is to iteratively update the weights in
order to better explain the observed training
sequences
Backward Probabilities
• α (t) is the probability that model is in state ω (t)
i i
and has generated the target sequence upto step t

• Analogously β i (t) is the probability that that the model


is in state ω i(t) and will generate the remainder of the
given target sequence from t+1 to T

Computation proceeds backward through the trellis


Backward evaluation algorithm

This is used in learning: parameter estimation


Estimating the aij and bjk
• α i (t ) β i (t ) are merely estimates of their
true values since we don’t know the actual values
of aij and bjk
• The probability of transition between ω i(t-1) and
ω j(t) given the model generated the entire training
sequence VT by any path is:
Calculating Improved Estimate of aij

• Numerator is the expected number of


transitions between state ωi(t-1) and ωj(t)

• Denominator is the total expected number of


transitions from ωi
Calculating Improved Estimate of bjk
• Ratio between frequency that any particular
symbol vk is emitted and that for any symbol
Learning Algorithm
• Start with rough estimates of aij and bjk
• Calculate improved estimate using the
formulas above
• Repeat until sufficiently small change in
the estimated values of the parameters
Baum-Welch or Forward-Backward Algorithm
Convergence
• Requires several presentations of each
training sequence (fewer than 5 common
in speech)
• Another stopping criterion:
• Overall probability that learning model could
have generated the training data
HMM Word Recognition
• Two approaches:
• HMM can model all
possible words
• Each state corresponds to
each letter of alphabet
• Letter transition Letters
probabilities are calculated a-z
for each pair of letters
• Letter confusion
probabilities are symbol
probabilities
• Decoding problem gives
most likely word
• Separate HMMs are used
to model each word
• Evaluation problem gives
probability of observation
which is used as a class-
conditional probability
HMM spoken word recognition
• Each word, e.g., cat, dog, etc, has an associated HMM
• For a test utterance determine which model has highest probability
• HMMs for speech are left-to-right models

HMM produces a
class-conditional
probability

Thus it is useful to compute probability of the model given the


sequence using Bayes rule

Computed by Prior probability


Forward algorithm of sequence
Cursive Word Recognition (not HMM)
Unit is pre-segment (cusp at bottom)
1
4
6
2 3 7 8
5 Image Segment from 1 to 3 is
u with 0.5 confidence
Image Segment 1 to 4 is w
with 0.7 confidence
Image Segment 1 to 5 is w
with 0.6 confidence and m
with 0.3 confidence
w[.6], m[.3]

w[.7] d[.8]
i[.8], l[.8] u[.5], v[.2] o[.5]
1 2 3 4 5 6 r[.4]
7 8
i[.7]
u[.3]
m[.2]
m[.1]

Best path in graph from segment 1 to 8: w o r d


Summary: Key Algorithms for HMM
• Problem 1: HMM Forward
• Problem 2: Viterbi Algorithm
• An algorithm to compute the optimal (most
likely) state sequence in a HMM given a
sequence of observed outputs.
• Problem 3: Baum-Welch Algorithm
• An algorithm to find HMM parameters A, B,
and Π with the maximum likelihood of
generating the given symbol sequence in the
observation vector

You might also like