[go: up one dir, main page]

0% found this document useful (0 votes)
36 views56 pages

Introduction To Hidden Markov Models

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1/ 56

Introduction to Hidden Markov Models

Markov Models
• Set of states: {s , s ,  , s }
1 2 N
• Process moves from one state to another generating a sequence
of states :
si1 ,ofsieach
• Markov chain property: probability 2 , subsequent
, sik ,  state depends only on
what was the previous state:

P( sikmodel,
| si1 , the
si 2 following 1 )  P ( sik | sik 1 )
,  , sik probabilities
• To define Markov have to be specified:
transition probabilities and initial probabilities

aij  P( s j | si )
 i  P ( si )
Example of Markov Model
0.3 0.7

Rain Dry

0.2 0.8

• Two states : ‘Rain’ and ‘Dry’.


• Transition probabilities: P(‘Rain’|‘Rain’)=0.3 , P(‘Dry’|‘Rain’)=0.7 ,
P(‘Rain’|‘Dry’)=0.2, P(‘Dry’|‘Dry’)=0.8
• Initial probabilities: say P(‘Rain’)=0.4 , P(‘Dry’)=0.6 .
Calculation of sequence probability
• By Markov chain property, probability of state sequence can be found by the
formula:

P( si1 , si 2 ,  , sik )  P( sik | si1 , si 2 ,  , sik 1 ) P( si1 , si 2 ,  , sik 1 )


 P( sik | sik 1 ) P( si1 , si 2 , , sik 1 )  
 P( s | s
ik 1 ) P( s
ik 1 |s
ik  2 )  P( s | s ) P( s )
• Suppose weikwant to calculate a probability ofi 2a sequence
i1 i1
of states in our
example, {‘Dry’,’Dry’,’Rain’,Rain’}.
P({‘Dry’,’Dry’,’Rain’,Rain’} ) =
P(‘Rain’|’Rain’) P(‘Rain’|’Dry’) P(‘Dry’|’Dry’) P(‘Dry’)=
= 0.3*0.2*0.8*0.6
Hidden Markov models.
• Set of states: {s , s ,  , s }
1 2 N
• Process moves from one state to another generating a sequence
of states :
s , s
i1 i 2 ,  , sik , 
• Markov chain property: probability of each subsequent state depends only on
what was the previous state:
P( sik | si1 , si 2 , , sik 1 )  P( sik | sik 1 )
• States are not visible, but each state randomly generates one of M
observations (or visible states) {v1 , v2 ,  , vM }
• To define hidden Markov model, the following probabilities have to be

A=(a ), a = P(s | s ) ,
specified: matrix of transition probabilities ij ij j i

matrix of observation probabilities B=(b (v )), b (v ) = P(v |


i m i m m

s ) and a vector of initial probabilities =( ),  = P(s ) . Model is


i i i i

represented by M=(A, B, ).


Example of Hidden Markov Model
0.3 0.7

Low High

0.2 0.8

0.6 0.6
0.4 0.4

Rain Dry
Example of Hidden Markov Model
• Two states : ‘Low’ and ‘High’ atmospheric pressure.
• Two observations : ‘Rain’ and ‘Dry’.
• Transition probabilities: P( )
‘Low’|‘Low’ =0.3 , P( )
‘High’|‘Low’ =0.7 ,

P(‘Low’|‘High’)=0.2, P(‘High’|‘High’)=0.8
• Observation probabilities : P(‘Rain’|‘Low’)=0.6 , P(‘Dry’|‘Low’)=0.4 ,

P(‘Rain’|‘High’)=0.4 , P(‘Dry’|‘High’)=0.3 .
• Initial probabilities: say P(‘Low’)=0.4 , P(‘High’)=0.6 .
Calculation of observation sequence probability
• Suppose we want to calculate a probability of a sequence of observations in
our example, {‘Dry’,’Rain’}.
• Consider all possible hidden state sequences:

P({‘Dry’,’Rain’} ) = P({‘Dry’,’Rain’} , {‘Low’,’Low’}) + P({‘Dry’,’Rain’} ,


{‘Low’,’High’}) + P({‘Dry’,’Rain’} , {‘High’,’Low’}) + P({‘Dry’,’Rain’} ,

{‘High’,’High’})

where first term is :


P({‘Dry’,’Rain’} , {‘Low’,’Low’})=
P({‘Dry’,’Rain’} | {‘Low’,’Low’}) P({‘Low’,’Low’}) =
P(‘Dry’|’Low’)P(‘Rain’|’Low’) P(‘Low’)P(‘Low’|’Low)
= 0.4*0.4*0.6*0.4*0.3
Main issues using HMMs :
Evaluation problem. Given the HMM M=(A, B, ) and the observation

sequence O=o o ... o 1 2 K , calculate the probability that model M has


generated sequence O .
• Decoding problem. Given the HMM M=(A, B, ) and the observation

sequence O=o o ... o1 2 K , calculate the most likely sequence of hidden

states s that produced this observation sequence O.


i

• Learning problem. Given some training observation sequences O=o o ...


1 2

o K and general structure of HMM (numbers of hidden and visible states),

determine HMM parameters M=(A, B, ) that best fit training data.

O=o ...o1 K denotes a sequence of observations o {v ,…,v }.


k 1 M
Example: A Crazy Soft Drink Machine
Suppose you have a crazy soft drink machine: it can be in two
states, cola preferring (CP) and iced tea preferring (IP), but it
switches between them randomly after each purchase, as
shown below:
output possibility matrix

NOT
OBSERVABLE

Three possible outputs( observations): cola, iced Tea, lemonade 10


General Form of an HMM
HMM is specified by a five-tuple ( S , O, , A, B)
1) S  {1, 2,..., N }
Set of hidden states
N: the number of states :sthe
t state at time t
2) O  { o 1 , o 2 , ..., o M }
Set of observation symbols
M: the number of observation symbols
3)   { i }  i  P ( s0  i) 1  i  N
The initial state distribution
4) A  {a } aij  P ( st  j | st 1  i ), 1  i, j  N
ij
State transition probability distribution
5) B  {b (k )} b (k )  P( X  o | s  j ) 1  j  N ,1  k  M
j j t k t
Observation symbol probability distribution in state j 11
General Form of an HMM (Continued)
To sum up, a complete specification of an HMM includes:
two constant-size parameters:
N and M (representing the total number of states and the
size of observation symbols),
three sets of probability distribution:
A, B, 
  ( A, B ,  )
Two assumptions:
1.Markov assumption:

represents the state sequence

2.Output independence assumption:

represents the output sequence

12
Three Basic Problems in HMM
MM ?
• 1.The Evaluation Problem
ua t ea n H –Given a model anda
l
sequenceHoof o eva
w tobservations orithm X  ( X 1 , X 2 ,..., X T, )what is the
A lg
rward
probability F o P ( X; |i.e., ) the probability of the model
that generates the observations?

• 2.The Decoding Problem – Given aMmodel M? and a
an H
sequence of observation to D ec od Xe  ( X 1 , X 2 ,..., X, Twhat
) is the
How orithm
Siterb( s0 , s1 ,..., sT ) in the model
most likely state sequence
V
i A lg
that produces the observations?

• 3.The Learning Problem –Given a model and


M ?a set of
HMparameter
observations, how can we adjust the model
a n
to Train ithm
H ow l gor
to maximize the joint probability -Welch A
Baum
 X
P (X| ?)
13
HMM for Ice Cream
• You are a climatologist in the year 2799
• Studying global warming
• You can’t find any records of the weather in
Baltimore, MA for summer of 2007
• But you find Jason Eisner’s diary
• Which lists how many ice-creams Jason ate
every date that summer
• Our job: figure out how hot it was
08/04/2020 14
Hidden Markov Model

• For Markov chains, the output symbols are the same as the states.
– See hot weather: we’re in state hot
• But in part-of-speech tagging (and other things)
– The output symbols are words
– But the hidden states are part-of-speech tags
• So we need an extension!
• A Hidden Markov Model is an extension of a Markov chain in
which the input symbols are not the same as the states.
• This means we don’t know which state we are in.

08/04/2020 15
Hidden Markov Models
• States Q = q1, q2…qN;
• Observations O= o1, o2…oN;
– Each observation is a symbol from a vocabulary V =
{v1,v2,…vV}
• Transition probabilities
– Transition probability matrix A = {aij}
aij  P(qt  j | qt1  i) 1  i, j  N
• Observation likelihoods
– Output probability matrix B={bi(k)}
bi (k)  P(X t  ok | qt  i)
 • Special initial probability vector 
 i  P(q1  i) 1  i  N
Hidden Markov Models

• Some constraints
N

a ij  1; 1 i  N
j1

 i  P(q1  i) 1  i  N
M

 b (k)  1
N

k1
i  j 1
 j1

08/04/2020 17
Assumptions
• Markov assumption:

• Output-independence
P(qi | q1 ...qi1)  P(qassumption
i | qi1 )

 P(ot | O1t1,q1t )  P(ot |q t )


08/04/2020 18
Eisner task
• Given
– Ice Cream Observation Sequence: 1,2,3,2,2,2,3…
• Produce:
– Weather Sequence: H,C,H,H,H,C…

08/04/2020 19
HMM for ice cream

08/04/2020 20
HMM
A colored ball choosing example :

Urn 1 Urn 2 Urn 3


# of Red = 30 # of Red = 10 # of Red =60
# of Green = 50 # of Green = 40 # of Green =10
# of Blue = 20 # of Blue = 50 # of Blue = 30

Probability of transition to another Urn after picking a ball:


U1 U2 U3
U1 0.1 0.4 0.5
U2 0.6 0.2 0.2
U3 0.3 0.4 0.3
Transition Probabilities
Example (contd.)
U1 U2 U3 R G B
Given : U1 0.1 0.4 0.5 U1 0.3 0.5 0.2
and
U2 0.6 0.2 0.2 U2 0.1 0.4 0.5
U3 0.3 0.4 0.3 U3 0.6 0.1 0.3

Observation : RRGGBRGR

State Sequence : ??

Not so Easily Computable.


Example (contd.)

• Here :
– S = {U1, U2, U3} U1 U2 U3
A=
– V = { R,G,B} U1 0.1 0.4 0.5
• For observation: U2 0.6 0.2 0.2
– O ={o1… on} U3 0.3 0.4 0.3
• And State sequence R G B
– Q ={q1… qn} B=
U1 0.3 0.5 0.2
• π is  P(q1  U i )
i
U2 0.1 0.4 0.5
U3 0.6 0.1 0.3
Model Definition

• Set of states : S where |S|=N


• Output Alphabet : V
• Transition Probabilities : A = {aij}
• Emission Probabilities : B = {bj(ok)}
• Initial State Probabilities : π
  ( A, B ,  )
Markov Processes
• Properties
– Limited Horizon :Given previous n states, a state i,
is independent of preceding 0…i-n+1 states.
• P(Xt=i|Xt-1, Xt-2 ,… X0) = P(Xt=i|Xt-1, Xt-2… Xt-n)
– Time invariance :
• P(Xt=i|Xt-1=j) = P(X1=i|X0=j) = P(Xn=i|X0-1=j)
Three Basic Problems of HMM
1. Given Observation Sequence O ={o1… oT}
– Efficiently estimate P(O|λ)

2. Given Observation Sequence O ={o1… oT}


– Get best Q ={q1… qT} i.e.
• Maximize P(Q|O, λ)

  ( A, B ,  ) P (O |  )
3. How to adjust to best maximize
– Re-estimate λ
Three basic problems (contd.)
• Problem 1: Likelihood of a sequence
– Forward Procedure
– Backward Procedure
• Problem 2: Best state sequence
– Viterbi Algorithm
• Problem 3: Re-estimation
– Baum-Welch ( Forward-Backward Algorithm )
Problem 2
• Given Observation Sequence O ={o1… oT}
– Get “best” Q ={q1… qT} i.e.
• Solution :
1. Best state individually likely at a position i
2. Best state given all the previously observed states
and observations
 Viterbi Algorithm
Example

• Output observed – aabb


• What state seq. is most probable? Since state seq.
cannot be predicted with certainty, the machine is
given qualification “hidden”.
• Note: ∑ P(outlinks) = 1 for all states
Probabilities for different possible seq
1

0.4 1,1 1,2 0.15

0.16 1,1,1 0.06 1,1,2 1,2,1 0.0375 0.0225 1,2,2

1,1,1,1 1,1,1,2 1,1,2,1 1,1,2,2

0.016 0.056 0.018 0.018

...and so on
CS626-449: NLP, Speech and
Web-Topics-in-AI
Pushpak Bhattacharyya
CSE Dept.,
IIT Bombay
Lecture 38-39: Baum Welch
Algorithm; HMM training
Baum Welch algorithm

 Training Hidden Markov Model (not structure learning, i.e.,


the structure of the HMM is pre-given). This involves:
 Learning probability values ONLY

Correspondence with PCFG:


Not learning production rule but probabilities associated with them
Training algorithm for PCFG is called Inside-Outside algorithm
Key Intuition
a

a b a

q r
b a b

Given: Training sequence


Initialization: Probability values
Compute: Pr (state seq | training seq)
get expected count of transition
compute rule probabilities
Approach: Initialize the probabilities and recompute them… EM
like approach
Building blocks: Probabilities to be used

1. i (t )  forward probability
 P(W 1,t  1,St  s i ),t  0
α1(t)  1.0 if i  0
 0 otherwise
T T
P (W 1,n )   P (W 1,t,Sn  1  s )  αi(n  1 )
i

i 1 i 1
T
αj(t  1 )   αi(t)P ( s i w j
 s )
k

i 1

W1 W2…………… Wn-1 Wn
S1 S2 Sn Sn+1
Probabilities to be used, contd…
2. i (t )  backward probability
 P (Wt , n,, St  s i ) ,t  1
 1(1)  P(W 1, n,, S 1  s i )
 P (W 1, n)
i (n  1)  P(| Sn  1  s i )  1
T
i (t  1)   P ( s i w
 s ). j (t )
k j

j 1
T
Exercise 1:- Prove the following:P (W 1,n)   j (t ). j (t )
j 1
Start of baum-welch algorithm
b
b
q a

r
a

String = aab aaa aab aaa

Sequence of states with respect to input symbols


o/p seq
q

a
r

b
q

b
q

a
r

a
q

a
r

b
q

b
q

b
q

a
r

a
q

a
r
State seq
Calculating probabilities from table
Table of counts
P(q 

a
r)  5 / 8 Src Dest O/P Count

q r a 5
P(q 
 r )  3 / 8
b
q q b 3

c( s i w j
 s )
k

P( s w
r q a 3
 s ) 
i k j
T A
i wm
 c ( s
l 1 m 1
  s l
) r q b 2

T=#states
A=#alphabet symbols
Now if we have a non-deterministic transitions then multiple
state seq possible for the given o/p seq (ref. to previous slide’s
feature). Our aim is to find expected count through this.
Interplay Between Two Equations

c( s i 
Wk
sj)
P( s i 
Wk
sj)  T A

 c ( s
l 1 m 1
i

Wk
s j
)

C (s i 
Wk
sj) 
 i,n1 1,n
P ( S |
si , n 1
W )  n ( s i

Wk
s j
, Si ,n1 , w1,n )

wk
No. of times the transitions sisj occurs in the string
Learning probabilities
a:0.67
b:0.17

q r
a:0.16
b:1.0
Actual (Desired) HMM

a:0.4
b:0.48

q r
a:0.48
b:1.0 Initial guess
One run of Baum-Welch algorithm:
string ababa
 a a  b b  a a  b b  b b  P(path) a b a b
q  r r  q q  q q  q
q r q r q q 0.00077 0.00154 0.00154 0 0.00077
q r q q q q 0.00442 0.00442 0.00442 0.00442 0.00884
q q q r q q 0.00442 0.00442 0.00442 0.00442 0.00884
q q q q q q 0.02548 0.0 0.000 0.05096 0.07644
Rounded Total  0.035 0.01 0.01 0.06 0.095
New Probabilities (P)  0.06 1.0 0.36 0.581
(0.01/
(0.01+0.06+
0.095)

State sequences

* is considered as starting and ending symbol of the input sequence


string

This way through multiple iterations the probability values will converge.
Appling Naïve Bayes

P ( S1,n 1 | W1,n )
P( S1,n 1 , W1,n )

P (W1,n )
1
  P(S1 ) P(S 2W1 | S1 ) P( S3W2 | S1W2W1 )  P(S 4W3 | S1S 2W1W2 )
P(W1,n )
P( S1 )  n 
 
P(W1,n )  i 1
P ( Si 1Wi | Si )

Hence multiplying the transition probabilities is valid


Discussions
1. Symmetry breaking:
Example: Symmetry breaking leads to no change in initial values
a:0.5 b:0.25
a:0.25
s s
b:1.0 s a:0.5 s a:0.5
a:1.0 b:0.5
a:0.25 b:0.5
b:0.5
s s b:0.5
Desired Initialized

2. Struck in Local maxima


3. Label bias problem
Probabilities have to sum to 1.
Values can rise at the cost of fall of values for others.
Computational part
1
C ( s  s ) 
i Wk j
[ P ( S1,n 1 , W1,n )  n( s i 
Wk
s j , S1,n 1 , W1,n )]
P (W1,n )

P ( S1,n 1 , W1,n )  n( s i 


Wk
s j , S1,n 1 , W1,n ) 
n

 P( S  s , S
t 1
t
i
t 1  s j , Wt  Wk , S1,n 1 , W1,n )
n
wk s ).i (t  1)
 α (t)P(s
t 1
i i  i

Exercise 2: What is the complexity of calculating the above expression?

Hint: To find this first solve Exercise


T
1 i.e. understand how probability of given
string can be represented as

αi(t). i (t )
i 1
Example: dishonest casino (1)

• Fair die and loaded die


• Loaded die: probability 0.5 of a 6 and
probability 0.1 for 1-5
• Switch from fair to loaded: probability
0.05
• Switch back: probability 0.1

Marjolijn Elsinga & Elze de Groot 50


Dishonest casino (2)

• Emission probabilities: HMM model that


generate or emit sequences

Marjolijn Elsinga & Elze de Groot 51


Viterbi Algorithm
• Result with casino example

Marjolijn Elsinga & Elze de Groot 52


Three algorithms
• What is the most probable path for generating a
given sequence?
Viterbi Algorithm
• How likely is a given sequence?
Forward Algorithm
• How can we learn the HMM parameters given a
set of sequences?
Forward-Backward (Baum-Welch) Algorithm

Marjolijn Elsinga & Elze de Groot 53


Summary (1)
• Markov chain is a collection of states where a
state depends only on the state before

• Hidden markov model is a model in which the


states sequence is ‘hidden’

Marjolijn Elsinga & Elze de Groot 54


Summary (2)
• Most probable path: viterbi algorithm
• How likely is a given sequence?: forward
algorithm
• Posterior state probability: forward and
backward algorithms (used for most probable
state of an observation)

Marjolijn Elsinga & Elze de Groot 55

You might also like