Hidden Markov Models: Julia Hirschberg CS4705
Hidden Markov Models: Julia Hirschberg CS4705
Hidden Markov Models: Julia Hirschberg CS4705
Julia Hirschberg
CS4705
CS 4705
POS Tagging using HMMs
10/11/20 2
POS Tagging as a Sequence ID Task
10/11/20 3
• Out of all sequences of n tags t1…tn want the
single tag sequence such that P(t1…tn|w1…wn) is
highest
10/11/20 4
Bayes Rule
• Bayes Rule
n n n
n n P( w1 | t1 ) P(t1 )
• Rewrite P (t1 | w1 ) n
P ( w1 )
• Substitute
10/11/20 5
n
• Drop denominator since P (
n w) 1
is the same
n
t
for all P ( ) we consider
1
^ n n n
• So…. arg max P( w1 | t1 ) P(t1 )
t 1
Likelihoods, Priors, and Simplifying
Assumptions
10/11/20 10
Some Data on race
10/11/20 11
Disambiguating to race tomorrow
10/11/20 12
Look Up the Probabilities
• P(NN|TO) = .00047
• P(VB|TO) = .83
• P(race|NN) = .00057
• P(race|VB) = .00012
• P(NR|VB) = .0027
• P(NR|NN) = .0012
• P(VB|TO)P(NR|VB)P(race|VB) = .00000027
• P(NN|TO)P(NR|NN)P(race|NN)=.00000000032
• So we (correctly) choose the verb reading
10/11/20 13
Markov Chains
• Markov Chains
– Have transition probabilities like P(ti|ti-1)
• A special case of weighted FSTs (FSTs which
have probabilities or weights on the arcs)
• Can only represent unambiguous phenomena
10/11/20 14
A Weather Example: cold, hot, hot
10/11/20 15
Weather Markov Chain
10/11/20 16
Markov Chain Defined
• A set of states Q = q1, q2…qN
• Transition probabilities
– A set of probabilities A = a01a02…an1…ann.
– Each aij represents the probability of transitioning from state i
to state j
– The set of these is the transition probability matrix A
a ij 1; 1 i N
j1
10/11/20 17
• Can have special initial probability vector
instead of start state
i P(q1 i) 1 i N
– An initial distribution over probability of start states
– Must sum to 1
N
j 1
j1
18 10/11/20
Hidden Markov Models
• Markov Chains are useful for representing
problems in which
– Observable events
– Sequences to be labeled are unambiguous
• Problems like POS tagging are neither
• HMMs are useful for computing events we cannot
directly observe in the world, using other events
we can observe
– Unobservable (Hidden): e.g., POS tags
– Observable: e.g., words
– We have to learn the relationships
Hidden Markov Models
a ij 1; 1 i N
j1
10/11/20 21
First-Order HMM Assumptions
10/11/20 22
Weather Again
10/11/20 23
Weather and Ice Cream
10/11/20 24
Weather/Ice Cream HMM
10/11/20 26
HMM Configurations
Ergodic =
Bakis = left-to-right
fully-connected
What can HMMs Do?
10/11/20 28
Likelihood: The Forward Algorithm
• Likelihood: Given an HMM λ = (A,B) and an
observation sequence O, determine the likelihood
P(O, λ)
– E.g. what is the probability of an ice-cream sequence 3 –
1 – 3 given
10/11/20 29
• Compute the joint probability of 3 – 1 – 3 by
summing over all the possible {H,C} weather
sequences – since weather is hidden
P(O | Q) P(Q)
Q
• Forward Algorithm:
– Dynamic Programming algorithm, stores table of
intermediate values so it need not recompute them
– Computes P(O) by summing over probabilities of all
hidden state paths that could generate the observation
sequence 3 -- 1 – 3:
– The previous forward path probability, t 1 (i )
– The transition probability from the previous state to the
current state ij
– The state observation likelihood of the observation ot
given the current state j bj (ot )
Decoding: The Viterbi Algorithm
• Decoding: Given an observation sequence O and
an HMM λ = (A,B), discover the best hidden state
sequence of weather states in Q
– E.g., Given the observations 3 – 1 – 1 and an HMM,
what is the best (most probable) hidden weather
sequence of {H,C}
• Viterbi algorithm
– Dynamic programming algorithm
– Uses a dynamic programming trellis to store
probabilities that the HMM is in state j after seeing the
first t observations, for all states j
10/11/20 32
– Value in each cell computed by taking MAX over all
paths leading to this cell – i.e. best path
– Extension of a path from state i at time t-1 is computed
by multiplying: