Hidden Markov Models: Julia Hirschberg CS4705

Hidden Markov Models
Julia Hirschberg
CS4705
CS 4705
POS Tagging using HMMs
• A different approach to POS tagging

– A special case of Bayesian inference
– Related to the “noisy channel” model used in MT, ASR
and other applications
10/11/20 2
POS Tagging as a Sequence ID Task
• Given a sentence (a sequence of words, or

observations)
– Secretariat is expected to race tomorrow
• What is the best sequence of tags which
corresponds to this sequence of observations?
• Bayesian approach:
– Consider all possible sequences of tags
– Choose the tag sequence t1…tn which is most probable
given the observation sequence of words w1…wn
10/11/20 3
• Out of all sequences of n tags t1…tn want the
single tag sequence such that P(t1…tn|w1…wn) is
highest
• I.e. find our best estimate of the sequence that

maximizes P(t1…tn|w1…wn)
• How do we do this?
• Use Bayes rule to transform P(t1…tn|w1…wn) into
a set of other probabilities that are easier to
compute
10/11/20 4
Bayes Rule
• Bayes Rule
n n n
n n P( w1 | t1 ) P(t1 )
• Rewrite P (t1 | w1 )  n
P ( w1 )
• Substitute
10/11/20 5
n
• Drop denominator since P (
n w) 1
is the same
n
t
for all P ( ) we consider
1
^ n n n
• So….  arg max P( w1 | t1 ) P(t1 )
t 1
Likelihoods, Priors, and Simplifying
Assumptions
• A1:P(w) depends only on its own POS

n
• A2:P(t) depends only on P(t-1)

Simplified Equation
Now we have two probabilities to calculate:

•Probability of a word occurring given its tag
•Probability of a tag occurring given a previous tag
•We can calculate each of these from a POS-tagged corpus
Tag Transition Probabilities P(ti|ti-1)
• Determiners likely to precede adjs and
nouns but unlikely to follow adjs
– The/DT red/JJ hat/NN;*Red/JJ the/DT hat/NN
– So we expect P(NN|DT) and P(JJ|DT) to be high
but P(DT|JJ) to be low
• Compute P(NN|DT) by counting in a
tagged corpus:
Word Likelihood Probabilities P(wi|ti)
• VBZ (3sg pres verb) likely to be is

• Compute P(is|VBZ) by counting in a tagged
corpus:
10/11/20 10
Some Data on race
• Secretariat/NNP is/VBZ expected/VBN to/TO

race/VB tomorrow/NR
• People/NNS continue/VB to/TO inquire/VB
the/DT reason/NN for/IN the/DT race/NN for/IN
outer/JJ space/NN
• How do we pick the right tag for race in new data?
10/11/20 11
Disambiguating to race tomorrow
10/11/20 12
Look Up the Probabilities
• P(NN|TO) = .00047
• P(VB|TO) = .83
• P(race|NN) = .00057
• P(race|VB) = .00012
• P(NR|VB) = .0027
• P(NR|NN) = .0012
• P(VB|TO)P(NR|VB)P(race|VB) = .00000027
• P(NN|TO)P(NR|NN)P(race|NN)=.00000000032
• So we (correctly) choose the verb reading
10/11/20 13
Markov Chains
• Markov Chains
– Have transition probabilities like P(ti|ti-1)
• A special case of weighted FSTs (FSTs which
have probabilities or weights on the arcs)
• Can only represent unambiguous phenomena
10/11/20 14
A Weather Example: cold, hot, hot
10/11/20 15
Weather Markov Chain
• What is the probability of 4 consecutive rainy

days?
• Sequence is rainy-rainy-rainy-rainy
• I.e., state sequence is 3-3-3-3
• P(3,3,3,3) =
 1a33a33a33a33 = 0.2 x (0.6)3 = 0.0432
10/11/20 16
Markov Chain Defined
• A set of states Q = q1, q2…qN
• Transition probabilities
– A set of probabilities A = a01a02…an1…ann.
– Each aij represents the probability of transitioning from state i
to state j
– The set of these is the transition probability matrix A
aij  P(qt  j | qt1  i) 1  i, j  N

N
a ij  1; 1 i  N
j1
 • Distinguished start and end states q0,qF
 10/11/20 17
• Can have special initial probability vector 
instead of start state
 i  P(q1  i) 1  i  N
– An initial distribution over probability of start states
– Must sum to 1
N
  j 1
j1

18 10/11/20
• Markov Chains are useful for representing
problems in which
– Observable events
– Sequences to be labeled are unambiguous
• Problems like POS tagging are neither
• HMMs are useful for computing events we cannot
directly observe in the world, using other events
we can observe
– Unobservable (Hidden): e.g., POS tags
– Observable: e.g., words
– We have to learn the relationships
• A set of states Q = q1, q2…qN

• Transition probabilities
– Transition probability matrix A = {aij}
N
a ij  1; 1 i  N
j1
• Observations O= o1, o2…oN;

 – Each a symbol from a vocabulary V = {v1,v2,…vV}
• Observation likelihoods or emission probabilities

– Output probability matrix B={bi(ot)}
• Special initial probability vector 
 i  P(q1  i) 1  i  N
• A set of legal accepting states QA  Q


10/11/20 21
First-Order HMM Assumptions
• Markov assumption: probability of a state

depends only on the state that precedes it
P(qi | q1 ...qi1)  P(qi | qi1)
– This is the same Markov assumption we made when we decided to
represent sentence probabilities as the product of bigram
probabilities
• Output-independence
 assumption: probability
of an output observation depends only on the
state that produced the observation
P(ot | O1t1,q1t )  P(ot |q t )
10/11/20 22
Weather Again
10/11/20 23
Weather and Ice Cream
• You are a climatologist in the year 2799 studying global

warming
• You can’t find any records of the weather in Baltimore for
summer of 2007
• But you find JE’s diary
• Which lists how many ice-creams J ate every day that
summer
• Your job: Determine (hidden) sequence of weather states
that `led to’ J’s (observed) ice cream behavior
10/11/20 24
Weather/Ice Cream HMM
• Hidden States: {Hot,Cold}

• Transition probabilities (A Matrix) between H and
C
• Observations: {1,2,3} # of ice creams eaten per
day
• Goal: Learn observation likelihoods between
observations and weather states (Output Matrix B)
by training HMM on aligned input streams from a
training corpus
• Result: trained HMM for weather prediction
given ice cream information alone
Ice Cream/Weather HMM
10/11/20 26
HMM Configurations
Ergodic =
Bakis = left-to-right
fully-connected
What can HMMs Do?
• Likelihood: Given an HMM λ = (A,B) and an

observation sequence O, determine the likelihood
P(O, λ): Given # ice creams, what is the weather?
• Decoding: Given an observation sequence O and
an HMM λ = (A,B), discover the best hidden state
sequence Q: Given seq of ice creams, what was
the most likely weather on those days?
• Learning: Given an observation sequence O and
the set of states in the HMM, learn the HMM
parameters A and B
10/11/20 28
Likelihood: The Forward Algorithm
• Likelihood: Given an HMM λ = (A,B) and an
observation sequence O, determine the likelihood
P(O, λ)
– E.g. what is the probability of an ice-cream sequence 3 –
1 – 3 given
10/11/20 29
• Compute the joint probability of 3 – 1 – 3 by
summing over all the possible {H,C} weather
sequences – since weather is hidden
 P(O | Q) P(Q)
Q
• Forward Algorithm:
– Dynamic Programming algorithm, stores table of
intermediate values so it need not recompute them
– Computes P(O) by summing over probabilities of all
hidden state paths that could generate the observation
sequence 3 -- 1 – 3:

– The previous forward path probability, t  1 (i )
– The transition probability from the previous state to the
current state  ij
– The state observation likelihood of the observation ot
given the current state j bj (ot )
Decoding: The Viterbi Algorithm
• Decoding: Given an observation sequence O and
an HMM λ = (A,B), discover the best hidden state
sequence of weather states in Q
– E.g., Given the observations 3 – 1 – 1 and an HMM,
what is the best (most probable) hidden weather
sequence of {H,C}
• Viterbi algorithm
– Dynamic programming algorithm
– Uses a dynamic programming trellis to store
probabilities that the HMM is in state j after seeing the
first t observations, for all states j
10/11/20 32
– Value in each cell computed by taking MAX over all
paths leading to this cell – i.e. best path
– Extension of a path from state i at time t-1 is computed
by multiplying:
– Most probable path is the max over all possible previous

state sequences
• Like Forward Algorithm, but it takes the max over
previous path probabilities rather than sum
HMM Training: The Forward-Backward
(Baum-Welch) Algorithm
• Learning: Given an observation sequence O and
the set of states in the HMM, learn the HMM
parameters A (transition) and B (emission)
• Input: unlabeled seq of observations O and
vocabulary of possible hidden states Q
– E.g. for ice-cream weather:
• Observations = {1,3,2,1,3,3,…}
• Hidden states = {H,C,C,C,H,C,….}
• Intuitions
– Iteratively re-estimate counts, starting from an
initialization for A and B probabilities, e.g. all equi-
probable
– Estimate new probabilities by computing forward
probability for an observation, dividing prob. mass
among all paths contributing to it, and computing the
backward probability from the same state
• Details: left as an exercise to the Reader
Very Important
• Check errata for Chapter 6 – there are some

important errors in equations
• Please use clic.cs.columbia.edu, not
cluster.cs.columbia.edu for HW
Summary
• HMMs are a major tool for relating observed
sequences to hidden information that explains or
predicts the observations
• Forward, Viterbi, and Forward-Backward
Algorithms are used for computing likelihoods,
decoding, and training HMMs
• Next week: Sameer Maskey will discuss
Maximum Entropy Models and other Machine
Learning approaches to NLP tasks
• These will be very useful in HW2

Hidden Markov Models: Julia Hirschberg CS4705

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Hidden Markov Models: Julia Hirschberg CS4705

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hidden Markov Models: Julia Hirschberg CS4705

Uploaded by

Copyright:

Available Formats

Hidden Markov Models

• A different approach to POS tagging

• Given a sentence (a sequence of words, or

• I.e. find our best estimate of the sequence that

• A1:P(w) depends only on its own POS

• A2:P(t) depends only on P(t-1)

Now we have two probabilities to calculate:

• VBZ (3sg pres verb) likely to be is

• Secretariat/NNP is/VBZ expected/VBN to/TO

• What is the probability of 4 consecutive rainy

aij  P(qt  j | qt1  i) 1  i, j  N

 • Distinguished start and end states q0,qF

• A set of states Q = q1, q2…qN

• Observations O= o1, o2…oN;

• Observation likelihoods or emission probabilities

• A set of legal accepting states QA  Q

• Markov assumption: probability of a state

• You are a climatologist in the year 2799 studying global

• Hidden States: {Hot,Cold}

• Likelihood: Given an HMM λ = (A,B) and an

– Most probable path is the max over all possible previous

• Check errata for Chapter 6 – there are some

You might also like