[go: up one dir, main page]

0% found this document useful (0 votes)
18 views37 pages

Hidden Markov Models: Julia Hirschberg CS4705

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1/ 37

Hidden Markov Models

Julia Hirschberg
CS4705

CS 4705
POS Tagging using HMMs

• A different approach to POS tagging


– A special case of Bayesian inference
– Related to the “noisy channel” model used in MT, ASR
and other applications

10/11/20 2
POS Tagging as a Sequence ID Task

• Given a sentence (a sequence of words, or


observations)
– Secretariat is expected to race tomorrow
• What is the best sequence of tags which
corresponds to this sequence of observations?
• Bayesian approach:
– Consider all possible sequences of tags
– Choose the tag sequence t1…tn which is most probable
given the observation sequence of words w1…wn

10/11/20 3
• Out of all sequences of n tags t1…tn want the
single tag sequence such that P(t1…tn|w1…wn) is
highest

• I.e. find our best estimate of the sequence that


maximizes P(t1…tn|w1…wn)
• How do we do this?
• Use Bayes rule to transform P(t1…tn|w1…wn) into
a set of other probabilities that are easier to
compute

10/11/20 4
Bayes Rule
• Bayes Rule

n n n
n n P( w1 | t1 ) P(t1 )
• Rewrite P (t1 | w1 )  n
P ( w1 )

• Substitute

10/11/20 5
n
• Drop denominator since P (
n w) 1
is the same

n
t
for all P ( ) we consider
1
^ n n n
• So….  arg max P( w1 | t1 ) P(t1 )
t 1
Likelihoods, Priors, and Simplifying
Assumptions

• A1:P(w) depends only on its own POS


n

• A2:P(t) depends only on P(t-1)


Simplified Equation

Now we have two probabilities to calculate:


•Probability of a word occurring given its tag
•Probability of a tag occurring given a previous tag
•We can calculate each of these from a POS-tagged corpus
Tag Transition Probabilities P(ti|ti-1)
• Determiners likely to precede adjs and
nouns but unlikely to follow adjs
– The/DT red/JJ hat/NN;*Red/JJ the/DT hat/NN
– So we expect P(NN|DT) and P(JJ|DT) to be high
but P(DT|JJ) to be low
• Compute P(NN|DT) by counting in a
tagged corpus:
Word Likelihood Probabilities P(wi|ti)

• VBZ (3sg pres verb) likely to be is


• Compute P(is|VBZ) by counting in a tagged
corpus:

10/11/20 10
Some Data on race

• Secretariat/NNP is/VBZ expected/VBN to/TO


race/VB tomorrow/NR
• People/NNS continue/VB to/TO inquire/VB
the/DT reason/NN for/IN the/DT race/NN for/IN
outer/JJ space/NN
• How do we pick the right tag for race in new data?

10/11/20 11
Disambiguating to race tomorrow

10/11/20 12
Look Up the Probabilities

• P(NN|TO) = .00047
• P(VB|TO) = .83
• P(race|NN) = .00057
• P(race|VB) = .00012
• P(NR|VB) = .0027
• P(NR|NN) = .0012
• P(VB|TO)P(NR|VB)P(race|VB) = .00000027
• P(NN|TO)P(NR|NN)P(race|NN)=.00000000032
• So we (correctly) choose the verb reading

10/11/20 13
Markov Chains
• Markov Chains
– Have transition probabilities like P(ti|ti-1)
• A special case of weighted FSTs (FSTs which
have probabilities or weights on the arcs)
• Can only represent unambiguous phenomena

10/11/20 14
A Weather Example: cold, hot, hot

10/11/20 15
Weather Markov Chain

• What is the probability of 4 consecutive rainy


days?
• Sequence is rainy-rainy-rainy-rainy
• I.e., state sequence is 3-3-3-3
• P(3,3,3,3) =
 1a33a33a33a33 = 0.2 x (0.6)3 = 0.0432

10/11/20 16
Markov Chain Defined
• A set of states Q = q1, q2…qN
• Transition probabilities
– A set of probabilities A = a01a02…an1…ann.
– Each aij represents the probability of transitioning from state i
to state j
– The set of these is the transition probability matrix A

aij  P(qt  j | qt1  i) 1  i, j  N


N

a ij  1; 1 i  N
j1

 • Distinguished start and end states q0,qF

 10/11/20 17
• Can have special initial probability vector 
instead of start state
 i  P(q1  i) 1  i  N
– An initial distribution over probability of start states
– Must sum to 1
N

  j 1
j1



18 10/11/20
Hidden Markov Models
• Markov Chains are useful for representing
problems in which
– Observable events
– Sequences to be labeled are unambiguous
• Problems like POS tagging are neither
• HMMs are useful for computing events we cannot
directly observe in the world, using other events
we can observe
– Unobservable (Hidden): e.g., POS tags
– Observable: e.g., words
– We have to learn the relationships
Hidden Markov Models

• A set of states Q = q1, q2…qN


• Transition probabilities
– Transition probability matrix A = {aij}
N

a ij  1; 1 i  N
j1

• Observations O= o1, o2…oN;


 – Each a symbol from a vocabulary V = {v1,v2,…vV}

• Observation likelihoods or emission probabilities


– Output probability matrix B={bi(ot)}
• Special initial probability vector 
 i  P(q1  i) 1  i  N

• A set of legal accepting states QA  Q




10/11/20 21
First-Order HMM Assumptions

• Markov assumption: probability of a state


depends only on the state that precedes it
P(qi | q1 ...qi1)  P(qi | qi1)
– This is the same Markov assumption we made when we decided to
represent sentence probabilities as the product of bigram
probabilities
• Output-independence
 assumption: probability
of an output observation depends only on the
state that produced the observation
P(ot | O1t1,q1t )  P(ot |q t )

10/11/20 22
Weather Again

10/11/20 23
Weather and Ice Cream

• You are a climatologist in the year 2799 studying global


warming
• You can’t find any records of the weather in Baltimore for
summer of 2007
• But you find JE’s diary
• Which lists how many ice-creams J ate every day that
summer
• Your job: Determine (hidden) sequence of weather states
that `led to’ J’s (observed) ice cream behavior

10/11/20 24
Weather/Ice Cream HMM

• Hidden States: {Hot,Cold}


• Transition probabilities (A Matrix) between H and
C
• Observations: {1,2,3} # of ice creams eaten per
day
• Goal: Learn observation likelihoods between
observations and weather states (Output Matrix B)
by training HMM on aligned input streams from a
training corpus
• Result: trained HMM for weather prediction
given ice cream information alone
Ice Cream/Weather HMM

10/11/20 26
HMM Configurations

Ergodic =
Bakis = left-to-right
fully-connected
What can HMMs Do?

• Likelihood: Given an HMM λ = (A,B) and an


observation sequence O, determine the likelihood
P(O, λ): Given # ice creams, what is the weather?
• Decoding: Given an observation sequence O and
an HMM λ = (A,B), discover the best hidden state
sequence Q: Given seq of ice creams, what was
the most likely weather on those days?
• Learning: Given an observation sequence O and
the set of states in the HMM, learn the HMM
parameters A and B

10/11/20 28
Likelihood: The Forward Algorithm
• Likelihood: Given an HMM λ = (A,B) and an
observation sequence O, determine the likelihood
P(O, λ)
– E.g. what is the probability of an ice-cream sequence 3 –
1 – 3 given

10/11/20 29
• Compute the joint probability of 3 – 1 – 3 by
summing over all the possible {H,C} weather
sequences – since weather is hidden
 P(O | Q) P(Q)
Q

• Forward Algorithm:
– Dynamic Programming algorithm, stores table of
intermediate values so it need not recompute them
– Computes P(O) by summing over probabilities of all
hidden state paths that could generate the observation
sequence 3 -- 1 – 3:

– The previous forward path probability, t  1 (i )
– The transition probability from the previous state to the
current state  ij
– The state observation likelihood of the observation ot
given the current state j bj (ot )
Decoding: The Viterbi Algorithm
• Decoding: Given an observation sequence O and
an HMM λ = (A,B), discover the best hidden state
sequence of weather states in Q
– E.g., Given the observations 3 – 1 – 1 and an HMM,
what is the best (most probable) hidden weather
sequence of {H,C}
• Viterbi algorithm
– Dynamic programming algorithm
– Uses a dynamic programming trellis to store
probabilities that the HMM is in state j after seeing the
first t observations, for all states j
10/11/20 32
– Value in each cell computed by taking MAX over all
paths leading to this cell – i.e. best path
– Extension of a path from state i at time t-1 is computed
by multiplying:

– Most probable path is the max over all possible previous


state sequences
• Like Forward Algorithm, but it takes the max over
previous path probabilities rather than sum
HMM Training: The Forward-Backward
(Baum-Welch) Algorithm
• Learning: Given an observation sequence O and
the set of states in the HMM, learn the HMM
parameters A (transition) and B (emission)
• Input: unlabeled seq of observations O and
vocabulary of possible hidden states Q
– E.g. for ice-cream weather:
• Observations = {1,3,2,1,3,3,…}
• Hidden states = {H,C,C,C,H,C,….}
• Intuitions
– Iteratively re-estimate counts, starting from an
initialization for A and B probabilities, e.g. all equi-
probable
– Estimate new probabilities by computing forward
probability for an observation, dividing prob. mass
among all paths contributing to it, and computing the
backward probability from the same state
• Details: left as an exercise to the Reader
Very Important

• Check errata for Chapter 6 – there are some


important errors in equations
• Please use clic.cs.columbia.edu, not
cluster.cs.columbia.edu for HW
Summary
• HMMs are a major tool for relating observed
sequences to hidden information that explains or
predicts the observations
• Forward, Viterbi, and Forward-Backward
Algorithms are used for computing likelihoods,
decoding, and training HMMs
• Next week: Sameer Maskey will discuss
Maximum Entropy Models and other Machine
Learning approaches to NLP tasks
• These will be very useful in HW2

You might also like