Machine Learning For Natural Language Processing: Hidden Markov Models

Machine Learning
for natural language processing

Hidden Markov Models
Laura Kallmeyer
Heinrich-Heine-Universität Düsseldorf
Summer 2016
1 / 33
Introduction
So far, we have classified texts/observations independent from

the class of the previous text/observation
Today: sequence classifier, assigning a sequence of classes to a
sequence of observations
Jurafsky & Martin (2015), chapter 8
2 / 33
Table of contents
1 Motivation
2 Hidden Markov Models
3 Likelihood computation
4 Decoding
5 Training
3 / 33
Motivation
Hidden Markov Models (HMMs)

are weighted finite state automata
with states emitting outputs.
Transitions and emissions are weighted.
4 / 33
Motivation
HMM
0.1 0.9
qa in qa , a is emitted with probability 0.9;
end
start
0.1
in qa , b is emitted with probability 0.1;
0.1 in qb , b is emitted with probability 0.9;
qb in qb , a is emitted with probability 0.1;
0.9
0.9
Given a sequence bb, what is the most probable path traversed by
the automaton, resulting in this output?
The states in the automaton correspond to classes, i.e., the search for
the most probable path amounts to the search for the most probable
class sequence.
Example
POS Tagging: the classes are POS tags, the emissions are words
5 / 33
A Markov chain is a weighted automaton in which
weights are probabilities, i.e., all weights are between 0 and 1
and the sum of the weights of all outgoing edges of a state is 1,
and
the input sequence uniquely determines the states the automa-
ton goes through.
A Markov chain is actually a bigram language model.

Markov Chain for LM example slide 9
1 1
3 2
Sam end
1 1
start 2 2 am
2
3
2 I ...
3 1
3
6 / 33
Markov chains are useful when we want to compute the probability
for a sequence of events that we can observe.
Hidden Markov Models compute probabilities for a sequence of

hidden events, given a sequence of observed events.
Example from Jurafsky & Martin (2015), after Eisner (2002)

HMM for inferring weather (hot = H, cold = C) from numbers of ice
creams eaten by Jason.
0.2 0.1
start 0.6 end
0.5
0.3
0.8 H C 0.1
0.4
P(1∣H) = 0.2 P(1∣C) = 0.5
P(2∣H) = 0.4 P(2∣C) = 0.4
P(3∣H) = 0.4 P(3∣C) = 0.1
7 / 33
HMM
A HMM is a tuple ⟨Q, A, O, B, q0 , qF ⟩, with
Q = {q1 , . . . , qN } a finite set of states;
A a ∣Q∣ × ∣Q∣ matrix, the transition probability matrix, with
0 ≤ aij ≤ 1 for all 1 ≤ i ≤ ∣Q∣, 1 ≤ j ≤ ∣Q∣.
O a finite set of observations;
a sequence B of ∣Q∣ functions bi ∶ O → R, the emission probabili-
ties with ∑o∈O bi (o) = 1 for all 1 ≤ i ≤ ∣Q∣
q0 , qF ∉ Q are special start and end states, associated with
transition probabilities a0i and aiF (0 ≤ a0i , aiF ≤ 1) for all
1 ≤ i ≤ ∣Q∣.
∣Q∣
For all i, 0 ≤ i ≤ ∣Q∣, it holds that ∑j=1 aij + aiF = 1.
8 / 33
First order HMMs make the following simplifying assumptions:
1 Markov assumption: In a sequence of states qi1 . . . qin , we

assume
P(qin ∣qi1 . . . qin−1 ) = P(qin ∣qin−1 )(= ain−1 in )
2 Output independence: In a sequence of states qi1 . . . qin with

the associated output sequence oi1 . . . oin , we assume
P(oij ∣qi1 . . . qij . . . qin−1 , oi1 . . . oij . . . oin−1 ) = P(oij ∣qij )(= bij (oij )
9 / 33
The following three fundamental problems have to be solved:
1 Likelihood: Given an HMM and an observation sequence,

determine the likelihood of the observation sequence.
2 Decoding: Given an HMM and an observation sequence, dis-
cover the best hidden state sequence leading to these observa-
tions.
3 Learning: Given the set of states of an HMM and an observa-
tion sequence, learn the HMM parameters A and B.
10 / 33
Likelihood computation
Given an HMM and an observation o = o1 . . . on , determine the
likelihood of the observation sequence o.
For a specific state sequence q = q1 . . . qn ∈ Q n , we have
P(o, q) = P(o∣q)P(q)
and with our independence assumptions
n n
P(o, q) = ∏ P(oi ∣qi ) ∏ P(qi ∣qi−1 )
i=1 i=1
For the probability of o, we have to sum over all possible state

sequences:
n n
P(o) = ∑ ∏ P(oi ∣qi ) ∏ P(qi ∣qi−1 )P(qF ∣qn )
q∈Q n i=1 i=1
11 / 33
Computing each element of the sum independently from the others

would lead to an exponential time complexity.
Instead, we adopt the forward algorithm that implements some
dynamic programming.
Idea: We fill a n × ∣Q∣ matrix α such that
αij = P(o1 . . . oi , qi = qj )
Forward algorithm
1 α1j = a0j bj (o1 ) for 1 ≤ j ≤ ∣Q∣
∣Q∣
2 αij = ∑k=1 αi−1k akj bj (oi ) for 1 ≤ i ≤ n and 1 ≤ j ≤ ∣Q∣
∣Q∣
3 P(o) = ∑k=1 αnk akF
12 / 33
Ice cream weather example continued

0.2 0.1
start 0.6 end
0.5
0.8 0.3
0.1
P(1∣H) = 0.2 H C P(1∣C) = 0.5
P(2∣H) = 0.4
0.4 P(2∣C) = 0.4
P(3∣H) = 0.4 P(3∣C) = 0.1
P(313):
H 0.32 3.92 ⋅ 10−2 1.79 ⋅ 10−2 α1j = a0j bj (o1 )
C 2 ⋅ 10−2 5.3 ⋅ 10−2 3.81 ⋅ 10−3 αij = ∑k=1 αi−1k akj bj (oi )
∣Q∣
P(o) = ∑k=1 αnk akF

∣Q∣
3 1 3
P(313) =2.17 ⋅ 10−3
13 / 33
Alternatively, we can also compute the backward probability β.

For a given observation oi in our observation sequence, the forward
probability considers the part from o1 to oi while the backward
probability considers the part from oi+1 to on .
Idea: We fill a n × ∣Q∣ matrix β such that
βij = P(oi+1 . . . on , qi = qj )
Backward algorithm
1 βnj = ajF for 1 ≤ j ≤ ∣Q∣
∣Q∣
2 βij = ∑k=1 βi+1k ajk bk (oi+1 ) for 1 ≤ i < n and 1 ≤ j ≤ ∣Q∣
∣Q∣
3 P(o) = ∑k=1 a0k bk (o1 )β1k
14 / 33

0.2 0.1
start 0.6 end
0.5
0.8 0.3
0.1
P(1∣H) = 0.2 H C P(1∣C) = 0.5
P(2∣H) = 0.4
0.4 P(2∣C) = 0.4
P(3∣H) = 0.4 P(3∣C) = 0.1
observed sequence 313:

H 6.38 ⋅ 10−3 2.7 ⋅ 10−2 0.1 βnj = ajF
C 7.4 ⋅ 10−3 2.1 ⋅ 10−2 0.1 βij = ∑k=1 βi+1k ajk bk (oi+1 )
∣Q∣
P(o) = ∑k=1 a0k bk (o1 )β1k

∣Q∣
3 1 3
P(313) = 2.17 ⋅ 10−3
15 / 33
Decoding
Given an HMM and an observation o = o1 . . . on , determine the most

probable state sequence q = q1 . . . qn ∈ Q n for o.
arg max P(o, q) = arg max P(o∣q)P(q)

q∈Q n q∈Q n
and with our independence assumptions

n n
arg max P(o, q) = arg max ∏ P(oi ∣qi ) ∏ P(qi ∣qi−1 )
q∈Q n q∈Q n i=1 i=1
16 / 33
Decoding
For the computation, we use the viterbi algorithm, which is very
similar to the forward algorithm except that, instead of computing
sums, we compute maxs.
We fill a n × ∣Q∣ matrix v such that
vij = max P(o1 . . . oi , q1 . . . qi−1 , qi = qj )

q∈Q i−1
Viterbi algorithm
1 v1j = a0j bj (o1 ) for 1 ≤ j ≤ ∣Q∣
2 vij = max1≤k≤∣Q∣ vi−1k akj bj (oi ) for 1 ≤ i ≤ n and 1 ≤ j ≤ ∣Q∣
3 maxq∈Qn P(o, q) = max1≤k≤∣Q∣ vnk akF
In addition, in order to retreive the best state sequence, each entry
vij is equipped with a backpointer to the state that has lead to the
maximum.
17 / 33
Decoding

0.2 0.1
start 0.6 end
0.5
0.8 0.3
0.1
P(1∣H) = 0.2 H C P(1∣C) = 0.5
P(2∣H) = 0.4
0.4 P(2∣C) = 0.4
P(3∣H) = 0.4 P(3∣C) = 0.1
output sequence 313:

H 32 ⋅ 10−2 , start 384 ⋅ 10−4 , H 9216 ⋅ 10−6 , H
C 2 ⋅ 10−2 , start 480 ⋅ 10−4 , H 24 ⋅ 10−4 , C
3 1 3
v1j = a0j bj (o1 ) for 1 ≤ j ≤ ∣Q∣
vij = max1≤k≤∣Q∣ vi−1k akj bj (oi ) for 1 ≤ i ≤ n and 1 ≤ j ≤ ∣Q∣
maxq∈Q n P(o, q) = max1≤k≤∣Q∣ vnk akF
most probable weather sequence is HHH with probability 9216 ⋅ 10−7
18 / 33
Training
Supervised HMM Training: Given a sequence of observations o and a
corresponding sequence of states q, learn the parameters A and B of
an HMM.
MLE via the counts of qi qj sequences and of pairs qi , oj of states and
observations in our training data, eventually with some smoothing.
More challenging: Unsupervised HMM Training: Given an
observation sequence o and a state set Q, estimate A and B.
Example
ice cream – weather case: we have a sequence of observations
o ∈ {1, 2, 3}n and we know that the possible states are Q =
{H , C}.
POS tagging: we have a sequence of words o ∈
{w1 , w2 , . . . , w∣V ∣ }n and we know that the possible POS tags
(= states) are Q = {NN , NNS, VBD, IN , . . . }.
Task: estimate A and B of the corresponding HMMs.
19 / 33
Training
We use the forward-backward or Baum-Welch algorithm (Baum,
1972), a special case of the EM algorithm (Dempster et al., 1977).
Underlying ideas:
We estimate parameters iteratively: we start with some param-
eters and use the estimated probabilities to derive better and
better parameters.
We get our estimates by computing forward probabilities and
then dividing the probability mass among all the different state
sequences that have contributed to a forward probability.
Put differently, in each iteration loop, we count all possible
state sequences for the given observation sequence. But, in-
stead of counting each of them once, we use their probabilities
according to the most recent parameters as counts. We perform
some kind of frequency-based MLE with these weighted or
fractional counts.
20 / 33
Training
We start with some initial A and B. In each iteration, new parameter

âij and b̂i (o) are computed.
Intuition:
expected number of transitions from state i to state j

âij =
expected number of transitions from state i
and
expected number of times in i observing symbol o

b̂i (o) =
expected number of times in state i
21 / 33
Training
More precisely, instead of the expected numbers, we use the

probabilities according to the last parameters A and B and according
to the counts we retreive from the observation sequence o:
probability of transitions from i to j, given o

âij =
probability of transitions from state i, given o
and
probability of emitting o in i, given o

b̂i (o) =
probability of passing through state i, given o
22 / 33
Training
For âij , we first calculate the probability of being in state i at time t

(observation ot ) and in state j at time t + 1, with the observation
sequence given. This is a product of
the probability of being in i at step t after having emitted

o1 . . . ot , which is the forward probability αti ;
the probability of the transition from i to j, which is aij ;
the probability of emitting ot+1 in j, which is bj (ot+1 );
and the probability of moving from j to qF while emitting

ot+2 . . . on , which is given by the backward probability βt+1,j
divided by the probability of the observation since this is taken as

given, i.e., divided by P(o).
23 / 33
Training
We define ξt (i, j) as this probability, i.e., the probability of being in i

at time t, i.e., at observation ot and in j when emitting ot+1 , given the
observation sequence o:
ξt (i, j) = P(qt = qi , qt+1 = qj ∣o)
According to the previous slide, this can be computed as

Transition E-step
αti aij bj (ot+1 )βt+1,j
ξt (i, j) = for i, j ∈ {1, . . . , ∣Q∣}
P(o)
This is the E (expectation) step for the transition probabilities.
24 / 33
Training
From these ξt (i, j) values, we can estimate new transition
probabilities towards maximizing the observed data:
Remember that we want to have
probability of transitions from i to j, given o
âij =
probability of transitions from state i, given o
With the previously defined ξt (i, j), we obtain
Transition M-step
t=1 ξt (i,j)
∑n−1
âij = n−1 ∣Q∣ αn,i aiF for i, j ∈ {1, . . . , ∣Q∣}.
∑t=1 ∑k=1 ξt (i,k)+ P(o)
αn,i aiF
a0i bi (o1 )β1,i
â0i = ∣Q∣ and âiF = ∣Q∣
P(o)
αn,i aiF for 1 ≤ i ≤ ∣Q∣
∑k=1 a0k bk (o1 )β1,k t=1 ∑k=1 ξt (i,k)+
∑n−1 P(o)
This is the M (maximization) step for the transition probabilities.

25 / 33
Training
For b̂i (o), we calculate the probability of being in state i at time t

given the observation sequence o, which we call γt (i).
Emission E-step
αti βti
γt (i) = P(qt = qi ∣o) =
P(o)
This is the E (expectation) step for the emission probabilities.
26 / 33
Training
From these γt (i) values, we can estimate new emission probabilities

towards maximizing the observed data:
Remember that we want to have
probability of emitting o in i, given o
b̂i (o) =
probability of passing through state i, given o
Emission M-step
∑1≤t≤n,ot =o γt (i)
b̂i (o) = for 1 ≤ i ≤ ∣Q∣
∑nt=1 γt (i)
This is the M (maximization) step for the emission probabilities.
27 / 33
Training
Forward-backward algorithm (input observations o ∈ O n , state set Q):
initialize A and B
iterate until convergence:
E-step
γt (i) = P(o) for all 1 ≤ t
αti βti
≤ n, 1 ≤ i ≤ ∣Q∣
α a b (ot+1 )βt+1,j
ξt (i, j) = ti ij j P(o) for all 1 ≤ t ≤ n − 1, 1 ≤ i, j ≤ ∣Q∣
M-step
t=1 ξt (i,j)
∑n−1
âij = n−1 ∣Q∣ αn,i aiF for all 1 ≤ i, j ≤ ∣Q∣
∑t=1 ∑k=1 ξt (i,k)+ P(o)
αn,i aiF
a0i bi (o1 )β1,i
â0i = ∣Q∣ and âiF = ∣Q∣
P(o)
αn,i aiF for 1 ≤ i ≤ ∣Q∣
∑k=1 a0k bk (o1 )β1,k t=1 ∑k=1 ξt (i,k)+
∑n−1 P(o)
∑ γ (i)
t =o t
b̂i (o) = 1≤t≤n,o
∑nt=1 γt (i)
return A, B
28 / 33
Training
(Simplified) ice cream weather example
Initialized HMM:
0.5 0.2
start 0.4 end
0.4
0.5 0.4
0.2
H C
P(1∣H) = 0.2 0.4 P(1∣C) = 0.8
P(3∣H) = 0.8 P(3∣C) = 0.2
observed sequence 31:

H 0.4 4 ⋅ 10−2 H 8 ⋅ 10−2 0.2
α: C 1 ⋅ 10−1 0.16 β: C 8 ⋅ 10−2 0.2 P(31) = 4 ⋅ 10−2
t 1 2 t 1 2
t H C j =H j =C
E-step: γ: 1 0.8 0.2 ξ1 : i =H 0.16 0.64
2 0.2 0.8 i =C 3.97 ⋅ 10−2 0.16
M-step:
0.2 0.2
start end
0.16 0.16
0.8 0.64
0.8
H C
P(1∣H) = 0.2 3.97 ⋅ 10−2 P(1∣C) = 0.8
P(3∣H) = 0.8 P(3∣C) = 0.2
P(31) = 0.27
29 / 33
Training
Second step:
0.2 0.2
start end
0.16 0.16
0.8 0.64
0.8
H C
P(1∣H) = 0.2 3.97 ⋅ 10−2 P(1∣C) = 0.8
P(3∣H) = 0.8 P(3∣C) = 0.2
H 0.64 2.08 ⋅ 10−2 H 0.42 0.2

α C 4 ⋅ 10−2 0.33 β C 0.1 0.8
t 1 2 t 1 2
t H C j =H j =C
E-step: γ: 1 0.98 1.53 ⋅ 10−2 ξ i =H 1.51 ⋅ 10−2 0.97
2 1.53 ⋅ 10−2 0.98 i =C 1.7 ⋅ 10−4 1.51 ⋅ 10−2
M-step:
1.53 ⋅ 10−2 1.53 ⋅ 10−2
start
1.51 ⋅ 10−2 1.51 ⋅ 10−2 end
0.98 0.97
0.98
H C
P(1∣H) = 1.53 ⋅ 10−2 1.7 ⋅ 10−4 P(1∣C) = 0.98
P(3∣H) = 0.98 P(3∣C) = 1.52 ⋅ 10−2
P(31) = 0.91
30 / 33
Training
Third step:
1.53 ⋅ 10−2 1.53 ⋅ 10−2
start
1.51 ⋅ 10−2 1.51 ⋅ 10−2 end
0.98 0.97
0.98
H C
P(1∣H) = 1.53 ⋅ 10−2 1.7 ⋅ 10−4 P(1∣C) = 0.98
P(3∣H) = 0.98 P(3∣C) = 1.52 ⋅ 10−2
H 0.97 2.1 ⋅ 10−4 H 0.94 1.53 ⋅ 10−2

α C 2.3 ⋅ 10−4 0.93 β C 1.46 ⋅ 10−2 0.98
t 1 2 t 1 2
t H C j =H j =C
E-step: γ: 1 1 0 ξ i =H 0 1
2 0 1 i =C 0 0
M-step:
0 0
start end
0 0
1 1
1
H C
P(1∣H) = 6 ⋅ 10−5 0 P(1∣C) = 1
P(3∣H) = 1 P(3∣C) = 0
P(31) = 1
31 / 33
Training
1
Note that we can also leave out the factor P(o) , the result is the same:
initialize A and B
iterate until convergence:
E-step
φt (i) = αti βti for all 1 ≤ t ≤ n, 1 ≤ i ≤ ∣Q∣
ψt (i, j) = αti aij bj (ot+1 )βt+1,j for all 1 ≤ t ≤ n − 1, 1 ≤ i, j ≤ ∣Q∣
M-step
t=1 ξt (i,j)
∑n−1
âij = n−1 ∣Q∣
for all 1 ≤ i, j ≤ ∣Q∣
∑t=1 ∑k=1 ψt (i,k)+αn,i aiF
a0i bi (o1 )β1,i αn,i aiF
â0i = ∣Q∣ and âiF = ∣Q∣ for 1 ≤ i ≤ ∣Q∣
∑k=1 a0k bk (o1 )β1,k t=1 ∑k=1 ξt (i,k)+αn,i aiF
∑n−1
∑1≤t≤n,ot =o φt (i)
b̂i (o) = ∑nt=1 φt (i)
return A, B
32 / 33
References
Baum, L. E. 1972. An inequality and ssociated maximization technique in statistical
estimation for jprobabilistic functions of markov processes. In O. Shisha (ed.),
Inequalities iii: Proceedigns of the 3rd symposiom on inequalities, 1–8. University of
California, Los Angeles: Academic Press.
Dempster, A. P., N. M. Laird & D. B. Rubin. 1977. Maximum likelihood from
incomplete data via the EM algorithm. Journal of the Royal Statistical Society 39(1).
1–21.
Eisner, Jason. 2002. An interactive spreadsheet for teaching the forward-backward
algorithm. In Proceedings of the acl workshop on effective tools and methodologies
for teaching nlp and cl, 10–18.
Jurafsky, Daniel & James H. Martin. 2015. Speech and language processing. an
introduction to natural language processing, computational linguistics, and
speech recognition. Draft of the 3rd edition.
33 / 33

Machine Learning For Natural Language Processing: Hidden Markov Models

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Machine Learning For Natural Language Processing: Hidden Markov Models

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Machine Learning For Natural Language Processing: Hidden Markov Models

Uploaded by

Copyright:

Available Formats

Machine Learning

for natural language processing

So far, we have classified texts/observations independent from

2 Hidden Markov Models

Hidden Markov Models (HMMs)

A Markov chain is actually a bigram language model.

Hidden Markov Models compute probabilities for a sequence of

Example from Jurafsky & Martin (2015), after Eisner (2002)

First order HMMs make the following simplifying assumptions:

1 Markov assumption: In a sequence of states qi1 . . . qin , we

P(qin ∣qi1 . . . qin−1 ) = P(qin ∣qin−1 )(= ain−1 in )

2 Output independence: In a sequence of states qi1 . . . qin with

The following three fundamental problems have to be solved:

1 Likelihood: Given an HMM and an observation sequence,

For the probability of o, we have to sum over all possible state

Computing each element of the sum independently from the others

Ice cream weather example continued

P(o) = ∑k=1 αnk akF

Alternatively, we can also compute the backward probability β.

Ice cream weather example continued

observed sequence 313:

P(o) = ∑k=1 a0k bk (o1 )β1k

Given an HMM and an observation o = o1 . . . on , determine the most

arg max P(o, q) = arg max P(o∣q)P(q)

and with our independence assumptions

vij = max P(o1 . . . oi , q1 . . . qi−1 , qi = qj )

Ice cream weather example continued

output sequence 313:

most probable weather sequence is HHH with probability 9216 ⋅ 10−7

We start with some initial A and B. In each iteration, new parameter

expected number of transitions from state i to state j

expected number of times in i observing symbol o

More precisely, instead of the expected numbers, we use the

probability of transitions from i to j, given o

probability of emitting o in i, given o

For âij , we first calculate the probability of being in state i at time t

the probability of being in i at step t after having emitted

the probability of the transition from i to j, which is aij ;

the probability of emitting ot+1 in j, which is bj (ot+1 );

and the probability of moving from j to qF while emitting

divided by the probability of the observation since this is taken as

We define ξt (i, j) as this probability, i.e., the probability of being in i

ξt (i, j) = P(qt = qi , qt+1 = qj ∣o)

According to the previous slide, this can be computed as

This is the E (expectation) step for the transition probabilities.

With the previously defined ξt (i, j), we obtain

This is the M (maximization) step for the transition probabilities.

For b̂i (o), we calculate the probability of being in state i at time t

This is the E (expectation) step for the emission probabilities.

From these γt (i) values, we can estimate new emission probabilities

This is the M (maximization) step for the emission probabilities.

observed sequence 31:

H 0.64 2.08 ⋅ 10−2 H 0.42 0.2

H 0.97 2.1 ⋅ 10−4 H 0.94 1.53 ⋅ 10−2

You might also like