Machine Learning For Natural Language Processing: Hidden Markov Models
Machine Learning For Natural Language Processing: Hidden Markov Models
Machine Learning For Natural Language Processing: Hidden Markov Models
Laura Kallmeyer
Heinrich-Heine-Universität Düsseldorf
Summer 2016
1 / 33
Introduction
2 / 33
Table of contents
1 Motivation
3 Likelihood computation
4 Decoding
5 Training
3 / 33
Motivation
4 / 33
Motivation
HMM
0.1 0.9
qa in qa , a is emitted with probability 0.9;
end
start
0.1
in qa , b is emitted with probability 0.1;
0.1 in qb , b is emitted with probability 0.9;
qb in qb , a is emitted with probability 0.1;
0.9
0.9
Given a sequence bb, what is the most probable path traversed by
the automaton, resulting in this output?
The states in the automaton correspond to classes, i.e., the search for
the most probable path amounts to the search for the most probable
class sequence.
Example
POS Tagging: the classes are POS tags, the emissions are words
5 / 33
Hidden Markov Models
A Markov chain is a weighted automaton in which
weights are probabilities, i.e., all weights are between 0 and 1
and the sum of the weights of all outgoing edges of a state is 1,
and
the input sequence uniquely determines the states the automa-
ton goes through.
1 1
start 2 2 am
2
3
2 I ...
3 1
3
6 / 33
Hidden Markov Models
Markov chains are useful when we want to compute the probability
for a sequence of events that we can observe.
0.2 0.1
start 0.6 end
0.5
0.3
0.8 H C 0.1
0.4
P(1∣H) = 0.2 P(1∣C) = 0.5
P(2∣H) = 0.4 P(2∣C) = 0.4
P(3∣H) = 0.4 P(3∣C) = 0.1
7 / 33
Hidden Markov Models
HMM
A HMM is a tuple ⟨Q, A, O, B, q0 , qF ⟩, with
Q = {q1 , . . . , qN } a finite set of states;
A a ∣Q∣ × ∣Q∣ matrix, the transition probability matrix, with
0 ≤ aij ≤ 1 for all 1 ≤ i ≤ ∣Q∣, 1 ≤ j ≤ ∣Q∣.
O a finite set of observations;
a sequence B of ∣Q∣ functions bi ∶ O → R, the emission probabili-
ties with ∑o∈O bi (o) = 1 for all 1 ≤ i ≤ ∣Q∣
q0 , qF ∉ Q are special start and end states, associated with
transition probabilities a0i and aiF (0 ≤ a0i , aiF ≤ 1) for all
1 ≤ i ≤ ∣Q∣.
∣Q∣
For all i, 0 ≤ i ≤ ∣Q∣, it holds that ∑j=1 aij + aiF = 1.
8 / 33
Hidden Markov Models
P(oij ∣qi1 . . . qij . . . qin−1 , oi1 . . . oij . . . oin−1 ) = P(oij ∣qij )(= bij (oij )
9 / 33
Hidden Markov Models
10 / 33
Likelihood computation
Given an HMM and an observation o = o1 . . . on , determine the
likelihood of the observation sequence o.
For a specific state sequence q = q1 . . . qn ∈ Q n , we have
P(o, q) = P(o∣q)P(q)
and with our independence assumptions
n n
P(o, q) = ∏ P(oi ∣qi ) ∏ P(qi ∣qi−1 )
i=1 i=1
11 / 33
Likelihood computation
αij = P(o1 . . . oi , qi = qj )
Forward algorithm
1 α1j = a0j bj (o1 ) for 1 ≤ j ≤ ∣Q∣
∣Q∣
2 αij = ∑k=1 αi−1k akj bj (oi ) for 1 ≤ i ≤ n and 1 ≤ j ≤ ∣Q∣
∣Q∣
3 P(o) = ∑k=1 αnk akF
12 / 33
Likelihood computation
P(313):
H 0.32 3.92 ⋅ 10−2 1.79 ⋅ 10−2 α1j = a0j bj (o1 )
C 2 ⋅ 10−2 5.3 ⋅ 10−2 3.81 ⋅ 10−3 αij = ∑k=1 αi−1k akj bj (oi )
∣Q∣
13 / 33
Likelihood computation
βij = P(oi+1 . . . on , qi = qj )
Backward algorithm
1 βnj = ajF for 1 ≤ j ≤ ∣Q∣
∣Q∣
2 βij = ∑k=1 βi+1k ajk bk (oi+1 ) for 1 ≤ i < n and 1 ≤ j ≤ ∣Q∣
∣Q∣
3 P(o) = ∑k=1 a0k bk (o1 )β1k
14 / 33
Likelihood computation
15 / 33
Decoding
16 / 33
Decoding
For the computation, we use the viterbi algorithm, which is very
similar to the forward algorithm except that, instead of computing
sums, we compute maxs.
We fill a n × ∣Q∣ matrix v such that
Viterbi algorithm
1 v1j = a0j bj (o1 ) for 1 ≤ j ≤ ∣Q∣
2 vij = max1≤k≤∣Q∣ vi−1k akj bj (oi ) for 1 ≤ i ≤ n and 1 ≤ j ≤ ∣Q∣
3 maxq∈Qn P(o, q) = max1≤k≤∣Q∣ vnk akF
In addition, in order to retreive the best state sequence, each entry
vij is equipped with a backpointer to the state that has lead to the
maximum.
17 / 33
Decoding
18 / 33
Training
Supervised HMM Training: Given a sequence of observations o and a
corresponding sequence of states q, learn the parameters A and B of
an HMM.
MLE via the counts of qi qj sequences and of pairs qi , oj of states and
observations in our training data, eventually with some smoothing.
More challenging: Unsupervised HMM Training: Given an
observation sequence o and a state set Q, estimate A and B.
Example
ice cream – weather case: we have a sequence of observations
o ∈ {1, 2, 3}n and we know that the possible states are Q =
{H , C}.
POS tagging: we have a sequence of words o ∈
{w1 , w2 , . . . , w∣V ∣ }n and we know that the possible POS tags
(= states) are Q = {NN , NNS, VBD, IN , . . . }.
Task: estimate A and B of the corresponding HMMs.
19 / 33
Training
We use the forward-backward or Baum-Welch algorithm (Baum,
1972), a special case of the EM algorithm (Dempster et al., 1977).
Underlying ideas:
We estimate parameters iteratively: we start with some param-
eters and use the estimated probabilities to derive better and
better parameters.
We get our estimates by computing forward probabilities and
then dividing the probability mass among all the different state
sequences that have contributed to a forward probability.
Put differently, in each iteration loop, we count all possible
state sequences for the given observation sequence. But, in-
stead of counting each of them once, we use their probabilities
according to the most recent parameters as counts. We perform
some kind of frequency-based MLE with these weighted or
fractional counts.
20 / 33
Training
Intuition:
21 / 33
Training
22 / 33
Training
23 / 33
Training
24 / 33
Training
From these ξt (i, j) values, we can estimate new transition
probabilities towards maximizing the observed data:
Remember that we want to have
probability of transitions from i to j, given o
âij =
probability of transitions from state i, given o
Transition M-step
t=1 ξt (i,j)
∑n−1
âij = n−1 ∣Q∣ αn,i aiF for i, j ∈ {1, . . . , ∣Q∣}.
∑t=1 ∑k=1 ξt (i,k)+ P(o)
αn,i aiF
a0i bi (o1 )β1,i
â0i = ∣Q∣ and âiF = ∣Q∣
P(o)
αn,i aiF for 1 ≤ i ≤ ∣Q∣
∑k=1 a0k bk (o1 )β1,k t=1 ∑k=1 ξt (i,k)+
∑n−1 P(o)
26 / 33
Training
Emission M-step
∑1≤t≤n,ot =o γt (i)
b̂i (o) = for 1 ≤ i ≤ ∣Q∣
∑nt=1 γt (i)
27 / 33
Training
Forward-backward algorithm (input observations o ∈ O n , state set Q):
initialize A and B
iterate until convergence:
E-step
γt (i) = P(o) for all 1 ≤ t
αti βti
≤ n, 1 ≤ i ≤ ∣Q∣
α a b (ot+1 )βt+1,j
ξt (i, j) = ti ij j P(o) for all 1 ≤ t ≤ n − 1, 1 ≤ i, j ≤ ∣Q∣
M-step
t=1 ξt (i,j)
∑n−1
âij = n−1 ∣Q∣ αn,i aiF for all 1 ≤ i, j ≤ ∣Q∣
∑t=1 ∑k=1 ξt (i,k)+ P(o)
αn,i aiF
a0i bi (o1 )β1,i
â0i = ∣Q∣ and âiF = ∣Q∣
P(o)
αn,i aiF for 1 ≤ i ≤ ∣Q∣
∑k=1 a0k bk (o1 )β1,k t=1 ∑k=1 ξt (i,k)+
∑n−1 P(o)
∑ γ (i)
t =o t
b̂i (o) = 1≤t≤n,o
∑nt=1 γt (i)
return A, B
28 / 33
Training
(Simplified) ice cream weather example
Initialized HMM:
0.5 0.2
start 0.4 end
0.4
0.5 0.4
0.2
H C
P(1∣H) = 0.2 0.4 P(1∣C) = 0.8
P(3∣H) = 0.8 P(3∣C) = 0.2
0.98 0.97
0.98
H C
P(1∣H) = 1.53 ⋅ 10−2 1.7 ⋅ 10−4 P(1∣C) = 0.98
P(3∣H) = 0.98 P(3∣C) = 1.52 ⋅ 10−2
P(31) = 0.91
30 / 33
Training
(Simplified) ice cream weather example
Third step:
1.53 ⋅ 10−2 1.53 ⋅ 10−2
start
1.51 ⋅ 10−2 1.51 ⋅ 10−2 end
0.98 0.97
0.98
H C
P(1∣H) = 1.53 ⋅ 10−2 1.7 ⋅ 10−4 P(1∣C) = 0.98
P(3∣H) = 0.98 P(3∣C) = 1.52 ⋅ 10−2
31 / 33
Training
1
Note that we can also leave out the factor P(o) , the result is the same:
initialize A and B
iterate until convergence:
E-step
φt (i) = αti βti for all 1 ≤ t ≤ n, 1 ≤ i ≤ ∣Q∣
ψt (i, j) = αti aij bj (ot+1 )βt+1,j for all 1 ≤ t ≤ n − 1, 1 ≤ i, j ≤ ∣Q∣
M-step
t=1 ξt (i,j)
∑n−1
âij = n−1 ∣Q∣
for all 1 ≤ i, j ≤ ∣Q∣
∑t=1 ∑k=1 ψt (i,k)+αn,i aiF
a0i bi (o1 )β1,i αn,i aiF
â0i = ∣Q∣ and âiF = ∣Q∣ for 1 ≤ i ≤ ∣Q∣
∑k=1 a0k bk (o1 )β1,k t=1 ∑k=1 ξt (i,k)+αn,i aiF
∑n−1
∑1≤t≤n,ot =o φt (i)
b̂i (o) = ∑nt=1 φt (i)
return A, B
32 / 33
References
Baum, L. E. 1972. An inequality and ssociated maximization technique in statistical
estimation for jprobabilistic functions of markov processes. In O. Shisha (ed.),
Inequalities iii: Proceedigns of the 3rd symposiom on inequalities, 1–8. University of
California, Los Angeles: Academic Press.
Dempster, A. P., N. M. Laird & D. B. Rubin. 1977. Maximum likelihood from
incomplete data via the EM algorithm. Journal of the Royal Statistical Society 39(1).
1–21.
Eisner, Jason. 2002. An interactive spreadsheet for teaching the forward-backward
algorithm. In Proceedings of the acl workshop on effective tools and methodologies
for teaching nlp and cl, 10–18.
Jurafsky, Daniel & James H. Martin. 2015. Speech and language processing. an
introduction to natural language processing, computational linguistics, and
speech recognition. Draft of the 3rd edition.
33 / 33