[go: up one dir, main page]

0% found this document useful (0 votes)
27 views33 pages

Machine Learning For Natural Language Processing: Hidden Markov Models

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 33

Machine Learning

for natural language processing


Hidden Markov Models

Laura Kallmeyer

Heinrich-Heine-Universität Düsseldorf

Summer 2016

1 / 33
Introduction

So far, we have classified texts/observations independent from


the class of the previous text/observation
Today: sequence classifier, assigning a sequence of classes to a
sequence of observations
Jurafsky & Martin (2015), chapter 8

2 / 33
Table of contents

1 Motivation

2 Hidden Markov Models

3 Likelihood computation

4 Decoding

5 Training

3 / 33
Motivation

Hidden Markov Models (HMMs)


are weighted finite state automata
with states emitting outputs.
Transitions and emissions are weighted.

4 / 33
Motivation
HMM
0.1 0.9
qa in qa , a is emitted with probability 0.9;
end
start
0.1
in qa , b is emitted with probability 0.1;
0.1 in qb , b is emitted with probability 0.9;
qb in qb , a is emitted with probability 0.1;
0.9
0.9
Given a sequence bb, what is the most probable path traversed by
the automaton, resulting in this output?

The states in the automaton correspond to classes, i.e., the search for
the most probable path amounts to the search for the most probable
class sequence.

Example
POS Tagging: the classes are POS tags, the emissions are words
5 / 33
Hidden Markov Models
A Markov chain is a weighted automaton in which
weights are probabilities, i.e., all weights are between 0 and 1
and the sum of the weights of all outgoing edges of a state is 1,
and
the input sequence uniquely determines the states the automa-
ton goes through.

A Markov chain is actually a bigram language model.


Markov Chain for LM example slide 9
1 1
3 2
Sam end

1 1
start 2 2 am
2
3

2 I ...
3 1
3

6 / 33
Hidden Markov Models
Markov chains are useful when we want to compute the probability
for a sequence of events that we can observe.

Hidden Markov Models compute probabilities for a sequence of


hidden events, given a sequence of observed events.

Example from Jurafsky & Martin (2015), after Eisner (2002)


HMM for inferring weather (hot = H, cold = C) from numbers of ice
creams eaten by Jason.

0.2 0.1
start 0.6 end
0.5
0.3

0.8 H C 0.1
0.4
P(1∣H) = 0.2 P(1∣C) = 0.5
P(2∣H) = 0.4 P(2∣C) = 0.4
P(3∣H) = 0.4 P(3∣C) = 0.1

7 / 33
Hidden Markov Models

HMM
A HMM is a tuple ⟨Q, A, O, B, q0 , qF ⟩, with
Q = {q1 , . . . , qN } a finite set of states;
A a ∣Q∣ × ∣Q∣ matrix, the transition probability matrix, with
0 ≤ aij ≤ 1 for all 1 ≤ i ≤ ∣Q∣, 1 ≤ j ≤ ∣Q∣.
O a finite set of observations;
a sequence B of ∣Q∣ functions bi ∶ O → R, the emission probabili-
ties with ∑o∈O bi (o) = 1 for all 1 ≤ i ≤ ∣Q∣
q0 , qF ∉ Q are special start and end states, associated with
transition probabilities a0i and aiF (0 ≤ a0i , aiF ≤ 1) for all
1 ≤ i ≤ ∣Q∣.
∣Q∣
For all i, 0 ≤ i ≤ ∣Q∣, it holds that ∑j=1 aij + aiF = 1.

8 / 33
Hidden Markov Models

First order HMMs make the following simplifying assumptions:

1 Markov assumption: In a sequence of states qi1 . . . qin , we


assume

P(qin ∣qi1 . . . qin−1 ) = P(qin ∣qin−1 )(= ain−1 in )

2 Output independence: In a sequence of states qi1 . . . qin with


the associated output sequence oi1 . . . oin , we assume

P(oij ∣qi1 . . . qij . . . qin−1 , oi1 . . . oij . . . oin−1 ) = P(oij ∣qij )(= bij (oij )

9 / 33
Hidden Markov Models

The following three fundamental problems have to be solved:

1 Likelihood: Given an HMM and an observation sequence,


determine the likelihood of the observation sequence.
2 Decoding: Given an HMM and an observation sequence, dis-
cover the best hidden state sequence leading to these observa-
tions.
3 Learning: Given the set of states of an HMM and an observa-
tion sequence, learn the HMM parameters A and B.

10 / 33
Likelihood computation
Given an HMM and an observation o = o1 . . . on , determine the
likelihood of the observation sequence o.
For a specific state sequence q = q1 . . . qn ∈ Q n , we have

P(o, q) = P(o∣q)P(q)
and with our independence assumptions
n n
P(o, q) = ∏ P(oi ∣qi ) ∏ P(qi ∣qi−1 )
i=1 i=1

For the probability of o, we have to sum over all possible state


sequences:
n n
P(o) = ∑ ∏ P(oi ∣qi ) ∏ P(qi ∣qi−1 )P(qF ∣qn )
q∈Q n i=1 i=1

11 / 33
Likelihood computation

Computing each element of the sum independently from the others


would lead to an exponential time complexity.
Instead, we adopt the forward algorithm that implements some
dynamic programming.
Idea: We fill a n × ∣Q∣ matrix α such that

αij = P(o1 . . . oi , qi = qj )

Forward algorithm
1 α1j = a0j bj (o1 ) for 1 ≤ j ≤ ∣Q∣
∣Q∣
2 αij = ∑k=1 αi−1k akj bj (oi ) for 1 ≤ i ≤ n and 1 ≤ j ≤ ∣Q∣
∣Q∣
3 P(o) = ∑k=1 αnk akF

12 / 33
Likelihood computation

Ice cream weather example continued


0.2 0.1
start 0.6 end
0.5
0.8 0.3
0.1
P(1∣H) = 0.2 H C P(1∣C) = 0.5
P(2∣H) = 0.4
0.4 P(2∣C) = 0.4
P(3∣H) = 0.4 P(3∣C) = 0.1

P(313):
H 0.32 3.92 ⋅ 10−2 1.79 ⋅ 10−2 α1j = a0j bj (o1 )
C 2 ⋅ 10−2 5.3 ⋅ 10−2 3.81 ⋅ 10−3 αij = ∑k=1 αi−1k akj bj (oi )
∣Q∣

P(o) = ∑k=1 αnk akF


∣Q∣
3 1 3
P(313) =2.17 ⋅ 10−3

13 / 33
Likelihood computation

Alternatively, we can also compute the backward probability β.


For a given observation oi in our observation sequence, the forward
probability considers the part from o1 to oi while the backward
probability considers the part from oi+1 to on .
Idea: We fill a n × ∣Q∣ matrix β such that

βij = P(oi+1 . . . on , qi = qj )

Backward algorithm
1 βnj = ajF for 1 ≤ j ≤ ∣Q∣
∣Q∣
2 βij = ∑k=1 βi+1k ajk bk (oi+1 ) for 1 ≤ i < n and 1 ≤ j ≤ ∣Q∣
∣Q∣
3 P(o) = ∑k=1 a0k bk (o1 )β1k

14 / 33
Likelihood computation

Ice cream weather example continued


0.2 0.1
start 0.6 end
0.5
0.8 0.3
0.1
P(1∣H) = 0.2 H C P(1∣C) = 0.5
P(2∣H) = 0.4
0.4 P(2∣C) = 0.4
P(3∣H) = 0.4 P(3∣C) = 0.1

observed sequence 313:


H 6.38 ⋅ 10−3 2.7 ⋅ 10−2 0.1 βnj = ajF
C 7.4 ⋅ 10−3 2.1 ⋅ 10−2 0.1 βij = ∑k=1 βi+1k ajk bk (oi+1 )
∣Q∣

P(o) = ∑k=1 a0k bk (o1 )β1k


∣Q∣
3 1 3
P(313) = 2.17 ⋅ 10−3

15 / 33
Decoding

Given an HMM and an observation o = o1 . . . on , determine the most


probable state sequence q = q1 . . . qn ∈ Q n for o.

arg max P(o, q) = arg max P(o∣q)P(q)


q∈Q n q∈Q n

and with our independence assumptions


n n
arg max P(o, q) = arg max ∏ P(oi ∣qi ) ∏ P(qi ∣qi−1 )
q∈Q n q∈Q n i=1 i=1

16 / 33
Decoding
For the computation, we use the viterbi algorithm, which is very
similar to the forward algorithm except that, instead of computing
sums, we compute maxs.
We fill a n × ∣Q∣ matrix v such that

vij = max P(o1 . . . oi , q1 . . . qi−1 , qi = qj )


q∈Q i−1

Viterbi algorithm
1 v1j = a0j bj (o1 ) for 1 ≤ j ≤ ∣Q∣
2 vij = max1≤k≤∣Q∣ vi−1k akj bj (oi ) for 1 ≤ i ≤ n and 1 ≤ j ≤ ∣Q∣
3 maxq∈Qn P(o, q) = max1≤k≤∣Q∣ vnk akF
In addition, in order to retreive the best state sequence, each entry
vij is equipped with a backpointer to the state that has lead to the
maximum.
17 / 33
Decoding

Ice cream weather example continued


0.2 0.1
start 0.6 end
0.5
0.8 0.3
0.1
P(1∣H) = 0.2 H C P(1∣C) = 0.5
P(2∣H) = 0.4
0.4 P(2∣C) = 0.4
P(3∣H) = 0.4 P(3∣C) = 0.1

output sequence 313:


H 32 ⋅ 10−2 , start 384 ⋅ 10−4 , H 9216 ⋅ 10−6 , H
C 2 ⋅ 10−2 , start 480 ⋅ 10−4 , H 24 ⋅ 10−4 , C
3 1 3
v1j = a0j bj (o1 ) for 1 ≤ j ≤ ∣Q∣
vij = max1≤k≤∣Q∣ vi−1k akj bj (oi ) for 1 ≤ i ≤ n and 1 ≤ j ≤ ∣Q∣
maxq∈Q n P(o, q) = max1≤k≤∣Q∣ vnk akF

most probable weather sequence is HHH with probability 9216 ⋅ 10−7

18 / 33
Training
Supervised HMM Training: Given a sequence of observations o and a
corresponding sequence of states q, learn the parameters A and B of
an HMM.
MLE via the counts of qi qj sequences and of pairs qi , oj of states and
observations in our training data, eventually with some smoothing.
More challenging: Unsupervised HMM Training: Given an
observation sequence o and a state set Q, estimate A and B.

Example
ice cream – weather case: we have a sequence of observations
o ∈ {1, 2, 3}n and we know that the possible states are Q =
{H , C}.
POS tagging: we have a sequence of words o ∈
{w1 , w2 , . . . , w∣V ∣ }n and we know that the possible POS tags
(= states) are Q = {NN , NNS, VBD, IN , . . . }.
Task: estimate A and B of the corresponding HMMs.
19 / 33
Training
We use the forward-backward or Baum-Welch algorithm (Baum,
1972), a special case of the EM algorithm (Dempster et al., 1977).
Underlying ideas:
We estimate parameters iteratively: we start with some param-
eters and use the estimated probabilities to derive better and
better parameters.
We get our estimates by computing forward probabilities and
then dividing the probability mass among all the different state
sequences that have contributed to a forward probability.
Put differently, in each iteration loop, we count all possible
state sequences for the given observation sequence. But, in-
stead of counting each of them once, we use their probabilities
according to the most recent parameters as counts. We perform
some kind of frequency-based MLE with these weighted or
fractional counts.
20 / 33
Training

We start with some initial A and B. In each iteration, new parameter


âij and b̂i (o) are computed.

Intuition:

expected number of transitions from state i to state j


âij =
expected number of transitions from state i
and

expected number of times in i observing symbol o


b̂i (o) =
expected number of times in state i

21 / 33
Training

More precisely, instead of the expected numbers, we use the


probabilities according to the last parameters A and B and according
to the counts we retreive from the observation sequence o:

probability of transitions from i to j, given o


âij =
probability of transitions from state i, given o
and

probability of emitting o in i, given o


b̂i (o) =
probability of passing through state i, given o

22 / 33
Training

For âij , we first calculate the probability of being in state i at time t


(observation ot ) and in state j at time t + 1, with the observation
sequence given. This is a product of

the probability of being in i at step t after having emitted


o1 . . . ot , which is the forward probability αti ;

the probability of the transition from i to j, which is aij ;

the probability of emitting ot+1 in j, which is bj (ot+1 );

and the probability of moving from j to qF while emitting


ot+2 . . . on , which is given by the backward probability βt+1,j

divided by the probability of the observation since this is taken as


given, i.e., divided by P(o).

23 / 33
Training

We define ξt (i, j) as this probability, i.e., the probability of being in i


at time t, i.e., at observation ot and in j when emitting ot+1 , given the
observation sequence o:

ξt (i, j) = P(qt = qi , qt+1 = qj ∣o)

According to the previous slide, this can be computed as


Transition E-step
αti aij bj (ot+1 )βt+1,j
ξt (i, j) = for i, j ∈ {1, . . . , ∣Q∣}
P(o)

This is the E (expectation) step for the transition probabilities.

24 / 33
Training
From these ξt (i, j) values, we can estimate new transition
probabilities towards maximizing the observed data:
Remember that we want to have
probability of transitions from i to j, given o
âij =
probability of transitions from state i, given o

With the previously defined ξt (i, j), we obtain

Transition M-step
t=1 ξt (i,j)
∑n−1
âij = n−1 ∣Q∣ αn,i aiF for i, j ∈ {1, . . . , ∣Q∣}.
∑t=1 ∑k=1 ξt (i,k)+ P(o)
αn,i aiF
a0i bi (o1 )β1,i
â0i = ∣Q∣ and âiF = ∣Q∣
P(o)
αn,i aiF for 1 ≤ i ≤ ∣Q∣
∑k=1 a0k bk (o1 )β1,k t=1 ∑k=1 ξt (i,k)+
∑n−1 P(o)

This is the M (maximization) step for the transition probabilities.


25 / 33
Training

For b̂i (o), we calculate the probability of being in state i at time t


given the observation sequence o, which we call γt (i).
Emission E-step
αti βti
γt (i) = P(qt = qi ∣o) =
P(o)

This is the E (expectation) step for the emission probabilities.

26 / 33
Training

From these γt (i) values, we can estimate new emission probabilities


towards maximizing the observed data:
Remember that we want to have
probability of emitting o in i, given o
b̂i (o) =
probability of passing through state i, given o

Emission M-step
∑1≤t≤n,ot =o γt (i)
b̂i (o) = for 1 ≤ i ≤ ∣Q∣
∑nt=1 γt (i)

This is the M (maximization) step for the emission probabilities.

27 / 33
Training
Forward-backward algorithm (input observations o ∈ O n , state set Q):
initialize A and B
iterate until convergence:
E-step
γt (i) = P(o) for all 1 ≤ t
αti βti
≤ n, 1 ≤ i ≤ ∣Q∣
α a b (ot+1 )βt+1,j
ξt (i, j) = ti ij j P(o) for all 1 ≤ t ≤ n − 1, 1 ≤ i, j ≤ ∣Q∣
M-step
t=1 ξt (i,j)
∑n−1
âij = n−1 ∣Q∣ αn,i aiF for all 1 ≤ i, j ≤ ∣Q∣
∑t=1 ∑k=1 ξt (i,k)+ P(o)
αn,i aiF
a0i bi (o1 )β1,i
â0i = ∣Q∣ and âiF = ∣Q∣
P(o)
αn,i aiF for 1 ≤ i ≤ ∣Q∣
∑k=1 a0k bk (o1 )β1,k t=1 ∑k=1 ξt (i,k)+
∑n−1 P(o)
∑ γ (i)
t =o t
b̂i (o) = 1≤t≤n,o
∑nt=1 γt (i)
return A, B

28 / 33
Training
(Simplified) ice cream weather example
Initialized HMM:
0.5 0.2
start 0.4 end
0.4
0.5 0.4
0.2
H C
P(1∣H) = 0.2 0.4 P(1∣C) = 0.8
P(3∣H) = 0.8 P(3∣C) = 0.2

observed sequence 31:


H 0.4 4 ⋅ 10−2 H 8 ⋅ 10−2 0.2
α: C 1 ⋅ 10−1 0.16 β: C 8 ⋅ 10−2 0.2 P(31) = 4 ⋅ 10−2
t 1 2 t 1 2
t H C j =H j =C
E-step: γ: 1 0.8 0.2 ξ1 : i =H 0.16 0.64
2 0.2 0.8 i =C 3.97 ⋅ 10−2 0.16
M-step:
0.2 0.2
start end
0.16 0.16
0.8 0.64
0.8
H C
P(1∣H) = 0.2 3.97 ⋅ 10−2 P(1∣C) = 0.8
P(3∣H) = 0.8 P(3∣C) = 0.2
P(31) = 0.27
29 / 33
Training
(Simplified) ice cream weather example
Second step:
0.2 0.2
start end
0.16 0.16
0.8 0.64
0.8
H C
P(1∣H) = 0.2 3.97 ⋅ 10−2 P(1∣C) = 0.8
P(3∣H) = 0.8 P(3∣C) = 0.2

H 0.64 2.08 ⋅ 10−2 H 0.42 0.2


α C 4 ⋅ 10−2 0.33 β C 0.1 0.8
t 1 2 t 1 2
t H C j =H j =C
E-step: γ: 1 0.98 1.53 ⋅ 10−2 ξ i =H 1.51 ⋅ 10−2 0.97
2 1.53 ⋅ 10−2 0.98 i =C 1.7 ⋅ 10−4 1.51 ⋅ 10−2
M-step:
1.53 ⋅ 10−2 1.53 ⋅ 10−2
start
1.51 ⋅ 10−2 1.51 ⋅ 10−2 end

0.98 0.97
0.98
H C
P(1∣H) = 1.53 ⋅ 10−2 1.7 ⋅ 10−4 P(1∣C) = 0.98
P(3∣H) = 0.98 P(3∣C) = 1.52 ⋅ 10−2
P(31) = 0.91

30 / 33
Training
(Simplified) ice cream weather example
Third step:
1.53 ⋅ 10−2 1.53 ⋅ 10−2
start
1.51 ⋅ 10−2 1.51 ⋅ 10−2 end

0.98 0.97
0.98
H C
P(1∣H) = 1.53 ⋅ 10−2 1.7 ⋅ 10−4 P(1∣C) = 0.98
P(3∣H) = 0.98 P(3∣C) = 1.52 ⋅ 10−2

H 0.97 2.1 ⋅ 10−4 H 0.94 1.53 ⋅ 10−2


α C 2.3 ⋅ 10−4 0.93 β C 1.46 ⋅ 10−2 0.98
t 1 2 t 1 2
t H C j =H j =C
E-step: γ: 1 1 0 ξ i =H 0 1
2 0 1 i =C 0 0
M-step:
0 0
start end
0 0
1 1
1
H C
P(1∣H) = 6 ⋅ 10−5 0 P(1∣C) = 1
P(3∣H) = 1 P(3∣C) = 0
P(31) = 1

31 / 33
Training
1
Note that we can also leave out the factor P(o) , the result is the same:
initialize A and B
iterate until convergence:
E-step
φt (i) = αti βti for all 1 ≤ t ≤ n, 1 ≤ i ≤ ∣Q∣
ψt (i, j) = αti aij bj (ot+1 )βt+1,j for all 1 ≤ t ≤ n − 1, 1 ≤ i, j ≤ ∣Q∣
M-step
t=1 ξt (i,j)
∑n−1
âij = n−1 ∣Q∣
for all 1 ≤ i, j ≤ ∣Q∣
∑t=1 ∑k=1 ψt (i,k)+αn,i aiF
a0i bi (o1 )β1,i αn,i aiF
â0i = ∣Q∣ and âiF = ∣Q∣ for 1 ≤ i ≤ ∣Q∣
∑k=1 a0k bk (o1 )β1,k t=1 ∑k=1 ξt (i,k)+αn,i aiF
∑n−1
∑1≤t≤n,ot =o φt (i)
b̂i (o) = ∑nt=1 φt (i)
return A, B

32 / 33
References
Baum, L. E. 1972. An inequality and ssociated maximization technique in statistical
estimation for jprobabilistic functions of markov processes. In O. Shisha (ed.),
Inequalities iii: Proceedigns of the 3rd symposiom on inequalities, 1–8. University of
California, Los Angeles: Academic Press.
Dempster, A. P., N. M. Laird & D. B. Rubin. 1977. Maximum likelihood from
incomplete data via the EM algorithm. Journal of the Royal Statistical Society 39(1).
1–21.
Eisner, Jason. 2002. An interactive spreadsheet for teaching the forward-backward
algorithm. In Proceedings of the acl workshop on effective tools and methodologies
for teaching nlp and cl, 10–18.
Jurafsky, Daniel & James H. Martin. 2015. Speech and language processing. an
introduction to natural language processing, computational linguistics, and
speech recognition. Draft of the 3rd edition.

33 / 33

You might also like