[go: up one dir, main page]

100% found this document useful (1 vote)
290 views12 pages

Week 5 Exercises Solutions

This document contains exercises on n-gram language modeling. It asks the reader to compute n-gram counts from training data and use them to calculate probabilities of sentences under unigram and bigram models. It also asks the reader to apply Laplace smoothing to address zero-frequency events and calculate expected values related to zero-count bigrams.

Uploaded by

Abdul M
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
290 views12 pages

Week 5 Exercises Solutions

This document contains exercises on n-gram language modeling. It asks the reader to compute n-gram counts from training data and use them to calculate probabilities of sentences under unigram and bigram models. It also asks the reader to apply Laplace smoothing to address zero-frequency events and calculate expected values related to zero-count bigrams.

Uploaded by

Abdul M
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Natural Language Processing

Fall 2021
Andreas Opedal, aopedal@ethz.ch

Week 5 Exercises
Limited data is one of the biggest problems in estimating language models—this problem
also affects n-gram models. In practice, a corpus over which we estimate a language model
often does not contain instances of all possible n-grams (even for low-order n). This leads
our model to subsequently assign zero probability to occurrences of such n-grams in natural
language. In order to avoid this phenomenon, we introduce Laplace smoothing (or
Lidstone smoothing), where we add imaginary “pseudo” counts. In the bigram case,
this corresponds to estimating bigram probabilities over a vocabulary V as:

count(wm−1 , wm ) + λ
p̂λ (wm |wm−1 ) = P 0
w0 ∈V count(wm−1 , w ) + |V|λ

Question 1: n-gram Language Modeling


Consider the following vocabulary:

{BOS, EOS, here, David, are, you, the}

where BOS and EOS are defined as in the lecture, i.e. BOS is the dummy token indicating
the beginning of a sentence and EOS indicates end of a sentence. Note that we need never
compute the (conditional) probability of BOS and so we should not include unigram or
bigram counts of the BOS token.
Consider the following training data:

BOS here you are EOS


BOS here you are David EOS
BOS are you here EOS
BOS you are here EOS
BOS you are here EOS
BOS David you are here EOS
BOS you are EOS

(a) Compute all n-gram counts up to n = 2.


(b) Calculate the following probabilities:
(i) p(you)
(ii) p(you|are)
(c) Using unigram and bigram language models, compute the probabilities of the following
sentences:
(i) BOS here you are EOS

1
(ii) BOS are you EOS
What do you observe?
(d) Apply Laplace smoothing with λ = 1 to the bigram model and compute the proba-
bilites of the sentences from part (c) again.

2
(a) (i) Unigrams:
here 6
David 2
the 0
you 7
are 7
EOS 7

(ii) Bigrams:
BOS here 2
BOS are 1
BOS you 3
BOS David 1
here you 2
you are 6
are EOS 2
are David 1
David EOS 1
are you 1
you here 1
here EOS 4
are here 3
David you 1

The rest are zero.

(b) (i) The total number of unigrams is 6 + 2 + 0 + 7 + 7 + 7 = 29 and you occurs 7


7
times. Therefore p(you) = 29 .
(ii) The total number of bigrams where are is the first word is 2 + 1 + 1 + 3 = 7 and
1 of them ends with you so p(you|are) = 17

(c) Unigram model:

p(here you are EOS) = p(here)p(you|here)


p(are|you, here)p(EOS|are, you, here)
!
= p(here)p(you)p(are)p(EOS)
6∗7∗7∗7
=
294
= 0.29%

3
p(are you EOS) = p(are)p(you|are)p(EOS|you, are)
!
= p(are)p(you)p(EOS)
7∗7∗7
=
293
= 1.4%

Bigram model:

p(BOS here you are EOS) = p(here|BOS)p(you|here, BOS)


p(are|you, here, BOS)p(EOS|are, you, here, BOS)
!
= p(here|BOS)p(you|here)p(are|you)p(EOS|are)
count(BOS here) count(here you) count(you are)
=
count(BOS) count(here) count(you)
count(are EOS)
count(are)
2262
=
7677
48
=
2058
= 2.33%

p(BOS are you EOS) = p(are|BOS)p(you|are, BOS)


p(EOS|you, are, BOS)
!
= p(are|BOS)p(you|are)p(EOS|you)
count(BOS are) count(are you) count(you EOS)
=
count(BOS) count(are) count(you)
110
=
777
= 0%

Observe that the probability of the last sentence is zero with the bigram model, since
”you” never marks the end of a sentence in the training set.

(d) New probabilities to the sentences from part (c) are:

4
psm (BOS here you are EOS)
= psm (here|BOS)psm (you|here)psm (are|you)psm (EOS|are)
count(BOS here) + 1 count(here you) + 1 count(you are) + 1
=
count(BOS) + |V | count(here) + |V | count(you) + |V |
count(are EOS) + 1
count(are) + |V |
3 3 7 4
=
14 13 14 14
= 0.71%

psm (BOS are you EOS)


= psm (are|BOS)psm (you|are)psm psm (EOS|you)
count(BOS are) + 1 count(are you) + 1 count(you EOS) + 1
=
count(BOS) + |V | count(are) + |V | count(you) + |V |
2 2 1
=
14 14 14
= 0.15%

5
Question 2: Unigram Language Models*
In this problem, we consider the language model over a vocabulary V estimated from a
corpus consisting of M tokens.

(a) Suppose that the word wi ∈ V appears mi times. Estimate the unigram probability
of wi with Laplace smoothing. For what values of mi is the smoothed probability of
word wi greater than the unsmoothed probability?
(b) Consider a simple language in which strings are constructed by drawing each token
1
from the vocabulary V with probability |V| , independent of previous tokens. For
simplicity, assume that neighboring bigrams are independent and exclude bigrams
containing BOS and EOS from your computations (i.e. exclude them from the vocab-
ulary).
(i) What is the expected value of the fraction of bigrams that can be constructed
from V with zero count?
(ii) Determine the value of M such that the expected fraction of bigrams with zero
count is at most  ∈ (0, 1). (Hint: Use that log(x + 1) ≈ x for |x|  1)

6
(a)
mi mi + λ
<
M M + |V|λ
mi M + mi |V|λ < M mi + M λ
M
mi <
|V|
Note the independence of λ.

(b) (i) We have M − 1 bigrams in our corpus. With the independence assumption, we
have that each bigram is sampled uniformly at random with probability |V|1 2 .
Letting Nij be a random variable denoting the count for bigram (wi , wj ), we
have that Nij follows a binomial distribution:
  k  M −1−k
M −1 1 1
p(Nij = k) = 1−
k |V|2 |V|2

We are interested in the case k = 0:


 M −1
1
p(Nij = 0) = 1 −
|V|2

The expectation of the number of bigrams with zero count is then

1{Nij =0} ] = E[1{Nij =0} ]


X X
E[
wi ,wj ∈V wi ,wj ∈V
X
= p(Nij = 0)
wi ,wj ∈V
 M −1
2 1
= |V| · 1 −
|V|2

Thus, the expectation of the fraction of the number of bigrams with zero count
is  M −1
|V|2 · 1 − |V|1 2 
1
M −1
= 1−
|V|2 |V|2
(ii) We seek a bound on the following proportion:
 M −1
1
1−
|V|2
which we derive below. Fix  ∈ (0, 1). It follows that
 M −1
1
≥ 1−
|V|2

7
 
1
log  ≥ (M − 1) log 1 −
|V|2
 
1
≈ (M − 1) · − 2
|V|

1
M − 1 ' −|V|2 log  = |V|2 log


1
M ' |V|2 log +1


8
Question 3: Modeling the Gambler’s Ruin
Consider the following RNN model architecture:

yt

ht

xt

where we define:

ht = f (w0 xt + w1 ht−1 + b0 )
yt = g(w2 ht + b1 )

Your friend convinces you to try your luck and play a game. She will flip a coin and if the
coin lands heads up, you get $100. But if the coin lands tails up, you lose $100. We want
the above model to output your current amount of money given a sequence of coin flips.

(a) In this model, what do we define our input xt to be? How can we denote the values
that xt takes on?
(b) Assume that you start with $200. Find weights, biases and activation functions such
that at time step t the model outputs your current amount of money. Note: You cannot
have a negative amount of money. In other words, it is your friend’s responsibility to
stop flipping the coin once you lose all of your money! For example, if the coin flips
are [t, h, h, t, t, t, t] your network should output [100, 200, 300, 200, 100, 0, 0]. Note that
the solution to this problem depends on part a, which has multiple solutions.

9
(a) We can think of coin flips as our inputs xt . One possible solution is that if at time t,
the coin lands heads up, we set xt = 1; if it lands tails up, we set xt = 0.

(b) The hidden state ht−1 is the amount of money before the coin flip and our output
is yt . We start with h0 = $200. Since our outputs cannot be negative, we set the
activation function f to be ReLU. To encode winning or losing money, we can set the
bias b0 = −100 and the weight w0 = 200. Lastly, we set w1 = 1, w2 = 1, b1 = 0 and
g to be the identity function g(x) = x.

10
Question 4: Recurrent Neural Networks*
(a) What does the term “exploding gradients” refer to? What circumstances lead to
“exploding gradients?”
(b) Why are “vanishing gradients” a more common problem in basic RNNs compared to
feed forward networks?
(c) Name two methods for alleviating the vanishing gradient problem and explain why
they work.
(d) Consider a recurrent neural network with a single hidden unit and a sigmoid activation,
hm = σ(θhm−1 + xm ). Prove that if |θ| < 1, then the gradient ∂h∂hm+k
m
goes to zero as
k → ∞.

(a) Exploding gradients is a problem of large increase in the norm of the gradient during
training. Backpropagation computes gradients using the chain rule and accumulating
gradients through layers involves multiplying a large number of gradients. When all
gradients are greater than 1, this can result in very large gradients. This in turn
results in large updates to the network weights which may even lead to numerical
overflow, hence the term “exploding gradients”. This gets worse as the length of the
sequence increases.

(b) In feed forward networks the gradients wrt earlier layers depend on the weight matrices
after it. Since the weight matrices are different, larger and smaller weight matrices
might balance out to not let the gradients vanish. RNNs on the other hand share
weights across time steps, so the gradients of wrt earlier time steps will depend on the
same matrix multiplied with itself many times. If the weight matrices increase the
magnitude of a gradient, we get the “exploding gradient” problem, whereas if they
decrease the magnitude of a gradient, we get the “vanishing gradient” problem.

(c) Some of the common ways to solve the “vanishing gradient problem” are:

(i) Use LSTM or GRU units


(ii) Use residual networks
(iii) Use ReLU instead of sigmoid activation function
(iv) Use fewer layers
(v) Use batch normalization
(vi) Truncate backpropagation

(d) We have

hm = σ(θhm−1 + xm )

∂hm
= θσ 0 (θhm−1 + xm )
∂hm−1

11
It can easily be shown that the derivative of the sigmoid function is

σ 0 (x) = σ(x)(1 − σ(x))

Since the sigmoid function is bounded and has outputs in (0, 1), one can conclude
that maximum value for σ 0 (x) is 0.25. This can be seen from the graph or by taking
the second derivative of this expression. Now let’s see what happens when k → ∞.

m+1
∂hm+k Y ∂hi
=
∂hm i=m+k
∂hi−1
k−1
Y
= θσ 0 (θhi−1 + xi )
i=1
k−1
Y
≤ 0.25θ
i=1
= 0.25k−1 θk−1
→ 0 as k → ∞ (since |θ| < 1)

This can easily be extended to higher dimensions. Pascanu et al1 showed that if
the largest eigenvalue of the weight matrix is less than 1, then the gradient shrinks
exponentially fast.

1
On the difficulty of training recurrent neural networks; Razvan Pascanu, Tomas Mikolov, Yoshua
Bengio; Proceedings of the 30th International Conference on Machine Learning, PMLR 28(3):1310-1318,
2013.

Problems adapted from Jacob Eisenstein’s Introduction to Natural Language Processing textbook.

12

You might also like