100% found this document useful (1 vote)

290 views12 pages

Week 5 Exercises Solutions

This document contains exercises on n-gram language modeling. It asks the reader to compute n-gram counts from training data and use them to calculate probabilities of sentences under unigram and bigram models. It also asks the reader to apply Laplace smoothing to address zero-frequency events and calculate expected values related to zero-count bigrams.

Uploaded by

Abdul M

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

290 views12 pages

Week 5 Exercises Solutions

Uploaded by

Abdul M

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Natural Language Processing

Fall 2021
Andreas Opedal, aopedal@ethz.ch

Week 5 Exercises
Limited data is one of the biggest problems in estimating language models—this problem
also affects n-gram models. In practice, a corpus over which we estimate a language model
often does not contain instances of all possible n-grams (even for low-order n). This leads
our model to subsequently assign zero probability to occurrences of such n-grams in natural
language. In order to avoid this phenomenon, we introduce Laplace smoothing (or
Lidstone smoothing), where we add imaginary “pseudo” counts. In the bigram case,
this corresponds to estimating bigram probabilities over a vocabulary V as:

count(wm−1 , wm ) + λ
p̂λ (wm |wm−1 ) = P 0
w0 ∈V count(wm−1 , w ) + |V|λ

Question 1: n-gram Language Modeling

Consider the following vocabulary:

{BOS, EOS, here, David, are, you, the}

where BOS and EOS are defined as in the lecture, i.e. BOS is the dummy token indicating
the beginning of a sentence and EOS indicates end of a sentence. Note that we need never
compute the (conditional) probability of BOS and so we should not include unigram or
bigram counts of the BOS token.
Consider the following training data:

BOS here you are EOS

BOS here you are David EOS
BOS are you here EOS
BOS you are here EOS
BOS you are here EOS
BOS David you are here EOS
BOS you are EOS

(a) Compute all n-gram counts up to n = 2.

(b) Calculate the following probabilities:
(i) p(you)
(ii) p(you|are)
(c) Using unigram and bigram language models, compute the probabilities of the following
sentences:
(i) BOS here you are EOS

1
(ii) BOS are you EOS
What do you observe?
(d) Apply Laplace smoothing with λ = 1 to the bigram model and compute the proba-
bilites of the sentences from part (c) again.

2
(a) (i) Unigrams:
here 6
David 2
the 0
you 7
are 7
EOS 7

(ii) Bigrams:
BOS here 2
BOS are 1
BOS you 3
BOS David 1
here you 2
you are 6
are EOS 2
are David 1
David EOS 1
are you 1
you here 1
here EOS 4
are here 3
David you 1

The rest are zero.

(b) (i) The total number of unigrams is 6 + 2 + 0 + 7 + 7 + 7 = 29 and you occurs 7

7
times. Therefore p(you) = 29 .
(ii) The total number of bigrams where are is the first word is 2 + 1 + 1 + 3 = 7 and
1 of them ends with you so p(you|are) = 17

(c) Unigram model:

p(here you are EOS) = p(here)p(you|here)

p(are|you, here)p(EOS|are, you, here)
!
= p(here)p(you)p(are)p(EOS)
6∗7∗7∗7
=
294
= 0.29%

3
p(are you EOS) = p(are)p(you|are)p(EOS|you, are)
!
= p(are)p(you)p(EOS)
7∗7∗7
=
293
= 1.4%

Bigram model:

p(BOS here you are EOS) = p(here|BOS)p(you|here, BOS)

p(BOS are you EOS) = p(are|BOS)p(you|are, BOS)

p(EOS|you, are, BOS)
!
= p(are|BOS)p(you|are)p(EOS|you)
count(BOS are) count(are you) count(you EOS)
=
count(BOS) count(are) count(you)
110
=
777
= 0%

Observe that the probability of the last sentence is zero with the bigram model, since
”you” never marks the end of a sentence in the training set.

(d) New probabilities to the sentences from part (c) are:

psm (BOS are you EOS)

5
Question 2: Unigram Language Models*
In this problem, we consider the language model over a vocabulary V estimated from a
corpus consisting of M tokens.

(a) Suppose that the word wi ∈ V appears mi times. Estimate the unigram probability
of wi with Laplace smoothing. For what values of mi is the smoothed probability of
word wi greater than the unsmoothed probability?
(b) Consider a simple language in which strings are constructed by drawing each token
1
from the vocabulary V with probability |V| , independent of previous tokens. For
simplicity, assume that neighboring bigrams are independent and exclude bigrams
containing BOS and EOS from your computations (i.e. exclude them from the vocab-
ulary).
(i) What is the expected value of the fraction of bigrams that can be constructed
from V with zero count?
(ii) Determine the value of M such that the expected fraction of bigrams with zero
count is at most ∈ (0, 1). (Hint: Use that log(x + 1) ≈ x for |x| 1)

6
(a)
mi mi + λ
<
M M + |V|λ
mi M + mi |V|λ < M mi + M λ
M
mi <
|V|
Note the independence of λ.

(b) (i) We have M − 1 bigrams in our corpus. With the independence assumption, we
have that each bigram is sampled uniformly at random with probability |V|1 2 .
Letting Nij be a random variable denoting the count for bigram (wi , wj ), we
have that Nij follows a binomial distribution:
k M −1−k
M −1 1 1
p(Nij = k) = 1−
k |V|2 |V|2

We are interested in the case k = 0:

M −1
1
p(Nij = 0) = 1 −
|V|2

The expectation of the number of bigrams with zero count is then

1{Nij =0} ] = E[1{Nij =0} ]

X X
E[
wi ,wj ∈V wi ,wj ∈V
X
= p(Nij = 0)
wi ,wj ∈V
M −1
2 1
= |V| · 1 −
|V|2

Thus, the expectation of the fraction of the number of bigrams with zero count
is M −1
|V|2 · 1 − |V|1 2
1
M −1
= 1−
|V|2 |V|2
(ii) We seek a bound on the following proportion:
M −1
1
1−
|V|2
which we derive below. Fix ∈ (0, 1). It follows that
M −1
1
≥ 1−
|V|2

7

1
log ≥ (M − 1) log 1 −
|V|2

1
≈ (M − 1) · − 2
|V|

1
M − 1 ' −|V|2 log = |V|2 log

1
M ' |V|2 log +1

8
Question 3: Modeling the Gambler’s Ruin
Consider the following RNN model architecture:

where we define:

ht = f (w0 xt + w1 ht−1 + b0 )
yt = g(w2 ht + b1 )

Your friend convinces you to try your luck and play a game. She will flip a coin and if the
coin lands heads up, you get $100. But if the coin lands tails up, you lose $100. We want
the above model to output your current amount of money given a sequence of coin flips.

(a) In this model, what do we define our input xt to be? How can we denote the values
that xt takes on?
(b) Assume that you start with $200. Find weights, biases and activation functions such
that at time step t the model outputs your current amount of money. Note: You cannot
have a negative amount of money. In other words, it is your friend’s responsibility to
stop flipping the coin once you lose all of your money! For example, if the coin flips
are [t, h, h, t, t, t, t] your network should output [100, 200, 300, 200, 100, 0, 0]. Note that
the solution to this problem depends on part a, which has multiple solutions.

9
(a) We can think of coin flips as our inputs xt . One possible solution is that if at time t,
the coin lands heads up, we set xt = 1; if it lands tails up, we set xt = 0.

(b) The hidden state ht−1 is the amount of money before the coin flip and our output
is yt . We start with h0 = $200. Since our outputs cannot be negative, we set the
activation function f to be ReLU. To encode winning or losing money, we can set the
bias b0 = −100 and the weight w0 = 200. Lastly, we set w1 = 1, w2 = 1, b1 = 0 and
g to be the identity function g(x) = x.

10
Question 4: Recurrent Neural Networks*
(a) What does the term “exploding gradients” refer to? What circumstances lead to
“exploding gradients?”
(b) Why are “vanishing gradients” a more common problem in basic RNNs compared to
feed forward networks?
(c) Name two methods for alleviating the vanishing gradient problem and explain why
they work.
(d) Consider a recurrent neural network with a single hidden unit and a sigmoid activation,
hm = σ(θhm−1 + xm ). Prove that if |θ| < 1, then the gradient ∂h∂hm+k
m
goes to zero as
k → ∞.

(a) Exploding gradients is a problem of large increase in the norm of the gradient during
training. Backpropagation computes gradients using the chain rule and accumulating
gradients through layers involves multiplying a large number of gradients. When all
gradients are greater than 1, this can result in very large gradients. This in turn
results in large updates to the network weights which may even lead to numerical
overflow, hence the term “exploding gradients”. This gets worse as the length of the
sequence increases.

(b) In feed forward networks the gradients wrt earlier layers depend on the weight matrices
after it. Since the weight matrices are different, larger and smaller weight matrices
might balance out to not let the gradients vanish. RNNs on the other hand share
weights across time steps, so the gradients of wrt earlier time steps will depend on the
same matrix multiplied with itself many times. If the weight matrices increase the
magnitude of a gradient, we get the “exploding gradient” problem, whereas if they
decrease the magnitude of a gradient, we get the “vanishing gradient” problem.

(i) Use LSTM or GRU units

(ii) Use residual networks
(iii) Use ReLU instead of sigmoid activation function
(iv) Use fewer layers
(v) Use batch normalization
(vi) Truncate backpropagation

(d) We have

hm = σ(θhm−1 + xm )

∂hm
= θσ 0 (θhm−1 + xm )
∂hm−1

11
It can easily be shown that the derivative of the sigmoid function is

σ 0 (x) = σ(x)(1 − σ(x))

Since the sigmoid function is bounded and has outputs in (0, 1), one can conclude
that maximum value for σ 0 (x) is 0.25. This can be seen from the graph or by taking
the second derivative of this expression. Now let’s see what happens when k → ∞.

m+1
∂hm+k Y ∂hi
=
∂hm i=m+k
∂hi−1
k−1
Y
= θσ 0 (θhi−1 + xi )
i=1
k−1
Y
≤ 0.25θ
i=1
= 0.25k−1 θk−1
→ 0 as k → ∞ (since |θ| < 1)

This can easily be extended to higher dimensions. Pascanu et al1 showed that if
the largest eigenvalue of the weight matrix is less than 1, then the gradient shrinks
exponentially fast.

1
On the difficulty of training recurrent neural networks; Razvan Pascanu, Tomas Mikolov, Yoshua
Bengio; Proceedings of the 30th International Conference on Machine Learning, PMLR 28(3):1310-1318,
2013.
∗
Problems adapted from Jacob Eisenstein’s Introduction to Natural Language Processing textbook.

Database Systems 2
No ratings yet
Database Systems 2
16 pages
The Rabin-Karp Algorithm: String Matching
No ratings yet
The Rabin-Karp Algorithm: String Matching
18 pages
Lab3 - Linked List - Solution
100% (1)
Lab3 - Linked List - Solution
10 pages
5 Pointers
100% (1)
5 Pointers
18 pages
Week 11 Exercises Solutions
No ratings yet
Week 11 Exercises Solutions
6 pages
Week 1 Solutions
No ratings yet
Week 1 Solutions
8 pages
Assignment 1
100% (1)
Assignment 1
2 pages
Exams 2024 Python For Beginners
No ratings yet
Exams 2024 Python For Beginners
22 pages
MidTermExam07 Solu
No ratings yet
MidTermExam07 Solu
6 pages
Demo Course PPT - Python
No ratings yet
Demo Course PPT - Python
18 pages
HW 05 PDF
100% (1)
HW 05 PDF
2 pages
18ECC203J - Microprocessor, Microcontroller & Interfacing Techniques Lab
No ratings yet
18ECC203J - Microprocessor, Microcontroller & Interfacing Techniques Lab
13 pages
Delays in Computer Networks
No ratings yet
Delays in Computer Networks
60 pages
H13 311 Enu V8.02
No ratings yet
H13 311 Enu V8.02
30 pages
CD Project: Topic: Implementation of LL (1) Parser
No ratings yet
CD Project: Topic: Implementation of LL (1) Parser
20 pages
Introduction To Information Theory Channel Capacity and Models
No ratings yet
Introduction To Information Theory Channel Capacity and Models
36 pages
Quizizz: IT202-MIDTERM EXAM
No ratings yet
Quizizz: IT202-MIDTERM EXAM
55 pages
Assignment 1 DBMS JUL 2022
100% (1)
Assignment 1 DBMS JUL 2022
11 pages
MCQ On Unit 1 2 PDF
0% (1)
MCQ On Unit 1 2 PDF
22 pages
Kami Export - Assignment - 2 - 20240709
No ratings yet
Kami Export - Assignment - 2 - 20240709
13 pages
Selection of Optimal Solution For Example and Model of Retrieval Based Voice Conversion
No ratings yet
Selection of Optimal Solution For Example and Model of Retrieval Based Voice Conversion
8 pages
CEN468 Lab 3 V2
No ratings yet
CEN468 Lab 3 V2
14 pages
Context Free Grammar
No ratings yet
Context Free Grammar
113 pages
Pycryptodome Master
100% (1)
Pycryptodome Master
82 pages
Lex-Yacc For Exam
100% (1)
Lex-Yacc For Exam
17 pages
DAA Question Paper Set -1 (1)
No ratings yet
DAA Question Paper Set -1 (1)
2 pages
Irrecha
No ratings yet
Irrecha
5 pages
Data Warehousing, OLAP, Data Mining Practice Questions Solutions
No ratings yet
Data Warehousing, OLAP, Data Mining Practice Questions Solutions
4 pages
Advanced Computer Architecture Test-1 Answer
No ratings yet
Advanced Computer Architecture Test-1 Answer
2 pages
Computer Organization and Architecture: Chapter One Digital Logic and Digital Systems
No ratings yet
Computer Organization and Architecture: Chapter One Digital Logic and Digital Systems
50 pages
Distributed Databases: Solutions To Practice Exercises
No ratings yet
Distributed Databases: Solutions To Practice Exercises
4 pages
Evolutionary Programming
No ratings yet
Evolutionary Programming
19 pages
Lab08 - Synchronization Readers Writer Problem
No ratings yet
Lab08 - Synchronization Readers Writer Problem
3 pages
Design and Analysis of Algorithms
No ratings yet
Design and Analysis of Algorithms
5 pages
Module 1 PDF
100% (1)
Module 1 PDF
33 pages
Conversion of CFG To PDA Conversion of PDA To CFG
No ratings yet
Conversion of CFG To PDA Conversion of PDA To CFG
22 pages
OOSE LAB 01 - Introduction and Project Definition
No ratings yet
OOSE LAB 01 - Introduction and Project Definition
4 pages
One Dimensions Random Variables PDF
No ratings yet
One Dimensions Random Variables PDF
99 pages
4A. Decision Trees
No ratings yet
4A. Decision Trees
35 pages
Java Practice - Questions
No ratings yet
Java Practice - Questions
2 pages
Set 3
No ratings yet
Set 3
16 pages
Daa Question Paper Winter 2024
No ratings yet
Daa Question Paper Winter 2024
8 pages
Turing Machine and Recursive Language
No ratings yet
Turing Machine and Recursive Language
36 pages
Previous Exam Exercises On Classification: Exercise 4 2012: Classification With 2 Features
No ratings yet
Previous Exam Exercises On Classification: Exercise 4 2012: Classification With 2 Features
9 pages
DAA Unit - 1
No ratings yet
DAA Unit - 1
68 pages
Digital Logic Design Midterm Exam
No ratings yet
Digital Logic Design Midterm Exam
3 pages
Tutorial I
No ratings yet
Tutorial I
6 pages
OA TD2 Correction 2019 2020 - Compressed
No ratings yet
OA TD2 Correction 2019 2020 - Compressed
3 pages
CSE132A Solutions HW 1: Sname Sid Pid Color Red
No ratings yet
CSE132A Solutions HW 1: Sname Sid Pid Color Red
5 pages
DSA Final Fall 2022
No ratings yet
DSA Final Fall 2022
2 pages
DC Lab Exp6 17l238 Rep
No ratings yet
DC Lab Exp6 17l238 Rep
12 pages
AI Berkeley Solution PDF
No ratings yet
AI Berkeley Solution PDF
9 pages
An To An A That It It An: I. (L, The
No ratings yet
An To An A That It It An: I. (L, The
10 pages
Data Mining Exercises - Solutions
No ratings yet
Data Mining Exercises - Solutions
5 pages
1 FIND+S+Algorithm
No ratings yet
1 FIND+S+Algorithm
2 pages
Intermediate Code Generation
No ratings yet
Intermediate Code Generation
28 pages
Dsa Assignment: 1. Define Binary Tree
No ratings yet
Dsa Assignment: 1. Define Binary Tree
6 pages
Computer Organization and Assembly Language: Final Exam, Spring 2014
No ratings yet
Computer Organization and Assembly Language: Final Exam, Spring 2014
2 pages
Java Exercises Beginning
No ratings yet
Java Exercises Beginning
1 page
Vu Study M Parallel and Distributed Computer
No ratings yet
Vu Study M Parallel and Distributed Computer
29 pages
Brute Force: Design and Analysis of Algorithms - Chapter 3 1
No ratings yet
Brute Force: Design and Analysis of Algorithms - Chapter 3 1
18 pages
Levitin: Introduction To The Design and Analysis of Algorithms
No ratings yet
Levitin: Introduction To The Design and Analysis of Algorithms
35 pages
LAB # 07 Facts and Rules in PROLOG: Objective
No ratings yet
LAB # 07 Facts and Rules in PROLOG: Objective
6 pages
Data Structure4
No ratings yet
Data Structure4
6 pages
De Moiver's Theorem (Trigonometry) Mathematics Question Bank
From Everand
De Moiver's Theorem (Trigonometry) Mathematics Question Bank
Mohmmad Khaja Shareef
No ratings yet