[go: up one dir, main page]

0% found this document useful (0 votes)
270 views128 pages

Cs 224 N

This document introduces natural language processing and discusses methods for representing words as vectors. It begins with an overview of NLP tasks and what makes human language special. Next, it discusses using one-hot vectors to represent words and the limitations of this approach. It then introduces the concept of word embeddings using dimensionality reduction techniques like singular value decomposition to learn word vectors from co-occurrence matrices that can capture semantic similarity.

Uploaded by

ANH Đỗ Quốc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
270 views128 pages

Cs 224 N

This document introduces natural language processing and discusses methods for representing words as vectors. It begins with an overview of NLP tasks and what makes human language special. Next, it discusses using one-hot vectors to represent words and the limitations of this approach. It then introduces the concept of word embeddings using dimensionality reduction techniques like singular value decomposition to learn word vectors from co-occurrence matrices that can capture semantic similarity.

Uploaded by

ANH Đỗ Quốc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 128

CS224n: Natural Language Processing with Deep

Learning 1 1
Course Instructors: Christopher
Manning, Richard Socher
Lecture Notes: Part I
Word Vectors I: Introduction, SVD and Word2Vec 2 2
Authors: Francois Chaubard, Michael
Fang, Guillaume Genthial, Rohit
Winter 2019 Mundra, Richard Socher

Keyphrases: Natural Language Processing. Word Vectors. Singu-


lar Value Decomposition. Skip-gram. Continuous Bag of Words
(CBOW). Negative Sampling. Hierarchical Softmax. Word2Vec.
This set of notes begins by introducing the concept of Natural
Language Processing (NLP) and the problems NLP faces today. We
then move forward to discuss the concept of representing words as
numeric vectors. Lastly, we discuss popular approaches to designing
word vectors.

1 Introduction to Natural Language Processing

We begin with a general discussion of what is NLP.

1.1 What is so special about NLP?


What’s so special about human (natural) language? Human language
is a system specifically constructed to convey meaning, and is not
produced by a physical manifestation of any kind. In that way, it is
very different from vision or any other machine learning task. Natural language is a dis-
Most words are just symbols for an extra-linguistic entity : the crete/symbolic/categorical system
word is a signifier that maps to a signified (idea or thing).
For instance, the word "rocket" refers to the concept of a rocket,
and by extension can designate an instance of a rocket. There are
some exceptions, when we use words and letters for expressive sig-
naling, like in "Whooompaa". On top of this, the symbols of language
can be encoded in several modalities : voice, gesture, writing, etc
that are transmitted via continuous signals to the brain, which itself
appears to encode things in a continuous manner. (A lot of work in
philosophy of language and linguistics has been done to conceptu-
alize human language and distinguish words from their references,
meanings, etc. Among others, see works by Wittgenstein, Frege, Rus-
sell and Mill.)

1.2 Examples of tasks


There are different levels of tasks in NLP, from speech processing to
semantic interpretation and discourse processing. The goal of NLP is
to be able to design algorithms to allow computers to "understand"
cs224n: natural language processing with deep learning lecture notes: part i word
vectors i: introduction, svd and word2vec 2

natural language in order to perform some task. Example tasks come


in varying level of difficulty:

Easy

• Spell Checking

• Keyword Search

• Finding Synonyms

Medium

• Parsing information from websites, documents, etc.

Hard

• Machine Translation (e.g. Translate Chinese text to English)

• Semantic Analysis (What is the meaning of query statement?)

• Coreference (e.g. What does "he" or "it" refer to given a docu-


ment?)

• Question Answering (e.g. Answering Jeopardy questions).

1.3 How to represent words?


The first and arguably most important common denominator across
all NLP tasks is how we represent words as input to any of our mod-
els. Much of the earlier NLP work that we will not cover treats words
as atomic symbols. To perform well on most NLP tasks we first need
to have some notion of similarity and difference between words. With
word vectors, we can quite easily encode this ability in the vectors
themselves (using distance measures such as Jaccard, Cosine, Eu-
clidean, etc).

2 Word Vectors

There are an estimated 13 million tokens for the English language


but are they all completely unrelated? Feline to cat, hotel to motel?
I think not. Thus, we want to encode word tokens each into some
vector that represents a point in some sort of "word" space. This is
paramount for a number of reasons but the most intuitive reason is
that perhaps there actually exists some N-dimensional space (such
that N  13 million) that is sufficient to encode all semantics of
our language. Each dimension would encode some meaning that
we transfer using speech. For instance, semantic dimensions might
cs224n: natural language processing with deep learning lecture notes: part i word
vectors i: introduction, svd and word2vec 3

indicate tense (past vs. present vs. future), count (singular vs. plural),
and gender (masculine vs. feminine). One-hot vector: Represent every word
So let’s dive into our first word vector and arguably the most as an R|V |×1 vector with all 0s and one
1 at the index of that word in the sorted
simple, the one-hot vector: Represent every word as an R|V |×1 vector english language.
with all 0s and one 1 at the index of that word in the sorted english
language. In this notation, |V | is the size of our vocabulary. Word
vectors in this type of encoding would appear as the following:
       
1 0 0 0
 0   1   0   0 
       
aardvark 0 a 0 at 1 zebra 0
       
w =  , w =  , w =  , · · · w
      = 
 Fun fact: The term "one-hot" comes
 ..   ..   ..   .. 

from digital circuit design, meaning "a
 .   .   .   .  group of bits among which the legal
0 0 0 1 combinations of values are only those
We represent each word as a completely independent entity. As with a single high (1) bit and all the
others low (0)".
we previously discussed, this word representation does not give us
directly any notion of similarity. For instance,

(whotel )T wmotel = (whotel )T wcat = 0


Denotational semantics: The concept
So maybe we can try to reduce the size of this space from R| V | to of representing an idea as a symbol (a
word or a one-hot vector). It is sparse
something smaller and thus find a subspace that encodes the rela- and cannot capture similarity. This is a
tionships between words. "localist" representation.

3 SVD Based Methods

For this class of methods to find word embeddings (otherwise known


as word vectors), we first loop over a massive dataset and accumu-
late word co-occurrence counts in some form of a matrix X, and then
perform Singular Value Decomposition on X to get a USV T decom-
position. We then use the rows of U as the word embeddings for all
words in our dictionary. Let us discuss a few choices of X.

3.1 Word-Document Matrix


Distributional semantics: The concept
As our first attempt, we make the bold conjecture that words that of representing the meaning of a word
are related will often appear in the same documents. For instance, based on the context in which it usually
appears. It is dense and can better
"banks", "bonds", "stocks", "money", etc. are probably likely to ap- capture similarity.
pear together. But "banks", "octopus", "banana", and "hockey" would
probably not consistently appear together. We use this fact to build
a word-document matrix, X in the following manner: Loop over
billions of documents and for each time word i appears in docu-
ment j, we add one to entry Xij . This is obviously a very large matrix
(R|V |× M ) and it scales with the number of documents (M). So per-
haps we can try something better.
cs224n: natural language processing with deep learning lecture notes: part i word
vectors i: introduction, svd and word2vec 4

3.2 Window based Co-occurrence Matrix


The same kind of logic applies here however, the matrix X stores
co-occurrences of words thereby becoming an affinity matrix. In this
method we count the number of times each word appears inside a
window of a particular size around the word of interest. We calculate
this count for all the words in corpus. We display an example below.
Let our corpus contain just three sentences and the window size be 1: Using Word-Word Co-occurrence
Matrix:
1. I enjoy flying. • Generate |V | × |V | co-occurrence
matrix, X.
2. I like NLP. • Apply SVD on X to get X = USV T .
• Select the first k columns of U to get
3. I like deep learning. a k-dimensional word vectors.
∑ik=1 σi
The resulting counts matrix will then be: • |V | indicates the amount of
∑i=1 σi
variance captured by the first k
I like enjoy deep learning NLP f lying . dimensions.
 
I 0 2 1 0 0 0 0 0
like

 2 0 0 1 0 1 0 0 

1 0 0 0 0 0 1 0
 
enjoy  
 
deep  0 1 0 0 1 0 0 0 
X=  
learning

 0 0 0 1 0 0 0 1 

0 1 0 0 0 0 0 1
 
NLP  
 
f lying  0 0 1 0 0 0 0 1 
. 0 0 0 0 1 1 1 0

3.3 Applying SVD to the cooccurrence matrix


We now perform SVD on X, observe the singular values (the diago-
nal entries in the resulting S matrix), and cut them off at some index
k based on the desired percentage variance captured:

∑ik=1 σi
|V |
∑i=1 σi

We then take the submatrix of U1:|V |,1:k to be our word embedding


matrix. This would thus give us a k-dimensional representation of
every word in the vocabulary.

Applying SVD to X:

|V | |V | |V | |V |
   
0 ··· − v1 −
   
| | σ1
 0 · · ·  |V |  − v2 − 
   
X = u u2 · · ·  |V | σ2
   
|V |  |V |  1
.. .. .. ..
   
| | . . . .
cs224n: natural language processing with deep learning lecture notes: part i word
vectors i: introduction, svd and word2vec 5

Reducing dimensionality by selecting first k singular vectors:

|V | k k |V |
   
0 ··· − v1 −
   
| | σ1
 0 ···  k  − v2 − 
   
X̂ =  u1 u2 ···  k σ2
   
|V |  |V |
.. .. .. ..
   
| | . . . .

Both of these methods give us word vectors that are more than
sufficient to encode semantic and syntactic (part of speech) informa-
tion but are associated with many other problems:

• The dimensions of the matrix change very often (new words are
added very frequently and corpus changes in size). SVD based methods do not scale
well for big matrices and it is hard to
• The matrix is extremely sparse since most words do not co-occur. incorporate new words or documents.
Computational cost for a m × n matrix
• The matrix is very high dimensional in general (≈ 106 × 106 ) is O(mn2 )

• Quadratic cost to train (i.e. to perform SVD)

• Requires the incorporation of some hacks on X to account for the


drastic imbalance in word frequency
However, count-based method make an
Some solutions to exist to resolve some of the issues discussed above: efficient use of the statistics

• Ignore function words such as "the", "he", "has", etc.

• Apply a ramp window – i.e. weight the co-occurrence count based


on distance between the words in the document.

• Use Pearson correlation and set negative counts to 0 instead of


using just raw count.

As we see in the next section, iteration based methods solve many


of these issues in a far more elegant manner.

4 Iteration Based Methods - Word2vec


For an overview of Word2vec, a note
Let us step back and try a new approach. Instead of computing and map can be found here : https://
storing global information about some huge dataset (which might be myndbook.com/view/4900

billions of sentences), we can try to create a model that will be able


to learn one iteration at a time and eventually be able to encode the A detailed summary of word2vec mod-
probability of a word given its context. els can also be found here [Rong, 2014]

The idea is to design a model whose parameters are the word vec-
tors. Then, train the model on a certain objective. At every iteration Iteration-based methods capture cooc-
we run our model, evaluate the errors, and follow an update rule currence of words one at a time instead
of capturing all cooccurrence counts
that has some notion of penalizing the model parameters that caused directly like in SVD methods.
the error. Thus, we learn our word vectors. This idea is a very old
cs224n: natural language processing with deep learning lecture notes: part i word
vectors i: introduction, svd and word2vec 6

one dating back to 1986. We call this method "backpropagating" the


errors (see [Rumelhart et al., 1988]). The simpler the model and the
task, the faster it will be to train it. Context of a word:
Several approaches have been tested. [Collobert et al., 2011] design The context of a word is the set of m
surrounding words. For instance, the
models for NLP whose first step is to transform each word in a vec- m = 2 context of the word "fox" in the
tor. For each special task (Named Entity Recognition, Part-of-Speech sentence "The quick brown fox jumped
over the lazy dog" is {"quick", "brown",
tagging, etc. ) they train not only the model’s parameters but also the "jumped", "over"}.
vectors and achieve great performance, while computing good word
vectors! Other interesting reading would be [Bengio et al., 2003]. This model relies on a very important
In this class, we will present a simpler, more recent, probabilistic hypothesis in linguistics, distributional
similarity, the idea that similar words
method by [Mikolov et al., 2013] : word2vec. Word2vec is a software have similar context.
package that actually includes :
- 2 algorithms: continuous bag-of-words (CBOW) and skip-gram.
CBOW aims to predict a center word from the surrounding context in
terms of word vectors. Skip-gram does the opposite, and predicts the
distribution (probability) of context words from a center word.
- 2 training methods: negative sampling and hierarchical softmax.
Negative sampling defines an objective by sampling negative exam-
ples, while hierarchical softmax defines an objective using an efficient
tree structure to compute probabilities for all the vocabulary.

4.1 Language Models (Unigrams, Bigrams, etc.)


First, we need to create such a model that will assign a probability to
a sequence of tokens. Let us start with an example:

"The cat jumped over the puddle."

A good language model will give this sentence a high probability


because this is a completely valid sentence, syntactically and semanti-
cally. Similarly, the sentence "stock boil fish is toy" should have a very
low probability because it makes no sense. Mathematically, we can
call this probability on any given sequence of n words:

P ( w1 , w2 , · · · , w n )

We can take the unary language model approach and break apart
this probability by assuming the word occurrences are completely
independent:
n
P ( w1 , w2 , · · · , w n ) = ∏ P ( wi )
i =1
Unigram model:
However, we know this is a bit ludicrous because we know the n
next word is highly contingent upon the previous sequence of words. P ( w1 , w2 , · · · , w n ) = ∏ P ( wi )
i =1
And the silly sentence example might actually score highly. So per-
haps we let the probability of the sequence depend on the pairwise
cs224n: natural language processing with deep learning lecture notes: part i word
vectors i: introduction, svd and word2vec 7

probability of a word in the sequence and the word next to it. We call
this the bigram model and represent it as:
n
P ( w1 , w2 , · · · , w n ) = ∏ P ( wi | wi −1 )
i =2
Bigram model:
Again this is certainly a bit naive since we are only concerning n
ourselves with pairs of neighboring words rather than evaluating a P ( w1 , w2 , · · · , w n ) = ∏ P ( wi | wi −1 )
i =2
whole sentence, but as we will see, this representation gets us pretty
far along. Note in the Word-Word Matrix with a context of size 1, we
basically can learn these pairwise probabilities. But again, this would
require computing and storing global information about a massive
dataset.
Now that we understand how we can think about a sequence of
tokens having a probability, let us observe some example models that
could learn these probabilities.

4.2 Continuous Bag of Words Model (CBOW)


One approach is to treat {"The", "cat", ’over", "the’, "puddle"} as a
context and from these words, be able to predict or generate the
center word "jumped". This type of model we call a Continuous Bag
of Words (CBOW) Model. CBOW Model:
Let’s discuss the CBOW Model above in greater detail. First, we Predicting a center word from the
surrounding context
set up our known parameters. Let the known parameters in our
For each word, we want to learn 2
model be the sentence represented by one-hot word vectors. The vectors
input one hot vectors or context we will represent with an x (c) . And - v: (input vector) when the word is in
the context
the output as y(c) and in the CBOW model, since we only have one
- u: (output vector) when the word is
output, so we just call this y which is the one hot vector of the known in the center
center word. Now let’s define our unknowns in our model.
We create two matrices, V ∈ Rn×|V | and U ∈ R|V |×n . Where n
Notation for CBOW Model:
is an arbitrary size which defines the size of our embedding space.
• wi : Word i from vocabulary V
V is the input word matrix such that the i-th column of V is the n-
• V ∈ Rn×|V | : Input word matrix
dimensional embedded vector for word wi when it is an input to this
• vi : i-th column of V , the input vector
model. We denote this n × 1 vector as vi . Similarly, U is the output representation of word wi
word matrix. The j-th row of U is an n-dimensional embedded vector • U ∈ R|V |×n : Output word matrix
for word w j when it is an output of the model. We denote this row of • ui : i-th row of U , the output vector
U as u j . Note that we do in fact learn two vectors for every word wi representation of word wi
(i.e. input word vector vi and output word vector ui ).

We breakdown the way this model works in these steps:

1. We generate our one hot word vectors for the input context of size
m : (x (c−m) , . . . , x (c−1) , x (c+1) , . . . , x (c+m) ∈ R|V | ).
cs224n: natural language processing with deep learning lecture notes: part i word
vectors i: introduction, svd and word2vec 8

2. We get our embedded word vectors for the context (vc−m =


V x (c−m) , vc−m+1 = V x (c−m+1) , . . ., vc+m = V x (c+m) ∈ Rn )
vc−m +vc−m+1 +...+vc+m
3. Average these vectors to get v̂ = 2m ∈ Rn

4. Generate a score vector z = U v̂ ∈ R|V | . As the dot product of


similar vectors is higher, it will push similar words close to each
other in order to achieve a high score.

5. Turn the scores into probabilities ŷ = softmax(z) ∈ R|V | . The softmax is an operator that we’ll
use very frequently. It transforms a vec-
6. We desire our probabilities generated, ŷ ∈ R|V | , to match the true tor into a vector whose i-th component
eŷi
probabilities, y ∈ R|V | , which also happens to be the one hot vector is |V | yˆ
∑ k =1 e k
.

of the actual word. - exponentiate to make positive


|V |
- Dividing by ∑k=1 eyˆk normalizes the
So now that we have an understanding of how our model would vector (∑nk=1 ŷk = 1) to give probability

work if we had a V and U , how would we learn these two matrices?


Well, we need to create an objective function. Very often when we
are trying to learn a probability from some true probability, we look
to information theory to give us a measure of the distance between
two distributions. Here, we use a popular choice of distance/loss
measure, cross entropy H (ŷ, y).
The intuition for the use of cross-entropy in the discrete case can
be derived from the formulation of the loss function:

|V |
H (ŷ, y) = − ∑ y j log(ŷ j )
j =1

Let us concern ourselves with the case at hand, which is that y


is a one-hot vector. Thus we know that the above loss simplifies to
simply:
H (ŷ, y) = −yi log(ŷi ) Figure 1: This image demonstrates how
CBOW works and how we must learn
In this formulation, c is the index where the correct word’s one the transfer matrices

hot vector is 1. We can now consider the case where our predic-
ŷ 7→ H (ŷ, y) is minimum when ŷ = y.
tion was perfect and thus ŷc = 1. We can then calculate H (ŷ, y) = Then, if we found a ŷ such that H (ŷ, y)
−1 log(1) = 0. Thus, for a perfect prediction, we face no penalty or is close to the minimum, we have ŷ ≈ y.
This means that our model is very good
loss. Now let us consider the opposite case where our prediction was at predicting the center word!
very bad and thus ŷc = 0.01. As before, we can calculate our loss to
be H (ŷ, y) = −1 log(0.01) ≈ 4.605. We can thus see that for proba-
bility distributions, cross entropy provides us with a good measure of
distance. We thus formulate our optimization objective as: To learn the vectors (the matrices U and
V) CBOW defines a cost that measures
how good it is at predicting the center
word. Then, we optimize this cost by
updating the matrices U and V thanks
to stochastic gradient descent
cs224n: natural language processing with deep learning lecture notes: part i word
vectors i: introduction, svd and word2vec 9

minimize J = − log P(wc |wc−m , . . . , wc−1 , wc+1 , . . . , wc+m )


= − log P(uc |v̂)
exp(ucT v̂)
= − log |V |
∑ j=1 exp(u Tj v̂)
|V |
= −ucT v̂ + log ∑ exp(u Tj v̂)
j =1

We use stochastic gradient descent to update all relevant word vec-


tors uc and v j . Stochastic gradient descent (SGD)
computes gradients for a window and
updates the parameters
4.3 Skip-Gram Model Unew ← Uold − α∇U J
Vold ← Vold − α∇V J
Another approach is to create a model such that given the center Skip-Gram Model:
word "jumped", the model will be able to predict or generate the Predicting surrounding context words
surrounding words "The", "cat", "over", "the", "puddle". Here we call given a center word

the word "jumped" the context. We call this type of model a Skip-
Gram model. Notation for Skip-Gram Model:
Let’s discuss the Skip-Gram model above. The setup is largely the • wi : Word i from vocabulary V
same but we essentially swap our x and y i.e. x in the CBOW are • V ∈ Rn×|V | : Input word matrix
now y and vice-versa. The input one hot vector (center word) we will • vi : i-th column of V , the input vector
represent with an x (since there is only one). And the output vectors representation of word wi

as y( j) . We define V and U the same as in CBOW. • U ∈ Rn×|V | : Output word matrix


• ui : i-th row of U , the output vector
representation of word wi
We breakdown the way this model works in these 6 steps:

1. We generate our one hot input vector x ∈ R|V | of the center word.

2. We get our embedded word vector for the center word vc = V x ∈


Rn

3. Generate a score vector z = U vc .

4. Turn the score vector into probabilities, ŷ = softmax(z). Note


that ŷc−m , . . . , ŷc−1 , ŷc+1 , . . . , ŷc+m are the probabilities of observing
each context word.

5. We desire our probability vector generated to match the true prob-


abilities which is y(c−m) , . . . , y(c−1) , y(c+1) , . . . , y(c+m) , the one hot
vectors of the actual output. Figure 2: This image demonstrates how
Skip-Gram works and how we must
learn the transfer matrices
As in CBOW, we need to generate an objective function for us to
evaluate the model. A key difference here is that we invoke a Naive
Bayes assumption to break out the probabilities. If you have not
seen this before, then simply put, it is a strong (naive) conditional
cs224n: natural language processing with deep learning lecture notes: part i word
vectors i: introduction, svd and word2vec 10

independence assumption. In other words, given the center word, all


output words are completely independent.

minimize J = − log P(wc−m , . . . , wc−1 , wc+1 , . . . , wc+m |wc )


2m
= − log ∏ P ( wc−m+ j | wc )
j=0,j6=m
2m
= − log ∏ P (uc−m+ j | vc )
j=0,j6=m
2m exp(ucT−m+ j vc )
= − log ∏ |V |
j=0,j6=m ∑k=1 exp(ukT vc )
2m |V |
=− ∑ ucT−m+ j vc + 2m log ∑ exp(ukT vc )
j=0,j6=m k =1

With this objective function, we can compute the gradients with


respect to the unknown parameters and at each iteration update
them via Stochastic Gradient Descent. Only one probability vector ŷ is com-
Note that puted. Skip-gram treats each context
word equally : the models computes
2m the probability for each word of appear-
J=− ∑ log P(uc−m+ j |vc ) ing in the context independently of its
distance to the center word
j=0,j6=m
2m
= ∑ H (ŷ, yc−m+ j )
j=0,j6=m

where H (ŷ, yc−m+ j ) is the cross-entropy between the probability


vector ŷ and the one-hot vector yc−m+ j .

4.4 Negative Sampling


Loss functions J for CBOW and Skip-
Lets take a second to look at the objective function. Note that the Gram are expensive to compute because
summation over |V | is computationally huge! Any update we do or of the softmax normalization, where we
sum over all |V | scores!
evaluation of the objective function would take O(|V |) time which
if we recall is in the millions. A simple idea is we could instead just
approximate it.
For every training step, instead of looping over the entire vocabu-
lary, we can just sample several negative examples! We "sample" from
a noise distribution (Pn (w)) whose probabilities match the ordering
of the frequency of the vocabulary. To augment our formulation of
the problem to incorporate Negative Sampling, all we need to do is
update the:

• objective function
cs224n: natural language processing with deep learning lecture notes: part i word
vectors i: introduction, svd and word2vec 11

• gradients

• update rules

Mikolov et al. present Negative Sampling in Distributed


Representations of Words and Phrases and their Compo-
sitionality. While negative sampling is based on the Skip-Gram
model, it is in fact optimizing a different objective. Consider a pair
(w, c) of word and context. Did this pair come from the training
data? Let’s denote by P( D = 1|w, c) the probability that (w, c) came
from the corpus data. Correspondingly, P( D = 0|w, c) will be the
probability that (w, c) did not come from the corpus data. First, let’s
model P( D = 1|w, c) with the sigmoid function: The sigmoid function
σ( x ) = 1+1e−x
1 is the 1D version of the softmax and
P( D = 1|w, c, θ ) = σ (vcT vw ) = T can be used to model a probability
1 + e(−vc vw )
Now, we build a new objective function that tries to maximize the
probability of a word and context being in the corpus data if it in-
deed is, and maximize the probability of a word and context not
being in the corpus data if it indeed is not. We take a simple maxi-
mum likelihood approach of these two probabilities. (Here we take θ
to be the parameters of the model, and in our case it is V and U .)

θ = argmax ∏ P( D = 1|w, c, θ ) ∏ P( D = 0|w, c, θ ) Figure 3: Sigmoid function


θ (w,c)∈ D (w,c)∈ D̃

= argmax ∏ P( D = 1|w, c, θ ) ∏ (1 − P( D = 1|w, c, θ ))


θ (w,c)∈ D (w,c)∈ D̃

= argmax ∑ log P( D = 1|w, c, θ ) + ∑ log(1 − P( D = 1|w, c, θ ))


θ (w,c)∈ D (w,c)∈ D̃
1 1
= argmax ∑ log Tv )
1 + exp(−uw c
+ ∑ log(1 −
1 + exp Tv )
(−uw c
)
θ (w,c)∈ D (w,c)∈ D̃
1 1
= argmax ∑ log T
+ ∑ log(
1 + exp(−uw vc ) (w,c)∈ D̃ Tv )
1 + exp(uw c
)
θ (w,c)∈ D

Note that maximizing the likelihood is the same as minimizing the


negative log likelihood

1 1
J=− ∑ log T
− ∑ log(
1 + exp(−uw vc ) (w,c)∈ D̃ Tv )
1 + exp(uw c
)
(w,c)∈ D

Note that D̃ is a "false" or "negative" corpus. Where we would have


sentences like "stock boil fish is toy". Unnatural sentences that should
get a low probability of ever occurring. We can generate D̃ on the fly
by randomly sampling this negative from the word bank.
cs224n: natural language processing with deep learning lecture notes: part i word
vectors i: introduction, svd and word2vec 12

For skip-gram, our new objective function for observing the con-
text word c − m + j given the center word c would be

K
− log σ(ucT−m+ j · vc ) − ∑ log σ(−ũkT · vc )
k =1
To compare with the regular softmax
loss for skip-gram
|V |
−ucT−m+ j vc + log ∑k=1 exp(ukT vc )
For CBOW, our new objective function for observing the center
v +v +1 +...+vc+m
word uc given the context vector v̂ = c−m c−m2m would be

K
− log σ(ucT · v̂) − ∑ log σ(−ũkT · v̂)
k =1
To compare with the regular softmax
loss for CBOW
In the above formulation, {ũk |k = 1 . . . K } are sampled from Pn (w). |V |
−ucT v̂ + log ∑ j=1 exp(u Tj v̂)
Let’s discuss what Pn (w) should be. While there is much discussion
of what makes the best approximation, what seems to work best is
the Unigram Model raised to the power of 3/4. Why 3/4? Here’s an
example that might help gain some intuition:

is: 0.93/4 = 0.92


Constitution: 0.093/4 = 0.16
bombastic: 0.013/4 = 0.032

"Bombastic" is now 3x more likely to be sampled while "is" only


went up marginally.

4.5 Hierarchical Softmax


Mikolov et al. also present hierarchical softmax as a much more
efficient alternative to the normal softmax. In practice, hierarchical
softmax tends to be better for infrequent words, while negative sam-
pling works better for frequent words and lower dimensional vectors. Hierarchical Softmax uses a binary
Hierarchical softmax uses a binary tree to represent all words in tree where leaves are the words. The
probability of a word being the output
the vocabulary. Each leaf of the tree is a word, and there is a unique word is defined as the probability
path from root to leaf. In this model, there is no output representation of a random walk from the root to
that word’s leaf. Computational cost
for words. Instead, each node of the graph (except the root and the becomes O(log(|V |)) instead of O(|V|).
leaves) is associated to a vector that the model is going to learn.
In this model, the probability of a word w given a vector wi ,
P(w|wi ), is equal to the probability of a random walk starting in
the root and ending in the leaf node corresponding to w. The main
advantage in computing the probability this way is that the cost is
only O(log(|V |)), corresponding to the length of the path.
Let’s introduce some notation. Let L(w) be the number of nodes
in the path from the root to the leaf w. For instance, L(w2 ) in Figure
4 is 3. Let’s write n(w, i ) as the i-th node on this path with associated Figure 4: Binary tree for Hierarchical
softmax
cs224n: natural language processing with deep learning lecture notes: part i word
vectors i: introduction, svd and word2vec 13

vector vn(w,i) . So n(w, 1) is the root, while n(w, L(w)) is the father
of w. Now for each inner node n, we arbitrarily choose one of its
children and call it ch(n) (e.g. always the left node). Then, we can
compute the probability as

L(w)−1
P ( w | wi ) = ∏ σ ([n(w, j + 1) = ch(n(w, j))] · vnT(w,j) vwi )
j =1

where 
1 if x is true
[x] =
−1 otherwise
.
and σ (·) is the sigmoid function.
This formula is fairly dense, so let’s examine it more closely.
First, we are computing a product of terms based on the shape of
the path from the root (n(w, 1)) to the leaf (w). If we assume ch(n) is
always the left node of n, then term [n(w, j + 1) = ch(n(w, j))] returns
1 when the path goes left, and -1 if right.
Furthermore, the term [n(w, j + 1) = ch(n(w, j))] provides normal-
ization. At a node n, if we sum the probabilities for going to the left
and right node, you can check that for any value of vnT vwi ,

σ (vnT vwi ) + σ (−vnT vwi ) = 1


|V |
The normalization also ensures that ∑w=1 P(w|wi ) = 1, just as in
the original softmax.
Finally, we compare the similarity of our input vector vwi to each
inner node vector vnT(w,j) using a dot product. Let’s run through an
example. Taking w2 in Figure 4, we must take two left edges and
then a right edge to reach w2 from the root, so
P(w2 |wi ) = p(n(w2 , 1), left) · p(n(w2 , 2), left) · p(n(w2 , 3), right)
= σ(vnT(w2 ,1) vwi ) · σ(vnT(w2 ,2) vwi ) · σ(−vnT(w2 ,3) vwi )
To train the model, our goal is still to minimize the negative log
likelihood − log P(w|wi ). But instead of updating output vectors per
word, we update the vectors of the nodes in the binary tree that are
in the path from root to leaf node.
The speed of this method is determined by the way in which the
binary tree is constructed and words are assigned to leaf nodes.
Mikolov et al. use a binary Huffman tree, which assigns frequent
words shorter paths in the tree.

References
[Bengio et al., 2003] Bengio, Y., Ducharme, R., Vincent, P., and Janvin, C. (2003). A
neural probabilistic language model. J. Mach. Learn. Res., 3:1137–1155.
cs224n: natural language processing with deep learning lecture notes: part i word
vectors i: introduction, svd and word2vec 14

[Collobert et al., 2011] Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu,
K., and Kuksa, P. P. (2011). Natural language processing (almost) from scratch.
CoRR, abs/1103.0398.
[Mikolov et al., 2013] Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). Efficient
estimation of word representations in vector space. CoRR, abs/1301.3781.
[Rong, 2014] Rong, X. (2014). word2vec parameter learning explained. CoRR,
abs/1411.2738.
[Rumelhart et al., 1988] Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1988).
Neurocomputing: Foundations of research. chapter Learning Representations by
Back-propagating Errors, pages 696–699. MIT Press, Cambridge, MA, USA.
CS224n: Natural Language Processing with Deep
Learning 1 1
Course Instructors: Christopher
Manning, Richard Socher
Lecture Notes: Part II
Word Vectors II: GloVe, Evaluation and Training 2 2
Authors: Rohit Mundra, Emma Peng,
Richard Socher, Ajay Sohmshetty,
Winter 2019 Amita Kamath

Keyphrases: Global Vectors for Word Representation (GloVe). In-


trinsic and extrinsic evaluations. Effect of hyperparameters on anal-
ogy evaluation tasks. Correlation of human judgment with word
vector distances. Dealing with ambiguity in word using contexts.
Window classification.
This set of notes first introduces the GloVe model for training
word vectors. Then it extends our discussion of word vectors (in-
terchangeably called word embeddings) by seeing how they can be
evaluated intrinsically and extrinsically. As we proceed, we discuss
the example of word analogies as an intrinsic evaluation technique
and how it can be used to tune word embedding techniques. We then
discuss training model weights/parameters and word vectors for ex-
trinsic tasks. Lastly we motivate artificial neural networks as a class
of models for natural language processing tasks.

1 Global Vectors for Word Representation (GloVe)3 3


This section is based on the GloVe
paper by Pennington et al.:
Jeffrey Pennington, Richard Socher,
1.1 Comparison with Previous Methods and Christopher D. Manning. 2014.
GloVe: Global Vectors for Word Repre-
So far, we have looked at two main classes of methods to find word sentation
embeddings. The first set are count-based and rely on matrix factor-
ization (e.g. LSA, HAL). While these methods effectively leverage
global statistical information, they are primarily used to capture
word similarities and do poorly on tasks such as word analogy, indi-
cating a sub-optimal vector space structure. The other set of methods
are shallow window-based (e.g. the skip-gram and the CBOW mod-
els), which learn word embeddings by making predictions in local
context windows. These models demonstrate the capacity to capture
complex linguistic patterns beyond word similarity, but fail to make
use of the global co-occurrence statistics. GloVe:
In comparison, GloVe consists of a weighted least squares model • Using global statistics to predict the
probability of word j appearing in
that trains on global word-word co-occurrence counts and thus the context of word i with a least
makes efficient use of statistics. The model produces a word vector squares objective
space with meaningful sub-structure. It shows state-of-the-art per-
formance on the word analogy task, and outperforms other current
methods on several word similarity tasks.
cs224n: natural language processing with deep learning lecture notes: part ii
word vectors ii: glove, evaluation and training 2

1.2 Co-occurrence Matrix


Co-occurrence Matrix:
Let X denote the word-word co-occurrence matrix, where Xij indi- • X: word-word co-occurrence matrix
cates the number of times word j occur in the context of word i. Let • Xij : number of times word j occur in
Xi = ∑k Xik be the number of times any word k appears in the con- the context of word i
X
text of word i. Finally, let Pij = P(w j |wi ) = Xij be the probability of j • Xi = ∑k Xik : the number of times
i any word k appears in the context of
appearing in the context of word i. word i
Populating this matrix requires a single pass through the entire X
• Pij = P(w j |wi ) = Xij : the probability
i
corpus to collect the statistics. For large corpora, this pass can be of j appearing in the context of word
computationally expensive, but it is a one-time up-front cost. i

1.3 Least Squares Objective


Recall that for the skip-gram model, we use softmax to compute the
probability of word j appears in the context of word i:

exp(~u Tj~vi )
Qij =
∑W
w=1 exp(~
uwT~
vi )
Training proceeds in an on-line, stochastic fashion, but the implied
global cross-entropy loss can be calculated as:

J=− ∑ ∑ log Qij


i ∈corpus j∈context(i )

As the same words i and j can appear multiple times in the corpus,
it is more efficient to first group together the same values for i and j:

W W
J=−∑ ∑ Xij log Qij
i =1 j =1

where the value of co-occurring frequency is given by the co-


occurrence matrix X. One significant drawback of the cross-entropy
loss is that it requires the distribution Q to be properly normalized,
which involves the expensive summation over the entire vocabulary.
Instead, we use a least square objective in which the normalization
factors in P and Q are discarded:

W W
Ĵ = ∑ ∑ Xi ( P̂ij − Q̂ij )2
i =1 j =1

where P̂ij = Xij and Q̂ij = exp(~u Tj~vi ) are the unnormalized
distributions. This formulation introduces a new problem – Xij often
takes on very large values and makes the optimization difficult. An
effective change is to minimize the squared error of the logarithms of
P̂ and Q̂:
cs224n: natural language processing with deep learning lecture notes: part ii
word vectors ii: glove, evaluation and training 3

W W
Ĵ = ∑ ∑ Xi (log( P̂)ij − log(Q̂ij ))2
i =1 j =1
W W
= ∑ ∑ Xi (~uTj~vi − log Xij )2
i =1 j =1

Another observation is that the weighting factor Xi is not guaran-


teed to be optimal. Instead, we introduce a more general weighting
function, which we are free to take to depend on the context word as
well:

W W
Ĵ = ∑ ∑ f (Xij )(~uTj~vi − log Xij )2
i =1 j =1

1.4 Conclusion
In conclusion, the GloVe model efficiently leverages global statistical
information by training only on the nonzero elements in a word-
word co-occurrence matrix, and produces a vector space with mean-
ingful sub-structure. It consistently outperforms word2vec on the
word analogy task, given the same corpus, vocabulary, window size,
and training time. It achieves better results faster, and also obtains
the best results irrespective of speed.

2 Evaluation of Word Vectors

So far, we have discussed methods such as the Word2Vec and GloVe


methods to train and discover latent vector representations of natural
language words in a semantic space. In this section, we discuss how
we can quantitatively evaluate the quality of word vectors produced
by such techniques.

2.1 Intrinsic Evaluation


Intrinsic evaluation of word vectors is the evaluation of a set of word
vectors generated by an embedding technique (such as Word2Vec or
GloVe) on specific intermediate subtasks (such as analogy comple-
tion). These subtasks are typically simple and fast to compute and
thereby allow us to help understand the system used to generate the
word vectors. An intrinsic evaluation should typically return to us a
number that indicates the performance of those word vectors on the
evaluation subtask.
Motivation: Let us consider an example where our final goal
is to create a question answering system which uses word vectors

Figure 1: The left subsystem (red)


being expensive to train is modified by
substituting with a simpler subsystem
(green) for intrinsic evaluation.
cs224n: natural language processing with deep learning lecture notes: part ii
word vectors ii: glove, evaluation and training 4

as inputs. One approach of doing so would be to train a machine


learning system that:

1. Takes words as inputs

2. Converts them to word vectors

3. Uses word vectors as inputs for an elaborate machine learning


system

4. Maps the output word vectors by this system back to natural


language words

5. Produces words as answers


Intrinsic evaluation:
Of course, in the process of making such a state-of-the-art question- • Evaluation on a specific, intermedi-
answering system, we will need to create optimal word-vector repre- ate task

sentations since they are used in downstream subsystems (such as • Fast to compute performance

deep neural networks). To do this in practice, we will need to tune • Helps understand subsystem

many hyperparameters in the Word2Vec subsystem (such as the • Needs positive correlation with real
task to determine usefulness
dimension of the word vector representation). While the idealistic
approach is to retrain the entire system after any parametric changes
in the Word2Vec subsystem, this is impractical from an engineering
standpoint because the machine learning system (in step 3) is typi-
cally a deep neural network with millions of parameters that takes
very long to train. In such a situation, we would want to come up
with a simple intrinsic evaluation technique which can provide a
measure of "goodness" of the word to word vector subsystem. Ob-
viously, a requirement is that the intrinsic evaluation has a positive
correlation with the final task performance.

2.2 Extrinsic Evaluation


Extrinsic evaluation:
Extrinsic evaluation of word vectors is the evaluation of a set of word • Is the evaluation on a real task
vectors generated by an embedding technique on the real task at • Can be slow to compute perfor-
hand. These tasks are typically elaborate and slow to compute. Using mance
our example from above, the system which allows for the evalua- • Unclear if subsystem is the prob-
lem, other subsystems, or internal
tion of answers from questions is the extrinsic evaluation system.
interactions
Typically, optimizing over an underperforming extrinsic evaluation
• If replacing subsystem improves
system does not allow us to determine which specific subsystem is at performance, the change is likely
fault and this motivates the need for intrinsic evaluation. good

2.3 Intrinsic Evaluation Example: Word Vector Analogies


A popular choice for intrinsic evaluation of word vectors is its per-
formance in completing word vector analogies. In a word vector
analogy, we are given an incomplete analogy of the form:
cs224n: natural language processing with deep learning lecture notes: part ii
word vectors ii: glove, evaluation and training 5

a:b::c:? Note: We see that word vectors in-


The intrinsic evaluation system then identifies the word vector corporate meaning in cosine distance
between words. They also incorpo-
which maximizes the cosine similarity: rate meaning in different dimensions:
for example, "nokia" might be close
( x b − x a + x c ) T xi to "samsung" in some dimension for
d = argmax
i k xb − x a + xc k the reason that they are both electron-
ics companies, but might be close to
This metric has an intuitive interpretation. Ideally, we want xb − x a = "finland" in another dimension for a
xd − xc (For instance, queen – king = actress – actor). This implies different reason, that Nokia is a Finnish
company.
that we want xb − x a + xc = xd . Thus we identify the vector xd which
Note: Interesting results can be seen
maximizes the normalized dot-product between the two word vectors when word vectors are cast down to 2
(i.e. cosine similarity). dimensions (using PCA) and graphed:
similar words cluster together. How-
Using intrinsic evaluation techniques such as word-vector analo- ever, it is important to remember that
gies should be handled with care (keeping in mind various aspects of a significant portion of spatial infor-
the corpus used for pre-training). For instance, consider analogies of mation is lost during dimensionality
reduction; thus complex relations be-
the form: tween words as described in the Nokia
City 1 : State containing City 1 : : City 2 : State containing City 2 example above may not emerge.

Table 1: Here are semantic word vector


Input Result Produced
analogies (intrinsic evaluation) that may
Chicago : Illinois : : Houston Texas suffer from different cities having the
same name
Chicago : Illinois : : Philadelphia Pennsylvania
Chicago : Illinois : : Phoenix Arizona
Chicago : Illinois : : Dallas Texas
Chicago : Illinois : : Jacksonville Florida
Chicago : Illinois : : Indianapolis Indiana
Chicago : Illinois : : Austin Texas
Chicago : Illinois : : Detroit Michigan
Chicago : Illinois : : Memphis Tennessee
Chicago : Illinois : : Boston Massachusetts
In many cases above, there are multiple cities/towns/villages with
the same name across the US. Thus, many states would qualify as the
right answer. For instance, there are at least 10 places in the US called
Phoenix and thus, Arizona need not be the only correct response. Let
us now consider analogies of the form:
Capital City 1 : Country 1 : : Capital City 2 : Country 2
In many of the cases above, the resulting city produced by this
task has only been the capital in the recent past. For instance, prior to
1997 the capital of Kazakhstan was Almaty. Thus, we can anticipate
other issues if our corpus is dated.
The previous two examples demonstrated semantic testing using
word vectors. We can also test syntax using word vector analogies.
The following intrinsic evaluation tests the word vectors’ ability to
capture the notion of superlative adjectives:
Similarly, the intrinsic evaluation shown below tests the word
vectors’ ability to capture the notion of past tense:
cs224n: natural language processing with deep learning lecture notes: part ii
word vectors ii: glove, evaluation and training 6

Table 2: Here are semantic word vector


Input Result Produced
analogies (intrinsic evaluation) that may
Abuja : Nigeria : : Accra Ghana suffer from countries having different
capitals at different points in time
Abuja : Nigeria : : Algiers Algeria
Abuja : Nigeria : : Amman Jordan
Abuja : Nigeria : : Ankara Turkey
Abuja : Nigeria : : Antananarivo Madagascar
Abuja : Nigeria : : Apia Samoa
Abuja : Nigeria : : Ashgabat Turkmenistan
Abuja : Nigeria : : Asmara Eritrea
Abuja : Nigeria : : Astana Kazakhstan
Table 3: Here are syntactic word vector
Input Result Produced
analogies (intrinsic evaluation) that test
bad : worst : : big biggest the notion of superlative adjectives

bad : worst : : bright brightest


bad : worst : : cold coldest
bad : worst : : cool coolest
bad : worst : : dark darkest
bad : worst : : easy easiest
bad : worst : : fast fastest
bad : worst : : good best
bad : worst : : great greatest

2.4 Intrinsic Evaluation Tuning Example: Analogy Evaluations


Some parameters we might consider
We now explore some of the hyperparameters in word vector em- tuning for a word embedding technique
bedding techniques (such as Word2Vec and GloVe) that can be tuned on intrinsic evaluation tasks are:
• Dimension of word vectors
using an intrinsic evaluation system (such as an analogy completion
• Corpus size
system). Let us first see how different methods for creating word-
• Corpus souce/type
vector embeddings have performed (in recent research work) under
• Context window size
the same hyperparameters on an analogy evaluation task:
• Context symmetry
Can you think of other hyperparame-
ters tunable at this stage?
cs224n: natural language processing with deep learning lecture notes: part ii
word vectors ii: glove, evaluation and training 7

Table 4: Here are syntactic word vector


Input Result Produced
analogies (intrinsic evaluation) that test
dancing : danced : : decreasing decreased the notion of past tense

dancing : danced : : describing described


dancing : danced : : enhancing enhanced
dancing : danced : : falling fell
dancing : danced : : feeding fed
dancing : danced : : flying flew
dancing : danced : : generating generated
dancing : danced : : going went
dancing : danced : : hiding hid
dancing : danced : : hitting hit

Table 5: Here we compare the perfor-


Model Dimension Size Semantics Syntax Total
mance of different models under the
ivLBL 100 1.5B 55.9 50.1 53.2 use of different hyperparameters and
datasets
HPCA 100 1.6B 4.2 16.4 10.8
GloVE 100 1.6B 67.5 54.3 60.3
SG 300 1B 61 61 61
CBOW 300 1.6B 16.1 52.6 36.1
vLBL 300 1.5B 54.2 64.8 60.0
ivLBL 300 1.5B 65.2 63.0 64.0
GloVe 300 1.6B 80.8 61.5 70.3
SVD 300 6B 6.3 8.1 7.3
SVD-S 300 6B 36.7 46.6 42.1
SVD-L 300 6B 56.6 63.0 60.1
CBOW 300 6B 63.6 67.4 65.7
SG 300 6B 73.0 66.0 69.1
GloVe 300 6B 77.4 67.0 71.7
CBOW 1000 6B 57.3 68.9 63.7
SG 1000 6B 66.1 65.1 65.6
SVD-L 300 42B 38.4 58.2 49.2
GloVe 300 42B 81.9 69.3 75.0
cs224n: natural language processing with deep learning lecture notes: part ii
word vectors ii: glove, evaluation and training 8

Inspecting the above table, we can make 3 primary observations:

• Performance is heavily dependent on the model used for word


embedding:
This is an expected result since different methods try embedding
words to vectors using fundamentally different properties (such as
co-occurrence count, singular vectors, etc.) Implementation Tip: A window size
of 8 around each center word typically
• Performance increases with larger corpus sizes: works well for GloVe embeddings
This happens because of the experience an embedding technique
gains with more examples it sees. For instance, an analogy com-
pletion example will produce incorrect results if it has not encoun-
tered the test words previously.

• Performance is lower for extremely low dimensional word vec-


tors:
Lower dimensional word vectors are not able to capture the dif-
ferent meanings of the different words in the corpus. This can be
viewed as a high bias problem where our model complexity is too
low. For instance, let us consider the words "king", "queen", "man", Figure 2: Here we see how training
time improves training performance
"woman". Intuitively, we would need to use two dimensions such
and helps squeeze the last few perfor-
as "gender" and "leadership" to encode these into 2-bit word vec- mance.
tors. Any lower would fail to capture semantic differences between
the four words.
Extremely high dimensional vectors:
Figure 3 demonstrates how accuracy has been shown to improve Intuitively, it would seem that these
vectors would capture noise in the
with larger corpus. corpus that doesn’t allow generaliza-
tion, i.e. resulting in high variance, but
Yin et al. show in On the Dimen-
sionality of Word Embedding that
skip-gram and GloVe are robust to this
over-fitting.

Figure 3: Here we see how performance


improves with data size.
Figure 4 demonstrates how other hyperparameters have been
shown to affect the accuracies using GloVe.
cs224n: natural language processing with deep learning lecture notes: part ii
word vectors ii: glove, evaluation and training 9

Figure 4: We see how accuracies vary


with vector dimension and context
2.5 Intrinsic Evaluation Example: Correlation Evaluation window size for GloVe

Another simple way to evaluate the quality of word vectors is by


asking humans to assess the similarity between two words on a fixed
scale (say 0-10) and then comparing this with the cosine similarity
between the corresponding word vectors. This has been done on
various datasets that contain human judgement survey data.

Table 6: Here we see the correlations be-


Model Size WS353 MC RG SCWS RW
tween of word vector similarities using
SVD 6B 35.3 35.1 42.5 38.3 25.6 different embedding techniques with
different human judgment datasets
SVD-S 6B 56.5 71.5 71.0 53.6 34.7
SVD-L 6B 65.7 72.7 75.1 56.5 37.0
CBOW 6B 57.2 65.6 68.2 57.0 32.5
SG 6B 62.8 65.2 69.7 58.1 37.2
GloVe 6B 65.8 72.7 77.8 53.9 38.1
SVD-L 42B 74.0 76.4 74.1 58.3 39.9
GloVe 42B 75.9 83.6 82.9 59.6 47.8
CBOW 100B 68.4 79.6 75.4 59.4 45.5

2.6 Further Reading: Dealing With Ambiguity


One might wonder how we handle the situation where we want to
capture the same word with different vectors for its different uses in
natural language. For instance, "run" is both a noun and a verb and
is used and interpreted differently based on the context. Improving
Word Representations Via Global Context And Multi-
ple Word Prototypes (Huang et al, 2012) describes how such
cases can also be handled in NLP. The essence of the method is the
following:

1. Gather fixed size context windows of all occurrences of the word


(for instance, 5 before and 5 after)

2. Each context is represented by a weighted average of the context


words’ vectors (using idf-weighting)
cs224n: natural language processing with deep learning lecture notes: part ii
word vectors ii: glove, evaluation and training 10

3. Apply spherical k-means to cluster these context representations.

4. Finally, each word occurrence is re-labeled to its associated cluster


and is used to train the word representation for that cluster.

For a more rigorous treatment on this topic, one should refer to


the original paper.

3 Training for Extrinsic Tasks

We have so far focused on intrinsic tasks and emphasized their


importance in developing a good word embedding technique. Of
course, the end goal of most real-world problems is to use the result-
ing word vectors for some other extrinsic task. Here we discuss the
general approach for handling extrinsic tasks.

3.1 Problem Formulation


Most NLP extrinsic tasks can be formulated as classification tasks.
For instance, given a sentence, we can classify the sentence to have
positive, negative or neutral sentiment. Similarly, in named-entity
recognition (NER), given a context and a central word, we want
to classify the central word to be one of many classes. For the in-
put, "Jim bought 300 shares of Acme Corp. in 2006", we would
like a classified output "[Jim]Person bought 300 shares of [Acme
Corp.]Organization in [2006]Time ."
For such problems, we typically begin with a training set of the
form:
{ x (i) , y(i) }1N
Figure 5: We can classify word vectors
where x (i) is a d-dimensional word vector generated by some word using simple linear decision boundaries
such as the one shown here (2-D word
embedding technique and y(i) is a C-dimensional one-hot vector
vectors) using techniques such as
which indicates the labels we wish to eventually predict (sentiments, logistic regression and SVMs
other words, named entities, buy/sell decisions, etc.).
In typical machine learning tasks, we usually hold input data and
target labels fixed and train weights using optimization techniques
(such as gradient descent, L-BFGS, Newton’s method, etc.). In NLP
applications however, we introduce the idea of retraining the input
word vectors when we train for extrinsic tasks. Let us discuss when
and why we should consider doing this. Implementation Tip: Word vector
retraining should be considered for
large training datasets. For small
3.2 Retraining Word Vectors datasets, retraining word vectors will
likely worsen performance.
As we have discussed so far, the word vectors we use for extrinsic
tasks are initialized by optimizing them over a simpler intrinsic task.
In many cases, these pretrained word vectors are a good proxy for
cs224n: natural language processing with deep learning lecture notes: part ii
word vectors ii: glove, evaluation and training 11

optimal word vectors for the extrinsic task and they perform well
at the extrinsic task. However, it is also possible that the pretrained
word vectors could be trained further (i.e. retrained) using the extrin-
sic task this time to perform better. However, retraining word vectors
can be risky.
If we retrain word vectors using the extrinsic task, we need to en-
sure that the training set is large enough to cover most words from
the vocabulary. This is because Word2Vec or GloVe produce semanti-
cally related words to be located in the same part of the word space.
When we retrain these words over a small set of the vocabulary, these
words are shifted in the word space and as a result, the performance Figure 6: Here, we see that the words
over the final task could actually reduce. Let us explore this idea "Telly", "TV", and "Television" are clas-
sified correctly before retraining. "Telly"
further using an example. Consider the pretrained vectors to be in a and "TV" are present in the extrinsic
two dimensional space as shown in Figure 6. Here, we see that the task training set while "Television" is
only present in the test set.
word vectors are classified correctly on some extrinsic classification
task. Now, if we retrain only two of those vectors because of a limited
training set size, then we see in Figure 7 that one of the words gets
misclassified because the boundary shifts as a result of word vector
updates.
Thus, word vectors should not be retrained if the training data set
is small. If the training set is large, retraining may improve perfor-
mance.

3.3 Softmax Classification and Regularization


Let us consider using the Softmax classification function which has
Figure 7: Here, we see that the words
the form: "Telly" and "TV" are classified correctly
exp(Wj· x )
p ( y j = 1| x ) = C after training, but "Television" is not
∑c=1 exp(Wc· x ) since it was not present in the training
set.
Here, we calculate the probability of word vector x being in class j.
Using the Cross-entropy loss function, we calculate the loss of such a
training example as:
C C 
exp(Wj· x )

− ∑ y j log( p(y j = 1| x )) = − ∑ y j log
j =1 j =1 ∑Cc=1 exp(Wc· x )

Of course, the above summation will be a sum over (C − 1) zero


values since y j is 1 only at a single index (at least for now) implying
that x belongs to only 1 correct class. Thus, let us define k to be the
index of the correct class. Thus, we can now simplify our loss to be:
 
exp(Wk· x )
− log
∑C
c=1 exp(Wc· x )
We can then extend the above loss to a dataset of N points:
N exp(Wk(i)· x (i) )
 
− ∑ log
i =1 ∑Cc=1 exp(Wc· x )
(i )
cs224n: natural language processing with deep learning lecture notes: part ii
word vectors ii: glove, evaluation and training 12

The only difference above is that k(i ) is now a function that returns
the correct class index for example x (i) .
Let us now try to estimate the number of parameters that would
be updated if we consider training both, model weights (W), as well
word vectors (x). We know that a simple linear decision boundary
would require a model that takes in at least one d-dimensional input
word vector and produces a distribution over C classes. Thus, to
update the model weights, we would be updating C · d parameters.
If we update the word vectors for every word in the vocabulary V
as well, then we would be updating as many as |V | word vectors,
each of which is d-dimensional. Thus, the total number of parameters
would be as many as C · d + |V | · d for a simple linear classifier:

∇W·1
 
 .. 

 . 

 ∇ 
W·d
∇θ J (θ ) = 
 
 ∇ xaardvark


..
 
 
 . 
∇ xzebra
This is an extremely large number of parameters considering how
simple the model’s decision boundary is - such a large number of
parameters is highly prone to overfitting.
To reduce overfitting risk, we introduce a regularization term
which poses the Bayesian belief that the parameters (θ) should be
small is magnitude (i.e. close to zero):

N 
exp(Wk(i)· x (i) )
 C ·d+|V |·d
− ∑ log +λ ∑ θk2
i =1 ∑C (i )
c=1 exp(Wc· x ) k =1

Minimizing the above cost function reduces the likelihood of the


parameters taking on extremely large values just to fit the training set
well and may improve generalization if the relative objective weight
λ is tuned well. The idea of regularization becomes even more of a
requirement once we explore more complex models (such as Neural
Networks) which have far more parameters.

3.4 Window Classification Figure 8: Here, we see a central word


with a symmetric window of length 2.
So far we have primarily explored the idea of predicting in extrinsic Such context may help disambiguate
between the place Paris and the name
tasks using a single word vector x. In reality, this is hardly done be- Paris.
cause of the nature of natural languages. Natural languages tend to
use the same word for very different meanings and we typically need
to know the context of the word usage to discriminate between mean-
ings. For instance, if you were asked to explain to someone what
cs224n: natural language processing with deep learning lecture notes: part ii
word vectors ii: glove, evaluation and training 13

"to sanction" meant, you would immediately realize that depending


on the context "to sanction" could mean "to permit" or "to punish".
In most situations, we tend to use a sequence of words as input to
the model. A sequence is a central word vector preceded and suc-
ceeded by context word vectors. The number of words in the context
is also known as the context window size and varies depending on
the problem being solved. Generally, narrower window sizes lead to
better performance in syntactic tests while wider windows lead to
better performance in semantic tests. Generally, narrower window sizes lead
In order to modify the previously discussed Softmax model to use to better performance in syntactic tests
while wider windows lead to better
windows of words for classification, we would simply substitute x (i) performance in semantic tests.
(i )
with xwindow in the following manner:

x ( i −2)
 

 x ( i −1) 

(i )
xwindow = x (i )
 

x ( i +1)
 
 
x ( i +2)
As a result, when we evaluate the gradient of the loss with respect
to the words, we will receive gradients for the word vectors:

∇ x ( i −2)
 

 ∇ x ( i −1) 

= ∇ x (i )
 
δwindow 
 
 ∇ x ( i +1) 
∇ x ( i +2)
The gradient will of course need to be distributed to update the
corresponding word vectors in implementation.

3.5 Non-linear Classifiers


We now introduce the need for non-linear classification models such
as neural networks. We see in Figure 9 that a linear classifier mis-
classifies many datapoints. Using a non-linear decision boundary as
shown in Figure 10, we manage to classify all training points accu- Figure 9: Here, we see that many exam-
ples are wrongly classified even though
rately. Although oversimplified, this is a classic case demonstrating the best linear decision boundary is
the need for non-linear decision boundaries. In the next set of notes, chosen. This is due linear decision
boundaries have limited model capacity
we study neural networks as a class of non-linear models that have for this dataset.
performed particularly well in deep learning applications.

Figure 10: Here, we see that the non-


linear decision boundary allows for
much better classification of datapoints.
CS224n: Natural Language Processing with Deep
Learning 1 1
Course Instructors: Christopher
Manning, Richard Socher
Lecture Notes: Part III
Neural Networks, Backpropagation 2 2
Authors: Rohit Mundra, Amani
Peddada, Richard Socher, Qiaojing Yan
Winter 2019

Keyphrases: Neural networks. Forward computation. Backward


propagation. Neuron Units. Max-margin Loss. Gradient checks.
Xavier parameter initialization. Learning rates. Adagrad.
This set of notes introduces single and multilayer neural networks,
and how they can be used for classification purposes. We then dis-
cuss how they can be trained using a distributed gradient descent
technique known as backpropagation. We will see how the chain
rule can be used to make parameter updates sequentially. After a
rigorous mathematical discussion of neural networks, we will discuss
some practical tips and tricks in training neural networks involving:
neuron units (non-linearities), gradient checks, Xavier parameter ini-
tialization, learning rates, Adagrad, etc. Lastly, we will motivate the
use of recurrent neural networks as a language model.

1 Neural Networks: Foundations

We established in our previous discussions the need for non-linear


classifiers since most data are not linearly separable and thus, our
classification performance on them is limited. Neural networks are
a family of classifiers with non-linear decision boundary as seen in
Figure 1: We see here how a non-linear
Figure 1. Now that we know the sort of decision boundaries neural decision boundary separates the data
networks create, let us see how they manage doing so. very well. This is the prowess of neural
networks.
Fun Fact:
1.1 A Neuron Neural networks are biologically in-
spired classifiers which is why they are
A neuron is a generic computational unit that takes n inputs and often called "artificial neural networks"
to distinguish them from the organic
produces a single output. What differentiates the outputs of different kind. However, in reality human neural
neurons is their parameters (also referred to as their weights). One of networks are so much more capable
and complex from artificial neural net-
the most popular choices for neurons is the "sigmoid" or "binary lo-
works that it is usually better to not
gistic regression" unit. This unit takes an n-dimensional input vector draw too many parallels between the
x and produces the scalar activation (output) a. This neuron is also two.

associated with an n-dimensional weight vector, w, and a bias scalar,


b. The output of this neuron is then:

1
a=
1 + exp(−(w T x + b))

We can also combine the weights and bias term above to equiva- Neuron:
A neuron is the fundamental building
block of neural networks. We will
see that a neuron can be one of many
functions that allows for non-linearities
to accrue in the network.
cs224n: natural language processing with deep learning lecture notes: part iii
neural networks, backpropagation 2

lently formulate:

1
a=
1 + exp(−[w T b] · [ x 1])

This formulation can be visualized in the manner shown in Fig-


ure 2.
Figure 2: This image captures how in
1.2 A Single Layer of Neurons a sigmoid neuron, the input vector x is
first scaled, summed, added to a bias
We extend the idea above to multiple neurons by considering the unit, and then passed to the squashing
sigmoid function.
case where the input x is fed as an input to multiple such neurons as
shown in Figure 3.
If we refer to the different neurons’ weights as {w(1) , · · · , w(m) }
and the biases as {b1 , · · · , bm }, we can say the respective activations
are { a1 , · · · , am }:

1
a1 =
1 + exp(w(1)T x + b1 )
..
.
1
am =
1 + exp(w(m)T x + bm )

Let us define the following abstractions to keep the notation simple


and useful for more complex networks:
 1 
1+exp(z1 )
 .. 
σ(z) = 
 .


1
1+exp(zm )
 
b1
 .  m
 ..  ∈ R
b= 
bm
 
− w (1) T −
m×n
W= ··· ∈R
 
− w(m) T −

We can now write the output of scaling and biases as:

z = Wx + b
Figure 3: This image captures how
The activations of the sigmoid function can then be written as: multiple sigmoid units are stacked on
the right, all of which receive the same
input x.
 
a (1)
 . 
 .  = σ (z) = σ (Wx + b)
 . 
a(m)
cs224n: natural language processing with deep learning lecture notes: part iii
neural networks, backpropagation 3

So what do these activations really tell us? Well, one can think
of these activations as indicators of the presence of some weighted
combination of features. We can then use a combination of these
activations to perform classification tasks.

1.3 Feed-forward Computation


So far we have seen how an input vector x ∈ Rn can be fed to a
layer of sigmoid units to create activations a ∈ Rm . But what is the
intuition behind doing so? Let us consider the following named-
entity recognition (NER) problem in NLP as an example:

"Museums in Paris are amazing"


Dimensions for a single hidden layer
Here, we want to classify whether or not the center word "Paris" is neural network: If we represent each
word using a 4-dimensional word
a named-entity. In such cases, it is very likely that we would not just vector and we use a 5-word window
want to capture the presence of words in the window of word vectors as input, then the input x ∈ R20 . If we
but some other interactions between the words in order to make the use 8 sigmoid units in the hidden layer
and generate 1 score output from the
classification. For instance, maybe it should matter that "Museums" activations, then W ∈ R8×20 , b ∈ R8 ,
is the first word only if "in" is the second word. Such non-linear de- U ∈ R8×1 , s ∈ R. The stage-wise
feed-forward computation is then:
cisions can often not be captured by inputs fed directly to a Softmax
function but instead require the scoring of the intermediate layer z = Wx + b
discussed in Section 1.2. We can thus use another matrix U ∈ Rm×1 a = σ(z)
to generate an unnormalized score for a classification task from the s = UT a
activations:
s = U T a = U T f (Wx + b)
where f is the activation function.
Analysis of Dimensions: If we represent each word using a 4-
dimensional word vector and we use a 5-word window as input (as
in the above example), then the input x ∈ R20 . If we use 8 sigmoid
units in the hidden layer and generate 1 score output from the activa-
tions, then W ∈ R8×20 , b ∈ R8 , U ∈ R8×1 , s ∈ R.

1.4 Maximum Margin Objective Function Figure 4: This image captures how a
simple feed-forward network might
Like most machine learning models, neural networks also need an compute its output.
optimization objective, a measure of error or goodness which we
want to minimize or maximize respectively. Here, we will discuss a
popular error metric known as the maximum margin objective. The
idea behind using this objective is to ensure that the score computed
for "true" labeled data points is higher than the score computed for
"false" labeled data points.
Using the previous example, if we call the score computed for the
"true" labeled window "Museums in Paris are amazing" as s and the
cs224n: natural language processing with deep learning lecture notes: part iii
neural networks, backpropagation 4

score computed for the "false" labeled window "Not all museums in
Paris" as sc (subscripted as c to signify that the window is "corrupt").
Then, our objective function would be to maximize (s − sc ) or to
minimize (sc − s). However, we modify our objective to ensure that
error is only computed if sc > s ⇒ (sc − s) > 0. The intuition
behind doing this is that we only care the the "true" data point have
a higher score than the "false" data point and that the rest does not
matter. Thus, we want our error to be (sc − s) if sc > s else 0. Thus,
our optimization objective is now:

minimize J = max(sc − s, 0)

However, the above optimization objective is risky in the sense that


it does not attempt to create a margin of safety. We would want the
"true" labeled data point to score higher than the "false" labeled data
point by some positive margin ∆. In other words, we would want
error to be calculated if (s − sc < ∆) and not just when (s − sc < 0).
Thus, we modify the optimization objective:

minimize J = max(∆ + sc − s, 0)

We can scale this margin such that it is ∆ = 1 and let the other
parameters in the optimization problem adapt to this without any
change in performance. For more information on this, read about
functional and geometric margins - a topic often covered in the study
of Support Vector Machines. Finally, we define the following opti- The max-margin objective function
mization objective which we optimize over all training windows: is most commonly associated with
Support Vector Machines (SVMs)

minimize J = max(1 + sc − s, 0)

In the above formulation sc = U T f (Wxc + b) and s = U T f (Wx + b).

1.5 Training with Backpropagation – Elemental


In this section we discuss how we train the different parameters in
the model when the cost J discussed in Section 1.4 is positive. No
parameter updates are necessary if the cost is 0. Since we typically
update parameters using gradient descent (or a variant such as SGD),
we typically need the gradient information for any parameter as
required in the update equation:

θ ( t +1) = θ ( t ) − α ∇ θ ( t ) J

Backpropagation is technique that allows us to use the chain rule


of differentiation to calculate loss gradients for any parameter used
in the feed-forward computation on the model. To understand this
further, let us understand the toy network shown in Figure 5 for
which we will perform backpropagation.

Figure 5: This is a 4-2-1 neural network


where neuron j on layer k receives input
(k) (k)
zj and produces activation output a j .
cs224n: natural language processing with deep learning lecture notes: part iii
neural networks, backpropagation 5

Here, we use a neural network with a single hidden layer and a


single unit output. Let us establish some notation that will make it
easier to generalize this model later:

• xi is an input to the neural network.

• s is the output of the neural network.

• Each layer (including the input and output layers) has neurons
which receive an input and produce an output. The j-th neuron
(k)
of layer k receives the scalar input z j and produces the scalar
(k)
activation output aj .

(k) (k)
• We will call the backpropagated error calculated at z j as δj .

• Layer 1 refers to the input layer and not the first hidden layer. For
(1) (1)
the input layer, x j = z j = aj .

• W (k) is the transfer matrix that maps the output from the k-th layer
to the input to the (k + 1)-th Thus, W (1) = W and W (2) = U to put
this new generalized notation in perspective of Section 1.3.
Backpropagation Notation:
Let us begin: Suppose the cost J = (1 + sc − s) is positive and • xi is an input to the neural network.
(1)
we want to perform the update of parameter W14 (in Figure 5 and • s is the output of the neural net-
(1) (2) work.
Figure 6), we must realize that W14 only contributes to z1 and
(2) • The j-th neuron of layer k receives
thus a1 . This fact is crucial to understanding backpropagation – (k)
the scalar input z j and produces
backpropagated gradients are only affected by values they contribute (k)
(2) the scalar activation output a j .
to. a1 is consequently used in the forward computation of score by
(1) (1)
(2) • For the input layer, x j = z j = aj .
multiplication with W1 . We can see from the max-margin loss that:
• W (k) is the transfer matrix that maps
∂J ∂J the output from the k-th layer to
=− = −1 the input to the (k + 1)-th. Thus,
∂s ∂sc
W (1) = W and W (2) = U T using
∂s notation from Section 1.3.
Therefore we will work with (1) here for simplicity. Thus,
∂Wij

(2) (2) (2)


∂s ∂W (2) a(2) ∂Wi ai (2) ∂ai
(1)
= (1)
= (1)
= Wi (1)
∂Wij ∂Wij ∂Wij ∂Wij
(2) (2) (2)
(2) ∂ai (2) ∂ai ∂zi
⇒ Wi (1)
= Wi (2) (1)
∂Wij ∂zi ∂Wij
(2) (2)
(2) f (zi ) ∂zi
= Wi (2) (1)
∂zi ∂Wij
(2)
(2) 0 (2) ∂zi
= Wi f ( zi ) (1)
∂Wij
cs224n: natural language processing with deep learning lecture notes: part iii
neural networks, backpropagation 6

(2) 0 (2) ∂ (1) (1) (1) (1) (1) (1) (1) (1) (1)
= Wi f ( zi ) (1)
( bi + a1 Wi1 + a2 Wi2 + a3 Wi3 + a4 Wi4 )
∂Wij

+ ∑ ak Wik )
(2) 0 (2) (1) (1) (1)
= Wi f ( zi ) (1)
( bi
∂Wij k
(2) 0 (2) (1)
= Wi f ( zi ) a j
(2) (1)
= δi · aj

(2) (1)
We see above that the gradient reduces to the product δi · aj where
(2)
δi is essentially the error propagating backwards from the i-th neu-
(1)
ron in layer 2. a j is an input fed to i-th neuron in layer 2 when
scaled by Wij .
Let us discuss the "error sharing/distribution" interpretation of
backpropagation better using Figure 6 as an example. Say we were to
(1) Figure 6: This subnetwork shows the
update W14 :
relevant parts of the network required
(1)
to update Wij
1. We start with the an error signal of 1 propagating backwards from
(3)
a1 .

2. We then multiply this error by the local gradient of the neuron


(3) (3)
which maps z1 to a1 . This happens to be 1 in this case and thus,
(3)
the error is still 1. This is now known as δ1 = 1.
(3)
3. At this point, the error signal of 1 has reached z1 . We now need
to distribute the error signal so that the "fair share" of the error
(2)
reaches to a1 .
(3) (3) (2) (2)
4. This amount is the (error signal at z1 = δ1 )×W1 = W1 . Thus,
(2) (2)
the error at a1 = W1 .

5. As we did in step 2, we need to move the error across the neuron


(2) (2)
which maps z1 to a1 . We do this by multiplying the error signal
(2)
at a1 by the local gradient of the neuron which happens to be
(2)
f 0 ( z1 ).
(2) (2) (2) (2)
6. Thus, the error signal at z1 is f 0 (z1 )W1 . This is known as δ1 .
(1)
7. Finally, we need to distribute the "fair share" of the error to W14
by simply multiplying it by the input it was responsible for for-
(1)
warding, which happens to be a4 .
(1)
8. Thus, the gradient of the loss with respect to W14 is calculated to
(1) (2) (2)
be a4 f 0 (z1 )W1 .
cs224n: natural language processing with deep learning lecture notes: part iii
neural networks, backpropagation 7

Notice that the result we arrive at using this approach is exactly


the same as that we arrived at using explicit differentiation earlier.
Thus, we can calculate error gradients with respect to a parameter
in the network using either the chain rule of differentiation or using
an error sharing and distributed flow approach – both of these ap-
proaches happen to do the exact same thing but it might be helpful
to think about them one way or another.

(1)
Bias Updates: Bias terms (such as b1 ) are mathematically equivalent
(2)
to other weights contributing to the neuron input (z1 ) as long as the
input being forwarded is 1. As such, the bias gradients for neuron
(k) (1)
i on layer k is simply δi . For instance, if we were updating b1
(1) (2) (2)
instead of W14 above, the gradient would simply be f 0 (z1 )W1 .

Generalized steps to propagate δ(k) to δ(k−1) :


(k) (k)
1. We have error δi propagating backwards from zi , i.e. neuron i Figure 7: Propagating error from δ(k) to
δ ( k −1)
at layer k. See Figure 7.
( k −1) (k)
2. We propagate this error backwards to a j by multiplying δi by
( k −1)
the path weight Wij .

( k −1) (k) ( k −1)


3. Thus, the error received at a j is δi Wij .

( k −1)
4. However, a j may have been forwarded to multiple nodes in
the next layer as shown in Figure 8. It should receive responsibility
for errors propagating backward from node m in layer k too, using
the exact same mechanism.
( k −1) (k) ( k −1) (k) ( k −1)
5. Thus, error received at a j is δi Wij + δm Wmj .

(k) ( k −1)
6. In fact, we can generalize this to be ∑i δi Wij .

( k −1)
7. Now that we have the correct error at a j , we move it across
neuron j at layer k − 1 by multiplying with with the local gradient
( k −1)
f 0 (z j ).
( k −1) ( k −1)
8. Thus, the error that reaches z j , called δj is
( k −1) (k) ( k −1)
f 0 (z j ) ∑i δi Wij

Figure 8: Propagating error from δ(k) to


δ ( k −1)
cs224n: natural language processing with deep learning lecture notes: part iii
neural networks, backpropagation 8

1.6 Training with Backpropagation – Vectorized


So far, we discussed how to calculate gradients for a given parameter
in the model. Here we will generalize the approach above so that
we update weight matrices and bias vectors all at once. Note that
these are simply extensions of the above model that will help build
intuition for the way error propagation can be done at a matrix-
vector level.
(k)
For a given parameter Wij , we identified that the error gradient is
( k +1) (k)
simply δi · a j . As a reminder, W (k) is the matrix that maps a(k)
to z(k+1) . We can thus establish that the error gradient for the entire
matrix W (k) is:
 ( k +1) ( k ) ( k +1) ( k )

δ1 a1 δ1 a2 ···
 ( k +1) ( k ) ( k +1) ( k )  ( k +1) ( k ) T
∇W ( k ) = 
δ2 a1 δ2 a2 · · ·
=δ a
.. .. ..
. . .
Error propagates from layer (k + 1) to
Thus, we can write an entire matrix gradient using the outer prod- (k) in the following manner:
uct of the error vector propagating into the matrix and the activations
δ ( k ) = f 0 ( z ( k ) ) ◦ (W ( k ) T δ ( k + 1 ) )
forwarded by the matrix.
Of course, this assumes that in the
Now, we will see how we can calculate the error vector δ(k) . We forward propagation the signal z(k) first
(k) (k) ( k +1) ( k )
established earlier using Figure 8 that δj = f 0 (z j ) ∑i δi Wij . goes through activation neurons f to
generate activations a(k) and are then
This can easily generalize to matrices such that:
linearly combined to yield z(k+1) via
transfer matrix W (k) .
δ ( k ) = f 0 ( z ( k ) ) ◦ (W ( k ) T δ ( k + 1 ) )

In the above formulation, the ◦ operator corresponds to an element


wise product between elements of vectors (◦ : R N × R N → R N ).

Computational efficiency: Having explored element-wise updates


as well as vector-wise updates, we must realize that the vectorized
implementations run substantially faster in scientific computing
environments such as MATLAB or Python (using NumPy/SciPy
packages). Thus, we should use vectorized implementation in prac-
tice. Furthermore, we should also reduce redundant calculations
in backpropagation - for instance, notice that δ(k) depends directly
on δ(k+1) . Thus, we should ensure that when we update W (k) using
δ(k+1) , we save δ(k+1) to later derive δ(k) – and we then repeat this for
(k − 1) . . . (1). Such a recursive procedure is what makes backpropa-
gation a computationally affordable procedure.

2 Neural Networks: Tips and Tricks

Having discussed the mathematical foundations of neural networks,


we will now dive into some tips and tricks commonly employed
cs224n: natural language processing with deep learning lecture notes: part iii
neural networks, backpropagation 9

when using neural networks in practice.

2.1 Gradient Check


In the last section, we discussed in detail how to calculate error
gradients/updates for parameters in a neural network model via
calculus-based (analytic) methods. Here we now introduce a tech-
nique of numerically approximating these gradients – though too
computationally inefficient to be used directly for training the net-
works, this method will allow us to very precisely estimate the
derivative with respect to any parameter; it can thus serve as a useful
sanity check on the correctness of our analytic derivatives. Given a
model with parameter vector θ and loss function J, the numerical
gradient around θi is simply given by centered difference formula:

J (θ (i+) ) − J (θ (i−) )
f 0 (θ ) ≈
2e
where e is a small number (usually around 1e−5 ). The term J (θ (i+) )
is simply the error calculated on a forward pass for a given input
when we perturb the parameter θ’s ith element by +e. Similarly, the
term J (θ (i−) ) is the error calculated on a forward pass for the same
input when we perturb the parameter θ’s ith element by −e. Thus,
using two forward passes, we can approximate the gradient with
respect to any given parameter element in the model. We note that
this definition of the numerical gradient follows very naturally from
the definition of the derivative, where, in the scalar case,

f ( x + e) − f ( x )
f 0 (x) ≈
e
Gradient checks are a great way to
Of course, there is a slight difference – the definition above only compare analytical and numerical
perturbs x in the positive direction to compute the gradient. While it gradients. Analytical gradients should
be close and numerical gradients can be
would have been perfectly acceptable to define the numerical gradi- calculated using:
ent in this way, in practice it is often more precise and stable to use
J (θ (i+) ) − J (θ (i−) )
the centered difference formula, where we perturb a parameter in f 0 (θ ) ≈
2e
both directions. The intuition is that to get a better approximation of J (θ (i+) ) and J (θ (i−) ) can be evalu-
the derivative/slope around a point, we need to examine the func- ated using two forward passes. An
tion f ’s behavior both to the left and right of that point. It can also be implementation of this can be seen in
Snippet 2.1.
shown using Taylor’s theorem that the centered difference formula
has an error proportional to e2 , which is quite small, whereas the
derivative definition is more error-prone.
Now, a natural question you might ask is, if this method is so pre-
cise, why do we not use it to compute all of our network gradients
instead of applying back-propagation? The simple answer, as hinted
earlier, is inefficiency – recall that every time we want to compute the
gradient with respect to an element, we need to make two forward
cs224n: natural language processing with deep learning lecture notes: part iii
neural networks, backpropagation 10

passes through the network, which will be computationally expen-


sive. Furthermore, many large-scale neural networks can contain
millions of parameters, and computing two passes per parameter is
clearly not optimal. And, since in optimization techniques such as
SGD, we must compute the gradients once per iteration for several
thousands of iterations, it is obvious that this method quickly grows
intractable. This inefficiency is why we only use gradient check to
verify the correctness of our analytic gradients, which are much
quicker to compute. A standard implementation of gradient check is
shown below:

Snippet 2.1
def eval_numerical_gradient(f, x):
"""
a naive implementation of numerical gradient of f at x
- f should be a function that takes a single argument
- x is the point (numpy array) to evaluate the gradient
at
"""

fx = f(x) # evaluate function value at original point


grad = np.zeros(x.shape)
h = 0.00001

# iterate over all indexes in x


it = np.nditer(x, flags=[’multi_index’],
op_flags=[’readwrite’])
while not it.finished:

# evaluate function at x+h


ix = it.multi_index
old_value = x[ix]
x[ix] = old_value + h # increment by h
fxh_left = f(x) # evaluate f(x + h)
x[ix] = old_value - h # decrement by h
fxh_right = f(x) # evaluate f(x - h)
x[ix] = old_value # restore to previous value (very
important!)

# compute the partial derivative


grad[ix] = (fxh_left - fxh_right) / (2*h) # the slope
it.iternext() # step to next dimension
return grad
cs224n: natural language processing with deep learning lecture notes: part iii
neural networks, backpropagation 11

2.2 Regularization
As with many machine learning models, neural networks are highly
prone to overfitting, where a model is able to obtain near perfect per-
formance on the training dataset, but loses the ability to generalize
to unseen data. A common technique used to address overfitting (an
issue also known as the “high-variance problem”) is the incorpora-
tion of an L2 regularization penalty. The idea is that we will simply
append an extra term to our loss function J, so that the overall cost is
now calculated as:

L
JR = J + λ ∑ W (i)

F
i =1
The Frobenius Norm of a matrix U is
In the above formulation, W (i) is the Frobenius norm of the defined as follows:

F s
matrix W (i) (the i-th weight matrix in the network) and λ is the ||U || F = ∑ ∑ Uij2
i j
hyper-parameter controlling how much weight the regularization
term has relative to the original cost function. Since we are trying
to minimize JR , what regularization is essentially doing is penaliz-
ing weights for being too large while optimizing over the original
cost function. Due to the quadratic nature of the Frobenius norm
(which computes the sum of the squared elements of a matrix), L2 -
regularization effectively reduces the flexibility of the model and
thereby reduces the overfitting phenomenon. Imposing such a con-
straint can also be interpreted as the prior Bayesian belief that the
optimal weights are close to zero – how close depends on the value
of λ. Choosing the right value of λ is critical, and must be chosen
via hyperparameter-tuning. Too high a value of λ causes most of
the weights to be set too close to 0, and the model does not learn
anything meaningful from the training data, often obtaining poor ac-
curacy on training, validation, and testing sets. Too low a value, and
we fall into the domain of overfitting once again. It must be noted
that the bias terms are not regularized and do not contribute to the
cost term above – try thinking about why this is the case!
There are indeed other types of regularization that are sometimes
used, such as L1 regularization, which sums over the absolute values
(rather than squares) of parameter elements – however, this is less
commonly applied in practice since it leads to sparsity of parameter
weights. In the next section, we discuss dropout, which effectively acts
as another form of regularization by randomly dropping (i.e. setting
to zero) neurons in the forward pass.
cs224n: natural language processing with deep learning lecture notes: part iii
neural networks, backpropagation 12

2.3 Dropout
Dropout is a powerful technique for regularization, first introduced
by Srivastava et al. in Dropout: A Simple Way to Prevent Neural Net-
works from Overfitting. The idea is simple yet effective – during train-
ing, we will randomly “drop” with some probability (1 − p) a subset
of neurons during each forward/backward pass (or equivalently,
we will keep alive each neuron with a probability p). Then, during
testing, we will use the full network to compute our predictions. The
result is that the network typically learns more meaningful informa-
tion from the data, is less likely to overfit, and usually obtains higher
performance overall on the task at hand. One intuitive reason why
this technique should be so effective is that what dropout is doing is
essentially doing is training exponentially many smaller networks at
once and averaging over their predictions.
In practice, the way we introduce dropout is that we take the out-
put h of each layer of neurons, and keep each neuron with prob-
ability p, and else set it to 0. Then, during back-propagation, we
only pass gradients through neurons that were kept alive during
the forward pass. Finally, during testing, we compute the forward
pass using all of the neurons in the network. However, a key sub-
Dropout applied to an artificial neural
tlety is that in order for dropout to work effectively, the expected network. Image credits to Srivastava et
output of a neuron during testing should be approximately the same al.

as it was during training – else the magnitude of the outputs could


be radically different, and the behavior of the network is no longer
well-defined. Thus, we must typically divide the outputs of each
neuron during testing by a certain value – it is left as an exercise to
the reader to determine what this value should be in order for the
expected outputs during training and testing to be equivalent.

2.4 Neuron Units


So far we have discussed neural networks that contain sigmoidal
neurons to introduce nonlinearities; however in many applications
better networks can be designed using other activation functions.
Some common choices are listed here with their function and gra-
dient definitions and these can be substituted with the sigmoidal
functions discussed above.

Sigmoid: This is the default choice we have discussed; the activation


function σ is given by:

1
σ(z) =
1 + exp(−z) Figure 9: The response of a sigmoid
nonlinearity
cs224n: natural language processing with deep learning lecture notes: part iii
neural networks, backpropagation 13

where σ (z) ∈ (0, 1)

The gradient of σ (z) is:

− exp(−z)
σ0 (z) = = σ(z)(1 − σ(z))
1 + exp(−z)

Tanh: The tanh function is an alternative to the sigmoid function


that is often found to converge faster in practice. The primary differ-
ence between tanh and sigmoid is that tanh output ranges from −1 to
1 while the sigmoid ranges from 0 to 1.

exp(z) − exp(−z)
tanh(z) = = 2σ(2z) − 1
exp(z) + exp(−z) Figure 10: The response of a tanh
nonlinearity
where tanh(z) ∈ (−1, 1)

The gradient of tanh(z) is:


2
exp(z) − exp(−z)

0
tanh (z) = 1 − = 1 − tanh2 (z)
exp(z) + exp(−z)

Hard tanh: The hard tanh function is sometimes preferred over the
tanh function since it is computationally cheaper. It does however
saturate for magnitudes of z greater than 1. The activation of the
hard tanh is:

 −1
 : z < −1
hardtanh(z) = z : −1 ≤ z ≤ 1
 Figure 11: The response of a hard tanh
 1 :z>1
nonlinearity
The derivative can also be expressed in a piecewise functional form:
(
0 1 : −1 ≤ z ≤ 1
hardtanh (z) =
0 : otherwise

Soft sign: The soft sign function is another nonlinearity which can
be considered an alternative to tanh since it too does not saturate as
easily as hard clipped functions:

z
softsign(z) =
1 + |z|
The derivative is the expressed as: Figure 12: The response of a soft sign
nonlinearity
sgn(z)
softsign0 (z) =
(1 + z )2

where sgn is the signum function which returns ± 1 depending on the sign of z
cs224n: natural language processing with deep learning lecture notes: part iii
neural networks, backpropagation 14

ReLU: The ReLU (Rectified Linear Unit) function is a popular choice


of activation since it does not saturate even for larger values of z and
has found much success in computer vision applications:

rect(z) = max(z, 0)
The derivative is then the piecewise function: Figure 13: The response of a ReLU
( nonlinearity
0 1 :z>0
rect (z) =
0 : otherwise

Leaky ReLU: Traditional ReLU units by design do not propagate


any error for non-positive z – the leaky ReLU modifies this such
that a small error is allowed to propagate backwards even when z is
negative:

leaky(z) = max(z, k · z)
where 0 < k < 1 Figure 14: The response of a leaky
ReLU nonlinearity
This way, the derivative is representable as:
(
0 1 :z>0
leaky (z) =
k : otherwise

2.5 Data Preprocessing


As is the case with machine learning models generally, a key step
to ensuring that your model obtains reasonable performance on the
task at hand is to perform basic preprocessing on your data. Some
common techniques are outlined below.

Mean Subtraction
Given a set of input data X, it is customary to zero-center the data by
subtracting the mean feature vector of X from X. An important point
is that in practice, the mean is calculated only across the training set,
and this mean is subtracted from the training, validation, and testing
sets.

Normalization
Another frequently used technique (though perhaps less so than
mean subtraction) is to scale every input feature dimension to have
similar ranges of magnitudes. This is useful since input features are
often measured in different “units”, but we often want to initially
consider all features as equally important. The way we accomplish
this is by simply dividing the features by their respective standard
deviation calculated across the training set.
cs224n: natural language processing with deep learning lecture notes: part iii
neural networks, backpropagation 15

Whitening
Not as commonly used as mean-subtraction + normalization, whiten-
ing essentially converts the data to a have an identity covariance
matrix – that is, features become uncorrelated and have a variance
of 1. This is done by first mean-subtracting the data, as usual, to get
X 0 . We can then take the Singular Value Decomposition (SVD) of X 0
to get matrices U, S, V. We then compute UX 0 to project X 0 into the
basis defined by the columns of U. We finally divide each dimension
of the result by the corresponding singular value in S to scale our
data appropriately (if a singular value is zero, we can just divide by a
small number instead).

2.6 Parameter Initialization


A key step towards achieving superlative performance with a neu-
ral network is initializing the parameters in a reasonable way. A
good starting strategy is to initialize the weights to small random
numbers normally distributed around 0 – and in practice, this often
words acceptably well. However, in Understanding the dif-
ficulty of training deep feedforward neural networks
(2010), Xavier et al study the effect of different weight and bias
initialization schemes on training dynamics. The empirical findings
suggest that for sigmoid and tanh activation units, faster convergence
and lower error rates are achieved when the weights of a matrix
( l +1) × n ( l )
W ∈ Rn are initialized randomly with a uniform distribution
as follows:
 s s 
6 6
W∼U − ,
n ( l ) + n ( l +1) n ( l ) + n ( l +1)

Where n(l ) is the number of input units to W (fan-in) and n(l +1)
is the number of output units from W (fan-out). In this parameter
initialization scheme, bias units are initialized to 0. This approach
attempts to maintain activation variances as well as backpropagated
gradient variances across layers. Without such initialization, the
gradient variances (which are a proxy for information) generally
decrease with backpropagation across layers.

2.7 Learning Strategies


The rate/magnitude of model parameter updates during training can
be controlled using the learning rate. In the following naive Gradient
Descent formulation, α is the learning rate:

θ new = θ old − α∇θ Jt (θ )


cs224n: natural language processing with deep learning lecture notes: part iii
neural networks, backpropagation 16

You might think that for fast convergence rates, we should set α
to larger values – however faster convergence is not guaranteed with
larger convergence rates. In fact, with very large learning rates, we
might experience that the loss function actually diverges because the
parameters update causes the model to overshoot the convex minima
as shown in Figure 15. In non-convex models (most of those we work
with), the outcome of a large learning rate is unpredictable, but the
chances of diverging loss functions are very high.
The simple solution to avoiding a diverging loss is to use a very
small learning rate so that we carefully scan the parameter space –
of course, if we use too small a learning rate, we might not converge Figure 15: Here we see that updating
in a reasonable amount of time, or might get caught in local minima. parameter w2 with a large learning rate
can lead to divergence of the error.
Thus, as with any other hyperparameter, the learning rate must be
tuned effectively.
Since training is the most expensive phase in a deep learning
system, some research has attempted to improve this naive approach
to setting learning learning rates. For instance, Ronan Collobert
( l +1) × n ( l )
scales the learning rate of a weight Wij (where W ∈ Rn ) by
the inverse square root of the fan-in of the neuron (n ). ( l )

There are several other techniques that have proven to be effec-


tive as well – one such method is annealing, where, after several
iterations, the learning rate is reduced in some way – this method
ensures that we start off with a high learning rate and approach a
minimum quickly; as we get closer to the minimum, we start lower-
ing our learning rate so that we can find the optimum under a more
fine-grained scope. A common way to perform annealing is to reduce
the learning rate α by a factor x after every n iterations of learning.
Exponential decay is also common, where, the learning rate α at iter-
ation t is given by α(t) = α0 e−kt , where α0 is the initial learning rate,
and k is a hyperparameter. Another approach is to allow the learning
rate to decrease over time such that:

α0 τ
α(t) =
max(t, τ )
In the above scheme, α0 is a tunable parameter and represents the
starting learning rate. τ is also a tunable parameter and represents
the time at which the learning rate should start reducing. In practice,
this method has been found to work quite well. In the next section
we discuss another method for adaptive gradient descent which does
not require hand-set learning rates.

2.8 Momentum Updates


Momentum methods, a variant of gradient descent inspired by the
study of dynamics and motion in physics, attempt to use the “veloc-
cs224n: natural language processing with deep learning lecture notes: part iii
neural networks, backpropagation 17

ity” of updates as a more effective update scheme. Pseudocode for


momentum updates is shown below:

Snippet 2.2
# Computes a standard momentum update
# on parameters x
v = mu*v - alpha*grad_x
x += v

2.9 Adaptive Optimization Methods


AdaGrad is an implementation of standard stochastic gradient de-
scent (SGD) with one key difference: the learning rate can vary for
each parameter. The learning rate for each parameter depends on
the history of gradient updates of that parameter in a way such that
parameters with a scarce history of updates are updated faster using
a larger learning rate. In other words, parameters that have not been
updated much in the past are likelier to have higher learning rates
now. Formally:

α ∂
θt,i = θt−1,i − q gt,i where gt,i = Jt (θ )
∑tτ =1 gτ,i
2 ∂θit

In this technique, we see that if the RMS of the history of gradients


is extremely low, the learning rate is very high. A simple implemen-
tation of this technique is:

Snippet 2.3
# Assume the gradient dx and parameter vector x
cache += dx**2
x += - learning_rate * dx / np.sqrt(cache + 1e-8)

Other common adaptive methods are RMSProp and Adam, whose


update rules are shown below (courtesy of Andrej Karpathy):

Snippet 2.4
# Update rule for RMS prop
cache = decay_rate * cache + (1 - decay_rate) * dx**2
x += - learning_rate * dx / (np.sqrt(cache) + eps)

Snippet 2.5
# Update rule for Adam
cs224n: natural language processing with deep learning lecture notes: part iii
neural networks, backpropagation 18

m = beta1*m + (1-beta1)*dx
v = beta2*v + (1-beta2)*(dx**2)
x += - learning_rate * m / (np.sqrt(v) + eps)

RMSProp is a variant of AdaGrad that utilizes a moving average


of squared gradients – in particular, unlike AdaGrad, its updates
do not become monotonically smaller. The Adam update rule is in
turn a variant of RMSProp, but with the addition of momentum-
like updates. We refer the reader to the respective sources of these
methods for more detailed analyses of their behavior.
CS224n: Natural Language Processing with Deep
Learning 1 1
Course Instructors: Christopher
Manning, Richard Socher
Lecture Notes: Part IV
Dependency Parsing 2 2
Authors: Lisa Wang, Juhi Naik, and
Shayne Longpre
Winter 2019

Keyphrases: Dependency Parsing.

1 Dependency Grammar and Dependency Structure

Parse trees in NLP, analogous to those in compilers, are used to ana-


lyze the syntactic structure of sentences. There are two main types of
structures used - constituency structures and dependency structures.
Constituency Grammar uses phrase structure grammar to organize
words into nested constituents. This will be covered in more detail in
following chapters. We now focus on Dependency Parsing.
Dependency structure of sentences shows which words depend
on (modify or are arguments of) which other words. These binary
asymmetric relations between the words are called dependencies and
are depicted as arrows going from the head (or governor, superior,
regent) to the dependent (or modifier, inferior, subordinate). Usually
these dependencies form a tree structure. They are often typed with
the name of grammatical relations (subject, prepositional object,
apposition, etc.). An example of such a dependency tree is shown in
Figure 1. Sometimes a fake ROOT node is added as the head to the
whole tree so that every word is a dependent of exactly one node.

1.1 Dependency Parsing Figure 1: Dependency tree for the sen-


tence "Bills on ports and immigration
Dependency parsing is the task of analyzing the syntactic depen- were submitted by Senator Brownback,
dency structure of a given input sentence S. The output of a depen- Republican of Kansas"
dency parser is a dependency tree where the words of the input sen-
tence are connected by typed dependency relations. Formally, the de-
pendency parsing problem asks to create a mapping from the input
sentence with words S = w0 w1 ...wn (where w0 is the ROOT) to its
dependency tree graph G. Many different variations of dependency-
based methods have been developed in recent years, including neural
network-based methods, which we will describe later.
To be precise, there are two subproblems in dependency parsing
(adapted from Kuebler et al., chapter 1.2):

1. Learning: Given a training set D of sentences annotated with de-


pendency graphs, induce a parsing model M that can be used to
parse new sentences.
cs224n: natural language processing with deep learning lecture notes: part iv
dependency parsing 2

2. Parsing: Given a parsing model M and a sentence S, derive the


optimal dependency graph D for S according to M.

1.2 Transition-Based Dependency Parsing


Transition-based dependency parsing relies on a state machine which
defines the possible transitions to create the mapping from the input
sentence to the dependency tree. The learning problem is to induce
a model which can predict the next transition in the state machine
based on the transition history. The parsing problem is to construct
the optimal sequence of transitions for the input sentence, given the
previously induced model. Most transition-based systems do not
make use of a formal grammar.

1.3 Greedy Deterministic Transition-Based Parsing


This system was introduced by Nivre in 2003 and was radically dif-
ferent from other methods in use at that time.
This transition system is a state machine, which consists of states
and transitions between those states. The model induces a sequence of
transitions from some initial state to one of several terminal states.
States:
For any sentence S = w0 w1 ...wn , a state can be described with a
triple c = (σ, β, A):

1. a stack σ of words wi from S,

2. a buffer β of words wi from S,

3. a set of dependency arcs A of the form (wi , r, w j ), where wi , w j are


from S, and r describes a dependency relation.

It follows that for any sentence S = w0 w1 ...wn ,

1. an initial state c0 is of the form ([w0 ]σ , [w1 , ..., wn ] β , ∅) (only the


ROOT is on the stack σ, all other words are in the buffer β and no
actions have been chosen yet),

2. a terminal state has the form (σ, [] β , A).

Transitions:
There are three types of transitions between states: Figure 2: Transitions for Dependency
Parsing.
1. Shift: Remove the first word in the buffer and push it on top of
the stack. (Pre-condition: buffer has to be non-empty.)

2. Left-Arcr : Add a dependency arc (w j , r, wi ) to the arc set A,


where wi is the word second to the top of the stack and w j is the
cs224n: natural language processing with deep learning lecture notes: part iv
dependency parsing 3

word at the top of the stack. Remove wi from the stack. (Pre-
condition: the stack needs to contain at least two items and wi
cannot be the ROOT.)

3. Right-Arcr : Add a dependency arc (wi , r, w j ) to the arc set A,


where wi is the word second to the top of the stack and w j is
the word at the top of the stack. Remove w j from the stack. (Pre-
condition: The stack needs to contain at least two items.)

A more formal definition of these three transitions is presented in


Figure 2.

1.4 Neural Dependency Parsing


While there are many deep models for dependency parsing, this
section focuses specifically on greedy, transition-based neural de-
pendency parsers. This class of model has demonstrated compara-
ble performance and significantly better efficiency than traditional
feature-based discriminative dependency parsers. The primary dis-
tinction from previous models is the reliance on dense rather than
sparse feature representations.
The model we will describe employs the arc-standard system
for transitions, as presented in section 1.3. Ultimately, the aim of
the model is to predict a transition sequence from some initial con-
figuration c to a terminal configuration, in which the dependency
parse tree is encoded. As the model is greedy, it attempts to correctly
predict one transition T ∈ { shift, Left-Arcr , Right-Arcr } at
a time, based on features extracted from the current configuration
c = (σ, β, A). Recall, σ is the stack, β the buffer, and A the set of
dependency arcs for a given sentence.

Feature Selection:

Depending on the desired complexity of the model, there is flexi-


bility in defining the input to the neural network. The features for a
given sentence S generally include some subset of:

1. Sword : Vector representations for some of the words in S (and their


dependents) at the top of the stack σ and buffer β.

2. Stag : Part-of-Speech (POS) tags for some of the words in S. POS


tags comprise a small, discrete set: P = { NN, NNP, NNS, DT, J J, ...}

3. Slabel : The arc-labels for some of the words in S. The arc-labels


comprise a small, discrete set, describing the dependency relation:
L = { amod, tmod, nsubj, csubj, dobj, ...}
cs224n: natural language processing with deep learning lecture notes: part iv
dependency parsing 4

For each feature type, we will have a corresponding embedding ma-


trix, mapping from the feature’s one hot encoding, to a d-dimensional
dense vector representation. The full embedding matrix for Sword is
Ew ∈ Rd× Nw where Nw is the dictionary/vocabulary size. Corre-
spondingly, the POS and label embedding matrices are Et ∈ Rd× Nt
and El ∈ Rd× Nl where Nt and Nl are the number of distinct POS tags
and arc labels.
Lastly, let the number of chosen elements from each set of features
be denoted as nword , ntag , and nlabel respectively.

Feature Selection Example:

As an example, consider the following choices for Sword , Stag , and


Slabel .

1. Sword : The top 3 words on the stack and buffer: s1 , s2 , s3 , b1 , b2 , b3 .


The first and second leftmost / rightmost children of the top two
words on the stack: lc1 (si ), rc1 (si ), lc2 (si ), rc2 (si ), i = 1, 2. The
leftmost of leftmost / rightmost of rightmost children of the top
two words on the stack: lc1 (lc1 (si )), rc1 (rc1 (si )), i = 1, 2. In total
Sword contains nw = 18 elements.

2. Stag : The corresponding POS tags for Stag (nt = 18).

3. Slabel : The corresponding arc labels of words, excluding those 6


words on the stack/buffer (nl = 12).

Note that we use a special Null token for non-existent elements:


when the stack and buffer are empty or dependents have not been
assigned yet. For a given sentence example, we select the words,
POS tags and arc labels given the schematic defined above, extract
their corresponding dense feature representations produced from the
embedding matrices Ew , Et , and El , and concatenate these vectors
into our inputs [ x w , x t , x l ]. At training time we backpropagate into
the dense vector representations, as well as the parameters at later
layers.

Feedforward Neural Network Model:

The network contains an input layer [ x w , x t , x l ], a hidden layer,


and a final softmax layer with a cross-entropy loss function. We can
either define a single weight matrix in the hidden layer, to operate
on a concatenation of [ x w , x t , x l ], or we can use three weight matrices
[W1w , W1t , W1l ], one for each input type, as shown in Figure 3. We then
apply a non-linear function and use one more affine layer [W2 ] so
that there are an equivalent number of softmax probabilities to the
number of possible transitions (the output dimension).
cs224n: natural language processing with deep learning lecture notes: part iv
dependency parsing 5

Softmax layer:
···
p = softmax(W2 h)
Hidden layer:
···
h = (W1w x w + W1t x t + W1l x l + b1 )3

Input layer: [ x w , x t , x l ] ··· ···

words POS tags arc labels


Stack Buffer

Configuration ROOT has_VBZ good_JJ control_NN ._.


nsubj
He_PRP
Figure 3: The neural network archi-
tecture for greedy, transition-based
Note that in Figure 3, f ( x ) = x3 is the non-linear function used. dependency parsing.

For a more complete explanation of a greedy transition-based


neural dependency parser, refer to "A Fast and Accurate Dependency
Parser using Neural Networks" under Further Reading.

Further reading:
Danqi Chen, and Christopher D. Manning. "A Fast and Accurate
Dependency Parser using Neural Networks." EMNLP. 2014.
Kuebler, Sandra, Ryan McDonald, and Joakim Nivre. “Depen-
dency parsing.” Synthesis Lectures on Human Language Technolo-
gies 1.1 (2009): 1-127.
CS224n: Natural Language Processing with Deep
Learning 1 1
Course Instructors: Christopher
Manning, Richard Socher
Lecture Notes: Part V
Language Models, RNN, GRU and LSTM 2 2
Authors: Milad Mohammadi, Rohit
Mundra, Richard Socher, Lisa Wang,
Winter 2019 Amita Kamath

Keyphrases: Language Models. RNN. Bi-directional RNN. Deep


RNN. GRU. LSTM.

1 Language Models

1.1 Introduction
Language models compute the probability of occurrence of a number
of words in a particular sequence. The probability of a sequence of
m words {w1 , ..., wm } is denoted as P(w1 , ..., wm ). Since the number
of words coming before a word, wi , varies depending on its location
in the input document, P(w1 , ..., wm ) is usually conditioned on a
window of n previous words rather than all previous words:

i=m i =m
P(w1 , ..., wm ) = ∏ P(wi |w1 , ..., wi−1 ) ≈ ∏ P(wi |wi−n , ..., wi−1 ) (1)
i =1 i =1

Equation 1 is especially useful for speech and translation systems


when determining whether a word sequence is an accurate transla-
tion of an input sentence. In existing language translation systems,
for each phrase / sentence translation, the software generates a num-
ber of alternative word sequences (e.g. {I have, I had, I has, me have, me
had}) and scores them to identify the most likely translation sequence.
In machine translation, the model chooses the best word ordering
for an input phrase by assigning a goodness score to each output
word sequence alternative. To do so, the model may choose between
different word ordering or word choice alternatives. It would achieve
this objective by running all word sequence candidates through a
probability function that assigns each a score. The sequence with
the highest score is the output of the translation. For example, the
machine would give a higher score to "the cat is small" compared
to "small the is cat", and a higher score to "walking home after school"
compared to "walking house after school".

1.2 n-gram Language Models


To compute the probabilities mentioned above, the count of each n-
gram could be compared against the frequency of each word. This is
cs224n: natural language processing with deep learning lecture notes: part v
language models, rnn, gru and lstm 2

called an n-gram Language Model. For instance, if the model takes


bi-grams, the frequency of each bi-gram, calculated via combining a
word with its previous word, would be divided by the frequency of
the corresponding uni-gram. Equations 2 and 3 show this relation-
ship for bigram and trigram models.

count(w1 , w2 )
p ( w2 | w1 ) = (2)
count(w1 )
count(w1 , w2 , w3 )
p ( w3 | w1 , w2 ) = (3)
count(w1 , w2 )
The relationship in Equation 3 focuses on making predictions
based on a fixed window of context (i.e. the n previous words) used
to predict the next word. But how long should the context be? In
some cases, the window of past consecutive n words may not be suf-
ficient to capture the context. For instance, consider the sentence "As
the proctor started the clock, the students opened their ". If the
window only conditions on the previous three words "the students
opened their", the probabilities calculated based on the corpus may
suggest that the next word be "books" - however, if n had been large
enough to include the "proctor" context, the probability might have
suggested "exam".
This leads us to two main issues with n-gram Language Models:
Sparsity and Storage.

1. Sparsity problems with n-gram Language models


Sparsity problems with these models arise due to two issues.
Firstly, note the numerator of Equation 3. If w1 , w2 and w3 never
appear together in the corpus, the probability of w3 is 0. To solve
this, a small δ could be added to the count for each word in the
vocabulary. This is called smoothing.
Secondly, consider the denominator of Equation 3. If w1 and w2
never occurred together in the corpus, then no probability can be
calculated for w3 . To solve this, we could condition on w2 alone.
This is called backoff.
Increasing n makes sparsity problems worse. Typically, n ≤ 5.

2. Storage problems with n-gram Language models


We know that we need to store the count for all n-grams we saw in
the corpus. As n increases (or the corpus size increases), the model
size increases as well.

1.3 Window-based Neural Language Model


The "curse of dimensionality" above was first tackled by Bengio et
al in A Neural Probabilistic Language Model, which introduced the
cs224n: natural language processing with deep learning lecture notes: part v
language models, rnn, gru and lstm 3

first large-scale deep learning for natural language processing model.


This model learns a distributed representation of words, along with the
probability function for word sequences expressed in terms of these
representations. Figure 1 shows the corresponding neural network
architecture. The input word vectors are used by both the hidden
layer and the output layer.
Equation 4 represents Figure 1 and shows the parameters of the
softmax() function, consisting of the standard tanh() function (i.e.
the hidden layer) as well as the linear function, W (3) x + b(3) , that
captures all the previous n input word vectors.

ŷ = softmax(W (2) tanh(W (1) x + b(1) ) + W (3) x + b(3) ) (4)


Note that the weight matrix W (1) is applied to the word vectors (solid
green arrows in Figure 1), W (2) is applied to the hidden layer (also Figure 1: The first deep neural network
architecture model for NLP presented
solid green arrow) and W (3) is applied to the word vectors (dashed by Bengio et al.
green arrows).
A simplified version of this model can be seen in Figure 2, where
the blue layer signifies concatenated word embeddings for the input
words: e = [e(1) ; e(2) ; e(3) ; e(4) ], the red layer signifies the hidden layer:
h = f (We + b1 ), and the green output distribution is a softmax over
the vocabulary: ŷ = softmax(Uh + b2 ).

2 Recurrent Neural Networks (RNN)

Unlike the conventional translation models, where only a finite win-


dow of previous words would be considered for conditioning the
language model, Recurrent Neural Networks (RNN) are capable of
conditioning the model on all previous words in the corpus.
Figure 3 introduces the RNN architecture where each vertical rect-
Figure 2: A simplified representation of
angular box is a hidden layer at a time-step, t. Each such layer holds Figure 1.
a number of neurons, each of which performs a linear matrix oper-
yt−1 yt yt+1
ation on its inputs followed by a non-linear operation (e.g. tanh()).
ht−1 ht ht+1
At each time-step, there are two inputs to the hidden layer: the out- W" W"
put of the previous layer ht−1 , and the input at that timestep xt . The
former input is multiplied by a weight matrix W (hh) and the latter xt−1 xt xt+1
by a weight matrix W (hx) to produce output features ht , which are
multiplied with a weight matrix W (S) and run through a softmax
over the vocabulary to obtain a prediction output ŷ of the next word Figure 3: A Recurrent Neural Network
(Equations 5 and 6). The inputs and outputs of each single neuron (RNN). Three time-steps are shown.

are illustrated in Figure 4.

ht = σ (W (hh) ht−1 + W (hx) x[t] ) (5)

ŷt = so f tmax (W (S) ht ) (6)


cs224n: natural language processing with deep learning lecture notes: part v
language models, rnn, gru and lstm 4

What is interesting here is that the same weights W (hh) and W (hx) are
applied repeatedly at each timestep. Thus, the number of parameters
the model has to learn is less, and most importantly, is independent
of the length of the input sequence - thus defeating the curse of di-
mensionality!
Below are the details associated with each parameter in the net-
work:

• x1 , ..., xt−1 , xt , xt+1 , ...x T : the word vectors corresponding to a cor- Figure 4: The inputs and outputs to a
pus with T words. neuron of a RNN

• ht = σ (W (hh) ht−1 + W (hx) xt ): the relationship to compute the


hidden layer output features at each time-step t

– xt ∈ Rd : input word vector at time t.


– W hx ∈ RDh ×d : weights matrix used to condition the input word
vector, xt
– W hh ∈ RDh × Dh : weights matrix used to condition the output of
the previous time-step, ht−1
– ht−1 ∈ RDh : output of the non-linear function at the previous
time-step, t − 1. h0 ∈ RDh is an initialization vector for the
hidden layer at time-step t = 0.
– σ (): the non-linearity function (sigmoid here)

• ŷt = so f tmax (W (S) ht ): the output probability distribution over the


vocabulary at each time-step t. Essentially, ŷt is the next predicted
word given the document context score so far (i.e. ht−1 ) and the
last observed word vector x (t) . Here, W (S) ∈ R|V |× Dh and ŷ ∈ R|V |
where |V | is the vocabulary.

An example of an RNN language model is shown in Figure 5.


The notation in this image is slightly different: here, the equivalent
of W (hh) is Wh , W (hx) is We , and W (S) is U. E converts word inputs
x (t) to word embeddings e(t) . The final softmax over the vocabulary
shows us the probability of various options for token x (5) , condi-
tioned on all previous tokens. The input could be much longer than Figure 5: An RNN Language Model

4-5 tokens.

2.1 RNN Loss and Perplexity


The loss function used in RNNs is often the cross entropy error in-
troduced in earlier notes. Equation 7 shows this function as the sum
over the entire vocabulary at time-step t.
|V |
J (t) (θ ) = − ∑ yt,j × log(ŷt,j ) (7)
j =1
cs224n: natural language processing with deep learning lecture notes: part v
language models, rnn, gru and lstm 5

The cross entropy error over a corpus of size T is:

T T |V |
1 1
J=
T ∑ J (t) ( θ ) = −
T ∑ ∑ yt,j × log(ŷt,j ) (8)
t =1 t =1 j =1

Equation 9 is called the perplexity relationship; it is basically 2 to


the power of the negative log probability of the cross entropy error
function shown in Equation 8. Perplexity is a measure of confusion
where lower values imply more confidence in predicting the next
word in the sequence (compared to the ground truth outcome).

Perplexity = 2 J (9)

2.2 Advantages, Disadvantages and Applications of RNNs


RNNs have several advantages:

1. They can process input sequences of any length

2. The model size does not increase for longer input sequence
lengths

3. Computation for step t can (in theory) use information from many
steps back.

4. The same weights are applied to every timestep of the input, so


there is symmetry in how inputs are processed

However, they also have some disadvantages:

1. Computation is slow - because it is sequential, it cannot be paral-


lelized

2. In practice, it is difficult to access information from many steps


back due to problems like vanishing and exploding gradients,
which we discuss in the following subsection

The amount of memory required to run a layer of RNN is pro-


portional to the number of words in the corpus. We can consider a
sentence as a minibatch, and a sentence with k words would have k
word vectors to be stored in memory. Also, the RNN must maintain
two pairs of W, b matrices. As aforementioned, while the size of W
could be very large, it does not scale with the size of the corpus (un-
like the traditional language models). For a RNN with 1000 recurrent
layers, the matrix would be 1000 × 1000 regardless of the corpus size.
RNNs can be used for many tasks, such as tagging (e.g. part-of-
speech, named entity recognition), sentence classification (e.g. senti-
ment classification), and encoder modules (e.g. questiton answering,
cs224n: natural language processing with deep learning lecture notes: part v
language models, rnn, gru and lstm 6

machine translation, and many other tasks). In the latter two applica-
tions, we want a representation for the sentence, which we can obtain
by taking the element-wise max or mean of all hidden states of the
timesteps in that sentence.
Note: Figure 6 is an alternative representation of RNNs used in
some publications. It represents the RNN hidden layer as a loop.

2.3 Vanishing Gradient & Gradient Explosion Problems ht


Recurrent neural networks propagate weight matrices from one time- Figure 6: The illustration of a RNN as a
step to the next. Recall the goal of a RNN implementation is to en- loop over time-steps
able propagating context information through faraway time-steps.
For example, consider the following two sentences:

Sentence 1
"Jane walked into the room. John walked in too. Jane said hi to ___"

Sentence 2
"Jane walked into the room. John walked in too. It was late in the
day, and everyone was walking home after a long day at work. Jane
said hi to ___"

In both sentences, given their context, one can tell the answer to both
blank spots is most likely "John". It is important that the RNN predicts
the next word as "John", the second person who has appeared several
time-steps back in both contexts. Ideally, this should be possible given
what we know about RNNs so far. In practice, however, it turns out
RNNs are more likely to correctly predict the blank spot in Sentence
1 than in Sentence 2. This is because during the back-propagation
phase, the contribution of gradient values gradually vanishes as they
propagate to earlier timesteps, as we will show below. Thus, for long
sentences, the probability that "John" would be recognized as the next
word reduces with the size of the context. Below, we discuss the math-
ematical reasoning behind the vanishing gradient problem.
Consider Equations 5 and 6 at a time-step t; to compute the RNN
error, dE/dW, we sum the error at each time-step. That is, dEt /dW for
every time-step, t, is computed and accumulated.

T
∂E ∂Et
=∑ (10)
∂W t =1
∂W

The error for each time-step is computed through applying the


chain rule differentiation to Equations 6 and 5; Equation 11 shows
the corresponding differentiation. Notice dht /dhk refers to the partial
cs224n: natural language processing with deep learning lecture notes: part v
language models, rnn, gru and lstm 7

derivative of ht with respect to all previous k time-steps.


t
∂Et ∂Et ∂yt ∂ht ∂hk
= ∑ (11)
∂W k =1
∂yt ∂ht ∂hk ∂W

Equation 12 shows the relationship to compute each dht /dhk ; this


is simply a chain rule differentiation over all hidden layers within the
[k, t] time interval.
t ∂h j t
∂ht
∂hk
= ∏ ∂h j −1
= ∏ W T × diag[ f 0 ( jj−1 )] (12)
j = k +1 j = k +1

Because h ∈ RDn , each ∂h j /∂h j−1 is the Jacobian matrix for h:


 ∂h ∂h j,1 
j,1
 ∂h j−1,1 . . .
 ∂h j−1,Dn 

 . . . 
∂h j ∂h j ∂h j  
=[ ... ]= . . . (13)
 

∂h j−1 ∂h j−1,1 ∂h j−1,Dn  
 . . . 
 
 ∂h j,Dn ∂h j,Dn 
. . .
∂h j−1,1 ∂h j−1,Dn

Putting Equations 10, 11, 12 together, we have the following rela-


tionship.
T t t ∂h j ∂hk
∂E ∂Et ∂yt
=∑ ∑ ( ∏ ) (14)
∂W t =1 k =1
∂yt ∂ht j=k+1 ∂h j−1 ∂W

Equation 15 shows the norm of the Jacobian matrix relationship


in Equation 13. Here, βW and β h represent the upper bound values
for the two matrix norms. The norm of the partial gradient at each
time-step, t, is therefore, calculated through the relationship shown in
Equation 15.

∂h j
k k≤k W T kk diag[ f 0 (h j−1 )] k≤ βW β h (15)
∂h j−1
The norm of both matrices is calculated through taking their L2-
norm. The norm of f 0 (h j−1 ) can only be as large as 1 given the sigmoid
non-linearity function.

t ∂h j
∂ht
k
∂hk
k=k ∏ ∂h j − 1
k≤ ( βW β h )t−k (16)
j = k +1

The exponential term ( βW β h )t−k can easily become a very small or


large number when βW β h is much smaller or larger than 1 and t − k
is sufficiently large. Recall a large t − k evaluates the cross entropy
error due to faraway words. The contribution of faraway words to
predicting the next word at time-step t diminishes when the gradient
vanishes early on.
cs224n: natural language processing with deep learning lecture notes: part v
language models, rnn, gru and lstm 8

During experimentation, once the gradient value grows extremely


large, it causes an overflow (i.e. NaN) which is easily detectable at
runtime; this issue is called the Gradient Explosion Problem. When the
gradient value goes to zero, however, it can go undetected while dras-
tically reducing the learning quality of the model for far-away words
in the corpus; this issue is called the Vanishing Gradient Problem. Due to
vanishing gradients, we don’t know whether there is no dependency
between steps t and t + n in the data, or we just cannot capture the
true dependency due to this issue.
To gain practical intuition about the vanishing gradient problem,
you may visit the following example website.

2.4 Solution to the Exploding & Vanishing Gradients


Now that we gained intuition about the nature of the vanishing gradi-
ents problem and how it manifests itself in deep neural networks, let
us focus on a simple and practical heuristic to solve these problems.
To solve the problem of exploding gradients, Thomas Mikolov first
introduced a simple heuristic solution that clips gradients to a small
number whenever they explode. That is, whenever they reach a cer-
tain threshold, they are set back to a small number as shown in Algo-
rithm 1.
Algorithm 1: Pseudo-code for norm clip-
ping in the gradients whenever they ex-
∂E plode
ĝ ←
∂W
if k ĝ k≥ threshold then
threshold
ĝ ← ĝ
k ĝ k
end if

Figure 7 visualizes the effect of gradient clipping. It shows the de-


cision surface of a small recurrent neural network with respect to its
W matrix and its bias terms, b. The model consists of a single unit
of recurrent neural network running through a small number of time-
steps; the solid arrows illustrate the training progress on each gradient On the difficulty of training Recurrent Neural

descent step. When the gradient descent model hits the high error wall rithm should work
gradient is not th
in the objective function, the gradient is pushed off to a far-away loca- (a case for which
tion on the decision surface. The clipping model produces the dashed as the ratio betwe
still explode).
line where it instead pulls back the error gradient to somewhere close Our hypothesis co
cent success of th
to the original gradient landscape. to other second or
To solve the problem of vanishing gradients, we introduce two tech- ferences between H
order algorithms.
niques. The first technique is that instead of initializing W (hh) ran- Figure 6. We plot the error surface of a single hidden unit
and hence can dea
not necessarily a
domly, start off from an identity matrix initialization. recurrent network, highlighting the existence of high cur-
Figure
vature 7: Gradient
walls. explosion
The solid lines clipping
depicts standard trajectories
new estimate of
date step and can
The second technique is to use the Rectified Linear Units (ReLU) in- that gradient descent might follow. Using dashed arrow
visualization
the diagram shows what would happen if the gradients is
curvature (such as
sis) while most ot
rescaled to a fixed size when its norm is above a threshold.
stead of the sigmoid function. The derivative for the ReLU is either 0 or sumption, i.e., av
steps.

explode so does the curvature along v, leading to a 3. Dealing wi


wall in the error surface, like the one seen in Fig. 6. vanishing g
If this holds, then it gives us a simple solution to the 3.1. Previous so
exploding gradients problem depicted in Fig. 6. Using an L1 or L2
If both the gradient and the leading eigenvector of the help with explodin
curvature are aligned with the exploding direction v, it ters initialized wit
follows that the error surface has a steep wall perpen- Wrec is probably
dicular to v (and consequently to the gradient). This that the gradient c
means that when stochastic gradient descent (SGD) tion found in secti
reaches the wall and does a gradient descent step, it ensure that durin
will be forced to jump across the valley moving perpen- exceeds 1. This a
dicular to the steep walls, possibly leaving the valley ple regime (with a
cs224n: natural language processing with deep learning lecture notes: part v
language models, rnn, gru and lstm 9

1. This way, gradients would flow through the neurons whose deriva-
tive is 1 without getting attenuated while propagating back through
time-steps.

2.5 Deep Bidirectional RNNs


So far, we have focused on RNNs that condition on past words to
predict the next word in the sequence. It is possible to make predic-
tions based on future words by having the RNN model read through
the corpus backwards. Irsoy et al. shows a bi-directional deep neu-
ral network; at each time-step, t, this network maintains two hidden
layers, one for the left-to-right propagation and another for the right-
to-left propagation. To maintain two hidden layers at any time, this
network consumes twice as much memory space for its weight and
bias parameters. The final classification result, yˆt , is generated through
combining the score results produced by both RNN hidden layers. Fig-
ure 8 shows the bi-directional network architecture, and Equations 17 Bidirectionality
and 18 show the mathematical formulation behind setting up the bi- y
directional RNN hidden layer. The only difference between these two ! !"
! !"
relationships is in the direction of recursing through the corpus. Equa- h t = f (W xt +V
! !"
" !"
tion 19 shows the classification relationship used for predicting the h h t = f (W xt +V
! !
next word via summarizing past and future word representations. yt = g(U[h t ;h t ]

→ −→ →−
− → −→ x
h t = f ( W x t + V h t −1 + b ) (17)
! !
h = [h;h ] now represents (summarizes) the past and
Figure 8: A bi-directional RNN model

− ←− −←
← − ←− around a single token.
h t = f ( W x t + V h t +1 + b ) (18)


→ ← −
ŷt = g(Uht + c) = g(U [ h t ; h t ] + c) (19)
RNNs can also be multi-layered. Figure 9 shows a multi-layer bi-
directional RNN where each lower layer feeds the next layer. As shown
in this figure, in this network architecture, at time-step t each interme-
diate neuron receives one set of parameters from the previous time-
step (in the same RNN layer), and two sets of parameters from the
previous RNN hidden layer; one input comes from the left-to-right Going Deep
RNN and the other from the right-to-left RNN. y
To construct a Deep RNN with L layers, the above relationships are
modified to the relationships in Equations 20 and 21 where the input
to each intermediate neuron at level i is the output of the RNN at
h (3) ! (i) !"
! (i)
h t = f (W ht(i−1
layer i − 1 at the same time-step, t. The output, ŷ, at each time-step is ! (i) !"
" (i)
the result of propagating input parameters through all hidden layers h (2) h t = f (W ht(i−1
(Equation 22). ! (L ) !
yt = g(U[h t ;h
h (1)

→ (i ) −→ ( i −1) − → − → (i ) −→
h t = f ( W (i ) h t + V ( i ) h t −1 + b ( i ) ) (20)
x
Each memory layer passes an intermediate seq
representation
Figure to the next.
9: A deep bi-directional RNN
with three RNN layers.
cs224n: natural language processing with deep learning lecture notes: part v
language models, rnn, gru and lstm 10


− (i ) ←− ( i −1) ← − ← − (i ) ←−
h t = f ( W (i ) h t + V ( i ) h t +1 + b ( i ) ) (21)


→( L) ←−( L)
ŷt = g(Uht + c) = g(U [ h t ; h t ] + c) (22)

2.6 Application: RNN Translation Model


Traditional translation models are quite complex; they consist of nu-
merous machine learning algorithms applied to different stages of the
language translation pipeline. In this section, we discuss the potential
for adopting RNNs as a replacement to traditional translation mod-
ules. Consider the RNN example model shown in Figure 10; here,
the German phrase Echt dicke Kiste is translated to Awesome sauce. The Awesome                
sauce  
y1 y2
first three hidden layer time-steps encode the German language words
into some language word features (h3 ). The last two time-steps decode h1 h2 h3
W   W  
h3 into English word outputs. Equation 23 shows the relationship for
x1 x2 x3
the Encoder stage and Equations 24 and 25 show the equation for the
Decoder stage.
Figure 10: A RNN-based translation
(hh) (hx ) model. The first three RNN hidden
h t = φ ( h t − 1 , x t ) = f (W h t −1 + W xt ) (23)
layers belong to the source language
model encoder, and the last two belong
to the destination language model
ht = φ(ht−1 ) = f (W (hh) ht−1 ) (24) decoder.

yt = so f tmax (W (S) ht ) (25)


One may naively assume this RNN model along with the cross-
entropy function shown in Equation 26 can produce high-accuracy
translation results. In practice, however, several extensions are to be
added to the model to improve its translation accuracy performance.

N
1
max
θ N ∑ log pθ (y(n) |x(n) ) (26)
n =1

Extension I: train different RNN weights for encoding and decoding.


This decouples the two units and allows for more accuracy prediction
of each of the two RNN modules. This means the φ() functions in
Equations 23 and 24 would have different W (hh) matrices.

Extension II: compute every hidden state in the decoder using three
different inputs:

• The previous hidden state (standard) Figure 11: Language model with
three inputs to each decoder neuron:
• Last hidden layer of the encoder (c = h T in Figure 11) (ht−1 , c, yt−1 )
cs224n: natural language processing with deep learning lecture notes: part v
language models, rnn, gru and lstm 11

• Previous predicted output word, ŷt−1

Combining the above three inputs transforms the φ function in the


decoder function of Equation 24 to the one in Equation 27. Figure 11
illustrates this model.

ht = φ(ht−1 , c, yt−1 ) (27)

Extension III: train deep recurrent neural networks using multiple


RNN layers as discussed earlier in this chapter. Deeper layers often
improve prediction accuracy due to their higher learning capacity. Of
course, this implies a large training corpus must be used to train the
model.

Extension IV: train bi-directional encoders to improve accuracy simi-


lar to what was discussed earlier in this chapter.

Extension V: given a word sequence A B C in German whose transla-


tion is X Y in English, instead of training the RNN using A B C → X
Y, train it using C B A → X Y. The intutition behind this technique is
that A is more likely to be translated to X. Thus, given the vanishing
gradient problem discussed earlier, reversing the order of the input
words can help reduce the error rate in generating the output phrase.

3 Gated Recurrent Units

Beyond the extensions discussed so far, RNNs have been found to per-
form better with the use of more complex units for activation. So far,
we have discussed methods that transition from hidden state ht−1 to
ht using an affine transformation and a point-wise nonlinearity. Here,
we discuss the use of a gated activation function thereby modifying
the RNN architecture. What motivates this? Well, although RNNs
can theoretically capture long-term dependencies, they are very hard
to actually train to do this. Gated recurrent units are designed in a
manner to have more persistent memory thereby making it easier for
RNNs to capture long-term dependencies. Let us see mathematically
how a GRU uses ht−1 and xt to generate the next hidden state ht . We
will then dive into the intuition of this architecture.

z t = σ (W ( z ) x t + U ( z ) h t − 1 ) (Update gate)
r t = σ (W ( r ) x t + U ( r ) h t − 1 ) (Reset gate)
h̃t = tanh(rt ◦ Uht−1 + Wxt ) (New memory)
ht = (1 − zt ) ◦ h̃t + zt ◦ ht−1 (Hidden state)
cs224n: natural language processing with deep learning lecture notes: part v
language models, rnn, gru and lstm 12

The above equations can be thought of a GRU’s four fundamental op-


erational stages and they have intuitive interpretations that make this
model much more intellectually satisfying (see Figure 12):
1. New memory generation: A new memory h̃t is the consolidation of
a new input word xt with the past hidden state ht−1 . Anthropomor-
phically, this stage is the one who knows the recipe of combining a
newly observed word with the past hidden state ht−1 to summarize
this new word in light of the contextual past as the vector h̃t .

2. Reset Gate: The reset signal rt is responsible for determining how


important ht−1 is to the summarization h̃t . The reset gate has the
ability to completely diminish past hidden state if it finds that ht−1
is irrelevant to the computation of the new memory.

3. Update Gate: The update signal zt is responsible for determining


how much of ht−1 should be carried forward to the next state. For
instance, if zt ≈ 1, then ht−1 is almost entirely copied out to ht .
Conversely, if zt ≈ 0, then mostly the new memory h̃t is forwarded
to the next hidden state.

4. Hidden state: The hidden state ht is finally generated using the


past hidden input ht−1 and the new memory generated h̃t with the
advice of the update gate.

Figure 12: The detailed internals of a


It is important to note that to train a GRU, we need to learn all GRU

the different parameters: W, U, W (r) , U (r) , W (z) , U (z) . These follow the
same backpropagation procedure we have seen in the past.
cs224n: natural language processing with deep learning lecture notes: part v
language models, rnn, gru and lstm 13

4 Long-Short-Term-Memories

Long-Short-Term-Memories are another type of complex activation unit


that differ a little from GRUs. The motivation for using these is similar
to those for GRUs however the architecture of such units does differ.
Let us first take a look at the mathematical formulation of LSTM units
before diving into the intuition behind this design:

i t = σ (W ( i ) x t + U ( i ) h t − 1 ) (Input gate)
f t = σ (W ( f ) x t + U ( f ) h t − 1 ) (Forget gate)
(o ) (o )
o t = σ (W xt + U h t −1 ) (Output/Exposure gate)
c̃t = tanh(W (c) xt + U (c) ht−1 ) (New memory cell)
ct = f t ◦ ct−1 + it ◦ c̃t (Final memory cell)
ht = ot ◦ tanh(ct )

Figure 13: The detailed internals of a


LSTM
cs224n: natural language processing with deep learning lecture notes: part v
language models, rnn, gru and lstm 14

We can gain intuition of the structure of an LSTM by thinking of its


architecture as the following stages:

1. New memory generation: This stage is analogous to the new mem-


ory generation stage we saw in GRUs. We essentially use the input
word xt and the past hidden state ht−1 to generate a new memory
c̃t which includes aspects of the new word x (t) .

2. Input Gate: We see that the new memory generation stage doesn’t
check if the new word is even important before generating the new
memory – this is exactly the input gate’s function. The input gate
uses the input word and the past hidden state to determine whether
or not the input is worth preserving and thus is used to gate the new
memory. It thus produces it as an indicator of this information.

3. Forget Gate: This gate is similar to the input gate except that it
does not make a determination of usefulness of the input word –
instead it makes an assessment on whether the past memory cell is
useful for the computation of the current memory cell. Thus, the
forget gate looks at the input word and the past hidden state and
produces f t .

4. Final memory generation: This stage first takes the advice of the
forget gate f t and accordingly forgets the past memory ct−1 . Sim-
ilarly, it takes the advice of the input gate it and accordingly gates
the new memory c̃t . It then sums these two results to produce the
final memory ct .

5. Output/Exposure Gate: This is a gate that does not explicitly exist


in GRUs. It’s purpose is to separate the final memory from the
hidden state. The final memory ct contains a lot of information that
is not necessarily required to be saved in the hidden state. Hidden
states are used in every single gate of an LSTM and thus, this gate
makes the assessment regarding what parts of the memory ct needs
to be exposed/present in the hidden state ht . The signal it produces
to indicate this is ot and this is used to gate the point-wise tanh of
the memory.
CS224n: Natural Language Processing with Deep
Learning 1 1
Course Instructors: Christopher
Manning, Richard Socher
Lecture Notes: Part VI
Neural Machine Translation, Seq2seq and Attention2 2
Authors: Guillaume Genthial, Lucas
Liu, Barak Oshri, Kushal Ranjan
Winter 2019

Keyphrases: Seq2Seq and Attention Mechanisms, Neural Machine


Translation, Speech Processing

1 Neural Machine Translation with Seq2Seq

So far in this class, we’ve dealt with problems of predicting a single


output: an NER label for a word, the single most likely next word in
a sentence given the past few, and so on. However, there’s a whole
class of NLP tasks that rely on sequential output, or outputs that are
sequences of potentially varying length. For example,

• Translation: taking a sentence in one language as input and out-


putting the same sentence in another language.

• Conversation: taking a statement or question as input and re-


sponding to it.

• Summarization: taking a large body of text as input and out-


putting a summary of it.

In these notes, we’ll look at sequence-to-sequence models, a deep


learning-based framework for handling these types of problems. This
framework proved to be very effective, and has, in fewer than 3 years,
become the standard for machine translation.

1.1 Brief Note on Historical Approaches


In the past, translation systems were based on probabilistic models
constructed from:

• a translation model, telling us what a sentence/phrase in a source


language most likely translates into

• a language model, telling us how likely a given sentence/phrase is


overall.

These components were used to build translation systems based on


words or phrases. As you might expect, a naive word-based system
would completely fail to capture differences in ordering between
languages (e.g. where negation words go, location of subject vs. verb
in a sentence, etc).
cs224n: natural language processing with deep learninglecture notes: part vi
neural machine translation, seq2seq and attention 2

Phrase-based systems were most common prior to Seq2Seq. A


phrase-based translation system can consider inputs and outputs in
terms of sequences of phrases and can handle more complex syntaxes
than word-based systems. However, long-term dependencies are still
difficult to capture in phrase-based systems.
The advantage that Seq2Seq brought to the table, especially with
its use of LSTMs, is that modern translation systems can generate
arbitrary output sequences after seeing the entire input. They can
even focus in on specific parts of the input automatically to help
generate a useful translation.

1.2 Sequence-to-sequence Basics


Sequence-to-sequence, or "Seq2Seq", is a relatively new paradigm,
with its first published usage in 2014 for English-French translation
3 . At a high level, a sequence-to-sequence model is an end-to-end 3
Sutskever et al. 2014, "Sequence
model made up of two recurrent neural networks: to Sequence Learning with Neural
Networks"

• an encoder, which takes the model’s input sequence as input and


encodes it into a fixed-size "context vector", and

• a decoder, which uses the context vector from above as a "seed"


from which to generate an output sequence.

For this reason, Seq2Seq models are often referred to as "encoder-


decoder models." We’ll look at the details of these two networks
separately.

1.3 Seq2Seq architecture - encoder


The encoder network’s job is to read the input sequence to our
Seq2Seq model and generate a fixed-dimensional context vector C
for the sequence. To do so, the encoder will use a recurrent neural
network cell – usually an LSTM – to read the input tokens one at a
time. The final hidden state of the cell will then become C. However,
because it’s so difficult to compress an arbitrary-length sequence into
a single fixed-size vector (especially for difficult tasks like transla-
tion), the encoder will usually consist of stacked LSTMs: a series of
LSTM "layers" where each layer’s outputs are the input sequence to
the next layer. The final layer’s LSTM hidden state will be used as C.
Seq2Seq encoders will often do something strange: they will pro-
cess the input sequence in reverse. This is actually done on purpose.
The idea is that, by doing this, the last thing that the encoder sees
will (roughly) corresponds to the first thing that the model outputs;
this makes it easier for the decoder to "get started" on the output,
which makes then gives the decoder an easier time generating a
cs224n: natural language processing with deep learning lecture notes: part vi
neural machine translation, seq2seq and attention 3

proper output sentence. In the context of translation, we’re allowing


the network to translate the first few words of the input as soon as
it sees them; once it has the first few words translated correctly, it’s
much easier to go on to construct a correct sentence than it is to do
so from scratch. See Fig. 1 for an example of what such an encoder
network might look like.

1.4 Seq2Seq architecture - decoder


The decoder is also an LSTM network, but its usage is a little more
complex than the encoder network. Essentially, we’d like to use it
as a language model that’s "aware" of the words that it’s generated
so far and of the input. To that end, we’ll keep the "stacked" LSTM
architecture from the encoder, but we’ll initialize the hidden state of
our first layer with the context vector from above; the decoder will
literally use the context of the input to generate an output. Figure 1: Example of a Seq2Seq encoder
network. This model may be used to
Once the decoder is set up with its context, we’ll pass in a special translate the English sentence "what
token to signify the start of output generation; in literature, this is is your name?" Note that the input
tokens are read in reverse. Note that the
usually an <EOS> token appended to the end of the input (there’s network is unrolled; each column is a
also one at the end of the output). Then, we’ll run all three layers of timestep and each row is a single layer,
LSTM, one after the other, following up with a softmax on the final so that horizontal arrows correspond
to hidden states and vertical arrows are
layer’s output to generate the first output word. Then, we pass that LSTM inputs/outputs.
word into the first layer, and repeat the generation. This is how we get
the LSTMs to act like a language model. See Fig. 2 for an example of
a decoder network.
Once we have the output sequence, we use the same learning strat-
egy as usual. We define a loss, the cross entropy on the prediction
sequence, and we minimize it with a gradient descent algorithm and
back-propagation. Both the encoder and decoder are trained at the
same time, so that they both learn the same context vector represen-
tation.

1.5 Recap & Basic NMT Example


Note that there is no connection between the lengths of the input
and output; any length input can be passed in and any length output
can be generated. However, Seq2Seq models are known to lose effec-
tiveness on very long inputs, a consequence of the practical limits of
LSTMs.
To recap, let’s think about what a Seq2Seq model does in order to
translate the English "what is your name" into the French "comment
t’appelles tu". First, we start with 4 one-hot vectors for the input.
These inputs may or may not (for translation, they usually are) em-
bedded into a dense vector representation. Then, a stacked LSTM
network reads the sequence in reverse and encodes it into a context
cs224n: natural language processing with deep learning lecture notes: part vi
neural machine translation, seq2seq and attention 4

vector. This context vector is a vector space representation of the no-


tion of asking someone for their name. It’s used to initialize the first
layer of another stacked LSTM. We run one step of each layer of this
network, perform softmax on the last layer’s output, and use that to
select our first output word. This word is fed back into the network
as input, and the rest of the sentence "comment t’appelles tu" is de-
coded in this fashion. During backpropagation, the encoder’s LSTM
weights are updated so that it learns a better vector space representa-
tion for sentences, while the decoder’s LSTM weights are trained to
allow it to generate grammatically correct sentences that are relevant
to the context vector.

1.6 Bidirectional RNNs


Recall from earlier in this class that dependencies in sentences don’t
just work in one direction; a word can have a dependency on another
word before or after it. The formulation of Seq2Seq that we’ve talked Figure 2: Example of a Seq2Seq decoder
network. This decoder is decoding the
about so far doesn’t account for that; at any timestep, we’re only context vector for "what is your name"
considering information (via the LSTM hidden state) from words (see Fig. 1 into its French translation,
"comment t’appeles tu?" Note the
before the current word. For NMT, we need to be able to effectively special "GO" token used at the start of
encode any input, regardless of dependency directions within that generation, and that generation is in
input, so this won’t cut it. the forward direction as opposed to the
input which is read in reverse. Note
Bidirectional RNNs fix this problem by traversing a sequence in also that the input and output do not
both directions and concatenating the resulting outputs (both cell need to be the same length.
outputs and final hidden states). For every RNN cell, we simply
add another cell but feed inputs to it in the opposite direction; the
output
h ot corresponding
i to the t’th word is the concatenated vector
(f) (b) (f)
ot ot , where ot is the output of the forward-direction RNN
(b)
on word t and ot is the corresponding output from the h reverse- i
direction RNN. Similarly, the final hidden state is h = h( f ) h(b) ,
where h( f ) is the final hidden state of the forward RNN and h(b) is
the final hidden state of the reverse RNN. See Fig. 6 for an example
of a bidirectional LSTM encoder.

2 Attention Mechanism

2.1 Motivation Figure 3: Example of a single-layer


bidirectional LSTM encoder network.
When you hear the sentence "the ball is on the field," you don’t as- Note that the input is fed into two
different LSTM layers, but in different
sign the same importance to all 6 words. You primarily take note of directions, and the hidden states are
the words "ball," "on," and "field," since those are the words that are concatenated to get the final context
most "important" to you. Similarly, Bahdanau et al. noticed the flaw vector.

in using the final RNN hidden state as the single "context vector" for
sequence-to-sequence models: often, different parts of an input have
cs224n: natural language processing with deep learning lecture notes: part vi
neural machine translation, seq2seq and attention 5

different levels of significance. Moreover, different parts of the output


may even consider different parts of the input "important." For exam-
ple, in translation, the first word of output is usually based on the first
few words of the input, but the last word is likely based on the last
few words of input.
Attention mechanisms make use of this observation by providing
the decoder network with a look at the entire input sequence at every
decoding step; the decoder can then decide what input words are
important at any point in time. There are many types of encoder
mechanisms, but we’ll examine the one introduced by Bahdanau et
al. 4 , 4
Bahdanau et al. 2014, "Neural Machine
Translation by Jointly Learning to Align
and Translate"
2.2 Bahdanau et al. NMT model
Remember that our seq2seq model is made of two parts, an encoder
that encodes the input sentence, and a decoder that leverages the
information extracted by the decoder to produce the translated sen-
tence. Basically, our input is a sequence of words x1 , . . . , xn that we
want to translate, and our target sentence is a sequence of words
y1 , . . . , y m .

1. Encoder
Let (h1 , . . . , hn ) be the hidden vectors representing the input sen-
tence. These vectors are the output of a bi-LSTM for instance, and
capture contextual representation of each word in the sentence.

2. Decoder
We want to compute the hidden states si of the decoder using a
recursive formula of the form

s i = f ( s i −1 , y i −1 , c i )

where si−1 is the previous hidden vector, yi−1 is the generated


word at the previous step, and ci is a context vector that capture
the context from the original sentence that is relevant to the time
step i of the decoder.
The context vector ci captures relevant information for the i-th
decoding time step (unlike the standard Seq2Seq in which there’s
only one context vector). For each hidden vector from the original
sentence h j , compute a score

ei,j = a(si−1 , h j )

where a is any function with values in R, for instance a single


layer fully-connected neural network. Then, we end up with a
cs224n: natural language processing with deep learning lecture notes: part vi
neural machine translation, seq2seq and attention 6

sequence of scalar values ei,1 , . . . , ei,n . Normalize these scores into a


vector αi = (αi,1 , . . . , αi,n ), using a softmax layer. The vector αi is called the attention
vector

exp(ei,j )
αi,j =
∑nk=1
exp(ei,k )

Then, compute the context vector ci as the weighted average of the


hidden vectors from the original sentence The context vector is extracted thanks
to the attention vector and captures the
n relevant context
ci = ∑ αi,j h j
j =1

Intuitively, this vector captures the relevant contextual information


from the original sentence for the i-th step of the decoder.

2.3 Connection with translation alignment


The attention-based model learns to assign significance to differ-
ent parts of the input for each step of the output. In the context of
Figure 4: Example of an alignment table
translation, attention can be thought of as "alignment." Bahdanau et
al. argue that the attention scores αij at decoding step i signify the
words in the source sentence that align with word i in the target.
Noting this, we can use attention scores to build an alignment table –
a table mapping words in the source to corresponding words in the
target sentence – based on the learned encoder and decoder from our
Seq2Seq NMT system.

2.4 Performance on long sentences


The major advantage of attention-based models is their ability to
efficiently translate long sentences. As the size of the input grows,
models that do not use attention will miss information and precision Figure 5: Performance on long sentence
of different NMT models - image taken
if they only use the final representation. Attention is a clever way to from Luong et al.
fix this issue and experiments indeed confirm the intuition.

3 Other Models

3.1 Huong et al. NMT model


We present a variant of this first model, with two different mecha-
nisms of attention, from Luong et al.5 . 5
Effective Approaches to Attention-
based Neural Machine Translation by
• Global attention We run our vanilla Seq2Seq NMT. We call the Minh-Thang Luong, Hieu Pham and
Christopher D. Manning
hidden states given by the encoder h1 , . . . , hn , and the hidden
states of the decoder h¯1 , . . . , h¯n . Now, for each h̄i , we compute an
cs224n: natural language processing with deep learning lecture notes: part vi
neural machine translation, seq2seq and attention 7

attention vector over the encoder hidden. We can use one of the
following scoring functions:


T ¯
 hi h j


score(hi , h¯j ) = hiT W h¯j ∈R

W [hi , h¯j ]

Now that we have a vector of scores, we can compute a context


vector in the same way as Bahdanau et al. First, we normalize the
scores via a softmax layer to obtain a vector αi = (αi,1 , . . . , αi,n ),
exp(score(h j ,h¯i ))
where αi,j = exp(score(hk ,h¯i ))
∑nk=1

n
ci = ∑ αi,j h j
j =1

and we can use the context vector and the hidden state to compute
a new vector for the i-th time step of the decoder

h̃i = f ([h̄i , ci ])

The final step is to use the h̃i to make the final prediction of the
decoder. To address the issue of coverage, Luong et al. also use
an input-feeding approach. The attentional vectors h̃i are fed as
input to the decoder, instead of the final prediction. This is similar
to Bahdanau et al., who use the context vectors to compute the
hidden vectors of the decoder.

• Local attention the model predicts an aligned position in the in-


put sequence. Then, it computes a context vector using a window
centered on this position. The computational cost of this atten-
tion step is constant and does not explode with the length of the
sentence.

The main takeaway of this discussion is to show that they are lots
of ways of doing attention.

3.2 Google’s new NMT


As a brief aside, Google recently made a major breakthrough for
NMT via their own translation system 6 . Rather than maintain a full 6
Johnson et el. 2016, "Google’s Mul-
Seq2Seq model for every pair of language that they support – each of tilingual Neural Machine Translation
System: Enabling Zero-Shot Transla-
which would have to be trained individually, which is a tremendous tion"
feat in terms of both data and compute time required – they built a
single system that can translate between any two languages. This is a
Seq2Seq model that accepts as input a sequence of words and a token
cs224n: natural language processing with deep learninglecture notes: part vi
neural machine translation, seq2seq and attention 8

specifying what language to translate into. The model uses shared


parameters to translate into any target language.
The new multilingual model not only improved their translation
Figure 6: Example of Google’s system
performance, it also enabled "zero-shot translation," in which we can
translate between two languages for which we have no translation train-
ing data. For instance, if we only had examples of Japanese-English
translations and Korean-English translations, Google’s team found
that the multilingual NMT system trained on this data could actu-
ally generate reasonable Japanese-Korean translations. The powerful
implication of this finding is that part of the decoding process is not
language-specific, and the model is in fact maintaining an internal
representation of the input/output sentences independent of the
actual languages involved.

3.3 More advanced papers using attention


• Show, Attend and Tell: Neural Image Caption Generation with Visual
Attention by Kelvin Xu, Jimmy Lei Ba,Ryan Kiros, Kyunghyun
Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel
and Yoshua Bengio. This paper learns words/image alignment.

• Modeling Coverage for Neural Machine Translation by Zhaopeng Tu,


Zhengdong Lu, Yang Liu, Xiaohua Liu and Hang Li. Their model
uses a coverage vector that takes into account the attention history
to help future attention.

• Incorporating Structural Alignment Biases into an Attentional Neural


Translation Model by Cohn, Hoang, Vymolova, Yao, Dyer, Haffari.
This paper improves the attention by incorporating other tradi-
tional linguistic ideas.

4 Sequence model decoders

Another approach to machine translation comes from statistical ma-


chine translation. Consider a model that computes the probability
P(s̄|s) of a translation s̄ given the original sentence s. We want to
pick the translation s̄∗ that has the best probability. In other words,
we want

s̄∗ = argmaxs̄ (P(s̄|s))

As the search space can be huge, we need to shrink its size. Here
is a list of sequence model decoders (both good ones and bad ones).

• Exhaustive search : this is the simplest idea. We compute the


probability of every possible sequence, and we chose the sequence
cs224n: natural language processing with deep learning lecture notes: part vi
neural machine translation, seq2seq and attention 9

with the highest probability. However, this technique does not


scale at all to large outputs as the search space is exponential in
the size of the input. Decoding in this case is an NP-complete
problem.

• Ancestral sampling : at time step t, we sample xt based on the


conditional probability of the word at step t given the past. In
other words,

x t ∼ P( x t | x1 , . . . , x n )

Theoretically, this technique is efficient and asymptotically ex-


act. However, in practice, it can have low performance and high
variance.

• Greedy Search : At each time step, we pick the most probable


token. In other words

xt = argmaxx˜t P( x˜t | x1 , . . . , xn )

This technique is efficient and natural, however it explores a small


part of the search space and if we make a mistake at one time step,
the rest of the sentence could be heavily impacted.

• Beam search : the idea is to maintain K candidates at each time


step.

Ht = {( x11 , . . . , xt1 ), . . . , ( x1K , . . . , xtK )}

and compute Ht+1 by expanding Ht and keeping the best K candi-


dates. In other words, we pick the best K sequence in the following
set

K
Htk˜+1
[
H̃t+1 =
k =1

where

Htk˜+1 = {( x1k , . . . , xtk , v1 ), . . . , ( x1k , . . . , xtk , v|V | )}

As we increase K, we gain precision and we are asymptotically


exact. However, the improvement is not monotonic and we can
set a K that combines reasonable performance and computational
efficiency. For this reason, beam search is the most commonly used
technique in NMT.
lecture notes: part vi
cs224n: natural language processing with deep learning
neural machine translation, seq2seq and attention 10

5 Evaluation of Machine Translation Systems

Now that we know the basics about machine translation systems, we


discuss some ways that these models are evaluated. Evaluating the
quality of translations is a notoriously tricky and subjective task. In
real-life, if you give a paragraph of text to ten different translators,
you will get back ten different translations. Translations are imperfect
and noisy in practice. They attend to different information and em-
phasize different meanings. One translation can preserve metaphors
and the integrity of long-ranging ideas, while the other can achieve a
more faithful reconstruction of syntax and style, attempting a word-
to-word translation. Note that this flexibility is not a burden; it is a
testament to the complexity of language and our abilities to decode
and interpret meaning, and is a wonderful aspect of our communica-
tive faculty.
At this point, you should note that there is a difference between
the objective loss function of your model and the evaluation methods
we are going to discuss. Since loss functions are in essence an evalua-
tion of your model prediction, it can be easy to confuse the two ideas.
The evaluation metrics ahead offer a final, summative assessment of
your model against some measurement criterion, and no one mea-
surement is superior to all others, though some have clear advantages
and majority preference.
Evaluating the quality of machine learning translations has be-
come it own entire research area, with many proposals like TER,
METEOR, MaxSim, SEPIA, and RTE-MT. We will focus in these notes
on two baseline evaluation methods and BLEU.

5.1 Human Evaluation


The first and maybe least surprising method is to have people manu-
ally evaluate the correctness, adequacy, and fluency of your system.
Like the Turing Test, if you can fool a human into not being able to
distinguish a human-made translation with your system translation,
your model passes the test for looking like a real-life sentence! The
obvious problem with this method is that it is costly and inefficient,
though it remains the gold standard for machine translation.

5.2 Evaluation against another task


A common way of evaluating machine learning models that output
a useful representation of some data (a representation being a trans-
lation or summary) is that if your predictions are useful for solving
some challenging task, then the model must be encoding relevant
information in your predictions. For example, you might think of
lecture notes: part vi
cs224n: natural language processing with deep learning
neural machine translation, seq2seq and attention 11

training your translation predictions on a question-answering task in


the translated language. That is, you use the outputs of your system
as inputs to a model for some other task (the question-answering).
If your second task can perform as well on your predictions as it
can on well-formed data in the translated language, it means that
your inputs have the relevant information or patterns for meeting the
demands of the task.
The issue with this method is that the second task may not be af-
fected by many of the finer points of translation. For example, if you
measured the quality of translation on a query-retrieval task (like
pulling up the right webpage for a search query), you would find
that a translation that preserves the main topic words of the docu-
ments but ignores syntax and grammar might still fit the task well.
But this itself doesn’t mean that the quality of your translations is
accurate or faithful. Therefore, determining the quality of the transla-
tion model is just shifted to determining the quality of the task itself,
which may or may not be a good standard.

5.3 Bilingual Evaluation Understudy (BLEU)


In 2002, IBM researchers developed the Bilingual Evaluation Under-
study (BLEU) that remains, with its many variants to this day, one of
the most respected and reliable methods for machine translation.
The BLEU algorithm evaluates the precision score of a candidate
machine translation against a reference human translation. The ref-
erence human translation is a assumed to be a model example of a
translation, and we use n-gram matches as our metric for how similar
a candidate translation is to it. Consider a reference sentence A and
candidate translation B:

A there are many ways to evaluate the quality of a translation, like


comparing the number of n-grams between a candidate transla-
tion and reference.

B the quality of a translation is evaluate of n-grams in a reference


and with translation.

The BLEU score looks for whether n-grams in the machine trans-
lation also appear in the reference translation. Color-coded below are
some examples of different size n-grams that are shared between the
reference and candidate translation.

A there are many ways to evaluate the quality of a translation, like


comparing the number of n-grams between a candidate transla-
tion and reference.
lecture notes: part vi
cs224n: natural language processing with deep learning
neural machine translation, seq2seq and attention 12

B the quality of a translation is evaluate of n-grams in a reference


and with translation.

The BLEU algorithm identifies all such matches of n-grams above,


including the unigram matches, and evaluates the strength of the
match with the precision score. The precision score is the fraction of
n-grams in the translation that also appear in the reference.
The algorithm also satisfies two other constraints. For each n-gram
size, a gram in the reference translation cannot be matched more than
once. For example, the unigram "a" appears twice in B but only once
in A. This only counts for one match between the sentences. Addi-
tionally, we impose a brevity penalty so that very small sentences
that would achieve a 1.0 precision (a "perfect" matching) are not
considered good translations. For example, the single word "there"
would achieve a 1.0 precision match, but it is obviously not a good
match.
Let us see how to actually compute the BLEU score. First let k be
the maximum n-gram that we want to evaluate our score on. That
is, if k = 4, the BLUE score only counts the number of n-grams with
length less than or equal to 4, and ignores larger n-grams. Let

pn = # matched n-grams / # n-grams in candidate translation

the precision score for the grams of length n. Finally, let wn = 1/2n
be a geometric weighting for the precision of the n’th gram. Our
brevity penalty is defined as
len
min(0,1− len ref )
β=e MT

where lenref is the length of the reference translation and lenMT is


the length of the machine translation.
The BLEU score is then defined as

k
BLEU = β ∏ pw
n
n

i =1

The BLEU score has been reported to correlate well with human
judgment of good translations, and so remains a benchmark for all
evaluation metrics following it. However, it does have many limi-
tations. It only works well on the corpus level because any zeros in
precision scores will zero the entire BLEU score. Additionally, this
BLEU score as presented suffers for only comparing a candidate
translation against a single reference, which is surely a noisy repre-
sentation of the relevant n-grams that need to be matched. Variants
of BLEU have modified the algorithm to compare the candidate with
multiple reference examples. Additionally, BLEU scores may only be
lecture notes: part vi
cs224n: natural language processing with deep learning
neural machine translation, seq2seq and attention 13

a necessary but not sufficient benchmark to pass for a good machine


translation system. Many researchers have optimized BLEU scores
until they have begun to approach the same BLEU scores between
reference translations, but the true quality remains far below human
translations.

6 Dealing with the large output vocabulary

Despite the success of modern NMT systems, they have a hard time
dealing with large vocabulary size. Specifically, these Seq2Seq models
predict the next word in the sequence by computing a target proba-
bilistic distribution over the entire vocabulary using softmax. It turns
out that softmax can be quite expensive to compute with a large
vocabulary and its complexity also scales proportionally to the vocab-
ulary size. We will now examine a number of approaches to address
this issue.

6.1 Scaling softmax


A very natural idea is to ask "can we find more efficient ways to com-
pute the target probabilistic distribution?" The answer is Yes! In fact,
we’ve already learned two methods that can reduce the complexity of
"softmax", which we’ll present a high-level review below (see details
in lecture note 1).
1. Noise Contrastive Estimation
The idea of NCE is to approximate "softmax" by randomly sam-
pling K words from negative samples. As a result, we are reduc-
|V |
ing the computational complexity by a factor of K , where |V | is
the vocabulary size. This method has been proven sucessful in
word2vec. A recent work by Zoph et al.7 applied this technique to 7
Zoph et al. 2016, Simple, Fast Noise-
learning LSTM language models and they also introduced a trick Contrastive Estimation for Large RNN
Vocabularies
by using the same samples per mini-batch to make the training
GPU-efficient.

2. Hierarchical Softmax
Morin et al.8 introduced a binary tree structure to more efficiently 8
Morin et al. 2005, Hierarchical Proba-
compute "softmax". Each probability in the target distribution bilistic Neural Network Language Model

is calculated by taking a path down the tree which only takes


O(log|V |) steps. Notably, even though Hierarchical Softmax saves
computation, it cannot be easily parallelized to run efficiently on
GPU. This method is used by Kim et al. 9 to train character-based 9
Kim et al. 2015, Character-Aware Neural
language models which will be covered in lecture 13. Language Models

One limitation for both methods is that they only save computa-
tion during training step (when target word is known). At test time,
lecture notes: part vi
cs224n: natural language processing with deep learning
neural machine translation, seq2seq and attention 14

one still has to compute the probability of all words in the vocabulary
in order to make predictions.

6.2 Reducing vocabulary


Instead of optimizing "softmax", one can also try to reduce the ef-
fective vocabulary size which will speed up both training and test
steps. A naive way of doing this is to simply limit the vocabulary size
to a small number and replace words outside the vocabulary with
a tag <UNK>. Now, both training and test time can be significantly
reduced but this is obviously not ideal because we may generate
outputs with lots of <UNK>.
Jean et al. 10 proposed a method to maintain a constant vocab- 10
Jean et al. 2015, On Using Very Large
ulary size |V 0 | by partitioning the training data into subsets with τ Target Vocabulary for Neural Machine
Translation
unique target words, where τ = |V 0 |. One subset can be found by se-
quentially scanning the original data set until τ unique target words
are detected( Figure 7). And this process is iterated over the entire
data set to produce all mini-batch subsets. In practice, we can achieve
about 10x saving with |V | = 500K and |V 0 | = 30K, 50K.
This concept is very similar to NCE in that for any given word,
the output vocabulary contains the target word and |V 0 | − 1 negative
Figure 7: Training data partition
(noise) samples. However, the main difference is that these negative
samples are sampled from a biased distribution Q for each subset V’
where 
 1 , i f y t ∈ |V 0 |
0
Q ( y t ) = |V |
0, otherwise

At test time, one can similarly predict target word out of a selected
subset, called candidate list, of the entire vocabulary. The challenge
is that the correct target word is unknown and we have to "guess"
what the target word might be. In the paper, the authors proposed
to construct a candidate list for each source sentence using K most
frequent words (based on unigram probability) and K’ likely target
words for each source word in the sentence. In Figure 8), an example
is shown with K 0 = 3 and the candidate list consists of all the words
in purple boxes. In practice, one can choose the following values:
K = 15k, 30k, 50k and K 0 = 10, 20.

6.3 Handling unknown words Figure 8: Candidate list

When NMT systems use the techniques mentioned above to reduce


effective vocabulary size, inevitably, certain words will get mapped to
<UNK>. For instance, this could happen when the predicted word,
usually rare word, is out of the candidate list or when we encounter
lecture notes: part vi
cs224n: natural language processing with deep learning
neural machine translation, seq2seq and attention 15

unseen words at test time. We need new mechanisms to address the


rare and unknown word problems.
One idea introduced by Gulcehre et al.11 to deal with these prob-
lems is to learn to "copy" from source text. The model (Figure 9)
applies attention distribution lt to decide where to point in the source
text and uses the decoder hidden state St to predict a binary variable Figure 9: Pointer network Architecture
Gulcehre et al. 2016, Pointing the
11
Zt which decides when to copy from source text. The final prediction Unknown Words
is either the word yw t chosen by softmax over candidate list, as in pre-
vious methods, or ylt copied from source text depending on the value
of Zt . They showed that this method improves performance in tasks
like machine translation and text summarization.
As one can imagine, there are of course limitations to this method.
It is important to point out a comment from Google’s NMT paper12 12
Wu et al. 2016, Google’s Neural Ma-
on this method, " this approach is both unreliable at scale — the attention chine Translation System: Bridging the Gap
between Human and Machine Translation
mechanism is unstable when the network is deep — and copying may not
always be the best strategy for rare words — sometimes transliteration is
more appropriate".

7 Word and character-based models

As discussed in section 6, "copy" mechanisms are still not sufficient


in dealing with rare or unknown words. Another direction to address
these problems is to operate at sub-word levels. One trend is to use
the same seq2seq architecture but operate on a smaller unit — word
segmentation, character-based models. Another trend is to embrace
hybrid architectures for words and characters.

7.1 Word segmentation


Sennrich et al. 13 proposed a method to enable open-vocabulary 13
Sennrich et al. 2016, Neural Machine
translation by representing rare and unknown words as a sequence of Translation of Rare Words with Subword
Units
subword units.
This is achieved by adapting a compression algorithm called Byte
Pair Encoding. The essential idea is to start with a vocabulary of
characters and keep extending the vocabulary with most frequent Figure 10: Byte Pair Encoding
n-gram pairs in the data set. For instance, in Figure 10, our data
set contains 4 words with their frequencies on the left, i.e. "low"
appears 5 times. Denote ( p, q, f ) as a n-gram pair p, q with frequency
f. In this figure, we’ve already selected most frequent n-gram pair
(e,s,9) and now we are adding current most frequent n-gram pair
(es,t,9). This process is repeated until all n-gram pairs are selected or
vocabulary size reaches some threshold.
One can choose to either build separate vocabularies for training
and test sets or build one vocabulary jointly. After the vocabulary
lecture notes: part vi
cs224n: natural language processing with deep learning
neural machine translation, seq2seq and attention 16

is built, an NMT system with some seq2seq architecture (the paper


used Bahdanau et al. 14 ), can be directly trained on these word seg- 14
Bahdanau et al. 2014, "Neural Ma-
ments. Notably, this method won top places in WMT 2016. chine Translation by Jointly Learning to
Align and Translate"

7.2 Character-based model


Ling et al. 15 proposed a character-based model to enable open- 15
Ling, et al. 2015, "Finding Function
vocabulary word representation. in Form: Compositional Character
Models for Open Vocabulary Word
For each word w with m characters, instead of storing a word em- Representation"
bedding, this model iterates over all characters c1 , c2 . . . cm to look up
the character embeddings e1 , e2 . . . em . These character embeddings
are then fed into a biLSTM to get the final hidden states h f , hb for
forward and backward directions respectively. The final word embed-
ding is computed by an affine transformation of two hidden states:

ew = W f H f + Wb Hb + b

There are also a family of CNN character-based models which will


be covered in lecture 13.

7.3 Hybrid NMT


Luong et al. 16 proposed a Hybrid Word-Character model to deal Luong et al. 2016, Achieving Open
16

with unknown words and achieve open-vocabulary NMT. The system Vocabulary Neural Machine Translation
with Hybrid Word-Character Models
translates mostly at word-level and consults the character compo-
nents for rare words. On a high level, the character-level recurrent
neural networks compute source word representations and recover
unknown target words when needed. The twofold advantage of such
a hybrid approach is that it is much faster and easier to train than
character-based ones; at the same time, it never produces unknown
words as in the case of word-based models.
Word-based Translation as a Backbone The core of the hybrid
NMT is a deep LSTM encoder-decoder that translates at the word
level. We maintain a vocabulary of size |V | per language and use
<unk> to represent out of vocabulary words.
Source Character-based Representation In regular word-based
NMT, a universal embedding for <unk> is used to represent all out-
of-vocabulary words. This is problematic because it discards valuable
information about the source words. Instead, we learn a deep LSTM
model over characters of rare words, and use the final hidden state of Figure 11: Hybrid NMT
the LSTM as the representation for the rare word (Figure 11).
Target Character-level Generation General word-based NMT
allows generation of <unk> in the target output. Instead, the goal
here is to create a coherent framework that handles an unlimited
output vocabulary. The solution is to have a separate deep LSTM that
lecture notes: part vi
cs224n: natural language processing with deep learning
neural machine translation, seq2seq and attention 17

"translates" at the character level given the current word-level state.


Note that the current word context is used to initialize the character-
level encoder. The system is trained such that whenever the word-
level NMT produces an <unk>, the character-level decoder is asked to
recover the correct surface form of the unknown target word.
CS224n: Natural Language Processing with Deep
Learning 1 1
Course Instructors: Christopher
Manning, Richard Socher
Lecture Notes: Part VII
Question Answering 2 2
Authors: Francois Chaubard, Richard
Socher
Winter 2019

1 Dynamic Memory Networks for Question Answering over Text


and Images

The idea of a QA system is to extract information (sometimes pas-


sages, or spans of words) directly from documents, conversations,
online searches, etc., that will meet user’s information needs. Rather
than make the user read through an entire document, QA system
prefers to give a short and concise answer. Nowadays, a QA system
can combine very easily with other NLP systems like chatbots, and
some QA systems even go beyond the search of text documents and
can extract information from a collection of pictures.
There are many types of questions, and the simplest of them is
factoid question answering. It contains questions that look like “The
symbol for mercuric oxide is?” “Which NFL team represented the
AFC at Super Bowl 50?”. There are of course other types such as
mathematical questions (“2+3=?”), logical questions that require
extensive reasoning (and no background information). However, we
can argue that the information-seeking factoid questions are the most
common questions in people’s daily life.
In fact, most of the NLP problems can be considered as a question-
answering problem, the paradigm is simple: we issue a query, and
the machine provides a response. By reading through a document, or
a set of instructions, an intelligent system should be able to answer a
wide variety of questions. We can ask the POS tags of a sentence, we
can ask the system to respond in a different language. So naturally,
we would like to design a model that can be used for general QA.
In order to achieve this goal, we face two major obstacles. Many
NLP tasks use different architectures, such as TreeLSTM (Tai et al.,
2015) for sentiment analysis, Memory Network (Weston et al., 2015)
for question answering, and Bi-directional LSTM-CRF (Huang et al.,
2015) for part-of-speech tagging. The second problem is full multi-
task learning tends to be very difficult, and transfer-learning remains
to be a major obstacle for current neural network architectures across
artificial intelligence domains (computer vision, reinforcement learn-
ing, etc.).
We can tackle the first problem with a shared architecture for NLP:
Dynamic Memory Network (DMN), an architecture designed for
cs224n: natural language processing with deep learning lecture notes: part vii
question answering 2

general QA tasks. QA is difficult, partially because reading a long


paragraph is difficult. Even for humans, we are not able to store a
long document in your working memory.

1.1 Input Module


Figure 1: A graphical illustration of the
Dynamic Memory Network.
Dynamic Memory Network is divided into modules. First we look
at input module. The input module takes as input a sequence of
TI words and outputs a sequence of TC fact representations. If the
output is a list of words, we have TC = TI and if the output is a list
of sentences, we have TC as the number of sentences and TI as the
number of words in the sentences. We use a simple GRU to read
the sentences in, i.e. the hidden state ht = GRU( xt , ht−1 ) where
xt = L[wt ], where L is the embedding matrix and wt is the word at
time t. We further improve it by using Bi-GRU as shown in Figure 2.

1.2 Question Module


We also use a standard GRU to read in the question (using embed- Figure 2: A graphical illustration of the
Dynamic Memory Network.
ding matrix L: qt = GRU( L[wtQ ], qt−1 )), but the output of the question
module is an encoded representation of the question.

1.3 Episodic Memory Module


One of the distinctive features of the dynamic memory network is
the episodic memory module which runs over the input sequence
multiple times, each time paying attention to a different subset of
facts from the input.
It accomplishes this using a Bi-GRU that takes input of the sentence-
level representation passed in from the input module, and produces
an episodic memory representation.
We denote the episodic memory representation as mi and the
episode representation (output by the attention mechanism) as ei .
The episodic memory representation is initialized using m0 = q,
and proceeds using the GRU: mi = GRU(ei , mi−1 ). The episode
representation is updated using the hidden state outputs from the
input module as follows where g is the attention mechanism:

hit = gti GRU(ct , hit−1 ) + (1 − gti )hit−1


ei = hiTC

The attention vector g may be computed in a number of ways,


but in the original DMN paper (Kumar et al. 2016), the following
formulation was found to work best:

gti = G (ct , mi−1 , q)


cs224n: natural language processing with deep learning lecture notes: part vii
question answering 3

G (c, m, q) = σ (W (2) tanh(W (1) z(c, m, q) + b(1) ) + b(2) )


z(c, m, q) = [c, m, q, c ◦ q, c ◦ m, |c − q|, |c − m|, c T W (b) q, c T W (b) m]

In this way, gates in this module are activated if the sentence is


relevant to the question or memory. In the ith pass, if the summary is
not sufficient to answer the question, we can repeat sequence over in-
put in the (i + 1)th pass. For example, consider the question "Where
is the football?" and input sequences “John kicked the football" and
“John was in the field." In this example, John and football could be
linked in one pass and then John and field could be linked in the sec-
ond pass, allowing for the network to perform a transitive inference
based on the two pieces of information.

1.4 Answer Module


The answer module is a simple GRU decoder that takes in the output
of question module, and episodic memory module, and output a
word (or in general a computational result). It works as follows:

yt = softmax(W (a) at )
at = GRU([yt−1 , q], at−1 )

1.5 Experiments
Through the experiments we can see DMN is able to outperform
MemNN in babl question answering tasks, and it can outperform
other architectures for sentiment analysis and part-of-speech tagging.
How many episodes are needed in the episodic memory? The answer
is that the harder the task is, the more passes are required. Multiple
passes also allows the network to truly comprehend the sentence by
paying attention to only relevant parts for the final task, instead of
reacting to just the information from the word embedding.
The key idea is to modularize the system, and you can allow dif-
ferent types of input by change the input module. For example, if we
replace the input module with a convolutional neural network-based
module, then this architecture can handle a task called visual ques-
tion answering (VQA). It is also able to outperform other models in
this task.

1.6 Summary
The zeal to search for a general architecture that would solve all
problems has slightly faded since 2015, but the desire to train on one
domain and generalize to other domains has increased. To compre-
hend more advanced modules for question answering, readers can
refer to the dynamic coattention network (DCN).
CS224n: Natural Language Processing with Deep
Learning 1 1
Course Instructors: Christopher
Manning, Richard Socher
Lecture Notes: Part VIII
Convolutional Neural Networks2 2
Authors: Francois Chaubard, Richard
Socher
Winter 2019

1 CNNs (Convolutional Neural Networks)

1.1 Why CNNs?


Convolutional Neural Networks take in a sentence of word vectors
and first create a phrase vector for all subphrases, not just grammat-
ically correct phrases (as with Recursive Neural Network, addressed
in the next set of notes). CNNs then group them together for the task
at hand.

1.2 What is Convolution?


Let’s start with the 1D case. Consider two 1D vectors, f and g with
f being our primary vector and g corresponding to the filter. The
convolution between f and g, evaluated at entry n is represented as
M
( f ∗ g)[n] and is equal to ∑ f [ n − m ] g [ m ].
m=− M
Figure 1 shows the 2D convolution case. The 9 × 9 green matrix
represents the primary matrix of concern, f . The 3 × 3 matrix of red
numbers represents the filter g and the convolution currently being
evaluated is at position [2, 2]. Figure 1 shows the value of the convo-
lution at position [2, 2] as 4 in the second table. Can you complete the
second table?

1.3 A Single-Layer CNN


Consider word-vectors xi ∈ Rk and the concatenated word-vectors
Figure 1: Convolution in the 2D case
of a n-word sentence, x1:n = x1 ⊕ x2 ... ⊕ xn . Finally, consider a
Convolutional filter w ∈ Rhk i.e. over h words. For k = 2, n = 5 and
h = 3, Figure 2 shows the Single-Layer Convolutional layer for NLP.
We will get a single value for each possible combination of three
consecutive words in the sentence, "the country of my birth". Note,
the filter w is itself a vector and we will have ci = f (w T xi:i|h−1 + b)
to give c = [c1 , c2 ...cn−h+1 ] ∈ Rn−h+1 . For the last two time-steps, i.e.
starting with the words "my" or "birth", we don’t have enough word-
vectors to multiply with the filter (since h = 3). If we necessarily
need the convolutions associated with the last two word-vectors, a
common trick is to pad the sentence with h − 1 zero-vectors at its
cs224n: natural language processing with deep learning lecture notes: part viii
convolutional neural networks 2

right-hand-side as in Figure 3.

1.4 Pooling
Assuming that we don’t use zero-padding, we will get a final convo- Figure 2: Single-Layer Convolution:
lutional output, c which has n − h + 1 numbers. Typically, we want to one-step
take the outputs of the CNN and feed it as input to further layers like
a Feedforward Neural Network or a RecNN. But, all of those need
a fixed length input while our CNN output has a length dependent
on the length of the sentence, n. One clever way to fix this problem
Figure 3: Single-Layer Convolution:
is to use max-pooling. The output of the CNN, c ∈ Rn−h−1 is the all-steps
input to the max-pooling layer. The output of the max-pooling layer
is ĉ = max {c}, thus ĉ ∈ R.
We could also have used min-pooling because typically we use
ReLU as our non-linear activation function and ReLU is bounded on
the low side to 0. Hence a min-pool layer might get smothered by
ReLU, so we nearly always use max-pooling over min-pooling.

1.5 Multiple-Filters
In the example above related to Figure 2, we had h = 2, meaning we
looked only at bi-gram with a single specific combination method
i.e. filter. We can use multiple bi-gram filters because each filter will
learn to recognize a different kind of bi-gram. Even more generally,
we are not restricted to using just bi-grams, we can also have filters
using tri-grams, quad-grams and even higher lengths. Each filter has
an associated max-pool layer. Thus, our final output from the CNN
layers will be a vector having length equal to the number of filters.

1.6 Multiple-Channels
If we allow gradients to flow into the word-vectors being used here,
then the word-vectors might change significantly over training. This
is desirable, as it specializes the word-vectors to the specific task at
hand (away from say GloVe initialization). But, what about words
that appear only in the test set but not in the train set? While other
semantically related word vectors which appear in the train set will
have moved significantly from their starting point, such words will
still be at their initialization point. The neural network will be spe-
cialized for inputs which have been updated. Hence, we will get low
performance on sentences with such words (words that are in test but
not in train).
One work-around is to maintain two sets of word-vectors, one
’static’ (no gradient flow into them) and one ’dynamic’, which are
updated via SGD. Both are initially the same (GloVe or other initial-
cs224n: natural language processing with deep learning lecture notes: part viii
convolutional neural networks 3

ization). Both sets are simultaneously used as input to the neural


network. Thus, the initialized word-vectors will always play a role in
the training of the neural network. Giving unseen words present in
the test set a better chance of being interpreted correctly.
There are several ways of handling these two channels, most com-
mon is to simply average them before using in a CNN. The other
method is to double the length of the CNN filters.

1.7 CNN Options


1. Narrow vs Wide
Refer to Figure 4. Another way to ask this is should we not (nar-
row) or should we (wide) zero-pad? If we use Narrow Convolu-
tion, we compute a convolution only in those positions where all
the components of a filter have a matching input component. This
will clearly not be the case at the start and end boundaries of the
input, as in the left side network in Figure 4. If we use Wide Con-
volution, we have an output component corresponding to each
alignment of the convolution filter. For this, we will have to pad
the input at the start and the end with h − 1 zeros.
In the Narrow Convolution case, the output length will be n − h +
1 and in the Wide Convolution case, the length will be n + h − 1.
Figure 4: Narrow and Wide Convolu-
2. k-max pooling tion (from Kalchbrenner et al. (2014))
This is a generalization of the max pooling layer. Instead of pick-
ing out only the biggest (max) value from its input, the k-max
pooling layer picks out the k biggest values. Setting k = 1 gives the
max pooling layer we saw earlier.
CS224n: Natural Language Processing with Deep
Learning 1 1
Course Instructors: Christopher
Manning, Richard Socher
Lecture Notes: Part IX
Recursive Neural Networks and Constituency Parsing2 2
Authors: Francois Chaubard, Richard
Socher
Winter 2019

Note: "RNN" in this set of notes refers to Recursive Neural Networks,


not Recurrent Neural Networks. The former is a superset of the latter.

1 Recursive Neural Networks

In these notes, we introduce and discuss a new type of model that


Figure 1: A standard Recursive Neural
is a superset of the previously discussed Recurrent Neural Network.
Network
Recursive Neural Networks (RNNs) are perfect for settings that have
nested hierarchy and an intrinsic recursive structure. Well if we think
about a sentence, doesn’t this have such a structure? Take the sen-
tence ”A small crowd quietly enters the historical church”. First,
we break apart the sentence into its respective Noun Phrase, Verb
Phrase, ”A small crowd” and ”quietly enters the historical church”,
respectively. But there is a noun phrase, verb phrase within that verb
phrase: ”quietly enters” and ”historical church”, etc, etc. This does
seem pretty recursive.
The syntactic rules of language are highly recursive. So we take
advantage of that recursive structure with a model that respects it!
Another added benefit of modeling sentences with RNN’s is that
we can now input sentences of arbitrary length, which was a huge
head scratcher for using Neural Nets in NLP, with very clever tricks
to make the input vector of the sentence to be of equal size despite
the length of the sentences not being equal. (see Bengio et al., 2003;
Henderson, 2003; Collobert & Weston, 2008)
Let’s imagine our task is to take a sentence and represent it as
a vector in the same semantic space as the words themselves. So Figure 2: Parse Tree of a sentence.
that phrases like ”I went to the mall yesterday”, ”We went shopping
last week”, and ”They went to the store”, would all be pretty close
in distance to each other. Well we have seen ways to train unigram
word vectors, should we do the same for bigrams, trigrams, etc. This
very well may work but there are two major issues with this thinking.
1) There are literally an infinite amount of possible combinations
of words. Storing and training an infinite amount of vectors would
just be absurd. 2) Some combinations of words while they might be
completely reasonable to hear in language, may never be represented
in our training/dev corpus. So we would never learn them.
We need a way to take a sentence and its respective words vectors,
lecture notes: part ix
cs224n: natural language processing with deep learning
recursive neural networks and constituency parsing 2

and derive what the embedded vector should be. Now lets first ask
a very debated question. Is it naive to believe that the vector space
that we used to represent all words, is sufficiently expressive to also
be able to represent all sentences of any length? While this may be
unintuitive, the performance of these models suggest that this is
actually a reasonable thing to do.
Let’s first discuss the difference between semantic and grammat-
ical understanding of a sentence. Semantic analysis is an under-
standing of the meaning of a sentence, being able to represent the
phrase as a vector in a structured semantic space, where similar sen-
tences are very nearby, and unrelated sentences are very far away.
The grammatical understanding is one where we have identified the
underlying grammatical structure of the sentence, which part of the
sentence depends on which other part, what words are modifying
what other words, etc. The output of such an understanding is usu-
ally represented as a parse tree as displayed in Figure 2.
Now for the million dollar question. If we want to know the se-
mantic representation, is it an advantage, nay, required, to have
a grammatical understanding? Well some might disagree but for
now we will treat this semantic composition task the following way.
First, we need to understand words. Then, we need to know the way
words are put together, Then, finally, we can get to a meaning of a
phrase or sentence by leveraging these two previous concepts.
So lets begin with our first model built on this principle. Let’s
imagine we were given a sentence, and we knew the parse tree for
that sentence, such as the one displayed in Figure 2, could we figure
out an encoding for the sentence and also perhaps a sentiment score
just from the word vectors that are in the sentence? We observe how
a Simple RNN can perform this task.

1.1 A simple single layer RNN


Lets walk through the model displayed in Figure 3 above. We first
take a sentence parse tree and the sentence word vectors and begin
to walk up the tree. The lowest node in the graph is Node 3, so we
concatenate L29 and L430 to form a vector ∈ R2d and feed it into our
Figure 3: An example standard RNN
network to compute:
applied to a parsed sentence ”I love this
" # assignment”
(1) (1) L29
h = tanh(W + b (1) ) (1)
L430

Since W (1) ∈ Rd×2d and b(1) ∈ Rd , h(1) ∈ Rd . We can now think


of h(1) as a point in the same word vector space for the bigram ”this
assignment”, in which we did not need to learn this representation
separately, but rather derived it from its constituting word vectors.
lecture notes: part ix
cs224n: natural language processing with deep learning
recursive neural networks and constituency parsing 3

We now take h(1) and put it through a softmax layer to get a score
over a set of sentiment classes, a discrete set of known classes that
represent some meaning. In the case of positive/negative sentiment
analysis, we would have 5 classes, class 0 implies strongly negative,
class 1 implies negative, class 2 is neutral, class 3 is positive, and
finally class 4 is strongly positive.
Now we do the same thing with the ”I” and ”love” to produce
the vector h(1) for the phrase ”I love”. Again, we compute a score
over the semantic classes again for that phrase. Finally, for the most
interesting step, we need to merge the two phrases ”I love” and ”this
assignment”. Here we are concatenating word phrases, rather than
word vectors! We do this in the same manner, concatenating the two
h(1) vectors and compute
 
(1)
h
h(1) = tanh(W (1)  (Le1)
ft 
+ b (1) ) (2)
h Right

Now we have a vector in the word vector space that represents


the full sentence ”I love this assignment”. Furthermore, we can put
this h(1) through the same softmax layer as before, and compute
sentiment probabilities for the full sentence. Of course the model will
only do this reliably once trained.
Now lets take a step back. First, is it naive to think we can use
the same matrix W to concatenate all words together and get a very
expressive h(1) and yet again use that same matrix W to concatenate
all phrase vectors to get even deeper phrases? These criticisms are
valid and we can address them in the following twist on the simple
RNN.

1.2 Syntactically Untied SU-RNN


As we discussed in the criticisms of the previous section, using the
same W to bring together a Noun Phrase and Verb Phrase and to Figure 4: Using different W’s for
different categories of inputs is more
bring together a Prepositional Phrase and another word vector seems natural than having just one W for all
intuitively wrong. And maybe we are bluntly merging all of these categories
functionalities into too weak of a model.
What we can do to remedy this shortcoming is to ”syntactically
untie” the weights of these different tasks. By this we mean, there
is no reason to expect the optimal W for one category of inputs to
be at all related to the optimal W for another category of inputs. So
we let these W’s be different and relax this constraint. While this for
sure increases our weight matrices to learn, the performance boost
we gain is non-trivial.
As in FIgure 4 above, we notice our model is now conditioned
upon what the syntactic categories of the inputs are. Note, we deter-
cs224n: natural language processing with deep learninglecture notes: part ix
recursive neural networks and constituency parsing 4

mine what the categories are via a very simple Probabilistic Context
Free Grammar (PCFG) which is more or less learned by computing
summary statistics over the Penn Tree Bank to learn rules such as
”The” is always a DT, etc, etc. No deeper understanding of this part
is really necessary, just know itś really simple.
The only major other difference in this model is that we initialize
the W’s to the identity. This way the default thing to do is to average
the two word vectors coming in. Slowly but surely, the model learns
which vector is more important and also any rotation or scaling of
the vectors that improve performance. We observe in Figure 5 that
the trained weight matrices learn actual meaning! For example, the
DT-NP rule or Determiner followed by a Noun Phrase such as ”The
cat” or ”A man”, puts more emphasis on the Noun Phrase than on
the Determiner. (This is obvious because the right diagonals are
red meaning higher weights). This is called the notion of soft head
words, which is something that Linguists have long observed to be
true for sometime, however the model learned this on its own just by
looking at data. Pretty cool!
The SU-RNN does indeed outperform previously discussed mod- Figure 5: The learnt W weights for
els but perhaps it is still not expressive enough. If we think of mod- DT-NP composition match Linguists
theory
ifying words, such as adverbs like ”very”, any interpolation with
this word vector and the following one, is definitely not what the
understood nature of ”very” is.
As an adverb, it’s literal definition is ”used for emphasis”. How
can we have a vector that emphasizes any other vector that is to
follow when we are solely performing a linear interpolation? How
can we construct a vector that will ”scale” any other vector this way?
Truth is we can not. We need to have some form of multiplication of
word on another word. We uncover two such compositions below
that enable this. The first utilizes word matrices and the other utilizes
a Quadratic equation over the typical Affine.

1.3 MV-RNN’s (Matrix-Vector Recursive Neural Networks)


We now augment our word representation, to not only include a
word vector, but also a word matrix! So the word ”very” will have
Figure 6: An example MV-RNN
a word vector vvery ∈ Rd but also Vvery ∈ Rd×d . This gives us the
expressive ability to not only embed what a word means, but we also
learn the way that words ”modify” other words. The word matrix en-
ables the latter. To feed two words, a and b, into a RNN, we take their
word matrices A and B, to form our input vector x as the concatena-
tion of vector Ab and Ba. For our example of ”very”, Vvery could just
be the identity times any scalar above one. Which would scale any
neighboring word vector by that number! This is the type of expres-
lecture notes: part ix
cs224n: natural language processing with deep learning
recursive neural networks and constituency parsing 5

sive ability we desired. While the new word representation explodes


our feature space, we can express much better the way words modify
each other.
By observing the errors the model makes, we see even the MV-
RNN still can not express certain relations. We observe three major
classes of mistakes.
First, Negated Positives. When we say something positive but
one word turns it negative, the model can not weigh that one word Figure 7: Negated Positives

strong enough to flip the sentiment of the entire sentence. Figure


7 shows such an example where the swap of the word ”most” to
”least” should flip the entire sentiment of the sentence, but the MV-
RNN does not capture this successfully.
The second class of mistakes is the Negated Negative case. Where Figure 8: Negated Negatives
we say something is not bad, or not dull, as in Figure 8. The MV-
RNN can not recognize that the word ”not” lessens the sentiment
from negative to neutral.
The final class of errors we observe is the ”X but Y conjunction”
displayed in Figure 9. Here the X might be negative BUT if the Y is
positive then the model’s sentiment output for the sentence should be
positive! MV-RNNs struggle with this.
Thus, we must look for an even more expressive composition
Figure 9: Using a Recursive Neural Net
algorithm that will be able to fully capture these types of high level can correctly classify the sentiment of
compositions. the contrastive conjunction X but Y but
the MV-RNN can not

1.4 RNTNs (Recursive Neural Tensor Network)


The final RNN we will cover here is by far the most successful on
the three types of errors we left off with. The Recursive Neural
Tensor Network does away with the notion of a word matrix, and
furthermore, does away with the traditional affine transformation
pre-tanh/sigmoid concept. To compose two word vectors or phrase
vectors, we again concatenate them to form a vector ∈ R2d but in-
stead of putting it through an affine function then a nonlinear, we put
it through a quadratic first, then a nonlinear, such as:

h(1) = tanh( x T Vx + Wx ) (3)


Note that V is a 3rd order tensor in ∈ R2d×2d×d . We compute
x T V [i ] x
∀i ∈ [1, 2, ...d] slices of the tensor outputting a vector ∈ Figure 10: One slice of a RNTN. Note
R . We then add Wx and put it through a nonlinear function. The
d there would be d of these slices.
quadratic shows that we can indeed allow for the multiplicative type
of interaction between the word vectors without needing to maintain
and learn word matrices!
As we see in Figure 11, the RNTN is the only model that is capable
of succeeding on these very hard datasets.
Figure 11: Comparing performance
on the Negated Positive and Negated
Negative data sets.
lecture notes: part ix
cs224n: natural language processing with deep learning
recursive neural networks and constituency parsing 6

The Convolutional Neural Network (CNN) discussed in the previ-


ous set of notes outperforms the RNTN in some aspects and does not
require an input parse tree!

2 Constituency Parsing

Natural Language Understanding requires being able to extract


meaning from large text units from the understanding of smaller
parts. This extraction requires being able to understand how smaller
parts are put together. There are two main techniques to analyze the
syntactic structure of sentences: constituency parsing and depen-
dency parsing. Dependency parsing was covered in the previous
lectures (see lecture notes 4). By building binary asymmetric relations
between a word and its dependents, the structure shows which word
depends on which other words. Now we focus on constituency pars-
ing, that organizes words into nested constituents.
Constituency Parsing is a way to break a piece of text (e.g. one
sentence) into sub-phrases. One of the goals of constituency pars-
ing (also known as "phrase structure parsing") is to identify the
constituents in the text which would be useful when extracting in-
formation from text. By knowing the constituents after parsing the
sentence, it is possible to generate similar sentences syntactically
correct.

2.1 Constituent
In syntactic analysis, a constituent can be a single word or a phrases
as a single unit within a hierarchical structure. A phrase is a se-
quence of two or more words built around a head lexical item and
working as a unit within a sentence. To be a phrase, a group of words
should come together to play a specific role in the sentence. In ad-
dition, the group of words can be moved together or replaced as a
whole, and the sentence should remain fluent and grammatical.
For instance, the following sentence contains the noun phrase:
"wonderful CS224N". We interpret large text units by seman-
tic composition of smaller elements.
• I want to be enrolled in the wonderful CS224N! These smaller elements can be changed
while keeping a similar meaning, like in
the following examples.
We can rewrite the sentence by moving that whole phrase to the
front as below.

• The wonderful CS224N I want to be enrolled in!

Or the phrase could be replaced with an constituent of similar


function and meaning, like "great CS course in Stanford about NLP
and Deep Learning".
lecture notes: part ix
cs224n: natural language processing with deep learning
recursive neural networks and constituency parsing 7

• I want to be enrolled in the great CS course in Stanford about NLP


and Deep Learning!

For constituency parsing, the basic clause structure is understood


as a binary division of the clause into subject (noun phrase NP) and
predicate (verb phrase VP), expressed as following rule. The binary
division of the clause results in a one-to-one-or-more correspondence.
For each element in a sentence, there are one or more nodes in the
tree structure.

• S -> NP VP

In fact, the process of parsing illustrates certain similar rules. We


deduce the rules by beginning with the sentence symbol S, and ap-
plying the phrase structure rules successively, finally applying re-
placement rules to substitute actual words for the abstract symbols.
Based on the extracted rules, it is possible to generate similar sen-
tences. If the rules are correct, then any sentence produced in this
way should be syntactically correct. However, the generated sen-
tences might be syntactically correct but semantically nonsensical,
such as the following well-known example:

• Colorless green ideas sleep furiously Figure 12: Constituency Parse Tree for
’Colorless green ideas sleep furiously’

2.2 Constituency Parse Tree


Interestingly, in natural language, the constituents are likely to be
nested inside one another. Thus a natural representation of these
phrases is a tree. Usually we use a consistency parse tree to dis-
play the parsing process. The constituency-based parse trees of con-
stituency grammars distinguish between terminal and non-terminal
nodes. Non-terminals in the tree are labeled as types of phrases (e.g.
Noun Phrase), the terminals are the exact words in the sentence.
Take ’John hits the ball’ as an example, the syntactic structure of the
English sentence is showed as below.
We have a parse tree starting from root S, which represents the
whole sentence, and ending in each leaf node, which represents each Figure 13: Consistency Parse Tree for
’John hits the ball’
word in a sentence. We use the following abbreviation:

• S stands for sentence, the top-level structure.

• NP stands for noun phrase including the subject of the sentence


and the object of the sentence.

• VP stands for verb phrase, which serves as the predicate.

• V stands for verb.


lecture notes: part ix
cs224n: natural language processing with deep learning
recursive neural networks and constituency parsing 8

• D stands for determiner, such as the definite article "the"

• N stands for noun

Note: Back to sentence representation, what if you did NOT know


the parse tree? The RNNs we have observed in this set of lecture
notes depend on such an initial parsing. What if we are using Recur-
rent Networks and the context of the phrase is on its right side? If we
only applied Softmax on the last time-step, then the last few words
would have a disproportionately large impact on the output, e.g. Sen-
timent Classification. Convolutional Neural Networks address this
issue, as covered in the previous set of notes.
Practical Tips for Final Projects Notes

February 2019

Contents
1 Introduction 1

2 Choosing a Project Topic 2


2.1 Expectations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.2 Finding existing research . . . . . . . . . . . . . . . . . . . . . . . 3
2.3 Finding datasets and tasks . . . . . . . . . . . . . . . . . . . . . 3
2.4 Obtaining access to datasets via Stanford . . . . . . . . . . . . . 4
2.5 Collecting your own data . . . . . . . . . . . . . . . . . . . . . . 4

3 Project Advice 5
3.1 Define your goals . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.2 Processing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.3 Data hygiene . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.4 Build strong baselines . . . . . . . . . . . . . . . . . . . . . . . . 7
3.5 Training and debugging neural models . . . . . . . . . . . . . . . 7
3.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1 Introduction
These notes complement the information provided in Lecture 9 of CS224n 2019,
Practical Tips for Final Projects. In particular this document contains lots of
links to useful resources. This document does not contain a detailed specifica-
tion for the project deliverables (proposal, milestone, poster and report) – those
specifications will be released separately.

Section 2 contains information on how to choose and formulate a project topic.


It is mostly of use to Custom Project teams, though some Default Project teams
may find parts of it useful too.

Section 3 contains information useful for all teams.

This document may be updated a few times during Week 5.

1
2 Choosing a Project Topic
Project Suitability You can choose any topic related to Deep Learning for
NLP. That means your project should make substantive use of deep learning
and substantive use of human language data.

Project types Here is a (non-exhaustive!) list of possible project types:


1. Applying an existing neural model to a new task
2. Implementing a complex neural architecture

3. Proposing a new neural model (or a new variation of an existing model)


4. Proposing a new training, optimization, or evaluation scheme
5. Experimental and/or theoretical analysis of neural models

Project ideas from Stanford researchers We have collected a list of


project ideas from members of the Stanford AI Lab.1 These are a great op-
portunity to work on an interesting research problem with an external mentor.
If you want to do these, get started early!

2.1 Expectations
Your project should aim to provide some kind of scientific knowledge gain,
similar to typical NLP research papers (see Section 2.2 on where to find them).
A typical case is that your project will show that your proposed method provides
good results on an NLP task you’re dealing with.
Given that you only have a few weeks to work on your project, it is not
necessary that your method beats the state-of-the-art performance, or works
better than previous methods. But it should at least show performance broadly
expected of the kinds of methods you’re using.
In any case, your paper should try to provide reasoning explaining the be-
haviour of your model. You will need to provide some qualitative analysis,
which will be useful for supporting or testing your explanations. This will be
particularly important when your method is not working as well as expected.
Ultimately, your project will be graded holistically, taking into account many
criteria: originality, performance of your methods, complexity of the techniques
you used, thoroughness of your evaluation, amount of work put into the project,
analysis quality, writeup quality, etc.
1 https://docs.google.com/document/d/1Ytncuq6tpiSGHsJBkdzskMf0nw4_

x2AJ1rZ7RvpOv5E/edit?usp=sharing

2
2.2 Finding existing research
Generally, it’s much easier to define your project (see Section 3.1) if there is
existing published research using the same or similar task, dataset, approaches,
and/or evaluation metrics. Identifying existing relevant research (and even ex-
isting code) will ultimately save you time, as it will provide a blueprint of how
you might sensibly approach the project. There are many ways to find relevant
research papers:
• You could browse recent publications at any of the top venues where
NLP and/or Deep Learning research is published: ACL, EMNLP, TACL,
NAACL, EACL, NIPS, ICLR, ICML (Not exhaustive!)
• In particular, publications at many NLP venues are indexed at
http://www.aclweb.org/anthology/
• Try a keyword search at:
– http://arxiv.org/
– http://scholar.google.com
– http://dl.acm.org/
– http://aclasb.dfki.de/
• Look at publications from the Stanford NLP group
https://nlp.stanford.edu/pubs/

2.3 Finding datasets and tasks


There are lots of publicly-available datasets on the web. Here are some useful
resources to find datasets:
• A repository to track progress in NLP, including listings for the datasets
and the current state-of-the-art performance for the most common NLP
tasks.
https://nlpprogress.com/
• A small list of well-known standard datasets for common NLP tasks:
https://machinelearningmastery.com/datasets-natural-language-processing/
• An alphabetical list of free or public domain text datasets:
https://github.com/niderhoff/nlp-datasets
• Wikipedia has a list of machine learning text datasets, tabulated with
useful information such as dataset size.
https://en.wikipedia.org/wiki/List_of_datasets_for_machine_learning_
research#Text_data
• Kaggle has many datasets, though some of them are too small for Deep
Learning. Try searching for ‘nlp’.
https://www.kaggle.com/datasets

3
• Datahub has lots of datasets, though not all of it is Machine Learning
focused.
https://datahub.io/collections
• Microsoft Research has a collection of datasets (look under the ‘Dataset
directory’ tab):
http://research.microsoft.com/en-US/projects/data-science-initiative/
datasets.aspx
• A script to search arXiv papers for a keyword, and extract important
information such as performance metrics on a task.
https://huyenchip.com/2018/10/04/sotawhat.html
• A collection of links to more collections of links to datasets!
http://kevinchai.net/datasets
• A collection of papers with code on many NLP tasks.
https://paperswithcode.com/sota
• Datasets for machine translation.
http://statmt.org
• Syntactic corpora for many languages.
https://universaldependencies.org

2.4 Obtaining access to datasets via Stanford


Stanford has purchased access to a lot of natural language data:
• The Linguistic Data Consortium (LDC) is a very large repository of data
from many languages, including lots of annotated data. Stanford has
purchased many of the datasets available there.
http://www.ldc.upenn.edu/
• List of datasets that Stanford has access to (not exhaustive):
https://linguistics.stanford.edu/resources/corpora/corpus-inventory
To obtain access to data, please follow the instructions in this link (make sure
to CC the CS224n staff when you email):
https://linguistics.stanford.edu/resources/corpora/accessing-corpora.

2.5 Collecting your own data


It is possible to collect your own data for your project. However, data collec-
tion is often a time-consuming and messy process that is more difficult than it
appears. Given the limited timeframe, we generally don’t recommend collecting
your own data. If you really do want to collect your own data, make sure to
budget the data collection time into your project. Remember, your project must
have a substantial Deep Learning component, so if you spend all your time on
data collection and none on building neural models, we can’t give you a good
grade.

4
3 Project Advice
3.1 Define your goals
At the very beginning of your project, it’s important to clearly define your
goals in your mind and make sure everyone in your team understands them. In
particular:
• Clearly define the task. What’s the input and what’s the output? Can
you give an example? If the task can’t be framed as input and output,
what exactly are you trying to achieve?

• What dataset(s) will you use? Is that dataset already organized into the
input and output sections described above? If not, what’s your plan to
obtain the data in the format that you need?
• What is your evaluation metric (or metrics)? This needs to be a well-
defined, numerical metric (e.g. ROUGE score), not a vague idea (e.g.
‘summary quality’). See section 3.6 for more detail on how to evaluate
your methods.
• What does success look like for your project? For your chosen evaluation
metrics, what numbers represent expected performance, based on previous
research? If you’re doing an analysis or theoretical project, define your
hypothesis and figure out how your experiments will confirm or negate
your hypothesis.

3.2 Processing data


You may need to do (additional) processing of your data (e.g. tokenization,
tagging, or parsing). Here are some tools that may be useful:
• StanfordNLP (new!): a Python library providing tokenization, tagging,
parsing, and other capabilities. Covers 53 languages.
https://stanfordnlp.github.io/stanfordnlp/

• Other software from the Stanford NLP group:


http://nlp.stanford.edu/software/index.shtml
• NLTK, a lightweight Natural Language Toolkit package in Python:
http://nltk.org/

• spaCy, another Python package that can do preprocessing, but also in-
cludes neural models (e.g. Language Models):
https://spacy.io/

5
3.3 Data hygiene
At the beginning of your project, split your data set into training data (most of
your data), development data (also known as validation data) and test data.
A typical train/dev/test split might be 90/5/5 percent (assigned randomly).
Many NLP datasets come with predefined splits, and if you want to compare
against existing work on the same dataset, you should use the same split as used
in that work. Here is how you should use these data splits in your project:
1. Training data: Use this (and only this data!) to optimize the parameters
of your neural model.

2. Development data: This has two main uses. The first is to compare the
performance of your different models (or versions of the same model) by
computing the evaluation metric on the development data. This enables
you to choose the best hyperparameters and/or architectural choices that
should be evaluated on the test data. The second important usage of
development data is to decide when to stop training your model. Two
simple and common methods for deciding when to stop training are:
(a) Every epoch (or every N training iterations, where N is predefined),
record performance of the current model on the development set and
store the current model as a checkpoint. If development performance
is worse than on the last previous iteration (alternatively, if it fails
to beat best performance M times in a row, where M is predefined),
stop training and keep the best checkpoint.
(b) Train for E epochs (where E is some predefined number) and, after
each epoch, record performance of the current model on the develop-
ment set and store the current model as a checkpoint. Once the E
epochs are finished, stop training and keep the best checkpoint.
3. Test data: At the end of your project, evaluate your best trained model(s)
on the test data to compute your final performance metric. To be scien-
tifically honest, you should only use the training and development data to
select which models to evaluate on the test set.

The reason we use data splits is to avoid overfitting. If you simply selected
the model that does best on your training set, then you wouldn’t know how
well your model would perform on new samples of data – you’d be overfitting
to the training set. In NLP, powerful neural models are particularly prone to
overfitting to their training data, so this is especially important.
Similarly, if you look at the test set before you’ve chosen your final archi-
tecture and hyperparameters, that might impact your decisions and lead your
project to overfit to the test data. Thus, in the interest of science, it is extremely
important that you don’t touch the test set until the very end of your project.
This will ensure that the quantitative performance that you report will be an
honest unbiased estimate of how your method will do on new samples of data.

6
It’s even possible to overfit to the development set. If we train many differ-
ent model variants, and only keep the hyperparameters and architectures that
perform best on the development set, then we may be overfitting to the devel-
opment set. To fix this, you can even use two separate development sets (one of
them called the tuning set, the other one the development set). The tuning
set is used for optimizing hyperparameters, the development set for measuring
overall progress. If you optimize your hyperparameters a lot and do many it-
erations, you may even want to create multiple distinct development sets (dev,
dev2, ...) to avoid overfitting.

3.4 Build strong baselines


A baseline is a simpler method to compare your more complex neural system
against. Baselines are important so that we can understand the performance of
our systems in context.
For example, suppose you’re building a multilayer LSTM-based network with
attention to do binary sentiment analysis (classification of sentences as positive
or negative). The simplest baseline is the guessing baseline, which would achieve
50% accuracy (assuming the dataset is 50% positive and 50% negative). A more
complex baseline would be a simple non-neural Machine Learning algorithm,
such as a Naive Bayes classifier. You could also have simple neural baselines
– for example, encoding the sentence using an average of word embeddings.
Lastly, you should compare against simpler versions of your full model, such as
a vanilla RNN version, a single-layer version, or a version without attention.
These last few options are also called ablation experiments. An ablation ex-
periment removes some part of the full model and measures the performance –
this is useful to quantify how much different parts of the network help perfor-
mance. Ablation experiments are an excellent way analyze your model.
Building strong baselines is very important. Too often, researchers and
practitioners fall into the trap of making baselines that are too weak, or failing
to define any baselines at all. In this case, we cannot tell whether our complex
neural systems are adding any value at all. Sometimes, strong baselines perform
much better than you expected, and this is important to know.

3.5 Training and debugging neural models


Unfortunately, neural networks are notoriously difficult to debug. However, here
are some tips:
• To debug a neural model, train on a small toy dataset (e.g., small fraction
of training data, or a hand-created toy dataset) to sanity-check and diag-
nose bugs. For example, if your model is unable to overfit (e.g. achieve
near-zero training loss) on this toy dataset, then you probably have a bug
in your implementation.
• Hyperparameters (e.g. learning rate, number of layers, dropout rate, etc.)
often impact results significantly. Use performance on the development set

7
to tune these parameters (see Section 3.3). Though you probably won’t
have time for a very exhaustive hyperparameter search, try to identify the
most sensitive/important hyperparameters and tune those.
• Due to their power, neural networks overfit easily. Regularization (dropout,
weight decay) and stopping criteria based on the development set (e.g.
early stopping, see Section 3.3) are extremely important for ensuring your
model will do well on new unseen data samples.
• A more complicated neural model can ‘fail silently’: It may get decent
performance by relying on the simple parts, when the really interesting
components (e.g. a cool attention mechanism) fail due to a bug. Use
ablation experiments (see Section 3.4) to diagnose whether different parts
of the model are adding value.
• During training, randomize the order of training samples, or at least re-
move obvious ordering (e.g., don’t train your Seq2Seq system on data by
going through the corpus in the order of sentence length – instead, make
sure the lengths of the sentences in subsequent minibatches are uncorre-
lated). SGD relies on the assumption that the samples come in random
order.
• There are many online resources containing practical advice for building
neural network models. Note, most of this type of advice is based more
on personal experience than rigorous theory, and these ideas evolve over
time – so take it with a grain of salt! After all, your project could open
new perspectives by disproving some commonly-held belief.
– A Twitter thread on most common neural net mistakes (June 2018):
https://twitter.com/karpathy/status/1013244313327681536
– Deep Learning for NLP Best Practices blog post (July 2017):
http://ruder.io/deep-learning-nlp-best-practices/
– Practical Advice for Building Deep Neural Networks (October 2017):
https://pcc.cs.byu.edu/2017/10/02/practical-advice-for-building-deep-neural-networks/

3.6 Evaluation
In your project, carrying out meaningful evaluation is as important as designing
and building your neural models. Meaningful evaluation means that you should
carefully compare the performance of your methods using appropriate evaluation
metrics.

Choosing evaluation metrics You must have at least one evaluation metric
(which should be a numerical metric that can be automatically computed) to
measure the performance of your methods. If there is existing published work
on the same dataset and/or task, you should use the same metric(s) as that
work (though you can evaluate on additional metrics if you think it’s useful).

8
Human evaluation Human evaluation is often necessary in research areas
that lack good, comprehensive automatic evaluation metrics (e.g. some natural
language generation tasks). If you want to use human judgment as an evaluation
metric, you are welcome to do so (though you may find it difficult to find the time
and/or funding to collect many human evaluations). Collecting a small number
of human judgments could be a valuable addition to your project, but you must
have at least one automatic evaluation metric – even if it is an imperfect metric.

What to compare You should use your evaluation metric(s) to (a) com-
pare your model against previous work, (b) compare your model against your
baselines (see Section 3.4), and (c) compare different versions of your model.
When comparing against previous work, make sure to get the details right – for
example, did the previous work compute the BLEU metric in a case-sensitive
or case-insensitive way? If you calculate your evaluation metric differently to
previous work, the numbers are not comparable!

Qualitative evaluation So far this section has discussed quantitative eval-


uation – numerical performance measures. Qualitative evaluation, or analysis,
seeks to understand your system (how it works, when it succeeds and when it
fails) by measuring or inspecting key characteristics or outputs of your model.
You will be expected to include some qualitative evaluation in your final report.
Here are some types of qualitative evaluation:

• A simple kind of qualitative evaluation is to include some examples (e.g.


input and model output) in your report. However, don’t just provide ran-
dom examples without comment – find interesting examples that support
your paper’s overall arguments, and comment on them.2

• Error analysis (as seen in some of the assignments) is another important


type of qualitative evaluation. Try to identify categories of errors.
• Break down the performance metric by some criteria. For example, if you
think a translation model is especially bad at translating long sentences,
show that by plotting the BLEU score as a function of source sentence
length.
• Compare the performance of two systems beyond the single evaluation
metric number. For example, what examples does your model get right
that the baseline gets wrong, and vice versa? Can these examples be
characterized by some quality? If so, substantiate that claim by measuring
or plotting the quality.
2 It can be useful to provide a true random sample of model outputs, especially for e.g.
natural language generation papers. This gives readers a true, non-cherry-picked qualitative
overview of your model’s output. If you wish to do this, make it clear that the selection is
random, and comment on it.

9
• If your model uses attention, you can create a plot or visualization of the
attention distribution to see what the model attended to on particular
examples.
If your method is successful, qualitative evaluation is important to understand
the reason behind the numbers, and identify areas for improvement. If your
method is unsuccessful, qualitative evaluation is even more important to under-
stand what went wrong.

10
Review of differential calculus theory 1 1
Author: Guillaume Genthial

Winter 2017

Keywords: Differential, Gradients, partial derivatives, Jacobian,


chain-rule

This note is optional and is aimed at students who wish to have


a deeper understanding of differential calculus. It defines and ex-
plains the links between derivatives, gradients, jacobians, etc. First,
we go through definitions and examples for f : Rn 7→ R. Then we
introduce the Jacobian and generalize to higher dimension. Finally,
we introduce the chain-rule.

1 Introduction

We use derivatives all the time, but we forget what they mean. In
general, we have in mind that for a function f : R 7→ R, we have
something like

f ( x + h) − f ( x ) ≈ f 0 ( x )h

Some people use different notation, especially when dealing with


higher dimensions, and there usually is a lot of confusion between
the following notations

f 0 (x)
df
dx
∂f
∂x
∇x f
Scalar-product and dot-product
However, these notations refer to different mathematical objects, Given two vectors a and b,
and the confusion can lead to mistakes. This paper recalls some • scalar-product h a|bi = ∑in=1 ai bi
notions about these objects. • dot-product a T · b = h a|bi =
∑in=1 ai bi
review of differential calculus theory 2

2 Theory for f : Rn 7→ R

2.1 Differential
Notation
Formal definition dx f is a linear form Rn 7→ R
Let’s consider a function f : Rn 7→ R defined on Rn with the scalar This is the best linear approximation
of the function f
product h·|·i. We suppose that this function is differentiable, which
means that for x ∈ Rn (fixed) and a small variation h (can change) we
can write: dx f is called the differential of f in x

f ( x + h ) = f ( x ) + d x f ( h ) + o h →0 ( h ) (1)

oh→0 (h) (Landau notation) is equiva-


and dx f : Rn 7→ R is a linear form, which means that ∀ x, y ∈ Rn , lent to the existence of a function e(h)
such that lim e(h) = 0
we have dx f ( x + y) = dx f ( x ) + dx f (y). h →0

Example !
x1
Let f : R 7→ R such that f (
2 ) = 3x1 + x22 . Let’s pick
x2
! !
a h 1
∈ R2 and h = ∈ R2 . We have
b h2
!
a + h1
f( ) = 3( a + h1 ) + ( b + h2 )2
b + h2
= 3a + 3h1 + b2 + 2bh2 + h22
= 3a + b2 + 3h1 + 2bh2 + h22
= f ( a, b) + 3h1 + 2bh2 + o (h)

! h 2 = h · h = o h →0 ( h )
f(
h1
Then, d ) = 3h1 + 2bh2
a
  h2
b

2.2 Link with the gradients


Notation for x ∈ Rn , the gradient is
Formal definition usually written ∇ x f ∈ Rn
It can be shown that for all linear forms a : Rn 7→ R, there exists a
vector u a ∈ Rn such that ∀h ∈ Rn The dual of a vector space E∗ is isomor-
phic to E
a(h) = hu a |hi See Riesz representation theorem

In particular, for the differential dx f , we can find a vector u ∈ Rn


such that

dx (h) = hu| hi

. The gradient has the same shape as x


review of differential calculus theory 3

We can thus define the gradient of f in x

∇ x f := u
Then, as a conclusion, we can rewrite equation 2.1 Gradients and differential of a func-
tion are conceptually very different.
The gradient is a vector, while the
differential is a function
f ( x + h ) = f ( x ) + d x f ( h ) + o h →0 ( h ) (2)
= f ( x ) + h∇ x f |hi + oh→0 (h) (3)

Example !
x1
Same example as before, f : R2 7→ R such that f ( ) =
x2
3x1 + x22 . We showed that
!
h1
d  f ( ) = 3h1 + 2bh2
 
a h2
b
We can rewrite this as
! ! !
h1 3 h1
d  f ( )=h | i
a
  h2 2b h2
b
and thus our gradient is
!
3
∇ f =
 
a 2b
b

2.3 Partial derivatives


Notation
Formal definition Partial derivatives are usually written
∂f
Now, let’s consider an orthonormal basis (e1 , . . . , en ) of Rn . Let’s 0
∂x but you may also see ∂ x f or f x
∂f
define the partial derivative • ∂xi is a function Rn 7→ R
∂f ∂f ∂f T
• ∂x = ( ∂x ,..., ∂xn ) is a function
1
Rn 7 → Rn .
∂f f ( x1 , . . . , xi−1 , xi + h, xi+1 , . . . , xn ) − f ( x1 , . . . , xn )
( x ) := lim •
∂f
(x) ∈ R
∂xi h →0 h ∂xi
∂f ∂f ∂f
• ∂x ( x ) = ( ∂x ( x ), . . . , ( x ))T ∈ Rn
∂f ∂xn
Note that the partial derivative ∂x ( x ) ∈ R and that it is defined
1

i
with respect to the i-th component and evaluated in x.
Example
Same example as before, f : R2 7→ R such that f ( x1 , x2 ) =
3x1 + x22 . Let’s write Depending on the context, most people
omit to write the ( x ) evaluation and just
write
∂f ∂f
∂x ∈ R instead of ∂x ( x )
n
review of differential calculus theory 4

! !
a+h a
! f( ) − f( )
∂f a b b
( ) = lim
∂x1 b h →0 h
3( a + h) + b2 − (3a + b2 )
= lim
h →0 h
3h
= lim
h →0 h
=3

In a similar way, we find that


!
∂f a
( ) = 2b
∂x2 b

2.4 Link with the partial derivatives


That’s why we usually write
Formal definition
∂f
It can be shown that ∇x f = (x)
∂x
n (same shape as x)
∂f
∇x f = ∑ ∂xi (x)ei
i =1
 ∂f  ei is a orthonormal basis. For instance,
∂x1 ( x ) in the canonical basis

= .. 
.

  ei = (0, . . . , 1, . . . 0)
∂f
∂xn ( x ) with 1 at index i

∂f
where ∂x ( x ) denotes the partial derivative of f with respect to the
i
ith component, evaluated in x.
Example
We showed that
  
∂ f  a


( ) =3


 ∂x1

b

 
∂ f  a


( ) = 2b


 ∂x2

b

and that
!
3
∇ f =
a
  2b
b
and then we verify that
review of differential calculus theory 5

 ! 
a∂f
 ∂x1 (
)
b 
∇  f =  ! 
a 
 ∂f a 
 
( )
b ∂x2 b

3 Summary

Formal definition
For a function f : Rn 7→ R, we have defined the following objects
which can be summarized in the following equation Recall that a T · b = h a|bi = ∑in=1 ai bi

f ( x + h ) = f ( x ) + d x f ( h ) + o h →0 ( h ) differential
= f ( x ) + h∇ x f |hi + oh→0 (h) gradient
∂f
= f ( x ) + h ( x )|hi + oh→0
∂x
 ∂f 
∂x ( x )
 1. 
 ..  |hi + oh→0
= f ( x ) + h partial derivatives

∂f
∂xn ( x )

Remark
Let’s consider x : R 7→ R such that x (u) = u for all u. Then we can
easily check that du x (h) = h. As this differential does not depend on
u, we may simply write dx. That’s why the following expression has The dx that we use refers to the differ-
some meaning, ential of u 7→ u, the identity mapping!

∂f
dx f (·) = ( x )dx (·)
∂x
because

∂f
dx f (h) = ( x )dx (h)
∂x
∂f
= ( x)h
∂x
In higher dimension, we write
n
∂f
dx f = ∑ ∂xi (x)dxi
i =1

4 Jacobian: Generalization to f : Rn 7→ Rm

For a function
review of differential calculus theory 6

   
x1 f 1 ( x1 , . . . , x n )
 .   .. 
 ..  7→
f : .
  
 
xn f m ( x1 , . . . , x n )
We can apply the previous section to each f i ( x ) :

f i ( x + h ) = f i ( x ) + d x f i ( h ) + o h →0 ( h )
= f i ( x ) + h∇ x f i |hi + oh→0 (h)
∂f
= f i ( x ) + h i ( x )|hi + oh→0
∂x
∂f ∂f
= f i ( x ) + h( i ( x ), . . . , i ( x ))T |hi + oh→0
∂x1 ∂xn

Putting all this in the same vector yields


     ∂f 
1 T
x1 + h1 x1 ∂x ( x ) · h
 .. 
= f
 .  
 . + .. 
 + o (h)
f
 .   .   . 
∂ fm T
xn + hn xn ∂x ( x ) · h
Now, let’s define the Jacobian matrix as The Jacobian matrix has dimensions
  ∂f m × n and is a generalization of the
 ∂f
T 1 ∂ f1 
1
( x ) ∂x ( x ) . . . ∂x n
( x ) gradient
 ∂x .   1 .. 
J ( x ) := 
 .
.
=
  . 

∂ fm T ∂ fm ∂ fm
∂x ( x ) ∂x ( x ) . . . ∂xn
1
( x )
Then, we have that

     ∂f ∂ f1 
1
x1 + h1 x1 ∂x1 ( x ) . . . ∂xn ( x )
 ..  = f  ..  + 
    .. 
 · h + o (h)
f
 .   .   . 
∂ fm ∂ fm
xn + hn xn ∂x ( x ) . . . ∂x ( x )
1 n

= f ( x ) + J ( x ) · h + o (h)

Example 1 : m = 1 ! In the case where m = 1, the Jacobian is


a row vector
x1 ∂ f1 ∂ f1
Let’s take our first function f : R2 7→ R such that f ( ) = ∂x1 ( x ) . . . ∂xn ( x )
x2 Remember that our gradient was
3x1 + x22 . Then, the Jacobian of f is defined as a column vector with the
same elements. We thus have that
J (x) = ∇x f T
   
∂f ∂f
∂x1 ( x ) ∂x2 ( x )
= 3 2x2
!T
3
=
2x2
= ∇ f ( x)T
review of differential calculus theory 7

Example 2 : g : R3 7→ R2 Let’s define

 
y1 !
y1 + 2y2 + 3y3
g (  y2  ) =
 
y1 y2 y3
y3

Then, the Jacobian of g is

 
∂(y1 +2y2 +3y3 )
∂y (y)T
Jg ( y ) = 
∂ ( y1 y2 y3 )

( y ) T
∂y
 
∂(y1 +2y2 +3y3 ) ∂(y1 +2y2 +3y3 ) ∂(y1 +2y2 +3y3 )
∂y1 (y) ∂y2 (y) ∂y3 (y)
= 
∂ ( y1 y2 y3 ) ∂ ( y1 y2 y3 ) ∂ ( y1 y2 y3 )

∂y1 (y) ∂y2 (y) ∂y3 ( y )
!
1 2 3
=
y2 y3 y1 y3 y1 y2

5 Generalization to f : Rn× p 7→ R

If a function takes as input a matrix A ∈ Rn× p , we can transform this


matrix into a vector a ∈ Rnp , such that

A[i, j] = a[i + nj]


Then, we end up with a function f˜ : Rnp 7→ R. We can apply
the results from 3 and we obtain for x, h ∈ Rnp corresponding to
X, h ∈ Rn× p ,

f˜( x + h) = f ( x ) + h∇ x f |hi + o (h)

 ∂f 
∂x1 ( x )
 .. 
where ∇ x f = 
 . .

∂f
∂xnp ( x )
Now, we would like to give some meaning to the following equa-
tion The gradient of f wrt to a matrix X is a
matrix of same shape as X and defined
by
f ( X + H ) = f ( X ) + h∇ X f | H i + o ( H ) ∂f
∇ X f ij = ∂X (X)
ij

Now, you can check that if you define

∂f
∇ X f ij = (X)
∂Xij
that these two terms are equivalent
review of differential calculus theory 8

h∇ x f |hi = h∇ X f | H i
np
∂f ∂f
∑ ∂xi (x)hi = ∑ ∂Xij (X ) Hij
i =1 i,j

6 Generalization to f : Rn× p 7→ Rm
Let’s generalize the generalization of
Applying the same idea as before, we can write the previous section

f ( x + h) = f ( x ) + J ( x ) · h + o (h)

where J has dimension m × n × p and is defined as

∂ fi
Jijk ( x ) = (x)
∂X jk
Writing the 2d-dot product δ = J ( x ) · h ∈ Rm means that the i-th
component of δ is You can apply the same idea to any
dimensions!
n p
∂ fi
δi = ∑∑ ∂X jk
( x )h jk
j =1 k =1

7 Chain-rule

Formal definition
Now let’s consider f : Rn 7→ Rm and g : R p 7→ Rn . We want
to compute the differential of the composition h = f ◦ g such that
h : x 7→ u = g( x ) 7→ f ( g( x )) = f (u), or

dx ( f ◦ g)
.
It can be shown that the differential is the composition of the dif-
ferentials

dx ( f ◦ g ) = d g( x ) f ◦ dx g

Where ◦ is the composition operator. Here, dg( x) f and dx g are lin-


ear transformations (see section 4). Then, the resulting differential is
also a linear transformation and the jacobian is just the dot product
between the jacobians. In other words, The chain-rule is just writing the
resulting jacobian as a dot product of
jacobians. Order of the dot product is
Jh ( x ) = J f ( g( x )) · Jg ( x ) very important!

where · is the dot-product. This dot-product between two matrices


can also be written component-wise:
review of differential calculus theory 9

n
Jh ( x )ij = ∑ J f ( g(x))ik · Jg (x)kj
k =1
Example !
x1
Let’s keep our example function f : ( ) 7→ 3x1 + x22 and our
x2
 
y1 !
y1 + 2y2 + 3y3
function g : (y2 ) = .
 
y1 y2 y3
y3
The composition of f and g is h = f ◦ g : R3 7→ R

 
y1 !
y1 + 2y2 + 3y3
h (  y2  ) = f ( )
 
y1 y2 y3
y3
= 3(y1 + 2y2 + 3y3 ) + (y1 y2 y3 )2

We can compute the three components of the gradient of h with


the partial derivatives

∂h
(y) = 3 + 2y1 y22 y23
∂y1
∂h
(y) = 6 + 2y2 y21 y23
∂y2
∂h
(y) = 9 + 2y3 y21 y22
∂y3

And then our gradient is


 
3 + 2y1 y22 y23
∇y h = 6 + 2y2 y21 y23 
 
9 + 2y3 y21 y22
In this process, we did not use our previous calculation, and that’s
a shame. Let’s use the chain-rule to make use of it. With examples 2.2
and 4, we had For a function f : Rn 7→ R, the Jacobian
is the transpose of the gradient

∇x f T = J f (x)
J f (x) = ∇x f T
 
= 3 2x2

We also need the jacobian of g, which we computed in 4

!
1 2 3
Jg ( y ) =
y2 y3 y1 y3 y1 y2
review of differential calculus theory 10

Applying the chain rule, we obtain that the jacobian of h is the


product J f · Jg (in this order). Recall that for a function Rn 7→ R, the
jacobian is formally the transpose of the gradient. Then,

Jh (y) = J f ( g(y)) · Jg (y)


= ∇ Tg(y) f · Jg (y)
!
  1 2 3
= 3 2y1 y2 y3 ·
y2 y3 y1 y3 y1 y2
 
= 3 + 2y1 y22 y23 6 + 2y2 y21 y23 9 + 2y3 y21 y22

and taking the transpose we find the same gradient that we com-
puted before!
Important remark

• The gradient is only defined for function with values in R.

• Note that the chain rule gives us a way to compute the Jacobian
and not the gradient. However, we showed that in the case of a
function f : Rn 7→ R, the jacobian and the gradient are directly
identifiable, because ∇ x J T = J ( x ). Thus, if we want to compute
the gradient of a function by using the chain-rule, the best way to
do it is to compute the Jacobian.

• As the gradient must have the same shape as the variable against
which we derive, and

– we know that the Jacobian is the transpose of the gradient


– and the Jacobian is the dot product of Jacobians

an efficient way of computing the gradient is to find the ordering


of jacobian (or the transpose of the jacobian) that yield correct
shapes!

• the notation ∂∂·· is often ambiguous and can refer to either the gra-
dient or the Jacobian.
Computing Neural Network Gradients
Kevin Clark

1 Introduction
The purpose of these notes is to demonstrate how to quickly compute neural
network gradients in a completely vectorized way. It is complementary to the
last part of lecture 3 in CS224n 2019, which goes over the same material.

2 Vectorized Gradients
While it is a good exercise to compute the gradient of a neural network with re-
spect to a single parameter (e.g., a single element in a weight matrix), in practice
this tends to be quite slow. Instead, it is more efficient to keep everything in ma-
trix/vector form. The basic building block of vectorized gradients is the Jacobian
Matrix. Suppose we have a function f : Rn → Rm that maps a vector of length n
to a vector of length m: f (x) = [f1 (x1 , ..., xn ), f2 (x1 , ..., xn ), ..., fm (x1 , ..., xn )].
Then its Jacobian is the following m × n matrix:

 ∂f1 ∂f1 
∂x1 ... ∂xn
∂f
=  ... .. .. 

∂x . . 
∂f m ∂fm
∂x1 ... ∂xn

That is, ( ∂f ∂fi


∂x )ij = ∂xj (which is just a standard non-vector derivative). The
Jacobian matrix will be useful for us because we can apply the chain rule to a
vector-valued function just by multiplying Jacobians.

As a little illustration of this, suppose we have a function f (x) = [f1 (x), f2 (x)]
taking a scalar to a vector of size 2 and a function g(y) = [g1 (y1 , y2 ), g2 (y1 , y2 )]
taking a vector of size two to a vector of size two. Now let’s compose them to
get g(x) = [g1 (f1 (x), f2 (x)), g2 (f1 (x), f2 (x))]. Using the regular chain rule, we
can compute the derivative of g as the Jacobian
∂  " ∂g1 ∂f1 ∂g1 ∂f2
#
∂g g1 (f 1 (x), f2 (x)) ∂f ∂x + ∂f ∂x
= ∂x ∂ = ∂g12 ∂f1 2
∂g2 ∂f2
∂x ∂x g2 (f1 (x), f2 (x)) ∂f1 ∂x + ∂f2 ∂x

1
And we see this is the same as multiplying the two Jacobians:
" #
∂g1 ∂g1  ∂f1 
∂g ∂g ∂f ∂f1 ∂f2 ∂x
= = ∂g2 ∂g2 ∂f2
∂x ∂f ∂x ∂f1 ∂f2 ∂x

3 Useful Identities
This section will now go over how to compute the Jacobian for several simple
functions. It will provide some useful identities you can apply when taking neu-
ral network gradients.

(1) Matrix times column vector with respect to the column vector
∂z
(z = W x, what is ∂x ?)

Suppose W ∈ Rn×m . Then we can think of z as a function of x taking an


m-dimensional vector to an n-dimensional vector. So its Jacobian will be
n × m. Note that
m
X
zi = Wik xk
k=1

∂z
So an entry ( ∂x )ij of the Jacobian will be
m m
∂z ∂zi ∂ X X ∂
( )ij = = Wik xk = Wik xk = Wij
∂x ∂xj ∂xj ∂xj
k=1 k=1

∂ ∂z
because ∂xj xk = 1 if k = j and 0 if otherwise. So we see that =W
∂x
(2) Row vector times matrix with respect to the row vector
∂z
(z = xW , what is ∂x ?)

∂z
A computation similar to (1) shows that = WT .
∂x
(3) A vector with itself
∂z
(z = x, what is ∂x ? )
We have zi = xi . So
(
∂z ∂zi ∂ 1 if i = j
( )ij = = xi =
∂x ∂xj ∂xj 0 if otherwise
∂z
So we see that the Jacobian is a diagonal matrix where the entry at (i, i)
∂x
∂z
is 1. This is just the identity matrix: = I . When applying the chain
∂x

2
rule, this term will disappear because a matrix or vector multiplied by the
identity matrix does not change.
(4) An elementwise function applied a vector
∂z
(z = f (x), what is ∂x ? )
Since f is being applied elementwise, we have zi = f (xi ). So
(
∂z ∂zi ∂ f 0 (xi ) if i = j
( )ij = = f (xi ) =
∂x ∂xj ∂xj 0 if otherwise

∂z
So we see that the Jacobian ∂x is a diagonal matrix where the entry at (i, i)
∂z
is the derivative of f applied to xi . We can write this as = diag(f 0 (x)) .
∂x
Since multiplication by a diagonal matrix is the same as doing elementwise
multiplication by the diagonal, we could also write ◦f 0 (x) when applying
the chain rule.

(5) Matrix times column vector with respect to the matrix


(z = W x, δ = ∂J ∂J ∂J ∂z ∂z
∂z what is ∂W = ∂z ∂W = δ ∂W ?)

This is a bit more complicated than the other identities. The reason for in-
cluding ∂J
∂z in the above problem formulation will become clear in a moment.
First suppose we have a loss function J (a scalar) and are computing its
gradient with respect to a matrix W ∈ Rn×m . Then we could think of J as
a function of W taking nm inputs (the entries of W ) to a single output (J).
∂J
This means the Jacobian ∂W would be a 1 × nm vector. But in practice
this is not a very useful way of arranging the gradient. It would be much
nicer if the derivatives were in a n × m matrix like this:
 ∂J ∂J 
∂W11 . . . ∂W 1m
∂J
=  ... .. .. 

∂W . . 
∂J ∂J
∂Wn1 . . . ∂Wnm

Since this matrix has the same shape as W , we could just subtract it (times
the learning rate) from W when doing gradient descent. So (in a slight abuse
∂J
of notation) let’s find this matrix as ∂W instead.
This way of arranging the gradients becomes complicated when computing
∂z
∂W . Unlike J, z is a vector. So if we are trying to rearrange the gradients
∂J ∂z
like with ∂W , ∂W would be an n × m × n tensor! Luckily, we can avoid
the issue by taking the gradient with respect to a single weight Wij instead.

3
∂z
∂Wij is just a vector, which is much easier to deal with. We have

m
X
zk = Wkl xl
l=1
m
∂zk X ∂
= xl Wkl
∂Wij ∂Wij
l=1


Note that ∂W ij
Wkl = 1 if i = k and j = l and 0 if otherwise. So if k 6= i
everything in the sum is zero and the gradient is zero. Otherwise, the only
nonzero element of the sum is when l = j, so we just get xj . Thus we find
∂zk
∂Wij = xj if k = i and 0 if otherwise. Another way of writing this is

 
0
 .. 
.
 
0
∂z  
xj  ← ith element
= 
∂Wij 0
 
.
 .. 
0
∂J
Now let’s compute ∂Wij

m
∂J ∂J ∂z ∂z X ∂zk
= =δ = δk = δ i xj
∂Wij ∂z ∂Wij ∂Wij ∂Wij
k=1

∂zi ∂J
(the only nonzero term in the sum is δi ∂W ij
). To get ∂W we want a ma-
trix where entry (i, j) is δi xj . This matrix is equal to the outer product
∂J
= δ T xT
∂W
(6) Row vector time matrix with respect to the matrix
(z = xW , δ = ∂J ∂J ∂z
∂z what is ∂W = δ ∂W ?)
∂J
A similar computation to (5) shows that = xT δ .
∂W
(7) Cross-entropy loss with respect to logits (ŷ = softmax(θ), J =
CE(y, ŷ), what is ∂J
∂θ ?)

∂J
The gradient is = ŷ − y
∂θ
(or (ŷ − y)T if y is a column vector).

4
These identities will be enough to let you quickly compute the gradients for many
neural networks. However, it’s important to know how to compute Jacobians
for other functions as well in case they show up. Some examples if you want
practice: dot product of two vectors, elementwise product of two vectors, 2-norm
of a vector. Feel free to use these identities in the assignments. One option is
just to memorize them. Another option is to figure them out by looking at the
dimensions. For example, only one ordering/orientation of δ and x will produce
∂J
the correct shape for ∂W (assuming W is not square).

4 Gradient Layout
Jacobean formulation is great for applying the chain rule: you just have to mul-
tiply the Jacobians. However, when doing SGD it’s more convenient to follow
the convention “the shape of the gradient equals the shape of the parameter”
∂J
(as we did when computing ∂W ). That way subtracting the gradient times the
learning rate from the parameters is easy. We expect answers to homework
questions to follow this convention. Therefore if you compute the gradient
of a column vector using Jacobian formulation, you should take the transpose
when reporting your final answer so the gradient is a column vector. Another
option is to always follow the convention. In this case the identities may not
work, but you can still figure out the answer by making sure the dimensions of
your derivatives match up. Up to you which of these options you choose!

5 Example: 1-Layer Neural Network


This section provides an example of computing the gradients of a full neural
network. In particular we are going to compute the gradients of a one-layer
neural network trained with cross-entropy loss. The forward pass of the model
is as follows:
x = input
z = W x + b1
h = ReLU(z)
θ = U h + b2
ŷ = softmax(θ)
J = CE(y, ŷ)
It helps to break up the model into the simplest parts possible, so note that
we defined z and θ to split up the activation functions from the linear trans-
formations in the network’s layers. The dimensions of the model’s parameters
are
x ∈ RDx ×1 b1 ∈ RDh ×1 W ∈ RDh ×Dx b2 ∈ RNc ×1 U ∈ RNc ×Dh
where Dx is the size of our input, Dh is the size of our hidden layer, and Nc is
the number of classes.

5
In this example, we will compute all of the network’s gradients:
∂J ∂J ∂J ∂J ∂J
∂U ∂b2 ∂W ∂b1 ∂x
To start with, recall that ReLU(x) = max(x, 0). This means
(
0 1 if x > 0
ReLU (x) = = sgn(ReLU(x))
0 if otherwise

where sgn is the signum function. Note that we are able to write the derivative
of the activation in terms of the activation itself.

∂J ∂J
Now let’s write out the chain rule for ∂U and ∂b2 :

∂J ∂J ∂ ŷ ∂θ
=
∂U ∂ ŷ ∂θ ∂U
∂J ∂J ∂ ŷ ∂θ
=
∂b2 ∂ ŷ ∂θ ∂b2
∂ ŷ
Notice that ∂J ∂J
∂ ŷ ∂θ = ∂θ is present in both gradients. This makes the math a bit
cumbersome. Even worse, if we’re implementing the model without automatic
differentiation, computing ∂J ∂θ twice will be inefficient. So it will help us to define
some variables to represent the intermediate derivatives:
∂J ∂J
δ1 = δ2 =
∂θ ∂z
These can be thought as the error signals passed down to θ and z when doing
backpropagation. We can compute them as follows:
∂J
δ1 = = (ŷ − y)T this is just identity (7)
∂θ
∂J ∂J ∂θ ∂h
δ2 = = using the chain rule
∂z ∂θ ∂h ∂z
∂θ ∂h
= δ1 substituting in δ1
∂h ∂z
∂h
= δ1 U using identity (1)
∂z
= δ1 U ◦ ReLU0 (z) using identity (4)
= δ1 U ◦ sgn(h) we computed this earlier

A good way of checking our work is by looking at the dimensions of the Jaco-
bians:
∂J
= δ1 U ◦ sgn(h)
∂z
(1 × Dh ) (1 × Nc ) (Nc × Dh ) (Dh )

6
We see that the dimensions of all the terms in the gradient match up (i.e., the
number of columns in a term equals the number of rows in the next term). This
will always be the case if we computed our gradients correctly.

Now we can use the error terms to compute our gradients. Note that we trans-
pose out answers when computing the gradients for column vectors terms to
follow the shape convention.
∂J ∂J ∂θ ∂θ
= = δ1 = δ1T hT using identity (5)
∂U ∂θ ∂U ∂U
∂J ∂J ∂θ ∂θ
= = δ1 = δ1T using identity (3) and transposing
∂b2 ∂θ ∂b2 ∂b2
∂J ∂J ∂z ∂z
= = δ2 = δ2T xT using identity (5)
∂W ∂θ ∂W ∂W
∂J ∂J ∂z ∂z
= = δ2 = δ2T using identity (3) and transposing
∂b1 ∂θ ∂b1 ∂b1
∂J ∂J ∂z
= = (δ2 W )T using identity (1) and transposing
∂x ∂θ ∂x

7
CS224n: Natural Language Processing with Deep
Learning1 1
Course Instructors: Christopher
Manning, Richard Socher
Lecture Notes: TensorFlow 2 2
Authors: Zhedi Liu, Jon Gauthier,
Bharath Ramsundar, Chip Huyen
Winter 2017

Keyphrases: TensorFlow
Code Demo: https://github.com/nishithbsk/tensorflow_tutorials

1 Introduction

TensorFlow is an open source software library for numerical com-


putation using data flow graphs. It was originally developed by
researchers and engineers working on the Google Brain Team within
Google’s Machine Intelligence research organization for the purposes
of conducting machine learning and deep neural networks research. Check the official tutorial
Nodes in TensorFlow’s data flow graph represent mathematical https://www.tensorflow.org/get_
started/
operations, while the edges represent the multidimensional data
arrays (tensors) communicated between them. The advantage of the
flexible architecture is that it allows users to build complex models
step by step and makes gradient calculations simple. TensorFlow
programs use a tensor data structure to represent all data – only
tensors are passed between operations in the computation graph. You
can think of a TensorFlow tensor as an n-dimensional array or list. A
tensor has a static type, a rank, and a shape.

2 Concepts

2.1 Variables, Placeholders, Mathematical Operations


Let’s use
h = ReLU (Wx + b)
where ReLU (Rectified Linear Unit) is defined as f ( x ) = max (0, x )
as an example to take a closer look at TensorFlow’s data flow graph,
shown in Figure 1. There are three types of nodes in a flow graph:
variables, placeholders and mathematical operations.
Variables are stateful nodes that maintain state across executions
of the graph. By stateful, we mean that variables retain their current
values over multiple executions, and it’s easy to restore those saved
values. Variables can be saved to disk during and after training. Typ-
ically, variables are parameters in a neural network. In our example,
weights W and bias b are variables.
Placeholders are nodes whose values are fed in at execution time. Figure 1: An Illustration of a Tensor-
Flow Flow Graph
The rationale behind having placeholders is that we want to be able
cs224n: natural language processing with deep learning 2

to build flow graphs without having to load external data, as we


only want to pass in them at run time. Placeholders, unlike vari-
ables, require initialization. In order to initialize a placeholder, type
and shape of data have to be passed in as arguments. Input data
and labels are some examples that need to be initialized as place-
holders. In our example, placeholder is x. See the code snippet be-
low for initializing an input placeholder that has type tf.float32 and
shape (batch_size, n_features), and a labels placeholder that has type
tf.int32 and shape (batch_size, n_classes).

## Example code snippet


input_placeholder = tf.placeholder(tf.float32,
shape=(batch_size, n_features))
labels_placeholder = tf.placeholder(tf.int32, s
hape=(batch_size, n_classes))

Mathematical operations, as the name suggests, represent mathe-


matical operations in a flow graph. In our example, MatMul (multiply
two matrix values), Add (add element-wise with broadcasting) and
ReLU (activate with element-wise rectified linear function) are mathe-
matical operations.
Now we are ready to see our flow graph in code. Let’s assume our
input x has shape (N, Dx), W has shape (Dx, N) and type tf.float32,
b has shape (N, 1) and we will initialize W ∼ Uniform(−1, 1) and
b = 0. Then the code snippet below shows us how to build our flow
graph for h = ReLU (Wx + b).

## Example code snippet


import tensorflow as tf

b = tf.Variable(tf.zeros((N,)))
W = tf.Variable(tf.random_uniform((Dx, N), -1, 1))
x = tf.placeholder(tf.float32, (N, Dx))
h = tf.nn.relu(tf.matmul(x, W) + b)

The key thing to remember about symbolic programming lan-


guage is that, up to what what we have written here, no data is ac-
tually being computed. x is just a placeholder for our input data. A
flow graph merely defines a function. We cannot do print(h) and
gets its value as it only represents a node in the graph.

2.2 Fetch, Fetch


Now that we’ve defined a graph, the next steps are to deploy this
graph with a session and run the session to get our outputs. A ses-
sion is an environment that supports the execution of all operations
to a particular execution context (e.g. CPU, GPU). A session can be
easily built by doing sess = tf.Session(). In order for a session to
cs224n: natural language processing with deep learning 3

run, two arguments have to be fed: fetches and feeds. We use feeds
and fetches to get data into and out of arbitrary operations.
Fetches represent a list of graph nodes and return the outputs of
these nodes. We could fetch a single node or multiple tensors. See
the code snippet below for an example of fetching two tensors: mul
and intermed.

## Example code snippet


import tensorflow as tf

input1 = tf.constant([3.0])
input2 = tf.constant([2.0])
input3 = tf.constant([5.0])
intermed = tf.add(input2, input3)
mul = tf.mul(input1, intermed)

with tf.Session() as sess:


result = sess.run([mul, intermed])
print (result)

# output:
# [array([ 21.], dtype=float32), array([ 7.], dtype=float32)]

A feed, supplied as an argument to a run() call, temporarily re-


places the output of an operation with a tensor value. The feed is
only used for the run call to which it is passed. Essentially, feeds are
dictionaries mapping placeholders to their values. Nodes that de-
pend on placeholders cannot run unless their values are fed. See the
code snippet below for an example of feeding a feed_dict.

## Example code snippet


import tensorflow as tf

input1 = tf.placeholder(tf.float32)
input2 = tf.placeholder(tf.float32)
output = tf.mul(input1, input2)

with tf.Session() as sess:


print (sess.run([output], feed_dict={input1:[7.], input2:[2.]}))

# output:
# [array([ 14.], dtype=float32)]

Before moving on to how to train a model, let’s see a slightly


more complicated example combining fetch and feed. In this ex-
ample, we have a placeholder x that requires initialization. We have
two variables W and b. It should be noted that when we launch a
graph, all variables have to be explicitly initialized before one can
run Ops that use their value. A variable can be initialized by run-
ning its initializer op, restoring the variable from a save file, or sim-
ply running an assign Op that assigns a value to the variable. In
fact, the variable initializer op is just an assign Op that assigns the
cs224n: natural language processing with deep learning 4

variable’s initial value to the variable itself. An example usage is


sess.run(w.initializer) where w is a variable in the graph. The
more common initialization pattern is to use the convenience func-
tion tf.initialize_all_variables() to add an Op to the graph that
initializes all the variables, as illustrated in the code snippet below.

## Example code snippet


import numpy as np
import tensorflow as tf

b = tf.Variable(tf.zeros((100,)))
W = tf.Variable(tf.random_uniform((784, 100),
-1, 1))

x = tf.placeholder(tf.float32, (100, 784))


h = tf.nn.relu(tf.matmul(x, W) + b)

sess = tf.Session()
sess.run(tf.initialize_all_variables())
# {x: np.random.random(100, 784)} is a feed
# that assigns np.random.random(100, 784) to placeholder x
sess.run(h, {x: np.random.random(100, 784)})

2.3 How to Train a Model in TensorFlow


1. Define a Loss
The first thing to do in order to train a model is to build a loss
node. See the code snippet below for an example of defining a cross-
entropy loss. We build the loss node using labels and prediction.
Note that we use tf.reduce_sum to compute the sum of elements
across dimensions of a tensor. For our example, axis=1 is used to
perform a row-wise sum.

## Example code snippet


import tensorflow as tf

prediction = tf.nn.softmax(...) #Output of neural network


label = tf.placeholder(tf.float32, [100, 10])

cross_entropy = -tf.reduce_sum(label * tf.log(prediction), axis=1)

# More examples of using tf.reduce_sum


# 'x' is [[1, 1, 1]
# [1, 1, 1]]
# tf.reduce_sum(x) ==> 6
# tf.reduce_sum(x, 0) ==> [2, 2, 2]
# tf.reduce_sum(x, 1) ==> [3, 3]
# tf.reduce_sum(x, 1, keep_dims=True) ==> [[3], [3]]
# tf.reduce_sum(x, [0, 1]) ==> 6

2. Compute Gradients
The next thing we have to do is to compute gradients. TensorFlow
nodes have attached operations; therefore gradients with respect
cs224n: natural language processing with deep learning 5

to parameters are automatically computed with backpropagation.


All we need to do is creating an optimizer object and calling the
minimize function on previously defined loss. See code snippet be-
low for an example of using a GradientDescentOptimizer optimizer
where cross_entropy is the same as we introduced in the previous
code snippet. Evaluating the minimization operation, train_step
at runtime will automatically compute and apply gradients to all
variables in the graph.

## Example code snippet


import tensorflow as tf

lr = 0.5 # learning rate


optimizer = tf.train.GradientDescentOptimizer(lr)
train_step = optimizer.minimize(cross_entropy)

3. Train Model
Now we are ready to train a model. This can simply be done by
creating an iterating training schedule that feeds in data, labels and
applies gradients to the variables, as shown in the code snippet be-
low.

## Example code snippet


import tensorflow as tf

sess = tf.Session()
sess.run(tf.initialize_all_variables())

for i in range(1000):
batch_x, batch_label = data.next_batch()
sess.run(train_step, feed_dict={x: batch_x, label: batch_label}

2.4 Variable Sharing


One last important concept is variable sharing. When building com-
plex models, we often need to share large sets of variables and might
want to initialize all of them in one place. This can be done by using
tf.variable_scope() and tf.get_variable().
Imagine we are building a neural nets with two layers, if we use
tf.Variable, we would have two sets of weights and two sets of bi-
ases. Let’s assume that these variables are initialized in define_variables().
The problem arises when we want to use this model for two tasks
that share the same parameters. We would have to call define_variables(inputs)
twice, resulting in two sets of variables, 4 variables in each one, for
a total of 8 variables. A common try to share variables is to create
them in a separate piece of code and pass them to functions that use
them, say by using a dictionary. I.e. the define_variables now takes
two arguments, inputs and variables_dict. While convenient, cre-
cs224n: natural language processing with deep learning 6

ating a variables_dict, outside of the code, breaks encapsulation:


1) the code that builds the graph must document the names, types,
and shapes of variables to create, and 2) When the code changes,
the callers may have to create more, or less, or different variables.
One way to address the problem is to use classes to create a model,
where the classes take care of managing the variables they need. For
a lighter solution, not involving classes, TensorFlow provides a Vari-
able Scope mechanism that allows to easily share named variables
while constructing a graph.
Variable Scope mechanism in TensorFlow consists of two main
functions: tf.get_variable(<name>, <shape>, <initializer>) cre-
ates or returns a variable with a given name instead of a direct call
to tf.Variable; tf.variable_scope(<scope_name>) manages names-
paces for names passed to tf.get_variable(). tf.get_variable
does one of two things depending on the scope it is called in. Let’s
set v = tf.get_variable(name, shape, dtype, initializer).
Case 1: the scope is set for creating new variables, i.e. tf.get_variable_scope(name,
reuse=False).In this case, v will be a newly created tf.Variable
with the provided shape and data type. The full name of the created
variable will be set to the current variable scope name + the provided
name and a check will be performed to ensure that no variable with
this full name exists yet. If a variable with this full name already ex-
ists, the function will raise a ValueError. If a new variable is created,
it will be initialized to the value initializer(shape). For example,

## Example code snippet


import tensorflow as tf

with tf.variable_scope("foo"):
v = tf.get_variable("v", [1])
assert v.name == "foo/v:0"

Case 2: the scope is set for reusing variables, i.e. tf.get_variable_scope(name,


reuse=True). In this case, the call will search for an already exist-
ing variable with name equal to the current variable scope name
+ the provided name. If no such variable exists, a ValueError will
be raised. If the variable is found, it will be returned. If a variable
already exists but reuse=False, program will crash. For example:

## Example code snippet


import tensorflow as tf

with tf.variable_scope("foo"):
v = tf.get_variable("v", [1])
with tf.variable_scope("foo", reuse=True):
v1 = tf.get_variable("v", [1])
with tf.variable_scope("foo", reuse=False):
v1 = tf.get_variable("v") # CRASH foo/v:0 already exists!

You might also like