Cs 224 N
Cs 224 N
Learning 1 1
Course Instructors: Christopher
Manning, Richard Socher
Lecture Notes: Part I
Word Vectors I: Introduction, SVD and Word2Vec 2 2
Authors: Francois Chaubard, Michael
Fang, Guillaume Genthial, Rohit
Winter 2019 Mundra, Richard Socher
Easy
• Spell Checking
• Keyword Search
• Finding Synonyms
Medium
Hard
2 Word Vectors
indicate tense (past vs. present vs. future), count (singular vs. plural),
and gender (masculine vs. feminine). One-hot vector: Represent every word
So let’s dive into our first word vector and arguably the most as an R|V |×1 vector with all 0s and one
1 at the index of that word in the sorted
simple, the one-hot vector: Represent every word as an R|V |×1 vector english language.
with all 0s and one 1 at the index of that word in the sorted english
language. In this notation, |V | is the size of our vocabulary. Word
vectors in this type of encoding would appear as the following:
1 0 0 0
0 1 0 0
aardvark 0 a 0 at 1 zebra 0
w = , w = , w = , · · · w
=
Fun fact: The term "one-hot" comes
.. .. .. ..
from digital circuit design, meaning "a
. . . . group of bits among which the legal
0 0 0 1 combinations of values are only those
We represent each word as a completely independent entity. As with a single high (1) bit and all the
others low (0)".
we previously discussed, this word representation does not give us
directly any notion of similarity. For instance,
∑ik=1 σi
|V |
∑i=1 σi
Applying SVD to X:
|V | |V | |V | |V |
0 ··· − v1 −
| | σ1
0 · · · |V | − v2 −
X = u u2 · · · |V | σ2
|V | |V | 1
.. .. .. ..
| | . . . .
cs224n: natural language processing with deep learning lecture notes: part i word
vectors i: introduction, svd and word2vec 5
|V | k k |V |
0 ··· − v1 −
| | σ1
0 ··· k − v2 −
X̂ = u1 u2 ··· k σ2
|V | |V |
.. .. .. ..
| | . . . .
Both of these methods give us word vectors that are more than
sufficient to encode semantic and syntactic (part of speech) informa-
tion but are associated with many other problems:
• The dimensions of the matrix change very often (new words are
added very frequently and corpus changes in size). SVD based methods do not scale
well for big matrices and it is hard to
• The matrix is extremely sparse since most words do not co-occur. incorporate new words or documents.
Computational cost for a m × n matrix
• The matrix is very high dimensional in general (≈ 106 × 106 ) is O(mn2 )
The idea is to design a model whose parameters are the word vec-
tors. Then, train the model on a certain objective. At every iteration Iteration-based methods capture cooc-
we run our model, evaluate the errors, and follow an update rule currence of words one at a time instead
of capturing all cooccurrence counts
that has some notion of penalizing the model parameters that caused directly like in SVD methods.
the error. Thus, we learn our word vectors. This idea is a very old
cs224n: natural language processing with deep learning lecture notes: part i word
vectors i: introduction, svd and word2vec 6
P ( w1 , w2 , · · · , w n )
We can take the unary language model approach and break apart
this probability by assuming the word occurrences are completely
independent:
n
P ( w1 , w2 , · · · , w n ) = ∏ P ( wi )
i =1
Unigram model:
However, we know this is a bit ludicrous because we know the n
next word is highly contingent upon the previous sequence of words. P ( w1 , w2 , · · · , w n ) = ∏ P ( wi )
i =1
And the silly sentence example might actually score highly. So per-
haps we let the probability of the sequence depend on the pairwise
cs224n: natural language processing with deep learning lecture notes: part i word
vectors i: introduction, svd and word2vec 7
probability of a word in the sequence and the word next to it. We call
this the bigram model and represent it as:
n
P ( w1 , w2 , · · · , w n ) = ∏ P ( wi | wi −1 )
i =2
Bigram model:
Again this is certainly a bit naive since we are only concerning n
ourselves with pairs of neighboring words rather than evaluating a P ( w1 , w2 , · · · , w n ) = ∏ P ( wi | wi −1 )
i =2
whole sentence, but as we will see, this representation gets us pretty
far along. Note in the Word-Word Matrix with a context of size 1, we
basically can learn these pairwise probabilities. But again, this would
require computing and storing global information about a massive
dataset.
Now that we understand how we can think about a sequence of
tokens having a probability, let us observe some example models that
could learn these probabilities.
1. We generate our one hot word vectors for the input context of size
m : (x (c−m) , . . . , x (c−1) , x (c+1) , . . . , x (c+m) ∈ R|V | ).
cs224n: natural language processing with deep learning lecture notes: part i word
vectors i: introduction, svd and word2vec 8
5. Turn the scores into probabilities ŷ = softmax(z) ∈ R|V | . The softmax is an operator that we’ll
use very frequently. It transforms a vec-
6. We desire our probabilities generated, ŷ ∈ R|V | , to match the true tor into a vector whose i-th component
eŷi
probabilities, y ∈ R|V | , which also happens to be the one hot vector is |V | yˆ
∑ k =1 e k
.
|V |
H (ŷ, y) = − ∑ y j log(ŷ j )
j =1
hot vector is 1. We can now consider the case where our predic-
ŷ 7→ H (ŷ, y) is minimum when ŷ = y.
tion was perfect and thus ŷc = 1. We can then calculate H (ŷ, y) = Then, if we found a ŷ such that H (ŷ, y)
−1 log(1) = 0. Thus, for a perfect prediction, we face no penalty or is close to the minimum, we have ŷ ≈ y.
This means that our model is very good
loss. Now let us consider the opposite case where our prediction was at predicting the center word!
very bad and thus ŷc = 0.01. As before, we can calculate our loss to
be H (ŷ, y) = −1 log(0.01) ≈ 4.605. We can thus see that for proba-
bility distributions, cross entropy provides us with a good measure of
distance. We thus formulate our optimization objective as: To learn the vectors (the matrices U and
V) CBOW defines a cost that measures
how good it is at predicting the center
word. Then, we optimize this cost by
updating the matrices U and V thanks
to stochastic gradient descent
cs224n: natural language processing with deep learning lecture notes: part i word
vectors i: introduction, svd and word2vec 9
the word "jumped" the context. We call this type of model a Skip-
Gram model. Notation for Skip-Gram Model:
Let’s discuss the Skip-Gram model above. The setup is largely the • wi : Word i from vocabulary V
same but we essentially swap our x and y i.e. x in the CBOW are • V ∈ Rn×|V | : Input word matrix
now y and vice-versa. The input one hot vector (center word) we will • vi : i-th column of V , the input vector
represent with an x (since there is only one). And the output vectors representation of word wi
1. We generate our one hot input vector x ∈ R|V | of the center word.
• objective function
cs224n: natural language processing with deep learning lecture notes: part i word
vectors i: introduction, svd and word2vec 11
• gradients
• update rules
1 1
J=− ∑ log T
− ∑ log(
1 + exp(−uw vc ) (w,c)∈ D̃ Tv )
1 + exp(uw c
)
(w,c)∈ D
For skip-gram, our new objective function for observing the con-
text word c − m + j given the center word c would be
K
− log σ(ucT−m+ j · vc ) − ∑ log σ(−ũkT · vc )
k =1
To compare with the regular softmax
loss for skip-gram
|V |
−ucT−m+ j vc + log ∑k=1 exp(ukT vc )
For CBOW, our new objective function for observing the center
v +v +1 +...+vc+m
word uc given the context vector v̂ = c−m c−m2m would be
K
− log σ(ucT · v̂) − ∑ log σ(−ũkT · v̂)
k =1
To compare with the regular softmax
loss for CBOW
In the above formulation, {ũk |k = 1 . . . K } are sampled from Pn (w). |V |
−ucT v̂ + log ∑ j=1 exp(u Tj v̂)
Let’s discuss what Pn (w) should be. While there is much discussion
of what makes the best approximation, what seems to work best is
the Unigram Model raised to the power of 3/4. Why 3/4? Here’s an
example that might help gain some intuition:
vector vn(w,i) . So n(w, 1) is the root, while n(w, L(w)) is the father
of w. Now for each inner node n, we arbitrarily choose one of its
children and call it ch(n) (e.g. always the left node). Then, we can
compute the probability as
L(w)−1
P ( w | wi ) = ∏ σ ([n(w, j + 1) = ch(n(w, j))] · vnT(w,j) vwi )
j =1
where
1 if x is true
[x] =
−1 otherwise
.
and σ (·) is the sigmoid function.
This formula is fairly dense, so let’s examine it more closely.
First, we are computing a product of terms based on the shape of
the path from the root (n(w, 1)) to the leaf (w). If we assume ch(n) is
always the left node of n, then term [n(w, j + 1) = ch(n(w, j))] returns
1 when the path goes left, and -1 if right.
Furthermore, the term [n(w, j + 1) = ch(n(w, j))] provides normal-
ization. At a node n, if we sum the probabilities for going to the left
and right node, you can check that for any value of vnT vwi ,
References
[Bengio et al., 2003] Bengio, Y., Ducharme, R., Vincent, P., and Janvin, C. (2003). A
neural probabilistic language model. J. Mach. Learn. Res., 3:1137–1155.
cs224n: natural language processing with deep learning lecture notes: part i word
vectors i: introduction, svd and word2vec 14
[Collobert et al., 2011] Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu,
K., and Kuksa, P. P. (2011). Natural language processing (almost) from scratch.
CoRR, abs/1103.0398.
[Mikolov et al., 2013] Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). Efficient
estimation of word representations in vector space. CoRR, abs/1301.3781.
[Rong, 2014] Rong, X. (2014). word2vec parameter learning explained. CoRR,
abs/1411.2738.
[Rumelhart et al., 1988] Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1988).
Neurocomputing: Foundations of research. chapter Learning Representations by
Back-propagating Errors, pages 696–699. MIT Press, Cambridge, MA, USA.
CS224n: Natural Language Processing with Deep
Learning 1 1
Course Instructors: Christopher
Manning, Richard Socher
Lecture Notes: Part II
Word Vectors II: GloVe, Evaluation and Training 2 2
Authors: Rohit Mundra, Emma Peng,
Richard Socher, Ajay Sohmshetty,
Winter 2019 Amita Kamath
exp(~u Tj~vi )
Qij =
∑W
w=1 exp(~
uwT~
vi )
Training proceeds in an on-line, stochastic fashion, but the implied
global cross-entropy loss can be calculated as:
As the same words i and j can appear multiple times in the corpus,
it is more efficient to first group together the same values for i and j:
W W
J=−∑ ∑ Xij log Qij
i =1 j =1
W W
Ĵ = ∑ ∑ Xi ( P̂ij − Q̂ij )2
i =1 j =1
where P̂ij = Xij and Q̂ij = exp(~u Tj~vi ) are the unnormalized
distributions. This formulation introduces a new problem – Xij often
takes on very large values and makes the optimization difficult. An
effective change is to minimize the squared error of the logarithms of
P̂ and Q̂:
cs224n: natural language processing with deep learning lecture notes: part ii
word vectors ii: glove, evaluation and training 3
W W
Ĵ = ∑ ∑ Xi (log( P̂)ij − log(Q̂ij ))2
i =1 j =1
W W
= ∑ ∑ Xi (~uTj~vi − log Xij )2
i =1 j =1
W W
Ĵ = ∑ ∑ f (Xij )(~uTj~vi − log Xij )2
i =1 j =1
1.4 Conclusion
In conclusion, the GloVe model efficiently leverages global statistical
information by training only on the nonzero elements in a word-
word co-occurrence matrix, and produces a vector space with mean-
ingful sub-structure. It consistently outperforms word2vec on the
word analogy task, given the same corpus, vocabulary, window size,
and training time. It achieves better results faster, and also obtains
the best results irrespective of speed.
sentations since they are used in downstream subsystems (such as • Fast to compute performance
deep neural networks). To do this in practice, we will need to tune • Helps understand subsystem
many hyperparameters in the Word2Vec subsystem (such as the • Needs positive correlation with real
task to determine usefulness
dimension of the word vector representation). While the idealistic
approach is to retrain the entire system after any parametric changes
in the Word2Vec subsystem, this is impractical from an engineering
standpoint because the machine learning system (in step 3) is typi-
cally a deep neural network with millions of parameters that takes
very long to train. In such a situation, we would want to come up
with a simple intrinsic evaluation technique which can provide a
measure of "goodness" of the word to word vector subsystem. Ob-
viously, a requirement is that the intrinsic evaluation has a positive
correlation with the final task performance.
optimal word vectors for the extrinsic task and they perform well
at the extrinsic task. However, it is also possible that the pretrained
word vectors could be trained further (i.e. retrained) using the extrin-
sic task this time to perform better. However, retraining word vectors
can be risky.
If we retrain word vectors using the extrinsic task, we need to en-
sure that the training set is large enough to cover most words from
the vocabulary. This is because Word2Vec or GloVe produce semanti-
cally related words to be located in the same part of the word space.
When we retrain these words over a small set of the vocabulary, these
words are shifted in the word space and as a result, the performance Figure 6: Here, we see that the words
over the final task could actually reduce. Let us explore this idea "Telly", "TV", and "Television" are clas-
sified correctly before retraining. "Telly"
further using an example. Consider the pretrained vectors to be in a and "TV" are present in the extrinsic
two dimensional space as shown in Figure 6. Here, we see that the task training set while "Television" is
only present in the test set.
word vectors are classified correctly on some extrinsic classification
task. Now, if we retrain only two of those vectors because of a limited
training set size, then we see in Figure 7 that one of the words gets
misclassified because the boundary shifts as a result of word vector
updates.
Thus, word vectors should not be retrained if the training data set
is small. If the training set is large, retraining may improve perfor-
mance.
The only difference above is that k(i ) is now a function that returns
the correct class index for example x (i) .
Let us now try to estimate the number of parameters that would
be updated if we consider training both, model weights (W), as well
word vectors (x). We know that a simple linear decision boundary
would require a model that takes in at least one d-dimensional input
word vector and produces a distribution over C classes. Thus, to
update the model weights, we would be updating C · d parameters.
If we update the word vectors for every word in the vocabulary V
as well, then we would be updating as many as |V | word vectors,
each of which is d-dimensional. Thus, the total number of parameters
would be as many as C · d + |V | · d for a simple linear classifier:
∇W·1
..
.
∇
W·d
∇θ J (θ ) =
∇ xaardvark
..
.
∇ xzebra
This is an extremely large number of parameters considering how
simple the model’s decision boundary is - such a large number of
parameters is highly prone to overfitting.
To reduce overfitting risk, we introduce a regularization term
which poses the Bayesian belief that the parameters (θ) should be
small is magnitude (i.e. close to zero):
N
exp(Wk(i)· x (i) )
C ·d+|V |·d
− ∑ log +λ ∑ θk2
i =1 ∑C (i )
c=1 exp(Wc· x ) k =1
x ( i −2)
x ( i −1)
(i )
xwindow = x (i )
x ( i +1)
x ( i +2)
As a result, when we evaluate the gradient of the loss with respect
to the words, we will receive gradients for the word vectors:
∇ x ( i −2)
∇ x ( i −1)
= ∇ x (i )
δwindow
∇ x ( i +1)
∇ x ( i +2)
The gradient will of course need to be distributed to update the
corresponding word vectors in implementation.
1
a=
1 + exp(−(w T x + b))
We can also combine the weights and bias term above to equiva- Neuron:
A neuron is the fundamental building
block of neural networks. We will
see that a neuron can be one of many
functions that allows for non-linearities
to accrue in the network.
cs224n: natural language processing with deep learning lecture notes: part iii
neural networks, backpropagation 2
lently formulate:
1
a=
1 + exp(−[w T b] · [ x 1])
1
a1 =
1 + exp(w(1)T x + b1 )
..
.
1
am =
1 + exp(w(m)T x + bm )
z = Wx + b
Figure 3: This image captures how
The activations of the sigmoid function can then be written as: multiple sigmoid units are stacked on
the right, all of which receive the same
input x.
a (1)
.
. = σ (z) = σ (Wx + b)
.
a(m)
cs224n: natural language processing with deep learning lecture notes: part iii
neural networks, backpropagation 3
So what do these activations really tell us? Well, one can think
of these activations as indicators of the presence of some weighted
combination of features. We can then use a combination of these
activations to perform classification tasks.
1.4 Maximum Margin Objective Function Figure 4: This image captures how a
simple feed-forward network might
Like most machine learning models, neural networks also need an compute its output.
optimization objective, a measure of error or goodness which we
want to minimize or maximize respectively. Here, we will discuss a
popular error metric known as the maximum margin objective. The
idea behind using this objective is to ensure that the score computed
for "true" labeled data points is higher than the score computed for
"false" labeled data points.
Using the previous example, if we call the score computed for the
"true" labeled window "Museums in Paris are amazing" as s and the
cs224n: natural language processing with deep learning lecture notes: part iii
neural networks, backpropagation 4
score computed for the "false" labeled window "Not all museums in
Paris" as sc (subscripted as c to signify that the window is "corrupt").
Then, our objective function would be to maximize (s − sc ) or to
minimize (sc − s). However, we modify our objective to ensure that
error is only computed if sc > s ⇒ (sc − s) > 0. The intuition
behind doing this is that we only care the the "true" data point have
a higher score than the "false" data point and that the rest does not
matter. Thus, we want our error to be (sc − s) if sc > s else 0. Thus,
our optimization objective is now:
minimize J = max(sc − s, 0)
minimize J = max(∆ + sc − s, 0)
We can scale this margin such that it is ∆ = 1 and let the other
parameters in the optimization problem adapt to this without any
change in performance. For more information on this, read about
functional and geometric margins - a topic often covered in the study
of Support Vector Machines. Finally, we define the following opti- The max-margin objective function
mization objective which we optimize over all training windows: is most commonly associated with
Support Vector Machines (SVMs)
minimize J = max(1 + sc − s, 0)
θ ( t +1) = θ ( t ) − α ∇ θ ( t ) J
• Each layer (including the input and output layers) has neurons
which receive an input and produce an output. The j-th neuron
(k)
of layer k receives the scalar input z j and produces the scalar
(k)
activation output aj .
(k) (k)
• We will call the backpropagated error calculated at z j as δj .
• Layer 1 refers to the input layer and not the first hidden layer. For
(1) (1)
the input layer, x j = z j = aj .
• W (k) is the transfer matrix that maps the output from the k-th layer
to the input to the (k + 1)-th Thus, W (1) = W and W (2) = U to put
this new generalized notation in perspective of Section 1.3.
Backpropagation Notation:
Let us begin: Suppose the cost J = (1 + sc − s) is positive and • xi is an input to the neural network.
(1)
we want to perform the update of parameter W14 (in Figure 5 and • s is the output of the neural net-
(1) (2) work.
Figure 6), we must realize that W14 only contributes to z1 and
(2) • The j-th neuron of layer k receives
thus a1 . This fact is crucial to understanding backpropagation – (k)
the scalar input z j and produces
backpropagated gradients are only affected by values they contribute (k)
(2) the scalar activation output a j .
to. a1 is consequently used in the forward computation of score by
(1) (1)
(2) • For the input layer, x j = z j = aj .
multiplication with W1 . We can see from the max-margin loss that:
• W (k) is the transfer matrix that maps
∂J ∂J the output from the k-th layer to
=− = −1 the input to the (k + 1)-th. Thus,
∂s ∂sc
W (1) = W and W (2) = U T using
∂s notation from Section 1.3.
Therefore we will work with (1) here for simplicity. Thus,
∂Wij
(2) 0 (2) ∂ (1) (1) (1) (1) (1) (1) (1) (1) (1)
= Wi f ( zi ) (1)
( bi + a1 Wi1 + a2 Wi2 + a3 Wi3 + a4 Wi4 )
∂Wij
∂
+ ∑ ak Wik )
(2) 0 (2) (1) (1) (1)
= Wi f ( zi ) (1)
( bi
∂Wij k
(2) 0 (2) (1)
= Wi f ( zi ) a j
(2) (1)
= δi · aj
(2) (1)
We see above that the gradient reduces to the product δi · aj where
(2)
δi is essentially the error propagating backwards from the i-th neu-
(1)
ron in layer 2. a j is an input fed to i-th neuron in layer 2 when
scaled by Wij .
Let us discuss the "error sharing/distribution" interpretation of
backpropagation better using Figure 6 as an example. Say we were to
(1) Figure 6: This subnetwork shows the
update W14 :
relevant parts of the network required
(1)
to update Wij
1. We start with the an error signal of 1 propagating backwards from
(3)
a1 .
(1)
Bias Updates: Bias terms (such as b1 ) are mathematically equivalent
(2)
to other weights contributing to the neuron input (z1 ) as long as the
input being forwarded is 1. As such, the bias gradients for neuron
(k) (1)
i on layer k is simply δi . For instance, if we were updating b1
(1) (2) (2)
instead of W14 above, the gradient would simply be f 0 (z1 )W1 .
( k −1)
4. However, a j may have been forwarded to multiple nodes in
the next layer as shown in Figure 8. It should receive responsibility
for errors propagating backward from node m in layer k too, using
the exact same mechanism.
( k −1) (k) ( k −1) (k) ( k −1)
5. Thus, error received at a j is δi Wij + δm Wmj .
(k) ( k −1)
6. In fact, we can generalize this to be ∑i δi Wij .
( k −1)
7. Now that we have the correct error at a j , we move it across
neuron j at layer k − 1 by multiplying with with the local gradient
( k −1)
f 0 (z j ).
( k −1) ( k −1)
8. Thus, the error that reaches z j , called δj is
( k −1) (k) ( k −1)
f 0 (z j ) ∑i δi Wij
J (θ (i+) ) − J (θ (i−) )
f 0 (θ ) ≈
2e
where e is a small number (usually around 1e−5 ). The term J (θ (i+) )
is simply the error calculated on a forward pass for a given input
when we perturb the parameter θ’s ith element by +e. Similarly, the
term J (θ (i−) ) is the error calculated on a forward pass for the same
input when we perturb the parameter θ’s ith element by −e. Thus,
using two forward passes, we can approximate the gradient with
respect to any given parameter element in the model. We note that
this definition of the numerical gradient follows very naturally from
the definition of the derivative, where, in the scalar case,
f ( x + e) − f ( x )
f 0 (x) ≈
e
Gradient checks are a great way to
Of course, there is a slight difference – the definition above only compare analytical and numerical
perturbs x in the positive direction to compute the gradient. While it gradients. Analytical gradients should
be close and numerical gradients can be
would have been perfectly acceptable to define the numerical gradi- calculated using:
ent in this way, in practice it is often more precise and stable to use
J (θ (i+) ) − J (θ (i−) )
the centered difference formula, where we perturb a parameter in f 0 (θ ) ≈
2e
both directions. The intuition is that to get a better approximation of J (θ (i+) ) and J (θ (i−) ) can be evalu-
the derivative/slope around a point, we need to examine the func- ated using two forward passes. An
tion f ’s behavior both to the left and right of that point. It can also be implementation of this can be seen in
Snippet 2.1.
shown using Taylor’s theorem that the centered difference formula
has an error proportional to e2 , which is quite small, whereas the
derivative definition is more error-prone.
Now, a natural question you might ask is, if this method is so pre-
cise, why do we not use it to compute all of our network gradients
instead of applying back-propagation? The simple answer, as hinted
earlier, is inefficiency – recall that every time we want to compute the
gradient with respect to an element, we need to make two forward
cs224n: natural language processing with deep learning lecture notes: part iii
neural networks, backpropagation 10
Snippet 2.1
def eval_numerical_gradient(f, x):
"""
a naive implementation of numerical gradient of f at x
- f should be a function that takes a single argument
- x is the point (numpy array) to evaluate the gradient
at
"""
2.2 Regularization
As with many machine learning models, neural networks are highly
prone to overfitting, where a model is able to obtain near perfect per-
formance on the training dataset, but loses the ability to generalize
to unseen data. A common technique used to address overfitting (an
issue also known as the “high-variance problem”) is the incorpora-
tion of an L2 regularization penalty. The idea is that we will simply
append an extra term to our loss function J, so that the overall cost is
now calculated as:
L
JR = J + λ ∑
W (i)
F
i =1
The Frobenius Norm of a matrix U is
In the above formulation,
W (i)
is the Frobenius norm of the defined as follows:
F s
matrix W (i) (the i-th weight matrix in the network) and λ is the ||U || F = ∑ ∑ Uij2
i j
hyper-parameter controlling how much weight the regularization
term has relative to the original cost function. Since we are trying
to minimize JR , what regularization is essentially doing is penaliz-
ing weights for being too large while optimizing over the original
cost function. Due to the quadratic nature of the Frobenius norm
(which computes the sum of the squared elements of a matrix), L2 -
regularization effectively reduces the flexibility of the model and
thereby reduces the overfitting phenomenon. Imposing such a con-
straint can also be interpreted as the prior Bayesian belief that the
optimal weights are close to zero – how close depends on the value
of λ. Choosing the right value of λ is critical, and must be chosen
via hyperparameter-tuning. Too high a value of λ causes most of
the weights to be set too close to 0, and the model does not learn
anything meaningful from the training data, often obtaining poor ac-
curacy on training, validation, and testing sets. Too low a value, and
we fall into the domain of overfitting once again. It must be noted
that the bias terms are not regularized and do not contribute to the
cost term above – try thinking about why this is the case!
There are indeed other types of regularization that are sometimes
used, such as L1 regularization, which sums over the absolute values
(rather than squares) of parameter elements – however, this is less
commonly applied in practice since it leads to sparsity of parameter
weights. In the next section, we discuss dropout, which effectively acts
as another form of regularization by randomly dropping (i.e. setting
to zero) neurons in the forward pass.
cs224n: natural language processing with deep learning lecture notes: part iii
neural networks, backpropagation 12
2.3 Dropout
Dropout is a powerful technique for regularization, first introduced
by Srivastava et al. in Dropout: A Simple Way to Prevent Neural Net-
works from Overfitting. The idea is simple yet effective – during train-
ing, we will randomly “drop” with some probability (1 − p) a subset
of neurons during each forward/backward pass (or equivalently,
we will keep alive each neuron with a probability p). Then, during
testing, we will use the full network to compute our predictions. The
result is that the network typically learns more meaningful informa-
tion from the data, is less likely to overfit, and usually obtains higher
performance overall on the task at hand. One intuitive reason why
this technique should be so effective is that what dropout is doing is
essentially doing is training exponentially many smaller networks at
once and averaging over their predictions.
In practice, the way we introduce dropout is that we take the out-
put h of each layer of neurons, and keep each neuron with prob-
ability p, and else set it to 0. Then, during back-propagation, we
only pass gradients through neurons that were kept alive during
the forward pass. Finally, during testing, we compute the forward
pass using all of the neurons in the network. However, a key sub-
Dropout applied to an artificial neural
tlety is that in order for dropout to work effectively, the expected network. Image credits to Srivastava et
output of a neuron during testing should be approximately the same al.
1
σ(z) =
1 + exp(−z) Figure 9: The response of a sigmoid
nonlinearity
cs224n: natural language processing with deep learning lecture notes: part iii
neural networks, backpropagation 13
− exp(−z)
σ0 (z) = = σ(z)(1 − σ(z))
1 + exp(−z)
exp(z) − exp(−z)
tanh(z) = = 2σ(2z) − 1
exp(z) + exp(−z) Figure 10: The response of a tanh
nonlinearity
where tanh(z) ∈ (−1, 1)
Hard tanh: The hard tanh function is sometimes preferred over the
tanh function since it is computationally cheaper. It does however
saturate for magnitudes of z greater than 1. The activation of the
hard tanh is:
−1
: z < −1
hardtanh(z) = z : −1 ≤ z ≤ 1
Figure 11: The response of a hard tanh
1 :z>1
nonlinearity
The derivative can also be expressed in a piecewise functional form:
(
0 1 : −1 ≤ z ≤ 1
hardtanh (z) =
0 : otherwise
Soft sign: The soft sign function is another nonlinearity which can
be considered an alternative to tanh since it too does not saturate as
easily as hard clipped functions:
z
softsign(z) =
1 + |z|
The derivative is the expressed as: Figure 12: The response of a soft sign
nonlinearity
sgn(z)
softsign0 (z) =
(1 + z )2
where sgn is the signum function which returns ± 1 depending on the sign of z
cs224n: natural language processing with deep learning lecture notes: part iii
neural networks, backpropagation 14
rect(z) = max(z, 0)
The derivative is then the piecewise function: Figure 13: The response of a ReLU
( nonlinearity
0 1 :z>0
rect (z) =
0 : otherwise
leaky(z) = max(z, k · z)
where 0 < k < 1 Figure 14: The response of a leaky
ReLU nonlinearity
This way, the derivative is representable as:
(
0 1 :z>0
leaky (z) =
k : otherwise
Mean Subtraction
Given a set of input data X, it is customary to zero-center the data by
subtracting the mean feature vector of X from X. An important point
is that in practice, the mean is calculated only across the training set,
and this mean is subtracted from the training, validation, and testing
sets.
Normalization
Another frequently used technique (though perhaps less so than
mean subtraction) is to scale every input feature dimension to have
similar ranges of magnitudes. This is useful since input features are
often measured in different “units”, but we often want to initially
consider all features as equally important. The way we accomplish
this is by simply dividing the features by their respective standard
deviation calculated across the training set.
cs224n: natural language processing with deep learning lecture notes: part iii
neural networks, backpropagation 15
Whitening
Not as commonly used as mean-subtraction + normalization, whiten-
ing essentially converts the data to a have an identity covariance
matrix – that is, features become uncorrelated and have a variance
of 1. This is done by first mean-subtracting the data, as usual, to get
X 0 . We can then take the Singular Value Decomposition (SVD) of X 0
to get matrices U, S, V. We then compute UX 0 to project X 0 into the
basis defined by the columns of U. We finally divide each dimension
of the result by the corresponding singular value in S to scale our
data appropriately (if a singular value is zero, we can just divide by a
small number instead).
Where n(l ) is the number of input units to W (fan-in) and n(l +1)
is the number of output units from W (fan-out). In this parameter
initialization scheme, bias units are initialized to 0. This approach
attempts to maintain activation variances as well as backpropagated
gradient variances across layers. Without such initialization, the
gradient variances (which are a proxy for information) generally
decrease with backpropagation across layers.
You might think that for fast convergence rates, we should set α
to larger values – however faster convergence is not guaranteed with
larger convergence rates. In fact, with very large learning rates, we
might experience that the loss function actually diverges because the
parameters update causes the model to overshoot the convex minima
as shown in Figure 15. In non-convex models (most of those we work
with), the outcome of a large learning rate is unpredictable, but the
chances of diverging loss functions are very high.
The simple solution to avoiding a diverging loss is to use a very
small learning rate so that we carefully scan the parameter space –
of course, if we use too small a learning rate, we might not converge Figure 15: Here we see that updating
in a reasonable amount of time, or might get caught in local minima. parameter w2 with a large learning rate
can lead to divergence of the error.
Thus, as with any other hyperparameter, the learning rate must be
tuned effectively.
Since training is the most expensive phase in a deep learning
system, some research has attempted to improve this naive approach
to setting learning learning rates. For instance, Ronan Collobert
( l +1) × n ( l )
scales the learning rate of a weight Wij (where W ∈ Rn ) by
the inverse square root of the fan-in of the neuron (n ). ( l )
α0 τ
α(t) =
max(t, τ )
In the above scheme, α0 is a tunable parameter and represents the
starting learning rate. τ is also a tunable parameter and represents
the time at which the learning rate should start reducing. In practice,
this method has been found to work quite well. In the next section
we discuss another method for adaptive gradient descent which does
not require hand-set learning rates.
Snippet 2.2
# Computes a standard momentum update
# on parameters x
v = mu*v - alpha*grad_x
x += v
α ∂
θt,i = θt−1,i − q gt,i where gt,i = Jt (θ )
∑tτ =1 gτ,i
2 ∂θit
Snippet 2.3
# Assume the gradient dx and parameter vector x
cache += dx**2
x += - learning_rate * dx / np.sqrt(cache + 1e-8)
Snippet 2.4
# Update rule for RMS prop
cache = decay_rate * cache + (1 - decay_rate) * dx**2
x += - learning_rate * dx / (np.sqrt(cache) + eps)
Snippet 2.5
# Update rule for Adam
cs224n: natural language processing with deep learning lecture notes: part iii
neural networks, backpropagation 18
m = beta1*m + (1-beta1)*dx
v = beta2*v + (1-beta2)*(dx**2)
x += - learning_rate * m / (np.sqrt(v) + eps)
Transitions:
There are three types of transitions between states: Figure 2: Transitions for Dependency
Parsing.
1. Shift: Remove the first word in the buffer and push it on top of
the stack. (Pre-condition: buffer has to be non-empty.)
word at the top of the stack. Remove wi from the stack. (Pre-
condition: the stack needs to contain at least two items and wi
cannot be the ROOT.)
Feature Selection:
Softmax layer:
···
p = softmax(W2 h)
Hidden layer:
···
h = (W1w x w + W1t x t + W1l x l + b1 )3
Further reading:
Danqi Chen, and Christopher D. Manning. "A Fast and Accurate
Dependency Parser using Neural Networks." EMNLP. 2014.
Kuebler, Sandra, Ryan McDonald, and Joakim Nivre. “Depen-
dency parsing.” Synthesis Lectures on Human Language Technolo-
gies 1.1 (2009): 1-127.
CS224n: Natural Language Processing with Deep
Learning 1 1
Course Instructors: Christopher
Manning, Richard Socher
Lecture Notes: Part V
Language Models, RNN, GRU and LSTM 2 2
Authors: Milad Mohammadi, Rohit
Mundra, Richard Socher, Lisa Wang,
Winter 2019 Amita Kamath
1 Language Models
1.1 Introduction
Language models compute the probability of occurrence of a number
of words in a particular sequence. The probability of a sequence of
m words {w1 , ..., wm } is denoted as P(w1 , ..., wm ). Since the number
of words coming before a word, wi , varies depending on its location
in the input document, P(w1 , ..., wm ) is usually conditioned on a
window of n previous words rather than all previous words:
i=m i =m
P(w1 , ..., wm ) = ∏ P(wi |w1 , ..., wi−1 ) ≈ ∏ P(wi |wi−n , ..., wi−1 ) (1)
i =1 i =1
count(w1 , w2 )
p ( w2 | w1 ) = (2)
count(w1 )
count(w1 , w2 , w3 )
p ( w3 | w1 , w2 ) = (3)
count(w1 , w2 )
The relationship in Equation 3 focuses on making predictions
based on a fixed window of context (i.e. the n previous words) used
to predict the next word. But how long should the context be? In
some cases, the window of past consecutive n words may not be suf-
ficient to capture the context. For instance, consider the sentence "As
the proctor started the clock, the students opened their ". If the
window only conditions on the previous three words "the students
opened their", the probabilities calculated based on the corpus may
suggest that the next word be "books" - however, if n had been large
enough to include the "proctor" context, the probability might have
suggested "exam".
This leads us to two main issues with n-gram Language Models:
Sparsity and Storage.
What is interesting here is that the same weights W (hh) and W (hx) are
applied repeatedly at each timestep. Thus, the number of parameters
the model has to learn is less, and most importantly, is independent
of the length of the input sequence - thus defeating the curse of di-
mensionality!
Below are the details associated with each parameter in the net-
work:
• x1 , ..., xt−1 , xt , xt+1 , ...x T : the word vectors corresponding to a cor- Figure 4: The inputs and outputs to a
pus with T words. neuron of a RNN
4-5 tokens.
T T |V |
1 1
J=
T ∑ J (t) ( θ ) = −
T ∑ ∑ yt,j × log(ŷt,j ) (8)
t =1 t =1 j =1
Perplexity = 2 J (9)
2. The model size does not increase for longer input sequence
lengths
3. Computation for step t can (in theory) use information from many
steps back.
machine translation, and many other tasks). In the latter two applica-
tions, we want a representation for the sentence, which we can obtain
by taking the element-wise max or mean of all hidden states of the
timesteps in that sentence.
Note: Figure 6 is an alternative representation of RNNs used in
some publications. It represents the RNN hidden layer as a loop.
Sentence 1
"Jane walked into the room. John walked in too. Jane said hi to ___"
Sentence 2
"Jane walked into the room. John walked in too. It was late in the
day, and everyone was walking home after a long day at work. Jane
said hi to ___"
In both sentences, given their context, one can tell the answer to both
blank spots is most likely "John". It is important that the RNN predicts
the next word as "John", the second person who has appeared several
time-steps back in both contexts. Ideally, this should be possible given
what we know about RNNs so far. In practice, however, it turns out
RNNs are more likely to correctly predict the blank spot in Sentence
1 than in Sentence 2. This is because during the back-propagation
phase, the contribution of gradient values gradually vanishes as they
propagate to earlier timesteps, as we will show below. Thus, for long
sentences, the probability that "John" would be recognized as the next
word reduces with the size of the context. Below, we discuss the math-
ematical reasoning behind the vanishing gradient problem.
Consider Equations 5 and 6 at a time-step t; to compute the RNN
error, dE/dW, we sum the error at each time-step. That is, dEt /dW for
every time-step, t, is computed and accumulated.
T
∂E ∂Et
=∑ (10)
∂W t =1
∂W
∂h j
k k≤k W T kk diag[ f 0 (h j−1 )] k≤ βW β h (15)
∂h j−1
The norm of both matrices is calculated through taking their L2-
norm. The norm of f 0 (h j−1 ) can only be as large as 1 given the sigmoid
non-linearity function.
t ∂h j
∂ht
k
∂hk
k=k ∏ ∂h j − 1
k≤ ( βW β h )t−k (16)
j = k +1
descent step. When the gradient descent model hits the high error wall rithm should work
gradient is not th
in the objective function, the gradient is pushed off to a far-away loca- (a case for which
tion on the decision surface. The clipping model produces the dashed as the ratio betwe
still explode).
line where it instead pulls back the error gradient to somewhere close Our hypothesis co
cent success of th
to the original gradient landscape. to other second or
To solve the problem of vanishing gradients, we introduce two tech- ferences between H
order algorithms.
niques. The first technique is that instead of initializing W (hh) ran- Figure 6. We plot the error surface of a single hidden unit
and hence can dea
not necessarily a
domly, start off from an identity matrix initialization. recurrent network, highlighting the existence of high cur-
Figure
vature 7: Gradient
walls. explosion
The solid lines clipping
depicts standard trajectories
new estimate of
date step and can
The second technique is to use the Rectified Linear Units (ReLU) in- that gradient descent might follow. Using dashed arrow
visualization
the diagram shows what would happen if the gradients is
curvature (such as
sis) while most ot
rescaled to a fixed size when its norm is above a threshold.
stead of the sigmoid function. The derivative for the ReLU is either 0 or sumption, i.e., av
steps.
1. This way, gradients would flow through the neurons whose deriva-
tive is 1 without getting attenuated while propagating back through
time-steps.
−
→ ← −
ŷt = g(Uht + c) = g(U [ h t ; h t ] + c) (19)
RNNs can also be multi-layered. Figure 9 shows a multi-layer bi-
directional RNN where each lower layer feeds the next layer. As shown
in this figure, in this network architecture, at time-step t each interme-
diate neuron receives one set of parameters from the previous time-
step (in the same RNN layer), and two sets of parameters from the
previous RNN hidden layer; one input comes from the left-to-right Going Deep
RNN and the other from the right-to-left RNN. y
To construct a Deep RNN with L layers, the above relationships are
modified to the relationships in Equations 20 and 21 where the input
to each intermediate neuron at level i is the output of the RNN at
h (3) ! (i) !"
! (i)
h t = f (W ht(i−1
layer i − 1 at the same time-step, t. The output, ŷ, at each time-step is ! (i) !"
" (i)
the result of propagating input parameters through all hidden layers h (2) h t = f (W ht(i−1
(Equation 22). ! (L ) !
yt = g(U[h t ;h
h (1)
−
→ (i ) −→ ( i −1) − → − → (i ) −→
h t = f ( W (i ) h t + V ( i ) h t −1 + b ( i ) ) (20)
x
Each memory layer passes an intermediate seq
representation
Figure to the next.
9: A deep bi-directional RNN
with three RNN layers.
cs224n: natural language processing with deep learning lecture notes: part v
language models, rnn, gru and lstm 10
←
− (i ) ←− ( i −1) ← − ← − (i ) ←−
h t = f ( W (i ) h t + V ( i ) h t +1 + b ( i ) ) (21)
−
→( L) ←−( L)
ŷt = g(Uht + c) = g(U [ h t ; h t ] + c) (22)
N
1
max
θ N ∑ log pθ (y(n) |x(n) ) (26)
n =1
Extension II: compute every hidden state in the decoder using three
different inputs:
• The previous hidden state (standard) Figure 11: Language model with
three inputs to each decoder neuron:
• Last hidden layer of the encoder (c = h T in Figure 11) (ht−1 , c, yt−1 )
cs224n: natural language processing with deep learning lecture notes: part v
language models, rnn, gru and lstm 11
Beyond the extensions discussed so far, RNNs have been found to per-
form better with the use of more complex units for activation. So far,
we have discussed methods that transition from hidden state ht−1 to
ht using an affine transformation and a point-wise nonlinearity. Here,
we discuss the use of a gated activation function thereby modifying
the RNN architecture. What motivates this? Well, although RNNs
can theoretically capture long-term dependencies, they are very hard
to actually train to do this. Gated recurrent units are designed in a
manner to have more persistent memory thereby making it easier for
RNNs to capture long-term dependencies. Let us see mathematically
how a GRU uses ht−1 and xt to generate the next hidden state ht . We
will then dive into the intuition of this architecture.
z t = σ (W ( z ) x t + U ( z ) h t − 1 ) (Update gate)
r t = σ (W ( r ) x t + U ( r ) h t − 1 ) (Reset gate)
h̃t = tanh(rt ◦ Uht−1 + Wxt ) (New memory)
ht = (1 − zt ) ◦ h̃t + zt ◦ ht−1 (Hidden state)
cs224n: natural language processing with deep learning lecture notes: part v
language models, rnn, gru and lstm 12
the different parameters: W, U, W (r) , U (r) , W (z) , U (z) . These follow the
same backpropagation procedure we have seen in the past.
cs224n: natural language processing with deep learning lecture notes: part v
language models, rnn, gru and lstm 13
4 Long-Short-Term-Memories
i t = σ (W ( i ) x t + U ( i ) h t − 1 ) (Input gate)
f t = σ (W ( f ) x t + U ( f ) h t − 1 ) (Forget gate)
(o ) (o )
o t = σ (W xt + U h t −1 ) (Output/Exposure gate)
c̃t = tanh(W (c) xt + U (c) ht−1 ) (New memory cell)
ct = f t ◦ ct−1 + it ◦ c̃t (Final memory cell)
ht = ot ◦ tanh(ct )
2. Input Gate: We see that the new memory generation stage doesn’t
check if the new word is even important before generating the new
memory – this is exactly the input gate’s function. The input gate
uses the input word and the past hidden state to determine whether
or not the input is worth preserving and thus is used to gate the new
memory. It thus produces it as an indicator of this information.
3. Forget Gate: This gate is similar to the input gate except that it
does not make a determination of usefulness of the input word –
instead it makes an assessment on whether the past memory cell is
useful for the computation of the current memory cell. Thus, the
forget gate looks at the input word and the past hidden state and
produces f t .
4. Final memory generation: This stage first takes the advice of the
forget gate f t and accordingly forgets the past memory ct−1 . Sim-
ilarly, it takes the advice of the input gate it and accordingly gates
the new memory c̃t . It then sums these two results to produce the
final memory ct .
2 Attention Mechanism
in using the final RNN hidden state as the single "context vector" for
sequence-to-sequence models: often, different parts of an input have
cs224n: natural language processing with deep learning lecture notes: part vi
neural machine translation, seq2seq and attention 5
1. Encoder
Let (h1 , . . . , hn ) be the hidden vectors representing the input sen-
tence. These vectors are the output of a bi-LSTM for instance, and
capture contextual representation of each word in the sentence.
2. Decoder
We want to compute the hidden states si of the decoder using a
recursive formula of the form
s i = f ( s i −1 , y i −1 , c i )
ei,j = a(si−1 , h j )
exp(ei,j )
αi,j =
∑nk=1
exp(ei,k )
3 Other Models
attention vector over the encoder hidden. We can use one of the
following scoring functions:
T ¯
hi h j
score(hi , h¯j ) = hiT W h¯j ∈R
W [hi , h¯j ]
n
ci = ∑ αi,j h j
j =1
and we can use the context vector and the hidden state to compute
a new vector for the i-th time step of the decoder
h̃i = f ([h̄i , ci ])
The final step is to use the h̃i to make the final prediction of the
decoder. To address the issue of coverage, Luong et al. also use
an input-feeding approach. The attentional vectors h̃i are fed as
input to the decoder, instead of the final prediction. This is similar
to Bahdanau et al., who use the context vectors to compute the
hidden vectors of the decoder.
The main takeaway of this discussion is to show that they are lots
of ways of doing attention.
As the search space can be huge, we need to shrink its size. Here
is a list of sequence model decoders (both good ones and bad ones).
x t ∼ P( x t | x1 , . . . , x n )
xt = argmaxx˜t P( x˜t | x1 , . . . , xn )
K
Htk˜+1
[
H̃t+1 =
k =1
where
The BLEU score looks for whether n-grams in the machine trans-
lation also appear in the reference translation. Color-coded below are
some examples of different size n-grams that are shared between the
reference and candidate translation.
the precision score for the grams of length n. Finally, let wn = 1/2n
be a geometric weighting for the precision of the n’th gram. Our
brevity penalty is defined as
len
min(0,1− len ref )
β=e MT
k
BLEU = β ∏ pw
n
n
i =1
The BLEU score has been reported to correlate well with human
judgment of good translations, and so remains a benchmark for all
evaluation metrics following it. However, it does have many limi-
tations. It only works well on the corpus level because any zeros in
precision scores will zero the entire BLEU score. Additionally, this
BLEU score as presented suffers for only comparing a candidate
translation against a single reference, which is surely a noisy repre-
sentation of the relevant n-grams that need to be matched. Variants
of BLEU have modified the algorithm to compare the candidate with
multiple reference examples. Additionally, BLEU scores may only be
lecture notes: part vi
cs224n: natural language processing with deep learning
neural machine translation, seq2seq and attention 13
Despite the success of modern NMT systems, they have a hard time
dealing with large vocabulary size. Specifically, these Seq2Seq models
predict the next word in the sequence by computing a target proba-
bilistic distribution over the entire vocabulary using softmax. It turns
out that softmax can be quite expensive to compute with a large
vocabulary and its complexity also scales proportionally to the vocab-
ulary size. We will now examine a number of approaches to address
this issue.
2. Hierarchical Softmax
Morin et al.8 introduced a binary tree structure to more efficiently 8
Morin et al. 2005, Hierarchical Proba-
compute "softmax". Each probability in the target distribution bilistic Neural Network Language Model
One limitation for both methods is that they only save computa-
tion during training step (when target word is known). At test time,
lecture notes: part vi
cs224n: natural language processing with deep learning
neural machine translation, seq2seq and attention 14
one still has to compute the probability of all words in the vocabulary
in order to make predictions.
At test time, one can similarly predict target word out of a selected
subset, called candidate list, of the entire vocabulary. The challenge
is that the correct target word is unknown and we have to "guess"
what the target word might be. In the paper, the authors proposed
to construct a candidate list for each source sentence using K most
frequent words (based on unigram probability) and K’ likely target
words for each source word in the sentence. In Figure 8), an example
is shown with K 0 = 3 and the candidate list consists of all the words
in purple boxes. In practice, one can choose the following values:
K = 15k, 30k, 50k and K 0 = 10, 20.
ew = W f H f + Wb Hb + b
with unknown words and achieve open-vocabulary NMT. The system Vocabulary Neural Machine Translation
with Hybrid Word-Character Models
translates mostly at word-level and consults the character compo-
nents for rare words. On a high level, the character-level recurrent
neural networks compute source word representations and recover
unknown target words when needed. The twofold advantage of such
a hybrid approach is that it is much faster and easier to train than
character-based ones; at the same time, it never produces unknown
words as in the case of word-based models.
Word-based Translation as a Backbone The core of the hybrid
NMT is a deep LSTM encoder-decoder that translates at the word
level. We maintain a vocabulary of size |V | per language and use
<unk> to represent out of vocabulary words.
Source Character-based Representation In regular word-based
NMT, a universal embedding for <unk> is used to represent all out-
of-vocabulary words. This is problematic because it discards valuable
information about the source words. Instead, we learn a deep LSTM
model over characters of rare words, and use the final hidden state of Figure 11: Hybrid NMT
the LSTM as the representation for the rare word (Figure 11).
Target Character-level Generation General word-based NMT
allows generation of <unk> in the target output. Instead, the goal
here is to create a coherent framework that handles an unlimited
output vocabulary. The solution is to have a separate deep LSTM that
lecture notes: part vi
cs224n: natural language processing with deep learning
neural machine translation, seq2seq and attention 17
yt = softmax(W (a) at )
at = GRU([yt−1 , q], at−1 )
1.5 Experiments
Through the experiments we can see DMN is able to outperform
MemNN in babl question answering tasks, and it can outperform
other architectures for sentiment analysis and part-of-speech tagging.
How many episodes are needed in the episodic memory? The answer
is that the harder the task is, the more passes are required. Multiple
passes also allows the network to truly comprehend the sentence by
paying attention to only relevant parts for the final task, instead of
reacting to just the information from the word embedding.
The key idea is to modularize the system, and you can allow dif-
ferent types of input by change the input module. For example, if we
replace the input module with a convolutional neural network-based
module, then this architecture can handle a task called visual ques-
tion answering (VQA). It is also able to outperform other models in
this task.
1.6 Summary
The zeal to search for a general architecture that would solve all
problems has slightly faded since 2015, but the desire to train on one
domain and generalize to other domains has increased. To compre-
hend more advanced modules for question answering, readers can
refer to the dynamic coattention network (DCN).
CS224n: Natural Language Processing with Deep
Learning 1 1
Course Instructors: Christopher
Manning, Richard Socher
Lecture Notes: Part VIII
Convolutional Neural Networks2 2
Authors: Francois Chaubard, Richard
Socher
Winter 2019
right-hand-side as in Figure 3.
1.4 Pooling
Assuming that we don’t use zero-padding, we will get a final convo- Figure 2: Single-Layer Convolution:
lutional output, c which has n − h + 1 numbers. Typically, we want to one-step
take the outputs of the CNN and feed it as input to further layers like
a Feedforward Neural Network or a RecNN. But, all of those need
a fixed length input while our CNN output has a length dependent
on the length of the sentence, n. One clever way to fix this problem
Figure 3: Single-Layer Convolution:
is to use max-pooling. The output of the CNN, c ∈ Rn−h−1 is the all-steps
input to the max-pooling layer. The output of the max-pooling layer
is ĉ = max {c}, thus ĉ ∈ R.
We could also have used min-pooling because typically we use
ReLU as our non-linear activation function and ReLU is bounded on
the low side to 0. Hence a min-pool layer might get smothered by
ReLU, so we nearly always use max-pooling over min-pooling.
1.5 Multiple-Filters
In the example above related to Figure 2, we had h = 2, meaning we
looked only at bi-gram with a single specific combination method
i.e. filter. We can use multiple bi-gram filters because each filter will
learn to recognize a different kind of bi-gram. Even more generally,
we are not restricted to using just bi-grams, we can also have filters
using tri-grams, quad-grams and even higher lengths. Each filter has
an associated max-pool layer. Thus, our final output from the CNN
layers will be a vector having length equal to the number of filters.
1.6 Multiple-Channels
If we allow gradients to flow into the word-vectors being used here,
then the word-vectors might change significantly over training. This
is desirable, as it specializes the word-vectors to the specific task at
hand (away from say GloVe initialization). But, what about words
that appear only in the test set but not in the train set? While other
semantically related word vectors which appear in the train set will
have moved significantly from their starting point, such words will
still be at their initialization point. The neural network will be spe-
cialized for inputs which have been updated. Hence, we will get low
performance on sentences with such words (words that are in test but
not in train).
One work-around is to maintain two sets of word-vectors, one
’static’ (no gradient flow into them) and one ’dynamic’, which are
updated via SGD. Both are initially the same (GloVe or other initial-
cs224n: natural language processing with deep learning lecture notes: part viii
convolutional neural networks 3
and derive what the embedded vector should be. Now lets first ask
a very debated question. Is it naive to believe that the vector space
that we used to represent all words, is sufficiently expressive to also
be able to represent all sentences of any length? While this may be
unintuitive, the performance of these models suggest that this is
actually a reasonable thing to do.
Let’s first discuss the difference between semantic and grammat-
ical understanding of a sentence. Semantic analysis is an under-
standing of the meaning of a sentence, being able to represent the
phrase as a vector in a structured semantic space, where similar sen-
tences are very nearby, and unrelated sentences are very far away.
The grammatical understanding is one where we have identified the
underlying grammatical structure of the sentence, which part of the
sentence depends on which other part, what words are modifying
what other words, etc. The output of such an understanding is usu-
ally represented as a parse tree as displayed in Figure 2.
Now for the million dollar question. If we want to know the se-
mantic representation, is it an advantage, nay, required, to have
a grammatical understanding? Well some might disagree but for
now we will treat this semantic composition task the following way.
First, we need to understand words. Then, we need to know the way
words are put together, Then, finally, we can get to a meaning of a
phrase or sentence by leveraging these two previous concepts.
So lets begin with our first model built on this principle. Let’s
imagine we were given a sentence, and we knew the parse tree for
that sentence, such as the one displayed in Figure 2, could we figure
out an encoding for the sentence and also perhaps a sentiment score
just from the word vectors that are in the sentence? We observe how
a Simple RNN can perform this task.
We now take h(1) and put it through a softmax layer to get a score
over a set of sentiment classes, a discrete set of known classes that
represent some meaning. In the case of positive/negative sentiment
analysis, we would have 5 classes, class 0 implies strongly negative,
class 1 implies negative, class 2 is neutral, class 3 is positive, and
finally class 4 is strongly positive.
Now we do the same thing with the ”I” and ”love” to produce
the vector h(1) for the phrase ”I love”. Again, we compute a score
over the semantic classes again for that phrase. Finally, for the most
interesting step, we need to merge the two phrases ”I love” and ”this
assignment”. Here we are concatenating word phrases, rather than
word vectors! We do this in the same manner, concatenating the two
h(1) vectors and compute
(1)
h
h(1) = tanh(W (1) (Le1)
ft
+ b (1) ) (2)
h Right
mine what the categories are via a very simple Probabilistic Context
Free Grammar (PCFG) which is more or less learned by computing
summary statistics over the Penn Tree Bank to learn rules such as
”The” is always a DT, etc, etc. No deeper understanding of this part
is really necessary, just know itś really simple.
The only major other difference in this model is that we initialize
the W’s to the identity. This way the default thing to do is to average
the two word vectors coming in. Slowly but surely, the model learns
which vector is more important and also any rotation or scaling of
the vectors that improve performance. We observe in Figure 5 that
the trained weight matrices learn actual meaning! For example, the
DT-NP rule or Determiner followed by a Noun Phrase such as ”The
cat” or ”A man”, puts more emphasis on the Noun Phrase than on
the Determiner. (This is obvious because the right diagonals are
red meaning higher weights). This is called the notion of soft head
words, which is something that Linguists have long observed to be
true for sometime, however the model learned this on its own just by
looking at data. Pretty cool!
The SU-RNN does indeed outperform previously discussed mod- Figure 5: The learnt W weights for
els but perhaps it is still not expressive enough. If we think of mod- DT-NP composition match Linguists
theory
ifying words, such as adverbs like ”very”, any interpolation with
this word vector and the following one, is definitely not what the
understood nature of ”very” is.
As an adverb, it’s literal definition is ”used for emphasis”. How
can we have a vector that emphasizes any other vector that is to
follow when we are solely performing a linear interpolation? How
can we construct a vector that will ”scale” any other vector this way?
Truth is we can not. We need to have some form of multiplication of
word on another word. We uncover two such compositions below
that enable this. The first utilizes word matrices and the other utilizes
a Quadratic equation over the typical Affine.
2 Constituency Parsing
2.1 Constituent
In syntactic analysis, a constituent can be a single word or a phrases
as a single unit within a hierarchical structure. A phrase is a se-
quence of two or more words built around a head lexical item and
working as a unit within a sentence. To be a phrase, a group of words
should come together to play a specific role in the sentence. In ad-
dition, the group of words can be moved together or replaced as a
whole, and the sentence should remain fluent and grammatical.
For instance, the following sentence contains the noun phrase:
"wonderful CS224N". We interpret large text units by seman-
tic composition of smaller elements.
• I want to be enrolled in the wonderful CS224N! These smaller elements can be changed
while keeping a similar meaning, like in
the following examples.
We can rewrite the sentence by moving that whole phrase to the
front as below.
• S -> NP VP
• Colorless green ideas sleep furiously Figure 12: Constituency Parse Tree for
’Colorless green ideas sleep furiously’
February 2019
Contents
1 Introduction 1
3 Project Advice 5
3.1 Define your goals . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.2 Processing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.3 Data hygiene . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.4 Build strong baselines . . . . . . . . . . . . . . . . . . . . . . . . 7
3.5 Training and debugging neural models . . . . . . . . . . . . . . . 7
3.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1 Introduction
These notes complement the information provided in Lecture 9 of CS224n 2019,
Practical Tips for Final Projects. In particular this document contains lots of
links to useful resources. This document does not contain a detailed specifica-
tion for the project deliverables (proposal, milestone, poster and report) – those
specifications will be released separately.
1
2 Choosing a Project Topic
Project Suitability You can choose any topic related to Deep Learning for
NLP. That means your project should make substantive use of deep learning
and substantive use of human language data.
2.1 Expectations
Your project should aim to provide some kind of scientific knowledge gain,
similar to typical NLP research papers (see Section 2.2 on where to find them).
A typical case is that your project will show that your proposed method provides
good results on an NLP task you’re dealing with.
Given that you only have a few weeks to work on your project, it is not
necessary that your method beats the state-of-the-art performance, or works
better than previous methods. But it should at least show performance broadly
expected of the kinds of methods you’re using.
In any case, your paper should try to provide reasoning explaining the be-
haviour of your model. You will need to provide some qualitative analysis,
which will be useful for supporting or testing your explanations. This will be
particularly important when your method is not working as well as expected.
Ultimately, your project will be graded holistically, taking into account many
criteria: originality, performance of your methods, complexity of the techniques
you used, thoroughness of your evaluation, amount of work put into the project,
analysis quality, writeup quality, etc.
1 https://docs.google.com/document/d/1Ytncuq6tpiSGHsJBkdzskMf0nw4_
x2AJ1rZ7RvpOv5E/edit?usp=sharing
2
2.2 Finding existing research
Generally, it’s much easier to define your project (see Section 3.1) if there is
existing published research using the same or similar task, dataset, approaches,
and/or evaluation metrics. Identifying existing relevant research (and even ex-
isting code) will ultimately save you time, as it will provide a blueprint of how
you might sensibly approach the project. There are many ways to find relevant
research papers:
• You could browse recent publications at any of the top venues where
NLP and/or Deep Learning research is published: ACL, EMNLP, TACL,
NAACL, EACL, NIPS, ICLR, ICML (Not exhaustive!)
• In particular, publications at many NLP venues are indexed at
http://www.aclweb.org/anthology/
• Try a keyword search at:
– http://arxiv.org/
– http://scholar.google.com
– http://dl.acm.org/
– http://aclasb.dfki.de/
• Look at publications from the Stanford NLP group
https://nlp.stanford.edu/pubs/
3
• Datahub has lots of datasets, though not all of it is Machine Learning
focused.
https://datahub.io/collections
• Microsoft Research has a collection of datasets (look under the ‘Dataset
directory’ tab):
http://research.microsoft.com/en-US/projects/data-science-initiative/
datasets.aspx
• A script to search arXiv papers for a keyword, and extract important
information such as performance metrics on a task.
https://huyenchip.com/2018/10/04/sotawhat.html
• A collection of links to more collections of links to datasets!
http://kevinchai.net/datasets
• A collection of papers with code on many NLP tasks.
https://paperswithcode.com/sota
• Datasets for machine translation.
http://statmt.org
• Syntactic corpora for many languages.
https://universaldependencies.org
4
3 Project Advice
3.1 Define your goals
At the very beginning of your project, it’s important to clearly define your
goals in your mind and make sure everyone in your team understands them. In
particular:
• Clearly define the task. What’s the input and what’s the output? Can
you give an example? If the task can’t be framed as input and output,
what exactly are you trying to achieve?
• What dataset(s) will you use? Is that dataset already organized into the
input and output sections described above? If not, what’s your plan to
obtain the data in the format that you need?
• What is your evaluation metric (or metrics)? This needs to be a well-
defined, numerical metric (e.g. ROUGE score), not a vague idea (e.g.
‘summary quality’). See section 3.6 for more detail on how to evaluate
your methods.
• What does success look like for your project? For your chosen evaluation
metrics, what numbers represent expected performance, based on previous
research? If you’re doing an analysis or theoretical project, define your
hypothesis and figure out how your experiments will confirm or negate
your hypothesis.
• spaCy, another Python package that can do preprocessing, but also in-
cludes neural models (e.g. Language Models):
https://spacy.io/
5
3.3 Data hygiene
At the beginning of your project, split your data set into training data (most of
your data), development data (also known as validation data) and test data.
A typical train/dev/test split might be 90/5/5 percent (assigned randomly).
Many NLP datasets come with predefined splits, and if you want to compare
against existing work on the same dataset, you should use the same split as used
in that work. Here is how you should use these data splits in your project:
1. Training data: Use this (and only this data!) to optimize the parameters
of your neural model.
2. Development data: This has two main uses. The first is to compare the
performance of your different models (or versions of the same model) by
computing the evaluation metric on the development data. This enables
you to choose the best hyperparameters and/or architectural choices that
should be evaluated on the test data. The second important usage of
development data is to decide when to stop training your model. Two
simple and common methods for deciding when to stop training are:
(a) Every epoch (or every N training iterations, where N is predefined),
record performance of the current model on the development set and
store the current model as a checkpoint. If development performance
is worse than on the last previous iteration (alternatively, if it fails
to beat best performance M times in a row, where M is predefined),
stop training and keep the best checkpoint.
(b) Train for E epochs (where E is some predefined number) and, after
each epoch, record performance of the current model on the develop-
ment set and store the current model as a checkpoint. Once the E
epochs are finished, stop training and keep the best checkpoint.
3. Test data: At the end of your project, evaluate your best trained model(s)
on the test data to compute your final performance metric. To be scien-
tifically honest, you should only use the training and development data to
select which models to evaluate on the test set.
The reason we use data splits is to avoid overfitting. If you simply selected
the model that does best on your training set, then you wouldn’t know how
well your model would perform on new samples of data – you’d be overfitting
to the training set. In NLP, powerful neural models are particularly prone to
overfitting to their training data, so this is especially important.
Similarly, if you look at the test set before you’ve chosen your final archi-
tecture and hyperparameters, that might impact your decisions and lead your
project to overfit to the test data. Thus, in the interest of science, it is extremely
important that you don’t touch the test set until the very end of your project.
This will ensure that the quantitative performance that you report will be an
honest unbiased estimate of how your method will do on new samples of data.
6
It’s even possible to overfit to the development set. If we train many differ-
ent model variants, and only keep the hyperparameters and architectures that
perform best on the development set, then we may be overfitting to the devel-
opment set. To fix this, you can even use two separate development sets (one of
them called the tuning set, the other one the development set). The tuning
set is used for optimizing hyperparameters, the development set for measuring
overall progress. If you optimize your hyperparameters a lot and do many it-
erations, you may even want to create multiple distinct development sets (dev,
dev2, ...) to avoid overfitting.
7
to tune these parameters (see Section 3.3). Though you probably won’t
have time for a very exhaustive hyperparameter search, try to identify the
most sensitive/important hyperparameters and tune those.
• Due to their power, neural networks overfit easily. Regularization (dropout,
weight decay) and stopping criteria based on the development set (e.g.
early stopping, see Section 3.3) are extremely important for ensuring your
model will do well on new unseen data samples.
• A more complicated neural model can ‘fail silently’: It may get decent
performance by relying on the simple parts, when the really interesting
components (e.g. a cool attention mechanism) fail due to a bug. Use
ablation experiments (see Section 3.4) to diagnose whether different parts
of the model are adding value.
• During training, randomize the order of training samples, or at least re-
move obvious ordering (e.g., don’t train your Seq2Seq system on data by
going through the corpus in the order of sentence length – instead, make
sure the lengths of the sentences in subsequent minibatches are uncorre-
lated). SGD relies on the assumption that the samples come in random
order.
• There are many online resources containing practical advice for building
neural network models. Note, most of this type of advice is based more
on personal experience than rigorous theory, and these ideas evolve over
time – so take it with a grain of salt! After all, your project could open
new perspectives by disproving some commonly-held belief.
– A Twitter thread on most common neural net mistakes (June 2018):
https://twitter.com/karpathy/status/1013244313327681536
– Deep Learning for NLP Best Practices blog post (July 2017):
http://ruder.io/deep-learning-nlp-best-practices/
– Practical Advice for Building Deep Neural Networks (October 2017):
https://pcc.cs.byu.edu/2017/10/02/practical-advice-for-building-deep-neural-networks/
3.6 Evaluation
In your project, carrying out meaningful evaluation is as important as designing
and building your neural models. Meaningful evaluation means that you should
carefully compare the performance of your methods using appropriate evaluation
metrics.
Choosing evaluation metrics You must have at least one evaluation metric
(which should be a numerical metric that can be automatically computed) to
measure the performance of your methods. If there is existing published work
on the same dataset and/or task, you should use the same metric(s) as that
work (though you can evaluate on additional metrics if you think it’s useful).
8
Human evaluation Human evaluation is often necessary in research areas
that lack good, comprehensive automatic evaluation metrics (e.g. some natural
language generation tasks). If you want to use human judgment as an evaluation
metric, you are welcome to do so (though you may find it difficult to find the time
and/or funding to collect many human evaluations). Collecting a small number
of human judgments could be a valuable addition to your project, but you must
have at least one automatic evaluation metric – even if it is an imperfect metric.
What to compare You should use your evaluation metric(s) to (a) com-
pare your model against previous work, (b) compare your model against your
baselines (see Section 3.4), and (c) compare different versions of your model.
When comparing against previous work, make sure to get the details right – for
example, did the previous work compute the BLEU metric in a case-sensitive
or case-insensitive way? If you calculate your evaluation metric differently to
previous work, the numbers are not comparable!
9
• If your model uses attention, you can create a plot or visualization of the
attention distribution to see what the model attended to on particular
examples.
If your method is successful, qualitative evaluation is important to understand
the reason behind the numbers, and identify areas for improvement. If your
method is unsuccessful, qualitative evaluation is even more important to under-
stand what went wrong.
10
Review of differential calculus theory 1 1
Author: Guillaume Genthial
Winter 2017
1 Introduction
We use derivatives all the time, but we forget what they mean. In
general, we have in mind that for a function f : R 7→ R, we have
something like
f ( x + h) − f ( x ) ≈ f 0 ( x )h
f 0 (x)
df
dx
∂f
∂x
∇x f
Scalar-product and dot-product
However, these notations refer to different mathematical objects, Given two vectors a and b,
and the confusion can lead to mistakes. This paper recalls some • scalar-product h a|bi = ∑in=1 ai bi
notions about these objects. • dot-product a T · b = h a|bi =
∑in=1 ai bi
review of differential calculus theory 2
2 Theory for f : Rn 7→ R
2.1 Differential
Notation
Formal definition dx f is a linear form Rn 7→ R
Let’s consider a function f : Rn 7→ R defined on Rn with the scalar This is the best linear approximation
of the function f
product h·|·i. We suppose that this function is differentiable, which
means that for x ∈ Rn (fixed) and a small variation h (can change) we
can write: dx f is called the differential of f in x
f ( x + h ) = f ( x ) + d x f ( h ) + o h →0 ( h ) (1)
Example !
x1
Let f : R 7→ R such that f (
2 ) = 3x1 + x22 . Let’s pick
x2
! !
a h 1
∈ R2 and h = ∈ R2 . We have
b h2
!
a + h1
f( ) = 3( a + h1 ) + ( b + h2 )2
b + h2
= 3a + 3h1 + b2 + 2bh2 + h22
= 3a + b2 + 3h1 + 2bh2 + h22
= f ( a, b) + 3h1 + 2bh2 + o (h)
! h 2 = h · h = o h →0 ( h )
f(
h1
Then, d ) = 3h1 + 2bh2
a
h2
b
dx (h) = hu| hi
∇ x f := u
Then, as a conclusion, we can rewrite equation 2.1 Gradients and differential of a func-
tion are conceptually very different.
The gradient is a vector, while the
differential is a function
f ( x + h ) = f ( x ) + d x f ( h ) + o h →0 ( h ) (2)
= f ( x ) + h∇ x f |hi + oh→0 (h) (3)
Example !
x1
Same example as before, f : R2 7→ R such that f ( ) =
x2
3x1 + x22 . We showed that
!
h1
d f ( ) = 3h1 + 2bh2
a h2
b
We can rewrite this as
! ! !
h1 3 h1
d f ( )=h | i
a
h2 2b h2
b
and thus our gradient is
!
3
∇ f =
a 2b
b
i
with respect to the i-th component and evaluated in x.
Example
Same example as before, f : R2 7→ R such that f ( x1 , x2 ) =
3x1 + x22 . Let’s write Depending on the context, most people
omit to write the ( x ) evaluation and just
write
∂f ∂f
∂x ∈ R instead of ∂x ( x )
n
review of differential calculus theory 4
! !
a+h a
! f( ) − f( )
∂f a b b
( ) = lim
∂x1 b h →0 h
3( a + h) + b2 − (3a + b2 )
= lim
h →0 h
3h
= lim
h →0 h
=3
∂f
where ∂x ( x ) denotes the partial derivative of f with respect to the
i
ith component, evaluated in x.
Example
We showed that
∂ f a
( ) =3
∂x1
b
∂ f a
( ) = 2b
∂x2
b
and that
!
3
∇ f =
a
2b
b
and then we verify that
review of differential calculus theory 5
!
a∂f
∂x1 (
)
b
∇ f = !
a
∂f a
( )
b ∂x2 b
3 Summary
Formal definition
For a function f : Rn 7→ R, we have defined the following objects
which can be summarized in the following equation Recall that a T · b = h a|bi = ∑in=1 ai bi
f ( x + h ) = f ( x ) + d x f ( h ) + o h →0 ( h ) differential
= f ( x ) + h∇ x f |hi + oh→0 (h) gradient
∂f
= f ( x ) + h ( x )|hi + oh→0
∂x
∂f
∂x ( x )
1.
.. |hi + oh→0
= f ( x ) + h partial derivatives
∂f
∂xn ( x )
Remark
Let’s consider x : R 7→ R such that x (u) = u for all u. Then we can
easily check that du x (h) = h. As this differential does not depend on
u, we may simply write dx. That’s why the following expression has The dx that we use refers to the differ-
some meaning, ential of u 7→ u, the identity mapping!
∂f
dx f (·) = ( x )dx (·)
∂x
because
∂f
dx f (h) = ( x )dx (h)
∂x
∂f
= ( x)h
∂x
In higher dimension, we write
n
∂f
dx f = ∑ ∂xi (x)dxi
i =1
4 Jacobian: Generalization to f : Rn 7→ Rm
For a function
review of differential calculus theory 6
x1 f 1 ( x1 , . . . , x n )
. ..
.. 7→
f : .
xn f m ( x1 , . . . , x n )
We can apply the previous section to each f i ( x ) :
f i ( x + h ) = f i ( x ) + d x f i ( h ) + o h →0 ( h )
= f i ( x ) + h∇ x f i |hi + oh→0 (h)
∂f
= f i ( x ) + h i ( x )|hi + oh→0
∂x
∂f ∂f
= f i ( x ) + h( i ( x ), . . . , i ( x ))T |hi + oh→0
∂x1 ∂xn
∂f ∂ f1
1
x1 + h1 x1 ∂x1 ( x ) . . . ∂xn ( x )
.. = f .. +
..
· h + o (h)
f
. . .
∂ fm ∂ fm
xn + hn xn ∂x ( x ) . . . ∂x ( x )
1 n
= f ( x ) + J ( x ) · h + o (h)
y1 !
y1 + 2y2 + 3y3
g ( y2 ) =
y1 y2 y3
y3
∂(y1 +2y2 +3y3 )
∂y (y)T
Jg ( y ) =
∂ ( y1 y2 y3 )
( y ) T
∂y
∂(y1 +2y2 +3y3 ) ∂(y1 +2y2 +3y3 ) ∂(y1 +2y2 +3y3 )
∂y1 (y) ∂y2 (y) ∂y3 (y)
=
∂ ( y1 y2 y3 ) ∂ ( y1 y2 y3 ) ∂ ( y1 y2 y3 )
∂y1 (y) ∂y2 (y) ∂y3 ( y )
!
1 2 3
=
y2 y3 y1 y3 y1 y2
5 Generalization to f : Rn× p 7→ R
∂f
∂x1 ( x )
..
where ∇ x f =
. .
∂f
∂xnp ( x )
Now, we would like to give some meaning to the following equa-
tion The gradient of f wrt to a matrix X is a
matrix of same shape as X and defined
by
f ( X + H ) = f ( X ) + h∇ X f | H i + o ( H ) ∂f
∇ X f ij = ∂X (X)
ij
∂f
∇ X f ij = (X)
∂Xij
that these two terms are equivalent
review of differential calculus theory 8
h∇ x f |hi = h∇ X f | H i
np
∂f ∂f
∑ ∂xi (x)hi = ∑ ∂Xij (X ) Hij
i =1 i,j
6 Generalization to f : Rn× p 7→ Rm
Let’s generalize the generalization of
Applying the same idea as before, we can write the previous section
f ( x + h) = f ( x ) + J ( x ) · h + o (h)
∂ fi
Jijk ( x ) = (x)
∂X jk
Writing the 2d-dot product δ = J ( x ) · h ∈ Rm means that the i-th
component of δ is You can apply the same idea to any
dimensions!
n p
∂ fi
δi = ∑∑ ∂X jk
( x )h jk
j =1 k =1
7 Chain-rule
Formal definition
Now let’s consider f : Rn 7→ Rm and g : R p 7→ Rn . We want
to compute the differential of the composition h = f ◦ g such that
h : x 7→ u = g( x ) 7→ f ( g( x )) = f (u), or
dx ( f ◦ g)
.
It can be shown that the differential is the composition of the dif-
ferentials
dx ( f ◦ g ) = d g( x ) f ◦ dx g
n
Jh ( x )ij = ∑ J f ( g(x))ik · Jg (x)kj
k =1
Example !
x1
Let’s keep our example function f : ( ) 7→ 3x1 + x22 and our
x2
y1 !
y1 + 2y2 + 3y3
function g : (y2 ) = .
y1 y2 y3
y3
The composition of f and g is h = f ◦ g : R3 7→ R
y1 !
y1 + 2y2 + 3y3
h ( y2 ) = f ( )
y1 y2 y3
y3
= 3(y1 + 2y2 + 3y3 ) + (y1 y2 y3 )2
∂h
(y) = 3 + 2y1 y22 y23
∂y1
∂h
(y) = 6 + 2y2 y21 y23
∂y2
∂h
(y) = 9 + 2y3 y21 y22
∂y3
∇x f T = J f (x)
J f (x) = ∇x f T
= 3 2x2
!
1 2 3
Jg ( y ) =
y2 y3 y1 y3 y1 y2
review of differential calculus theory 10
and taking the transpose we find the same gradient that we com-
puted before!
Important remark
• Note that the chain rule gives us a way to compute the Jacobian
and not the gradient. However, we showed that in the case of a
function f : Rn 7→ R, the jacobian and the gradient are directly
identifiable, because ∇ x J T = J ( x ). Thus, if we want to compute
the gradient of a function by using the chain-rule, the best way to
do it is to compute the Jacobian.
• As the gradient must have the same shape as the variable against
which we derive, and
• the notation ∂∂·· is often ambiguous and can refer to either the gra-
dient or the Jacobian.
Computing Neural Network Gradients
Kevin Clark
1 Introduction
The purpose of these notes is to demonstrate how to quickly compute neural
network gradients in a completely vectorized way. It is complementary to the
last part of lecture 3 in CS224n 2019, which goes over the same material.
2 Vectorized Gradients
While it is a good exercise to compute the gradient of a neural network with re-
spect to a single parameter (e.g., a single element in a weight matrix), in practice
this tends to be quite slow. Instead, it is more efficient to keep everything in ma-
trix/vector form. The basic building block of vectorized gradients is the Jacobian
Matrix. Suppose we have a function f : Rn → Rm that maps a vector of length n
to a vector of length m: f (x) = [f1 (x1 , ..., xn ), f2 (x1 , ..., xn ), ..., fm (x1 , ..., xn )].
Then its Jacobian is the following m × n matrix:
∂f1 ∂f1
∂x1 ... ∂xn
∂f
= ... .. ..
∂x . .
∂f m ∂fm
∂x1 ... ∂xn
As a little illustration of this, suppose we have a function f (x) = [f1 (x), f2 (x)]
taking a scalar to a vector of size 2 and a function g(y) = [g1 (y1 , y2 ), g2 (y1 , y2 )]
taking a vector of size two to a vector of size two. Now let’s compose them to
get g(x) = [g1 (f1 (x), f2 (x)), g2 (f1 (x), f2 (x))]. Using the regular chain rule, we
can compute the derivative of g as the Jacobian
∂ " ∂g1 ∂f1 ∂g1 ∂f2
#
∂g g1 (f 1 (x), f2 (x)) ∂f ∂x + ∂f ∂x
= ∂x ∂ = ∂g12 ∂f1 2
∂g2 ∂f2
∂x ∂x g2 (f1 (x), f2 (x)) ∂f1 ∂x + ∂f2 ∂x
1
And we see this is the same as multiplying the two Jacobians:
" #
∂g1 ∂g1 ∂f1
∂g ∂g ∂f ∂f1 ∂f2 ∂x
= = ∂g2 ∂g2 ∂f2
∂x ∂f ∂x ∂f1 ∂f2 ∂x
3 Useful Identities
This section will now go over how to compute the Jacobian for several simple
functions. It will provide some useful identities you can apply when taking neu-
ral network gradients.
(1) Matrix times column vector with respect to the column vector
∂z
(z = W x, what is ∂x ?)
∂z
So an entry ( ∂x )ij of the Jacobian will be
m m
∂z ∂zi ∂ X X ∂
( )ij = = Wik xk = Wik xk = Wij
∂x ∂xj ∂xj ∂xj
k=1 k=1
∂ ∂z
because ∂xj xk = 1 if k = j and 0 if otherwise. So we see that =W
∂x
(2) Row vector times matrix with respect to the row vector
∂z
(z = xW , what is ∂x ?)
∂z
A computation similar to (1) shows that = WT .
∂x
(3) A vector with itself
∂z
(z = x, what is ∂x ? )
We have zi = xi . So
(
∂z ∂zi ∂ 1 if i = j
( )ij = = xi =
∂x ∂xj ∂xj 0 if otherwise
∂z
So we see that the Jacobian is a diagonal matrix where the entry at (i, i)
∂x
∂z
is 1. This is just the identity matrix: = I . When applying the chain
∂x
2
rule, this term will disappear because a matrix or vector multiplied by the
identity matrix does not change.
(4) An elementwise function applied a vector
∂z
(z = f (x), what is ∂x ? )
Since f is being applied elementwise, we have zi = f (xi ). So
(
∂z ∂zi ∂ f 0 (xi ) if i = j
( )ij = = f (xi ) =
∂x ∂xj ∂xj 0 if otherwise
∂z
So we see that the Jacobian ∂x is a diagonal matrix where the entry at (i, i)
∂z
is the derivative of f applied to xi . We can write this as = diag(f 0 (x)) .
∂x
Since multiplication by a diagonal matrix is the same as doing elementwise
multiplication by the diagonal, we could also write ◦f 0 (x) when applying
the chain rule.
This is a bit more complicated than the other identities. The reason for in-
cluding ∂J
∂z in the above problem formulation will become clear in a moment.
First suppose we have a loss function J (a scalar) and are computing its
gradient with respect to a matrix W ∈ Rn×m . Then we could think of J as
a function of W taking nm inputs (the entries of W ) to a single output (J).
∂J
This means the Jacobian ∂W would be a 1 × nm vector. But in practice
this is not a very useful way of arranging the gradient. It would be much
nicer if the derivatives were in a n × m matrix like this:
∂J ∂J
∂W11 . . . ∂W 1m
∂J
= ... .. ..
∂W . .
∂J ∂J
∂Wn1 . . . ∂Wnm
Since this matrix has the same shape as W , we could just subtract it (times
the learning rate) from W when doing gradient descent. So (in a slight abuse
∂J
of notation) let’s find this matrix as ∂W instead.
This way of arranging the gradients becomes complicated when computing
∂z
∂W . Unlike J, z is a vector. So if we are trying to rearrange the gradients
∂J ∂z
like with ∂W , ∂W would be an n × m × n tensor! Luckily, we can avoid
the issue by taking the gradient with respect to a single weight Wij instead.
3
∂z
∂Wij is just a vector, which is much easier to deal with. We have
m
X
zk = Wkl xl
l=1
m
∂zk X ∂
= xl Wkl
∂Wij ∂Wij
l=1
∂
Note that ∂W ij
Wkl = 1 if i = k and j = l and 0 if otherwise. So if k 6= i
everything in the sum is zero and the gradient is zero. Otherwise, the only
nonzero element of the sum is when l = j, so we just get xj . Thus we find
∂zk
∂Wij = xj if k = i and 0 if otherwise. Another way of writing this is
0
..
.
0
∂z
xj ← ith element
=
∂Wij 0
.
..
0
∂J
Now let’s compute ∂Wij
m
∂J ∂J ∂z ∂z X ∂zk
= =δ = δk = δ i xj
∂Wij ∂z ∂Wij ∂Wij ∂Wij
k=1
∂zi ∂J
(the only nonzero term in the sum is δi ∂W ij
). To get ∂W we want a ma-
trix where entry (i, j) is δi xj . This matrix is equal to the outer product
∂J
= δ T xT
∂W
(6) Row vector time matrix with respect to the matrix
(z = xW , δ = ∂J ∂J ∂z
∂z what is ∂W = δ ∂W ?)
∂J
A similar computation to (5) shows that = xT δ .
∂W
(7) Cross-entropy loss with respect to logits (ŷ = softmax(θ), J =
CE(y, ŷ), what is ∂J
∂θ ?)
∂J
The gradient is = ŷ − y
∂θ
(or (ŷ − y)T if y is a column vector).
4
These identities will be enough to let you quickly compute the gradients for many
neural networks. However, it’s important to know how to compute Jacobians
for other functions as well in case they show up. Some examples if you want
practice: dot product of two vectors, elementwise product of two vectors, 2-norm
of a vector. Feel free to use these identities in the assignments. One option is
just to memorize them. Another option is to figure them out by looking at the
dimensions. For example, only one ordering/orientation of δ and x will produce
∂J
the correct shape for ∂W (assuming W is not square).
4 Gradient Layout
Jacobean formulation is great for applying the chain rule: you just have to mul-
tiply the Jacobians. However, when doing SGD it’s more convenient to follow
the convention “the shape of the gradient equals the shape of the parameter”
∂J
(as we did when computing ∂W ). That way subtracting the gradient times the
learning rate from the parameters is easy. We expect answers to homework
questions to follow this convention. Therefore if you compute the gradient
of a column vector using Jacobian formulation, you should take the transpose
when reporting your final answer so the gradient is a column vector. Another
option is to always follow the convention. In this case the identities may not
work, but you can still figure out the answer by making sure the dimensions of
your derivatives match up. Up to you which of these options you choose!
5
In this example, we will compute all of the network’s gradients:
∂J ∂J ∂J ∂J ∂J
∂U ∂b2 ∂W ∂b1 ∂x
To start with, recall that ReLU(x) = max(x, 0). This means
(
0 1 if x > 0
ReLU (x) = = sgn(ReLU(x))
0 if otherwise
where sgn is the signum function. Note that we are able to write the derivative
of the activation in terms of the activation itself.
∂J ∂J
Now let’s write out the chain rule for ∂U and ∂b2 :
∂J ∂J ∂ ŷ ∂θ
=
∂U ∂ ŷ ∂θ ∂U
∂J ∂J ∂ ŷ ∂θ
=
∂b2 ∂ ŷ ∂θ ∂b2
∂ ŷ
Notice that ∂J ∂J
∂ ŷ ∂θ = ∂θ is present in both gradients. This makes the math a bit
cumbersome. Even worse, if we’re implementing the model without automatic
differentiation, computing ∂J ∂θ twice will be inefficient. So it will help us to define
some variables to represent the intermediate derivatives:
∂J ∂J
δ1 = δ2 =
∂θ ∂z
These can be thought as the error signals passed down to θ and z when doing
backpropagation. We can compute them as follows:
∂J
δ1 = = (ŷ − y)T this is just identity (7)
∂θ
∂J ∂J ∂θ ∂h
δ2 = = using the chain rule
∂z ∂θ ∂h ∂z
∂θ ∂h
= δ1 substituting in δ1
∂h ∂z
∂h
= δ1 U using identity (1)
∂z
= δ1 U ◦ ReLU0 (z) using identity (4)
= δ1 U ◦ sgn(h) we computed this earlier
A good way of checking our work is by looking at the dimensions of the Jaco-
bians:
∂J
= δ1 U ◦ sgn(h)
∂z
(1 × Dh ) (1 × Nc ) (Nc × Dh ) (Dh )
6
We see that the dimensions of all the terms in the gradient match up (i.e., the
number of columns in a term equals the number of rows in the next term). This
will always be the case if we computed our gradients correctly.
Now we can use the error terms to compute our gradients. Note that we trans-
pose out answers when computing the gradients for column vectors terms to
follow the shape convention.
∂J ∂J ∂θ ∂θ
= = δ1 = δ1T hT using identity (5)
∂U ∂θ ∂U ∂U
∂J ∂J ∂θ ∂θ
= = δ1 = δ1T using identity (3) and transposing
∂b2 ∂θ ∂b2 ∂b2
∂J ∂J ∂z ∂z
= = δ2 = δ2T xT using identity (5)
∂W ∂θ ∂W ∂W
∂J ∂J ∂z ∂z
= = δ2 = δ2T using identity (3) and transposing
∂b1 ∂θ ∂b1 ∂b1
∂J ∂J ∂z
= = (δ2 W )T using identity (1) and transposing
∂x ∂θ ∂x
7
CS224n: Natural Language Processing with Deep
Learning1 1
Course Instructors: Christopher
Manning, Richard Socher
Lecture Notes: TensorFlow 2 2
Authors: Zhedi Liu, Jon Gauthier,
Bharath Ramsundar, Chip Huyen
Winter 2017
Keyphrases: TensorFlow
Code Demo: https://github.com/nishithbsk/tensorflow_tutorials
1 Introduction
2 Concepts
b = tf.Variable(tf.zeros((N,)))
W = tf.Variable(tf.random_uniform((Dx, N), -1, 1))
x = tf.placeholder(tf.float32, (N, Dx))
h = tf.nn.relu(tf.matmul(x, W) + b)
run, two arguments have to be fed: fetches and feeds. We use feeds
and fetches to get data into and out of arbitrary operations.
Fetches represent a list of graph nodes and return the outputs of
these nodes. We could fetch a single node or multiple tensors. See
the code snippet below for an example of fetching two tensors: mul
and intermed.
input1 = tf.constant([3.0])
input2 = tf.constant([2.0])
input3 = tf.constant([5.0])
intermed = tf.add(input2, input3)
mul = tf.mul(input1, intermed)
# output:
# [array([ 21.], dtype=float32), array([ 7.], dtype=float32)]
input1 = tf.placeholder(tf.float32)
input2 = tf.placeholder(tf.float32)
output = tf.mul(input1, input2)
# output:
# [array([ 14.], dtype=float32)]
b = tf.Variable(tf.zeros((100,)))
W = tf.Variable(tf.random_uniform((784, 100),
-1, 1))
sess = tf.Session()
sess.run(tf.initialize_all_variables())
# {x: np.random.random(100, 784)} is a feed
# that assigns np.random.random(100, 784) to placeholder x
sess.run(h, {x: np.random.random(100, 784)})
2. Compute Gradients
The next thing we have to do is to compute gradients. TensorFlow
nodes have attached operations; therefore gradients with respect
cs224n: natural language processing with deep learning 5
3. Train Model
Now we are ready to train a model. This can simply be done by
creating an iterating training schedule that feeds in data, labels and
applies gradients to the variables, as shown in the code snippet be-
low.
sess = tf.Session()
sess.run(tf.initialize_all_variables())
for i in range(1000):
batch_x, batch_label = data.next_batch()
sess.run(train_step, feed_dict={x: batch_x, label: batch_label}
with tf.variable_scope("foo"):
v = tf.get_variable("v", [1])
assert v.name == "foo/v:0"
with tf.variable_scope("foo"):
v = tf.get_variable("v", [1])
with tf.variable_scope("foo", reuse=True):
v1 = tf.get_variable("v", [1])
with tf.variable_scope("foo", reuse=False):
v1 = tf.get_variable("v") # CRASH foo/v:0 already exists!