Natural LanguageProcessing
Natural Language Processing
with DeepLearning
with Deep Learning
CS224N/Ling284
CS224N/Ling284
Christopher Manning
Christopher Manning and Richard Socher
Lecture 4: Gradients by hand (matrix calculus) and
Lecture
algorithmically 2: Word Vectors algorithm)
(the backpropagation
1. Introduction
Assignment 2 is all about making sure you really understand the
math of neural networks … then we’ll let the software do it!
We’ll go through it quickly today, but also look at the readings!
This will be a tough week for some! à
Make sure to get help if you need it
Visit office hours Friday/Tuesday
Note: Monday is MLK Day – No office hours, sorry!
But we will be on Piazza
Read tutorial materials given in the syllabus
2
NER: Binary classification for center word being location
• We do supervised training and want high score if it’s a location
1
𝐽" 𝜃 = 𝜎 𝑠 =
1 + 𝑒 *+
x = [ xmuseums xin xParis xare xamazing ]
3
Remember: Stochastic Gradient Descent
Update equation:
𝛼 = step size or learning rate
How can we compute ∇- 𝐽(𝜃)?
1. By hand
2. Algorithmically: the backpropagation algorithm
4
Lecture Plan
Lecture 4: Gradients by hand and algorithmically
1. Introduction (5 mins)
2. Matrix calculus (40 mins)
3. Backpropagation (35 mins)
5
Computing Gradients by Hand
• Matrix calculus: Fully vectorized gradients
• “multivariable calculus is just like single-variable calculus if
you use matrices”
• Much faster and more useful than non-vectorized gradients
• But doing a non-vectorized gradient can be good for
intuition; watch last week’s lecture for an example
• Lecture notes and matrix calculus notes cover this
material in more detail
• You might also review Math 51, which has a new online
textbook:
http://web.stanford.edu/class/math51/textbook.html
6
Gradients
• Given a function with 1 output and 1 input
𝑓 𝑥 = 𝑥3
• It’s gradient (slope) is its derivative
45
46
= 3𝑥 8
“How much will the output change if we change the input a bit?”
7
Gradients
• Given a function with 1 output and n inputs
• Its gradient is a vector of partial derivatives with
respect to each input
8
Jacobian Matrix: Generalization of the Gradient
• Given a function with m outputs and n inputs
• It’s Jacobian is an m x n matrix of partial derivatives
9
Chain Rule
• For one-variable functions: multiply derivatives
• For multiple variables at once: multiply Jacobians
10
Example Jacobian: Elementwise activation Function
11
Example Jacobian: Elementwise activation Function
Function has n outputs and n inputs → n by n Jacobian
12
Example Jacobian: Elementwise activation Function
13
Example Jacobian: Elementwise activation Function
14
Example Jacobian: Elementwise activation Function
15
Other Jacobians
• Compute these at home for practice!
• Check your answers with the lecture notes
16
Other Jacobians
• Compute these at home for practice!
• Check your answers with the lecture notes
17
Other Jacobians
Fine print: This is the correct Jacobian.
Later we discuss the “shape convention”;
using it the answer would be h.
• Compute these at home for practice!
• Check your answers with the lecture notes
18
Other Jacobians
• Compute these at home for practice!
• Check your answers with the lecture notes
19
Back to our Neural Net!
x = [ xmuseums xin xParis xare xamazing ]
20
Back to our Neural Net!
• Let’s find
• Really, we care about the gradient of the loss, but we
will compute the gradient of the score for simplicity
x = [ xmuseums xin xParis xare xamazing ]
21
1. Break up equations into simple pieces
22
2. Apply the chain rule
23
2. Apply the chain rule
24
2. Apply the chain rule
25
2. Apply the chain rule
26
3. Write out the Jacobians
Useful Jacobians from previous slide
27
3. Write out the Jacobians
𝒖:
Useful Jacobians from previous slide
28
3. Write out the Jacobians
𝒖:
Useful Jacobians from previous slide
29
3. Write out the Jacobians
𝒖:
Useful Jacobians from previous slide
30
3. Write out the Jacobians
𝒖:
𝒖:
Useful Jacobians from previous slide
31
Re-using Computation
• Suppose we now want to compute
• Using the chain rule again:
32
Re-using Computation
• Suppose we now want to compute
• Using the chain rule again:
The same! Let’s avoid duplicated computation…
33
Re-using Computation
• Suppose we now want to compute
• Using the chain rule again:
𝒖:
34
𝛿 is local error signal
Derivative with respect to Matrix: Output shape
• What does look like?
• 1 output, nm inputs: 1 by nm Jacobian?
• Inconvenient to do
35
Derivative with respect to Matrix: Output shape
• What does look like?
• 1 output, nm inputs: 1 by nm Jacobian?
• Inconvenient to do
• Instead we use shape convention: the shape of
the gradient is the shape of the parameters
• So is n by m:
36
Derivative with respect to Matrix
• Remember
• is going to be in our answer
• The other term should be because
• Answer is:
𝛿 is local error signal at 𝑧
𝑥 is local input signal
37
Why the Transposes?
• Hacky answer: this makes the dimensions work out!
• Useful trick for checking your work!
• Full explanation in the lecture notes; intuition next
• Each input goes to each output – you get outer product
38
Why the Transposes?
39
Deriving local input gradient in backprop
• For this function:
𝜕𝑠 𝜕𝒛 𝜕
=𝜹 =𝜹 𝑾𝒙 + 𝒃
𝜕𝑾 𝜕𝑾 𝜕𝑾
• Let’s consider the derivative
of a single weight Wij
• Wij only contributes to zi s u2
• For example: W23 is only
f(z1)= h1 h2 =f(z2)
used to compute z2 not z1
W23
𝜕𝑧C 𝜕
= 𝑾CF 𝒙 + 𝑏C b2
𝜕𝑊CE 𝜕𝑊CE
H
= ∑4MNO 𝑊CM 𝑥M = 𝑥E x1 x2 x3 +1
HIJK
40
What shape should derivatives be?
• is a row vector
• But convention says our gradient should be a column vector
because is a column vector…
• Disagreement between Jacobian form (which makes
the chain rule easy) and the shape convention (which
makes implementing SGD easy)
• We expect answers to follow the shape convention
• But Jacobian form is useful for computing the answers
41
What shape should derivatives be?
Two options:
1. Use Jacobian form as much as possible, reshape to
follow the convention at the end:
• What we just did. But at the end transpose to make the
derivative a column vector, resulting in
2. Always follow the convention
• Look at dimensions to figure out when to transpose and/or
reorder terms
42
Deriving gradients: Tips
• Tip 1: Carefully define your variables and keep track of their
dimensionality!
• Tip 2: Chain rule! If y = f(u) and u = g(x), i.e., y = f(g(x)), then:
𝜕𝒚 𝜕𝒚 𝜕𝒖
=
𝜕𝒙 𝜕𝒖 𝜕𝒙
Keep straight what variables feed into what computations
• Tip 3: For the top softmax part of a model: First consider the
derivative wrt fc when c = y (the correct class), then consider
derivative wrt fc when c ¹ y (all the incorrect classes)
• Tip 4: Work out element-wise partial derivatives if you’re getting
confused by matrix calculus!
• Tip 5: Use Shape Convention. Note: The error message 𝜹 that
arrives at a hidden layer has the same dimensionality as that
43
hidden layer
3. Backpropagation
We’ve almost shown you backpropagation
It’s taking derivatives and using the (generalized,
multivariate, or matrix) chain rule
Other trick:
We re-use derivatives computed for higher layers in
computing derivatives for lower layers to minimize
computation
44
Computation Graphs and Backpropagation
• We represent our neural net
equations as a graph
• Source nodes: inputs
• Interior nodes: operations
+
45
Computation Graphs and Backpropagation
• We represent our neural net
equations as a graph
• Source nodes: inputs
• Interior nodes: operations
• Edges pass along result of the
operation
+
46
Computation Graphs and Backpropagation
• Representing our neural net
equations as a graph
• Source nodes: inputs
• “Forward Propagation”
Interior nodes: operations
• Edges pass along result of the
operation
+
47
Backpropagation
• Go backwards along edges
• Pass along gradients
+
48
Backpropagation: Single Node
• Node receives an “upstream gradient”
• Goal is to pass on the correct
“downstream gradient”
49 Downstream Upstream
gradient gradient
Backpropagation: Single Node
• Each node has a local gradient
• The gradient of its output with
respect to its input
50 Downstream Local Upstream
gradient gradient gradient
Backpropagation: Single Node
• Each node has a local gradient
• The gradient of its output with
respect to its input
Chain
rule!
51 Downstream Local Upstream
gradient gradient gradient
Backpropagation: Single Node
• Each node has a local gradient
• The gradient of it’s output with
respect to it’s input
• [downstream gradient] = [upstream gradient] x [local gradient]
52 Downstream Local Upstream
gradient gradient gradient
Backpropagation: Single Node
• What about nodes with multiple inputs?
53
Backpropagation: Single Node
• Multiple inputs → multiple local gradients
Downstream Local Upstream
54 gradients gradients gradient
An Example
55
An Example
Forward prop steps
*
max
56
An Example
Forward prop steps
2
+ 3
6
2 *
2
max
0
57
An Example
Forward prop steps Local gradients
2
+ 3
6
2 *
2
max
0
58
An Example
Forward prop steps Local gradients
2
+ 3
6
2 *
2
max
0
59
An Example
Forward prop steps Local gradients
2
+ 3
6
2 *
2
max
0
60
An Example
Forward prop steps Local gradients
2
+ 3
6
2 *
2
max
0
61
An Example
Forward prop steps Local gradients
2
+ 3
1*2 = 2 6
2 * 1
2
max 1*3 = 3
0
62 upstream * local = downstream
An Example
Forward prop steps Local gradients
2
+ 3
2
6
2 * 1
2
3*1 = 3
max 3
0
63 3*0 = 0 upstream * local = downstream
An Example
Forward prop steps Local gradients
1
2*1 = 2
2
+ 3
2
2*1 = 2 6
2 * 1
2
3
max 3
0
64 0 upstream * local = downstream
An Example
Forward prop steps Local gradients
1
2
2
+ 3
2
2 6
2 * 1
2
3
max 3
0
65 0
Gradients sum at outward branches
66
Gradients sum at outward branches
67
Node Intuitions
• + “distributes” the upstream gradient to each summand
1
2
2
+ 3
2
2 6
2 * 1
2
max
0
68
Node Intuitions
• + “distributes” the upstream gradient to each summand
• max “routes” the upstream gradient
2
+ 3
6
2 * 1
2
3
max 3
0
69 0
Node Intuitions
• + “distributes” the upstream gradient
• max “routes” the upstream gradient
• * “switches” the upstream gradient
2
+ 3
2
6
2 * 1
2
max 3
0
70
Efficiency: compute all gradients at once
• Incorrect way of doing backprop:
• First compute
* +
71
Efficiency: compute all gradients at once
• Incorrect way of doing backprop:
• First compute
• Then independently compute
• Duplicated computation!
* +
72
Efficiency: compute all gradients at once
• Correct way:
• Compute all the gradients at once
• Analogous to using 𝜹 when we
computed gradients by hand
* +
73
Back-Prop in General Computation Graph
1. Fprop: visit nodes in topological sort order
Single scalar output - Compute value of node given predecessors
2. Bprop:
- initialize output gradient = 1
… - visit nodes in reverse order:
Compute gradient wrt each node using
gradient wrt successors
… = successors of
Done correctly, big O() complexity of fprop and
bprop is the same
… In general our nets have regular layer-structure
74
and so we can use matrices and Jacobians…
Automatic Differentiation
• The gradient computation can be
automatically inferred from the
symbolic expression of the fprop
• Each node type needs to know how
to compute its output and how to
compute the gradient wrt its inputs
given the gradient wrt its output
• Modern DL frameworks (Tensorflow,
PyTorch, etc.) do backpropagation
for you but mainly leave layer/node
writer to hand-calculate the local
derivative
75
Backprop Implementations
76
Implementation: forward/backward API
77
Implementation: forward/backward API
78
Manual Gradient checking: Numeric Gradient
• For small h (≈ 1e-4),
• Easy to implement correctly
• But approximate and very slow:
• Have to recompute f for every parameter of our model
• Useful for checking your implementation
• In the old days when we hand-wrote everything, it was key
to do this everywhere.
• Now much less needed, when throwing together layers
79
Summary
We’ve mastered the core technology of neural nets! 🎉
• Backpropagation: recursively (and hence efficiently)
apply the chain rule along computation graph
• [downstream gradient] = [upstream gradient] x [local gradient]
• Forward pass: compute results of operations and save
intermediate values
• Backward pass: apply chain rule to compute gradients
80
Why learn all these details about gradients?
• Modern deep learning frameworks compute gradients for you!
• But why take a class on compilers or systems when they are
implemented for you?
• Understanding what is going on under the hood is useful!
• Backpropagation doesn’t always work perfectly
• Understanding why is crucial for debugging and improving
models
• See Karpathy article (in syllabus):
• https://medium.com/@karpathy/yes-you-should-understand-
backprop-e2f06eab496b
• Example in future lecture: exploding and vanishing gradients
81