0% found this document useful (0 votes)

114 views83 pages

cs224n 2023 Lecture03 Neuralnets

This document provides an overview and agenda for a lecture on computing gradients for neural networks both by hand using matrix calculus and algorithmically using the backpropagation algorithm. The lecture will cover the mathematics behind computing partial derivatives and Jacobian matrices, as well as working through an example gradient computation for a simple neural network by hand using the chain rule. The goal is for students to understand both the theoretical underpinnings and practical application of calculating gradients, a crucial step in training neural networks.

Uploaded by

myturtle game01

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

114 views83 pages

cs224n 2023 Lecture03 Neuralnets

Uploaded by

myturtle game01

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 83

Natural Language Processing

with Deep Learning

CS224N/Ling284

Christopher Manning
Lecture 3: Neural net learning: Gradients by hand (matrix calculus)
and algorithmically (the backpropagation algorithm)
1. Introduction
Assignment 2 is all about making sure you really understand the math of neural networks
… then we’ll let the software do it!

We’ll go through it all quickly today, but this is the one week of quarter to most work
through the readings!!!

This will be a tough week for some! à Make sure to get help if you need it:
Visit office hours! Read tutorial materials on the syllabus!

Thursday will be mainly linguistics! Some people find that tough too. 😉

PyTorch tutorial: 3:30pm Friday in Gates B01

A great chance to get an intro to PyTorch, a key deep learning package, before Ass 3!
2
NER: Binary classification for center word being location

• We do supervised training and want high score if it’s a location

1
𝐽! 𝜃 = 𝜎 𝑠 =
1 + 𝑒 "#
predicted model
probability of class

f = Some element-
wise non-linear
function, e.g.,
logistic, tanh, ReLU

∈ R5d
x = [ xmuseums xin xParis xare xamazing ] Embedding of
1-hot words
3
7. Neural computation

Original McCulloch & Pitts

1943 threshold unit:
𝟏(𝑊𝑥 > 𝜃)
= 𝟏(𝑊𝑥 − 𝜃 > 0)
This function has no slope,
so, no gradient-based learning
4
Non-linearities, old and new
(Rectified Linear Unit) Leaky ReLU /
logistic (“sigmoid”) tanh hard tanh ReLU Parametric ReLU
ReLU 𝑧 = max(𝑧, 0)

1 1

0
0
0 −1

tanh is just a rescaled and shifted sigmoid (2×as steep, [−1,1]): GELU arXiv:1606.08415
Swish arXiv:1710.05941
tanh(z) = 2logistic(2z) −1 GELU 𝑥
swish 𝑥 = 𝑥 ' logistic(𝑥)
= 𝑥 1 𝑃 𝑋 ≤ 𝑥 , 𝑋~𝑁(0,1)
Logistic and tanh are still used (e.g., logistic to get a probability) ≈ 𝑥 1 logistic(1.702𝑥)
However, now, for deep networks, the first thing to try is ReLU: it
trains quickly and performs well due to good gradient backflow.
ReLU has a negative “dead zone” that recent proposals mitigate 0
GELU is frequently used with Transformers (BERT, RoBERTa, etc.)
Non-linearities (i.e., “f ” on previous slide): Why they’re needed
• Neural networks do function approximation,
e.g., regression or classification
• Without non-linearities, deep neural networks
can’t do anything more than a linear transform
• Extra layers could just be compiled down into a
single linear transform: W1 W2 x = Wx
• But, with more layers that include non-linearities,
they can approximate any complex function!

6
Training with “cross entropy loss” – you use this in PyTorch!
• Until now, our objective was stated as to maximize the probability of the correct class y
or equivalently we can minimize the negative log probability of that class
• Now restated in terms of cross entropy, a concept from information theory
• Let the true probability distribution be p; let our computed model probability be q
• The cross entropy is:

• Assuming a ground truth (or true or gold or target) probability distribution that is 1 at
the right class and 0 everywhere else, p = [0, …, 0, 1, 0, …, 0], then:
• Because of one-hot p, the only term left is the negative log probability of the true
class yi: − log 𝑝(𝑦. |𝑥. )
Cross entropy can be used in other ways with a more interesting p,
but for now just know that you’ll want to use it as the loss in PyTorch
7
Remember: Stochastic Gradient Descent
Update equation:

𝛼 = step size or learning rate

67 8
i.e., for each parameter: 𝜃/012 = 𝜃/345 −𝛼
68!"#$

In deep learning, 𝜃 includes the data representation (e.g., word vectors) too!

How can we compute ∇8 𝐽(𝜃)?

1. By hand
2. Algorithmically: the backpropagation algorithm
8
Lecture Plan
Lecture 4: Gradients by hand and algorithmically
1. Introduction (10 mins)
2. Matrix calculus (35 mins)
3. Backpropagation (35 mins)

9
Computing Gradients by Hand
• Matrix calculus: Fully vectorized gradients
• “Multivariable calculus is just like single-variable calculus if you use matrices”
• Much faster and more useful than non-vectorized gradients
• But doing a non-vectorized gradient can be good for intuition; recall the first
lecture for an example
• Lecture notes and matrix calculus notes cover this material in more detail
• You might also review Math 51, which has an online textbook:
http://web.stanford.edu/class/math51/textbook.html

10
Gradients
• Given a function with 1 output and 1 input
𝑓 𝑥 = 𝑥9
• It’s gradient (slope) is its derivative
5:
5;
= 3𝑥 <
“How much will the output change if we change the input a bit?”
At x = 1 it changes about 3 times as much: 1.013 = 1.03
At x = 4 it changes about 48 times as much: 4.013 = 64.48

11
Gradients
• Given a function with 1 output and n inputs

• Its gradient is a vector of partial derivatives with

respect to each input

12
Jacobian Matrix: Generalization of the Gradient
• Given a function with m outputs and n inputs

• It’s Jacobian is an m x n matrix of partial derivatives

13
Chain Rule
• For composition of one-variable functions: multiply derivatives

• For multiple variables functions: multiply Jacobians

14
Example Jacobian: Elementwise activation Function

15
Example Jacobian: Elementwise activation Function

Function has n outputs and n inputs → n by n Jacobian

16
Example Jacobian: Elementwise activation Function

17
Example Jacobian: Elementwise activation Function

18
Example Jacobian: Elementwise activation Function

19
Other Jacobians

• Compute these at home for practice!

• Check your answers with the lecture notes
20
Other Jacobians

• Compute these at home for practice!

• Check your answers with the lecture notes
21
Other Jacobians

Fine print: This is the correct Jacobian.

Later we discuss the “shape convention”;
using it the answer would be h.

• Compute these at home for practice!

• Check your answers with the lecture notes
22
Other Jacobians

• Compute these at home for practice!

• Check your answers with the lecture notes

23
Back to our Neural Net!

x = [ xmuseums xin xParis xare xamazing ]

24
Back to our Neural Net!
• Let’s find
• Really, we care about the gradient of the loss Jt but we
will compute the gradient of the score for simplicity

x = [ xmuseums xin xParis xare xamazing ]

25
1. Break up equations into simple pieces

Carefully define your variables and keep track of their dimensionality!

26
2. Apply the chain rule

27
2. Apply the chain rule

28
2. Apply the chain rule

29
2. Apply the chain rule

30
3. Write out the Jacobians

Useful Jacobians from previous slide

31
3. Write out the Jacobians

𝒖!

Useful Jacobians from previous slide

32
3. Write out the Jacobians

𝒖!

Useful Jacobians from previous slide

33
3. Write out the Jacobians

𝒖!

Useful Jacobians from previous slide

34
3. Write out the Jacobians

𝒖!
𝒖! .
Useful Jacobians from previous slide

⊙ = Hadamard product =
element-wise multiplication
of 2 vectors to give vector

35
Re-using Computation

• Suppose we now want to compute

• Using the chain rule again:

36
Re-using Computation

• Suppose we now want to compute

• Using the chain rule again:

The same! Let’s avoid duplicated computation …

37
Re-using Computation

• Suppose we now want to compute

• Using the chain rule again:

𝒖!

𝛿 is the upstream gradient (“error signal”)

38
Derivative with respect to Matrix: Output shape

• What does look like?

• 1 output, nm inputs: 1 by nm Jacobian?
• Inconvenient to then do

39
Derivative with respect to Matrix: Output shape

• What does look like?

• 1 output, nm inputs: 1 by nm Jacobian?
• Inconvenient to then do

• Instead, we leave pure math and use the shape convention:

the shape of the gradient is the shape of the parameters!

• So is n by m:

40
Derivative with respect to Matrix

• What is
• is going to be in our answer
• The other term should be because

• Answer is:

𝛿 is upstream gradient (“error signal”) at 𝑧

𝑥 is local input signal

41
Why the Transposes?

• Hacky answer: this makes the dimensions work out!

• Useful trick for checking your work!
• Full explanation in the lecture notes
• Each input goes to each output – you want to get outer product
42
Deriving local input gradient in backprop
"𝒛
• For "𝑾 in our equation:
𝜕𝑠 𝜕𝒛 𝜕
=𝜹 =𝜹 (𝑾𝒙 + 𝒃)
𝜕𝑾 𝜕𝑾 𝜕𝑾
• Let’s consider the derivative of a single weight Wij
• Wij only contributes to zi u2
• For example: W23 is only
s
used to compute z2 not z1 f(z1)= h1 h2 =f(z2)

W23
𝜕𝑧. 𝜕
= 𝑾.> 𝒙 + 𝑏. b2
𝜕𝑊./ 𝜕𝑊./
6
= ∑5@AB 𝑊.@ 𝑥@ = 𝑥/ x1 x2 x3 +1
6?%!
43
What shape should derivatives be?

• Similarly, is a row vector

• But shape convention says our gradient should be a column vector because b is
a column vector …

• Disagreement between Jacobian form (which makes the chain rule

easy) and the shape convention (which makes implementing SGD easy)
• We expect answers in the assignment to follow the shape convention
• But Jacobian form is useful for computing the answers

44
What shape should derivatives be?
Two options for working through specific problems:
1. Use Jacobian form as much as possible, reshape to
follow the shape convention at the end:
• What we just did. But at the end transpose to make the
derivative a column vector, resulting in

2. Always follow the shape convention

• Look at dimensions to figure out when to transpose and/or
reorder terms
• The error message 𝜹 that arrives at a hidden layer has the
same dimensionality as that hidden layer

45
3. Backpropagation

We’ve almost shown you backpropagation

It’s taking derivatives and using the (generalized, multivariate, or matrix)
chain rule
Other trick:
We re-use derivatives computed for higher layers in computing
derivatives for lower layers to minimize computation

46
Computation Graphs and Backpropagation
• Software represents our neural
net equations as a graph
• Source nodes: inputs
• Interior nodes: operations

47
Computation Graphs and Backpropagation
• Software represents our neural
net equations as a graph
• Source nodes: inputs
• Interior nodes: operations
• Edges pass along result of the
operation

48
Computation Graphs and Backpropagation
• Software represents our neural
net equations as a graph
• Source nodes: inputs
• “Forward Propagation”
Interior nodes: operations
• Edges pass along result of the
operation

49
Backpropagation
• Then go backwards along edges
• Pass along gradients

50
Backpropagation: Single Node
• Node receives an “upstream gradient”
• Goal is to pass on the correct
“downstream gradient”

Downstream Upstream
51 gradient gradient
Backpropagation: Single Node

• Each node has a local gradient

• The gradient of its output with
respect to its input

Downstream Local Upstream

52 gradient gradient gradient
Backpropagation: Single Node

• Each node has a local gradient

• The gradient of its output with
respect to its input

Chain
rule!
Downstream Local Upstream
53 gradient gradient gradient
Backpropagation: Single Node

• Each node has a local gradient

• The gradient of its output with
respect to its input

• [downstream gradient] = [upstream gradient] x [local gradient]

Downstream Local Upstream

54 gradient gradient gradient
Backpropagation: Single Node
• What about nodes with multiple inputs?

55
Backpropagation: Single Node
• Multiple inputs → multiple local gradients

Downstream Local Upstream

gradients gradients gradient
56
An Example

57
An Example

Forward prop steps

*
max

58
An Example

Forward prop steps

2
+ 3

6
2 *
2
max
0
59
An Example

Forward prop steps Local gradients

2
+ 3

6
2 *
2
max
0
60
An Example

Forward prop steps Local gradients

2
+ 3

6
2 *
2
max
0
61
An Example

Forward prop steps Local gradients

2
+ 3

6
2 *
2
max
0
62
An Example

Forward prop steps Local gradients

2
+ 3

6
2 *
2
max
0
63
An Example

Forward prop steps Local gradients

2
+ 3
1*2 = 2 6
2 * 1
2
max 1*3 = 3
0
upstream * local = downstream
64
An Example

Forward prop steps Local gradients

2
+ 3
2
6
2 * 1
2
3*1 = 3
max 3
0
3*0 = 0 upstream * local = downstream
65
An Example

Forward prop steps Local gradients

1
2*1 = 2
2
+ 3
2
2*1 = 2 6
2 * 1
2
3
max 3
0
0 upstream * local = downstream
66
An Example

Forward prop steps Local gradients

1
2
2
+ 3
2
2 6
2 * 1
2
3
max 3
0
0
67
Gradients sum at outward branches

68
Gradients sum at outward branches

69
Node Intuitions

• + “distributes” the upstream gradient to each summand

1
2
2
+ 3
2
2 6
2 * 1
2
max
0
70
Node Intuitions

• + “distributes” the upstream gradient to each summand

• max “routes” the upstream gradient

2
+ 3

6
2 * 1
2
3
max 3
0
0
71
Node Intuitions

• + “distributes” the upstream gradient

• max “routes” the upstream gradient
• * “switches” the upstream gradient

2
+ 3
2
6
2 * 1
2
max 3
0
72
Efficiency: compute all gradients at once
• Incorrect way of doing backprop:
• First compute

* +

73
Efficiency: compute all gradients at once
• Incorrect way of doing backprop:
• First compute
• Then independently compute
• Duplicated computation!

* +

74
Efficiency: compute all gradients at once
• Correct way:
• Compute all the gradients at once
• Analogous to using 𝜹 when we
computed gradients by hand

* +

75
Back-Prop in General Computation Graph
1. Fprop: visit nodes in topological sort order
Single scalar output - Compute value of node given predecessors
2. Bprop:
- initialize output gradient = 1
… - visit nodes in reverse order:
Compute gradient wrt each node using
gradient wrt successors
… = successors of

Done correctly, big O() complexity of fprop and

bprop is the same
In general, our nets have regular layer-structure
Inputs and so we can use matrices and Jacobians…
76
Automatic Differentiation

• The gradient computation can be

automatically inferred from the symbolic
expression of the fprop
• Each node type needs to know how to
compute its output and how to compute
the gradient wrt its inputs given the
gradient wrt its output
• Modern DL frameworks (Tensorflow,
PyTorch, etc.) do backpropagation for
you but mainly leave layer/node writer
to hand-calculate the local derivative
77
Backprop Implementations

78
Implementation: forward/backward API

79
Implementation: forward/backward API

80
Manual Gradient checking: Numeric Gradient

• For small h (≈ 1e-4),

• Easy to implement correctly
• But approximate and very slow:
• You have to recompute f for every parameter of our model

• Useful for checking your implementation

• In the old days, we hand-wrote everything, doing this everywhere was the key test
• Now much less needed; you can use it to check layers are correctly implemented

81
Summary

We’ve mastered the core technology of neural nets! 🎉 🎉 🎉

• Backpropagation: recursively (and hence efficiently) apply the chain rule

along computation graph
• [downstream gradient] = [upstream gradient] x [local gradient]

• Forward pass: compute results of operations and save intermediate

values
• Backward pass: apply chain rule to compute gradients
82
Why learn all these details about gradients?
• Modern deep learning frameworks compute gradients for you!
• Come to the PyTorch introduction this Friday!

• But why take a class on compilers or systems when they are implemented for you?
• Understanding what is going on under the hood is useful!

• Backpropagation doesn’t always work perfectly out of the box

• Understanding why is crucial for debugging and improving models
• See Karpathy article (in syllabus):
• https://medium.com/@karpathy/yes-you-should-understand-backprop-e2f06eab496b
• Example in future lecture: exploding and vanishing gradients

Lecture04 Neuralnets
No ratings yet
Lecture04 Neuralnets
81 pages
Christopher Manning Lecture 3: Neural Net Learning: Gradients by Hand (Matrix Calculus) and Algorithmically (The Backpropagation Algorithm)
No ratings yet
Christopher Manning Lecture 3: Neural Net Learning: Gradients by Hand (Matrix Calculus) and Algorithmically (The Backpropagation Algorithm)
84 pages
XCS224N Module2 Slides
No ratings yet
XCS224N Module2 Slides
80 pages
CS460 - Deep Learning - W02 & W03
No ratings yet
CS460 - Deep Learning - W02 & W03
44 pages
Matrix Calculus
No ratings yet
Matrix Calculus
33 pages
TUM I2DL Matrix Derivatives
No ratings yet
TUM I2DL Matrix Derivatives
8 pages
Vectorized Neural Network Gradients
No ratings yet
Vectorized Neural Network Gradients
7 pages
Computing Neural Network Gradients-Merged
No ratings yet
Computing Neural Network Gradients-Merged
67 pages
Lecture12 Diff
No ratings yet
Lecture12 Diff
31 pages
Neural Networks: Derivation: 1 Model
No ratings yet
Neural Networks: Derivation: 1 Model
9 pages
DeepLearning Recap
No ratings yet
DeepLearning Recap
104 pages
Back Prop
No ratings yet
Back Prop
8 pages
5 Backward Propagation
No ratings yet
5 Backward Propagation
81 pages
Neural Networks & Backpropagation
No ratings yet
Neural Networks & Backpropagation
77 pages
Neural Network Presentation
No ratings yet
Neural Network Presentation
33 pages
AI2025 Lecture08 Recording Slide
No ratings yet
AI2025 Lecture08 Recording Slide
38 pages
Learning 3
No ratings yet
Learning 3
98 pages
EE769 7 Introduction To Neural Networks
No ratings yet
EE769 7 Introduction To Neural Networks
52 pages
ML807 Distributed and Federated Learning Slides 2
No ratings yet
ML807 Distributed and Federated Learning Slides 2
211 pages
Lecture 09 Slides - After
No ratings yet
Lecture 09 Slides - After
57 pages
2024 04 CS115 Vector Caculus
No ratings yet
2024 04 CS115 Vector Caculus
131 pages
Lecture 17
No ratings yet
Lecture 17
28 pages
Neural Networks Training Loss Functions, Stochastic Gradient Descent, Backpropagation Algorithm, Bias-Variance Tradeoff
No ratings yet
Neural Networks Training Loss Functions, Stochastic Gradient Descent, Backpropagation Algorithm, Bias-Variance Tradeoff
29 pages
Matrix Calculus for Deep Learning
No ratings yet
Matrix Calculus for Deep Learning
34 pages
Bai 1 Eng
No ratings yet
Bai 1 Eng
10 pages
Sparse Autoencoder
No ratings yet
Sparse Autoencoder
15 pages
Gradient Descent & Backpropagation Practice Problems
No ratings yet
Gradient Descent & Backpropagation Practice Problems
7 pages
Tut 01
No ratings yet
Tut 01
39 pages
3EBX0 Lecture Notes Addendum
No ratings yet
3EBX0 Lecture Notes Addendum
10 pages
Notes Chapter8
No ratings yet
Notes Chapter8
4 pages
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
No ratings yet
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
12 pages
Machine Learning
No ratings yet
Machine Learning
4 pages
Machine Learning: Backpropagation
No ratings yet
Machine Learning: Backpropagation
24 pages
Neural Networks & Gradient Descent
No ratings yet
Neural Networks & Gradient Descent
77 pages
Deep Learning for Beginners
100% (1)
Deep Learning for Beginners
87 pages
DeepLearning Introduction
No ratings yet
DeepLearning Introduction
14 pages
Neural Network Training
No ratings yet
Neural Network Training
73 pages
DL03 Classroom SNN
No ratings yet
DL03 Classroom SNN
41 pages
6.3 HiddenUnits
No ratings yet
6.3 HiddenUnits
26 pages
Cours 2
No ratings yet
Cours 2
25 pages
Lecture 20
No ratings yet
Lecture 20
71 pages
Chapter 9
No ratings yet
Chapter 9
73 pages
Part 2 Module 2 DL BP
No ratings yet
Part 2 Module 2 DL BP
66 pages
Imperial Dlcourse2022 RNN Notes
No ratings yet
Imperial Dlcourse2022 RNN Notes
9 pages
18 DL Regularization
No ratings yet
18 DL Regularization
41 pages
L3 Backpropagation
No ratings yet
L3 Backpropagation
61 pages
Convolutional Neural Networks in Computer Vision: Jochen Lang
No ratings yet
Convolutional Neural Networks in Computer Vision: Jochen Lang
46 pages
Unit 2.1
No ratings yet
Unit 2.1
37 pages
Part (A) - Differences Between Scalars, Vectors, Ma
No ratings yet
Part (A) - Differences Between Scalars, Vectors, Ma
11 pages
Deep Learning Module-02 Search Creators
No ratings yet
Deep Learning Module-02 Search Creators
15 pages
Lecture 5
No ratings yet
Lecture 5
34 pages
Kagan Lecture2
No ratings yet
Kagan Lecture2
118 pages
UNIT 1 Introduction Part 1
No ratings yet
UNIT 1 Introduction Part 1
37 pages
Lecture 02
No ratings yet
Lecture 02
37 pages
Introduction To Feed Forward Neural Networks
No ratings yet
Introduction To Feed Forward Neural Networks
121 pages
Ch2-Training, Optimization and Regularization of DNN-new
No ratings yet
Ch2-Training, Optimization and Regularization of DNN-new
114 pages
Practical-5 - 2CEIT606 - Artificial Intelligence
No ratings yet
Practical-5 - 2CEIT606 - Artificial Intelligence
14 pages
Neural Network (Basics)
No ratings yet
Neural Network (Basics)
48 pages
CGQP Nov 2023 Examiner Report - RM (Website)
No ratings yet
CGQP Nov 2023 Examiner Report - RM (Website)
3 pages
0985 Module 7 Management Liability Human Resource PDF
No ratings yet
0985 Module 7 Management Liability Human Resource PDF
80 pages
CGQP Nov 2023 Exam Paper - RM
No ratings yet
CGQP Nov 2023 Exam Paper - RM
9 pages
CGQP Jun 2023 Mark Scheme - RM
No ratings yet
CGQP Jun 2023 Mark Scheme - RM
43 pages
Arm 54
No ratings yet
Arm 54
77 pages
June 2023 Risk Management Exam Report
No ratings yet
June 2023 Risk Management Exam Report
3 pages
CGQP Jun 2023 Exam Paper - RM
No ratings yet
CGQP Jun 2023 Exam Paper - RM
9 pages
0985 Module 8 Environmental Risk PDF
No ratings yet
0985 Module 8 Environmental Risk PDF
54 pages
0985 Module 10 Fleet Risk PDF
No ratings yet
0985 Module 10 Fleet Risk PDF
58 pages
0985 Module 5 Intellectual Property and Reputation Risk PDF
No ratings yet
0985 Module 5 Intellectual Property and Reputation Risk PDF
54 pages
Financial Risk Management Guide
No ratings yet
Financial Risk Management Guide
548 pages
SSM2010H Marketing in Sustainability Management
No ratings yet
SSM2010H Marketing in Sustainability Management
1 page
SSM2030 Winter2021 Syllabus
No ratings yet
SSM2030 Winter2021 Syllabus
8 pages
FR Study Guide 4th Edition-2018
No ratings yet
FR Study Guide 4th Edition-2018
716 pages
155 Main
No ratings yet
155 Main
57 pages
SSM2040 Course Syllabus 2022-23 Final
No ratings yet
SSM2040 Course Syllabus 2022-23 Final
10 pages
215 Amain
No ratings yet
215 Amain
73 pages
Lecture 1
No ratings yet
Lecture 1
13 pages
Lecture 4
No ratings yet
Lecture 4
67 pages
BB Notes
No ratings yet
BB Notes
18 pages
Lecture 6
No ratings yet
Lecture 6
28 pages
Advanced Convex Optimization Techniques
No ratings yet
Advanced Convex Optimization Techniques
43 pages
Robust Slides
No ratings yet
Robust Slides
32 pages
Dependency Parsing Lecture
No ratings yet
Dependency Parsing Lecture
45 pages
Entrepreneurial Creative Thinking and Venture Performance: Reviewing The Influence of Psychomotor Education On The Profitability of Small and Medium Scale Firms in Port Harcourt Metropolis
No ratings yet
Entrepreneurial Creative Thinking and Venture Performance: Reviewing The Influence of Psychomotor Education On The Profitability of Small and Medium Scale Firms in Port Harcourt Metropolis
10 pages
Quality Data Statistics (QDS) (9) : Microcomputer Operator's Manual
No ratings yet
Quality Data Statistics (QDS) (9) : Microcomputer Operator's Manual
26 pages
Probability and Statistics Practice Set
No ratings yet
Probability and Statistics Practice Set
31 pages
Statistics Autosaved
No ratings yet
Statistics Autosaved
83 pages
Statistics and Probability Module 6: Week 6: Third Quarter
No ratings yet
Statistics and Probability Module 6: Week 6: Third Quarter
7 pages
Tables
No ratings yet
Tables
13 pages
Rdes Form 4
No ratings yet
Rdes Form 4
12 pages
Circadian Medicine 1st Edition Christopher S. Colwell Download
100% (9)
Circadian Medicine 1st Edition Christopher S. Colwell Download
134 pages
Normal Distribution
No ratings yet
Normal Distribution
20 pages
10 B2-Statistics - Median and Mode
No ratings yet
10 B2-Statistics - Median and Mode
2 pages
Beginning Behavioral Research A Conceptual Primer 7th Edition Ralph L. Rosnow PDF Version
100% (5)
Beginning Behavioral Research A Conceptual Primer 7th Edition Ralph L. Rosnow PDF Version
173 pages
Hypothesis Testing Basics
No ratings yet
Hypothesis Testing Basics
41 pages
Ceis Final Report
No ratings yet
Ceis Final Report
76 pages
Ies TM-21-11
100% (2)
Ies TM-21-11
33 pages
Designing Surveys A Guide To Decisions and Procedures 2 Ed 076192745x 0761927468 2004014521 Compress
No ratings yet
Designing Surveys A Guide To Decisions and Procedures 2 Ed 076192745x 0761927468 2004014521 Compress
316 pages
Unlocking The Secrets of Prompt Engineering Master The Art of Creative Language Generation To Accelerate Your Journey From Novice To Pro 1st Edition Gilbert Mizrahi PDF Download
No ratings yet
Unlocking The Secrets of Prompt Engineering Master The Art of Creative Language Generation To Accelerate Your Journey From Novice To Pro 1st Edition Gilbert Mizrahi PDF Download
125 pages
Weather 101 4th Edition Kathleen Sears 2025 PDF Download
No ratings yet
Weather 101 4th Edition Kathleen Sears 2025 PDF Download
140 pages
GCSE DataCollection
No ratings yet
GCSE DataCollection
21 pages
Intro to Sampling Theory Course
No ratings yet
Intro to Sampling Theory Course
20 pages
Gothic The New Critical Idiom 2nd Edition Fred Botting Available Instanly
100% (5)
Gothic The New Critical Idiom 2nd Edition Fred Botting Available Instanly
74 pages
M.V. Syllabus
No ratings yet
M.V. Syllabus
37 pages
APSTRACT 2017 03-04 Web 09 Fejezet
No ratings yet
APSTRACT 2017 03-04 Web 09 Fejezet
8 pages
EDA-Discrete Probability Distribution
No ratings yet
EDA-Discrete Probability Distribution
35 pages
PR1 Chapter 1
100% (1)
PR1 Chapter 1
56 pages
COGJET 2020: Cognitive Science PG Test
No ratings yet
COGJET 2020: Cognitive Science PG Test
3 pages
Lecture 11.6 - Hypergeometric Distribution
No ratings yet
Lecture 11.6 - Hypergeometric Distribution
24 pages
Central Limit Theorem: Melc Competency Code
No ratings yet
Central Limit Theorem: Melc Competency Code
9 pages
Sample Proposal HUMSS12B Income Inconsistency
No ratings yet
Sample Proposal HUMSS12B Income Inconsistency
10 pages
Practical Research 2: Learning Competency "Describes Sampling Procedure and The Sample"
No ratings yet
Practical Research 2: Learning Competency "Describes Sampling Procedure and The Sample"
15 pages
Ethics For Engineers Martin Peterson Download
100% (1)
Ethics For Engineers Martin Peterson Download
128 pages

cs224n 2023 Lecture03 Neuralnets

Uploaded by

cs224n 2023 Lecture03 Neuralnets

Uploaded by

Natural Language Processing

with Deep Learning

PyTorch tutorial: 3:30pm Friday in Gates B01

• We do supervised training and want high score if it’s a location

Original McCulloch & Pitts

𝛼 = step size or learning rate

How can we compute ∇8 𝐽(𝜃)?

• Its gradient is a vector of partial derivatives with

• It’s Jacobian is an m x n matrix of partial derivatives

• For multiple variables functions: multiply Jacobians

Function has n outputs and n inputs → n by n Jacobian

• Compute these at home for practice!

• Compute these at home for practice!

Fine print: This is the correct Jacobian.

• Compute these at home for practice!

• Compute these at home for practice!

x = [ xmuseums xin xParis xare xamazing ]

x = [ xmuseums xin xParis xare xamazing ]

Carefully define your variables and keep track of their dimensionality!

Useful Jacobians from previous slide

Useful Jacobians from previous slide

Useful Jacobians from previous slide

Useful Jacobians from previous slide

• Suppose we now want to compute

• Suppose we now want to compute

The same! Let’s avoid duplicated computation …

• Suppose we now want to compute

𝛿 is the upstream gradient (“error signal”)

• What does look like?

• What does look like?

• Instead, we leave pure math and use the shape convention:

𝛿 is upstream gradient (“error signal”) at 𝑧

• Hacky answer: this makes the dimensions work out!

• Similarly, is a row vector

• Disagreement between Jacobian form (which makes the chain rule

2. Always follow the shape convention

We’ve almost shown you backpropagation

• Each node has a local gradient

Downstream Local Upstream

• Each node has a local gradient

• Each node has a local gradient

• [downstream gradient] = [upstream gradient] x [local gradient]

Downstream Local Upstream

Downstream Local Upstream

Forward prop steps

Forward prop steps

Forward prop steps Local gradients

Forward prop steps Local gradients

Forward prop steps Local gradients

Forward prop steps Local gradients

Forward prop steps Local gradients

Forward prop steps Local gradients

Forward prop steps Local gradients

Forward prop steps Local gradients

• + “distributes” the upstream gradient to each summand

• + “distributes” the upstream gradient to each summand

• + “distributes” the upstream gradient

Done correctly, big O() complexity of fprop and

• The gradient computation can be

• For small h (≈ 1e-4),

• Useful for checking your implementation

We’ve mastered the core technology of neural nets! 🎉 🎉 🎉

• Backpropagation: recursively (and hence efficiently) apply the chain rule

• Forward pass: compute results of operations and save intermediate

• Backpropagation doesn’t always work perfectly out of the box

You might also like