grokking
Deep Learning
grokking
Deep Learning
Andrew W. Trask
MANNING
S HELTER I SL AND
For online information and ordering of this and other Manning books, please visit
www.manning.com. The publisher offers discounts on this book when ordered in
quantity. For more information, please contact
Special Sales Department
Manning Publications Co.
20 Baldwin Road, PO Box 761
Shelter Island, NY 11964
Email: orders@manning.com
©2019 by Manning Publications Co. All rights reserved.
No part of this publication may be reproduced, stored in a retrieval system, or
transmitted, in any form or by means electronic, mechanical, photocopying, or
otherwise, without prior written permission of the publisher.
Many of the designations used by manufacturers and sellers to distinguish their
products are claimed as trademarks. Where those designations appear in the book,
and Manning Publications was aware of a trademark claim, the designations have been
printed in initial caps or all caps.
∞
Recognizing the importance of preserving what has been written, it is Manning’s
policy to have the books we publish printed on acid-free paper, and we exert our best
efforts to that end. Recognizing also our responsibility to conserve the resources of
our planet, Manning books are printed on paper that is at least 15 percent recycled
and processed without the use of elemental chlorine.
Manning Publications Co.
20 Baldwin Road
Shelter Island, NY 11964
Development editor: Christina Taylor
Review editor: Aleksandar Dragosavljevic
Production editor: Lori Weidert
Copyeditor: Tiffany Taylor
Proofreader: Sharon Wilkey
Technical proofreader: David Fombella Pomball
Typesetter: Dennis Dalinnik
Cover designer: Leslie Haimes
ISBN: 9781617293702
Printed in the United States of America
1 2 3 4 5 6 7 8 9 10 – SP – 23 22 21 20 19 18
To Mom. You sacrificed so much time in your life to bless Tara and me with education.
I hope you see your work behind this book.
And to Dad. Thank you for loving us so much and for taking the time to teach me
programming and technology at such a young age. I wouldn’t be doing this without you.
It is a great honor to be your son.
contents
preface
xv
acknowledgments
xvi
about this book
xvii
about the author
xx
1 Introducing deep learning: why you should learn it
Welcome to Grokking Deep Learning
Why you should learn deep learning
Will this be difficult to learn?
Why you should read this book
What you need to get started
You’ll probably need some Python knowledge
Summary
2 Fundamental concepts: how do machines learn?
What is deep learning?
What is machine learning?
Supervised machine learning
Unsupervised machine learning
Parametric vs. nonparametric learning
Supervised parametric learning
Unsupervised parametric learning
Nonparametric learning
Summary
vii
3
3
4
5
5
7
8
8
9
10
11
12
13
14
15
17
18
19
viii
contents
3 Introduction to neural prediction: forward propagation
21
Step 1: Predict
A simple neural network making a prediction
What is a neural network?
What does this neural network do?
Making a prediction with multiple inputs
Multiple inputs: What does this neural network do?
Multiple inputs: Complete runnable code
Making a prediction with multiple outputs
Predicting with multiple inputs and outputs
Multiple inputs and outputs: How does it work?
Predicting on predictions
A quick primer on NumPy
Summary
22
24
25
26
28
30
35
36
38
40
42
44
46
4 Introduction to neural learning: gradient descent
Predict, compare, and learn
Compare
Learn
Compare: Does your network make good predictions?
Why measure error?
What’s the simplest form of neural learning?
Hot and cold learning
Characteristics of hot and cold learning
Calculating both direction and amount from error
One iteration of gradient descent
Learning is just reducing error
Let’s watch several steps of learning
Why does this work? What is weight_delta, really?
Tunnel vision on one concept
A box with rods poking out of it
Derivatives: Take two
What you really need to know
What you don’t really need to know
How to use a derivative to learn
Look familiar?
47
48
48
49
50
51
52
54
55
56
58
60
62
64
66
67
68
69
69
70
71
contents
Breaking gradient descent
Visualizing the overcorrections
Divergence
Introducing alpha
Alpha in code
Memorizing
5 Learning multiple weights at a time:
generalizing gradient descent
Gradient descent learning with multiple inputs
Gradient descent with multiple inputs explained
Let’s watch several steps of learning
Freezing one weight: What does it do?
Gradient descent learning with multiple outputs
Gradient descent with multiple inputs and outputs
What do these weights learn?
Visualizing weight values
Visualizing dot products (weighted sums)
Summary
6 Building your first deep neural network:
introduction to backpropagation
The streetlight problem
Preparing the data
Matrices and the matrix relationship
Creating a matrix or two in Python
Building a neural network
Learning the whole dataset
Full, batch, and stochastic gradient descent
Neural networks learn correlation
Up and down pressure
Edge case: Overfitting
Edge case: Conflicting pressure
Learning indirect correlation
Creating correlation
Stacking neural networks: A review
Backpropagation: Long-distance error attribution
ix
72
73
74
75
76
77
79
80
82
86
88
90
92
94
96
97
98
99
100
102
103
106
107
108
109
110
111
113
114
116
117
118
119
x
contents
Backpropagation: Why does this work?
Linear vs. nonlinear
Why the neural network still doesn’t work
The secret to sometimes correlation
A quick break
Your first deep neural network
Backpropagation in code
One iteration of backpropagation
Putting it all together
Why do deep networks matter?
7 How to picture neural networks: in your head and on paper
It’s time to simplify
Correlation summarization
The previously overcomplicated visualization
The simplified visualization
Simplifying even further
Let’s see this network predict
Visualizing using letters instead of pictures
Linking the variables
Everything side by side
The importance of visualization tools
8 Learning signal and ignoring noise:
introduction to regularization and batching
Three-layer network on MNIST
Well, that was easy
Memorization vs. generalization
Overfitting in neural networks
Where overfitting comes from
The simplest regularization: Early stopping
Industry standard regularization: Dropout
Why dropout works: Ensembling works
Dropout in code
Dropout evaluated on MNIST
Batch gradient descent
Summary
120
121
122
123
124
125
126
128
130
131
133
134
135
136
137
138
139
140
141
142
143
145
146
148
149
150
151
152
153
154
155
157
158
160
contents
9 Modeling probabilities and nonlinearities: activation functions
xi
161
What is an activation function?
Standard hidden-layer activation functions
Standard output layer activation functions
The core issue: Inputs have similarity
softmax computation
Activation installation instructions
Multiplying delta by the slope
Converting output to slope (derivative)
Upgrading the MNIST network
162
165
166
168
169
170
172
173
174
10 Neural learning about edges and corners:
intro to convolutional neural networks
177
Reusing weights in multiple places
The convolutional layer
A simple implementation in NumPy
Summary
178
179
181
185
11 Neural networks that understand language:
king – man + woman == ?
What does it mean to understand language?
Natural language processing (NLP)
Supervised NLP
IMDB movie reviews dataset
Capturing word correlation in input data
Predicting movie reviews
Intro to an embedding layer
Interpreting the output
Neural architecture
Comparing word embeddings
What is the meaning of a neuron?
Filling in the blank
Meaning is derived from loss
King – Man + Woman ~= Queen
Word analogies
Summary
187
188
189
190
191
192
193
194
196
197
199
200
201
203
206
207
208
xii
contents
12 Neural networks that write like Shakespeare:
recurrent layers for variable-length data
The challenge of arbitrary length
Do comparisons really matter?
The surprising power of averaged word vectors
How is information stored in these embeddings?
How does a neural network use embeddings?
The limitations of bag-of-words vectors
Using identity vectors to sum word embeddings
Matrices that change absolutely nothing
Learning the transition matrices
Learning to create useful sentence vectors
Forward propagation in Python
How do you backpropagate into this?
Let’s train it!
Setting things up
Forward propagation with arbitrary length
Backpropagation with arbitrary length
Weight update with arbitrary length
Execution and output analysis
Summary
13 Introducing automatic optimization:
let’s build a deep learning framework
What is a deep learning framework?
Introduction to tensors
Introduction to automatic gradient computation (autograd)
A quick checkpoint
Tensors that are used multiple times
Upgrading autograd to support multiuse tensors
How does addition backpropagation work?
Adding support for negation
Adding support for additional functions
Using autograd to train a neural network
Adding automatic optimization
Adding support for layer types
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
229
231
232
233
234
236
237
238
240
241
242
246
248
249
contents
xiii
Layers that contain layers
Loss-function layers
How to learn a framework
Nonlinearity layers
The embedding layer
Adding indexing to autograd
The embedding layer (revisited)
The cross-entropy layer
The recurrent neural network layer
Summary
250
251
252
253
255
256
257
258
260
263
14 Learning to write like Shakespeare:
long short-term memory
Character language modeling
The need for truncated backpropagation
Truncated backpropagation
A sample of the output
Vanishing and exploding gradients
A toy example of RNN backpropagation
Long short-term memory (LSTM) cells
Some intuition about LSTM gates
The long short-term memory layer
Upgrading the character language model
Training the LSTM character language model
Tuning the LSTM character language model
Summary
15 Deep learning on unseen data:
introducing federated learning
The problem of privacy in deep learning
Federated learning
Learning to detect spam
Let’s make it federated
Hacking into federated learning
Secure aggregation
Homomorphic encryption
265
266
267
268
271
272
273
274
275
276
277
278
279
280
281
282
283
284
286
287
288
289
xiv
contents
Homomorphically encrypted federated learning
Summary
16 Where to go from here: a brief guide
290
291
293
Congratulations!
Step 1: Start learning PyTorch
Step 2: Start another deep learning course
Step 3: Grab a mathy deep learning textbook
Step 4: Start a blog, and teach deep learning
Step 5: Twitter
Step 6: Implement academic papers
Step 7: Acquire access to a GPU (or many)
Step 8: Get paid to practice
Step 9: Join an open source project
Step 10: Develop your local community
294
294
295
295
296
297
297
297
298
298
299
index
301
preface
Grokking Deep Learning is the product of a monumental three years of effort. To get to
the book you hold in your hand, I wrote at least twice the number of pages you see here.
Half-a-dozen chapters were rewritten from scratch three or four times before they were
ready to publish, and along the way important chapters were added that weren’t part of
the original plan.
More significantly, I arrived at two decisions early on that make Grokking Deep Learning
uniquely valuable: this book requires no math background beyond basic arithmetic, and
it doesn’t rely on a high-level library that might hide what is going on. In other words,
anyone can read this book and understand how deep learning really works. To accomplish
this, I had to invent new ways to describe and teach the core ideas and techniques without
falling back on advanced mathematics or sophisticated code that someone else wrote.
My goal in writing Grokking Deep Learning was to create the lowest possible barrier to
entry to the practice of deep learning. You don’t just read the theory; you’ll discover it
yourself. To help you get there, To help you get there, I wrote a lot of code and did my best
to explain it in the right order so that the code snippets required for the working demos
all made sense.
This knowledge, combined with all the theory, code, and examples you’ll explore in this
book, will make you much faster at iterating through experiments. You’ll have quick
successes and better job opportunities, and you’ll even learn about more-advanced deep
learning concepts more rapidly.
In the last three years, I not only authored this book, but also entered a PhD program at
Oxford, joined the team at Google, and helped spearhead OpenMined, a decentralized
artificial intelligence platform. This book is the culmination of years of thinking, learning,
and teaching.
There are many other resources for learning deep learning. I’m glad that you came to
this one.
xv
acknowledgments
I’m exceedingly grateful for everyone who has contributed to the production of Grokking Deep
Learning. First and foremost, I’d like to thank the amazing team at Manning: Bert Bates, who
taught me how to write; Christina Taylor, who patiently kept me going for three years; Michael
Stephens, whose creativity has allowed the book to have great success even before publication;
and Marjan Bace, whose encouragement in the midst of delays made all the difference.
Grokking Deep Learning wouldn’t be what it is without the immense contributions of early
readers through email, Twitter, and GitHub. I feel greatly indebted to Jascha Swisher, Varun
Sudhakar, Francois Chollet, Frederico Vitorino, Cody Hammond, Mauricio Maroto Arrieta,
Aleksandar Dragosavljevic, Alan Carter, Frank Hinek, Nicolas Benjamin Hocker, Hank
Meisse, Wouter Hibma, Joerg Rosenkranz, Alex Vieira, and Charlie Harrington for all your
help refining the text and the online code repository.
I’d like to thank the reviewers who took time to read the manuscript at various stages in
development: Alexander A. Myltsev, Amit Lamba, Anand Saha, Andrew Hamor, Cristian
Barrientos, Montoya, Eremey Valetov, Gerald Mack, Ian Stirk, Kalyan Reddy, Kamal Raj,
Kelvin D. Meeks, Marco Paulo dos Santos Nogueira, Martin Beer, Massimo Ilario, Nancy
W. Grady, Peter Hampton, Sebastian Maldonado, Shashank Gupta, Tymoteusz Wołodźko,
Kumar Unnikrishnan, Vipul Gupta, Will Fuger, and William Wheeler.
I’m also grateful to Mat and Niko at Udacity, who included the book in Udacity’s Deep
Learning Nanodegree, which greatly aided in early awareness of the book among young
deep learning practitioners.
I must thank Dr. William Hooper, who let me wander into his office and bug him about
computer science, who made an exception to let me into his (already full) Programming 1
class, and who inspired me to pursue a career in deep learning. I am exceedingly thankful
for all the patience you had with me starting out. You have blessed me immensely.
Finally, I’d like to thank my wife for being so patient with me during all the nights and
weekends spent working on the book, for copyediting the entire text several times herself,
and for creating and debugging the online GitHub code repository.
xvi
about this book
Grokking Deep Learning was written to help give you a foundation in deep learning so that
you can master a major deep learning framework. It begins by focusing on the basics of
neural networks and then switches its focus to provide an in-depth look at advanced layers
and architectures.
Who should read this book
I’ve intentionally written this book with what I believe is the lowest barrier to entry possible.
No knowledge of linear algebra, calculus, convex optimization, or even machine learning
is assumed. Everything from those subjects that’s necessary to understand deep learning
will be explained as we go. If you’ve passed high school mathematics and hacked around in
Python, you’re ready for this book.
Roadmap
This book has 16 chapters:
• Chapter 1 focuses on why should you learn deep learning, and what you’ll need to get
started.
• Chapter 2 starts to dig deep in fundamental concepts, such as machine learning,
parametric and nonparametric models, and supervised and unsupervised learning. It
also introduces the “predict, compare, learn” paradigm that will continue through the
following chapters.
• Chapter 3 will walk you through using simple networks to make a prediction, as well as
provide your first look at a neural network.
• Chapter 4 will teach you how to evaluate the predictions made in chapter 3 and identify
errors to help train models in the next step.
xvii
xviii
about this book
• Chapter 5 focuses on the learn part of the “predict, compare, learn” paradigm. Using an
in-depth example, this chapter walks through the learning process.
• In chapter 6, you’ll build your first “deep” neural network, code and all.
• Chapter 7 focuses on the 10,000-foot view of neural networks and works to simplify your
mental picture.
• Chapter 8 introduces overfitting, dropout, and batch gradient descent, and teaches you
how to classify your dataset within the new network you just built.
• Chapter 9 teaches activation functions and how to use them when modeling probabilities.
• Chapter 10 introduces convolutional neural networks, highlighting the usability of
structure to counter overfitting.
• Chapter 11 dives into natural language processing (NLP) and provides foundational
vocabulary and concepts in the deep learning field.
• Chapter 12 discusses recurrent neural networks, a state-of-the-art approach in nearly
every sequence-modeling field, and one of the most popular tools used in the industry.
• Chapter 13 will fast-track you on how to build a deep learning framework from scratch by
becoming a power user of deep learning frameworks.
• Chapter 14 uses your recurrent neural network to tackle a more challenging task: language
modeling.
• Chapter 15 focuses on privacy in data, introducing basic privacy concepts such as
federated learning, homomorphic encryption, and concepts related to differential privacy
and secure multiparty computation.
• Chapter 16 will give you the tools and resources you need to continue your deep learning
journey.
About the Code conventions and downloads
All code in the book is presented in a fixed-width font like this to separate it from
ordinary text. Code annotations accompany some of the listings, highlighting important
concepts.
You can download the code for the examples in the book from the publisher’s website at
www.manning.com/books/grokking-deep-learning, or from https://github.com/iamtrask/
grokking-deep-learning.
about this book
xix
Book forum
Purchase of Grokking Deep Learning includes free access to a private web forum run by
Manning Publications, where you can make comments about the book, ask technical
questions, and receive help from the author and from other users. To access the forum, go to
https://forums.manning.com/forums/grokking-deep-learning. You can also learn more about
Manning’s forums and the rules of conduct at https://forums.manning.com/forums/about.
Manning’s commitment to our readers is to provide a venue where a meaningful dialogue
between individual readers and between readers and the author can take place. It isn’t
a commitment to any specific amount of participation on the part of the author, whose
contribution to the forum remains voluntary (and unpaid). We suggest you try asking the
author some challenging questions lest his interest stray! The forum and the archives of
previous discussions will be accessible from the publisher’s website as long as the book is
in print.
about the author
Andrew Trask is the founding member of Digital Reasoning’s machine learning lab, where
deep learning approaches to natural language processing, image recognition, and audio
transcription are being researched. Within several months, Andrew and his research partner
exceeded best published results in sentiment classification and part-of-speech tagging.
He trained the world’s largest artificial neural network with over 160 billion parameters,
the results of which he presented with his coauthor at The International Conference on
Machine Learning. Those results were published in the Journal of Machine Learning. He is
currently the product manager of text and audio analytics at Digital Reasoning, responsible
for driving the analytics roadmap for the Synthesys cognitive computing platform, for which
deep learning is a core competency.
xx
grokking
Deep Learning
introducing deep learning:
why you should learn it
1
In this chapter
•
Why you should learn deep learning
•
Why you should read this book
•
What you need to get started
Do not worry about your difficulties in Mathematics.
I can assure you mine are still greater.
—Albert Einstein
Welcome to Grokking Deep Learning
You’re about to learn some of the most valuable skills
of the century!
I’m very excited that you’re here! You should be, too! Deep learning represents an
exciting intersection of machine learning and artificial intelligence, and a very significant
disruption to society and industry. The methods discussed in this book are changing the
world all around you. From optimizing the engine of your car to deciding which content
you view on social media, it’s everywhere, it’s powerful, and, fortunately, it’s fun!
3
4
Chapter 1
I Introducing deep learning
Why you should learn deep learning
It’s a powerful tool for the incremental automation of intelligence.
From the beginning of time, humans have been building better and better tools to
understand and control the environment around us. Deep learning is today’s chapter in this
story of innovation.
Perhaps what makes this chapter so compelling is that this field is more of a mental
innovation than a mechanical one. Much like its sister fields in machine learning, deep
learning seeks to automate intelligence bit by bit. In the past few years, it has achieved
enormous success and progress in this endeavor, exceeding previous records in computer
vision, speech recognition, machine translation, and many other tasks.
This is particularly extraordinary given that deep learning seems to use largely the same
brain-inspired algorithm (neural networks) for achieving these accomplishments across a
vast number of fields. Even though deep learning is still an actively developing field with
many challenges, recent developments have lead to tremendous excitement: perhaps we’ve
discovered not just a great tool, but a window into our own minds.
Deep learning has the potential for significant automation
of skilled labor.
There’s a substantial amount of hype around the potential impacts of deep learning if
the current trend of progress is extrapolated at varying speeds. Although many of these
predictions are overzealous, I believe one merits your consideration: job displacement. I
think this claim stands out from the rest because even if deep learning’s innovations stopped
today, there would already be an incredible impact on skilled labor around the globe. Callcenter operators, taxi drivers, and low-level business analysts are compelling examples
where deep learning can provide a low-cost alternative.
Fortunately, the economy doesn’t turn on a dime; but in many ways we’re already past
the point of concern, given the current power of the technology. It’s my hope that you
(and people you know) will be enabled by this book to transition from perhaps one of the
industries facing disruption into an industry ripe with growth and prosperity: deep learning.
It’s fun and creative. You’ll discover much about what it is to be
human by trying to simulate intelligence and creativity.
Personally, I got into deep learning because it’s fascinating. It’s an amazing intersection
between human and machine. Unpacking exactly what it means to think, to reason, and to
create is enlightening, engaging, and, for me, inspiring. Consider having a dataset filled with
every painting ever painted, and then using that to teach a machine how to paint like Monet.
Insanely, it’s possible, and it’s mind-bogglingly cool to see how it works.
Will this be difficult to learn?
5
Will this be difficult to learn?
How hard will you have to work before there’s a “fun” payoff?
This is my favorite question. My definition of a “fun” payoff is the experience of witnessing
something that I built learning. There’s something amazing about seeing a creation of your
hands do something like that. If you also feel this way, then the answer is simple. A few
pages into chapter 3, you’ll create your first neural network. The only work involved until
then is reading the pages between here and there.
After chapter 3, you may be interested to know that the next fun payoff occurs after you’ve
memorized a small snippet of code and proceeded to read to the midway of chapter 4. Each
chapter will work this way: memorize a small code segment from the previous chapter, read
the next chapter, and then experience the payoff of a new learning neural network.
Why you should read this book
It has a uniquely low barrier to entry.
The reason you should read this book is the same reason I’m writing it. I don’t know
of another resource (book, course, large blog series) that teaches deep learning without
assuming advanced knowledge of mathematics (a college degree in a mathy field).
Don’t get me wrong: there are really good reasons for teaching it using math. Math
is, after all, a language. It’s certainly more efficient to teach deep learning using this
language, but I don’t think it’s absolutely necessary to assume advanced knowledge
of math in order to become a skilled, knowledgeable practitioner who has a firm
understanding of the “how” behind deep learning.
So, why should you learn deep learning using this book? Because I’m going to assume
you have a high school–level background in math (and that it’s rusty) and explain
everything else you need to know as we go along. Remember multiplication? Remember
x-y graphs (the squares with lines on them)? Awesome! You’ll be fine.
It will help you understand what’s inside a framework
(Torch, TensorFlow, and so on).
There are two major groups of deep learning educational material (such as books and
courses). One group is focused around how to use popular frameworks and code libraries
like Torch, TensorFlow, Keras, and others. The other group is focused around teaching deep
learning itself, otherwise known as the science under the hood of these major frameworks.
6
Chapter 1
I Introducing deep learning
Ultimately, learning about both is important. It’s like if you want to be a NASCAR driver:
you need to learn both about the particular model of car you’re driving (the framework) and
about driving (the science/skill). But just learning about a framework is like learning about
the pros and cons of a Generation 6 Chevrolet SS before you know what a stick shift is. This
book is about teaching you what deep learning is so you can then be prepared to learn a
framework.
All math-related material will be backed by intuitive analogies.
Whenever I encounter a math formula in the wild, I take a two-step approach. The first
is to translate its methods into an intuitive analogy to the real world. I almost never take
a formula at face value: I break it into parts, each with a story of its own. That will be
the approach of this book, as well. Anytime we encounter a math concept, I’ll offer an
alternative analogy for what the formula is actually doing.
Everything should be made as simple as possible, but not simpler.
—Attributed to Albert Einstein
Everything after the introduction chapters is “project” based.
If there’s one thing I hate when learning something new, it’s having to question whether
what I’m learning is useful or relevant. If someone is teaching me everything there is to
know about a hammer without actually taking my hand and helping me drive in a nail, then
they’re not really teaching me how to use a hammer. I know there will be dots that aren’t
connected, and if I’m thrown out into the real world with a hammer, a box of nails, and a
bunch of two-by-fours, I’ll have to do some guesswork.
This book is about giving you the wood, nails, and hammer before telling you what they
do. Each lesson is about picking up the tools and building stuff with them, explaining
how things work as we go. This way, you won’t leave with a list of facts about the various
deep learning tools you’ll work with; you’ll have the ability to use them to solve problems.
Furthermore, you’ll understand the most important part: when and why each tool is
appropriate for each problem you want to solve. It is with this knowledge that you’ll be
empowered to pursue a career in research and/or industry.
What you need to get started
7
What you need to get started
Install Jupyter Notebook and the NumPy Python library.
My absolute favorite place to work is in Jupyter Notebook. One of the most important parts
of learning deep learning (for me) is the ability to stop a network while it’s training and tear
apart absolutely every piece to see what it looks like. This is something Jupyter Notebook is
incredibly useful for.
As for NumPy, perhaps the most compelling case for why this book leaves nothing out is
that we’ll be using only a single matrix library. In this way, you’ll understand how everything
works, not just how to call a framework. This book teaches deep learning from absolute
scratch, soup to nuts.
Installation instructions for these two tools can be found at http://jupyter.org for Jupyter
and http://numpy.org for NumPy. I’ll build the examples in Python 2.7, but I’ve tested them
for Python 3 as well. For easy installation, I also recommend the Anaconda framework:
https://docs.continuum.io/anaconda/install.
Pass high school mathematics.
Some mathematical assumptions are out of depth for this book, but my goal is to teach deep
learning assuming that you understand only basic algebra.
Find a personal problem you’re interested in.
This might seem like an optional “need” to get started. I guess it could be, but seriously, I
highly, highly recommend finding one. Everyone I know who has become successful at this
stuff had some sort of problem they were trying to solve. Learning deep learning was just a
“dependency” to solving some other interesting task.
For me, it was using Twitter to predict the stock market. It’s just something I thought was
really fascinating. It’s what drove me to sit down and read the next chapter and build the
next prototype.
And as it turns out, this field is so new, and is changing so fast, that if you spend the next
couple of years chasing one project with these tools, you’ll find yourself becoming one of
the leading experts in that particular problem faster than you might think. For me, chasing
this idea took me from barely knowing anything about programming to a research grant
at a hedge fund applying what I learned, in around 18 months! For deep learning, having a
problem you’re fascinated with that involves using one dataset to predict another is the key
catalyst! Go find one!
8
Chapter 1
I Introducing deep learning
You’ll probably need some Python knowledge
Python is my teaching library of choice, but I’ll provide a few
others online.
Python is an amazingly intuitive language. I think it just might be the most widely adopted
and intuitively readable language yet constructed. Furthermore, the Python community
has a passion for simplicity that can’t be beat. For these reasons, I want to stick with Python
for all the examples (Python 2.7 is what I’m working in). In the book’s downloadable
source code, available at www.manning.com/books/grokking-deep-learning and also at
https://github.com/iamtrask/Grokking-Deep-Learning, I provide all the examples in a
variety of other languages online.
How much coding experience should you have?
Scan through the Python Codecademy course (www.codecademy.com/learn/python). If you
can read the table of contents and feel comfortable with the terms mentioned, you’re all set!
If not, then take the course and come back when you’re done. It’s designed to be a beginner
course, and it’s very well crafted.
Summary
If you’ve got a Jupyter notebook in hand and feel comfortable with the basics of Python,
you’re ready for the next chapter! As a heads-up, chapter 2 is the last chapter that will be
mostly dialogue based (without building something). It’s designed to give you an awareness
of the high-level vocabulary, concepts, and fields in artificial intelligence, machine learning,
and, most important, deep learning.
fundamental concepts:
how do machines learn?
In this chapter
•
What are deep learning, machine learning, and
artificial intelligence?
•
What are parametric models and nonparametric
models?
•
What are supervised learning and unsupervised
learning?
•
How can machines learn?
Machine learning will cause every successful IPO win
in five years.
—Eric Schmidt, Google executive chairman, keynote
speech, Cloud Computing Platform conference, 2016
9
2
10
Chapter 2
I Fundamental concepts
What is deep learning?
Deep learning is a subset of methods for machine learning.
Deep learning is a subset of machine learning, which is a field dedicated to the study and
development of machines that can learn (sometimes with the goal of eventually attaining
general artificial intelligence).
In industry, deep learning is used to solve practical tasks in a variety of fields such as
computer vision (image), natural language processing (text), and automatic speech
recognition (audio). In short, deep learning is a subset of methods in the machine learning
toolbox, primarily using artificial neural networks, which are a class of algorithm loosely
inspired by the human brain.
Machine
learning
Deep
learning
Artificial
intelligence
Notice in this figure that not all of deep learning is focused around pursuing generalized
artificial intelligence (sentient machines as in the movies). Many applications of this
technology are used to solve a wide variety of problems in industry. This book seeks to
focus on teaching the fundamentals of deep learning behind both cutting-edge research and
industry, helping to prepare you for either.
What is machine learning?
11
What is machine learning?
A field of study that gives computers the ability to learn without being
explicitly programmed.
—Attributed to Arthur Samuel
Given that deep learning is a subset of machine learning, what is machine learning? Most
generally, it is what its name implies. Machine learning is a subfield of computer science
wherein machines learn to perform tasks for which they were not explicitly programmed.
In short, machines observe a pattern and attempt to imitate it in some way that can be either
direct or indirect.
Machine
learning
~=
Monkey see,
monkey do
I mention direct and indirect imitation as a parallel to the two main types of machine
learning: supervised and unsupervised. Supervised machine learning is the direct imitation
of a pattern between two datasets. It’s always attempting to take an input dataset and
transform it into an output dataset. This can be an incredibly powerful and useful capability.
Consider the following examples (input datasets in bold and output datasets in italic):
• Using the pixels of an image to detect the presence or absence of a cat
• Using the movies you’ve liked to predict more movies you may like
• Using someone’s words to predict whether they’re happy or sad
• Using weather sensor data to predict the probability of rain
• Using car engine sensors to predict the optimal tuning settings
• Using news data to predict tomorrow’s stock price
• Using an input number to predict a number double its size
• Using a raw audio file to predict a transcript of the audio
These are all supervised machine learning tasks. In all cases, the machine learning algorithm
is attempting to imitate the pattern between the two datasets in such a way that it can use
one dataset to predict the other. For any of these examples, imagine if you had the power to
predict the output dataset given only the input dataset. Such an ability would be profound.
12
Chapter 2
I Fundamental concepts
Supervised machine learning
Supervised learning transforms datasets.
Supervised learning is a method for transforming one dataset into another. For example, if
you had a dataset called Monday Stock Prices that recorded the price of every stock on every
Monday for the past 10 years, and a second dataset called Tuesday Stock Prices recorded
over the same time period, a supervised learning algorithm might try to use one to predict
the other.
Monday
stock prices
Supervised
learning
Tuesday
stock prices
If you successfully trained the supervised machine learning algorithm on 10 years of
Mondays and Tuesdays, then you could predict the stock price on any Tuesday in the future
given the stock price on the immediately preceding Monday. I encourage you to stop and
consider this for a moment.
Supervised machine learning is the bread and butter of applied artificial intelligence (also
known as narrow AI). It’s useful for taking what you know as input and quickly transforming
it into what you want to know. This allows supervised machine learning algorithms to
extend human intelligence and capabilities in a seemingly endless number of ways.
The majority of work using machine learning results in the training of a supervised classifier
of some kind. Even unsupervised machine learning (which you’ll learn more about in a
moment) is typically done to aid in the development of an accurate supervised machine
learning algorithm.
What you
know
Supervised
learning
What you want
to know
For the rest of this book, you’ll be creating algorithms that can take input data that is
observable, recordable, and, by extension, knowable and transform it into valuable output
data that requires logical analysis. This is the power of supervised machine learning.
Unsupervised machine learning
13
Unsupervised machine learning
Unsupervised learning groups your data.
Unsupervised learning shares a property in common with supervised learning: it transforms
one dataset into another. But the dataset that it transforms into is not previously known or
understood. Unlike supervised learning, there is no “right answer” that you’re trying to get
the model to duplicate. You just tell an unsupervised algorithm to “find patterns in this data
and tell me about them.”
For example, clustering a dataset into groups is a type of unsupervised learning. Clustering
transforms a sequence of datapoints into a sequence of cluster labels. If it learns 10 clusters,
it’s common for these labels to be the numbers 1–10. Each datapoint will be assigned to a
number based on which cluster it’s in. Thus, the dataset turns from a bunch of datapoints
into a bunch of labels. Why are the labels numbers? The algorithm doesn’t tell you what the
clusters are. How could it know? It just says, “Hey scientist! I found some structure. It looks
like there are groups in your data. Here they are!”
List of
datapoints
Unsupervised
learning
List of cluster
labels
I have good news! This idea of clustering is something you can reliably hold onto in your
mind as the definition of unsupervised learning. Even though there are many forms
of unsupervised learning, all forms of unsupervised learning can be viewed as a form of
clustering. You’ll discover more on this later in the book.
puppies
pizza
kittens
hot dog
burger
Unsupervised
learning
1
2
1
2
2
Check out this example. Even though the algorithm didn’t tell what the clusters are named,
can you figure out how it clustered the words? (Answer: 1 == cute and 2 == delicious.) Later,
we’ll unpack how other forms of unsupervised learning are also just a form of clustering and
why these clusters are useful for supervised learning.
14
Chapter 2
I Fundamental concepts
Parametric vs. nonparametric learning
Oversimplified: Trial-and-error learning vs. counting
and probability
The last two pages divided all machine learning algorithms into two groups: supervised and
unsupervised. Now, we’re going to discuss another way to divide the same machine learning
algorithms into two groups: parametric and nonparametric. So, if we think about our little
machine learning cloud, it has two settings:
Supervised
Parametric
Unsupervised
Nonparametric
As you can see, there are really four different types of algorithms to choose from. An
algorithm is either unsupervised or supervised, and either parametric or nonparametric.
Whereas the previous section on supervision is about the type of pattern being learned,
parametricism is about the way the learning is stored and often, by extension, the
method for learning. First, let’s look at the formal definitions of parametricism versus
nonparametricism. For the record, there’s still some debate around the exact difference.
A parametric model is characterized by having a fixed number of parameters, whereas
a nonparametric model’s number of parameters is infinite (determined by data).
As an example, let’s say the problem is to fit a square peg into the correct (square)
hole. Some humans (such as babies) just jam it into all the holes until it fits somewhere
(parametric). A teenager, however, may count the number of sides (four) and then search
for the hole with an equal number (nonparametric). Parametric models tend to use trial and
error, whereas nonparametric models tend to count. Let’s look closer.
Supervised parametric learning
15
Supervised parametric learning
Oversimplified: Trial-and-error learning using knobs
Supervised parametric learning machines are machines with a fixed number of knobs (that’s
the parametric part), wherein learning occurs by turning the knobs. Input data comes in, is
processed based on the angle of the knobs, and is transformed into a prediction.
Data
01010111011000110
01101101100011001
10010011100101010
Machine
Prediction
98%
Learning is accomplished by turning the knobs to different angles. If you’re trying to predict
the probability that the Red Sox will win the World Series, then this model would first take
data (such as sports stats like win/loss record or average number of toes per player) and
make a prediction (such as 98% chance). Next, the model would observe whether or not
the Red Sox actually won. After it knew whether they won, the learning algorithm would
update the knobs to make a more accurate prediction the next time it sees the same
or similar input data.
Perhaps it would “turn up” the “win/loss record” knob if the team’s win/loss record was a
good predictor. Inversely, it might “turn down” the “average number of toes” knob if that
datapoint wasn’t a good predictor. This is how parametric models learn!
Note that the entirety of what the model has learned can be captured in the positions
of the knobs at any given time. You can also think of this type of learning model as a
search algorithm. You’re “searching” for the appropriate knob configuration by trying
configurations, adjusting them, and retrying.
Note further that the notion of trial and error isn’t the formal definition, but it’s a common
(with exceptions) property to parametric models. When there is an arbitrary (but
fixed) number of knobs to turn, some level of searching is required to find the optimal
configuration. This is in contrast to nonparametric learning, which is often count based
and (more or less) adds new knobs when it finds something new to count. Let’s break down
supervised parametric learning into its three steps.
16
Chapter 2
I Fundamental concepts
Step 1: Predict
To illustrate supervised parametric learning, let’s continue with the sports analogy of trying
to predict whether the Red Rox will win the World Series. The first step, as mentioned, is to
gather sports statistics, send them through the machine, and make a prediction about the
probability that the Red Sox will win.
Data
Machine
Location: away
Opponent: Yankees
# toes: 250
# players: 25
# fans: 25,000
Prediction
98%
Step 2: Compare to the truth pattern
The second step is to compare the prediction (98%) with the pattern you care about
(whether the Red Sox won). Sadly, they lost, so the comparison is
Pred: 98% > Truth: 0%
This step recognizes that if the model had predicted 0%, it would have perfectly predicted the
upcoming loss of the team. You want the machine to be accurate, which leads to step 3.
Step 3: Learn the pattern
This step adjusts the knobs by studying both how much
the model missed by (98%) and what the input data was
(sports stats) at the time of prediction. This step then turns
the knobs to make a more accurate prediction given the
input data.
In theory, the next time this step saw the same sports stats,
the prediction would be lower than 98%. Note that each
knob represents the prediction’s sensitivity to different types
of input data. That’s what you’re changing when you “learn.”
Adjusting sensitivity
by turning knobs
win
loss
home/
away
# toes
# fans
Unsupervised parametric learning
17
Unsupervised parametric learning
Unsupervised parametric learning uses a very similar approach. Let’s walk through the
steps at a high level. Remember that unsupervised learning is all about grouping data.
Unsupervised parametric learning uses knobs to group data. But in this case, it usually
has several knobs for each group, each of which maps
the input data’s affinity to that particular group (with
Home or away
# fans
exceptions and nuance—this is a high-level description).
Let’s look at an example that assumes you want to divide
home
100k
away
50k
the data into three groups.
In the dataset, I’ve identified three clusters in the data that
you might want the parametric model to find. They’re
indicated via formatting as group 1, group 2, and group
3. Let’s propagate the first datapoint through a trained
unsupervised model, as shown next. Notice that it maps
most strongly to group 1.
home
home
away
away
away
100k
99k
50k
10k
11k
Group membership
probability
# fans
94%
group 1
home
away
Datapoint
# fans
home/away
home
# fans
100k
1%
group 2
home
away
# fans
5%
group 3
home
away
Each group’s machine attempts to transform the input data to a number between 0 and 1,
telling us the probability that the input data is a member of that group. There is a great deal
of variety in how these models train and their resulting properties, but at a high level they
adjust parameters to transform the input data into its subscribing group(s).
18
Chapter 2
I Fundamental concepts
Nonparametric learning
Oversimplified: Counting-based methods
Nonparametric learning is a class of algorithm wherein the number of parameters is based
on data (instead of predefined). This lends itself to methods that generally count in one way
or another, thus increasing the number of parameters based on the number of items being
counted within the data. In the supervised setting, for example, a nonparametric model
might count the number of times a particular color of streetlight causes cars to “go.” After
counting only a few examples, this model would then be able to predict that middle lights
always (100%) cause cars to go, and right lights only sometimes (50%) cause cars to go.
Stop
Go
Go
Go
Stop
Stop
Notice that this model would have three parameters: three counts indicating the number
of times each colored light turned on and cars would go (perhaps divided by the number
of total observations). If there were five lights, there would be five counts (five parameters).
What makes this simple model nonparametric is this trait wherein the number of parameters
changes based on the data (in this case, the number of lights). This is in contrast to
parametric models, which start with a set number of parameters and, more important, can
have more or fewer parameters purely at the discretion of the scientist training the model
(regardless of data).
A close eye might question this idea. The parametric model from before seemed to have
a knob for each input datapoint. Most parametric models still have to have some sort of
input based on the number of classes in the data. Thus you can see that there is a gray
area between parametric and nonparametric algorithms. Even parametric algorithms are
somewhat influenced by the number of classes in the data, even if they aren’t explicitly
counting patterns.
This also illuminates that parameters is a generic term, referring only to the set of numbers
used to model a pattern (without any limitation on how those numbers are used). Counts
are parameters. Weights are parameters. Normalized variants of counts or weights are
parameters. Correlation coefficients can be parameters. The term refers to the set of
numbers used to model a pattern. As it happens, deep learning is a class of parametric
models. We won’t discuss nonparametric models further in this book, but they’re an
interesting and powerful class of algorithm.
Summary
19
Summary
In this chapter, we’ve gone a level deeper into the various flavors of machine learning.
You learned that a machine learning algorithm is either supervised or unsupervised
and either parametric or nonparametric. Furthermore, we explored exactly what makes
these four different groups of algorithms distinct. You learned that supervised machine
learning is a class of algorithm where you learn to predict one dataset given another and
that unsupervised learning generally groups a single dataset into various kinds of clusters.
You learned that parametric algorithms have a fixed number of parameters and that
nonparametric algorithms adjust their number of parameters based on the dataset.
Deep learning uses neural networks to perform both supervised and unsupervised
prediction. Until now, we’ve stayed at a conceptual level as you got your bearings in the field
as a whole and your place in it. In the next chapter, you’ll build your first neural network,
and all subsequent chapters will be project based. So, pull out your Jupyter notebook, and
let’s jump in!
introduction to neural prediction:
forward propagation
In this chapter
•
A simple network making a prediction
•
What is a neural network, and what does it do?
•
Making a prediction with multiple inputs
•
Making a prediction with multiple outputs
•
Making a prediction with multiple inputs
and outputs
•
Predicting on predictions
I try not to get involved in the business of prediction.
It’s a uick way to look like an idiot.
—Warren Ellis comic-book writer,
novelist, and screenwriter
21
3
22
Chapter 3
I Introduction to neural prediction
Step 1: Predict
This chapter is about prediction.
In the previous chapter, you learned about the paradigm predict, compare, learn. In this
chapter, we’ll dive deep into the first step: predict. You may remember that the predict step
looks a lot like this:
Data
Machine
Prediction
Location: away
Opponent: Yankees
# of toes: 250
# of players: 25
# of fans: 25,000
98%
In this chapter, you’ll learn more about what these three different parts of a neural network
prediction look like under the hood. Let’s start with the first one: the data. In your first
neural network, you’re going to predict one datapoint at a time, like so:
# toes
8.5
Machine
Prediction
98%
Later, you’ll find that the number of datapoints you process at a time has a significant
impact on what a network looks like. You might be wondering, “How do I choose how
many datapoints to propagate at a time?” The answer is based on whether you think the
neural network can be accurate with the data you give it.
For example, if I’m trying to predict whether there’s a cat in a photo, I definitely need to
show my network all the pixels of an image at once. Why? Well, if I sent you only one
pixel of an image, could you classify whether the image contained a cat? Me neither!
(That’s a general rule of thumb, by the way: always present enough information to the
network, where “enough information” is defined loosely as how much a human might
need to make the same prediction.)
Step 1: Predict
23
Let’s skip over the network for now. As it turns out, you can create a network only after
you understand the shape of the input and output datasets (for now, shape means “number
of columns” or “number of datapoints you’re processing at once”). Let’s stick with a single
prediction of the likelihood that the baseball team will win:
# toes
Machine
Win probability
98%
8.5
Now that you know you want to take one input datapoint and output one prediction, you
can create a neural network. Because you have only one input datapoint and one output
datapoint, you’re going to build a network with a single knob mapping from the input point
to the output. (Abstractly, these “knobs” are actually called weights, and I’ll refer to them
as such from here on out.) So, without further ado, here’s your first neural network, with a
single weight mapping from the input “# toes” to the output “win?”:
b
An empty network
Input data
enters here.
Predictions
come out here.
.1
# toes
win?
As you can see, with one weight, this network takes in one datapoint at a time (average
number of toes per player on the baseball team) and outputs a single prediction (whether it
thinks the team will win).
24
Chapter 3
I Introduction to neural prediction
A simple neural network making a prediction
Let’s start with the simplest neural network possible.
b
An empty network
Input data
enters here.
Predictions
come out here.
weight = 0.1
def neural_network(input, weight):
prediction = input * weight
.1
# toes
c
win?
return prediction
Inserting one input datapoint
number_of_toes = [8.5, 9.5, 10, 9]
Input data
(# toes)
input = number_of_toes[0]
pred = neural_network(input,weight)
.1
print(pred)
8.5
d
Multiplying input by weight
(8.5 * 0.1 = 0.85)
def neural_network(input, weight):
prediction = input * weight
return prediction
.1
8.5
e
Depositing the prediction
Prediction
number_of_toes = [8.5, 9.5, 10, 9]
input = number_of_toes[0]
pred = neural_network(input,weight)
.1
8.5
0.85
What is a neural network?
25
What is a neural network?
Here is your first neural network.
To start a neural network, open a Jupyter notebook and run this code:
weight = 0.1
def neural_network(input, weight):
prediction = input * weight
The network
return prediction
Now, run the following:
number_of_toes = [8.5, 9.5, 10, 9]
input = number_of_toes[0]
pred = neural_network(input,weight)
print(pred)
How you use the
network to predict
something
You just made your first neural network and used it to predict! Congratulations! The last line
prints the prediction (pred). It should be 0.85. So what is a neural network? For now, it’s one or
more weights that you can multiply by the input data to make a prediction.
What is input data?
It’s a number that you recorded in the real world somewhere. It’s usually something
that is easily knowable, like today’s temperature, a baseball player’s batting average, or
yesterday’s stock price.
What is a prediction?
A prediction is what the neural network tells you, given the input data, such as “given the
temperature, it is 0% likely that people will wear sweatsuits today” or “given a baseball player’s
batting average, he is 30% likely to hit a home run” or “given yesterday’s stock price, today’s
stock price will be 101.52.”
Is this prediction always right?
No. Sometimes a neural network will make mistakes, but it can learn from them. For example,
if it predicts too high, it will adjust its weight to predict lower next time, and vice versa.
How does the network learn?
Trial and error! First, it tries to make a prediction. Then, it sees whether the prediction was too
high or too low. Finally, it changes the weight (up or down) to predict more accurately the
next time it sees the same input.
26
Chapter 3
I Introduction to neural prediction
What does this neural network do?
It multiplies the input by a weight. It “scales” the input by a
certain amount.
In the previous section, you made your first prediction with a neural network. A neural network,
in its simplest form, uses the power of multiplication. It takes an input datapoint (in this case,
8.5) and multiplies it by the weight. If the weight is 2, then the neural network will double the
input. If the weight is 0.01, then the network will divide the input by 100. As you can see, some
weight values make the input bigger, and other values make it smaller.
b
An empty network
weight = 0.1
Input data
enters here.
Predictions
come out here.
def neural_network(input, weight):
prediction = input * weight
.1
# toes
win?
return prediction
The interface for a neural network is simple. It accepts an input variable as information and a
weight variable as knowledge and outputs a prediction. Every neural network you’ll ever see
works this way. It uses the knowledge in the weights to interpret the information in the input
data. Later neural networks will accept larger, more complicated input and weight values, but
this same underlying premise will always ring true.
c
Inserting one input datapoint
number_of_toes = [8.5, 9.5, 10, 9]
Input data
(# toes)
input = number_of_toes[0]
.1
pred = neural_network(input,weight)
8.5
In this case, the information is the average number of toes on a baseball team before a game.
Notice several things. First, the neural network does not have access to any information
except one instance. If, after this prediction, you were to feed in number_of_toes[1], the
network wouldn’t remember the prediction it made in the last timestep. A neural network
knows only what you feed it as input. It forgets everything else. Later, you’ll learn how to
give a neural network a “short-term memory” by feeding in multiple inputs at once.
What does this neural network do?
d
27
Multiplying input by weight
def neural_network(input, weight):
(8.5 * 0.1 = 0.85)
prediction = input * weight
Weight
(volume knob)
return prediction
.1
8.5
Another way to think about a neural network’s weight value is as a measure of sensitivity
between the input of the network and its prediction. If the weight is very high, then even the
tiniest input can create a really large prediction! If the weight is very small, then even large
inputs will make small predictions. This sensitivity is akin to volume. “Turning up the weight”
amplifies the prediction relative to the input: weight is a volume knob!
e
Depositing the prediction
Prediction
number_of_toes = [8.5, 9.5, 10, 9]
input = number_of_toes[0]
pred = neural_network(input,weight)
.1
8.5
0.85
In this case, what the neural network is really doing is applying a volume knob to the
number_of_toes variable. In theory, this volume knob can tell you the likelihood that the team
will win, based on the average number of toes per player on the team. This may or may not work.
Truthfully, if the team members had an average of 0 toes, they would probably play terribly. But
baseball is much more complex than this. In the next section, you’ll present multiple pieces of
information at the same time so the neural network can make more-informed decisions.
Note that neural networks don’t predict just positive numbers—they can also predict negative
numbers and even take negative numbers as input. Perhaps you want to predict the probability
that people will wear coats today. If the temperature is –10 degrees Celsius, then a negative
weight will predict a high probability that people will wear their coats.
Probability
Temperature
–10
89
–8.9
28
Chapter 3
I Introduction to neural prediction
Making a prediction with multiple inputs
Neural networks can combine intelligence from
multiple datapoints.
The previous neural network was able to take one datapoint as input and make one prediction
based on that datapoint. Perhaps you’ve been wondering, “Is the average number of toes really
a good predictor, all by itself?” If so, you’re onto something. What if you could give the network
more information (at one time) than just the average number of toes per player? In that case,
the network should, in theory, be able to make more-accurate predictions. Well, as it turns out, a
network can accept multiple input datapoints at a time. Take a look at the next prediction:
b
An empty network with multiple inputs
# toes
weights = [0.1, 0.2, 0]
.1
def neural_network(input, weights):
Input data
enters here
(three at a time).
.2
win
loss
win?
c
return pred
.0
# fans
Predictions
come out here.
Inserting one input datapoint
8.5
.1
One row
of data
(first game)
65%
.2
.0
pred = w_sum(input,weights)
This dataset is the current status at the beginning of
each game for the first four games in a season:
toes = current average number of toes per player
wlrec = current games won (percent)
nfans = fan count (in millions).
toes = [8.5, 9.5, 9.9, 9.0]
wlrec = [0.65, 0.8, 0.8, 0.9]
nfans = [1.2, 1.3, 0.5, 1.0]
input = [toes[0],wlrec[0],nfans[0]]
pred = neural_network(input,weights)
1.2
Input corresponds to every entry
for the first game of the season.
Making a prediction with multiple inputs
d
29
Performing a weighted sum of inputs
def w_sum(a,b):
8.5
assert(len(a) == len(b))
.1
output = 0
.85
for i in range(len(a)):
output += (a[i] * b[i])
.2
.13
65%
return output
.0
def neural_network(input, weights):
.0
pred = w_sum(input,weights)
1.2
return pred
Inputs
Weights
Local predictions
(8.50 *
0.1)
=
0.85
= toes prediction
(0.65 *
0.2)
=
0.13
= wlrec prediction
(1.20 *
0.0)
=
0.00
= fans prediction
toes prediction + wlrec prediction + fans prediction = final prediction
0.85
e
+
0.13
+
0.00
=
0.98
Depositing the prediction
Input corresponds to every entry
for the first game of the season.
toes = [8.5, 9.5, 9.9, 9.0]
wlrec = [0.65, 0.8, 0.8, 0.9]
nfans = [1.2, 1.3, 0.5, 1.0]
8.5
.1
input = [toes[0],wlrec[0],nfans[0]]
.2
65%
0.98
pred = neural_network(input,weights)
.0
print(pred)
1.2
Prediction
30
Chapter 3
I Introduction to neural prediction
Multiple inputs: What does this neural network do?
It multiplies three inputs by three knob weights and sums them.
This is a weighted sum.
At the end of the previous section, you came to realize the limiting factor of your simple
neural network: it was only a volume knob on one datapoint. In the example, that datapoint
was a baseball team’s average number of toes per player. You learned that in order to make
accurate predictions, you need to build neural networks that can combine multiple inputs at
the same time. Fortunately, neural networks are perfectly capable of doing so.
b
An empty network with multiple inputs
weights = [0.1, 0.2, 0]
# toes
.1
Input data
enters here
(three at a
time).
win
loss
def neural_network(input, weights):
pred = w_sum(input,weights)
.2
win?
return pred
.0
# fans
Predictions
come out here.
This new neural network can accept multiple inputs at a time per prediction. This allows the
network to combine various forms of information to make better-informed decisions. But
the fundamental mechanism for using weights hasn’t changed. You still take each input and
run it through its own volume knob. In other words, you multiply each input by its own
weight.
The new property here is that, because you have multiple inputs, you have to sum their
respective predictions. Thus, you multiply each input by its respective weight and then sum
all the local predictions together. This is called a weighted sum of the input, or a weighted sum
for short. Some also refer to the weighted sum as a dot product, as you’ll see.
A relevant reminder
The interface for the neural network is simple: it accepts an input variable as information and
a weights variable as knowledge, and it outputs a prediction.
Multiple inputs: What does this neural network do?
c
31
Inserting one input datapoint
8.5
.1
One row
of data
(first game)
.2
65%
This dataset is the current status at the beginning
of each game for the first four games in a season:
toes = current number of toes
wlrec = current games won (percent)
nfans = fan count (in millions)
toes = [8.5, 9.5, 9.9, 9.0]
wlrec = [0.65, 0.8, 0.8, 0.9]
nfans = [1.2, 1.3, 0.5, 1.0]
.0
input = [toes[0],wlrec[0],nfans[0]]
1.2
pred = neural_network(input,weights)
Input corresponds to every entry
for the first game of the season.
This new need to process multiple inputs at a time justifies the use of a new tool. It’s called a vector,
and if you’ve been following along in your Jupyter notebook, you’ve already been using it. A vector
is nothing other than a list of numbers. In the example, input is a vector and weights is a vector.
Can you spot any more vectors in the previous code? (There are three more.)
As it turns out, vectors are incredibly useful whenever you want to perform operations
involving groups of numbers. In this case, you’re performing a weighted sum between two
vectors (a dot product). You’re taking two vectors of equal length (input and weights),
multiplying each number based on its position (the first position in input is multiplied by
the first position in weights, and so on), and then summing the resulting output.
Anytime you perform a mathematical operation between two vectors of equal length where
you pair up values according to their position in the vector (again: position 0 with 0, 1 with 1,
and so on), it’s called an elementwise operation. Thus elementwise addition sums two vectors,
and elementwise multiplication multiplies two vectors.
Challenge: Vector math
Being able to manipulate vectors is a cornerstone technique for deep learning. See if you can
write functions that perform the following operations:
•
•
•
•
def elementwise_multiplication(vec_a, vec_b)
def elementwise_addition(vec_a, vec_b)
def vector_sum(vec_a)
def vector_average(vec_a)
Then, see if you can use two of these methods to perform a dot product!
32
d
Chapter 3
I Introduction to neural prediction
Performing a weighted sum of inputs
def w_sum(a,b):
8.5
assert(len(a) == len(b))
.1
output = 0
.85
.2
65%
for i in range(len(a)):
output += (a[i] * b[i])
.13
return output
.0
def neural_network(input, weights):
.0
pred = w_sum(input,weights)
1.2
return pred
Inputs
Weights
Local predictions
(8.50 *
0.1)
=
0.85
= toes prediction
(0.65 *
0.2)
=
0.13
= wlrec prediction
(1.20 *
0.0)
=
0.00
= fans prediction
toes prediction + wlrec prediction + fans prediction = final prediction
0.85
+
0.13
+
0.00
=
0.98
The intuition behind how and why a dot product (weighted sum) works is easily one of the most
important parts of truly understanding how neural networks make predictions. Loosely stated, a
dot product gives you a notion of similarity between two vectors. Consider these examples:
a
b
c
d
e
=
=
=
=
=
[ 0,
[ 1,
[ 0,
[.5,
[ 0,
1, 0,
0, 1,
1, 1,
0,.5,
1,-1,
1]
0]
0]
0]
0]
w_sum(a,b)
w_sum(b,c)
w_sum(b,d)
w_sum(c,c)
w_sum(d,d)
w_sum(c,e)
=
=
=
=
=
=
0
1
1
2
.5
0
The highest weighted sum (w_sum(c,c)) is between vectors that are exactly identical. In
contrast, because a and b have no overlapping weight, their dot product is zero. Perhaps
the most interesting weighted sum is between c and e, because e has a negative weight.
This negative weight canceled out the positive similarity between them. But a dot product
between e and itself would yield the number 2, despite the negative weight (double
negative turns positive). Let’s become familiar with the various properties of the dot
product operation.
Multiple inputs: What does this neural network do?
33
Sometimes you can equate the properties of the dot product to a logical AND. Consider a and b:
a = [ 0, 1, 0, 1]
b = [ 1, 0, 1, 0]
If you ask whether both a[0] AND b[0] have value, the answer is no. If you ask whether both
a[1] AND b[1] have value, the answer is again no. Because this is always true for all four
values, the final score equals 0. Each value fails the logical AND.
b = [ 1, 0, 1, 0]
c = [ 0, 1, 1, 0]
b and c, however, have one column that shares value. It passes the logical AND because b[2]
and c[2] have weight. This column (and only this column) causes the score to rise to 1.
c = [ 0, 1, 1, 0]
d = [.5, 0,.5, 0]
Fortunately, neural networks are also able to model partial ANDing. In this case, c and d share
the same column as b and c, but because d has only 0.5 weight there, the final score is only 0.5.
We exploit this property when modeling probabilities in neural networks.
d = [.5, 0,.5, 0]
e = [-1, 1, 0, 0]
In this analogy, negative weights tend to imply a logical NOT operator, given that any positive
weight paired with a negative weight will cause the score to go down. Furthermore, if both
vectors have negative weights (such as w_sum(e,e)), then the neural network will perform
a double negative and add weight instead. Additionally, some might say it’s an OR after the
AND, because if any of the rows show weight, the score is affected. Thus, for w_sum(a,b), if
(a[0] AND b[0]) OR (a[1] AND b[1]), and so on, then w_sum(a,b) returns a positive score.
Furthermore, if one value is negative, then that column gets a NOT.
Amusingly, this gives us a kind of crude language for reading weights. Let’s read a few
examples, shall we? These assume you’re performing w_sum(input,weights) and the “then”
to these if statements is an abstract “then give high score”:
weights = [ 1, 0, 1] => if input[0] OR input[2]
weights = [ 0, 0, 1] => if input[2]
weights = [ 1, 0, -1] => if input[0] OR NOT input[2]
weights = [ -1, 0, -1] => if NOT input[0] OR NOT input[2]
weights = [ 0.5, 0, 1] => if BIG input[0] or input[2]
Notice in the last row that weight[0] = 0.5 means the corresponding input[0] would
have to be larger to compensate for the smaller weighting. And as I mentioned, this is a very
34
Chapter 3
I Introduction to neural prediction
crude approximate language. But I find it immensely useful when trying to picture in my
head what’s going on under the hood. This will help you significantly in the future, especially
when putting networks together in increasingly complex ways.
Given these intuitions, what does this mean when a neural network makes a prediction?
Roughly speaking, it means the network gives a high score of the inputs based on how
similar they are to the weights. Notice in the following example that nfans is completely
ignored in the prediction because the weight associated with it is 0. The most sensitive
predictor is wlrec because its weight is 0.2. But the dominant force in the high score is
the number of toes (ntoes), not because the weight is the highest, but because the input
combined with the weight is by far the highest.
e
Deposit prediction
Input corresponds
to every entry for
the first game of
the season.
8.5
.1
input = [toes[0],wlrec[0],nfans[0]]
.2
0.98
65%
toes = [8.5, 9.5, 9.9, 9.0]
wlrec = [0.65, 0.8, 0.8, 0.9]
nfans = [1.2, 1.3, 0.5, 1.0]
pred = neural_network(input,weights)
print(pred)
.0
1.2
Prediction
Here are a few more points to note for further reference. You can’t shuffle weights: they have
specific positions they need to be in. Furthermore, both the value of the weight and the value of
the input determine the overall impact on the final score. Finally, a negative weight will cause
some inputs to reduce the final prediction (and vice versa).
Multiple inputs: Complete runnable code
35
Multiple inputs: Complete runnable code
The code snippets from this example come together in the following code, which creates and
executes a neural network. For clarity, I’ve written everything out using basic properties of
Python (lists and numbers). But a better way exists that we’ll begin using in the future.
Previous code
def w_sum(a,b):
assert(len(a) == len(b))
output = 0
for i in range(len(a)):
output += (a[i] * b[i])
return output
weights = [0.1, 0.2, 0]
def neural_network(input, weights):
pred = w_sum(input,weights)
return pred
toes = [8.5, 9.5, 9.9, 9.0]
wlrec = [0.65, 0.8, 0.8, 0.9]
nfans = [1.2, 1.3, 0.5, 1.0]
input = [toes[0],wlrec[0],nfans[0]]
pred = neural_network(input,weights)
print(pred)
Input corresponds
to every entry for
the first game of
the season.
There’s a Python library called NumPy, which stands for “numerical Python.” It has
very efficient code for creating vectors and performing common functions (such as dot
products). Without further ado, here’s the same code in NumPy.
NumPy code
import numpy as np
weights = np.array([0.1, 0.2, 0])
def neural_network(input, weights):
pred = input.dot(weights)
return pred
toes = np.array([8.5, 9.5, 9.9, 9.0])
wlrec = np.array([0.65, 0.8, 0.8, 0.9])
nfans = np.array([1.2, 1.3, 0.5, 1.0])
input = np.array([toes[0],wlrec[0],nfans[0]])
pred = neural_network(input,weights)
print(pred)
Input corresponds
to every entry for
the first game of
the season.
Both networks should print out 0.98. Notice that in the NumPy code, you don’t have to
create a w_sum function. Instead, NumPy has a dot function (short for “dot product”) you
can call. Many functions you’ll use in the future have NumPy parallels.
36
Chapter 3
I Introduction to neural prediction
Making a prediction with multiple outputs
Neural networks can also make multiple predictions using only a
single input.
Perhaps a simpler augmentation than multiple inputs is multiple outputs. Prediction occurs
the same as if there were three disconnected single-weight neural networks.
b
An empty network with multiple outputs
Input data
enters here.
hurt?
.3
win
loss
.2
win?
Predictions
come out here.
.9
Instead of predicting just whether the team won or lost,
you’re also predicting whether the players are happy
or sad and the percentage of team members who are
hurt. You make this prediction using only the current
win/loss record.
weights = [0.3, 0.2, 0.9]
def neural_network(input, weights):
sad?
pred = ele_mul(input,weights)
return pred
The most important comment in this setting is to notice that the three predictions are
completely separate. Unlike neural networks with multiple inputs and a single output, where
the prediction is undeniably connected, this network truly behaves as three independent
components, each receiving the same input data. This makes the network simple to implement.
c
Inserting one input datapoint
.3
65%
wlrec = [0.65, 0.8, 0.8, 0.9]
.2
.9
input = wlrec[0]
pred = neural_network(input,weights)
Making a prediction with multiple outputs
d
37
Performing elementwise multiplication
def ele_mul(number,vector):
output = [0,0,0]
.195
assert(len(output) == len(vector))
.3
.2
65%
for i in range(len(vector)):
output[i] = number * vector[i]
.13
return output
.9
.585
def neural_network(input, weights):
pred = ele_mul(input,weights)
return pred
Inputs
Weights
Final predictions
(0.65 *
0.3)
=
0.195
= hurt prediction
(0.65 *
0.2)
=
0.13
= win prediction
(0.65 *
0.9)
=
0.585
= sad prediction
e
Depositing predictions
.195
Predictions
(a vector of numbers)
.3
65%
.2
.13
.9
wlrec = [0.65, 0.8, 0.8, 0.9]
input = wlrec[0]
pred = neural_network(input,weight)
.585
print(pred)
38
Chapter 3
I Introduction to neural prediction
Predicting with multiple inputs and outputs
Neural networks can predict multiple outputs given
multiple inputs.
Finally, the way you build a network with multiple inputs or outputs can be combined to build
a network that has both multiple inputs and multiple outputs. As before, a weight connects each
input node to each output node, and prediction occurs in the usual way.
b
An empty network with multiple inputs and outputs
Inputs
Predictions
# toes
hurt?
.1
.2
win
loss
win?
.0
# toes % win
weights = [ [0.1, 0.1,
[0.1, 0.2,
[0.0, 1.3,
# fans
-0.3], # hurt?
0.0], # win?
0.1] ] # sad?
def neural_network(input, weights):
pred = vect_mat_mul(input,weights)
sad?
# fans
c
return pred
Inserting one input datapoint
Inputs
Predictions
8.5
.1
This dataset is the current status at the beginning
of each game for the first four games in a season:
toes = current average number of toes per player
wlrec = current games won (percent)
fans = fan count (in millions)
.2
65%
.0
toes = [8.5, 9.5, 9.9, 9.0]
wlrec = [0.65,0.8, 0.8, 0.9]
nfans = [1.2, 1.3, 0.5, 1.0]
input = [toes[0],wlrec[0],nfans[0]]
1.2
pred = neural_network(input,weights)
Input corresponds to every entry
for the first game of the season.
Predicting with multiple inputs and outputs
d
39
For each output, performing a weighted sum of inputs
8.5
hurt?
.1
def w_sum(a,b):
assert(len(a) == len(b))
output = 0
for i in range(len(a)):
output += (a[i] * b[i])
return output
.85
65%
.2
win?
.13
def vect_mat_mul(vect,matrix):
assert(len(vect) == len(matrix))
output = [0,0,0]
for i in range(len(vect)):
output[i] = w_sum(vect,matrix[i])
.0
.0
return output
1.2
sad?
# toes
(8.5 * 0.1)
(8.5 * 0.1)
(8.5 * 0.0)
e
+
+
+
% win
(0.65 * 0.1) +
(0.65 * 0.2) +
(0.65 * 1.3) +
# fans
(1.2 * –0.3)
(1.2 * 0.0)
(1.2 * 0.1)
def neural_network(input, weights):
pred = vect_mat_mul(input,weights)
return pred
=
=
=
0.555
0.98
0.965
=
=
=
hurt prediction
win prediction
sad prediction
Depositing predictions
Inputs
Predictions
8.5
.555
Input corresponds to every entry
for the first game of the season.
.1
.2
.98
65%
.0
toes = [8.5, 9.5, 9.9, 9.0]
wlrec = [0.65,0.8, 0.8, 0.9]
nfans = [1.2, 1.3, 0.5, 1.0]
input = [toes[0],wlrec[0],nfans[0]]
1.2
.965
pred = neural_network(input,weight)
40
Chapter 3
I Introduction to neural prediction
Multiple inputs and outputs: How does it work?
It performs three independent weighted sums of the input
to make three predictions.
You can take two perspectives on this architecture: think of it as either three weights coming
out of each input node, or three weights going into each output node. For now, I find the
latter to be much more beneficial. Think about this neural network as three independent dot
products: three independent weighted sums of the input. Each output node takes its own
weighted sum of the input and makes a prediction.
b
An empty network with multiple inputs and outputs
Inputs
Predictions
# toes
# toes %
weights = [ [0.1,
[0.1,
[0.0,
hurt?
.1
win
loss
.2
win
0.1,
0.2,
1.3,
# fans
-0.3],# hurt?
0.0], # win?
0.1] ]# sad?
def neural_network(input, weights):
win?
pred = vect_mat_mul(input,weights)
.0
return pred
# fans
c
sad?
Inserting one input datapoint
Inputs
Predictions
8.5
.1
This dataset is the current status at the beginning
of each game for the first four games in a season:
toes = current average number of toes per player
wlrec = current games won (percent)
fans = fan count (in millions)
.2
toes = [8.5, 9.5, 9.9, 9.0]
wlrec = [0.65,0.8, 0.8, 0.9]
nfans = [1.2, 1.3, 0.5, 1.0]
.0
input = [toes[0],wlrec[0],nfans[0]]
65%
pred = neural_network(input,weights)
1.2
Input corresponds to every entry
for the first game of the season.
Multiple inputs and outputs: How does it work?
d
41
For each output, performing a weighted sum of inputs
8.5
hurt?
.1
def w_sum(a,b):
assert(len(a) == len(b))
output = 0
for i in range(len(a)):
output += (a[i] * b[i])
return output
.85
65%
.2
.13
win?
for i in range(len(vect)):
output[i] = w_sum(vect,matrix[i])
.0
return output
.0
1.2
def vect_mat_mul(vect,matrix):
assert(len(vect) == len(matrix))
output = [0,0,0]
sad?
def neural_network(input, weights):
pred = vect_mat_mul(input,weights)
return pred
# toes
% win
# fans
(8.5 * 0.1) + (0.65 * 0.1) + (1.2 * –0.3) = 0.555 = hurt prediction
(8.5 * 0.1) + (0.65 * 0.2) + (1.2 * 0.0) = 0.98 = win prediction
(8.5 * 0.0) + (0.65 * 1.3) + (1.2 * 0.1) = 0.965 = sad prediction
As mentioned earlier, we’re choosing to think about this network as a series of weighted
sums. Thus, the previous code creates a new function called vect_ mat_mul. This function
iterates through each row of weights (each row is a vector) and makes a prediction using
the w_sum function. It’s literally performing three consecutive weighted sums and then
storing their predictions in a vector called output. A lot more weights are flying around in
this one, but it isn’t that much more advanced than other networks you’ve seen.
I want to use this list of vectors and series of weighted sums logic to introduce two new
concepts. See the weights variable in step 1? It’s a list of vectors. A list of vectors is called a
matrix. It’s as simple as it sounds. Commonly used functions use matrices. One of these is
called vector-matrix multiplication. The series of weighted sums is exactly that: you take a
vector and perform a dot product with every row in a matrix.* As you’ll find out in the next
section, NumPy has special functions to help.
* If you’re experienced with linear algebra, the more formal definition stores/processes weights as column vectors instead of row
vectors. This will be rectified shortly.
42
Chapter 3
I Introduction to neural prediction
Predicting on predictions
Neural networks can be stacked!
As the following figures make clear, you can also take the output of one network and feed it
as input to another network. This results in two consecutive vector-matrix multiplications.
It may not yet be clear why you’d predict this way; but some datasets (such as image
classification) contain patterns that are too complex for a single-weight matrix. Later, we’ll
discuss the nature of these patterns. For now, it’s sufficient to know this is possible.
b
An empty network with multiple inputs and outputs
Hiddens
Inputs
# toes
hurt?
hid[0]
–.1
win
loss
Predictions
.1
.1
.2
win?
hid[1]
# fans
-0.1], # hid[0]
0.9], # hid[1]
0.1] ] # hid[2]
#hid[0] hid[1] hid[2]
hp_wgt = [ [0.3, 1.1, -0.3], # hurt?
[0.1, 0.2, 0.0], # win?
[0.0, 1.3, 0.1] ] # sad?
weights = [ih_wgt, hp_wgt]
.0
.9
# toes % win
ih_wgt = [ [0.1, 0.2,
[-0.1,0.1,
[0.1, 0.4,
def neural_network(input, weights):
c
sad?
hid[2]
# fans
hid = vect_mat_mul(input,weights[0])
pred = vect_mat_mul(hid,weights[1])
return pred
Predicting the hidden layer
Inputs
Hiddens
Predictions
Input corresponds to every entry
for the first game of the season.
8.5
.86
–.1
.1
65%
.295
input = [toes[0],wlrec[0],nfans[0]]
pred = neural_network(input,weights)
def neural_network(input, weights):
.9
1.2
toes = [8.5, 9.5, 9.9, 9.0]
wlrec = [0.65,0.8, 0.8, 0.9]
nfans = [1.2, 1.3, 0.5, 1.0]
1.23
hid = vect_mat_mul(input,weights[0])
pred = vect_mat_mul(hid,weights[1])
return pred
Predicting on predictions
d
43
Predicting the output layer (and depositing the prediction)
Inputs
Hiddens
Predictions
8.5
.86
.214
def neural_network(input, weights):
hid = vect_mat_mul(input,weights[0])
pred = vect_mat_mul(hid,weights[1])
return pred
.1
.2
65%
.295
.145
input = [toes[0],wlrec[0],nfans[0]]
.0
1.2
1.23
toes = [8.5, 9.5, 9.9, 9.0]
wlrec = [0.65,0.8, 0.8, 0.9]
nfans = [1.2, 1.3, 0.5, 1.0]
.507
pred = neural_network(input,weights)
print(pred)
Input corresponds to every entry
for the first game of the season.
The following listing shows how you can do the same operations coded in the previous
section using a convenient Python library called NumPy. Using libraries like NumPy makes
your code faster and easier to read and write.
NumPy version
import numpy as np
# toes % win # fans
ih_wgt = np.array([
[0.1, 0.2, -0.1], # hid[0]
[-0.1,0.1, 0.9], # hid[1]
[0.1, 0.4, 0.1]]).T # hid[2]
# hid[0] hid[1] hid[2]
hp_wgt = np.array([
[0.3, 1.1, -0.3], # hurt?
[0.1, 0.2, 0.0], # win?
[0.0, 1.3, 0.1] ]).T # sad?
weights = [ih_wgt, hp_wgt]
def neural_network(input, weights):
hid = input.dot(weights[0])
pred = hid.dot(weights[1])
return pred
toes = np.array([8.5, 9.5, 9.9, 9.0])
wlrec = np.array([0.65,0.8, 0.8, 0.9])
nfans = np.array([1.2, 1.3, 0.5, 1.0])
input = np.array([toes[0],wlrec[0],nfans[0]])
pred = neural_network(input,weights)
print(pred)
44
Chapter 3
I Introduction to neural prediction
A quick primer on NumPy
NumPy does a few things for you. Let’s reveal the magic.
So far in this chapter, we’ve discussed two new types of mathematical tools: vectors and matrices.
You’ve also learned about different operations that occur on vectors and matrices, including dot
products, elementwise multiplication and addition, and vector-matrix multiplication. For these
operations, you’ve written Python functions that can operate on simple Python list objects.
In the short term, you’ll keep writing and
using these functions to be sure you fully
understand what’s going on inside them.
But now that I’ve mentioned NumPy and
several of the big operations, I’d like to
give you a quick rundown of basic NumPy
use so you’ll be ready for the transition to
NumPy-only chapters. Let’s start with the
basics again: vectors and matrices.
import numpy as np
A matrix
2 × 4 matrix
of zeros
Random 2 × 5 matrix
of numbers between
0 and 1
print(a)
print(b)
print(c)
print(d)
print(e)
Output
[0 1 2 3]
[4 5 6 7]
[[0 1 2 3]
[4 5 6 7]]
[[ 0. 0. 0. 0.]
[ 0. 0. 0. 0.]]
[[ 0.22717119 0.39712632
0.0627734
0.08431724
0.53469141]
[ 0.09675954 0.99012254
0.45922775 0.3273326
0.28617742]]
Multiplies every number
in matrix c by 0.2
Multiplies every number
in vector a by 0.1
Because a and e don’t
have the same number
of columns, this throws
“Value Error: operands
could not be broadcast
together with...”
Another vector
d = np.zeros((2,4))
e = np.random.rand(2,5)
You can create vectors and matrices
in multiple ways in NumPy. Most of
the common techniques for neural
networks are listed in the previous code.
Note that the processes for creating
a vector and a matrix are identical.
If you create a matrix with only one
row, you’re creating a vector. And, as
in mathematics in general, you create
a matrix by listing (rows,columns).
I say that only so you can remember
the order: rows come first, columns
come second. Let’s see some operations
you can perform on these vectors and
matrices:
print(a
print(c
print(a
print(a
print(a
A vector
a = np.array([0,1,2,3])
b = np.array([4,5,6,7])
c = np.array([[0,1,2,3],
[4,5,6,7]])
*
*
*
*
*
0.1)
0.2)
b)
b * 0.2)
c)
Multiplies elementwise
between a and b
(columns paired)
Multiplies elementwise,
then multiplies by 0.2
print(a * e)
Performs elementwise multiplication on
every row of matrix c, because c has the
same number of columns as a
A quick primer on NumPy
45
Go ahead and run all of the previous code. The first bit of “at first confusing but eventually
heavenly” magic should be visible. When you multiply two variables with the * function,
NumPy automatically detects what kinds of variables you’re working with and tries to figure
out the operation you’re talking about. This can be mega-convenient but sometimes makes
NumPy code a bit hard to read. Make sure you keep track of each variable type as you go along.
The general rule of thumb for anything elementwise (+, –, *, /) is that either the two
variables must have the same number of columns, or one of the variables must have only
one column. For example, print(a * 0.1) multiplies a vector by a single number (a
scalar). NumPy says, “Oh, I bet I’m supposed to do vector-scalar multiplication here,” and
then multiples the scalar (0.1) by every value in the vector. This looks exactly the same as
print(c * 0.2), except NumPy knows that c is a matrix. Thus, it performs scalar-matrix
multiplication, multiplying every element in c by 0.2. Because the scalar has only one
column, you can multiply it by anything (or divide, add, or subtract).
Next up: print(a * b). NumPy first identifies that they’re both vectors. Because neither
vector has only one column, NumPy checks whether they have an identical number of
columns. They do, so NumPy knows to multiply each element by each element, based on
their positions in the vectors. The same is true with addition, subtraction, and division.
print(a * c) is perhaps the most elusive. a is a vector with four columns, and c is a
(2 × 4) matrix. Neither has only one column, so NumPy checks whether they have the
same number of columns. They do, so NumPy multiplies the vector a by each row of c
(as if it were doing elementwise vector multiplication on each row).
Again, the most confusing part is that all of these operations look the same if you don’t
know which variables are scalars, vectors, or matrices. When you “read NumPy,” you’re really
doing two things: reading the operations and keeping track of the shape (number of rows and
columns) of each operation. It will take some practice, but eventually it becomes second nature.
Let’s look at a few examples of matrix multiplication in NumPy, noting the input and output shapes
of each matrix.
a = np.zeros((1,4))
b = np.zeros((4,3))
c = a.dot(b)
print(c.shape)
Output
Vector of length 4
Matrix with
4 rows and
3 columns
(1,3)
There’s one golden rule when using the dot function: if you put the (rows,cols) description
of the two variables you’re “dotting” next to each other, neighboring numbers should always be
the same. In this case, you’re dot-producting (1,4) with (4,3). It works fine and outputs (1,3).
In terms of variable shape, you can think of it as follows, regardless of whether you’re dotting
46
Chapter 3
I Introduction to neural prediction
vectors or matrices: their shape (number of rows and columns) must line up. The columns of the
left matrix must equal the rows on the right, such that (a,b).dot(b,c) = (a,c).
a = np.zeros((2,4))
b = np.zeros((4,3))
c = a.dot(b)
print(c.shape)
Matrix with 4 rows
and 3 columns
Outputs (2,3)
Matrix with 2 rows
and 1 column
e = np.zeros((2,1))
f = np.zeros((1,3))
g = e.dot(f)
print(g.shape)
Matrix with 2 rows
and 4 columns
Matrix with 1 row
and 3 columns
Outputs (2,3)
Throws an error; .T flips the
rows and columns of a matrix.
Matrix with 4 rows
and 5 columns
h = np.zeros((5,4)).T
i = np.zeros((5,6))
j = h.dot(i)
print(j.shape)
h = np.zeros((5,4))
i = np.zeros((5,6))
j = h.dot(i)
print(j.shape)
Matrix with 6 rows
and 5 columns
Outputs (4,6)
Matrix with 5 rows
and 4 columns
Matrix with 5 rows
and 6 columns
Throws
an error
Summary
To predict, neural networks perform repeated weighted sums
of the input.
You’ve seen an increasingly complex variety of neural networks in this chapter. I hope it’s clear
that a relatively small number of simple rules are used repeatedly to create larger, more advanced
neural networks. The network’s intelligence depends on the weight values you give it.
Everything we’ve done in this chapter is a form of what’s called forward propagation, wherein
a neural network takes input data and makes a prediction. It’s called this because you’re
propagating activations forward through the network. In these examples, activations are all the
numbers that are not weights and are unique for every prediction.
In the next chapter, you’ll learn how to set weights so your neural networks make accurate
predictions. Just as prediction is based on several simple techniques that are repeated/stacked on
top of each other, weight learning is also a series of simple techniques that are combined many
times across an architecture. See you there!
introduction to neural learning:
gradient descent
In this chapter
•
Do neural networks make accurate predictions?
•
Why measure error?
•
Hot and cold learning
•
Calculating both direction and amount from error
•
Gradient descent
•
Learning is just reducing error
•
Derivatives and how to use them to learn
•
Divergence and alpha
The only relevant test of the validity of a hypothesis is
comparison of its predictions with experience.
—Milton Friedman, Essays in Positive Economics
(University of Chicago Press, 1953)
47
4
48
Chapter 4
I Introduction to neural learning
Predict, compare, and learn
In chapter 3, you learned about the paradigm “predict, compare, learn,” and we dove
deep into the first step: predict. In the process, you learned a myriad of things, including
the major parts of neural networks (nodes and weights), how datasets fit into networks
(matching the number of datapoints coming in at one time), and how to use a neural
network to make a prediction.
Perhaps this process begged the question, “How do we set weight values so the network
predicts accurately?” Answering this question is the main focus of this chapter, as we
cover the next two steps of the paradigm: compare and learn.
Compare
Comparing gives a measurement of how much a prediction
“missed” by.
Once you’ve made a prediction, the next step is to evaluate how well you did. This may
seem like a simple concept, but you’ll find that coming up with a good way to measure
error is one of the most important and complicated subjects of deep learning.
There are many properties of measuring error that you’ve likely been doing your whole
life without realizing it. Perhaps you (or someone you know) amplify bigger errors while
ignoring very small ones. In this chapter, you’ll learn how to mathematically teach a network
to do this. You’ll also learn that error is always positive! We’ll consider the analogy of an
archer hitting a target: whether the shot is too low by an inch or too high by an inch, the
error is still just 1 inch. In the neural network compare step, you need to consider these
kinds of properties when measuring error.
As a heads-up, in this chapter we evaluate only one simple way of measuring error: mean
squared error. It’s but one of many ways to evaluate the accuracy of a neural network.
This step will give you a sense for how much you missed, but that isn’t enough to be able to
learn. The output of the compare logic is a “hot or cold” type signal. Given some prediction,
you’ll calculate an error measure that says either “a lot” or “a little.” It won’t tell you why you
missed, what direction you missed, or what you should do to fix the error. It more or less
says “big miss,” “little miss,” or “perfect prediction.” What to do about the error is captured
in the next step, learn.
Learn
49
Learn
Learning tells each weight how it can change to reduce the error.
Learning is all about error attribution, or the art of figuring out how each weight played
its part in creating error. It’s the blame game of deep learning. In this chapter, we’ll
spend many pages looking at the most popular version of the deep learning blame game:
gradient descent.
At the end of the day, it results in computing a number for each weight. That number
represents how that weight should be higher or lower in order to reduce the error. Then
you’ll move the weight according to that number, and you’ll be finished.
50
Chapter 4
I Introduction to neural learning
Compare: Does your network make
good predictions?
Let’s measure the error and find out!
Execute the following code in your Jupyter notebook. It should print 0.3025:
Error
pred = input * knob_weight
.5
0.5
knob_weight = 0.5
input = 0.5
goal_pred = 0.8
0.4
.30
error = (pred - goal_pred) ** 2
The error is a way to measure how much
you missed. There are multiple ways to
calculate error, as you’ll learn later. This
one is mean squared error.
print(error)
Raw error
Forces the raw error to be
positive by multiplying it
by itself. Negative error
wouldn't make sense.
What is the goal_pred variable?
Much like input, goal_pred is a number you recorded in the real world somewhere.
But it’s usually something hard to observe, like “the percentage of people who did wear
sweatsuits,” given the temperature; or “whether the batter did hit a home run,” given his
batting average.
Why is the error squared?
Think about an archer hitting a target. When the shot hits 2 inches too high, how much
did the archer miss by? When the shot hits 2 inches too low, how much did the archer
miss by? Both times, the archer missed by only 2 inches. The primary reason to square
“how much you missed” is that it forces the output to be positive. (pred - goal_pred)
could be negative in some situations, unlike actual error.
Doesn’t squaring make big errors (>1) bigger and small errors (<1) smaller?
Yeah … It’s kind of a weird way of measuring error, but it turns out that amplifying big
errors and reducing small errors is OK. Later, you’ll use this error to help the network learn,
and you’d rather it pay attention to the big errors and not worry so much about the small
ones. Good parents are like this, too: they practically ignore errors if they’re small enough
(breaking the lead on your pencil) but may go nuclear for big errors (crashing the car). See
why squaring is valuable?
Why measure error?
51
Why measure error?
Measuring error simplifies the problem.
The goal of training a neural network is to make correct predictions. That’s what you want.
And in the most pragmatic world (as mentioned in the preceding chapter), you want the
network to take input that you can easily calculate (today’s stock price) and predict things that
are hard to calculate (tomorrow’s stock price). That’s what makes a neural network useful.
It turns out that changing knob_weight to make the network correctly predict
goal_prediction is slightly more complicated than changing knob_weight to make
error == 0. There’s something more concise about looking at the problem this way.
Ultimately, both statements say the same thing, but trying to get the error to 0 seems
more straightforward.
Different ways of measuring error prioritize error differently.
If this is a bit of a stretch right now, that’s OK, but think back to what I said earlier: by
squaring the error, numbers that are less than 1 get smaller, whereas numbers that are greater
than 1 get bigger. You’re going to change what I call pure error (pred - goal_pred) so that
bigger errors become very big and smaller errors quickly become irrelevant.
By measuring error this way, you can prioritize big errors over smaller ones. When you have
somewhat large pure errors (say, 10), you’ll tell yourself that you have very large error (10**2 ==
100); and in contrast, when you have small pure errors (say, 0.01), you’ll tell yourself that you
have very small error (0.01**2 == 0.0001). See what I mean about prioritizing? It’s just modifying
what you consider to be error so that you amplify big ones and largely ignore small ones.
In contrast, if you took the absolute value instead of squaring the error, you wouldn’t have this
type of prioritization. The error would just be the positive version of the pure error—which
would be fine, but different. More on this later.
Why do you want only positive error?
Eventually, you’ll be working with millions of input -> goal_prediction pairs, and we’ll
still want to make accurate predictions. So, you’ll try to take the average error down to 0.
This presents a problem if the error can be positive and negative. Imagine if you were
trying to get the neural network to correctly predict two datapoints—two input ->
goal_prediction pairs. If the first had an error of 1,000 and the second had an error of
–1,000, then the average error would be zero! You’d fool yourself into thinking you predicted
perfectly, when you missed by 1,000 each time! That would be really bad. Thus, you want the
error of each prediction to always be positive so they don’t accidentally cancel each other out
when you average them.
52
Chapter 4
I Introduction to neural learning
What’s the simplest form of neural learning?
Learning using the hot and cold method.
At the end of the day, learning is really about one thing: adjusting knob_weight either up
or down so the error is reduced. If you keep doing this and the error goes to 0, you’re done
learning! How do you know whether to turn the knob up or down? Well, you try both up and
down and see which one reduces the error! Whichever one reduces the error is used to update
knob_weight. It’s simple but effective. After you do this over and over again, eventually
error == 0, which means the neural network is predicting with perfect accuracy.
Hot and cold learning
Hot and cold learning means wiggling the weights to see which direction reduces the error
the most, moving the weights in that direction, and repeating until the error gets to 0.
b
An empty network
weight = 0.1
Input data
enters here.
Predictions
come out here.
lr = 0.01
def neural_network(input, weight):
prediction = input * weight
.1
# toes
c
win?
return prediction
PREDICT: Making a prediction and evaluating error
Error
input = number_of_toes[0]
true = win_or_lose_binary[0]
.1
8.5
number_of_toes = [8.5]
win_or_lose_binary = [1] # (won!!!)
0.85
.023
pred = neural_network(input,weight)
The error is a way to measure how much you
missed. There are multiple ways to calculate
error, as you’ll learn later. This one is mean
squared error.
error = (pred - true) ** 2
print(error)
Raw error
Forces the raw error to be
positive by multiplying it
by itself. Negative error
wouldn't make sense.
What’s the simplest form of neural learning?
d
53
COMPARE: Making a prediction with a higher weight and evaluating error
We want to move the weight so the error goes downward. Let’s try moving the weight up and down using
weight+lr and weight-lr, to see which one has the lowest error.
Error
Higher
lr = 0.1
p_up = neural_network(input,weight+lr)
.11
0.85
8.5
e
e_up = (p_up - true) ** 2
print(e_up)
.004
COMPARE: Making a prediction with a lower weight and evaluating error
Lower
Error
lr = 0.01
p_dn = neural_network(input,weight-lr)
.09
8.5
f
0.85
e_dn = (p_dn - true) ** 2
print(e_dn)
.055
COMPARE + LEARN: Comparing the errors and setting the new weight
Errors
Best!!
Down
Same
Up
.055
.023
.004
.11
8.5
0.85
if(error > e_dn ||
error > e_up):
if(e_dn < e_up):
weight -= lr
if(e_up < e_up):
weight += lr
These last five steps are one iteration of hot and cold learning. Fortunately, this iteration got
us pretty close to the correct answer all by itself (the new error is only 0.004). But under
normal circumstances, we’d have to repeat this process many times to find the correct
weights. Some people have to train their networks for weeks or months before they find a
good enough weight configuration.
This reveals what learning in neural networks really is: a search problem. You’re searching
for the best possible configuration of weights so the network’s error falls to 0 (and predicts
perfectly). As with all other forms of search, you might not find exactly what you’re looking
for, and even if you do, it may take some time. Next, we’ll use hot and cold learning for a
slightly more difficult prediction so you can see this searching in action!
54
Chapter 4
I Introduction to neural learning
Hot and cold learning
This is perhaps the simplest form of learning.
Execute the following code in your Jupyter notebook. (New neural network modifications
are in bold.) This code attempts to correctly predict 0.8:
weight = 0.5
input = 0.5
goal_prediction = 0.8
How much to move
the weights each
iteration
step_amount = 0.001
Repeat learning many
times so the error can
keep getting smaller.
for iteration in range(1101):
prediction = input * weight
error = (prediction - goal_prediction) ** 2
print("Error:" + str(error) + " Prediction:" + str(prediction))
up_prediction = input * (weight + step_amount)
up_error = (goal_prediction - up_prediction) ** 2
down_prediction = input * (weight - step_amount)
down_error = (goal_prediction - down_prediction) ** 2
if(down_error < up_error):
weight = weight - step_amount
if(down_error > up_error):
weight = weight + step_amount
Try up!
Try down!
If down is better,
go down!
If up is better,
go up!
When I run this code, I see the following output:
Error:0.3025 Prediction:0.25
Error:0.30195025 Prediction:0.2505
....
Error:2.50000000033e-07 Prediction:0.7995
Error:1.07995057925e-27 Prediction:0.8
The last step correctly
predicts 0.8!
Characteristics of hot and cold learning
55
Characteristics of hot and cold learning
It’s simple.
Hot and cold learning is simple. After making a prediction, you predict two more times, once with a
slightly higher weight and again with a slightly lower weight. You then move weight depending on
which direction gave a smaller error. Repeating this enough times eventually reduces error to 0.
Why did I iterate exactly 1,101 times?
The neural network in the example reaches 0.8 after exactly that many iterations. If you
go past that, it wiggles back and forth between 0.8 and just above or below 0.8, making
for a less pretty error log printed at the bottom of the left page. Feel free to try it.
Problem 1: It’s inefficient.
You have to predict multiple times to make a single knob_weight update. This seems very
inefficient.
Problem 2: Sometimes it’s impossible to predict the exact
goal prediction.
With a set step_amount, unless the perfect weight is exactly n*step_amount away, the network
will eventually overshoot by some number less than step_amount. When it does, it will then
start alternating back and forth between each side of goal_prediction. Set step_amount to 0.2
to see this in action. If you set step_amount to 10, you’ll really break it. When I try this, I see the
following output. It never remotely comes close to 0.8!
Error:0.3025 Prediction:0.25
Error:19.8025 Prediction:5.25
Error:0.3025 Prediction:0.25
Error:19.8025 Prediction:5.25
Error:0.3025 Prediction:0.25
....
.... repeating infinitely...
The real problem is that even though you know the correct direction to move weight, you don’t know
the correct amount. Instead, you pick a fixed one at random (step_amount). Furthermore, this amount
has nothing to do with error. Whether error is big or tiny, step_amount is the same. So, hot and cold
learning is kind of a bummer. It’s inefficient because you predict three times for each weight update, and
step_ amount is arbitrary, which can prevent you from learning the correct weight value.
What if you had a way to compute both direction and amount for each weight without having to
repeatedly make predictions?
56
Chapter 4
I Introduction to neural learning
Calculating both direction and amount from error
Let’s measure the error and find the direction and amount!
Execute this code in your Jupyter notebook:
direction_and_amount
–0.2
.1
weight = 0.5
goal_pred = 0.8
input = 0.5
0.5
0.4
.30
b Pure error
c
for iteration in range(20):
reversal, and stopping
pred = input * weight
error = (pred - goal_pred) ** 2
direction_and_amount = (pred - goal_pred) * input
weight = weight - direction_and_amount
Scaling, negative
print("Error:" + str(error) + " Prediction:" + str(pred))
What you see here is a superior form of learning known as gradient descent. This method allows
you to (in a single line of code, shown here in bold) calculate both the direction and the amount
you should change weight to reduce error.
What is direction_and_amount?
direction_and_amount represents how you want to change weight. The first part b
is what I call pure error, which equals (pred - goal_pred). (More about this shortly.) The
second part c is the multiplication by the input that performs scaling, negative reversal,
and stopping, modifying the pure error so it’s ready to update weight.
What is the pure error?
The pure error is (pred - goal_pred), which indicates the raw direction and amount you
missed. If this is a positive number, you predicted too high, and vice versa. If this is a big
number, you missed by a big amount, and so on.
What are scaling, negative reversal, and stopping?
These three attributes have the combined effect of translating the pure error into the absolute
amount you want to change weight. They do so by addressing three major edge cases
where the pure error isn’t sufficient to make a good modification to weight.
Calculating both direction and amount from error
What is stopping?
Stopping is the first (and simplest) effect on the pure error caused by multiplying it by
input. Imagine plugging a CD player into your stereo. If you turned the volume all the
way up but the CD player was off, the volume change wouldn’t matter. Stopping addresses
this in a neural network. If input is 0, then it will force direction_and_amount to also
be 0. You don’t learn (change the volume) when input is 0, because there’s nothing
to learn. Every weight value has the same error, and moving it makes no difference
because pred is always 0.
What is negative reversal?
This is probably the most difficult and important effect. Normally (when input is positive),
moving weight upward makes the prediction move upward. But if input is negative,
then all of a sudden weight changes directions! When input is negative, moving
weight up makes the prediction go down. It’s reversed! How do you address this? Well,
multiplying the pure error by input will reverse the sign of direction_and_amount in
the event that input is negative. This is negative reversal, ensuring that weight moves in
the correct direction even if input is negative.
What is scaling?
Scaling is the third effect on the pure error caused by multiplying it by input. Logically, if
input is big, your weight update should also be big. This is more of a side effect, because
it often goes out of control. Later, you’ll use alpha to address when that happens.
When you run the previous code, you should see the following output:
Error:0.3025 Prediction:0.25
Error:0.17015625 Prediction:0.3875
Error:0.095712890625 Prediction:0.490625
...
The last steps correctly
approach 0.8!
Error:1.7092608064e-05 Prediction:0.79586567925
Error:9.61459203602e-06 Prediction:0.796899259437
Error:5.40820802026e-06 Prediction:0.797674444578
In this example, you saw gradient descent in action in a bit of an oversimplified environment.
Next, you’ll see it in its more native environment. Some terminology will be different, but I’ll
code it in a way that makes it more obviously applicable to other kinds of networks (such as
those with multiple inputs and outputs).
57
58
Chapter 4
I Introduction to neural learning
One iteration of gradient descent
This performs a weight update on a single training example
(input->true) pair.
b
An empty network
Input data
enters here.
weight = 0.1
Predictions
come out here.
c
win?
return prediction
PREDICT: Making a prediction and evaluating error
Error
8.5
number_of_toes = [8.5]
win_or_lose_binary = [1] # (won!!!)
input = number_of_toes[0]
goal_pred = win_or_lose_binary[0]
.1
0.85
.023
The error is a way to measure how much
you missed. There are multiple ways to
calculate error, as you’ll learn later. This
one is mean squared error.
d
def neural_network(input, weight):
prediction = input * weight
.1
# toes
alpha = 0.01
pred = neural_network(input,weight)
error = (pred - goal_pred) ** 2
Raw error
Forces the raw error to be positive by
multiplying it by itself. Negative error
wouldn't make sense.
COMPARE: Calculating the node delta and putting it on the output node
number_of_toes = [8.5]
win_or_lose_binary = [1] # (won!!!)
.1
8.5
–.15
.023
input = number_of_toes[0]
goal_pred = win_or_lose_binary[0]
pred = neural_network(input,weight)
error = (pred - goal_pred) ** 2
Node delta
delta = pred - goal_pred
delta is a measurement of how much this node missed. The true prediction is 1.0, and the
network’s prediction was 0.85, so the network was too low by 0.15. Thus, delta is negative 0.15.
One iteration of gradient descent
59
The primary difference between gradient descent and this implementation is the new variable
delta. It’s the raw amount that the node was too high or too low. Instead of computing
direction_and_amount directly, you first calculate how much you want the output node to be
different. Only then do you compute direction_and_amount to change weight (in step 4, now
renamed weight_delta):
e
LEARN: Calculating the weight delta and putting it on the weight
number_of_toes = [8.5]
win_or_lose_binary = [1] # (won!!!)
–1.25
–.15
8.5
input = number_of_toes[0]
goal_pred = win_or_lose_binary[0]
.023
pred = neural_network(input,weight)
error = (pred - goal_pred) ** 2
delta = pred - goal_pred
Weight delta
weight_delta = input * delta
weight_delta is a measure of how much a weight caused the network to miss. You calculate
it by multiplying the weight’s output node delta by the weight’s input. Thus, you create
each weight_delta by scaling its output node delta by the weight’s input. This accounts
for the three aforementioned properties of direction_and_amount: scaling, negative
reversal, and stopping.
f
LEARN: Updating the weight
New weight
number_of_toes = [8.5]
win_or_lose_binary = [1] # (won!!!)
input = number_of_toes[0]
goal_pred = win_or_lose_binary[0]
.1125
pred = neural_network(input,weight)
error = (pred - goal_pred) ** 2
delta = pred - goal_pred
weight_delta = input * delta
Fixed before training
alpha = 0.01
weight -= weight_delta * alpha
You multiply weight_delta by a small number alpha before using it to update weight. This
lets you control how fast the network learns. If it learns too fast, it can update weights too
aggressively and overshoot. (More on this later.) Note that the weight update made the same
change (small increase) as hot and cold learning.
60
Chapter 4
I Introduction to neural learning
Learning is just reducing error
You can modify weight to reduce error.
Putting together the code from the previous pages, we now have the following:
weight, goal_pred, input = (0.0, 0.8, 0.5)
for iteration in range(4):
These lines
have a secret.
pred = input * weight
error = (pred - goal_pred) ** 2
delta = pred - goal_pred
weight_delta = delta * input
weight = weight - weight_delta
print("Error:" + str(error) + " Prediction:" + str(pred))
The golden method for learning
This approach adjusts each weight in the correct direction and by the correct amount so
that error reduces to 0.
All you’re trying to do is figure out the right direction and amount to modify weight so that
error goes down. The secret lies in the pred and error calculations. Notice that you use pred
inside the error calculation. Let’s replace the pred variable with the code used to generate it:
error = ((input * weight) - goal_pred) ** 2
This doesn’t change the value of error at all! It just combines the two lines of code and
computes error directly. Remember that input and goal_prediction are fixed at 0.5 and
0.8, respectively (you set them before the network starts training). So, if you replace their
variables names with the values, the secret becomes clear:
error = ((0.5 * weight) - 0.8) ** 2
Learning is just reducing error
61
The secret
For any input and goal_pred, an exact relationship is defined between error and weight,
found by combining the prediction and error formulas. In this case:
error = ((0.5 * weight) - 0.8) ** 2
error
Let’s say you increased weight by 0.5. If there’s an exact relationship between error and weight,
you should be able to calculate how much this also moves error. What if you wanted to move
error in a specific direction? Could it be done?
Slope
weight
This graph represents every value of error for every weight according to the relationship in the
previous formula. Notice it makes a nice bowl shape. The black dot is at the point of both the
current weight and error. The dotted circle is where you want to be (error == 0).
Key takeaway
The slope points to the bottom of the bowl (lowest error) no matter where you are in the
bowl. You can use this slope to help the neural network reduce the error.
62
Chapter 4
I Introduction to neural learning
Let’s watch several steps of learning
Will we eventually find the bottom of the bowl?
weight, goal_pred, input = (0.0, 0.8, 1.1)
for iteration in range(4):
print("-----\nWeight:" + str(weight))
pred = input * weight
error = (pred - goal_pred) ** 2
delta = pred - goal_pred
weight_delta = delta * input
weight = weight - weight_delta
print("Error:" + str(error) + " Prediction:" + str(pred))
print("Delta:" + str(delta) + " Weight Delta:" + str(weight_delta))
b
A big weight increase
.0
0.0
1.1
–.88
.64
–.8
error = 0.64
delta (raw error)
weight = 0.0
weight_delta = -0.88
(Raw error modified for
scaling, negative reversal,
and stopping per this weight
and input)
Overshot a bit; let’s go back the other way.
.88
1.1
.97
.185
0.17
.03
error = 0.03
c
weight = 0.88
Let’s watch several steps of learning
Overshot again! Let’s go back, but only a little.
.69
1.1
.76
–.036
.001
–.04
error = 0.002
d
63
weight = 0.69
OK, we’re pretty much there.
.73
1.1
.803
0.0000054
.0081
.007
error = 0.000009
e
weight = 0.73
f
Code output
----Weight:0.0
Error:0.64 Prediction:0.0
Delta:-0.8 Weight Delta:-0.88
----Weight:0.88
Error:0.028224 Prediction:0.968
Delta:0.168 Weight Delta:0.1848
----Weight:0.6952
Error:0.0012446784 Prediction:0.76472
Delta:-0.03528 Weight Delta:-0.038808
----Weight:0.734008
Error:5.489031744e-05 Prediction:0.8074088
Delta:0.0074088 Weight Delta:0.00814968
64
Chapter 4
I Introduction to neural learning
Why does this work? What is weight_delta, really?
Let’s back up and talk about functions. What is a function?
How do you understand one?
Consider this function:
def my_function(x):
return x * 2
A function takes some numbers as input and gives you another number as output. As you can
imagine, this means the function defines some sort of relationship between the input number(s)
and the output number(s). Perhaps you can also see why the ability to learn a function is
so powerful: it lets you take some numbers (say, image pixels) and convert them into other
numbers (say, the probability that the image contains a cat).
Every function has what you might call moving parts: pieces you can tweak or change to make
the output the function generates different. Consider my_function in the previous example. Ask
yourself, “What’s controlling the relationship between the input and the output of this function?”
The answer is, the 2. Ask the same question about the following function:
error = ((input * weight) - goal_pred) ** 2
What’s controlling the relationship between input and the output (error)? Plenty of things
are—this function is a bit more complicated! goal_pred, input, **2, weight, and all the
parentheses and algebraic operations (addition, subtraction, and so on) play a part in calculating
the error. Tweaking any one of them would change the error. This is important to consider.
As a thought exercise, consider changing goal_pred to reduce the error. This is silly, but totally
doable. In life, you might call this (setting goals to be whatever your capability is) “giving up.”
You’re denying that you missed! That wouldn’t do.
What if you changed input until error went to 0? Well, that’s akin to seeing the world as you
want to see it instead of as it actually is. You’re changing the input data until you’re predicting
what you want to predict (this is loosely how inceptionism works).
Now consider changing the 2, or the additions, subtractions, or multiplications. This is just
changing how you calculate error in the first place. The error calculation is meaningless if
it doesn’t actually give a good measure of how much you missed (with the right properties
mentioned a few pages ago). This won’t do, either.
Why does this work? What is weight_delta, really?
65
What’s left? The only variable remaining is weight. Adjusting it doesn’t change your perception
of the world, doesn’t change your goal, and doesn’t destroy your error measure. Changing
weight means the function conforms to the patterns in the data. By forcing the rest of the
function to be unchanging, you force the function to correctly model some pattern in the data.
It’s only allowed to modify how the network predicts.
To sum up: you modify specific parts of an error function until the error value goes to 0. This error
function is calculated using a combination of variables, some of which you can change (weights) and
some of which you can’t (input data, output data, and the error logic):
weight = 0.5
goal_pred = 0.8
input = 0.5
for iteration in range(20):
pred = input * weight
error = (pred - goal_pred) ** 2
direction_and_amount = (pred - goal_pred) * input
weight = weight - direction_and_amount
print("Error:" + str(error) + " Prediction:" + str(pred))
Key takeaway
You can modify anything in the pred calculation except input.
We’ll spend the rest of this book (and many deep learning researchers will spend the rest of
their lives) trying everything you can imagine on that pred calculation so that it can make good
predictions. Learning is all about automatically changing the prediction function so that it
makes good predictions—aka, so that the subsequent error goes down to 0.
Now that you know what you’re allowed to change, how do you go about doing the changing?
That’s the good stuff. That’s the machine learning, right? In the next section, we’re going to talk
about exactly that.
66
Chapter 4
I Introduction to neural learning
Tunnel vision on one concept
Concept: Learning is adjusting the weight to reduce the error to 0.
So far in this chapter, we’ve been hammering on the idea that learning is really just about
adjusting weight to reduce error to 0. This is the secret sauce. Truth be told, knowing how to
do this is all about understanding the relationship between weight and error. If you understand
this relationship, you can know how to adjust weight to reduce error.
What do I mean by “understand the relationship”? Well, to understand the relationship between
two variables is to understand how changing one variable changes the other. In this case, what
you’re really after is the sensitivity between these two variables. Sensitivity is another name for
direction and amount. You want to know how sensitive error is to weight. You want to know
the direction and the amount that error changes when you change weight. This is the goal. So
far, you’ve seen two different methods that attempt to help you understand this relationship.
When you were wiggling weight (hot and cold learning) and studying its effect on error, you
were experimentally studying the relationship between these two variables. It’s like walking
into a room with 15 different unlabeled light switches. You start flipping them on and off to
learn about their relationship to various lights in the room. You did the same thing to study the
relationship between weight and error: you wiggled weight up and down and watched for how
it changed error. Once you knew the relationship, you could move weight in the right direction
using two simple if statements:
if(down_error < up_error):
weight = weight - step_amount
if(down_error > up_error):
weight = weight + step_amount
Now, let’s go back to the earlier formula that combined the pred and error logic. As
mentioned, they quietly define an exact relationship between error and weight:
error = ((input * weight) - goal_pred) ** 2
This line of code, ladies and gentlemen, is the secret. This is a formula. This is the relationship
between error and weight. This relationship is exact. It’s computable. It’s universal. It is and will
always be.
Now, how can you use this formula to know how to change weight so that error moves in a
particular direction? That is the right question. Stop. I beg you. Stop and appreciate this moment.
This formula is the exact relationship between these two variables, and now you’re going to
figure out how to change one variable to move the other variable in a particular direction.
As it turns out, there’s a method for doing this for any formula. You’ll use it to reduce error.
A box with rods poking out of it
67
A box with rods poking out of it
Picture yourself sitting in front of a cardboard box that has two circular rods sticking
through two little holes. The blue rod is sticking out of the box by 2 inches, and the red rod
is sticking out of the box by 4 inches. Imagine that I tell you these rods were connected, but I
won’t tell you in what way. You have to experiment to figure it out.
So, you take the blue rod and push it in 1 inch, and watch as, while you’re pushing, the red
rod also moves into the box by 2 inches. Then, you pull the blue rod back out 1 inch, and the
red rod follows again, pulling out by 2 inches. What did you learn? Well, there seems to be
a relationship between the red and blue rods. However much you move the blue rod, the red
rod will move by twice as much. You might say the following is true:
red_length = blue_length * 2
As it turns out, there’s a formal definition for “When I tug on this part, how much does this
other part move?” It’s called a derivative, and all it really means is “How much does rod X
move when I tug on rod Y?”
In the case of the red and blue rods, the derivative for “How much does red move when
I tug on blue?” is 2. Just 2. Why is it 2? That’s the multiplicative relationship determined by
the formula:
Derivative
red_length = blue_length * 2
Notice that you always have the derivative between two variables. You’re always looking to
know how one variable moves when you change another one. If the derivative is positive,
then when you change one variable, the other will move in the same direction. If the
derivative is negative, then when you change one variable, the other will move in the
opposite direction.
Consider a few examples. Because the derivative of red_length compared to blue_length
is 2, both numbers move in the same direction. More specifically, red will move twice as
much as blue in the same direction. If the derivative had been –1, red would move in the
opposite direction by the same amount. Thus, given a function, the derivative represents the
direction and the amount that one variable changes if you change the other variable. This is
exactly what we were looking for.
68
Chapter 4
I Introduction to neural learning
Derivatives: Take two
Still a little unsure about them? Let’s take another perspective.
I’ve heard people explain derivatives two ways. One way is all about understanding how one
variable in a function changes when you move another variable. The other way says that a
derivative is the slope at a point on a line or curve. As it turns out, if you take a function and
plot it (draw it), the slope of the line you plot is the same thing as “how much one variable
changes when you change the other.” Let me show you by plotting our favorite function:
error = ((input * weight) - goal_pred) ** 2
Remember, goal_pred and input are fixed, so you can rewrite this function:
error = ((0.5 * weight) - 0.8) ** 2
Because there are only two variables left that change (all the rest of them are fixed), you can take
every weight and compute the error that goes with it. Let’s plot them.
As you can see, the plot looks like a big U-shaped curve. Notice that there’s also a point in
the middle where error == 0. Also notice that to the right of that point, the slope of the line
is positive, and to the left of that point, the slope of the line is negative. Perhaps even more
interesting, the farther away from the goal weight you move, the steeper the slope gets.
These are useful properties. The slope’s sign gives you direction, and the slope’s steepness gives
you amount. You can use both of these to help find the goal weight.
Even now, when I look at that curve, it’s easy for
me to lose track of what it represents. It’s similar
to the hot and cold method for learning. If you
tried every possible value for weight and plotted
it out, you’d get this curve.
weight = 0.5
error = 0.3025
direction_and_amount = -0.3025
Goal weight
error
And what’s remarkable about derivatives is that
they can see past the big formula for computing
error (at the beginning of this section) and see
this curve. You can compute the slope (derivative)
of the line for any value of weight. You can then
use this slope (derivative) to figure out which
direction reduces the error. Even better, based
on the steepness, you can get at least some idea
of how far away you are from the optimal point
where the slope is zero (although not an exact
answer, as you’ll learn more about later).
Starting weight
weight = 1.6
error = 0.0
direction_and_amount = 0.0
Slope
weight
What you really need to know
69
What you really need to know
With derivatives, you can pick any two variables in any formula,
and know how they interact.
Take a look at this big whopper of a function:
y = (((beta * gamma) ** 2) + (epsilon + 22 - x)) ** (1/2)
Here’s what you need to know about derivatives. For any function (even this whopper), you can
pick any two variables and understand their relationship with each other. For any function, you
can pick two variables and plot them on an x-y graph as we did earlier. For any function, you can
pick two variables and compute how much one changes when you change the other. Thus, for
any function, you can learn how to change one variable so that you can move another variable in
a direction. Sorry to harp on this point, but it’s important that you know this in your bones.
Bottom line: in this book, you’re going to build neural networks. A neural network is really just
one thing: a bunch of weights you use to compute an error function. And for any error function
(no matter how complicated), you can compute the relationship between any weight and the
final error of the network. With this information, you can change each weight in the neural
network to reduce error down to 0—and that’s exactly what you’re going to do.
What you don’t really need to know
Calculus
So, it turns out that learning all the methods for taking any two variables in any function and
computing their relationship takes about three semesters of college. Truth be told, if you went
through all three semesters so that you could learn how to do deep learning, you’d use only
a very small subset of what you learned. And really, calculus is just about memorizing and
practicing every possible derivative rule for every possible function.
In this book, I’m going to do what I typically do in real life (cuz I’m lazy—I mean, efficient):
look up the derivative in a reference table. All you need to know is what the derivative
represents. It’s the relationship between two variables in a function so you can know how
much one changes when you change the other. It’s just the sensitivity between two variables.
I know that was a lot of information to say, “It’s the sensitivity between two variables,” but
it is. Note that this can include positive sensitivity (when variables move together), negative
sensitivity (when they move in opposite directions), and zero sensitivity (when one stays fixed
regardless of what you do to the other). For example, y = 0 * x. Move x, and y is always 0.
Enough about derivatives. Let’s get back to gradient descent.
70
Chapter 4
I Introduction to neural learning
How to use a derivative to learn
weight_delta is your derivative.
Starting weight
weight = 0.5
error = 0.3025
weight_delta = -0.3025
error
What’s the difference between error and
the derivative of error and weight? error
is a measure of how much you missed. The
derivative defines the relationship between each
weight and how much you missed. In other
words, it tells how much changing a weight
contributed to the error. So, now that you know
this, how do you use it to move the error in a
particular direction?
Goal weight
weight = 1.6
error = 0.0
weight_delta = 0.0
You’ve learned the relationship between
Slope
two variables in a function, but how do you
exploit that relationship? As it turns out, this
weight
is incredibly visual and intuitive. Check out
the error curve again. The black dot is where
weight starts out: (0.5). The dotted circle is where you want it to go: the goal weight. Do you see
the dotted line attached to the black dot? That’s the slope, otherwise known as the derivative. It
tells you at that point in the curve how much error changes when you change weight. Notice
that it’s pointed downward: it’s a negative slope.
The slope of a line or curve always points in the opposite direction of the lowest point of the line or
curve. So, if you have a negative slope, you increase weight to find the minimum of error. Check it out.
So, how do you use the derivative to find the error minimum (lowest point in the error graph)?
You move the opposite direction of the slope—the opposite direction of the derivative. You can
take each weight value, calculate its derivative with respect to error (so you’re comparing two
variables: weight and error), and then change weight in the opposite direction of that slope.
That will move you to the minimum.
Remember back to the goal again: you’re trying to figure out the direction and the amount to
change the weight so the error goes down. A derivative gives you the relationship between any
two variables in a function. You use the derivative to determine the relationship between any
weight and error. You then move the weight in the opposite direction of the derivative to find the
lowest weight. Voilà! The neural network learns.
This method for learning (finding error minimums) is called gradient descent. This name should
seem intuitive. You move the weight value opposite the gradient value, which reduces error to
0. By opposite, I mean you increase the weight when you have a negative gradient, and vice versa.
It’s like gravity.
Look familiar?
71
Look familiar?
weight = 0.0
goal_pred = 0.8
input = 1.1
for iteration in range(4):
pred = input * weight
error = (pred - goal_pred) ** 2
delta = pred - goal_pred
weight_delta = delta * input
weight = weight - weight_delta
Derivative
(how fast the error
changes, given changes
in the weight)
print("Error:" + str(error) + " Prediction:" + str(pred))
b
A big weight increase
.0
0.0
1.1
–.88
.64
–.8
error = 0.64
delta (raw error)
weight = 0.0
weight_delta = -0.88
(Raw error modified for
scaling, negative reversal,
and stopping per this weight
and input)
Overshot a bit; let’s go back the other way.
.88
1.1
.97
.187
0.17
.03
error = 0.03
c
weight = 0.88
72
Chapter 4
I Introduction to neural learning
Breaking gradient descent
Just give me the code!
weight = 0.5
goal_pred = 0.8
input = 0.5
for iteration in range(20):
pred = input * weight
error = (pred - goal_pred) ** 2
delta = pred - goal_pred
weight_delta = input * delta
weight = weight - weight_delta
print("Error:" + str(error) + " Prediction:" + str(pred))
When I run this code, I see the following output:
Error:0.3025 Prediction:0.25
Error:0.17015625 Prediction:0.3875
Error:0.095712890625 Prediction:0.490625
...
Error:1.7092608064e-05 Prediction:0.79586567925
Error:9.61459203602e-06 Prediction:0.796899259437
Error:5.40820802026e-06 Prediction:0.797674444578
Now that it works, let’s break it. Play around with the starting weight, goal_pred, and
input numbers. You can set them all to just about anything, and the neural network will
figure out how to predict the output given the input using the weight. See if you can find
some combinations the neural network can’t predict. I find that trying to break something
is a great way to learn about it.
Let’s try setting input equal to 2, but still try to get the algorithm to predict 0.8. What
happens? Take a look at the output:
Error:0.04 Prediction:1.0
Error:0.36 Prediction:0.2
Error:3.24 Prediction:2.6
...
Error:6.67087267987e+14 Prediction:-25828031.8
Error:6.00378541188e+15 Prediction:77484098.6
Error:5.40340687069e+16 Prediction:-232452292.6
Whoa! That’s not what you want. The predictions exploded! They alternate from negative to
positive and negative to positive, getting farther away from the true answer at every step. In
other words, every update to the weight overcorrects. In the next section, you’ll learn more
about how to combat this phenomenon.
Visualizing the overcorrections
73
Visualizing the overcorrections
A big weight increase
delta (raw error)
.5
2.0
0.4
1.0
.04
0.2
error = 0.04
b
weight = 0.5
weight_delta = -0.28
(Raw error modified for scaling,
negative reversal, and stopping
per this weight and input)
Overshot a bit; let’s go back the other way.
.1
2.0
0.2
–1.2
.36
–.6
error = 0.36
c
weight = 0.1
Overshot again! Let’s go back, but only a little.
1.3
2.0
2.6
3.6
1.8
3.24
error = 3.24
d
weight = 1.3
74
Chapter 4
I Introduction to neural learning
Divergence
Sometimes neural networks explode in value. Oops?
Derivative value
Step 2
Start
Goal
Weight value
Step 3
Step 1
What really happened? The explosion in the error was caused by the fact that you made the
input larger. Consider how you’re updating the weight:
weight = weight - (input * (pred - goal_pred))
If the input is sufficiently large, this can make the weight update large even when the error is
small. What happens when you have a large weight update and a small error? The network
overcorrects. If the new error is even bigger, the network overcorrects even more. This
causes the phenomenon you saw earlier, called divergence.
If you have a big input, the prediction is very sensitive to changes in the weight (because
pred = input * weight). This can cause the network to overcorrect. In other words, even
though the weight is still starting at 0.5, the derivative at that point is very steep. See how
tight the U-shaped error curve is in the graph?
This is really intuitive. How do you predict? By multiplying the input by the weight. So, if the
input is huge, small changes in the weight will cause changes in the prediction. The error is
very sensitive to the weight. In other words, the derivative is really big. How do you make
it smaller?
Introducing alpha
75
Introducing alpha
It’s the simplest way to prevent overcorrecting weight updates.
What’s the problem you’re trying to solve? That if the input is too big, then the weight
update can overcorrect. What’s the symptom? That when you overcorrect, the new
derivative is even larger in magnitude than when you started (although the sign will be
the opposite).
Stop and consider this for a second. Look again at the graph in the previous section to
understand the symptom. Step 2 is even farther away from the goal, which means the
derivative is even greater in magnitude. This causes step 3 to be even farther from the
goal than step 2, and the neural network continues like this, demonstrating divergence.
The symptom is this overshooting. The solution is to multiply the weight update by a
fraction to make it smaller. In most cases, this involves multiplying the weight update
by a single real-valued number between 0 and 1, known as alpha. Note: this has no
effect on the core issue, which is that the input is larger. It will also reduce the weight
updates for inputs that aren’t too large.
Finding the appropriate alpha, even for state-of-the-art neural networks, is often done
by guessing. You watch the error over time. If it starts diverging (going up), then the
alpha is too high, and you decrease it. If learning is happening too slowly, then the alpha
is too low, and you increase it. There are other methods than simple gradient descent
that attempt to counter for this, but gradient descent is still very popular.
76
Chapter 4
I Introduction to neural learning
Alpha in code
Where does our “alpha” parameter come into play?
You just learned that alpha reduces the weight update so it doesn’t overshoot. How does this
affect the code? Well, you were updating the weights according to the following formula:
weight = weight - derivative
Accounting for alpha is a rather small change, as shown next. Notice that if alpha is
small (say, 0.01), it will reduce the weight update considerably, thus preventing it from
overshooting:
weight = weight - (alpha * derivative)
That was easy. Let’s install alpha into the tiny implementation from the beginning of this
chapter and run it where input = 2 (which previously didn’t work):
weight = 0.5
goal_pred = 0.8
input = 2
alpha = 0.1
What happens when you
make alpha crazy small or big?
What about making it negative?
for iteration in range(20):
pred = input * weight
error = (pred - goal_pred) ** 2
derivative = input * (pred - goal_pred)
weight = weight - (alpha * derivative)
print("Error:" + str(error) + " Prediction:" + str(pred))
Error:0.04 Prediction:1.0
Error:0.0144 Prediction:0.92
Error:0.005184 Prediction:0.872
...
Error:1.14604719983e-09 Prediction:0.800033853319
Error:4.12576991939e-10 Prediction:0.800020311991
Error:1.48527717099e-10 Prediction:0.800012187195
Voilà! The tiniest neural network can now make good predictions again. How did I
know to set alpha to 0.1? To be honest, I tried it, and it worked. And despite all the crazy
advancements of deep learning in the past few years, most people just try several orders of
magnitude of alpha (10, 1, 0.1, 0.01, 0.001, 0.0001) and then tweak it from there to see what
works best. It’s more art than science. There are more advanced ways to get to later, but for
now, try various alphas until you get one that seems to work pretty well. Play with it.
Memorizing
77
Memorizing
It’s time to really learn this stuff.
This may sound a bit intense, but I can’t stress enough the value I’ve found from this
exercise: see if you can build the code from the previous section in a Jupyter notebook (or a
.py file, if you must) from memory. I know that might seem like overkill, but I (personally)
didn’t have my “click” moment with neural networks until I was able to perform this task.
Why does this work? Well, for starters, the only way to know you’ve gleaned all the
information necessary from this chapter is to try to produce it from your head. Neural
networks have lots of small moving parts, and it’s easy to miss one.
Why is this important for the rest of the book? In the following chapters, I’ll be referring to
the concepts discussed in this chapter at a faster pace so that I can spend plenty of time on
the newer material. It’s vitally important that when I say something like “Add your alpha
parameterization to the weight update,” you immediately recognize which concepts from
this chapter I’m referring to.
All that is to say, memorizing small bits of neural network code has been hugely beneficial
for me personally, as well as for many individuals who have taken my advice on this subject
in the past.
learning multiple weights at a time:
generalizing gradient descent
In this chapter
•
Gradient descent learning with multiple inputs
•
Freezing one weight: what does it do?
•
Gradient descent learning with multiple outputs
•
Gradient descent learning with multiple inputs
and outputs
•
Visualizing weight values
•
Visualizing dot products
You don’t learn to walk by following rules. You learn by
doing and by falling over.
—Richard Branson, http://mng.bz/oVgd
79
5
80
Chapter 5
I Learning multiple weights at a time
Gradient descent learning with multiple inputs
Gradient descent also works with multiple inputs.
In the preceding chapter, you learned how to use gradient descent to update a weight. In
this chapter, we’ll more or less reveal how the same techniques can be used to update a
network that contains multiple weights. Let’s start by jumping in the deep end, shall we? The
following diagram shows how a network with multiple inputs can learn.
b
An empty network with multiple inputs
def w_sum(a,b):
Predictions
come out here.
#toes
.1
Input data
enters here
(three at a
time).
win
loss
assert(len(a) == len(b))
output = 0
for i in range(len(a)):
output += (a[i] * b[i])
.2
win?
–.1
return output
weights = [0.1, 0.2, -.1]
def neural_network(input, weights):
#fans
pred = w_sum(input,weights)
return pred
c
PREDICT + COMPARE: Making a prediction, and calculating error and delta
Prediction
8.5
toes = [8.5 , 9.5, 9.9, 9.0]
wlrec = [0.65, 0.8, 0.8, 0.9]
nfans = [1.2 , 1.3, 0.5, 1.0]
.1
65%
.2
-.1
Input corresponds to every entry
for the first game of the season.
0.86
.020
win_or_lose_binary = [1, 1, 0, 1]
true = win_or_lose_binary[0]
–.14
input = [toes[0],wlrec[0],nfans[0]]
1.2
error
pred = neural_network(input,weights)
error = (pred - true) ** 2
delta
delta = pred - true
Gradient descent learning with multiple inputs
d
81
LEARN: Calculating each weight_delta and putting it on each weight
def ele_mul(number,vector):
8.5
output = [0,0,0]
–1.2
65%
assert(len(output) == len(vector))
–.09
0.86
–.17
–.14
.020
for i in range(len(vector)):
output[i] = number * vector[i]
return output
input = [toes[0],wlrec[0],nfans[0]]
1.2
weight_deltas
pred = neural_network(input,weight)
error = (pred - true) ** 2
delta = pred - true
weight_deltas = ele_mul(delta,input)
8.5 * -0.14 = -1.19 = weight_deltas[0]
0.65 * -0.14 = -0.091 = weight_deltas[1]
1.2 * -0.14 = -0.168 = weight_deltas[2]
There’s nothing new in this diagram. Each weight_delta is calculated by taking its output
delta and multiplying it by its input. In this case, because the three weights share the same
output node, they also share that node’s delta. But the weights have different weight deltas
owing to their different input values. Notice further that you can reuse the ele_mul function
from before, because you’re multiplying each value in weights by the same value delta.
e
LEARN: Updating the weights
pred = neural_network(input,weight)
error = (pred - true) ** 2
delta = pred - true
# toes
0.1119
win
loss
input = [toes[0],wlrec[0],nfans[0]]
.201
–.098
weight_deltas = ele_mul(delta,input)
win?
alpha = 0.01
for i in range(len(weights)):
weights[i] -= alpha * weight_deltas[i]
print("Weights:" + str(weights))
print("Weight Deltas:" + str(weight_deltas))
# fans
0.1 - (-1.19 * 0.01) = 0.1119 = weights[0]
0.2 - (-.091 * 0.01) = 0.2009 = weights[1]
-0.1 - (-.168 * 0.01) = -0.098 = weights[2]
82
Chapter 5
I Learning multiple weights at a time
Gradient descent with multiple inputs explained
Simple to execute, and fascinating to understand.
When put side by side with the single-weight neural network, gradient descent with
multiple inputs seems rather obvious in practice. But the properties involved are fascinating
and worthy of discussion. First, let’s take a look at them side by side.
b
Single input: Making a prediction and calculating error and delta
number_of_toes = [8.5]
win_or_lose_binary = [1] # (won!!!)
.1
–.15
8.5
.023
input = number_of_toes[0]
true = win_or_lose_binary[0]
pred = neural_network(input,weight)
error
delta
c
error = (pred - true) ** 2
delta = pred - true
Multi-input: Making a prediction and calculating error and delta
Prediction
8.5
toes = [8.5, 9.5, 9.9, 9.0]
wlrec = [0.65, 0.8, 0.8, 0.9]
nfans = [1.2, 1.3, 0.5, 1.0]
.1
.2
65%
–.1
Input corresponds to every entry
for the first game of the season
0.86
.020
win_or_lose_binary = [1, 1, 0, 1]
true = win_or_lose_binary[0]
0.14
input = [toes[0],wlrec[0],nfans[0]]
1.2
error
pred = neural_network(input,weights)
error = (pred - true) ** 2
delta
delta = pred - true
Up until the generation of delta on the output node, single input and multi-input gradient
descent are identical (other than the prediction differences we studied in chapter 3). You make
a prediction and calculate error and delta in identical ways. But the following problem
remains: when you had only one weight, you had only one input (one weight_delta to
generate). Now you have three. How do you generate three weight_deltas?
Gradient descent with multiple inputs explained
83
How do you turn a single delta (on the node)
into three weight_delta values?
Remember the definition and purpose of delta versus weight_delta. delta is a measure
of how much you want a node’s value to be different. In this case, you compute it by a direct
subtraction between the node’s value and what you wanted the node’s value to be (pred - true).
Positive delta indicates the node’s value was too high, and negative that it was too low.
delta
A measure of how much higher or lower you want a node’s value to be, to predict perfectly
given the current training example.
weight_delta, on the other hand, is an estimate of the direction and amount to move the
weights to reduce node_delta, inferred by the derivative. How do you transform delta into
a weight_delta? You multiply delta by a weight’s input.
weight_delta
A derivative-based estimate of the direction and amount you should move a weight to reduce
node_delta, accounting for scaling, negative reversal, and stopping.
Consider this from the perspective of a single weight, highlighted at right:
delta: Hey, inputs—yeah, you three. Next time, predict a
little higher.
Single weight: Hmm: if my input was 0, then my weight
wouldn’t have mattered, and I wouldn’t change a thing
(stopping). If my input was negative, then I’d want
to decrease my weight instead of increase it (negative
reversal). But my input is positive and quite large, so
I’m guessing that my personal prediction mattered a
lot to the aggregated output. I’m going to move my
weight up a lot to compensate (scaling).
input
8.5
.1
65%
.2
–.1
1.2
Prediction
0.86
.020
0.14
delta
The single weight increases its value.
What did those three properties/statements really say? They all (stopping, negative reversal, and
scaling) made an observation of how the weight’s role in delta was affected by its input. Thus,
each weight_delta is a sort of input-modified version of delta.
This brings us back to the original question: how do you turn one (node) delta into three
weight_delta values? Well, because each weight has a unique input and a shared delta, you
84
Chapter 5
I Learning multiple weights at a time
use each respective weight’s input multiplied by delta to create each respective weight_delta.
Let’s see this process in action.
In the next two figures, you can see the generation of weight_delta variables for the
previous single-input architecture and for the new multi-input architecture. Perhaps the
easiest way to see how similar they are is to read the pseudocode at the bottom of each
figure. Notice that the multi-weight version multiplies delta (0.14) by every input to create
the various weight_deltas. It’s a simple process.
d
Single input: Calculating weight_delta and putting it on the weight
number_of_toes = [8.5]
win_or_lose_binary = [1] # (won!!!)
–1.25
.1
–.15
8.5
.023
input = number_of_toes[0]
true = win_or_lose_binary[0]
pred = neural_network(input,weight)
error = (pred - true) ** 2
weight_delta
delta = pred - true
weight_delta = input * delta
8.5 * –0.15 = –1.25 => weight_delta
e
Multi-input: Calculating each weight_delta and putting it on each weight
def ele_mul(number,vector):
8.5
output = [0,0,0]
–1.2
65%
–.09
–.17
assert(len(output) == len(vector))
0.86
0.14
.020
for i in range(len(vector)):
output[i] = number * vector[i]
return output
input = [toes[0],wlrec[0],nfans[0]]
1.2
weight_deltas
pred = neural_network(input,weights)
error = (pred - true) ** 2
delta = pred - true
weight_deltas = ele_mul(delta,input)
8.5 * 0.14 = –1.2 => weight_deltas[0]
0.65 * 0.14 = –.09 => weight_deltas[1]
1.2 * 0.14 = –.17 => weight_deltas[2]
Gradient descent with multiple inputs explained
f
85
Updating the weight
New weight
number_of_toes = [8.5]
win_or_lose_binary = [1] # (won!!!)
input = number_of_toes[0]
true = win_or_lose_binary[0]
.1125
pred = neural_network(input,weight)
You multiply weight_delta by a small
number, alpha, before using it to update the
weight. This allows you to control how quickly
the network learns. If it learns too quickly,
it can update weights too aggressively and
overshoot. Note that the weight update made
the same change (small increase) as hot and
cold learning.
error = (pred - true) ** 2
delta = pred - true
weight_delta = input * delta
alpha = 0.01
Fixed before training
weight -= weight_delta * alpha
g
Updating the weights
input = [toes[0],wlrec[0],nfans[0]]
pred = neural_network(input,weights)
#toes
error = (pred - true) ** 2
.1119
delta = pred - true
win
loss
.201
–.098
#fans
win?
weight_deltas = ele_mul(delta,input)
alpha = 0.01
for i in range(len(weights)):
weights[i] -= alpha * weight_deltas[i]
0.1 – (1.19 * 0.01) = 0.1119 = weights[0]
0.2 – (.091 * 0.01) = 0.2009 = weights[1]
–0.1 – (.168 * 0.01) = –0.098 = weights[2]
The last step is also nearly identical to the single-input network. Once you have the
weight_delta values, you multiply them by alpha and subtract them from the weights. It’s
literally the same process as before, repeated across multiple weights instead of a single one.
86
Chapter 5
I Learning multiple weights at a time
Let’s watch several steps of learning
def neural_network(input, weights):
out = 0
for i in range(len(input)):
out += (input[i] * weights[i])
return out
def ele_mul(scalar, vector):
out = [0,0,0]
for i in range(len(out)):
out[i] = vector[i] * scalar
return out
pred = neural_network(input,weights)
error = (pred - true) ** 2
delta = pred - true
weight_deltas=ele_mul(delta,input)
toes = [8.5, 9.5, 9.9, 9.0]
wlrec = [0.65, 0.8, 0.8, 0.9]
nfans = [1.2, 1.3, 0.5, 1.0]
win_or_lose_binary = [1, 1, 0, 1]
true = win_or_lose_binary[0]
alpha = 0.01
weights = [0.1, 0.2, -.1]
input = [toes[0],wlrec[0],nfans[0]]
b
(continued)
for iter in range(3):
print("Iteration:" + str(iter+1))
print("Pred:" + str(pred))
print("Error:" + str(error))
print("Delta:" + str(delta))
print("Weights:" + str(weights))
print("Weight_Deltas:")
print(str(weight_deltas))
print(
)
for i in range(len(weights)):
weights[i]-=alpha*weight_deltas[i]
Iteration
a
error
a
8.5
b
b
65%
–.09
.2
–.1
0.86
–.14
.020
weight
error
–1.2 .1
weight
–.17
c
c
weight_deltas
error
1.2
weight
We can make three individual error/weight curves, one for each weight. As before, the slopes
of these curves (the dotted lines) are reflected by the weight_delta values. Notice that a is
steeper than the others. Why is weight_delta steeper for a than the others if they share the
same output delta and error measure? Because a has an input value that’s significantly
higher than the others and thus, a higher derivative.
Let’s watch several steps of learning
Iteration
a
error
c
87
a
8.5
b
–.31 .112
65%
–.02
.201
–.098
.964
.001
weight
error
b
–.04
weight
c
–.04
c
error
1.2
weight_deltas
weight
Iteration
a
error
d
a
8.5
b
–.08 .115
65%
–.01
.201
–.098
.991
–.01
.000
error
b
weight
weight
c
–.01
c
weight_deltas
error
1.2
weight
Here are a few additional takeaways. Most of the learning (weight changing) was performed
on the weight with the largest input a , because the input changes the slope significantly.
This isn’t necessarily advantageous in all settings. A subfield called normalization helps
encourage learning across all weights despite dataset characteristics such as this. This
significant difference in slope forced me to set alpha lower than I wanted (0.01 instead of
0.1). Try setting alpha to 0.1: do you see how a causes it to diverge?
88
Chapter 5
I Learning multiple weights at a time
Freezing one weight: What does it do?
This experiment is a bit advanced in terms of theory, but I think it’s a great exercise to
understand how the weights affect each other. You’re going to train again, except weight a
won’t ever be adjusted. You’ll try to learn the training example using only weights b and c
(weights[1] and weights[2]).
def neural_network(input, weights):
out = 0
for i in range(len(input)):
out += (input[i] * weights[i])
return out
def ele_mul(scalar, vector):
out = [0,0,0]
for i in range(len(out)):
out[i] = vector[i] * scalar
return out
pred = neural_network(input,weights)
error = (pred - true) ** 2
delta = pred - true
weight_deltas=ele_mul(delta,input)
weight_deltas[0] = 0
toes = [8.5, 9.5, 9.9, 9.0]
wlrec = [0.65, 0.8, 0.8, 0.9]
nfans = [1.2, 1.3, 0.5, 1.0]
win_or_lose_binary = [1, 1, 0, 1]
true = win_or_lose_binary[0]
alpha = 0.3
weights = [0.1, 0.2, -.1]
input = [toes[0],wlrec[0],nfans[0]]
print("Iteration:" + str(iter+1))
print("Pred:" + str(pred))
print("Error:" + str(error))
print("Delta:" + str(delta))
print("Weights:" + str(weights))
print("Weight_Deltas:")
print(str(weight_deltas))
print(
)
for i in range(len(weights)):
weights[i]-=alpha*weight_deltas[i]
Iteration
error
a
a
8.5
b
–1.2 .1
b
65%
–.09
.2
–.1
0.86
–.14
.020
weight
error
weight
–.17
c
c
1.2
weight_deltas
error
b
(continued)
for iter in range(3):
weight
Freezing one weight: What does it do?
This tells you what the graphs really are. In
truth, these are 2D slices of a four-dimensional
shape. Three of the dimensions are the weight
values, and the fourth dimension is the error.
This shape is called the error plane, and, believe
it or not, its curvature is determined by the
training data. Why is that the case?
error
a
b
weight
weight
error
c
weight
Iteration
a
error
d
b
weight
weight
c
error
Also notice how a finds the bottom of the
bowl. Instead of the black dot moving, the curve
seems to move to the left. What does this mean?
The black dot can move horizontally only if the
weight is updated. Because the weight for a is
frozen for this experiment, the dot must stay
fixed. But error clearly goes to 0.
Iteration
error
This is an extremely important lesson. First,
if you converged (reached error = 0) with b
and c weights and then tried to train a , a
wouldn’t move. Why? error = 0, which means
weight_delta is 0. This reveals a potentially
damaging property of neural networks: a
may be a powerful input with lots of predictive
power, but if the network accidentally figures
out how to predict accurately on the training
data without it, then it will never learn to
incorporate a into its prediction.
c
error
Perhaps you’re surprised to see that a still
finds the bottom of the bowl. Why is this? Well,
the curves are a measure of each individual
weight relative to the global error. Thus,
because error is shared, when one weight finds
the bottom of the bowl, all the weights find the
bottom of the bowl.
89
weight
is determined by the training data. Any network can have any weight value, but
the value of error given any particular weight configuration is 100% determined by data.
You’ve already seen how the steepness of the U shape is affected by the input data (on
several occasions). What you’re really trying to do with the neural network is find the
lowest point on this big error plane, where the lowest point refers to the lowest error.
Interesting, eh? We’ll come back to this idea later, so file it away for now.
error
90
Chapter 5
I Learning multiple weights at a time
Gradient descent learning with multiple outputs
Neural networks can also make multiple predictions using only
a single input.
Perhaps this will seem a bit obvious. You calculate each delta the same way and then
multiply them all by the same, single input. This becomes each weight’s weight_delta.
At this point, I hope it’s clear that a simple mechanism (stochastic gradient descent) is
consistently used to perform learning across a wide variety of architectures.
b
An empty network with multiple outputs
Input data
enters here.
Instead of predicting just whether the team won or lost,
now you’re also predicting whether they’re happy or sad
and the percentage of the team members who are hurt.
You’re making this prediction using only the current win/
loss record.
hurt?
.3
.2
win
loss
win?
.9
Predictions
come out
here.
weights = [0.3, 0.2, 0.9]
def neural_network(input, weights):
pred = ele_mul(input,weights)
return pred
sad?
c
PREDICT: Making a prediction and calculating error and delta
.195
.095
.3
65%
wlrec = [0.65, 1.0, 1.0, 0.9]
.009
.2
.9
.13
–.87
.757
.585
.235
hurt
win
sad
= [0.1, 0.0, 0.0, 0.1]
= [ 1,
1,
0,
1]
= [0.1, 0.0, 0.1, 0.2]
input = wlrec[0]
true = [hurt[0], win[0], sad[0]]
pred = neural_network(input,weights)
.485
error = [0, 0, 0]
delta = [0, 0, 0]
for i in range(len(true)):
error[i] = (pred[i] - true[i]) ** 2
delta[i] = pred[i] - true[i]
Gradient descent learning with multiple outputs
d
91
COMPARE: Calculating each weight_delta and putting it on each weight
def scalar_ele_mul(number,vector):
weight_deltas
.195
output = [0,0,0]
.009
.095
assert(len(output) == len(vector))
.062
65%
–.57
.315
.13
for i in range(len(vector)):
output[i] = number * vector[i]
–.87
return output
.757
wlrec = [0.65, 1.0, 1.0, 0.9]
.585
.235
As before, weight_deltas
are computed by multiplying
the input node value with
the output node delta for
each weight. In this case,
the weight_deltas share
the same input node and
have unique output nodes
(deltas). Note also that
you can reuse the ele_mul
function.
.485
hurt
win
sad
= [0.1, 0.0, 0.0, 0.1]
= [ 1,
1,
0,
1]
= [0.1, 0.0, 0.1, 0.2]
input = wlrec[0]
true = [hurt[0], win[0], sad[0]]
pred = neural_network(input,weights)
error = [0, 0, 0]
delta = [0, 0, 0]
for i in range(len(true)):
error[i] = (pred[i] - true[i]) ** 2
delta[i] = pred[i] - true[i]
weight_deltas = scalar_ele_mul(input,weights)
e
LEARN: Updating the weights
input = wlrec[0]
true = [hurt[0], win[0], sad[0]]
pred = neural_network(input,weights)
hurt?
error = [0, 0, 0]
delta = [0, 0, 0]
.29
win
loss
.26
win?
.87
sad?
for i in range(len(true)):
error[i] = (pred[i] - true[i]) ** 2
delta[i] = pred[i] - true[i]
weight_deltas = scalar_ele_mul(input,weights)
alpha = 0.1
for i in range(len(weights)):
weights[i] -= (weight_deltas[i] * alpha)
print("Weights:" + str(weights))
print("Weight Deltas:" + str(weight_deltas))
92
Chapter 5
I Learning multiple weights at a time
Gradient descent with multiple inputs and outputs
Gradient descent generalizes to arbitrarily large networks.
b
An empty network with multiple inputs and outputs
Inputs
Predictions
# toes
hurt?
# toes
weights = [ [0.1,
[0.1,
[0.0,
# fans
-0.3],# hurt?
0.0], # win?
0.1] ]# sad?
def vect_mat_mul(vect,matrix):
assert(len(vect) == len(matrix))
output = [0,0,0]
for i in range(len(vect)):
output[i] = w_sum(vect,matrix[i])
return output
.1
win
loss
%win
0.1,
0.2,
1.3,
.2
win?
.0
def neural_network(input, weights):
# fans
pred = vect_mat_mul(input,weights)
sad?
return pred
c
PREDICT: Making a prediction and calculating error and delta
Inputs
pred
.555
8.5
.1
.2
65%
.0
1.2
Errors
.207
.455
.98
.96
.865
hurt
win
sad
= [0.1, 0.0, 0.0, 0.1]
= [ 1,
1,
0,
1]
= [0.1, 0.0, 0.1, 0.2]
alpha = 0.01
input = [toes[0],wlrec[0],nfans[0]]
true = [hurt[0], win[0], sad[0]]
–.02
.965
toes = [8.5, 9.5, 9.9, 9.0]
wlrec = [0.65,0.8, 0.8, 0.9]
nfans = [1.2, 1.3, 0.5, 1.0]
.748
pred = neural_network(input,weights)
error = [0, 0, 0]
delta = [0, 0, 0]
for i in range(len(true)):
error[i] = (pred[i] - true[i]) ** 2
delta = pred[i] - true[i]
Gradient descent with multiple inputs and outputs
d
93
COMPARE: Calculating each weight_delta and putting it on each weight
Inputs
pred
Errors
def outer_prod(vec_a, vec_b):
out = zeros_matrix(len(a),len(b))
.555
8.5
.207
.455
.296
65%
.2
.98
.0
–.02
–.01
.96
return out
input = [toes[0],wlrec[0],nfans[0]]
true = [hurt[0], win[0], sad[0]]
.562
.965
1.2
for i in range(len(a)):
for j in range(len(b)):
out[i][j] = vec_a[i]*vec_b[j]
.748
pred = neural_network(input,weights)
error = [0, 0, 0]
delta = [0, 0, 0]
.865
for i in range(len(true)):
(weight_deltas are shown for
only one input, to save space.)
error[i] = (pred[i] - true[i]) ** 2
delta = pred[i] - true[i]
weight_deltas = outer_prod(input,delta)
e
LEARN: Updating the weights
Inputs
Predictions
# toes
hurt?
input = [toes[0],wlrec[0],nfans[0]]
true = [hurt[0], win[0], sad[0]]
pred = neural_network(input,weights)
.09
win
loss
error = [0, 0, 0]
delta = [0, 0, 0]
for i in range(len(true)):
.2
win?
–.01
error[i] = (pred[i] - true[i]) ** 2
delta = pred[i] - true[i]
weight_deltas = outer_prod(input,delta)
# fans
sad?
for i in range(len(weights)):
for j in range(len(weights[0])):
weights[i][j] -= alpha * \
weight_deltas[i][j]
94
Chapter 5
I Learning multiple weights at a time
What do these weights learn?
Each weight tries to reduce the error, but what do they learn
in aggregate?
Congratulations! This is the part of the book where we move on to the first real-world
dataset. As luck would have it, it’s one with historical significance.
It’s called the Modified National Institute of Standards and Technology (MNIST) dataset,
and it consists of digits that high school students and employees of the US Census Bureau
handwrote some years ago. The interesting bit is that these handwritten digits are black-andwhite images of people’s handwriting. Accompanying each digit image is the actual number
they were writing (0–9). For the last few decades, people
have been using this dataset to train neural networks to read
human handwriting, and today, you’re going to do the same.
Each image is only 784 pixels (28 × 28). Given that you have
784 pixels as input and 10 possible labels as output, you
can imagine the shape of the neural network: each training
example contains 784 values (one for each pixel), so the
neural network must have 784 input values. Pretty simple,
eh? You adjust the number of input nodes to reflect how
many datapoints are in each training example. You want to
predict 10 probabilities: one for each digit. Given an input drawing, the neural network
will produce these 10 probabilities, telling you which digit is most likely to be what
was drawn.
How do you configure the neural network to produce 10 probabilities? In the previous
section, you saw a diagram for a neural network that could take multiple inputs at a time
and make multiple predictions based on that input. You should be able to modify this
network to have the correct number of inputs and outputs for the new MNIST task. You’ll
tweak it to have 784 inputs and 10 outputs.
In the MNISTPreprocessor notebook is a script to preprocess the MNIST dataset and
load the first 1,000 images and labels into two NumPy matrices called images and
labels. You may be wondering, “Images are two-dimensional. How do I load the
(28 × 28) pixels into a flat neural network?” For now, the answer is simple: flatten the
images into a vector of 1 × 784. You’ll take the first row of pixels and concatenate them
with the second row, and the third row, and so on, until you have one list of pixels per
image (784 pixels long).
What do these weights learn?
Inputs
Predictions
pix[0]
0?
pix[1]
1?
pix[2]
2?
.
.
.
.
.
.
pix[783]
9?
95
This diagram represents the new MNIST classification neural
network. It most closely resembles the network you trained
with multiple inputs and outputs earlier. The only difference
is the number of inputs and outputs, which has increased
substantially. This network has 784 inputs (one for each pixel
in a 28 × 28 image) and 10 outputs (one for each possible digit
in the image).
If this network could predict perfectly, it would take in an
image’s pixels (say, a 2, like the one in the next figure) and
predict a 1.0 in the correct output position (the third one) and
a 0 everywhere else. If it were able to do this correctly for all
the images in the dataset, it would have no error.
Inputs
Predictions
0.0
0.01
Highest prediction!
The network thinks
this image is a 2.
0.0
0.03
0.98
0.98
.
.
.
.
.
.
0.95
0.15
Small errors: the
network thinks the
image kind of looks
like a 9 (but only a bit).
Over the course of training, the network will adjust the weights between the input and
prediction nodes so that error falls toward 0 in training. But what does this do? What does it
mean to modify a bunch of weights to learn a pattern in aggregate?
96
Chapter 5
I Learning multiple weights at a time
Visualizing weight values
An interesting and intuitive
practice in neural network research
(particularly for image classifiers)
is to visualize the weights as if they
were an image. If you look at this
diagram, you’ll see why.
Each output node has a weight
coming from every pixel. For
example, the 2? node has 784
input weights, each mapping the
relationship between a pixel and the
number 2.
What is this relationship? Well,
if the weight is high, it means the
model believes there’s a high degree
of correlation between that pixel
and the number 2. If the number
is very low (negative), then the
network believes there is a very low
correlation (perhaps even negative
correlation) between that pixel and
the number 2.
Inputs
Predictions
Inputs
Predictions
pix[0]
0?
pix[0]
0?
pix[1]
1?
pix[1]
1?
pix[2]
2?
pix[2]
2?
.
.
.
.
.
.
.
.
.
.
.
.
pix[783]
9?
pix[783]
9?
If you take the weights and print
them out into an image that’s the
same shape as the input dataset
images, you can see which pixels
have the highest correlation with
a particular output node. In our example, a very vague 2 and 1 appear in the two images,
which were created using the weights for 2 and 1, respectively. The bright areas are high
weights, and the dark areas are negative weights. The neutral color (red, if you’re reading this
in the eBook) represents 0 in the weight matrix. This illustrates that the network generally
knows the shape of a 2 and of a 1.
Why does it turn out this way? This takes us back to the lesson on dot products. Let’s have a
quick review.
Visualizing dot products (weighted sums)
97
Visualizing dot products (weighted sums)
Recall how dot products work. They take two vectors, multiply them together (elementwise),
and then sum over the output. Consider this example:
a = [ 0, 1, 0, 1]
b = [ 1, 0, 1, 0]
[ 0, 0, 0, 0] -> 0
Score
First you multiply each element in a and b by each other, in this case creating a vector of 0s.
The sum of this vector is also 0. Why? Because the vectors have nothing in common.
c = [ 0, 1, 1, 0]
d = [.5, 0,.5, 0]
b = [ 1, 0, 1, 0]
c = [ 0, 1, 1, 0]
But the dot products between c and d return higher scores, because there’s overlap in the
columns that have positive values. Performing dot products between two identical vectors
tends to result in higher scores, as well. The takeaway? A dot product is a loose measurement
of similarity between two vectors.
What does this mean for the weights and inputs? Well, if the weight vector is similar to
the input vector for 2, then it will output a high score because the two vectors are similar.
Inversely, if the weight vector is not similar to the input vector for 2, it will output a low
score. You can see this in action in the following figure. Why is the top score (0.98) higher
than the lower one (0.01)?
Weights
Inputs
(dot)
Predictions
(equals)
0.98
0.01
98
Chapter 5
I Learning multiple weights at a time
Summary
Gradient descent is a general learning algorithm.
Perhaps the most important subtext of this chapter is that gradient descent is a very flexible
learning algorithm. If you combine weights in a way that allows you to calculate an error
function and a delta, gradient descent can show you how to move the weights to reduce the
error. We’ll spend the rest of this book exploring different types of weight combinations and
error functions for which gradient descent is useful. The next chapter is no exception.
building your first deep neural network:
introduction to backpropagation
In this chapter
•
The streetlight problem
•
Matrices and the matrix relationship
•
Full, batch, and stochastic gradient descent
•
Neural networks learn correlation
•
Overfitting
•
Creating your own correlation
•
Backpropagation: long-distance error attribution
•
Linear versus nonlinear
•
The secret to sometimes correlation
•
Your first deep network
•
Backpropagation in code: bringing it all together
O Deep Thought computer,” he said, “the task we have designed
you to perform is this. We want you to tell us…” he paused,
“The Answer.”
—Douglas Adams, The Hitchhiker’s Guide to the Galaxy
99
6
100
Chapter 6
I Building your first deep neural network
The streetlight problem
This toy problem considers how a network learns entire datasets.
Consider yourself approaching a street corner in a foreign country. As you approach, you
look up and realize that the street light is unfamiliar. How can you know when it’s safe to
cross the street?
You can know when it’s safe to cross the street by interpreting the streetlight. But in this
case, you don’t know how to interpret it. Which light combinations indicate when it’s time
to walk? Which indicate when it’s time to stop? To solve this problem, you might sit at the
street corner for a few minutes observing the correlation between each light combination
and whether people around you choose to walk or stop. You take a seat and record the
following pattern:
STOP
OK, nobody walked at the first light. At this point you’re thinking, “Wow, this pattern could
be anything. The left light or the right light could be correlated with stopping, or the central
light could be correlated with walking.” There’s no way to know. Let’s take another datapoint:
WALK
The streetlight problem
101
People walked, so something about this light changed the signal. The only thing you know
for sure is that the far-right light doesn’t seem to indicate one way or another. Perhaps it’s
irrelevant. Let’s collect another datapoint:
STOP
Now you’re getting somewhere. Only the middle light changed this time, and you got the
opposite pattern. The working hypothesis is that the middle light indicates when people feel
safe to walk. Over the next few minutes, you record the following six light patterns, noting
when people walk or stop. Do you notice a pattern overall?
STOP
WALK
STOP
WALK
WALK
STOP
As hypothesized, there is a perfect correlation between the middle (crisscross) light and
whether it’s safe to walk. You learned this pattern by observing all the individual datapoints
and searching for correlation. This is what you’re going to train a neural network to do.
102
Chapter 6
I Building your first deep neural network
Preparing the data
Neural networks don’t read streetlights.
In the previous chapters, you learned about supervised algorithms. You learned that they
can take one dataset and turn it into another. More important, they can take a dataset of
what you know and turn it into a dataset of what you want to know.
How do you train a supervised neural network? You present it with two datasets and ask it
to learn how to transform one into the other. Think back to the streetlight problem. Can you
identify two datasets? Which one do you always know? Which one do you want to know?
You do indeed have two datasets. On the one hand, you have six streetlight states. On the
other hand, you have six observations of whether people walked. These are the two datasets.
You can train the neural network to convert from the dataset you know to the dataset that
you want to know. In this particular real-world example, you know the state of the streetlight
at any given time, and you want to know whether it’s safe to cross the street.
What you know
What you want
to know
STOP
WALK
STOP
WALK
WALK
STOP
To prepare this data for the neural network, you need to first split it into these two groups
(what you know and what you want to know). Note that you could attempt to go backward
if you swapped which dataset was in which group. For some problems, this works.
Matrices and the matrix relationship
103
Matrices and the matrix relationship
Translate the streetlight into math.
Math doesn’t understand streetlights. As mentioned in the previous section, you want to
teach a neural network to translate a streetlight pattern into the correct stop/walk pattern.
The operative word here is pattern. What you really want to do is mimic the pattern of the
streetlight in the form of numbers. Let me show you what I mean.
Streetlights
Streetlight pattern
1
0
1
0
1
1
0
0
1
1
1
1
0
1
1
1
0
1
Notice that the pattern of numbers shown here mimics the pattern from the streetlights in
the form of 1s and 0s. Each light gets a column (three columns total, because there are three
lights). Notice also that there are six rows representing the six different observed streetlights.
This structure of 1s and 0s is called a matrix. This relationship between the rows and
columns is common in matrices, especially matrices of data (like the streetlights).
In data matrices, it’s convention to give each recorded example a single row. It’s also
convention to give each thing being recorded a single column. This makes the matrix easy
to read.
So, a column contains every state in which a thing was recorded. In this case, a column
contains every on/off state recorded for a particular light. Each row contains the
simultaneous state of every light at a particular moment in time. Again, this is common.
104
Chapter 6
I Building your first deep neural network
Good data matrices perfectly mimic the outside world.
The data matrix doesn’t have to be all 1s and 0s. What if the streetlights were on dimmers
and turned on and off at varying degrees of intensity? Perhaps the streetlight matrix would
look more like this:
Streetlights
Streetlight matrix A
.9
.0
1
.2
.8
1
.1
.0
1
.8
.9
1
.1
.7
1
.9
.1
0
Matrix A is perfectly valid. It’s mimicking the patterns that exist in the real world
(streetlight), so you can ask the computer to interpret them. Would the following matrix
still be valid?
Streetlights
Streetlight matrix B
9
0
10
2
8
10
1
0
10
8
9
10
1
7
10
9
1
0
Matrix (B) is valid. It adequately captures the relationships between various training
examples (rows) and lights (columns). Note that Matrix A * 10 == Matrix B
(A * 10 == B). This means these matrices are scalar multiples of each other.
Matrices and the matrix relationship
105
Matrices A and B both contain the same underlying pattern.
The important takeaway is that an infinite number of matrices exist that perfectly reflect the
streetlight patterns in the dataset. Even the one shown next is perfect.
Streetlights
Streetlight matrix C
18
0
20
4
16
20
2
0
20
16
18
20
2
14
20
18
2
0
It’s important to recognize that the underlying pattern isn’t the same as the matrix. It’s a
property of the matrix. In fact, it’s a property of all three of these matrices (A, B, and C).
The pattern is what each of these matrices is expressing. The pattern also existed in the
streetlights.
This input data pattern is what you want the neural network
to learn to transform into the output data pattern. But in order
to learn the output data pattern, you also need to capture the
pattern in the form of a matrix, as shown here.
Note that you could reverse the 1s and 0s, and the output matrix
would still capture the underlying STOP/WALK pattern that’s
present in the data. You know this because regardless of whether
you assign a 1 to WALK or to STOP, you can still decode the 1s
and 0s into the underlying STOP/WALK pattern.
The resulting matrix is called a lossless representation because
you can perfectly convert back and forth between your stop/
walk notes and the matrix.
STOP
0
WALK
1
STOP
0
WALK
1
WALK
1
STOP
0
106
Chapter 6
I Building your first deep neural network
Creating a matrix or two in Python
Import the matrices into Python.
You’ve converted the streetlight pattern into a matrix (one with just 1s and 0s). Now let’s
create that matrix (and, more important, its underlying pattern) in Python so the neural
network can read it. Python’s NumPy library (introduced in chapter 3) was built just for
handling matrices. Let’s see it in action:
import numpy as np
streetlights = np.array( [ [ 1, 0, 1 ],
[ 0, 1, 1 ],
[ 0, 0, 1 ],
[ 1, 1, 1 ],
[ 0, 1, 1 ],
[ 1, 0, 1 ] ] )
If you’re a regular Python user, something should be striking in this code. A matrix is just
a list of lists. It’s an array of arrays. What is NumPy? NumPy is really just a fancy wrapper
for an array of arrays that provides special, matrix-oriented functions. Let’s create a NumPy
matrix for the output data, too:
walk _ vs _ stop = np.array( [
[
[
[
[
[
[
1
0
1
1
0
0 ],
],
],
],
],
] ] )
What do you want the neural network to do? Take the streetlights matrix and learn to
transform it into the walk_vs_stop matrix. More important, you want the neural network
to take any matrix containing the same underlying pattern as streetlights and transform it
into a matrix that contains the underlying pattern of walk_vs_stop. More on that later. Let’s
start by trying to transform streetlights into walk_vs_stop using a neural network.
Neural network
streetlights
walk_vs_stop
Building a neural network
107
Building a neural network
You’ve been learning about neural networks for several chapters now. You have a new
dataset, and you’re going to create a neural network to solve it. Following is some example
code to learn the first streetlight pattern. This should look familiar:
import numpy as np
weights = np.array([0.5,0.48,-0.7])
alpha = 0.1
streetlights = np.array( [ [
[
[
[
[
[
1,
0,
0,
1,
0,
1,
0,
1,
0,
1,
1,
0,
1
1
1
1
1
1
],
],
],
],
],
] ] )
walk_vs_stop = np.array( [ 0, 1, 0, 1, 1, 0 ] )
input = streetlights[0]
goal_prediction = walk_vs_stop[0]
[1,0,1]
Equals 0 (stop)
for iteration in range(20):
prediction = input.dot(weights)
error = (goal_prediction - prediction) ** 2
delta = prediction - goal_prediction
weights = weights - (alpha * (input * delta))
print("Error:" + str(error) + " Prediction:" + str(prediction))
This code example may bring back several nuances you learned in chapter 3. First, the
use of the dot function was a way to perform a dot product (weighted sum) between
two vectors. But not included in chapter 3 was the way NumPy matrices can perform
elementwise addition and multiplication:
import numpy as np
Elementwise
multiplication
a = np.array([0,1,2,1])
b = np.array([2,2,2,3])
print(a*b)
print(a+b)
print(a * 0.5)
print(a + 0.5)
Elementwise
addition
Vector-scalar
multiplication
Vector-scalar
addition
NumPy makes these operations easy. When you put a + between two vectors, it does what
you expect: it adds the two vectors together. Other than these nice NumPy operators and
the new dataset, the neural network shown here is the same as the ones built previously.
108
Chapter 6
I Building your first deep neural network
Learning the whole dataset
The neural network has been learning only one streetlight. Don’t
we want it to learn them all?
So far in this book, you’ve trained neural networks that learned how to model a single
training example (input -> goal_pred pair). But now you’re trying to build a neural
network that tells you whether it’s safe to cross the street. You need it to know more than one
streetlight. How do you do this? You train it on all the streetlights at once:
import numpy as np
weights = np.array([0.5,0.48,-0.7])
alpha = 0.1
streetlights = np.array( [[
[
[
[
[
[
1,
0,
0,
1,
0,
1,
0,
1,
0,
1,
1,
0,
1
1
1
1
1
1
],
],
],
],
],
] ] )
walk_vs_stop = np.array( [ 0, 1, 0, 1, 1, 0 ] )
input = streetlights[0]
goal_prediction = walk_vs_stop[0]
[1,0,1]
Equals 0 (stop)
for iteration in range(40):
error_for_all_lights = 0
for row_index in range(len(walk_vs_stop)):
input = streetlights[row_index]
goal_prediction = walk_vs_stop[row_index]
prediction = input.dot(weights)
error = (goal_prediction - prediction) ** 2
error_for_all_lights += error
delta = prediction - goal_prediction
weights = weights - (alpha * (input * delta))
print("Prediction:" + str(prediction))
print("Error:" + str(error_for_all_lights) + "\n")
Error:2.6561231104
Error:0.962870177672
...
Error:0.000614343567483
Error:0.000533736773285
Full, batch, and stochastic gradient descent
109
Full, batch, and stochastic gradient descent
Stochastic gradient descent updates weights one example
at a time.
As it turns out, this idea of learning one example at a time is a variant on gradient descent
called stochastic gradient descent, and it’s one of the handful of methods that can be used to
learn an entire dataset.
How does stochastic gradient descent work? As you saw in the previous example, it
performs a prediction and weight update for each training example separately. In other
words, it takes the first streetlight, tries to predict it, calculates the weight_delta, and
updates the weights. Then it moves on to the second streetlight, and so on. It iterates
through the entire dataset many times until it can find a weight configuration that works
well for all the training examples.
(Full) gradient descent updates weights one dataset at a time.
As introduced in chapter 4, another method for learning an entire dataset is gradient
descent (or average/full gradient descent). Instead of updating the weights once for each
training example, the network calculates the average weight_delta over the entire dataset,
changing the weights only each time it computes a full average.
Batch gradient descent updates weights after n examples.
This will be covered in more detail later, but there’s also a third configuration that sort
of splits the difference between stochastic gradient descent and full gradient descent.
Instead of updating the weights after just one example or after the entire dataset of
examples, you choose a batch size (typically between 8 and 256) of examples, after
which the weights are updated.
We’ll discuss this more later in the book, but for now, recognize that the previous
example created a neural network that can learn the entire streetlights dataset by
training on each example, one at a time.
110
Chapter 6
I Building your first deep neural network
Neural networks learn correlation
What did the last neural network learn?
You just got done training a single-layer neural network to take a streetlight pattern and
identify whether it was safe to cross the street. Let’s take on the neural network’s perspective
for a moment. The neural network doesn’t know that it was processing streetlight data. All it
was trying to do was identify which input (of the three possible) correlated with the output.
It correctly identified the middle light by analyzing the final weight positions of the network.
Output
walk/stop
Input
.01
–.0
1.0
Notice that the middle weight is very near 1, whereas the far-left and far-right weights are
very near 0. At a high level, all the iterative, complex processes for learning accomplished
something rather simple: the network identified correlation between the middle input and
output. The correlation is located wherever the weights were set to high numbers. Inversely,
randomness with respect to the output was found at the far-left and far-right weights (where
the weight values are very near 0).
How did the network identify correlation? Well, in the process of gradient descent, each
training example asserts either up pressure or down pressure on the weights. On average,
there was more up pressure for the middle weight and more down pressure for the other
weights. Where does the pressure come from? Why is it different for different weights?
Up and down pressure
111
Up and down pressure
It comes from the data.
Each node is individually trying to correctly predict the output given the input. For the
most part, each node ignores all the other nodes when attempting to do so. The only cross
communication occurs in that all three weights must share the same error measure. The
weight update is nothing more than taking this shared error measure and multiplying it by
each respective input.
Why do you do this? A key part of why neural networks learn is error attribution, which
means given a shared error, the network needs to figure out which weights contributed (so
they can be adjusted) and which weights did not contribute (so they can be left alone).
Training data
Weight pressure
1
0
1
0
–
0
–
0
0
1
1
1
0
+
+
1
0
0
1
0
0
0
–
0
1
1
1
1
+
+
+
1
0
1
1
1
0
+
+
1
1
0
1
0
–
0
–
0
Consider the first training example. Because the middle input is 0, the middle weight
is completely irrelevant for this prediction. No matter what the weight is, it’s going to be
multiplied by 0 (the input). Thus, any error at that training example (regardless of whether
it’s too high or too low), can be attributed to only the far-left and right weights.
Consider the pressure of this first training example. If the network should predict 0, and two
inputs are 1s, then this will cause error, which drives the weight values toward 0.
The Weight Pressure table helps describe the effect of each training example on each
respective weight. + indicates that it has pressure toward 1, and – indicates that it has
pressure toward 0. Zeros (0) indicate that there is no pressure because the input datapoint
is 0, so that weight won’t be changed. Notice that the far-left weight has two negatives and
one positive, so on average the weight will move toward 0. The middle weight has three
positives, so on average the weight will move toward 1.
112
Chapter 6
I Building your first deep neural network
Weight pressure
Training data
1
0
1
0
–
0
–
0
0
1
1
1
0
+
+
1
0
0
1
0
0
0
–
0
1
1
1
1
+
+
+
1
0
1
1
1
0
+
+
1
1
0
1
0
–
0
–
0
Each individual weight is attempting to compensate for error. In the first training example,
there’s discorrelation between the far-right and far-left inputs and the desired output. This
causes those weights to experience down pressure.
This same phenomenon occurs throughout all six training examples, rewarding correlation
with pressure toward 1 and penalizing decorrelation with pressure toward 0. On average,
this causes the network to find the correlation present between the middle weight and the
output to be the dominant predictive force (heaviest weight in the weighted average of the
input), making the network quite accurate.
Bottom line
The prediction is a weighted sum of the inputs. The learning algorithm rewards inputs that
correlate with the output with upward pressure (toward 1) on their weight while penalizing
inputs with discorrelation with downward pressure. The weighted sum of the inputs find
perfect correlation between the input and the output by weighting decorrelated inputs to 0.
The mathematician in you may be cringing a little. Upward pressure and downward pressure
are hardly precise mathematical expressions, and they have plenty of edge cases where
this logic doesn’t hold (which we’ll address in a second). But you’ll later find that this is an
extremely valuable approximation, allowing you to temporarily overlook all the complexity
of gradient descent and just remember that learning rewards correlation with larger weights
(or more generally, learning finds correlation between the two datasets).
Edge case: Overfitting
113
Edge case: Overfitting
Sometimes correlation happens accidentally.
Consider again the first example in the training data. What if the far-left weight was 0.5 and
the far-right weight was –0.5? Their prediction would equal 0. The network would predict
perfectly. But it hasn’t remotely learned how to safely predict streetlights (those weights
would fail in the real world). This phenomenon is known as overfitting.
Deep learning’s greatest weakness: Overfitting
Error is shared among all the weights. If a particular configuration of weights accidentally
creates perfect correlation between the prediction and the output dataset (such that
error == 0) without giving the heaviest weight to the best inputs, the neural network
will stop learning.
If it wasn’t for the other training examples, this fatal flaw would cripple the neural network.
What do the other training examples do? Well, let’s look at the second training example. It
bumps the far-right weight upward while not changing the far-left weight. This throws off
the equilibrium that stopped the learning in the first example. As long as you don’t train
exclusively on the first example, the rest of the training examples will help the network avoid
getting stuck in these edge-case configurations that exist for any one training example.
This is very important. Neural networks are so flexible that they can find many, many
different weight configurations that will correctly predict for a subset of training data. If
you trained this neural network on the first two training examples, it would likely stop
learning at a point where it did not work well for the other training examples. In essence, it
memorized the two training examples instead of finding the correlation that will generalize
to any possible streetlight configuration.
If you train on only two streetlights and the network finds just these edge-case
configurations, it could fail to tell you whether it’s safe to cross the street when it sees a
streetlight that wasn’t in the training data.
Key takeaway
The greatest challenge you’ll face with deep learning is convincing your neural network to
generalize instead of just memorize. You’ll see this again.
114
Chapter 6
I Building your first deep neural network
Edge case: Conflicting pressure
Sometimes correlation fights itself.
Consider the far-right column in the following Weight Pressure table. What do you see?
This column seems to have an equal number of upward and downward pressure moments.
But the network correctly pushes this (far-right) weight down to 0, which means the
downward pressure moments must be larger than the upward ones. How does this work?
Weight pressure
Training data
1
0
1
0
–
0
–
0
0
1
1
1
0
+
+
1
0
0
1
0
0
0
–
0
1
1
1
1
+
+
+
1
0
1
1
1
0
+
+
1
1
0
1
0
–
0
–
0
The left and middle weights have enough signal to converge on their own. The left weight
falls to 0, and the middle weight moves toward 1. As the middle weight moves higher and
higher, the error for positive examples continues to decrease. But as they approach their
optimal positions, the decorrelation on the far-right weight becomes more apparent.
Let’s consider the extreme example, where the left and middle weights are perfectly set to
0 and 1, respectively. What happens to the network? If the right weight is above 0, then the
network predicts too high; and if the right weight is beneath 0, the network predicts too low.
As other nodes learn, they absorb some of the error; they absorb part of the correlation.
They cause the network to predict with moderate correlative power, which reduces the error.
The other weights then only try to adjust their weights to correctly predict what’s left.
In this case, because the middle weight has consistent signal to absorb all the correlation
(because of the 1:1 relationship between the middle input and the output), the error when
you want to predict 1 becomes very small, but the error to predict 0 becomes large, pushing
the middle weight downward.
Edge case: Conflicting pressure
115
It doesn’t always work out like this.
In some ways, you kind of got lucky. If the middle node hadn’t been so perfectly correlated,
the network might have struggled to silence the far-right weight. Later you’ll learn about
regularization, which forces weights with conflicting pressure to move toward 0.
As a preview, regularization is advantageous because if a weight has equal pressure upward
and downward, it isn’t good for anything. It’s not helping either direction. In essence,
regularization aims to say that only weights with really strong correlation can stay on;
everything else should be silenced because it’s contributing noise. It’s sort of like natural
selection, and as a side effect it would cause the neural network to train faster (fewer
iterations) because the far-right weight has this problem of both positive and negative
pressure.
In this case, because the far-right node isn’t definitively correlative, the network would
immediately start driving it toward 0. Without regularization (as you trained it before), you
won’t end up learning that the far-right input is useless until after the left and middle start to
figure out their patterns. More on this later.
If networks look for correlation between an input column of data and the output column,
what would the neural network do with the following dataset?
Training data
Weight pressure
1
0
1
1
+
0
+
1
0
1
1
1
0
+
+
1
0
0
1
0
0
0
–
0
1
1
1
0
–
–
–
0
There is no correlation between any input column and the output column. Every weight has
an equal amount of upward pressure and downward pressure. This dataset is a real problem
for the neural network.
Previously, you could solve for input datapoints that had both upward and downward
pressure because other nodes would start solving for either the positive or negative
predictions, drawing the balanced node to favor up or down. But in this case, all the inputs
are equally balanced between positive and negative pressure. What do you do?
116
Chapter 6
I Building your first deep neural network
Learning indirect correlation
If your data doesn’t have correlation, create intermediate data
that does!
Previously, I described a neural network as an instrument that searches for correlation
between input and output datasets. I want to refine this just a touch. In reality, neural
networks search for correlation between their input and output layers.
You set the values of the input layer to be individual rows of the input data, and you try
to train the network so that the output layer equals the output dataset. Oddly enough, the
neural network doesn’t know about data. It just searches for correlation between the input
and output layers.
Output
walk/stop
Input
.01
–.0
1.0
Unfortunately, this is a new streetlights dataset that has no correlation between the input
and output. The solution is simple: use two of these networks. The first one will create an
intermediate dataset that has limited correlation with the output, and the second will use
that limited correlation to correctly predict the output.
Because the input dataset doesn’t correlate with the output dataset, you’ll use the input
dataset to create an intermediate dataset that does have correlation with the output. It’s kind
of like cheating.
Creating correlation
117
Creating correlation
Here’s a picture of the new neural network. You basically stack two neural networks on top
of each other. The middle layer of nodes (layer_1) represents the intermediate dataset. The
goal is to train this network so that even though there’s no correlation between the input
dataset and output dataset (layer_0 and layer_2), the layer_1 dataset that you create
using layer_0 will have correlation with layer_2.
walk/stop
weights_1_2
layer_2
This will be the
intermediate data.
layer_1
weights_0_1
layer_0
Note: this network is still just a function. It has a bunch of weights that are collected together
in a particular way. Furthermore, gradient descent still works because you can calculate how
much each weight contributes to the error and adjust it to reduce the error to 0. And that’s
exactly what you’re going to do.
118
Chapter 6
I Building your first deep neural network
Stacking neural networks: A review
Chapter 3 briefly mentioned stacked neural networks.
Let’s review.
When you look at the following architecture, the prediction occurs exactly as you might
expect when I say, “Stack neural networks.” The output of the first lower network (layer_0
to layer_1) is the input to the second upper neural network (layer_1 to layer_2). The
prediction for each of these networks is identical to what you saw before.
walk/stop
layer_2
weights_1_2
layer_1
weights_0_1
layer_0
As you start to think about how this neural network learns, you already know a great deal.
If you ignore the lower weights and consider their output to be the training set, the top
half of the neural network (layer_1 to layer_2) is just like the networks trained in the
preceding chapter. You can use all the same learning logic to help them learn.
The part that you don’t yet understand is how to update the weights between layer_0
and layer_1. What do they use as their error measure? As you may remember from
chapter 5, the cached/normalized error measure is called delta. In this case, you want
to figure out how to know the delta values at layer_1 so they can help layer_2 make
accurate predictions.
Backpropagation: Long-distance error attribution
119
Backpropagation: Long-distance error attribution
The weighted average error
What’s the prediction from layer_1 to layer_2? It’s a weighted average of the values at
layer_1. If layer_2 is too high by x amount, how do you know which values at layer_1
contributed to the error? The ones with higher weights (weights_1_2) contributed more.
The ones with lower weights from layer_1 to layer_2 contributed less.
Consider the extreme. Let’s say the far-left weight from layer_1 to layer_2 was zero. How
much did that node at layer_1 cause the network’s error? Zero.
It’s so simple it’s almost hilarious. The weights from layer_1 to layer_2 exactly describe how
much each layer_1 node contributes to the layer_2 prediction. This means those weights
also exactly describe how much each layer_1 node contributes to the layer_2 error.
How do you use the delta at layer_2 to figure out the delta at layer_1? You multiply it
by each of the respective weights for layer_1. It’s like the prediction logic in reverse. This
process of moving delta signal around is called backpropagation.
This value is the layer_2 delta
(goal_prediction - prediction).
I made up some weight
values so you can see
how the layer_2 delta
passes through them.
layer_2
+0.25
0.0
0.0
-1.0
0.5
1.0
weights_1_2
0.125
0.25
–0.25
layer_1
weights_0_1
layer_1 deltas, which
are actually weighted
versions of the layer_2
delta
layer_0
120
Chapter 6
I Building your first deep neural network
Backpropagation: Why does this work?
The weighted average delta
In the neural network from chapter 5, the delta variable told you the direction and
amount the value of this node should change next time. All backpropagation lets you
do is say, “Hey, if you want this node to be x amount higher, then each of these previous
four nodes needs to be x*weights_1_2 amount higher/lower, because these weights were
amplifying the prediction by weights_1_2 times.”
When used in reverse, the weights_1_2 matrix amplifies the error by the appropriate
amount. It amplifies the error so you know how much each layer_1 node should move
up or down.
Once you know this, you can update each weight matrix as you did before. For each
weight, multiply its output delta by its input value, and adjust the weight by that much
(or you can scale it with alpha).
This value is the layer_2 delta
(goal_prediction – prediction).
I made up some weight
values so you can see
how the layer_2 delta
passes through them.
layer_2
+0.25
0.0
0.0
–1.0
0.5
1.0
0.125
0.25
weights_1_2
–0.25
layer_1
weights_0_1
layer_1 deltas, which
are actually weighted
versions of the layer_2
delta
layer_0
Linear vs. nonlinear
121
Linear vs. nonlinear
This is probably the hardest concept in the book.
Let’s take it slowly.
I’m going to show you a phenomenon. As it turns out, you need one more piece to make this
neural network train. Let’s take it from two perspectives. The first will show why the neural
network can’t train without it. In other words, first I’ll show you why the neural network
is currently broken. Then, once you add this piece, I’ll show you what it does to fix this
problem. For now, check out this simple algebra:
1 * 10 * 2 = 100
5 * 20 = 100
1 * 0.25 * 0.9 = 0.225
1 * 0.225 = 0.225
Here’s the takeaway: for any two multiplications, I can accomplish the same thing using a
single multiplication. As it turns out, this is bad. Check out the following:
0.225
–0.225
0.225
–0.225
0.225
0.9
0.25
–0.25
1.0
–1.0
0.25
1.0
–1.0
These two graphs show two training examples
each, one where the input is 1.0 and another
where the input is –1.0. The bottom line: for
any three-layer network you create, there’s a
two-layer network that has identical behavior.
Stacking two neural nets (as you know them at
the moment) doesn’t give you any more power.
Two consecutive weighted sums is just a more
expensive version of one weighted sum.
122
Chapter 6
I Building your first deep neural network
Why the neural network still doesn’t work
If you trained the three-layer network as it is now,
it wouldn’t converge.
Problem: For any two consecutive weighted sums of the input, there exists a single
weighted sum with exactly identical behavior. Anything that the three-layer network can
do, the two-layer network can also do.
Let’s talk about the middle layer (layer_1) before it’s fixed. Right now, each node (out of the
four) has a weight coming to it from each of the inputs. Let’s think about this from a correlation
standpoint. Each node in the middle layer subscribes to a certain amount of correlation with
each input node. If the weight from an input to the middle layer is 1.0, then it subscribes to
exactly 100% of that node’s movement. If that node goes up by 0.3, the middle node will follow.
If the weight connecting two nodes is 0.5,
each node in the middle layer subscribes to
layer_2
exactly 50% of that node’s movement.
The only way the middle node can escape
the correlation of one particular input node
is if it subscribes to additional correlation
from another input node. Nothing new is
being contributed to this neural network.
Each hidden node subscribes to a little
correlation from the input nodes.
The middle nodes don’t get to add anything
to the conversation; they don’t get to have
correlation of their own. They’re more or
less correlated to various input nodes.
weights_1_2
layer_1
weights_0_1
layer_0
But because you know that in the new
dataset there is no correlation between any of the inputs and the output, how can the middle
layer help? It mixes up a bunch of correlation that’s already useless. What you really need is
for the middle layer to be able to selectively correlate with the input.
You want the middle layer to sometimes correlate with an input, and sometimes not correlate.
That gives it correlation of its own. This gives the middle layer the opportunity to not just
always be x% correlated to one input and y% correlated to another input. Instead, it can be
x% correlated to one input only when it wants to be, but other times not be correlated at all.
This is called conditional correlation or sometimes correlation.
The secret to sometimes correlation
123
The secret to sometimes correlation
Turn off the node when the value would be below 0.
This might seem too simple to work, but consider this: if a node’s value dropped below 0,
normally the node would still have the same correlation to the input as always. It would just
happen to be negative in value. But if you turn off the node (setting it to 0) when it would be
negative, then it has zero correlation to any inputs whenever it’s negative.
What does this mean? The node can now selectively pick and choose when it wants to be
correlated to something. This allows it to say something like, “Make me perfectly correlated
to the left input, but only when the right input is turned off.” How can it do this? Well, if
the weight from the left input is 1.0 and the weight from the right input is a huge negative
number, then turning on both the left and right inputs will cause the node to be 0 all the
time. But if only the left input is on, the node will take on the value of the left input.
This wasn’t possible before. Earlier, the middle node was either always correlated to an input
or always not correlated. Now it can be conditional. Now it can speak for itself.
Solution: By turning off any middle node whenever it would be negative, you allow the
network to sometimes subscribe to correlation from various inputs. This is impossible for
two-layer neural networks, thus adding power to three-layer nets.
The fancy term for this “if the node would be negative, set it to 0” logic is nonlinearity.
Without this tweak, the neural network is linear. Without this technique, the output layer
only gets to pick from the same correlation it had in the two-layer network. It’s subscribing
to pieces of the input layer, which means it can’t solve the new streetlights dataset.
There are many kinds of nonlinearities. But the one discussed here is, in many cases, the best
one to use. It’s also the simplest. (It’s called relu.)
For what it’s worth, most other books and courses say that consecutive matrix multiplication
is a linear transformation. I find this unintuitive. It also makes it harder to understand
what nonlinearities contribute and why you choose one over the other (which we’ll get to
later). It says, “Without the nonlinearity, two matrix multiplications might as well be 1.” My
explanation, although not the most concise answer, is an intuitive explanation of why you
need nonlinearities.
124
Chapter 6
I Building your first deep neural network
A quick break
That last part probably felt a little abstract, and that’s totally OK.
Here’s the deal. Previous chapters worked with simple algebra, so everything was ultimately
grounded in fundamentally simple tools. This chapter started building on the premises you
learned earlier. Previously, you learned lessons like this:
You can compute the relationship between the error and any one of the weights so that you
know how changing the weight changes the error. You can then use this to reduce the error
to 0.
That was a massive lesson. But now we’re moving past it. Because we already worked through
why that works, you can take the statement at face value. The next big lesson came at the
beginning of this chapter:
Adjusting the weights to reduce the error over a series of training examples ultimately
searches for correlation between the input and the output layers. If no correlation exists, then
the error will never reach 0.
This is an even bigger lesson. It largely means you can put the previous lesson out of
your mind for now. You don’t need it. Now you’re focused on correlation. The takeaway
is that you can’t constantly think about everything all at once. Take each lesson and let
yourself trust it. When it’s a more concise summarization (a higher abstraction) of more
granular lessons, you can set aside the granular and focus on understanding the higher
summarizations.
This is akin to a professional swimmer, biker, or similar athlete who requires a combined
fluid knowledge of a bunch of small lessons. A baseball player who swings a bat learned
thousands of little lessons to ultimately culminate in a great bat swing. But the player doesn’t
think of all of them when he goes to the plate. His actions are fluid—even subconscious. It’s
the same for studying these math concepts.
Neural networks look for correlation between input and output, and you no longer have to
worry about how that happens. You just know it does. Now we’re building on that idea. Let
yourself relax and trust the things you’ve already learned.
Your first deep neural network
125
Your first deep neural network
Here’s how to make the prediction.
The following code initializes the weights and makes a forward propagation. New code is bold.
import numpy as np
This function sets all
negative numbers to 0.
np.random.seed(1)
def relu(x):
return (x > 0) * x
alpha = 0.2
hidden_size = 4
streetlights = np.array( [[
[
[
[
1,
0,
0,
1,
0,
1,
0,
1,
1
1
1
1
],
],
],
] ] )
Two sets of weights now
to connect the three layers
(randomly initialized)
walk_vs_stop = np.array([[ 1, 1, 0, 0]]).T
weights_0_1 = 2*np.random.random((3,hidden_size)) - 1
weights_1_2 = 2*np.random.random((hidden_size,1)) - 1
layer_0 = streetlights[0]
layer_1 = relu(np.dot(layer_0,weights_0_1))
layer_2 = np.dot(layer_1,weights_1_2)
The output of layer_1 is
sent through relu, where negative
values become 0. This is the
input for the next layer, layer_2.
For each piece of the code, follow along with
the figure. Input data comes into layer_0.
Via the dot function, the signal travels
up the weights from layer_0 to layer_1
(performing a weighted sum at each of the
four layer_1 nodes). These weighted sums
at layer_1 are then passed through the
relu function, which converts all negative
numbers to 0. Then a final weighted sum is
performed into the final node, layer_2.
walk/
stop
layer_2
weights_1_2
layer_1
weights_0_1
layer_0
126
Chapter 6
I Building your first deep neural network
Backpropagation in code
You can learn the amount that each weight contributes
to the final error.
At the end of the previous chapter, I made an assertion that it would be important to
memorize the two-layer neural network code so you could quickly and easily recall it when I
reference more-advanced concepts. This is when that memorization matters.
The following listing is the new learning code, and it’s essential that you recognize and
understand the parts addressed in the previous chapters. If you get lost, go to chapter 5,
memorize the code, and then come back. It will make a big difference someday.
import numpy as np
np.random.seed(1)
def relu(x):
return (x > 0) * x
def relu2deriv(output):
return output>0
Returns x if x > 0;
returns 0 otherwise
Returns 1 for input > 0;
returns 0 otherwise
alpha = 0.2
hidden_size = 4
weights_0_1 = 2*np.random.random((3,hidden_size)) - 1
weights_1_2 = 2*np.random.random((hidden_size,1)) - 1
for iteration in range(60):
layer_2_error = 0
for i in range(len(streetlights)):
layer_0 = streetlights[i:i+1]
layer_1 = relu(np.dot(layer_0,weights_0_1))
layer_2 = np.dot(layer_1,weights_1_2)
This line computes the
delta at layer_1 given
the delta at layer_2
by taking the layer_2_
delta and multiplying
it by its connecting
weights_1_2.
layer_2_error += np.sum((layer_2 - walk_vs_stop[i:i+1]) ** 2)
layer_2_delta = (walk_vs_stop[i:i+1] - layer_2)
layer_1_delta=layer_2_delta.dot(weights_1_2.T)*relu2deriv(layer_1)
weights_1_2 += alpha * layer_1.T.dot(layer_2_delta)
weights_0_1 += alpha * layer_0.T.dot(layer_1_delta)
if(iteration % 10 == 9):
print("Error:" + str(layer_2_error))
Believe it or not, the only truly new code is in bold. Everything else is fundamentally the
same as in previous pages. The relu2deriv function returns 1 when output > 0; otherwise,
it returns 0. This is the slope (the derivative) of the relu function. It serves an important
purpose, as you’ll see in a moment.
Backpropagation in code
127
Remember, the goal is error attribution. It’s about figuring out how much each weight
contributed to the final error. In the first (two-layer) neural network, you calculated a delta
variable, which told you how much higher or lower you wanted the output prediction to
be. Look at the code here. You compute the layer_2_delta in the same way. Nothing new.
(Again, go back to chapter 5 if you’ve forgotten how that part works.)
Now that you know how much the final prediction should move up or down (delta), you
need to figure out how much each middle (layer_1) node should move up or down. These
are effectively intermediate predictions. Once you have the delta at layer_1, you can use
the same processes as before for calculating a weight update (for each weight, multiply its
input value by its output delta and increase the weight value by that much).
How do you calculate the deltas for layer_1? First, do the obvious: multiply the output
delta by each weight attached to it. This gives a weighting of how much each weight
contributed to that error. There’s one more thing to factor in. If relu set the output to a
layer_1 node to be 0, then it didn’t contribute to the error. When this is true, you should
also set the delta of that node to 0. Multiplying each layer_1 node by the relu2deriv
function accomplishes this. relu2deriv is either 1 or 0, depending on whether the layer_1
value is greater than 0.
This value is layer_2 delta
(goal_prediction – prediction).
I made up some weight
values so you can see
how the layer_2 delta
passes through them.
layer_2
+0.25
–1.0
0.0
0.0
0.5
1.0
0.125
0.25
weights_1_2
–0.25
layer_1
weights_0_1
layer_1 deltas, which
are actually weighted
versions of the layer_2
delta
layer_0
128
Chapter 6
I Building your first deep neural network
One iteration of backpropagation
b
Initializing the network’s weights and data
Inputs
Hiddens
Prediction
import numpy as np
np.random.seed(1)
def relu(x):
return (x > 0) * x
def relu2deriv(output):
return output>0
lights = np.array( [[
[
[
[
1,
0,
0,
1,
0,
1,
0,
1,
1
1
1
1
],
],
],
] ] )
walk_stop = np.array([[ 1, 1, 0, 0]]).T
alpha = 0.2
hidden_size = 3
weights_0_1 = 2*np.random.random(\
(3,hidden_size)) - 1
weights_1_2 = 2*np.random.random(\
(hidden_size,1)) - 1
c
PREDICT + COMPARE: Making a prediction, and calculating the output error and delta
layer_0
layer_1
layer_1
layer_2
Inputs
layer_0
Hiddens
layer_1
1
0
Prediction
layer_2
0
.13
–.02
0.14
1
0
=
=
=
=
lights[0:1]
np.dot(layer_0,weights_0_1)
relu(layer_1)
np.dot(layer_1,weights_1_2)
error = (layer_2-walk_stop[0:1])**2
layer_2_delta=(layer_2-walk_stop[0:1])
1.04
One iteration of backpropagation
d
129
LEARN: Backpropagating from layer_2 to layer_1
Inputs
layer_0
layer_0
layer_1
layer_1
layer_2
Hiddens
layer_1
0
1
Prediction
layer_2
0
0
.13
–.02
–.17
0.14
=
=
=
=
lights[0:1]
np.dot(layer_0,weights_0_1)
relu(layer_1)
np.dot(layer_1,weights_1_2)
error = (layer_2-walk_stop[0:1])**2
layer_2_delta=(layer_2-walk_stop[0:1])
1.04
0
1
0
layer_1_delta=layer_2_delta.dot(weights_1_2.T)
layer_1_delta *= relu2deriv(layer_1)
e
LEARN: Generating weight_deltas, and updating weights
layer_0 = lights[0:1]
layer_1 = np.dot(layer_0,weights_0_1)
Inputs
Hiddens
Prediction layer_1 = relu(layer_1)
layer_2 = np.dot(layer_1,weights_1_2)
layer_0
layer_1
layer_2
error = (layer_2-walk_stop[0:1])**2
layer_2_delta=(layer_2-walk_stop[0:1])
1
0
0
0
layer_1_delta=layer_2_delta.dot(weights_1_2.T)
layer_1_delta *= relu2deriv(layer_1)
.13
–.02
–.17
0.14
0
1
1.04
weight_delta_1_2 = layer_1.T.dot(layer_2_delta)
weight_delta_0_1 = layer_0.T.dot(layer_1_delta)
0
weights_1_2 -= alpha * weight_delta_1_2
weights_0_1 -= alpha * weight_delta_0_1
As you can see, backpropagation is about calculating deltas for intermediate layers so you
can perform gradient descent. To do so, you take the weighted average delta on layer_2
for layer_1 (weighted by the weights in between them). You then turn off (set to 0) nodes
that weren’t participating in the forward prediction, because they couldn’t have contributed
to the error.
130
Chapter 6
I Building your first deep neural network
Putting it all together
Here’s the self-sufficient program you should be able to run
(runtime output follows).
import numpy as np
np.random.seed(1)
Returns x if x > 0;
returns 0 otherwise
def relu(x):
return (x > 0) * x
def relu2deriv(output):
return output>0
streetlights = np.array( [[
[
[
[
Returns 1 for input > 0;
returns 0 otherwise
1,
0,
0,
1,
0,
1,
0,
1,
1
1
1
1
],
],
],
] ] )
walk_vs_stop = np.array([[ 1, 1, 0, 0]]).T
alpha = 0.2
hidden_size = 4
weights_0_1 = 2*np.random.random((3,hidden_size)) - 1
weights_1_2 = 2*np.random.random((hidden_size,1)) - 1
for iteration in range(60):
layer_2_error = 0
for i in range(len(streetlights)):
layer_0 = streetlights[i:i+1]
layer_1 = relu(np.dot(layer_0,weights_0_1))
layer_2 = np.dot(layer_1,weights_1_2)
layer_2_error += np.sum((layer_2 - walk_vs_stop[i:i+1]) ** 2)
layer_2_delta = (layer_2 - walk_vs_stop[i:i+1])
layer_1_delta=layer_2_delta.dot(weights_1_2.T)*relu2deriv(layer_1)
weights_1_2 -= alpha * layer_1.T.dot(layer_2_delta)
weights_0_1 -= alpha * layer_0.T.dot(layer_1_delta)
if(iteration % 10 == 9):
print("Error:" + str(layer_2_error))
Error:0.634231159844
Error:0.358384076763
Error:0.0830183113303
Error:0.0064670549571
Error:0.000329266900075
Error:1.50556226651e-05
Why do deep networks matter?
131
Why do deep networks matter?
What’s the point of creating “intermediate datasets” that
have correlation?
Consider the cat picture shown here. Consider further that I had a dataset of images with
cats and without cats (and I labeled them as such). If I wanted to train a neural network to
take the pixel values and predict whether there’s a cat in the picture, the two-layer network
might have a problem.
Just as in the last streetlight dataset, no individual pixel correlates with whether there’s a cat
in the picture. Only different configurations of pixels correlate with whether there’s a cat.
This is the essence of deep learning. Deep learning is all about creating intermediate layers
(datasets) wherein each node in an intermediate layer represents the presence or absence of
a different configuration of inputs.
This way, for the cat images dataset, no individual pixel has to correlate with whether there’s
a cat in the photo. Instead, the middle layer will attempt to identify different configurations
of pixels that may or may not correlate with a cat (such as an ear, or cat eyes, or cat hair).
The presence of many cat-like configurations will then give the final layer the information
(correlation) it needs to correctly predict the presence or absence of a cat.
Believe it or not, you can take the three-layer network and continue to stack more and
more layers. Some neural networks have hundreds of layers, each node playing its part in
detecting different configurations of input data. The rest of this book will be dedicated to
studying different phenomena within these layers in an effort to explore the full power of
deep neural networks.
Toward that end, I must issue the same challenge I did in chapter 5: memorize the previous
code. You’ll need to be very familiar with each of the operations in the code in order for the
following chapters to be readable. Don’t progress past this point until you can build a threelayer neural network from memory!
how to picture neural networks:
in your head and on paper
In this chapter
•
Correlation summarization
•
Simplified visualization
•
Seeing the network predict
•
Visualizing using letters instead of pictures
•
Linking variables
•
The importance of visualization tools
Numbers have an important story to tell. They rely on
you to give them a clear and convincing voice.
—Stephen Few, IT innovator, teacher, and consultant
133
7
134
Chapter 7
I How to picture neural networks
It’s time to simplify
It’s impractical to think about everything all the time.
Mental tools can help.
Chapter 6 finished with a code example that was quite impressive. Just the neural network
contained 35 lines of incredibly dense code. Reading through it, it’s clear there’s a lot going
on; and that code includes over 100 pages of concepts that, when combined, can predict
whether it’s safe to cross the street.
I hope you’re continuing to rebuild these examples from memory in each chapter. As
the examples get larger, this exercise becomes less about remembering specific letters
of code and more about remembering concepts and then rebuilding the code based on
those concepts.
In this chapter, this construction of efficient concepts in your mind is exactly what I want
to talk about. Even though it’s not an architecture or experiment, it’s perhaps the most
important value I can give you. In this case, I want to show how I summarize all the little
lessons in an efficient way in my mind so that I can do things like build new architectures,
debug experiments, and use an architecture on new problems and new datasets.
Let’s start by reviewing the concepts you’ve learned so far.
This book began with small lessons and then built layers of abstraction on top of them.
We began by talking about the ideas behind machine learning in general. Then we
progressed to how individual linear nodes (or neurons) learned, followed by horizontal
groups of neurons (layers) and then vertical groups (stacks of layers). Along the way,
we discussed how learning is actually just reducing error to 0, and we used calculus
to discover how to change each weight in the network to help move the error in the
direction of 0.
Next, we discussed how neural networks search for (and sometimes create) correlation
between the input and output datasets. This last idea allowed us to overlook the
previous lessons on how individual neurons behaved because it concisely summarizes
the previous lessons. The sum total of the neurons, gradients, stacks of layers, and so on
lead to a single idea: neural networks find and create correlation.
Holding onto this idea of correlation instead of the previous smaller ideas is important
to learning deep learning. Otherwise, it would be easy to become overwhelmed with the
complexity of neural networks. Let’s create a name for this idea: the correlation summarization.
Correlation summarization
135
Correlation summarization
This is the key to sanely moving forward to more advanced
neural networks.
Correlation summarization
Neural networks seek to find direct and indirect correlation between an input layer and
an output layer, which are determined by the input and output datasets, respectively.
At the 10,000-foot level, this is what all neural networks do. Given that a neural network is
really just a series of matrices connected by layers, let’s zoom in slightly and consider what
any particular weight matrix is doing.
Local correlation summarization
Any given set of weights optimizes to learn how to correlate its input layer with what the
output layer says it should be.
When you have only two layers (input and output), the weight matrix knows what the
output layer says it should be based on the output dataset. It looks for correlation between
the input and output datasets because they’re captured in the input and output layers. But
this becomes more nuanced when you have multiple layers, remember?
Global correlation summarization
What an earlier layer says it should be can be determined by taking what a later layer says it
should be and multiplying it by the weights in between them. This way, later layers can tell
earlier layers what kind of signal they need, to ultimately find correlation with the output. This
cross-communication is called backpropagation.
When global correlation teaches each layer what it should be, local correlation can optimize
weights locally. When a neuron in the final layer says, “I need to be a little higher,” it then
proceeds to tell all the neurons in the layer immediately preceding it, “Hey, previous layer,
send me higher signal.” They then tell the neurons preceding them, “Hey. Send us higher
signal.” It’s like a giant game of telephone—at the end of the game, every layer knows which
of its neurons need to be higher and lower, and the local correlation summarization takes
over, updating the weights accordingly.
136
Chapter 7
I How to picture neural networks
The previously overcomplicated visualization
While simplifying the mental picture, let’s simplify the
visualization as well.
At this point, I expect the visualization of neural networks in your head is something like
the picture shown here (because that’s the one we used). The input dataset is in layer_0,
connected by a weight matrix (a bunch of lines) to layer_1, and so on. This was a useful tool
to learn the basics of how collections of weights and layers come together to learn a function.
But moving forward, this picture has too much detail. Given the correlation summarization,
you already know you no longer need to worry about how individual weights are updated.
Later layers already know how to communicate to earlier layers and tell them, “Hey, I
need higher signal” or “Hey, I need lower signal.” Truth be told, you don’t really care about
the weight values anymore, only that they’re behaving as they should, properly capturing
correlation in a way that generalizes.
To reflect this change, let’s update the visualization on paper. We’ll also do a few other
things that will make sense later. As you know, the neural network is a series of weight
matrices. When you’re using the network, you also end up creating vectors corresponding
to each layer.
In the figure, the weight
matrices are the lines going
from node to node, and the
vectors are the strips of nodes.
For example, weights_1_2
is a matrix, weights_0_1 is
a matrix, and layer_1 is a
vector.
In later chapters, we’ll arrange
vectors and matrices in
increasingly creative ways,
so instead of all this detail
showing each node connected
by each weight (which gets
hard to read if we have, say,
500 nodes in layer_1), let’s
instead think in general
terms. Let’s think of them
as vectors and matrices of
arbitrary size.
layer_2
weights_1_2
relu nodes are
on this layer.
layer_1
weights_0_1
layer_0
The simplified visualization
137
The simplified visualization
Neural networks are like LEGO bricks, and each brick
is a vector or matrix.
Moving forward, we’ll build new neural network architectures in the same way people build
new structures with LEGO pieces. The great thing about the correlation summarization is
that all the bits and pieces that lead to it (backpropagation, gradient descent, alpha, dropout,
mini-batching, and so on) don’t depend on a particular configuration of the LEGOs. No
matter how you piece together the series of matrices, gluing them together with layers, the
neural network will try to learn the pattern in the data by modifying the weights between
wherever you put the input layer and the output layer.
To reflect this, we’ll build all the neural networks
with the pieces shown at right. The strip is a vector,
the box is a matrix, and the circles are individual
weights. Note that the box can be viewed as a
“vector of vectors,” horizontally or vertically.
(1 × 1)
layer_2
(4 × 1)
weights_1_2
Numbers
Vector
Matrix
(1 × 4)
(3 × 4)
(1 × 3)
layer_1
weights_0_1
layer_0
The big takeaway
The picture at left still gives you all
the information you need to build
a neural network. You know the
shapes and sizes of all the layers
and matrices. The detail from before
isn’t necessary when you know
the correlation summarization and
everything that went into it. But
we aren’t finished: we can simplify
even further.
138
Chapter 7
I How to picture neural networks
Simplifying even further
The dimensionality of the matrices is determined by the layers.
In the previous section, you may have noticed a pattern. Each matrix’s dimensionality
(number of rows and columns) has a direct relationship to the dimensionality of the layers
before and after them. Thus, we can simplify the
visualization even further.
Consider the visualization shown at right. We still
have all the information needed to build a neural
network. We can infer that weights_0_1 is a (3 ×
4) matrix because the previous layer (layer_0) has
three dimensions and the next layer (layer_1) has
four dimensions. Thus, in order for the matrix to be
big enough to have a single weight connecting each
node in layer_0 to each node in layer_1, it must
be a (3 × 4) matrix.
layer_2
weights_1_2
layer_1
weights_0_1
This allows us to start thinking about the neural
networks using the correlation summarization. All
this neural network will to do is adjust the weights
layer_0
to find correlation between layer_0 and layer_2.
It will do this using all the methods mentioned so
far in this book. But the different configurations of
weights and layers between the input and output layers have a strong impact on whether the
network is successful in finding correlation (and/or how fast it finds correlation).
The particular configuration of layers and weights in a neural network is called its
architecture, and we’ll spend the majority of the rest of this book discussing the pros and
cons of various architectures. As the correlation summarization reminds us, the neural
network adjusts weights to find correlation between the input and output layers, sometimes
even inventing correlation in the hidden layers. Different architectures channel signal to
make correlation easier to discover.
Good neural architectures channel signal so that correlation is easy to discover. Great
architectures also filter noise to help prevent overfitting.
Much of the research into neural networks is about finding new architectures that can find
correlation faster and generalize better to unseen data. We’ll spend the vast majority of the
rest of this book discussing new architectures.
Let’s see this network predict
139
1
Let’s see this network predict
Let’s picture data from the streetlight example
flowing through the system.
weights_1_2
In figure 1, a single datapoint from the streetlight dataset is
selected. layer_0 is set to the correct values.
weights_0_1
1
In figure 2, four different weighted sums of layer_0 are
performed. The four weighted sums are performed by
weights_0_1. As a reminder, this process is called vectormatrix multiplication. These four values are deposited into
the four positions of layer_1 and passed through the relu
function (setting negative values to 0). To be clear, the third
value from the left in layer_1 would have been negative, but
the relu function sets it to 0.
0
1
2
weights_1_2
.5
.2
0
.9
weights_0_1
1
0
1
As shown in figure 3, final step performs a weighted average of
layer_1, again using the vector-matrix multiplication process.
Review: Vector-matrix multiplication
Vector-matrix multiplication performs multiple weighted
sums of a vector. The matrix must have the same number
of rows as the vector has values, so that each column
in the matrix performs a unique weighted sum. Thus, if
the matrix has four columns, four weighted sums will be
generated. The weightings of each sum are performed
depending on the values of the matrix.
3
.9
This yields the number 0.9, which is the network’s
final prediction.
weights_1_2
.5
.2
0
.9
weights_0_1
1
0
1
140
Chapter 7
I How to picture neural networks
Visualizing using letters instead of pictures
All these pictures and detailed explanations are actually
a simple piece of algebra.
Just as we defined simpler pictures for the matrix
and vector, we can perform the same visualization
in the form of letters.
How do you visualize a matrix using math? Pick
a capital letter. I try to pick one that’s easy to
remember, such as W for “weights.” The little 0
means it’s probably one of several Ws. In this case,
the network has two. Perhaps surprisingly, I could
have picked any capital letter. The little 0 is an extra
that lets me call all my weight matrices W so I can
tell them apart. It’s your visualization; make it easy
to remember.
How do you visualize a vector using math? Pick a
lowercase letter. Why did I choose the letter l? Well,
because I have a bunch of vectors that are layers,
I thought l would be easy to remember. Why did
I choose to call it l-zero? Because I have multiple
layers, it seems nice to make all them ls and number
them instead of having to think of new letters for
every layer. There’s no wrong answer here.
If that’s how to visualize matrices and vectors
in math, what do all the pieces in the network
look like? At right, you can see a nice selection of
variables pointing to their respective sections of the
neural network. But defining them doesn’t show
how they relate. Let’s combine the variables via
vector-matrix multiplication.
W0
Matrix
l0
Vector
l2
weights_1_2
W1
l1
weights_0_1
W0
l0
Linking the variables
141
Linking the variables
The letters can be combined to indicate functions
and operations.
Vector-matrix multiplication is simple. To visualize that two letters are being multiplied by
each other, put them next to each other. For example:
Algebra
Translation
l0W0
“Take the layer 0 vector and perform vectormatrix multiplication with the weight matrix 0.”
l1W1
“Take the layer 1 vector and perform vectormatrix multiplication with the weight matrix 1.”
You can even throw in arbitrary functions like relu using notation that looks almost exactly
like the Python code. This is crazy-intuitive stuff.
l1 = relu(l0W0)
“To create the layer 1 vector, take the layer 0 vector
and perform vector-matrix multiplication with the
weight matrix 0; then perform the relu function on
the output (setting all negative numbers to 0).”
l2 = l1W1
“To create the layer 2 vector, take the layer 1 vector
and perform vector-matrix multiplication with the
weight matrix 1.”
If you notice, the layer 2 algebra contains layer 1 as an input variable. This means you
can represent the entire neural network in one expression by chaining them together.
Thus, all the logic in the forward propagation step can
be contained in this one formula. Note: baked into this
l2 = relu(l0W0)W1
formula is the assumption that the vectors and matrices
have the right dimensions.
142
Chapter 7
I How to picture neural networks
Everything side by side
Let’s see the visualization, algebra formula,
and Python code in one place.
I don’t think much dialogue is necessary on this page. Take a minute and look at each piece
of forward propagation through these four different ways of seeing it. It’s my hope that you’ll
truly grok forward propagation and understand the architecture by seeing it from different
perspectives, all in one place.
layer_2 = relu(layer_0.dot(weights_0_1)).dot(weights_1_2)
l2 = relu(l0W0)W1
Inputs
Hiddens
Prediction
layer_2
weights_1_2
layer_1
weights_0_1
layer_0
The importance of visualization tools
143
The importance of visualization tools
We’re going to be studying new architectures.
In the following chapters, we’ll be taking these vectors and matrices and combining them
in some creative ways. My ability to describe each architecture for you is entirely dependent
on our having a mutually agreed-on language for describing them. Thus, please don’t move
beyond this chapter until you can clearly see how forward propagation manipulates these
vectors and matrices, and how these various forms of describing them are articulated.
Key takeaway
Good neural architectures channel signal so that correlation is easy to discover. Great
architectures also filter noise to help prevent overfitting.
As mentioned previously, a neural architecture controls how signal flows through a network.
How you create these architectures will affect the ways in which the network can detect
correlation. You’ll find that you want to create architectures that maximize the network’s
ability to focus on the areas where meaningful correlation exists, and minimize the
network’s ability to focus on the areas that contain noise.
But different datasets and domains have different characteristics. For example, image data
has different kinds of signal and noise than text data. Even though neural networks can be
used in many situations, different architectures will be better suited to different problems
because of their ability to locate certain types of correlations. So, for the next few chapters,
we’ll explore how to modify neural networks to specifically find the correlation you’re
looking for. See you there!
learning signal and ignoring noise:
introduction to regularization and batching
In this chapter
•
Overfitting
•
Dropout
•
Batch gradient descent
With four parameters I can fit an elephant, and with
five I can make him wiggle his trunk.
—John von Neumann, mathematician, physicist,
computer scientist, and polymath
145
8
146
Chapter 8
I Learning signal and ignoring noise
Three-layer network on MNIST
Let’s return to the MNIST dataset and attempt to classify it with
the new network.
In last several chapters, you’ve learned that neural networks model correlation. The hidden
layers (the middle one in the three-layer network) can even create intermediate correlation
to help solve for a task (seemingly out of midair). How do you know the network is creating
good correlation?
When we discussed stochastic gradient descent with multiple inputs, we ran an experiment
where we froze one weight and then asked the network to continue training. As it was
training, the dots found the bottom of the bowls, as it were. You saw the weights become
adjusted to minimize the error.
When we froze the weight, the frozen weight still found the bottom of the bowl. For some
reason, the bowl moved so that the frozen weight value became optimal. Furthermore, if we
unfroze the weight to do some more training, it wouldn’t learn. Why? Well, the error had
already fallen to 0. As far as the network was concerned, there was nothing more to learn.
This begs the question, what if the input to the frozen weight was important to predicting
baseball victory in the real world? What if the network had figured out a way to accurately
predict the games in the training dataset (because that’s what networks do: they minimize
error), but it somehow forgot to include a valuable input?
Unfortunately, this phenomenon—overfitting—is extremely common in neural networks.
We could say it’s the archnemesis of neural networks; and the more powerful the neural
network’s expressive power (more layers and weights), the more prone the network is to
overfit. An everlasting battle is going on in research, where people continually find tasks
that need more powerful layers but then have to do lots of problem-solving to make sure the
network doesn’t overfit.
In this chapter, we’re going to study the basics of regularization, which is key to combatting
overfitting in neural networks. To do this, we’ll start with the most powerful neural network
(three-layer network with relu hidden layer) on the most challenging task (MNIST digit
classification).
To begin, go ahead and train the network, as shown next. You should see the same results
as those listed. Alas, the network learned to perfectly predict the training data. Should
we celebrate?
Three-layer network on MNIST
147
import sys, numpy as np
from keras.datasets import mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
images, labels = (x_train[0:1000].reshape(1000,28*28) \
255, y_train[0:1000])
one_hot_labels = np.zeros((len(labels),10))
for i,l in enumerate(labels):
one_hot_labels[i][l] = 1
labels = one_hot_labels
test_images = x_test.reshape(len(x_test),28*28) / 255
test_labels = np.zeros((len(y_test),10))
for i,l in enumerate(y_test):
Returns x if x > 0;
test_labels[i][l] = 1
returns 0 otherwise
np.random.seed(1)
Returns 1 for input > 0;
relu = lambda x:(x>=0) * x
returns 0 otherwise
relu2deriv = lambda x: x>=0
alpha, iterations, hidden_size, pixels_per_image, num_labels = \
(0.005, 350, 40, 784, 10)
weights_0_1 = 0.2*np.random.random((pixels_per_image,hidden_size)) - 0.1
weights_1_2 = 0.2*np.random.random((hidden_size,num_labels)) - 0.1
for j in range(iterations):
error, correct_cnt = (0.0, 0)
for i in range(len(images)):
layer_0 = images[i:i+1]
layer_1 = relu(np.dot(layer_0,weights_0_1))
layer_2 = np.dot(layer_1,weights_1_2)
error += np.sum((labels[i:i+1] - layer_2) ** 2)
correct_cnt += int(np.argmax(layer_2) == \
np.argmax(labels[i:i+1]))
layer_2_delta = (labels[i:i+1] - layer_2)
layer_1_delta = layer_2_delta.dot(weights_1_2.T)\
* relu2deriv(layer_1)
weights_1_2 += alpha * layer_1.T.dot(layer_2_delta)
weights_0_1 += alpha * layer_0.T.dot(layer_1_delta)
sys.stdout.write("\r"+ \
" I:"+str(j)+ \
" Error:" + str(error/float(len(images)))[0:5] +\
" Correct:" + str(correct_cnt/float(len(images))))
....
I:349 Error:0.108 Correct:1.0
148
Chapter 8
I Learning signal and ignoring noise
Well, that was easy
The neural network perfectly learned to predict all 1,000 images.
In some ways, this is a real victory. The neural network was able to take a dataset of 1,000
images and learn to correlate each input image with the correct label.
How did it do this? It iterated through each image, made a prediction, and then updated
each weight ever so slightly so the prediction was better next time. Doing this long enough
on all the images eventually reached a state where the network could correctly predict on all
the images.
Here’s a non-obvious question: how well will the neural network do on an image it hasn’t
seen before? In other words, how well will it do on an image that wasn’t part of the 1,000
images it was trained on? The MNIST dataset has many more images than just the 1,000 you
trained on; let’s try it.
In the notebook from the previous code are two variables: test_images and test_labels.
If you execute the following code, it will run the neural network on these images and
evaluate how well the network classifies them:
if(j % 10 == 0 or j == iterations-1):
error, correct_cnt = (0.0, 0)
for i in range(len(test_images)):
layer_0 = test_images[i:i+1]
layer_1 = relu(np.dot(layer_0,weights_0_1))
layer_2 = np.dot(layer_1,weights_1_2)
error += np.sum((test_labels[i:i+1] - layer_2) ** 2)
correct_cnt += int(np.argmax(layer_2) == \
np.argmax(test_labels[i:i+1]))
sys.stdout.write(" Test-Err:" + str(error/float(len(test_images)))[0:5] +\
" Test-Acc:" + str(correct_cnt/float(len(test_images))))
print()
Error:0.653 Correct:0.7073
The network did horribly! It predicted with an accuracy of only 70.7%. Why does it do so
terribly on these new testing images when it learned to predict with 100% accuracy on the
training data? How strange.
This 70.7% number is called the test accuracy. It’s the accuracy of the neural network on
data the network was not trained on. This number is important because it simulates how
well the neural network will perform if you try to use it in the real world (which gives the
network only images it hasn’t seen before). This is the score that matters.
Memorization vs. generalization
149
Memorization vs. generalization
Memorizing 1,000 images is easier than generalizing to all images.
Let’s consider again how a neural network learns. It adjusts each weight in each matrix so
the network is better able to take specific inputs and make a specific prediction. Perhaps a
better question might be, “If we train it on 1,000 images, which it learns to predict perfectly,
why does it work on other images at all?”
As you might expect, when the fully trained neural network is applied to a new image,
it’s guaranteed to work well only if the new image is nearly identical to an image from the
training data. Why? Because the neural network learned to transform input data to output
data for only very specific input configurations. If you give it something that doesn’t look
familiar, it will predict randomly.
This makes neural networks kind of pointless. What’s the point of a neural network working only
on the data you trained it on? You already know the correct classifications for those datapoints.
Neural networks are useful only if they work on data you don’t already know the answer to.
As it turns out, there’s a way to combat this. Here I’ve printed out both the training and
testing accuracy of the neural network as it was training (every 10 iterations). Notice
anything interesting? You should see a clue to better networks:
I:0 Train-Err:0.722 Train-Acc:0.537 Test-Err:0.601 Test-Acc:0.6488
I:10 Train-Err:0.312 Train-Acc:0.901 Test-Err:0.420 Test-Acc:0.8114
I:20 Train-Err:0.260 Train-Acc:0.93 Test-Err:0.414 Test-Acc:0.8111
I:30 Train-Err:0.232 Train-Acc:0.946 Test-Err:0.417 Test-Acc:0.8066
I:40 Train-Err:0.215 Train-Acc:0.956 Test-Err:0.426 Test-Acc:0.8019
I:50 Train-Err:0.204 Train-Acc:0.966 Test-Err:0.437 Test-Acc:0.7982
I:60 Train-Err:0.194 Train-Acc:0.967 Test-Err:0.448 Test-Acc:0.7921
I:70 Train-Err:0.186 Train-Acc:0.975 Test-Err:0.458 Test-Acc:0.7864
I:80 Train-Err:0.179 Train-Acc:0.979 Test-Err:0.466 Test-Acc:0.7817
I:90 Train-Err:0.172 Train-Acc:0.981 Test-Err:0.474 Test-Acc:0.7758
I:100 Train-Err:0.166 Train-Acc:0.984 Test-Err:0.482 Test-Acc:0.7706
I:110 Train-Err:0.161 Train-Acc:0.984 Test-Err:0.489 Test-Acc:0.7686
I:120 Train-Err:0.157 Train-Acc:0.986 Test-Err:0.496 Test-Acc:0.766
I:130 Train-Err:0.153 Train-Acc:0.99 Test-Err:0.502 Test-Acc:0.7622
I:140 Train-Err:0.149 Train-Acc:0.991 Test-Err:0.508 Test-Acc:0.758
....
I:210 Train-Err:0.127 Train-Acc:0.998 Test-Err:0.544 Test-Acc:0.7446
I:220 Train-Err:0.125 Train-Acc:0.998 Test-Err:0.552 Test-Acc:0.7416
I:230 Train-Err:0.123 Train-Acc:0.998 Test-Err:0.560 Test-Acc:0.7372
I:240 Train-Err:0.121 Train-Acc:0.998 Test-Err:0.569 Test-Acc:0.7344
I:250 Train-Err:0.120 Train-Acc:0.999 Test-Err:0.577 Test-Acc:0.7316
I:260 Train-Err:0.118 Train-Acc:0.999 Test-Err:0.585 Test-Acc:0.729
I:270 Train-Err:0.117 Train-Acc:0.999 Test-Err:0.593 Test-Acc:0.7259
I:280 Train-Err:0.115 Train-Acc:0.999 Test-Err:0.600 Test-Acc:0.723
I:290 Train-Err:0.114 Train-Acc:0.999 Test-Err:0.607 Test-Acc:0.7196
I:300 Train-Err:0.113 Train-Acc:0.999 Test-Err:0.614 Test-Acc:0.7183
I:310 Train-Err:0.112 Train-Acc:0.999 Test-Err:0.622 Test-Acc:0.7165
I:320 Train-Err:0.111 Train-Acc:0.999 Test-Err:0.629 Test-Acc:0.7133
I:330 Train-Err:0.110 Train-Acc:0.999 Test-Err:0.637 Test-Acc:0.7125
I:340 Train-Err:0.109 Train-Acc:1.0 Test-Err:0.645 Test-Acc:0.71
I:349 Train-Err:0.108 Train-Acc:1.0 Test-Err:0.653 Test-Acc:0.7073
150
Chapter 8
I Learning signal and ignoring noise
Overfitting in neural networks
Neural networks can get worse if you train them too much!
For some reason, the test accuracy went up for the first 20 iterations and then slowly
decreased as the network trained more and more (during which time the training accuracy
was still improving). This is common in neural networks. Let me explain the phenomenon
via an analogy.
Imagine you’re creating a mold for a common dinner fork, but instead of using it to create
other forks, you want to use it to identify whether a particular utensil is a fork. If an object
fits in the mold, you’ll conclude that the object is a fork, and if it doesn’t, you’ll conclude that
it’s not a fork.
Let’s say you set out to make this mold, and you start with a wet piece of clay and a big
bucket of three-pronged forks, spoons, and knives. You then press each of the forks into
the same place in the mold to create an outline, which sort of looks like a mushy fork. You
repeatedly place all the forks in the clay over and over, hundreds of times. When you let the
clay dry, you then find that none of the spoons or knives fit into this mold, but all the forks
do. Awesome! You did it. You correctly made a mold that can fit only the shape of a fork.
But what happens if someone hands you a four-pronged fork? You look at your mold and
notice that there’s a specific outline for three thin prongs in the clay. The four-pronged fork
doesn’t fit. Why not? It’s still a fork.
It’s because the clay wasn’t molded on any four-pronged forks. It was molded only on the
three-pronged variety. In this way, the clay has overfit to recognize only the types of forks it
was “trained” to shape.
This is exactly the same phenomenon you just witnessed in the neural network. It’s an even
closer parallel than you might think. One way to view the weights of a neural network is
as a high-dimensional shape. As you train, this shape molds around the shape of the data,
learning to distinguish one pattern from another. Unfortunately, the images in the testing
dataset were slightly different from the patterns in the training dataset. This caused the
network to fail on many of the testing examples.
A more official definition of a neural network that overfits is a neural network that has
learned the noise in the dataset instead of making decisions based only on the true signal.
Where overfitting comes from
151
Where overfitting comes from
What causes neural networks to overfit?
Let’s alter this scenario a bit. Picture the fresh clay again (unmolded). What if you pushed
only a single fork into it? Assuming the clay was very thick, it wouldn’t have as much detail
as the previous mold (which was imprinted many times). Thus, it would be only a very
general shape of a fork. This shape might be compatible with both the three- and fourpronged varieties of fork, because it’s still a fuzzy imprint.
Assuming this information, the mold got worse at the testing dataset as you imprinted more
forks because it learned more-detailed information about the training dataset it was being
molded to. This caused it to reject images that were even the slightest bit off from what it
had repeatedly seen in the training data.
What is this detailed information in the images that’s incompatible with the test data? In the
fork analogy, it’s the number of prongs on the fork. In images, it’s generally referred to as
noise. In reality, it’s a bit more nuanced. Consider these two dog pictures.
Everything that makes these pictures unique beyond what captures the essence of “dog”
is included in the term noise. In the picture on the left, the pillow and the background are
both noise. In the picture on the right, the empty, middle blackness of the dog is a form of
noise as well. It’s really the edges that tell you it’s a dog; the middle blackness doesn’t tell you
anything. In the picture on the left, the middle of the dog has the furry texture and color of a
dog, which could help the classifier correctly identify it.
How do you get neural networks to train only on the signal (the essence of a dog) and ignore
the noise (other stuff irrelevant to the classification)? One way is early stopping. It turns out
a large amount of noise comes in the fine-grained detail of an image, and most of the signal
(for objects) is found in the general shape and perhaps color of the image.
152
Chapter 8
I Learning signal and ignoring noise
The simplest regularization: Early stopping
Stop training the network when it starts getting worse.
How do you get a neural network to ignore the fine-grained detail and capture only the
general information present in the data (such as the general shape of a dog or of an MNIST
digit)? You don’t let the network train long enough to learn it.
In the fork-mold example, it takes many forks imprinted many times to create the perfect
outline of a three-pronged fork. The first few imprints generally capture only the shallow
outline of a fork. The same can be said for neural networks. As a result, early stopping is the
cheapest form of regularization, and if you’re in a pinch, it can be quite effective.
This brings us to the subject this chapter is all about: regularization. Regularization is a
subfield of methods for getting a model to generalize to new datapoints (instead of just
memorizing the training data). It’s a subset of methods that help the neural network learn
the signal and ignore the noise. In this case, it’s a toolset at your disposal to create neural
networks that have these properties.
Regularization
Regularization is a subset of methods used to encourage generalization in learned
models, often by increasing the difficulty for a model to learn the fine-grained details of
training data.
The next question might be, how do you know when to stop? The only real way to know
is to run the model on data that isn’t in the training dataset. This is typically done using a
second test dataset called a validation set. In some circumstances, if you used the test set for
knowing when to stop, you could overfit to the test set. As a general rule, you don’t use it to
control training. You use a validation set instead.
Industry standard regularization: Dropout
153
Industry standard regularization: Dropout
The method: Randomly turn off neurons (set them to 0)
during training.
This regularization technique is as simple as it sounds. During training, you randomly
set neurons in the network to 0 (and usually the deltas on the same nodes during
backpropagation, but you technically don’t have to). This causes the neural network to train
exclusively using random subsections of the neural network.
Believe it or not, this regularization technique is generally accepted as the go-to, state-ofthe-art regularization technique for the vast majority of networks. Its methodology is simple
and inexpensive, although the intuitions behind why it works are a bit more complex.
Why dropout works (perhaps oversimplified)
Dropout makes a big network act like a little one by randomly training little subsections
of the network at a time, and little networks don’t overfit.
It turns out that the smaller a neural network is, the less it’s able to overfit. Why? Well, small
neural networks don’t have much expressive power. They can’t latch on to the more granular
details (noise) that tend to be the source of overfitting. They have room to capture only the
big, obvious, high-level features.
This notion of room or capacity is really important to keep in your mind. Think of it like
this. Remember the clay analogy? Imagine if the clay was made of sticky rocks the size of
dimes. Would that clay be able to make a good imprint of a fork? Of course not. Those
stones are much like the weights. They form around the data, capturing the patterns you’re
interested in. If you have only a few, larger stones, they can’t capture nuanced detail. Each
stone instead is pushed on by large parts of the fork, more or less averaging the shape
(ignoring fine creases and corners).
Now, imagine clay made of very fine-grained sand. It’s made up of millions and millions of
small stones that can fit into every nook and cranny of a fork. This is what gives big neural
networks the expressive power they often use to overfit to a dataset.
How do you get the power of a large neural network with the resistance to overfitting
of the small neural network? Take the big neural network and turn off nodes randomly.
What happens when you take a big neural network and use only a small part of it? It
behaves like a small neural network. But when you do this randomly over potentially
millions of different subnetworks, the sum total of the entire network still maintains its
expressive power. Neat, eh?
154
Chapter 8
I Learning signal and ignoring noise
Why dropout works: Ensembling works
Dropout is a form of training a bunch of networks and
averaging them.
Something to keep in mind: neural networks always start out randomly. Why does this
matter? Well, because neural networks learn by trial and error, this ultimately means
every neural network learns a little differently. It may learn equally effectively, but no two
neural networks are ever exactly the same (unless they start out exactly the same for some
random or intentional reason).
This has an interesting property. When you overfit two neural networks, no two neural
networks overfit in exactly the same way. Overfitting occurs only until every training
image can be predicted perfectly, at which point the error == 0 and the network stops
learning (even if you keep iterating). But because each neural network starts by predicting
randomly and then adjusting its weights to make better predictions, each network
inevitably makes different mistakes, resulting in different updates. This culminates in a
core concept:
Although it’s likely that large, unregularized neural networks will overfit to noise, it’s unlikely
they will overfit to the same noise.
Why don’t they overfit to the same noise? Because they start randomly, and they stop
training once they’ve learned enough noise to disambiguate between all the images in the
training set. The MNIST network needs to find only a handful of random pixels that happen
to correlate with the output labels, to overfit. But this is contrasted with, perhaps, an even
more important concept:
Neural networks, even though they’re randomly generated, still start by learning the biggest,
most broadly sweeping features before learning much about the noise.
The takeaway is this: if you train 100 neural networks (all initialized randomly), they will
each tend to latch onto different noise but similar broad signal. Thus, when they make
mistakes, they will often make differing mistakes. If you allowed them to vote equally, their
noise would tend to cancel out, revealing only what they all learned in common: the signal.
Dropout in code
155
Dropout in code
Here’s how to use dropout in the real world.
In the MNIST classification model, let’s add dropout to the hidden layer, such that 50% of
the nodes are turned off (randomly) during training. You may be surprised that this is only
a three-line change in the code. Following is a familiar snippet from the previous neural
network logic, with the dropout mask added:
i = 0
layer_0 = images[i:i+1]
dropout_mask = np.random.randint(2,size=layer_1.shape)
layer_1 *= dropout_mask * 2
layer_2 = np.dot(layer_1, weights_1_2)
error += np.sum((labels[i:i+1] - layer_2) ** 2)
correct_cnt += int(np.argmax(layer_2) == \
np.argmax(labels[i+i+1]))
layer_2_delta = (labels[i:i+1] - layer_2)
layer_1_delta = layer_2_delta.dot(weights_1_2.T)\
* relu2deriv(layer_1)
layer_1_delta *= dropout_mask
weights_1_2 += alpha * layer_1.T.dot(layer_2_delta)
weights_0_1 += alpha * layer_0.T.dot(layer_1_delta)
To implement dropout on a layer (in this case, layer_1), multiply the layer_1 values by a
random matrix of 1s and 0s. This has the effect of randomly turning off nodes in layer_1
by setting them to equal 0. Note that dropout_mask uses what’s called a 50% Bernoulli
distribution such that 50% of the time, each value in dropout_mask is 1, and (1 – 50% =
50%) of the time, it’s 0.
This is followed by something that may seem a bit peculiar. You multiply layer_1 by 2.
Why do you do this? Remember that layer_2 will perform a weighted sum of layer_1.
Even though it’s weighted, it’s still a sum over the values of layer_1. If you turn off half the
nodes in layer_1, that sum will be cut in half. Thus, layer_2 would increase its sensitivity
to layer_1, kind of like a person leaning closer to a radio when the volume is too low to
better hear it. But at test time, when you no longer use dropout, the volume would be back
up to normal. This throws off layer_2’s ability to listen to layer_1. You need to counter this
by multiplying layer_1 by (1 / the percentage of turned on nodes). In this case, that’s 1/0.5,
which equals 2. This way, the volume of layer_1 is the same between training and testing,
despite dropout.
156
Chapter 8
import numpy, sys
np.random.seed(1)
def relu(x):
return (x >= 0) * x
def relu2deriv(output):
return output >= 0
I Learning signal and ignoring noise
Returns x if x > 0;
returns 0 otherwise
Returns 1
for input > 0
alpha, iterations, hidden_size = (0.005, 300, 100)
pixels_per_image, num_labels = (784, 10)
weights_0_1 = 0.2*np.random.random((pixels_per_image,hidden_size)) - 0.1
weights_1_2 = 0.2*np.random.random((hidden_size,num_labels)) - 0.1
for j in range(iterations):
error, correct_cnt = (0.0,0)
for i in range(len(images)):
layer_0 = images[i:i+1]
layer_1 = relu(np.dot(layer_0,weights_0_1))
dropout_mask = np.random.randint(2, size=layer_1.shape)
layer_1 *= dropout_mask * 2
layer_2 = np.dot(layer_1,weights_1_2)
error += np.sum((labels[i:i+1] - layer_2) ** 2)
correct_cnt += int(np.argmax(layer_2) == \
np.argmax(labels[i:i+1]))
layer_2_delta = (labels[i:i+1] - layer_2)
layer_1_delta = layer_2_delta.dot(weights_1_2.T) * relu2deriv(layer_1)
layer_1_delta *= dropout_mask
weights_1_2 += alpha * layer_1.T.dot(layer_2_delta)
weights_0_1 += alpha * layer_0.T.dot(layer_1_delta)
if(j%10 == 0):
test_error = 0.0
test_correct_cnt = 0
for i in range(len(test_images)):
layer_0 = test_images[i:i+1]
layer_1 = relu(np.dot(layer_0,weights_0_1))
layer_2 = np.dot(layer_1, weights_1_2)
test_error += np.sum((test_labels[i:i+1] - layer_2) ** 2)
test_correct_cnt += int(np.argmax(layer_2) == \
np.argmax(test_labels[i:i+1]))
sys.stdout.write("\n" + \
"I:" + str(j) + \
" Test-Err:" + str(test_error/ float(len(test_images)))[0:5] +\
" Test-Acc:" + str(test_correct_cnt/ float(len(test_images)))+\
" Train-Err:" + str(error/ float(len(images)))[0:5] +\
" Train-Acc:" + str(correct_cnt/ float(len(images))))
Dropout evaluated on MNIST
157
Dropout evaluated on MNIST
If you remember from before, the neural network (without dropout) previously reached a
test accuracy of 81.14% before falling down to finish training at 70.73% accuracy. When you
add dropout, the neural network instead behaves this way:
I:0 Test-Err:0.641 Test-Acc:0.6333 Train-Err:0.891 Train-Acc:0.413
I:10 Test-Err:0.458 Test-Acc:0.787 Train-Err:0.472 Train-Acc:0.764
I:20 Test-Err:0.415 Test-Acc:0.8133 Train-Err:0.430 Train-Acc:0.809
I:30 Test-Err:0.421 Test-Acc:0.8114 Train-Err:0.415 Train-Acc:0.811
I:40 Test-Err:0.419 Test-Acc:0.8112 Train-Err:0.413 Train-Acc:0.827
I:50 Test-Err:0.409 Test-Acc:0.8133 Train-Err:0.392 Train-Acc:0.836
I:60 Test-Err:0.412 Test-Acc:0.8236 Train-Err:0.402 Train-Acc:0.836
I:70 Test-Err:0.412 Test-Acc:0.8033 Train-Err:0.383 Train-Acc:0.857
I:80 Test-Err:0.410 Test-Acc:0.8054 Train-Err:0.386 Train-Acc:0.854
I:90 Test-Err:0.411 Test-Acc:0.8144 Train-Err:0.376 Train-Acc:0.868
I:100 Test-Err:0.411 Test-Acc:0.7903 Train-Err:0.369 Train-Acc:0.864
I:110 Test-Err:0.411 Test-Acc:0.8003 Train-Err:0.371 Train-Acc:0.868
I:120 Test-Err:0.402 Test-Acc:0.8046 Train-Err:0.353 Train-Acc:0.857
I:130 Test-Err:0.408 Test-Acc:0.8091 Train-Err:0.352 Train-Acc:0.867
I:140 Test-Err:0.405 Test-Acc:0.8083 Train-Err:0.355 Train-Acc:0.885
I:150 Test-Err:0.404 Test-Acc:0.8107 Train-Err:0.342 Train-Acc:0.883
I:160 Test-Err:0.399 Test-Acc:0.8146 Train-Err:0.361 Train-Acc:0.876
I:170 Test-Err:0.404 Test-Acc:0.8074 Train-Err:0.344 Train-Acc:0.889
I:180 Test-Err:0.399 Test-Acc:0.807 Train-Err:0.333 Train-Acc:0.892
I:190 Test-Err:0.407 Test-Acc:0.8066 Train-Err:0.335 Train-Acc:0.898
I:200 Test-Err:0.405 Test-Acc:0.8036 Train-Err:0.347 Train-Acc:0.893
I:210 Test-Err:0.405 Test-Acc:0.8034 Train-Err:0.336 Train-Acc:0.894
I:220 Test-Err:0.402 Test-Acc:0.8067 Train-Err:0.325 Train-Acc:0.896
I:230 Test-Err:0.404 Test-Acc:0.8091 Train-Err:0.321 Train-Acc:0.894
I:240 Test-Err:0.415 Test-Acc:0.8091 Train-Err:0.332 Train-Acc:0.898
I:250 Test-Err:0.395 Test-Acc:0.8182 Train-Err:0.320 Train-Acc:0.899
I:260 Test-Err:0.390 Test-Acc:0.8204 Train-Err:0.321 Train-Acc:0.899
I:270 Test-Err:0.382 Test-Acc:0.8194 Train-Err:0.312 Train-Acc:0.906
I:280 Test-Err:0.396 Test-Acc:0.8208 Train-Err:0.317 Train-Acc:0.9
I:290 Test-Err:0.399 Test-Acc:0.8181 Train-Err:0.301 Train-Acc:0.908
Not only does the network instead peak at a score of 82.36%, it also doesn’t overfit nearly
as badly, finishing training with a testing accuracy of 81.81%. Notice that the dropout also
slows down Training-Acc, which previously went straight to 100% and stayed there.
This should point to what dropout really is: it’s noise. It makes it more difficult for the
network to train on the training data. It’s like running a marathon with weights on your
legs. It’s harder to train, but when you take off the weights for the big race, you end up
running quite a bit faster because you trained for something that was much more difficult.
158
Chapter 8
I Learning signal and ignoring noise
Batch gradient descent
Here’s a method for increasing the speed of training and the rate
of convergence.
In the context of this chapter, I’d like to briefly apply a concept introduced several chapters
ago: mini-batched stochastic gradient descent. I won’t go into too much detail, because it’s
something that’s largely taken for granted in neural network training. Furthermore, it’s a
simple concept that doesn’t get more advanced even with the most state-of-the-art neural
networks.
Previously we trained one training example at a time, updating the weights after each
example. Now, let’s train 100 training examples at a time, averaging the weight updates
among all 100 examples. The training/testing output is shown next, followed by the code for
the training logic.
I:0 Test-Err:0.815 Test-Acc:0.3832 Train-Err:1.284 Train-Acc:0.165
I:10 Test-Err:0.568 Test-Acc:0.7173 Train-Err:0.591 Train-Acc:0.672
I:20 Test-Err:0.510 Test-Acc:0.7571 Train-Err:0.532 Train-Acc:0.729
I:30 Test-Err:0.485 Test-Acc:0.7793 Train-Err:0.498 Train-Acc:0.754
I:40 Test-Err:0.468 Test-Acc:0.7877 Train-Err:0.489 Train-Acc:0.749
I:50 Test-Err:0.458 Test-Acc:0.793 Train-Err:0.468 Train-Acc:0.775
I:60 Test-Err:0.452 Test-Acc:0.7995 Train-Err:0.452 Train-Acc:0.799
I:70 Test-Err:0.446 Test-Acc:0.803 Train-Err:0.453 Train-Acc:0.792
I:80 Test-Err:0.451 Test-Acc:0.7968 Train-Err:0.457 Train-Acc:0.786
I:90 Test-Err:0.447 Test-Acc:0.795 Train-Err:0.454 Train-Acc:0.799
I:100 Test-Err:0.448 Test-Acc:0.793 Train-Err:0.447 Train-Acc:0.796
I:110 Test-Err:0.441 Test-Acc:0.7943 Train-Err:0.426 Train-Acc:0.816
I:120 Test-Err:0.442 Test-Acc:0.7966 Train-Err:0.431 Train-Acc:0.813
I:130 Test-Err:0.441 Test-Acc:0.7906 Train-Err:0.434 Train-Acc:0.816
I:140 Test-Err:0.447 Test-Acc:0.7874 Train-Err:0.437 Train-Acc:0.822
I:150 Test-Err:0.443 Test-Acc:0.7899 Train-Err:0.414 Train-Acc:0.823
I:160 Test-Err:0.438 Test-Acc:0.797 Train-Err:0.427 Train-Acc:0.811
I:170 Test-Err:0.440 Test-Acc:0.7884 Train-Err:0.418 Train-Acc:0.828
I:180 Test-Err:0.436 Test-Acc:0.7935 Train-Err:0.407 Train-Acc:0.834
I:190 Test-Err:0.434 Test-Acc:0.7935 Train-Err:0.410 Train-Acc:0.831
I:200 Test-Err:0.435 Test-Acc:0.7972 Train-Err:0.416 Train-Acc:0.829
I:210 Test-Err:0.434 Test-Acc:0.7923 Train-Err:0.409 Train-Acc:0.83
I:220 Test-Err:0.433 Test-Acc:0.8032 Train-Err:0.396 Train-Acc:0.832
I:230 Test-Err:0.431 Test-Acc:0.8036 Train-Err:0.393 Train-Acc:0.853
I:240 Test-Err:0.430 Test-Acc:0.8047 Train-Err:0.397 Train-Acc:0.844
I:250 Test-Err:0.429 Test-Acc:0.8028 Train-Err:0.386 Train-Acc:0.843
I:260 Test-Err:0.431 Test-Acc:0.8038 Train-Err:0.394 Train-Acc:0.843
I:270 Test-Err:0.428 Test-Acc:0.8014 Train-Err:0.384 Train-Acc:0.845
I:280 Test-Err:0.430 Test-Acc:0.8067 Train-Err:0.401 Train-Acc:0.846
I:290 Test-Err:0.428 Test-Acc:0.7975 Train-Err:0.383 Train-Acc:0.851
Notice that the training accuracy has a smoother trend than it did before. Taking an average
weight update consistently creates this kind of phenomenon during training. As it turns out,
individual training examples are very noisy in terms of the weight updates they generate.
Thus, averaging them makes for a smoother learning process.
Batch gradient descent
159
import numpy as np
np.random.seed(1)
def relu(x):
return (x >= 0) * x
Returns x
if x > 0
def relu2deriv(output):
return output >= 0
Returns 1
for input > 0
batch_size = 100
alpha, iterations = (0.001, 300)
pixels_per_image, num_labels, hidden_size = (784, 10, 100)
weights_0_1 = 0.2*np.random.random((pixels_per_image,hidden_size)) - 0.1
weights_1_2 = 0.2*np.random.random((hidden_size,num_labels)) - 0.1
for j in range(iterations):
error, correct_cnt = (0.0, 0)
for i in range(int(len(images) / batch_size)):
batch_start, batch_end = ((i * batch_size),((i+1)*batch_size))
layer_0 = images[batch_start:batch_end]
layer_1 = relu(np.dot(layer_0,weights_0_1))
dropout_mask = np.random.randint(2,size=layer_1.shape)
layer_1 *= dropout_mask * 2
layer_2 = np.dot(layer_1,weights_1_2)
error += np.sum((labels[batch_start:batch_end] - layer_2) ** 2)
for k in range(batch_size):
correct_cnt += int(np.argmax(layer_2[k:k+1]) == \
np.argmax(labels[batch_start+k:batch_start+k+1]))
layer_2_delta = (labels[batch_start:batch_end]-layer_2) \
/batch_size
layer_1_delta = layer_2_delta.dot(weights_1_2.T)* \
relu2deriv(layer_1)
layer_1_delta *= dropout_mask
weights_1_2 += alpha * layer_1.T.dot(layer_2_delta)
weights_0_1 += alpha * layer_0.T.dot(layer_1_delta)
if(j%10 == 0):
test_error = 0.0
test_correct_cnt = 0
for i in range(len(test_images)):
layer_0 = test_images[i:i+1]
layer_1 = relu(np.dot(layer_0,weights_0_1))
layer_2 = np.dot(layer_1, weights_1_2)
160
Chapter 8
I Learning signal and ignoring noise
The first thing you’ll notice when running this code is that it runs much faster. This is
because each np.dot function is now performing 100 vector dot products at a time. CPU
architectures are much faster at performing dot products batched this way.
There’s more going on here, however. Notice that alpha is 20 times larger than before. You
can increase it for a fascinating reason. Imagine you were trying to find a city using a very
wobbly compass. If you looked down, got a heading, and then ran 2 miles, you’d likely
be way off course. But if you looked down, took 100 headings, and then averaged them,
running 2 miles would probably take you in the general right direction.
Because the example takes an average of a noisy signal (the average weight change over
100 training examples), it can take bigger steps. You’ll generally see batching ranging from
size 8 to as high as 256. Generally, researchers pick numbers randomly until they find a
batch_size/alpha pair that seems to work well.
Summary
This chapter addressed two of the most widely used methods for increasing the accuracy
and training speed of almost any neural architecture. In the following chapters, we’ll pivot
from sets of tools that are universally applicable to nearly all neural networks, to specialpurpose architectures that are advantageous for modeling specific types of phenomena
in data.
modeling probabilities and nonlinearities:
activation functions
In this chapter
•
What is an activation function?
•
Standard hidden activation functions
—Sigmoid
—Tanh
•
Standard output activation functions
—Softmax
•
Activation function installation instructions
I know that 2 and 2 make 4—& should be glad to prove
it too if I could—though I must say if by any sort of
process I could convert 2 & 2 into five it would give me
much greater pleasure.
—George Gordon Byron, letter to Annabella Milbanke,
November 10, 1813
161
9
162
Chapter 9
I Modeling probabilities and nonlinearities
What is an activation function?
It’s a function applied to the neurons
in a layer during prediction.
An activation function is a function applied to the
neurons in a layer during prediction. This should
seem very familiar, because you’ve been using an
activation function called relu (shown here in the
three-layer neural network). The relu function had
the effect of turning all negative numbers to 0.
Oversimplified, an activation function is any function
that can take one number and return another
number. But there are an infinite number of functions
in the universe, and not all them are useful as
activation functions.
layer_2
weights_1_2
relu
layer_1
weights_0_1
There are several constraints on what makes a
function an activation function. Using functions
outside of these constraints is usually a bad idea, as
you’ll see.
layer_0
Constraint 1: The function must be continuous
and infinite in domain.
The first constraint on what makes a proper activation function is that it must have an
output number for any input. In other words, you shouldn’t be able to put in a number that
doesn’t have an output for some reason.
A bit overkill, but see how the function on the left (four distinct lines) doesn’t have y values
for every x value? It’s defined in only four spots. This would make for a horrible activation
function. The function on the right, however, is continuous and infinite in domain. There is
no input (x) for which you can’t compute an output (y).
y (output)
y (output)
(y = x * x)
x (input)
x (input)
What is an activation function?
163
Constraint 2: Good activation functions are monotonic,
never changing direction.
The second constraint is that the function is 1:1. It must never change direction. In other
words, it must either be always increasing or always decreasing.
As an example, look at the following two functions. These shapes answer the question,
“Given x as input, what value of y does the function describe?” The function on the left
(y = x * x) isn’t an ideal activation function because it isn’t either always increasing or
always decreasing.
How can you tell? Well, notice that there are many cases in which two values of x have a
single value of y (this is true for every value except 0). The function on the right, however, is
always increasing! There is no point at which two values of x have the same value of y:
(y = x)
y (output)
y (output)
(y = x * x)
x (input)
x (input)
This particular constraint isn’t technically a requirement. Unlike functions that have missing
values (noncontinuous), you can optimize functions that aren’t monotonic. But consider the
implication of having multiple input values map to the same output value.
When you’re learning in neural networks, you’re searching for the right weight
configurations to give a specific output. This problem can get a lot harder if there are
multiple right answers. If there are multiple ways to get the same output, then the network
has multiple possible perfect configurations.
An optimist might say, “Hey, this is great! You’re more likely to find the right answer if it
can be found in multiple places!” A pessimist would say, “This is terrible! Now you don’t
have a correct direction to go to reduce the error, because you can go in either direction and
theoretically make progress.”
Unfortunately, the phenomenon the pessimist identified is more important. For an advanced
study of this subject, look more into convex versus non-convex optimization; many
universities (and online classes) have entire courses dedicated to these kinds of questions.
164
Chapter 9
I Modeling probabilities and nonlinearities
Constraint 3: Good activation functions are nonlinear
(they squiggle or turn).
The third constraint requires a bit of recollection back to chapter 6. Remember sometimes
correlation? In order to create it, you had to allow the neurons to selectively correlate to
input neurons such that a very negative signal from one input into a neuron could reduce
how much it correlated to any input (by forcing the neuron to drop to 0, in the case of relu).
As it turns out, this phenomenon is facilitated by any function that curves. Functions that
look like straight lines, on the other hand, scale the weighted average coming in. Scaling
something (multiplying it by a constant like 2) doesn’t affect how correlated a neuron is to
its various inputs. It makes the collective correlation that’s represented louder or softer. But
the activation doesn’t allow one weight to affect how correlated the neuron is to the other
weights. What you really want is selective correlation. Given a neuron with an activation
function, you want one incoming signal to be able to increase or decrease how correlated the
neuron is to all the other incoming signals. All curved lines do this (to varying degrees, as
you’ll see).
Thus, the function shown here on the left is considered a linear function, whereas the one
on the right is considered nonlinear and will usually make for a better activation function
(there are exceptions, which we’ll discuss later).
y = relu(x)
y (output)
y (output)
y = (2 * x) + 5
x (input)
x (input)
Constraint 4: Good activation functions (and their derivatives)
should be efficiently computable.
This one is pretty simple. You’ll be calling this function a lot (sometimes billions of times),
so you don’t want it to be too slow to compute. Many recent activation functions have
become popular because they’re so easy to compute at the expense of their expressiveness
(relu is a great example of this).
Standard hidden-layer activation functions
165
Standard hidden-layer activation functions
Of the infinite possible functions, which ones are most
commonly used?
Even with these constraints, it should be clear that an infinite (possibly transfinite?) number
of functions could be used as activation functions. The last few years have seen a lot of
progress in state-of-the-art activations. But there’s still a relatively small list of activations
that account for the vast majority of activation needs, and improvements on them have been
minute in most cases.
sigmoid is the bread-and-butter
activation.
sigmoid is great because it smoothly squishes
the infinite amount of input to an output
between 0 and 1. In many circumstances, this
lets you interpret the output of any individual
neuron as a probability. Thus, people use this
nonlinearity both in hidden layers and output
layers.
(Image: Wikipedia)
tanh is better than sigmoid for
hidden layers.
Here’s the cool thing about tanh. Remember
modeling selective correlation? Well, sigmoid
gives varying degrees of positive correlation.
That’s nice. tanh is the same as sigmoid except
it’s between –1 and 1!
This means it can also throw in some negative
correlation. Although it isn’t that useful for
output layers (unless the data you’re predicting
goes between –1 and 1), this aspect of negative
correlation is powerful for hidden layers;
on many problems, tanh will outperform
sigmoid in hidden layers.
(Image: Wolfram Alpha)
166
Chapter 9
I Modeling probabilities and nonlinearities
Standard output layer activation functions
Choosing the best one depends on what you’re trying to predict.
It turns out that what’s best for hidden-layer activation functions can be quite different from
what’s best for output-layer activation functions, especially when it comes to classification.
Broadly speaking, there are three major types of output layer.
Configuration 1: Predicting raw data values
(no activation function)
This is perhaps the most straightforward but least common type of output layer. In some
cases, people want to train a neural network to transform one matrix of numbers into
another matrix of numbers, where the range of the output (difference between lowest and
highest values) is something other than a probability. One example might be predicting the
average temperature in Colorado given the temperature in the surrounding states.
The main thing to focus on here is ensuring that the output nonlinearity can predict the
right answers. In this case, a sigmoid or tanh would be inappropriate because it forces every
prediction to be between 0 and 1 (you want to predict any temperature, not just between 0
and 1). If I were training a network to do this prediction, I’d very likely train the network
without an activation function on the output.
Configuration 2: Predicting unrelated yes/no
probabilities (sigmoid)
You’ll often want to make multiple binary probabilities in one neural network. We did this
in the “Gradient descent with multiple inputs and outputs” section of chapter 5, predicting
whether the team would win, whether there would be injuries, and the morale of the team
(happy or sad) based on the input data.
As an aside, when a neural network has hidden layers, predicting multiple things at once can
be beneficial. Often the network will learn something when predicting one label that will be
useful to one of the other labels. For example, if the network got really good at predicting
whether the team would win ballgames, the same hidden layer would likely be very useful
for predicting whether the team would be happy or sad. But the network might have a
harder time predicting happiness or sadness without this extra signal. This tends to vary
greatly from problem to problem, but it’s good to keep in mind.
In these instances, it’s best to use the sigmoid activation function, because it models
individual probabilities separately for each output node.
Standard output layer activation functions
167
Configuration 3: Predicting which-one probabilities (softmax)
By far the most common use case in neural networks is predicting a single label out of
many. For example, in the MNIST digit classifier, you want to predict which number is in
the image. You know ahead of time that the image can’t be more than one number. You can
train this network with a sigmoid activation function and declare that the highest output
probability is the most likely. This will work reasonably well. But it’s far better to have an
activation function that models the idea that “The more likely it’s one label, the less likely it’s
any of the other labels.”
Why do we like this phenomenon? Consider how weight updates are performed. Let’s say the
MNIST digit classifier should predict that the image is a 9. Also say that the raw weighted sums
going into the final layer (before applying an activation function) are the following values:
Raw dot
product
values
0
1
2
3
4
5
6
7
8
9
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
100
The network’s raw input to the last layer predicts a 0 for every node but 9, where it predicts
100. You might call this perfect. Let’s see what happens when these numbers are run through
a sigmoid activation function:
sigmoid
.50
.50
.50
.50
.50
.50
.50
.50
.50
.99
Strangely, the network seems less sure now: 9 is still the highest, but the network seems to
think there’s a 50% chance that it could be any of the other numbers. Weird! softmax, on the
other hand, interprets the input very differently:
softmax
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
1.0
This looks great. Not only is 9 the highest, but the network doesn’t even suspect it’s any of
the other possible MNIST digits. This might seem like a theoretical flaw of sigmoid, but it
can have serious consequences when you backpropagate. Consider how the mean squared
error is calculated on the sigmoid output. In theory, the network is predicting nearly
perfectly, right? Surely it won’t backprop much error. Not so for sigmoid:
sigmoid
MSE
.25
.25
.25
.25
.25
.25
.25
.25
.25
.00
Look at all the error! These weights are in for a massive weight update even though the
network predicted perfectly. Why? For sigmoid to reach 0 error, it doesn’t just have to
predict the highest positive number for the true output; it also has to predict a 0 everywhere
else. Where softmax asks, “Which digit seems like the best fit for this input?” sigmoid says,
“You better believe that it’s only digit 9 and doesn’t have anything in common with the other
MNIST digits.”
168
Chapter 9
I Modeling probabilities and nonlinearities
The core issue: Inputs have similarity
Different numbers share characteristics. It’s good to let the
network believe that.
MNIST digits aren’t all completely different: they have
overlapping pixel values. The average 2 shares quite a
bit in common with the average 3.
Similar strokes!
Why is this important? Well, as a general rule, similar
inputs create similar outputs. When you take some
numbers and multiply them by a matrix, if the starting
numbers are pretty similar, the ending numbers will be
pretty similar.
Consider the 2 and 3 shown here. If we forward propagate the 2 and a small amount of
probability accidentally goes to the label 3, what does it mean for the network to consider
this a big mistake and respond with a big weight update? It will penalize the network for
recognizing a 2 by anything other than features that are exclusively related to 2s. It penalizes
the network for recognizing a 2 based on, say, the top curve. Why? Because 2 and 3 share
the same curve at the top of the image. Training with sigmoid would penalize the network
for trying to predict a 2 based on this input, because by doing so it would be looking for the
same input it does for 3s. Thus, when a 3 came along, the 2 label would get some probability
(because part of the image looks 2ish).
What’s the side effect? Most images share lots of pixels in the
middle of images, so the network will start trying to focus on the
edges. Consider the 2-detector node weights shown at right.
See how muddy the middle of the image is? The heaviest weights
are the end points of the 2 toward the edge of the image. On one
hand, these are probably the best individual indicators of a 2, but
the best overall is a network that sees the entire shape for what it
is. These individual indicators can be accidentally triggered by a 3
that’s slightly off-center or tilted the wrong way. The network isn’t
learning the true essence of a 2 because it needs to learn 2 and not
1, not 3, not 4, and so on.
We want an output activation that won’t penalize labels that are similar. Instead, we want
it to pay attention to all the information that can be indicative of any potential input. It’s
also nice that a softmax’s probabilities always sum to 1. You can interpret any individual
prediction as a global probability that the prediction is a particular label. softmax works
better in both theory and practice.
softmax computation
169
softmax computation
softmax raises each input value exponentially and then
divides by the layer’s sum.
Let’s see a softmax computation on the neural network’s hypothetical output values from
earlier. I’ll show them here again so you can see the input to softmax:
Raw dot
product
values
0
1
2
3
4
5
6
7
8
9
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
100
To compute a softmax on the whole layer, first raise each
value exponentially. For each value x, compute e to the
power of x (e is a special number ~2.71828…). The value
of e^x is shown on the right.
Notice that it turns every prediction into a positive
number, where negative numbers turn into very small
positive numbers, and big numbers turn into very big
numbers. (If you’ve heard of exponential growth, it was
likely talking about this function or one very similar to it.)
e^x
0
1
2
3
4
5
6
7
8
9
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
...
2.688 * 10^43
In short, all the 0s turn to 1s (because 1 is the y intercept of e^x), and the 100 turns into a
massive number (2 followed by 43 zeros). If there were any negative numbers, they turned
into something between 0 and 1. The next step is to sum all the nodes in the layer and divide
each value in the layer by that sum. This effectively makes every number 0 except the value
for label 9.
softmax
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
1.0
The nice thing about softmax is that the higher the network predicts one value, the lower it
predicts all the others. It increases what is called the sharpness of attenuation. It encourages
the network to predict one output with very high probability.
To adjust how aggressively it does this, use numbers slightly higher or lower than e when
exponentiating. Lower numbers will result in lower attenuation, and higher numbers will
result in higher attenuation. But most people just stick with e.
170
Chapter 9
I Modeling probabilities and nonlinearities
Activation installation instructions
How do you add your favorite activation function to any layer?
Now that we’ve covered a wide variety of activation functions and explained their usefulness in
hidden and output layers of neural networks, let’s talk about the proper way to install one into
a neural network. Fortunately, you’ve already seen an example of how to use a nonlinearity
in your first deep neural network: you added a relu activation function to the hidden layer.
Adding this to forward propagation was relatively straightforward. You took what layer_1
would have been (without an activation) and applied the relu function to each value:
layer_0 = images[i:i+1]
layer_1 = relu(np.dot(layer_0,weights_0_1))
layer_2 = np.dot(layer_1,weights_1_2)
There’s a bit of lingo here to remember. The input to a layer refers to the value before the
nonlinearity. In this case, the input to layer_1 is np.dot(layer_0,weights_0_1). This
isn’t to be confused with the previous layer, layer_0.
Adding an activation function to a layer in forward propagation is relatively
straightforward. But properly compensating for the activation function in
backpropagation is a bit more nuanced.
In chapter 6, we performed an interesting operation to create the layer_1_delta variable.
Wherever relu had forced a layer_1 value to be 0, we also multiplied the delta by 0.
The reasoning at the time was, “Because a layer_1 value of 0 had no effect on the output
prediction, it shouldn’t have any impact on the weight update either. It wasn’t responsible
for the error.” This is the extreme form of a more nuanced property. Consider the shape of
the relu function.
Because the purpose of delta at this point is
to tell earlier layers “make my input higher or
lower next time,” this delta is very useful. It
modifies the delta backpropagated from the
following layer to take into account whether
this node contributed to the error.
y (output)
The slope of relu for positive numbers is exactly 1. The slope of relu for negative
numbers is exactly 0. Modifying the input to this function (by a tiny amount) will have
a 1:1 effect if it was predicting positively, and
will have a 0:1 effect (none) if it was predicting
y = relu(x)
negatively. This slope is a measure of how
much the output of relu will change given a
change in its input.
x (input)
Activation installation instructions
171
Thus, when you backpropagate, in order to generate layer_1_delta, multiply the
backpropagated delta from layer_2 (layer_2_delta.dot(weights_1_2.T)) by the
slope of relu at the point predicted in forward propagation. For some deltas the slope is 1
(positive numbers), and for others it’s 0 (negative numbers):
error += np.sum((labels[i:i+1] - layer_2) ** 2)
correct_cnt += int(np.argmax(layer_2) == \
np.argmax(labels[i:i+1]))
layer_2_delta = (labels[i:i+1] - layer_2)
layer_1_delta = layer_2_delta.dot(weights_1_2.T)\
* relu2deriv(layer_1)
weights_1_2 += alpha * layer_1.T.dot(layer_2_delta)
weights_0_1 += alpha * layer_0.T.dot(layer_1_delta)
def relu(x):
return (x >= 0) * x
Returns x if x > 0;
returns 0 otherwise
def relu2deriv(output):
return output >= 0
Returns 1 for input > 0;
returns 0 otherwise
relu2deriv is a special function that can take the output of relu and calculate the slope of
relu at that point (it does this for all the values in the output vector). This begs the question,
how do you make similar adjustments for all the other nonlinearities that aren’t relu?
Consider relu and sigmoid:
y = sigmoid(x)
y (output)
y = relu(x)
x (input)
The important thing in these figures is that the slope is an indicator of how much a tiny
change to the input affects the output. You want to modify the incoming delta (from the
following layer) to take into account whether a weight update before this node would have
any effect. Remember, the end goal is to adjust weights to reduce error. This step encourages
the network to leave weights alone if adjusting them will have little to no effect. It does so by
multiplying it by the slope. It’s no different for sigmoid.
172
Chapter 9
I Modeling probabilities and nonlinearities
Multiplying delta by the slope
To compute layer_delta, multiply the backpropagated delta by
the layer’s slope.
layer_1_delta[0] represents how much higher
or lower the first hidden node of layer 1 should
be in order to reduce the error of the network (for
a particular training example). When there’s no
nonlinearity, this is the weighted average delta of
layer_2.
Inputs
Hiddens
Prediction
Weights being informed
layer_1_delta[0]
But the end goal of delta on a neuron is to inform
the weights whether they should move. If moving
them would have no effect, they (as a group)
should be left alone. This is obvious for relu,
which is either on or off. sigmoid is, perhaps, more
nuanced.
y = sigmoid(x)
y (output)
y = relu(x)
x (input)
Consider a single sigmoid neuron. sigmoid’s sensitivity to change in the input slowly
increases as the input approaches 0 from either direction. But very positive and very
negative inputs approach a slope of very near 0. Thus, as the input becomes very positive or
very negative, small changes to the incoming weights become less relevant to the neuron’s
error at this training example. In broader terms, many hidden nodes are irrelevant to the
accurate prediction of a 2 (perhaps they’re used only for 8s). You shouldn’t mess with their
weights too much, because you could corrupt their usefulness elsewhere.
Inversely, this also creates a notion of stickiness. Weights that have previously been updated
a lot in one direction (for similar training examples) confidently predict a high value or low
value. These nonlinearities help make it harder for occasional erroneous training examples
to corrupt intelligence that has been reinforced many times.
Converting output to slope (derivative)
173
Converting output to slope (derivative)
Most great activations can convert their output to their slope.
(Efficiency win!)
Now that you know that adding an activation to a layer changes how to compute delta for
that layer, let’s discuss how the industry does this efficiently. The new operation necessary is
the computation of the derivative of whatever nonlinearity was used.
Most nonlinearities (all the popular ones) use a method of computing a derivative that will
seem surprising to those of you who are familiar with calculus. Instead of computing the
derivative at a certain point on its curve the normal way, most great activation functions
have a means by which the output of the layer (at forward propagation) can be used to
compute the derivative. This has become the standard practice for computing derivatives in
neural networks, and it’s quite handy.
Following is a small table for the functions you’ve seen so far, paired with their derivatives.
input is a NumPy vector (corresponding to the input to a layer). output is the prediction of
the layer. deriv is the derivative of the vector of activation derivatives corresponding to the
derivative of the activation at each node. true is the vector of true values (typically 1 for the
correct label position, 0 everywhere else).
Function
Forward prop
Backprop delta
relu
ones_and_zeros = (input > 0)
output = input*ones_and_zeros
mask = output > 0
deriv = output * mask
sigmoid
output = 1/(1 + np.exp(-input))
deriv = output*(1-output)
tanh
output = np.tanh(input)
deriv = 1 - (output**2)
softmax
temp = np.exp(input)
output /= np.sum(temp)
temp = (output - true)
output = temp/len(true)
Note that the delta computation for softmax is special because it’s used only for the last
layer. There’s a bit more going on (theoretically) than we have time to discuss here. For now,
let’s install some better activation functions in the MNIST classification network.
174
Chapter 9
I Modeling probabilities and nonlinearities
Upgrading the MNIST network
Let’s upgrade the MNIST network to reflect what you’ve learned.
Theoretically, the tanh function should make for a better hidden-layer activation, and
softmax should make for a better output-layer activation function. When we test them, they
do in fact reach a higher score. But things aren’t always as simple as they seem.
I had to make a couple of adjustments in order to tune the network properly with these
new activations. For tanh, I had to reduce the standard deviation of the incoming
weights. Remember that you initialize the weights randomly. np.random.random creates
a random matrix with numbers randomly spread between 0 and 1. By multiplying by 0.2
and subtracting by 0.1, you rescale this random range to be between –0.1 and 0.1. This
worked great for relu but is less optimal for tanh. tanh likes to have a narrower random
initialization, so I adjusted it to be between –0.01 and 0.01.
I also removed the error calculation, because we’re not ready for that yet. Technically,
softmax is best used with an error function called cross entropy. This network properly
computes layer_2_delta for this error measure, but because we haven’t analyzed why this
error function is advantageous, I removed the lines to compute it.
Finally, as with almost all changes made to a neural network, I had to revisit the alpha
tuning. I found that a much higher alpha was required to reach a good score within 300
iterations. And voilà! As expected, the network reached a higher testing accuracy of 87%.
import numpy as np, sys
np.random.seed(1)
from keras.datasets import mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
images, labels = (x_train[0:1000].reshape(1000,28*28)\
/ 255, y_train[0:1000])
one_hot_labels = np.zeros((len(labels),10))
for i,l in enumerate(labels):
one_hot_labels[i][l] = 1
labels = one_hot_labels
test_images = x_test.reshape(len(x_test),28*28) / 255
test_labels = np.zeros((len(y_test),10))
for i,l in enumerate(y_test):
test_labels[i][l] = 1
def tanh(x):
return np.tanh(x)
def tanh2deriv(output):
return 1 - (output ** 2)
def softmax(x):
temp = np.exp(x)
return temp / np.sum(temp, axis=1, keepdims=True)
Upgrading the MNIST network
175
alpha, iterations, hidden_size = (2, 300, 100)
pixels_per_image, num_labels = (784, 10)
batch_size = 100
weights_0_1 = 0.02*np.random.random((pixels_per_image,hidden_size))-0.01
weights_1_2 = 0.2*np.random.random((hidden_size,num_labels)) - 0.1
for j in range(iterations):
correct_cnt = 0
for i in range(int(len(images) / batch_size)):
batch_start, batch_end=((i * batch_size),((i+1)*batch_size))
layer_0 = images[batch_start:batch_end]
layer_1 = tanh(np.dot(layer_0,weights_0_1))
dropout_mask = np.random.randint(2,size=layer_1.shape)
layer_1 *= dropout_mask * 2
layer_2 = softmax(np.dot(layer_1,weights_1_2))
for k in range(batch_size):
correct_cnt += int(np.argmax(layer_2[k:k+1]) == \
np.argmax(labels[batch_start+k:batch_start+k+1]))
layer_2_delta = (labels[batch_start:batch_end]-layer_2)\
/ (batch_size * layer_2.shape[0])
layer_1_delta = layer_2_delta.dot(weights_1_2.T) \
* tanh2deriv(layer_1)
layer_1_delta *= dropout_mask
weights_1_2 += alpha * layer_1.T.dot(layer_2_delta)
weights_0_1 += alpha * layer_0.T.dot(layer_1_delta)
test_correct_cnt = 0
for i in range(len(test_images)):
layer_0 = test_images[i:i+1]
layer_1 = tanh(np.dot(layer_0,weights_0_1))
layer_2 = np.dot(layer_1,weights_1_2)
test_correct_cnt += int(np.argmax(layer_2) == \
np.argmax(test_labels[i:i+1]))
if(j % 10 == 0):
sys.stdout.write("\n"+ "I:" + str(j) + \
" Test-Acc:"+str(test_correct_cnt/float(len(test_images)))+\
" Train-Acc:" + str(correct_cnt/float(len(images))))
I:0 Test-Acc:0.394 Train-Acc:0.156
I:10 Test-Acc:0.6867 Train-Acc:0.723
I:20 Test-Acc:0.7025 Train-Acc:0.732
I:30 Test-Acc:0.734 Train-Acc:0.763
I:40 Test-Acc:0.7663 Train-Acc:0.794
I:50 Test-Acc:0.7913 Train-Acc:0.819
I:60 Test-Acc:0.8102 Train-Acc:0.849
I:70 Test-Acc:0.8228 Train-Acc:0.864
I:80 Test-Acc:0.831 Train-Acc:0.867
I:90 Test-Acc:0.8364 Train-Acc:0.885
I:100 Test-Acc:0.8407 Train-Acc:0.88
I:110 Test-Acc:0.845 Train-Acc:0.891
I:120 Test-Acc:0.8481 Train-Acc:0.90
I:130 Test-Acc:0.8505 Train-Acc:0.90
I:140 Test-Acc:0.8526 Train-Acc:0.90
I:150
I:160
I:170
I:180
I:190
I:200
I:210
I:220
I:230
I:240
I:250
I:260
I:270
I:280
I:290
Test-Acc:0.8555 Train-Acc:0.914
Test-Acc:0.8577 Train-Acc:0.925
Test-Acc:0.8596 Train-Acc:0.918
Test-Acc:0.8619 Train-Acc:0.933
Test-Acc:0.863 Train-Acc:0.933
Test-Acc:0.8642 Train-Acc:0.926
Test-Acc:0.8653 Train-Acc:0.931
Test-Acc:0.8668 Train-Acc:0.93
Test-Acc:0.8672 Train-Acc:0.937
Test-Acc:0.8681 Train-Acc:0.938
Test-Acc:0.8687 Train-Acc:0.937
Test-Acc:0.8684 Train-Acc:0.945
Test-Acc:0.8703 Train-Acc:0.951
Test-Acc:0.8699 Train-Acc:0.949
Test-Acc:0.8701 Train-Acc:0.94
neural learning about edges and corners:
intro to convolutional neural networks
In this chapter
•
Reusing weights in multiple places
•
The convolutional layer
The pooling operation used in convolutional neural
networks is a big mistake, and the fact that it works
so well is a disaster.
—Geoffrey Hinton, from “Ask Me Anything” on Reddit
177
10
178
Chapter 10
I Neural learning about edges and corners
Reusing weights in multiple places
If you need to detect the same feature in multiple places,
use the same weights!
The greatest challenge in neural networks is that of
overfitting, when a neural network memorizes a dataset
instead of learning useful abstractions that generalize to
unseen data. In other words, the neural network learns to
predict based on noise in the dataset as opposed to relying on
the fundamental signal (remember the analogy about a fork
embedded in clay?).
Similar strokes!
Overfitting is often caused by having more parameters than
necessary to learn a specific dataset. In this case, the network has so many parameters that
it can memorize every fine-grained detail in the training dataset (neural network: “Ah. I see
we have image number 363 again. This was the number 2.”) instead of learning high-level
abstractions (neural network: “Hmm, it’s got a swooping top, a swirl at the bottom left, and
a tail on the right; it must be a 2.”). When neural networks have lots of parameters but not
very many training examples, overfitting is difficult to avoid.
We covered this topic extensively in chapter 8, when we looked at regularization as a means
of countering overfitting. But regularization isn’t the only technique (or even the most ideal
technique) to prevent overfitting.
As I mentioned, overfitting is concerned with the ratio between
the number of weights in the model and the number of
datapoints it has to learn those weights. Thus, there’s a better
method to counter overfitting. When possible, it’s preferable to
use something loosely defined as structure.
Structure is when you selectively choose to reuse weights for
multiple purposes in a neural network because we believe the
same pattern needs to be detected in multiple places. As you’ll see,
this can significantly reduce overfitting and lead to much more
accurate models, because it reduces the weight-to-data ratio.
But whereas normally removing parameters makes the model less expressive (less able
to learn patterns), if you’re clever in where you reuse weights, the model can be equally
expressive but more robust to overfitting. Perhaps surprisingly, this technique also tends
to make the model smaller (because there are fewer actual parameters to store). The most
famous and widely used structure in neural networks is called a convolution, and when used
as a layer it’s called a convolutional layer.
The convolutional layer
179
The convolutional layer
Lots of very small linear layers are reused in every position,
instead of a single big one.
The core idea behind a convolutional layer is that instead of having a large, dense linear
layer with a connection from every input to every output, you instead have lots of very
small linear layers, usually with fewer than 25 inputs and a single output, which you
use in every input position. Each mini-layer is called a convolutional kernel, but it’s
really nothing more than a baby linear layer with a small number of inputs and a
single output.
0
0
0
0
0
0
0
0
1
Shown here is a single 3 × 3 convolutional kernel. It will predict in its current location,
move one pixel to the right, then predict again, move another pixel to the right, and so
on. Once it has scanned all the way across the image, it will move down a single pixel and
scan back to the left, repeating until it has made a prediction in every possible position
within the image. The result will be a smaller square of kernel predictions, which are used
as input to the next layer. Convolutional layers usually have many kernels.
180
Chapter 10
I Neural learning about edges and corners
At bottom-right are four different convolutional
kernels processing the same 8 × 8 image of a 2.
Each kernel results in a 6 × 6 prediction matrix.
The result of the convolutional layer with four
3 × 3 kernels is four 6 × 6 prediction matrices.
You can either sum these matrices elementwise
(sum pooling), take the mean elementwise
(mean pooling), or compute the elementwise
maximum value (max pooling).
The max value of each
kernel’s output forms a
meaningful representation and
is passed to the next layer.
The last version turns out to be the most
popular: for each position, look into each of the
four kernel’s outputs, find the max, and copy it
into a final 6 × 6 matrix as pictured at upperright of this page. This final matrix (and only
this matrix) is then forward propagated into the
next layers.
There are a few things to notice in these
figures. First, the bottom-right kernel forward
propagates a 1 only if it’s focused on a horizontal
line segment. The bottom-left kernel forward
propagates a 1 only if it’s focused on a diagonal
line pointing upward and to the right. Finally,
the bottom-right kernel didn’t identify any
patterns that it was trained to predict.
Outputs from each
of the four kernels in
each position
It’s important to realize that this technique
allows each kernel to learn a particular pattern
and then search for the existence of that pattern
somewhere in the image. A single, small set
of weights can train over a much larger set of
training examples, because even though the
Four convolutional
dataset hasn’t changed, each mini-kernel is
kernels predicting
forward propagated multiple times on multiple
over the same 2
segments of data, thus changing the ratio of
weights to datapoints on which those weights
are being trained. This has a powerful impact on the network, drastically reducing its ability
to overfit to training data and increasing its ability to generalize.
A simple implementation in NumPy
181
A simple implementation in NumPy
Just think mini-linear layers, and you already know what you
need to know.
Let’s start with forward propagation. This method shows how to select a subregion in a batch
of images in NumPy. Note that it selects the same subregion for the entire batch:
def get_image_section(layer,row_from, row_to, col_from, col_to):
sub_section = layer[:,row_from:row_to,col_from:col_to]
return subsection.reshape(-1,1,row_to-row_from, col_to-col_from)
Now, let’s see how this method is used. Because it selects a subsection of a batch of input
images, you need to call it multiple times (on every location within the image). Such a for
loop looks something like this:
layer_0 = images[batch_start:batch_end]
layer_0 = layer_0.reshape(layer_0.shape[0],28,28)
layer_0.shape
sects = list()
for row_start in range(layer_0.shape[1]-kernel_rows):
for col_start in range(layer_0.shape[2] - kernel_cols):
sect = get_image_section(layer_0,
row_start,
row_start+kernel_rows,
col_start,
col_start+kernel_cols)
sects.append(sect)
expanded_input = np.concatenate(sects,axis=1)
es = expanded_input.shape
flattened_input = expanded_input.reshape(es[0]*es[1],-1)
In this code, layer_0 is a batch of images 28 × 28 in shape. The for loop iterates through
every (kernel_rows × kernel_cols) subregion in the images and puts them into a list
called sects. This list of sections is then concatenated and reshaped in a peculiar way.
Pretend (for now) that each individual subregion is its own image. Thus, if you had a batch
size of 8 images, and 100 subregions per image, you’d pretend it was a batch size of 800
smaller images. Forward propagating them through a linear layer with one output neuron is
the same as predicting that linear layer over every subregion in every batch (pause and make
sure you get this).
If you instead forward propagate using a linear layer with n output neurons, it will generate the
outputs that are the same as predicting n linear layers (kernels) in every input position of the
image. You do it this way because it makes the code both simpler and faster:
kernels = np.random.random((kernel_rows*kernel_cols,num_kernels))
...
kernel_output = flattened_input.dot(kernels)
182
Chapter 10
I Neural learning about edges and corners
The following listing shows the entire NumPy implementation:
import numpy as np, sys
np.random.seed(1)
from keras.datasets import mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
images, labels = (x_train[0:1000].reshape(1000,28*28) / 255,
y_train[0:1000])
one_hot_labels = np.zeros((len(labels),10))
for i,l in enumerate(labels):
one_hot_labels[i][l] = 1
labels = one_hot_labels
test_images = x_test.reshape(len(x_test),28*28) / 255
test_labels = np.zeros((len(y_test),10))
for i,l in enumerate(y_test):
test_labels[i][l] = 1
def tanh(x):
return np.tanh(x)
def tanh2deriv(output):
return 1 - (output ** 2)
def softmax(x):
temp = np.exp(x)
return temp / np.sum(temp, axis=1, keepdims=True)
alpha, iterations = (2, 300)
pixels_per_image, num_labels = (784, 10)
batch_size = 128
input_rows = 28
input_cols = 28
kernel_rows = 3
kernel_cols = 3
num_kernels = 16
hidden_size = ((input_rows - kernel_rows) *
(input_cols - kernel_cols)) * num_kernels
kernels = 0.02*np.random.random((kernel_rows*kernel_cols,
num_kernels))-0.01
weights_1_2 = 0.2*np.random.random((hidden_size,
num_labels)) - 0.1
def get_image_section(layer,row_from, row_to, col_from, col_to):
section = layer[:,row_from:row_to,col_from:col_to]
return section.reshape(-1,1,row_to-row_from, col_to-col_from)
A simple implementation in NumPy
for j in range(iterations):
correct_cnt = 0
for i in range(int(len(images) / batch_size)):
batch_start, batch_end=((i * batch_size),((i+1)*batch_size))
layer_0 = images[batch_start:batch_end]
layer_0 = layer_0.reshape(layer_0.shape[0],28,28)
layer_0.shape
sects = list()
for row_start in range(layer_0.shape[1]-kernel_rows):
for col_start in range(layer_0.shape[2] - kernel_cols):
sect = get_image_section(layer_0,
row_start,
row_start+kernel_rows,
col_start,
col_start+kernel_cols)
sects.append(sect)
expanded_input = np.concatenate(sects,axis=1)
es = expanded_input.shape
flattened_input = expanded_input.reshape(es[0]*es[1],-1)
kernel_output = flattened_input.dot(kernels)
layer_1 = tanh(kernel_output.reshape(es[0],-1))
dropout_mask = np.random.randint(2,size=layer_1.shape)
layer_1 *= dropout_mask * 2
layer_2 = softmax(np.dot(layer_1,weights_1_2))
for k in range(batch_size):
labelset = labels[batch_start+k:batch_start+k+1]
_inc = int(np.argmax(layer_2[k:k+1]) ==
np.argmax(labelset))
correct_cnt += _inc
layer_2_delta = (labels[batch_start:batch_end]-layer_2)\
/ (batch_size * layer_2.shape[0])
layer_1_delta = layer_2_delta.dot(weights_1_2.T) * \
tanh2deriv(layer_1)
layer_1_delta *= dropout_mask
weights_1_2 += alpha * layer_1.T.dot(layer_2_delta)
l1d_reshape = layer_1_delta.reshape(kernel_output.shape)
k_update = flattened_input.T.dot(l1d_reshape)
kernels -= alpha * k_update
test_correct_cnt = 0
for i in range(len(test_images)):
layer_0 = test_images[i:i+1]
layer_0 = layer_0.reshape(layer_0.shape[0],28,28)
layer_0.shape
sects = list()
for row_start in range(layer_0.shape[1]-kernel_rows):
for col_start in range(layer_0.shape[2] - kernel_cols):
sect = get_image_section(layer_0,
row_start,
row_start+kernel_rows,
183
184
Chapter 10
I Neural learning about edges and corners
col_start,
col_start+kernel_cols)
sects.append(sect)
expanded_input = np.concatenate(sects,axis=1)
es = expanded_input.shape
flattened_input = expanded_input.reshape(es[0]*es[1],-1)
kernel_output = flattened_input.dot(kernels)
layer_1 = tanh(kernel_output.reshape(es[0],-1))
layer_2 = np.dot(layer_1,weights_1_2)
test_correct_cnt += int(np.argmax(layer_2) ==
np.argmax(test_labels[i:i+1]))
if(j % 1 == 0):
sys.stdout.write("\n"+ \
"I:" + str(j) + \
" Test-Acc:"+str(test_correct_cnt/float(len(test_images)))+\
" Train-Acc:" + str(correct_cnt/float(len(images))))
I:0
I:1
I:2
I:3
I:4
I:5
I:6
I:7
I:8
Test-Acc:0.0288 Train-Acc:0.055
Test-Acc:0.0273 Train-Acc:0.037
Test-Acc:0.028 Train-Acc:0.037
Test-Acc:0.0292 Train-Acc:0.04
Test-Acc:0.0339 Train-Acc:0.046
Test-Acc:0.0478 Train-Acc:0.068
Test-Acc:0.076 Train-Acc:0.083
Test-Acc:0.1316 Train-Acc:0.096
Test-Acc:0.2137 Train-Acc:0.127
....
I:297 Test-Acc:0.8774 Train-Acc:0.816
I:298 Test-Acc:0.8774 Train-Acc:0.804
I:299 Test-Acc:0.8774 Train-Acc:0.814
As you can see, swapping out the first layer from the network in chapter 9 with a
convolutional layer gives another few percentage points in error reduction. The output of the
convolutional layer (kernel_output) is itself also a series of two-dimensional images (the
output of each kernel in each input position).
Most uses of convolutional layers stack multiple layers on top of each other, such that each
convolutional layer treats the previous as an input image. (Feel free to do this as a personal
project; it will increase accuracy further.)
Stacked convolutional layers are one of the main developments that allowed for very deep
neural networks (and, by extension, the popularization of the phrase deep learning). It can’t
be overstressed that this invention was a landmark moment for the field; without it, we
might still be in the previous AI winter even at the time of writing.
Summary
185
Summary
Reusing weights is one of the most important innovations in
deep learning.
Convolutional neural networks are a more general development than you might realize. The
notion of reusing weights to increase accuracy is hugely important and has an intuitive basis.
Consider what you need to understand in order to detect that a cat is in an image. You first
need to understand colors, then lines and edges, corners and small shapes, and eventually
the combination of such lower-level features that correspond to a cat. Presumably, neural
networks also need to learn about these lower-level features (like lines and edges), and the
intelligence for detecting lines and edges is learned in the weights.
But if you use different weights to analyze different parts of an image, each section of
weights has to independently learn what a line is. Why? Well, if one set of weights looking
at one part of an image learns what a line is, there’s no reason to think that another section
of weights would somehow have the ability to use that information: it’s in a different part
of the network.
Convolutions are about taking advantage of a property of learning. Occasionally, you need
to use the same idea or piece of intelligence in multiple places; and if that’s the case, you
should attempt to use the same weights in those locations. This brings us to one of the most
important ideas in this book. If you don’t learn anything else, learn this:
The structure trick
When a neural network needs to use the same idea in multiple places, endeavor to use the
same weights in both places. This will make those weights more intelligent by giving them
more samples to learn from, increasing generalization.
Many of the biggest developments in deep learning over the past five years (some before) are
iterations of this idea. Convolutions, recurrent neural networks (RNNs), word embeddings,
and the recently published capsule networks can all be viewed through this lens. When you
know a network will need the same idea in multiple places, force it to use the same weights
in those places. I fully expect that more deep learning discoveries will continue to be based
on this idea, because it’s challenging to discover new, higher-level abstract ideas that neural
networks could use repeatedly throughout their architecture.
neural networks that understand language:
king – man + woman == ?
In this chapter
•
Natural language processing (NLP)
•
Supervised NLP
•
Capturing word correlation in input data
•
Intro to an embedding layer
•
Neural architecture
•
Comparing word embeddings
•
Filling in the blank
•
Meaning is derived from loss
•
Word analogies
Man is a slow, sloppy, and brilliant thinker; computers
are fast, accurate, and stupid.
—John Pfeiffer, in Fortune, 1961
187
11
188
Chapter 11
I Neural networks that understand language
What does it mean to understand language?
What kinds of predictions do people make about language?
Up until now, we’ve been using neural networks to model image data. But neural
networks can be used to understand a much wider variety of datasets. Exploring new
datasets also teaches us a lot about neural networks in general, because different
datasets often justify different styles of neural network training according to the
challenges hidden in the data.
Machine
learning
Image
recognition
Deep
learning
Artificial
intelligence
Natural
language
processing
We’ll begin this chapter by exploring a much older field that overlaps deep learning:
natural language processing (NLP). This field is dedicated exclusively to the automated
understanding of human language (previously not using deep learning). We’ll discuss the
basics of deep learning’s approach to this field.
Natural language processing (NLP)
189
Natural language processing (NLP)
NLP is divided into a collection of tasks or challenges.
Perhaps the best way to quickly get to know NLP is to consider a few of the many challenges
the NLP community seeks to solve. Here are a few types of classification problem that are
common to NLP:
• Using the characters of a document to predict where words start and end.
• Using the words of a document to predict where sentences start and end.
• Using the words in a sentence to predict the part of speech for each word.
• Using words in a sentence to predict where phrases start and end.
• Using words in a sentence to predict where named entity (person, place, thing) references
start and end.
• Using sentences in a document to predict which pronouns refer to the same person /
place / thing.
• Using words in a sentence to predict the sentiment of a sentence.
Generally speaking, NLP tasks seek to do one of three things: label a region of text (such as
part-of-speech tagging, sentiment classification, or named-entity recognition); link two or
more regions of text (such as coreference, which tries to answer whether two mentions of
a real-world thing are in fact referencing the same real-world thing, where the real-world
thing is generally a person, place, or some other named entity); or try to fill in missing
information (missing words) based on context.
Perhaps it’s also apparent how machine learning and NLP are deeply intertwined. Until
recently, most state-of-the-art NLP algorithms were advanced, probabilistic, non-parametric
models (not deep learning). But the recent development and popularization of two major
neural algorithms have swept the field of NLP: neural word embeddings and recurrent
neural networks (RNNs).
In this chapter, we’ll build a word-embedding algorithm and demonstrate why it increases
the accuracy of NLP algorithms. In the next chapter, we’ll create a recurrent neural network
and demonstrate why it’s so effective at predicting across sequences.
It’s also worth mentioning the key role that NLP (perhaps using deep learning) plays in the
advancement of artificial intelligence. AI seeks to create machines that can think and engage
with the world as humans do (and beyond). NLP plays a very special role in this endeavor,
because language is the bedrock of conscious logic and communication in humans. As
such, methods by which machines can use and understand language form the foundation of
human-like logic in machines: the foundation of thought.
190
Chapter 11
I Neural networks that understand language
Supervised NLP
Words go in, and predictions come out.
Perhaps you’ll remember the following figure from chapter 2. Supervised learning is all
about taking “what you know” and transforming it into “what you want to know.” Up until
now, “what you know” has always consisted of numbers in one way or another. But NLP
uses text as input. How do you process it?
What you
know
Supervised
learning
What you want
to know
Because neural networks only map input numbers to output numbers, the first step is to
convert the text into numerical form. Much as we converted the streetlight dataset, we
need to convert the real-world data (in this case, text) into a matrix the neural network can
consume. As it turns out, how we do this is extremely important!
Raw text
Matrix of #
Supervised
learning
What you want
to know
How should we convert text to numbers? Answering that question requires some thought
regarding the problem. Remember, neural networks look for correlation between their
input and output layers. Thus, we want to convert text into numbers in such a way that the
correlation between input and output is most obvious to the network. This will make for
faster training and better generalization.
In order to know what input format makes input/output correlation the most obvious to the
network, we need to know what the input/output dataset looks like. To explore this topic,
let’s take on the challenge of topic classification.
IMDB movie reviews dataset
191
IMDB movie reviews dataset
You can predict whether people post positive or
negative reviews.
The IMDB movie reviews dataset is a collection of review -> rating pairs that often look like
the following (this is an imitation, not pulled from IMDB):
This movie was terrible! The plot was dry, the acting
unconvincing, and I spilled popcorn on my shirt.”
Rating: 1 (stars)
The entire dataset consists of around 50,000 of these pairs, where the input reviews are
usually a few sentences and the output ratings are between 1 and 5 stars. People consider
it a sentiment dataset because the stars are indicative of the overall sentiment of the movie
review. But it should be obvious that this sentiment dataset might be very different from
other sentiment datasets, such as product reviews or hospital patient reviews.
You want to train a neural network that can use the input text to make accurate
predictions of the output score. To accomplish this, you must first decide how to turn
the input and output datasets into matrices. Interestingly, the output dataset is a number,
which perhaps makes it an easier place to start. You’ll adjust the range of stars to be
between 0 and 1 instead of 1 and 5, so that you can use binary softmax. That’s all you
need to do to the output. I’ll show an example on the next page.
The input data, however, is a bit trickier. To begin, let’s consider the raw data. It’s a list
of characters. This presents a few problems: not only is the input data text instead of
numbers, but it’s variable-length text. So far, neural networks always take an input of a
fixed size. You’ll need to overcome this.
So, the raw input won’t work. The next question to ask is, “What about this data will have
correlation with the output?” Representing that property might work well. For starters, I
wouldn’t expect any characters (in the list of characters) to have any correlation with the
sentiment. You need to think about it differently.
What about the words? Several words in this dataset would have a bit of correlation. I’d
bet that terrible and unconvincing have significant negative correlation with the rating. By
negative, I mean that as they increase in frequency in any input datapoint (any review),
the rating tends to decrease.
Perhaps this property is more general! Perhaps words by themselves (even out of context)
would have significant correlation with sentiment. Let’s explore this further.
192
Chapter 11
I Neural networks that understand language
Capturing word correlation in input data
Bag of words: Given a review’s vocabulary, predict the sentiment.
If you observe correlation between the vocabulary of an IMDB review and its rating, then
you can proceed to the next step: creating an input matrix that represents the vocabulary of
a movie review.
What’s commonly done in this case is to create a matrix where each row (vector)
corresponds to each movie review, and each column represents whether a review contains
a particular word in the vocabulary. To create the vector for a review, you calculate the
vocabulary of the review and then put 1 in each corresponding column for that review
and 0s everywhere else. How big are these vectors? Well, if there are 2,000 words, and you
need a place in each vector for each word, each vector will have 2,000 dimensions.
This form of storage, called one-hot encoding, is the most common format for encoding
binary data (the binary presence or absence of an input datapoint among a vocabulary of
possible input datapoints). If the vocabulary was only four words, the one-hot encoding
might look like this:
import numpy as np
cat
1
0
0
0
onehots = {}
onehots['cat']
onehots['the']
onehots['dog']
onehots['sat']
the
0
1
0
0
dog
0
0
1
0
sat
0
0
0
1
=
=
=
=
np.array([1,0,0,0])
np.array([0,1,0,0])
np.array([0,0,1,0])
np.array([0,0,0,1])
sentence = ['the','cat','sat']
x = word2hot[sentence[0]] + \
word2hot[sentence[1]] + \
word2hot[sentence[2]]
print("Sent Encoding:" + str(x))
As you can see, we create a vector for each term in the vocabulary, and this allows you to use
simple vector addition to create a vector representing a subset of the total vocabulary (such
as a subset corresponding to the words in a sentence).
"the cat sat"
Output:
Sent Encoding:[1 1 0 1]
1
1
0
1
Note that when you create an embedding for several terms (such as “the cat sat”), you have
multiple options if words occur multiple times. If the phrase was “cat cat cat,” you could
either sum the vector for “cat” three times (resulting in [3,0,0,0]) or just take the unique
“cat” a single time (resulting in [1,0,0,0]). The latter typically works better for language.
Predicting movie reviews
193
Predicting movie reviews
With the encoding strategy and the previous network,
you can predict sentiment.
Using the strategy we just identified, you can build a vector for each word in the sentiment
dataset and use the previous two-layer network to predict sentiment. I’ll show you the code,
but I strongly recommend attempting this from memory. Open a new Jupyter notebook,
load in the dataset, build your one-hot vectors, and then build a neural network to predict
the rating of each movie review (positive or negative).
Here’s how I would do the preprocessing step:
import sys
f = open('reviews.txt')
raw_reviews = f.readlines()
f.close()
f = open('labels.txt')
raw_labels = f.readlines()
f.close()
tokens = list(map(lambda x:set(x.split(" ")),raw_reviews))
vocab = set()
for sent in tokens:
for word in sent:
if(len(word)>0):
vocab.add(word)
vocab = list(vocab)
word2index = {}
for i,word in enumerate(vocab):
word2index[word]=i
input_dataset = list()
for sent in tokens:
sent_indices = list()
for word in sent:
try:
sent_indices.append(word2index[word])
except:
""
input_dataset.append(list(set(sent_indices)))
target_dataset = list()
for label in raw_labels:
if label == 'positive\n':
target_dataset.append(1)
else:
target_dataset.append(0)
194
Chapter 11
I Neural networks that understand language
Intro to an embedding layer
Here’s one more trick to make the
network faster.
layer_2
At right is the diagram from the previous neural network,
which you’ll now use to predict sentiment. But before
that, I want to describe the layer names. The first layer is
the dataset (layer_0). This is followed by what’s called
a linear layer (weights_0_1). This is followed by a relu
layer (layer_1), another linear layer (weights_1_2),
and then the output, which is the prediction layer. As it
turns out, you can take a bit of a shortcut to layer_1 by
replacing the first linear layer (weights_0_1) with an
embedding layer.
weights_1_2
layer_1
weights_0_1
layer_0
1
"the cat sat"
Taking a vector of 1s and 0s is mathematically equivalent
to summing several rows of a matrix. Thus, it’s much
more efficient to select the relevant rows of weights_0_1
and sum them as opposed to doing
One-hot vector-matrix multiplication
a big vector-matrix multiplication.
weights_0_1
layer_0
Because the sentiment vocabulary is
on the order of 70,000 words, most
layer_1
of the vector-matrix multiplication
is spent multiplying 0s in the input
vector by different rows of the matrix
before summing them. Selecting the
rows corresponding to each word in
a matrix and summing them is much
more efficient.
0
1
0
=
Using this process of selecting rows
and performing a sum (or average)
means treating the first linear layer
(weights_0_1) as an embedding layer.
Structurally, they’re identical (layer_1
is exactly the same using either
method for forward propagation).
The only difference is that summing a
small number of rows is much faster.
Matrix row sum
weights_0_1
layer_1
the
+
sat
+
cat
=
Intro to an embedding layer
195
After running the previous code, run this code.
import numpy as np
np.random.seed(1)
def sigmoid(x):
return 1/(1 + np.exp(-x))
alpha, iterations = (0.01, 2)
hidden_size = 100
weights_0_1 = 0.2*np.random.random((len(vocab),hidden_size)) - 0.1
weights_1_2 = 0.2*np.random.random((hidden_size,1)) - 0.1
correct,total = (0,0)
for iter in range(iterations):
for i in range(len(input_dataset)-1000):
Compares the
prediction with
the truth
Trains on the first
24,000 reviews
embed +
sigmoid
x,y = (input_dataset[i],target_dataset[i])
layer_1 = sigmoid(np.sum(weights_0_1[x],axis=0))
layer_2 = sigmoid(np.dot(layer_1,weights_1_2))
linear +
softmax
layer_2_delta = layer_2 - y
layer_1_delta = layer_2_delta.dot(weights_1_2.T)
Backpropagation
weights_0_1[x] -= layer_1_delta * alpha
weights_1_2 -= np.outer(layer_1,layer_2_delta) * alpha
if(np.abs(layer_2_delta) < 0.5):
correct += 1
total += 1
if(i % 10 == 9):
progress = str(i/float(len(input_dataset)))
sys.stdout.write('\rIter:'+str(iter)\
+' Progress:'+progress[2:4]\
+'.'+progress[4:6]\
+'% Training Accuracy:'\
+ str(correct/float(total)) + '%')
print()
correct,total = (0,0)
for i in range(len(input_dataset)-1000,len(input_dataset)):
x = input_dataset[i]
y = target_dataset[i]
layer_1 = sigmoid(np.sum(weights_0_1[x],axis=0))
layer_2 = sigmoid(np.dot(layer_1,weights_1_2))
if(np.abs(layer_2 - y) < 0.5):
correct += 1
total += 1
print("Test Accuracy:" + str(correct / float(total)))
196
Chapter 11
I Neural networks that understand language
Interpreting the output
What did the neural network learn along the way?
Here’s the output of the movie reviews neural network. From one perspective, this is the
same correlation summarization we’ve already discussed:
Iter:0 Progress:95.99% Training Accuracy:0.832%
Iter:1 Progress:95.99% Training Accuracy:0.8663333333333333%
Test Accuracy:0.849
The neural network was looking for correlation between the input
datapoints and the output datapoints. But those datapoints have
characteristics we’re familiar with (notably those of language).
Furthermore, it’s extremely beneficial to consider what patterns
of language would be detected by the correlation summarization,
and more importantly, which ones wouldn’t. After all, just because
the network is able to find correlation between the input and
output datasets doesn’t mean it understands every useful pattern
of language.
Pos/Neg label
(Neural network)
Review vocab
Furthermore, understanding the difference between what the
network (in its current configuration) is capable of learning relative to what it needs to
know to properly understand language is an incredibly fruitful line of thinking. This is what
researchers on the front lines of state-of-the-art research consider, and it’s what we’re going
to consider here.
What about language did the movie reviews network learn?
Let’s start by considering what was presented to the network. As
displayed in the diagram at top right, you presented each review’s
vocabulary as input and asked the network to predict one of
two labels (positive or negative). Given that the correlation
summarization says the network will look for correlation between
the input and output datasets, at a minimum, you’d expect the
network to identify words that have either a positive or negative
correlation (by themselves).
This follows naturally from the correlation summarization.
You present the presence or absence of a word. As such, the
correlation summarization will find direct correlation between
this presence/absence and each of the two labels. But this isn’t the
whole story.
weights_1_2
weights_0_1
Neural architecture
197
Neural architecture
How did the choice of architecture affect
what the network learned?
We just discussed the first, most trivial type of information the neural network learned:
direct correlation between the input and target datasets. This observation is largely the clean
slate of neural intelligence. (If a network can’t discover direct correlation between input
and output data, something is probably broken.) The development of more-sophisticated
architectures is based on the need to find more-complex patterns than direct correlation,
and this network is no exception.
The minimal architecture needed to identify direct correlation is a two-layer network, where
the network has a single weight matrix that connects directly from the input layer to the
output layer. But we used a network that has a hidden layer. This begs the question, what
does this hidden layer do?
Fundamentally, hidden layers are about grouping datapoints from a previous layer into n
groups (where n is the number of neurons in the hidden layer). Each hidden neuron takes in
a datapoint and answers the question, “Is this datapoint in my group?” As the hidden layer
learns, it searches for useful groupings of its input. What are useful groupings?
An input datapoint grouping is useful if it does two things. First, the grouping must be useful
to the prediction of an output label. If it’s not useful to the output prediction, the correlation
summarization will never lead the network to find the group. This is a hugely valuable
realization. Much of neural network research is about finding training data (or some other
manufactured signal for the network to artificially predict) so it finds groupings that are useful
for a task (such as predicting movie review stars). We’ll discuss this more in a moment.
Second, a grouping is useful if it’s an actual phenomenon in the data that you care about.
Bad groupings just memorize the data. Good groupings pick up on phenomena that are
useful linguistically.
For example, when predicting whether a movie review is positive or negative, understanding
the difference between “terrible” and “not terrible” is a powerful grouping. It would be great
to have a neuron that turned off when it saw “awful” and turned on when it saw “not awful.”
This would be a powerful grouping for the next layer to use to make the final prediction.
But because the input to the neural network is the vocabulary of a review, “it was great,
not terrible” creates exactly the same layer_1 value as “it was terrible, not great.” For this
reason, the network is very unlikely to create a hidden neuron that understands negation.
Testing whether a layer is the same or different based on a certain language pattern is a
great first step for knowing whether an architecture is likely to find that pattern using the
198
Chapter 11
I Neural networks that understand language
correlation summarization. If you can construct two examples with an identical hidden
layer, one with the pattern you find interesting and one without, the network is unlikely to
find that pattern.
As you just learned, a hidden layer fundamentally groups the previous layer’s data. At a
granular level, each neuron classifies a datapoint as either subscribing or not subscribing to
its group. At a higher level, two datapoints (movie reviews) are similar if they subscribe to
many of the same groups. Finally, two inputs (words) are similar if the weights linking them
to various hidden neurons (a measure of each word’s group affinity) are similar. Given this
knowledge, in the previous neural network, what should you observe in the weights going
into the hidden neurons from the words?
What should you see in the weights connecting words and
hidden neurons?
Here’s a hint: words that have a similar predictive power
should subscribe to similar groups (hidden neuron
configurations). What does this mean for the weights
connecting each word to each hidden neuron?
NEG
Here’s the answer. Words that correlate with similar labels
(positive or negative) will have similar weights connecting
them to various hidden neurons. This is because the neural
network learns to bucket them into similar hidden neurons
so that the final layer (weights_1_2) can make the correct
positive or negative predictions.
You can see this phenomenon by taking a particularly
positive or negative word and searching for the other
words with the most similar weight values. In other words,
you can take each word and see which other words have
the most similar weight values connecting them to each
hidden neuron (to each group).
.23
.15
bad
good
–.30
film
The three bold weights for “good”
form the embedding for “good.”
They reflect how much the term
“good” is a member of each
group (hidden neuron). Words
with similar predictive power
have similar word embeddings
(weight values).
Words that subscribe to similar groups will have similar
predictive power for positive or negative labels. As such,
words that subscribe to similar groups, having similar
weight values, will also have similar meaning. Abstractly, in
terms of neural networks, a neuron has similar meaning to other neurons in the same layer
if and only if it has similar weights connecting it to the next and/or previous layers.
Comparing word embeddings
199
Comparing word embeddings
How can you visualize weight similarity?
For each input word, you can select the list of weights proceeding out of it to the various
hidden neurons by selecting the corresponding row of weights_0_1. Each entry in the row
represents each weight proceeding from that row’s word to each hidden neuron. Thus, to
figure out which words are most similar to a target term, you compare each word’s vector
(row of the matrix) to that of the target term. The comparison of choice is called Euclidian
distance, as shown in the following code:
from collections import Counter
import math
def similar(target='beautiful'):
target_index = word2index[target]
scores = Counter()
for word,index in word2index.items():
raw_difference = weights_0_1[index] - (weights_0_1[target_index])
squared_difference = raw_difference * raw_difference
scores[word] = -math.sqrt(sum(squared_difference))
return scores.most_common(10)
This allows you to easily query for the most similar word (neuron) according to the network:
print(similar('beautiful'))
print(similar('terrible'))
[('beautiful', -0.0),
('atmosphere', -0.70542101298),
('heart', -0.7339429768542354),
('tight', -0.7470388145765346),
('fascinating', -0.7549291974),
('expecting', -0.759886970744),
('beautifully', -0.7603669338),
('awesome', -0.76647368382398),
('masterpiece', -0.7708280057),
('outstanding', -0.7740642167)]
[('terrible', -0.0),
('dull', -0.760788602671491),
('lacks', -0.76706470275372),
('boring', -0.7682894961694),
('disappointing', -0.768657),
('annoying', -0.78786389931),
('poor', -0.825784172378292),
('horrible', -0.83154121717),
('laughable', -0.8340279599),
('badly', -0.84165373783678)]
As you might expect, the most similar term to every word is itself, followed by words with
similar usefulness as the target term. Again, as you might expect, because the network has only
two labels (positive and negative), the input terms are grouped according to which label
they tend to predict.
This is a standard phenomenon of the correlation summarization. It seeks to create similar
representations (layer_1 values) within the network based on the label being predicted, so
that it can predict the right label. In this case, the side effect is that the weights feeding into
layer_1 get grouped according to output label.
The key takeaway is a gut instinct about this phenomenon of the correlation summarization. It
consistently attempts to convince the hidden layers to be similar based on which label should
be predicted.
200
Chapter 11
I Neural networks that understand language
What is the meaning of a neuron?
Meaning is entirely based on the target labels being predicted.
Note that the meanings of different words didn’t totally reflect how you might group them.
The term most similar to “beautiful” is “atmosphere.” This is a valuable lesson. For the
purposes of predicting whether a movie review is positive or negative, these words have
nearly identical meaning. But in the real world, their meaning is quite different (one is an
adjective and another a noun, for example).
print(similar('beautiful'))
print(similar('terrible'))
[('beautiful', -0.0),
('atmosphere', -0.70542101298),
('heart', -0.7339429768542354),
('tight', -0.7470388145765346),
('fascinating', -0.7549291974),
('expecting', -0.759886970744),
('beautifully', -0.7603669338),
('awesome', -0.76647368382398),
('masterpiece', -0.7708280057),
('outstanding', -0.7740642167)]
[('terrible', -0.0),
('dull', -0.760788602671491),
('lacks', -0.76706470275372),
('boring', -0.7682894961694),
('disappointing', -0.768657),
('annoying', -0.78786389931),
('poor', -0.825784172378292),
('horrible', -0.83154121717),
('laughable', -0.8340279599),
('badly', -0.84165373783678)]
This realization is incredibly important. The meaning (of a neuron) in the network is
defined based on the target labels. Everything in the neural network is contexualized based
on the correlation summarization trying to correctly make predictions. Thus, even though
you and I know a great deal about these words, the neural network is entirely ignorant of all
information outside the task at hand.
How can you convince the network to learn more-nuanced information about neurons
(in this case, word neurons)? Well, if you give it input and target data that requires a
more nuanced understanding of language, it will have reason to learn more-nuanced
interpretations of various terms.
What should you use the neural network to predict so that it learns more-interesting
weight values for the word neurons? The task you’ll use to learn more-interesting weight
values for the word neurons is a glorified fill-in-the blank task. Why use this? First, there’s
nearly infinite training data (the internet), which means nearly infinite signal for the neural
network to use to learn more-nuanced information about words. Furthermore, being able to
accurately fill in the blank requires at least some notion of context about the real world.
For instance, in the following example, is it more likely that the blank is correctly filled by
the word “anvil” or “wool”? Let’s see if the neural network can figure it out.
????
Mary had a little lamb whose __________
was white as snow.
Filling in the blank
201
Filling in the blank
Learn richer meanings for words by having
a richer signal to learn.
This example uses almost exactly the same neural network as the previous one, with only a few
modifications. First, instead of predicting a single label given a movie review, you’ll take each
(five-word) phrase, remove one word (a focus term), and attempt to train a network to figure
out the identity of the word you removed using the rest of the phrase. Second, you’ll use a trick
called negative sampling to make the network train a bit faster.
Consider that in order to predict which term is missing, you need one label for each possible
word. This would require several thousand labels, which would cause the network to train
slowly. To overcome this, let’s randomly ignore most of the labels for each forward propagation
step (as in, pretend they don’t exist). Although this may seem like a crude approximation, it’s a
technique that works well in practice. Here’s the preprocessing code for this example:
import sys,random,math
from collections import Counter
import numpy as np
np.random.seed(1)
random.seed(1)
f = open('reviews.txt')
raw_reviews = f.readlines()
f.close()
tokens = list(map(lambda x:(x.split(" ")),raw_reviews))
wordcnt = Counter()
for sent in tokens:
for word in sent:
wordcnt[word] -= 1
vocab = list(set(map(lambda x:x[0],wordcnt.most_common())))
word2index = {}
for i,word in enumerate(vocab):
word2index[word]=i
concatenated = list()
input_dataset = list()
for sent in tokens:
sent_indices = list()
for word in sent:
try:
sent_indices.append(word2index[word])
concatenated.append(word2index[word])
except:
""
input_dataset.append(sent_indices)
concatenated = np.array(concatenated)
random.shuffle(input_dataset)
202
Chapter 11
I Neural networks that understand language
alpha, iterations = (0.05, 2)
hidden_size,window,negative = (50,2,5)
weights_0_1 = (np.random.rand(len(vocab),hidden_size) - 0.5) * 0.2
weights_1_2 = np.random.rand(len(vocab),hidden_size)*0
layer_2_target = np.zeros(negative+1)
layer_2_target[0] = 1
def similar(target='beautiful'):
target_index = word2index[target]
scores = Counter()
for word,index in word2index.items():
raw_difference = weights_0_1[index] - (weights_0_1[target_index])
squared_difference = raw_difference * raw_difference
scores[word] = -math.sqrt(sum(squared_difference))
return scores.most_common(10)
def sigmoid(x):
return 1/(1 + np.exp(-x))
Predicts only a random subset,
because it’s really expensive to
predict every vocabulary
for rev_i,review in enumerate(input_dataset * iterations):
for target_i in range(len(review)):
target_samples = [review[target_i]]+list(concatenated\
[(np.random.rand(negative)*len(concatenated)).astype('int').tolist()])
left_context = review[max(0,target_i-window):target_i]
right_context = review[target_i+1:min(len(review),target_i+window)]
layer_1 = np.mean(weights_0_1[left_context+right_context],axis=0)
layer_2 = sigmoid(layer_1.dot(weights_1_2[target_samples].T))
layer_2_delta = layer_2 - layer_2_target
layer_1_delta = layer_2_delta.dot(weights_1_2[target_samples])
weights_0_1[left_context+right_context] -= layer_1_delta * alpha
weights_1_2[target_samples] -= np.outer(layer_2_delta,layer_1)*alpha
if(rev_i % 250 == 0):
sys.stdout.write('\rProgress:'+str(rev_i/float(len(input_dataset)
*iterations)) + "
" + str(similar('terrible')))
sys.stdout.write('\rProgress:'+str(rev_i/float(len(input_dataset)
*iterations)))
print(similar('terrible'))
Progress:0.99998 [('terrible', -0.0), ('horrible', -2.846300248788519),
('brilliant', -3.039932544396419), ('pathetic', -3.4868595532695967),
('superb', -3.6092947961276645), ('phenomenal', -3.660172529098085),
('masterful', -3.6856112636664564), ('marvelous', -3.9306620801551664),
Meaning is derived from loss
203
Meaning is derived from loss
With this new neural network, you can subjectively see that the word embeddings cluster
in a rather different way. Where before words were clustered according to their likelihood
to predict a positive or negative label, now they’re clustered based on their likelihood to
occur within the same phrase (sometimes regardless of sentiment).
Predicting POS/NEG
Fill in the blank
print(similar('terrible'))
print(similar('terrible'))
[('terrible', -0.0),
('dull', -0.760788602671491),
('lacks', -0.76706470275372),
('boring', -0.7682894961694),
('disappointing', -0.768657),
('annoying', -0.78786389931),
('poor', -0.825784172378292),
('horrible', -0.83154121717),
('laughable', -0.8340279599),
('badly', -0.84165373783678)]
[('terrible', -0.0),
('horrible', -2.79600898781),
('brilliant', -3.3336178881),
('pathetic', -3.49393193646),
('phenomenal', -3.773268963),
('masterful', -3.8376122586),
('superb', -3.9043150978490),
('bad', -3.9141673639585237),
('marvelous', -4.0470804427),
('dire', -4.178749691835959)]
print(similar('beautiful'))
print(similar('beautiful'))
[('beautiful', -0.0),
('atmosphere', -0.70542101298),
('heart', -0.7339429768542354),
('tight', -0.7470388145765346),
('fascinating', -0.7549291974),
('expecting', -0.759886970744),
('beautifully', -0.7603669338),
('awesome', -0.76647368382398),
('masterpiece', -0.7708280057),
('outstanding', -0.7740642167)]
[('beautiful', -0.0),
('lovely', -3.0145597243116),
('creepy', -3.1975363066322),
('fantastic', -3.2551041418),
('glamorous', -3.3050812101),
('spooky', -3.4881261617587),
('cute', -3.592955888181448),
('nightmarish', -3.60063813),
('heartwarming', -3.6348147),
('phenomenal', -3.645669007)]
The key takeaway is that, even though the network trained over the same dataset with a very
similar architecture (three layers, cross entropy, sigmoid nonlinear), you can influence what
the network learns within its weights by changing what you tell the network to predict. Even
though it’s looking at the same statistical information, you can target what it learns based
on what you select as the input and target values. For the moment, let’s call this process of
choosing what you want the network to learn intelligence targeting.
Controlling the input/target values isn’t the only way to perform intelligence targeting. You
can also adjust how the network measures error, the size and types of layers it has, and the
types of regularization to apply. In deep learning research, all of these techniques fall under
the umbrella of constructing what’s called a loss function.
204
Chapter 11
I Neural networks that understand language
Neural networks don’t really learn data; they minimize
the loss function.
In chapter 4, you learned that learning is about adjusting each weight in the neural network
to bring the error down to 0. In this section, I’ll explain the same phenomena from a
different perspective, choosing the error so the neural network learns the patterns we’re
interested in. Remember these lessons?
The golden method for learning
The secret
Adjust each weight in the correct direction
and by the correct amount so error
reduces to 0.
For any input and goal_pred, an exact
relationship is defined between error
and weight, found by combining the
prediction and error formulas.
error = ((0.5 * weight) - 0.8) ** 2
Perhaps you remember this formula from the one-weight neural network. In that network,
you could evaluate the error by first forward propagating (0.5 * weight) and then
comparing to the target (0.8). I encourage you not to think about this from the perspective
of two different steps (forward propagation, then error evaluation), but instead to consider
the entire formula (including forward prop) to be the evaluation of an error value. This
context will reveal the true cause of the different word-embedding clusterings. Even though
the network and datasets were similar, the error function was fundamentally different,
leading to different word clusterings within each network.
Predicting POS/NEG
Fill in the blank
print(similar('terrible'))
print(similar('terrible'))
[('terrible', -0.0),
('dull', -0.760788602671491),
('lacks', -0.76706470275372),
('boring', -0.7682894961694),
('disappointing', -0.768657),
('annoying', -0.78786389931),
('poor', -0.825784172378292),
('horrible', -0.83154121717),
('laughable', -0.8340279599),
('badly', -0.84165373783678)]
[('terrible', -0.0),
('horrible', -2.79600898781),
('brilliant', -3.3336178881),
('pathetic', -3.49393193646),
('phenomenal', -3.773268963),
('masterful', -3.8376122586),
('superb', -3.9043150978490),
('bad', -3.9141673639585237),
('marvelous', -4.0470804427),
('dire', -4.178749691835959)]
Meaning is derived from loss
205
The choice of loss function determines the
neural network’s knowledge.
The more formal term for an error function is a loss function or objective function (all
three phrases are interchangeable). Considering learning to be all about minimizing a loss
function (which includes forward propagation) gives a far broader perspective on how
neural networks learn. Two neural networks can have identical starting weights, be trained
over identical datasets, and ultimately learn very different patterns because you choose
a different loss function. In the case of the two movie review neural networks, the loss
function was different because you chose two different target values (positive or negative
versus fill in the blank).
Different kinds of architectures, layers, regularization techniques, datasets, and nonlinearities aren’t really that different. These are the ways you can choose to construct a loss
function. If the network isn’t learning properly, the solution can often come from any of
these possible categories.
For example, if a network is overfitting, you can augment the loss function by choosing
simpler nonlinearities, smaller layer sizes, shallower architectures, larger datasets, or moreaggressive regularization techniques. All of these choices will have a fundamentally similar
effect on the loss function and a similar consequence on the behavior of the network.
They all interplay together, and over time you’ll learn how changing one can affect the
performance of another; but for now, the important takeaway is that learning is about
constructing a loss function and then minimizing it.
Whenever you want a neural network to learn a pattern, everything you need to know to do
so will be contained in the loss function. When you had only a single weight, this allowed
the loss function to be simple, as you’ll recall:
error = ((0.5 * weight) - 0.8) ** 2
But as you chain large numbers of complex layers together, the loss function will become
more complicated (and that’s OK). Just remember, if something is going wrong, the solution
is in the loss function, which includes both the forward prediction and the raw error
evaluation (such as mean squared error or cross entropy).
206
Chapter 11
I Neural networks that understand language
King – Man + Woman ~= Queen
Word analogies are an interesting consequence
of the previously built network.
Before closing out this chapter, let’s discuss what is, at the time of writing, still one of
the most famous properties of neural word embeddings (word vectors like those we
just created). The task of filling in the blank creates word embeddings with interesting
phenomena known as word analogies, wherein you can take the vectors for different words
and perform basic algebraic operations on them.
For example, if you train the previous network on a large enough corpus, you’ll be able to
take the vector for king, subtract from it the vector for man, add in the vector for woman, and
then search for the most similar vector (other than those in the query). As it turns out, the
most similar vector is often the word “queen.” There are even similar phenomena in the fillin-the-blank network trained over movie reviews.
def analogy(positive=['terrible','good'],negative=['bad']):
norms = np.sum(weights_0_1 * weights_0_1,axis=1)
norms.resize(norms.shape[0],1)
normed_weights = weights_0_1 * norms
query_vect = np.zeros(len(weights_0_1[0]))
for word in positive:
query_vect += normed_weights[word2index[word]]
for word in negative:
query_vect -= normed_weights[word2index[word]]
scores = Counter()
for word,index in word2index.items():
raw_difference = weights_0_1[index] - query_vect
squared_difference = raw_difference * raw_difference
scores[word] = -math.sqrt(sum(squared_difference))
return scores.most_common(10)[1:]
terrible – bad + good ~=
elizabeth – she + he ~=
analogy(['terrible','good'],['bad'])
analogy(['elizabeth','he'],['she'])
[('superb', -223.3926217861),
('terrific', -223.690648739),
('decent', -223.7045545791),
('fine', -223.9233021831882),
('worth', -224.03031703075),
('perfect', -224.125194533),
('brilliant', -224.2138041),
('nice', -224.244182032763),
('great', -224.29115420564)]
[('christopher', -192.7003),
('it', -193.3250398279812),
('him', -193.459063887477),
('this', -193.59240614759),
('william', -193.63049856),
('mr', -193.6426152274126),
('bruce', -193.6689279548),
('fred', -193.69940566948),
('there', -193.7189421836)]
Word analogies
207
Word analogies
Linear compression of an existing property in the data
When this property was first discovered, it created a flurry of excitement as people
extrapolated many possible applications of such a technology. It’s an amazing property in its
own right, and it did create a veritable cottage industry around generating word embeddings
of one variety or another. But the word analogy property in and of itself hasn’t grown that
much since then, and most of the current work in language focuses instead on recurrent
architectures (which we’ll get to in chapter 12).
That being said, getting a good intuition for what’s going on with word embeddings as a
result of a chosen loss function is extremely valuable. You’ve already learned that the choice
of loss function can affect how words are grouped, but this word analogy phenomenon is
something different. What about the new loss function causes it to happen?
If you consider a word embedding having two
dimensions, it’s perhaps easier to envision exactly
what it means for these word analogies to work.
king
man
woman
queen
=
=
=
=
[0.6
[0.5
[0.0
[0.1
,
,
,
,
0.1]
0.0]
0.8]
1.0]
king - man = [0.1 , 0.1]
queen - woman = [0.1 , 0.2]
king – man + woman ==
man
woman
king
queen
The relative usefulness to the final prediction between “king”/“man” and “queen”/“woman” is
similar. Why? The difference between “king” and “man” leaves a vector of royalty. There are
a bunch of male- and female-related words in one grouping, and then there’s another grouping
in the royal direction.
This can be traced back to the chosen loss. When the word “king” shows up in a phrase, it
changes the probability of other words showing up in a certain way. It increases the probability
of words related to “man” and the probability of words related to royalty. The word “queen”
appearing in a phrase increases the probability of words related to “woman” and the probability
of words related to royalty (as a group). Thus, because the words have this sort of Venn diagram
impact on the output probability, they end up subscribing to similar combinations of groupings.
Oversimplified, “king” subscribes to the male and the royal dimensions of the hidden layer,
whereas “queen” subscribes to the female and royal dimensions of the hidden layer. Taking
the vector for “king” and subtracting out some approximation of the male dimensions
and adding in the female ones yields something close to “queen.” The most important
takeaway is that this is more about the properties of language than deep learning. Any linear
compression of these co-occurrence statistics will behave similarly.
208
Chapter 11
I Neural networks that understand language
Summary
You’ve learned a lot about neural word embeddings and the
impact of loss on learning.
In this chapter, we’ve unpacked the fundamental principles of using neural networks to
study language. We started with an overview of the primary problems in natural language
processing and then explored how neural networks model language at the word level using
word embeddings. You also learned how the choice of loss function can change the kinds of
properties that are captured by word embeddings. We finished with a discussion of perhaps
the most magical of neural phenomena in this space: word analogies.
As with the other chapters, I encourage you to build the examples in this chapter from
scratch. Although it may seem as though this chapter stands on its own, the lessons in lossfunction creation and tuning are invaluable and will be extremely important as you tackle
increasingly more complicated strategies in future chapters. Good luck!
neural networks that write like Shakespeare:
recurrent layers for variable-length data
In this chapter
•
The challenge of arbitrary length
•
The surprising power of averaged word vectors
•
The limitations of bag-of-words vectors
•
Using identity vectors to sum word embeddings
•
Learning the transition matrices
•
Learning to create useful sentence vectors
•
Forward propagation in Python
•
Forward propagation and backpropagation with
arbitrary length
•
Weight update with arbitrary length
There’s something magical about Recurrent
Neural Networks.
—Andrej Karpathy, “The Unreasonable Effectiveness
of Recurrent Neural Networks,” http://mng.bz/V PW
209
12
210
Chapter 12
I Neural networks that write like Shakespeare
The challenge of arbitrary length
Let’s model arbitrarily long sequences of data
with neural networks!
This chapter and chapter 11 are intertwined, and I encourage you to ensure that you’ve
mastered the concepts and techniques from chapter 11 before you dive into this one. In
chapter 11, you learned about natural language processing (NLP). This included how to
modify a loss function to learn a specific pattern of information within the weights of a
neural network. You also developed an intuition for what a word embedding is and how it
can represent shades of similarity with other word embeddings. In this chapter, we’ll expand
on this intuition of an embedding conveying the meaning of a single word by creating
embeddings that convey the meaning of variable-length phrases and sentences.
Let’s first consider this challenge. If you wanted to create a vector that held an entire
sequence of symbols within its contents in the same way a word embedding stores
information about a word, how would you accomplish this? We’ll start with the simplest
option. In theory, if you concatenated or stacked the word embeddings, you’d have a vector
of sorts that held an entire sequence of symbols.
the
cat
sat
But this approach leaves something to be desired, because different sentences will have
different-length vectors. This makes comparing two vectors together tricky, because one
vector will stick out the side. Consider the following second sentence:
the
cat
sat
still
In theory, these two sentences should be very similar, and comparing their vectors should
indicate a high degree of similarity. But because “the cat sat” is a shorter vector, you have to
choose which part of “the cat sat still” vector to compare to. If you align left, the vectors will
appear to be identical (ignoring the fact that “the cat sat still” is, in fact, a different sentence).
But if you align right, then the vectors will appear to be extraordinarily different, despite the
fact that three-quarters of the words are the same, in the same order. Although this naive
approach shows some promise, it’s far from ideal in terms of representing the meaning of a
sentence in a useful way (a way that can be compared with other vectors).
Do comparisons really matter?
211
Do comparisons really matter?
Why should you care about whether you can compare two
sentence vectors?
The act of comparing two vectors is useful because it gives an approximation of what the
neural network sees. Even though you can’t read two vectors, you can tell when they’re
similar or different (using the function from chapter 11). If the method for generating
sentence vectors doesn’t reflect the similarity you observe between two sentences, then the
network will also have difficulty recognizing when two sentences are similar. All it has to
work with are the vectors!
As we continue to iterate and evaluate various methods for computing sentence vectors,
I want you to remember why we’re doing this. We’re trying to take the perspective of a
neural network. We’re asking, “Will the correlation summarization find correlation between
sentence vectors similar to this one and a desirable label, or will two nearly identical
sentences instead generate wildly different vectors such that there is very little correlation
between sentence vectors and the corresponding labels you’re trying to predict?” We want to
create sentence vectors that are useful for predicting things about the sentence, which, at a
minimum, means similar sentences need to create similar vectors.
The previous way of creating the sentence
vectors (concatenation) had issues because
of the rather arbitrary way of aligning
them, so let’s explore the next-simplest
approach. What if you take the vector
for each word in a sentence, and average
them? Well, right off the bat, you don’t have
to worry about alignment because each
sentence vector is of the same length!
Matrix row average
Word vectors
the
+
sat
+
Sentence vector
=
cat
Furthermore, the sentences “the cat sat”
and “the cat sat still” will have similar
sentence vectors because the words going into them are similar. Even better, it’s likely that
“a dog walked” will be similar to “the cat sat,” even though no words overlap, because the
words used are also similar.
As it turns out, averaging word embeddings is a surprisingly effective way to create word
embeddings. It’s not perfect (as you’ll see), but it does a strong job of capturing what you
might perceive to be complex relationships between words. Before moving on, I think it will
be extremely beneficial to take the word embeddings from chapter 11 and play around with
the average strategy.
212
Chapter 12
I Neural networks that write like Shakespeare
The surprising power of averaged word vectors
It’s the amazingly powerful go-to tool for neural prediction.
In the previous section, I proposed the second method for creating vectors that convey the
meaning of a sequence of words. This method takes the average of the vectors corresponding
to the words in a sentence, and intuitively we expect these new average sentence vectors to
behave in several desirable ways.
In this section, let’s play with sentence vectors generated using the embeddings from the
previous chapter. Break out the code from chapter 11, train the embeddings on the IMDB
corpus as you did before, and let’s experiment with average sentence embeddings.
At right is the same
normalization performed
when comparing word
embeddings before. But
this time, let’s prenormalize
all the word embeddings
into a matrix called
normed_weights. Then,
create a function called
make_sent_vect and use it
to convert each review (list
of words) into embeddings
using the average approach.
This is stored in the matrix
reviews2vectors.
import numpy as np
norms = np.sum(weights_0_1 * weights_0_1,axis=1)
norms.resize(norms.shape[0],1)
normed_weights = weights_0_1 * norms
def make_sent_vect(words):
indices = list(map(lambda x:word2index[x],\
filter(lambda x:x in word2index,words)))
return np.mean(normed_weights[indices],axis=0)
reviews2vectors = list()
for review in tokens:
reviews2vectors.append(make_sent_vect(review))
reviews2vectors = np.array(reviews2vectors)
def most_similar_reviews(review):
v = make_sent_vect(review)
scores = Counter()
for i,val in enumerate(reviews2vectors.dot(v)):
scores[i] = val
most_similar = list()
for idx,score in scores.most_common(3):
After this, you create a
most_similar.append(raw_reviews[idx][0:40])
return most_similar
function that queries for
most_similar_reviews(['boring','awful'])
the most similar reviews
given an input review, by
['I am amazed at how boring this film',
performing a dot product
'This is truly one of the worst dep',
'It just seemed to go on and on and.]
between the input review’s
vector and the vector of
every other review in the corpus. This dot product similarity metric is the same one we
briefly discussed in chapter 4 when you were learning to predict with multiple inputs.
Perhaps surprisingly, when you query for the most similar reviews to the average vector
between the two words “boring” and “awful,” you receive back three very negative reviews.
There appears to be interesting statistical information within these vectors, such that
negative and positive embeddings cluster together.
Tokenized
reviews
How is information stored in these embeddings?
213
How is information stored in these embeddings?
When you average word embeddings, average shapes remain.
Considering what’s going on here requires a little abstract thought. I recommend digesting
this kind of information over a period of time, because it’s probably a different kind of lesson
than you’re used to. For a moment, I’d like you to consider that a word vector can be visually
observed as a squiggly line like this one:
2
–.1
–.5
.1
.5
.6
.9
1.0
–1
Instead of thinking of a vector as a list of numbers, think about it as a line with high and low
points corresponding to high and low values at different places in the vector. If you selected
several words from the corpus, they might look like this figure:
Consider the similarities between
terrible
the various words. Notice that each
vector’s corresponding shape is
wonderful
unique. But “terrible” and “boring”
have a certain similarity in their
boring
shape. “beautiful” and “wonderful”
beautiful
also have a similarity to their shape,
but it’s different from that of the other
words. If we were to cluster these little
squiggles, words with similar meaning would cluster together. More important, parts of
these squiggles have true meaning in and of themselves.
For example, for the negative words, there’s a downward and then upward spike about 40%
from the left. If I were to continue drawing lines corresponding to words, this spike would
continue to be distinctive. There’s nothing magical about that spike that means “negativity,”
and if I retrained the network, it would likely show up somewhere else. The spike indicates
negativity only because all the negative words have it!
Thus, during the course of training, these shapes are molded such that different curves in
different positions convey meaning (as discussed in chapter 11). When you take an average
curve over the words in a sentence, the most dominant meanings of the sentence hold true,
and the noise created by any particular word gets averaged away.
214
Chapter 12
I Neural networks that write like Shakespeare
How does a neural network use embeddings?
Neural networks detect the curves that have correlation
with a target label.
You’ve learned about a new way to view word embeddings as a squiggly line with distinctive
properties (curves). You’ve also learned that these curves are developed throughout the
course of training to accomplish the target objective. Words with similar meaning in
one way or another will often share a distinctive bend in the curve: a combination of
high-low pattern among the weights. In this section, we’ll consider how the correlation
summarization processes these curves as input. What does it mean for a layer to consume
these curves as input?
Truth be told, a neural network consumes embeddings just as it consumed the streetlight
dataset in the book’s early chapters. It looks for correlation between the various bumps and
curves in the hidden layer and the target label it’s trying to predict. This is why words with
one particular aspect of similarity share similar bumps and curves. At some point during
training, a neural network starts developing unique characteristics between the shapes of
different words so it can tell them apart, and grouping them (giving them similar bumps/
curves) to help make accurate predictions. But this is another way of summarizing the
lessons from the end of chapter 11. We want to press further.
In this chapter, we’ll consider what it means to sum these embeddings into a sentence
embedding. What kinds of classifications would this summed vector be useful for? We’ve
identified that taking an average across the word embeddings of a sentence results in a
vector with an average of the characteristics of the words therein. If there are many positive
words, the final embedding will look somewhat positive (with other noise from the words
generally cancelling out). But note that this approach is a bit mushy: given enough words,
these different wavy lines should all average together to generally be a straight line.
This brings us to the first weakness of this approach: when attempting to store arbitrarily
long sequences (a sentence) of information into a fixed-length vector, if you try to store too
much, eventually the sentence vector (being an average of a multitude of word vectors) will
average out to a straight line (a vector of near-0s).
In short, this process of storing the information of a sentence doesn’t decay nicely. If you try
to store too many words into a single vector, you end up storing almost nothing. That being
said, a sentence is often not that many words; and if a sentence has repeating patterns, these
sentence vectors can be useful, because the sentence vector will retain the most dominant
patterns present across the word vectors being summed (such as the negative spike in the
previous section).
The limitations of bag-of-words vectors
215
The limitations of bag-of-words vectors
Order becomes irrelevant when you average word embeddings.
The biggest issue with average embeddings is that they have no concept of order. For
example, consider the two sentences “Yankees defeat Red Sox” and “Red Sox defeat
Yankees.” Generating sentence vectors for these two sentences using the average approach
will yield identical vectors, but the sentences are conveying the exact opposite information!
Furthermore, this approach ignores grammar and syntax, so “Sox Red Yankees defeat” will
also yield an identical sentence embedding.
This approach of summing or averaging word embeddings to form the embedding for a
phrase or sentence is classically known as a bag-of-words approach because, much like
throwing a bunch of words into a bag, order isn’t preserved. The key limitation is that you
can take any sentence, scramble all the words around, and generate a sentence vector, and
no matter how you scramble the words, the vector will be the same (because addition is
associative: a + b == b + a).
The real topic of this chapter is generating sentence vectors in a way where order does
matter. We want to create vectors such that scrambling them around changes the resulting
vector. More important, the way in which order matters (otherwise known as the way
in which order changes the vector) should be learned. In this way, the neural network’s
representation of order can be based around trying to solve a task in language and, by
extension, hopefully capture the essence of order in language. I’m using language as an
example here, but you can generalize these statements to any sequence. Language is just a
particularly challenging, yet universally known, domain.
One of the most famous and successful ways of generating vectors for
sequences (such as a sentence) is a recurrent neural network (RNN). In
order to show you how it works, we’ll start by coming up with a new,
and seemingly wasteful, way of doing the average word embeddings
using something called an identity matrix. An identity matrix is just an
arbitrarily large, square matrix (num rows == num columns) of 0s with
1s stretching from the top-left corner to the bottom-right corner as in the
examples shown here.
All three of these matrices are identity matrices, and they have one
purpose: performing vector-matrix multiplication with any vector will
return the original vector. If I multiply the vector [3,5] by the top
identity matrix, the result will be [3,5].
[1,0]
[0,1]
[1,0,0]
[0,1,0]
[0,0,1]
[1,0,0,0]
[0,1,0,0]
[0,0,1,0]
[0,0,0,1]
216
Chapter 12
I Neural networks that write like Shakespeare
Using identity vectors to sum word embeddings
Let’s implement the same logic using a different approach.
You may think identity matrices are useless. What’s the purpose of a matrix that takes a
vector and outputs that same vector? In this case, we’ll use it as a teaching tool to show how
to set up a more complicated way of summing the word embeddings so the neural network
can take order into account when generating the final sentence embedding. Let’s explore
another way of summing embeddings.
"Red Sox defeat Yankees"
"Red Sox defeat Yankees"
Yankees
+
defeat
+
Sox
Red
+
Yankees
Identity
matrix
+
x
+
defeat
This is the standard technique for summing
multiple word embeddings together to form
a sentence embedding (dividing by the
number of words gives the average sentence
embedding). The example on the right adds
a step between each sum: vector-matrix
multiplication by an identity matrix.
+
Identity
matrix
x
+
Sox
Identity
matrix
x
The vector for “Red” is multiplied by an
identity matrix, and then the output is
Red
+
summed with the vector for “Sox,” which is
then vector-matrix multiplied by the identity
matrix and added to the vector for “defeat,”
and so on throughout the sentence. Note that because the vector-matrix multiplication by
the identity matrix returns the same vector that goes into it, the process on the right yields
exactly the same sentence embedding as the process at top left.
Yes, this is wasteful computation, but that’s about to change. The main thing to consider
here is that if the matrices used were any matrix other than the identity matrix, changing the
order of the words would change the resulting embedding. Let’s see this in Python.
Matrices that change absolutely nothing
217
Matrices that change absolutely nothing
Let’s create sentence embeddings using identity matrices in Python.
In this section, we’ll demonstrate how to play with identity matrices in Python and
ultimately implement the new sentence vector technique from the previous section (proving
that it produces identical sentence embeddings).
At right, we first initialize four vectors (a, b, c, and d) of length 3 as well as an identity
matrix with three rows and three columns (identity
matrices are always square). Notice that the identity
import numpy as np
matrix has the characteristic set of 1s running
a = np.array([1,2,3])
diagonally from top-left to bottom-right (which,
b = np.array([0.1,0.2,0.3])
c = np.array([-1,-0.5,0])
by the way, is called the diagonal in linear algebra).
d = np.array([0,0,0])
Any square matrix with 1s along the diagonal and 0s
identity = np.eye(3)
everywhere else is an identity matrix.
print(identity)
We then proceed to perform vector-matrix
multiplication with each of the vectors and the
identity matrix (using NumPy’s dot function). As
you can see, the output of this process is a new vector
identical to the input vector.
Because vector-matrix multiplication by an
identity matrix returns the same vector we started
with, incorporating this process into the sentence
embedding should seem trivial, and it is:
this = np.array([2,4,6])
movie = np.array([10,10,10])
rocks = np.array([1,1,1])
[[ 1.
[ 0.
[ 0.
0.
1.
0.
0.]
0.]
1.]]
print(a.dot(identity))
print(b.dot(identity))
print(c.dot(identity))
print(d.dot(identity))
[ 1. 2. 3.]
[ 0.1 0.2 0.3]
[-1. -0.5 0. ]
[ 0. 0. 0. ]
print(this + movie + rocks)
print((this.dot(identity) + movie).dot(identity) + rocks)
Both ways of creating sentence embeddings generate the same
vector. This is only because the identity matrix is a very special
kind of matrix. But what would happen if we didn’t use the
identity matrix? What if, instead, we used a different matrix? In fact, the identity matrix is
the only matrix guaranteed to return the same vector that it’s vector-matrix multiplied with.
No other matrix has this guarantee.
[13 15 17]
[ 13. 15.
17.]
218
Chapter 12
I Neural networks that write like Shakespeare
Learning the transition matrices
What if you allowed the identity matrices to change
to minimize the loss?
Before we begin, let’s remember the goal: generating sentence embeddings that cluster
according to the meaning of the sentence, such that given a sentence, we can use the vector
to find sentences with a similar meaning. More specifically, these sentence embeddings
should care about the order of words.
Previously, we tried summing word embeddings. But this meant “Red Sox defeat Yankees”
had an identical vector to the sentence “Yankees defeat Red Sox,” despite the fact that these
two sentences have opposite meanings. Instead, we want to form sentence embeddings
where these two sentences generate different embeddings (yet still cluster in a meaningful
way). The theory is that if we use the identity-matrix way of creating sentence embeddings,
but used any other matrix other than the identity matrix, the sentence embeddings would be
different depending on the order.
Now the obvious question: what matrix to use instead of the identity matrix. There are
an infinite number of choices. But in deep learning, the standard answer to this kind
of question is, “You’ll learn the matrix just like you learn any other matrix in a neural
network!” OK, so you’ll just learn this matrix. How?
Whenever you want to train a neural network to learn something, you always need a task
for it to learn. In this case, that task should require it to generate interesting sentence
embeddings by learning both useful word vectors and useful modifications to the identity
matrices. What task should you use?
What you
know
What you want
to know
Supervised
learning
The goal was similar when you wanted to generate useful word embeddings (fill in the
blank). Let’s try to accomplish a very similar task: training a neural network to take a list of
words and attempt to predict the next word.
["This", "movie", "was"]
Neural
network
["great"]
Learning to create useful sentence vectors
219
Learning to create useful sentence vectors
Create the sentence vector, make a prediction, and modify the
sentence vector via its parts.
In this next experiment, I don’t want you to think about the network like previous neural
networks. Instead, think about creating a sentence embedding, using it to predict the next
word, and then modifying the respective parts that formed the sentence embedding to
make this prediction more accurate. Because you’re predicting the next word, the sentence
embedding will be made from the parts of the sentence you’ve seen so far. The neural network
will look something like the figure.
"Yankees"
It’s composed of two steps: create
the sentence embedding, and then
use that embedding to predict which
word comes next. The input to this
network is the text “Red Sox defeat,”
and the word to be predicted is
“Yankees.”
I’ve written Identity matrix in the
boxes between the word vectors.
This matrix will only start as an
identity matrix. During training,
you’ll backpropagate gradients into
these matrices and update them
to help the network make better
predictions (just as for the rest of the
weights in the network).
Predicts
the overall
vocabulary
via softmax
0.0
0.0
0.0
0.0
0.0
1.0
+
defeat
Identity
matrix
Creates a
sentence
embedding
x
+
Sox
Identity
matrix
Red
x
+
This way, the network will learn how
to incorporate more information than just a sum of word embeddings. By allowing the (initially,
identity) matrices to change (and become not identity matrices), you let the neural network
learn how to create embeddings where the order in which the words are presented changes the
sentence embedding. But this change isn’t arbitrary. The network will learn how to incorporate
the order of words in a way that’s useful for the task of predicting the next word.
You’ll also constrain the transition matrices (the matrices that are originally identity
matrices) to all be the same matrix. In other words, the matrix from “Red” -> “Sox” will
be reused to transition from “Sox” -> “defeat.” Whatever logic the network learns in one
transition will be reused in the next, and only logic that’s useful at every predictive step will
be allowed to be learned in the network.
220
Chapter 12
I Neural networks that write like Shakespeare
Forward propagation in Python
Let’s take this idea and see how to perform a simple
forward propagation.
Now that you have the conceptual idea of what you’re trying to build, let’s check out a toy
version in Python. First, let’s set up the weights (I’m using a limited vocab of nine words):
import numpy as np
def softmax(x_):
x = np.atleast_2d(x_)
temp = np.exp(x)
return temp / np.sum(temp, axis=1, keepdims=True)
word_vects = {}
word_vects['yankees'] = np.array([[0.,0.,0.]])
word_vects['bears'] = np.array([[0.,0.,0.]])
word_vects['braves'] = np.array([[0.,0.,0.]])
word_vects['red'] = np.array([[0.,0.,0.]])
word_vects['sox'] = np.array([[0.,0.,0.]])
word_vects['lose'] = np.array([[0.,0.,0.]])
word_vects['defeat'] = np.array([[0.,0.,0.]])
word_vects['beat'] = np.array([[0.,0.,0.]])
word_vects['tie'] = np.array([[0.,0.,0.]])
sent2output = np.random.rand(3,len(word_vects))
identity = np.eye(3)
Transition
weights
Word
embeddings
Sentence embedding
to output classification
weights
This code creates three sets of weights. It creates a Python dictionary of word embeddings,
the identity matrix (transition matrix), and a classification layer. This classification layer
sent2output is a weight matrix to predict the next word given a sentence vector of length 3.
With these tools, forward propagation is trivial. Here’s how forward propagation works with
the sentence “red sox defeat” -> “yankees”:
layer_0 = word_vects['red']
layer_1 = layer_0.dot(identity) + word_vects['sox']
layer_2 = layer_1.dot(identity) + word_vects['defeat']
pred = softmax(layer_2.dot(sent2output))
print(pred)
[[ 0.11111111
0.11111111
0.11111111
0.11111111
Creates a
sentence
embedding
Predicts over
all vocabulary
0.11111111 0.11111111
0.11111111]]
0.11111111
0.11111111
How do you backpropagate into this?
221
How do you backpropagate into this?
It might seem trickier, but they’re the same steps you
already learned.
You just saw how to perform forward prediction for this network. At first, it might not be
clear how backpropagation can be performed. But it’s simple. Perhaps this is what you see:
Normal neural network
(chapters 1–5)
Some sort of strange
additional piece
layer_0 = word_vects['red']
layer_1 = layer_0.dot(identity) + word_vects['sox']
layer_2 = layer_1.dot(identity) + word_vects['defeat']
pred = softmax(layer_2.dot(sent2output))
print(pred)
Normal neural network
again (chapter 9 stuff)
Based on previous chapters, you should feel comfortable with computing a loss and
backpropagating until you get to the gradients at layer_2, called layer_2_delta. At this
point, you might be wondering, “Which direction do I backprop in?” Gradients could go
back to layer_1 by going backward through the identity matrix multiplication, or they
could go into word_vects['defeat'].
When you add two vectors together during forward propagation, you backpropagate the
same gradient into both sides of the addition. When you generate layer_2_delta, you’ll
backpropagate it twice: once across the identity matrix to create layer_1_delta, and again
to word_vects['defeat']:
y = np.array([1,0,0,0,0,0,0,0,0])
pred_delta = pred - y
layer_2_delta = pred_delta.dot(sent2output.T)
defeat_delta = layer_2_delta * 1
layer_1_delta = layer_2_delta.dot(identity.T)
sox_delta = layer_1_delta * 1
layer_0_delta = layer_1_delta.dot(identity.T)
alpha = 0.01
word_vects['red'] -= layer_0_delta * alpha
word_vects['sox'] -= sox_delta * alpha
word_vects['defeat'] -= defeat_delta * alpha
identity -= np.outer(layer_0,layer_1_delta) * alpha
identity -= np.outer(layer_1,layer_2_delta) * alpha
sent2output -= np.outer(layer_2,pred_delta) * alpha
Targets the one-hot
vector for “yankees”
Can ignore the “1”
as in chapter 11
Again, can
ignore the “1”
222
Chapter 12
I Neural networks that write like Shakespeare
Let’s train it!
You have all the tools; let’s train the network on a toy corpus.
So that you can get an intuition for what’s going on, let’s first train the new network on a toy
task called the Babi dataset. This dataset is a synthetically generated question-answer corpus
to teach machines how to answer simple questions about an environment. You aren’t using
it for QA (yet), but the simplicity of the task will help you better see the impact made by
learning the identity matrix. First, download the Babi dataset. Here are the bash commands:
wget http://www.thespermwhale.com/jaseweston/babi/tasks_1-20_v1-1.tar.gz
tar -xvf tasks_1-20_v1-1.tar.gz
With some simple Python, you can open and clean a small dataset to train the network:
import sys,random,math
from collections import Counter
import numpy as np
f = open('tasksv11/en/qa1_single-supporting-fact_train.txt','r')
raw = f.readlines()
f.close()
tokens = list()
for line in raw[0:1000]:
tokens.append(line.lower().replace("\n","").split(" ")[1:])
print(tokens[0:3])
[['Mary', 'moved', 'to', 'the', 'bathroom'],
['John', 'went', 'to', 'the', 'hallway'],
['Where', 'is', 'Mary', 'bathroom'],
As you can see, this dataset contains a variety of simple statements and questions (with
punctuation removed). Each question is followed by the correct answer. When used in the
context of QA, a neural network reads the statements in order and answers questions (either
correctly or incorrectly) based on information in the recently read statements.
For now, you’ll train the network to attempt to finish each sentence when given one or more
starting words. Along the way, you’ll see the importance of allowing the recurrent matrix
(previously the identity matrix) to learn.
Setting things up
223
Setting things up
Before you can create matrices, you need to learn how many
parameters you have.
As with the word embedding neural network, you first need to create a few useful counts,
lists, and utility functions to use during the predict, compare, learn process. These utility
functions and objects are shown here and should look familiar:
vocab = set()
for sent in tokens:
for word in sent:
vocab.add(word)
def words2indices(sentence):
idx = list()
for word in sentence:
idx.append(word2index[word])
return idx
vocab = list(vocab)
word2index = {}
for i,word in enumerate(vocab):
word2index[word]=i
def softmax(x):
e_x = np.exp(x - np.max(x))
return e_x / e_x.sum(axis=0)
At left, you create a simple list of the vocabulary words as well as a lookup dictionary
allowing you to go back and forth between a word’s text and its index. You’ll use its index in
the vocabulary list to pick which row and column of the embedding and prediction matrices
correspond to which word. At right is a utility function for converting a list of words to a list
of indices, as well as the function for softmax, which you’ll use to predict the next word.
The following code initializes the random seed (to get consistent results) and then sets the
embedding size to 10. You create a matrix of word embeddings, recurrent embeddings, and
an initial start embedding. This is the embedding modeling an empty phrase, which is
key to the network modeling how sentences tend to start. Finally, there’s a decoder weight
matrix (just like from embeddings) and a one_hot utility matrix:
Word embeddings
np.random.seed(1)
embed_size = 10
embed = (np.random.rand(len(vocab),embed_size) - 0.5) * 0.1
recurrent = np.eye(embed_size)
start = np.zeros(embed_size)
Embedding ->
embedding (initially
the identity matrix)
decoder = (np.random.rand(embed_size, len(vocab)) - 0.5) * 0.1
one_hot = np.eye(len(vocab))
Sentence embedding for
an empty sentence
One-hot lookups
(for the loss function)
Embedding ->
output weights
224
Chapter 12
I Neural networks that write like Shakespeare
Forward propagation with arbitrary length
You’ll forward propagate using the same logic described earlier.
The following code contains the logic to forward propagate and predict the next word.
Note that although the construction might feel unfamiliar, it follows the same procedure as
before for summing embeddings while using the identity matrix. Here, the identity matrix is
replaced with a matrix called recurrent, which is initialized to be all 0s (and will be learned
through training).
Furthermore, instead of predicting only at the last word, you make a prediction
(layer['pred']) at every timestep, based on the embedding generated by the previous
words. This is more efficient than doing a new forward propagation from the beginning of
the phrase each time you want to predict a new term.
def predict(sent):
layers = list()
layer = {}
layer['hidden'] = start
layers.append(layer)
loss = 0
Forward
propagates
preds = list()
for target_i in range(len(sent)):
Tries to
predict the
next term
layer = {}
layer['pred'] = softmax(layers[-1]['hidden'].dot(decoder))
loss += -np.log(layer['pred'][sent[target_i]])
layer['hidden'] = layers[-1]['hidden'].dot(recurrent) +\
embed[sent[target_i]]
layers.append(layer)
return layers, loss
Generates the
next hidden state
There’s nothing particularly new about this bit of code relative to what you’ve learned in the
past, but there’s a particular piece I want to make sure you’re familiar with before we move
forward. The list called layers is a new way to forward propagate.
Notice that you end up doing more forward propagations if the length of sent is larger. As
a result, you can’t use static layer variables as before. This time, you need to keep appending
new layers to the list based on the required number. Be sure you’re comfortable with what’s
going on in each part of this list, because if it’s unfamiliar to you in the forward propagation
pass, it will be very difficult to know what’s going on during the backpropagation and weight
update steps.
Backpropagation with arbitrary length
225
Backpropagation with arbitrary length
You’ll backpropagate using the same logic described earlier.
As described with the “Red Sox defeat Yankees” example, let’s implement backpropagation
over arbitrary-length sequences, assuming you have access to the forward propagation
objects returned from the function in the previous section. The most important object is the
layers list, which has two vectors (layer['state'] and layer['previous->hidden']).
In order to backpropagate, you’ll take the output gradient and add a new object to each
list called layer['state_delta'], which will represent the gradient at that layer. This
corresponds to variables like sox_delta, layer_0_delta, and defeat_delta from the “Red
Sox defeat Yankees” example. You’re building the same logic in a way that it can consume
the variable-length sequences from the forward propagation logic.
Forward
for iter in range(30000):
alpha = 0.001
sent = words2indices(tokens[iter%len(tokens)][1:])
layers,loss = predict(sent)
for layer_idx in reversed(range(len(layers))):
layer = layers[layer_idx]
target = sent[layer_idx-1]
If not the
Backpropagates
first layer
If the last layer,
don’t pull from a
later one, because
it doesn’t exist
if(layer_idx > 0):
layer['output_delta'] = layer['pred'] - one_hot[target]
new_hidden_delta = layer['output_delta']\
.dot(decoder.transpose())
if(layer_idx == len(layers)-1):
layer['hidden_delta'] = new_hidden_delta
else:
layer['hidden_delta'] = new_hidden_delta + \
layers[layer_idx+1]['hidden_delta']\
.dot(recurrent.transpose())
else: # if the first layer
layer['hidden_delta'] = layers[layer_idx+1]['hidden_delta']\
.dot(recurrent.transpose())
Before moving on to the next section, be sure you can read this code and explain it to a
friend (or at least to yourself). There are no new concepts in this code, but its construction
can make it seem a bit foreign at first. Spend some time linking what’s written in this code
back to each line of the “Red Sox defeat Yankees” example, and you should be ready for the
next section and updating the weights using the gradients you backpropagated.
226
Chapter 12
I Neural networks that write like Shakespeare
Weight update with arbitrary length
You’ll update weights using the same logic described earlier.
As with the forward and backprop logic, this weight update logic isn’t new. But I’m
presenting it after having explained it so you can focus on the engineering complexity,
having (hopefully) already grokked (ha!) the theory complexity.
Forward
for iter in range(30000):
alpha = 0.001
sent = words2indices(tokens[iter%len(tokens)][1:])
layers,loss = predict(sent)
for layer_idx in reversed(range(len(layers))):
layer = layers[layer_idx]
target = sent[layer_idx-1]
Backpropagates
if(layer_idx > 0):
layer['output_delta'] = layer['pred'] - one_hot[target]
new_hidden_delta = layer['output_delta']\
.dot(decoder.transpose())
Updates
weights
If the last layer,
if(layer_idx == len(layers)-1):
don’t pull from a
layer['hidden_delta'] = new_hidden_delta
later one, because
else:
it doesn’t exist
layer['hidden_delta'] = new_hidden_delta + \
layers[layer_idx+1]['hidden_delta']\
.dot(recurrent.transpose())
else:
layer['hidden_delta'] = layers[layer_idx+1]['hidden_delta']\
.dot(recurrent.transpose())
start -= layers[0]['hidden_delta'] * alpha / float(len(sent))
for layer_idx,layer in enumerate(layers[1:]):
decoder -= np.outer(layers[layer_idx]['hidden'],\
layer['output_delta']) * alpha / float(len(sent))
embed_idx = sent[layer_idx]
embed[embed_idx] -= layers[layer_idx]['hidden_delta'] * \
alpha / float(len(sent))
recurrent -= np.outer(layers[layer_idx]['hidden'],\
layer['hidden_delta']) * alpha / float(len(sent))
if(iter % 1000 == 0):
print("Perplexity:" + str(np.exp(loss/len(sent))))
Execution and output analysis
227
Execution and output analysis
You’ll update weights using the same logic described earlier.
Now the moment of truth: what happens when you run it? Well, when I run this code, I
get a relatively steady downtrend in a metric called perplexity. Technically, perplexity is
the probability of the correct label (word), passed through a log function, negated, and
exponentiated (e^x).
But what it represents theoretically is the difference between two probability distributions.
In this case, the perfect probability distribution would be 100% probability allocated to the
correct term and 0% everywhere else.
Perplexity is high when two probability distributions don’t match, and it’s low (approaching
1) when they do match. Thus, a decreasing perplexity, like all loss functions used with
stochastic gradient descent, is a good thing! It means the network is learning to predict
probabilities that match the data.
Perplexity:82.09227500075585
Perplexity:81.87615610433569
Perplexity:81.53705034457951
....
Perplexity:4.132556753967558
Perplexity:4.071667181580819
Perplexity:4.0167814473718435
But this hardly tells you what’s going on in the weights. Perplexity has faced some criticism
over the years (particularly in the language-modeling community) for being overused as a
metric. Let’s look a little more closely at the predictions:
sent_index = 4
l,_ = predict(words2indices(tokens[sent_index]))
print(tokens[sent_index])
for i,each_layer in enumerate(l[1:-1]):
input = tokens[sent_index][i]
true = tokens[sent_index][i+1]
pred = vocab[each_layer['pred'].argmax()]
print("Prev Input:" + input + (' ' * (12 - len(input))) +\
"True:" + true + (" " * (15 - len(true))) + "Pred:" + pred)
This code takes a sentence and predicts the word the model thinks is most likely. This is
useful because it gives a sense for the kinds of characteristics the model takes on. What
kinds of things does it get right? What kinds of mistakes does it make? You’ll see in the
next section.
228
Chapter 12
I Neural networks that write like Shakespeare
Looking at predictions can help you understand what’s going on.
You can look at the output predictions of the neural network as it trains to learn not only
what kinds of patterns it picks up, but also the order in which it learns them. After 100
training steps, the output looks like this:
['sandra', 'moved', 'to', 'the', 'garden.']
Prev Input:sandra
True:moved
Pred:is
Prev Input:moved
True:to
Pred:kitchen
Prev Input:to
True:the
Pred:bedroom
Prev Input:the
True:garden.
Pred:office
Neural networks tend to start off random. In this case, the neural network is likely only
biased toward whatever words it started with in its first random state. Let’s keep training:
['sandra', 'moved', 'to', 'the', 'garden.']
Prev Input:sandra
True:moved
Pred:the
Prev Input:moved
True:to
Pred:the
Prev Input:to
True:the
Pred:the
Prev Input:the
True:garden.
Pred:the
After 10,000 training steps, the neural network picks out the most common word (“the”) and
predicts it at every timestep. This is an extremely common error in recurrent neural networks.
It takes lots of training to learn finer-grained detail in a highly skewed dataset.
['sandra', 'moved', 'to', 'the', 'garden.']
Prev Input:sandra
True:moved
Pred:is
Prev Input:moved
True:to
Pred:to
Prev Input:to
True:the
Pred:the
Prev Input:the
True:garden.
Pred:bedroom.
These mistakes are really interesting. After seeing only the word “sandra,” the network
predicts “is,” which, although not exactly the same as “moved,” isn’t a bad guess. It picked the
wrong verb. Next, notice that the words “to” and “the” were correct, which isn’t as surprising
because these are some of the more common words in the dataset, and presumably the
network has been trained to predict the phrase “to the” after the verb “moved” many times.
The final mistake is also compelling, mistaking “bedroom” for the word “garden.”
It’s important to note that there’s almost no way this neural network could learn this task
perfectly. After all, if I gave you the words “sandra moved to the,” could you tell me the
correct next word? More context is needed to solve this task, but the fact that it’s unsolvable,
in my opinion, creates educational analysis for the ways in which it fails.
Summary
229
Summary
Recurrent neural networks predict over arbitrary-length
sequences.
In this chapter, you learned how to create vector representations for arbitrary-length
sequences. The last exercise trained a linear recurrent neural network to predict the
next term given a previous phrase of terms. To do this, it needed to learn how to create
embeddings that accurately represented variable-length strings of terms into a fixed-size
vector.
This last sentence should drive home a question: how does a neural network fit a variable
amount of information into a fixed-size box? The truth is, sentence vectors don’t encode
everything in the sentence. The name of the game in recurrent neural networks is not just
what these vectors remember, but also what they forget. In the case of predicting the next
word, most RNNs learn that only the last couple of words are really necessary,* and they
learn to forget (aka, not make unique patterns in their vectors for) words further back in
the history.
But note that there are no nonlinearities in the generation of these representations. What
kinds of limitations do you think that could create? In the next chapter, we’ll explore this
question and more using nonlinearities and gates to form a neural network called a long
short-term memory network (LSTM). But first, make sure you can sit down and (from
memory) code a working linear RNN that converges. The dynamics and control flow of
these networks can be a bit daunting, and the complexity is about to jump by quite a bit.
Before moving on, become comfortable with what you’ve learned in this chapter.
And with that, let’s dive into LSTMs!
* See, for example, “Frustratingly Short Attention Spans in Neural Language Modeling” by Michał Daniluk et al. (paper presented at
ICLR 2017), https://arxiv.org/abs/1702.04521.
introducing automatic optimization:
let’s build a deep learning framework
In this chapter
•
What is a deep learning framework?
•
Introduction to tensors
•
Introduction to autograd
•
How does addition backpropagation work?
•
How to learn a framework
•
Nonlinearity layers
•
The embedding layer
•
The cross-entropy layer
•
The recurrent layer
Whether we are based on carbon or on silicon makes
no fundamental difference; we should each be treated
with appropriate respect.
—Arthur C. Clarke, 2010: Odyssey Two (1982)
231
13
232
Chapter 13
I Introducing automatic optimization
What is a deep learning framework?
Good tools reduce errors, speed development, and increase
runtime performance.
If you’ve been reading about deep learning for long, you’ve probably come across one of
the major frameworks such as PyTorch, TensorFlow, Theano (recently deprecated), Keras,
Lasagne, or DyNet. Framework development has been extremely rapid over the past few
years, and, despite all frameworks being free, open source software, there’s a light spirit of
competition and comradery around each framework.
Thus far, I’ve avoided the topic of frameworks because, first and foremost, it’s extremely
important for you to know what’s going on under the hood of these frameworks by
implementing algorithms yourself (from scratch in NumPy). But now we’re going to
transition into using a framework, because the networks you’ll be training next—long shortterm memory networks (LSTMs)—are very complex, and NumPy code describing their
implementation is difficult to read, use, or debug (gradients are flying everywhere).
It’s exactly this code complexity that deep learning frameworks were created to mitigate.
Especially if you wish to train a neural network on a GPU (giving 10–100× faster training),
a deep learning framework can significantly reduce code complexity (reducing errors and
increasing development speed) while also increasing runtime performance. For these
reasons, their use is nearly universal within the research community, and a thorough
understanding of a deep learning framework will be essential on your journey toward
becoming a user or researcher of deep learning.
But we won’t jump into any deep learning frameworks you’ve heard of, because that would
stifle your ability to learn about what complex models (such as LSTMs) are doing under the
hood. Instead, you’ll build a light deep learning framework according to the latest trends in
framework development. This way, you’ll have no doubt about what frameworks do when
using them for complex architectures. Furthermore, building a small framework yourself
should provide a smooth transition to using actual deep learning frameworks, because you’ll
already be familiar with the API and the functionality underneath it. I found this exercise
beneficial myself, and the lessons learned in building my own framework are especially
useful when attempting to debug a troublesome model.
How does a framework simplify your code? Abstractly, it eliminates the need to write code
that you’d repeat multiple times. Concretely, the most beneficial pieces of a deep learning
framework are its support for automatic backpropagation and automatic optimization. These
features let you specify only the forward propagation code of a model, with the framework
taking care of backpropagation and weight updates automatically. Most frameworks even
make the forward propagation code easier by providing high-level interfaces to common
layers and loss functions.
Introduction to tensors
233
Introduction to tensors
Tensors are an abstract form of vectors and matrices.
Up to this point, we’ve been working exclusively with vectors and matrices as the basic data
structures for deep learning. Recall that a matrix is a list of vectors, and a vector is a list
of scalars (single numbers). A tensor is the abstract version of this form of nested lists of
numbers. A vector is a one-dimensional tensor. A matrix is a two-dimensional tensor, and
higher dimensions are referred to as n-dimensional tensors. Thus, the beginning of a new
deep learning framework is the construction of this basic type, which we’ll call Tensor:
import numpy as np
class Tensor (object):
def __init__(self, data):
self.data = np.array(data)
def __add__(self, other):
return Tensor(self.data + other.data)
def __repr__(self):
return str(self.data.__repr__())
def __str__(self):
return str(self.data.__str__())
x = Tensor([1,2,3,4,5])
print(x)
[1 2 3 4 5]
y = x + x
print(y)
[ 2
4
6
8 10]
This is the first version of this basic data structure. Note that it stores all the numerical
information in a NumPy array (self.data), and it supports one tensor operation
(addition). Adding more operations is relatively simple: create more functions on the tensor
class with the appropriate functionality.
234
Chapter 13
I Introducing automatic optimization
Introduction to automatic gradient computation
(autograd)
Previously, you performed backpropagation by hand.
Let’s make it automatic!
In chapter 4, you learned about derivatives. Since then, you’ve been computing derivatives
by hand for each neural network you train. Recall that this is done by moving backward
through the neural network: first compute the gradient at the output of the network, then
use that result to compute the derivative at the next-to-last component, and so on until all
weights in the architecture have correct gradients. This logic for computing gradients can
also be added to the tensor object. Let me show you what I mean. New code is in bold:
import numpy as np
class Tensor (object):
def __init__(self, data, creators=None, creation_op=None):
self.data = np.array(data)
self.creation_op = creation_op
self.creators = creators
self.grad = None
def backward(self, grad):
self.grad = grad
if(self.creation_op == "add"):
self.creators[0].backward(grad)
self.creators[1].backward(grad)
def __add__(self, other):
return Tensor(self.data + other.data,
creators=[self,other],
creation_op="add")
def __repr__(self):
return str(self.data.__repr__())
def __str__(self):
return str(self.data.__str__())
x = Tensor([1,2,3,4,5])
y = Tensor([2,2,2,2,2])
z = x + y
z.backward(Tensor(np.array([1,1,1,1,1])))
This method introduces two new concepts. First, each tensor gets two new attributes.
creators is a list containing any tensors used in the creation of the current tensor (which
defaults to None). Thus, when the two tensors x and y are added together, z has two
Introduction to automatic gradient computation (autograd)
235
creators, x and y. creation_op is a related feature that stores the instructions creators
used in the creation process. Thus, performing z = x + y creates a computation graph with
three nodes (x, y, and z) and two edges (z -> x and z -> y). Each edge is labeled by the
creation_op add. This graph allows you to recursively backpropagate gradients.
x
y
add
add
z
The first new concept in this implementation is the automatic creation of this graph
whenever you perform math operations. If you took z and performed further operations,
the graph would continue with whatever resulting new variables pointed back to z.
The second new concept introduced in this version of Tensor is the ability to use this graph
to compute gradients. When you call z .backward(), it sends the correct gradient for x
and y given the function that was applied to create z (add). Looking at the graph, you place
a vector of gradients (np.array([1,1,1,1,1])) on z, and then they’re applied to their
parents. As you learned in chapter 4, backpropagating through addition means also applying
addition when backpropagating. In this case, because there’s only one gradient to add into x
or y, you copy the gradient from z onto x and y:
print(x.grad)
print(y.grad)
print(z.creators)
print(z.creation_op)
[1 1 1 1 1]
[1 1 1 1 1]
[array([1, 2, 3, 4, 5]), array([2, 2, 2, 2, 2])]
add
Perhaps the most elegant part of this form of autograd is that it works recursively as well,
because each vector calls .backward() on all of its self.creators:
a = Tensor([1,2,3,4,5])
b = Tensor([2,2,2,2,2])
c = Tensor([5,4,3,2,1])
d = Tensor([-1,-2,-3,-4,-5])
e = a + b
f = c + d
g = e + f
g.backward(Tensor(np.array([1,1,1,1,1])))
print(a.grad)
Output
[1 1 1 1 1]
236
Chapter 13
I Introducing automatic optimization
A quick checkpoint
Everything in Tensor is another form of lessons already learned.
Before moving on, I want to first acknowledge that even if it feels like a bit of a stretch or a
heavy lift to think about gradients flowing over a graphical structure, this is nothing new
compared to what you’ve already been working with. In the previous chapter on RNNs, you
forward propagated in one direction and then back propagated across a (virtual graph) of
activations.
You just didn’t explicitly encode the nodes and edges in a graphical data structure. Instead,
you had a list of layers (dictionaries) and hand-coded the correct order of forward and
backpropagation operations. Now you’re building a nice interface so you don’t have to
write as much code. This interface lets you backpropagate recursively instead of having to
handwrite complicated backprop code.
This chapter is only somewhat theoretical. It’s mostly about commonly used engineering
practices for learning deep neural networks. In particular, this notion of a graph that gets
built during forward propagation is called a dynamic computation graph because it’s built
on the fly during forward prop. This is the type of autograd present in newer deep learning
frameworks such as DyNet and PyTorch. Older frameworks such as Theano and TensorFlow
have what’s called a static computation graph, which is specified before forward propagation
even begins.
In general, dynamic computation graphs are easier to write/experiment with, and static
computation graphs have faster runtimes because of some fancy logic under the hood.
But note that dynamic and static frameworks have lately been moving toward the middle,
allowing dynamic graphs to compile to static ones (for faster runtimes) or allowing static
graphs to be built dynamically (for easier experimentation). In the long run, you’re likely
to end up with both. The primary difference is whether forward propagation is happening
during graph construction or after the graph is already defined. In this book, we’ll stick
with dynamic.
The main point of this chapter is to help prepare you for deep learning in the real world,
where 10% (or less) of your time will be spent thinking up a new idea and 90% of your time
will be spent figuring out how to get a deep learning framework to play nicely. Debugging
these frameworks can be extremely difficult at times, because most bugs don’t raise an error
and print out a stack trace. Most bugs lie hidden within the code, keeping the network from
training as it should (even if it appears to be training somewhat).
All that is to say, really dive into this chapter. You’ll be glad you did when it’s 2:00 a.m. and
you’re chasing down an optimization bug that’s keeping you from getting that juicy state-ofthe-art score.
Tensors that are used multiple times
237
Tensors that are used multiple times
The basic autograd has a rather pesky bug. Let’s squish it!
The current version of Tensor supports backpropagating into a variable only once. But
sometimes, during forward propagation, you’ll use the same tensor multiple times (the
weights of a neural network), and thus multiple parts of the graph will backpropagate
gradients into the same tensor. But the code will currently compute the incorrect gradient
when backpropagating into a variable that was used multiple times (is the parent of multiple
children). Here’s what I mean:
a = Tensor([1,2,3,4,5])
b = Tensor([2,2,2,2,2])
c = Tensor([5,4,3,2,1])
d = a + b
e = b + c
f = d + e
f.backward(Tensor(np.array([1,1,1,1,1])))
print(b.grad.data == np.array([2,2,2,2,2]))
array([False, False, False, False, False])
In this example, the b variable is used twice in the process of creating f. Thus, its gradient
should be the sum of two derivatives: [2,2,2,2,2]. Shown here is the resulting graph
created by this chain of operations. Notice there are now two pointers pointing into b: so,
it should be the sum of the gradient coming from both e and d.
a
b
add
add
c
add
add
d
e
add
add
f
But the current implementation of Tensor merely overwrites each derivative with the
previous. First, d applies its gradient, and then it gets overwritten with the gradient from e.
We need to change the way gradients are written.
238
Chapter 13
I Introducing automatic optimization
Upgrading autograd to support multiuse tensors
Add one new function, and update three old ones.
This update to the Tensor object adds two new features. First, gradients can be accumulated so
that when a variable is used more than once, it receives gradients from all children:
import numpy as np
class Tensor (object):
def __init__(self,data,
autograd=False,
creators=None,
creation_op=None,
id=None):
self.data = np.array(data)
self.creators = creators
self.creation_op = creation_op
self.grad = None
self.autograd = autograd
self.children = {}
if(id is None):
id = np.random.randint(0,100000)
self.id = id
if(creators is not None):
for c in creators:
if(self.id not in c.children):
c.children[self.id] = 1
else:
c.children[self.id] += 1
def all_children_grads_accounted_for(self):
for id,cnt in self.children.items():
if(cnt != 0):
return False
return True
Keeps track of how
many children a
tensor has
Checks whether a
tensor has received
the correct number
of gradients from
each child
Checks to make sure
you can backpropagate
or whether you’re
waiting for a gradient,
in which case
decrement the counter
def backward(self,grad=None, grad_origin=None):
if(self.autograd):
if(grad_origin is not None):
if(self.children[grad_origin.id] == 0):
raise Exception("cannot backprop more than once")
else:
self.children[grad_origin.id] -= 1
if(self.grad is None):
self.grad = grad
else:
self.grad += grad
if(self.creators is not None and
(self.all_children_grads_accounted_for() or
grad_origin is None)):
Accumulates
gradients
from several
children
Upgrading autograd to support multiuse tensors
239
if(self.creation_op == "add"):
self.creators[0].backward(self.grad, self)
self.creators[1].backward(self.grad, self)
def __add__(self, other):
if(self.autograd and other.autograd):
return Tensor(self.data + other.data,
autograd=True,
creators=[self,other],
creation_op="add")
return Tensor(self.data + other.data)
Begins actual
backpropagation
def __repr__(self):
return str(self.data.__repr__())
def __str__(self):
return str(self.data.__str__())
a = Tensor([1,2,3,4,5], autograd=True)
b = Tensor([2,2,2,2,2], autograd=True)
c = Tensor([5,4,3,2,1], autograd=True)
d = a + b
e = b + c
f = d + e
f.backward(Tensor(np.array([1,1,1,1,1])))
print(b.grad.data == np.array([2,2,2,2,2]))
[ True True True True True]
Additionally, you create a self.children counter that counts the number of gradients
received from each child during backpropagation. This way, you also prevent a variable from
accidentally backpropagating from the same child twice (which throws an exception).
The second added feature is a new function with the rather verbose name all_children_
grads_accounted_for(). The purpose of this function is to compute whether a tensor has
received gradients from all of its children in the graph. Normally, whenever .backward() is
called on an intermediate variable in a graph, it immediately calls .backward() on its parents.
But because some variables receive their gradient value from multiple parents, each variable
needs to wait to call .backward() on its parents until it has the final gradient locally.
As mentioned previously, none of these concepts are new from a deep learning theory
perspective; these are the kinds of engineering challenges that deep learning frameworks
seek to face. More important, they’re the kinds of challenges you’ll face when debugging
neural networks in a standard framework. Before moving on, take a moment to play around
and get familiar with this code. Try deleting different parts and seeing how it breaks in
various ways. Try calling .backprop() twice.
240
Chapter 13
I Introducing automatic optimization
How does addition backpropagation work?
Let’s study the abstraction to learn how to add support
for more functions.
At this point, the framework has reached an exciting place! You can now add support for
arbitrary operations by adding the function to the Tensor class and adding its derivative to
the .backward() method. For addition, there’s the following method:
def __add__(self, other):
if(self.autograd and other.autograd):
return Tensor(self.data + other.data,
autograd=True,
creators=[self,other],
creation_op="add")
return Tensor(self.data + other.data)
And for backpropagation through the addition function, there’s the following gradient
propagation within the .backward() method:
if(self.creation_op == "add"):
self.creators[0].backward(self.grad, self)
self.creators[1].backward(self.grad, self)
Notice that addition isn’t handled anywhere else in the class. The generic backpropagation
logic is abstracted away so everything necessary for addition is defined in these two places.
Note further that backpropagation logic calls .backward() two times, once for each variable
that participated in the addition. Thus, the default setting in the backpropagation logic is to
always backpropagate into every variable in the graph. But occasionally, backpropagation
is skipped if the variable has autograd turned off (self.autograd == False). This check is
performed in the .backward() method:
def backward(self,grad=None, grad_origin=None):
if(self.autograd):
if(grad_origin is not None):
if(self.children[grad_origin.id] == 0):
raise Exception("cannot backprop more than once")
...
Even though the backpropagation logic for addition backpropagates the gradient into all the
variables that contributed to it, the backpropagation won’t run unless .autograd is set to True
for that variable (for self.creators[0] or self.creators[1], respectively). Also notice in the
first line of __add__() that the tensor created (which is later the tensor running.backward())
has self.autograd == True only if self.autograd == other.autograd == True.
Adding support for negation
241
Adding support for negation
Let’s modify the support for addition to support negation.
Now that addition is working, you should be able to copy and paste the addition code, create
a few modifications, and add autograd support for negation. Let’s try it. Modifications from
the __add__ function are in bold:
def __neg__(self):
if(self.autograd):
return Tensor(self.data * -1,
autograd=True,
creators=[self],
creation_op="neg")
return Tensor(self.data * -1)
Nearly everything is identical. You don’t accept any parameters so the parameter “other” has
been removed in several places. Let’s take a look at the backprop logic you should add to
.backward(). Modifications from the __add__ function backpropagation logic are in bold:
if(self.creation_op == "neg"):
self.creators[0].backward(self.grad.__neg__())
Because the __neg__ function has only one creator, you end up calling .backward() only
once. (If you’re wondering how you know the correct gradients to backpropagate, revisit
chapters 4, 5, and 6.) You can now test out the new code:
a = Tensor([1,2,3,4,5], autograd=True)
b = Tensor([2,2,2,2,2], autograd=True)
c = Tensor([5,4,3,2,1], autograd=True)
d = a + (-b)
e = (-b) + c
f = d + e
f.backward(Tensor(np.array([1,1,1,1,1])))
print(b.grad.data == np.array([-2,-2,-2,-2,-2]))
[ True True True True True]
When you forward propagate using -b instead of b, the gradients that are backpropagated
have a flipped sign as well. Furthermore, you don’t have to change anything about the
general backpropagation system to make this work. You can create new functions as you
need them. Let’s add some more!
242
Chapter 13
I Introducing automatic optimization
Adding support for additional functions
Subtraction, multiplication, sum, expand, transpose,
and matrix multiplication
Using the same ideas you learned for addition and negation, let’s add the forward and
backpropagation logic for several additional functions:
def __sub__(self, other):
if(self.autograd and other.autograd):
return Tensor(self.data - other.data,
autograd=True,
creators=[self,other],
creation_op="sub")
return Tensor(self.data - other.data)
def __mul__(self, other):
if(self.autograd and other.autograd):
return Tensor(self.data * other.data,
autograd=True,
creators=[self,other],
creation_op="mul")
return Tensor(self.data * other.data)
def sum(self, dim):
if(self.autograd):
return Tensor(self.data.sum(dim),
autograd=True,
creators=[self],
creation_op="sum_"+str(dim))
return Tensor(self.data.sum(dim))
def expand(self, dim,copies):
trans_cmd = list(range(0,len(self.data.shape)))
trans_cmd.insert(dim,len(self.data.shape))
new_shape = list(self.data.shape) + [copies]
new_data = self.data.repeat(copies).reshape(new_shape)
new_data = new_data.transpose(trans_cmd)
if(self.autograd):
return Tensor(new_data,
autograd=True,
creators=[self],
creation_op="expand_"+str(dim))
return Tensor(new_data)
def transpose(self):
if(self.autograd):
return Tensor(self.data.transpose(),
autograd=True,
creators=[self],
creation_op="transpose")
return Tensor(self.data.transpose())
Adding support for additional functions
243
def mm(self, x):
if(self.autograd):
return Tensor(self.data.dot(x.data),
autograd=True,
creators=[self,x],
creation_op="mm")
return Tensor(self.data.dot(x.data))
We’ve previously discussed the derivatives for all these functions, although sum and
expand might seem foreign because they have new names. sum performs addition across a
dimension of the tensor; in other words, say you have a 2 × 3 matrix called x:
x = Tensor(np.array([[1,2,3],
[4,5,6]]))
The .sum(dim) function sums across a dimension. x.sum(0) will result in a 1 × 3 matrix (a
length 3 vector), whereas x.sum(1) will result in a 2 × 1 matrix (a length 2 vector):
x.sum(0)
array([5, 7, 9])
x.sum(1)
array([ 6, 15])
You use expand to backpropagate through a .sum(). It’s a function that copies data along a
dimension. Given the same matrix x, copying along the first dimension gives two copies of
the tensor:
array([[[1, 2, 3],
[4, 5, 6]],
x.expand(dim=0, copies=4)
[[1, 2, 3],
[4, 5, 6]],
[[1, 2, 3],
[4, 5, 6]],
[[1, 2, 3],
[4, 5, 6]]])
To be clear, whereas .sum() removes a dimension (2 × 3 -> just 2 or 3), expand adds
a dimension. The 2 × 3 matrix becomes 4 × 2 × 3. You can think of this as a list of four
tensors, each of which is 2 × 3. But if you expand to the last dimension, it copies along the
last dimension, so each entry in the original tensor becomes a list of entries instead:
x.expand(dim=2, copies=4)
array([[[1, 1, 1, 1],
[2, 2, 2, 2],
[3, 3, 3, 3]],
[[4, 4, 4, 4],
[5, 5, 5, 5],
[6, 6, 6, 6]]])
Thus, when you perform .sum(dim=1) on a tensor with four entries in that dimension, you
need to perform .expand(dim=1, copies=4) to the gradient when you backpropagate it.
244
Chapter 13
I Introducing automatic optimization
You can now add the corresponding backpropagation logic to the .backward() method:
if(self.creation_op == "sub"):
new = Tensor(self.grad.data)
self.creators[0].backward(new, self)
new = Tensor(self.grad.__neg__().data)
self.creators[1].backward(, self)
if(self.creation_op == "mul"):
new = self.grad * self.creators[1]
self.creators[0].backward(new , self)
new = self.grad * self.creators[0]
self.creators[1].backward(new, self)
Usually an
activation
if(self.creation_op == "mm"):
Usually a
act = self.creators[0]
weight matrix
weights = self.creators[1]
new = self.grad.mm(weights.transpose())
act.backward(new)
new = self.grad.transpose().mm(act).transpose()
weights.backward(new)
if(self.creation_op == "transpose"):
self.creators[0].backward(self.grad.transpose())
if("sum" in self.creation_op):
dim = int(self.creation_op.split("_")[1])
ds = self.creators[0].data.shape[dim]
self.creators[0].backward(self.grad.expand(dim,ds))
if("expand" in self.creation_op):
dim = int(self.creation_op.split("_")[1])
self.creators[0].backward(self.grad.sum(dim))
If you’re unsure about this functionality, the best thing to do is to look back at how you
were doing backpropagation in chapter 6. That chapter has figures showing each step of
backpropagation, part of which I’ve shown again here.
The gradients start at the end of the network. You then move the error signal backward
through the network by calling functions that correspond to the functions used to move
activations forward through the network. If the last operation was a matrix multiplication
(and it was), you backpropagate by performing matrix multiplication (dot) on the
transposed matrix.
In the following image, this happens at the line layer_1_delta=layer_2_delta.dot
(weights_1_2.T). In the previous code, it happens in if(self.creation_op == "mm")
(highlighted in bold). You’re doing the exact same operations as before (in reverse order of
forward propagation), but the code is better organized.
Adding support for additional functions
245
d LEARN: backpropagating from layer_2 to layer_1
Inputs
layer_0
1
layer_0
layer_1
layer_1
layer_2
Hiddens
layer_1
0
Prediction
layer_2
1
.13
-.02
–.17
0.14
layer_2_delta=(layer_2-walk_stop[0:1])
1.04
0
0
e
lights[0:1]
np.dot(layer_0,weights_0_1)
relu(layer_1)
np.dot(layer_1,weights_1_2)
error = (layer_2-walk_stop[0:1])**2
0
0
=
=
=
=
layer_1_delta=layer_2_delta.dot(weights_1_2.T)
layer_1_delta *= relu2deriv(layer_1)
LEARN: Generating weight_deltas and updating weights
Inputs
layer_0
1
Hiddens
layer_1
0
0
0
1
Prediction
layer_2
layer_1_delta=layer_2_delta.dot(weights_1_2.T)
layer_1_delta *= relu2deriv(layer_1)
.13
–.02
–.17
0.14
0
layer_0 = lights[0:1]
layer_1 = np.dot(layer_0,weights_0_1)
layer_1 = relu(layer_1)
layer_2 = np.dot(layer_1,weights_1_2)
error = (layer_2-walk_stop[0:1])**2
layer_2_delta=(layer_2-walk_stop[0:1])
1.04
weight_delta_1_2 = layer_1.T.dot(layer_2_delta)
weight_delta_0_1 = layer_0.T.dot(layer_1_delta)
0
weights_1_2 -= alpha * weight_delta_1_2
weights_0_1 -= alpha * weight_delta_0_1
246
Chapter 13
I Introducing automatic optimization
Using autograd to train a neural network
You no longer have to write backpropagation logic!
This may have seemed like quite a bit of engineering effort, but it’s about to pay off. Now,
when you train a neural network, you don’t have to write any backpropagation logic! As a
toy example, here’s a neural network to backprop by hand:
import numpy
np.random.seed(0)
data = np.array([[0,0],[0,1],[1,0],[1,1]])
target = np.array([[0],[1],[0],[1]])
weights_0_1 = np.random.rand(2,3)
weights_1_2 = np.random.rand(3,1)
for i in range(10):
layer_1 = data.dot(weights_0_1)
layer_2 = layer_1.dot(weights_1_2)
diff = (layer_2 - target)
sqdiff = (diff * diff)
loss = sqdiff.sum(0)
Predict
Compare
Mean squared
error loss
layer_1_grad = diff.dot(weights_1_2.transpose())
weight_1_2_update = layer_1.transpose().dot(diff)
weight_0_1_update = data.transpose().dot(layer_1_grad)
Learn; this is the
backpropagation
piece.
weights_1_2 -= weight_1_2_update * 0.1
weights_0_1 -= weight_0_1_update * 0.1
print(loss[0])
0.4520108746468352
0.33267400101121475
0.25307308516725036
0.1969566997160743
0.15559900212801492
0.12410658864910949
0.09958132129923322
0.08019781265417164
0.06473333002675746
0.05232281719234398
You have to forward propagate in such a way that layer_1, layer_2, and diff exist as
variables, because you need them later. You then have to backpropagate each gradient to its
appropriate weight matrix and perform the weight update appropriately.
Using autograd to train a neural network
247
import numpy
np.random.seed(0)
data = Tensor(np.array([[0,0],[0,1],[1,0],[1,1]]), autograd=True)
target = Tensor(np.array([[0],[1],[0],[1]]), autograd=True)
w = list()
w.append(Tensor(np.random.rand(2,3), autograd=True))
w.append(Tensor(np.random.rand(3,1), autograd=True))
for i in range(10):
Predict
pred = data.mm(w[0]).mm(w[1])
Compare
loss = ((pred - target)*(pred - target)).sum(0)
loss.backward(Tensor(np.ones_like(loss.data)))
Learn
for w_ in w:
w_.data -= w_.grad.data * 0.1
w_.grad.data *= 0
print(loss)
But with the fancy new autograd system, the code is much simpler. You don’t have to keep
around any temporary variables (because the dynamic graph keeps track of them), and you
don’t have to implement any backpropagation logic (because the .backward() method
handles that). Not only is this more convenient, but you’re less likely to make silly mistakes
in the backpropagation code, reducing the likelihood of bugs!
[0.58128304]
[0.48988149]
[0.41375111]
[0.34489412]
[0.28210124]
[0.2254484]
[0.17538853]
[0.1324231]
[0.09682769]
[0.06849361]
Before moving on, I’d like to point out one stylistic thing in this new implementation. Notice
that I put all the parameters in a list, which I could iterate through when performing the
weight update. This is a bit of foreshadowing for the next piece of functionality. When you
have an autograd system, stochastic gradient descent becomes trivial to implement (it’s just
that for loop at the end). Let’s try making this its own class as well.
248
Chapter 13
I Introducing automatic optimization
Adding automatic optimization
Let’s make a stochastic gradient descent optimizer.
At face value, creating something called a stochastic gradient descent optimizer may sound
difficult, but it’s just copying and pasting from the previous example with a bit of good, oldfashioned object-oriented programming:
class SGD(object):
def __init__(self, parameters, alpha=0.1):
self.parameters = parameters
self.alpha = alpha
def zero(self):
for p in self.parameters:
p.grad.data *= 0
def step(self, zero=True):
for p in self.parameters:
p.data -= p.grad.data * self.alpha
if(zero):
p.grad.data *= 0
The previous neural network is further simplified as follows, with exactly the same results as
before:
import numpy
np.random.seed(0)
data = Tensor(np.array([[0,0],[0,1],[1,0],[1,1]]), autograd=True)
target = Tensor(np.array([[0],[1],[0],[1]]), autograd=True)
w = list()
w.append(Tensor(np.random.rand(2,3), autograd=True))
w.append(Tensor(np.random.rand(3,1), autograd=True))
optim = SGD(parameters=w, alpha=0.1)
for i in range(10):
Predict
pred = data.mm(w[0]).mm(w[1])
Compare
loss = ((pred - target)*(pred - target)).sum(0)
loss.backward(Tensor(np.ones_like(loss.data)))
optim.step()
Learn
Adding support for layer types
249
Adding support for layer types
You may be familiar with layer types in Keras or PyTorch.
At this point, you’ve done the most complicated pieces of the new deep learning framework.
Further work is mostly about adding new functions to the tensor and creating convenient
higher-order classes and functions. Probably the most common abstraction among nearly all
frameworks is the layer abstraction. It’s a collection of commonly used forward propagation
techniques packaged into an simple API with some kind of .forward() method to call
them. Here’s an example of a simple linear layer:
class Layer(object):
def __init__(self):
self.parameters = list()
def get_parameters(self):
return self.parameters
class Linear(Layer):
def __init__(self, n_inputs, n_outputs):
super().__init__()
W = np.random.randn(n_inputs, n_outputs)*np.sqrt(2.0/(n_inputs))
self.weight = Tensor(W, autograd=True)
self.bias = Tensor(np.zeros(n_outputs), autograd=True)
self.parameters.append(self.weight)
self.parameters.append(self.bias)
def forward(self, input):
return input.mm(self.weight)+self.bias.expand(0,len(input.data))
Nothing here is particularly new. The weights are organized into a class (and I added bias
weights because this is a true linear layer). You can initialize the layer all together, such
that both the weights and bias are initialized with the correct sizes, and the correct forward
propagation logic is always employed.
Also notice that I created an abstract class Layer, which has a single getter. This allows for
more-complicated layer types (such as layers containing other layers). All you need to do is
override get_parameters() to control what tensors are later passed to the optimizer (such
as the SGD class created in the previous section).
250
Chapter 13
I Introducing automatic optimization
Layers that contain layers
Layers can also contain other layers.
The most popular layer is a sequential layer that forward propagates a list of layers, where
each layer feeds its outputs into the inputs of the next layer:
class Sequential(Layer):
def __init__(self, layers=list()):
super().__init__()
self.layers = layers
def add(self, layer):
self.layers.append(layer)
def forward(self, input):
for layer in self.layers:
input = layer.forward(input)
return input
def get_parameters(self):
params = list()
for l in self.layers:
params += l.get_parameters()
return params
data = Tensor(np.array([[0,0],[0,1],[1,0],[1,1]]), autograd=True)
target = Tensor(np.array([[0],[1],[0],[1]]), autograd=True)
model = Sequential([Linear(2,3), Linear(3,1)])
optim = SGD(parameters=model.get_parameters(), alpha=0.05)
for i in range(10):
Predict
pred = model.forward(data)
Compare
loss = ((pred - target)*(pred - target)).sum(0)
loss.backward(Tensor(np.ones_like(loss.data)))
optim.step()
print(loss)
Learn
Loss-function layers
251
Loss-function layers
Some layers have no weights.
You can also create layers that are functions on the input. The most popular version of this
kind of layer is probably the loss-function layer, such as mean squared error:
class MSELoss(Layer):
def __init__(self):
super().__init__()
def forward(self, pred, target):
return ((pred - target)*(pred - target)).sum(0)
import numpy
np.random.seed(0)
data = Tensor(np.array([[0,0],[0,1],[1,0],[1,1]]), autograd=True)
target = Tensor(np.array([[0],[1],[0],[1]]), autograd=True)
model = Sequential([Linear(2,3), Linear(3,1)])
criterion = MSELoss()
optim = SGD(parameters=model.get_parameters(), alpha=0.05)
for i in range(10):
Predict
pred = model.forward(data)
Compare
loss = criterion.forward(pred, target)
loss.backward(Tensor(np.ones_like(loss.data)))
optim.step()
print(loss)
Learn
[2.33428272]
[0.06743796]
...
[0.01153118]
[0.00889602]
If you’ll forgive the repetition, again, nothing here is particularly new. Under the hood, the
last several code examples all do the exact same computation. It’s just that autograd is doing
all the backpropagation, and the forward propagation steps are packaged in nice classes to
ensure that the functionality executes in the correct order.
252
Chapter 13
I Introducing automatic optimization
How to learn a framework
Oversimplified, frameworks are autograd + a list of prebuilt
layers and optimizers.
You’ve been able to write (rather quickly) a variety of new layer types using the underlying
autograd system, which makes it quite easy to piece together arbitrary layers of functionality.
Truth be told, this is the main feature of modern frameworks, eliminating the need to
handwrite each and every math operation for forward and backward propagation. Using
frameworks greatly increases the speed with which you can go from idea to experiment and
will reduce the number of bugs in your code.
Viewing a framework as merely an autograd system coupled with a big list of layers and
optimizers will help you learn them. I expect you’ll be able to pivot from this chapter into
almost any framework fairly quickly, although the framework that’s most similar to the API
built here is PyTorch. Either way, for your reference, take a moment to peruse the lists of
layers and optimizers in several of the big frameworks:
• https://pytorch.org/docs/stable/nn.html
• https://keras.io/layers/about-keras-layers
• https://www.tensorflow.org/api_docs/python/tf/layers
The general workflow for learning a new framework is to find the simplest possible code
example, tweak it and get to know the autograd system’s API, and then modify the code
example piece by piece until you get to whatever experiment you care about.
def backward(self,grad=None, grad_origin=None):
if(self.autograd):
if(grad is None):
grad = Tensor(np.ones_like(self.data))
One more thing before we move on. I’m adding a nice convenience function to
Tensor.backward() that makes it so you don’t have to pass in a gradient of 1s the first time
you call .backward(). It’s not, strictly speaking, necessary—but it’s handy.
Nonlinearity layers
253
Nonlinearity layers
Let’s add nonlinear functions to Tensor and then create some
layer types.
For the next chapter, you’ll need .sigmoid() and .tanh(). Let’s add them to the Tensor
class. You learned about the derivative for both quite some time ago, so this should be easy:
def sigmoid(self):
if(self.autograd):
return Tensor(1 / (1 + np.exp(-self.data)),
autograd=True,
creators=[self],
creation_op="sigmoid")
return Tensor(1 / (1 + np.exp(-self.data)))
def tanh(self):
if(self.autograd):
return Tensor(np.tanh(self.data),
autograd=True,
creators=[self],
creation_op="tanh")
return Tensor(np.tanh(self.data))
The following code shows the backprop logic added to the Tensor.backward() method:
if(self.creation_op == "sigmoid"):
ones = Tensor(np.ones_like(self.grad.data))
self.creators[0].backward(self.grad * (self * (ones - self)))
if(self.creation_op == "tanh"):
ones = Tensor(np.ones_like(self.grad.data))
self.creators[0].backward(self.grad * (ones - (self * self)))
Hopefully, this feels fairly routine. See if you can make a few more nonlinearities as well: try
HardTanh or relu.
class Tanh(Layer):
def __init__(self):
super().__init__()
def forward(self, input):
return input.tanh()
class Sigmoid(Layer):
def __init__(self):
super().__init__()
def forward(self, input):
return input.sigmoid()
254
Chapter 13
I Introducing automatic optimization
Let’s try out the new nonlinearities. New additions are in bold:
import numpy
np.random.seed(0)
data = Tensor(np.array([[0,0],[0,1],[1,0],[1,1]]), autograd=True)
target = Tensor(np.array([[0],[1],[0],[1]]), autograd=True)
model = Sequential([Linear(2,3), Tanh(), Linear(3,1), Sigmoid()])
criterion = MSELoss()
optim = SGD(parameters=model.get_parameters(), alpha=1)
for i in range(10):
Predict
pred = model.forward(data)
Compare
loss = criterion.forward(pred, target)
loss.backward(Tensor(np.ones_like(loss.data)))
optim.step()
print(loss)
Learn
[1.06372865]
[0.75148144]
[0.57384259]
[0.39574294]
[0.2482279]
[0.15515294]
[0.10423398]
[0.07571169]
[0.05837623]
[0.04700013]
As you can see, you can drop the new Tanh() and Sigmoid() layers into the input
parameters to Sequential(), and the neural network knows exactly how to use them. Easy!
In the previous chapter, you learned about recurrent neural networks. In particular, you
trained a model to predict the next word, given the previous several words. Before we finish
this chapter, I’d like for you to translate that code into the new framework. To do this, you’ll
need three new layer types: an embedding layer that learns word embeddings, an RNN
layer that can learn to model sequences of inputs, and a softmax layer that can predict a
probability distribution over labels.
The embedding layer
255
The embedding layer
An embedding layer translates indices into activations.
In chapter 11, you learned about word embeddings, which are vectors mapped to words
that you can forward propagate into a neural network. Thus, if you have a vocabulary of
200 words, you’ll also have 200 embeddings. This gives the initial spec for creating an
embedding layer. First, initialize a list (of the right length) of word embeddings (of the
right size):
class Embedding(Layer):
def __init__(self, vocab_size, dim):
super().__init__()
This initialization style
is a convention from
word2vec.
self.vocab_size = vocab_size
self.dim = dim
weight = np.random.rand(vocab_size, dim) - 0.5) / dim
So far, so good. The matrix has a row (vector) for each word in the vocabulary. Now, how
will you forward propagate? Well, forward propagation always starts with the question,
“How will the inputs be encoded?” In the case of word embeddings, you obviously can’t pass
in the words themselves, because the words don’t tell you which rows in self.weight to
forward propagate with. Instead, as you hopefully remember from chapter 11, you forward
propagate indices. Fortunately, NumPy supports this operation:
identity = np.eye(5)
print(identity)
print(identity[np.array([[1,2,3,4],
[2,3,4,0]])])
array([[1.,
[0.,
[0.,
[0.,
[0.,
0.,
1.,
0.,
0.,
0.,
0.,
0.,
1.,
0.,
0.,
0.,
0.,
0.,
1.,
0.,
0.],
0.],
0.],
0.],
1.]])
[[[0.
[0.
[0.
[0.
1.
0.
0.
0.
0.
1.
0.
0.
0.
0.
1.
0.
0.]
0.]
0.]
1.]]
[[0.
[0.
[0.
[1.
0.
0.
0.
0.
1.
0.
0.
0.
0.
1.
0.
0.
0.]
0.]
1.]
0.]]]
Notice how, when you pass a matrix of integers into a NumPy matrix, it returns the same
matrix, but with each integer replaced with the row the integer specified. Thus a 2D matrix
of indices turns into a 3D matrix of embeddings (rows). This is perfect!
256
Chapter 13
I Introducing automatic optimization
Adding indexing to autograd
Before you can build the embedding layer, autograd needs to
support indexing.
In order to support the new embedding strategy (which assumes words are forward
propagated as matrices of indices), the indexing you played around with in the previous
section must be supported by autograd. This is a pretty simple idea. You need to make sure
that during backpropagation, the gradients are placed in the same rows as were indexed into
for forward propagation. This requires that you keep around whatever indices were passed
in, so you can place each gradient in the appropriate location during backpropagation with a
simple for loop:
def index_select(self, indices):
if(self.autograd):
new = Tensor(self.data[indices.data],
autograd=True,
creators=[self],
creation_op="index_select")
new.index_select_indices = indices
return new
return Tensor(self.data[indices.data])
First, use the NumPy trick you learned in the previous section to select the correct rows:
if(self.creation_op == "index_select"):
new_grad = np.zeros_like(self.creators[0].data)
indices_ = self.index_select_indices.data.flatten()
grad_ = grad.data.reshape(len(indices_), -1)
for i in range(len(indices_)):
new_grad[indices_[i]] += grad_[i]
self.creators[0].backward(Tensor(new_grad))
Then, during backprop(), initialize a new gradient of the correct size (the size of the
original matrix that was being indexed into). Second, flatten the indices so you can iterate
through them. Third, collapse grad_ to a simple list of rows. (The subtle part is that the list
of indices in indices_ and the list of vectors in grad_ will be in the corresponding order.)
Then, iterate through each index, add it into the correct row of the new gradient you’re
creating, and backpropagate it into self.creators[0]. As you can see, grad_[i] correctly
updates each row (adds a vector of 1s, in this case) in accordance with the number of times
the index is used. Indices 2 and 3 update twice (in bold):
x = Tensor(np.eye(5), autograd=True)
x.index_select(Tensor([[1,2,3],
[2,3,4]])).backward()
print(x.grad)
[[0.
[1.
[2.
[2.
[1.
0.
1.
2.
2.
1.
0.
1.
2.
2.
1.
0.
1.
2.
2.
1.
0.]
1.]
2.]
2.]
1.]]
The embedding layer (revisited)
257
The embedding layer (revisited)
Now you can finish forward propagation using the new
.index_select() method.
For forward prop, call .index_select(), and autograd will handle the rest:
class Embedding(Layer):
def __init__(self, vocab_size, dim):
super().__init__()
This initialization style
is a convention from
word2vec.
self.vocab_size = vocab_size
self.dim = dim
weight = np.random.rand(vocab_size, dim) - 0.5) / dim
self.weight = Tensor((weight, autograd=True)
self.parameters.append(self.weight)
def forward(self, input):
return self.weight.index_select(input)
data = Tensor(np.array([1,2,1,2]), autograd=True)
target = Tensor(np.array([[0],[1],[0],[1]]), autograd=True)
embed = Embedding(5,3)
model = Sequential([embed, Tanh(), Linear(3,1), Sigmoid()])
criterion = MSELoss()
optim = SGD(parameters=model.get_parameters(), alpha=0.5)
for i in range(10):
Predict
pred = model.forward(data)
Compare
loss = criterion.forward(pred, target)
loss.backward(Tensor(np.ones_like(loss.data)))
optim.step()
print(loss)
[0.98874126]
[0.6658868]
[0.45639889]
...
[0.08731868]
[0.07387834]
Learn
In this neural network, you learn to correlate input indices 1 and 2 with
the prediction 0 and 1. In theory, indices 1 and 2 could correspond to
words (or some other input object), and in the final example, they will.
This example was to show the embedding working.
258
Chapter 13
I Introducing automatic optimization
The cross-entropy layer
Let’s add cross entropy to the autograd and create a layer.
Hopefully, at this point you’re starting to feel comfortable with how to create new layer
types. Cross entropy is a pretty standard one that you’ve seen many times throughout this
book. Because we’ve already walked through how to create several new layer types, I’ll leave
the code here for your reference. Attempt to do it yourself before copying this code.
def cross_entropy(self, target_indices):
temp = np.exp(self.data)
softmax_output = temp / np.sum(temp,
axis=len(self.data.shape)-1,
keepdims=True)
t = target_indices.data.flatten()
p = softmax_output.reshape(len(t),-1)
target_dist = np.eye(p.shape[1])[t]
loss = -(np.log(p) * (target_dist)).sum(1).mean()
if(self.autograd):
out = Tensor(loss,
autograd=True,
creators=[self],
creation_op="cross_entropy")
out.softmax_output = softmax_output
out.target_dist = target_dist
return out
return Tensor(loss)
if(self.creation_op == "cross_entropy"):
dx = self.softmax_output - self.target_dist
self.creators[0].backward(Tensor(dx))
class CrossEntropyLoss(object):
def __init__(self):
super().__init__()
def forward(self, input, target):
return input.cross_entropy(target)
The cross-entropy layer
259
import numpy
np.random.seed(0)
# data indices
data = Tensor(np.array([1,2,1,2]), autograd=True)
# target indices
target = Tensor(np.array([0,1,0,1]), autograd=True)
model = Sequential([Embedding(3,3), Tanh(), Linear(3,4)])
criterion = CrossEntropyLoss()
optim = SGD(parameters=model.get_parameters(), alpha=0.1)
for i in range(10):
Predict
pred = model.forward(data)
Compare
loss = criterion.forward(pred, target)
loss.backward(Tensor(np.ones_like(loss.data)))
optim.step()
print(loss)
Learn
1.3885032434928422
0.9558181509266037
0.6823083585795604
0.5095259967493119
0.39574491472895856
0.31752527285348264
0.2617222861964216
0.22061283923954234
0.18946427334830068
0.16527389263866668
Using the same cross-entropy logic employed in several previous neural networks, you
now have a new loss function. One noticeable thing about this loss is different from others:
both the final softmax and the computation of the loss are within the loss class. This is an
extremely common convention in deep neural networks. Nearly every framework will work
this way. When you want to finish a network and train with cross entropy, you can leave
off the softmax from the forward propagation step and call a cross-entropy class that will
automatically perform the softmax as a part of the loss function.
The reason these are combined so consistently is performance. It’s much faster to calculate
the gradient of softmax and negative log likelihood together in a cross-entropy function
than to forward propagate and backpropagate them separately in two different modules.
This has to do with a shortcut in the gradient math.
260
Chapter 13
I Introducing automatic optimization
The recurrent neural network layer
By combining several layers, you can learn over time series.
As the last exercise of this chapter, let’s create one more layer that’s the composition of
multiple smaller layer types. The point of this layer will be to learn the task you finished at
the end of the previous chapter. This layer is the recurrent layer. You’ll construct it using
three linear layers, and the .forward() method will take both the output from the previous
hidden state and the input from the current training data:
class RNNCell(Layer):
def __init__(self, n_inputs,n_hidden,n_output,activation='sigmoid'):
super().__init__()
self.n_inputs = n_inputs
self.n_hidden = n_hidden
self.n_output = n_output
if(activation == 'sigmoid'):
self.activation = Sigmoid()
elif(activation == 'tanh'):
self.activation == Tanh()
else:
raise Exception("Non-linearity not found")
self.w_ih = Linear(n_inputs, n_hidden)
self.w_hh = Linear(n_hidden, n_hidden)
self.w_ho = Linear(n_hidden, n_output)
self.parameters += self.w_ih.get_parameters()
self.parameters += self.w_hh.get_parameters()
self.parameters += self.w_ho.get_parameters()
def forward(self, input, hidden):
from_prev_hidden = self.w_hh.forward(hidden)
combined = self.w_ih.forward(input) + from_prev_hidden
new_hidden = self.activation.forward(combined)
output = self.w_ho.forward(new_hidden)
return output, new_hidden
def init_hidden(self, batch_size=1):
return Tensor(np.zeros((batch_size,self.n_hidden)),autograd=True)
It’s out of scope for this chapter to reintroduce RNNs, but it’s worth pointing out the pieces
that should be familiar already. RNNs have a state vector that passes from timestep to
timestep. In this case, it’s the variable hidden, which is both an input parameter and output
variable to the forward function. RNNs also have several different weight matrices: one
that maps input vectors to hidden vectors (processing input data), one that maps from
hidden to hidden (which updates each hidden vector based on the previous), and optionally
The recurrent neural network layer
261
a hidden-to-output layer that learns to make predictions based on the hidden vector. This
RNNCell implementation includes all three. The self.w_ih layer is the input-to-hidden layer,
self.w_hh is the hidden-to-hidden layer, and self.w_ho is the hidden-to-output layer. Note
the dimensionality of each. The input size of self.w_ih and the output size of self.w_ho are
both the size of the vocabulary. All other dimensions are configurable based on the n_hidden
parameter.
Finally, an activation input parameter defines which nonlinearity is applied to hidden
vectors at each timestep. I’ve added two possibilities (Sigmoid and Tanh), but there are
many options to choose from. Let’s train a network:
import sys,random,math
from collections import Counter
import numpy as np
f = open('tasksv11/en/qa1_single-supporting-fact_train.txt','r')
raw = f.readlines()
f.close()
tokens = list()
for line in raw[0:1000]:
tokens.append(line.lower().replace("\n","").split(" ")[1:])
new_tokens = list()
for line in tokens:
new_tokens.append(['-'] * (6 - len(line)) + line)
tokens = new_tokens
vocab = set()
for sent in tokens:
for word in sent:
vocab.add(word)
vocab = list(vocab)
word2index = {}
for i,word in enumerate(vocab):
word2index[word]=i
def words2indices(sentence):
idx = list()
for word in sentence:
idx.append(word2index[word])
return idx
indices = list()
for line in tokens:
idx = list()
for w in line:
idx.append(word2index[w])
indices.append(idx)
data = np.array(indices)
262
Chapter 13
I Introducing automatic optimization
You can learn to fit the task you previously accomplished
in the preceding chapter.
Now you can initialize the recurrent layer with an embedding input and train a network
to solve the same task as in the previous chapter. Note that this network is slightly more
complex (it has one extra layer) despite the code being much simpler, thanks to the little
framework.
embed = Embedding(vocab_size=len(vocab),dim=16)
model = RNNCell(n_inputs=16, n_hidden=16, n_output=len(vocab))
criterion = CrossEntropyLoss()
params = model.get_parameters() + embed.get_parameters()
optim = SGD(parameters=params, alpha=0.05)
First, define the input embeddings and then the recurrent cell. (Note that cell is a
conventional name given to recurrent layers when they’re implementing only a single
recurrence. If you created another layer that provided the ability to configure arbitrary
numbers of cells together, it would be called an RNN, and n_layers would be an input
parameter.)
for iter in range(1000):
batch_size = 100
total_loss = 0
hidden = model.init_hidden(batch_size=batch_size)
for t in range(5):
input = Tensor(data[0:batch_size,t], autograd=True)
rnn_input = embed.forward(input=input)
output, hidden = model.forward(input=rnn_input, hidden=hidden)
target = Tensor(data[0:batch_size,t+1], autograd=True)
loss = criterion.forward(output, target)
loss.backward()
optim.step()
total_loss += loss.data
if(iter % 200 == 0):
p_correct = (target.data == np.argmax(output.data,axis=1)).mean()
print_loss = total_loss / (len(data)/batch_size)
print("Loss:",print_loss,"% Correct:",p_correct)
Loss:
Loss:
Loss:
Loss:
Loss:
0.47631100976371393 % Correct: 0.01
0.17189538896184856 % Correct: 0.28
0.1460940222788725 % Correct: 0.37
0.13845863915406884 % Correct: 0.37
0.135574472565278 % Correct: 0.37
Summary
263
batch_size = 1
hidden = model.init_hidden(batch_size=batch_size)
for t in range(5):
input = Tensor(data[0:batch_size,t], autograd=True)
rnn_input = embed.forward(input=input)
output, hidden = model.forward(input=rnn_input, hidden=hidden)
target = Tensor(data[0:batch_size,t+1], autograd=True)
loss = criterion.forward(output, target)
ctx = ""
for idx in data[0:batch_size][0][0:-1]:
ctx += vocab[idx] + " "
print("Context:",ctx)
print("Pred:", vocab[output.data.argmax()])
Context: - mary moved to the
Pred: office.
As you can see, the neural network learns to predict the first 100 examples of the training
dataset with an accuracy of around 37% (near perfect, for this toy task). It predicts a
plausible location for Mary to be moving toward, much like at the end of chapter 12.
Summary
Frameworks are efficient, convenient abstractions of forward
and backward logic.
I hope this chapter’s exercise has given you an appreciation for how convenient frameworks
can be. They can make your code more readable, faster to write, faster to execute (through
built-in optimizations), and much less buggy. More important, this chapter will prepare
you for using and extending industry standard frameworks like PyTorch and TensorFlow.
Whether debugging existing layer types or prototyping your own, the skills you’ve learned
here will be some of the most important you acquire in this book, because they bridge the
abstract knowledge of deep learning from previous chapters with the design of real-world
tools you’ll use to implement models in the future.
The framework that’s most similar to the one built here is PyTorch, and I highly
recommend diving into it when you complete this book. It will likely be the framework
that feels most familiar.
learning to write like Shakespeare:
long short-term memory
In this chapter
•
Character language modeling
•
Truncated backpropagation
•
Vanishing and exploding gradients
•
A toy example of RNN backpropagation
•
Long short-term memory (LSTM) cells
Lord, what fools these mortals be!
—William Shakespeare
A Midsummer Night’s Dream
265
14
266
Chapter 14
I Learning to write like Shakespeare
Character language modeling
Let’s tackle a more challenging task with the RNN.
At the end of chapters 12 and 13, you trained vanilla recurrent neural networks (RNNs)
that learned a simple series prediction problem. But you were training over a toy dataset
of phrases that were synthetically generated using rules.
In this chapter, you’ll attempt language modeling over a much more challenging dataset:
the works of Shakespeare. And instead of learning to predict the next word given the
previous words (as in the preceding chapter), the model will train on characters. It needs
to learn to predict the next character given the previous characters observed. Here’s what
I mean:
import sys,random,math
from collections import Counter
import numpy as np
import sys
np.random.seed(0)
f = open('shakespear.txt','r')
raw = f.read()
f.close()
From http://karpathy.github.io/2015/05/21/rnn-effectiveness/
vocab = list(set(raw))
word2index = {}
for i,word in enumerate(vocab):
word2index[word]=i
indices = np.array(list(map(lambda x:word2index[x], raw)))
Whereas in chapters 12 and 13 the vocabulary was made up of the words from the dataset,
now the vocabulary is made up the characters in the dataset. As such, the dataset is also
transformed into a list of indices corresponding to characters instead of words. Above this is
the indices NumPy array:
embed = Embedding(vocab_size=len(vocab),dim=512)
model = RNNCell(n_inputs=512, n_hidden=512, n_output=len(vocab))
criterion = CrossEntropyLoss()
optim = SGD(parameters=model.get_parameters() + embed.get_parameters(),
alpha=0.05)
This code should all look familiar. It initializes the embeddings to be of dimensionality 8
and the RNN hidden state to be of size 512. The output weights are initialized as 0s (not
a rule, but I found it worked a bit better). Finally, you initialize the cross-entropy loss and
stochastic gradient descent optimizer.
The need for truncated backpropagation
267
The need for truncated backpropagation
Backpropagating through 100,000 characters is intractable.
One of the more challenging aspects of reading code for RNNs is the mini-batching logic for
feeding in data. The previous (simpler) neural network had an inner for loop like this (the
bold part):
for iter in range(1000):
batch_size = 100
total_loss = 0
hidden = model.init_hidden(batch_size=batch_size)
for t in range(5):
input = Tensor(data[0:batch_size,t], autograd=True)
rnn_input = embed.forward(input=input)
output, hidden = model.forward(input=rnn_input, hidden=hidden)
target = Tensor(data[0:batch_size,t+1], autograd=True)
loss = criterion.forward(output, target)
loss.backward()
optim.step()
total_loss += loss.data
if(iter % 200 == 0):
p_correct = (target.data == np.argmax(output.data,axis=1)).mean()
print_loss = total_loss / (len(data)/batch_size)
print("Loss:",print_loss,"% Correct:",p_correct)
You might ask, “Why iterate to 5?” As it turns out, the previous dataset didn’t have any
example longer than six words. It read in five words and then attempted to predict the sixth.
Even more important is the backpropagation step. Consider when you did a simple
feedforward network classifying MNIST digits: the gradients always backpropagated all the
way through the network, right? They kept backpropagating until they reached the input
data. This allowed the network to adjust every weight to try to learn how to correctly predict
given the entire input example.
The recurrent example here is no different. You forward propagate through five input
examples and then, when you later call loss.backward(), it backpropagates gradients all
the way back through the network to the input datapoints. You can do this because you
aren’t feeding in that many input datapoints at a time. But the Shakespeare dataset has
100,000 characters! This is way too many to backpropagate through for every prediction.
What do you do?
You don’t! You backpropagate for a fixed number of steps into the past and then stop. This
is called truncated backpropagation, and it’s the industry standard. The length you backprop
becomes another tunable parameter (like batch size or alpha).
268
Chapter 14
I Learning to write like Shakespeare
Truncated backpropagation
Technically, it weakens the theoretical maximum
of the neural network.
The downside of using truncated backpropagation is that it shortens the distance a neural
network can learn to remember things. Basically, cutting off gradients after, say, five
timesteps, means the neural network can’t learn to remember events that are longer than five
timesteps in the past.
Strictly speaking, it’s more nuanced than this. There can accidentally be residual information
in an RNN’s hidden layer from more than five timesteps in the past, but the neural network
can’t use gradients to specifically request that the model keep information around from six
timesteps in the past to help with the current prediction. Thus, in practice, neural networks
won’t learn to make predictions based on input signal from more than five timesteps in the
past (if truncation is set at five timesteps). In practice, for language modeling, the truncation
variable is called bptt, and it’s usually set somewhere between 16 and 64:
batch_size = 32
bptt = 16
n_batches = int((indices.shape[0] / (batch_size)))
The other downside of truncated backpropagation is that it makes the mini-batching logic a
bit more complex. To use truncated backpropagation, you pretend that instead of having one
big dataset, you have a bunch of small datasets of size bptt. You need to group the datasets
accordingly:
trimmed_indices = indices[:n_batches*batch_size]
batched_indices = trimmed_indices.reshape(batch_size, n_batches)
batched_indices = batched_indices.transpose()
input_batched_indices = batched_indices[0:-1]
target_batched_indices = batched_indices[1:]
n_bptt = int(((n_batches-1) / bptt))
input_batches = input_batched_indices[:n_bptt*bptt]
input_batches = input_batches.reshape(n_bptt,bptt,batch_size)
target_batches = target_batched_indices[:n_bptt*bptt]
target_batches = target_batches.reshape(n_bptt, bptt, batch_size)
There’s a lot going on here. The top line makes the dataset an even multiple between the
batch_size and n_batches. This is so that when you group it into tensors, it’s square
(alternatively, you could pad the dataset with 0s to make it square). The second and third
lines reshape the dataset so each column is a section of the initial indices array. I’ll show
you that part, as if batch_size was set to 8 (for readability):
Truncated backpropagation
269
print(raw[0:5])
print(indices[0:5])
'That,'
array([ 9, 14,
2, 10, 57])
Those are the first five characters in the Shakespeare dataset. They spell out the string
“That,”. Following are the first five rows of the output of the transformation contained within
batched_indices:
print(batched_indices[0:5])
array([[ 9,
[14,
[ 2,
[10,
[57,
43,
44,
41,
39,
39,
21,
39,
39,
57,
43,
10,
21,
54,
48,
1,
10,
43,
37,
21,
10,
23,
14,
21,
54,
21,
57,
1,
26,
38,
21,
46],
10],
57],
43],
33]])
I’ve highlighted the first column in bold. See how the indices for the phrase “That,” are in the
first column on the left? This is a standard construction. The reason there are eight columns
is that the batch_size is 8. This tensor is then used to construct a list of smaller datasets,
each of length bptt.
You can see here how the input and target are constructed. Notice that the target indices are
the input indices offset by one row (so the network predicts the next character). Note again
that batch_size is 8 in this printout so it’s easier to read, but you’re really setting it to 32.
print(input_batches[0][0:5])
print(target_batches[0][0:5])
array([[ 9,
[14,
[ 2,
[10,
[57,
43,
44,
41,
39,
39,
21,
39,
39,
57,
43,
10,
21,
54,
48,
1,
10,
43,
37,
21,
10,
23,
14,
21,
54,
21,
57,
1,
26,
38,
21,
46],
10],
57],
43],
33]])
array([[14,
[ 2,
[10,
[57,
[43,
44,
41,
39,
39,
43,
39,
39,
57,
43,
41,
21,
54,
48,
1,
60,
43,
37,
21,
10,
52,
14,
21,
54,
21,
12,
1,
26,
38,
21,
54,
10],
57],
43],
33],
1]])
Don’t worry if this doesn’t make sense to you yet. It doesn’t have much to do with deep learning
theory; it’s just a particularly complex part of setting up RNNs that you’ll run into from time to
time. I thought I’d spend a couple of pages explaining it.
270
Chapter 14
I Learning to write like Shakespeare
Let’s see how to iterate using truncated backpropagation.
The following code shows truncated backpropagation in practice. Notice that it looks very
similar to the iteration logic from chapter 13. The only real difference is that you generate
a batch_loss at each step; and after every bptt steps, you backpropagate and perform a
weight update. Then you keep reading through the dataset like nothing happened (even
using the same hidden state from before, which only gets reset with each epoch):
def train(iterations=100):
for iter in range(iterations):
total_loss = 0
n_loss = 0
hidden = model.init_hidden(batch_size=batch_size)
for batch_i in range(len(input_batches)):
train()
hidden = Tensor(hidden.data, autograd=True)
loss = None
losses = list()
for t in range(bptt):
input = Tensor(input_batches[batch_i][t], autograd=True)
rnn_input = embed.forward(input=input)
output, hidden = model.forward(input=rnn_input,
hidden=hidden)
target = Tensor(target_batches[batch_i][t], autograd=True)
batch_loss = criterion.forward(output, target)
losses.append(batch_loss)
if(t == 0):
loss = batch_loss
else:
loss = loss + batch_loss
for loss in losses:
""
loss.backward()
optim.step()
total_loss += loss.data
log = "\r Iter:" + str(iter)
log += " - Batch "+str(batch_i+1)+"/"+str(len(input_batches))
log += " - Loss:" + str(np.exp(total_loss / (batch_i+1)))
if(batch_i == 0):
log += " - " + generate_sample(70,'\n').replace("\n"," ")
if(batch_i % 10 == 0 or batch_i-1 == len(input_batches)):
sys.stdout.write(log)
optim.alpha *= 0.99
print()
Iter:0 - Batch 191/195 - Loss:148.00388828554404
Iter:1 - Batch 191/195 - Loss:20.588816924127116 mhnethet tttttt t t t
....
Iter:99 - Batch 61/195 - Loss:1.0533843281265225 I af the mands your
A sample of the output
271
A sample of the output
By sampling from the predictions of the model,
you can write Shakespeare!
The following code uses a subset of the training logic to make predictions using the model.
You store the predictions in a string and return the string version as output to the function.
The sample that’s generated looks quite Shakespearian and even includes characters talking:
def generate_sample(n=30, init_char=' '):
s = ""
hidden = model.init_hidden(batch_size=1)
input = Tensor(np.array([word2index[init_char]]))
for i in range(n):
rnn_input = embed.forward(input)
output, hidden = model.forward(input=rnn_input, hidden=hidden)
output.data *= 10
Temperature for sampling;
temp_dist = output.softmax()
higher = greedier
temp_dist /= temp_dist.sum()
m = (temp_dist > np.random.rand()).argmax()
c = vocab[m]
input = Tensor(np.array([m]))
s += c
return s
print(generate_sample(n=2000, init_char='\n'))
Samples
from pred
I war ded abdons would.
CHENRO:
Why, speed no virth to her,
Plirt, goth Plish love,
Befion
hath if be fe woulds is feally your hir, the confectife to the nightion
As rent Ron my hath iom
the worse, my goth Plish love,
Befion
Ass untrucerty of my fernight this we namn?
ANG, makes:
That's bond confect fe comes not commonour would be forch the conflill
As
poing from your jus eep of m look o perves, the worse, my goth
Thould be good lorges ever word
DESS:
Where exbinder: if not conflill, the confectife to the nightion
As co move, sir, this we namn?
ANG VINE PAET:
There was courter hower how, my goth Plish lo res
Toures
ever wo formall, have abon, with a good lorges ever word.
272
Chapter 14
I Learning to write like Shakespeare
Vanishing and exploding gradients
Vanilla RNNs suffer from vanishing and exploding gradients.
You may recall this image from when you first put together a RNN. The idea was to be able
to combine the word embeddings in a way that order mattered. You did this by learning a
matrix that transformed each embedding to the next timestep. Forward propagation then
became a two-step process: start with the first word embedding (the embedding for “Red” in
the following example), multiply by the weight matrix, and add the next embedding (“Sox”).
You then take the resulting vector, multiply it by the same weight matrix, and then add in
the next word, repeating until you’ve read in the entire series of words.
"Red Sox defeat Yankees"
"Red Sox defeat Yankees"
Yankees
+
Sox
Red
+
Yankees
+
defeat
Weight
matrix
+
x
+
defeat
But as you know, an additional nonlinearity
was added to the hidden state-generation
process. Thus, forward propagation becomes
a three-step process: matrix multiply the
previous hidden state by a weight matrix,
add in the next word’s embedding, and apply
a nonlinearity.
+
Weight
matrix
x
+
Sox
Weight
matrix
x
Note that this nonlinearity plays an
Red
important role in the stability of the
+
network. No matter how long the sequence
of words is, the hidden states (which could
in theory grow larger and larger over time) are forced to stay between the values of the
nonlinearity (between 0 and 1, in the case of a sigmoid). But backpropagation happens in
a slightly different way than forward propagation, which doesn’t have this nice property.
Backpropagation tends to lead to either extremely large or extremely small values. Large
values can cause divergence (lots of not-a-numbers [NaNs]), whereas extremely small values
keep the network from learning. Let’s take a closer look at RNN backpropagation.
A toy example of RNN backpropagation
273
A toy example of RNN backpropagation
To see vanishing/exploding gradients firsthand,
let’s synthesize an example.
The following code shows a recurrent backpropagation loop for sigmoid and relu
activations. Notice how the gradients become very small/large for sigmoid/relu,
respectively. During backprop, they become large as the result of the matrix multiplication,
and small as a result of the sigmoid activation having a very flat derivative at its tails
(common for many nonlinearities).
(sigmoid,relu)=(lambda x:1/(1+np.exp(-x)), lambda x:(x>0).astype(float)*x)
weights = np.array([[1,4],[4,1]])
activation = sigmoid(np.array([1,0.01]))
print("Sigmoid Activations")
activations = list()
for iter in range(10):
activation = sigmoid(activation.dot(weights))
The derivative of sigmoid
activations.append(activation)
causes very small gradients
print(activation)
when activation is very near
print("\nSigmoid Gradients")
0 or 1 (the tails).
gradient = np.ones_like(activation)
for activation in reversed(activations):
gradient = (activation * (1 - activation) * gradient)
gradient = gradient.dot(weights.transpose())
print(gradient)
The matrix
print("Activations")
multiplication
activations = list()
causes exploding
for iter in range(10):
gradients that
activation = relu(activation.dot(weights))
don’t get squished
activations.append(activation)
by a nonlinearity
print(activation)
(as in sigmoid).
print("\nGradients")
gradient = np.ones_like(activation)
for activation in reversed(activations):
gradient = ((activation > 0) * gradient).dot(weights.transpose())
print(gradient)
Sigmoid Activations
[0.93940638 0.96852968]
[0.9919462 0.99121735]
[0.99301385 0.99302901]
...
[0.99307291 0.99307291]
Relu Activations
[23.71814585 23.98025559]
[119.63916823 118.852839 ]
[595.05052421 597.40951192]
...
[46583049.71437107 46577890.60826711]
Sigmoid Gradients
[0.03439552 0.03439552]
[0.00118305 0.00118305]
[4.06916726e-05 4.06916726e-05]
...
[1.45938177e-14 2.16938983e-14]
Relu Gradients
[5. 5.]
[25. 25.]
[125. 125.]
...
[9765625. 9765625.]
274
Chapter 14
I Learning to write like Shakespeare
Long short-term memory (LSTM) cells
LSTMs are the industry standard model to counter
vanishing/exploding gradients.
The previous section explained how vanishing/exploding gradients result from the
way hidden states are updated in a RNN. The problem is the combination of matrix
multiplication and nonlinearity being used to form the next hidden state. The solution that
LSTMs provide is surprisingly simple.
The gated copy trick
LSTMs create the next hidden state by copying the previous hidden state and then
adding or removing information as necessary. The mechanisms the LSTM uses for adding
and removing information are called gates.
def forward(self, input, hidden):
from_prev_hidden = self.w_hh.forward(hidden)
combined = self.w_ih.forward(input) + from_prev_hidden
new_hidden = self.activation.forward(combined)
output = self.w_ho.forward(new_hidden)
return output, new_hidden
The previous code is the forward propagation logic for the RNN cell. Following is the new
forward propagation logic for the LSTM cell. The LSTM has two hidden state vectors: h (for
hidden) and cell.
The one you care about is cell. Notice how it’s updated. Each new cell is the previous cell
plus u, weighted by i and f. f is the “forget” gate. If it takes a value of 0, the new cell will
erase what it saw previously. If i is 1, it will fully add in the value of u to create the new cell.
o is an output gate that controls how much of the cell’s state the output prediction is allowed
to see. For example, if o is all zeros, then the self.w_ho.forward(h) line will make a
prediction ignoring the cell state entirely.
def forward(self, input, hidden):
prev_hidden, prev_cell = (hidden[0], hidden[1])
f = (self.xf.forward(input) +
i = (self.xi.forward(input) +
o = (self.xo.forward(input) +
u = (self.xc.forward(input) +
cell = (f * prev_cell) + (i *
h = o * cell.tanh()
output = self.w_ho.forward(h)
return output, (h, cell)
self.hf.forward(prev_hidden)).sigmoid()
self.hi.forward(prev_hidden)).sigmoid()
self.ho.forward(prev_hidden)).sigmoid()
self.hc.forward(prev_hidden)).tanh()
u)
Some intuition about LSTM gates
275
Some intuition about LSTM gates
LSTM gates are semantically similar to reading/writing
from memory.
So there you have it! There are three gates—f, i, o—and a cell-update vector u; think of
these as forget, input, output, and update, respectively. They work together to ensure that
any information to be stored or manipulated in c can be so without requiring each update
of c to have any matrix multiplications or nonlinearities applied to it. In other words, you’re
avoiding ever calling nonlinearity(c) or c.dot(weights).
This is what allows the LSTM to store information across a time series without worrying
about vanishing or exploding gradients. Each step is a copy (assuming f is nonzero) plus
an update (assuming i is nonzero). The hidden value h is then a masked version of the cell
that’s used for prediction.
Notice further that each of the three gates is formed the same way. They have their own
weight matrices, but each of them conditions on the input and the previous hidden state,
passed through a sigmoid. It’s this sigmoid nonlinearity that makes them so useful as gates,
because it saturates at 0 and 1:
f = (self.xf.forward(input) + self.hf.forward(prev_hidden)).sigmoid()
i = (self.xi.forward(input) + self.hi.forward(prev_hidden)).sigmoid()
o = (self.xo.forward(input) + self.ho.forward(prev_hidden)).sigmoid()
One last possible critique is about h. Clearly it’s still prone to vanishing and exploding
gradients, because it’s basically being used the same as the vanilla RNN. First, because the
h vector is always created using a combination of vectors that are squished with tanh and
sigmoid, exploding gradients aren’t really a problem—only vanishing gradients. But this
ends up being OK because h is conditioned on c, which can carry long-range information:
the kind of information vanishing gradients can’t learn to carry. Thus, all long-range
information is transported using c, and h is only a localized interpretation of c, useful for
making an output prediction and constructing gate activations at the following timestep. In
short, c can learn to transport information over long distances, so it doesn’t matter if h can’t.
276
Chapter 14
I Learning to write like Shakespeare
The long short-term memory layer
You can use the autograd system to implement an LSTM.
class LSTMCell(Layer):
def __init__(self, n_inputs, n_hidden, n_output):
super().__init__()
self.n_inputs = n_inputs
self.n_hidden = n_hidden
self.n_output = n_output
self.xf
self.xi
self.xo
self.xc
self.hf
self.hi
self.ho
self.hc
=
=
=
=
=
=
=
=
Linear(n_inputs,
Linear(n_inputs,
Linear(n_inputs,
Linear(n_inputs,
Linear(n_hidden,
Linear(n_hidden,
Linear(n_hidden,
Linear(n_hidden,
n_hidden)
n_hidden)
n_hidden)
n_hidden)
n_hidden,
n_hidden,
n_hidden,
n_hidden,
bias=False)
bias=False)
bias=False)
bias=False)
self.w_ho = Linear(n_hidden, n_output, bias=False)
self.parameters
self.parameters
self.parameters
self.parameters
self.parameters
self.parameters
self.parameters
self.parameters
+=
+=
+=
+=
+=
+=
+=
+=
self.xf.get_parameters()
self.xi.get_parameters()
self.xo.get_parameters()
self.xc.get_parameters()
self.hf.get_parameters()
self.hi.get_parameters()
self.ho.get_parameters()
self.hc.get_parameters()
self.parameters += self.w_ho.get_parameters()
def forward(self, input, hidden):
prev_hidden = hidden[0]
prev_cell = hidden[1]
f=(self.xf.forward(input)+self.hf.forward(prev_hidden)).sigmoid()
i=(self.xi.forward(input)+self.hi.forward(prev_hidden)).sigmoid()
o=(self.xo.forward(input)+self.ho.forward(prev_hidden)).sigmoid()
g = (self.xc.forward(input) +self.hc.forward(prev_hidden)).tanh()
c = (f * prev_cell) + (i * g)
h = o * c.tanh()
output = self.w_ho.forward(h)
return output, (h, c)
def init_hidden(self, batch_size=1):
h = Tensor(np.zeros((batch_size, self.n_hidden)), autograd=True)
c = Tensor(np.zeros((batch_size, self.n_hidden)), autograd=True)
h.data[:,0] += 1
c.data[:,0] += 1
return (h, c)
Upgrading the character language model
277
Upgrading the character language model
Let’s swap out the vanilla RNN with the new LSTM cell.
Earlier in this chapter, you trained a character language model to predict Shakespeare.
Now let’s train an LSTM-based model to do the same. Fortunately, the framework from the
preceding chapter makes this easy to do (the complete code from the book’s website, www.
manning.com/books/grokking-deep-learning; or on GitHub at https://github.com/iamtrask/
grokking-deep-learning). Here’s the new setup code. All edits from the vanilla RNN code are
in bold. Notice that hardly anything has changed about how you set up the neural network:
import sys,random,math
from collections import Counter
import numpy as np
import sys
np.random.seed(0)
f = open('shakespear.txt','r')
raw = f.read()
f.close()
vocab = list(set(raw))
word2index = {}
for i,word in enumerate(vocab):
word2index[word]=i
indices = np.array(list(map(lambda x:word2index[x], raw)))
This seemed to
help training.
embed = Embedding(vocab_size=len(vocab),dim=512)
model = LSTMCell(n_inputs=512, n_hidden=512, n_output=len(vocab))
model.w_ho.weight.data *= 0
criterion = CrossEntropyLoss()
optim = SGD(parameters=model.get_parameters() + embed.get_parameters(),
alpha=0.05)
batch_size = 16
bptt = 25
n_batches = int((indices.shape[0] / (batch_size)))
trimmed_indices = indices[:n_batches*batch_size]
batched_indices = trimmed_indices.reshape(batch_size, n_batches)
batched_indices = batched_indices.transpose()
input_batched_indices = batched_indices[0:-1]
target_batched_indices = batched_indices[1:]
n_bptt = int(((n_batches-1) / bptt))
input_batches = input_batched_indices[:n_bptt*bptt]
input_batches = input_batches.reshape(n_bptt,bptt,batch_size)
target_batches = target_batched_indices[:n_bptt*bptt]
target_batches = target_batches.reshape(n_bptt, bptt, batch_size)
min_loss = 1000
278
Chapter 14
I Learning to write like Shakespeare
Training the LSTM character language model
The training logic also hasn’t changed much.
The only real change you have to make from the vanilla RNN logic is the truncated
backpropagation logic, because there are two hidden vectors per timestep instead of one.
But this is a relatively minor fix (in bold). I’ve also added a few bells and whistles that make
training easier (alpha slowly decreases over time, and there’s more logging):
for iter in range(iterations):
total_loss, n_loss = (0, 0)
hidden = model.init_hidden(batch_size=batch_size)
batches_to_train = len(input_batches)
for batch_i in range(batches_to_train):
hidden = (Tensor(hidden[0].data, autograd=True),
Tensor(hidden[1].data, autograd=True))
losses = list()
for t in range(bptt):
input = Tensor(input_batches[batch_i][t], autograd=True)
rnn_input = embed.forward(input=input)
output, hidden = model.forward(input=rnn_input, hidden=hidden)
target = Tensor(target_batches[batch_i][t], autograd=True)
batch_loss = criterion.forward(output, target)
if(t == 0):
losses.append(batch_loss)
else:
losses.append(batch_loss + losses[-1])
loss = losses[-1]
loss.backward()
optim.step()
total_loss += loss.data / bptt
epoch_loss = np.exp(total_loss / (batch_i+1))
if(epoch_loss < min_loss):
min_loss = epoch_loss
print()
log = "\r Iter:" + str(iter)
log += " - Alpha:" + str(optim.alpha)[0:5]
log += " - Batch "+str(batch_i+1)+"/"+str(len(input_batches))
log += " - Min Loss:" + str(min_loss)[0:5]
log += " - Loss:" + str(epoch_loss)
if(batch_i == 0):
s = generate_sample(n=70, init_char='T').replace("\n"," ")
log += " - " + s
sys.stdout.write(log)
optim.alpha *= 0.99
Tuning the LSTM character language model
279
Tuning the LSTM character language model
I spent about two days tuning this model, and
it trained overnight.
Here’s some of the training output for this model. Note that it took a very long time to
train (there are a lot of parameters). I also had to train it many times in order to find a
good tuning (learning rate, batch size, and so on) for this task, and the final model trained
overnight (8 hours). In general, the longer you train, the better your results will be.
I:0 - Alpha:0.05 - Batch 1/249 - Min Loss:62.00 - Loss:62.00 - eeeeeeeeee
...
I:7 - Alpha:0.04 - Batch 140/249 - Min Loss:10.5 - Loss:10.7 - heres, and
...
I:91 - Alpha:0.016 - Batch 176/249 - Min Loss:9.900 - Loss:11.9757225699
def generate_sample(n=30, init_char=' '):
s = ""
hidden = model.init_hidden(batch_size=1)
input = Tensor(np.array([word2index[init_char]]))
for i in range(n):
rnn_input = embed.forward(input)
output, hidden = model.forward(input=rnn_input, hidden=hidden)
output.data *= 15
temp_dist = output.softmax()
temp_dist /= temp_dist.sum()
Takes the max
prediction
m = output.data.argmax()
c = vocab[m]
input = Tensor(np.array([m]))
s += c
return s
print(generate_sample(n=500, init_char='\n'))
Intestay thee.
SIR:
It thou my thar the sentastar the see the see:
Imentary take the subloud I
Stall my thentaring fook the senternight pead me, the gakentlenternot
they day them.
KENNOR:
I stay the see talk :
Non the seady!
Sustar thou shour in the suble the see the senternow the antently the see
the seaventlace peake,
I sentlentony my thent:
I the sentastar thamy this not thame.
280
Chapter 14
I Learning to write like Shakespeare
Summary
LSTMs are incredibly powerful models.
The distribution of Shakespearian language that the LSTM learned to generate isn’t to be
taken lightly. Language is an incredibly complex statistical distribution to learn, and the
fact that LSTMs can do so well (at the time of writing, they’re the state-of-the-art approach
by a wide margin) still baffles me (and others as well). Small variants on this model either
are or have recently been the state of the art in a wide variety of tasks and, alongside word
embeddings and convolutional layers, will undoubtedly be one of our go-to tools for a long
time to come.
deep learning on unseen data:
introducing federated learning
In this chapter
•
The problem of privacy in deep learning
•
Federated learning
•
Learning to detect spam
•
Hacking into federated learning
•
Secure aggregation
•
Homomorphic encryption
•
Homomorphically encrypted federated learning
Friends don’t spy; true friendship is about privacy, too.
—Stephen King, Hearts in Atlantis (1999)
281
15
282
Chapter 15
I Deep learning on unseen data
The problem of privacy in deep learning
Deep learning (and tools for it) often means you have access to
your training data.
As you’re keenly aware by now, deep learning, being a subfield of machine learning, is all
about learning from data. But often, the data being learned from is incredibly personal.
The most meaningful models interact with the most personal information about human
lives and tell us things about ourselves that might have been difficult to know otherwise.
To paraphrase, a deep learning model can study thousands of lives to help you better
understand your own.
The primary natural resource for deep learning is training data (either synthetic or natural).
Without it, deep learning can’t learn; and because the most valuable use cases often interact
with the most personal datsets, deep learning is often a reason behind companies seeking to
aggregate data. They need it in order to solve a particular use case.
But in 2017, Google published a very exciting paper and blog post that made a significant
dent in this conversation. Google proposed that we don’t need to centralize a dataset in
order to train a model over it. The company proposed this question: what if instead of
bringing all the data to one place, we could bring the model to the data? This is a new,
exciting subfield of machine learning called federated learning, and it’s what this chapter
is about.
What if instead of bringing the corpus of training data to one place to train a model, you
could bring the model to the data wherever it’s generated?
This simple reversal is extremely important. First, it means in order to participate in the
deep learning supply chain, people don’t technically have to send their data to anyone.
Valuable models in healthcare, personal management, and other sensitive areas can be
trained without requiring anyone to disclose information about themselves. In theory,
people could retain control over the only copy of their personal data (at least as far as deep
learning is concerned).
This technique will also have a huge impact on the competitive landscape of deep learning
in corporate competition and entrepreneurship. Large enterprises that previously wouldn’t
(or couldn’t, for legal reasons) share data about their customers can potentially still earn
revenue from that data. There are some problem domains where the sensitivity and
regulatory constraints surrounding the data have been a headwind to progress. Healthcare
is one example where datasets are often locked up tight, making research challenging.
Federated learning
283
Federated learning
You don’t have to have access to a dataset in order
to learn from it.
The premise of federated learning is that many datasets contain information that’s useful for
solving problems (for example, identifying cancer in an MRI), but it’s hard to access these
relevant datasets in large enough quantities to train a suitably strong deep learning model.
The main concern is that, even though the dataset has information sufficient to train a deep
learning model, it also has information that (presumably) has nothing to do with learning the
task but could potentially harm someone if it were revealed.
Federated learning is about a model going into a secure environment and learning how to
solve a problem without needing the data to move anywhere. Let’s jump into an example.
import numpy as np
from collections import Counter
import random
Dataset from
import sys
http://www2.aueb.gr/users/ion/data/enron-spam/
import codecs
np.random.seed(12345)
with codecs.open('spam.txt',"r",encoding='utf-8',errors='ignore') as f:
raw = f.readlines()
vocab, spam, ham = (set(["<unk>"]), list(), list())
for row in raw:
spam.append(set(row[:-2].split(" ")))
for word in spam[-1]:
vocab.add(word)
with codecs.open(‘ham.txt',"r",encoding='utf-8',errors='ignore') as f:
raw = f.readlines()
for row in raw:
ham.append(set(row[:-2].split(" ")))
for word in ham[-1]:
vocab.add(word)
vocab, w2i = (list(vocab), {})
for i,w in enumerate(vocab):
w2i[w] = i
def to_indices(input, l=500):
indices = list()
for line in input:
if(len(line) < l):
line = list(line) + ["<unk>"] * (l - len(line))
idxs = list()
for word in line:
idxs.append(w2i[word])
indices.append(idxs)
return indices
284
Chapter 15
I Deep learning on unseen data
Learning to detect spam
Let’s say you want to train a model across people’s emails
to detect spam.
The use case we’ll talk about is email classification. The first model will be trained on a publicly
available dataset called the Enron dataset, which is a large corpus of emails released from the
famous Enron lawsuit (now an industry standard email analytics corpus). Fun fact: I used to
know someone who read/annotated this dataset professionally, and people emailed all sorts of
crazy stuff to each other (much of it very personal). But because it was all released to the public
in the court case, it’s free to use now.
The code in the previous section and this section is just the preprocessing. The input data
files (ham.txt and spam.txt) are available on the book’s website, www.manning.com/books/
grokking-deep-learning; and on GitHub at https://github.com/iamtrask/Grokking-DeepLearning. You preprocess it to get it ready to forward propagate into the embedding class
created in chapter 13 when you created a deep learning framework. As before, all the words
in this corpus are turned into lists of indices. You also make all the emails exactly 500 words
long by either trimming the email or padding it with <unk> tokens. Doing so makes the final
dataset square.
spam_idx = to_indices(spam)
ham_idx = to_indices(ham)
train_spam_idx = spam_idx[0:-1000]
train_ham_idx = ham_idx[0:-1000]
test_spam_idx = spam_idx[-1000:]
test_ham_idx = ham_idx[-1000:]
train_data = list()
train_target = list()
test_data = list()
test_target = list()
for i in range(max(len(train_spam_idx),len(train_ham_idx))):
train_data.append(train_spam_idx[i%len(train_spam_idx)])
train_target.append([1])
train_data.append(train_ham_idx[i%len(train_ham_idx)])
train_target.append([0])
for i in range(max(len(test_spam_idx),len(test_ham_idx))):
test_data.append(test_spam_idx[i%len(test_spam_idx)])
test_target.append([1])
test_data.append(test_ham_idx[i%len(test_ham_idx)])
test_target.append([0])
Learning to detect spam
285
def train(model, input_data, target_data, batch_size=500, iterations=5):
n_batches = int(len(input_data) / batch_size)
for iter in range(iterations):
iter_loss = 0
for b_i in range(n_batches):
# padding token should stay at 0
model.weight.data[w2i['<unk>']] *= 0
input = Tensor(input_data[b_i*bs:(b_i+1)*bs], autograd=True)
target = Tensor(target_data[b_i*bs:(b_i+1)*bs], autograd=True)
pred = model.forward(input).sum(1).sigmoid()
loss = criterion.forward(pred,target)
loss.backward()
optim.step()
iter_loss += loss.data[0] / bs
sys.stdout.write("\r\tLoss:" + str(iter_loss / (b_i+1)))
print()
return model
def test(model, test_input, test_output):
model.weight.data[w2i['<unk>']] *= 0
input = Tensor(test_input, autograd=True)
target = Tensor(test_output, autograd=True)
pred = model.forward(input).sum(1).sigmoid()
return ((pred.data > 0.5) == target.data).mean()
With these nice train() and test() functions, you can initialize a neural network and
train it using the following few lines. After only three iterations, the network can already
classify on the test dataset with 99.45% accuracy (the test dataset is balanced, so this is
quite good):
model = Embedding(vocab_size=len(vocab), dim=1)
model.weight.data *= 0
criterion = MSELoss()
optim = SGD(parameters=model.get_parameters(), alpha=0.01)
for i in range(3):
model = train(model, train_data, train_target, iterations=1)
print("% Correct on Test Set: " + \
str(test(model, test_data, test_target)*100))
Loss:0.037140416860871446
% Correct on Test Set: 98.65
Loss:0.011258669226059114
% Correct on Test Set: 99.15
Loss:0.008068268387986223
% Correct on Test Set: 99.45
286
Chapter 15
I Deep learning on unseen data
Let’s make it federated
The previous example was plain vanilla deep learning. Let’s
protect privacy.
In the previous section, you got the email example. Now, let’s put all the emails in one place.
This is the old-school way of doing things (which is still far too common in the world). Let’s
start by simulating a federated learning environment that has multiple different collections
of emails:
bob = (train_data[0:1000], train_target[0:1000])
alice = (train_data[1000:2000], train_target[1000:2000])
sue = (train_data[2000:], train_target[2000:])
Easy enough. Now you can do the same training as before, but across each person’s email
database all at the same time. After each iteration, you’ll average the values of the models
from Bob, Alice, and Sue and evaluate. Note that some methods of federated learning
aggregate after each batch (or collection of batches); I’m keeping it simple:
for i in range(3):
print("Starting Training Round...")
print("\tStep 1: send the model to Bob")
bob_model = train(copy.deepcopy(model), bob[0], bob[1], iterations=1)
print("\n\tStep 2: send the model to Alice")
alice_model = train(copy.deepcopy(model),
alice[0], alice[1], iterations=1)
print("\n\tStep 3: Send the model to Sue")
sue_model = train(copy.deepcopy(model), sue[0], sue[1], iterations=1)
print("\n\tAverage Everyone's New Models")
model.weight.data = (bob_model.weight.data + \
alice_model.weight.data + \
sue_model.weight.data)/3
print("\t% Correct on Test Set: " + \
str(test(model, test_data, test_target)*100))
print("\nRepeat!!\n")
The next section shows the results. The model
learns to nearly the same performance as
before, and in theory you didn’t have access to
the training data—or did you? After all, each
person is changing the model somehow, right?
Can you really not discover anything about
their dataset?
Starting Training Round...
Step 1: send the model to Bob
Loss:0.21908166249699718
......
Step 3: Send the model to Sue
Loss:0.015368461608470256
Average Everyone's New Models
% Correct on Test Set: 98.8
Hacking into federated learning
287
Hacking into federated learning
Let’s use a toy example to see how to still learn
the training dataset.
Federated learning has two big challenges, both of which are at their worst when each
person in the training dataset has only a small handful of training examples. These
challenges are performance and privacy. As it turns out, if someone has only a few training
examples (or the model improvement they send you uses only a few examples: a training
batch), you can still learn quite a bit about the data. Given 10,000 people (each with a little
data), you’ll spend most of your time sending the model back and forth and not much time
training (especially if the model is really big).
But we’re getting ahead of ourselves. Let’s see what you can learn when a user performs a
weight update over a single batch:
import copy
bobs_email = ["my", "computer", "password", "is", "pizza"]
bob_input = np.array([[w2i[x] for x in bobs_email]])
bob_target = np.array([[0]])
model = Embedding(vocab_size=len(vocab), dim=1)
model.weight.data *= 0
bobs_model = train(copy.deepcopy(model),
bob_input, bob_target, iterations=1, batch_size=1)
Bob is going to create an update to the model using an email in his inbox. But Bob saved
his password in an email to himself that says, “My computer password is pizza.” Silly Bob.
By looking at which weights changed, you can figure out the vocabulary (and infer the
meaning) of Bob’s email:
for i, v in enumerate(bobs_model.weight.data - model.weight.data):
if(v != 0):
print(vocab[i])
is
pizza
computer
password
my
And just like that, you learned Bob’s super-secret password (and
probably his favorite food, too). What’s to be done? How can you use
federated learning if it’s so easy to tell what the training data was from
the weight update?
288
Chapter 15
I Deep learning on unseen data
Secure aggregation
Let’s average weight updates from zillions of people before
anyone can see them.
The solution is to never let Bob put a gradient out in the open like that. How can Bob
contribute his gradient if people shouldn’t see it? The social sciences use an interesting
technique called randomized response.
It goes like this. Let’s say you’re conducting a survey, and you want to ask 100 people
whether they’ve committed a heinous crime. Of course, all would answer “No” even if you
promised them you wouldn’t tell. Instead, you have them flip a coin twice (somewhere you
can’t see), and tell them that if the first coin flip is heads, they should answer honestly; and
if it’s tails, they should answer “Yes” or “No” according to the second coin flip.
Given this scenario, you never actually ask people to tell you whether they committed
crimes. The true answers are hidden in the random noise of the first and second coin flips.
If 60% of people say “Yes,” you can determine (using simple math) that about 70% of the
people you surveyed committed heinous crimes (give or take a few percentage points). The
idea is that the random noise makes it plausible that any information you learn about the
person came from the noise instead of from them.
Privacy via plausible deniability
The level of chance that a particular answer came from random noise instead of an
individual protects their privacy by giving them plausible deniability. This forms the basis
for secure aggregation and, more generally, much of differential privacy.
You’re looking only at aggregate statistics overall. (You never see anyone’s answer directly;
you see only pairs of answers or perhaps larger groupings.) Thus, the more people you can
aggregate before adding noise, the less noise you have to add to hide them (and the more
accurate the findings are).
In the context of federated learning, you could (if you wanted) add a ton of noise, but this
would hurt training. Instead, first sum all the gradients from all the participants in such a
way that no one can see anyone’s gradient but their own. The class of problems for doing
this is called secure aggregation, and in order to do it, you’ll need one more (very cool) tool:
homomorphic encryption.
Homomorphic encryption
289
Homomorphic encryption
You can perform arithmetic on encrypted values.
One of the most exciting frontiers of research is the intersection of artificial intelligence
(including deep learning) and cryptography. Front and center in this exciting intersection
is a very cool technology called homomorphic encryption. Loosely stated, homomorphic
encryption lets you perform computation on encrypted values without decrypting them.
In particular, we’re interested in performing addition over these values. Explaining exactly
how it works would take an entire book on its own, but I’ll show you how it works with a
few definitions. First, a public key lets you encrypt numbers. A private key lets you decrypt
encrypted numbers. An encrypted value is called a ciphertext, and an unencrypted value
is called a plaintext.
Let’s see an example of homomorphic encryption using the phe library. (To install the
library, run pip install phe or download it from GitHub at https://github.com/
n1analytics/python-paillier):
import phe
public_key, private_key = phe.generate_paillier_keypair(n_length=1024)
x = public_key.encrypt(5)
Encrypts the number 5
y = public_key.encrypt(3)
Encrypts the number 3
z = x + y
Adds the two encrypted values
z_ = private_key.decrypt(z)
print("The Answer: " + str(z_))
Decrypts the result
The Answer: 8
This code encrypts two numbers (5 and 3) and adds them together while they’re still
encrypted. Pretty neat, eh? There’s another technique that’s a sort-of cousin to homomorphic
encryption: secure multi-party computation. You can learn about it at the “Cryptography and
Machine Learning” blog (https://mortendahl.github.io).
Now, let’s return to the problem of secure aggregation. Given your new knowledge that
you can add together numbers you can’t see, the answer becomes plain. The person who
initializes the model sends a public_key to Bob, Alice, and Sue so they can each encrypt
their weight updates. Then, Bob, Alice, and Sue (who don’t have the private key) talk directly
to each other and accumulate all their gradients into a single, final update that’s sent back to
the model owner, who decrypts it with the private_key.
290
Chapter 15
I Deep learning on unseen data
Homomorphically encrypted federated learning
Let’s use homomorphic encryption to protect the gradients
being aggregated.
model = Embedding(vocab_size=len(vocab), dim=1)
model.weight.data *= 0
# note that in production the n_length should be at least 1024
public_key, private_key = phe.generate_paillier_keypair(n_length=128)
def train_and_encrypt(model, input, target, pubkey):
new_model = train(copy.deepcopy(model), input, target, iterations=1)
encrypted_weights = list()
for val in new_model.weight.data[:,0]:
encrypted_weights.append(public_key.encrypt(val))
ew = np.array(encrypted_weights).reshape(new_model.weight.data.shape)
return ew
for i in range(3):
print("\nStarting Training Round...")
print("\tStep 1: send the model to Bob")
bob_encrypted_model = train_and_encrypt(copy.deepcopy(model),
bob[0], bob[1], public_key)
print("\n\tStep 2: send the model to Alice")
alice_encrypted_model=train_and_encrypt(copy.deepcopy(model),
alice[0],alice[1],public_key)
print("\n\tStep 3: Send the model to Sue")
sue_encrypted_model = train_and_encrypt(copy.deepcopy(model),
sue[0], sue[1], public_key)
print("\n\tStep 4: Bob, Alice, and Sue send their")
print("\tencrypted models to each other.")
aggregated_model = bob_encrypted_model + \
alice_encrypted_model + \
sue_encrypted_model
print("\n\tStep 5: only the aggregated model")
print("\tis sent back to the model owner who")
print("\t can decrypt it.")
raw_values = list()
for val in sue_encrypted_model.flatten():
raw_values.append(private_key.decrypt(val))
new = np.array(raw_values).reshape(model.weight.data.shape)/3
model.weight.data = new
print("\t% Correct on Test Set: " + \
str(test(model, test_data, test_target)*100))
Summary
291
Now you can run the new training scheme, which has an added step. Alice, Bob, and Sue
add up their homomorphically encrypted models before sending them back to you, so
you never see which updates came from which person (a form of plausible deniability).
In production, you’d also add some additional random noise sufficient to meet a certain
privacy threshold required by Bob, Alice, and Sue (according to their personal preferences).
More on that in future work.
Starting Training Round...
Step 1: send the model to Bob
Loss:0.21908166249699718
Step 2: send the model to Alice
Loss:0.2937106899184867
...
...
...
% Correct on Test Set: 99.15
Summary
Federated learning is one of the most exciting breakthroughs
in deep learning.
I firmly believe that federated learning will change the landscape of deep learning in the
coming years. It will unlock new datasets that were previously too sensitive to work with,
creating great social good as a result of this newly available entrepreneurial opportunities.
This is part of a broader convergence between encryption and artificial intelligence research
that, in my opinion, is the most exciting convergence of the decade.
The main thing holding back these techniques from practical use is their lack of
availability in modern deep learning toolkits. The tipping point will be when anyone can
run pip install... and then have access to deep learning frameworks where privacy
and security are first-class citizens, and where techniques such as federated learning,
homomorphic encryption, differential privacy, and secure multi-party computation are all
built in (and you don’t have to be an expert to use them).
Out of this belief, I’ve been working with a team of open source volunteers as a part of
the OpenMined project for the past year, extending major deep learning frameworks with
these primitives. If you believe in the importance of these tools to the future of privacy
and security, come check us out at http://openmined.org or at the GitHub repository
(https://github.com/OpenMined). Show your support, even if it’s only starring a few
repos; and do join if you can (slack.openmined.org is the chat room).
where to go from here:
a brief guide
In this chapter
•
Step 1: Start learning PyTorch
•
Step 2: Start another deep learning course
•
Step 3: Grab a mathy deep learning textbook
•
Step 4: Start a blog, and teach deep learning
•
Step 5: Twitter
•
Step 6: Implement academic papers
•
Step 7: Acquire access to a GPU
•
Step 8: Get paid to practice
•
Step 9: Join an open source project
•
Step 10: Develop your local community
Whether you believe you can do a thing or not, you
are right.
—Henry Ford, automobile manufacturer
293
16
294
Chapter 16
I Where to go from here
Congratulations!
If you’re reading this, you’ve made it through nearly 300 pages of
deep learning.
You did it! This was a lot of material. I’m proud of you, and you should be proud of yourself.
Today should be a cause for celebration. At this point, you understand the basic concepts
behind artificial intelligence, and should feel quite confident in your abilities to speak about
them as well as your abilities to learn advanced concepts.
This last chapter includes a few short sections discussing appropriate next steps for you,
especially if this is your first resource in the field of deep learning. My general assumption is
that you’re interested in pursuing a career in the field or at least continuing to dabble on the
side, and I hope my general comments will help guide you in the right direction (although
they’re only very general guidelines that may or may not directly apply to you).
Step 1: Start learning PyTorch
The deep learning framework you made most closely
resembles PyTorch.
You’ve been learning deep learning using NumPy, which is a basic matrix library. You then
built your own deep learning toolkit, and you’ve used that quite a bit as well. But from this
point forward, except when learning about a new architecture, you should use an actual
framework for your experiments. It will be less buggy. It will run (way) faster, and you’ll be
able to inherit/study other people’s code.
Why should you choose PyTorch? There are many good options, but if you’re coming from
a NumPy background, PyTorch will feel the most familiar. Furthermore, the framework you
built in chapter 13 closely resembles the API of PyTorch. I did it this way specifically with
the intent of preparing you for an actual framework. If you choose PyTorch, you’ll feel right
at home. That said, choosing a deep learning framework is sort of like joining a house at
Hogwarts: they’re all great (but PyTorch is definitely Gryffindor).
Now the next question: how should you learn PyTorch? The best way is to take a deep learning
course that teaches you deep learning using the framework. This will jog your memory about
the concepts you’re already familiar with while showing you where each piece lives in PyTorch.
(You’ll review stochastic gradient descent while also learning about where it’s located in
PyTorch’s API.) The best place to do this at the time of writing is either Udacity’s deep learning
Nanodegree (although I’m biased: I helped teach it) or fast.ai. In addition, https://pytorch.org/
tutorials and https://github.com/pytorch/examples are golden resources.
Step 2: Start another deep learning course
295
Step 2: Start another deep learning course
I learned deep learning by relearning the same concepts over
and over.
Although it would be nice to think that one book or course is sufficient for your entire deep
learning education, it’s not. Even if every concept was covered in this book (they aren’t),
hearing the same concepts from multiple perspectives is essential for you to really grok
them (see what I did there?). I’ve taken probably a half-dozen different courses (or YouTube
series) in my growth as a developer in addition to watching tons of YouTube videos and
reading lots of blog posts describing basic concepts.
Look for online courses on YouTube from the big deep learning universities or AI labs
(Stanford, MIT, Oxford, Montreal, NYU, and so on). Watch all the videos. Do all the
exercises. Do fast.ai, and Udacity if you can. Relearn the same concepts over and over.
Practice them. Become familiar with them. You want the fundamentals to be second nature
in your head.
Step 3: Grab a mathy deep learning textbook
You can reverse engineer the math from your deep learning
knowledge.
My undergraduate degree at university was in applied discrete mathematics, but I learned
way more about algebra, calculus, and statistics from spending time in deep learning than
I ever did in the classroom. Furthermore, and this might sound surprising, I learned by
hacking together NumPy code and then going back to the math problems it implements to
figure out how they worked. This is how I really learned the deep learning–related math at a
deeper level. It’s a nice trick I hope you’ll take to heart.
If you’re not sure which mathy book to go for, probably the best on the market at the time
of writing is Deep Learning by Ian Goodfellow, Yoshua Bengio, and Aaron Courville (MIT
Press, 2016). It’s not insane on the math side, but it’s the next step up from this book (and
the math notation guide in the front of the book is golden).
296
Chapter 16
I Where to go from here
Step 4: Start a blog, and teach deep learning
Nothing I’ve ever done has helped my knowledge or career more.
I probably should have put this as step 1, but here goes. Nothing has boosted my knowledge
of deep learning (and my career in deep learning) more than teaching deep learning on my
blog. Teaching forces you to explain everything as simply as possible, and the fear of public
shaming will ensure that you do a good job.
Funny story: one of my first blog posts made it onto Hacker News, but it was horribly
written, and a major researcher at a top AI lab totally destroyed me in the comments. It hurt
my feelings and my confidence, but it also tightened up my writing. It made me realize that
most of the time, when I read something and it’s hard to understand, it’s not my fault; the
person who was writing it didn’t take enough time to explain all the little pieces I needed
to know to understand the full concepts. They didn’t provide relatable analogies to help my
understanding.
All that is to say, start a blog. Try to get on the Hacker News or ML Reddit front page. Start
by teaching the basic concepts. Try to do it better than anyone else. Don’t worry if the topic
has already been covered. To this day, my most popular blog post is “A Neural Network
in 11 Lines of Python,” which teaches the most over-taught thing in deep learning: a basic
feedforward neural network. But I was able to explain it in a new way, which helped some
folks. The main reason it did was that I wrote the post in a way that helped me understand it.
That’s the ticket. Teach things the way you want to learn them.
And don’t just do summaries of deep learning concepts! Summaries are boring, and no
one wants to read them. Write tutorials. Every blog post you write should include a neural
network that learns to do something—something the reader can download and run. Your
blog should give a line-by-line account of what each piece does so that even a five-year-old
could understand. That’s the standard. You may want to give up when you’ve been working
on a two-page blog post for three days, but that’s not the time to turn back: that’s the time to
press on and make it amazing! One great blog post can change your life. Trust me.
If you want to apply to a job, masters, or PhD program to do AI, pick a researcher you want
to work with in that program, and write tutorials about their work. Every time I’ve done
that, it has led to later meeting that researcher. Doing this shows that you understand the
concepts they’re working with, which is a prerequisite to them wanting to work with you.
This is much better than a cold email, because, assuming it gets on Reddit, Hacker News, or
some other venue, someone else will send it to them first. Sometimes they’ll even reach out
to you.
Step 5: Twitter
297
Step 5: Twitter
A lot of AI conversation happens on Twitter.
I’ve met more researchers from around the world on Twitter than almost any other way, and
I’ve learned about nearly every paper I read because I was following someone who tweeted
about it. You want to be up-to-date on the latest changes; and, more important, you want
to become part of the conversation. I started by finding some AI researchers I looked up to,
following them, and then following the people they follow. That got my feed started, and it has
helped me greatly. (Just don’t let it become an addiction!)
Step 6: Implement academic papers
Twitter + your blog = tutorials on academic papers.
Watch your Twitter feed until you come across a paper that both sounds interesting and
doesn’t need an insane number of GPUs. Write a tutorial on it. You’ll have to read the paper,
decipher the math, and go through the motions of tuning that the original researchers also
had to go through. There’s no better exercise if you’re interested in doing abstract research.
My first published paper at the International Conference on Machine Learning (ICML)
came out of me reading the paper for and subsequently reverse-engineering the code in
word2vec. Eventually, you’ll be reading along and go, “Wait! I think I can make this better!”
And voila: you’re a researcher.
Step 7: Acquire access to a GPU (or many)
The faster you can experiment, the faster you can learn.
It’s no secret that GPUs give 10 to 100× faster training times, but the implication is that you
can iterate through your own (good and bad) ideas 100× faster. This is unbelievably valuable
for learning deep learning. One of the mistakes I made in my career was waiting too long to
start working with GPUs. Don’t be like me: go buy one from NVIDIA, or use the free K80s
you can access in Google Colab notebooks. NVIDIA also occasionally lets students use
theirs for free for certain AI competitions, but you have to watch out for them.
298
Chapter 16
I Where to go from here
Step 8: Get paid to practice
The more time you have to do deep learning,
the faster you’ll learn.
Another pivot point in my career was when I got a job that let me explore deep learning
tools and research. Become a data scientist, data engineer, or research engineer, or freelance
as a consultant doing statistics. The point is, you want to find a way to get paid to keep
learning during work hours. These jobs exist; it just takes some effort to find them.
Your blog is essential to getting a job like this. Whatever job you want to get, write at least
two blog posts showing that you can do whatever it is they’re looking to hire someone for.
That’s the perfect resume (better than a degree in math). The perfect candidate is someone
who has already shown they can do the job.
Step 9: Join an open source project
The best way to network and career-build in AI is to become
a core developer in an open source project.
Find a deep learning framework you like, and start implementing things. Before you know
it, you’ll be interacting with researchers at the top labs (who will be reading/approving your
pull requests). I know of plenty of folks who have landed awesome jobs (seemingly from
nowhere) using this approach.
That being said, you have to put in the time. No one is going to hold your hand. Read the
code. Make friends. Start by adding unit tests and documentation explaining the code, then
work on bugs, and eventually start in on bigger projects. It takes time, but it’s an investment
in your future. If you’re not sure, go with a major deep learning framework like PyTorch,
TensorFlow, or Keras, or you can come work with me at OpenMined (which I think is the
coolest open source project around). We’re very newbie friendly.
Step 10: Develop your local community
299
Step 10: Develop your local community
I really learned deep learning because I enjoyed hanging with
friends who were.
I learned deep learning at Bongo Java, sitting next to my best friends who were also
interested in it. A big part of me sticking with it when the bugs were hard to fix (it took me
two days to find a single period once) or the concepts were hard to master was that I was
spending time around the people I loved being with. Don’t underestimate this. If you’re
in a place you like to be, with people you like to be with, you’re going to work longer and
advance faster. It’s not rocket science, but you have to be intentional. Who knows? You
might even have a little fun while you’re at it!
index
* function, 45
A
absolute value, 51
academic papers, 297
accuracy, 149–150
activation functions, 161–175
adding to layer, 170–171
computable, 164
continuous and infinite in
domain, 162
defined, 162
hidden-layer, 165
installation instructions, 170–171
monotonic, 163
nonlinear, 164
output layer, 166–167
similar inputs, 168
slope, 172–173
softmax computation, 169
upgrading MNIST network,
174–175
activation input parameter, 261
actual error, 50
__add__ function, 241
addition backpropagation, 240
additional functions, adding
support for, 242–245
algorithms, 18
alpha, 75–76
Anaconda framework, 7
AND operator, 33
arbitrary length
backpropagation with, 225
challenge of, 210
forward propagation with, 224
weight update with, 226
architecture of neural networks
importance of visualization
tools, 143
language, 197–198
overview of, 138
artificial neural networks, 10
attenuation, 169
autograd (automatic gradient
computation)
adding cross entropy to, 258–259
adding indexing to, 256
general discussion, 234–235
implementing LSTM with, 276
to train neural network, 246–247
upgrading to support multiuse
tensors, 238–239
used multiple times, 237
automatic optimization, 231–263.
See also autograd (automatic
gradient computation)
adding, 248
adding support
for additional functions,
242–245
for negation, 241
addition backpropagation, 240
deep learning framework, 232, 252
dynamic computation graph, 236
301
layers
adding support for layer
types, 249
containing layers, 250
cross-entropy layer, 258–259
embedding, 255, 257
in Keras or PyTorch., 249
loss-function layers, 251
nonlinearity layers, 253–254
recurrent neural network (RNN)
layer, 260–263
tensors
defined, 233
that are used multiple times,
237
averaged word vectors, RNN, 212
B
Babi dataset, 222
backpropagation
addition, 240
in code, 126–127
iteration of, 128–129
recurrent neural network (RNN),
221, 225
truncated, 267–270
disadvantages of, 268–269
iterating with, 270
weighted average delta, 120
weighted average error, 119
.backward( ) method, 239–240,
241, 244, 247, 252
302
index
bash commands, 222
batch gradient descent, 109,
158–160
batch_loss, 270
batch_size, 268, 269
batch_size/alpha pair, 160
Bernoulli distribution, 155
blogging, to teach deep
learning, 296
Bongo Java, 299
bptt variable, 268, 269, 270
C
calculus, 69
cell-update vector, 275
character language modeling, 266.
See also LSTM (long shortterm memory)
ciphertext, 289
cluster labels, 13
comparing
mean squared error, 48, 50
measuring error, 51
computable activation
functions, 164
computation graphs, 235, 236
concatenation, 211
conditional (sometimes) correlation, 122, 123
continuous functions, 162
convolutional neural networks,
177–185
convolutional layer, 179–180
implementation in NumPy,
181–185
reusing weights in multiple
places, 178
corners. See convolutional neural
networks
correlation
coefficients, 18
creating, 117
indirect, 116
input/output, 190
learning, 110
negative, 165
searching for, 100–101
selective, 164
correlation summarization, 135
counting-based learning. See nonparametric learning
creation_op add, 235
creators attribute, 234
cross communication, 111, 135
cross-entropy class, 174, 205, 259,
266
cross-entropy layer, 258–259
cryptography, 289
curves, 213
D
data, grouping, 13
data patterns, 105
datapoints, 13, 196
datasets
Babi, 222
clustering into groups, 13
IMDB movie reviews, 191
learning whole, 108
MNIST, 94–95, 146–147, 174–175
preparing data, 102
streetlight problem, 100–101
transforming, 12
debugging frameworks, 236
decoder weight matrix, 223
deep learning
adapted for beginning learners, 5
analogies and, 6
defined, 10
difficulty level for learning, 5
ongoing education, 295
overview, 3
project-based teaching method, 6
reasons for, 4
incremental automation of
intelligence, 4
potential for automation of
skilled labor, 4
stimulates intelligence and
creativity, 4
requirements for, 7–8
high school–level mathematics, 7
Jupyter Notebook, 7
NumPy Python library, 7
personal challenge to solve, 7
Python knowledge, 8
subfield of machine learning, 10
teaching, 296
textbook for, 295
to understand frameworks and
code libraries, 5–6
deep learning framework, 232,
252. See also automatic optimization
deep neural network
backpropagation
in code, 126–127
iteration of, 128–129
weighted average delta, 120
weighted average error, 119
batch gradient descent, 109
building, 107
correlation
creating, 117
indirect, 116
learning, 110
full gradient descent, 109
importance of, 131
learning whole dataset, 108
linear versus nonlinear
networks, 121
making predictions, 125
matrices and matrix relationship,
103–105
importing matrices into
Python, 106
patterns, 104–105
streetlight problem, 103–105
overfitting, 113
preparing data, 102
running program, 130
sometimes correlation, 122, 123
stacking, 118
stochastic gradient descent, 109
streetlight problem, 100–101
weight pressure
conflicting pressure, 114–115
weight update, 111–112
defeat_delta variable, 225
index
delta, multiplying by slope, 172
delta variable, 58, 59, 81, 120, 127
deniability, 288, 291
derivatives
calculating, 68
defined, 67
example, 71
relationship between weight and
error and, 66, 69
using to learn, 70
weight_delta, 70
diagonal, 217
diff variable, 246
direct imitation, 11
direction_and_amount variable, 56
discorrelation, 112
divergence, 74, 75
dot products, 35, 45, 107, 217
neural prediction, 30–34
visualizing, 97
double negative, 33
down pressure, 110
dropout technique, regulation
in code, 155–156
evaluated on MNIST, 157
general discussion, 153–154
dynamic computation graph, 236
DyNet framework, 232
E
early stopping, 151, 152
edges. See convolutional neural
networks
ele_mul function, 81, 91
elementwise addition, 31, 107
elementwise multiplication, 31,
107
elementwise operation, 31
embedding layers, 194–195,
255, 257
encryption, 288
error curve, 70
errors
error attribution, 49
mean squared error, 48, 50
measuring, 51
positive, 51
reducing, 60–61
Euclidian distance, 199
execution and output analysis,
RNN, 227–228
expand function, adding support
for, 242–245
exploding gradient problem
countering with LSTM, 274
general discussion, 272
toy example, 273
F
federated learning
general discussion, 283
hacking into, 287
homomorphically encrypted,
290–291
overview of, 282, 286
fill-in-the-blank task, neural
networks, 201–202, 206
for loop, 181, 247, 256, 267
forward( ) method, 249, 260
forward propagation, 21–46
finishing with .index_select( )
method., 257
importance of visualization
tools, 143
linking variables, 141
with multiple inputs, 28–29
how it works, 40–41
runnable code, 35
weighted sum, 30–34
with multiple outputs
how it works, 40–41
predicting with multiple inputs,
38–39
using single input, 36–37
neural networks
defined, 25
purpose of, 26–27
simple, 24
stacking, 42–43
NumPy Python library, 44–46
overview of, 171, 173, 181,
201, 244
303
predicting on predictions,
42–43
prediction, 22–23
recurrent neural network,
220, 224
side-by-side visualization, 142
frameworks, debugging, 236
full gradient descent, 109
functions, 64–65
G
gates, 274
generalization, regulation, 149
get_parameters( ) method, 249
global correlation summarization,
135
goal_pred variable, 50
goal_prediction variable, 55
GPUs, 297
gradient descent, 79–98. See also
neural learning
breaking, 72
general discussion, 56–57
iteration of, 58–59
with multiple inputs
freezing one weight, 88–89
general discussion, 80–81
single-weight neural network
versus, 82–87
turning single delta into
three weight_delta values,
83–85
with multiple inputs and outputs
gradient descent generalizes
to arbitrarily large networks,
92–93
neural networks making
multiple predictions using
single input, 90–91
with multiple outputs, 90–91
visualizing dot products
(weighted sums), 97
weights, 94–96
graphs, 235, 236
grouping data, 13, 197, 214
Gryffindor, 294
index
304
H
hidden variable, 260
hidden-layer activation functions,
165
hidden-to-hidden layer, 261
hidden-to-output layer, 261
high-low pattern, 214
homomorphic encryption, 288,
289–291
hot and cold learning
characteristics of, 55
example, 54
general discussion, 52–53
I
ICML (International Conference
on Machine Learning), 297
identity matrices, 215, 217, 219
identity vectors, 216
image classification, 42
images matrix, 94
IMDB movie reviews dataset, 191
imitation, 11
inceptionism, 64
.index_select( ) method, 257
indices array, 266, 268
indirect correlation, 116
indirect imitation, 11
infinite functions, 162
infinite parameters, 14
input -> goal_prediction pairs,
51, 108
input data pattern, 105
input datapoints, 196
input datasets, 11, 23
input layers, 116
input node, 38
input values, 26, 81
input variable, 26, 30
input vector, 31, 97, 173
input/output correlation, 190
inputs
gradient descent with
freezing one weight, 88–89
general discussion, 80–81
generalizes to arbitrarily large
networks, 92–93
neural networks making
multiple predictions using
only single input, 90–91
single-weight neural network vs.,
82–87
turning single delta (on node)
into three weight_delta values,
83–85
how it works, 40–41
overview of, 28–29
runnable code, 35
weighted sums, 30–34
input-to-hidden layer, 261
installation instructions, activation
functions, 170–171
intelligence targeting, 203
intermediate dataset, 117
intermediate predictions, 127
International Conference on
Machine Learning (ICML),
297
J
Jupyter Notebook, 7
K
Keras framework, 232, 249, 298
kernel_output, 184
kernels, 179–180
knob_weight variable, 52, 55
L
labels, 94, 168, 196, 199, 203
language, neural networks
understanding, 187–208
embedding layer, 194–195
fill-in-the-blank task, 201–202
general discussion, 188
IMDB movie reviews
dataset, 191
interpreting output, 196
loss function, 203–205
meaning of neuron, 200
natural language processing,
189
neural architecture, 197–198
predicting movie reviews, 193
supervised natural language
processing, 190
word analogies, 206–207
word correlation, 192
word embeddings, 199
Lasagne framework, 232
Layer class, 249
layer_0_delta variable, 225
layer_1_delta variable, 170,
221
layer_2_delta variable, 221
layers
adding activation functions to,
170–171
adding support for layer
types, 249
containing layers, 250
cross-entropy layer, 258–259
dimensionality of matrices
and, 138
embedding, 255, 257
embedding layer translates indices into activations, 255
in Keras or PyTorch, 249
loss-function layers, 251
nonlinearity layers, 253–254
linear neural networks, 121
linear nodes, 134
list objects, 44
local correlation summarization,
135
log function, 227
logical analysis, 12
logical AND, 33
long short-term memory. See
LSTM (long short-term
memory)
loss function, 203–205, 207,
223, 259
loss.backward( ) function, 267
loss-function layers, 251
lossless representation, 105
index
lower weights, 119
LSTM (long short-term memory)
character language model
training, 278
tuning, 279–280
upgrading, 277
countering vanishing and exploding gradients with, 274
gates, 275
using autograd system to implement, 276
M
machine learning, 11–13
make_sent_vect function, 212
matrices and matrix relationship,
103–105
importing matrices into
Python, 106
layers and, 143
patterns, 104–105
streetlight problem, 103–105
matrix multiplication, adding
support for, 242–245
max pooling, 180
mean pooling, 180
mean squared error, 48, 50, 52, 58,
205, 246, 251
measuring error, 48
memorization, regulation, 149
memorizing neural network
code, 77
mini-layers, 179
missing values, 163
MNIST (Modified National
Institute of Standards and
Technology) dataset
overview of, 94–95
three-layer network on, 146–147
upgrading, 174–175
MNIST digit classifier, 167
MNISTPreprocessor notebook,
94
monotonic activation functions,
163
multi-input gradient descent, 82
multiple inputs, 28–29
gradient descent with
freezing one weight, 88–89
general discussion, 80–81
generalizes to arbitrarily large
networks, 92–93
neural networks making
multiple predictions using
only single input, 90–91
single-weight neural network
versus, 82–87
turning single delta (on node)
into three weight_delta values,
83–85
how it works, 40–41
runnable code, 35
weighted sums, 30–34
multiple outputs
gradient descent with
generalizes to arbitrarily large
networks, 92–93
neural networks making
multiple predictions using
only single input, 90–91
how it works, 40–41
predicting with multiple inputs,
38–39
using single input, 36–37
multiplication function, adding
support for, 242–245
N
n linear layers, 181
n output neurons, 181
n_batches, 268
n_hidden parameter, 261
n_layers input parameter, 262
Nanodegree, 294
NaNs (not-a-numbers), 272
natural language processing.
See NLP
n-dimensional tensors, 233
__neg__ function, 241
negation, adding support for, 241
negative correlation, 165
negative derivatives, 67
305
negative labels, 196, 199, 203
negative numbers, 27
negative reversal attribute, 56, 57, 83
negative sampling, 201
negative sensitivity, 69
negative weight, 34
neural architecture. See architecture of neural networks
neural learning, 47–77
alpha, 75–76
calculus and, 69
comparing
mean squared error, 48, 50
measuring error, 51
derivatives
calculating, 68
defined, 67
example, 71
relationship between weight and
error and, 66, 69
using to learn, 70
weight_delta, 70
divergence, 74
error attribution, 49
functions, 64–65
gradient descent
breaking, 72
general discussion, 56–57
iteration of, 58–59
hot and cold learning
characteristics of, 55
example, 54
general discussion, 52–53
memorizing, 77
overcorrections
alpha, 75
visualizing, 73
reducing error, 60–61
steps of, 62–63
neural networks. See also deep
neural network; neural
learning
backpropagation
in code, 126–127
iteration of, 128–129
weighted average delta, 120
weighted average error, 119
306
neural networks (continued)
batch gradient descent, 109
building, 107
correlation
creating, 117
indirect, 116
learning, 110
defined, 25
full gradient descent, 109
importance of, 131
learning whole dataset, 108
linear versus nonlinear
networks, 121
making multiple predictions
using single input, 90–91
making predictions, 125
matrices and matrix relationship
importing matrices into
Python, 106
patterns, 104–105
streetlight problem, 103–105
overfitting, 113
preparing data, 102
purpose of, 26–27
running program, 130
simple, 24
sometimes correlation, 122, 123
stacking, 42–43, 118
stochastic gradient descent, 109
streetlight problem, 100–101
visualizing, 133–143
architecture, 138
correlation summarization, 135
importance of visualization
tools, 143
side by side, 142
simplifying, 134, 136–138
vector-matrix multiplication,
139–141
weight pressure
conflicting pressure, 114–115
weight update, 111–112
neural prediction, 21–46
with multiple inputs, 28–29
how it works, 40–41
runnable code, 35
weighted sum, 30–34
index
with multiple outputs
how it works, 40–41
predicting with multiple inputs,
38–39
using single input, 36–37
neural networks
defined, 25
purpose of, 26–27
simple, 24
stacking, 42–43
NumPy Python library, 44–46
predicting on predictions, 42–43
prediction, 22–23
neurons, 134, 181, 200
NLP (natural language processing),
189, 190, 210
nodes, 38
noise, 150–151, 154, 213. See also
regulation
nonlinear activation functions,
164
nonlinear neural networks, 121
nonlinearities. See also activation
functions
nonlinearity layers, 123, 253–254
nonparametric learning
counting-based methods, 18
parametric learning versus, 14
normalization, 87
normalized variants, 18
normed_weights matrix, 212
NOT operator, 33
not-a-numbers (NaNs), 272
np.dot function, 160
NumPy Python library, 7, 44–46,
181–185
O
objective function, 205
one_hot utility matrix, 223
one-dimensional tensors, 233
one-hot encoding, 192
on-the-job training, 298
open source project, 298
OpenMined, 291, 298
OR operator, 33
output data pattern, 105
output datapoints, 196
output datasets, 11, 23
output layer activation functions
choosing, 166
configurations
no activation function, 166
sigmoid, 166
softmax, 167
outputs
converting to slope, 173
gradient descent with
generalizes to arbitrarily large
networks, 92–93
neural networks making
multiple predictions using
only single input, 90–91
how it works, 40–41
neural networks, 196
predicting with multiple inputs,
38–39
using single input, 36–37
overcorrections
alpha, 75
visualizing, 73
overfitting
causes of, 151
general discussion, 150
overview of, 113
overshooting, 74
P
parameters, 14–18, 178, 279
parametric learning
nonparametric learning
versus, 14
supervised, 15
unsupervised, 17
patterns, 104–105
peer support, for deep learning, 99
perplexity metric, 227
pip install phe command, 289
pixels, 96, 131, 154, 179
plaintext, 289
plausible deniability, 288, 291
pooling, 180
index
positive errors, 51
positive labels, 196, 199, 203
positive sensitivity, 69
practice, importance of, 298
pred variable, 60
predictions
deep neural networks, 125
with multiple inputs, 28–29
how it works, 40–41
runnable code, 35
weighted sum, 30–34
with multiple outputs
how it works, 40–41
predicting with multiple inputs,
38–39
using single input, 36–37
neural networks
defined, 25
purpose of, 26–27
simple, 24
stacking, 42–43
NumPy Python library, 44–46
predicting images, 148
predicting movie reviews, 193
predicting on predictions, 42–43
prediction, 22–23
privacy, 281–291. See also security
and privacy
federated learning, 286
general discussion, 283
hacking into, 287
homomorphically encrypted,
290–291
homomorphic encryption, 289
privacy, 282
secure aggregation, 288
spam, 284–285
private key, 289
probabilities. See activation
functions
project-based teaching method, 6
propagation, 21–46
finishing with .index_select( )
method, 257
importance of visualization
tools, 143
linking variables, 141
with multiple inputs, 28–29
how it works, 40–41
runnable code, 35
weighted sum, 30–34
with multiple outputs
how it works, 40–41
predicting with multiple inputs,
38–39
using single input, 36–37
neural networks
defined, 25
purpose of, 26–27
simple, 24
stacking, 42–43
NumPy Python library, 44–46
overview of, 141, 171, 173, 181,
201, 244
predicting on predictions, 42–43
prediction, 22–23
recurrent neural network, 220,
224
side-by-side visualization, 142
public key, 289
pure error, 51, 56
Python
creating sentence embeddings
using identity matrices
in, 217
forward propagation in, 220
learning, 8
NumPy Python library, 7, 44–46,
181–185
Python Codecademy course, 8
PyTorch, 232, 249, 294
R
random subsections, 153
randomized response, 288
randomness, 110
raw error, 52, 71
recurrent embeddings, 223
recurrent matrix, 224
recurrent neural network. See
RNN
reducing error, 60–61
regularization, 115, 146, 152, 178
307
regulation, 145–160
batch gradient descent, 158–160
dropout technique
in code, 155–156
evaluated on MNIST, 157
general discussion, 153–154
early stopping, 152
generalization, 149
memorization, 149
overfitting
causes of, 151
general discussion, 150
predicting images, 148
three-layer network on MNIST,
146–147
relu function, 123, 139, 141, 146,
162, 164, 170, 173, 273
relu2deriv function, 126, 127, 171
reviews2vectors matrix, 212
RNN (recurrent neural network),
209–229
arbitrary length of data
backpropagation with, 225
challenge of, 210
forward propagation with, 224
weight update with, 226
averaged word vectors, 212
Babi dataset, 222
backpropagation, 221
character language modeling, 266
comparing sentence vectors, 211
execution and output analysis,
227–228
forward propagation in
Python, 220
overview of, 260–263, 272
sentence embeddings, 217–219
setting up, 223
vanishing and exploding
gradients, 272
word embeddings, 213–216
how information is stored
in, 213
limitations of, 215
neural networks use of, 214
summing, 216
RNNCell class, 261
308
index
runnable code, neural prediction, 35
running.backward( ) method, 240
S
sampling output, 271
scalar multiples, 104
scalar-matrix multiplication, 45
scalars, 233
scaling attribute, 56, 57, 83
security and privacy, 281–291
federated learning, 286
general discussion, 283
hacking into, 287
homomorphically encrypted,
290–291
homomorphic encryption, 289
privacy, 282
secure aggregation, 288
spam, 284–285
selective correlation, 164
self.children counter, 239
self.data array, 233
self.w_hh layer, 261
self.w_ho layer, 261
self.w_ho.forward(h), 274
self.w_ih layer, 261
self.weight, 255
sensitivity, 27, 69, 172
sent2output layer, 220
sentence embeddings, RNN, 217
sentence vectors, 211, 219
transition matrices, 218
sentiment dataset, 191
Sequential( ) method, 254
SGD class, 249
shape, 23, 45, 150
sharpness of attenuation, 169
short-term memory, 26
sigmoid( ) function, 165, 166, 173,
253, 254, 273, 275
signal, 151, 154. See also regulation
similarity, 32
simple neural networks, 24
simplifying visualization, 134,
136–138
single-input gradient descent, 82
single-weight neural network, 82–87
slope
converting output to, 173
multiplying delta by, 172
overview of, 126
softmax computation
activation functions, 169
output layer activation
functions, 167
softmax function, 173, 174, 191,
223, 259
sometimes (conditional) correlation, 122, 123
sox_delta variable, 225
spam, 284–285
square matrix, 215
squiggly line, 213
stacked convolutional layers, 184
stacking neural networks, 42–43,
118
stacks of layers, 134
static computation graph, 236
step_amount variable, 55
stickiness, 172
stochastic gradient descent optimizer, 109, 227, 248, 266
stopping attribute, 56, 57, 83
streetlight problem
datasets, 100–101
matrices and matrix relationship,
103–105
subtraction function, adding
support for, 242–245
sum function, adding support for,
242–245
sum pooling, 180
.sum(dim) function, 243
summing embeddings, 216
supervised machine learning, 12
supervised natural language
processing (NLP), 190
supervised parametric learning, 16
T
tanh( ) function, 165, 173, 174,
253, 254
target labels, meaning of neuron
based on, 200
tasks, NLP, 189
Tensor class, 240, 253
Tensor.backward( ) function,
252
TensorFlow, 232, 298
tensors
adding nonlinear functions to,
253–254
automatic gradient computation
(autograd), 238–239
defined, 233
that are used multiple times,
238–239
test( ) function, 285
test_images variable, 148
test_labels variable, 148
testing accuracy, 149
Theano, 232
three-layer networks, 121,
146–147
topic classification, 190
train( ) function, 285
training accuracy, 149–150
Training-Acc, 157
transition weights, 220
transpose function, adding support
for, 242–245
trial and error, 15, 25. See also
parametric learning
true signal, 150
truncated backpropagation
disadvantages of, 268–269
iterating with, 270
overview of, 267
Twitter, 297
two-dimensional tensors, 233
two-layer networks, 121, 197
U
<unk> tokens, 284
unsupervised machine learning,
11, 13
up pressure, 110
utility functions, 223
index
V
validation set, 152
vanishing gradient problem
countering with LSTM, 274
general discussion, 272
toy example, 273
variable-length text, 191
variables
linking, 141
multiplying, 45
variants, 280
vect_ mat_mul function, 41
vectors, 31, 137, 140, 192, 215
vector-scalar addition and
multiplication, 107
virtual graph, 236
visualizing neural networks,
133–143
architecture, 138
correlation summarization, 135
importance of visualization
tools, 143
side by side, 142
simplifying, 134, 136–138
vector-matrix multiplication
defined, 139
letters can be combined to
indicate functions and
operations, 141
linking variables, 141
using letters instead of
pictures, 140
volume, 27, 30
W
w_sum function, 35, 41
weight pressure
conflicting pressure, 114–115
weight update, 111–112
Weight Pressure table, 111, 114
weight values, 26, 110
weight variable, 26
weight vector, 97
weight_delta variable, 81, 83, 84,
91, 109
weighted average delta, 120
weighted average error, 119
weighted sums, 107, 139
neural prediction, 30–34
visualizing, 97
weights. See also multiple inputs
batch gradient descent, 109
convolutional neural
networks, 178
freezing one weight, 88–89
full gradient descent, 109
MNIST dataset, 94–95
stochastic gradient descent, 109
309
turning single delta into three
weight_delta values, 83–85
visualizing weight values, 96
weight update with arbitrary
length, 226
weights variable, 18, 23, 30, 41, 84
weights vector, 31
wlrec predictor, 34
word analogies, 206–207
word correlation, capturing in
input data, 192
word embeddings
comparing, 199
recurrent neural network, 213–216
how information is stored
in, 213
limitations of, 215
neural networks use of, 214
summing, 216
word vectors, 206, 214
Y
y. creation_op, 235
Z
z .backward( ) function, 235
MORE TITLES FROM MANNING
Deep Learning with Python
by François Chollet
ISBN: 9781617294433
384 pages
$49.99
November 2017
Deep Learning with R
by François Chollet
with J. J. Allaire
ISBN: 9781617295546
360 pages
$49.99
January 2018
For ordering information go to www.manning.com
MORE TITLES FROM MANNING
Deep Learning and the Game of Go
by Max Pumperla
and Kevin Ferguson
ISBN: 9781617295324
384 pages
$54.99
January 2019
Keras in Motion
by Dan Van Boxel
Course duration: 2h 4m
55 exercises
$49.99
For ordering information go to www.manning.com
MORE TITLES FROM MANNING
The Quick Python Book,
Third Edition
by Naomi Ceder
ISBN: 9781617294037
472 pages
$39.99
May 2018
Natural Language Processing
in Action
Understanding, analyzing, and generating text with Python
by Hobson Lane, Cole Howard,
and Hannes Max Hapke
ISBN: 9781617294631
420 pages
$49.99
January 2019
For ordering information go to www.manning.com
MORE TITLES FROM MANNING
Practical Recommender Systems
by Kim Falk
ISBN: 9781617292705
400 pages
$49.99
February 2019
Grokking Algorithms
An illustrated guide for programmers
and other curious people
by Aditya Y. Bhargava
ISBN: 9781617292231
256 pages
$44.99
May 2016
For ordering information go to www.manning.com