Deep Learning Tutorial Complete (v3)
Deep Learning Tutorial Complete (v3)
Deep Learning Tutorial Complete (v3)
李宏毅
Hung-yi Lee
Deep learning
attracts lots of attention.
• Google Trends
Machine “2”
Handwriting Digit Recognition
Input Output
y1
0.1 is 1
x1
x2 y2
0.7 is 2
The image
is “2”
……
……
……
x256 y10
0.2 is 0
16 x 16 = 256
Ink → 1 Each dimension represents
No ink → 0 the confidence of a digit.
Example Application
• Handwriting Digit Recognition
x1 y1
x2
y2
Machine “2”
……
……
x256 y10
𝑓: 𝑅256 → 𝑅10
In deep learning, the function 𝑓 is
represented by neural network
Element of Neural Network
Neuron 𝑓: 𝑅𝐾 → 𝑅
a1 w1 z a1w1 a2 w2 aK wK b
a2 w2
z z
a
wK
…
aK Activation
weights function
b
bias
Neural Network
neuron
Input Layer 1 Layer 2 Layer L Output
x1 …… y1
x2 …… y2
……
……
……
……
……
xN …… yM
Input Output
Layer Hidden Layers Layer
z
1
z
1 e z
Example of Neural Network
1 4 0.98 2 0.86 3 0.62
1
-2 -1 -1
1 0 -2
-1 -2 0.12 -2 0.11 -1 0.83
-1
1 -1 4
0 0 2
Example of Neural Network
1 0.73 2 0.72 3 0.51
0
-2 -1 -1
1 0 -2
-1 0.5 -2 0.12 -1 0.85
0
1 -1 4
0 0 2
𝑓: 𝑅2 → 𝑅2 1 0.62 0 0.51
𝑓 = 𝑓 =
−1 0.83 0 0.85
Different parameters define different function
Matrix Operation
1 4 0.98
1 y1
-2
1
-1 -2 0.12
-1 y2
1
0
1 −2 1 1 0.98
𝜎 + =
−1 1 −1 0 0.12
4
−2
Neural Network
x1 …… y1
x2 W1 W2 ……
WL y2
b1 b2 bL
……
……
……
……
……
xN x a1 ……
a2 y yM
𝜎 W1 x + b1
𝜎 W2 a1 + b2
𝜎 WL aL-1 + bL
Neural Network
x1 …… y1
x2 W1 W2 ……
WL y2
b1 b2 bL
……
……
……
……
……
xN x a1 ……
a2 y yM
=𝜎 WL …𝜎 W2 𝜎 W1 x + b1 + b2 … + bL
Softmax
• Softmax layer as the output layer
Ordinary Layer
z1
y1 z1
In general, the output of
z2
y2 z 2
network can be any value.
3 0.88 3
e
20
z1 e e z1
y1 e z1 zj
j 1
1 0.12 3
z2 e e z 2 2.7
y2 e z2
e
zj
j 1
0.05 ≈0
z3 -3
3
e e z3
y3 e z3
e
zj
3 j 1
e zj
j 1
How to set network parameters
𝜃 = 𝑊 1 , 𝑏1 , 𝑊 2 , 𝑏 2 , ⋯ 𝑊 𝐿 , 𝑏 𝐿
x1 …… y1
0.1 is 1
Softmax
x2 …… y2
0.7 is 2
……
……
……
x256 …… y10
0.2 is 0
16 x 16 = 256
Ink → 1 Set the network parameters 𝜃 such that ……
No ink → 0
Input: How to let thethe
y1 has neural
maximum value
network achieve this
Input: y2 has the maximum value
Training Data
• Preparing training data: images and their labels
“1”
x1 …… y0.2
1 1
x2 …… y2
0.3 0
Cost
……
……
……
……
……
x256 …… y0.5 𝐿(𝜃) 0
10
target
Cost can be Euclidean distance or cross
entropy of the network output and target
Total Cost
For all training data … Total Cost:
𝑅
x1 NN y1 𝑦1 𝐶 𝜃 = 𝐿𝑟 𝜃
𝐿1 𝜃
𝑟=1
x2 NN y2 𝑦2
𝐿2 𝜃 How bad the network
parameters 𝜃 is on
x3 NN y3 𝑦3 this task
𝐿3 𝜃
……
……
……
……
𝛻𝐶 𝜃 𝛻𝐶 𝜃 𝛻𝐶 𝜃
≈0 =0 =0
parameter space
In physical world ……
• Momentum
Gradient = 0
Mini-batch
Randomly initialize 𝜃 0
x1 NN y1 𝑦 1 Pick the 1st batch
Mini-batch
𝐿1 𝐶 = 𝐿1 + 𝐿31 + ⋯
x31 NN y31 𝑦 31 𝜃 1 ← 𝜃 0 − 𝜂𝛻𝐶 𝜃 0
𝐿31 Pick the 2nd batch
……
𝐶 = 𝐿2 + 𝐿16 + ⋯
𝜃 2 ← 𝜃 1 − 𝜂𝛻𝐶 𝜃 1
x2 NN y2 𝑦2
Mini-batch
…
𝐿2
16
C is different each time
x16 NN y16 𝑦 when we update
𝐿16 parameters!
……
Mini-batch
Original Gradient Descent With Mini-batch
unstable
𝐶1 𝐶 = 𝐶 1 + 𝐶 31 + ⋯
x31 NN y31 𝑦 31 𝜃 1 ← 𝜃 0 − 𝜂𝛻𝐶 𝜃 0
𝐶 31 Pick the 2nd batch
……
𝐶 = 𝐶 2 + 𝐶 16 + ⋯
𝜃 2 ← 𝜃 1 − 𝜂𝛻𝐶 𝜃 1
x2 NN y2 𝑦2
Mini-batch
…
𝐶2 Until all mini-batches
have been picked
x16 NN y16 𝑦16
𝐶 16 one epoch
……
Ref:
http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lec
ture/Theano%20DNN.ecm.mp4/index.html
Part II:
Why Deep?
Deeper is Better?
Word Error Word Error
Layer X Size Layer X Size
Rate (%) Rate (%)
1 X 2k 24.2
2 X 2k 20.4 Not surprised, more
3 X 2k 18.4 parameters, better
4 X 2k 17.8 performance
5 X 2k 17.2 1 X 3772 22.5
7 X 2k 17.1 1 X 4634 22.6
1 X 16k 22.1
Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription
Using Context-Dependent Deep Neural Networks." Interspeech. 2011.
Universality Theorem
Any continuous function f
f : R N RM
Can be realized by a network
with one hidden layer
Reference for the reason:
(given enough hidden http://neuralnetworksandde
neurons) eplearning.com/chap4.html
x1 x2 …… xN x1 x2 …… xN
Shallow Deep
Fat + Short v.s. Thin + Tall
Word Error Word Error
Layer X Size Layer X Size
Rate (%) Rate (%)
1 X 2k 24.2
2 X 2k 20.4
3 X 2k 18.4
4 X 2k 17.8
5 X 2k 17.2 1 X 3772 22.5
7 X 2k 17.1 1 X 4634 22.6
1 X 16k 22.1
Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription
Using Context-Dependent Deep Neural Networks." Interspeech. 2011.
Why Deep?
• Deep → Modularization
Classifier Girls with 長髮 長髮
1 long hair 女 女長髮長髮
女女
Classifier Boys with 長髮
2 weak long hair 男 examples
Little
Image
Classifier Girls with 短髮短髮
3 short hair 女 女短髮短髮
女女
Classifier Boys with 短髮短髮
4 short hair 男 男短髮短髮
男男
Why Deep? Each basic classifier can have
sufficient training examples.
• Deep → Modularization
長髮 長髮
長髮長髮 男
短髮 女
短髮 女 長髮
Boy or Girl? 女 女短髮 女 v.s. 短髮短髮
短髮 女 男 男短髮
女女 短髮
Basic 男男
Image
Classifier
長髮長髮 短髮短髮
Long or
女 女長髮長髮 女 女短髮短髮
short? 女女 v.s. 女女
長髮 短髮短髮
Classifiers for the
男 男 男短髮短髮
attributes 男男
Why Deep?
can be trained by little data
• Deep → Modularization
Classifier Girls with
1 long hair
Boy or Girl? Classifier Boys with
2 fine long Little
hair data
Image Basic
Classifier Classifier Girls with
Long or 3 short hair
short?
Classifier Boys with
Sharing by the 4 short hair
following classifiers
as module
Deep Learning also works
Why Deep? on small data set like TIMIT.
……
……
……
xN ……
The most basic Use 1st layer as module Use 2nd layer as
classifiers to build classifiers module ……
Hand-crafted
kernel function
SVM
Apply simple
classifier
Source of image: http://www.gipsa-lab.grenoble-
inp.fr/transfert/seminaire/455_Kadri2013Gipsa-lab.pdf
Deep Learning
simple
Learnable kernel
𝜙 𝑥 classifier
x1 …… y1
x2
𝑥 …… y2
…
…
…
…
…
xN …… yM
Hard to get the power of Deep …
http://www.gizmodo.com.au/2015/04/the-basic-recipe-for-machine-learning-
explained-in-a-single-powerpoint-slide/
Recipe for Learning
http://www.gizmodo.com.au/2015/04/the-basic-recipe-for-machine-learning-
explained-in-a-single-powerpoint-slide/
Recipe for Learning
Modify the Network
• New activation functions, for example, ReLU
or Maxout
Prevent Overfitting
• Dropout Only use this approach when you already
obtained good results on the training data.
Part III:
Tips for Training DNN
New Activation Function
ReLU
• Rectified Linear Unit (ReLU)
Reason:
𝑎
𝜎 𝑧 1. Fast to compute
𝑎=𝑧
2. Biological reason
𝑎=0 3. Infinite sigmoid
𝑧
with different biases
4. Vanishing gradient
[Xavier Glorot, AISTATS’11]
[Andrew L. Maas, ICML’13] problem
[Kaiming He, arXiv’15]
Vanishing Gradient Problem
x1 …… y1
In x2006,
2
people used RBM
……pre-training. y2
In 2015, people use ReLU.
……
……
……
……
……
xN …… yM
x1 …… 𝑦1 𝑦1
Small
x2 output 𝑦2
…… 𝑦2
……
……
……
……
𝐶
……
……
+∆𝐶
xN …… 𝑦𝑀 𝑦𝑀
Large
+∆𝑤 input
x1 y1
0 y2
x2
0
0
𝑎
𝑎=𝑧
ReLU
A Thinner linear network 𝑎=0
𝑧
x1 y1
y2
x2
Do not have
smaller gradients
Maxout ReLU is a special cases of Maxout
+ 5 neuron + 1
Input
Max 7 Max 2
x1 + 7 + 2
x2 + −1 + 4
Max 1 Max 4
+ 1 + 3
−𝛻𝐶 𝜃 0
𝜃0
𝑤1
Can we give different
Learning Rate Set the learning
parameters different
rate η carefully
learning rates?
𝑤1
Original Gradient Descent
Adagrad 𝜃 𝑡 ← 𝜃 𝑡−1 − 𝜂𝛻𝐶 𝜃 𝑡−1
𝜂 constant
ߟ𝑤 =
𝑡 Summation of the square of
𝑖=0 𝑔𝑖 2
the previous derivatives
𝜂
ߟ𝑤 =
Adagrad 𝑡
𝑖=0 𝑔𝑖 2
g0 g1 …… g0 g1 ……
𝑤1 𝑤2
0.1 0.2 …… 20.0 10.0 ……
Learning rate: Learning rate:
𝜂 𝜂 𝜂 𝜂
= =
0.12 0.1 20 2 20
𝜂 𝜂 𝜂 𝜂
= =
0.12 + 0.22 0.22 202 + 102 22
Observation: 1. Learning rate is smaller and
smaller for all parameters
2. Smaller derivatives, larger
Why?
learning rate, and vice versa
Larger
derivatives
Smaller
Learning Rate
Smaller Derivatives
Training:
Training:
Thinner!
No dropout
If the dropout rate at training is p%,
all the weights times (1-p)%
Assume that the dropout rate is 50%.
If a weight w = 1 by training, set 𝑤 = 0.5 for testing.
Dropout - Intuitive Reason
我的 partner
會擺爛,所以
我要好好做
y1 y2 y3 y4
average
Dropout is a kind of ensemble.
minibatch minibatch minibatch minibatch Training of
1 2 3 4 Dropout
M neurons
……
2M possible
networks
All the
weights
……
multiply
(1-p)%
y1 y2 y3
average ≈ y
More about dropout
• More reference for dropout [Nitish Srivastava, JMLR’14] [Pierre Baldi,
NIPS’13][Geoffrey E. Hinton, arXiv’12]
• Dropout works better with Maxout [Ian J. Goodfellow, ICML’13]
• Dropconnect [Li Wan, ICML’13]
• Dropout delete neurons
• Dropconnect deletes the connection between neurons
• Annealed dropout [S.J. Rennie, SLT’14]
• Dropout rate decreases by epochs
• Standout [J. Ba, NISP’13]
• Each neural has different dropout rate
Part IV:
Neural Network
with Memory
Neural Network needs Memory
• Name Entity Recognition
• Detecting named entities like name of people,
locations, organization, etc. in a sentence.
1
0.1 people
apple 0
0.1 location
0 DNN
0.5 organization
…
0 0.3 none
Neural Network needs Memory
• Name Entity Recognition
• Detecting named entities like name of people,
locations, organization, etc. in a sentence.
target ORG target NONE
y1 y2 y3 y4 y5 y6 y7
x1 x2 x3 x4 x5 x6 x7
the president of apple eats an apple
DNN needs memory!
Recurrent Neural Network (RNN)
y1 y2
a1 a2
y1 y2 y3
Wo a1
copy Wo copy Wo
a2 a3
a1 a2
Wi Wh Wh
Wi Wi
x1 x2 x3
Wi Wi Wi
x1 x2 x3
…… ……
……
……
……
…… ……
…… ……
xt xt+1 xt+2
Bidirectional RNN
xt xt+1 xt+2
…… ……
yt yt+1 yt+2
…… ……
xt xt+1 xt+2
Many to Many (Output is shorter)
• Both input and output are both sequences, but the output
is shorter.
• E.g. Speech Recognition
好 φ φ 棒 φ φ φ φ 好 φ φ 棒 φ 棒 φ φ
Many to Many (No Limitation)
• Both input and output are both sequences with different
lengths. → Sequence to sequence learning
• E.g. Machine Translation (machine learning→機器學習)
machine
learning
Containing all
information about
input sequence
Many to Many (No Limitation)
• Both input and output are both sequences with different
lengths. → Sequence to sequence learning
• E.g. Machine Translation (machine learning→機器學習)
機 器 學 習 慣 性 ……
……
machine
learning
推 tlkagk: =========斷==========
Ref:http://zh.pttpedia.wikia.com/wiki/%E6%8E%A5%E9%BE%8D%
E6%8E%A8%E6%96%87 (鄉民百科)
Many to Many (No Limitation)
• Both input and output are both sequences with different
lengths. → Sequence to sequence learning
• E.g. Machine Translation (machine learning→機器學習)
===
機 器 學 習
machine
learning
Unfortunately ……
• RNN-based network is not always easy to learn
Real experiments on Language modeling
sometimes
Lucky
The error surface is rough.
The error surface is either
very flat or very steep.
Clipping
Cost
w2
𝑧𝑜 multiply
Activation function f is
𝑓 𝑧𝑜 ℎ 𝑐′ usually a sigmoid function
Between 0 and 1
Mimic open and close gate
𝑐 𝑓 𝑧𝑓
𝑐c′ 𝑧𝑓
𝑐𝑓 𝑧𝑓
𝑔 𝑧 𝑓 𝑧𝑖
𝑐 ′ = 𝑔 𝑧 𝑓 𝑧𝑖 + 𝑐𝑓 𝑧𝑓
𝑓 𝑧𝑖
𝑧𝑖
multiply
𝑔 𝑧
𝑧
Original Network:
Simply replace the neurons with LSTM
……
……
𝑎1 𝑎2
𝑧1 𝑧2
x1 x2 Input
𝑎1 𝑎2
+ +
+ +
+ +
+ +
ct-1 ct ct+1
× + × × + ×
× ×
zf zi z zo zf zi z zo
[Tomas Mikolov,
[Cho, EMNLP’14] ICLR’15]
…… ……
Reading Head Writing Head
1 W−2 1 1 0.98
𝜎 x + b = a
−1 1 −1 0 0.12
Why Deep? – Logic Circuits
• A two levels of basic logic gates can represent any
Boolean function.
• However, no one uses two levels of logic gates to
build computers
• Using multiple layers of logic gates to build some
functions are much simpler (less gates needed).
Boosting Weak classifier
……
Weak classifier
Deep Learning
Weak Boosted weak Boosted Boosted
classifier classifier weak classifier
x1 ……
x2 ……
…
…
…
…
xN ……
Maxout ReLU is a special cases of Maxout
𝑧 + 𝑧1
Input 𝑤 ReLU 𝑎 Input 𝑤 Max 𝑎
𝑏
x x 0 + 𝑧2 𝑚𝑎𝑥 𝑧1 , 𝑧2
𝑏
0
1 1
𝑎 𝑎
𝑧 = 𝑤𝑥 + 𝑏
𝑧1 = 𝑤𝑥 + 𝑏
𝑥 𝑥
𝑧2 =0
Maxout ReLU is a special cases of Maxout
𝑧 + 𝑧1
Input 𝑤 ReLU 𝑎 Input 𝑤 Max 𝑎
𝑏
x x + 𝑧2 𝑚𝑎𝑥 𝑧1 , 𝑧2
𝑤′
𝑏
1 𝑏′
1 Learnable Activation
Function
𝑎 𝑎
𝑧 = 𝑤𝑥 + 𝑏
𝑧1 = 𝑤𝑥 + 𝑏
𝑥 𝑥
𝑧2 = 𝑤 ′ 𝑥 + 𝑏 ′