0% found this document useful (0 votes)

252 views109 pages

Deep Learning Tutorial Complete (v3)

This document provides an introduction to deep learning. It discusses how neural networks can be used for tasks like handwritten digit recognition. The key elements of neural networks, including neurons, weights, biases and activation functions are explained. An example of a simple neural network is presented. Training data is used to optimize the network parameters to minimize the cost function and achieve the desired output classifications.

Uploaded by

Mario Cordina

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

252 views109 pages

Deep Learning Tutorial Complete (v3)

Uploaded by

Mario Cordina

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 109

Deep Learning Tutorial

李宏毅
Hung-yi Lee
Deep learning
attracts lots of attention.
• Google Trends

Deep learning obtains many exciting results.

The talks in this afternoon
This talk will focus on the technical part.

2007 2009 2011 2013 2015

Outline

Part I: Introduction of Deep Learning

Part II: Why Deep?

Part III: Tips for Training Deep Neural Network

Part IV: Neural Network with Memory

Part I:
Introduction of
Deep Learning

What people already knew in 1980s

Example Application
• Handwriting Digit Recognition

Machine “2”
Handwriting Digit Recognition

Input Output

y1
0.1 is 1
x1
x2 y2
0.7 is 2
The image
is “2”

……
……
……

x256 y10
0.2 is 0
16 x 16 = 256
Ink → 1 Each dimension represents
No ink → 0 the confidence of a digit.
Example Application
• Handwriting Digit Recognition

x1 y1
x2
y2
Machine “2”
……

……
x256 y10
𝑓: 𝑅256 → 𝑅10
In deep learning, the function 𝑓 is
represented by neural network
Element of Neural Network
Neuron 𝑓: 𝑅𝐾 → 𝑅

a1 w1 z  a1w1  a2 w2    aK wK  b

a2 w2
z  z 
 a
wK
…

aK Activation
weights function
b
bias
Neural Network
neuron
Input Layer 1 Layer 2 Layer L Output
x1 …… y1
x2 …… y2

……
……

……

……
xN …… yM
Input Output
Layer Hidden Layers Layer

Deep means many hidden layers

Example of Neural Network
1 4 0.98
1
-2
1
-1 -2 0.12
-1
1
0
Sigmoid Function  z 

 z  
1
z
1 e z
Example of Neural Network
1 4 0.98 2 0.86 3 0.62
1
-2 -1 -1
1 0 -2
-1 -2 0.12 -2 0.11 -1 0.83
-1
1 -1 4
0 0 2
Example of Neural Network
1 0.73 2 0.72 3 0.51
0
-2 -1 -1
1 0 -2
-1 0.5 -2 0.12 -1 0.85
0
1 -1 4
0 0 2

𝑓: 𝑅2 → 𝑅2 1 0.62 0 0.51
𝑓 = 𝑓 =
−1 0.83 0 0.85
Different parameters define different function
Matrix Operation
1 4 0.98
1 y1
-2
1
-1 -2 0.12
-1 y2
1
0

1 −2 1 1 0.98
𝜎 + =
−1 1 −1 0 0.12
4
−2
Neural Network
x1 …… y1
x2 W1 W2 ……
WL y2
b1 b2 bL

……
……

……

……
xN x a1 ……
a2 y yM

𝜎 W1 x + b1
𝜎 W2 a1 + b2
𝜎 WL aL-1 + bL
Neural Network
x1 …… y1
x2 W1 W2 ……
WL y2
b1 b2 bL

……
……

……

……
xN x a1 ……
a2 y yM

Using parallel computing techniques

y =𝑓 x
to speed up matrix operation

=𝜎 WL …𝜎 W2 𝜎 W1 x + b1 + b2 … + bL
Softmax
• Softmax layer as the output layer

Ordinary Layer

z1   
y1   z1
In general, the output of
z2   
y2   z 2
network can be any value.

May not be easy to interpret

z3   
y3   z 3
Softmax
Probability:
• Softmax layer as the output layer  1 > 𝑦𝑖 > 0
 𝑖 𝑦𝑖 = 1
Softmax Layer

3 0.88 3

e
20
z1 e e z1
 y1  e z1 zj

j 1

1 0.12 3
z2 e e z 2 2.7
 y2  e z2
e
zj

j 1
0.05 ≈0
z3 -3
3
e e z3
 y3  e z3
e
zj

3 j 1

 e zj

j 1
How to set network parameters
𝜃 = 𝑊 1 , 𝑏1 , 𝑊 2 , 𝑏 2 , ⋯ 𝑊 𝐿 , 𝑏 𝐿
x1 …… y1
0.1 is 1

Softmax
x2 …… y2
0.7 is 2
……

……

……
x256 …… y10
0.2 is 0
16 x 16 = 256
Ink → 1 Set the network parameters 𝜃 such that ……
No ink → 0
Input: How to let thethe
y1 has neural
maximum value
network achieve this
Input: y2 has the maximum value
Training Data
• Preparing training data: images and their labels

“5” “0” “4” “1”

“9” “2” “1” “3”

Using the training data to find

the network parameters.
Given a set of network parameters 𝜃,
Cost each example has a cost value.

“1”

x1 …… y0.2
1 1
x2 …… y2
0.3 0
Cost
……

……
……

……

……
x256 …… y0.5 𝐿(𝜃) 0
10

target
Cost can be Euclidean distance or cross
entropy of the network output and target
Total Cost
For all training data … Total Cost:
𝑅
x1 NN y1 𝑦1 𝐶 𝜃 = 𝐿𝑟 𝜃
𝐿1 𝜃
𝑟=1
x2 NN y2 𝑦2
𝐿2 𝜃 How bad the network
parameters 𝜃 is on
x3 NN y3 𝑦3 this task
𝐿3 𝜃
……
……

……
……

Find the network

parameters 𝜃 ∗ that
xR NN yR 𝑦𝑅 minimize this value
𝐿𝑅 𝜃
Assume there are only two
parameters w1 and w2 in a
Gradient Descent network.
Error Surface 𝜃 = 𝑤1 , 𝑤2

The colors represent the value of C. Randomly pick a

starting point 𝜃 0
Compute the
negative gradient
𝑤2 𝜃∗ at 𝜃 0
−𝜂𝛻𝐶 𝜃 0 −𝛻𝐶 𝜃 0
−𝛻𝐶 𝜃 0
Times the
𝜕𝐶 𝜃 0 /𝜕𝑤1 learning rate 𝜂
𝜃0 𝛻𝐶 𝜃 0 =
𝜕𝐶 𝜃 0 /𝜕𝑤2 −𝜂𝛻𝐶 𝜃 0
𝑤1
Gradient Descent
Eventually, we would
Randomly pick a
reach a minima …..
starting point 𝜃 0
Compute the
2−𝜂𝛻𝐶 𝜃2 negative gradient
−𝜂𝛻𝐶 𝜃𝜃
1
𝑤2 2 at 𝜃 0
−𝛻𝐶
−𝛻𝐶 𝜃 1 𝜃
𝜃1 −𝛻𝐶 𝜃 0
Times the
learning rate 𝜂
𝜃0
−𝜂𝛻𝐶 𝜃 0
𝑤1
Local Minima
• Gradient descent never guarantee global minima
Different initial
point 𝜃 0

𝐶 Reach different minima,

so different results
Who is Afraid of Non-Convex
Loss Functions?
𝑤1 𝑤2 http://videolectures.net/eml07
_lecun_wia/
Besides local minima ……
cost
Very slow at the
plateau
Stuck at saddle point

Stuck at local minima

𝛻𝐶 𝜃 𝛻𝐶 𝜃 𝛻𝐶 𝜃
≈0 =0 =0
parameter space
In physical world ……
• Momentum

How about put this phenomenon

in gradient descent?
Still not guarantee reaching
Momentum global minima, but give some
hope ……
cost
Movement =
Negative of Gradient + Momentum
Negative of Gradient
Momentum
Real Movement

Gradient = 0
Mini-batch
 Randomly initialize 𝜃 0
x1 NN y1 𝑦 1  Pick the 1st batch
Mini-batch

𝐿1 𝐶 = 𝐿1 + 𝐿31 + ⋯
x31 NN y31 𝑦 31 𝜃 1 ← 𝜃 0 − 𝜂𝛻𝐶 𝜃 0
𝐿31  Pick the 2nd batch
……

𝐶 = 𝐿2 + 𝐿16 + ⋯
𝜃 2 ← 𝜃 1 − 𝜂𝛻𝐶 𝜃 1
x2 NN y2 𝑦2
Mini-batch

…
𝐿2
16
C is different each time
x16 NN y16 𝑦 when we update
𝐿16 parameters!
……
Mini-batch
Original Gradient Descent With Mini-batch

unstable

The colors represent the total C on all training data.

Mini-batch Faster Better!
 Randomly initialize 𝜃 0
x1 NN y1 𝑦 1  Pick the 1st batch
Mini-batch

𝐶1 𝐶 = 𝐶 1 + 𝐶 31 + ⋯
x31 NN y31 𝑦 31 𝜃 1 ← 𝜃 0 − 𝜂𝛻𝐶 𝜃 0
𝐶 31  Pick the 2nd batch
……

𝐶 = 𝐶 2 + 𝐶 16 + ⋯
𝜃 2 ← 𝜃 1 − 𝜂𝛻𝐶 𝜃 1
x2 NN y2 𝑦2
Mini-batch

…
𝐶2  Until all mini-batches
have been picked
x16 NN y16 𝑦16
𝐶 16 one epoch
……

Repeat the above process

Backpropagation
• A network can have millions of parameters.
• Backpropagation is the way to compute the gradients
efficiently (not today)
• Ref:
http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_201
5_2/Lecture/DNN%20backprop.ecm.mp4/index.html
• Many toolkits can compute the gradients automatically

Ref:
http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lec
ture/Theano%20DNN.ecm.mp4/index.html
Part II:
Why Deep?
Deeper is Better?
Word Error Word Error
Layer X Size Layer X Size
Rate (%) Rate (%)
1 X 2k 24.2
2 X 2k 20.4 Not surprised, more
3 X 2k 18.4 parameters, better
4 X 2k 17.8 performance
5 X 2k 17.2 1 X 3772 22.5
7 X 2k 17.1 1 X 4634 22.6
1 X 16k 22.1
Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription
Using Context-Dependent Deep Neural Networks." Interspeech. 2011.
Universality Theorem
Any continuous function f

f : R N  RM
Can be realized by a network
with one hidden layer
Reference for the reason:
(given enough hidden http://neuralnetworksandde
neurons) eplearning.com/chap4.html

Why “Deep” neural network not “Fat” neural network?

Fat + Short v.s. Thin + Tall
The same number
of parameters

Which one is better?

……

x1 x2 …… xN x1 x2 …… xN

Shallow Deep
Fat + Short v.s. Thin + Tall
Word Error Word Error
Layer X Size Layer X Size
Rate (%) Rate (%)
1 X 2k 24.2
2 X 2k 20.4
3 X 2k 18.4
4 X 2k 17.8
5 X 2k 17.2 1 X 3772 22.5
7 X 2k 17.1 1 X 4634 22.6
1 X 16k 22.1
Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription
Using Context-Dependent Deep Neural Networks." Interspeech. 2011.
Why Deep?
• Deep → Modularization
Classifier Girls with 長髮長髮
1 long hair 女女長髮長髮
女女
Classifier Boys with 長髮
2 weak long hair 男 examples
Little
Image
Classifier Girls with 短髮短髮
3 short hair 女女短髮短髮
女女
Classifier Boys with 短髮短髮
4 short hair 男男短髮短髮
男男
Why Deep? Each basic classifier can have
sufficient training examples.

• Deep → Modularization
長髮長髮
長髮長髮男
短髮女
短髮女長髮
Boy or Girl? 女女短髮女 v.s. 短髮短髮
短髮女男男短髮
女女短髮
Basic 男男
Image
Classifier
長髮長髮短髮短髮
Long or
女女長髮長髮女女短髮短髮
short? 女女 v.s. 女女
長髮短髮短髮
Classifiers for the
男男男短髮短髮
attributes 男男
Why Deep?
can be trained by little data

• Deep → Modularization
Classifier Girls with
1 long hair
Boy or Girl? Classifier Boys with
2 fine long Little
hair data
Image Basic
Classifier Classifier Girls with
Long or 3 short hair
short?
Classifier Boys with
Sharing by the 4 short hair
following classifiers
as module
Deep Learning also works
Why Deep? on small data set like TIMIT.

• Deep → Modularization → Less training data?

x1 ……
x2 The modularization is ……
automatically learned from data.
……

……
……

……
xN ……

The most basic Use 1st layer as module Use 2nd layer as
classifiers to build classifiers module ……
Hand-crafted
kernel function

SVM
Apply simple
classifier
Source of image: http://www.gipsa-lab.grenoble-
inp.fr/transfert/seminaire/455_Kadri2013Gipsa-lab.pdf
Deep Learning
simple
Learnable kernel
𝜙 𝑥 classifier

x1 …… y1
x2
𝑥 …… y2
…
…

…
…

…
xN …… yM
Hard to get the power of Deep …

Before 2006, deeper usually does not imply better.

Part III:
Tips for Training DNN
Recipe for Learning

http://www.gizmodo.com.au/2015/04/the-basic-recipe-for-machine-learning-
explained-in-a-single-powerpoint-slide/
Recipe for Learning

Don’t forget! overfitting

Modify the Network Preventing
Better optimization Overfitting
Strategy

http://www.gizmodo.com.au/2015/04/the-basic-recipe-for-machine-learning-
explained-in-a-single-powerpoint-slide/
Recipe for Learning
Modify the Network
• New activation functions, for example, ReLU
or Maxout

Better optimization Strategy

• Adaptive learning rates

Prevent Overfitting
• Dropout Only use this approach when you already
obtained good results on the training data.
Part III:
Tips for Training DNN
New Activation Function
ReLU
• Rectified Linear Unit (ReLU)
Reason:
𝑎
𝜎 𝑧 1. Fast to compute
𝑎=𝑧
2. Biological reason
𝑎=0 3. Infinite sigmoid
𝑧
with different biases
4. Vanishing gradient
[Xavier Glorot, AISTATS’11]
[Andrew L. Maas, ICML’13] problem
[Kaiming He, arXiv’15]
Vanishing Gradient Problem
x1 …… y1
In x2006,
2
people used RBM
……pre-training. y2
In 2015, people use ReLU.
……

……
……

……

……
xN …… yM

Smaller gradients Larger gradients

Learn very slow Learn very fast

Almost random Already converge

based on random!?
Vanishing Gradient Problem
Smaller gradients

x1 …… 𝑦1 𝑦1
Small
x2 output 𝑦2
…… 𝑦2
……

……
……

……
𝐶

……

……
+∆𝐶
xN …… 𝑦𝑀 𝑦𝑀
Large
+∆𝑤 input

Intuitive way to compute the gradient …

𝜕𝐶 ∆𝐶
=?
𝜕𝑤 ∆𝑤
𝑎
𝑎=𝑧
ReLU
𝑎=0
𝑧
0

x1 y1

0 y2
x2
0

0
𝑎
𝑎=𝑧
ReLU
A Thinner linear network 𝑎=0
𝑧

x1 y1

y2
x2
Do not have
smaller gradients
Maxout ReLU is a special cases of Maxout

• Learnable activation function [Ian J. Goodfellow, ICML’13]

+ 5 neuron + 1
Input
Max 7 Max 2
x1 + 7 + 2

x2 + −1 + 4
Max 1 Max 4
+ 1 + 3

You can have more than 2 elements in a group.

Maxout ReLU is a special cases of Maxout

• Learnable activation function [Ian J. Goodfellow, ICML’13]

• Activation function in maxout network can be
any piecewise linear convex function
• How many pieces depending on how many
elements in a group

2 elements in a group 3 elements in a group

Part III:
Tips for Training DNN
Adaptive Learning Rate
Learning Rate Set the learning
rate η carefully

−𝜂𝛻𝐶 𝜃 0 If learning rate is too large

Cost may not decrease

after each update
𝑤2

−𝛻𝐶 𝜃 0

𝜃0

𝑤1
Can we give different
Learning Rate Set the learning
parameters different
rate η carefully
learning rates?

If learning rate is too large

Cost may not decrease

after each update
𝑤2
0 If learning rate is too small
−𝛻𝐶 𝜃
−𝜂𝛻𝐶 𝜃0

𝜃0 Training would be too slow

𝑤1
Original Gradient Descent
Adagrad 𝜃 𝑡 ← 𝜃 𝑡−1 − 𝜂𝛻𝐶 𝜃 𝑡−1

Each parameter w are considered separately

𝜕𝐶 𝜃 𝑡
𝑤 𝑡+1 ← 𝑤 𝑡 − ߟ𝑤 𝑔𝑡 𝑔𝑡 =
𝜕𝑤
Parameter dependent
learning rate

𝜂 constant
ߟ𝑤 =
𝑡 Summation of the square of
𝑖=0 𝑔𝑖 2
the previous derivatives
𝜂
ߟ𝑤 =
Adagrad 𝑡
𝑖=0 𝑔𝑖 2

g0 g1 …… g0 g1 ……
𝑤1 𝑤2
0.1 0.2 …… 20.0 10.0 ……
Learning rate: Learning rate:
𝜂 𝜂 𝜂 𝜂
= =
0.12 0.1 20 2 20
𝜂 𝜂 𝜂 𝜂
= =
0.12 + 0.22 0.22 202 + 102 22
Observation: 1. Learning rate is smaller and
smaller for all parameters
2. Smaller derivatives, larger
Why?
learning rate, and vice versa
Larger
derivatives

Smaller
Learning Rate

Smaller Derivatives

Larger Learning Rate

2. Smaller derivatives, larger

Why?
learning rate, and vice versa
Not the whole story ……
• Adagrad [John Duchi, JMLR’11]
• RMSprop
• https://www.youtube.com/watch?v=O3sxAc4hxZU

• Adadelta [Matthew D. Zeiler, arXiv’12]

• Adam [Diederik P. Kingma, ICLR’15]
• AdaSecant [Caglar Gulcehre, arXiv’14]
• “No more pesky learning rates” [Tom Schaul, arXiv’12]
Part III:
Tips for Training DNN
Dropout
Pick a mini-batch
Dropout 𝜃 𝑡 ← 𝜃 𝑡−1 − 𝜂𝛻𝐶 𝜃 𝑡−1

Training:

 Each time before computing the gradients

 Each neuron has p% to dropout
Pick a mini-batch
Dropout 𝜃 𝑡 ← 𝜃 𝑡−1 − 𝜂𝛻𝐶 𝜃 𝑡−1

Training:

Thinner!

 Each time before computing the gradients

 Each neuron has p% to dropout
The structure of the network is changed.
 Using the new network for training
For each mini-batch, we resample the dropout neurons
Dropout
Testing:

 No dropout
 If the dropout rate at training is p%,
all the weights times (1-p)%
 Assume that the dropout rate is 50%.
If a weight w = 1 by training, set 𝑤 = 0.5 for testing.
Dropout - Intuitive Reason
我的 partner
會擺爛，所以
我要好好做

Set 1 Set 2 Set 3 Set 4

Network Network Network Network

1 2 3 4

Train a bunch of networks with different structures

Dropout is a kind of ensemble.
Ensemble
Testing data x

Network Network Network Network

1 2 3 4

y1 y2 y3 y4

average
Dropout is a kind of ensemble.
minibatch minibatch minibatch minibatch Training of
1 2 3 4 Dropout

M neurons

……
2M possible
networks

Using one mini-batch to train one network

Some parameters in the network are shared
Dropout is a kind of ensemble.
Testing of Dropout testing data x

All the
weights

……
multiply
(1-p)%

y1 y2 y3

average ≈ y
More about dropout
• More reference for dropout [Nitish Srivastava, JMLR’14] [Pierre Baldi,
NIPS’13][Geoffrey E. Hinton, arXiv’12]
• Dropout works better with Maxout [Ian J. Goodfellow, ICML’13]
• Dropconnect [Li Wan, ICML’13]
• Dropout delete neurons
• Dropconnect deletes the connection between neurons
• Annealed dropout [S.J. Rennie, SLT’14]
• Dropout rate decreases by epochs
• Standout [J. Ba, NISP’13]
• Each neural has different dropout rate
Part IV:
Neural Network
with Memory
Neural Network needs Memory
• Name Entity Recognition
• Detecting named entities like name of people,
locations, organization, etc. in a sentence.

1
0.1 people
apple 0
0.1 location
0 DNN
0.5 organization
…

0 0.3 none
Neural Network needs Memory
• Name Entity Recognition
• Detecting named entities like name of people,
locations, organization, etc. in a sentence.
target ORG target NONE
y1 y2 y3 y4 y5 y6 y7

DNN DNN DNN DNN DNN DNN DNN

x1 x2 x3 x4 x5 x6 x7
the president of apple eats an apple
DNN needs memory!
Recurrent Neural Network (RNN)
y1 y2

The output of hidden layer

are stored in the memory.
copy

a1 a2

Memory can be considered x1 x2

as another input.
RNN

y1 y2 y3
Wo a1
copy Wo copy Wo
a2 a3
a1 a2

Wi Wh Wh
Wi Wi
x1 x2 x3

The same network is used again and again.

Output yi depends on x1, x2, …… xi

RNN How to train?
𝑦1 target 𝑦 2 target 𝑦 3 target
L1 L2 L3
y1 y2 y3
Wo Wo Wo
Wh Wh
……

Wi Wi Wi
x1 x2 x3

Find the network parameters to minimize the total cost:

Backpropagation through time (BPTT)
Of course it can be deep …
yt yt+1 yt+2

…… ……

……
……
……

…… ……

xt xt+1 xt+2
Bidirectional RNN
xt xt+1 xt+2

…… ……

yt yt+1 yt+2

…… ……

xt xt+1 xt+2
Many to Many (Output is shorter)
• Both input and output are both sequences, but the output
is shorter.
• E.g. Speech Recognition

Output: “好棒” (character sequence)

Trimming
Problem?
Why can’t it be 好好好棒棒棒棒棒
“好棒棒”
(vector
Input:
sequence)
Many to Many (Output is shorter)
• Both input and output are both sequences, but the output
is shorter.
• Connectionist Temporal Classification (CTC) [Alex Graves,
ICML’06][Alex Graves, ICML’14][Haşim Sak, Interspeech’15][Jie Li,
Interspeech’15][Andrew Senior, ASRU’15]

“好棒” Add an extra symbol “φ” “好棒棒”

representing “null”

好 φ φ 棒 φ φ φ φ 好 φ φ 棒 φ 棒 φ φ
Many to Many (No Limitation)
• Both input and output are both sequences with different
lengths. → Sequence to sequence learning
• E.g. Machine Translation (machine learning→機器學習)
machine

learning

Containing all
information about
input sequence
Many to Many (No Limitation)
• Both input and output are both sequences with different
lengths. → Sequence to sequence learning
• E.g. Machine Translation (machine learning→機器學習)

機器學習慣性 ……

……
machine

learning

Don’t know when to stop

Many to Many (No Limitation)

推 tlkagk: =========斷==========
Ref:http://zh.pttpedia.wikia.com/wiki/%E6%8E%A5%E9%BE%8D%
E6%8E%A8%E6%96%87 (鄉民百科)
Many to Many (No Limitation)
• Both input and output are both sequences with different
lengths. → Sequence to sequence learning
• E.g. Machine Translation (machine learning→機器學習)

===
機器學習
machine

learning

Add a symbol “===“ (斷)

[Ilya Sutskever, NIPS’14][Dzmitry Bahdanau, arXiv’15]
感謝曾柏翔同學提供實驗結果

Unfortunately ……
• RNN-based network is not always easy to learn
Real experiments on Language modeling

sometimes

Lucky
The error surface is rough.
The error surface is either
very flat or very steep.

Clipping

Cost
w2

w1 [Razvan Pascanu, ICML’13]

Why?
𝑤=1 𝑦1000 = 1 Large Small
𝑤 = 1.01 𝑦1000 ≈ 20000 gradient Learning rate?
𝑤 = 0.99 𝑦1000 ≈ 0 small Large
𝑤 = 0.01 𝑦1000 ≈ 0 gradient Learning rate?
=w999
y1 y2 y3 y1000
Toy Example
1 1 1 1
……
w w w
1 1 1 1
1 0 0 0
Helpful Techniques
• Nesterov’s Accelerated Gradient (NAG):
• Advance momentum method
• RMS Prop
• Advanced approach to give each parameter
different learning rates
• Considering the change of Second derivatives
• Long Short-term Memory (LSTM)
• Can deal with gradient vanishing (not gradient
explode)
Long Short-term Memory (LSTM)
Other part of the network
Special Neuron:
Signal control
Output Gate
4 inputs,
the output gate 1 output
(Other part of
the network)
Memory Forget Signal control
Cell Gate the forget gate
(Other part of
the network)
Signal control
Input Gate LSTM
the input gate
(Other part of
the network)
Other part of the network
𝑎 = ℎ 𝑐 ′ 𝑓 𝑧𝑜

𝑧𝑜 multiply
Activation function f is
𝑓 𝑧𝑜 ℎ 𝑐′ usually a sigmoid function
Between 0 and 1
Mimic open and close gate
𝑐 𝑓 𝑧𝑓
𝑐c′ 𝑧𝑓
𝑐𝑓 𝑧𝑓
𝑔 𝑧 𝑓 𝑧𝑖
𝑐 ′ = 𝑔 𝑧 𝑓 𝑧𝑖 + 𝑐𝑓 𝑧𝑓
𝑓 𝑧𝑖
𝑧𝑖
multiply
𝑔 𝑧

𝑧
Original Network:
Simply replace the neurons with LSTM

……

……
𝑎1 𝑎2

𝑧1 𝑧2

x1 x2 Input
𝑎1 𝑎2

+ +

4 times of parameters x1 x2 Input

Extension: “peephole”
LSTM
yt yt+1

ct-1 ct ct+1

× ＋ × × ＋ ×

× ×

zf zi z zo zf zi z zo

ct-1 ht-1 xt ct ht xt+1

Other Simpler Alternatives
Gated Recurrent Unit (GRU) Structurally Constrained
Recurrent Network (SCRN)

[Tomas Mikolov,
[Cho, EMNLP’14] ICLR’15]

Vanilla RNN Initialized with Identity matrix + ReLU activation

function [Quoc V. Le, arXiv’15]
 Outperform or be comparable with LSTM in 4 different tasks
What is the next wave?
Internal memory or
• Attention-based Model information from output

…… ……
Reading Head Writing Head

Reading Head Writing Head

Controller Controller

Input x DNN/LSTM output y

Already applied on speech recognition, caption
generation, QA, visual QA
What is the next wave?
• Attention-based Model
• End-To-End Memory Networks. S. Sukhbaatar, A. Szlam, J. Weston, R. Fergus.
arXiv Pre-Print, 2015.
• Neural Turing Machines. Alex Graves, Greg Wayne, Ivo Danihelka. arXiv Pre-Print,
2014
• Ask Me Anything: Dynamic Memory Networks for Natural Language Processing.
Kumar et al. arXiv Pre-Print, 2015
• Neural Machine Translation by Jointly Learning to Align and Translate. D.
Bahdanau, K. Cho, Y. Bengio; International Conference on Representation
Learning 2015.
• Show, Attend and Tell: Neural Image Caption Generation with Visual
Attention. Kelvin Xu et. al.. arXiv Pre-Print, 2015.
• Attention-Based Models for Speech Recognition. Jan Chorowski, Dzmitry
Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, Yoshua Bengio. arXiv Pre-Print,
2015.
• Recurrent models of visual attention. V. Mnih, N. Hees, A. Graves and K.
Kavukcuoglu. In NIPS, 2014.
• A Neural Attention Model for Abstractive Sentence Summarization. A. M. Rush,
S. Chopra and J. Weston. EMNLP 2015.
Concluding Remarks
Concluding Remarks
• Introduction of deep learning
• Discussing some reasons using deep learning
• New techniques for deep learning
• ReLU, Maxout
• Giving all the parameters different learning rates
• Dropout
• Network with memory
• Recurrent neural network
• Long short-term memory (LSTM)
Reading Materials
• “Neural Networks and Deep Learning”
• written by Michael Nielsen
• http://neuralnetworksanddeeplearning.com/
• “Deep Learning” (not finished yet)
• Written by Yoshua Bengio, Ian J. Goodfellow and
Aaron Courville
• http://www.iro.umontreal.ca/~bengioy/dlbook/
Thank you
for your attention!
Acknowledgement
• 感謝 Ryan Sun 來信指出投影片上的錯字
Appendix
Matrix Operation
1 4 0.98
x1 y1
1 -2
1
-1 -2 0.12
x2 y2
1
-1
0

1 W−2 1 1 0.98
𝜎 x + b = a
−1 1 −1 0 0.12
Why Deep? – Logic Circuits
• A two levels of basic logic gates can represent any
Boolean function.
• However, no one uses two levels of logic gates to
build computers
• Using multiple layers of logic gates to build some
functions are much simpler (less gates needed).
Boosting Weak classifier

Input Weak classifier

Combine
𝑥

……
Weak classifier
Deep Learning
Weak Boosted weak Boosted Boosted
classifier classifier weak classifier

x1 ……
x2 ……
…

…
…
…

xN ……
Maxout ReLU is a special cases of Maxout

𝑧 + 𝑧1
Input 𝑤 ReLU 𝑎 Input 𝑤 Max 𝑎
𝑏
x x 0 + 𝑧2 𝑚𝑎𝑥 𝑧1 , 𝑧2
𝑏
0
1 1

𝑎 𝑎
𝑧 = 𝑤𝑥 + 𝑏
𝑧1 = 𝑤𝑥 + 𝑏
𝑥 𝑥
𝑧2 =0
Maxout ReLU is a special cases of Maxout

𝑧 + 𝑧1
Input 𝑤 ReLU 𝑎 Input 𝑤 Max 𝑎
𝑏
x x + 𝑧2 𝑚𝑎𝑥 𝑧1 , 𝑧2
𝑤′
𝑏
1 𝑏′
1 Learnable Activation
Function
𝑎 𝑎
𝑧 = 𝑤𝑥 + 𝑏
𝑧1 = 𝑤𝑥 + 𝑏
𝑥 𝑥

𝑧2 = 𝑤 ′ 𝑥 + 𝑏 ′

DEU CSC5045 Intelligent System Applications Using Fuzzy - 7+Deep+Learning
No ratings yet
DEU CSC5045 Intelligent System Applications Using Fuzzy - 7+Deep+Learning
108 pages
React Js Cheat Sheet
No ratings yet
React Js Cheat Sheet
280 pages
Business AI Data Structures and Analytics
No ratings yet
Business AI Data Structures and Analytics
159 pages
Visual Guide To Machine Learning
No ratings yet
Visual Guide To Machine Learning
364 pages
HP 5347A, 5348A Service
No ratings yet
HP 5347A, 5348A Service
508 pages
Demystifying Deep Learning
No ratings yet
Demystifying Deep Learning
68 pages
Theory Lectures v2.3
No ratings yet
Theory Lectures v2.3
264 pages
Python Notes
No ratings yet
Python Notes
279 pages
Deep Learning
100% (2)
Deep Learning
49 pages
Generative Ai With Python Harnessing the Power of Machine Learning and Deep Learning to Build Creative and Intelligent Systems
100% (1)
Generative Ai With Python Harnessing the Power of Machine Learning and Deep Learning to Build Creative and Intelligent Systems
239 pages
OS by JJsir
No ratings yet
OS by JJsir
269 pages
Dokumen - Pub Algorithm Design Techniques
No ratings yet
Dokumen - Pub Algorithm Design Techniques
555 pages
Multi-Modal Generative AI Survey
No ratings yet
Multi-Modal Generative AI Survey
23 pages
Class10_Facilitator_Handbook
No ratings yet
Class10_Facilitator_Handbook
130 pages
Dissertation Baraev1
No ratings yet
Dissertation Baraev1
69 pages
CTIT
No ratings yet
CTIT
72 pages
Data Science Guide
100% (1)
Data Science Guide
275 pages
Icom R9000L SM
No ratings yet
Icom R9000L SM
198 pages
Introduction To Convolutional Neural Networks
No ratings yet
Introduction To Convolutional Neural Networks
41 pages
Jonathan Joslin
No ratings yet
Jonathan Joslin
34 pages
NPU MachineLearning
No ratings yet
NPU MachineLearning
28 pages
Multiuser Diversity in Downlink Channels: When Does The Feedback Cost Outweigh The Spectral Efficiency Gain?
No ratings yet
Multiuser Diversity in Downlink Channels: When Does The Feedback Cost Outweigh The Spectral Efficiency Gain?
25 pages
R Rep SM.2211 2 2018 PDF e
No ratings yet
R Rep SM.2211 2 2018 PDF e
40 pages
Isolation Forest
No ratings yet
Isolation Forest
11 pages
Pytorch Lightning Readthedocs Latest
100% (1)
Pytorch Lightning Readthedocs Latest
421 pages
Lecture 6 Smaller Network: RNN: One X at A Time Re-Use The Same Edge Weights
No ratings yet
Lecture 6 Smaller Network: RNN: One X at A Time Re-Use The Same Edge Weights
39 pages
Stas Bekman - Machine Learning Engineering
No ratings yet
Stas Bekman - Machine Learning Engineering
217 pages
AIoT Integration Book
No ratings yet
AIoT Integration Book
64 pages
2018 Miccai PDF
No ratings yet
2018 Miccai PDF
239 pages
EP2200 Queueing Theory and Teletraffic Systems: Viktoria Fodor
No ratings yet
EP2200 Queueing Theory and Teletraffic Systems: Viktoria Fodor
29 pages
Session 6-4 Equipment - in - LTE - network - noNote 杨波-final
No ratings yet
Session 6-4 Equipment - in - LTE - network - noNote 杨波-final
39 pages
An Introduction To Data Mining: Prof. S. Sudarshan CSE Dept, IIT Bombay
No ratings yet
An Introduction To Data Mining: Prof. S. Sudarshan CSE Dept, IIT Bombay
47 pages
Writing Code For NLP Research PDF
No ratings yet
Writing Code For NLP Research PDF
254 pages
PN 161218
No ratings yet
PN 161218
17 pages
Deep Learning (MODULE-3) (1)
No ratings yet
Deep Learning (MODULE-3) (1)
85 pages
NationalStrategy For AI Discussion Paper
No ratings yet
NationalStrategy For AI Discussion Paper
115 pages
Marketing Thesis
No ratings yet
Marketing Thesis
18 pages
Machine Learning and Statistical Methods For Clustering Single-Cell RNA-sequencing Data
No ratings yet
Machine Learning and Statistical Methods For Clustering Single-Cell RNA-sequencing Data
15 pages
Deep Neural Nets - 33 Years Ago and 33 Years From Now
No ratings yet
Deep Neural Nets - 33 Years Ago and 33 Years From Now
17 pages
23 DeepLearning PDF
No ratings yet
23 DeepLearning PDF
74 pages
_OceanofPDF.com_LLMs_in_Enterprise_-_Ahmed_Menshawy
No ratings yet
_OceanofPDF.com_LLMs_in_Enterprise_-_Ahmed_Menshawy
194 pages
yourfirstweekwithreact2ndedition
No ratings yet
yourfirstweekwithreact2ndedition
177 pages
Pytorch Slides
No ratings yet
Pytorch Slides
31 pages
Nn4ir PDF
No ratings yet
Nn4ir PDF
290 pages
White Paper
No ratings yet
White Paper
20 pages
RNN and LSTM: YANG Jiancheng
No ratings yet
RNN and LSTM: YANG Jiancheng
15 pages
Machine Learning Tutorial
No ratings yet
Machine Learning Tutorial
149 pages
Pe 42462 Ds
No ratings yet
Pe 42462 Ds
20 pages
CNN PPT Unit Iv
No ratings yet
CNN PPT Unit Iv
134 pages
Machine Learning in Business
No ratings yet
Machine Learning in Business
3 pages
INTERNSHIP REPORT
No ratings yet
INTERNSHIP REPORT
41 pages
Mobile Networks Connected Drones: Field Trials, Simulations, and Design Insights
No ratings yet
Mobile Networks Connected Drones: Field Trials, Simulations, and Design Insights
8 pages
XAI Seminar
No ratings yet
XAI Seminar
8 pages
Hugging Face
100% (1)
Hugging Face
11 pages
Learning Naive Bayes Classifier From Noisy Data: UCLA Computer Science Department Technical Report CSD-TR No. 030056 1
No ratings yet
Learning Naive Bayes Classifier From Noisy Data: UCLA Computer Science Department Technical Report CSD-TR No. 030056 1
19 pages
Journal of King Saud University - Computer and Information Sciences
No ratings yet
Journal of King Saud University - Computer and Information Sciences
23 pages
Applsci 10 05280 v2
No ratings yet
Applsci 10 05280 v2
10 pages
Speech and Language Processing. Daniel Jurafsky James H. Martin
No ratings yet
Speech and Language Processing. Daniel Jurafsky James H. Martin
25 pages
s41597-022-01639-1
No ratings yet
s41597-022-01639-1
22 pages
Artificial Intelligence - Semester-Wise Breakdown
No ratings yet
Artificial Intelligence - Semester-Wise Breakdown
6 pages
Brain Sciences: A Deep Siamese Convolution Neural Network For Multi-Class Classification of Alzheimer Disease
No ratings yet
Brain Sciences: A Deep Siamese Convolution Neural Network For Multi-Class Classification of Alzheimer Disease
15 pages
Neural Networks and Deep Learning
No ratings yet
Neural Networks and Deep Learning
22 pages
Provider ICdetection
No ratings yet
Provider ICdetection
22 pages
Deep Learning Techniques and Application
No ratings yet
Deep Learning Techniques and Application
20 pages
Food and Formalin Detector Using Machine Learning Approach: October 2019
No ratings yet
Food and Formalin Detector Using Machine Learning Approach: October 2019
7 pages
Tensorflow 2 - 0 Slides PDF
No ratings yet
Tensorflow 2 - 0 Slides PDF
100 pages
Random Forest
No ratings yet
Random Forest
10 pages
Lecture8 PDF
No ratings yet
Lecture8 PDF
434 pages
Sinan Ozdemir - Quick Start Guide to Large Language Models, Second Edition-Addison-Wesley (2024)
No ratings yet
Sinan Ozdemir - Quick Start Guide to Large Language Models, Second Edition-Addison-Wesley (2024)
279 pages
CS60010: Deep Learning CNN - Part 3: Sudeshna Sarkar
No ratings yet
CS60010: Deep Learning CNN - Part 3: Sudeshna Sarkar
167 pages
Top 20 AI Research Labs in The World in 2021
No ratings yet
Top 20 AI Research Labs in The World in 2021
9 pages
Benjamin S. Baumer, Daniel T. Kaplan, Nicholas J. Horton - Modern Data Science With R (Chapman & Hall - CRC Texts in Statistical Science) - Chapman and Hall - CRC (2021)
100% (1)
Benjamin S. Baumer, Daniel T. Kaplan, Nicholas J. Horton - Modern Data Science With R (Chapman & Hall - CRC Texts in Statistical Science) - Chapman and Hall - CRC (2021)
650 pages
Onlinepay
No ratings yet
Onlinepay
23 pages
Graph Neural Network The Next Frontier in Deep Learning
No ratings yet
Graph Neural Network The Next Frontier in Deep Learning
1 page
s12903-024-04657-0
No ratings yet
s12903-024-04657-0
10 pages
Easy 4GLTE IMSI Catchers For Non-Programmers
No ratings yet
Easy 4GLTE IMSI Catchers For Non-Programmers
13 pages
Complete Download Python for Beginners: Master Python Programming from Basics to Advanced Level Tim Simon PDF All Chapters
100% (2)
Complete Download Python for Beginners: Master Python Programming from Basics to Advanced Level Tim Simon PDF All Chapters
47 pages
I Think Unix
No ratings yet
I Think Unix
299 pages
HSU06 Session 5 Trần Thị Bích Hiền - Colab
No ratings yet
HSU06 Session 5 Trần Thị Bích Hiền - Colab
4 pages
Community Session IndexingChaining
No ratings yet
Community Session IndexingChaining
19 pages
DL Jun - 2023
No ratings yet
DL Jun - 2023
2 pages
Gradient Descent
No ratings yet
Gradient Descent
17 pages
An Introduction To Vision-Language Modeling: Aishwarya Agrawal Kate Saenko Asli Celikyilmaz Vikas Chandra
No ratings yet
An Introduction To Vision-Language Modeling: Aishwarya Agrawal Kate Saenko Asli Celikyilmaz Vikas Chandra
76 pages
Large Language Model Based Solutions How To Deliver Value With Cost Effective Generative AI Applications 1st Edition Shreyas Subramanian
No ratings yet
Large Language Model Based Solutions How To Deliver Value With Cost Effective Generative AI Applications 1st Edition Shreyas Subramanian
60 pages
Research_Paper(AI_RESUME_ANALYSER)
No ratings yet
Research_Paper(AI_RESUME_ANALYSER)
4 pages
Gluon Tutorials: Deep Learning - The Straight Dope
No ratings yet
Gluon Tutorials: Deep Learning - The Straight Dope
403 pages
Tutorials
No ratings yet
Tutorials
17 pages
Multimodal Deep Learning
No ratings yet
Multimodal Deep Learning
21 pages
Guide Convolutional Neural Network CNN
100% (1)
Guide Convolutional Neural Network CNN
25 pages
MM-LLMs Recent Advances in MultiModal Large Language Models
No ratings yet
MM-LLMs Recent Advances in MultiModal Large Language Models
22 pages
statement-of-purpose-for-research-internship
No ratings yet
statement-of-purpose-for-research-internship
3 pages
Types of Neural Networks
No ratings yet
Types of Neural Networks
7 pages
AI Unit 4 - Artificial Neural Network by Kulbhushan (Krazy Kaksha & KK World)
No ratings yet
AI Unit 4 - Artificial Neural Network by Kulbhushan (Krazy Kaksha & KK World)
5 pages
Applied Generative AI for Beginners: Practical Knowledge on Diffusion Models, ChatGPT, and Other LLMs 1st Edition Akshay Kulkarni All Chapters Instant Download
100% (4)
Applied Generative AI for Beginners: Practical Knowledge on Diffusion Models, ChatGPT, and Other LLMs 1st Edition Akshay Kulkarni All Chapters Instant Download
51 pages
Career Track For AI/ML
No ratings yet
Career Track For AI/ML
10 pages
Introduction To Learning: Frederic Precioso 24/01/2019
No ratings yet
Introduction To Learning: Frederic Precioso 24/01/2019
179 pages
Lesson 4 Gradient Descent
No ratings yet
Lesson 4 Gradient Descent
13 pages
GNN-XAI 学习提纲.md
No ratings yet
GNN-XAI 学习提纲.md
4 pages
Simple Libraries in Python
No ratings yet
Simple Libraries in Python
12 pages
Understanding Large Language Models: Learning Their Underlying Concepts and Technologies 1st Edition Thimira Amaratunga - Download the full set of chapters carefully compiled
100% (1)
Understanding Large Language Models: Learning Their Underlying Concepts and Technologies 1st Edition Thimira Amaratunga - Download the full set of chapters carefully compiled
55 pages
Machine Learning: Andrew NG's Course From Coursera: Presentation
100% (1)
Machine Learning: Andrew NG's Course From Coursera: Presentation
4 pages
PyTorch Workflow Fundamentals
No ratings yet
PyTorch Workflow Fundamentals
1 page
Microsoft Certified: Power BI Data Analyst Associate PL 300 Practice Tests
From Everand
Microsoft Certified: Power BI Data Analyst Associate PL 300 Practice Tests
CertSquad Professional Trainers
No ratings yet
Mastering WebGL: Crafting Advanced 3D Web Experiences: WebGL Wizadry
From Everand
Mastering WebGL: Crafting Advanced 3D Web Experiences: WebGL Wizadry
Kameron Hussain
No ratings yet
The Datadog Handbook: A Guide to Monitoring, Metrics, and Tracing
From Everand
The Datadog Handbook: A Guide to Monitoring, Metrics, and Tracing
Robert Johnson
No ratings yet