[go: up one dir, main page]

0% found this document useful (0 votes)
3 views79 pages

Deep Learning Supervised

The document discusses deep learning, focusing on supervised and unsupervised learning techniques, particularly the development of statistical models to uncover underlying structures in data. It highlights the impact of deep learning across various domains such as speech recognition, computer vision, and medical analysis. Additionally, it outlines the architecture and training of neural networks, including activation functions and optimization techniques.

Uploaded by

Abbas sabbar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views79 pages

Deep Learning Supervised

The document discusses deep learning, focusing on supervised and unsupervised learning techniques, particularly the development of statistical models to uncover underlying structures in data. It highlights the impact of deep learning across various domains such as speech recognition, computer vision, and medical analysis. Additionally, it outlines the architecture and training of neural networks, including activation functions and optimization techniques.

Uploaded by

Abbas sabbar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 79

Deep Learning I

Supervised Learning

Russ Salakhutdinov

Machine Learning Department


Carnegie Mellon University
Canadian Institute for Advanced Research
Mining for Structure
Massive increase in both computa:onal power and the amount of
data available from web, video cameras, laboratory measurements.
Images & Video Text & Language Speech & Audio

Gene Expression

Rela:onal Data/
Product fMRI Tumor region
Social Network
Recommenda:on

• Develop sta:s:cal models that can discover underlying structure, cause, or


sta:s:cal correla:on from data in unsupervised or semi-supervised way.
• Mul:ple applica:on domains.
Mining for Structure
Massive increase in both computa:onal power and the amount of
data available from web, video cameras, laboratory measurements.
Images & Video Text & Language Speech & Audio

Gene Expression

Product
Deep Learning
Rela:onal Data/Models that
fMRI Tumor region
Social Network
support inferences and discover
Recommenda:on

structure at mul:ple levels.

• Develop sta:s:cal models that can discover underlying structure, cause, or


sta:s:cal correla:on from data in unsupervised or semi-supervised way.
• Mul:ple applica:on domains.
Impact of Deep Learning
• Speech Recogni:on

• Computer Vision

• Recommender Systems

• Language Understanding
• Drug Discovery and Medical
Image Analysis
Deep Genera:ve Model
Model P(document) Reuters dataset: 804,414
newswire stories: unsupervised
European Community
Interbank Markets Monetary/Economic

Energy Markets
Disasters and
Accidents

Leading Legal/Judicial
Economic
Indicators

Bag of words Government


Accounts/
Borrowings
Earnings

(Hinton & Salakhutdinov, Science 2006)


Example: Understanding Images
TAGS:
strangers, coworkers, conven:oneers,
aUendants, patrons

Nearest Neighbor Sentence:


people taking pictures of a crazy person

Model Samples
• a group of people in a crowded area .
• a group of people are walking and talking .
• a group of people, standing around and talking .
Cap:on Genera:on
Talk Roadmap
Part 1: Supervised Learning: Deep Networks
• Defini:on of Neural Networks
• Training Neural Networks
• Recent Op:miza:on / Regulariza:on Techniques

Part 2: Unsupervised Learning: Learning Deep


Genera:ve Models

Part 3: Open Research Ques:ons


Learning Feature Representa:ons
pixel 1
pixel 2 Learning
Algorithm

Segway
Input Space Non-Segway
pixel 1

pixel 2
Learning Feature Representa:ons
Handle

Feature Learning
Representation Algorithm
Wheel

Segway
Input Space Non-Segway Feature Space
pixel 1

Wheel
pixel 2 Handle
Tradi:onal Approaches
Data Feature Learning
extraction algorithm

Object
detec:on

Image vision features Recogni:on

Audio
classifica:on

Speaker
Audio audio features iden:fica:on
Computer Vision Features

SIFT Textons

HoG RIFT
GIST
Audio Features

Spectrogram MFCC

Flux ZCR Rolloff


Audio Features

Representa:on Learning:
Spectrogram MFCC
Can we automa:cally learn
these representa:ons?

Flux ZCR Rolloff


Neural Networks Online Course
• Disclaimer: Some of the material and slides for this lecture were
borrowed from Hugo Larochelle’s class on Neural Networks:
https://sites.google.com/site/deeplearningsummerschool2016/

• Hugo’s class covers


many other topics:
convolutional networks,
neural language model,
Boltzmann machines,
autoencoders, sparse
coding, etc.

• We will use his


material for some of the
other lectures.
Feedforward Neural Networks
‣ Definition of Neural Networks
- Forward propagation
- Types of units
- Capacity of neural networks

‣ How to train neural nets:


- Loss function
- Backpropagation with gradient descent

‣ More recent techniques:


- Dropout
- Batch normalization
- Unsupervised Pre-training
Abstract
Université de Sherbrooke September
Math for my slides “Feedforward 6, 2012
neural network”.
P
Artificial
ugo.larochelle@usherbrooke.ca Neuron
Abstract
Abstract
Math for my slides > neural network”.
“Feedforward
• a(x) = b +
Math
• Neuron for my w x
i slides
pre-activation = b + w
i “Feedforward
i (or x network”.
neural
input activation):
SeptemberPP 6, 2012 P >Abstract
• •a(x)
• h(x) a(x)==b b+
= g(a(x))
Math +my=
for i
wx iix=
g(b
slides
w
i i
=b +
+ b+w i
ww
>
i“Feedforward
x ixxneural
i ) network”.
P PP
••h(x)
xh(x)
• x1•• Neuron
d b
a(x) = =g(a(x))
= wbg(a(x))
+ wi= =xg(b
dwig(b ++b +i w w
xx
wi i> ii x
) i)
output 1activation:i =
Abstract P
• x
• w • •x1h(x) x
1 xd = b w
d bg(a(x)) w
w1 1 wd=dg(b + i wi xi )
s “Feedforward neural network”.
• •ww
x1 xd b w 1 w d
• { where
>
xi = b •+•{w{w arexthe weights (parameters)
• g(·)P b is the bias term
g(b + • ••g(·)
{wisibx
g(·)
i
bi )
called the activation function
• h(x) = g(a(x))
• ••h(x)
g(·) ==
h(x) bg(a(x))
g(a(x)) ⇣ ⌘
• h(x)(1) g(a(x))(1)
= (1) (1)
⇣ ⇣ (1) P P
(1)
P (1)
⌘(1)⌘
Abstract
Abstract
Math
Mathfor
formy Artificial Neuron
myslides
slides“Feedforward
“Feedforwardneural
neuralnetwork”.
network”.
PP >
••Output
• a(x) = b + ii wii xii = b + w>xx
a(x) = b +
activation of w
thex = b +
neuron: w
PP
•• h(x)
h(x)==g(a(x))
g(a(x))==g(b g(b+
+ iiwwiixxii))
•• xx11 xxdd bb ww11 wwdd

•• ww
Range
•• {{ is
determined
•• g(·)
by g(·) bb
Bias only changes
•• h(x)
h(x)==g(a(x))
g(a(x)) the position of the
⇣⇣ riff ⌘⌘
(1) (from Pascal
(1) Vincent’s slides) (1)
(1) PP (1) (1)
• a(x) = b(1) + W(1) x
• a(x) = b + W x a(x)ii = bii a(x) = b W
jj Wi,j
xjj
i,j x

>
• x1 xd b w 1 w d

• w
Activation
• {
Function
• Sigmoid activation function:
• g(a) = a
1
Ø Squashes the neuron’s • g(a) = sigm(a) = 1+exp( a)
output between 0 and 1
exp(a) exp( a) exp(2a
Ø Always positive • g(a) = tanh(a) = exp(a)+exp( a) = exp(2a

Ø Bounded • g(·) b
Ø Strictly Increasing
• h(x) = g(a(x))
⇣ P
(1) (1) (1)
• a(x) = b +W x a(x)i = bi j

(out) (2) (2) >


• o(x) = g (b +w x)
1
• g(a) = sigm(a) = 1+exp( a)
Activation Function exp(a) exp( a) exp(
• g(a) = tanh(a) = exp(a)+exp( a) = exp(
• Rectified linear (ReLU) activation function:
• g(a) = max(0, a)
Ø Bounded below by 0 • g(a) = reclin(a) = max(0, a)
(always non-negative)

Ø Tends to produce units


• g(·) b
with sparse activities
• h(x) = g(a(x))
Ø Not upper bounded ⇣
(1) (1) (1) P
Ø Strictly increasing • a(x) = b +W x a(x)i = bi

(out) (2) (2) >


• o(x) = g (b +w x)
• •g(a) = tanh(a)
exp(a)+exp( a)
g(·) b = exp(a)+exp(
exp(2a)+1
exp(a)+exp(
a)
a) = exp(2a)+1
exp(2a)+1
g(a) = max(0, a)
• g(a)(1) = max(0, a)
• g(a) = max(0, a)
Single i Hidden Layer Neural Net
(1) (2) (2)
g(a) = reclin(a)• =W max(0,
i,j a)b x j h(x) i w i b
g(·) b • g(a) = reclin(a) = max(0, a)
• • •g(a)
Hiddenh(x) ==reclin(a)
layer g(a(x)) = max(0, a)
pre-activation:
(1) (1) (2)
Wi,j bi • xj p(y
h(x)= i w 1|x)
i b(2) ⇣ P ⌘
h(x) = g(a(x))
• • p(y
a(x) = =1|x)b (1)
+ W (1)
x a(x) i = b
(1)
i + W
(1)
i,j xj
j
• g(·) b
⇣ P ⌘
a(x) = b(1) +•Wg(·) x ba(x)i = b(2)
(1) (1) (1)
i + Wi,j>xj
j (2)
• f(1)
(x) =(1)
o(b +w x)(2) (2)
• W
f (x) = o(b(2)•+ w
> b
(1)x) i(1)
i,j
(2) x j h(x) i w i(2) b (2)
W
• Hidden b x h(x)
i,jlayer iactivation:
j i w i b
• h(x) = g(a(x))
• h(x) = g(a(x)) ⇣ ⌘
(1) (1) ⇣ (1) P (1) ⌘
• a(x)
• Output
= b
layer
+ W
activation:
(1) 1
x a(x) i = b i(1)+ P j W i,j x
(1) j
• a(x) = b + W x a(x)i = bi1 + j Wi,j xj
(1)
⇣ >

• f (x) = o ⇣b(2) + w (2)
>h ⌘x
(1)
• f (x) = o b(2) + w(2) x
Output activation
function
1
• p(y = c|x)
h i>
•• o(a)
p(y == c|x) Multilayer
softmax(a) = Neural
Pexp(a1 )
...
c exp(ac )
Net
Pexp(aC )
c exp(ac )
h i>
••••Consider
fo(a) a network with
(x) = softmax(a) = Pexp(a1layers.
L hidden )
. . . Pexp(aC )
p(y = c|x) exp(ac )
c exp(ac )
c

•- h (1)
layer h(2) (x) Wfor
pre-activation
(x) (1)
hW(2) W(3) b(1) b(2) bi(3)
k>0 >
• f (x) Pexp(a1 ) . . . Pexp(aC )
• o(a) = softmax(a) =
(k) (k 1)c exp(ac )(0)
• a(k) c exp(a c)
(k)
(x) = b + W h (x) (h (x) = x)
• h(1) (x) h (2)
• p(y =(k)c|x)(x) W (1)
W (2)
W (3)
b (1)
b (2)
b (3)
•• hf (x)
(k)
- hidden(x) = g(aactivation
layer (x))
(k) (k) (k) (k 1) h (0) i>
• afrom(1)
(x)
1 to =L:b (2)
+ W h
(1)
x
(2)
(h (3)
= x)
(1) (2) (3)
•• hh(L+1) (x)
•(x) =h o(a
o(a) =(x)(L+1)W
(x)) = W
softmax(a) (x) Pexp(a
f= W 1 ) b. . . Pexp(a
b Cb )
(k) (k) c exp(ac ) c exp(ac )
• h (k)(x) = g(a (k) (x))(k) (k 1)
• a (x) = b + W h x (h(0) = x)
• f (x)
(L+1) (L+1)
• -houtput (x)
layer
= o(a
activation
(x)) = f (x)
(k=L+1):
• h(k) (x) = g(a (k)
(x))
• h(1) (x) h(2) (x) W(1) W(2) W(3) b(1) b(2) b(3)
• h(L+1) (x) (L+1)
• a (x) = b + W(k)fh(x)
(k)= o(a (k) (x)) = (k 1)
x (h(0) (x) = x)

• h(k) (x) = g(a(k) (x))


(L+1) (L+1)
Capacity of Neural Nets
• Consider a single layer neural network 2
Réseaux de neurones
z x2
1

0 1
-1
0
-1
0
-1
1
x1
zk

Output
sortie k
y1 x2 y2
x2
y2 wkj
1 -1 -.4 1

0
y1 .7
1 0 1
-1
0 -1.5 Hidden
cachée j-1 0
-1
0
bias
biais .5
-1
0
-1 -1
1 1 wji 1
x1 1 1 1
x1

Input i
entrée
x1 x2
x2
(from Pascal Vincent’s slides)
z=-1
1
R2
z=+1 R1
x
Capacity of Neural Nets
• Consider a single layer neural network

(from Pascal Vincent’s slides)


Universal Approximation
• Universal Approximation Theorem (Hornik, 1991):

- “a single hidden layer neural network with a linear output


unit can approximate any continuous function arbitrarily well,
given enough hidden units’’

• This applies for sigmoid, tanh and many other activation


functions.

• However, this does not mean that there is learning algorithm that
can find the necessary parameter values.
Feedforward Neural Networks
‣ How neural networks predict f(x) given an input x:
- Forward propagation
- Types of units
- Capacity of neural networks

‣ How to train neural nets:


- Loss function
- Backpropagation with gradient descent

‣ More recent techniques:


- Dropout
- Batch normalization
- Unsupervised Pre-training
Training set: Dtrain = {(x(t) , y (t) )}
f (x; ✓)
Training
Dvalid Dtest
• Empirical Risk Minimization:

1X
arg min l(f (x(t) ; ✓), y (t) ) + ⌦(✓)
✓ T t
Loss function Regularizer

• Learning is cast as optimization.

Ø For classification problems, we would like to minimize


classification error.

ØLoss function can sometimes be viewed as a surrogate for


what we want to optimize (e.g. upper bound)
• ⌦(✓)

• ✓ ✓+ Dé
Abstract
P (t)1 dX Hugo Laro
U
Stochastic Gradient Descend
1
• = T Mathr
t forl(f (x (t)
• ; ✓),
{x y2 R) | r
✓ my slides “Feedforwardxneuralr f
✓ ⌦(✓)
(x) = 0}
network”.
(t) (t)
arg min l(f (x ; ✓),Département
y ) + ⌦(✓)d’i
• ✓ ✓+ • ✓v> rT 2
x f (x)v > 0 8v hugo
• f (x) after seeing each example:
• Perform updates
t
Université de S
h
• {x 2 R(t)
- Initialize
d
| rx f(1)
• l(f•:(x✓ ⌘;{W (x)
✓), y (t)
> 2
=
) 0}
(1)

,b ,...,W
v r x f (x)v
(L+1)
,b
< 0
(L+1)
8v
}
hugo.larochelle@u
> 2 (t) (t)
• v
- For t=1:T r f (x)v > 0 8v • = r l(f (x ; ✓), y ) r✓ ⌦(✓)
September 1
x ✓
(t) (t)
• ⌦(✓) • l(f (x ; ✓), y )
- •forveach
> 2 training example
rx f (x)v < 0 8v • (x(t) , y (t) )
• r P
1 (x(t) ; ✓), y (t) )(t)
l(f (t)Math for my slidesepoch
“Feedforward ne
• Training
= tr
(t) ✓ l(f (x ; ✓), y ) r✓ ⌦(✓)

• = rT✓ l(f (x ; ✓), y (t) ) r✓ ⌦(✓) =
• ⌦(✓) Math for my slides “Feedforward neural ne
Abstra
• f (x) Math for Iteration of all“Feedforward
my slides examples neu
• ✓ ✓+↵ 5
Math for my slides “Feedforward neural network”.
• r✓ ⌦(✓) • f (x)• l(f (x(t) ; ✓), y (t) )
• {x 2 R
• To train a neural
d
| rxwe
net, f (x)
need:= 0} • f (x) 5
• f (x)
• f (x)c = p(y = c|x) (t) (t)
• l(f (x • r; ✓),
✓ (t) l(fy(x(t)
(t)
) ; ✓), y (t) )
>
2 • l(f (x ; ✓), y )
• v r f
Ø Loss function:
(x)v >
(t)0 8v(t)
• x • l(f
x(t)
y (x ; ✓),•y r) l(f
(t)
(t)
; ✓),(t)y (t) ) (t)
✓ • (x⌦(✓)
Ø A procedure
2 to compute(t) gradients:
P • r✓ l(f (x ; ✓), y )
• v>•rl(f
x•f (x)v
ry)
(x), <
✓ l(f
0 8v
=(x ; c✓),
(t)
y )log f (x)c = log f (x)y
1(y=c) =
Ø Regularizer and its gradient: • ⌦(✓)• , r✓ ⌦(✓)
• •= •r✓⌦(✓) l(f (x(t) ; ✓), y (t)•) ⌦(✓)r✓ ⌦(✓)
• f (x)c = p(y = c|x)
• r✓ ⌦(✓)
• r ⌦(✓) ✓
Computational Flow Graph
• Forward propagation can be represented
as an acyclic flow graph

• Forward propagation can be implemented


in a modular way:

Ø Each box can be an object with an fprop


method, that computes the value of the
box given its children

Ø Calling the fprop method of each box in


the right order yields forward propagation
Computational Flow Graph
• Each object also has a bprop method

- it computes the gradient of the loss with


respect to each child box.

• By calling bprop in the reverse order, we


obtain backpropagation
T 1b 1
• T ⌃= T t (x (t) b (t)
µ)(x b >
µ)

Model Selection
• Supervised
• T 1b Machine
⌃= 1 learnin
Pexamp
learning
t (x (t)
T T
• Training Protocol: • Supervised learning example:
• Training set: D (x, y)
train x {(x
= y

- Train your model on •theTraining


Training set:
Set •Dtrain = {(x
f (x; ✓)
(t) (t)
, y )}
• Supervised learning e
- For model selection,•use
f (x; ✓)
Validation • Dvalid Dtest
Set
• Training set: Dtrain =
ØHyper-parameter search: hidden layer size, learning rate,
number of iterations/epochs, etc.
• f (x; ✓)
valid
• D
- Estimate generalization performance using the Test Set Dtest

• Generalization is the behavior of the model on unseen


examples.
Early Stopping
• To select the number of epochs, stop training when validation set
error increases (with some look ahead).
@f (x) f (x+✏) f (x ✏)
• @x ⇡ 2✏


Mini-batch,
f (x) x ✏
Momentum
• Make updates
• f (x based
+ ✏) on
f (xa mini-batch
✏) of examples (instead of a
single example):
P 1
Ø • gradient
the t=1 ↵is tthe 1
= average regularized loss for that mini-batch
Ø P1a more
can give 2 accurate estimate of the gradient
• t=1 t < 1 ↵t

Ø can leverage matrix/matrix operations, which are more efficient

• ↵t = 1+ t

• Momentum: = t↵use0.5
• ↵tCan an exponential
<  1 average of previous
gradients:
(t) (t) (t) (t 1)
• r✓ = r✓ l(f (x ), y )+ r✓
Ø can get pass plateaus more quickly, by ‘‘gaining momentum’’
Feedforward Neural Networks
‣ How neural networks predict f(x) given an input x:
- Forward propagation
- Types of units
- Capacity of neural networks

‣ How to train neural nets:


- Loss function
- Backpropagation with gradient descent

‣ More recent techniques:


- Dropout
- Batch normalization
- Unsupervised Pre-training
Learning Distributed Representations
• Deep learning is research on learning models with multilayer
representations

Ø multilayer (feed-forward) neural networks


Ø multilayer graphical model (deep belief network, deep Boltzmann
machine)

• Each layer learns ‘‘distributed representation’’


Ø Units in a layer are not mutually exclusive
• each unit is a separate feature of the input
• two units can be ‘‘active’’ at the same time
Ø Units do not correspond to a partitioning (clustering) of the inputs
• in clustering, an input can only belong to a single cluster
Local vs. Distributed Representations
• Clustering, Nearest • RBMs, Factor models,
Neighbors, RBF SVM, local PCA, Sparse Coding,
density es:mators Deep models
• Parameters
Local regionsfor each region.
• # of regions is linear with C1=1
# of parameters. C2=1

C1=0
C1=1 C2=1
C2=0
C1=0
C2=0

C1 C2 C3

Learned
prototypes (Bengio, 2009, Foundations and
Trends in Machine Learning)
Local vs. Distributed Representations
• Clustering, Nearest • RBMs, Factor models,
Neighbors, RBF SVM, local PCA, Sparse Coding,
density es:mators Deep models C1=1
C2=1
• Parameters C3=1
Local regionsfor each region.
• # of regions is linear with C1=1
# of parameters. C2=1
C3=0
C1=0
C1=1 C2=1 C1=0
C2=0 C3=0 C2=1
C3=0 C1=0 C3=1
C2=0
C3=0
C1 C2 C3
C1=0
C2=0
C3=1
Learned
prototypes
Local vs. Distributed Representations
• Clustering, Nearest • RBMs, Factor models,
Neighbors, RBF SVM, local PCA, Sparse Coding,
density es:mators Deep models C1=1
C2=1
• Parameters
Local regionsfor each region. • Each parameter affects many
C3=1
• # of regions is linear with regions, not
C1=1just local.
# of parameters. C2=1 grows (roughly)
• # of regions
C3=0
exponen:ally in #C1=0
of parameters.
C1=1 C2=1 C1=0
C2=0 C3=0 C2=1
C3=0 C1=0 C3=1
C2=0
C3=0
C1 C2 C3
C1=0
C2=0
C3=1
Learned
prototypes
Inspiration from Visual Cortex
Why Training is Hard
• First hypothesis: Hard optimization
problem (underfitting)

Ø vanishing gradient problem


Ø saturated units block gradient
propagation

• This is a well known problem in


recurrent neural networks
Why Training is Hard
• Second hypothesis: Overfitting

Ø we are exploring a space of complex functions


Ø deep nets usually have lots of parameters

• Might be in a high variance / low bias situation


Why Training is Hard
• First hypothesis (underfitting): better optimize

Ø Use better optimization tools (e.g. batch-normalization, second


order methods, such as KFAC)
Ø Use GPUs, distributed computing.

• Second hypothesis (overfitting): use better regularization


Ø Unsupervised pre-training
Ø Stochastic drop-out training

• For many large-scale practical problems, you will need to use both:
better optimization and better regularization!
Unsupervised Pre-training
• Initialize hidden layers using unsupervised learning

Ø Force network to represent latent structure of input distribution

Ø Encourage hidden layers to encode that structure


Unsupervised Pre-training
• Initialize hidden layers using unsupervised learning

Ø This is a harder task than supervised learning (classification)

Ø Hence we expect less overfitting


Hugo Larochelle
Autoencoders: DépartementPreview
d’informatique
h(x) = g(a(x))
Université de Sherbrooke
• Feed-forward neural network trained to = sigm(b
reproduce its + Wx)
input at the
hugo.larochelle@usherbrooke.ca
output layer
October
Decoder 16, 2012
b = o(b
x a(x))
sigm(c + W⇤ h(x))
= Abstract
P
Math for my slides “Autoencoders”. 2
P
For binary units
b l(f (x)) =
f (x) ⌘ x k (b
xk xk ) l(f (x)) = k (xk log(b xk ) + (1 xk ) log(1

Encoder
h(x) = g(a(x))
= sigm(b + Wx)
b
x Autoencoders:

= o(c + W h(x)) Preview
• h(x) = g(a(x))
= sigm(c + W⇤ h(x)) = sigm(b + Wx
• Loss function for binary inputs
2
P
bk
x xk ) l(f (x)) = k (xk log(b
xk ) + (1 xk ) log(1 bk ))
x
P
Ø Cross-entropy error function • fx bo(b
b(x)=⌘ x l(f (x)) =
a(x)) k (b
x
= sigm(c + W⇤ h(x
• Loss function for real-valued inputs
1
P 2
P
b l(f (x)) =
f (x) ⌘ x 2 k (b
xk xk ) l(f (x)) = k (xk log(b
xk )
(t) of squared
b(t) differences
Ø (xsum
rba(x(t) ) l(f )) = x x(t)
Ø we use a linear activation function at the output

a(x ) (= b + Wx(t)
(t)

h(x(t) ) (= sigm(a(x(t)
(t) >
Pre-training
• We will use a greedy, layer-wise procedure

Ø Train one layer at a time with unsupervised criterion


Ø Fix the parameters of previous hidden layers
Ø Previous layers can be viewed as feature extraction
Fine-tuning
• Once all layers are pre-trained
Ø add output layer
Ø train the whole network using
supervised learning

• We call this last phase fine-tuning


Ø all parameters are ‘‘tuned’’ for the
supervised task at hand
Ø representation is adjusted to be more
discriminative
Why Training is Hard
• First hypothesis (underfitting): better optimize

Ø Use better optimization tools (e.g. batch-normalization, second


order methods, such as KFAC)
Ø Use GPUs, distributed computing.

• Second hypothesis (overfitting): use better regularization


Ø Unsupervised pre-training
Ø Stochastic drop-out training

• For many large-scale practical problems, you will need to use both:
better optimization and better regularization!
Dropout
• Key idea: Cripple neural network by removing hidden units
stochastically

Ø each hidden unit is set to 0 with


probability 0.5

Ø hidden units cannot co-adapt to


other units

Ø hidden units must be more


generally useful

• Could use a different dropout


probability, but 0.5 usually works well
• p(y = c|x)
h i>
•• o(a)
p(y == c|x)
softmax(a) = Dropout
Pexp(a ) . . . Pexp(a )
1
exp(a )
c c exp(a )
c
C
c
h i>
•• fo(a)
(x) = softmax(a) = Pexp(a 1)
. . . Pexp(aC )
• Use random binary masks mc (k) exp(ac ) c exp(ac )
• p(y = c|x)
•• Øhf (1)
(x) (x) h (2)
(x) W (1)
layer pre-activation for W (2)
W (3)
b (1)
b (2)
b (3)
h k>0 i>
exp(a ) exp(a )
• o(a)
(k)
•• ah(1)(x) = softmax(a)
(k) =
(k) P
(k 1)
1
. . . P
(0)
C

(x)= b + WW(1)
h(2) (x) h cW (x)
(2)c ) (h (3)c(x)
exp(a
W exp(a =cx)
b(1) ) (2)
b b(3)
•••Øhaf(k)
(x)(x) = g(a
(k)
(x) =layer
hidden
(k)
(k)
+(x))
b activationW(k) h(k=1
(k 1)
toxL):(h(0) = x)
(1) (2) (1) (2)
••• hhh(L+1)
(k) (x)
(x) h
= (x)
o(a W(x))
(k)(L+1) W= f (x)W(3) b(1) b(2) b(3)
(x) = g(a (x))
(k)
• a(L+1) (x) = b(k) + (L+1)
W(k) h(k 1) x (h(0) = x)
• h (x) = o(a (x)) = f (x)
•Ø h(k) Output g(a(k) (x))
(x) =activation (k=L+1)

• h(L+1) (x) = o(a(L+1) (x)) = f (x)


Dropout at Test Time
• At test time, we replace the masks by their expectation

Ø This is simply the constant vector 0.5 if dropout probability is 0.5


Ø For single hidden layer: equivalent to taking the geometric average
of all neural networks, with all possible binary masks

• Can be combined with unsupervised pre-training

• Beats regular backpropagation on many datasets

• Ensemble: Can be viewed as a geometric average of exponential


number of networks.
Why Training is Hard
• First hypothesis (underfitting): better optimize

Ø Use better optimization tools (e.g. batch-normalization, second


order methods, such as KFAC)
Ø Use GPUs, distributed computing.

• Second hypothesis (overfitting): use better regularization


Ø Unsupervised pre-training
Ø Stochastic drop-out training

• For many large-scale practical problems, you will need to use both:
better optimization and better regularization!
Batch Normalization
• Normalizing the inputs will speed up training (Lecun et al. 1998)

Ø could normalization be useful at the level of the hidden layers?

• Batch normalization is an attempt to do that (Ioffe and Szegedy, 2014)

Ø each unit’s pre-activation is normalized (mean subtraction, stddev


division)
Ø during training, mean and stddev is computed for each minibatch
Ø backpropagation takes into account the normalization
Ø at test time, the global mean / stddev is used
Batch Normalization

Learned linear transformation to adapt to non-linear


activation function (𝛾 and β are trained)
nsion µB ← xi // mini-b
(k) (k)
m i=1
(k) x − E[x ]
x
! = " m
Var[x(k) ] Batch Normalization
σ ←
1
m
(x − µ 2
B
#
i B)
2
// mini-batch
i=1
ation and variance are computed over the
• Why normalize the pre-activation? xi − µB
As shown in (LeCun et al., 1998b), such !i ← " 2
x //
can even
eds up convergence,
Ø help keep
whenthe thepre-activation
fea- σB + ϵ
in a non-saturating regime
rrelated. (though the linear transform yi ← γ! xi + β ≡could
BNγ,βcancel
(xi ) this // scale
y normalizing each input of a layer may
effect)
layer can represent. For instance, nor- Algorithm 1: Batch Normalizing Transform,
ts of a sigmoid would constrain them to activation x over a mini-batch.
• Use the global mean and stddev at test time.
of the nonlinearity. To address this, we
e transformation
Ø inserted
removesin thestochasticity
the network of The BN transform
the mean can be added to a network
and stddev
identity transform. To accomplish this, ulate any activation. In the notation y = BNγ
Ø requires a final phase where, from the first to the last hidden layer
• propagate all training data to that layer

3
compute and store the global mean and stddev of each unit

Ø for early stopping, could use a running average


Optimization Tricks
• SGD with momentum, batch-normalization, and dropout usually
works very well

• Pick learning rate by running on a subset of the data


Ø Start with large learning rate & divide by 2 until loss does not diverge
Ø Decay learning rate by a factor of ~100 or more by the end of training

• Use ReLU nonlinearity

• Initialize parameters so that each feature across layers has


similar variance. Avoid units in saturation.

[From Marc'Aurelio Ranzato, CVPR 2014 tutorial]


Visualization
• Check gradients numerically by finite differences

• Visualize features (features need to be uncorrelated) and have


high variance

• Good training: hidden units


are sparse across samples

[From Marc'Aurelio Ranzato, CVPR 2014 tutorial]


Visualization
• Check gradients numerically by finite differences

• Visualize features (features need to be uncorrelated) and have


high variance

• Visualize parameters: learned features should exhibit structure


and should be uncorrelated and are uncorrelated
Visualization
• Check gradients numerically by finite differences

• Visualize features (features need to be uncorrelated) and have


high variance

• Bad training: many hidden


units ignore the input and/or
exhibit strong correlations
Computer Vision
• Design algorithms that can process visual data to accomplish a given task:

Ø For example, object recognition: Given an input image, identify


which object it contains
Deep Convolutional Nets

Prediction
Very deep network

….

• Convolution High-level feature


• Pooling space
• Normalization
• Densely connected
Deep Convolutional Nets

Pooling

Convolution
ConvNets: Examples
• Optical Character Recognition, House Number and Traffic Sign
classification
ConvNets: Examples
• Pedestrian detection

(Sermanet et al., Pedestrian detection with unsupervised multi-stage, CVPR 2013)


ConvNets: Examples
• Object Detection

Sermanet et al., OverFeat: Integrated recognition, localization, 2013


Girshick et al., Rich feature hierarchies for accurate object detection, 2013
Szegedy et al., DNN for object detection, NIPS 2013
ImageNet Dataset
• 1.2 million images, 1000 classes

Examples of Hammer

(Deng et al., Imagenet: a large scale hierarchical image database, CVPR 2009)
Important Breakthrough
• Deep Convolu:onal Nets for Vision (Supervised)
Krizhevsky, A., Sutskever, I. and Hinton, G. E., ImageNet Classification with Deep
Convolutional Neural Networks, NIPS, 2012.

1.2 million training images


1000 classes
Architecture
• How can we select the right architecture:
Ø Manual tuning of features is now replaced with the manual tuning
of architechtures

• Depth
• Width
• Parameter count
How to Choose Architecture
• Many hyper-parameters:
Ø Number of layers, number of feature maps

• Cross Validation

• Grid Search (need lots of GPUs)

• Smarter Strategies

Ø Random search
Ø Bayesian Optimization
AlexNet
Softmax Output
• 8 layers total
Layer 7: Full
• Trained on Imagenet
dataset [Deng et al. CVPR’09] Layer 6: Full

• 18.2% top-5 error Layer 5: Conv + Pool

Layer 4: Conv

Layer 3: Conv

Layer 2: Conv + Pool

Layer 1: Conv + Pool

[From Rob Fergus’ CIFAR 2016 tutorial] Input Image


AlexNet
Softmax Output
• Remove top fully connected layer 7

• Drop ~16 million parameters


Layer 6: Full
• Only 1.1% drop in performance!
Layer 5: Conv + Pool

Layer 4: Conv

Layer 3: Conv

Layer 2: Conv + Pool

Layer 1: Conv + Pool

[From Rob Fergus’ CIFAR 2016 tutorial] Input Image


AlexNet
Softmax Output
• Let us remove upper feature extractor layers
and fully connected:

Ø Layers 3,4, 6 and 7


Layer 6: Full

• Drop ~50 million parameters Layer 5: Conv + Pool

• 33.5 drop in performance!

• Depth of the network is the key. Layer 2: Conv + Pool

Layer 1: Conv + Pool

[From Rob Fergus’ CIFAR 2016 tutorial] Input Image


GoogLeNet

• 24 layer model that uses so-called inception Convolution


module. Pooling
Softmax
Other

(Szegedy et al., Going Deep with Convolutions, 2014)


GoogLeNet
• GoogLeNet inception module:

Ø Multiple filter scales at each layer


Ø Dimensionality reduction to keep computational requirements down

number
of filters 1x1

3x3

5x5

(a) Inception module, naı̈ve version (b) Inception module with dimension reductions

Figure 2: Inception module

(Szegedy et al., Going Deep with Convolutions, 2014)


GoogLeNet

• Width of inception modules ranges from 256 filters (in early modules) to
1024 in top inception modules.
• Can remove fully connected layers on top completely
• Number of parameters is reduced to 5 million
• 6.7% top-5 validation error on Imagnet

(Szegedy et al., Going Deep with Convolutions, 2014)


Learning for Image Recognition
VGG-19 34-layer plain 34-layer residual Residual
image image image insert sho
network

Residual Networks
output

gyu Zhang Shaoqing Ren Jian Sun


3x3 conv, 64

shortcuts
size: 224
3x3 conv, 64
output ar
Microsoft Research
pool, /2
output
size: 112
3x3 conv, 128
Fig. 3). W
in Fig. 3)
ngz, v-shren, jiansun}@microsoft.com 3x3 conv, 128 7x7 conv, 64, /2 7x7 conv, 64, /2
performs
Really, really deep convnets do not train well, output
size: 56
pool, /2 pool, /2 pool, /2
for increa
model top-1 err. top-5 err. paramete
64-d 256-d 3x3 conv, 64

E.g. CIFAR10:
3x3 conv, 256 3x3 conv, 64

28.07 VGG-16 [41] 9.33 3x3 conv, 256


3x3, 64
3x3 conv, 64 1x1, 64 3x3 conv, 64
match di
relu

20 20
GoogLeNet [44] - 9.15 3x3 conv,relu
256 3x3 conv, 64 3x3, 64 3x3 conv, 64
options, w
PReLU-net [13] 24.27 7.38
relu
3x3 3x3,
conv,64
256 3x3 conv, 64 1x1, 256 3x3 conv, 64
sizes, the
training error (%)

plain-34 28.54 10.02 3x3 conv, 64 3x3 conv, 64

test error (%)


56-layer
3.4. Imp
train. We ResNet-34 A 25.03 7.76
relu relu
3x3 conv, 64 3x3 conv, 64
ResNet-34 B 20-layer24.52 7.46
Our im
e training Figure 5. Apool,deeper residual function F for ImageNet. Left: a
10 10 /2 3x3 conv, 128, /2 3x3 conv, 128, /2
ResNet-34 C 24.19 7.40 output
56-layer
22.85 6.71
building block
size: 28
3x3 conv, (on
512 56⇥56 feature maps)
3x3 conv, 128 as in Fig. 33x3for
conv,ResNet-
128 in [21, 41
hose used ResNet-50
ResNet-101 21.75 6.05
34. Right:3x3aconv,
“bottleneck”
512 building block
3x3 conv, 128 for ResNet-50/101/152.
3x3 conv, 128 domly sa
as learn- A 224⇥2
20-layer
ResNet-152 21.43 5.71 3x3 conv, 512 3x3 conv, 128 3x3 conv, 128
0
0 1 2 3 4 5 6
0
0 1 2 3 4 5 6 parameter-free,
3x3 conv, 512 identity shortcuts help with training.
3x3 conv, 128 Next horizonta
nputs, in- Table 3. Error rates (%, 10-crop testing) on ImageNet validation.
3x3 conv, 128
iter. (1e4) iter. (1e4) we investigate projection shortcuts (Eqn.(2)). In Table 3 we
VGG-16 is based on our test. ResNet-50/101/152 are of option B standard
Figure 1. Training error (left) and test errorthat(right)
only useson CIFAR-10
3x3 conv, 128 3x3 conv, 128
compare three options: (A) zero-padding shortcuts are used
vide com- projections for increasing dimensions. normaliza
Key
withidea:
20-layerintroduce “pass for increasing dimensions, and all shortcuts are parameter-
3x3 conv, 128 3x3 conv, 128

e residual and 56-layer “plain” networks. The deeper network free (the same as Table 2 and Fig. 4 right); 3x3(B)
3x3 conv, 128
projec-
conv, 128
before ac
method top-1 err. top-5 err. as in [13]
has higher training error, and thus test error. Similar phenomena
racy from through” into each layer tion 14 shortcuts are used for increasing dimensions, and other
output
VGG [41] (ILSVRC’14) - 8.43†
pool, /2 3x3 conv, 256, /2 3x3 conv, 256, /2
use SGD
size:

on ImageNet is presented in Fig. 4. GoogLeNet [44] (ILSVRC’14) - 7.89 shortcuts3x3are conv,identity;


512 and (C) all 256
3x3 conv, shortcuts are 3x3projections.
conv, 256

ataset we Table3x3
3 conv,
shows 512 that all three options are considerably
3x3 conv, 256 bet-
starts from
VGG [41] (v5) 24.4 7.1
and the m
3x3 conv, 256

ter than the plain512 counterpart.3x3 B is256slightly better3x3than A. We


yers—8⇥ PReLU-net [13] 21.59 5.71
use a wei
Thus only residual now
3x3 conv, conv, conv, 256
argue that this is because the3x3zero-padded dimensions in A
complex- greatly benefited from very deep models. BN-inception [16] 21.99 5.81 3x3 conv, 512 conv, 256
indeed have no residual learning. C is marginally better than
3x3 conv, 256
do not us
ResNet-34 B 21.84 5.71
needs to be
Driven learned
by the significance of depth, a question arises: Is B, and we attribute this to the extra parameters introduced In test
3x3 conv, 256 3x3 conv, 256

57% error ResNet-34 C 21.53 5.60


by many (thirteen) projection shortcuts. But the small dif-
3x3 conv, 256 3x3 conv, 256
10-crop t
ResNet-50 20.74 5.25
ace on the learning better networks as easy as stacking more layers? ImageNet test set, and won the 1st place in the ILSVRC
ResNet-101 19.87 4.60 ferences among A/B/C indicate that
3x3 conv, 256 projection shortcuts
3x3 conv, 256 are
convoluti
t analysis An obstaclexto answering this question was the notorious ResNet-152 19.38 4.49 not essential for addressing the degradation
3x3 conv, 256 problem. So we
3x3 conv, 256
at multip

problem of vanishing/exploding gradients [1,


Table 4. Error 9],
rates (%) which 2015
of single-model results onclassification
the ImageNet competition. The extremely deep rep- do not use option C in the rest
ory/time complexity and model
of this
3x3 conv, 256 paper, to reduce
sizes. Identity
3x3 conv, 256
shortcuts
mem-
are
side is in

hamper convergence from the beginning. This problem, resentations also have excellent generalization performance

weight layer validation set (except reported on the test set). 3x3 conv, 256
particularly important for not increasing the complexity of
3x3 conv, 256

mportance 4. Expe
the bottleneck architectures that are introduced below.
3x3 conv, 256 3x3 conv, 256

F(x) method
initial- on3.57% other recognition
top-5 err. (test)
tasks,Architectures.
and leadNextuswe to further win
4.1. the
however, has been largely addressed x by With normalized ensembling, 7.32 top-5
relu
o our ex-
output
Ima
pool, /2 3x3 conv, 512, /2 3x3 conv, 512, /2
size: 7
VGG [41] (ILSVRC’14) Deeper Bottleneck describe our
ization [23, 9, 37, 13] and intermediate
identity normalization layers
3x3 conv, 512 3x3 conv, 512

lative im- weight layer test error on


GoogLeNet
VGG [41] (v5)
[44]
ImageNet
(ILSVRC’14)
1st places 6.66
6.8
on: ImageNet
deeper nets for
detection, ImageNet localization,
ImageNet. Because of concerns on the
ing time that we can afford, we modify the building block
train- We ev 3x3 conv, 512 3x3 conv, 512

et. Deep [16], which enable networks with tens of layers to [13]
start con- cation da
PReLU-net
COCO 4.94
detection, as a and COCO
bottleneck segmentation
design4 . For each residual function F, in weILSVRC &
3x3 conv, 512 3x3 conv, 512

are traine
o ILSVRC verging
F(x)+xfor stochastic gradient descent (SGD) with
BN-inception [16] back- 4.82 use a stack of 3 layers instead of 2 (Fig. 5). The three layers
ated on t
3x3 conv, 512 3x3 conv, 512

on the 1st propagation [22]. relu ResNet (ILSVRC’15) COCO 3.57 2015 competitions.
are 1⇥1, 3⇥3, and 1⇥1 This strong
convolutions, whereevidence
the 1⇥1 layers shows that
are responsible for reducing and then increasing (restoring) result on
output
3x3 conv, 512 3x3 conv, 512

Net local-Figure 2.When


Residual learning: a building
are able block. the residual learning
Table 5. Error rates (%) of ensembles. The top-5 error is on the
principle islayer
generic,
a bottleneckand we expect that
fc 4096 avg pool avg pool

dimensions, leaving the 3⇥3 with smaller We evalu


size: 1

deeper networks toteststart converging,


set of ImageNet and reportedaby the test server. fc 4096 fc 1000 fc 1000

on. (He, Zhang, Ren, Sun, CVPR 2016)


degradation problem has been exposed: with the network it is applicable inbothother
input/output dimensions. Fig. 5 shows an example, where
vision
designs have andcomplexity.
similar time non-vision problems. Plain Ne fc 1000

plain nets
The parameter-free identity shortcuts are particularly im-
depth increasing, accuracy gets saturated (which might be
ResNet reduces the top-1 error by 3.5% (Table 2), resulting Figure 3. Example network architectures for ImageNet. Left: the 18-layer
Choosing the Architecture
• Task dependent

• Cross-validation

• [Convolution → pooling]* + fully connected layer

• The more data: the more layers and the more kernels
Ø Look at the number of parameters at each layer
Ø Look at the number of flops at each layer

• Computational resources

[From Marc'Aurelio Ranzato, CVPR 2014 tutorial]


End of Part 1

You might also like