Deep Learning Supervised
Deep Learning Supervised
Supervised Learning
Russ Salakhutdinov
Gene Expression
Rela:onal Data/
Product fMRI Tumor region
Social Network
Recommenda:on
Gene Expression
Product
Deep Learning
Rela:onal Data/Models that
fMRI Tumor region
Social Network
support inferences and discover
Recommenda:on
• Computer Vision
• Recommender Systems
• Language Understanding
• Drug Discovery and Medical
Image Analysis
Deep Genera:ve Model
Model P(document) Reuters dataset: 804,414
newswire stories: unsupervised
European Community
Interbank Markets Monetary/Economic
Energy Markets
Disasters and
Accidents
Leading Legal/Judicial
Economic
Indicators
Model Samples
• a group of people in a crowded area .
• a group of people are walking and talking .
• a group of people, standing around and talking .
Cap:on Genera:on
Talk Roadmap
Part 1: Supervised Learning: Deep Networks
• Defini:on of Neural Networks
• Training Neural Networks
• Recent Op:miza:on / Regulariza:on Techniques
Segway
Input Space Non-Segway
pixel 1
pixel 2
Learning Feature Representa:ons
Handle
Feature Learning
Representation Algorithm
Wheel
Segway
Input Space Non-Segway Feature Space
pixel 1
Wheel
pixel 2 Handle
Tradi:onal Approaches
Data Feature Learning
extraction algorithm
Object
detec:on
Audio
classifica:on
Speaker
Audio audio features iden:fica:on
Computer Vision Features
SIFT Textons
HoG RIFT
GIST
Audio Features
Spectrogram MFCC
Representa:on Learning:
Spectrogram MFCC
Can we automa:cally learn
these representa:ons?
•• ww
Range
•• {{ is
determined
•• g(·)
by g(·) bb
Bias only changes
•• h(x)
h(x)==g(a(x))
g(a(x)) the position of the
⇣⇣ riff ⌘⌘
(1) (from Pascal
(1) Vincent’s slides) (1)
(1) PP (1) (1)
• a(x) = b(1) + W(1) x
• a(x) = b + W x a(x)ii = bii a(x) = b W
jj Wi,j
xjj
i,j x
>
• x1 xd b w 1 w d
• w
Activation
• {
Function
• Sigmoid activation function:
• g(a) = a
1
Ø Squashes the neuron’s • g(a) = sigm(a) = 1+exp( a)
output between 0 and 1
exp(a) exp( a) exp(2a
Ø Always positive • g(a) = tanh(a) = exp(a)+exp( a) = exp(2a
Ø Bounded • g(·) b
Ø Strictly Increasing
• h(x) = g(a(x))
⇣ P
(1) (1) (1)
• a(x) = b +W x a(x)i = bi j
•- h (1)
layer h(2) (x) Wfor
pre-activation
(x) (1)
hW(2) W(3) b(1) b(2) bi(3)
k>0 >
• f (x) Pexp(a1 ) . . . Pexp(aC )
• o(a) = softmax(a) =
(k) (k 1)c exp(ac )(0)
• a(k) c exp(a c)
(k)
(x) = b + W h (x) (h (x) = x)
• h(1) (x) h (2)
• p(y =(k)c|x)(x) W (1)
W (2)
W (3)
b (1)
b (2)
b (3)
•• hf (x)
(k)
- hidden(x) = g(aactivation
layer (x))
(k) (k) (k) (k 1) h (0) i>
• afrom(1)
(x)
1 to =L:b (2)
+ W h
(1)
x
(2)
(h (3)
= x)
(1) (2) (3)
•• hh(L+1) (x)
•(x) =h o(a
o(a) =(x)(L+1)W
(x)) = W
softmax(a) (x) Pexp(a
f= W 1 ) b. . . Pexp(a
b Cb )
(k) (k) c exp(ac ) c exp(ac )
• h (k)(x) = g(a (k) (x))(k) (k 1)
• a (x) = b + W h x (h(0) = x)
• f (x)
(L+1) (L+1)
• -houtput (x)
layer
= o(a
activation
(x)) = f (x)
(k=L+1):
• h(k) (x) = g(a (k)
(x))
• h(1) (x) h(2) (x) W(1) W(2) W(3) b(1) b(2) b(3)
• h(L+1) (x) (L+1)
• a (x) = b + W(k)fh(x)
(k)= o(a (k) (x)) = (k 1)
x (h(0) (x) = x)
0 1
-1
0
-1
0
-1
1
x1
zk
Output
sortie k
y1 x2 y2
x2
y2 wkj
1 -1 -.4 1
0
y1 .7
1 0 1
-1
0 -1.5 Hidden
cachée j-1 0
-1
0
bias
biais .5
-1
0
-1 -1
1 1 wji 1
x1 1 1 1
x1
Input i
entrée
x1 x2
x2
(from Pascal Vincent’s slides)
z=-1
1
R2
z=+1 R1
x
Capacity of Neural Nets
• Consider a single layer neural network
• However, this does not mean that there is learning algorithm that
can find the necessary parameter values.
Feedforward Neural Networks
‣ How neural networks predict f(x) given an input x:
- Forward propagation
- Types of units
- Capacity of neural networks
1X
arg min l(f (x(t) ; ✓), y (t) ) + ⌦(✓)
✓ T t
Loss function Regularizer
Model Selection
• Supervised
• T 1b Machine
⌃= 1 learnin
Pexamp
learning
t (x (t)
T T
• Training Protocol: • Supervised learning example:
• Training set: D (x, y)
train x {(x
= y
•
Mini-batch,
f (x) x ✏
Momentum
• Make updates
• f (x based
+ ✏) on
f (xa mini-batch
✏) of examples (instead of a
single example):
P 1
Ø • gradient
the t=1 ↵is tthe 1
= average regularized loss for that mini-batch
Ø P1a more
can give 2 accurate estimate of the gradient
• t=1 t < 1 ↵t
↵
Ø can leverage matrix/matrix operations, which are more efficient
↵
• ↵t = 1+ t
• Momentum: = t↵use0.5
• ↵tCan an exponential
< 1 average of previous
gradients:
(t) (t) (t) (t 1)
• r✓ = r✓ l(f (x ), y )+ r✓
Ø can get pass plateaus more quickly, by ‘‘gaining momentum’’
Feedforward Neural Networks
‣ How neural networks predict f(x) given an input x:
- Forward propagation
- Types of units
- Capacity of neural networks
C1=0
C1=1 C2=1
C2=0
C1=0
C2=0
C1 C2 C3
Learned
prototypes (Bengio, 2009, Foundations and
Trends in Machine Learning)
Local vs. Distributed Representations
• Clustering, Nearest • RBMs, Factor models,
Neighbors, RBF SVM, local PCA, Sparse Coding,
density es:mators Deep models C1=1
C2=1
• Parameters C3=1
Local regionsfor each region.
• # of regions is linear with C1=1
# of parameters. C2=1
C3=0
C1=0
C1=1 C2=1 C1=0
C2=0 C3=0 C2=1
C3=0 C1=0 C3=1
C2=0
C3=0
C1 C2 C3
C1=0
C2=0
C3=1
Learned
prototypes
Local vs. Distributed Representations
• Clustering, Nearest • RBMs, Factor models,
Neighbors, RBF SVM, local PCA, Sparse Coding,
density es:mators Deep models C1=1
C2=1
• Parameters
Local regionsfor each region. • Each parameter affects many
C3=1
• # of regions is linear with regions, not
C1=1just local.
# of parameters. C2=1 grows (roughly)
• # of regions
C3=0
exponen:ally in #C1=0
of parameters.
C1=1 C2=1 C1=0
C2=0 C3=0 C2=1
C3=0 C1=0 C3=1
C2=0
C3=0
C1 C2 C3
C1=0
C2=0
C3=1
Learned
prototypes
Inspiration from Visual Cortex
Why Training is Hard
• First hypothesis: Hard optimization
problem (underfitting)
• For many large-scale practical problems, you will need to use both:
better optimization and better regularization!
Unsupervised Pre-training
• Initialize hidden layers using unsupervised learning
Encoder
h(x) = g(a(x))
= sigm(b + Wx)
b
x Autoencoders:
⇤
= o(c + W h(x)) Preview
• h(x) = g(a(x))
= sigm(c + W⇤ h(x)) = sigm(b + Wx
• Loss function for binary inputs
2
P
bk
x xk ) l(f (x)) = k (xk log(b
xk ) + (1 xk ) log(1 bk ))
x
P
Ø Cross-entropy error function • fx bo(b
b(x)=⌘ x l(f (x)) =
a(x)) k (b
x
= sigm(c + W⇤ h(x
• Loss function for real-valued inputs
1
P 2
P
b l(f (x)) =
f (x) ⌘ x 2 k (b
xk xk ) l(f (x)) = k (xk log(b
xk )
(t) of squared
b(t) differences
Ø (xsum
rba(x(t) ) l(f )) = x x(t)
Ø we use a linear activation function at the output
a(x ) (= b + Wx(t)
(t)
h(x(t) ) (= sigm(a(x(t)
(t) >
Pre-training
• We will use a greedy, layer-wise procedure
• For many large-scale practical problems, you will need to use both:
better optimization and better regularization!
Dropout
• Key idea: Cripple neural network by removing hidden units
stochastically
(x)= b + WW(1)
h(2) (x) h cW (x)
(2)c ) (h (3)c(x)
exp(a
W exp(a =cx)
b(1) ) (2)
b b(3)
•••Øhaf(k)
(x)(x) = g(a
(k)
(x) =layer
hidden
(k)
(k)
+(x))
b activationW(k) h(k=1
(k 1)
toxL):(h(0) = x)
(1) (2) (1) (2)
••• hhh(L+1)
(k) (x)
(x) h
= (x)
o(a W(x))
(k)(L+1) W= f (x)W(3) b(1) b(2) b(3)
(x) = g(a (x))
(k)
• a(L+1) (x) = b(k) + (L+1)
W(k) h(k 1) x (h(0) = x)
• h (x) = o(a (x)) = f (x)
•Ø h(k) Output g(a(k) (x))
(x) =activation (k=L+1)
• For many large-scale practical problems, you will need to use both:
better optimization and better regularization!
Batch Normalization
• Normalizing the inputs will speed up training (Lecun et al. 1998)
Prediction
Very deep network
….
Pooling
Convolution
ConvNets: Examples
• Optical Character Recognition, House Number and Traffic Sign
classification
ConvNets: Examples
• Pedestrian detection
Examples of Hammer
(Deng et al., Imagenet: a large scale hierarchical image database, CVPR 2009)
Important Breakthrough
• Deep Convolu:onal Nets for Vision (Supervised)
Krizhevsky, A., Sutskever, I. and Hinton, G. E., ImageNet Classification with Deep
Convolutional Neural Networks, NIPS, 2012.
• Depth
• Width
• Parameter count
How to Choose Architecture
• Many hyper-parameters:
Ø Number of layers, number of feature maps
• Cross Validation
• Smarter Strategies
Ø Random search
Ø Bayesian Optimization
AlexNet
Softmax Output
• 8 layers total
Layer 7: Full
• Trained on Imagenet
dataset [Deng et al. CVPR’09] Layer 6: Full
Layer 4: Conv
Layer 3: Conv
Layer 4: Conv
Layer 3: Conv
number
of filters 1x1
3x3
5x5
(a) Inception module, naı̈ve version (b) Inception module with dimension reductions
• Width of inception modules ranges from 256 filters (in early modules) to
1024 in top inception modules.
• Can remove fully connected layers on top completely
• Number of parameters is reduced to 5 million
• 6.7% top-5 validation error on Imagnet
Residual Networks
output
shortcuts
size: 224
3x3 conv, 64
output ar
Microsoft Research
pool, /2
output
size: 112
3x3 conv, 128
Fig. 3). W
in Fig. 3)
ngz, v-shren, jiansun}@microsoft.com 3x3 conv, 128 7x7 conv, 64, /2 7x7 conv, 64, /2
performs
Really, really deep convnets do not train well, output
size: 56
pool, /2 pool, /2 pool, /2
for increa
model top-1 err. top-5 err. paramete
64-d 256-d 3x3 conv, 64
E.g. CIFAR10:
3x3 conv, 256 3x3 conv, 64
20 20
GoogLeNet [44] - 9.15 3x3 conv,relu
256 3x3 conv, 64 3x3, 64 3x3 conv, 64
options, w
PReLU-net [13] 24.27 7.38
relu
3x3 3x3,
conv,64
256 3x3 conv, 64 1x1, 256 3x3 conv, 64
sizes, the
training error (%)
e residual and 56-layer “plain” networks. The deeper network free (the same as Table 2 and Fig. 4 right); 3x3(B)
3x3 conv, 128
projec-
conv, 128
before ac
method top-1 err. top-5 err. as in [13]
has higher training error, and thus test error. Similar phenomena
racy from through” into each layer tion 14 shortcuts are used for increasing dimensions, and other
output
VGG [41] (ILSVRC’14) - 8.43†
pool, /2 3x3 conv, 256, /2 3x3 conv, 256, /2
use SGD
size:
ataset we Table3x3
3 conv,
shows 512 that all three options are considerably
3x3 conv, 256 bet-
starts from
VGG [41] (v5) 24.4 7.1
and the m
3x3 conv, 256
hamper convergence from the beginning. This problem, resentations also have excellent generalization performance
†
weight layer validation set (except reported on the test set). 3x3 conv, 256
particularly important for not increasing the complexity of
3x3 conv, 256
mportance 4. Expe
the bottleneck architectures that are introduced below.
3x3 conv, 256 3x3 conv, 256
F(x) method
initial- on3.57% other recognition
top-5 err. (test)
tasks,Architectures.
and leadNextuswe to further win
4.1. the
however, has been largely addressed x by With normalized ensembling, 7.32 top-5
relu
o our ex-
output
Ima
pool, /2 3x3 conv, 512, /2 3x3 conv, 512, /2
size: 7
VGG [41] (ILSVRC’14) Deeper Bottleneck describe our
ization [23, 9, 37, 13] and intermediate
identity normalization layers
3x3 conv, 512 3x3 conv, 512
et. Deep [16], which enable networks with tens of layers to [13]
start con- cation da
PReLU-net
COCO 4.94
detection, as a and COCO
bottleneck segmentation
design4 . For each residual function F, in weILSVRC &
3x3 conv, 512 3x3 conv, 512
are traine
o ILSVRC verging
F(x)+xfor stochastic gradient descent (SGD) with
BN-inception [16] back- 4.82 use a stack of 3 layers instead of 2 (Fig. 5). The three layers
ated on t
3x3 conv, 512 3x3 conv, 512
on the 1st propagation [22]. relu ResNet (ILSVRC’15) COCO 3.57 2015 competitions.
are 1⇥1, 3⇥3, and 1⇥1 This strong
convolutions, whereevidence
the 1⇥1 layers shows that
are responsible for reducing and then increasing (restoring) result on
output
3x3 conv, 512 3x3 conv, 512
plain nets
The parameter-free identity shortcuts are particularly im-
depth increasing, accuracy gets saturated (which might be
ResNet reduces the top-1 error by 3.5% (Table 2), resulting Figure 3. Example network architectures for ImageNet. Left: the 18-layer
Choosing the Architecture
• Task dependent
• Cross-validation
• The more data: the more layers and the more kernels
Ø Look at the number of parameters at each layer
Ø Look at the number of flops at each layer
• Computational resources