TRes Net
TRes Net
Some slides were adated/taken from various sources, including Andrew Ng’s Coursera Lectures, CS231n: Convolutional Neural Networks for Visual Recognition lectures, Stanford University CS
Waterloo Canada lectures, Aykut Erdem, et.al. tutorial on Deep Learning in Computer Vision, Ismini Lourentzou's lecture slide on "Introduction to Deep Learning", Ramprasaath's lecture
slides, and many more. We thankfully acknowledge them. Students are requested to use this material for their study only and NOT to distribute it.
In this Lecture
Technical ResNet
Intro ResNet Results Comparison
details 1000
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners
Technical ResNet
Intro ResNet Results Comparison
details 1000
Deep vs Shallow Networks
What happens when we continue stacking deeper layers on a “plain” convolutional
neural network?
56-layer
Training error
56-layer
Test error
20-layer
20-layer
Iterations Iterations
Technical ResNet
Intro ResNet Results Comparison
details 1000
Challenges
Technical ResNet
Intro ResNet Results Comparison
details 1000
ResNet
Technical ResNet
Intro ResNet Results Comparison
details 1000
Plain Network
Technical ResNet
Intro ResNet Results Comparison
details 1000
Residual Blocks
Technical ResNet
Intro ResNet Results Comparison
details 1000
Residual Blocks
X Big NN a[l]
a[l] a[l+2]
X Big NN
a[l+2]=g(z[l+2]+a[l])
=g(w[l+2] a[l+2]+b[l+2] +a[l])=g(a[l])
if w[l+2]=0 and b[l+2] =0
Identity function is easy to learn for residual block
Skip Connections “shortcuts”
Technical ResNet
Intro ResNet Results Comparison
details 1000
ResNet
He et. al. 2015
Technical ResNet
Intro ResNet Results Comparison
details 1000
Softmax
FC 1000
Pool
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
X 3x3 conv, 64
Residual block 3x3 conv, 64
3x3 conv, 64
Pool
7x7 conv, 64, / 2
Input
Technical
Intro ResNet Results ResNet 1000 Comparison
details
Softmax
FC 1000
Pool
Pool
7x7 conv, 64, / 2
Input
Technical
Intro ResNet Results ResNet 1000 Comparison
details
Softmax
FC 1000
Pool
the beginning X
3x3 conv, 64
3x3 conv, 64
Residual block 3x3 conv, 64
3x3 conv, 64
Pool
7x7 conv, 64, / 2 Beginning
Input conv layer
Technical
Intro ResNet Results ResNet 1000 Comparison
details
Softmax
FC 1000 No FC layers
Pool besides FC
ResNet Architecture 3x3 conv, 512
3x3 conv, 512
1000 to
output
classes
3x3 conv, 512
3x3 conv, 512
Full ResNet architecture: Global
relu 3x3 conv, 512 average
- Stack residual blocks 3x3 conv, 512, /2 pooling layer
F(x) + x after last
- Every residual block has ..
conv layer
.
two 3x3 conv layers 3x3 conv, 128
the beginning X
3x3 conv, 64
3x3 conv, 64
Pool
classes) 7x7 conv, 64, / 2
Input
Technical
Intro ResNet Results ResNet 1000 Comparison
details
Softmax
FC 1000
Pool
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
Pool
7x7 conv, 64, / 2
Input
Technical
Intro ResNet Results ResNet 1000 Comparison
details
ResNet Architecture
28x28x256
output
28x28x256
input
Technical ResNet
Intro ResNet Results Comparison
details 1000
ResNet Architecture
28x28x256
output
1x1 conv, 256 filters projects
back to 256 feature maps
For deeper networks (28x28x256) 1x1 conv, 256
(ResNet-50+), use “bottleneck”
layer to improve efficiency 3x3 conv operates over
3x3 conv, 64
(similar to GoogLeNet) only 64 feature maps
Technical ResNet
Intro ResNet Results Comparison
details 1000
Residual Blocks (skip connections)
Deeper Bottleneck Architecture
Technical ResNet
Intro ResNet Results Comparison
details 1000
Deeper Bottleneck Architecture (Cont.)
• Addresses high training time of very deep networks.
• Keeps the time complexity same as the two layered convolution
• Allows us to increase the number of layers
• allows the model to converge much faster.
• 152-layer ResNet has 11.3 billion FLOPS while VGG-16/19 nets has
15.3/19.6 billion FLOPS.
Technical ResNet
Intro ResNet Results Comparison
details 1000
Why Do ResNets Work Well?
Technical ResNet
Intro ResNet Results Comparison
details 1000
Why Do ResNets Work Well? (Cont
(Cont)
Cont)
• In theory ResNet is still identical to plain networks, but in practice due to
the above the convergence is much faster.
• No additional training parameters introduced.
• No addition complexity introduced.
Technical ResNet
Intro ResNet Results Comparison
details 1000
Training ResNet in practice
• Batch Normalization after every CONV layer.
• Xavier/2 initialization from He et al.
• SGD + Momentum (0.9)
• Learning rate: 0.1, divided by 10 when validation error
plateaus.
• Mini-batch size 256.
• Weight decay of 1e-5.
• No dropout used.
Technical ResNet
Intro ResNet Results Comparison
details 1000
Loss Function
• For measuring the loss of the model a combination of cross-entropy
and softmax were used.
• The output of the cross-entropy was normalized using softmax
function.
Technical ResNet
Intro ResNet Results Comparison
details 1000
Results
Experimental Results
- Able to train very deep
networks without degrading
(152 layers on ImageNet, 1202
on Cifar)
- Deeper networks now achieve
lowing training error as
expected
- Swept 1st place in all ILSVRC
and COCO 2015 competitions ILSVRC 2015 classification winner (3.6%
top 5 error) -- better than “human
performance”! (Russakovsky 2014)
Technical ResNet
Intro ResNet Results Comparison
details 1000
Comparing Plain to ResNet (18/
18/34 Layers)
Technical ResNet
Intro ResNet Results Comparison
details 1000
Comparing Plain to Deeper ResNet
Test Error: Train Error:
Technical ResNet
Intro ResNet Results Comparison
details 1000
ResNet on More than 1000 Layers
• To farther improve learning of extremely deep ResNet “Identity
Mappings in Deep Residual Networks Kaiming He, Xiangyu Zhang,
Shaoqing Ren, and Jian Sun 2016” suggests to pass the input directly
to the final residual layer, hence allowing the network to easily learn
to pass the input as identity mapping both in forward and backward
passes.
Technical ResNet
Intro ResNet Results Comparison
details 1000
Identity Mappings in Deep Residual Networks
Technical ResNet
Intro ResNet Results Comparison
details 1000
Identity Mappings in Deep Residual Networks
Improvement on CIFAR-
CIFAR-10
Technical ResNet
Intro ResNet Results Comparison
details 1000
Reduce Learning Time with Random Layer
Drops
• Dropping layers during training, and using the full network in testing.
• Residual block are used as network’s building block.
• During training, input flows through both the shortcut and the weights.
• Training: Each layer has a “survival probability” and is randomly dropped.
• Testing: all blocks are kept active.
• re-calibrated according to its survival probability during training.
Technical ResNet
Intro ResNet Results Comparison
details 1000