Deep Learning on GPUs
March 2016
What is Deep Learning?
GPUs and DL
AGENDA DL in practice
Scaling up DL
2
What is Deep Learning?
3
DEEP LEARNING EVERYWHERE
INTERNET & CLOUD MEDICINE & BIOLOGY MEDIA & ENTERTAINMENT SECURITY & DEFENSE AUTONOMOUS MACHINES
Image Classification Cancer Cell Detection Video Captioning Face Detection Pedestrian Detection
Speech Recognition Diabetic Grading Video Search Video Surveillance Lane Tracking
Language Translation Drug Discovery Real Time Translation Satellite Imagery Recognize Traffic Sign
Language Processing
Sentiment Analysis
Recommendation
4
Traditional machine perception
Hand crafted feature extractors
Classifier/
Raw data Feature extraction Result
detector
SVM,
shallow neural net,
HMM,
Speaker ID,
shallow neural net, speech transcription,
Topic classification,
machine translation,
Clustering, HMM,
LDA, LSA sentiment analysis
5
Deep learning approach
Train:
Errors
Dog
MODEL
Dog
Cat
Cat Raccoon
Honey badger
Deploy:
MODEL Dog
6
Artificial neural network
A collection of simple, trainable mathematical units that collectively
learn complex functions
Hidden layers
Input layer Output layer
Given sufficient training data an artificial neural network can approximate very complex
functions mapping raw data to output decisions
7
Artificial neurons
Biological neuron Artificial neuron
w1 w2 w3
x1 x2 x3
From Stanford cs231n lecture notes
y=F(w1x1+w2x2+w3x3)
F(x)=max(0,x)
8
Deep neural network (dnn)
Raw data Low-level features Mid-level features High-level features
Application components:
Task objective
e.g. Identify face
Training data
10-100M images
Network architecture
~10 layers
1B parameters
Input Result Learning algorithm
~30 Exaflops
~30 GPU days
9
Deep learning benefits
Robust
No need to design the features ahead of time features are automatically learned to
be optimal for the task at hand
Robustness to natural variations in the data is automatically learned
Generalizable
The same neural net approach can be used for many different applications and data
types
Scalable
Performance improves with more data, method is massively parallelizable
10
Baidu Deep Speech 2
End-to-end Deep Learning for English and Mandarin Speech Recognition
English and Mandarin speech recognition
Transition from English to Mandarin made simpler by end-to-end DL
No feature engineering or Mandarin-specifics required
More accurate than humans
Error rate 3.7% vs. 4% for human tests
http://svail.github.io/mandarin/
http://arxiv.org/abs/1512.02595
11
AlphaGo
First Computer Program to Beat a Human Go Professional
Training DNNs: 3 weeks, 340 million training steps on 50 GPUs
Play: Asynchronous multi-threaded search
Simulations on CPUs, policy and value DNNs in parallel on GPUs
Single machine: 40 search threads, 48 CPUs, and 8 GPUs
Distributed version: 40 search threads, 1202 CPUs and 176
GPUs
Outcome: Beat both European and World Go champions in
best of 5 matches
http://www.nature.com/nature/journal/v529/n7587/full/nature16961.html
http://deepmind.com/alpha-go.html 12
Deep Learning for Autonomous vehicles
13
Deep Learning Synthesis
Texture synthesis and transfer using CNNs. Timo Aila et al., NVIDIA Research
14
THE AI RACE IS ON
IMAGENET
Accuracy Rate
100%
Traditional CV Deep Learning
90%
80%
70% IBM Watson Achieves Breakthrough Facebook Baidu Deep Speech 2
in Natural Language Processing Launches Big Sur Beats Humans
60%
50%
40%
30%
20%
10% Google Toyota Invests $1B Microsoft & U. Science & Tech, China
Launches TensorFlow in AI Labs Beat Humans on IQ
0%
2009 2010 2011 2012 2013 2014 2015 2016 15
The Big Bang in Machine Learning
DNN BIG DATA GPU
Googles AI engine also reflects how the world of computer hardware is changing.
(It) depends on machines equipped with GPUs And it depends on these chips more
than the larger tech universe realizes.
16
GPUs and DL
USE MORE PROCESSORS TO GO FASTER
17
Deep learning development cycle
18
Three Kinds of Networks
DNN all fully connected layers
CNN some convolutional layers
RNN recurrent neural network, LSTM
19
DNN
Key operation is dense M x V
Backpropagation uses dense matrix-matrix multiply starting from softmax scores 20
DNN
Batching for training and latency insensitive.
MxM
Batched operation is M x M gives re-use of weights.
Without batching, would use each element of Weight matrix once.
Want 10-50 arithmetic operations per memory fetch for modern
compute architectures.
21
CNN
Requires convolution and M x V
Filters conserved through plane
Multiply limited even without batching.
22
Other Operations
To finish building a DNN
These are not limiting factors with appropriate GPU use
Complex networks have hundreds of millions of weights. 23
Lots of Parallelism Available in a DNN
24
13x Faster Training
Caffe
Dual CPU Server
TESLA M40 GPU Server with Reduce Training Time from 13 Days to just 1 Day
Worlds Fastest Accelerator 4x TESLA M40
for Deep Learning Training
0 1 2 3 4 5 6 7 8 9 10 11 12 13
Number of Days
CUDA Cores 3072
Peak SP 7 TFLOPS
GDDR5 Memory 12 GB
Bandwidth 288 GB/s
Power 250W
28 Gflop/W Note: Caffe benchmark with AlexNet,
CPU server uses 2x E5-2680v3 12 Core 2.5GHz CPU, 128GB System Memory, Ubuntu 14.04
25
Comparing CPU and GPU server class
Xeon E5-2698 and Tesla M40
NVIDIA Whitepaper GPU based deep learning inference: A performance and power analysis. 26
DL in practice
27
The Engine of Modern AI
EDUCATION BIG SUR TENSORFLOW WATSON CNTK
TORCH CAFFE
THEANO MATCONVNET
MOCHA.JL PURINE START-UPS
CHAINER DL4J KERAS OPENDEEP
MINERVA MXNET*
SCHULTS
LABORATORIES VITRUVIAN
NVIDIA GPU PLATFORM
* U. Washington, CMU, Stanford, TuSimple, NYU, Microsoft, U. Alberta, MIT, NYU Shanghai 28
CUDA for Deep Learning Development
DEEP LEARNING SDK
DIGITS cuDNN cuSPARSE cuBLAS NCCL
TITAN X DEVBOX GPU CLOUD
29
GPU-accelerated Deep Learning Tiled FFT up to 2x faster than FFT
subroutines 2.5x
High performance neural network 2.0x
training 1.5x
Accelerates Major Deep Learning 1.0x
frameworks: Caffe, Theano, 0.5x
Torch, TensorFlow
0.0x
Deep Learning Primitives Up to 3.5x faster AlexNet training
in Caffe than baseline GPU
Millions of Images Trained Per Day
Accelerating
Artificial Intelligence 100
80
60
40
20
0
cuDNN 1 cuDNN 2 cuDNN 3 cuDNN 4
developer.nvidia.com/cudnn
30
Caffe Performance
6
M40+cuDNN4
CUDA BOOSTS
M40+cuDNN3
DEEP LEARNING
Performance
3
5X IN 2 YEARS
2
K40+cuDNN1
K40
1
0
11/2013 9/2014 7/2015 12/2015
AlexNet training throughput based on 20 iterations,
CPU: 1x E5-2680v3 12 Core 2.5GHz. 128GB System Memory, Ubuntu 14.04
31
NVIDIA DIGITS
Interactive Deep Learning GPU Training System
Process Data Configure DNN Monitor Progress Visualize Layers
Test Image
developer.nvidia.com/digits
32
ONE ARCHITECTURE END-TO-END AI
PC GAMING
Tesla Titan X DRIVE PX Jetson
for Cloud for PC for Auto for Embedded
33
Scaling DL
34
Scaling Neural Networks
Data Parallelism
W Sync. W
Image 1 Image 2
Machine 1 Machine 2
Notes:
Need to sync model across machines.
Largest models do not fit on one GPU.
Requires P-fold larger batch size.
Works across many nodes parameter server approach linear speedup.
Adam Coates, Brody Huval, Tao Wang, David J. Wu, Andrew Ng and Bryan Catanzaro 35
Multiple GPUs
Near linear scaling data parallel.
Ren Wu et al, Baidu, Deep Image: Scaling up Image Recognition. arXiv 2015 36
Scaling Neural Networks
Model Parallelism
W
Image 1
Machine 1 Machine 2
Notes:
Allows for larger models than fit on one GPU.
Requires much more frequent communication between GPUs.
Most commonly used within a node GPU P2P.
Effective for the fully connected layers.
Adam Coates, Brody Huval, Tao Wang, David J. Wu, Andrew Ng and Bryan Catanzaro 37
Scaling Neural Networks
Hyper Parameter Parallelism
Try many alternative neural networks in parallel on different CPU / GPU / Machines.
Probably the most obvious and effective way!
38
Deep Learning Everywhere
NVIDIA DRIVE PX
NVIDIA Tesla
NVIDIA Jetson
NVIDIA Titan X
Contact: jbarker@nvidia.com
39