Building Smart Socs: Holger Keding
Building Smart Socs: Holger Keding
Building Smart Socs: Holger Keding
Using Virtual Prototyping for the Design and SoC Integration of Deep
Learning Accelerators
Holger Keding
Solutions Architect
Interconnect
hardware accelerators Accelerator HBM
Multi-core AI
Deep Learning Accelerator optimizations CPU SRAM SoC
IO SRAM
– Schedule workload on parallel hardware engines IO
– Optimize and reduce data transfers IO
to and from memory
AI SoC Design Challenges
Brute-force Processing of Huge Data Sets
• Choosing the right algorithm and architecture: CPU, GPU, FPGA, vector DSP, ASIP
– CNN graphs evolving fast, need short time to market, cannot optimize for one single graph
– Joint design of algorithm, compiler, and target architecture
– Joint optimization of power, performance, accuracy, and cost
• Highly parallel compute drives memory requirements
– High on-chip and chip to chip bandwidth at low latency
– High memory bandwidth requirements for parameters and layer to layer communication
• Performance analysis requires realistic workloads to consider dynamic effects
– Scheduling of AI operators on parallel processing elements
– Unpredictable interconnect and memory access latencies
Large Design Space drives Differentiation by
AI Algorithm & Architecture
Agenda
• Deep Learning Market and Technology Trends
• How to Design a Deep Learning Accelerator (DLA)
• Analytical Performance Modeling
• Shift Left Architecture Analysis and Optimization with Virtual Prototyping
• Example
• Importing Network Algorithms as prototxt + generate analytical model spreadsheet
• Find suited configuration and scaling parameters in analytical model
• Validate first results, and explore architecture for dynamic and power aspects using
Virtual Platforms
• Summary
How to design a DLA?
validate
validate High-Level Architecture back-annotate
back-annotate
+ Good for hardware exploration
Analytical Models RTL Simulation
+ Simulations in minutes/hours
+ Good first order ~ Varying Accuracy + Perfect accuracy
+ Results within minutes Refine Refine - High computational needs
- Omits dynamic effects - High turn-around costs
validate
back-annotate
Analytical Performance Models
Simple Example: Amdahl’s Law [1]
[1] Validity of the Single Processor Approach to Achieving Large Scale Computing Capabilities (1967)
2 𝑜𝑝𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑠 𝑜𝑝𝑠
= 0.25
8 𝑏𝑦𝑡𝑒𝑠 𝑓𝑒𝑡𝑐ℎ𝑒𝑑 𝑏𝑦𝑡𝑒
Roofline: an insightful visual performance model for multicore architectures (Williams, Waterman, Patterson,2009)
Analytical Models – Roofline Models (2)
Theoretical maximum
compute power
slope =
maximum
memory
bandwidth 2 𝑜𝑝𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑠 𝑜𝑝𝑠
𝑜𝑝𝑖𝑛𝑡𝑒𝑛𝑠𝑖𝑡𝑦 ⋅ 𝑚𝑒𝑚_𝑏𝑎𝑛𝑑𝑤𝑖𝑑𝑡ℎ𝑝𝑒𝑎𝑘 = 0.25
8 𝑏𝑦𝑡𝑒𝑠 𝑓𝑒𝑡𝑐ℎ𝑒𝑑 𝑏𝑦𝑡𝑒
Roofline: an insightful visual performance model for multicore architectures (Williams, Waterman, Patterson,2009)
Analytical Models – Roofline Models (3)
compute bound
memory bound
Roofline: an insightful visual performance model for multicore architectures (Williams, Waterman, Patterson,2009)
Analytical Models – Roofline Models
Roofline: an insightful visual performance model for multicore architectures (Williams, Waterman, Patterson,2009)
Example: Analytical Model for CNN Convolutional Layer (1)
Conv1 of AlexNet
𝑛𝑀𝐴𝐶 = 𝑜ℎ ⋅ 𝑜𝑤 ⋅ 𝑜𝑐 ⋅ 𝑘𝑤 ⋅ 𝑘ℎ ⋅ 𝑖𝑐 = 55 ⋅ 55 ⋅ 96 ⋅ 11 ⋅ 11 ⋅ 3 = 105,415,200
𝑛𝑀𝐴𝐶 = 𝑜ℎ ⋅ 𝑜𝑤 ⋅ 𝑜𝑐 ⋅ 𝑘𝑤 ⋅ 𝑘ℎ ⋅ 𝑖𝑐 = 55 ⋅ 55 ⋅ 96 ⋅ 11 ⋅ 11 ⋅ 3 = 105,415,200
Width + Height
+ Channel
+ Kernel Tiling
Example: Analytical Model for CNN Convolutional Layer (6)
Conv1 with tiling
validate
back-annotate
Shift Left Architecture Analysis and Optimization
translate Neural
Network
explore
Power,
Performance Neural
Network
NN Workload Model Compiler
explore
map
results Deep
Deep Learning DDR/
Interconnect
Learning DDR/ Accelerator HBM
Interconnect
Accelerator AI
HBM AI Multi-core
Multi-core SRAM SoC
SoC CPU
CPU SRAM
Model IO SRAM
IO SRAM
IO IO
IO IO
Platform Architect Ultra
Providing a Comprehensive Library of Generic and Vendor Specific Models
interconnect
– Workload activity Memory
subsystem
– Utilization of resources record
DMA
– Interconnect metrics
• Latency, Throughput, Contention Virtual Prototype
• Outstanding transactions
• …
System Level Power Modeling
• Workload Model Task B
(read image)
– Task level parallelism and dependencies Task D
Task A
– Characterized with processing cycles (proc conv)
Task C
and memory accesses (read kernel)
• SoC Platform Model
– Accurate SystemC Transaction level
models of processing elements, ACC
interconnect
interconnect and memory Memory
• System-level Power Overlay Model subsystem
DMA
– Define power state machine per
component Virtual Prototype
– Bind power models to
Virtual Prototype records
sleep
– Measure power and Energy/Power
recording sleep idle idle
performance based idle
on real activity and utilization active active page page
miss hit
IP Power Models
Platform Architect Ultra AI Exploration Pack (XP)
Exploration & optimization of AI designs
• Automated generation of workloads from AI
frameworks
– AI Operator Library for Neural Network modeling
• E.g. Convolution, Matmul, MaxPool, BatchNorm etc.
– Example workload model of ResNet50 Neural Network
– Utility to convert prototxt description to workload model
CNN
using AI operator library Operator Library workload model
• AI centric HW architecture model library
– VPUs configured to represent AI compute and DMA engines
– Interconnect and memory subsystem models
– Example performance model of
NVIDIA Deep Learning Accelerator (NVDLA)
NVDLA Performance
• AI centric analysis views: memory + processing Model Example
utilization
Workload Model of One Convolution Layer
AI algorithm params Mapping params
read
input
calculate write output Scaling parameters reflect
convolutions feature maps the DLA architecture – can
read be taken from analytical
coefficients model.
Workload params
Agenda
• Deep Learning Market and Technology Trends
• How to Design a Deep Learning Accelerator (DLA)
• Analytical Performance Modeling
• Shift Left Architecture Analysis and Optimization with Virtual Prototyping
• Example
• Importing Network Algorithms as prototxt + generate analytical model spreadsheet
• Find suited configuration and scaling parameters in analytical model
• Validate first results, and explore architecture for dynamic and power aspects using
Virtual Platforms
• Summary
Example: Resnet-18 (Inference) with NV-DLA
Resnet18 task graph
Resnet18
Import prototxt
Neural
Network
map Goals:
100 ms latency, minimize power, minimize energy
Optimize Hardware configuration:
– SIMD width
– Burst size, outstanding transactions
NVDLA platform
– speed of DDR memory and of data path
ResNet-18 Workload model generated with AI-XP
Example: Brief Overview of NVDLA
Convolution Engine (CONV_CORE)
• Works on two sets of data: offline-trained kernels (weights) and input
features (images)
• configurable MAC units and convolutional buffer (RAM)
• Executes operations such as tf.nn.conv2d
Single Data Point Processor (SDP)
• Applies linear and non-linear (activation) functions onto individual data points.
• Executes e.g. tf.nn.batch_normalization, tf.nn.bias_add, tf.nn.elu, tf.nn.relu,
tf.sigmoid, tf.tanh, and more.
Planar Data Processor (PDP)
• Applies common CNN spatial operations such as min/max/avg pooling
• Executes e.g. tf.nn.avg_pool, tf.nn.max_pool, tf.nn.pool.
Cross-channel Data Processor (CDP)
• Processes data from different channels/features, e.g. local response normalization
(LRN) function
• Executes e.g. tf.nn.local_response_normalization
Data Reshape Engine (RUBIK)
• Performs data format transformations (splitting, slicing, merging, …)
• Executes e.g. tf.nn.conv2d_transpose, tf.concat, tf.slice, etc.
VP Simulation Results of Initial Configuration
Task trace
Transaction trace
DDR utilization
Resource utlization
Throughput
Outstanding
transactions
AlexNet (Norm1):
Expected: 580,800 Bytes
Measured: 654,720 Bytes
Inflation by ~12.72%
“Dark Bandwidth”
Simulation Reveals Implementation Effects… (2)
Differences between calculated and measured execution time
SIMD
8
processing bound
SIMD
16
SIMD
32
performance
Diminishing
gains
SIMD
64
memory
SIMD CONV DMA load bandwidth
128 bound
CONV PE load
DDR Memory Bandwidth and Power Improvement
DMA
SIMD-
128
Utilization
Resource
Conv PE
25% faster
SIMD-
64
SIMD-128
Conv PE Power
Power consumption
10% lower
DDR Power
total energy
SIMD-64
20% lower
average power
Resnet 18 Example Sweep
Goal: 100 ms latency, minimize power & energy
Sweep parameters
– Burst size: 16, 32
– Outstanding transactions: 4, 8
– DDR memory speed: DDR4-1866, DDR4-2400
– Clock frequency of data path: 1, 1.33, 2GHz
– SIMD width: 64, 128 operations per cycle
Sensitivity
Root-Cause
Analysis
Sweep Over Hardware Parameters, Latency
Outstanding
transactions
GHz
SIMD
DDR4 speed
Burst size
Power/Performance/Energy Trade-off Analysis
Optimal
solution
Outstanding Tx
Datapath GHz
SIMD
Burst size DDR
Example: Resnet-18 with NV-DLA
Resnet18 task graph
Resnet18
generate
Neural
Network
map
Goal:
– 100 ms latency, minimize energy
analyze
Sensitivity
Virtual HW Platform
Power/Performance
Thank You
Questions