[go: up one dir, main page]

0% found this document useful (0 votes)
69 views49 pages

Building Smart Socs: Holger Keding

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 49

Building Smart SoCs

Using Virtual Prototyping for the Design and SoC Integration of Deep
Learning Accelerators

Holger Keding
Solutions Architect

© Accellera Systems Initiative 1


Agenda
• Deep Learning Market and Technology Trends
• How to Design a Deep Learning Accelerator (DLA)
• Analytical Performance Modeling
• Shift Left Architecture Analysis and Optimization with Virtual Prototyping
• Example
• Importing Network Algorithms as prototxt + generate analytical model spreadsheet
• Find suited configuration and scaling parameters in analytical model
• Validate first results, and explore architecture for dynamic and power aspects using
Virtual Platforms
• Summary
Increasing number of AI Accelerators

Source: Qualcomm AI Day Speaker Presentation 2019


Deep Learning Technology Trends
New Neural Network algorithms
– Higher accuracy, lower size and less processing Neural
– But: less data re-use, less cycles per byte Network

Neural Network Compiler optimizations Neural


– Loop-tiling, -unrolling, and -parallelization Network
– Splitting and fusing of Neural Network layers Compiler
– Memory layout optimization across layers Deep
– Optimized code generation to utilize available Learning DDR/

Interconnect
hardware accelerators Accelerator HBM
Multi-core AI
Deep Learning Accelerator optimizations CPU SRAM SoC
IO SRAM
– Schedule workload on parallel hardware engines IO
– Optimize and reduce data transfers IO
to and from memory
AI SoC Design Challenges
Brute-force Processing of Huge Data Sets

• Choosing the right algorithm and architecture: CPU, GPU, FPGA, vector DSP, ASIP
– CNN graphs evolving fast, need short time to market, cannot optimize for one single graph
– Joint design of algorithm, compiler, and target architecture
– Joint optimization of power, performance, accuracy, and cost
• Highly parallel compute drives memory requirements
– High on-chip and chip to chip bandwidth at low latency
– High memory bandwidth requirements for parameters and layer to layer communication
• Performance analysis requires realistic workloads to consider dynamic effects
– Scheduling of AI operators on parallel processing elements
– Unpredictable interconnect and memory access latencies
Large Design Space drives Differentiation by
AI Algorithm & Architecture
Agenda
• Deep Learning Market and Technology Trends
• How to Design a Deep Learning Accelerator (DLA)
• Analytical Performance Modeling
• Shift Left Architecture Analysis and Optimization with Virtual Prototyping
• Example
• Importing Network Algorithms as prototxt + generate analytical model spreadsheet
• Find suited configuration and scaling parameters in analytical model
• Validate first results, and explore architecture for dynamic and power aspects using
Virtual Platforms
• Summary
How to design a DLA?
validate
validate High-Level Architecture back-annotate
back-annotate
+ Good for hardware exploration
Analytical Models RTL Simulation
+ Simulations in minutes/hours
+ Good first order ~ Varying Accuracy + Perfect accuracy
+ Results within minutes Refine Refine - High computational needs
- Omits dynamic effects - High turn-around costs

Functional LT Model (VDK)


+ Good for SW development
+ Simulations in minutes/hours
Refine
+ Trace Ops, Memory accesses
- Low Timing Accuracy

validate
back-annotate
Analytical Performance Models
Simple Example: Amdahl’s Law [1]

[1] Validity of the Single Processor Approach to Achieving Large Scale Computing Capabilities (1967)

• Simple insightful formula, with restricted applicability, though.


• “All models are wrong but some are useful” (George Box, 1978)
Analytical Models – Roofline Models (1)
𝑝𝑝(𝑓𝑟𝑒𝑞𝑐𝑙𝑘 , #𝑟𝑒𝑠𝑜𝑢𝑟𝑐𝑒𝑠)
Theoretical maximum
compute power
ILP or SIMD
observed
performance
Only Thread-Level Parallelism

2 𝑜𝑝𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑠 𝑜𝑝𝑠
= 0.25
8 𝑏𝑦𝑡𝑒𝑠 𝑓𝑒𝑡𝑐ℎ𝑒𝑑 𝑏𝑦𝑡𝑒

Roofline: an insightful visual performance model for multicore architectures (Williams, Waterman, Patterson,2009)
Analytical Models – Roofline Models (2)

Theoretical maximum
compute power

slope =
maximum
memory
bandwidth 2 𝑜𝑝𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑠 𝑜𝑝𝑠
𝑜𝑝𝑖𝑛𝑡𝑒𝑛𝑠𝑖𝑡𝑦 ⋅ 𝑚𝑒𝑚_𝑏𝑎𝑛𝑑𝑤𝑖𝑑𝑡ℎ𝑝𝑒𝑎𝑘 = 0.25
8 𝑏𝑦𝑡𝑒𝑠 𝑓𝑒𝑡𝑐ℎ𝑒𝑑 𝑏𝑦𝑡𝑒

Roofline: an insightful visual performance model for multicore architectures (Williams, Waterman, Patterson,2009)
Analytical Models – Roofline Models (3)

compute bound

memory bound

Roofline: an insightful visual performance model for multicore architectures (Williams, Waterman, Patterson,2009)
Analytical Models – Roofline Models

Roofline: an insightful visual performance model for multicore architectures (Williams, Waterman, Patterson,2009)
Example: Analytical Model for CNN Convolutional Layer (1)
Conv1 of AlexNet

for(row=0; row<oh; row++){


for(col=0; col<ow; col++){ Maths Textbook 𝑛𝑀𝐴𝐶 = 𝑜ℎ ⋅ 𝑜𝑤 ⋅ 𝑜𝑐 ⋅ 𝑘𝑤 ⋅ 𝑘ℎ ⋅ 𝑖𝑐
for(k=0; k<oc; k++){
for(ti=0; ti<ic; t i ++){ Convolution algorithm = 55 ⋅ 55 ⋅ 96 ⋅ 11 ⋅ 11 ⋅ 3
for(i=0; i<kh; i++){
for(j=0; j<kw; j++){ = 105,415,200
L : outputfm [ k ] [ row ] [ col ] +=
kernels[ k ][ ti ][ i ][ j ]∗
inputfm[ ti ][ sw∗row+i ][ sh∗col+j ];
}}}}}}
Example: Analytical Model for CNN Convolutional Layer (2)
Conv1 of AlexNet
But: here we assume unlimited
amount of local memory

𝑛𝑀𝐴𝐶 = 𝑜ℎ ⋅ 𝑜𝑤 ⋅ 𝑜𝑐 ⋅ 𝑘𝑤 ⋅ 𝑘ℎ ⋅ 𝑖𝑐 = 55 ⋅ 55 ⋅ 96 ⋅ 11 ⋅ 11 ⋅ 3 = 105,415,200

𝑑𝑀𝐴𝐶 = 𝑑𝑖𝑓𝑚𝑎𝑝 + 𝑑𝑘𝑒𝑟𝑛𝑒𝑙 = (𝑖𝑤 ⋅ 𝑖ℎ ⋅ 𝑖𝑐 + 𝑘𝑤 ⋅ 𝑘ℎ ⋅ 𝑖𝑐 ⋅ 𝑘) ⋅ 𝐵𝑖 ≈ 0.38𝑀𝑖𝐵


𝑛
⇒ 𝑂𝑝𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑎𝑙 𝐼𝑛𝑡𝑒𝑛𝑠𝑖𝑡𝑦 𝐼 = 𝑑𝑀𝐴𝐶 ≈ 278 𝑜𝑝𝑠/𝐵
𝑀𝐴𝐶
Example: Analytical Model for CNN Convolutional Layer (3)
Conv1 of AlexNet
Opposite extreme: we assume no
local memory

𝑛𝑀𝐴𝐶 = 𝑜ℎ ⋅ 𝑜𝑤 ⋅ 𝑜𝑐 ⋅ 𝑘𝑤 ⋅ 𝑘ℎ ⋅ 𝑖𝑐 = 55 ⋅ 55 ⋅ 96 ⋅ 11 ⋅ 11 ⋅ 3 = 105,415,200

𝑑𝑀𝐴𝐶 = 2 ⋅ 𝑛𝑀𝐴𝐶 ⋅ 𝐵𝑖 ≈ 420𝑀𝑖𝐵


𝑛 1
⇒ 𝑂𝑝𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑎𝑙 𝐼𝑛𝑡𝑒𝑛𝑠𝑖𝑡𝑦 𝐼 = 𝑑𝑀𝐴𝐶 ≈ 4 𝑜𝑝𝑠/𝐵
𝑀𝐴𝐶
Example: Analytical Model for CNN Convolutional Layer (4)
Conv1 of AlexNet
Practical setup: limited amount
of local memory

for(row=0; row<oh; row++){


for(col=0; col<ow; col++){
Maths Textbook
for(k=0; k<oc; k++){ Convolution algorithm
for(ti=0; ti<ic; t i ++){
for(i=0; i<kh; i++){
for(j=0; j<kw; j++){
L : outputfm [ k ] [ row ] [ col ] +=
kernels[ k ][ ti ][ i ][ j ]∗inputfm[ti][sw∗row+i][sh∗col+j];
}}}}}}
Example: Analytical Model for CNN Convolutional Layer (5)
Conv1 of AlexNet – with very simple tiling
Practical setup: limited amount
of local memory

Width + Height
+ Channel
+ Kernel Tiling
Example: Analytical Model for CNN Convolutional Layer (6)
Conv1 with tiling

Source: Optimizing FPGA-based Accelerator Design for Deep


Convolutional Neural Networks, Cheng Zhang, 2015
Example: Analytical Model for CNN Convolutional Layer (6)
Conv1 with tiling

Now it gets more tricky: Taking into acount non-integer


relations of tiling parameters and channel dimensions:

Tiling also brings the operational intensity


closer to the optimum HW utilization point
Example: Analytical Model, Mapping Conv to HW Resources
#MAC cells can be configured to scale
up/down peak performance

Tiling parameters and MAC Cell number and depth should


match tiling parameters
Roofline model

Operational Intensity (Operations/Byte)


Roofline model

Operational Intensity (Operations/Byte)


Analytical Model as Python Generated Spreadsheet
Expressions represent both Algorithmic and HW -> calculate attainable performance
Exploring different numbers of MAC cells and their depth
Analytical Model Summary
What is achieved and what comes next?

What we have seen:


+ Good first order analysis of static effects
+ Results within minutes
~ Requires deep understanding of
both algorithm and architecture

What is not covered


- Implementation overhead is hard
to predict and not ‚priced in‘ in
first round
- Omits dynamic effects
- Joint performance and power
is difficult
How to design a DLA?
validate
validate High-Level Architecture back-annotate
back-annotate
+ Good for hardware exploration
Analytical Models RTL Simulation
+ Simulations in minutes/hours
+ Good first order ~ Varying Accuracy + Perfect accuracy
+ Results within minutes Refine Refine - High computational needs
- Omits dynamic effects - High turn-around costs

Functional LT Model (VDK)


+ Good for SW development
+ Simulations in minutes/hours
Refine
+ Trace Ops, Memory accesses
- Low Timing Accuracy

validate
back-annotate
Shift Left Architecture Analysis and Optimization

translate Neural
Network
explore
Power,
Performance Neural
Network
NN Workload Model Compiler
explore
map
results Deep
Deep Learning DDR/

Interconnect
Learning DDR/ Accelerator HBM
Interconnect

Accelerator AI
HBM AI Multi-core
Multi-core SRAM SoC
SoC CPU
CPU SRAM
Model IO SRAM
IO SRAM
IO IO
IO IO
Platform Architect Ultra
Providing a Comprehensive Library of Generic and Vendor Specific Models

Capture Workload Model


Capture Architecture Model Interconnect Models
Generic:
•SBL-TLM2-FT (AXI)
•SBL-GCCI (ACE, CHI)

Memory Subsystems IP Specific:


Analyze Power & Performance •Arteris FlexNoC & Ncore
• Generic multiport •Arm AHB/APB
memory controller •Arm PL300
Traffic, Processors, RTL (GMPMC) •Arm SBL-301
• DesignWare uMCTL2 •Arm SBL-400
• Task-based and trace-based
memory controller •Synopsys DW AXI
workload models
• DesignWare LPDDR5
• Cycle accurate processor for
User Traffic, memory controller
ARM, ARC, Tensilica,
for CEVA
Scenarios
• Co-simulate with RTL
• RTL Co-simulation/emulation
Exploration
Workload Modeling and Mapping
• Workload Model cycles: 0 cycles: 2000
rd_bytes: 0x200 rd_bytes: 0
– Task level parallelism and dependencies wr_bytes: 0 wr_bytes: 0
– Characterized with processing cycles and
memory accesses Task B
(read image)
• SoC Platform Model Task D
– Accurate SystemC Transaction level models of Task A (proc conv)
processing elements, interconnect and memory Task C
(read kernel)
• Map workload to platform
• Analyze performance metrics
– End-to-end constraints ACC

interconnect
– Workload activity Memory
subsystem
– Utilization of resources record
DMA
– Interconnect metrics
• Latency, Throughput, Contention Virtual Prototype
• Outstanding transactions
• …
System Level Power Modeling
• Workload Model Task B
(read image)
– Task level parallelism and dependencies Task D
Task A
– Characterized with processing cycles (proc conv)
Task C
and memory accesses (read kernel)
• SoC Platform Model
– Accurate SystemC Transaction level
models of processing elements, ACC

interconnect
interconnect and memory Memory
• System-level Power Overlay Model subsystem
DMA
– Define power state machine per
component Virtual Prototype
– Bind power models to
Virtual Prototype records
sleep
– Measure power and Energy/Power
recording sleep idle idle
performance based idle
on real activity and utilization active active page page
miss hit
IP Power Models
Platform Architect Ultra AI Exploration Pack (XP)
Exploration & optimization of AI designs
• Automated generation of workloads from AI
frameworks
– AI Operator Library for Neural Network modeling
• E.g. Convolution, Matmul, MaxPool, BatchNorm etc.
– Example workload model of ResNet50 Neural Network
– Utility to convert prototxt description to workload model
CNN
using AI operator library Operator Library workload model
• AI centric HW architecture model library
– VPUs configured to represent AI compute and DMA engines
– Interconnect and memory subsystem models
– Example performance model of
NVIDIA Deep Learning Accelerator (NVDLA)
NVDLA Performance
• AI centric analysis views: memory + processing Model Example
utilization
Workload Model of One Convolution Layer
AI algorithm params Mapping params

read
input
calculate write output Scaling parameters reflect
convolutions feature maps the DLA architecture – can
read be taken from analytical
coefficients model.

Workload params
Agenda
• Deep Learning Market and Technology Trends
• How to Design a Deep Learning Accelerator (DLA)
• Analytical Performance Modeling
• Shift Left Architecture Analysis and Optimization with Virtual Prototyping
• Example
• Importing Network Algorithms as prototxt + generate analytical model spreadsheet
• Find suited configuration and scaling parameters in analytical model
• Validate first results, and explore architecture for dynamic and power aspects using
Virtual Platforms
• Summary
Example: Resnet-18 (Inference) with NV-DLA
Resnet18 task graph
Resnet18
Import prototxt
Neural
Network

map Goals:
 100 ms latency,  minimize power,  minimize energy
Optimize Hardware configuration:
– SIMD width
– Burst size, outstanding transactions
NVDLA platform
– speed of DDR memory and of data path
ResNet-18 Workload model generated with AI-XP
Example: Brief Overview of NVDLA
Convolution Engine (CONV_CORE)
• Works on two sets of data: offline-trained kernels (weights) and input
features (images)
• configurable MAC units and convolutional buffer (RAM)
• Executes operations such as tf.nn.conv2d
Single Data Point Processor (SDP)
• Applies linear and non-linear (activation) functions onto individual data points.
• Executes e.g. tf.nn.batch_normalization, tf.nn.bias_add, tf.nn.elu, tf.nn.relu,
tf.sigmoid, tf.tanh, and more.
Planar Data Processor (PDP)
• Applies common CNN spatial operations such as min/max/avg pooling
• Executes e.g. tf.nn.avg_pool, tf.nn.max_pool, tf.nn.pool.
Cross-channel Data Processor (CDP)
• Processes data from different channels/features, e.g. local response normalization
(LRN) function
• Executes e.g. tf.nn.local_response_normalization
Data Reshape Engine (RUBIK)
• Performs data format transformations (splitting, slicing, merging, …)
• Executes e.g. tf.nn.conv2d_transpose, tf.concat, tf.slice, etc.
VP Simulation Results of Initial Configuration
Task trace

Transaction trace

DDR utilization

Resource utlization

Throughput

Outstanding
transactions

Performance limited by processing, use wider SIMD data path


Simulation Reveals Implementation Effects… (1)
Differences between calculated and measured data read/write amount

AlexNet (Norm1):
Expected: 580,800 Bytes
Measured: 654,720 Bytes

Inflation by ~12.72%

 “Dark Bandwidth”
Simulation Reveals Implementation Effects… (2)
Differences between calculated and measured execution time

Convolutional Layers 1&2 of LeNet on NVDLA


Back-Annotate Simulation Findings To Analytical Model
Caffe .prototxt

Platform Architect / Simulation Model Spreadsheet / Analytical Model


Impact of SIMD Width on Performance
Resource Utilization of CONV Datapath (yellow), CONV DMA (red) and other components

SIMD
8

processing bound
SIMD
16

SIMD
32

performance
Diminishing

gains
SIMD
64
memory
SIMD CONV DMA load bandwidth
128 bound
CONV PE load
DDR Memory Bandwidth and Power Improvement
DMA
SIMD-
128
Utilization
Resource

Conv PE
25% faster
SIMD-
64
SIMD-128

Conv PE Power
Power consumption

10% lower
DDR Power
total energy
SIMD-64

20% lower
average power
Resnet 18 Example Sweep
Goal: 100 ms latency, minimize power & energy
Sweep parameters
– Burst size: 16, 32
– Outstanding transactions: 4, 8
– DDR memory speed: DDR4-1866, DDR4-2400
– Clock frequency of data path: 1, 1.33, 2GHz
– SIMD width: 64, 128 operations per cycle
Sensitivity

Root-Cause
Analysis
Sweep Over Hardware Parameters, Latency

Outstanding
transactions

GHz

SIMD

DDR4 speed
Burst size
Power/Performance/Energy Trade-off Analysis

Optimal
solution

Outstanding Tx

Datapath GHz
SIMD
Burst size DDR
Example: Resnet-18 with NV-DLA
Resnet18 task graph

Resnet18
generate
Neural
Network

map
Goal:
– 100 ms latency, minimize energy

Optimize Hardware configuration:


– SIMD width: 128 operations per cycle
– Burst size: 32 bytes
– outstanding transactions: 8
NVDLA platform – speed of DDR memory: DDR4-1866
– speed of data path: 1GHz
Summary
• Be fast and get it right!
• Shift Left with Virtual Prototyping
• Joint Optimization of Algorithm,
Architecture, and Compiler
Task graph
Neural
generate Network
Analytical Model

Explore & Refine map

analyze

Sensitivity
Virtual HW Platform
Power/Performance
Thank You

Questions

© Accellera Systems Initiative 49

You might also like