Fosdem SDR Ornl
Fosdem SDR Ornl
Fosdem SDR Ornl
Jeffrey S. Vetter
Seyong Lee
Mehmet Belviranli
Jungwon Kim
Richard Glassbrook
FOSDEM
Abdel-Kareem Moadi Brussels
Seth Hitefield 2 Feb 2020
• Architectural specialization
• Performance portability of applications and software
• DSSoC ORNL project investigating on performance portability of
SDR
– Understand applications and target architectures
– Use open programming models (e.g., OpenMP, OpenACC, OpenCL)
– Develop intelligent runtime systems
• Goal: scale applications from Qualcomm Snapdragon to DoE
Summit Supercomputer with minimal programmer effort
27
Sixth Wave of Computing
6th wave
Transition
Period
http://www.kurzweilai.net/exponential-growth-of-computing
37
Predictions for Transition Period
46
Complex architectures yield…
Complex
Programming
System: MPI, Legion, HPX, Charm++, etc Models
Low overhead Node: OpenMP, Pthreads, U-threads, etc
119
During this Sixth Wave transition, Complexity is our major challenge!
120
DARPA Domain-Specific System on a Chip (DSSoC) Program
Getting the best out of specialization when we need programmability
DARPA ERI DSSoC Program: Dr. Tom Rondeau
DSSoC’s Full-Stack Integration
Three Optimization Areas
Decoupled Software
Development Environment and Programming
development
Languages
1. Design time
2. Run time Application
3. Compile time
Libraries
Operating System
Hardware-Software Co-design
Heterogeneous architecture
Addressed via five program areas
Intelligent scheduling/routing
• CPUs
1. Intelligent scheduling
Looking at how Hardware/Software co-design is an enabler for efficient use of processing power
130
Development Lifecycle
Precise configuration and benchmark
data for static analysis, mapping,
partitioning, code generation, etc
•Create scalable •Ontologies based on •Programming systems built •Intelligent runtime •DSSoC design •As feature of DSSoC, PFU
application Aspen models Aspen models using to support ontologies scheduling uses models quantitatively derived API provides dynamic
manually, with static or statistical and machine •Query Aspen models and and PFU to inform from application Aspen performance response of
dynamic analysis, or using learning techniques PFU for automatic code dynamic decisions models deployed DSSoC to
historical information generation, optimization, •Dynamic resource •Early design space intelligent runtime and
etc. discovery and monitoring exploration with Aspen programming system.
DSSoC Design
•DSSoC design
quantitatively derived
from application Aspen
models
131 •Early design space
exploration with Aspen
Architectures
132
134
https://excl.ornl.gov/
7nm TSMC
Adreno 640
Hexagon 690 5G
Spectra 360
Kyro 485
© Qualcomm Inc.
© Qualcomm Inc.
Prime • Snapdragon X24 LTE (855 built-in) modem LTE Category 20 • Connected Qualcomm board to HPZ820 through USB
Core
• Snapdragon X50 5G (external) modem (for 5G devices) • Development Environment: Android SDK/NDK
• Qualcomm Wi-Fi 6-ready mobile platform: (802.11ax-ready, • Login to mcmurdo machine
802.11ac Wave 2, 802.11ay, 802.11ad) $ ssh –Y mcmurdo
• Qualcomm 60 GHz Wi-Fi mobile platform: (802.11ay, • Setup Android platform tools and development environment
Hexagon 690 (DSP + AI) $ source /home/nqx/setup_android.source
• Quad threaded Scalar Core 802.11ad)
• Run Hello-world on ARM cores
• DSP + 4 Hexagon Vector Xccelerators • Bluetooth Version: 5.0
$ git clone https://code.ornl.gov/nqx/helloworld-android
• New Tensor Xccelerator for AI • Bluetooth Speed: 2 Mbps
$ make compile push run
• Apps: AI, Voice Assistance, AV codecs • High accuracy location with dual-frequency GNSS.
• Run OpenCL example on GPU
$ git clone https://code.ornl.gov/nqx/opencl-img-processing
Adreno 640 Spectra 360 ISP
• Run Sobel edge detection
• Vulkan, OpenCL, OpenGL ES 3.1 • New dedicated Image Signal Processor (ISP)
$ make compile push run fetch
• Apps: HDR10+, HEVC, Dolby, etc • Dual 14-bit CV-ISPs; 48MP @ 30fps single camera
• Login to Qualcomm development board shell
• Enables 8k-360o VR video playback • Hardware CV for object detection, tracking, streo depth process $ adb shell
• 20% faster compared to Adreno 630 • 6DoF XR Body tracking, H265, 4K60 HDR video capture, etc. $ cd /data/local/tmp
138 For more information or to apply for an account, visit https://excl.ornl.gov/ Created by Narasinga Rao Miniskar, Steve Moulton
Applications
140
End-to-End System: Gnu Radio for Wifi on two NVIDIA Xavier SoCs
UDP
Video/Image
Files Antenna
• GR-Tools
• First tools are released
• Block-level Ontologies [ontologyAnalysis]
• Following properties are extracted from a batch
of block definition files: Descriptions and IDs,
source and sink ports (whether input/output is
scalar, vector or multi-port), allowed data types,
and additional algorithm-specific parameters
• Flowgraph Characterization [workflowAnalysis]
• Characterization of GR workloads at the
flowgraph level.
• Scripts automatically run for for 30 seconds and
reports a breakdown of high-level library module
calls libgnuradio CPU-time Breakdown
4%
• Design-space Exploration [designSpaceCL] libgnuradio-
analog
• Script to run 13 blocks included in gr-clenabled 13% libgnuradio-
- Both on a GPU and on a single CPU core 6% blocks
3%
- By using input sizes varying between 24 and 28% libgnuradio-
4% channels
227 elements.
1% libgnuradio-
• Two prototype tools have been added recently 10% digital
libgnuradio-dtv
1%
• cgran-scraper 22% 8%
0%
• GRC-analyzer
149 https://github.com/cosmic-sdr
Applications Profiling
• Will require significant consideration when run on SoC libgnuradio CPU-time Breakdown
libgnuradio-analog
• Cannot be executed in parallel 4%
13% libgnuradio-blocks
• Hardware assisted scheduling is essential libgnuradio-
28% 6%
3% channels
libgnuradio-digital
4%
libgnuradio-dtv
1%
10% libgnuradio-fec
1% libgnuradio-fft
22% 8%
0%
152
GRC statistics: Block Proximity Analysis
153
Programming Systems
155
Programming Solution for DSSoC
Used as input
OpenCL programming
model to the
OpenMP
OpenARC compiler
158
New OpenACC GR Block Mapping Strategy for Heterogeneous
Architectures
CUDA NVIDIA GPU Mapping
used for
Xavier
OpenMP ARM CPU porting
Constructor
• OpenACC GR block class inherits GRACCBase class as a base class.
• GRACCBase constructor assigns a unique thread ID per OpenACC
GR block instantiation, which is internally used for thread safety.
• OpenACC backend runtime is also initialized.
OpenACC Implementation
• Contains the OpenACC version of the reference CPU implementation.
• Performs the following tasks:
• Copy input data to device memory.
• Execute the OpenACC kernel.
• Copy output data back to host memory.
• OpenARC will translate the OpenACC kernel to multiple different
output programming models (e.g., CUDA, OpenCL, OpenMP, HIP, etc.)
164
Basic Memory Management for OpenACC-Enabled GR Workflow
Host
1 3 1 3
2 2
Device
Device Device
kernel1 kernel2
• In the basic memory management scheme, each invocation of an OpenACC GR block performs
the following three tasks:
1) Copy input data to device memory.
2) Run a kernel on device.
3) Copy output data back to host memory.
165
Optimized Memory Management for OpenACC-Enabled GR Workflow
Host
1 3
2 2
Device
Device Device
kernel1 kernel2
• In the optimized memory management scheme, some blocks can bypass unnecessary memory
transfers between host and device and directly communicate each other using device memory if
both producer and consumer blocks are running on the same device.
• Notice that device buffer needs extra padding to handle the overwriting feature in the host circular
buffer.
166
Sample Output of the Example SDR Workflow
168
SDR Workflow Profiling Using a Built-in GR Performance Monitoring Tool
• CPU versions of OpenACC blocks are algorithmically equivalent to those in the original GR blocks.
Some OpenACC
blocks (B, D) use
a simple register
caching
optimization,
which causes
them to perform
better than the
original GR
blocks.
A A B B C C D1 D1 D2 D2
169
SDR Workflow Profiling Results When OpenACC Blocks Offloaded to CPU
• OpenACC blocks are automatically translated to OpenMP3 versions and run on Xavier CPU.
Some of original
GR blocks (A, C)
were already
vectorized with
Volk library.
Some of original
GR blocks (B, C)
performed better
than OpenACC
blocks (B, C).
A A B C D1 D2 D1 B C D2
170
SDR Workflow Profiling Results When OpenACC Blocks Offloaded to GPU
• OpenACC blocks are automatically translated to CUDA versions and run on Xavier GPU.
• Each invocation of an OpenACC block executes three tasks: 1) copy input data to device memory, 2) run a
kernel on device, and 3) copy output data back to host memory
OpenACC Blocks on Xavier GPU Original GR Blocks on Xavier CPU
Due to extra
memory transfer
overheads, most
OpenACC blocks
perform worse
than original GR
blocks, except for
the OpenACC
block D1 and D2.
171
A B C A B D2 D1 D1 C D2
SDR Workflow Profiling Results When Opt. OpenACC Blocks Offloaded to GPU
• OpenACC blocks are automatically translated to CUDA versions and run on Xavier GPU.
• Optimized OpenACC blocks bypass memory transfers between host and device and directly communicate
each other using device memory if both producer and consumer blocks are running on the same device.
Opt. OpenACC Blocks on Xavier Original GR Blocks on Xavier CPU
GPU
Most of the
OpenACC blocks
perform better
than original GR
blocks, except for
the block A; the
original GR block
A is vectorized
with Volk library,
which performs
better than the
OpenACC block
A.
172
A D2 A B C B D1 C D1 D2
More Complex SDR Workflow Example
OpenACC-enabled workflow
using gr-openacc blocks
173
Profiling Results When Opt. OpenACC Blocks Offloaded to GPU
• OpenACC blocks are automatically translated to CUDA versions and run on Xavier GPU.
• Optimized OpenACC blocks bypass memory transfers between host and device and directly communicate
each other using device memory if both producer and consumer blocks are running on the same device.
Opt. OpenACC Blocks on Xavier Original GR Blocks on Xavier CPU
GPU
This example
shows similar
performance
behaviors as the
previous example.
• Updated the programming system to use our new heterogeneous runtime system, called IRIS, as the
common backend runtime.
• IRIS allows intermixing of multiple different output programming models (e.g., OpenMP3, OpenMP4, OpenACC, CUDA, HIP,
etc.) and runs them on heterogeneous devices concurrently.
• Developed a host-device memory transfer optimization scheme, which allows OpenACC GR blocks to
bypass memory transfers between host and device and directly communicate each other if both
producer and consumer blocks are running on the same device.
• Performed preliminary evaluation of the new programming system by creating synthetic SDR workflow
using the OpenACC GR blocks.
• Next Steps
• Port more complex GR blocks to OpenACC and evaluate more complex SDR workflow.
• Continue to improve and fix bugs in the programming system.
175
Runtime systems for intelligent scheduling
176
IRIS: An Intelligent Runtime System for Extremely Heterogeneous
Architectures
177 https://github.com/swiftcurrent2018
The IRIS Architecture
• Platform Model
– A single-node system equipped with host CPUs
and multiple compute devices (GPUs, FPGAs,
Xeon Phis, and multicore CPUs)
• Memory Model
– Host memory + shared device memory
– All compute devices share the device memory
• Execution Model
– DAG-style task parallel execution across all
available compute devices
• Programming Model
– High-level OpenACC, OpenMP4, SYCL* (*
planned)
– Low-level C/Fortran/Python IRIS host-side
runtime API + OpenCL/CUDA/HIP/OpenMP
kernels (w/o compiler support)
178
Supported Architectures and Programming Systems by IRIS
179
IRIS Booting on Various Platforms
180
Task Scheduling in IRIS
• A task
– A scheduling unit
– Contains multiple in-order commands
• Kernel launch command
• Memory copy command (device-to-host, host-to-device)
– May have DAG-style dependencies with other tasks
– Enqueued to the application task queue with a device
selection policy
• Available device selection policies
– Specific Device (compute device #)
– Device Type (CPU, GPU, FPGA, XeonPhi)
– Profile-based
– Locality-aware
– Ontology-base
– Performance models (Aspen)
– Any, All, Random, 3rd-party users’ custom policies
• Computation
– S[] = A * X[] + Y[]
• Two tasks
– S[] = A * X[] on NVIDIA GPU (CUDA)
– S[] += Y[] on ARM CPU (OpenMP)
• S[] is shared between two tasks
• Read-after-write (RAW), true dependency
182
SAXPY: Python host code & CUDA kernel code
NTASKS = 1000000
t0 = iris.timer_now()
for i in range(NTASKS):
task = iris.task()
task.submit(iris.iris_random, False)
CPU or GPU
iris.synchronize() randomly asynchronous
task submission
iris.finalize()
187
Closing
Abdel-Kareem
Seyong Lee, Mehmet Belviranli, Seth Hitfield Austin Latham
Jungwon Kim, Richard Glassbrook, Steve Moulton, Moadi, Blaise Tine, Intern, Mohammad Monil,
Jeffrey Vetter, PI Programming Apps, Modeling,
Runtime Systems Project Manager Systems Engineer Software/Hardware SDR Georgia Tech Intern, Oregon SYCL
Systems Ontologies
Engineer
Summary Acknowledgements
• Architectural specialization • Thanks to staff and students for the work!
• Performance portability of applications and • Thanks to DARPA, DOE for funding our work!
software
• This research was developed, in part, with funding from the Defense
Advanced Research Projects Agency (DARPA). The views, opinions
• DSSoC ORNL project investigating on and/or findings expressed are those of the authors and should not be
performance portability of SDR interpreted as representing the official views or policies of the
Department of Defense or the U.S. Government. This document is
– Understand applications and target architectures approved for public release: distribution unlimited.
– Use open programming models: OpenACC, OpenCL,
OpenMP
– Developing intelligent runtime systems: IRIS
• Goal: scale applications from Qualcomm
Snapdragon to DoE Summit Supercomputer with
minimal programmer effort
• Work continues…
188