Fosdem SDR Ornl

Striving for SDR Performance Portability
in the Era of Heterogeneous SoCs
Jeffrey S. Vetter
Seyong Lee
Mehmet Belviranli
Jungwon Kim
Richard Glassbrook
FOSDEM
Abdel-Kareem Moadi Brussels
Seth Hitefield 2 Feb 2020
ORNL is managed by UT-Battelle, LLC for the US Department of Energy
ORNL is managed by UT-Battelle

for the US Department of Energy http://ft.ornl.gov vetter@computer.org
Highlights
• Architectural specialization
• Performance portability of applications and software
• DSSoC ORNL project investigating on performance portability of
SDR
– Understand applications and target architectures
– Use open programming models (e.g., OpenMP, OpenACC, OpenCL)
– Develop intelligent runtime systems
• Goal: scale applications from Qualcomm Snapdragon to DoE
Summit Supercomputer with minimal programmer effort
27
Sixth Wave of Computing
6th wave
Transition
Period
http://www.kurzweilai.net/exponential-growth-of-computing
37
Predictions for Transition Period
Optimize Software and Architectural

Expose New Specialization and Emerging Technologies
Hierarchical Parallelism Integration
• Redesign software to • Use CMOS more • Investigate new
boost performance effectively for specific computational
on upcoming workloads paradigms
architectures • Integrate components • Quantum
• Exploit new levels of to boost performance • Neuromorphic
parallelism and and eliminate • Advanced Digital
efficient data inefficiencies
• Emerging Memory
movement • Workload specific Devices
memory+storage
system design
46
Complex architectures yield…
Complex
Programming
System: MPI, Legion, HPX, Charm++, etc Models
Low overhead Node: OpenMP, Pthreads, U-threads, etc
Resource contention SIMD Cores: OpenACC, CUDA, OpenCL, OpenMP4, …

Memory use,
Data orchestration Fine grained parallelism Hardware features
NUMA, HBM coalescing
Locality
119
During this Sixth Wave transition, Complexity is our major challenge!
Design: How do we design

Programmability: How do we
future systems so that they
design applications with
are better than current
some level of performance
systems on mission
portability?
applications?
•Entirely possible that the •Software lasts much longer
new system will be slower than transient hardware
than the old system! platforms
•Expect ‘disaster’ •Adapt or die
procurements
120
DARPA Domain-Specific System on a Chip (DSSoC) Program
Getting the best out of specialization when we need programmability
DARPA ERI DSSoC Program: Dr. Tom Rondeau
DSSoC’s Full-Stack Integration
Three Optimization Areas
Decoupled Software
Development Environment and Programming
development
Languages
1. Design time
2. Run time Application
3. Compile time
Libraries
Operating System
Hardware-Software Co-design
Heterogeneous architecture
Addressed via five program areas
Integrated performance analysis

composed of Processor Elements:
Intelligent scheduling/routing
• CPUs
1. Intelligent scheduling
Compiler, linker, assembler

• Graphics processing units
Medium Access Control

2. Domain representations • Tensor product units
• Neuromorphic units
3. Software • Accelerators (e.g., FFT)
• DSPs
4. Medium access control (MAC) • Programmable logic
• Math accelerators
5. Hardware integration
Looking at how Hardware/Software co-design is an enabler for efficient use of processing power
Distribution Statement “A” (Approved for Public Release, Distribution Unlimited)

DSSoC ORNL Project Overview
130
Development Lifecycle
Precise configuration and benchmark
data for static analysis, mapping,
partitioning, code generation, etc
Dynamic Performance Feedback

including profiling and
configuration response
Programming Runtime and Performance

Applications Ontologies DSSoC Chip
systems Scheduling Functional API
•Create scalable •Ontologies based on •Programming systems built •Intelligent runtime •DSSoC design •As feature of DSSoC, PFU
application Aspen models Aspen models using to support ontologies scheduling uses models quantitatively derived API provides dynamic
manually, with static or statistical and machine •Query Aspen models and and PFU to inform from application Aspen performance response of
dynamic analysis, or using learning techniques PFU for automatic code dynamic decisions models deployed DSSoC to
historical information generation, optimization, •Dynamic resource •Early design space intelligent runtime and
etc. discovery and monitoring exploration with Aspen programming system.
DSSoC Design
•DSSoC design
quantitatively derived
from application Aspen
models
131 •Early design space
exploration with Aspen
Architectures
132
134
https://excl.ornl.gov/
Intel Stratix 10 FPGA

Experimental Computing Lab (ExCL) managed by the ORNL Future Technologies Group
• Intel Stratix 10 FPGA and four banks of DDR4 external

memory
– Board configuration: Nallatech 520 Network Acceleration Card
• Up to 10 TFLOPS of peak single precision performance
• 25MBytes of L1 cache @ up to 94 TBytes/s peak

bandwidth
• 2X Core performance gains over Arria® 10
• Quartus and OpenCL software (Intel SDK v18.1) for

using FPGA
• Provide researcher access to advanced FPGA/SOC
environment
136 For more information or to apply for an account, visit https://excl.ornl.gov/

Mar 2019
NVIDIA Jetson AGX Xavier SoC

• NVIDIA Jetson AGX Xavier:

• High-performance system on a chip for autonomous
machines
• Heterogeneous SoC contains:
– Eight-core 64-bit ARMv8.2 CPU cluster (Carmel)
– 1.4 CUDA TFLOPS (FP32) GPU with additional
inference optimizations (Volta)
– 11.4 DL TOPS (INT8) Deep learning accelerator
(NVDLA)
– 1.7 CV TOPS (INT8) 7-slot VLIW dual-processor
Vision accelerator (PVA)
– A set of multimedia accelerators (stereo, LDC,
optical flow)
• Provides researchers access to advanced high-
performance SOC environment
137 For more information or to apply for an account, visit https://excl.ornl.gov/

Mar 2019
Qualcomm 855 SoC (SM8510P) Snapdragon™

Qualcomm Development Board connected to (mcmurdo) HPZ820
7nm TSMC
Adreno 640
Hexagon 690 5G
Spectra 360
Kyro 485
© Qualcomm Inc.
© Qualcomm Inc.
Kyro 485 (8-ARM Prime+BigLittle Cores) Connectivity (5G)
Prime • Snapdragon X24 LTE (855 built-in) modem LTE Category 20 • Connected Qualcomm board to HPZ820 through USB
Core
• Snapdragon X50 5G (external) modem (for 5G devices) • Development Environment: Android SDK/NDK
• Qualcomm Wi-Fi 6-ready mobile platform: (802.11ax-ready, • Login to mcmurdo machine
802.11ac Wave 2, 802.11ay, 802.11ad) $ ssh –Y mcmurdo
• Qualcomm 60 GHz Wi-Fi mobile platform: (802.11ay, • Setup Android platform tools and development environment
Hexagon 690 (DSP + AI) $ source /home/nqx/setup_android.source
• Quad threaded Scalar Core 802.11ad)
• Run Hello-world on ARM cores
• DSP + 4 Hexagon Vector Xccelerators • Bluetooth Version: 5.0
$ git clone https://code.ornl.gov/nqx/helloworld-android
• New Tensor Xccelerator for AI • Bluetooth Speed: 2 Mbps
$ make compile push run
• Apps: AI, Voice Assistance, AV codecs • High accuracy location with dual-frequency GNSS.
• Run OpenCL example on GPU
$ git clone https://code.ornl.gov/nqx/opencl-img-processing
Adreno 640 Spectra 360 ISP
• Run Sobel edge detection
• Vulkan, OpenCL, OpenGL ES 3.1 • New dedicated Image Signal Processor (ISP)
$ make compile push run fetch
• Apps: HDR10+, HEVC, Dolby, etc • Dual 14-bit CV-ISPs; 48MP @ 30fps single camera
• Login to Qualcomm development board shell
• Enables 8k-360o VR video playback • Hardware CV for object detection, tracking, streo depth process $ adb shell
• 20% faster compared to Adreno 630 • 6DoF XR Body tracking, H265, 4K60 HDR video capture, etc. $ cd /data/local/tmp
138 For more information or to apply for an account, visit https://excl.ornl.gov/ Created by Narasinga Rao Miniskar, Steve Moulton
Applications
140
End-to-End System: Gnu Radio for Wifi on two NVIDIA Xavier SoCs
Xavier SoC #1 Xavier SoC #2
UDP
Video/Image
Files Antenna
GR IEEE-802.11 Transmit (TX) IEEE-802.11 Receive (RX)

UDP
• Signal processing: An open-
source implementation of
IEEE-802.11 WIFI a/b/g with
GR OOT modules.
• Input / Output file support via
Socket PDU (UDP server)
blocks
• Image/Video transcoding with
OpenCL/OpenCV
141
GR-Tools
• GR-Tools
• First tools are released
• Block-level Ontologies [ontologyAnalysis]
• Following properties are extracted from a batch
of block definition files: Descriptions and IDs,
source and sink ports (whether input/output is
scalar, vector or multi-port), allowed data types,
and additional algorithm-specific parameters
• Flowgraph Characterization [workflowAnalysis]
• Characterization of GR workloads at the
flowgraph level.
• Scripts automatically run for for 30 seconds and
reports a breakdown of high-level library module
calls libgnuradio CPU-time Breakdown
4%
• Design-space Exploration [designSpaceCL] libgnuradio-
analog
• Script to run 13 blocks included in gr-clenabled 13% libgnuradio-
- Both on a GPU and on a single CPU core 6% blocks
3%
- By using input sizes varying between 24 and 28% libgnuradio-
4% channels
227 elements.
1% libgnuradio-
• Two prototype tools have been added recently 10% digital
libgnuradio-dtv
1%
• cgran-scraper 22% 8%
0%
• GRC-analyzer
149 https://github.com/cosmic-sdr
Applications Profiling
• Preliminary SDR Application Profiling: Library Percentage
• Created fully automated GRC profiling toolkit [kernel.kallsyms] 27.8547

libpython 18.6281
• Ran each of the 89 flowgraph for 30 seconds libgnuradio 11.7548
• Profiled with performance counters libc 7.7503

ld 3.8839
• Major overheads: libvolk 3.7963
libperl 3.7837
• Python glue code (libpython), O/S threading & profiling [unknown] 3.6465
(kernel.kallsysms, libpthread), libc, ld, Qt libQt5 2.9866
• Runtime overhead: libpthread 2.1449
• Will require significant consideration when run on SoC libgnuradio CPU-time Breakdown
libgnuradio-analog
• Cannot be executed in parallel 4%
13% libgnuradio-blocks
• Hardware assisted scheduling is essential libgnuradio-
28% 6%
3% channels
libgnuradio-digital
4%
libgnuradio-dtv
1%
10% libgnuradio-fec
1% libgnuradio-fft
22% 8%
0%
152
GRC statistics: Block Proximity Analysis
Block proximity analysis

• Creates a graph:
• Nodes: Unique block types
• Edges: Blocks used in the same GRC file.
• Every co-occurrence increases edge
weight by 1.
• This example was run
• With --mode proximityGraph
• On randomly selected sub-set of GRC
files
153
Programming Systems
155
Programming Solution for DSSoC
Main input Main input/ Optional input Main output/

programming main output programming optional input
models programming models programming
models models
HIP
MPI CUDA
Used as input
OpenCL programming
model to the
OpenMP
OpenARC compiler
LLVM Used as both

OpenACC input and
Compilers Verilog output
programming
model to the
CASH
compiler
IRIS
158
New OpenACC GR Block Mapping Strategy for Heterogeneous
Architectures
CUDA NVIDIA GPU Mapping
used for
Xavier
OpenMP ARM CPU porting
IRIS Common Runtime API

OpenMP CPU/Xeon Phi
OpenACC
GR Block Intel
Intel FPGA
OpenCL
Support more General
OpenCL
programming Accelerators
models.
HIP AMD GPU
SYCL IRIS offers a common API for diverse

GR Block heterogeneous devices and also allows
intermixing of multiple programming models
(mix CUDA, OpenMP, OpenCL, etc.).
161
OpenACC GR Block Code Structure
Constructor
• OpenACC GR block class inherits GRACCBase class as a base class.
• GRACCBase constructor assigns a unique thread ID per OpenACC
GR block instantiation, which is internally used for thread safety.
• OpenACC backend runtime is also initialized.
Reference CPU Implementation

• Contains the same code as that in the original GR block, which may
have already been vectorized using Volk library.
OpenACC Implementation
• Contains the OpenACC version of the reference CPU implementation.
• Performs the following tasks:
• Copy input data to device memory.
• Execute the OpenACC kernel.
• Copy output data back to host memory.
• OpenARC will translate the OpenACC kernel to multiple different
output programming models (e.g., CUDA, OpenCL, OpenMP, HIP, etc.)
Main Entry Function

• Main entry function executed whenever GR scheduler invokes the
OpenACC GR block.
• The GR block argument, contextType decides which to execute
between the reference CPU version and OpenACC version.
• OpenACC backend runtime may choose CPU as an offloading
target (e.g., offloading OpenMP3 kernel to CPU).
162
Example Translation of GR accLog Module
Input OpenACC code
Output host code
Output CUDA kernel

163 code
Port an Example SDR Workflow to Xavier
OpenACC-enabled workflow using gr-openacc blocks
Reference CPU workflow using original gr-blocks
164
Basic Memory Management for OpenACC-Enabled GR Workflow
Host
Source OpenAC OpenAC

Sink Block
block C block1 C block2
1 3 1 3
2 2
Device
Device Device
kernel1 kernel2
• In the basic memory management scheme, each invocation of an OpenACC GR block performs
the following three tasks:
1) Copy input data to device memory.
2) Run a kernel on device.
3) Copy output data back to host memory.
165
Optimized Memory Management for OpenACC-Enabled GR Workflow
Host
Source OpenAC OpenAC

Sink Block
block C block1 C block2
1 3
2 2
Device
Device Device
kernel1 kernel2
• In the optimized memory management scheme, some blocks can bypass unnecessary memory
transfers between host and device and directly communicate each other using device memory if
both producer and consumer blocks are running on the same device.
• Notice that device buffer needs extra padding to handle the overwriting feature in the host circular
buffer.
166
Sample Output of the Example SDR Workflow
168
SDR Workflow Profiling Using a Built-in GR Performance Monitoring Tool
• CPU versions of OpenACC blocks are algorithmically equivalent to those in the original GR blocks.
OpenACC Blocks on Xavier CPU Original GR Blocks on Xavier CPU
Some OpenACC
blocks (B, D) use
a simple register
caching
optimization,
which causes
them to perform
better than the
original GR
blocks.
A A B B C C D1 D1 D2 D2
169
SDR Workflow Profiling Results When OpenACC Blocks Offloaded to CPU
• OpenACC blocks are automatically translated to OpenMP3 versions and run on Xavier CPU.
OpenACC Blocks on Xavier CPU

via OpenMP Original GR Blocks on Xavier CPU
Some of original
GR blocks (A, C)
were already
vectorized with
Volk library.
Some of original
GR blocks (B, C)
performed better
than OpenACC
blocks (B, C).
A A B C D1 D2 D1 B C D2
170
SDR Workflow Profiling Results When OpenACC Blocks Offloaded to GPU
• OpenACC blocks are automatically translated to CUDA versions and run on Xavier GPU.
• Each invocation of an OpenACC block executes three tasks: 1) copy input data to device memory, 2) run a
kernel on device, and 3) copy output data back to host memory
OpenACC Blocks on Xavier GPU Original GR Blocks on Xavier CPU
Due to extra
memory transfer
overheads, most
OpenACC blocks
perform worse
than original GR
blocks, except for
the OpenACC
block D1 and D2.
171
A B C A B D2 D1 D1 C D2
SDR Workflow Profiling Results When Opt. OpenACC Blocks Offloaded to GPU
• Optimized OpenACC blocks bypass memory transfers between host and device and directly communicate
each other using device memory if both producer and consumer blocks are running on the same device.
Opt. OpenACC Blocks on Xavier Original GR Blocks on Xavier CPU
GPU
Most of the
OpenACC blocks
perform better
than original GR
blocks, except for
the block A; the
original GR block
A is vectorized
with Volk library,
which performs
better than the
OpenACC block
A.
172
A D2 A B C B D1 C D1 D2
More Complex SDR Workflow Example
This example offloads more OpenACC blocks to

Xavier GPU than the previous example.
OpenACC-enabled workflow
using gr-openacc blocks
Reference CPU workflow

using original gr-blocks
173
Profiling Results When Opt. OpenACC Blocks Offloaded to GPU
• Optimized OpenACC blocks bypass memory transfers between host and device and directly communicate
each other using device memory if both producer and consumer blocks are running on the same device.
Opt. OpenACC Blocks on Xavier Original GR Blocks on Xavier CPU
GPU
This example
shows similar
performance
behaviors as the
previous example.
A0 A1A1 B1 D1D3D2 C0B0 A0 C1 B0 B1 C0D0C1D0 D1 D2D3

174
Programming Systems Update Summary and Next Steps
• Updated the programming system to use our new heterogeneous runtime system, called IRIS, as the
common backend runtime.
• IRIS allows intermixing of multiple different output programming models (e.g., OpenMP3, OpenMP4, OpenACC, CUDA, HIP,
etc.) and runs them on heterogeneous devices concurrently.
• Developed a host-device memory transfer optimization scheme, which allows OpenACC GR blocks to
bypass memory transfers between host and device and directly communicate each other if both
producer and consumer blocks are running on the same device.
• Performed preliminary evaluation of the new programming system by creating synthetic SDR workflow
using the OpenACC GR blocks.
• Next Steps
• Port more complex GR blocks to OpenACC and evaluate more complex SDR workflow.
• Continue to improve and fix bugs in the programming system.
175
Runtime systems for intelligent scheduling
176
IRIS: An Intelligent Runtime System for Extremely Heterogeneous
Architectures
• Provide programmers a unified programming

environment to write portable code across
heterogeneous architectures (and preferred
programming systems)
• Orchestrate diverse programming systems
(OpenCL, CUDA, HIP, OpenMP for CPU) in a single
application
– OpenCL
• NVIDIA GPU, AMD GPU, ARM GPU, Qualcomm GPU, Intel
CPU, Intel Xeon Phi, Intel FPGA, Xilinx FPGA
– CUDA
• NVIDIA GPU
– HIP
• AMD GPU
– OpenMP for CPU
• Intel CPU, AMD CPU, PowerPC CPU, ARM CPU,
Qualcomm CPU
177 https://github.com/swiftcurrent2018
The IRIS Architecture
• Platform Model
– A single-node system equipped with host CPUs
and multiple compute devices (GPUs, FPGAs,
Xeon Phis, and multicore CPUs)
• Memory Model
– Host memory + shared device memory
– All compute devices share the device memory
• Execution Model
– DAG-style task parallel execution across all
available compute devices
• Programming Model
– High-level OpenACC, OpenMP4, SYCL* (*
planned)
– Low-level C/Fortran/Python IRIS host-side
runtime API + OpenCL/CUDA/HIP/OpenMP
kernels (w/o compiler support)
178
Supported Architectures and Programming Systems by IRIS
ExCL* Systems Oswald Summit-node Radeon Xavier Snapdragon

Qualcomm
CPU Intel Xeon IBM Power9 Intel Xeon ARMv8
Kryo
Programming • Intel OpenMP • IBM XL OpenMP • Intel OpenMP • GNU GOMP • Android NDK
Systems • Intel OpenCL • Intel OpenCL OpenMP
AMD Radeon Qualcomm
GPU NVIDIA P100 NVIDIA V100 NVIDIA Volta
VII Adreno 640
Programming • NVIDIA CUDA • NVIDIA CUDA • AMD HIP • NVIDIA CUDA • Qualcomm
Systems • NVIDIA • AMD OpenCL OpenCL
OpenCL
Intel/Altera
FPGA
Stratix 10
Programming
* ORNL • Intel OpenCL
Experimental Computing Laboratory (ExCL) https://excl.ornl.gov/
Systems
179
IRIS Booting on Various Platforms
180
Task Scheduling in IRIS
• A task
– A scheduling unit
– Contains multiple in-order commands
• Kernel launch command
• Memory copy command (device-to-host, host-to-device)
– May have DAG-style dependencies with other tasks
– Enqueued to the application task queue with a device
selection policy
• Available device selection policies
– Specific Device (compute device #)
– Device Type (CPU, GPU, FPGA, XeonPhi)
– Profile-based
– Locality-aware
– Ontology-base
– Performance models (Aspen)
– Any, All, Random, 3rd-party users’ custom policies
• The task scheduler dispatches the tasks in the

application task queue to available compute devices
– Select the optimal target compute device according to
task’s device selection policy
181
SAXPY Example on Xavier
• Computation
– S[] = A * X[] + Y[]
• Two tasks
– S[] = A * X[] on NVIDIA GPU (CUDA)
– S[] += Y[] on ARM CPU (OpenMP)
• S[] is shared between two tasks
• Read-after-write (RAW), true dependency
• Low-level Python IRIS host code +

CUDA/OpenMP kernels
– saxpy.py
– kernel.cu
– kernel.openmp.h
182
SAXPY: Python host code & CUDA kernel code
saxpy.py (1/2) saxpy.py (2/2) kernel.cu (CUDA)

#!/usr/bin/env python kernel0 = iris.kernel("saxpy0") extern "C" __global__ void saxpy0(float*
kernel0.setmem(0, mem_s, iris.iris_w) S, float A, float* X) {
kernel0.setint(1, A)
import iris int id = blockIdx.x * blockDim.x +
kernel0.setmem(2, mem_x, iris.iris_r)
import numpy as np threadIdx.x;
import sys off = [ 0 ] S[id] = A * X[id];
ndr = [ SIZE ] }
iris.init()
task0 = iris.task() extern "C" __global__ void saxpy1(float*
task0.h2d_full(mem_x, x)
SIZE = 1024 S, float* Y) {
task0.kernel(kernel0, 1, off, ndr)
A = 10.0 task0.submit(iris.iris_gpu) int id = blockIdx.x * blockDim.x +
threadIdx.x;
x = np.arange(SIZE, kernel1 = iris.kernel("saxpy1") S[id] += Y[id];
dtype=np.float32) kernel1.setmem(0, mem_s, iris.iris_rw) }
y = np.arange(SIZE, kernel1.setmem(1, mem_y, iris.iris_r)
dtype=np.float32)
task1 = iris.task()
s = np.arange(SIZE, task1.h2d_full(mem_y, y)
dtype=np.float32) task1.kernel(kernel1, 1, off, ndr)
task1.d2h_full(mem_s, s)
print 'X', x task1.submit(iris.iris_cpu)
print 'Y', y
print 'S =', A, '* X + Y', s
mem_x = iris.mem(x.nbytes) iris.finalize()
mem_y = iris.mem(y.nbytes)
183 mem_s = iris.mem(s.nbytes)
SAXPY: Python host code & OpenMP kernel code
saxpy.py (1/2) saxpy.py (2/2) kernel.openmp.h (OpenMP)

#!/usr/bin/env python kernel0 = iris.kernel("saxpy0") #include <iris/iris_openmp.h>
kernel0.setmem(0, mem_s, iris.iris_w)
import iris static void saxpy0(float* S, float A, float*
import numpy as np X, IRIS_OPENMP_KERNEL_ARGS) {
import sys off = [ 0 ] int id;
ndr = [ SIZE ] #pragma omp parallel for shared(S, A, X)
iris.init() private(id)
task0 = iris.task() IRIS_OPENMP_KERNEL_BEGIN
SIZE = 1024 S[id] = A * X[id];
A = 10.0 task0.submit(iris.iris_gpu) IRIS_OPENMP_KERNEL_END
}
x = np.arange(SIZE, kernel1 = iris.kernel("saxpy1")
dtype=np.float32) kernel1.setmem(0, mem_s, iris.iris_rw) static void saxpy1(float* S, float* Y,
y = np.arange(SIZE, kernel1.setmem(1, mem_y, iris.iris_r) IRIS_OPENMP_KERNEL_ARGS) {
dtype=np.float32) int id;
task1 = iris.task()
s = np.arange(SIZE, task1.h2d_full(mem_y, y) #pragma omp parallel for shared(S, Y)
dtype=np.float32) task1.kernel(kernel1, 1, off, ndr) private(id)
task1.d2h_full(mem_s, s) IRIS_OPENMP_KERNEL_BEGIN
print 'X', x task1.submit(iris.iris_cpu) S[id] += Y[id];
print 'Y', y IRIS_OPENMP_KERNEL_END
print 'S =', A, '* X + Y', s
}
Memory Consistency Management
saxpy.py (1/2) saxpy.py (2/2)

#!/usr/bin/env python kernel0 = iris.kernel("saxpy0")
import iris
import numpy as np
import sys off = [ 0 ]
ndr = [ SIZE ]
iris.init()
task0 = iris.task() mem_s is
task0.h2d_full(mem_x, x) shared
SIZE = 1024
task0.kernel(kernel0, 1, off, ndr) between GPU
A = 10.0 task0.submit(iris.iris_gpu) and CPU

dtype=np.float32) kernel1.setmem(0, mem_s, iris.iris_rw)
dtype=np.float32)
task1 = iris.task()
s = np.arange(SIZE, task1.h2d_full(mem_y, y)
dtype=np.float32) task1.kernel(kernel1, 1, off, ndr)
task1.d2h_full(mem_s, s)
print 'X', x task1.submit(iris.iris_cpu)
print 'Y', y
print 'S =', A, '* X + Y', s
Locality-aware Device Selection Policy
saxpy.py (1/2) saxpy.py (2/2)

#!/usr/bin/env python kernel0 = iris.kernel("saxpy0")
import iris
import numpy as np
import sys off = [ 0 ]
ndr = [ SIZE ]
iris.init()
task0 = iris.task()
SIZE = 1024
A = 10.0 task0.submit(iris.iris_gpu)

dtype=np.float32) kernel1.setmem(0, mem_s, iris.iris_rw)
dtype=np.float32)
task1 = iris.task() iris_data
s = np.arange(SIZE, task1.h2d_full(mem_y, y) selects the
dtype=np.float32) task1.kernel(kernel1, 1, off, ndr) device that
task1.d2h_full(mem_s, s) requires
print 'X', x task1.submit(iris.iris_data) minimum
print 'Y', y data transfer
print 'S =', A, '* X + Y', s to execute
mem_x = iris.mem(x.nbytes) the task
iris.finalize()
IRIS: Task Scheduling Overhead – Running One Million (Empty)
Tasks
user@xavier:~/work$ ./ntasks.py
ntasks.py Time: 11.46s
#!/usr/bin/env python
Throughput Latency
import iris
87,268 tasks/sec 11.4 μs/task
iris.init()
NTASKS = 1000000
t0 = iris.timer_now()
for i in range(NTASKS):
task = iris.task()
task.submit(iris.iris_random, False)
CPU or GPU
iris.synchronize() randomly asynchronous
task submission
t1 = iris.timer_now() concurrent tasks

print 'Time:', t1 - t0 execution on
multiple devices
iris.finalize()
187
Closing
Abdel-Kareem
Seyong Lee, Mehmet Belviranli, Seth Hitfield Austin Latham
Jungwon Kim, Richard Glassbrook, Steve Moulton, Moadi, Blaise Tine, Intern, Mohammad Monil,
Jeffrey Vetter, PI Programming Apps, Modeling,
Runtime Systems Project Manager Systems Engineer Software/Hardware SDR Georgia Tech Intern, Oregon SYCL
Systems Ontologies
Engineer
Summary Acknowledgements
• Architectural specialization • Thanks to staff and students for the work!
• Performance portability of applications and • Thanks to DARPA, DOE for funding our work!
software
• This research was developed, in part, with funding from the Defense
Advanced Research Projects Agency (DARPA). The views, opinions
• DSSoC ORNL project investigating on and/or findings expressed are those of the authors and should not be
performance portability of SDR interpreted as representing the official views or policies of the
Department of Defense or the U.S. Government. This document is
– Understand applications and target architectures approved for public release: distribution unlimited.
– Use open programming models: OpenACC, OpenCL,
OpenMP
– Developing intelligent runtime systems: IRIS
• Goal: scale applications from Qualcomm
Snapdragon to DoE Summit Supercomputer with
minimal programmer effort
• Work continues…
188

Fosdem SDR Ornl

Uploaded by

Document Informationclick to expand document information

Document Informationclick to expand document information

Copyright:

Available Formats

Fosdem SDR Ornl

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Fosdem SDR Ornl

Uploaded by

Copyright:

Available Formats

Striving for SDR Performance Portability

in the Era of Heterogeneous SoCs

ORNL is managed by UT-Battelle, LLC for the US Department of Energy

ORNL is managed by UT-Battelle

Optimize Software and Architectural

Resource contention SIMD Cores: OpenACC, CUDA, OpenCL, OpenMP4, …

Design: How do we design

Integrated performance analysis

Compiler, linker, assembler

Medium Access Control

Distribution Statement “A” (Approved for Public Release, Distribution Unlimited)

Dynamic Performance Feedback

Programming Runtime and Performance

Intel Stratix 10 FPGA

• Intel Stratix 10 FPGA and four banks of DDR4 external

• Up to 10 TFLOPS of peak single precision performance

• 25MBytes of L1 cache @ up to 94 TBytes/s peak

• Quartus and OpenCL software (Intel SDK v18.1) for

136 For more information or to apply for an account, visit https://excl.ornl.gov/

NVIDIA Jetson AGX Xavier SoC

• NVIDIA Jetson AGX Xavier:

137 For more information or to apply for an account, visit https://excl.ornl.gov/

Qualcomm 855 SoC (SM8510P) Snapdragon™

Kyro 485 (8-ARM Prime+BigLittle Cores) Connectivity (5G)

Xavier SoC #1 Xavier SoC #2

GR IEEE-802.11 Transmit (TX) IEEE-802.11 Receive (RX)

• Preliminary SDR Application Profiling: Library Percentage

• Created fully automated GRC profiling toolkit [kernel.kallsyms] 27.8547

• Profiled with performance counters libc 7.7503

• Runtime overhead: libpthread 2.1449

Block proximity analysis

Main input Main input/ Optional input Main output/

LLVM Used as both

IRIS Common Runtime API

SYCL IRIS offers a common API for diverse

Reference CPU Implementation

Main Entry Function

Input OpenACC code

Output host code

Output CUDA kernel

OpenACC-enabled workflow using gr-openacc blocks

Reference CPU workflow using original gr-blocks

Source OpenAC OpenAC

Source OpenAC OpenAC

OpenACC Blocks on Xavier CPU Original GR Blocks on Xavier CPU

OpenACC Blocks on Xavier CPU

This example offloads more OpenACC blocks to

Reference CPU workflow

A0 A1A1 B1 D1D3D2 C0B0 A0 C1 B0 B1 C0D0C1D0 D1 D2D3

• Provide programmers a unified programming

ExCL* Systems Oswald Summit-node Radeon Xavier Snapdragon

• The task scheduler dispatches the tasks in the

• Low-level Python IRIS host code +

saxpy.py (1/2) saxpy.py (2/2) kernel.cu (CUDA)

saxpy.py (1/2) saxpy.py (2/2) kernel.openmp.h (OpenMP)

saxpy.py (1/2) saxpy.py (2/2)

x = np.arange(SIZE, kernel1 = iris.kernel("saxpy1")