[go: up one dir, main page]

0% found this document useful (0 votes)
24 views48 pages

Fosdem SDR Ornl

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 48

Striving for SDR Performance Portability

in the Era of Heterogeneous SoCs

Jeffrey S. Vetter
Seyong Lee
Mehmet Belviranli
Jungwon Kim
Richard Glassbrook
FOSDEM
Abdel-Kareem Moadi Brussels
Seth Hitefield 2 Feb 2020

ORNL is managed by UT-Battelle, LLC for the US Department of Energy

ORNL is managed by UT-Battelle


for the US Department of Energy http://ft.ornl.gov vetter@computer.org
Highlights

• Architectural specialization
• Performance portability of applications and software
• DSSoC ORNL project investigating on performance portability of
SDR
– Understand applications and target architectures
– Use open programming models (e.g., OpenMP, OpenACC, OpenCL)
– Develop intelligent runtime systems
• Goal: scale applications from Qualcomm Snapdragon to DoE
Summit Supercomputer with minimal programmer effort

27
Sixth Wave of Computing

6th wave

Transition
Period

http://www.kurzweilai.net/exponential-growth-of-computing

37
Predictions for Transition Period

Optimize Software and Architectural


Expose New Specialization and Emerging Technologies
Hierarchical Parallelism Integration
• Redesign software to • Use CMOS more • Investigate new
boost performance effectively for specific computational
on upcoming workloads paradigms
architectures • Integrate components • Quantum
• Exploit new levels of to boost performance • Neuromorphic
parallelism and and eliminate • Advanced Digital
efficient data inefficiencies
• Emerging Memory
movement • Workload specific Devices
memory+storage
system design

46
Complex architectures yield…

Complex
Programming
System: MPI, Legion, HPX, Charm++, etc Models
Low overhead Node: OpenMP, Pthreads, U-threads, etc

Resource contention SIMD Cores: OpenACC, CUDA, OpenCL, OpenMP4, …


Memory use,
Data orchestration Fine grained parallelism Hardware features
NUMA, HBM coalescing
Locality

119
During this Sixth Wave transition, Complexity is our major challenge!

Design: How do we design


Programmability: How do we
future systems so that they
design applications with
are better than current
some level of performance
systems on mission
portability?
applications?
•Entirely possible that the •Software lasts much longer
new system will be slower than transient hardware
than the old system! platforms
•Expect ‘disaster’ •Adapt or die
procurements

120
DARPA Domain-Specific System on a Chip (DSSoC) Program
Getting the best out of specialization when we need programmability
DARPA ERI DSSoC Program: Dr. Tom Rondeau
DSSoC’s Full-Stack Integration
Three Optimization Areas

Decoupled Software
Development Environment and Programming

development
Languages
1. Design time
2. Run time Application

3. Compile time
Libraries

Operating System

Hardware-Software Co-design
Heterogeneous architecture
Addressed via five program areas

Integrated performance analysis


composed of Processor Elements:

Intelligent scheduling/routing
• CPUs
1. Intelligent scheduling

Compiler, linker, assembler


• Graphics processing units

Medium Access Control


2. Domain representations • Tensor product units
• Neuromorphic units
3. Software • Accelerators (e.g., FFT)
• DSPs
4. Medium access control (MAC) • Programmable logic
• Math accelerators
5. Hardware integration

Looking at how Hardware/Software co-design is an enabler for efficient use of processing power

Distribution Statement “A” (Approved for Public Release, Distribution Unlimited)


DSSoC ORNL Project Overview

130
Development Lifecycle
Precise configuration and benchmark
data for static analysis, mapping,
partitioning, code generation, etc

Dynamic Performance Feedback


including profiling and
configuration response

Programming Runtime and Performance


Applications Ontologies DSSoC Chip
systems Scheduling Functional API

•Create scalable •Ontologies based on •Programming systems built •Intelligent runtime •DSSoC design •As feature of DSSoC, PFU
application Aspen models Aspen models using to support ontologies scheduling uses models quantitatively derived API provides dynamic
manually, with static or statistical and machine •Query Aspen models and and PFU to inform from application Aspen performance response of
dynamic analysis, or using learning techniques PFU for automatic code dynamic decisions models deployed DSSoC to
historical information generation, optimization, •Dynamic resource •Early design space intelligent runtime and
etc. discovery and monitoring exploration with Aspen programming system.

DSSoC Design

•DSSoC design
quantitatively derived
from application Aspen
models
131 •Early design space
exploration with Aspen
Architectures

132
134
https://excl.ornl.gov/

Intel Stratix 10 FPGA


Experimental Computing Lab (ExCL) managed by the ORNL Future Technologies Group

• Intel Stratix 10 FPGA and four banks of DDR4 external


memory
– Board configuration: Nallatech 520 Network Acceleration Card

• Up to 10 TFLOPS of peak single precision performance

• 25MBytes of L1 cache @ up to 94 TBytes/s peak


bandwidth
• 2X Core performance gains over Arria® 10

• Quartus and OpenCL software (Intel SDK v18.1) for


using FPGA
• Provide researcher access to advanced FPGA/SOC
environment

136 For more information or to apply for an account, visit https://excl.ornl.gov/


Mar 2019
https://excl.ornl.gov/

NVIDIA Jetson AGX Xavier SoC


Experimental Computing Lab (ExCL) managed by the ORNL Future Technologies Group

• NVIDIA Jetson AGX Xavier:


• High-performance system on a chip for autonomous
machines
• Heterogeneous SoC contains:
– Eight-core 64-bit ARMv8.2 CPU cluster (Carmel)
– 1.4 CUDA TFLOPS (FP32) GPU with additional
inference optimizations (Volta)
– 11.4 DL TOPS (INT8) Deep learning accelerator
(NVDLA)
– 1.7 CV TOPS (INT8) 7-slot VLIW dual-processor
Vision accelerator (PVA)
– A set of multimedia accelerators (stereo, LDC,
optical flow)
• Provides researchers access to advanced high-
performance SOC environment

137 For more information or to apply for an account, visit https://excl.ornl.gov/


Mar 2019
https://excl.ornl.gov/

Qualcomm 855 SoC (SM8510P) Snapdragon™


Experimental Computing Lab (ExCL) managed by the ORNL Future Technologies Group
Qualcomm Development Board connected to (mcmurdo) HPZ820

7nm TSMC
Adreno 640

Hexagon 690 5G
Spectra 360

Kyro 485
© Qualcomm Inc.
© Qualcomm Inc.

Kyro 485 (8-ARM Prime+BigLittle Cores) Connectivity (5G)

Prime • Snapdragon X24 LTE (855 built-in) modem LTE Category 20 • Connected Qualcomm board to HPZ820 through USB
Core
• Snapdragon X50 5G (external) modem (for 5G devices) • Development Environment: Android SDK/NDK
• Qualcomm Wi-Fi 6-ready mobile platform: (802.11ax-ready, • Login to mcmurdo machine
802.11ac Wave 2, 802.11ay, 802.11ad) $ ssh –Y mcmurdo
• Qualcomm 60 GHz Wi-Fi mobile platform: (802.11ay, • Setup Android platform tools and development environment
Hexagon 690 (DSP + AI) $ source /home/nqx/setup_android.source
• Quad threaded Scalar Core 802.11ad)
• Run Hello-world on ARM cores
• DSP + 4 Hexagon Vector Xccelerators • Bluetooth Version: 5.0
$ git clone https://code.ornl.gov/nqx/helloworld-android
• New Tensor Xccelerator for AI • Bluetooth Speed: 2 Mbps
$ make compile push run
• Apps: AI, Voice Assistance, AV codecs • High accuracy location with dual-frequency GNSS.
• Run OpenCL example on GPU
$ git clone https://code.ornl.gov/nqx/opencl-img-processing
Adreno 640 Spectra 360 ISP
• Run Sobel edge detection
• Vulkan, OpenCL, OpenGL ES 3.1 • New dedicated Image Signal Processor (ISP)
$ make compile push run fetch
• Apps: HDR10+, HEVC, Dolby, etc • Dual 14-bit CV-ISPs; 48MP @ 30fps single camera
• Login to Qualcomm development board shell
• Enables 8k-360o VR video playback • Hardware CV for object detection, tracking, streo depth process $ adb shell
• 20% faster compared to Adreno 630 • 6DoF XR Body tracking, H265, 4K60 HDR video capture, etc. $ cd /data/local/tmp

138 For more information or to apply for an account, visit https://excl.ornl.gov/ Created by Narasinga Rao Miniskar, Steve Moulton
Applications

140
End-to-End System: Gnu Radio for Wifi on two NVIDIA Xavier SoCs

Xavier SoC #1 Xavier SoC #2

UDP
Video/Image
Files Antenna

GR IEEE-802.11 Transmit (TX) IEEE-802.11 Receive (RX)


UDP
• Signal processing: An open-
source implementation of
IEEE-802.11 WIFI a/b/g with
GR OOT modules.
• Input / Output file support via
Socket PDU (UDP server)
blocks
• Image/Video transcoding with
OpenCL/OpenCV
141
GR-Tools

• GR-Tools
• First tools are released
• Block-level Ontologies [ontologyAnalysis]
• Following properties are extracted from a batch
of block definition files: Descriptions and IDs,
source and sink ports (whether input/output is
scalar, vector or multi-port), allowed data types,
and additional algorithm-specific parameters
• Flowgraph Characterization [workflowAnalysis]
• Characterization of GR workloads at the
flowgraph level.
• Scripts automatically run for for 30 seconds and
reports a breakdown of high-level library module
calls libgnuradio CPU-time Breakdown
4%
• Design-space Exploration [designSpaceCL] libgnuradio-
analog
• Script to run 13 blocks included in gr-clenabled 13% libgnuradio-
- Both on a GPU and on a single CPU core 6% blocks
3%
- By using input sizes varying between 24 and 28% libgnuradio-
4% channels
227 elements.
1% libgnuradio-
• Two prototype tools have been added recently 10% digital
libgnuradio-dtv
1%
• cgran-scraper 22% 8%
0%
• GRC-analyzer

149 https://github.com/cosmic-sdr
Applications Profiling

• Preliminary SDR Application Profiling: Library Percentage

• Created fully automated GRC profiling toolkit [kernel.kallsyms] 27.8547


libpython 18.6281
• Ran each of the 89 flowgraph for 30 seconds libgnuradio 11.7548

• Profiled with performance counters libc 7.7503


ld 3.8839
• Major overheads: libvolk 3.7963
libperl 3.7837
• Python glue code (libpython), O/S threading & profiling [unknown] 3.6465
(kernel.kallsysms, libpthread), libc, ld, Qt libQt5 2.9866

• Runtime overhead: libpthread 2.1449

• Will require significant consideration when run on SoC libgnuradio CPU-time Breakdown
libgnuradio-analog
• Cannot be executed in parallel 4%
13% libgnuradio-blocks
• Hardware assisted scheduling is essential libgnuradio-
28% 6%
3% channels
libgnuradio-digital
4%
libgnuradio-dtv
1%
10% libgnuradio-fec

1% libgnuradio-fft
22% 8%
0%

152
GRC statistics: Block Proximity Analysis

Block proximity analysis


• Creates a graph:
• Nodes: Unique block types
• Edges: Blocks used in the same GRC file.
• Every co-occurrence increases edge
weight by 1.
• This example was run
• With --mode proximityGraph
• On randomly selected sub-set of GRC
files

153
Programming Systems

155
Programming Solution for DSSoC

Main input Main input/ Optional input Main output/


programming main output programming optional input
models programming models programming
models models
HIP
MPI CUDA

Used as input
OpenCL programming
model to the
OpenMP
OpenARC compiler

LLVM Used as both


OpenACC input and
Compilers Verilog output
programming
model to the
CASH
compiler
IRIS

158
New OpenACC GR Block Mapping Strategy for Heterogeneous
Architectures
CUDA NVIDIA GPU Mapping
used for
Xavier
OpenMP ARM CPU porting

IRIS Common Runtime API


OpenMP CPU/Xeon Phi
OpenACC
GR Block Intel
Intel FPGA
OpenCL
Support more General
OpenCL
programming Accelerators
models.
HIP AMD GPU

SYCL IRIS offers a common API for diverse


GR Block heterogeneous devices and also allows
intermixing of multiple programming models
(mix CUDA, OpenMP, OpenCL, etc.).
161
OpenACC GR Block Code Structure

Constructor
• OpenACC GR block class inherits GRACCBase class as a base class.
• GRACCBase constructor assigns a unique thread ID per OpenACC
GR block instantiation, which is internally used for thread safety.
• OpenACC backend runtime is also initialized.

Reference CPU Implementation


• Contains the same code as that in the original GR block, which may
have already been vectorized using Volk library.

OpenACC Implementation
• Contains the OpenACC version of the reference CPU implementation.
• Performs the following tasks:
• Copy input data to device memory.
• Execute the OpenACC kernel.
• Copy output data back to host memory.
• OpenARC will translate the OpenACC kernel to multiple different
output programming models (e.g., CUDA, OpenCL, OpenMP, HIP, etc.)

Main Entry Function


• Main entry function executed whenever GR scheduler invokes the
OpenACC GR block.
• The GR block argument, contextType decides which to execute
between the reference CPU version and OpenACC version.
• OpenACC backend runtime may choose CPU as an offloading
target (e.g., offloading OpenMP3 kernel to CPU).
162
Example Translation of GR accLog Module

Input OpenACC code

Output host code

Output CUDA kernel


163 code
Port an Example SDR Workflow to Xavier

OpenACC-enabled workflow using gr-openacc blocks

Reference CPU workflow using original gr-blocks

164
Basic Memory Management for OpenACC-Enabled GR Workflow

Host

Source OpenAC OpenAC


Sink Block
block C block1 C block2

1 3 1 3
2 2

Device
Device Device
kernel1 kernel2

• In the basic memory management scheme, each invocation of an OpenACC GR block performs
the following three tasks:
1) Copy input data to device memory.
2) Run a kernel on device.
3) Copy output data back to host memory.
165
Optimized Memory Management for OpenACC-Enabled GR Workflow

Host

Source OpenAC OpenAC


Sink Block
block C block1 C block2

1 3
2 2

Device
Device Device
kernel1 kernel2

• In the optimized memory management scheme, some blocks can bypass unnecessary memory
transfers between host and device and directly communicate each other using device memory if
both producer and consumer blocks are running on the same device.
• Notice that device buffer needs extra padding to handle the overwriting feature in the host circular
buffer.
166
Sample Output of the Example SDR Workflow

168
SDR Workflow Profiling Using a Built-in GR Performance Monitoring Tool

• CPU versions of OpenACC blocks are algorithmically equivalent to those in the original GR blocks.

OpenACC Blocks on Xavier CPU Original GR Blocks on Xavier CPU

Some OpenACC
blocks (B, D) use
a simple register
caching
optimization,
which causes
them to perform
better than the
original GR
blocks.

A A B B C C D1 D1 D2 D2
169
SDR Workflow Profiling Results When OpenACC Blocks Offloaded to CPU

• OpenACC blocks are automatically translated to OpenMP3 versions and run on Xavier CPU.

OpenACC Blocks on Xavier CPU


via OpenMP Original GR Blocks on Xavier CPU

Some of original
GR blocks (A, C)
were already
vectorized with
Volk library.

Some of original
GR blocks (B, C)
performed better
than OpenACC
blocks (B, C).

A A B C D1 D2 D1 B C D2
170
SDR Workflow Profiling Results When OpenACC Blocks Offloaded to GPU

• OpenACC blocks are automatically translated to CUDA versions and run on Xavier GPU.
• Each invocation of an OpenACC block executes three tasks: 1) copy input data to device memory, 2) run a
kernel on device, and 3) copy output data back to host memory
OpenACC Blocks on Xavier GPU Original GR Blocks on Xavier CPU

Due to extra
memory transfer
overheads, most
OpenACC blocks
perform worse
than original GR
blocks, except for
the OpenACC
block D1 and D2.

171
A B C A B D2 D1 D1 C D2
SDR Workflow Profiling Results When Opt. OpenACC Blocks Offloaded to GPU

• OpenACC blocks are automatically translated to CUDA versions and run on Xavier GPU.
• Optimized OpenACC blocks bypass memory transfers between host and device and directly communicate
each other using device memory if both producer and consumer blocks are running on the same device.
Opt. OpenACC Blocks on Xavier Original GR Blocks on Xavier CPU
GPU

Most of the
OpenACC blocks
perform better
than original GR
blocks, except for
the block A; the
original GR block
A is vectorized
with Volk library,
which performs
better than the
OpenACC block
A.

172
A D2 A B C B D1 C D1 D2
More Complex SDR Workflow Example

This example offloads more OpenACC blocks to


Xavier GPU than the previous example.

OpenACC-enabled workflow
using gr-openacc blocks

Reference CPU workflow


using original gr-blocks

173
Profiling Results When Opt. OpenACC Blocks Offloaded to GPU

• OpenACC blocks are automatically translated to CUDA versions and run on Xavier GPU.
• Optimized OpenACC blocks bypass memory transfers between host and device and directly communicate
each other using device memory if both producer and consumer blocks are running on the same device.
Opt. OpenACC Blocks on Xavier Original GR Blocks on Xavier CPU
GPU

This example
shows similar
performance
behaviors as the
previous example.

A0 A1A1 B1 D1D3D2 C0B0 A0 C1 B0 B1 C0D0C1D0 D1 D2D3


174
Programming Systems Update Summary and Next Steps

• Updated the programming system to use our new heterogeneous runtime system, called IRIS, as the
common backend runtime.
• IRIS allows intermixing of multiple different output programming models (e.g., OpenMP3, OpenMP4, OpenACC, CUDA, HIP,
etc.) and runs them on heterogeneous devices concurrently.

• Developed a host-device memory transfer optimization scheme, which allows OpenACC GR blocks to
bypass memory transfers between host and device and directly communicate each other if both
producer and consumer blocks are running on the same device.
• Performed preliminary evaluation of the new programming system by creating synthetic SDR workflow
using the OpenACC GR blocks.
• Next Steps
• Port more complex GR blocks to OpenACC and evaluate more complex SDR workflow.
• Continue to improve and fix bugs in the programming system.

175
Runtime systems for intelligent scheduling

176
IRIS: An Intelligent Runtime System for Extremely Heterogeneous
Architectures

• Provide programmers a unified programming


environment to write portable code across
heterogeneous architectures (and preferred
programming systems)
• Orchestrate diverse programming systems
(OpenCL, CUDA, HIP, OpenMP for CPU) in a single
application
– OpenCL
• NVIDIA GPU, AMD GPU, ARM GPU, Qualcomm GPU, Intel
CPU, Intel Xeon Phi, Intel FPGA, Xilinx FPGA
– CUDA
• NVIDIA GPU
– HIP
• AMD GPU
– OpenMP for CPU
• Intel CPU, AMD CPU, PowerPC CPU, ARM CPU,
Qualcomm CPU

177 https://github.com/swiftcurrent2018
The IRIS Architecture

• Platform Model
– A single-node system equipped with host CPUs
and multiple compute devices (GPUs, FPGAs,
Xeon Phis, and multicore CPUs)
• Memory Model
– Host memory + shared device memory
– All compute devices share the device memory
• Execution Model
– DAG-style task parallel execution across all
available compute devices
• Programming Model
– High-level OpenACC, OpenMP4, SYCL* (*
planned)
– Low-level C/Fortran/Python IRIS host-side
runtime API + OpenCL/CUDA/HIP/OpenMP
kernels (w/o compiler support)

178
Supported Architectures and Programming Systems by IRIS

ExCL* Systems Oswald Summit-node Radeon Xavier Snapdragon


Qualcomm
CPU Intel Xeon IBM Power9 Intel Xeon ARMv8
Kryo
Programming • Intel OpenMP • IBM XL OpenMP • Intel OpenMP • GNU GOMP • Android NDK
Systems • Intel OpenCL • Intel OpenCL OpenMP
AMD Radeon Qualcomm
GPU NVIDIA P100 NVIDIA V100 NVIDIA Volta
VII Adreno 640
Programming • NVIDIA CUDA • NVIDIA CUDA • AMD HIP • NVIDIA CUDA • Qualcomm
Systems • NVIDIA • AMD OpenCL OpenCL
OpenCL
Intel/Altera
FPGA
Stratix 10
Programming
* ORNL • Intel OpenCL
Experimental Computing Laboratory (ExCL) https://excl.ornl.gov/
Systems

179
IRIS Booting on Various Platforms

180
Task Scheduling in IRIS
• A task
– A scheduling unit
– Contains multiple in-order commands
• Kernel launch command
• Memory copy command (device-to-host, host-to-device)
– May have DAG-style dependencies with other tasks
– Enqueued to the application task queue with a device
selection policy
• Available device selection policies
– Specific Device (compute device #)
– Device Type (CPU, GPU, FPGA, XeonPhi)
– Profile-based
– Locality-aware
– Ontology-base
– Performance models (Aspen)
– Any, All, Random, 3rd-party users’ custom policies

• The task scheduler dispatches the tasks in the


application task queue to available compute devices
– Select the optimal target compute device according to
task’s device selection policy
181
SAXPY Example on Xavier

• Computation
– S[] = A * X[] + Y[]

• Two tasks
– S[] = A * X[] on NVIDIA GPU (CUDA)
– S[] += Y[] on ARM CPU (OpenMP)
• S[] is shared between two tasks
• Read-after-write (RAW), true dependency

• Low-level Python IRIS host code +


CUDA/OpenMP kernels
– saxpy.py
– kernel.cu
– kernel.openmp.h

182
SAXPY: Python host code & CUDA kernel code

saxpy.py (1/2) saxpy.py (2/2) kernel.cu (CUDA)


#!/usr/bin/env python kernel0 = iris.kernel("saxpy0") extern "C" __global__ void saxpy0(float*
kernel0.setmem(0, mem_s, iris.iris_w) S, float A, float* X) {
kernel0.setint(1, A)
import iris int id = blockIdx.x * blockDim.x +
kernel0.setmem(2, mem_x, iris.iris_r)
import numpy as np threadIdx.x;
import sys off = [ 0 ] S[id] = A * X[id];
ndr = [ SIZE ] }
iris.init()
task0 = iris.task() extern "C" __global__ void saxpy1(float*
task0.h2d_full(mem_x, x)
SIZE = 1024 S, float* Y) {
task0.kernel(kernel0, 1, off, ndr)
A = 10.0 task0.submit(iris.iris_gpu) int id = blockIdx.x * blockDim.x +
threadIdx.x;
x = np.arange(SIZE, kernel1 = iris.kernel("saxpy1") S[id] += Y[id];
dtype=np.float32) kernel1.setmem(0, mem_s, iris.iris_rw) }
y = np.arange(SIZE, kernel1.setmem(1, mem_y, iris.iris_r)
dtype=np.float32)
task1 = iris.task()
s = np.arange(SIZE, task1.h2d_full(mem_y, y)
dtype=np.float32) task1.kernel(kernel1, 1, off, ndr)
task1.d2h_full(mem_s, s)
print 'X', x task1.submit(iris.iris_cpu)
print 'Y', y
print 'S =', A, '* X + Y', s
mem_x = iris.mem(x.nbytes) iris.finalize()
mem_y = iris.mem(y.nbytes)
183 mem_s = iris.mem(s.nbytes)
SAXPY: Python host code & OpenMP kernel code

saxpy.py (1/2) saxpy.py (2/2) kernel.openmp.h (OpenMP)


#!/usr/bin/env python kernel0 = iris.kernel("saxpy0") #include <iris/iris_openmp.h>
kernel0.setmem(0, mem_s, iris.iris_w)
kernel0.setint(1, A)
import iris static void saxpy0(float* S, float A, float*
kernel0.setmem(2, mem_x, iris.iris_r)
import numpy as np X, IRIS_OPENMP_KERNEL_ARGS) {
import sys off = [ 0 ] int id;
ndr = [ SIZE ] #pragma omp parallel for shared(S, A, X)
iris.init() private(id)
task0 = iris.task() IRIS_OPENMP_KERNEL_BEGIN
task0.h2d_full(mem_x, x)
SIZE = 1024 S[id] = A * X[id];
task0.kernel(kernel0, 1, off, ndr)
A = 10.0 task0.submit(iris.iris_gpu) IRIS_OPENMP_KERNEL_END
}
x = np.arange(SIZE, kernel1 = iris.kernel("saxpy1")
dtype=np.float32) kernel1.setmem(0, mem_s, iris.iris_rw) static void saxpy1(float* S, float* Y,
y = np.arange(SIZE, kernel1.setmem(1, mem_y, iris.iris_r) IRIS_OPENMP_KERNEL_ARGS) {
dtype=np.float32) int id;
task1 = iris.task()
s = np.arange(SIZE, task1.h2d_full(mem_y, y) #pragma omp parallel for shared(S, Y)
dtype=np.float32) task1.kernel(kernel1, 1, off, ndr) private(id)
task1.d2h_full(mem_s, s) IRIS_OPENMP_KERNEL_BEGIN
print 'X', x task1.submit(iris.iris_cpu) S[id] += Y[id];
print 'Y', y IRIS_OPENMP_KERNEL_END
print 'S =', A, '* X + Y', s
}
mem_x = iris.mem(x.nbytes) iris.finalize()
mem_y = iris.mem(y.nbytes)
184 mem_s = iris.mem(s.nbytes)
Memory Consistency Management

saxpy.py (1/2) saxpy.py (2/2)


#!/usr/bin/env python kernel0 = iris.kernel("saxpy0")
kernel0.setmem(0, mem_s, iris.iris_w)
kernel0.setint(1, A)
import iris
kernel0.setmem(2, mem_x, iris.iris_r)
import numpy as np
import sys off = [ 0 ]
ndr = [ SIZE ]
iris.init()
task0 = iris.task() mem_s is
task0.h2d_full(mem_x, x) shared
SIZE = 1024
task0.kernel(kernel0, 1, off, ndr) between GPU
A = 10.0 task0.submit(iris.iris_gpu) and CPU

x = np.arange(SIZE, kernel1 = iris.kernel("saxpy1")


dtype=np.float32) kernel1.setmem(0, mem_s, iris.iris_rw)
y = np.arange(SIZE, kernel1.setmem(1, mem_y, iris.iris_r)
dtype=np.float32)
task1 = iris.task()
s = np.arange(SIZE, task1.h2d_full(mem_y, y)
dtype=np.float32) task1.kernel(kernel1, 1, off, ndr)
task1.d2h_full(mem_s, s)
print 'X', x task1.submit(iris.iris_cpu)
print 'Y', y
print 'S =', A, '* X + Y', s
mem_x = iris.mem(x.nbytes) iris.finalize()
mem_y = iris.mem(y.nbytes)
185 mem_s = iris.mem(s.nbytes)
Locality-aware Device Selection Policy

saxpy.py (1/2) saxpy.py (2/2)


#!/usr/bin/env python kernel0 = iris.kernel("saxpy0")
kernel0.setmem(0, mem_s, iris.iris_w)
kernel0.setint(1, A)
import iris
kernel0.setmem(2, mem_x, iris.iris_r)
import numpy as np
import sys off = [ 0 ]
ndr = [ SIZE ]
iris.init()
task0 = iris.task()
task0.h2d_full(mem_x, x)
SIZE = 1024
task0.kernel(kernel0, 1, off, ndr)
A = 10.0 task0.submit(iris.iris_gpu)

x = np.arange(SIZE, kernel1 = iris.kernel("saxpy1")


dtype=np.float32) kernel1.setmem(0, mem_s, iris.iris_rw)
y = np.arange(SIZE, kernel1.setmem(1, mem_y, iris.iris_r)
dtype=np.float32)
task1 = iris.task() iris_data
s = np.arange(SIZE, task1.h2d_full(mem_y, y) selects the
dtype=np.float32) task1.kernel(kernel1, 1, off, ndr) device that
task1.d2h_full(mem_s, s) requires
print 'X', x task1.submit(iris.iris_data) minimum
print 'Y', y data transfer
print 'S =', A, '* X + Y', s to execute
mem_x = iris.mem(x.nbytes) the task
iris.finalize()
mem_y = iris.mem(y.nbytes)
186 mem_s = iris.mem(s.nbytes)
IRIS: Task Scheduling Overhead – Running One Million (Empty)
Tasks
user@xavier:~/work$ ./ntasks.py
ntasks.py Time: 11.46s
#!/usr/bin/env python
Throughput Latency
import iris
87,268 tasks/sec 11.4 μs/task
iris.init()

NTASKS = 1000000

t0 = iris.timer_now()

for i in range(NTASKS):
task = iris.task()
task.submit(iris.iris_random, False)
CPU or GPU
iris.synchronize() randomly asynchronous
task submission

t1 = iris.timer_now() concurrent tasks


print 'Time:', t1 - t0 execution on
multiple devices

iris.finalize()
187
Closing
Abdel-Kareem
Seyong Lee, Mehmet Belviranli, Seth Hitfield Austin Latham
Jungwon Kim, Richard Glassbrook, Steve Moulton, Moadi, Blaise Tine, Intern, Mohammad Monil,
Jeffrey Vetter, PI Programming Apps, Modeling,
Runtime Systems Project Manager Systems Engineer Software/Hardware SDR Georgia Tech Intern, Oregon SYCL
Systems Ontologies
Engineer

Summary Acknowledgements
• Architectural specialization • Thanks to staff and students for the work!
• Performance portability of applications and • Thanks to DARPA, DOE for funding our work!
software
• This research was developed, in part, with funding from the Defense
Advanced Research Projects Agency (DARPA). The views, opinions
• DSSoC ORNL project investigating on and/or findings expressed are those of the authors and should not be
performance portability of SDR interpreted as representing the official views or policies of the
Department of Defense or the U.S. Government. This document is
– Understand applications and target architectures approved for public release: distribution unlimited.
– Use open programming models: OpenACC, OpenCL,
OpenMP
– Developing intelligent runtime systems: IRIS
• Goal: scale applications from Qualcomm
Snapdragon to DoE Summit Supercomputer with
minimal programmer effort
• Work continues…
188

You might also like