OpenACC For Programmers 2018
OpenACC For Programmers 2018
TM
Programmers
Programmers
Concepts and Strategies
Edited by
Sunita Chandrasekaran
Guido Juckeland
Boston • Columbus • Indianapolis • New York • San Francisco • Amsterdam • Cape Town
Dubai • London • Madrid • Milan • Munich • Paris • Montreal • Toronto • Delhi • Mexico City
São Paulo • Sydney • Hong Kong • Seoul • Singapore • Taipei • Tokyo
Foreword . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxi
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxiii
1.1.1 Directives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.2 Clauses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1 Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.2 Parallel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.3 Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.4 Routine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
vii
1.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
viii
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
ix
5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
6.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
xi
xii
8.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
8.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
9.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
9.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
10.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
10.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
xiii
Chapter 12: Innovative Research Ideas Using OpenACC, Part II 237
12.1 A Framework for Directive-Based High-Performance
Reconfigurable Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
xiv
In the previous century, most computers used for scientific and technical program-
ming consisted of one or more general-purpose processors, often called CPUs,
each capable of carrying out a diversity of tasks from payroll processing through
engineering and scientific calculations. These processors were able to perform
arithmetic operations, move data in memory, and branch from one operation
to another, all with high efficiency. They served as the computational motor for
desktop and personal computers, as well as laptops. Their ability to handle a wide
variety of workloads made them equally suitable for word processing, computing
an approximation of the value of pi, searching and accessing documents on the
web, playing back an audio file, and maintaining many different kinds of data. The
evolution of computer processors is a tale of the need for speed: In a drive to build
systems that are able to perform more operations on data in any given time, the
computer hardware manufacturers have designed increasingly complex proces-
sors. The components of a typical CPU include the arithmetic logic unit (ALU), which
performs simple arithmetic and logical operations, the control unit (CU), which man-
ages the various components of the computer and gives instructions to the ALU, and
cache, the high-speed memory that is used to store a program’s instructions and
data on which it operates. Most computers today have several levels of cache, from
a small amount of very fast memory to larger amounts of slower memory.
xv
Thus multicore processing systems were born. In such a system, the actual
compute logic, or core, of a processor is replicated. Each core will typically have
its own ALU and CU but may share one or more levels of cache and other memory
with other cores. The cores may be connected in a variety of different ways and will
typically share some hardware resources, especially memory. Virtually all of our
laptops, desktops, and clusters today are built from multicore processors.
Many other computers are embedded in telephone systems, toys, printers, and
other electronic appliances, and increasingly in household objects from washing
machines to refrigerators. These are typically special-purpose computing chips
that are designed to carry out a certain function or set of functions and have pre-
cisely the hardware needed for the job. Oftentimes, those tasks are all that they are
able to perform. As the demands for more complex actions grow, some of these
appliances today are also based on specialized multicore processors, something
that increases the available compute power and the range of applications for which
they are well suited.
Although the concept of computer gaming has been around since sometime in the
1960s, game consoles for home use were first introduced a decade later and didn’t
take off until the 1980s. Special-purpose chips were designed specifically for them,
too. There was, and is, a very large market for gaming devices, and considerable
effort has therefore been expended on the creation of processors that are very
xvi
The developers and operators of HPC systems have been at the forefront of hard-
ware innovation for many decades, and advances made in this area form the back-
drop and motivation for the topic of this book. IBM’s Roadrunner (installed at the
Department of Energy’s Los Alamos National Laboratory [LANL] in 2008) was the
first computing system to achieve 1 petaflop/s (1,000 trillion floating-point calcu-
lations per second) sustained performance on a benchmark (the Linpack TOP500)
that is widely used to assess a system’s speed on scientific application code. Its
nodes were an example of what is often called a hybrid architecture: They not only
introduced dual-core processors into the node but also attached Cell processors to
the multicores. The idea was that the Cell processor could execute certain portions
of the code much faster than the multicore. However, the code for execution on the
Cell had to be specifically crafted for it; data had to be transferred from the multi-
core’s memory to Cell memory and the results then returned. This proved to be dif-
ficult to accomplish as a result of the tiny amount of memory available on the Cell.
xvii
People at large data centers in industry as well as at public institutions had become
concerned about the rising cost of providing computing services, especially the cost
of the computers’ electricity consumption. Specialized cores such as the Cell were
expected to offer higher computational efficiency on suitable application code at a
very reasonable operating cost. Cores with these characteristics were increasingly
referred to as accelerators. At LANL they encountered a major challenges with
respect to the deployment of accelerators in hybrid nodes. The application code had
to be nontrivially modified in order to exploit the Cell technology. Additionally, the
cost of transferring data and code had to be amortized by the code speedup.
Today, we are in an era of innovation with respect to the design of nodes for HPC
systems. Many of the fastest machines on the planet have adopted the ideas pio-
neered by Titan, and hence GPUs are the most common hardware accelerators.
Systems are emerging that will employ multiple GPUs in each node, sometimes
with very fast data transfer capabilities between them. In other developments,
technology has been under construction to enable multicore CPUs to share memory—
and hence data—directly with GPUs without data transfers. Although there will still
be many challenges related to the efficient use of memory, this advancement will
alleviate some of the greatest programming difficulties. Perhaps more importantly,
many smaller HPC systems, as well as desktop and laptop systems, now come
equipped with GPUs, and their users are successfully exploiting them for scien-
tific and technical computing. GPUs were, of course, designed to serve the gaming
industry, and this successful adaptation would have been unthinkable without the
success stories that resulted from the Titan installation. They, in turn, would not
have been possible without an approachable programming model that meets the
needs of the scientific application development community.
xviii
Other kinds of node architecture have recently been designed that similarly prom-
ise performance, programmability, and power efficiency. In particular, the idea of
manycore processing has gained significant traction. A manycore processor is one
that is inherently designed for parallel computing. In other words, and in contrast
to multicore platforms, it is not designed to support general-purpose, sequential
computing needs. As a result, each core may not provide particularly high levels
of performance: The overall computational power that they offer is the result of
aggregating a large number of the cores and deploying them collaboratively to
solve a problem. To accomplish this, some of the architectural complexities of
multicore hardware are jettisoned; this frees up space that can be used to add
more, simpler cores. By this definition, the GPU actually has a manycore design,
although it is usually characterized by its original purpose. Other hardware devel-
opers are taking the essential idea behind its design—a large number of cores that
are intended to work together and are not expected to support the entire generality
of application programs—and using it to create other kinds of manycore hardware,
based on a different kind of core and potentially employing different mechanisms
to aggregate the many cores. Many such systems have emerged in HPC, and inno-
vations in this area continue.
The biggest problem facing the users of Titan, its successor platforms, and other
manycore systems is related to the memory. GPUs, and other manycores, have
relatively small amounts of memory per core, and, in most existing platforms,
data and code that are stored on the multicore host platform must be copied to
the GPU via a relatively slow communications network. Worse, data movement
expends high levels of electricity, so it needs to be kept to the minimum neces-
sary. As mentioned, recent innovations take on this problem in order to reduce the
complexity of creating code that is efficient in terms of execution speed as well as
power consumption. Current trends toward ever more powerful compute nodes
in HPC, and thus potentially more powerful parallel desktops and laptops, involve
even greater amounts of heterogeneity in the kinds of cores configured, new
kinds of memory and memory organization, and new strategies for integrating
the components. Although these advances will not lead to greater transparency in
the hardware, they are expected to reduce the difficulty of creating efficient code
employing accelerators. They will also increase the range of systems for which
OpenACC is suitable.
xix
• Chapters 4–7 take you through your first real-world OpenACC programs and
reveal the magic behind compiling OpenACC programs, thereby introducing
additional concepts.
• Chapters 11 and 12 serve as a look into the diverse field of research in OpenACC
implementation of potential new language features.
xxi
Most chapters contain a few exercises at the end to review the chapter contents.
The solutions as well as the code examples used in the chapters are available
online at https://github.com/OpenACCUserGroup/openacc_concept_strategies_
book. This URL also presents a slide deck for each chapter to help teachers kick-
start their classes.
xxii
This book would not have been possible without a multitude of people who are not
listed as contributors. The idea of the book was originated by Duncan Poole, the
longtime OpenACC president. He wanted to offer not only online material but also
really old-fashioned printed material so that students and interested readers can
use this book to uncover the magic of parallel programming with OpenACC. When
Duncan could not pursue this idea any further, he passed the torch to Sunita and
Guido, and the result is now finally in all our hands.
• Pat Brooks and Julia Levites from NVIDIA, for bringing us in contact with pub-
lishers and answering questions that require inside knowledge
• Laura Lewin and Sheri Replin—our editors—and production manager Rachel Paul
and copy editor Betsy Hardinger for guiding us safely through the maze of actu-
ally generating a book
• Our chapter reviewers: Mat Colgrove, Rob Faber, Kyle Friedline, Roberto
Gomperts, Mark Govett, Andreas Herten, Maxime Hugues, Max Katz,
John Larson, Junjie Li, Sanhu Li, Meifeng Lin, Georgios Markomanolis, James
Norris, Sergio Pino, Ludwig Schneider, Thomas Schwinge, Anne Severt, and
Peter Steinbach
Some chapters would not have been possible without assistants to the contrib-
utors. Many thanks to Lingda Li, Masahiro Nakao, Hitoshi Murai, Mitsuhisa Sato,
Akihiro Tabuchi, and Taisuke Boku!
Have we already thanked our contributors who went with us on this crazy journey,
never let us down, and kept delivering content on time?
xxiii
xxvi
Jiri Kraus has more than eight years’ experience in HPC and scientific com-
puting. As a senior developer technology engineer with NVIDIA, he works as a
performance expert for GPU HPC applications. At the NVIDIA Julich Applica-
tions Lab and the Power Acceleration and Design Center (PADC), Jiri collabo-
rates with developers and scientists from the Julich Supercomputing Centre,
the Forschungszentrum Julich, and other institutions in Europe. A primary
focus of his work is multi-GPU programming models. Before joining NVIDIA,
Jiri worked on the parallelization and optimization of scientific and technical
applications for clusters of multicore CPUs and GPUs at Fraunhofer SCAI in
St. Augustin. He holds a Diploma in mathematics (minor in computer science)
from the University of Cologne, Germany.
xxvii
Jinpil Lee received his master’s and PhD degree in computer science from
University of Tsukuba in 2013, under the supervision of Prof. Mitsuhisa Sato.
From 2013 to 2015, he was working at KISTI, the national supercomputing
center in Korea, as a member of the user support department. Since 2015, he
has worked at Riken AICS in Japan as a member of the programming environ-
ment research team. He has been doing research on parallel programming
models and compilers for modern cluster architectures such as manycore
clusters. Currently he is working on developing a programming environment
for the next flagship Japanese supercomputer.
xxviii
Xiaonan (Daniel) Tian is a GPU compiler engineer at the PGI Compilers and
Tools group at NVIDIA, where he specializes in designing and implementing
languages, programming models, and compilers for high-performance com-
puting. Prior to joining NVIDIA, Daniel worked with Dr. Barbara Chapman in
her compiler research group at the University of Houston, where he received
a PhD degree in computer science. Prior to his work at the University of
Houston, Daniel worked on GNU tool-chain porting for a semiconductor com-
pany. His research includes computer architectures, directive-based parallel
programming models including OpenACC and OpenMP, compiler optimization,
and application parallelization and optimization.
1. http://www.openacc.org/.
This chapter introduces the OpenACC programming model. We explain some of its
fundamental features irrespective of the target platform:
• How to move data between traditional cores and accelerators using data envi-
ronment clauses
2. http://www.pgroup.com/about/news.htm#55.
3. https://www.hpcwire.com/2015/10/29/pgi-accelerator-compilers-add-openacc
-support-for-x86-multicore-cpus-2/.
4. https://exascaleproject.org/wp-content/uploads/2017/04/Messina-ECP-Presentation
-HPC-User-Forum-2017-04-18.pdf.
OpenACC directives are always set up in the following fashion (for C/C++):
#pragma acc <directive> [clause [[,] clause] . . .] new-line
Here it is in Fortran:
!$acc <directive> [clause [[,] clause]. . .]
The keyword for a compiler directive (which is #pragma for C/C++, or !$ for Fortran)
is followed by the directive type, which is acc for OpenACC directives. Next comes
the actual directive to tell the compiler what to do, followed by one or more optional
clauses to provide further information. A line with directives can be continued on
the following line by putting a backslash (\) at the end of the preceding line. Such
an OpenACC directive is applied to the immediately following statement, loop, or
structured code block. Together they form a so-called OpenACC construct.
1.1.1 Directives
The first word following the acc label is called the directive. A directive is an
“instruction” to the compiler to do something about the code block that follows it.
OpenACC distinguishes three directive types.
5. https://www.openacc.org/sites/default/files/inline-files/OpenACC_2pt5.pdf.
1. Compute directives. They mark a block of code that you can accelerate by
exploiting the inherent data parallelism and distribute work to multiple
threads. The OpenACC compute directives are parallel, kernels,
routine, and loop.
Note
Enter data and exit data are currently the only OpenACC directives that consist
of two words.
1.1.2 Clauses
Each OpenACC directive can be augmented by one or more clauses. A clause adds
additional information to what the compiler needs to do with an OpenACC con-
struct. Not all clauses can be combined with all directives, as is explained in detail
in the remainder of this book. Most clauses accept additional arguments in paren-
theses. In general, clauses fall into three categories.
1. Data handling. These clauses override the compiler analysis for the speci-
fied variables by assigning a certain behavior for them. Example clauses are
default, private, firstprivate, copy, copyin, copyout, create,
delete, and deviceptr.
2. Work distribution. These clauses override the compiler-selected values for the
work distribution among generated threads. The clauses are seq, auto, gang,
worker, vector, tile, num_gangs, num _workers, and vector_length.
3. Control flow. These clauses allow for steering of the parallel execution during
program runtime. Execution can be marked as conditional (if or if_present),
dependencies overridden (independent and reduction), and task parallel-
ism specified (async and wait).
Here it is in Fortran:
use openacc
Note
Using the OpenACC API routines leads to a dependency on an OpenACC runtime
environment. If the program is now compiled with a non-OpenACC-capable compiler,
the compilation and linking fail because of the lack of the runtime environment within
that compiler.
Now the program can use a number of functions directly from the OpenACC run-
time. The routines are used for the following.
• Device management: Querying and setting the used compute device types and
numbers. This can also be set by the environment variables ACC_DEVICE_
TYPE and ACC_DEVICE_NUM.
1.2.1 Kernels
The kernels directive is the first of two compute constructs provided by
OpenACC. Both constructs are used to offload code to the compute device; however,
the philosophy of the two directives is different. The kernels construct merely acts
as a sentinel to tell the compiler that this region of code should be placed on the
accelerator if possible. However, when loops are present inside of a kernels con-
struct, things become interesting.
int a[n][m], b[n][m], c[n][m];
init(a,b,n,m);
#pragma acc kernels
for(int j = 0; j < n; ++j) {
for(int k = 0; k < m; ++k) {
c[j][k] = a[j][k];
a[j][k] = c[j][k] + b[j][k];
}
}
Because the kernels construct tells the compiler to do the heavy lifting of paral-
lelizing the code, the compiler is expected either to run this code sequentially or
to parallelize the loop nest to exploit the parallelism. No matter what the compiler
chooses to do, it must ensure that correct, sequentially consistent results are
computed.
You can use several clauses to modify the kernels construct, but because it is
intended to be a compiler-driven mechanism, we discuss these clauses in the next
section. There is, however, one situation that merits examining: multiple loop nests
inside a kernels construct.
int a[n][m], b[n][m], c[n][m];
init(a,b,n,m);
#pragma acc kernels
{
In this code, the compiler is free to choose to do at least two things. First, it can
decide to fuse the two loop nests into one loop nest:
int a[n][m], b[n][m], c[n][m];
init(a,b,n,m);
#pragma acc kernels
for(int j = 0; j < n; ++j) {
for(int k = 0; k < m; ++k) {
c[j][k] = a[j][k];
a[j][k] = c[j][k] + b[j][k];
d[j][k] = a[j][k] – 5;
}
}
This fusion will likely allow for more efficient usage of the available resources,
because it will not incur the overhead of launching two kernels.
Or, second, the compiler can choose to generate two kernels that could be also
coded as two parallel regions.
int a[n][m], b[n][m], c[n][m];
init(a,b,n,m);
#pragma acc parallel loop
for(int j = 0; j < n; ++j) {
for(int k = 0; k < m; ++k) {
c[j][k] = a[j][k];
a[j][k] = c[j][k] + b[j][k];
}
}
#pragma acc parallel loop
for(int j = 0; j < n; ++j) {
for(int k = 0; k < m; ++k) {
d[j][k] = a[j][k] – 5;
}
}
1.2.2 Parallel
The parallel directive is the second compute construct provided by OpenACC.
Whereas the kernels directive places the onus on the compiler to generate
correct code, the parallel directive put the onus on the user to spell out what to
parallelize and to ensure that correct code was generated. The parallel direc-
tive takes the position that users are much smarter about the parallel nature of the
code that is being written and thus puts you on the spot to correctly express all the
parallelism in the region.
Now look at the example from the preceding section, naively changed to use only
the parallel construct.
int a[n][m], b[n][m], c[n][m];
init(a,b,n,m);
#pragma acc parallel
for(int j = 0; j < n; ++j) {
for(int k = 0; k < m; ++k) {
c[j][k] = a[j][k];
a[j][k] = c[j][k] + b[j][k];
}
}
What is wrong with this code? It is running the loop nest with some parallelism, but
redundantly. We will fix the redundant execution later, but it is important to note
that the compiler is free to choose the number of threads to execute this loop with.
This means that the code may run slowly and give correct results, or it might not
give correct results.
Also note that the compiler will detect that the objects a and b are used inside the
parallel region and generate any data movements that may be needed to move the
objects to the device. Because the compiler is free to run this code with any number
of parallel threads, it is not portable, because some implementations will choose
one thread and give the correct answer, whereas others may choose something
else and give the wrong answer.
1.2.3 Loop
The loop construct can be used inside both the parallel construct and the kernels
construct. Inside the parallel construct, the loop construct tells the compiler that
the iterations of this loop are independent. No iteration of the loop modifies data
that is used by another iteration, and hence all iterations can be executed in a
parallel manner. Inside the kernels construct, the loop construct can be used to tell
the compiler that a loop is independent when the compiler cannot determine this
at compile time. Inside both compute constructs, you can use the seq clause to tell
the compiler to execute the loop in a sequential manner.
void foo( int *a, int *b, n ) {
#pragma acc parallel
#pragma acc loop
for(int j = 0; j < n; ++j) {
a[j] += b[j];
}
}
void bar( int *a; int *b, n) {
#pragma acc parallel
#pragma acc loop gang worker vector
for(int j = 0; j < n; ++j) {
a[j] += b[j];
}
}
The code in foo and bar are two different ways of saying similar things in OpenACC.
The reason is that the compiler is always free to use more levels of parallelism on a
loop if it sees no better uses elsewhere. The code in foo() allows the compiler to
choose the levels of parallelism it wishes to exploit. The code in bar() tells the com-
piler it should exploit all thread levels of parallelism as discussed in Chapter 2.
1.2.4 Routine
Throughout this section, with one exception, a function or subroutine has only
occurred outside a compute construct. This is deliberate, because making calls
inside compute constructs is more complex than just calling a simple function.
Because the function can be called from within a compute construct having any
arbitrary parallelism, it must be compiled for such a situation as well. Hence we
use the routine directive to mark the function as potentially parallel.
The routine directive is used in two places: in procedure definitions and in proce-
dure declarations.
#pragma acc routine
extern void foo(int *a, int *b, int n);
This code example shows the three ways that the function foo can be modified
with the routine directive. The first approach places the directive on the dec-
laration of the function and tells the compiler that any calls to this function inside
compute constructs will exploit parallelism. The second uses a named form of the
directive to declare a function that has already been prototyped to have a device
version of the code, as with the first example. The third use of the routine direc-
tive tells the compiler to generate a device version of the function so that compute
constructs can call the procedure.
Note
There are more intricacies about combining the routine directive with the three levels
of parallelism (gang, worker, vector), which are introduced in Chapter 2. For simplicity
it can be said that no level of parallelism can be requested more than once, especially
for nested routines and loops.
The routine directive takes three other clauses that are covered here.
#pragma acc routine vector device_type(nvidia) gang bind(foo_nvidia)
void foo(int *a, int *b, int n);
The first directive in this code tells the compiler that whenever it sees a call to
foo() inside a compute construct while compiling for NVIDIA targets, it should
replace the call name with foo_nvidia.6 The directive also tells the compiler that
for all targets other than NVIDIA, the function foo() will exploit only vector par-
allelism; however, on NVIDIA targets the function foo_nvidia will use both gang
and vector parallelism, leaving no parallelism available outside the function.
The second directive tells the compiler to generate code for the device but not for
the host. This is useful for at least two reasons. First, it reduces the text size of the
6. NVIDIA, founded in 1993, is a vendor of GPUs, notably for gaming platforms. See http://
www.nvidia.com/object/about-nvidia.html.
10
generated code. Second, it ensures that code that could not be compiled for the
host—that is, a device-specific intrinsic—is not given to the host compiler.
Now while this looks extremely complicated, rest assured that this is also about the
maximum level of complexity of very advanced OpenACC programs and in many
cases not necessary.
Scalar variables used in parallel regions but not listed in data clauses are prede-
termined as firstprivate (a private variable that is initialized with the value
from before the construct) or private, depending on whether the object is read or
written first in the region.
The specification defines two important concepts related to data objects: data
regions and data lifetimes. A data region is the dynamic scope of a structured block
associated with an implicit or explicit data construct. A data lifetime starts when
an object is first made available to a device, and ends when the object is no longer
available on the device. Four types of data regions are defined: the data construct,
the compute construct, a procedure, and the whole program.
11
The data lifetime for a begins at the opening curly brace (or curly bracket) and ends
at the closing curly brace. Here is the Fortran equivalent.
!$acc data copy(a)
<use a>
!$acc end data
An unstructured data region is delimited with a begin-end pair that need not be in
the same lexical scope.
void foo(int *array, int n) {
#pragma acc enter data copyin(array[0:n])
}
The proper placement of data directives can greatly increase the performance of
a program. However, placing data directives should be considered an optimization,
because you can handle all data motion at parallel and kernels constructs by add-
ing the same data clauses that are available to the structured data construct.
The foundation of all the clauses that allocate data is the create clause. This
clause creates a data lifetime for the objects listed in the clause on the device.
This means that on a nonshared memory device, memory for the object is to be
available during the associated construct. “Available” means that if the object has
already had memory created for it on the device, then the runtime need only update
the reference count it uses to track when an object’s memory is no longer needed
12
on the device. If the object does not already have memory assigned to it, then the
runtime will allocate memory for the object and update the reference count as in use.
So we know that data on the device is tracked by the runtime, but why even expose
this to users? Enter the present clause. It is the first part of every data clause, but
it is more of a bookkeeping clause than a data clause. Rather than ensure an object
is made available on the device, it ensures that the object is available on the device.
Why is this useful? A library writer may want to ensure that the objects are already
on the device so that a computation—one that would normally not be beneficial if
data motion were required—can be sent to the device.
The next three clauses are related, both in name and in function. We start with the
most complex clause: copy. This clause starts with a present clause. Then if
the data is not present, the create clause is performed. Once the object is on the
device, the copy clause copies the data from the host to the device at the beginning
of the region. When the region ends, if an object’s reference count—the number of
data constructs associated with the object—is about to go to zero, then the data is
copied back to the host and the memory is freed.
The copyin clause does everything the copy clause does on the way into a data
region, but at the end of the region the data is not copied back to the host. Data
storage is released only if the reference count goes to zero.
The copyout clause does everything the copy clause does, except copyout does
not copy the data to the device on entry to the data region.
The delete clause is powerful. However, as with many other powerful things, it
is also dangerous. The delete clause does two things. First it determines whether
the object is present, and if not it does nothing; second, it removes the object from the
device’s data environment by forcing the reference count to zero and releasing the
memory.
The deviceptr clause was added to improve support for native libraries and
calls that use the device. It tells the OpenACC system that the pointer in the clause
contains an address that is resident on the device. This allows you to take control of
data placement outside OpenACC but still use the data inside via the API.
13
In this example, cache tells the compiler that each iteration of the for loop uses
only one element of the array b. This is sufficient information for the compiler to
determine that if the loop is going to be run by n threads, the compiler should move
n elements of the array into faster memory and work from this faster memory
copy. This movement does not come without a cost, but if the body of the loop con-
tains sufficient reuses of the objects that are moved to fast memory, the overhead
can be well worth the cost. If the object that is in the cache directive is read-only,
the cost is greatly reduced. However, the compiler must do the data analysis to
determine whether the object is live on entry or exit from the loop.
OpenACC offers the capability for partial data transfers on all data constructs
and clauses, which is often termed array shaping. The syntax of this feature is
for C/C++:
#pragma acc data copy(a[0:n])
Here it is in Fortan:
!$acc data copy(a(1:n))
The array boundary notation differs for C and Fortran. In C you specify the start
index and after the colon the number of elements, whereas in Fortran you specify
the range by listing the starting and ending index.
A typical use case, like the example provided, is when you use dynamically allo-
cated arrays of n elements. The provided code snippet will copy the whole array
a to the device memory at the start of the data construct and transfer its contents
back to the host at the end. Additionally, this notation is very helpful when dealing
14
with arrays that are larger than the available device memory or when only part
of the array is needed on the device. With array shaping one can build a software
pipeline to stage processing of array parts, as discussed later in Chapter 10.
It should also be noted that when you transfer/update only parts of your array, you
as the programmer are responsible that your code only accesses the transferred
parts in your compute regions. Otherwise you risk unspecified behavior of your
application and potentially a very time-consuming error search.
1.4 Summary
OpenACC programmers can target various types of processors and program to an
abstract machine model. You also have available a range of fundamental directives
for developing a functioning program. You can run code on an accelerator, and you
can use various mechanisms to reduce the amount of data traffic that is required
for distributed memory machines. With the knowledge of these tools, you are now
ready to begin writing parallel programs using OpenACC.
1.5 Exercises
1. Turn one of the following “hello world” codes into an actual program, and run it
to see what happens.
Simple hello world in C
#pragma acc parallel
printf("hello world!");
15
a[i] = 0;
}
#pragma acc parallel loop
for ( int i = 0; i < n; i++ ) {
a[i] = i;
}
for ( int i = 0; i < n; i++ ) {
if ( a[i] == 0 ) {
a[i] = i;
}
else {
a[i] = (a[i] + i) * i;
}
}
myPrint(a, i);
3. Write a program that puts two separate loop nests inside a kernels construct.
If the compiler does not already generate two kernels, modify the code to force
the compiler to generate two kernels.
16
Loops are at the heart of the key computational kernels of many applications, espe-
cially in computational simulations. Because of the nature of the data that applica-
tions or algorithms operate on, data is stored in arrays, often multidimensional in
scope. This means that the majority of work in the application is in iterating over
those arrays and updating them using the algorithm being employed.
17
When you are looking for a way to use the large range of computational resources
that are available in parallel computing hardware, it is therefore natural to alight on
the idea of distributing the iterations of these nests of loops to different process-
ing elements in a computational resource; in this way, you distribute work across
available hardware and reduce the overall runtime of the application. Indeed, this is
a core feature of parallel programming constructs, such as OpenMP1 and OpenCL.2
However, recognizing that accelerators or manycore (more than 12–18 cores) hard-
ware may have several levels of parallelism—that is, cores, threads, and vector
units—OpenACC allows you to specify a more detailed description of the way loops
are mapped to the hardware being exploited by a given execution of an application.
This chapter explores in detail how to parallelize loops using OpenACC, discussing
the various levels of parallelism provided and the clauses that can be added to the
parallelization directives to ensure that the resulting parallelism still produces
correct results, or to improve performance.
1. http://www.openmp.org.
2. https://www.khronos.org/opencl/.
18
For instance, if you were to use the kernels directive to parallelize the loops out-
lined in this chapter, it could look like this, for C:
#pragma acc kernels
{
for(int i=0; i<N; i++){
for(int j=0; j<M; j++){
for(int k=0; k<P; k++){
particles[i][j][k] = particles[i][j][k] * 2;
}
}
}
}
Here it is in Fortran:
!$acc kernels
do i=1,N
do j=1,M
do k=1,P
particles(k,j,i) = particles(k,j,i) * 2
end do
end do
end do
!$acc end kernels
If you were to use the parallel directive to parallelize the same code, it would
look like this, for C:
#pragma acc parallel
{
for(int i=0; i<N; i++){
for(int j=0; j<M; j++){
for(int k=0; k<P; k++){
particles[i][j][k] = particles[i][j][k] * 2;
}
}
}
}
Here it is in Fortran:
!$acc parallel
do i=1,N
do j=1,M
do k=1,P
particles(k,j,i) = particles(k,j,i) * 2
end do
end do
end do
!$acc end parallel
19
To achieve correct and efficient parallelization when using the parallel directive,
you must combine it with one, or more, loop directives to specify where to apply work
sharing. Therefore, the example should have been parallelized as follows, in C:
#pragma acc parallel loop
for(int i=0; i<N; i++){
for(int j=0; j<M; j++){
for(int k=0; k<P; k++){
particles[i][j][k] = particles[i][j][k] * 2;
}
}
}
Both kernels and parallel allow you to include multiple loops inside them;
however, the way they treat those loops differs. If there are multiple loops inside a
kernel region, each loop (that the compiler decides it can parallelize) will gener-
ate a separate kernel to be executed, and these will run sequentially, one after the
other. In contrast, if there are multiple loops inside a single parallel region, then
these may run at the same time, or at least there is no guarantee that the first loop
will have finished before the second one starts. Consider this example, in Fortran:
!$acc parallel
!$acc loop
do i=1,1000
a(i) = b(i)
end do
!$acc loop
do i=1,1000
b(i) = b(i)*2
c(i) = b(i)+a(i)
end do
!$acc end parallel
20
Here it is in C:
#pragma acc parallel{
#pragma acc loop
for(int i=0; i<1000; i++){
a[i] = b[i];
}
#pragma acc loop
for(int i=0; i<1000; i++){
b[i] = b[i]*2;
c[i] = b[i]+a[i];
}
}
This code is not guaranteed to be correct, because the second loop could start
executing before the first loop has finished, depending on how the loops have been
mapped to the hardware by the compiler. If kernels had been used instead, this
would have generated code that was guaranteed to be correct.
These clauses, which can be applied to the loop directive, affect how loops and
loop iterations are distributed to hardware components on the targeted processors
or accelerator. This may affect both the performance and the functionality of an
application, because the types of synchronization that are available between loop
iterations are dependent on the hardware where the iterations are executed.
We discuss these clauses in this section, but note that what they actually translate
to depends on the hardware your application is running and the compiler you are
using. Most hardware that OpenACC programs run on supports at least two levels
of parallelism (across cores and across vector units), but some hardware may
support more or fewer levels. However, you can make some general assumptions:
gang is the most coarse-grained, mapping to chunks of work assigned to various
gangs of workers to perform independently; worker defines how work is distributed
inside a gang; and vector maps to the hardware’s instruction-level parallelism (i.e.,
vector operations).
21
The use of gang, worker, and vector allows the complexity of different hardware
to be abstracted by programmers, providing a generalized level of mapping between
the program’s functionality and the hardware the program is running on. However, it is
possible to specify in more detail exactly how to map loops to hardware—for instance
specifying the vector length along with specifying the number of workers, or gangs,
to use. This construct may provide optimized performance for specific hardware,
albeit at the expense of how well the application’s performance stands up when the
application is ported, often termed as performance portability to other hardware.
Finally, note that OpenACC does not require programmers to specify this mapping
between program functionality and parallel hardware. If you choose not to specify
gang, worker, or vector clauses on your loop directive, then the compiler will
attempt a mapping based on what it knows about the compiled code and hardware
being used.
Workers are threads within a gang, and they allow users or developers to spec-
ify how threads can be mapped to the hardware being used. As such, they are an
intermediate level between the group of threads (the gang) and the low-level paral-
lelism implemented in vector (which specifies how instructions are mapped to the
hardware available to an individual thread).
Vector parallelism has the finest granularity, with an individual instruction oper-
ating on multiple pieces of data (much like single-instruction, multiple-data (or
SIMD) parallelism on a modern CPU, or warps on NVIDA GPUs). These operations
are performed with reference to a vector length, which specifies how many units of
data will be operated on for a given instruction.
22
These three levels of parallelism layout enable a mapping to all current hardware
platforms, as shown in Table 2.1.
Table 2.1 Potential mapping of gang, worker, and vector to various hardware platforms
Mapping of
Platform
Gang Worker Vector
Multicore CPU Whole CPU (NUMA domain) Core SIMD vector
Manycore CPU (e.g., Xeon Phi) NUMA domain (whole chip or quadrant) Core SIMD vector
NVIDIA GPU (CUDA) Thread block Warp Thread
AMD GPU (OpenCL) Workgroup Wavefront Thread
Note: The mapping actually used is up to the compiler and its runtime environment, and this table is not guaranteed for an
actual OpenACC implementation. The programmer can guide the compiler only by specifying a number of each element to use.
At this point it is crucial to understand both the hardware you are targeting and
your algorithm or application. If you know the sizes of your loops (or at least
roughly the range of sizes they may be) and the synchronization you need between
loop iterations, then you should be able to make sensible choices about what hard-
ware mapping directives you can use on which loops.
For NVIDIA GPU hardware, it is likely that gangs map to thread blocks, workers to
warps, and vector to threads. Alternatively, you could simply use a gang to map
to thread blocks and workers to map to threads in those thread blocks. Likewise,
different loops in a parallel or kernel region could map to different thread blocks.
3. NUMA stands for non-uniform memory access. If you have multiple processors in a
node it is likely all of them can access the memory of the node, but it’s quicker to access
memory directly attached to a processor than it is to access memory attached to another
processor in the node. Each processor would be a NUMA region: a region where local
memory accesses are faster than remote memory accesses.
23
NUMA characteristics of the node being ignored when mapping OpenACC directives
to hardware.
These clauses take an integer value that specifies, respectively, the number of
gangs to be used for each kernel or gang loop in the region, the number of work-
ers per gang, and the vector length to use in vector operations. Note that these
are optional clauses, and you can use any combination of them required. Here’s an
example:
!$acc parallel loop num_gangs(32) vector_length(128)
do i=1,N
do j=1,M
do k=1,P
particles(k,j,i) = particles(k,j,i) * 2
end do
end do
end do
!$acc end parallel
If the value you specify is not supported or implementable on the hardware being
used, the actual implementation may use a lower value than what you specify in
these clauses.
When you manually specify how loop iterations should be mapped to the available
parallel hardware, the loops you are parallelizing must have enough iterations to
map to the various hardware levels. You must also specify loop clauses on all loops
inside a parallel region, something that can be tedious and can impact the readabil-
ity of the code.
24
As an alternative, you can treat a whole nest of loops as a single loop space to be
distributed to the parallel hardware, specify the parallelism at a single loop level,
and allow the compiler to sort out the mapping of those loops to specific hardware.
This approach also can have the benefit of providing a larger loop iteration space
to be distributed across hardware threads or vector hardware, and this is useful
if some loops in the loop nest have small iteration counts, as demonstrated in the
following code:
for(int i=0; i<8; i++){
for(int j=0; j<8; j++){
for(int k=0; k<8; k++){
. . .
}
}
}
To tell the compiler to treat a whole loop nest as a single loop, you can use the
collapse keyword for the loop directive. This directive takes a single positive
number as an argument and uses this number to determine how many loops within
the loop nest to collapse into a single loop. Here’s an example:
!$acc parallel
!$acc loop collapse(3)
do i=1,8
do j=1,8
do k=1,8
…
end do
end do
end do
!$acc end parallel
For collapse to work, the loops need to be tightly nested—that is, there is no
code between loops—and the size of the loops must be computable and must not
change during the loop.
Therefore, compilers often err on the side of caution when deciding which loops
can be parallelized, and this may mean performance is lost in code because the
25
functionality that could be run across many threads is being run sequentially.
Here’s an example of such code:
!$acc kernels
!$acc loop
do i = 1, 512
do j = 1, 1024
local_index = (((i-1)*1024)+j
halo_data(local_index) = particle_data(j,i)
enddo
enddo
!$acc end kernels
The code in the innermost loop depends on the outer loop counter (to calculate
local_index), and therefore the compiler may decide this is a loop dependency
(i.e., one iteration of the loop depends on the result of the previous iteration); as a
result, the compiler produces a sequential kernel rather than parallelize the loops.
You can address this problem by adding the independent clause to the loop
directive to tell the compiler you guarantee that the loop iterations are independent
and therefore it can safely parallelize the loops.
Note that this issue is not restricted to OpenACC or accelerator applications; com-
pilers have similar issues trying to vectorize code, and many compilers or parallel
programming approaches will give you methods for giving the compiler a bit more
information about loops to help it in distributing iterations to parallel hardware.
You can usually be guided by compiler reports on the parallelism that it has imple-
mented, or by profiling data on your application, regarding loops that don’t seem
to be parallelizing. Then you can decide whether they truly are independent and
therefore whether you can apply this clause.
For instance, the PGI compiler, when compiling code using the -Minfo=accel
flag, may produce output like this:
. . . Parallelization would require privatization of
array 'halo_data(:)'
Accelerator kernel generated
!$acc loop seq
26
Such output indicates that the associated code loop needs further investigation.
Furthermore, the parallel directive instructs the compiler to parallelize all the
code contained within the parallel region, whether or not it is correct to do so.
Also, loop nests to be parallelized with OpenACC directives may have more loops
available than the various types of parallelism you are targeting (i.e., more than
three loops, which you could map to gang, worker, or vector). In this scenario
it is often possible to collapse loops to reduce the loops to be distributed to hard-
ware, but in some situations loop collapsing is not possible (e.g., loops that are not
perfectly nested).
Therefore, OpenACC provides a clause, seq, that you can add to a loop to stop the
compiler from parallelizing or vectorizing that loop, or to control where parallelism
happens in the loop nest. You can add seq to any loop directive, but it is mutually
exclusive with the gang, worker, and vector clauses (because they specify parallel-
ism, and seq disables parallelism). Here’s an example:
#pragma acc parallel
#pragma acc loop gang
for (f=0; f<A; f++){
#pragma acc loop worker
for (g=0; g<B; g++){
#pragma acc loop seq
for (h=0; h<X; h++){
#pragma acc loop vector
for (i=0; i<Q; i++){
…
}
}
}
}
You can use auto, on the other hand, to instruct the compiler to define the various
levels of parallelism that loops should be mapped to. This clause is automatically
applied to loops inside a kernel region that have not had a parallelism clause
(worker, vector, gang, or seq) applied to them.
27
However, note that auto does not instruct the compiler that loops are independent.
Therefore, auto may need to be combined with the independent clause for
loops the compiler is struggling to automatically parallelize.
You can also use auto to enable compiler checking on loops inside a parallel
region. As previously discussed, these loops are assumed to be independent and
parallelizable by default.
Here it is in Fortran:
total = 0
do i=1,100
total = total + data(i)
end do
Because the variable total at iteration i depends on iteration i-1, the compiler
will not be able to automatically parallelize the loop. However, you can fix this issue
by using a reduction clause.
28
The reduction clause enables you to maintain your original loop in your source
code and automatically convert it into a set of parallel and sequential loops when
compilation happens. In this way, the compiler parallelizes as much as possible
while producing the correct result.
When using a reduction clause, you do not need to initialize your variable; it will
be automatically initialized correctly for you.
29
Here it is for C:
total = 0;
#pragma acc parallel loop reduction(+:total)
for(i=0; i<100; i++){
total = total + data[i];
}
With parallel regions, you can use the reduction clause either for the whole par-
allel region (as a clause on the parallel directive itself) or on a loop directive.
If you are using the kernels directive, the reduction clause can be applied only
at the loop directive level.
Within a parallel region, if you apply reduction on a loop by using the vector or
worker clause (and no gang clause) and if the reduction variable is listed as a
private variable, then the value of the private copy of the scalar will be updated
at the exit of the loop. If the reduction variable is not specified as private, or if
you apply the reduction to a loop by using the gang clause, the value of the variable
will not be updated until execution reaches the end of the parallel region.
2.4 Summary
Loop parallelization functionality is at the heart of the ability of OpenACC to exploit
multicore, manycore, and GPU hardware. Efficiently and correctly parallelizing
loops across hardware is key to ensuring good performance.
The parallel and kernels directives differ; kernels gives the compiler more
responsibility for identifying parallelism, and parallel mandates that the com-
piler parallelize the code within that region.
OpenACC provides extra functionality, additional clauses that can be added to par-
allelization directives, to enable you to help the compiler efficiently parallelize your
code and then to sensibly map it to the hardware you are using.
30
Hopefully, with this information and some experimenting, you will be able to build
correct OpenACC code and make it fly. Just remember that correctness should
come before performance, so having well-tested and validated OpenACC code
should be your first target before you start optimizing the loops and exploiting the
hardware you are using to its full potential.
2.5 Exercises
1. If you run the following fragment of OpenACC on a GPU using all 1,536 threads
that the GPU has, how many times as fast would you expect it to run, compared
with using a single thread on the GPU?
#pragma acc parallel
for(i=0;i<2359296;i++){
answer[i] = part1[i] + part2[i] * 4;
}
2. The following fragment of code produces an incorrect answer. Can you fix it?
passing_count = 0
!$acc parallel loop
do i=1,10000
moving(i) = (left(i) – right(i)) * 2
left(i) = left(i) + 0.6
right(i) = right(i) - 3
passing_count = passing_count + 1
end do
!$acc end parallel loop
3. The following code does not efficiently utilize the hardware available on a GPU.
Can you suggest a way of optimizing it to improve the utilization of the GPU?
#pragma acc parallel loop
for(i=0;i<8;i++){
for(j=0;j<320;j++){
for(k=0;k<320;k++){
new_data[i][j][k] = (old_data[i][j][k] * 2) + 4.3;
}
}
}
31
33
3.1 Common Characteristics of
Architectures
The OpenACC programming model has been developed for many core devices,
such as GPUs. It targets host-directed execution using an attached or integrated
accelerator. This means that the host is executing the main program and controls
the activity of the accelerator. To benefit from OpenACC directives, the executing
devices should provide parallel computing capabilities that can be addressed with
at least one of the three levels of parallelism: gang, worker, or vector.
OpenACC uses the concept of computation offloading. This allows you to execute
computationally intensive work, or workloads that simply fit to the architecture of
an accelerator, on an offloading device. This means that OpenACC devices are most
often host-controlled and not self-hosted. Nevertheless, some OpenACC compilers
(e.g., PGI’s multicore back end) generate code for multicore CPUs from OpenACC
directives.
34
Compilers with support for the OpenACC API typically provide a switch that
enables the interpretation of OpenACC directives (for example, -acc for the PGI
compiler, and –fopenacc for the GNU compiler). This usually includes the header
openacc.h automatically, as well as a link to the compiler’s OpenACC runtime
library. Because OpenACC has been developed for performance portable pro-
gramming, there is typically a compiler option to specify the target architecture.
The default target might fit a variety of devices, but most often it does not provide
the best performance. Table 3.1 shows an overview of OpenACC compiler flags for
several known OpenACC compilers.
An example of a complete compile command and the compiler output are given in
Listing 3.1 for the PGI C compiler. It generates an executable prog for the OpenACC
program prog.c. The switch –ta=tesla:kepler optimizes the device code
for NVIDIA’s Kepler architecture. The flag –Minfo=accel prints additional infor-
mation on the implementation of OpenACC directives and clauses.
35
Listing 3.1 Exemplary PGI compiler output with -Minfo=accel for a simple OpenACC reduction
pgcc –acc –ta=nvidia:kepler –Minfo=accel prog.c –o prog
reduceAddOpenACC_kernel:
40, Generating implicit copyin(data[1:size-1])
41, Loop is parallelizable
Accelerator kernel generated
Generating Tesla code
41, #pragma acc loop gang, vector(128)
/* blockIdx.x threadIdx.x */
42, Generating implicit reduction(+:sum)
The number of compilers with OpenACC support1 has grown over time. PGI and Cray
are commercial compilers. PGI also provides a community edition without license
costs. Among the open source compilers are the GCC2 and several OpenACC research
compilers, such as Omni from the University of Tsukuba,3 OpenARC from Oak Ridge
National Laboratory,4 and OpenUH from the University of Houston.5
1. http://www.openacc.org/content/tools.
2. https://gcc.gnu.org/wiki/OpenACC.
3. http://omni-compiler.org.
4. http://ft.ornl.gov/research/openarc.
5. https://you.stonybrook.edu/exascallab/downloads/openuh-compiler/..
36
Performance analysis is obviously useful for tuning programs and should be a part of
a reasonable application development cycle. It starts with the measurement prepara-
tion, followed by the measurement run and the analysis of the performance data, and
concludes with code optimization. In the following, we focus on the first three steps.
During measurement preparation you select and configure data sources, which will
provide information on program execution. It does not require changing the appli-
cation code in every case (see the following section). The measurement itself is an
execution run while performance data are recorded. The measurement data are then
analyzed, often with the help of performance tools. Based on the performance analy-
sis results you try to optimize the code, and then you repeat the whole process.
6. Sameer Shende and Allen D. Malony, “The TAU Parallel Performance System,” Interna-
tional Journal of High Performance Computing Applications, 20(2) (2006): 287–311. http://
tau.uoregon.edu.
7. Andreas Knüpfer et al., “The Vampir Performance Analysis Tool-Set,” Tools for High
Performance Computing (2008): 139–155.
37
Profiles Traces
Data Recording Summarization Logging
Event-Based
Data Acquisition Sampling
Instrumentation
8. Thomas Ilsche, Joseph Schuchart, Robert Schöne, and Daniel Hackenberg, “Combining
Instrumentation and Sampling for Trace-Based Application Performance Analysis,” Tools
for High Performance Computing (2014): 123–136.
38
Since OpenACC 2.5, the specification describes a profiling interface that enables
you to record OpenACC events during the execution of an OpenACC program. It is
described in Section 3.3.4. Interfaces can be used for data acquisition for many
other programming models, such as CUPTI for CUDA,9 OMPT for OpenMP,10 the MPI
tool information interface, PMPI for MPI,11 and MPI_T tools for MPI 3.0.12
9. docs.nvidia.com/cuda/cupti/.
10. www.openmp.org/wp-content/uploads/ompt-tr2.pdf.
11. http://mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf.
12. https://computation.llnl.gov/projects/mpi_t.
13. Dieter an Mey et al., “Score-P: A Unified Performance Measurement System for
Petascale Applications,” Competence in High Performance Computing 2010 (2012): 85–97.
http://www.score-p.org.
14. https://www.cs.uoregon.edu/research/tau/home.php.
39
Table 3.2 Runtime events specified in the OpenACC 2.5 profiling interface for use in
instrumentation-based performance tools
Event acc_ev_ Occurrence
Kernel launch events
enqueue_launch_[start|end] Before or after a kernel launch operation
Data events
enqueue_upload_[start|end] Before or after a data transfer to the device
enqueue_download_[start|end] Before or after a data transfer from the device
create/delete When the OpenACC runtime associates or disassociates device memory
with host memory
alloc/free When the OpenACC runtime allocates or frees memory from the device
memory pool
Other events
device_init_[start|end] Before or after the initialization of an OpenACC device
device_shutdown_[start|end] Before or after the finalization of an OpenACC device
runtime_shutdown When the OpenACC runtime finalizes
wait_[start|end] Before or after an explicit OpenACC wait operation
compute_construct_[start|end] Before or after the execution of a compute construct
update_construct_[start|end] Before or after the execution of an update construct
enter_data_[start|end] Before or after the execution of an enter data directive, before or after
entering a data region
exit_data_[start|end] Before or after the execution of an exit data directive, before or after
leaving a data region
Events are categorized into three groups: kernel launch, data, and other events.
Independent of the group, all events provide information on the event type and the
parent construct, as well as whether the event is triggered by an implicit or explicit
directive or runtime API call. For example, implicit wait events are triggered for
waiting at synchronous data and compute constructs. Event-specific information
for kernel launch events is the kernel name as well as gang, worker, and vector
size. Data events provide additional information on the name of the variable and the
number of bytes, as well as a host and a device pointer to the corresponding data.
For a complete list of all information that is provided in OpenACC event callbacks,
consult the OpenACC specification.
Event callbacks also provide an interface to low-level APIs such as CUDA, OpenCL,
and Intel’s COI (Coprocessor Offload Infrastructure).15 This interface allows access
to additional information that is not available with the OpenACC profiling interface
or access to hooks into the execution of the low-level programming model. Tools
15. https://software.intel.com/en-us/articles/offload-runtime-for-the-intelr-xeon-phitm-coprocessor.
40
Portable collection of device data and sampling are not part of the OpenACC 2.5
profiling interface. However, the runtime of device tasks generated from OpenACC
directives can be estimated based on enqueue and wait events.
This section introduces three powerful performance tools that support the com-
bined analysis of OpenACC host and device activities. First, the NVIDIA profiler,
as part of the CUDA toolkit, focuses specifically on performance measurement
and analysis of OpenACC offloading to CUDA target devices. Second, the Score-P
environment supports many paradigms, including message passing (e.g., with
MPI) and multithreading as well as offloading with CUDA, OpenCL, and OpenACC.
The CUDA toolkit is one of the most capable tool sets when additional paradigms
and programming models are used next to OpenACC. Last, the TAU Performance
System provides comprehensive performance measurement (instrumentation and
sampling) and analysis of hybrid parallel programs, including OpenACC.
41
CUDA kernels, such as their runtime, call parameters, and hardware counters.
The profiler can also capture calls into the CUDA runtime and CUDA driver API. A
dependency analysis exposes execution dependencies and enables highlighting of
the critical path between host and device. nvprof (and nvvp) also support CPU
sampling for host-side activities.
Because the GUI (nvvp) is not always available, nvprof can export the analysis
results in a file to be imported by nvvp, or it can directly print text output to stdout
or a specified file. Listing 3.2 shows an example analysis for a C program with
OpenACC directives, which solves the 2D Laplace equation with the Jacobi iterative
method.17 It shows the most runtime-consuming GPU tasks (kernels and memory
copies), CUDA API calls, and OpenACC activities. Optimization efforts should focus
on these hot spots. Note that nvprof is simply prepended to the executable pro-
gram. When no options are used, the output for an OpenACC program with a CUDA
target has three sections: global profile, CUDA API calls, and OpenACC activities.
The list in each section is sorted by exclusive execution time.
Figure 3.2 shows the nvvp visualization for the same example as mentioned
earlier. The main display shows a timeline that visualizes the execution of OpenACC
activities, CUDA driver API calls, and CUDA kernels that have been generated by
the PGI 16.10 (community edition)18 compiler for a Tesla K80 GPU. The properties
of the selected region are shown on the upper right of the figure. The region is an
OpenACC enter data region with duration of about 33ms followed by a compute
region. The respective CUDA API calls and CUDA kernels are shown in the under-
lying timeline bars. This illustrates the implementation of OpenACC constructs in
the CUDA programming model in a timeline view. The nvvp profiler also provides
summary profile tables, chart views, and a guided analysis (see the bottom half
of Figure 3.2), information that supports programmers in identifying performance
issues and optimization opportunities. You can use nvvp directly to analyze an
application binary. It also allows loading a performance report that has been gen-
erated with nvprof or nvvp.
17. https://devblogs.nvidia.com/parallelforall/getting-started-openacc/.
18. As of July 2017, PGI 17.4 is the most recent community edition; see https://www.pgroup
.com/products/community.htm.
42
Listing 3.2 Exemplary nvprof usage and output for an iterative Jacobi solver
nvprof ./laplace2d_oacc
. . .
==50330== Profiling result:
Time(%) Time Calls Avg Min Max Name
58.25% 1.30766s 1000 1.3077ms 1.3022ms 1.3212ms main_92_gpu
40.12% 900.64ms 1000 900.64us 887.69us 915.98us main_102_gpu
0.55% 12.291ms 1000 12.290us 12.128us 12.640us main_92_gpu_red
0.54% 12.109ms 1004 12.060us 864ns 2.7994ms [CUDA memcpy HtoD]
. . .
==50330== API calls:
Time(%) Time Calls Avg Min Max Name
45.24% 1.31922s 1005 1.3127ms 9.2480us 1.3473ms cuMemcpyDtoHAsync
31.28% 912.13ms 3002 303.84us 2.7820us 2.4866ms cuStreamSynchroni
9.86% 287.63ms 1 287.63ms 287.63ms 287.63ms cuDeviceCtxRetain
9.55% 278.42ms 1 278.42ms 278.42ms 278.42ms cuDeviceCtxRelease
1.55% 45.202ms 3000 15.067us 12.214us 566.93us cuLaunchKernel
. . .
==50330== OpenACC (excl):
Time(%) Time Calls Avg Min Max Name
54.73% 1.32130s 1000 1.3213ms 1.2882ms 1.3498ms acc_enqueue_download@laplace2d_oacc.c:92
37.49% 905.04ms 1000 905.04us 347.17us 1.0006ms acc_wait@laplace2d_oacc.c:102
2.52% 60.780ms 1 60.780ms 60.780ms 60.780ms acc_enter_data@laplace2d_oacc.c:86
1.05% 25.246ms 1 25.246ms 25.246ms 25.246ms acc_exit_data@laplace2d_oacc.c:86
. . .
43
In comparison to the NVIDIA profiling tools, the focus is more on global optimizations
of the program parallelization on all parallel layers. Hence, the Score-P tools scale to
hundreds of thousands of processes, threads, and accelerator streams. The VI-HPS
workshop material23 and the Score-P documentation24 provide detailed information
on the Score-P analysis workflow, which can be summarized by the following steps.
2. Generate a profile.
3. Generate a trace.
Before the measurement starts, the executable must be prepared. For OpenACC
programs, the simplest way is to prepend scorep --cuda --openacc before
19. Pavel Saviankou, Michael Knobloch, Anke Visser, and Bernd Mohr, “Cube v4: From Per-
formance Report Explorer to Performance Analysis Tool,” Procedia Computer Science 51
(2015): 1343–1352.
20. http://www.openshmem.org/site/.
21. docs.cray.com/books/004-2178-002/06chap3.pdf.
22. Dominic Eschweiler et al., “Open Trace Format 2: The Next Generation of Scalable Trace
Formats and Support Libraries,” Applications, Tools and Techniques on the Road to Exascale
Computing 22 (2012): 481–490.
23. http://www.vi-hps.org/training/material/.
24. https://silc.zih.tu-dresden.de/scorep-current/html/.
44
the compiler command and recompile the application. The options cuda and
openacc link the respective measurement components for CUDA and OpenACC
into the binary. Typing scorep --help prints a summary of all available options.
As defined in the OpenACC 2.5 specification, you must statically link the profiling
library into the application, or you must specify the path to the shared library by
setting ACC_PROFLIB=/path/to/shared/library or LD_PRELOAD=/path/
to/shared/library. Score-P’s OpenACC profiling library is located in Score-P’s
lib directory and is called libscorep_adapter_openacc_events.so.
The amount of device memory to store device activities can be set with the environ-
ment variable SCOREP_CUDA_BUFFER (value in bytes), if needed.
45
You should carefully select the events to record to avoid unnecessary measure-
ment overhead. For example, OpenACC enqueue operations are typically very
short regions, and the CUDA driver feature includes similar information by
tracking API functions such as cuLaunchKernel. There is an option to track the
device memory allocations and deallocations in both the OpenACC and the CUDA
features. A recommended low-overhead setup would use regions and wait
from the OpenACC features, and kernel and memcpy from the CUDA features, an
approach that supplies enough information to correlate OpenACC directives with
CUDA device activities.
Figure 3.3 shows the profile visualization with the CUBE GUI for a hybrid MPI/
OpenMP/OpenACC program (molecular dynamics code from Indiana University25).
Selecting the exclusive runtime in the Metric tree box on the left of the figure, the
CUDA kernel accel_296_gpu is dominating the runtime on the GPU (with 73.63
sec) as shown in the Flat view in the center. Immediately after the kernel launch,
the host starts waiting for its completion (code location is int_ion_mix_acc.f
in line 296). The System tree on the right shows the distribution of the execution
time over processes, threads, and GPU streams.
25. http://hgpu.org/?p=11116.
46
Figure 3.4 shows the Vampir visualization of an OTF2 trace that has been gen-
erated from the SPEC ACCEL26 363.swim benchmark. The selected time interval
illustrates the level of detail that can be investigated using tracing. The timeline at
the top of the figure shows a host stream (captioned Master thread) that triggers
activities on the CUDA device stream (captioned CUDA[0:13]). Data transfers are
visualized as black lines between two execution streams. Details on the selected
data transfer are shown on the lower right in the Context View box. Each region in
the timelines can be selected to get additional information. The Function Summary
box on the upper right provides profile information of the selected time interval in
the program execution. The runtime-dominating regions are an OpenACC launch
in line 92 of file swim.f, and an OpenACC wait operation in line 124 of the same
file. The CUDA stream shows some CUDA kernels (in orange if you’re viewing this
electronically), but most of the time the GPU is idle. The lower timeline highlights
regions on the critical path, which is the longest path in the program execution that
does not contain any wait states. Optimizing regions that are not on the critical path
has no effect on the total program run time.
Figure 3.4 Exemplary Vampir trace visualization of the SPEC ACCEL 363.swim benchmark
Note: The GUI provides several timeline views and statistical displays. All CUDA kernels are on the critical path (lower
timeline), making them reasonable optimization targets.
26. Guido Juckeland et al., “SPEC ACCEL: A Standard Application Suite for Measuring Hard-
ware Accelerator Performance,” 5th International Workshop on Performance Modeling,
Benchmarking and Simulation of High Performance Computer Systems (2014): 46–67.
47
OTF2 traces can provide additional OpenACC runtime information that is not avail-
able in CUBE profiles. If the feature kernel_properties has been set, kernel
launch parameters such as gang, worker, and vector size are added to the trace.
The feature variable_names includes the name of variables for OpenACC data
allocation and OpenACC data movement. Furthermore, implicit OpenACC opera-
tions are annotated. Vampir visualizes this information in the Context View box or
by generating metrics from these attributes as counters in the Counter Data Time-
line or Performance Radar box. All Vampir views can be opened using the chart
menu or the icons that are shown on the bar at the top of Figure 3.4.
TAU is a versatile performance analysis tool suite that supports all parallel pro-
gramming paradigms and parallel execution platforms. It enables performance
measurement (profiling and tracing) of hybrid parallel applications and can capture
the inner workings of OpenACC programs to identify performance problems for
both host and device. The integrated TAU Performance System is open source and
includes tools for source analysis (PDT), parallel profile analysis (ParaProf), profile
data mining (PerfExplorer), and performance data management (TAUdb).
48
TAU also supports event-based sampling (EBS) to capture performance data when
the program’s execution is interrupted, either periodically by an interval timer or
after a hardware counter overflow occurs. With sampling, fine-grained perfor-
mance measurement at the statement level is possible, as is aggregation of all
statement-level performance within a routine. TAU is one of very few systems that
support both probe-based instrumentation and event-based sampling. It is the only
tool that integrates the performance data at run time into a hybrid profile.27
This command will collect performance data from the OpenACC profiling interface
(-openacc) using instrumentation, as well as data from event-based sampling
(-ebs). A separate profile will be generated for every MPI process in this case.
Note that if OpenACC is offloaded to a CPU “device” that uses multithreading, every
thread of execution will have its own profile.
TAU’s parallel profile browser, ParaProf, analyzes the profile generated when
the application is launched by tau_exec. Figure 3.5 shows a view of ParaProf’s
thread statistics window that gives the breakdown of performance data for the
Cactus BenchADM benchmark28 from the PGI tutorial on OpenACC. Because event-
based sampling was enabled for this performance experiment, ParaProf presents
a hybrid performance view, where [SAMPLE] highlights an EBS sample measure-
ment and [SUMMARY] shows the total contribution of events from that routine;
“[CONTEXT]” highlights the “context” for EBS sample aggregation of the parent
node. All other identifiers are instrumentation-based measurements.
In Figure 3.5, you see low-level routines that are called by the openacc_enter_
data event for the bench_staggeredleapfrog2 routine (in green on elec-
tronic versions), which can be found in the StaggeredLeapfrog2_acc2.F file.
27. Alan Morris, Allen D. Malony, Sameer Shende, and Kevin Huck, “Design and Imple-
mentation of a Hybrid Parallel Performance Measurement System,” Proc. International
Conference on Parallel Processing (ICPP ’10) (2010).
28. PGI, “Building Cactus BenchADM with PGI Accelerator Compilers,” PGI Insider, 2017,
http://www.pgroup.com/lit/articles/insider/v1n2a4.htm.
49
Note that the routine calls the openacc_enqueue_upload event at line 369 in
this file (just below the enter data event). This operation takes 5.725 seconds, as
shown in the columns for exclusive and inclusive time. You also see samples gen-
erated in a routine __c_mcopy8 (also in green), a low-level routine from libpgc.so
that presumably copies data. This routine executes for 6.03 seconds, as reported
by the EBS measurement.29
Figure 3.5 Exemplary TAU ParaProf thread statistics window showing the time spent in OpenACC
and host code segments in a tree display
The color coding (or shading, if you’re reading the paper edition) indicates perfor-
mance intensity. For instance, the bench_staggeredleapfrog2_ routine
at sample line 344 takes as much as 17.817 seconds of the total inclusive time of
35.853 seconds for the top level (identified as the TAU Application routine). As you
see in Figure 3.5, by right-clicking and selecting a line of code, you can select the
Show Source Code option to launch the source browser. The result is shown in
Figure 3.6. It appears that this is a highly intensive computational loop that can be
moved to the GPU using OpenACC constructs. You could easily turn on additional
measurement, such as hardware counters, to investigate the performance behav-
ior further.
29. Note that TAU can generate call-path profiles showing the nesting of events up to a
certain call-path depth. A pair of environment variables is used to enable call-path pro-
filing (TAU_CALLPATH) and to set the depth (TAU_CALLPATH_DEPTH). In this example,
these variables were set to 1 and 1000, respectively.
50
Figure 3.6 Exemplary TAU ParaProf source browser window showing the innermost loop on line
344, where the application spends 17.817 seconds
51
to be changed. The GNU OpenACC runtime prints debug output if you set GOMP_
DEBUG=1. The PGI compiler enables debug output with ACC_NOTIFY=1. Unfortu-
nately, such debugging output can become lengthy, or the relevant information is
not provided.
Thus, the following steps should help you systematically identify bugs in OpenACC
programs.
1. Compile and execute the program without interpreting the OpenACC direc-
tives (without the -acc flag) to ensure that the bug arises in the context of the
OpenACC implementation.
4. Isolate the bug by, for example, moving code from OpenACC compute regions
back to the host.
5. Use debuggers such as TotalView30 and Allinea DDT,31 which can be used for
debugging at runtime in both host and device code.
Parallel debuggers, such as DDT and TotalView, allow debugging of host and device
code. This includes viewing variables and intrinsic values such as the threadIdx
in CUDA kernels. Similarly, it is possible to set breakpoints in host and device code.
Debugging of the device code is restricted to CUDA targets in the current version of
both debuggers. Figure 3.7 shows exemplary debugging output of DDT for a CUDA
program with a memory access error.32 Here, CUDA thread 254 tries to access an
“illegal lane address” and causes a signal to occur. At this point, the respective
pointer can be investigated, and the issue can be traced.
30. http://www.roguewave.com/products-services/totalview.
31. https://www.allinea.com/products/ddt.
32. https://www.allinea.com/blog/debugging-cuda-dynamic-parallelism.
52
Figure 3.7 Parallel debugging with Allinea DDT supporting OpenACC with CUDA target devices
Image courtesy of Allinea.
3.5 Summary
Several tools are available to help developers analyze their OpenACC code to
get correct and high-performing OpenACC applications. Programming tools for
OpenACC must cope with the concept of computation offloading, and this means
that performance analysis and debugging must be performed on the host and
the accelerator.
53
Furthermore, they let you investigate programs that additionally use interprocess
communication and multithreading.
Because the OpenACC compiler generates the device code, you would think you
would not need to debug OpenACC compute kernels. But it is still possible to have
logic and coding errors. Also, there are several other sources of errors that can be
detected based on compiler or runtime debugging output. Tools such as TotalView
and DDT support the debugging of OpenACC programs on host and CUDA devices.
Debugging and performance analysis are not one-time events. Because the
compiler has some leeway in implementing the specification and because there
are many different accelerators, optimizations can differ for each combination of
compiler and accelerator. The same thing applies to bugs, which may occur only for
a specific combination of compiler and accelerator. Software tools simplify the pro-
cess of debugging and performance analysis. Because OpenACC is about perfor-
mance, profile-driven program development is recommended. With the available
tools, you can avoid do-it-yourself solutions.
3.6 Exercises
For a better understanding of compiling, analyzing, and debugging of OpenACC
programs, we accelerate the Game of Life (GoL) with OpenACC. The C code appears
in Listing 3.3.
The Game of Life is a cellular automaton that has been developed by mathema-
tician John Horton Conway. The system is based on an infinite two-dimensional
orthogonal grid. Each cell has eight neighbors and can have one of two possible
states: dead or alive. Based on an initial pattern, further generations are deter-
mined with the following rules.
• Any dead cell with exactly three living neighbors becomes a living cell.
• Any living cell with two or three living neighbors lives on to the next generation.
• Any living cell with more than three or fewer than two living neighbors dies.
The program uses a two-dimensional stencil that sums up the eight closest
neighbors and applies the game rules. The initial state can be randomly generated.
Ghost cells are used to implement periodic boundary conditions. Cells are not
updated until the end of the iteration.
54
55
// current grid
int *grid = (int*)malloc(arraySize * sizeof(int)); // result grid
int *newGrid = (int*) malloc(arraySize * sizeof(int));
// assign initial population randomly
srand(1985); // pseudo random with a fixed seed
for(i = 1; i <= dim; i++) {
for(j = 1; j <= dim; j++) {
int id = i*(dim+2) + j;
grid[id] = rand() % 2;
}
}
for (it = 0; it < itEnd; it++){
gol( grid, newGrid, dim );
}
// sum up alive cells
total = 0; // total number of alive cells
for (i = 1; i <= dim; i++) {
for (j = 1; j <= dim; j++) {
int id = i*(dim+2) + j;
total += grid[id];
}
}
printf("Total Alive: %d\n", total);
return 0;
}
The following tasks demonstrate the tools presented in this chapter. Use a profile-
driven approach to accelerate GoL with OpenACC.
1. Measure the runtime of the sequential CPU code that implements the game of
life per Listing 3.3. Use 1024 as grid dimension and 2048 steps. What is dom-
inating the runtime of the CPU code? The runtime can be easily measured by
prepending time before the executable.
2. Add OpenACC compute directives to execute one game step (iteration) on the
device.
56
d. Make sure that the grid and newGrid arrays are available on the device.
Add a respective clause to the compute construct, if necessary.
3. Use nvprof to profile the program. What is dominating the device runtime or is
preventing the device from computing?
4. Use the data construct (or data enter or exit constructs) to avoid unnecessary
host-device data movement between game iterations. Verify the result using
nvprof or nvvp.
5. Count all living cells in parallel on the device using the OpenACC parallel
construct.
b. Set all required environment variables to capture OpenACC events and even-
tually CUDA or OpenCL device activity.
d. What is dominating the runtime on the host? Which kernels are dominating
the device runtime?
e. Measure the effect of gang, worker, and vector clauses on loop constructs.
Try different values. The performance might depend on the grid size.
iii. Select the OpenACC kernel launch regions to get details on the launch
properties.
57
In this chapter, you’ll parallelize real code. You will start with code that does some-
thing useful. Then you’ll consider how you might use OpenACC to speed it up. You
will see that reducing data movement is key to achieving significant speedup, and
that OpenACC gives you the tools to do so. By the end of the chapter you will be able
to call yourself an OpenACC programmer—a fledgling one, perhaps, but on your
way. Let’s jump right into it.
4.1 Case Study
You are reading a book about OpenACC programming, so it’s a safe bet the authors
are fans of this approach to parallel programming. Although that’s a perfectly sen-
sible thing, it has its dangers. It is tempting for enthusiasts to cherry-pick examples
that make it seem as if their favored technology is perfect for everything. Anyone
with experience in parallel programming has seen this before. We are determined
not to do that here.
Our example is so generically useful that it has many applications, and it is often
used to demonstrate programming with other parallel techniques as well, such
as the somewhat related OpenMP and the very different MPI. So, rest assured, we
haven’t rigged the game.
59
Another reason we prefer this example is that both the “science” and the numerical
method are intuitive. Although we will solve the Laplace equation for steady-state
temperature distribution using Jacobi iteration, we don’t expect that you immedi-
ately know what that means.
Let’s look at the physical problem. You have a square metal plate. It is initially at
zero degrees. This is termed, unsurprisingly, the initial conditions. You will heat
two of the edges in an interesting pattern where you heat the lower-right corner (as
pictured in Figure 4.1A) to 100 degrees. You control the two heating elements that
lead from this corner such that they go steadily to zero degrees at their farthest
edge. The other two edges you will hold at zero degrees. These four edges consti-
tute the boundary conditions.
For the metal plate, you would probably guess the ultimate solution should look
something like Figure 4.1B.
Metal Metal
Plate Plate
Heating
Element
A B
Figure 4.1 A heated metal plate
You have a very hot corner, a very cold corner, and some kind of gradient in
between. This is what the ultimate, numerically solved solution should look like.
∇2T = 0
values. This makes sense for temperature; if you have a pebble and you put a cold
stone on one side and a hot stone on the other, you’d probably guess that the peb-
ble would come to the average of the two. And in general, you would be right.
The simulation starting point—the set of initial conditions—is far from this. You
have zero everywhere except some big jumps along the edges where the heat-
ing elements are. You want to end up with something that resembles the desired
solution.
There are many ways you can find this solution, but let’s pick a particularly
straightforward one: Jacobi iteration. This method simply says that if you go over
your grid and set each element equal to the average of the neighbors, and keep
doing this, you will eventually converge on the correct answer. You will know when
you have reached the right answer because when you make your averaging pass,
the values will already be averaged (the Laplace condition) and so nothing will
happen. Of course, these are floating-point numbers, so you will pick some small
error, which defines “nothing happening.” In this case, we will say that when no ele-
ment changes more than one-hundredth of a degree, we are done. If that isn’t good
enough for you, you can easily change it and continue to a smaller error.
Here it is in Fortran:
do j=1,width
do i=1,height
temperature(i,j) =0.25*(temperature_previous(i+1,j)&
+ temperature_previous(i-1,j)&
+ temperature_previous(i,j+1)&
+ temperature_previous(i,j-1))
enddo
enddo
61
Note that the C and Fortran code snippets are virtually identical in construction.
This will remain true for the entire program.
This nested loop is the guts of the method and in some sense contains all the sci-
ence of the simulation. You are iterating over your metal plate in both dimensions
and setting every interior point equal to the average of the neighbors (i.e., adding
together and dividing by 4). You don’t change the very outside elements; those are
the heating elements (or boundary conditions). There are a few other items in the
main iteration loop as it repeats until convergence. Listing 4.1 shows the C code,
and Listing 4.2 shows the Fortran code.
worst_dt = 0.0;
if((iteration % 100) == 0) {
track_progress(iteration);
}
iteration++;
}
62
do j=1,width
do i=1,height
temperature(i,j) =0.25*(temperature_previous(i+1,j)&
+ temperature_previous(i-1,j)&
+ temperature_previous(i,j+1)&
+ temperature_previous(i,j-1))
enddo
enddo
worst_dt=0.0
do j=1,width
do i=1,height
worst_dt = max( abs(temperature(i,j) – &
temperature_previous(i,j)),&
worst_dt )
temperature_previous(i,j) = temperature(i,j)
enddo
enddo
iteration = iteration+1
enddo
The important addition is that you have a second array that keeps the temperature
data from the last iteration. If you tried to use one array, you would find yourself
using some updated neighboring elements and some old neighboring elements
from the previous iteration as you were updating points in the grid. You need to
make sure you use only elements from the last iteration.
While you are doing this nested loop copy to your backup array (and moving all
this data around in memory), it’s a good time to look for the worst (most changing)
element in the simulation. When the worst element changes only by 0.01 degree,
you know you are finished.
63
It might also be nice to track your progress as you go; it’s much better than star-
ing at a blank screen for the duration. So, every 100 iterations, let’s call a modest
output routine.
That is all there is to it for your serial Laplace Solver. Even with the initialization
and output code, the full program clocks in at fewer than 100 lines. (See Listing 4.3
for the C code, and Listing 4.4 for Fortran.)
double Temperature[HEIGHT+2][WIDTH+2];
double Temperature_previous[HEIGHT+2][WIDTH+2];
void initialize();
void track_progress(int iter);
int i, j;
int iteration=1;
double worst_dt=100;
struct timeval start_time, stop_time, elapsed_time;
gettimeofday(&start_time,NULL);
initialize();
worst_dt = 0.0;
64
worst_dt);
Temperature_previous[i][j] = Temperature[i][j];
}
}
if((iteration % 100) == 0) {
track_progress(iteration);
}
iteration++;
}
gettimeofday(&stop_time,NULL);
timersub(&stop_time, &start_time, &elapsed_time);
void initialize(){
int i,j;
int i;
65
integer :: i, j, iteration=1
double precision :: worst_dt=100.0
real :: start_time, stop_time
call cpu_time(start_time)
call initialize(temperature_previous)
do j=1,width
do i=1,height
temperature(i,j) = 0.25* (temperature_previous(i+1,j)&
+ temperature_previous(i-1,j)&
+ temperature_previous(i,j+1)&
+ temperature_previous(i,j-1))
enddo
enddo
worst_dt=0.0
do j=1,width
do i=1,height
worst_dt = max( abs(temperature(i,j) – &
temperature_previous(i,j)),&
worst_dt )
temperature_previous(i,j) = temperature(i,j)
enddo
enddo
iteration = iteration+1
enddo
call cpu_time(stop_time)
print*, 'Max error at iteration ', iteration-1, ' was ', &
worst_dt
print*, 'Total time was ',stop_time-start_time, ' seconds.'
end program serial
66
temperature_previous = 0.0
do i=0,height+1
temperature_previous(i,0) = 0.0
temperature_previous(i,width+1) = (100.0/height) * i
enddo
do j=0,width+1
temperature_previous(0,j) = 0.0
temperature_previous(height+1,j) = ((100.0)/width) * j
enddo
end subroutine initialize
67
We use PGI for performance consistency in this chapter. Any other standard com-
piler would work the same. If you run the resulting executable, you will see some-
thing like this:
. . .
. . .
---------- Iteration number: 3200 ------------
. . . [998,998]: 99.18 [999,999]: 99.56 [1000,1000]: 99.86
---------- Iteration number: 3300 ------------
. . . [998,998]: 99.19 [999,999]: 99.56 [1000,1000]: 99.87
The output shows that the simulation looped 3,372 times before all the elements
stabilized (to within our 0.01 degree tolerance). If you examine the full output, you
can see the elements converge from their zero-degree starting point.
The times for both the C and the Fortran version will be very close here and as you
progress throughout optimization. Of course, the time will vary depending on the
CPU you are using. In this case, we are using an Intel Broadwell running at 3.0 GHz.
At the time of this writing, it is a very good processor, so our eventual speedups
won’t be compared against a poor serial baseline.
This is the last time you will look at any code outside the main loop. You will hence-
forth exploit the wonderful ability of OpenACC to allow you to focus on a small
portion of your code—be it a single routine, or even a single loop—and ignore the
rest. You will return to this point when you are finished.
68
rank these spots. Often, as is the case here, it is obvious where to start. A large loop is
a big flag, and you have two of them within the main loop. This is where we focus.
worst_dt = 0.0;
if((iteration % 100) == 0) {
track_progress(iteration);
}
iteration++;
}
69
Listing 4.6 Fortran Laplace code main loop with kernels directives
do while ( worst_dt > temp_tolerance )
!$acc kernels
do j=1,width
do i=1,height
temperature(i,j) =0.25*(temperature_previous(i+1,j)&
+ temperature_previous(i-1,j)&
+ temperature_previous(i,j+1)&
+ temperature_previous(i,j-1))
enddo
enddo
!$acc end kernels
worst_dt=0.0
!$acc kernels
do j=1,width
do i=1,height
worst_dt = max( abs(temperature(i,j) – &
temperature_previous(i,j)),&
worst_dt )
temperature_previous(i,j) = temperature(i,j)
enddo
enddo
!$acc end kernels
iteration = iteration+1
enddo
The compilation is also straightforward. All you do is activate the directives using,
for example, the PGI compiler, for the C version:
pgcc –acc laplace.c
If you do this, the executable pops right out and you can be on your way. However, you
probably want to verify that your directives actually did something. OpenACC’s defense
against compiling a loop with dependencies or other issues is to simply ignore the
directives and deliver a “correct,” if unaccelerated, executable. With the PGI compiler,
you can request feedback on the C OpenACC compilation by using this:
pgcc –acc -Minfo=acc laplace.c
70
Similar options are available for other compilers. Among the informative output,
you see the “Accelerator kernel generated” message for both of your kernels-
enabled loops. You may also notice that a reduction was automatically generated
for worst_dt. It was nice of the compiler to catch that and generate the reduction
automatically. So far so good.
If you run this executable, you will get something like this:
. . .
. . .
---------- Iteration number: 3200 ------------
. . .[998,998]: 99.18 [999,999]: 99.56 [1000,1000]: 99.86
---------- Iteration number: 3300 ------------
. . .[998,998]: 99.19 [999,999]: 99.56 [1000,1000]: 99.87
This was executed on an NVIDIA K80, the fastest GPU available at the time of this
writing. For our efforts thus far, we have managed to slow down the code by about
70 percent, which is not impressive at all.
You could choose to use a sophisticated performance analysis tool, but in this case,
the problem is so egregious you can probably find enlightenment with something
as simple as the PGI environment profiling option:
export PGI_ACC_TIME=1
If you run the executable again with this option enabled, you will get additional
output, including this:
Accelerator Kernel Timing data
main NVIDIA devicenum=0
time(us): 11,460,015
71
The problem is not subtle. The line numbers 31 and 41 correspond to your two
kernels directives. Each resulted in a lot of data transfers, which ended up
using most of the time. Of the total sampled time of 11.4 seconds (everything is in
microseconds here), well over 10s was spent in the data transfers, and very little
time in the compute region. That is no surprise given that we can see multiple data
transfers for every time a kernels construct was actually launched. How did this
happen?
Recall that the kernels directive does the safe thing: When in doubt, copy any
data used within the kernel to the device at the beginning of the kernels region,
and off at the end. This paranoid approach guarantees correct results, but it can be
expensive. Let’s see how that worked in Figure 4.2.
What OpenACC has done is to make sure that each time you call a device kernels,
any involved data is copied to the device, and at the end of the kernels region, it is
all copied back. This is safe but results in two large arrays getting copied back and
forth twice for each iteration of the main loop. These are two 1,000 × 1,000 double-
precision arrays, so this is (2 arrays) × (1,000 × 1,000 grid points/array) ×
(8 bytes/grid point) = 16MB of memory copies every iteration.
72
while (worst_dt
worst_dt > TEMP_TOLERANCE ) {
Temperature, Temperature_previous
on host
Temperature, Temperature_previous
on device
Note that we ignore worst_dt. In general, the cost of copying an 8-byte scalar
(non-array) variable is negligible.
73
Pause here and see whether you can come up with a strategy to minimize data
movement. What directives does that strategy translate to? Feel free to experiment
with the code on your own before reading the answer, which is provided later.
But let’s start with that objective in mind. If you load your data onto the device at
the beginning of the main loop, when do you next need it on the host? Think the first
iteration through as a start: there is no reason for the two big arrays to return to
the host between the two kernels. They can stay on the device.
What about worst_dt? It is insignificant in size, so you don’t care what it does as
long as it is available when needed, as per the default kernels behavior. Once
you start to use data regions, you uncouple the execution from the data regions
and could prevent unnecessary data movement. Because there is no real perfor-
mance gain, you won’t override the default by including it in any data directives.
It will continue to be set to 0 on the host, get to a maximum in the second nested
loop (actually a reduction from all of the “local maximums” found by each process-
ing element (PE) on the device), and get copied back to the host so that it can be
checked as the condition to continue the while loop every iteration. Again, this is
all default kernels behavior, so we don’t worry about the details.
After that, you run into the output routine. It isn’t an issue for the first 100 iterations,
so let’s ignore it for a moment and continue around the loop for the second itera-
tion. At the start of the second iteration, you would like both big arrays to be on the
device. That is just where you left them! So it looks as if you can just keep the data
on the device between iterations of the while loop. The obvious data directives
would be data copy clauses applied to the while loop.
// C
#pragma acc data copy(Temperature_previous, Temperature)
while ( worst_dt > TEMP_TOLERANCE ) {
. . .
! Fortran
!$acc data copy(temperature_previous, temperature)
do while ( worst_dt > temp_tolerance )
. . .
This is indeed the key. It will significantly speed up the code, and you will get the
right answer at the end.
74
! Fortran
. . .
if( mod(iteration,100).eq.0 ) then
!$acc update host(temperature)
call track_progress(temperature, iteration)
endif
. . .
It is important to realize that all the tools for convenient data management are
already in OpenACC. Once you decide how you want to manage the data concep-
tually, some combination of data copy, declare, enter/exit, and update
clauses should allow you to accomplish that as you wish. If you find yourself fight-
ing the scope or blocking of your code to make the directives match your wishes,
take a breath and ask yourself whether the other clauses will allow you to accom-
plish this more naturally.
75
Continuing with that line of thought, you don’t need for both arrays to exit the while
loop with the final data; one will suffice. Once again, temperature_previous has the
correct values so that you can abandon temperature on the device. This means that
temperature is really just a temporary array used on the device, and there is no need
to copy it in or out. That is exactly what the data create clause is for.
Note that this last optimization is really not very important. The big win was recog-
nizing that you were copying the large arrays needlessly every iteration. You were
copying two large arrays into and out of each of the two kernels each loop:
(2 arrays) × (in and out) × (2 pairs of loops) × (3,372 iterations) = 26,976 copies
Getting rid of all those transfers with a data copy was the big win. Using data
create instead of copy for the Temperature array saved one copy in at the
beginning of the entire run, and one copy out at the end. It wasn’t significant. So
don’t feel bad if you didn’t spot that opportunity.
Likewise, using an update for the track progress routine caused 33 transfers
over the course of the run. It was a quick fix for the problem. In comparison to the
original 26,876 copies, having 33 remaining is nothing. However now that you are
down to one copy in and one copy out for the whole run, it does have an impact on
the order of 5 percent of the new and significantly reduced total run time. Given the
huge performance improvement you have achieved, you may not care, but for those
of you seeking perfection, see Exercise 1 at the end of the chapter.
worst_dt = 0.0;
76
if((iteration % 100) == 0) {
#pragma acc update host(Temperature)
track_progress(iteration);
}
iteration++;
}
!$acc kernels
do j=1,width
do i=1,height
temperature(i,j) =0.25*(temperature_previous(i+1,j)&
+ temperature_previous(i-1,j)&
+ temperature_previous(i,j+1)&
+ temperature_previous(i,j-1))
enddo
enddo
!$acc end kernels
worst_dt=0.0
!$acc kernels
do j=1,width
do i=1,height
worst_dt = max( abs(temperature(i,j) – &
temperature_previous(i,j)),&
worst_dt )
temperature_previous(i,j) = temperature(i,j)
enddo
enddo
!$acc end kernels
iteration = iteration+1
enddo
!$acc end data
77
You compile exactly as before. If you again use the compiler verbose information
option (-Minfo=acc for PGI), you see that the generated copies are now outside
the while loop, as intended. Here is the result.
. . .
. . .
---------- Iteration number: 3200 ------------
. . .[998,998]: 99.18 [999,999]: 99.56 [1000,1000]: 99.86
---------- Iteration number: 3300 ------------
. . .[998,998]: 99.19 [999,999]: 99.56 [1000,1000]: 99.87
This is much better. Table 4.1 sums it up. With only a handful of directives, you
have managed to speed up the serial code more than 20 times. But you had to
think about your data migration in order to get there. This is typical of accelerator
development.
To review, you looked for the large loops and placed kernels directives there.
Then (prompted by terrible performance) you thought about how the data should
really flow between the host and the device. Then you used the appropriate data
directives to make that happen. Further performance improvements are possible
(see the exercises), but you have achieved the lion’s share of what can be done.
4.5 Summary
Here are all the OpenACC advantages you have used in this chapter.
• Incremental optimization. You focused on only the loop of interest here. You have
not had to deal with whatever is going on in track_progress() or any other
section of the code. We have not misled you with this approach. It will usually
remain true for an 80,000-lines of code program with 1,200 subroutines. You
may be able to focus on a single computationally intense section of the code to
great effect. That might be 120 lines of code instead of our 20, but it sure beats
the need to understand the dusty corners of large chunks of legacy code.
78
• Single source. This code is still entirely valid serial code. If your colleagues down
the hall are oblivious to OpenACC, they can still understand the program results
by simply ignoring the funny-looking comments (your OpenACC directives)—as
can an OpenACC-ignorant compiler. Or a compute platform without accelerators.
This isn’t guaranteed to be true; you can utilize the OpenACC API instead of direc-
tives, or rearrange your code to make better use of parallel regions; and these
types of changes will likely break the pure serial version. But it can be true for
many nontrivial cases.
• High level. We have managed to avoid any discussion of the hardware specifics
of our accelerator. Beyond the acknowledgment that the host-device connection
is much slower than the local memory connection on either device, we have not
concerned ourselves with the fascinating topic of GPU architecture at all.
• Portable. This code should run efficiently on any accelerated device. You haven’t
had to embed any platform-specific information. This won’t always be true for all
algorithms, and you will read more about this later in Chapter 7, “OpenACC and
Performance Portability.”
With these advantages in mind, we hope your enthusiasm for OpenACC is growing.
At least you can see how easy it is to take a stab at accelerating a code. The low
risk should encourage you to attempt this with your applications.
4.6 Exercises
1. We noted that the track_progress routine introduces a penalty for the peri-
odic array copies that it initiates. However, the output itself is only a small por-
tion of the full array. Can you utilize the data directive’s array-shaping options
to minimize this superfluous copy (see Section 1.3.4)?
2. The sample problem is small by most measures. But it lends itself easily to
scaling. How large a square plate problem can you run on your accelerator? Do
so, and compare the speedup relative to the serial code for that case.
79
3. This code can also be scaled into a 3D version. What is the largest 3D cubic case
you can accommodate on your accelerator?
4. We have focused only on the main loop. Could you also use OpenACC directives
on the initialize and output routines? What kinds of gains would you expect?
5. If you know OpenMP, you may see an opportunity here to speed up the host
(CPU) version of the code and improve the serial performance. Do so, and com-
pare to the speedup achieved with OpenACC.
80
Getting eight independent oxen to work in unison is not simple. However, it is child’s
play compared with coordinating a thousand chickens. In either case, the problem
is to harness independently operating beasts so that their efforts are coordinated
toward achieving a single goal.
This chapter overviews the challenges that you must solve to effectively utilize
massive parallelism and presents the theoretical foundations underlying solutions
to those challenges. Then it closes with details of how compilers combine that
theory with OpenACC directives to effect your requests and get effective speedup
on parallel architectures.
81
The following section overviews that topic. Subsequent sections detail the process
by which compilers map programs to parallelism, highlighting the type of infor-
mation that they need to successfully effect a mapping. The following sections
detail how OpenACC directives provide that information and how compilers use it to
generate parallel code. OpenACC is not free, and the concluding sections detail the
challenges introduced by OpenACC directives.
The simplest way of coordinating functional units is to link their execution so that
all the units are executing the same instruction at the same time. This method is
more or less required with arithmetic units, because arithmetic units rarely contain
program counters. For processors, this method of coordination is often effected by
having the processors share a single program counter. Sharing a single program
counter is the approach taken in NVIDIA GPUs. Obviously, having processors redun-
dantly execute the same instruction is productive only if each operates on different
data. Following this approach, each chicken is essentially marching in lockstep.
82
the pioneer analogy, each ox or chicken would be proceeding forward at its own
pace, oblivious to the progress of the other animals.
Examples of functional units that concurrently execute the same instruction are
vector units and lockstep processors. Generally, these are classified as single-
instruction multiple-data (SIMD)1 or single-instruction multiple-thread (SIMT).2
The defining point semantically is simultaneous execution.
Examples of functional units that execute independently abound, and they are
illustrated by virtually any processors connected by either an internal bus or an
external network. The processors may share memory; memory may be distrib-
uted among them; or memory may be split between those two options. In any case,
because the functional units are loosely connected, the relative order in which
operations execute on two processors is indeterminate. This execution is multiple-
instruction multiple-data (MIMD).3
83
parallel hardware, by the adjective parallel, are not. Given that, it is not immediately
obvious that loops can be directly mapped to parallel hardware. In fact, not all
loops can be.4
This is not true of all loops. Here we change the fragment slightly:
for(i=1; i<n; i++) {
a[i] = a[i-1] + 1;
}
This fragment sets each element of a to the value of its ordinal position (assuming
elements are initially 0). Executed sequentially, a loop iteration picks up the value of
a stored by the preceding iteration, increments it, and stores the result for the next
iteration. Simultaneous execution changes the result. Executed simultaneously, all
the initial values of a are fetched, then incremented, and finally stored. No updated
values are used. Stated differently, every element of a is set to 1.
The reason this loop produces different results when executed simultaneously is
that the loop body depends on values produced by previous iterations. When the
loop executes simultaneously, the newly computed values are not available until
the computation is complete. That is too late for the dependent iteration to use.
Because MIMD processors may execute loop iterations in any order, and because
simultaneous execution is one possible order, not all sequential loops can be
executed correctly on MIMD hardware. Sequential loops in which no statement
depends on itself, directly or indirectly, may be executed simultaneously. The
condition that determines whether sequential loops will execute correctly with
4. J. R. Allen and K. Kennedy, Optimizing Compilers for Modern Architectures (San Francisco:
Morgan Kaufmann, 2001).
84
Rather than use the result directly, the expression value is first stored into a
temporary. Using temporaries in this fashion is a common coding style for users
wary of compiler optimization. Excluding this temporary, the semantics of the loop
are identical to those of the original.
Although the semantics are identical, the loop iterations as written cannot be cor-
rectly executed either simultaneously or indeterminately. The reason is the scalar
temporary t. On a nonparallel machine, the temporary provides a location to hold
the computation to avoid recomputing it. Executing the iterations in either simul-
taneous or indeterminate order, it is possible (and indeed probable) that multiple
85
loop iterations will be stored into the temporary at roughly the same time. If so, the
value fetched on an iteration may not be the proper one.
In the specific example, the obvious solution is to propagate the scalar forward.
More generally, however, there may be many uses of t in the loop, in which case
saving the results of the computation may be the most efficient approach. The
more general solution is to expand the scalar to create a location for each loop
iteration.
for(i=0; i<n; i++) {
t[i] = a[i] + 1;
a[i] = t[i];
}
A literal expansion illustrates the basic concept, but there are more efficient ways
to achieve the end result.6 The requirement is that each loop iteration (or thread)
have a dedicated location for the value. On simultaneous architectures, the scalar
is usually expanded into a vector register or equivalent in a transformation known
as scalar expansion.7 MIMD architectures usually provide individual processors
with dedicated memory—memory private to that processor. Allocating the scalar to
this storage (a transformation known as privatization) removes the bottleneck.8
5.1.4 Reductions
Simultaneous and indeterminate execution cover common forms of fully parallel
loops. However, there are computations that are only partially parallel but that
are important enough to merit special hardware attention. The most prevalent
of these are reductions. Reductions are operations that reduce a vector or an
array to a scalar. Dot product (a special version of a more general sum reduc-
tion) is one example:
t = 0;
for(i=0; i<n; i++) {
t = t + a[i] * b[i];
}
86
This code may accumulate the sums in a different order from the original, because
each individual private variable accumulates part of the overall sum. Although
mathematical addition is associative, computer floating-point addition is not. This
means that accumulating the sum in a different order may cause the results to vary
slightly from the sequential result, and from run to run. Despite the possible vari-
ances, reductions consume enough computation in enough applications to justify
this partial parallelism.
Given that the semantics of sequential loops differ from those of parallel hardware, it
is not hard to see the role of OpenACC directives in exploiting parallelism in sequential
programs. Programmers armed with OpenACC directives can use them to indicate
parallel loops to the compiler, which can then take care of scheduling them.
The OpenACC gang and worker loop directives state that a loop will execute
correctly regardless of the order in which the iterations are executed. Accordingly,
such loops can be correctly executed on MIMD hardware. In fact, given that any
iteration order is valid, gang and worker loops can also be executed simultane-
ously. This precipitates one of the decisions faced by a compiler (discussed later):
Does it execute loops exactly as prescribed by directives (i.e., as gang and worker),
or does it use the information to schedule loops as it sees fit (i.e., schedule a gang
loop as gang, worker, and vector)? From a loop semantics viewpoint, gang and
worker both imply indeterminate semantics. The difference between the two is
that gang implies that there is more beneficial parallelism underneath.
87
The OpenACC vector loop directive states that a loop will execute correctly if the
iterations are executed simultaneously. Accordingly, the loop can be executed on
vector hardware or on the vector lanes of a GPU.
Reductions are indicated on a parallel loop by using a keyword (i.e., sum) indicating
the kind of the reduction and the scalar that is the result of the reduction. With this
information, the compiler knows to create a private copy of the reduction variable
inside the parallel loop, initialize it according to the reduction kind, and create
appropriate locking and updating of the associated global variable.
5.2 Restructuring Compilers
Previous sections present challenges that you must solve to effectively utilize par-
allelism: recognizing loops that can be executed as gang, worker, and vector loops;
recognizing reductions; and placing variables appropriately in the memory hierar-
chy. These challenges are not new; parallel architectures have existed for decades,
and every parallel architecture faces the challenges presented here. Compiler
writers tend to be smart. Compiler research over those decades has resulted in a
theoretical foundation to meet many of those challenges. To best apply OpenACC
directives, you need to know where compilers can best be helped.
88
The first statement computes a value of t[i] that is used by the second. As a
result, their relative execution order must be preserved or else the second state-
ment will use an incorrect value of t[i]. (Similarly, changing the relative execution
order will also cause a[i] in the first statement to receive an incorrect value.)
With dependency defined in this way, a compiler is free to reorder a program
at will, as long as it does not violate any dependencies, and the results that are
computed will remain unchanged. Dependency theory also includes the ability to
attribute dependencies to specific loops, enabling more precise reordering.
89
general recognition is not simple for computers. Although it is not perfect, data-
flow analysis does provide information necessary for recognizing most reductions.
Aliasing, which occurs when the same memory location may have two or more
references to it, is more subtle.12 The following fragment illustrates aliasing and
associated problems.
void copy(double a[], double b[], int n) {
int i;
for(i=0; i<n; i++)
a[i] = b[i];
}
. . .
double x[100];
copy(&x[1], &x[0], 10);
Looking at the procedure copy in isolation, the loop appears parallelizable. But
in fact, it’s not. The reason is the way in which copy is called. The fact that a and
b are different names for the same memory means that the statement depends
on itself. Resolving aliases in a program is an NP-complete problem, where NP
stands for nondeterministic polynomial time. Compilers must adopt a conservative
approach (that is, compilers assume that the worst possible case is happening
90
unless it can prove otherwise), and this means that, in most cases of possible alias-
ing, compilers must assume that a dependency exists.
Another symbolic area that creates problems for compilers is symbolic loop
bounds and steps. Doing the best job of allocating loops to parallel hardware
requires knowing roughly how many times the loops iterate. For instance, there
is little point in scheduling a loop with only three iterations as a vector loop; the
overhead of scheduling the loop will overwhelm the benefit. Given two potential
candidates for vector execution, one that iterates 3 times and the other that iterates
128 times, the larger iteration loop is absolutely the better choice. However, the
compiler can’t make that choice without knowledge of the loop sizes, and if those
sizes are symbolic, the compiler can only guess.
Compilers must work with the semantics of a computation as written and, in partic-
ular, with the semantics as expressed in a sequential language for OpenACC. Those
semantics sometimes distort the mathematics underlying the physical phenome-
non. Consider the following fragment:
for(i=1; i<m-1; i++) {
for(j=1; i<n-1; j++) {
a[i][j] = (a[i+1][j] + a[i-1][j] + a[i][j+1]
+ a[i][j-1]) / 4.0;
}
}
This fragment is typical code used to solve differential equations, which have the
property that at steady state, the value at a point is equal to the average value
across a surrounding sphere. A compiler examining the fragment would correctly
conclude that the statement depends upon itself in every loop and that it cannot be
executed simultaneously or indeterminately. In fact, however, it can. The reason
is that the fragment is converging toward a fixed point: when the values no longer
change between iterations, a holds the solution. As a result, any errors that are
encountered along the way by parallel execution are insignificant to the final solu-
tion. A human can readily recognize this; a compiler cannot.
91
5.3 Compiling OpenACC
OpenACC directives obviously simplify parts of the compilation process for GPU
programs. Gang, worker, and vector directives tell the compiler whether a loop
executes correctly in an indeterminate order, thereby eliminating the need for
sophisticated dependency analysis and detailed loop analysis. Moreover, private
clauses dictate precisely where variables need to be located in the memory hierar-
chy. Reduction recognition is tricky and detailed; the reduction clause eliminates
the need for lots of checks normally employed by compilers to ensure correctness.
Explicit directive control of memory transfers between host and device relieves the
compiler of having to do detailed interprocedural analysis when it tries to optimize
memory transfers.
Although it’s easy to think that directives solve all the compiler’s problems, that is
not the case, and, in some cases, they complicate the compiler’s work. The phrase
effective parallelism has two parts: parallelism, which means getting an application
to run in parallel, and effective, which means getting the application to run faster.
Experienced programmers are aware that the two are not always the same.
The ease of applying directives raises user expectations, making the compiler’s
job harder.
92
The programmer wrote the first form for efficiency. Additions are much cheaper
than multiplications, so in terms of the integer support arithmetic, the first is faster
(although a reasonable compiler will eventually generate that form regardless of
what is written). The transformation that reduces the second form to the first form
is known as strength reduction (replacing an operator with a cheaper operation).
For vector or parallel execution, however, the second form is necessary, both
to uncover the memory access strides through the vectors and to eliminate the
scalar dependencies. This is true whether the parallelism is implicitly detected or
explicitly stated by a directive. This transformation (known as auxiliary induction
variable substitution)13 as well as other preliminary transformations must be per-
formed, either by the compiler or by the user.
5.3.2 Scheduling
Scheduling parallel and vector code involves balancing two conflicting forces. In
one direction, parallelism is effective only when all processors are being produc-
tive. The best chance of keeping all processors busy is to make each instruction
packet to be executed in parallel as small as possible, so that there are as many
packets as possible. This approach means that as much code as possible will be
ready to execute at any time. However, parallel execution does not come free; there
is always some overhead involved in executing an instruction packet in parallel.
Maximizing the number of packets maximizes the overhead. Effective scheduling
therefore requires that you keep instruction packets small enough to balance the
load across the processors, but large enough to minimize the overhead.
93
Most effective scheduling techniques are dynamic: they schedule work while the
application is running. The alternative is static scheduling, wherein the compiler
fixes the schedule for the application during the compilation process. Static sched-
uling introduces less overhead, because the compiler schedules the code directly
with no overhead, whereas dynamic scheduling generally achieves better load
balance, because it can adjust for execution changes based on input data.
GPUs start up in fully parallel mode, with all processors—blocks, warps, and
vectors—executing the kernel. Its “natural” state is to be executing in parallel, and
there is no one instruction that says, “Go serial.” As a result, it is relatively simple
to execute parallel code; accordingly, the more difficult task is to execute serial
code. Assuming that workers are dynamically scheduled using a bakery counter
(all loop iterations to be performed are placed in a queue; as a worker becomes
free, the queue gives the worker an iteration to perform), serial code at the worker
level is not difficult. When workers are working on a serial section, all processors
except one (a chief processor) are held in a pen. When the chief processor encoun-
ters a worker loop, it adds the iterations to the queue and opens up the pen.
Serializing the vector lanes down to a single execution is more challenging. Making
only one lane of a warp execute (making it effectively the “chief” lane) requires
neutering the remaining lanes so that they do not execute. The pragmatic way of
effecting this is to place a jump around the serial section of code, predicated on the
thread id. For the zeroth thread (the chief lane), the jump is not taken; for all
others it is. The result is similar to this:
if (thread_id == 0) {
/* serial code */
}
/* back to full vector code */
The transformation is simple when performed on a single basic block (i.e., a group
of instructions without control flow change; each instruction in the block is exe-
cuted if and only if another instruction in the block is executed). But the process for
handling multiple basic blocks with control flow change is more difficult.
94
When optimizing a program, compilers analyze the program to uncover all the
information they need to create more efficient forms of code. Compilers must be
conservative; they do not effect an optimization unless they can absolutely prove
that the change is safe. One would think that directives, including those in OpenACC,
would save compilers the bother of implementing the analysis, given that the direc-
tives are providing the needed information. That approach works if the information
provided by the user is correct. Unfortunately, users are not always perfect, and
the coordination between OpenACC and user is not always perfect. Two critical
areas compilers need to check are control flow and efficiency.
Control flow checks are useful because of the way in which GPUs implement con-
trol flow for simultaneous operations. When control flow splits (i.e., a conditional
branch is encountered) the GPU continues; vector lanes that do not take the branch
are enabled, and vector lanes that do take the branch are disabled. Eventually, the
different paths should converge so that all vector lanes are again enabled. If that
convergence does not occur, it prompts a condition known as divergence, and most
likely it means that a valid program will hang at execution. Divergence normally
does not happen with optimizing compilers under independent control, because
compilers tackle only structured control flow, and divergence does not happen with
structured programs. Users, however, are not limited to using structure program-
ming or to placing directives around structured loops. Branches that terminate
loops prematurely, particularly by branching prior to the loop, convolute conver-
gence conditions.
95
3. If there is no other parallel loop, the outer loop should be both the gang and the
worker loop.
4. Vector loops are most efficient when all the vectors access contiguous memory.
In C, this means when the vector loop runs down the row with unit stride. In
Fortran, this means when the vector loop runs down the column with unit stride.
5. Gang, worker, and vector loops all have a minimal size at which they become
more efficient than scalar execution. Contrary to common conceptions, this size
is usually more in the range of 5 to 10 than it is 2.
Note that compilers are proactive in trying to achieve these conditions. They will
interchange loops to get a more profitable parallel loop to the outermost position or
to get a better vector loop to an innermost position.
Although these principles are guiding, they are not absolute on every architecture.
Directives in a program are usually immutable; as a result, they are not guaranteed
to be optimally placed for every architecture. Compilers do have accurate models
of architectures, so where they have full knowledge (e.g., loop lengths), they are
better at loop selection than users.
One of the questions faced by compilers, given this, is whether the compiler should
view OpenACC directives as prescriptive or descriptive. Prescriptive says that the
user has prescribed exactly the desired mapping to hardware, and the compiler
should follow it whether or not it is the most efficient. Descriptive says that the
user has provided information to the compiler, describing the desired loop behav-
ior, but the compiler should make the final decision. There are compilers support-
ing either strategy.
15. We ignore, for the moment, memory concerns, something that definitely cannot be
ignored in practice.
16. Allen and Kennedy, Optimizing Compilers for Modern Architectures.
96
download time to the device. If not, it makes more sense to execute the code on the
host processor.
5.4 Summary
Exploiting parallelism with any number of processors is challenging, but it is
particularly so when you are trying to effectively use hundreds or thousands of
processors. The theme of this chapter is that effectively exploiting parallelism—
particularly massive parallelism—is a job that requires cooperation of both the
programmer and the compiler, and that OpenACC directives provide a firm basis
for that cooperation. In doing so, you will be able to not only succeed in shepherding
your one thousand chickens, but also see performance gains for your eight oxen.
The key challenges include recognizing loops (which observe simultaneous and
indeterminate semantics, allocating variables to the proper level in the memory
hierarchy) and recognizing and scheduling reductions. OpenACC directives help
meet these challenges and allow programmers to help overcome the compil-
er’s biggest weaknesses: aliasing, symbolic expressions, and hidden semantics.
Although the use of OpenACC directives combined with a compiler is not a perfect
solution, it greatly simplifies parallel programming and enables faster exploration
of algorithms and loop parallelism to get the best possible performance.
5.5 Exercises
1. The OpenACC standard says that a vector loop cannot contain a gang or
worker loop. Why?
2. Dr. Grignard, a longtime professor of parallel programming, has said, “If you
can run a loop backwards (i.e., negating the step, starting with the upper bound,
and ending at the lower) and you get the same result, you can run it correctly in
vector.” Is he right?
3. Dr. Grignard has also said that a loop that passes the test of running backward
can also be run correctly as a gang loop. Is that correct?
97
5. Given the definition of daxpy as a procedure that adds two vectors, it seems
natural to assume that the loop can be run in vector and in parallel.
6. The text states that if a statement in a loop does not depend upon itself, then it
can be executed simultaneously. Consider the following loop:
for(i=1; i<n; i++) {
d[i] = a[i-1] + 1;
a[i] = b[i] + c[i];
}
There is a dependency from the second statement to the first, but neither
statement depends upon itself. Can the statements be executed simultaneously?
What should a compiler do if the user places a vector directive around the
loop?
7. The text states that if there is a dependency in a loop caused by the loop, then
the loop cannot be executed in gang or worker mode. Consider again the follow-
ing loop:
for(i=1; i<n; i++) {
d[i] = a[i-1] + 1;
a[i] = b[i] + c[i];
}
This loop carries a dependency from the second statement to the first. Can it be
executed in gang or worker mode? If not, can you think of ways of changing it so
that it can?
What OpenACC directives would you place, and where, to get the best execution
speed? Are there other changes you would make to the loops? Suppose the loop
were instead written this way:
for(j=0; j<n; j++) {
for(i=0; i<m; i++) {
t = 0;
for(k=0; k<p; k++) {
t += a[i][k] * b[k][j];
}
c[i][j] = t;
}
}
What would you do differently? Which version would you expect to execute
faster, and why?
99
Previous chapters have shown the details of how OpenACC can be used to identify
and express parallelism in an algorithm, allowing portable heterogeneous execu-
tion. However, most real-world programming problems are far more complex than
the provided code snippets and examples. Accelerating an existing application is
often a bewildering and complex tradeoff of application performance, maintain-
ability, and portability. Furthermore, in a professional programming environment
developer time is a finite resource that must be taken into consideration.
101
The concepts are demonstrated with a representative example: the parallel accel-
eration of a thermodynamic property table lookup and interpolation class, common
in computational fluid dynamics solvers.
Basic flat profiles and call graphs are extremely useful when you are preparing
a heterogeneous acceleration strategy. A basic flat profile usually contains the
amount of time spent in each routine, both exclusive to the routine and in child
routines, and the total number of times a routine is called. A call graph contains
additional information about where in the call tree a routine is called. From this
limited information, we can already identify promising application hot spots that
are potential good locations to introduce acceleration. You should look for routines
that have these characteristics:
102
Very lucky programmers may find a single hot spot that is responsible for the
majority of the run time, in which case the next step is clear: Focus efforts on
exposing the maximum amount of parallelism in the single hot spot. However, with
most large research or industrial code, it is far more likely that the work will be
spread across multiple loops and routines. In this situation achieving meaningful
acceleration will require careful strategy.
If an application’s profile shows many small hot spots that sum to nearly 100 per-
cent of the application’s run time, the implication is clear: To achieve the best possi-
ble performance in the minimum amount of time, you should follow an acceleration
strategy that offloads as many hot spots as possible to the accelerator device,
rather than fine-tune any individual hot spot.
103
Minimizing the transfers between the main system memory and the device mem-
ory will improve performance on most hardware. These transfers are something
OpenACC programmers can directly control, and effectively managing both the
number and the size of these transfers is critical for effective accelerator use.
The most natural strategy for accelerating a large application is one of incremental
acceleration and verification. With this approach, application hot spots are targeted
one by one, starting with the most time-consuming routines or loops, which offer
the greatest potential for acceleration. Each hot spot is first targeted independently,
with all necessary data transferred to and from the device at the start and end of
the routine or loop. This approach results in poor data locality and less than ideal
performance, but it lends itself to easy verification of the targeted accelerated
routines and is a useful first step.
104
Tip
Many current generation GPU accelerators perform best with single-precision math.
However, some algorithms are susceptible to round-off error when run in single
precision, which may yield small differences in CPU and the hybrid CPU-accelerator
results. When verifying the results of an accelerated routine, make sure to compile in
full double-precision mode to minimize the impact of these errors.
An example use of the acc atomic directive is shown in Listing 6.1. This example
shows a flux summation loop, commonly used in unstructured finite-volume meth-
ods. In this algorithm, flux contributions from a finite volume’s faces are summed
2. G. Rokos, G. Gorman, and P. H. Kelly, “A Fast and Scalable Graph Coloring Algorithm for
Multi-core and Many-core Architectures,” Lecture Notes in Computer Science Euro-Par
2015, Parallel Processing (2015): 414–425.
105
and stored. The parallel loop is over all mesh faces, and incorrect results will occur
if multiple threads attempt to update the same volume concurrently. This algo-
rithm may be made thread safe by the simple addition of #pragma acc atomic
directly before each offending summation statement.
The kernels directive, which can be applied to large blocks of code, instructs
the compiler to identify possible parallelism and automatically implement data
transfer and loop parallelism. The major benefit of this approach is that it allows
parallelization of the code with a single construct, making it a low-risk approach
that requires very little programmer time. The compiler will attempt to parallelize
only loops that it can determine to be thread safe, and if multiple loops are present
within a single kernel the compiler may be able to improve data locality automati-
cally. However, the compiler may not be able to fully parallelize the code because of
overly conservative analysis. The automatically generated data transfers may also
be less than optimal for the same reasons. In practice, the kernels construct can
often be used for a first implementation, allowing you to quickly see which loops
the compiler can safely parallelize and which loops may require additional treat-
ment. This resolves the problems mentioned earlier but allows for unsafe execu-
tion and potentially incorrect results.
106
Which option fits best depends on the application profile, the programming model,
and your intended development time. It can be expected that the parallel construct,
paired with carefully managed data regions, will lead to nearly optimal perfor-
mance, but it will require a much more extensive development effort. This asser-
tion is most apparent with complicated object-oriented programming models or
code that contains extensive pointer indirection, which make it difficult for the com-
piler to determine which loops are eligible for safe, parallel execution. Finally, it is
worth noting that the kernels construct may be interpreted differently by different
compilers, resulting in different optimization, such as more or less effective data
reuse. Whether this level of uncertainty is acceptable depends on the target end
users and deployment strategy for your application.
• The number of parallel loop iterations, depending on the size of the dataset
• The speed and latency of the transfer bus (PCIe2, PCIe3, NVLink)
With so many competing factors, it becomes probable that some loops that benefit
from accelerator execution with some datasets or hardware may actually result in
an application slowdown in other scenarios. Targeting only loops that are guaran-
teed to result in a large speedup will result in low on-device computation and less-
than-ideal performance. On the other hand, blindly accelerating all thread-safe
loops without considering performance could be disastrous.
A runtime testing and tuning approach is one work-around for this problem.
Although the details of the implementation vary from application to application,
the basic concept is the same: At run time or during a trial execution, each parallel
107
region should be launched and timed on both the CPU and the accelerator, includ-
ing all necessary data transfers. From this comparative information, each region
can be assigned to the CPU or the device, and the related data regions can be
configured accordingly.
OpenACC has a simple if clause in place, for both kernels and parallel constructs,
that facilitates conditional execution. An example of this approach follows in Listing 6.2,
with a placeholder accTimer class used to gather and store comparative CPU and
device execution times.
108
The time required to create and update the 2D and 1D data structures has been
measured with increasing dataset sizes. These tests were performed on a Linux
workstation with a PCIe3 transfer bus between the main memory and the device.
As shown in Figure 6.1, the device offload time for the smaller datasets is domi-
nated by the overhead of each distinct transfer. The 1D variant is guaranteed to be
stored in a single contiguous block of memory and may be offloaded to the device
with a single transfer, whereas the 2D variant is guaranteed to be only nbVar
contiguous blocks of memory and must be offloaded accordingly. The impact is sig-
nificant, leading to more than a 2× difference in offload time. With larger datasets
the transfer time becomes linearly dependent on the total transfer size, and the
performance difference of the two methods is diminished. From these results we
can establish a few general guidelines for minimizing data transfer time.
109
&""'
($")&""'
!"#$%
Figure 6.1 Solution array device offload demonstration
The OpenACC present clause is used to indicate that data utilized in a parallelized
loop already resides in device memory. This simple clause is often a positive
indicator of well-optimized data locality. If the parallel regions in an application list
most or all their associated data in a present clause, you can usually assume that
data is being efficiently reused. The present clause works for many different data
types, including arrays, structs, and even pointers to C++ objects.
Listing 6.4 demonstrates the use of the present clause for reuse of a data array
in multiple parallel loops. In this example the unstructured enter data and exit
data directives are used. Structured data regions can also be used to the same effect.
110
111
a subarray, allowing you to efficiently update the minimum amount of data nec-
essary. Array shaping also applies for multidimensional arrays, allowing you to
selectively transfer or update portions of a large, multidimensional array.
Tip
A common pitfall comes from syntax differences between the Fortran and C/C++ array
shape specification. Many legacy applications contain both Fortran and C modules with
shared data structures, so to avoid data transfer bugs, you need to take extra care with
array shaping. In Fortran the shape specification is data(startIndex:endIndex),
and in C/C++ the specification is data[startIndex:count].
3. https://www.pgroup.com/products/community.htm.
4. R. Pecnik, E. Rinaldi, and P. Colonna, “Computational Fluid Dynamics of a Radial Com-
pressor Operating with Supercritical CO2,” Journal of Engineering for Gas Turbines and
Power 134(12) (2012): 122301-122301-8.
5. C. Lettieri, D. Yang, and Z. Spakovszky, “An Investigation of Condensation Effects in
Supercritical Carbon Dioxide Compressors,” Journal of Engineering for Gas Turbines and
Power 137(8) (2015): 082602-082602-8.
112
6.4.3 Profiling
Before attempting to accelerate the baseline CPU code, you must determine how
the application is consuming resources, in particular the execution time. For the
profiling run we use a dataset size that is representative of a small CFD simulation:
1e5 (105) points and 1e3 (103) iterations. The provided source file ThermoTables_
CPU.C can be compiled and profiled with the following commands:
113
The first lines of the output profile.txt file shows a flat profile. The “percent
time” and “self seconds” columns indicate that most of the program execution time
comes from determining table subintervals in the bisection function and per-
forming the weighted average in the LookupTable2D::interpolate method.
At first glance this indicates that this program has multiple execution hot spots, but
upon closer inspection we determine that the bisection method is called from
within interpolate, and in fact this is a single hot spot. Careful reading of the
call-graph output confirms this. The interpolate method is called millions of
times, so finding parallelizable loops within the method is unlikely. However, the
main function, which calls interpolate, may be a feasible point in the call tree
to expose parallelism.
Flat profile:
% cumulative self self total
time seconds seconds calls ns/call ns/call name
52.33 8.32 8.32 100000000 83.20 156.70 interpolate(. . .)
46.23 15.67 7.35 200000000 36.75 36.75 bisection(. . .)
0.82 15.80 0.13 main
. . .
. . .
Call graph:
index % time self children called name
[1] 99.4 0.13 15.67 main [1]
8.32 7.35 100000000/100000000 interpolate(. . .) [2]
...
-----------------------------------------------
8.32 7.35 100000000/100000000 main [1]
[2] 98.6 8.32 7.35 100000000 interpolate(. . .) [2]
7.35 0.00 200000000/200000000 bisection(. . .) [3]
It is obvious that the temp, rho, and pres arrays must be offloaded to the GPU.
Less obviously, presFromRhoTemp, a pointer to a LookupTable2D object
instantiated on the heap, must also be offloaded. To handle this you need to
114
It is then necessary within the main function to call the createDevice() method
before the parallel loop, and call deleteDevice() afterward. For the solution
arrays you can append the copy clause to the loop, as shown in Listing 6.6.
Before compiling with OpenACC you must tell the compiler how to handle method
and function calls within the parallel region. The same directive can be used for
both the class method interpolate and the stand-alone bisection function.
Placing the #pragma acc routine seq directive on the line before each proto-
type instructs the compiler that the routine will be launched on the device sequen-
tially by each thread. Now you can compile and compare the CPU-only CPU+GPU
performance.
115
At first glance, these results look great. The OpenACC implementation has resulted
in an approximately 6× global speedup. However, this first test is run on a single
CPU core, and the Xeon E5-2620 has six cores, making the comparison incomplete.
You can use the OpenACC multicore target option to generate threaded code for the
CPU, similar to widely used OpenMP parallelism.
[user@hostname]$ pgc++ --c++11 –acc –ta=multicore
ThermoTables_OpenACC.C–o out_CPU_Multicore
[user@hostname]$ time ./out_CPU_Multicore 100000 1000
real 0m2.376s
user 0m14.143s
sys 0m0.004s
From the multicore results, you see much less favorable results. The first OpenACC
implementation with the table interpolation on the GPU yields performance much
better than a single CPU core, but nearly identical to the six CPU cores. Additional
optimizations are needed to obtain a worthwhile speedup.
• The table class is unchanged over the length of the computation and should be
copied to the device before the main iteration loop and deleted afterward.
• Similarly, the three quantity arrays can be permanently created on the device
outside the main iteration loop.
• Inside the iteration loop, the quantities can be selectively updated: density and
temperature on the device, and the pressure on the host.
116
The results of the improved data locality are significant, reducing the execution
time of this test from 2.38 seconds to 0.75 seconds. This represents a greater than
3× reduction in execution speed compared with the multithreaded six-core CPU
results. In a full CFD solver many more hot spots would exist in the iteration loop,
with some eligible for parallel execution and others not eligible. The same selective
quantity update shown here could be extended to the additional routines, and, with
a sufficient number of hot spots eligible for acceleration, it will be possible to skip
certain updates, further improving data locality.
117
!"
!"
!#!
Figure 6.2 Table interpolation performance study
Tip
Depending on your hardware and operating system, there may be a significant over-
head incurred when the accelerator is initialized. The PGI pgcudainit utility can be
used to reduce this overhead and obtain more accurate results. Also, the command
line time utility may not be able to obtain accurate timings with small datasets. The
c++ 11 std::chrono::high_resolution_clock utility library can be used to
obtain very high-resolution timing data for targeted regions.
6.5 Summary
Best programming practices for effective heterogeneous development with
OpenACC are meant to yield meaningful performance improvements without major
sacrifices to the maintainability and portability of the code, and with a minimum
amount of programmer time. These practices provide a conceptual foundation that
an OpenACC programmer can use to accelerate any large scientific application.
118
6.6 Exercises
1. In the table interpolation example, a random dataset of density and tempera-
ture values are generated. Using a small set of hard-coded input density and
temperature values, verify that the CPU, multicore CPU, and CPU+GPU versions
of the code yield identical output pressure values.
2. Offloading the table interpolation to the GPU has been shown to be inefficient
with datasets smaller than 1e3 points. Use the if clause to conditionally skip
data transfer and GPU execution for small datasets.
3. The dataset size cutoff for effective GPU acceleration varies, depending on your
system hardware. Implement a runtime tuning class to conditionally skip GPU
execution based on comparative CPU and GPU execution times gathered over a
specified number of solver iterations.
119
7.1 Challenges
Performance portability has been identified by the high-performance computing
(HPC) community as a high-priority design constraint for next-generation systems
such as those currently being deployed in the Top500 (a list that ranks systems by
121
using the Linpack benchmark)1 as well as the exascale systems upcoming in the
next decade. This prioritization has been emphasized because software devel-
opment and maintenance costs are as large or larger than the cost of the system
itself, and ensuring performance portability helps protect this investment—for
example, by ensuring the application’s usability if one architecture goes away or a
new one becomes available.
Looking forward, we are seeing two main node-architecture types for HPC: one
with heterogeneous accelerators (e.g., IBM Power-based systems with multiple
NVIDIA Volta GPUs; Sunway accelerator architecture), and the other consisting of
homogeneous architectures (e.g., third-generation Intel Xeon Phi-based nodes,
Post-K ARM, etc.). At present it is a nontrivial task to write a performance portable
application that targets these divergent hardware architectures and that makes
efficient use of all the available computational resources (both at the node level and
across nodes). It is clear that applications need to be written so that the parallelism
can be easily decomposed and mapped with at least three levels of parallelism:
across nodes, within the nodes (thread-level parallelism), and vector-level parallel-
ism (fine-grained or SIMD-level parallelism).
In this chapter, we show how OpenACC can be used as a single programming model
to program host multicore, homogeneous, and heterogeneous accelerators, and
we discuss the potential performance or productivity tradeoffs. We highlight how
OpenACC can be used to program a micro-kernel called the Hardware Accelerated
Cosmology Code (HACCmk), which is sensitive to vector-level parallelism (e.g.,
vectorization, warps, SMTs/SIMD, etc.).
We use the PGI and Cray compilers (see Section 7.3.4, “Data Layout for Perfor-
mance Portability,” later in this chapter for versions) to target OpenACC both on
CPUs and on NVIDIA GPU hardware platforms. We compare the performance of
1. https://www.top500.org/project/.
122
We summarize these experiences to reflect the current state of the art for achiev-
ing performance portability using OpenACC.
The default PGI behavior for the –acc flag is to create a unified “fat” binary that
includes both host serial CPU and multiple targets of varying compute capabilities
(cc20, cc35, cc50, etc.). This same behavior can be obtained by using, for instance,
the flag -ta=tesla,host. At run time, the default OpenACC device_type will
be NVIDIA unless no NVIDIA targets are available, in which case the device_
type will be HOST.
123
Users can modify the default by either compiling to a specific target (such as
-ta=tesla:cc60 for the Pascal architecture, or -ta=multicore for a parallel
CPU version); calling acc_set_device_type() in the program; or setting the ACC_
DEVICE_TYPE environment variable to have an effect on the default device type.
In the near future, OpenACC parallelization for KNL will be supported by the PGI
compiler, furthering its capabilities toward performance portability.
OpenACC data regions can be synchronized with the host memory. All data move-
ment between host memory and device memory is performed by the host through
runtime library calls that explicitly move data between the memories (or ignored in
shared memory), typically by using direct memory access (DMA) transfers.
124
• Discrete memories. These systems have completely separate host and device
memory spaces that are connected via, for example, PCIe or NVLINK. This mem-
ory architecture maps well to the OpenACC data regions’ copy-in and copy-out
memory model.
• Shared memory. In these systems, all of the cores can access all of the memory
available in the system. For example, traditional shared multicore systems (Intel
KNL self-hosted, etc.) fit into this category. This model maps well to OpenACC,
because the data region directives can be ignored or used as hints to the com-
piler to do prefetching.
The OpenACC compilers also accept gang, worker, or vector clauses to pick the
correct level of parallelism. If you specify only acc loop, the compiler decides
how to map the iteration space to the target architecture based on its cost models
and picks the right type of scheduling across gangs, workers, or vectors. This is an
important feature of OpenACC, because it gives the compiler the freedom to pick
how to map the loop iterations to different loop schedules, and that helps generate
125
However, in some cases the compiler cannot do the best job in generating the cor-
rect loop schedules. For these cases, you can improve the loop scheduling by add-
ing clauses such as gang, worker, or vector to the OpenACC loops. In cases of
perfectly nested parallel loops, OpenACC also supports the use of the tile clause
to schedule nested loops to a given level of parallelism (e.g., gang or vector).
It’s important to decide how the layout of data structures affects performance
portability. Structures of arrays are in general more suitable for GPUs as long as
the data access to the arrays is contiguous (memory coalescing). This is a good
layout optimization for throughput-driven architectures. On the other hand, arrays
of structures are also good for caching structure elements to cache lines, a layout
that is important for latency-driven architectures common to host CPUs. Interest-
ingly, we have noted that in some cases, improving data layouts on GPUs can also
benefit the multicore case, but not as commonly the other way around.
We note that high-level frameworks for data abstractions (e.g. Kokkos,2 SYCL,3
etc.) can be useful to explore these types of issues related to data structure lay-
outs, but they come at the cost of making the compilation process and compiler
analysis more complex.
2. http://www.sciencedirect.com/science/article/pii/S0743731514001257.
3. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0236r0.pdf.
126
7.4.1 HACCmk
HACC is a framework that uses N-body techniques to simulate fluids during the
evolution of the early universe. The HACCmk4 microkernel is derived from the HACC
application and is part of the CORAL benchmark suite. It consists of a short-force
evaluation routine which uses an O(n2) algorithm using mostly single-precision
floating-point operations.
The HACCmk kernel as found in the CORAL benchmark suite has shared memory
OpenMP 3.1 implemented for CPU multicore parallelization, but here we convert
it to OpenACC 2.5. The kernel has one parallel loop over particles that contain a
function call to the bulk of the computational kernel. This function contains another
parallel loop over particles, resulting in two nested loops over the number of par-
ticles and the O(n2) algorithm, as described by the benchmark. A good optimizing
compiler should be able to generate performance portable code by decomposing
the parallelism of the two nested loops across threads and vector lanes (fine-
grained parallelism). For vector (or SIMD-based) architectures, the compiler should
aim at generating code that exploits vector instructions having long widths. For
multithreaded or SMT architectures, it should exploit thread-level parallelism and
smaller vector lanes.
4. https://asc.llnl.gov/CORAL-benchmarks/Summaries/HACCmk_Summary_v1.0.pdf.
127
128
When you compile HACCmk with the PGI compiler and target an AMD Bulldozer
architecture, you get the following compiler output:
pgcc -acc -Minfo -O3 -ta=multicore -c main.c -o main.o
main:
188, Loop not vectorized/parallelized:
contains a parallel region
203, Loop not vectorized: data dependency
Loop unrolled 4 times
FMA (fused multiply-add) instruction(s) generated
211, Generated an alternate version of the loop
Generated vector simd code for the loop
222, Memory set idiom, loop replaced by call to __c_mset1
223, Loop unrolled 8 times
239, Generating Multicore code
242, #pragma acc loop gang
257, Loop is parallelizable
Generated vector simd code for the loop containing
reductions and conditionals
Generated 4 prefetch instructions for the loop
FMA (fused multiply-add) instruction(s) generated
301, Generated vector simd code for the loop containing
reductions
Generated 3 prefetch instructions for the loop
FMA (fused multiply-add) instruction(s) generated
Notice that for these two architectures, PGI translates the outer OpenACC loop to
gang-level parallelism (loop 239) and translates the inner loop to vector-level par-
allelism (loop 257). However, one of the main differences is the vector_length
used. For the GPU version, the vector length is 128, whereas the CPU version is
based on the size of the vector register for AVX (256-bits). In this case the vector_
length is 8 (for 8 floats vector instructions). Also notice that to efficiently vector-
ize the inner loop for multicores, generation of vector predicates and intrinsics is
needed (e.g., powf()).
129
We specified different vector lengths using OpenACC, but for both cases (GPU and
multicore) the PGI compiler always picked 128 (for the GPU) and 4 (for the multi-
core) architecture. The compiler’s internal cost models picked the right length for
the different architectures and optionally decided to ignore the clauses provided by
the user, as reported in the output from –Minfo shown earlier.
To control the number of threads spawned on the multicore platforms, you use the
ACC_MULTICORE environment variable. However, this variable is a PGI exten-
sion and not part of the OpenACC standard. You should use this flag if you want to
control the number of threads to be used on the CPU. If the flag is not specified, the
CPU will use the maximum number of threads available on the target multicore
architecture.
Figure 7.1 shows the HACCmk speedup of the OpenACC version when running on an
NVIDIA K20x GPU, as compared with the OpenMP shared-memory version running
on an AMD Bulldozer processor using 8 host CPU threads (because each floating-point
unit is shared between 2 of the 16 physical cores). The OpenACC version always
outperformed the shared-memory version running on the CPU. This is what we
would expect given the K20x compute capabilities. When we compare the results
using different compilers, we observed less OpenACC speedup when using the PGI
16.5 compiler. These results highlight the fact that performance portability of code
also depends on the quality of the compiler optimizations, because more hints may
be needed to generate performance portable code, depending on the compiler.
130
Figure 7.1 Speedup of OpenACC running on NVIDIA K20x GPUs when compared to OpenMP shared
memory running on Bulldozer AMD CPU using 8 threads
1.6"
Speedup&over&8&OpenMP&Threads&
OpenACC"0"Mul4core"(PGI)"0"8"threads"
1.4"
1.2"
1"
0.8"
0.6"
0.4"
0.2"
0"
100" 200" 400" 800" 1600" 3200"
Problem&Size&(#&of∥cles)&
Figure 7.2 Speedup of OpenACC running on a Bulldozer AMD CPU when compared to OpenMP
shared memory running on a Bulldozer AMD CPU using 8 threads
131
threads on the CPU. The OpenACC version outperformed the OpenMP 3.1 version.
One of the reasons is that our OpenACC version inlines the inner loop (compared
to the OpenMP 3.1 version) and provides more information to the vectorization
phase of the compiler, including information about reductions. We also notice that
when the problem size increases, the OpenACC improvement in terms of speedup
becomes less profitable (from 1.49 to 1.06) as we saturate the memory band-
width. The Cray 8.8.5 compiler did not support the mode of targeting OpenACC
to multicore.
7.5 Summary
It is possible to achieve improved performance portability across architectures
when you use OpenACC. Performance of OpenACC depends on the ability of the
compiler to generate good code on the target platform. For example, we observed
a significant performance variation when we compiled OpenACC with Cray 8.5.0
compared with the PGI 16.5. Further investigation showed a significant perfor-
mance variation when we tried PGI 16.7. This tells us that compilers play a sig-
nificant role in the level of performance portability of a programming model. To
write performance portable code, it is important to specify where the parallelism
in the code is (e.g., via #pragma acc loop) without specifying clauses that affect
how the parallelism is mapped to the architecture. For example, specifying hints
such as the OpenACC vector_length can help achieve good performance on an
architecture, but, at the same time, it can possibly hinder performance on another
one. However, sometimes these hints are necessary when the compiler cannot
efficiently map the parallelism to the target platform.
Not only is the #pragma acc loop directive extremely useful for performance
portability (to specify parallelism for offloading to accelerators), but also, depend-
ing on the implementation, it can be critical if you are to achieve good levels of mul-
tithreading and vectorization. It can help compilers identify the parallelism in the
code when they cannot figure it out automatically, and this is important for optimi-
zations such as automatic vectorization. Compilers cannot yet consistently identify
these opportunities in all cases, so hints must be used to ensure that vectorization
is used where appropriate. Although GPUs do not have vector units, the #pragma
132
acc loop directive can be used to help identify parallelism that can be mapped to
grids, thread blocks, and potential very fine-grained parallelism that can be exe-
cuted by SMT threads (e.g., GPU warps).
7.6 Exercises
1. How many levels of parallelism can be specified using OpenACC?
3. Which of the following memory models does OpenACC support? Describe any
limitations that apply to those models that are supported.
a. Shared memory
b. Discrete memory
4. Which of the following clauses can be used to tune for a specific architecture?
a. Specifying a vector_length()
133
c. Both
d. Neither
134
8.1 Programming Models
This section offers a short description of each of the most relevant programming
models. For this purpose, the models are also distinguished with respect to their
implementation strategy (language features, directives, language extensions, or
C++ abstraction), and their supported parallelization models (thread parallelism,
task parallelism, and data parallelism).
135
Task parallelism–based systems are also flexible. They handle both functional and
data parallelism. Furthermore, they can naturally expose parallelism in situations
with complex data and synchronization dependencies.
Data parallelism is the most constrained model but is simpler to use than thread
parallelism, and generally it has less runtime overhead than task parallelism–
based systems. Data parallelism models describe work mainly via an index space,
which can be iterated over in parallel. This does not necessarily mean that the
iteration scheme is directly associated with a corresponding data structure. The
data could be calculated during the execution of the program based on the iteration
index.
Some programming models provide capabilities from all three areas. For exam-
ple, OpenMP allows for thread parallelism and data parallelism as well as task
parallelism.
Language features are the capabilities within the language itself that allow expres-
sions of parallelism. Examples of those are the DO CONCURRENT concept in
Fortran 2008, std::thread in C++11, and for_each in C++17. The advantage of
using programming models that are implemented as language features is that they
are part of the official programming language standard. Usually this means that
they work well together with all other capabilities of a language. Additionally, they
don’t require any other third-party tools in addition to the compiler itself. However,
native parallelism in language standards is usually many years behind program-
ming models implemented through other means with respect to its scope.
136
For example, Kokkos, from the Sandia National Laboratories, provides back ends
for CUDA, OpenMP, C++ threads, and serial execution, only one of which needs to be
available on a given platform to compile and execute an application. The particular
features of these back ends are abstracted, allowing the programmer to write code
as if it were native C++. Other examples of this approach are RAJA and TBB.
Table 8.1 lists the various types of programming models, supported languages, and
architectures these models can target.
137
8.1.1 OpenACC
Although OpenACC is extensively described in previous chapters, it is worthwhile
to identify its properties relative to other programming models discussed in this
chapter. OpenACC is a data parallel programming model designed to provide
performance portability across various hardware architectures. As such it has pro-
vided data management directives from its inception. The semantics of OpenACC
directives are generally less restrictive than OpenMP semantics, allowing for more
flexible mapping of algorithmic parallelism to the available hardware execution
resources. A large number of concepts (though not syntax) introduced by OpenACC
are now part of OpenMP 4. More information about OpenACC can be found at http://
www.openacc.org.
8.1.2 OpenMP
OpenMP is arguably the most mature of the shared-memory parallel programming
models. As with OpenACC, it is a directive-based model that is available in Fortran,
C, and C++. The OpenMP standard is defined by a consortium (OpenMP Architecture
Review Board (ARB)) that has many members, including major hardware and soft-
ware vendors such as Intel, IBM, AMD, NVIDIA, ARM, Oracle, and Fujitsu. As such,
OpenMP is widely supported across the vast majority of all hardware platforms
as well as on all major operating systems. All major compilers relevant for HPC
support OpenMP directives, including GCC, Clang, Intel, IBM, Cray, and PGI.
In 1997, OpenMP was targeted for simple data parallel multiprocessing on CPUs,
and thus its concepts and semantics are geared toward the existence of heavy-
weight threads. In fact, OpenMP has supported direct thread parallel algorithms
from the start. In version 4.0, released in 2013, concepts were added to enable the
support of heterogeneous architectures, including designs that contain acceler-
ators. A significant number of new directives were introduced that support mas-
sively parallel algorithms on lightweight threads.
In its current iteration, version 4.5, OpenMP provides a rich set of parallel execu-
tion capabilities as well as data management capabilities for two-level memory
systems. It also supports task parallelism in addition to data and thread parallel
constructs.
138
8.1.3 CUDA
CUDA is a language extension for C and C++ to enable support for using NVIDIA GPUs. It
was initially released in 2007 to exploit the increased flexibility of NVIDIA GPUs, allow-
ing them to be used for general-purpose programming instead of only graphic
computations. CUDA is designed to exploit massive parallelism by using inherent
hierarchies. It expresses parallelism through a grid of thread blocks, where only
threads within the same block can synchronize with each other. Furthermore,
threads within a block are organized in so-called warps, which execute instructions
in lockstep.
Therefore, CUDA can be considered a hybrid data and thread parallel program-
ming model at its core. Interthread block parallelism corresponds more closely to
pure data parallelism: You specify how many blocks need to be executed (the index
space), but CUDA does not promise anything with respect to the order of execution
nor the number of concurrently executing thread blocks. Intrathread block par-
allelism, on the other hand, is thread parallelism: You specify how many threads
are concurrently executing, and those threads can interact via memory as well as
synchronization constructs.
You make data management explicit by using special allocators and deep copy
function calls. There are no built-in data constructs (other than for special access
operations such as noncoherent constant loads), so higher-level structures must
be built from raw pointers.
The CUDA ecosystem provides a comprehensive set of scientific libraries that help
with implementing complex applications. These include BLAS and solver libraries
as well as comprehensive data analytics and artificial intelligence libraries.
Note that the CUDA programming model was also adopted to Fortran in the PGI
compiler suite. More resources for CUDA are provided at http://www.nvidia.com/
cuda.
8.1.4 OpenCL
With its introduction in 2008, OpenCL tried to fill the need for a programming model
that allows applications to run across a wide range of hardware architectures. As
an open standard released by the Khronos Group, it was soon supported by all
major hardware companies, including Intel, NVIDIA, AMD, and IBM. Conceptually
OpenCL has a lot of similarities to CUDA, a natural consequence of its desire to
support GPU-type architectures. OpenCL provides a hybrid data/thread parallel
model equivalent to CUDA. Work is scheduled in workgroups, with threads inside a
139
Within the area of HPC, OpenCL is significantly less common than CUDA or
OpenACC programs. Current trends indicate that the availability of OpenMP 4.5
will further reduce the relevance of OpenCL in HPC, because OpenMP 4.5 will be
supported across all major hardware platforms. OpenCL’s home page can be found
at http://www.khronos.org/opencl.
8.1.6 Kokkos
Developed at Sandia National Laboratories, Kokkos is a programming model
intended for writing performance portable HPC applications. It is implemented as
a C++ abstraction layer and provides pure data parallel concepts, task parallelism,
and hybrid data/thread parallel and task/thread parallel capabilities. One differen-
tiator of Kokkos is its comprehensive data abstraction layer. The model can handle
deep multilevel memory hierarchies as well as hybrid architectures having more
than two types of execution resources.
140
The programming model Kokkos has been under development since 2011 and has
been considered production ready since 2015. It is now the default model for Sand-
ia’s extensive application portfolio for implementing shared-memory parallelism.
Data abstractions are handled mostly via the view abstraction, which is a multidi-
mensional array abstraction with view semantics. A particular feature of views is
their memory layout abstraction, which allows you to specify the index mapping to
memory location. Views also implement the concepts of memory spaces as well
as access traits such as atomic operations. Kokkos itself and libraries related to
it (such as profiling tools and math libraries) are available at https://github.com/
kokkos.
8.1.7 RAJA
Like Kokkos, RAJA is a programming model providing performance portability
for applications through C++ abstractions. RAJA was developed at the Lawrence
Livermore National Laboratory and is used to port HPC applications to run on new
supercomputer designs. RAJA focuses on parallel execution and provides mech-
anisms to customize iteration schemes for specific applications. Although RAJA
itself does not handle data management, libraries built on top of RAJA (such as
CHAI) are being developed with more comprehensive data abstractions. RAJA is
available at https://github.com/llnl/raja.
141
8.1.9 C++17
With the C++11 standard, the C++ standards committee has begun to address the
need for shared-memory parallelism on modern computing systems by intro-
ducing std::thread. Although these capabilities allow for expressing thread
parallelism similar to using POSIX threads, they do not allow for the use of hybrid
compute architectures such as systems with CPUs and GPUs. This starts to
change with C++17, which introduces fundamental data parallel constructs. Such
constructs allow significantly more freedom to map parallel work onto various
execution resources.
In C++17 only the very basics have been introduced. It provides a for_each
function that takes an execution policy as the first argument. This execution policy
states the allowable parallelization semantics for the provided index range. Tech-
nically these semantics allow for executing on a GPU, given a coherent memory
subsystem, but at this point no compilers are available to exploit that possibility.
The C++ committee is working on extending the scope of the available paralleliza-
tion capabilities. The proposals are strongly influenced by programming models
implemented as C++ abstraction layers. Indeed, many developers of those abstrac-
tion layers explicitly state that they consider their models as prototypes for a future
C++ standard.
142
for many models. Although you can implement them through a combination of
parallel loops, temporary data buffers, and serial loops, they are not an actual con-
cept in many models and hence are shown as (X). For example, CUDA and OpenCL
do not provide reductions. Also, most of the syntax examples are shown in C/C++
because only OpenMP, OpenACC, and Fortran 2008 have support for Fortran.
Task parallelism
Data allocations
Data transfers
Parallel loops
Parallel reduction
OpenACC X X X X — X X
OpenMP X X X X X X X
In most models, parallel loops are a first-class concept, meaning that there is a
specific syntax explicitly designed for parallelizing a loop over an index range. The
exceptions are CUDA and OpenCL, which, as previously discussed, are hybrid data/
thread parallel models. In both cases, part of the index to execution resources
mapping must be carried out explicitly by the application.
143
A simple example of parallel loops is a vector addition, which as a plain C/C++ code
can be written this way:
Table 8.3 Mapping of a simple loop to the presented parallel programming models
Model Syntax
OpenACC void launch (int N, [ARGS]) {
#pragma acc parallel loop
for(int i=0; i<N; i++) {/*. . .*/}
}
OpenMP void launch (int N, [ARGS]) {
#pragma omp parallel for
(CPU only)
for(int i=0; i<N; i++) {/*. . .*/}
}
OpenMP void launch (int N, [ARGS]) {
#pragma omp teams distribute parallel for
for(int i = 0; i<N; i++) {/*. . .*/}
}
CUDA __global__ void foo( int N, [ARGS] ) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if(i<N) {/*. . .*/}
}
144
clfork(stdgpu,0,krn,&ndr,CL_EVENT_NOWAIT);
}
C++17 void launch (int N, [ARGS]) {
for_each(par_unseq,begin(range),end(range),[&] (int i)
{/*. . .*/});
}
Fortran DO CONCURRENT (i = 1:N)
. . .
END DO
C++ AMP void launch (int N, [ARGS]) {
extent<1> range(N);
parallel_for_each(range, [=] (index<1> idx) restrict(amp)
{ int i = idx[0]; {/*. . .*/} });
}
Kokkos void launch (int N, [ARGS]) {
parallel_for(N, KOKKOS_LAMBDA (int i) {/*. . .*/});
}
RAJA void launch (int N, [ARGS]) {
forall<cuda_exec<256>>(0, N, [=] RAJA_DEVICE (int i)
{/*. . .*/});
}
TBB void launch (int N, [ARGS]) {
parallel_for(0, N, [=] (int i) {/*. . .*/});
}
145
Simply parallelizing this loop would result in race conditions for the update of
result and thus would produce the wrong answer.
Even though parallel reduction is commonplace, only about half of the program-
ming models listed in this section provide a first-class concept of a reduction.
In particular, CUDA and OpenCL, as well as C++17 and Fortran 2008, do not have
native support for reductions. Implementing high-performing reductions in these
models is a nontrivial endeavor. It requires a combination of multiple parallel loops,
use of atomic operations, and thread parallel programming techniques such as
tree reductions, which heavily utilize pairwise barriers between threads.
Another concern is which operations a model supports for reductions and on which
data types these operations can work. Most models provide built-ins for numeric
scalar types (such as integer or floating-point scalars) and a set of common oper-
ations (sum, max, min). Furthermore, some models such as OpenMP, Kokkos, and
RAJA allow user-defined reduction operations and reduction types. Table 8.4, with
syntax examples, demonstrates a max reduction.
Table 8.4 Syntax for max reduction using various programming models
Model Syntax
OpenACC double launch (int N, [ARGS]) {
double result;
#pragma acc parallel loop reduction(max: result)
for(int i = 0; i<N; i++) {/*. . .*/}
}
OpenMP double launch (int N, [ARGS]) {
double result;
#pragma omp teams distribute parallel for \
reduction(max: result) map(tofrom: result)
for(int i = 0; i<N; i++) {/*. . .*/}
}
Kokkos double launch (int N, [ARGS]) {
double result; Max<double> max(result);
parallel_reduce(N, KOKKOS_LAMBDA (int i, double& lmax)
{/*. . .*/},
max);
return result;
}
RAJA double launch (int N, [ARGS]) {
ReduceMax<cuda_reduce<256>,double> max_val(0.0)
forall<cuda_exec<256>>(0, N, [=] RAJA_DEVICE (int i) {
/*. . .*/
max_val(value_i);
});
return max_val.get();
}
146
As with simple loops, each iteration is completely independent and can be executed
in any order. Programming models that support both reductions and tightly nested
loops usually also support their combination. Table 8.5 shows the syntax for tightly
nested loops using various programming models.
Table 8.5 Syntax for tightly nested loops using various programming models
Model Syntax
OpenACC void launch (int N0, int N1, [ARGS]) {
#pragma acc parallel loop collapse(2)
for(int i0=0; i0<N0; i0++)
for(int i1=0; i1<N1; i1++) {/*. . .*/}
}
OpenMP void launch (int N0, int N1, [ARGS]) {
#pragma omp teams distribute parallel for collapse(2)
for(int i0=0; i0<N0; i0++)
for(int i1=0; i1<N1; i1++) {/*. . .*/}
}
147
Table 8.5 Syntax for tightly nested loops using various programming models (continued)
148
You could express this as a tightly nested loop for serial execution by simply adding
onto y[i0] in the inner loop, but this would assume that all elements of y are zero.
For parallel execution this approach would result in write conflicts, because all
the iterations of the inner loop would update the same value. Although true tightly
nested loops allow for reductions, it would need to be a reduction over the entire
index space. In the case of the matrix vector multiplication, you are confronted with
a set of independent reductions, and hence it falls into the category of non-tightly
nested loops.
In Kokkos, on the other hand, hierarchical parallelism is provided through the team
interface. A work item is given to a team of threads, with all threads active for the
lifetime of the work item. Nested parallel constructs thus are required only to com-
pute their loop bounds and can go right on computing. In fact, a nested parallel_
for can even be nonblocking. On the other hand, this makes it necessary to use
execution restriction capabilities in the outer scope—for example, if a contribution
to some global object should be made only once per work item. Table 8.6 shows the
syntax for non-tightly nested loops using various programming models.
149
Table 8.6 Syntax for non-tightly nested loops using various programming models
Model Syntax
OpenACC void launch (int N0, int N1, [ARGS]) {
#pragma acc parallel loop gang
for(int i0=0; i0<N0; i0++) {
/*. . .*/
#pragma acc loop vector reduction(+:y_i0)
for(int i1=0; i1<N1; i1++) {/*...*/}
/*. . .*/
}
}
OpenMP void launch (int N0, int N1, [ARGS]) {
#pragma omp target teams distribute
for(int i0=0; i0<N0; i0++) {
/*. . .*/
#pragma omp parallel for
for(int i1=0; i1<N1; i1++) {/*...*/}
/*. . .*/
}
}
CUDA __global__ void foo( int N1, [ARGS] ) {
int i0 = blockIdx.x;
/*. . .*/
for(int i1=threadIdx.x; i1<N1; i1+= blockDim.x) {/*. . .*/}
/*. . .*/
}
150
A third benefit of task parallelism is that it naturally allows for functional parallel-
ism. This capability can help expose more parallelism and help with goals such as
overlapping communication with computation.
151
The main mechanisms of task parallelism are spawning tasks, futures to wait on,
dependencies, and potentially a mechanism to preempt an active task (i.e., pause
its execution to wait for a child task). Arguably TBB tasking is the most compre-
hensive of the four models mentioned here, followed by Kokkos, which provides
mechanisms to generate dynamic task-directed acyclic graphs (DAG). OpenMP
tasking is significantly more limited in that it does not provide the future concept.
The only kind of dependencies allowed are a parent task waiting on its children or a
dependency on data regions. Furthermore, OpenMP tasks must be launched inside
an OpenMP parallel region, and upon exiting the scope, a barrier ensures that all
tasks are finished. That means that it is not possible to write unstructured code
(as is often the case in C++) that can generate task trees. C++17 tasks do provide
the future concept but are otherwise basic. For example, there is no mechanism to
provide a future as a dependency to a new task for the purpose of scheduling order.
Because of the significant differences among the models, often it is not possible to
express task-based code implemented by one model in another model.
152
Table 8.7 Syntax for data allocation using various programming models
Model Syntax
OpenACC void foo(..) {
double* device_data = (double*) acc_malloc( N*sizeof(double) );
//. . .
acc_free(device_data);
}
OpenMP void foo(..) {
double* device_data = (double*) omp_target_alloc(
N*sizeof(double ), omp_get_default_device() );
//. . .
omp_target_free(device_data, omp_get_default_device());
}
CUDA void foo(..) {
double* a; cudaMalloc(&a, N*sizeof(double);
//. . .
cudaFree(a);
}
OpenCL void foo(..) {
cl_mem a = clCreateBuffer(context, CL_MEM_READ_WRITE,
N*sizeof(double), NULL, NULL);
//. . .
clReleaseMemObjecy(a);
}
C++ AMP void foo(..) {
array<double,1> a(N);
//Deallocates at end of scope
}
Kokkos void foo (..) {
View<double*, MemSpace> a("A",N);
View<double*, MemSpace>::HostMirror h_a =
create_mirror_view(a);
//Deallocates on its own after falling out of scope
}
TBB void foo(..) {
double* a =
(double*) scalable_aligned_malloc(N*sizeof(double),64);
//. . .
scalable_aligned_free(a);
}
153
device happen when execution enters the scope, and copies from the device occur
upon exiting the scope. It also means that the device allocations are valid only for
that scope. Table 8.8 shows the syntax for data transfer using various program-
ming models.
Table 8.8 Syntax for data transfers using various programming models
Model Syntax
OpenACC void foo(int N, double* a) {
(directive #pragma acc data copy(a[0:N])
structured) {
//. . .
}
}
OpenACC void foo(int N, double* a) {
(directive #pragma acc enter data copyin(a[0:N])
unstructured) //. . .
#pragma acc update self(a[0:N])
//. . .
#pragma acc exit data copyout(a)
}
OpenACC (API) void foo(int N, double* host_data, double* device_data) {
acc_memcpy_to_device(device_data, host_data, N*sizeof(double));
//. . .
acc_memcpy_from_device(host_data, device_data,
N*sizeof(double));
}
OpenMP void foo(int N, double* a) {
#pragma omp target data map(tofrom: a[0:N])
(directive)
{
//. . .
}
}
OpenMP void foo(int N, double* host_data, double* device_data) {
omp_target_memcpy(device_data, host_data, N*sizeof(double),
(API)
0, 0, omp_get_default_device(), omp_get_initial_device());
//...
omp_target_memcpy(host_data, device_data, N*sizeof(double),
0, 0, omp_get_initial_device(), omp_get_default_device());
}
CUDA void foo(int N, double* host_data, double* device_data) {
cudaMemcpy(device_data, host_data,
N*sizeof(double), cudaMemcpyDefault);
//. . .
cudaMemcpy(host_data, device_data,
N*sizeof(double), cudaMemcpyDefault);
}
154
For this case study, MiniFE was stripped of its MPI component, and the CG solve
was modified to call functions that take simple pointers as input instead of the
default vector and matrix classes otherwise used in MiniFE. Note that the solver
has a certain amount of extra code for collecting timing information. But no allo-
cated data are accessed outside the three math functions other than for data
allocation and transfer. In the following, the implementations of the vector addition,
the dot product, and the sparse matrix multiplication are given for several models.
The vector addition is a simple elementwise operation, which can be expressed as
a straightforward parallel loop. The dot product requires a reduction over the iter-
ation space. The sparse matrix vector multiplication has non-tightly nested loops,
with the inner loop being a reduction.
The code is available at the book’s repository (see the Preface), and the file contain-
ing all the programming model-specific code is src/cg_solve.h. The rest of the
application is largely unchanged from the original MiniFE version and is the same
code for all variants.
155
double* x = &x_in.coefs[0];
const double* b = &b_in.coefs[0];
The vector addition (AXPBY) in serial is a simple loop. There are six input param-
eters: the length of the vectors, the output vector z, the input vectors x and y, and
their prefactors alpha and beta. Both x and y are read only and thus come in as
const data. The name AXPBY comes from the math operation performed: z = a*x
+ b*y.
void axpby(int n, double* z, double alpha, const double* x,
double beta, const double* y) {
for(int i=0; i<n; i++)
z[i] = alpha*x[i] + beta*y[i];
}
The dot product has only three input parameters: the length of the vectors, and two
const pointers for the vectors:
double dot(int n, const double* x, const double* y) {
double sum = 0.0;
for(int i=0; i<n; i++)
sum += x[i]*y[i];
return sum;
}
The sparse matrix vector multiplication (SPMV) function is slightly more complex,
featuring a nested reduction. In addition to the aforementioned three arrays rep-
resenting the matrix, the SPMV function also takes the number of rows, the output
vector y, and the right-hand side vector x as input.
156
For the vector addition, the simple parallel loop construct is employed. Because the
data were already transferred, the present clause is used to declare the arrays
available.
void axpby(int n, double* z, double alpha, const double* x, double beta,
const double* y) {
#pragma acc parallel loop present(x[0:n],y[0:n])
for(int i=0; i<n; i++)
z[i] = alpha*x[i] + beta*y[i];
}
The dot product is similar to the AXPBY function, with the addition of a reduction
parameter.
double dot(int n, const double* x, const double* y) {
double sum = 0.0;
#pragma acc parallel loop reduction(+: sum) \
present(x[0:n],y[0:n])
for(int i=0; i<n; i++)
sum += x[i]*y[i];
return sum;
}
For the SPMV operation, a hierarchical operation is implemented wherein the outer
loop is parallelized with gangs and workers, and the inner loop is parallelized with
vector parallelism. To get good performance on GPUs, the number of workers
157
and the vector length on the parallel region are explicitly defined. The vector_
length of 8 is chosen to optimize performance and is mainly dependent on the
number of nonzeros of a typical row of the matrix. In MiniFE that number is 27.
void spmv(int nrows, int nnz, const int* A_row_offsets,
const int* A_cols,const double* A_vals, double* y,
const double* x) {
#pragma acc parallel num_workers(64) vector_length(8) \
present(y[0:nrows], x[0:nrows], A_row_offsets[0:nrows+1], \
A_cols[0:nnz], A_vals[0:nnz])
{
#pragma acc loop gang worker
for(int row=0; row<nrows; ++row) {
const int row_start=A_row_offsets[row];
const int row_end=A_row_offsets[row+1];
The AXPBY function is a simple parallel for usage. Here, the data clauses are
added again, because they appear inside a function. The runtime should make sure
that the data are not copied if the AXPBY function is called within a target data
scope where the data is already present on the device.
void axpby(int n, double* z, double alpha, const double* x, double beta,
const double* y) {
#pragma omp target teams distribute parallel for map(from: z[0:n]) \
map(to: x[0:n], y[0:n])
for(int i=0; i<n; i++)
z[i] = alpha*x[i] + beta*y[i];
}
158
The dot product function using OpenMP 4.5 directives also requires an explicit
tofrom mapping of the reduction variable sum:
double dot(int n, const double* x, const double* y) {
double sum = 0.0;
#pragma omp target teams distribute parallel for \
reduction(+: sum) map(tofrom: sum) \
map(to: x[0:n], y[0:n])
for(int i=0; i<n; i++)
sum += x[i]*y[i];
return sum;
}
As in the OpenACC variant, vector parallelism is used for the inner reduction in the
SPMV algorithm, whereas the loop over the number of rows is spread over threads.
It is worthwhile, though, to observe that removing the vector reduction clause on
the inner loop improves performance on KNL.
void SPMV(int nrows, int nnz, const int* A_row_offsets,
const int* A_cols, const double* A_vals, double* y,
const double* x) {
#pragma omp target teams distribute parallel for \
map(from: y[0:nrows]) \
map(to: x[0:nrows], A_row_offsets[0:nrows+1], \
A_cols[0:nnz], A_vals[0:nnz])
for(int row=0; row<nrows; ++row) {
const int row_start=A_row_offsets[row];
const int row_end=A_row_offsets[row+1];
159
double* r; cudaMalloc(&r,nrows*sizeof(double));
double* p; cudaMalloc(&p,nrows*sizeof(double));
double* Ap; cudaMalloc(&Ap,nrows*sizeof(double));
double* x; cudaMalloc(&x,nrows*sizeof(double));
double* b; cudaMalloc(&b,nrows*sizeof(double));
double* A_vals; cudaMalloc(&A_vals,A_size*sizeof(double));
int* A_cols; cudaMalloc(&A_cols,A_size*sizeof(int));
int* A_rows; cudaMalloc(&A_rows,(nrows+1)*sizeof(int));
// Copy to Device
cudaMemcpy(x,h_x,nrows*sizeof(double),cudaMemcpyDefault);
cudaMemcpy(b,h_b,nrows*sizeof(double),cudaMemcpyDefault);
cudaMemcpy(A_vals,h_A_vals,A_size*sizeof(double),cudaMemcpyDefault);
cudaMemcpy(A_cols,h_A_cols,A_size*sizeof(int),cudaMemcpyDefault);
cudaMemcpy(A_rows,h_A_rows,(nrows+1)*sizeof(int),cudaMemcpyDefault);
The dot product is significantly more complex. Because CUDA does not provide a
native reduction capability, a two-step algorithm is implemented. First, within each
block so-called shared memory is used to implement a tree reduction. Shared
memory is local scratch memory that is private to the block of threads. Each block
then writes out a single value to a buffer. The buffer is allocated in memory that is
accessible from both device and host. Although the access is relatively slow from
the GPU, its sparse use does not constitute significant overhead. In this case the
synchronization is necessary, to ensure that the kernel has written all its results
back to the buffer. Then the final reduction is performed serially on the host.
160
Note that the reduction method used here is intentionally simplistic and may not
perform well. More complex reduction schemes using three or even four levels are
deployed in practice (such as in the Kokkos model). In those schemes a hierarchical
reduction is used that is aware of the fact that blocks consist of subgroups called
warps, and the final reduction is performed by the last block to write back its
partial result. Another noteworthy particularity of this algorithm is the artificial
limitation of the number of blocks to 4,096. This allows for a fixed-length buf-
fer, as well as limits the amount of data that needs to be transferred back to the
host. Its drawback is that it limits the flexibility of the GPU’s hardware scheduler,
because fewer independent blocks of work exist. This scheme is often referred to
as block recycling.
void __global__ dot_kernel(int n, const double* x,
const double* y, double* buffer) {
int i_start = blockIdx.x * blockDim.x + threadIdx.x;
double sum = 0;
for(int i=i_start; i<n; i+=gridDim.x*blockDim.x)
sum += x[i]*y[i];
return sum;
}
161
The SPMV function is comparably simpler than the dot product because no global
reduction is necessary. As with the AXPBY function, the outer loop is split over
both the number of blocks and the threads within the block. In this case the sec-
ond dimension of the block is used as part of that index range. But in contrast to
the AXPBY function, a second dimension of the block is used for parallelism in the
inner loop. The reason to use the x dimension for the parallelism of the inner loop
is that threads with consecutive x indexes are part of the same warp. This means
that they are synchronized and act as vector lanes on CPUs, something that allows
the use of shuffle operations to implement the nested reduction. These shuffle
operations allow for direct register exchange between threads within the same
warp. Therefore, the number of threads for that x dimension must be a power of
2 smaller, or equal to 32. In this case 8 is chosen because it is empirically shown
to be the best value for this operation, given the matrix structure of MiniFE with its
typical 27 entries per row.
__global__ void spmv_kernel(int nrows, const int* A_row_offsets,
const int* A_cols,
const double* A_vals,
double* y, const double* x) {
int row = blockIdx.x * blockDim.y + threadIdx.y;
if(row>=nrows) return;
int row_start=A_row_offsets[row];
int row_end=A_row_offsets[row+1];
double sum = 0.0;
for(int i=row_start + threadIdx.x; i<row_end; i+=blockDim.x)
sum += A_vals[i]*x[A_cols[i]];
// Reduce over blockDim.x
int delta = 1;
while(delta < blockDim.x) {
sum += __shfl_down(sum,delta,blockDim.x);
delta*=2;
}
if(threadIdx.x == 0)
y[row] = sum;
}
162
View<double*> r("r",nrows);
View<double*> p("p",nrows);
View<double*> Ap("Ap",nrows);
deep_copy(x,h_x);
deep_copy(b,h_b);
deep_copy(A_vals,h_A_vals);
deep_copy(A_cols,h_A_cols);
deep_copy(A_rows,h_A_rows);
The vector addition is again a simple parallel loop. The KOKKOS_LAMBDA macro
adds the necessary function qualifiers when you are compiling with back ends
such as CUDA, something that requires you to mark functions that need to be com-
piled for the GPU. The string is an optional, not necessarily unique, identifier used
163
in Kokkos-aware profiling and debugging tools. The fence is added for timing
purposes, because the parallel_for call is potentially asynchronous.
void axpby(int n, View<double*> z, double alpha,
View<const double*> x, double beta,
View<const double*> y) {
parallel_for("AXpBY", n, KOKKOS_LAMBDA ( const int& i) {
z(i) = alpha*x(i) + beta*y(i);
});
fence();
}
For the SPMV operation two implementations are provided. The first uses simple
flat parallelism on the outer loop, a technique that represents a straightforward
port from the serial implementation and is enough to get good performance on
CPUs.
void SPMV(int nrows, View<const int*> A_row_offsets,
View<const int*> A_cols, View<const double*> A_vals,
View<double*> y,
View<const double*, MemoryTraits< RandomAccess>> x) {
parallel_for("SPMV:Flat",nrows,
KOKKOS_LAMBDA (const int& row) {
const int row_start=A_row_offsets[row];
const int row_end=A_row_offsets[row+1];
double y_row = 0.0;
for(int i=row_start; i<row_end; i++)
y_row += A_vals(i)*x(A_cols(i));
y(row) = y_row;
} );
fence();
}
Although this implementation is OK for CPUs, it is not good for GPUs because (a)
it doesn’t expose all available parallelism, and (b) it has uncoalesced data access
164
for most arrays. To remedy this, a more flexible kernel is provided using hierarchi-
cal parallelism. But that exposes another issue: The inner loop is too small for an
actual team of threads. Hence the approach taken subdivides rows into row-blocks,
with each team handling one row-block. Kokkos’s third level of hierarchical paral-
lelism is then used to perform the reduction for each row. By choosing appropriate
row-block and team-size parameters, we now allow this kernel to perform well
across all architectures.
void SPMV(int nrows, View<const int*> A_row_offsets,
View<const int*> A_cols, View<const double*> A_vals,
View<double*> y,
View<const double*, MemoryTraits< RandomAccess>> x) {
#ifdef KOKKOS_ENABLE_CUDA
int rows_per_team = 64;
int team_size = 64;
#else
int rows_per_team = 512;
int team_size = 1;
#endif
parallel_for("SPMV:Hierarchy",
TeamPolicy< Schedule< Static > >
((nrows+rows_per_team-1)/rows_per_team,team_size,8),
KOKKOS_LAMBDA (const TeamPolicy<>::member_type& team) {
const int first_row = team.league_rank()*rows_per_team;
const int last_row = first_row+rows_per_team<nrows?
first_row+rows_per_team:nrows;
parallel_for(TeamThreadRange(team,first_row,last_row),
[&] (const int row) {
const int row_start=A_row_offsets[row];
const int row_length=A_row_offsets[row+1]-row_start;
double y_row;
parallel_reduce(ThreadVectorRange(team,row_length),
[=] (const int i,double& sum) {
sum += A_vals(i+row_start)*
x(A_cols(i+row_start));
} , y_row);
y(row) = y_row;
});
});
fence();
}
165
The dot product is somewhat more involved. TBB does not provide its simple range
interface for parallel_reduce; instead, you use the blocked_range policy.
Furthermore, when compared to Kokkos, TBB does not assume a sum reduction as
the default, and thus the reduction operation must be provided explicitly.
double dot(int n, const double* x, const double* y) {
return tbb::parallel_reduce(tbb::blocked_range<int>(0,n),
0.0,[&] (const tbb::blocked_range<int>& r,
double lsum)->double {
for(int i = r.begin(); i<r.end(); i++)
lsum += x[i]*y[i];
return lsum;
}, std::plus<double>()
);
}
For the SPMV again, two implementations are provided. The flat parallelism one
can use the simple parallel_for for the outer loop:
void SPMV(int nrows, const int* A_row_offsets, const int* A_cols,
const double* A_vals, double* y, const double* x) {
tbb::parallel_for(0,nrows,[&] (const int& row) {
double sum = 0.0;
int row_start=A_row_offsets[row];
int row_end=A_row_offsets[row+1];
for(int i=row_start; i<row_end; ++i) {
sum += A_vals[i]*x[A_cols[i]];
}
y[row] = sum;
});
}
166
y[row] =
tbb::parallel_reduce(tbb::blocked_range<int>
(row_start,row_end), 0.0,
[&] (const tbb::blocked_range<int>& r,
double lsum)->double {
for(int i = r.begin(); i<r.end(); i++)
lsum += A_vals[i]*x[A_cols[i]];
return lsum;
}, std::plus<double>()
);
)});
}
For the Power8 benchmark, the test runs had to be limited to a single socket,
because the OpenMP and OpenACC code does not take care of proper first touch of
the data. For Kokkos that is not an issue, because all the data are explicitly copied
into Kokkos Views, which are first-touch initialized in parallel at construction time.
To provide a fairer comparison of the runs, however, all models were restricted
to a single socket. The best performance was generally achieved with two or four
hyperthreads per core. Note that all code variants were compiled with the PGI and
GNU compilers, respectively, to provide a fairer programming model comparison.
The IBM XL compiler was not able to generate working OpenMP 4.5 code for the
Power8 architecture. For the Serial and the Kokkos versions, it did work, though,
and provided significant performance improvements (not shown in the plots).
Two semantic issues of the programming model hamper TBB on Intel KNL. The
first significant issue is that TBB by default does not provide any thread-binding
mechanisms. On an architecture such as Intel’s KNL chip, that leads to severe
performance degradation. The other issue is that TBB’s internal implementation
is task based, with the corresponding overhead. That makes nested parallelism
significantly more expensive than, for example, in the Kokkos model. Thus, it is not
167
suited for something like the inner loop of the SPMV function, which has only a few
operations per iteration. The data shown here come from runs in which threads
were properly pinned using additional code provided in the online repository (see
the Preface).
Note
For the KNL data, the OpenACC code was compiled with PGI 17.3, with Intel Haswell
CPUs as the target architecture. At the time of this writing, the KNL architecture is not
yet explicitly supported by PGI. As with the OpenMP back end on GPUs, performance
is expected to improve with future compiler versions.
As mentioned previously, the OpenMP SPMV operation could have been improved
by removing the SIMD reduction on the inner loop. This was not done for this
comparison.
The data shown in Figures 8.1, 8.2, and 8.3 were collected using Intel 17.1, PGI 17.3,
IBM XL 13.1.5, GCC 6.3, and CUDA 8.0.44, respectively. For detailed run settings,
consult the Results folder in the online code repository (see the Preface).
60
50
40
30
20
10
0
AXPBY-100 AXPBY-200 DOT-100 DOT-200 SPMV-100 SPMV-200
OpenACC CUDA Kokkos OpenMP
168
20
Performance [Gflop/s]
15
10
0
AXPBY-100 AXPBY-200 DOT-100 DOT-200 SPMV-100 SPMV-200
Serial OpenACC Kokkos OpenMP
50
Performance [Gflop/s]
40
30
20
10
0
AXPBY-100 AXPBY-200 DOT-100 DOT-200 SPMV-100 SPMV-200
OpenACC Kokkos OpenMP TBB (Flat) TBB (Hierarchical)
169
8.4 Summary
Parallel programming concepts carry over to most programming models. Although
syntax can be different, it should generally be recognizable to someone familiar
with one or two of those models, and translating from one model to another should
be straightforward. But as with many other things, the details matter. Design points
for these models differ, and thus use cases might fit better or worse to any of these
models. The main considerations for choosing a model are the maturity of the
available software stack, the level of support by different vendors, and the likeli-
hood of continued support in the future.
8.5 Exercises
1. Which of the following programming models do not support CPUs, manycore
architectures, and GPUs?
a. OpenACC
b. OpenMP
c. CUDA
d. OpenCL
e. Kokkos
f. RAJA
g. TBB
a. OpenACC
b. OpenMP
c. CUDA
d. OpenCL
e. Kokkos
170
f. RAJA
g. TBB
a. Matrix-Vector multiplication
b. Matrix-Matrix addition
c. Matrix-Matrix multiplication
171
One of the major strengths of OpenACC is that it is interoperable with native or low-
level APIs to program the accelerator. This enables two important use cases:
• Calling native device code from OpenACC. This allows you to easily call into
high-performance libraries, such as BLAS or FFT, or custom kernels using the
native device API from an OpenACC program without introducing any unneces-
sary data staging.
• Calling OpenACC code from native device code. This also avoids unnecessary
data staging.
This chapter explains how OpenACC interoperability works for both cases, covering
the following:
• How OpenACC can work with memory managed by a native device API
173
This chapter primarily uses the NVIDIA CUDA platform targeting C in examples, but
the presented concepts are easily transferable to other accelerator targets and
languages supported by OpenACC.
The combination of these two filters is an edge detector. Example input and output
images are shown in Figures 9.1 and 9.2.
1. C. V. Loan, Computational Frameworks for the Fast Fourier Transform (Philadelphia: Society
for Industrial and Applied Mathematics, 1992).
174
175
allowing you to reuse one filter for all image chunks. Working with small chunks
also lowers the cost of preparing the filters in frequency space, because the filter
size need only match the size of a single chunk and not of the whole image.
The processing flow of the example application is visualized in Figure 9.3. For the
preprocessing, first the FFT plans for the real-to-complex (R2C) and complex-to-
real (C2R) DFTs are created. These plans are used for transformations from real
to frequency space and frequency space to real space, respectively. Independent
of planning the DFTs, the filters need to be initialized in real space. After that, the
filters are transformed and combined in frequency space with a pointwise complex
multiplication.
You start processing the image by transforming the input image from real to
frequency space using a DFT R2C transform. The combined filter is then applied
to the image in frequency space by a pointwise complex multiplication. Then the
filtered image is transformed back from frequency space to real space with a DFT
C2R transform, and finally the output needs to be normalized. The normalization is
necessary because cuFFT computes unnormalized DFTs.
176
In Figure 9.3, the steps that are executed by a library are shown in white boxes, and
the steps that are executed by kernels written in OpenACC are in shaded boxes.
The data dependencies are depicted with arrows and highlight the fact that the
input of each library call is produced by an OpenACC kernel and the output of each
library call is consumed by an OpenACC kernel. Thus, for efficient execution the
FFT library and OpenACC should share data. In the provided example code, data
are managed by OpenACC and the called FFT library operates on the data. How that
works is explained next.
OpenACC supports systems with accelerators that have distinct memory from the host as
well as systems with accelerators that share memory with the host. In the former case,
called a non-shared memory device, the system has separate host memory and device
memory. In the latter case, called a shared memory device as the accelerator shares
memory with the host thread, the system has one shared memory.
The concept of separate accelerator and host memory is discussed earlier in this
book. Because native or low-level accelerator models usually operate on acceler-
ator memory, it is important to understand this concept and know how to obtain
references to the different memories when you use nonshared memory devices.
OpenACC programs operate by default on the host version of all variables. Some direc-
tives are used to manage device instances of some host variables; other directives are
also used to mark the code regions that will be executed on the device using the device
instances of the host variables referenced in those code regions.
177
The host variable foo is referenced three times in this example. Its initialization at
line 1 is done outside all OpenACC context and so only concerns the host. The data
directive in line 2 ensures that an instance of foo exists on the device for the dura-
tion of the block of code ranging from line 3 to line 9. However, the data directive
is only about data management, and the associated code remains executed by the
host; this means that foo in line 4 still refers to the host variable. Finally, the block
of code from lines 6 to 8 is subject to the parallel directive in line 5, and this
implies that this code is executed on the device. Consequently, line 7 refers to the
device version of foo.
When you want to call into a library that is written using a low-level or native pro-
gramming model and you expect references to accelerator memory, it is necessary
to pass a reference to accelerator memory to the library instead of the reference
to host memory. Using a parallel or kernels construct to perform the address
translation is not possible, because the library call must be executed on the host.
To solve that problem, OpenACC provides host_data, a construct whose body
is executed on the host using the device addresses of all variables specified in its
use_device clauses.
An example of this is given in Listing 9.2, where the pointers to device memory of
pic_real and pic_freq are passed into the NVIDIA CUDA FFT Library cuFFT.
The host_data construct with the use_device clause for pic_real and
pic_freq in lines 8 and 9 tells the compiler to use pointers to the accelerator
memory in the structured block, spawning lines 10–14. These pointers are then
passed into the cuFFT call, spawning lines 11–13. On a shared-memory acceler-
ator, the references within the host_data construct may be the same as outside
178
the host_data construct, because the host and the device instances of a vari-
able might be identical. However, there is no guarantee of that, because even on a
shared-memory accelerator an OpenACC implementation might choose to create
separate copies for performance reasons.
Alignment
The example program uses the C11 float complex type in OpenACC kernels
to allow the use of the provided arithmetic operators for that type. In line 9 of
Listing 9.2, it is therefore required to cast the pointer to pic_freq from C11
float complex to cufftComplex, both of which have the same size and data
layout. In the scope of the shown code snippet this cast is safe, because sufficient
alignment is provided by the OpenACC allocator for device memory. However,
C11 float complex might not have the same alignment as cufftComplex.
For example, with the PGI compiler on x86, the alignment requirement for float
complex is 4 bytes, whereas the alignment requirement for cufftComplex is
8 bytes, because it is a typedef to the CUDA built-in type float2,2 “B.3. Built-in
Vector Types.”
Low-level or native APIs typically allocate data with coarse alignment, meaning
that such an alignment is typically not a problem. However, it can become an issue
with a custom suballocator that does not guarantee sufficient alignment, or with
user-defined types such as this:
struct foo {
float weight;
float complex data[512];
};
It is therefore recommended that you check for sufficient alignment to avoid this
hard-to-debug issue:
The OpenACC runtime library routine acc_deviceptr offers the same func-
tionality as the host_data directive with the use_device clause. It returns the
179
device address for a given host address. Similarly, acc_hostptr returns the host
address for a given device address.
Calls to the NVIDIA FFT library cuFFT are asynchronous with respect to the host—
that is, they return immediately before the work has finished. To manage multiple
independent streams of work, the CUDA platform offers the concept of streams.
The cufftSetStream function provided by cuFFT needs to be called to control
which CUDA stream will be used by the library. As explained in more detail in Chap-
ter 10, OpenACC offers a similar concept with the async clause and work queues.
In both concepts, work submitted to the same stream or work queue is guaranteed
to be processed in order, and work submitted to different streams or work queues
can be executed concurrently.
For the provided example code, this has the consequence that an OpenACC kernel
launched after a DFT transform does not wait for that transform to finish and thus
very likely operates on stale data.
To add the missing synchronization between the OpenACC runtime and libraries
written in a native API, either you can map a native API stream or work queue to
an OpenACC work queue identifier, or you can query the OpenACC runtime for the
native API stream or work queue it uses internally for a given identifier.
In the case of the CUDA platform the OpenACC API calls are as follows.
180
There are similar API routines for other target platforms, such as OpenCL or Intel
Coprocessor Offload Infrastructure.3
181
accelerator APIs and used by OpenACC. Similar to the host_data directive with
the use_device clause, OpenACC has the deviceptr clause. An example of this
is shown in Listing 9.4.
The deviceptr clause is allowed on structured data and compute constructs and
declare directives. The deviceptr tells the compiler that no data management
needs to be done for the pointers listed in the deviceptr clause—in other words, the
listed pointers can be directly used in a kernel generated by the OpenACC compiler.
9.3.1 acc_map_data
In the preceding section, the deviceptr clause is introduced as a possibility to
use data managed by native accelerator APIs in OpenACC. Using the OpenACC API
calls acc_map_data and acc_unmap_data, you can achieve the same thing by
making the OpenACC runtime aware of accelerator data that were allocated with
the native accelerator API (or with acc_malloc) which is done by associating host
and accelerator data allocations. This allows the following:
182
In Listing 9.5, the application allocates host and accelerator data using malloc
and acc_malloc, respectively, in lines 1–5. To use the host and accelerator data
allocated that way in OpenACC directives, it is necessary to map the accelerator
data (x_d and y_d) to the corresponding host data (x and y), something that is
done with the calls to acc_map_data in lines 7 and 8. Listing 9.5 is equivalent to
the code shown in Listing 9.6. This demonstrates that a call to acc_map_data is
essentially equivalent to an enter data directive combined with the create
clause, with the difference that no accelerator memory is allocated; instead, the
pointer passed in as the second argument is used. So the calls acc_map_data to
x and y can be used in OpenACC directives as if they are part of an enclosing data
region. For example, in line 10 of Listing 9.5 the present clause can be used and
the body of the loop spawning lines 11–14 operates on the memory allocated in
lines 4 and 5.
183
Listing 9.6 acc_map_data clause example, equivalent code to Listing 9.5 (continued)
4 : #pragma acc enter data create(x[0:n],y[0:n])
5 :
6 : #pragma acc kernels present(x[0:n],y[0:n])
7 : for (int i = 0; i < n; i++)
8 : {
9 : y[i] += x[i];
10: }
11:
12: #pragma acc exit data delete(x,y)
First, the CUDA C device function needs to be prepared. As shown in Listing 9.7, you
need the extern "C" linkage specifier in order to avoid C++ name mangling and
make the later mapping and linking steps work.
The CUDA C device function needs to be compiled using the CUDA C/C++ compiler
nvcc to an object using the -relocatable-device-code true command-line
option. In this way, code is generated that later can be used by the device linker.
On the OpenACC side, it is necessary to put a routine directive in front of the func-
tion declaration as shown in Listing 9.8. The example tells the OpenACC compiler that
at link time there will be a device-callable sequential routine with the name saxpy_
dev. It is important that only the function declaration, and not the definition, appear in
OpenACC code; otherwise there would be multiple definitions of the device function.
184
This routine can then be used in OpenACC compute constructs, as shown in Listing
9.9.
The linking needs to be done with the PGI compiler using the rdc suboption to
-ta=tesla and presenting the object file generated with nvcc containing the
device code of saxpy_dev and the object file with the OpenACC code.
A fully working example of this is given in Chapter 10, and the example is available
at https://github.com/jefflarkin/openacc-interoperability in openacc_cuda_
device.cpp.
9.4 Summary
A powerful feature of OpenACC is its ability to be interoperable with native or low-
level accelerator programming APIs. This capability is particularly important because
it not only allows the use of highly tuned standard libraries (such as BLAS and FFT)
from OpenACC code but also allows exploitation of libraries (such as MPI) that are
aware of accelerators—for example, CUDA-aware MPI. In addition, interoperability
with native accelerator programming APIs allows you to use OpenACC for produc-
tivity and performance portability without compromising performance. In cases
where OpenACC cannot match the performance of native accelerator programming
APIs, it is possible to use the native programming APIs.
9.5 Exercises
1. What happens if you remove the acc host_data use_device directives in
the accompanying example code?
185
2. Change the code to call a host FFT library instead of cuFFT. What changes are
necessary, and how do the runtime and the timeline tools (such as the NVIDIA
Visual Profiler or pgprof) show the change?
186
With the basics of OpenACC programming well in hand, this chapter discusses two
advanced OpenACC features for maximizing application performance. The first fea-
ture is asynchronous operations, which allow multiple things to happen at the same
time to better utilize the available system resources, such as a GPU, a CPU, and the
PCIe connection in between. The second feature is support for multiple accelera-
tor devices. The chapter discusses two ways that an application can utilize two or
more accelerator devices to increase performance: one using purely OpenACC, and
the other combining OpenACC with the Message Passing Interface (MPI).
10.1 Asynchronous Operations
Programming is often taught by developing a series of steps that, when completed,
achieve a specific result. If I add E = A + B and then F = C + D, then I can add G = E +
F and get the sum of A, B, C, and D, as shown in Figure 10.1.
Our brains often like to think in an ordered list of steps, and that influences the way
we write our programs, but in fact we’re used to carrying out multiple tasks at the
same time. Take, for instance, cooking a spaghetti dinner. It would be silly to cook
the pasta, and then the sauce, and finally the loaf of bread. Instead, you will prob-
ably let the sauce simmer while you bake the bread and, when the bread is almost
finished, boil the pasta so that all the parts of the meal will be ready when they are
needed. By thinking about when you need each part of the meal and cooking the
parts so that they all complete only when they are needed, you can cook the meal
in less time than if you’d prepared each part in a separate complete step. You also
187
E = A + B
F = C + D
G = E + F
would better utilize the kitchen by taking advantage of the fact that you can have
two pots and the oven working simultaneously.
Let us look at the spaghetti dinner example a little more closely. You can treat cook-
ing the pasta, preparing the sauce, and baking the bread as three separate opera-
tions in cooking the dinner. Cooking the pasta takes roughly 30 minutes, because it
involves bringing the water to a boil, actually cooking the pasta, and then draining
it when it is done. Making the sauce takes quite a bit longer, roughly two hours
from putting the sauce in a pot, adding seasoning, and then simmering so that the
seasoning takes effect. Baking the bread is another 30-minute operation, involving
heating the oven, baking the bread, and then allowing the bread to cool. Finally,
when these operations are complete, you can plate and eat your dinner.
If I did each of these operations in order (in any order), then the entire process
would take three hours, and at least some part of my meal would be cold. To under-
stand the dependencies between the steps, let’s look at the process in reverse.
188
1. To plate the meal, the pasta, sauce, and bread must all be complete.
2. For the pasta to be complete, it must have been drained, which requires that it
has been cooked, which requires that the water has been brought to a boil.
3. For the sauce to be ready, it must have simmered, which requires that the sea-
soning has been added, and the sauce has been put into a pot.
4. For the bread to be done, it must have cooled, before which it must have been
baked, which requires that the oven has been heated.
From this, you can see that the three parts of the meal do not need to be prepared
in any particular order, but each has its own steps that do have a required order.
To put it in parallel programming terms, each step within the larger tasks has a
dependency on the preceding step, but the three major tasks are independent of
each other. Figure 10.2 illustrates this point. The circles indicate the steps in cook-
ing the dinner, and the rounded rectangles indicate the steps that have dependen-
cies. Notice that when each task reaches the plating step, it becomes dependent on
the completion of the other tasks. Therefore, where the rounded rectangles overlap
is where the cook needs to synchronize the tasks, because the results of all three
tasks are required before you can move on; but where the boxes do not cross, the
tasks can operate asynchronously from each other.
189
This is rarely the most efficient way to run the code, however, because system
resources go unused while the host is waiting for the accelerator to complete. For
instance, on a system having one CPU and one GPU, which are connected via PCIe,
then at any given time either the CPU is computing, the GPU is computing, or the
PCIe bus is copying data, but no two of those resources are ever used concurrently.
If we were to expose the dependencies in our code and then launch our operations
asynchronously from the host thread, then the host thread would be free to send
more data and instructions to the device or even participate in the calculation.
OpenACC uses the async clause and wait directive to achieve exactly this.
the programmer must wait (or synchronize) on the work queue (or queues). Refer-
ring to Figure 10.2, the rounded rectangles could each represent a separate work
queue, and the steps should be put into that queue in the order in which they must
be completed.
Perhaps rather than insert a whole new directive into the code to force this syn-
chronization, you would rather force an already existing directive to wait on a
particular queue. This can be achieved with the wait clause, which is available on
many directives. Listing 10.3 demonstrates this in both C and C++.
192
The code performs a simple vector summation but has separate loops for initial-
izing the A array, initializing the B array, and finally summing them both into the
C array. Note that this is a highly inefficient way to perform a vector sum, but it is
written only to demonstrate the concept discussed in this section. The OpenACC
directives place the work of initializing A and B into queues 1 and 2, respectively.
The summation loop should not be run until both queues 1 and 2 have completed,
so an asynchronous wait directive is used to force queue 1 to wait until the work
in queue 2 has completed before moving forward.
On first reading, the asynchronous wait directive may feel backward, so let’s
break down what it is saying. First, the directive says to wait on queue 2, but then it
also says to place this directive asynchronously into work queue 1 so that the host
CPU is free to move on to enqueue the summation loop into queue 1 and then wait
for it to complete. Effectively, queues 1 and 2 are joined together at the asynchro-
nous wait directive and then queue 1 continues with the summation loop. By writ-
ing the code in this way you enable the OpenACC runtime to resolve the dependency
between queues 1 and 2 on the device, for devices that support this, rather than
block the host CPU. This results in better utilization of system resources, because
the host CPU no longer must idly wait for both queues to complete.
194
These examples strictly work on NVIDIA GPUs with CUDA, but OpenACC compilers
using OpenCL have a similar API for interoperating with OpenCL work queues.
195
We start by analyzing the current application performance using the PGI compiler,
an NVIDIA GPU, and the PGProf profiling tool. Figure 10.3 shows a screen shot of
PGProf, which shows the GPU timeline for the filter application. For reference, the
196
data presented is for a Tesla P100 accelerator and an 8,000×4,051 pixel image.
Notice that copying the image data to and from the device takes nearly as much
time as actually performing the filtering of the image. The program requires
8.8ms to copy the image to the device, 8.2ms copying the data from the device, and
20.5ms performing the filtering operation.
The most significant performance limiter in this application is the time spent copy-
ing data between the CPU and the GPU over PCIe. Unfortunately, you cannot remove
that time, because you do need to perform those copies for the program to work, so
you need to look for ways to overlap the data copies with the GPU work. If you could
completely overlap both directions of data transfers with the filtering kernel, then
the total run time would drop to only 20.5ms, which is a speedup of 1.8×, so this is
the maximum speedup you can obtain by optimizing to hide the data movement.
How can you overlap the data transfers, though, if you need to get the data to the
device before you start and cannot remove it until you have computed the results? The
trick is to recognize that you do not need all of the data on the device to proceed; you
need only enough to kick things off. If you could break the work into discrete parts, just
as you did with cooking dinner earlier, then you can operate on each smaller part inde-
pendently and potentially overlap steps that do not require the same system resource
197
at the same time. Image manipulation is a perfect case study for this sort of optimi-
zation, because each pixel of an image is independent of all the others, so the image
itself can be broken into smaller parts that can be independently filtered.
198
33 ch + 1];
34 red += filter[fy][fx] * (float)imgData[iy * step + ix *
35 ch + 2];
36 }
37 }
38 out[y * step + x * ch] = 255 - (scale * blue);
39 out[y * step + x * ch + 1 ] = 255 - (scale * green);
40 out[y * step + x * ch + 2 ] = 255 - (scale * red);
41 }
42 }
43 }
44 }
There are a few important things to note from this code example. First, we have
blocked only the computation, and not the data transfers. This is because we want
to ensure that we obtain correct results by making only small changes to the code
with each step. Second, notice that within each block we calculate the starting and
ending rows of the block and adjust the y loop accordingly (lines 11–13). Failing to
make this adjustment would produce correct results but would increase the run
time proportionally to the number of blocks. With this change in place, you will see
later in Figure 10.4 that the performance is slightly worse than the original. This
is because we have introduced a little more overhead to the operation to manage
the multiple blocks of work. However, this additional overhead will be more than
compensated for in the final code.
Figure 10.4 PGProf timeline for multidevice, software pipelined image filter
199
Our image filter has one subtlety that needs to be addressed during this step. The
filter operation for element [i][j] of the image requires the values from several
surrounding pixels, particularly two to left and right, and two up and down. This
means that for the filter to produce the same results, when you copy a block’s por-
tion of imgData to the accelerator you also need to copy the two rows before that
block and the two after, if they exist. These additional rows are frequently referred
to as a halo or ghost zone, because they surround the core part of the data. In
Listing 10.9, notice that, to accommodate the surrounding rows, the starting and
ending rows for the update directive are slightly different from the starting and
ending rows of the compute loops. Not all computations require this overlapping,
but because of the nature of the image filter, this application does. The full code for
this step is found in Listing 10.9.
Listing 10.9 Blocked OpenACC image-filtering code with separate update directives
1 void blur5_update(unsigned restrict char *imgData,
2 unsigned restrict char *out, long w, long h,
3 long ch, long step)
4 {
5 long x, y;
6 const int nblocks = 8;
7
8 long blocksize = h/ nblocks;
9 #pragma acc data create(imgData[w*h*ch],out[w*h*ch]) \
10 copyin(filter)
11 {
12 for ( long blocky = 0; blocky < nblocks; blocky++)
13 {
14 // For data copies we include ghost zones for the filter
15 long starty = MAX(0,blocky * blocksize - filtersize/2);
16 long endy = MIN(h,starty + blocksize + filtersize/2);
17 #pragma acc update \
18 device(imgData[starty*step:(endy-starty)*step])
19 starty = blocky * blocksize;
20 endy = starty + blocksize;
21 #pragma acc parallel loop collapse(2) gang vector
200
Again, you need to run the code and check the results before moving to the next
step. You see in Figure 10.4 that the code is again slightly slower than the previous
version because of increased overhead, but that will soon go away.
Making It Asynchronous
The last step in implementing software pipelining using OpenACC is to make the
whole operation asynchronous. Again, the key requirement for making an opera-
tion work effectively asynchronously is to expose the dependencies or data flow
in the code so that independent operations can be run concurrently with sufficient
resources. You will use OpenACC’s work queues to expose the flow of data through
each block of the image. For each block, the code must update the device, apply
the filter, and then update the host, and it must be done in that order. This means
that for each block these three operations should be put in the same work queue.
These operations for different blocks may be placed in different queues, so the
block number is a convenient way to refer to the work queues. There is overhead
involved in creating new work queues, so it is frequently a best practice to reuse
some smaller number of queues throughout the computation.
201
Listing 10.10 shows the code from this step, which adds the async clause to both
update directives and the parallel loop, using the formula just discussed to
determine the work queue number. However, one critical and easy-to-miss step
remains before we can run the code: adding synchronization to the end of the com-
putation. If we did not synchronize before writing the image to a file, we’d almost
certainly produce an image that’s only partially filtered. The wait directive in line
52 of Listing 10.10 prevents the host thread from moving forward until after all
asynchronous work on the accelerator has completed. The final code for this step
appears in Listing 10.10.
Listing 10.10 Final, pipelined OpenACC filtering code with asynchronous directives
1 void blur5_pipelined(unsigned restrict char *imgData,
2 unsigned restrict char *out, long w, long h,
3 long ch, long step)
4 {
5 long x, y;
6 const int nblocks = 8;
7
8 long blocksize = h/ nblocks;
9 #pragma acc data create(imgData[w*h*ch],out[w*h*ch])
10 {
11 for ( long blocky = 0; blocky < nblocks; blocky++)
12 {
13 // For data copies we include ghost zones for the filter
14 long starty = MAX(0,blocky * blocksize - filtersize/2);
15 long endy = MIN(h,starty + blocksize + filtersize/2);
16 #pragma acc update \
17 device(imgData[starty*step:(endy-starty)*step]) \
18 async(blocky%3+1)
19 starty = blocky * blocksize;
20 endy = starty + blocksize;
21 #pragma acc parallel loop collapse(2) gang vector \
22 async(blocky%3+1)
23 for ( y = starty; y < endy; y++ )
24 {
202
What is the net result of these changes? First, notice in Table 10.1 that the run
time of the whole operation has now improved by quite a bit. Compared with the
original version, you see an improvement of 1.8×, which is roughly what we pre-
dicted earlier for fully overlapped code. Figure 10.5 is another timeline from the
PGProf, which shows that we have achieved exactly what we set out to do. Notice
that nearly all of the data movement is overlapped with computation on the GPU,
making it essentially free. The little bit of data transfer at the beginning and end of
the computation is offset by a small amount of overlap between subsequent filter
kernels as each kernel begins to run out of work and the next one begins to run.
You can experiment with the number of blocks and find the optimal value, which
may vary depending on the machine on which you are running. In Figure 10.5, we
are obviously operating on eight blocks.
203
10.2 Multidevice Programming
Many systems are now built with not only one, but several accelerator devices.
This section discusses ways to extend existing code to divide work across multiple
devices.
204
and manages its data correctly, do those blocks need to be run on the same device?
The answer, of course, is no, but how do you send blocks to different devices using
OpenACC?
The OpenACC runtime library provides routines for querying and setting both the
device type (e.g., NVIDIA GPU, AMD GPU, Intel Xeon Phi, etc.) and the device number.
With these API routines you can direct both your data and compute kernels to dif-
ferent devices as needed. Let us first look at the routines for querying and setting
the device type.
Listing 10.11 Example usage of setting and getting OpenACC device type
1 #include <stdio.h>
2 #include <openacc.h>
3 int main()
4 {
5 int error = 0;
6 acc_device_t myDev = -1;
7
8 acc_set_device_type(acc_device_nvidia);
9 myDev = acc_get_device_type();
10 if ( myDev != acc_device_nvidia )
11 {
12 fprintf(stderr, "Wrong device type returned! %d != %d\n",
13 (int)myDev, (int)acc_device_nvidia);
14 error = -1;
15 }
16 return error;
17 }
205
Listing 10.12 Example usage of getting the number of devices, setting the current device, and
getting the current device number
1 #include <stdio.h>
2 #include <openacc.h>
3 int main()
4 {
5 int numDevices = -1, myNum = -1;
6 numDevices = acc_get_num_devices(acc_device_nvidia);
7 acc_set_device_num(acc_device_nvidia,0);
8 myNum = acc_get_device_num(acc_device_nvidia);
9
10 printf("Using device %d of %d.\n", myNum, numDevices);
11
12 return 0;
13 }
Let’s extend the pipelined example from the preceding section to send the chunks
of work to all available devices. Although I am running on an NVIDIA Tesla P100, I
don’t want to assume that other users will have a machine that matches my own,
so I will specify acc_device_default as my device type. This is a special
device type value that tells the compiler that I don’t care what specific device type
it chooses. On my testing machine, I could also have used acc_device_nvidia,
because that best describes the accelerators in my machine. I have chosen to
use OpenMP to assign a CPU thread to each of my accelerator devices. This is
not strictly necessary—a single thread could send work to all devices—but in my
opinion this simplifies the code because I can call acc_set_device once in each
thread and not worry about it again.
206
12 {
13 int myid = omp_get_thread_num();
14 acc_set_device_num(myid,acc_device_default);
15 int queue = 1;
16 #pragma acc data create(imgData[w*h*ch],out[w*h*ch])
17 {
18 #pragma omp for schedule(static)
19 for ( long blocky = 0; blocky < nblocks; blocky++)
20 {
21 // For data copies we include ghost zones for the filter
22 long starty = MAX(0,blocky * blocksize - filtersize/2);
23 long endy = MIN(h,starty + blocksize + filtersize/2);
24 #pragma acc update device \
25 (imgData[starty*step:(endy-starty)*step]) \
26 async(queue)
27 starty = blocky * blocksize;
28 endy = starty + blocksize;
29 #pragma acc parallel loop collapse(2) gang vector async(queue)
30 for ( long y = starty; y < endy; y++ )
31 {
32 for ( long x = 0; x < w; x++ )
33 {
34 float blue = 0.0, green = 0.0, red = 0.0;
35 for ( int fy = 0; fy < filtersize; fy++ )
36 {
37 long iy = y - (filtersize/2) + fy;
38 for ( int fx = 0; fx < filtersize; fx++ )
39 {
40 long ix = x - (filtersize/2) + fx;
41 if ( (iy<0) || (ix<0) ||
42 (iy>=h) || (ix>=w) ) continue;
43 blue += filter[fy][fx] *
44 (float)imgData[iy * step + ix * ch];
45 green += filter[fy][fx] *
46 (float)imgData[iy * step + ix * ch + 1];
47 red += filter[fy][fx] *
48 (float)imgData[iy * step + ix * ch + 2];
49 }
50 }
51 out[y * step + x * ch] = 255 - (scale * blue);
52 out[y * step + x * ch + 1 ] = 255 - (scale * green);
53 out[y * step + x * ch + 2 ] = 255 - (scale * red);
54 }
55 }
56 #pragma acc update self(out[starty*step:blocksize*step]) \
57 async(queue)
58 queue = (queue%3)+1;
59 }
60 #pragma acc wait
61 }
62 }
63 }
207
In line 10 of Listing 10.13, you start by querying the number of available devices
and then use that number to spawn an OpenMP parallel region with that number of
threads. In line 14, which is run redundantly by all threads in the OpenMP region,
you set the device number that will be used by each individual thread, using the
thread number to select the device. Notice also that you change from using a sim-
ple modulo operator to set the work queue number to having a per-thread counter.
This change prevents the modulo from assigning all chunks from a given device to
a reduced number of queues when certain device counts are used. The remainder
of the code is unchanged, because each device will have its own data region (line
16) and will have its own work queues.
Table 10.2 shows the results of running the code on one, two, and four GPUs (three
is skipped because it would not divide the work evenly). Notice that you see a per-
formance improvement with each additional device, but it is not a strict doubling as
the number of devices doubles. It is typical to see some performance degradation
as more devices are added because of additional overhead brought into the calcu-
lation by each device.
Figure 10.4, presented earlier, shows another screenshot from PGProf, in which
you can clearly see overlapping data movement and computation across multiple
devices. Note that in the example code it was necessary to run the operation twice
so that the first time would absorb the start-up cost of the multiple devices and the
second would provide fair timing. In short-running benchmarks like this one, such
a step is needed to prevent the start-up time from dominating the performance. In
real, long-running applications, this is not necessary because the start-up cost will
be hidden by the long runtime.
208
This approach has several advantages and disadvantages compared with the
approach used in the preceding section. For one thing, application developers are
forced to consider how to divide the problem domain into discrete chunks, which
are assigned to different processors, thereby effectively isolating each GPU from
the others to avoid possible race conditions on the same data. Also, because the
domain is divided among multiple processes, it is possible that these processes
may span multiple nodes in a cluster, and this increases both the available memory
and the number of accelerators available. For applications already using MPI to run
in parallel, using MPI to manage multiple accelerators often requires fewer code
changes compared with managing multiple devices using OpenACC alone. Using
existing MPI also has the advantage that non-accelerated parts of the code can still
use MPI parallelism to avoid a serialization slowdown due to Amdahl’s law. When
you’re dealing with GPUs and other discrete accelerators, it is common for a sys-
tem to have more CPU cores than it does discrete accelerators. In these situations,
it is common to either share a single accelerator among multiple MPI ranks or to
thread within each MPI rank to better utilize all system resources.
When you are using MPI to utilize multiple discrete accelerators, to obtain the best
possible performance it is often necessary to understand how the accelerators and
host CPUs are connected. For instance, if a system has two multicore CPU sockets
and four GPU cards, as shown in Figure 10.6, then it’s likely that each CPU socket
will be directly connected to two GPUs but will have to communicate with the other
two GPUs by way of the other CPU, and this reduces the available bandwidth and
increases the latency between the CPU and the GPU.
Furthermore, in some GPU configurations each pair of GPUs may be able to com-
municate directly with each other, but only indirectly with GPUs on the opposite
socket. Best performance is achieved when each rank is assigned to the nearest
accelerator, something known as good accelerator affinity. Although some MPI job
launchers will handle assigning a GPU to each MPI rank, you should not count on
this behavior; instead you should use the acc_set_device_num function to man-
ually assign a particular accelerator to each MPI rank. By understanding how MPI
assigns ranks to CPU cores and how the CPUs and accelerators are configured, you
can assign the accelerator affinity so that each rank is communicating with the near-
est available accelerator, thereby maximizing host-device bandwidth and minimizing
latency. For instance, in Figure 10.6, you would want to assign GPU0 and GPU1 to ranks
physically assigned to CPU0 and GPU2, and assign GPU3 to ranks physically assigned
to CPU1. This is true whether or not ranks are assigned to the CPUs in a round-robin
manner, packed to CPU0 and then CPU1, or in some other order.
209
Figure 10.6 System with two multicore CPU sockets and four GPU cards
Many MPI libraries can now communicate directly with accelerator memory, partic-
ularly on clusters utilizing NVIDIA GPUs, which support a feature known as GPUDi-
rect RDMA (remote direct memory access). When you are using one of these MPI
libraries it’s possible to use the host_data directive to pass accelerator memory
to the library, as shown in Chapter 9 and discussed next.
Listing 10.14 Common patterns for MPI communication with OpenACC updates
// When using MPI_Sendrecv or any blocking collective, it is
// necessary to have an update before and after.
210
The biggest drawback to this approach is that all interactions between OpenACC
and MPI require a copy through host memory. This means that there can never be
effective overlapping between MPI and OpenACC and that it may be necessary to
communicate the data over the PCIe bus twice as often. If possible, this approach to
using MPI with OpenACC should be avoided.
211
Note, however, that if the same code is used with a non-accelerator-aware library
a program crash will occur. Listing 10.15 demonstrates some common patterns for
using host_data to pass device arrays into the MPI library. This method has far
fewer caveats to highlight than the preceding one.
Listing 10.15 Example using OpenACC host_data directive with accelerator-enabled MPI
// By using an accelerator-aware MPI library, host_data may be
// used to pass device memory directly into the library.
One advantage of this approach over the other one comes when you’re using asyn-
chronous MPI calls. In the earlier case the update of the device data could not occur
until after MPI_Wait has completed, because the update may copy stale data to
the device. Because the MPI library is interacting with accelerator memory directly,
MPI_Wait will guarantee that data have been placed in the accelerator buffer,
removing the overhead of updating the device after the receive has completed.
Additionally, the MPI library can invisibly perform optimizations of the way the
messages are transferred, such as pipelining smaller message blocks, a feature
that may be more efficient than directly copying the memory or manually staging
buffers through host memory, for all the same reasons pipelining was beneficial in
the earlier example.
This section has presented two methods for utilizing multiple accelerator devices
with OpenACC. The first method uses OpenACC’s built-in acc_set_device_num
function call to direct data and work to specific devices. In particular, OpenMP
threads are used to assign a different host thread to each device, although this is
not strictly necessary. The second method uses MPI to partition work across multi-
ple processes, each of which utilizes a different accelerator. This approach is often
212
simpler to implement in scientific code, because such code frequently uses MPI
partitioning, meaning that non-accelerated code may still run across multiple MPI
ranks. Both approaches can take advantage of multiple accelerators in the same
node, but only the MPI approach can be used to manage multiple accelerators on
different nodes. Which approach will work best depends on the specific needs of
the application, and both techniques can be used together in the same application.
The source code in GitHub accompanying this chapter (check the Preface for a link
to GitHub) includes the image filter pipeline code used in this chapter and a simple
OpenACC + MPI application that can be used as examples.
10.3 Summary
You can exploit advanced techniques using OpenACC asynchronous execution and
management of multiple devices. These techniques are designed to maximize
application performance by improving the way the application uses all available
resources. Asynchronous execution decouples work on the host from work on the
accelerator, including data transfers between the two, allowing the application
to use the host, accelerator, and any connecting bus concurrently, but only if all
data dependencies are properly handled. Multidevice support extends this idea by
allowing a program to utilize all available accelerators for maximum performance.
10.4 Exercises
1. The source code for the image-filtering example used in this chapter has been
provided at the GitHub site shown in the Preface. You should attempt to build
and understand this sample application.
2. You should experiment to determine the best block size for your machine. Does
changing the image size affect the result? Does changing the accelerator type
change the result?
213
Then in Section 11.2, author Daniel Tian, a compiler developer from NVIDIA, dis-
cusses an open source OpenACC compiler in an industrial framework (OpenUH as
a branch of Open64) and explains the various loop-scheduling reduction strate-
gies for general-purpose GPUs, with the intention to allow researchers to explore
advanced compiler techniques.
11.1 Sunway OpenACC
Lin Gan, Tsinghua University/ National Supercomputing Center in Wuxi
1. H. Fu, J. Liao, J. Yang, L. Wang, Z. Song, X. Huang, C. Yang, W. Xue, F. Liu, F. Qiao, and
W. Zhao, “The Sunway TaihuLight Supercomputer: System and Applications,” Science
China Information Sciences 59(7) (2016): 072001: 1–16.
215
and assembled in China using Chinese technology. Since June 2016, at the Top500
Sunway Taihulight has been ranked the top supercomputer in the world. It contains
a heterogeneous computing architecture, a network system, a peripheral system,
and a maintenance and diagnostic system, as well as power and cooling. The Tai-
huLight software ecosystem includes basic software, a parallel operating system,
and high-performance storage management; in addition, a parallel programming
language and compilation environment supports mainstream parallel program-
ming language interfaces such as MPI, OpenMP, OpenACC, and others. This chapter
focuses on OpenACC and specifically on the OpenACC version used on the Sunway
TaihuLight.
There are differences between the structures of the SW26010 manycore CPU and
current heterogeneous computing platforms consisting of GPUs and MICs. This
meant that the Sunway TaihuLight implementation of OpenACC had to provide new
functions and some extensions to the OpenACC syntax.2
The SW26010 contains four core groups, and each group contains one management
processing element (MPE) and 64 computing processing elements (CPEs). This
means that each SW26010 contains 260 CPU cores. The MPE and CPEs share mem-
ory within each core group. Each CPE has independent 64KB scratch pad memory
(SPM), which can be used to customize data with finer-grained methods and is
specifically important for performance. Data between SPM and main memory can
be transformed by direct memory access (DMA) initiated by CPEs. The MPE, as the
core of the general CPU, can perform operations such as communication, IO, and
computation. CPEs in the same core group are used to accelerate computations to
boost computation-intensive sections of code.
2. In this chapter, we use OpenACC* to distinguish the Sunway TaihuLight OpenACC imple-
mentation from the OpenACC version 2.0 standard.
216
Succinctly, each MPE in the SW26010 can be regarded as the core in an x86 plat-
form, and 64 CPEs can be regarded as the GPUs that accelerate the computation
through parallelism.
1. The main thread data space, located in the main memory, is directly accessible
to accelerator threads created by the main thread as well as the main thread. All
the variables declared outside the accelerator region are within this space.
2. The private space in the accelerator thread, located in the main memory, is
independently owned by each accelerator thread and contains all the variables
declared with private and firstprivate.
217
3. The local space in the accelerator thread, located in the device memory, is
independently owned by each accelerator thread and provides the highest per-
formance compared with the other data spaces. Variables declared with copy,
local, and so on are totally or partially stored in local space controlled by the
compiler system. Data transfer between the main memory and the local space
is controlled by the accelerator thread.
The cache directive in OpenACC* is designed to store data at the memory level
where the speed of data access is the highest, whereas the cache directive in the
OpenACC standard is designed to utilize the accelerator’s hardware- or software-
managed caches. Because the copy, copyin, and copyout directives in
OpenACC* can meet the requirements of moving data between SPM and main
memory in most circumstances, the cache directive in OpenACC* is used mostly
when the copy clauses do not work, such as when you are copying arrays whose
indexes are not fixed.
The program starts on the MPE. It is a serial section of code executed as a main
thread or concurrently executed as multiple main threads through the use of pro-
gramming interfaces such as OpenMP or MPI. The computation-intensive part is
loaded onto the CPEs as kernel jobs controlled by the main thread.
• Transferring the needed data from main memory to device memory initiated by
the device, and waiting until finish
• The device conducting the computation and returning the results to the main
memory
In most cases, each MPE loads a series of kernels and executes them one by one
on the accelerator device.
219
(a)OpenACC (b)OpenACC*
11.4 Code Example: Using data to Implement Data Exchange
Figure 11.4 Code example: Using data to implement data exchange
Data Copy
Data Copy
In OpenACC*, the copy annotation in parallel components has been extended to
describe data transfers between main memory and multiple SPMs. In particular, the
In OpenACC*, the copy annotation in parallel components has been extended to
size of the copied data is controlled by using loop components and tile annotations
describe
collectively data
just like transfers
the Fortran between
codes in Figure main memory
11.5. This and multiple
is important so the SPMs. In particular,
programmer can make
the size of theusecopied
of SPMdata
space.
is controlled by using loop components and tile annota-
tions collectively, as does the Fortran code in Figure 11.5. This is important so that
the programmer can make use of SPM space.
The SW26010 execution method in Figure 11.5 is different from that of OpenACC
on a standard GPU platform, because the compiler can automatically analyze the
mapping relation between the data-access method and the loop division method
and determine the data division method based on the loop division method. In
particular, the i loop is divided and runs in parallel with a chunk size of 1, whereas
the j loop is serially divided, with a chunk size of 2 according to the tile annotation.
With this information, the division method for arrays A, B, and C can be determined.
In array A, for instance, each computation iteration needs (256, 2, 1) of data. Thus
the compiler will apply (256, 2, 1) caches for A, B, and C, respectively, and will auto-
matically generate the control sequence for data transfer.
220
Figure 11.6 shows how swap series annotation can be used to transpose the arrays
and then perform the data transfer.
Swap annotation is generally used to access the discontinuous array data, because
discontinuous data generally lead to poor data transfer performance. Swap anno-
tation gets around this problem by first transposing the arrays and then transfer-
ring the data to improve data transfer efficiency.
221
Pack series annotation is used to package multiple scattered variables and trans-
fer the packaged data as a sequential whole, as shown in Figure 11.7. Specifically,
the compiler packages A, B, and C into a single array for transfer and uses this
packaged memory directly in the computation. Pack annotation is used mostly for
packaging multiple arrays or scalar data.
222
11.1.5 Summary
OpenACC has been used to program more than 20 essential domains, such as
climate modeling,3 material simulation, deep learning, geophysics exploration, life
science, and more, on the Sunway TaihuLight supercomputer, the fastest super-
computer in the world as of the November 2016 Top500 list. More than 100 appli-
cations have delivered inspiring performance results on the Sunway TaihuLight
hardware.
Work is proceeding to scale six important applications so that they can run effi-
ciently on the entire system, which contains more than 10 million computing cores
and is capable of delivering tens of petaflops’ worth of sustainable performance.
Other notable projects include a fully implicit solver for atmospheric dynamics,
which won the 2016 ACM Gordon Bell Prize.4 Two other 2016 ACM Gordon Bell
Finalists are focused on climate modeling5 and material simulation.6
OpenACC* is one of the major parallel programming tools used for application
development on the Sunway system. All told, OpenACC has greatly facilitated the
programming efforts of users and has played an important role in supporting
research covering various applications. In November 2016, the National Super-
computing Center in Wuxi officially became a formal member of the OpenACC
Alliance, an international organization that supports cooperation to improve the
performance and widespread use of OpenACC. In summary, Sunway OpenACC will
continue to be improved based on feedback as well as the international standard.
3. H. Fu, J. Liao, W. Xue, L. Wang, D. Chen, L. Gu, J. Xu, N. Ding, X. Wang, C. He, and S. Xu,
“Refactoring and Optimizing the Community Atmosphere Model (CAM) on the Sunway
TaihuLight Supercomputer,” in Proceedings of the International Conference for High Perfor-
mance Computing, Networking, Storage and Analysis (November 2016): 83.
4. C. Yang, W. Xue, H. Fu, H. You, X. Wang, Y. Ao, F. Liu, L. Gan, P. Xu, L. Wang, and G. Yang,
“10M-core Scalable Fully-Implicit Solver for Nonhydrostatic Atmospheric Dynamics,” in
Proceedings of the International Conference for High Performance Computing, Networking,
Storage and Analysis (November 2016): 6.
5. F. Qiao, W. Zhao, X. Yin, X. Huang, X. Liu, Q. Shu, G. Wang, Z. Song, X. Li, H. Liu, and G. Yang, “A
Highly Effective Global Surface Wave Numerical Simulation with Ultra-High Resolution,
in Proceedings of the International Conference for High Performance Computing, Networking,
Storage and Analysis (November 2016): 5.
6. J. Zhang, C. Zhou, Y. Wang, L. Ju, Q. Du, X. Chi, D. Xu, D. Chen, Y. Liu, and Z. Liu, “Extreme-
Scale Phase Field Simulations of Coarsening Dynamics on the Sunway TaihuLight Super-
computer,” in Proceedings of the International Conference for High Performance Computing,
Networking, Storage and Analysis (November 2016): 4.
223
OpenACC lets programmers make use of directives via pragmas to offload computation-
intensive nested loops onto massively parallel accelerator devices. Directives are
easy to use but are high level and need smart translation to map iterations of loops
to the underlying hardware. This is the role of the compiler, which helps users
translate annotated loops into kernels to be run on massively parallel architec-
tures. Therefore, the loop-scheduling transformation serves as a basic and manda-
tory feature in an OpenACC compiler infrastructure.
224
unit (GPGPU) thread hierarchy. We discuss our solutions in more detail in the next
subsection.
225
#pragma acc data copyin(a[0:n], b[0:n]), //data transfer from host to device
copyout(c[0:n]) __accr_malloc_device_mem(&_d_c, 0,
{ n*8);
#pragma acc kernels loop independent __accr_malloc_device_mem(&_d_c, 0,
for(i=0; i<n; i++){ n*8);
c[i]=a[i] + b[i]; __accr_malloc_device_mem(&_d_c, 0,
} n*8);
} _accr_mem_h2d(&b, _d_b, 0, n*8);
_accr_mem_h2d(&a, _d_a, 0, n*8);
(a) OpenACC code __accr_set_default_gang_vector();
__accr_push_kernel_param(&__d_c);
__global__ void __accrg_vectoradd(
__accr_push_kernel_param(&__d_b);
double * c, double * b, double * a, int n)
__accr_push_kernel_param(&__d_a);
{
__accr_push_kernel_param(&n);
int i;
__accr_launch_kernel(
int istep;
“__accrg_vectoradd”,
“vectoradd.ptx”);
i = threadIdx.x + (blockIdx.x * blockdim_x);
//data transfer from device to host
istep = blockDim_x * gridDim_x;
__accr_mem_d2h(&_d_c, c, 0, n*8);
for(i = i; i <= n; i = i + istep) {
__accr_free_dmem(_d_c);
* (c+i) = *(a + i) + *(b + i);
__accr_free_dmem(_d_b);
}
__accr_free_dmem(_d_a);
}
(b) Translated CUDA Kernel (c) Translated CPU code
Figure 11.9 OpenACC vector addition example
226
and vectors can help create multidimensional thread-block and grid topology, a
technique that might help generate enough parallelism and improve performance.
We discuss these loop-scheduling methods and our proposed methodologies in the
following two subsections.
Note
In the following subsections, p is parallel, k is kernel, g is gang, w is worker, v is vec-
tor, and x, y, and z denote the three dimensions.
227
Table 11.1 shows the CUDA terminology that we use in our OpenACC implementa-
tion. In OpenUH, a gang maps to a thread block; a worker maps to the y dimen-
sion of a thread block; and a vector maps to the x dimension of a thread block.
Because contiguous iterations tend to access contiguous data, this strategy allows
for coalescing accesses to the global memory on the GPU architecture (both AMD
and NVIDIA). Based on these definitions, the implementation for the single-loop
nest is shown in Figure 11.11. Figure 11.11(a) demonstrates an example in OpenACC,
and Figure 11.11(b) demonstrates an example of the loop-scheduling transforma-
tion in CUDA. In this example, all the iterations are distributed evenly across gangs,
workers, and vector lanes.
228
OpenACC kernels region. In our rules, both gang and vector can appear only at
most one time in the same level of loop. Or users can leave loop-scheduling clauses
empty and let the compiler generate suitable loop scheduling for the nested loops.
Table 11.2 lists our six proposed OpenACC clauses and shows how they map to cor-
responding CUDA descriptions for the kernels directive. These new clauses can
guide the compiler to create multiple-dimensional thread topologies.10
Table 11.2 OpenACC and CUDA terminology mapping in the kernels construct
OpenACC clause CUDA description
gangx Block in x dimension of grid
gangy Block in y dimension of grid
gangz Block in z dimension of grid
vectorx Threads in x dimension of block
vectory Threads in y dimension of block
vectorz Threads in z dimension of block
The example in Figure 11.12(a) shows the loop-scheduling gangy vectory for the
outer loop, and gangx vectorx for the inside of the loop. The translated CUDA
version (Figure 11.12(b)) follows Equations 11.1 and 11.2. In this loop scheduling,
gang and vector are two dimensions that the parallel loop scheduling cannot do.
After the mapping, the outer loop thread stride is gridDim.y * blockDim.y,
and the inner loop thread stride is gridDim.x * blockDim.x.
blockIdx. x | y | z + c gangx | y | z
threadIdx. x | y | z + c vectorx | y | z
init : iinit =
blockIdx. x | y | z * blockDim.x | y | z
+ threadIdx.x | y | z + c gangx | y | z vectorx | y | z
Equation 11.1
gridDim. x | y | z gangx | y | z
incre : istep = blockDim. x | y | z vectorx | y | z
gridDim. x | y | z * blockDim.x | y | z gangx | y | z vectorx | y | z
Equation 11.2
10. For a detailed description of Table 11.2 and the mapping strategies, see J. Zhang et al.,
“Extreme-Scale Phase Field Simulations of Coarsening Dynamics on the Sunway Taihu-
Light Supercomputer,” cited earlier.
229
Figure 11.12 An example of double-nested loop transformation with new loop-scheduling clauses
Note: (a) Original loop scheduling k-gyvy-gxvx with clauses; (b) Generated CUDA version of loop.
We have chosen to introduce these six clauses within the kernels construct
because this technique allows the compiler to analyze and automatically parallelize
loops as well as select the best loop-scheduling strategy. So far, OpenUH requires
the user to explicitly specify these loop schedules, but in the future we plan to
automate this work in the compiler so that the programmer does not need to worry
about the choice of the loop-scheduling strategy. Another issue lies with the barrier
operation in kernels computation regions. For example, if the reduction is used
in the inner loop of scheduling-gv-gv, it would require synchronization across
thread blocks, and such synchronization is not supported in the AMD or NVIDIA
GPU. In this situation, the work-around is to use parallel loop scheduling.
-fopenacc -nvcc,-arch=sm_35
230
The flags following -nvcc are passed to the CUDA 8.0 nvcc compiler. We use
sm_35 for the Kepler architecture. When targeting Pascal P100, sm_60 is used.
Matrix multiplication (see Figure 11.13 and the stencil-like Laplacian benchmarks
in Figure 11.14) is used to demonstrate the performance difference among various
loop-scheduling methods. The two benchmarks are simple and easy to understand,
with only a single kernel within each benchmark. The matrix size is 8,192×8,192
(see Table 11.3), and its kernel is executed only one time. The Laplacian data size is
256×256×256, and its kernel is iterated more than 20,000 times.
Figure 11.13 Matrix multiplication kernels with double nested loop parallelism
Note: Only the two outer loops are parallelized; the innermost loop is executed sequentially. In this code, the compiler creates a
two-dimensional grid as well as a two-dimensional thread block; each thread block has 256 threads in total. Loop-scheduling
clauses can be replaced with other combinations.
Figure 11.14 Stencil-like Laplacian kernels with triple nested loop parallelism
Note: Loop scheduling is gy-gx-vx; A is a macro that defines the array offset calculation; the alpha and beta variables are
constant in this kernel. The i loop is always mapped on x-dimension threads in order to keep memory access coalesced.
231
Figure 11.13 shows simple kernel code of matrix multiplication (MM). MM has
double nested parallelized loops. The B and C arrays accessed from the inner loop
j are contiguous memory access, whereas they are noncontiguous when accessed
from the outer loop. The inner loop j uses vector parallelism, and this means that
the threads executing the inner loop are consecutive with maximal memory access
coalesced.
Figure 11.14 gives a Laplacian kernel example, which is a typical stencil applica-
tion. The Laplacian example shows a triple nested loop, which can support more
loop-scheduling combinations. In the Laplacian example, the data accessed in the
innermost loop i are consecutive in memory. To achieve better performance, mem-
ory access must be coalesced. So the innermost loop i is mapped to vectorx.
In terms of the performance comparison, Figures 11.15 and 11.16 show the follow-
ing for the NVIDIA K80 architecture.
32
p-g-v
k-gyvy-gxvx
p-gw-v
k-gy-gxvx
16
Execut ion Tim e (s)
4
K80 P100
Figure 11.15 Matrix multiplication performance with various loop-scheduling methods
Note: k means OpenACC kernels region; p means OpenACC parallel region; gx, gy, and gz are short for gangx, gangy, and
gangz. Also note that vx, vy, and vz are short for vectorx, vectory, and vectorz; and g, w, and v represent the gang,
worker, and vector clauses in the OpenACC parallel region.
232
128
k-gy-gx-vx
k-gy-gxvy-vx
p-g-w-v
k-gyvy-gxvx
64 p-g-v
Execut ion Tim e (s)
32
16
8
K80 P100
Figure 11.16 Laplacian performance with various loop-scheduling methods
Note: k means OpenACC kernels region; p represents OpenACC parallel region; and k-gx-vy-vx is the same as p-g-w-v.
Also, k-gx-vx, which is the same as p-g-v, parallelizes only the loop j and loop i, with loop i running sequentially.
• The extended loop scheduling can help improve performance in our matrix mul-
tiplication example.
However, for NVIDIA’s P100 target,11 the various loop schedules show little to no
difference in the MM example.
To improve performance portability across various GPU generations will require fur-
ther research. There are many factors affecting performance, with loop scheduling
directly impacting how data are accessed and how loop parallelism is exposed to the
11. https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture
-whitepaper.pdf.
233
hardware. The compiler can build a heuristic cost model in the loop nest optimization
phase to help identify the optimal loop scheduling for a particular target.
234
13. Rengan Xu, Xiaonan Tian, Yonghong Yan, Sunita Chandrasekaran, and Barbara Chap-
man, “Reduction Operations in Parallel Loops for GPGPUs,” in 2014 International Work-
shop on Programming Models and Applications for Multicores and Manycores (PMAM 2014),
Orlando, FL (February 2014): 10:10–10:20.
14. Xiaonan Tian, Rengan Xu, Yonghong Yan, Zhifeng Yun, Sunita Chandrasekaran, and
Barbara Chapman, “Compiling a High-Level Directive-based Programming Model for
Accelerators,” in 26th International Workshop on Languages and Compilers for High Per-
formance Computing (LCPC 2013), Santa Clara, CA (September 2013): 105–120.
15. Xiaonan Tian, Rengan Xu, Yonghong Yan, Sunita Chandrasekaran, Deepak Eachempati,
and Barbara Chapman, “Compiler Transformation of Nested Loops for GPGPUs,” in Con-
currency and Computation: Practice and Experience, Special Issue on Programming Models
and Applications for Multicores and Manycore (2015): 537–556.
16. Xiaonan Tian, Dounia Khaldi, Deepak Eachempati, Rengan Xu, and Barbara Chapman,
“Optimizing GPU Register Usage: Extensions to OpenACC and Compiler Optimizations,”
in Proceedings of 2016 45th International Conference on Parallel Processing (ICPP 2016),
Philadelphia, PA (August 2016): 572–581.
17. Rengan Xu, Sunita Chandrasekaran, Xiaonan Tian, and Barbara Chapman, “An Analyti-
cal Model-based Auto-tuning Framework for Locality-aware Loop Scheduling,” in Pro-
ceedings of 2016 International Supercomputing Conference (ISC 2016), Frankfurt, Germany
(June 2016): 3–20.
235
In Section 12.2, Jinpil Lee, a scientist from Riken, Japan, describes an OpenACC
programming interface in the Omni compiler infrastructure to develop applications
on accelerator cluster systems.
1. This material is a modified version of the author’s previously published paper: Seyong
Lee, Jungwon Kim, and Jeffrey S. Vetter, “OpenACC to FPGA: A Framework for Directive-
based High-Performance Reconfigurable Computing,” 30th IEEE International Parallel &
Distributed Processing Symposium (IPDPS) (2016).
237
C-language program as input and generates a hardware configuration file for exe-
cution on FPGAs.
The prototype system is built on top of Open Accelerator Research Compiler (Open-
ARC), an open source OpenACC compiler; it performs source-to-source translation
and optimization of the input OpenACC program into OpenCL code, which is further
compiled into an FPGA program by the back-end Altera Offline OpenCL compiler.2
Hopefully, the empirical evidence we present will help you understand the potential
of directive-based, high-level programming strategy for performance portability
across heterogeneous HPC architectures.
12.1.1 Introduction
Future exascale supercomputers will need to satisfy numerous criteria: high
performance on mission applications, wide flexibility to serve a diverse application
workload, efficient power usage, effective reliability, and reasonable cost. Hetero-
geneous computing with GPUs and Xeon Phis has recently become a popular solu-
tion to this challenge. However, the preceding exascale architectural criteria must
be balanced against the user-driven goals of portable, productive development of
software and applications. Many applications, such as climate model simulations,
are being actively developed (e.g., improved) and will execute on dozens of differ-
ent architectures over their lifetimes. Consequently, many scientists have opted to
postpone porting their applications to these heterogeneous architectures until a
single portable programming solution exists.
2. https://www.altera.com/en_US/pdfs/literature/rn/rn_aocl.pdf.
238
Reconfigurable computers, such as FPGAs, have had a long history of offering both
performance and energy efficiency advantages for specific workloads compared to
other heterogeneous systems.
Aware of this challenge, FPGA vendors have recently introduced the OpenCL pro-
gramming system for their FPGA platforms. This new level of abstraction hides
some complexity while providing a new level of widespread portability not offered
by earlier approaches. Still, many users consider the OpenCL programming system
as too low a level of abstraction for their large, complex applications. In this regard,
this section introduces a preliminary implementation of an OpenACC compiler that
compiles and builds applications for the Altera Stratix V FPGA using the Altera
OpenCL programming system.
1. Configuration constructs
2. Compute constructs
3. Memory constructs
4. Synchronization constructs
A host CPU will execute these HeteroIR constructs to control a target device. (Note
that accelerator kernels are not a part of the HeteroIR.) Listing 12.2 is example
HeteroIR code that is translated from the OpenACC code in Listing 12.1. (Listing 12.2
represents only the host code.)
Listing 12.2 HeteroIR code for MM program in Listing 12.1 (host code only)
1: float *C, *B, *A, *devC, *devB, *devA;
2: HI_set_device(HI_device_type_default, 0);
3: HI_init();
4: . . .
240
Figure 12.1 shows the overall OpenARC system architecture; the OpenARC com-
piler translates an input OpenACC program into host code that contains HeteroIR
constructs and device-specific kernel code. With the help of complementary run-
time systems (e.g., CUDA driver for NVIDIA GPUs, OpenCL runtime for AMD GPUs,
or Intel Xeon Phis), the host code and related kernel code can be executed on any
supported architecture.
241
OpenACC-to-FPGA Translation
OpenARC supports both CUDA and OpenCL as output programming models. To
port OpenACC to FPGAs, we use OpenCL as the output model, and the Altera
Offline compiler (AOC) as its back-end compiler. We use the same HeteroIR runtime
system of the existing OpenCL back ends, and most of the compiler passes for the
kernel generation are also reused, thanks to their architecture-agnostic behaviors.
OpenCL is architecture independent, and thus the same OpenCL code generated for
GPUs can be executed on FPGAs, being functionally portable. However, the unique
features of an FPGA architecture necessitate customizing the OpenCL code to
achieve performance portability. Figure 12.2 shows the FPGA OpenCL architecture.
Each OpenCL kernel function is transformed into dedicated and deeply pipelined
hardware circuits. Generated pipeline logics are connected to external memory via
a global memory interconnect and to on-chip memory blocks through a specialized
interconnect structure.
The key benefit of using FPGAs is that they support heterogeneous and deeply
pipelined parallelism customized for an input program. In OpenCL-based,
242
When OpenARC translates an OpenACC compute region into a kernel, each iteration
of the worker loop in the compute region is mapped to each work item. If the total
number of workers in the region does not match the number of corresponding loop
iterations, the kernel body should be conditionally executed so that only the work
items with valid mapping can execute the kernel. Listing 12.3 shows such an exam-
ple of kernel code, which is translated from the OpenACC code in Listing 12.1, with
no optimizations enabled. The if statement in line 6 enforces the requirement that
only the work items whose global ID is less than the upper bound of the outermost
loop in Listing 12.1 (line 3 in Listing 12.1) execute the kernel. This approach pre-
vents possible array-index-out-of-bounds errors.
Listing 12.3 Device kernel code for MM program of Listing 12.1 (unoptimized version)
1: __attribute__((reqd_work_group_size(workers,1,1)))
2: __kernel void kernel_0(__global float * C, __global float * A,
3: global float * B,
4: int M, int N, int P) {
5: int i, j, k; i = get_global_id(0);
6: if (i < M) {
7: for (j = 0; j < N; j++) {
8: float sum = 0.0f;
9: for (k = 0; k < P; k++)
10: sum += A[i*P+k] * B[k*N+j];
11: C[i*N+j] = sum;
12: }
243
Listing 12.3 Device kernel code for MM program of Listing 12.1 (unoptimized version) (continued)
13: }
14: return;
15: }
Generally, diverging control paths of work items is less of an issue in FPGA pro-
gramming compared with GPU programming, because the back-end AOC can
eliminate diverging control paths of work items by collapsing simple conditional
statements into single bits that indicate when an FPGA functional unit becomes
active, resulting in a flat control structure. However, AOC cannot handle complex
conditional structures and the work-item-ID-dependent backward branching,
which involves looping structure. The kernel code that contains a nonflat control
structure can cause significant performance penalty by disabling various compiler
optimizations, including the kernel vectorization (explained in the next section).
In the earlier example translation (Listing 12.3), the OpenARC compiler had to add
the if statement, because the number of iterations is a runtime variable. In this
case, only the programmer (or the runtime) can resolve this issue. For this, we
added a new directive and command-line option for programmers to inform the
compiler whether it is safe to remove the conditional statements for a specific ker-
nel or whole program. If guaranteed by programmers, OpenARC generates kernel
code without the guarding statements, allowing opportunities for the underlying
OpenCL compiler to perform more aggressive optimizations.
Directive Extension for Loop Unrolling, Kernel Vectorization, and Compute Unit Rep-
lication. In FPGA programming, deep pipelining is a primary method to increase
throughput, but there exist three other important optimization techniques to maxi-
mize the performance of the OpenCL kernel: loop unrolling, kernel vectorization, and
compute unit replication.
Loop unrolling decreases the number of iterations that the Altera back-end com-
piler executes, at the expense of increased consumption of hardware resources.
The compiler expands the pipeline by the number of unroll factors specified, result-
ing in a deeper pipeline. In addition, the compiler further optimizes the kernel by
coalescing the memory operations. To increase the number of parallel operations,
the memory bandwidth, and the number of operations per clock cycle, we add a
new loop unroll directive for users.
Kernel vectorization allows multiple work items in the same workgroup to execute
in a single-instruction multiple-data (SIMD) fashion. The underlying Altera Offline
compiler implements this vectorization by replicating the kernel datapaths; this
244
allows SIMD operations to share control logic across each SIMD vector lane, and it
might coalesce memory accesses.
Kernel vectorization is usually beneficial, but finding optimal SIMD width may
require experimentation, because of its resource contention with other optimiza-
tions. The OpenACC vector clause provides similar effects, but its behavior is not
the same as the kernel vectorization.
In the OpenACC execution model, vector lanes are executed in an SIMD manner only
when the kernel is in the vector-partitioned mode. However, in kernel vectorization,
the whole kernel is executed in an SIMD mode, and SIMD vector lanes exercise a
strict vectorization (vector lanes execute in a lockstep manner), although not all
vector lanes in the OpenACC model may execute synchronously. (The vector width
in the OpenACC vector clause provides a hint for optimal vectorization; the vector
lanes may be executed as a set of SIMD operations with a narrower width, if neces-
sary.) Therefore, to enable high-level control over this optimization, we add a new
directive to allow users to specify the SIMD width of a specific kernel.
The compute unit replication optimization generates multiple compute units for
each kernel, and the hardware scheduler in the FPGA dispatches workgroups to
additional available compute units. Increasing the number of compute units can
achieve higher throughput. However, it will also increase the bandwidth requests
to the global memory because of memory-access contention among the compute
units. Like kernel vectorization, this optimization also increases hardware resource
utilization, and thus you need to balance between these optimizations to achieve
optimal performance. For this, we also add a new directive and command-line
option to control this optimization.
Kernel-Pipelining Transformation
In the OpenACC model, device kernels can communicate with each other only
through the device global memory, and synchronizations between kernels are at
the granularity of a kernel execution; no mechanism exists to synchronize actively
running kernels in a fine-grained manner. Therefore, for a kernel to communicate
with other kernels, the involved kernels should be split into multiple subkernels
at each communication point, and the data transfer should be done via the global
memory in a sequential manner, as shown in Figure 12.3(a). However, launching
or synchronizing kernels requires host involvement, incurring non-negligible
overhead, and the limited bandwidth and long latency of the global memory may
degrade overall performance significantly.
245
To address this issue, the underlying AOCL provides a mechanism, called channels,
for passing data to kernels and synchronizing kernels with high efficiency and low
latency. The channels extension allows kernels to directly communicate with each
other via a FIFO buffer (Figure 12.3(b)), decoupling the kernel execution from the host.
As shown earlier, pipelining kernels using channel variables as FIFO buffers allows
you to bypass global memory accesses and fine-grained synchronizations between
kernels. To enable this fancy mechanism in OpenACC while maintaining portabil-
ity, we propose a set of new data clauses: pipe, pipein, and pipeout. A pipe
clause is used in an OpenACC data construct, which replaces a create clause
and indicates that the variables in the pipe clause are used as temporary device
buffers for kernel communications, convertible to AOCL channel variables. Pipein
and pipeout clauses are used in an OpenACC parallel or kernels constructs,
which replace present clauses and indicate that the variables in the clause are
read-only (pipein) or write-only (pipeout), and they are declared as channel
variables in a pipe clause of an enclosing data region.
246
Basically, if two or more kernels execute in sequential order and communicate with
each other using temporary device buffers, these kernels can be pipelined using
the channel variables, as shown in Listing 12.4. More precisely, for the pipelining
transformation to preserve the original execution semantics, the following condi-
tions should be met.
1. Kernels have sequential dependencies only in one direction; there are no depen-
dency cycles among kernels.
2. Temporary device buffers used for kernel communications are accessed only by
the involved kernels.
3. A kernel can either read or write (but not both) the same temporary buffer only
once per loop iteration.
4. A consumer kernel should read a temporary buffer in the same order that the
producer kernel writes the buffer.
OpenARC can detect kernels that satisfy these conditions to automate the pipelining
transformation, but ensuring the same access order of a temporary buffer can be
difficult if a kernel contains complex control flows. In this case, the proposed pipe
clauses are used for users to inform the compiler which kernels can be pipelined.
247
12.1.4 EVALUATION
This section presents the results and analysis of OpenACC translation to FPGAs.
Methodology
Target Device. We evaluated the proposed framework on an Altera Stratix V GS
D5 FPGA, which consisted of 457K logic elements, 172,600 ALMs, 690K registers,
2,014 M20K memory blocks, and an 8GB DDR3 device memory. We used the Altera
SDK for OpenCL 15.0 as its back-end compiler and runtime. For comparative tests,
we used the Intel Xeon Phi coprocessor 5110P, the NVIDIA Tesla K40c GPU, and the
AMD Radeon R9 290X GPU.
Results
Worker Size. Figure 12.4 presents the performance effect of different worker
sizes (workgroup sizes in OpenCL) on each application. We measured the applica-
tion performance with worker sizes of 32, 128, and 512 on the Altera Stratix V GS
D5 FPGA, the Intel Xeon Phi coprocessor 5110P, and the NVIDIA Tesla K40c GPU.
For the AMD Radeon R9 290X GPU, we used a worker size of 256 instead of 512
because of the limit of the underlying AMD OpenCL framework. The figure indicates
that the NVIDIA GPU was the most sensitive to the worker size. This is because the
248
worker size in the GPU architecture affects hardware thread scheduling, which
implicitly determines the occupancy, latency hiding, register spilling, and resultant
multiprocessor utilization. These are the key factors affecting the overall perfor-
mance of GPU applications.
On the other hand, Figure 12.4(a) shows that FPGA is the least sensitive to the
worker size. Unlike GPUs and the Xeon Phi, FPGA transforms each kernel to a cus-
tom, deeply pipelined hardware circuit, in which a work item is clocked in on each
clock cycle. If the pipeline depth is long enough, multiple workgroups can be present
in the pipeline at the same time. In this case, the worker size will not affect overall
performance, if no workgroup-related operations such as workgroup barriers exist.
249
Compute Unit Replication and Kernel Vectorization. Figure 12.5 shows the perfor-
mance effect of replicating the compute units (CUs) and changing the SIMD width in
the kernel vectorization on the tested FPGA.
The speedups in the figure are normalized to the performance of each application
with 1 CU and 1 SIMD width. The worker sizes of all kernels are set to 128. In the
FPGA architecture, CU replication and kernel vectorization achieve higher through-
put by, respectively, executing multiple workgroups concurrently on the available
CUs and allowing multiple work items to execute in an SIMD fashion. The figure
shows that Jacobi and MatMul perform better with increases in CUs and SIMD,
although their best configurations differ. They have no work-item-dependent
control-flow divergence, and they perform regular memory accesses. These char-
acteristics allow the Altera Offline compiler to vectorize their kernels and repli-
cate CUs, resulting in higher performance than the base version (1 CU and 1 SIMD
width).
However, SpMul and SRAD perform more poorly with multiple CUs. This is mainly
due to their irregular memory accesses, which incur suboptimal bandwidth utili-
zation and become worse by replicating CUs. On the other hand, HotSpot and NW
show performance enhancement with multiple CUs. However, they perform worse
with kernel vectorization. HotSpot and NW have many work-item-dependent control-flow
divergences and strided memory accesses. These make kernel vectorization and
memory coalescing inefficient and complex, incurring lower performance and
higher hardware resource consumption.
Loop Unrolling. Figure 12.6(a) demonstrates the performance impact of the loop
unrolling on Jacobi and MatMul. We set the unroll factors to 4 and 8, and we
compared them with the original, no-unroll version. Both Jacobi and MatMul show
better performance as the unroll factor increases. This is because the loop unroll-
ing expands the pipeline, thereby allowing more work items to run at the same
time. Moreover, it enables the coalescing memory operations from global memory,
resulting in higher bandwidth.
250
Kernel Pipelining. Figure 12.6(b) shows the speedup of FFT-2D when the kernel
pipelining is applied. The baseline version (Global Memory in the figure) uses global
memory to transfer data between kernels. The optimized version (Pipelining in the
figure) exploits our kernel pipelining extension. The kernel pipelining increases
data transfer efficiency between kernels by communicating directly with FIFO
buffers. Moreover, it enables the concurrent multiple kernel execution on the FPGA
with fine-grained synchronizations between kernels, bringing higher throughput.
The figure shows that the pipelining version achieves about a 60 percent perfor-
mance improvement over the baseline version.
Overall Performance. Figure 12.7 compares the overall performance of the tested
OpenACC benchmarks on each target accelerator.
Figure 12.7 Speedup over CPU sequential (CPU: Intel Xeon E5520 octacore)
These results indicate that the performance behavior of FPGAs differs from that of
GPUs and Xeon Phi: specifically, GPUs prefer applications with massive parallelism
(e.g., Jacobi, MatMul, SpMul, HotSpot, and SRAD), but FPGAs prefer applications
with deep execution pipelines (e.g., FFT-1D and FFT-2D). For example, FFT-1D is a
sequential program with deeply pipelined instructions. Being sequential, FFT-1D
is not suitable for traditional accelerators with massively parallel computational
251
units. However, FPGAs can build deeply pipelined hardware customized to the input
program, and thus FFT-1D (and FFT-2D, which is parallel) can perform much better
on FPGAs than other accelerators and CPUs.
12.1.5 Summary
A directive-based, high-level programming system for high-performance recon-
figurable computing is built on top of an existing OpenACC compiler called Ope-
nARC. OpenARC’s high-level abstraction allowed us to easily port the OpenACC
programming model to heterogeneous reconfigurable architectures such as
FPGAs, with minimal extensions to the existing OpenACC framework. Porting eight
OpenACC benchmarks onto four representative accelerator architectures (FPGAs,
NVIDIA GPUs, AMD GPUs, and Xeon Phis) demonstrates the functional portability of
OpenACC.
252
3. Masahiro Nakao, Hitoshi Murai, Takenori Shimosaka, Akihiro Tabuchi, Toshihiro Hanawa,
Yuetsu Kodama, Taisuke Boku, and Mitsuhisa Sato, “XcalableACC: Extension of Xcal-
ableMP PGAS Language Using OpenACC for Accelerator Clusters,” Workshop on Acceler-
ator Programming Using Directives (WACCPD) (New Orleans, LA, November 2014).
4. Omni XcalableMP Web site, http://www.xcalablemp.org/.
5. Jinpil Lee and Mitsuhisa Sato, “Implementation and Performance Evaluation of Xcal-
ableMP: A Parallel Programming Language for Distributed Memory Systems,” 39th
International Conference on Parallel Processing Workshops (ICPPW10) (San Diego, CA;
September 2010): 413–420.
253
The primary solution is to use OpenACC with XMP. However, the current XMP spec-
ification (version 1.2) does not define a behavior when you mix XMP and OpenACC
directives in the same code. Thus, we have been designing the new programming
model XACC as a hybrid programming model using XMP and OpenACC directives.
In addition, XACC extends the XMP directives to improve productivity and performance
by integrating internode and inner-node parallelism in the same programming model.
Although this section provides sample code written in C, the XACC model also
supports Fortran.
void main(void) {
int sum = 0;
#pragma xmp loop (i, j) on t(j, i) reduction (+:sum)
for (int i = 0; i < YMAX; i++)
for (int j = 0; j < XMAX; j++)
sum += calc(A[i][j]);
}
254
Each node will execute the same program but modify different parts of the data
according to the data parallelism described by the user. Data distribution in XMP is
done by using a template, which is a virtual array representing global index space
among the nodes. The template directive declares the virtual array. Because it
is used to distribute array A, template t has the same shape (dimension and size).
Template t is duplicated among nodes, and this means that every node in p has the
ownership of every index in t.
XMP requires loop work mapping to make a loop statement run in parallel. The
loop construct in Listing 12.5 specifies that the following loop will be executed
by the owners of template t. Because t is distributed among node p, the iteration
255
space will be distributed and run in parallel. One of the most important features
in XMP is that it supports explicit parallelism. The compiler does not generate par-
allel loops or inter-node communication without any user description. All memory
access refers to the node local memory. This is why we use a template both in
data distribution and in loop work mapping. By using the same template, we can
make the data distribution and loop work mapping access the same array index,
something that leads to local memory access in a node. Figure 12.10 shows the
matching process.
Internode Communication
As mentioned, all memory access in the code refers to the local memory. To access
the remote node data, special language constructs should be described. Typical
internode communications are provided in directive form. For example, reduction
can be done in the loop construct by adding the reduction clause (Listing 12.5).
Other collective communications, such as broadcast and allgather, are also
supported in the global view model.
256
If a statement refers to a remote array element, the user can add a gmove direc-
tive to generate internode communication. Figure 12.8 shows an example. All of
the array elements in B are copied to array C. Because array C is duplicated among
nodes, it requires internode communication (allgather). The compiler generates
allgather communication by using the information from the distribution descrip-
tor of B. Without the gmove directive, the program will fail because it refers to
unavailable elements in B.
The major difference is that, in XACC, the OpenACC data section replicates a dis-
tributed chunk on PGAS memory space.
When a data directive specifies a distributed array (by XMP directives), a distrib-
uted chunk—including a communication buffer (e.g., XMP shadow region, illustrated
in Figure 12.10 as white boxes)—will be replicated on the accelerator memory. As
with XMP, internode communication between host memory will be done by XMP
directives or coarray statements. XACC supports direct communication between
accelerators by extending XMP communication features.
257
Before the outermost loop, a data construct is specified. It replicates the host
memory image on the accelerator memory. Because the arrays are distributed
among nodes, the 2D block-distributed chunks (including shadow elements) are
allocated on the accelerator memory. Finally, the array data in u and uu are copied
from the host to the accelerator. At the end of the data region, the data in u are cop-
ied from the accelerator to the host. The data copy includes shadow elements.
OpenACC work-sharing directives can be written with the XMP loop directive. In
Listing 12.6, a parallel loop is specified to a loop, which is already distributed
by the XMP loop construct. In the XACC compiler, two directives work together so
that loop iterations that are mapped to the node can be parallelized on the accel-
erator. Without the XMP loop construct, the XACC compiler will try to schedule the
entire iteration space for the parallel loop construct on the accelerator. It will
cause a runtime error on the distributed memory system.
258
If the target system provides special hardware features supporting direct commu-
nication between accelerators (e.g., GPU direct), the communication will be done on
the replicated memory region. If not, the communication will be processed via the
host memory, but you should still specify direct communication in the syntax level
to improve productivity. The user does not need to care about the hardware feature.
All communication directives (such as bcast, reduction, and reflect) can be
specified by using the acc clause. Coarray statements are also extended in XACC so
that the user can describe one-sided communication within accelerator memory.
259
When an XMP loop directive and OpenACC loop construct is given to the same
loop statement, the compiler translates the loop statement as specified in the
XMP directive. Next, the compiler places the OpenACC directive before the loop
statement. Figure 12.11 shows the code translation. Note that _XMP_init_i,
_XMP_cond_i, and _XMP_step_i are inserted by the compiler for parallel
execution among the nodes. Those values are calculated at run time. Along with the
parallel loop directive, each node calculates its allocated iteration space on
the accelerator.
260
Runtime Implementation
The Omni XMP/XACC runtime library uses MPI for internode communication. CUDA
is used to program NVIDIA GPU devices. For better portability, we implement our
OpenACC runtime in OpenCL to replace vendor-dependent manycore libraries and
languages.6
If you are using data transfer among accelerators on distributed systems, data
should be copied to the host memory through a PCIe connection (in case of GPUs).
To reduce the overhead, some accelerators provide special hardware features.
NVIDIA GPUs support GPUDirect. The University of Tsukuba has developed
PEACH27 based on the PCIe interface to enable direct communication between
GPU devices. PEACH2 is used in the HA-PACS/TCA (tightly coupled accelerators)
system. Figure 12.12 shows a picture of the HA-PACS system and the PEACH2
interface board.
The Omni XACC runtime is optimized for the HA-PACS/TCA system, which has
both GPUDirect over InfiniBand and PEACH2 hardware. It uses the InfiniBand and
PEACH2 runtime to improve inter-GPU communication. Figure 12.13 shows the
performance of GPUDirect over InfiniBand and PEACH2 on HA-PACS/TCA (PEACH2
provides two communication modes: host and internal). Generally, PEACH provides
better latency with small messages. The XACC runtime selects the interconnection
according to the size of the data to minimize latency.
6. Akihiro Tabuchi, Yasuyuki Kimura, Sunao Torii, Hideo Matsufuru, Tadashi Ishikawa,
Taisuke Boku, and Mitsuhisa Sato, “Design and Preliminary Evaluation of Omni OpenACC
Compiler for Massive MIMD Processor PEZY-SC,” OpenMP: Memory, Devices, and Tasks:
12th International Workshop on OpenMP, IWOMP (October 2016): 293–305.
7. https://www.ccs.tsukuba.ac.jp/wp-content/uploads/sites/14/2013/06/01-Taisuke-Boku.pdf.
261
Himeno Benchmark
The Himeno benchmark solves the 3D Poisson’s equation using the Jacobi iteration
method. It is a typical example of stencil computation. We parallelized the serial
version of the Himeno kernel using XACC and OpenACC+MPI for our GPU cluster
262
system (see Listing 12.7). Because data arrays are distributed among nodes, the
XACC runtime updates the distributed memory chunk in the data region at each
iteration. Scalar variables such as gosa are duplicated on each node. The bound-
ary elements of the arrays are exchanged directly among GPU accelerators at
the end of each iteration. To this end, we added the acc clause to the reflect
directive. The runtime switches between GPUDirect over InfiniBand and PEACH2 to
minimize communication overhead.
Figure 12.14 shows the performance of the Himeno kernel. In problem sizes S and
M using multiple nodes, the performance of XACC using the PEACH2 is as much
as 2.7 times as fast as OpenACC+MPI using GPUDirect RDMA over InfiniBand. The
performance is faster because the transfer time of boundary elements in the XACC
version is smaller than the OpenACC+MPI version, thanks to the lower latency
of PEACH2. In problem size L, the performance of XACC is almost the same as
OpenACC+MPI, because computation time dominates the total execution time. If we
divide only the first dimension in problem size L, the size of the boundary elements
is about 520KB, and the latencies for the communication with PEACH2 and GPU-
Direct RDMA over InfiniBand are almost the same, as you see in Figure 12.14. In
the cases of (4×1×1), (8×1×1), and (16×1×1) with problem size L, the OpenACC+MPI
version shows better performance. The reason is that the current implementation
of the reflect directive requires an MPI barrier after the boundary elements are
exchanged (because of the hardware resource limitation of the PEACH2 device).
Therefore, the boundary transfer time of XACC is greater than with OpenACC+MPI.
When we used a single node in all problem sizes, the performance of XACC is
slightly worse than with OpenACC+MPI, because the Omni XACC compiler trans-
lates the global memory address among nodes into the local memory address on
each node. This address calculation affects the performance of array references.
263
264
Listing 12.9 shows the communication code used for the NPB-CG kernel. We
describe the communication in two ways. The first way is to use directives in the
global view model, wherein the vectors are distributed by using XMP/XACC direc-
tives. The interdevice communication is specified by the reduction and gmove
directives. Therefore, all of the elements are collected on each node. Note that
these directives have the acc clause. The compiler generates inter-GPU communi-
cation for the data exchange.
The second way of communication is to use coarrays in the local view model. In
that case, vectors are declared as coarrays. The communication is described as a
reference to the co-dimension. First, the coarrays are declared on the host mem-
ory. Then its memory images are duplicated by using the acc declare directive.
Because the use_device clause is given for the host_data region, coarrays w
and q are manipulated on the GPU memory.
The performance of the XACC local view version (XACC-L) is more than 97 per-
cent of the OpenACC+MPI send/recv version (MPI+ACC (s/r)) for class C, and 99
percent for class D. Even though performance is degraded on more processes for
class C, the performance is almost the same with the OpenACC+MPI get version
(MPI+ACC (get)), which uses a similar communication mechanism.
265
// Local View
double w[na/num_proc_rows+2]:[*];
double q[na/num_proc_rows+2]:[*];
#pragma acc declare create(w,q)
#pragma acc host_data use_device(w, q)
{
/* array reduction */
for(i = l2npcols; i >= 1; i--){
int image = reduce_exch_proc[i-1] + 1;
int start = reduce_recv_starts[i-1] - 1;
int length = reduce_recv_lengths[i-1];
xmp_sync_image(image, NULL);
q[start:length] = w[start:length]:[image];
xmp_sync_image(image, NULL);
#pragma acc parallel loop
for(j = send_start-1; j < send_start+lengths-1; j++)
w[j] = w[j] + q[j];
}
/* column-distributed array <- row-distributed array */
if( l2npcols != 0 ) {
xmp_sync_image(exch_proc+1, NULL);
q[0:send_len] = w[send_start-1:send_len]:[exch_proc+1];
xmp_sync_image(exch_proc+1, NULL);
}else{
#pragma acc parallel loop
for(j=0; j < exch_recv_length; j++)
q[j] = w[j];
}
}
On the other hand, the XACC global view version (XACC-G) shows lower perfor-
mance than XACC-L in all cases. One reason is the increased time taken by array
reductions to collect vector elements. The array reduction in XACC uses the MPI_
Allreduce() function. Because MVAPICH2 supports GPUDirect, some MPI func-
tions (such as MPI_Isend/Recv() and MPI_Get()) that are used in XACC-L
and MPI+ACC (s/r) work on the GPU memory. However, the current MVAPICH2
implementation uses host memory buffers to calculate array reduction in MPI_
Allreduce(). The data transfer time between the GPU device and the host,
along with the reduction calculation on the host, lowers performance. The second
reason is the increased number of elements exchanged in the communication.
Whereas XACC-G collects all elements in vectors, the XACC-L and OpenACC+MPI
266
versions collect the minimum number of elements required for the matrix-vector
multiplication. This makes the communication size different on the 2×4 and 4×8
node divisions.
60000 120
180000 120
150000 100
40000 80 120000 80
10000 20 30000 20
0 0 0 0
1x1 1x2 2x2 2x4 4x4 4x8 8x8 1x2 2x2 2x4 4x4 4x8 8x8
CLASS C: # of processes ( Row × Column) CLASS D: # of processes ( Row × Column)
MPI+ACC (s/r) MPI+ACC (get) XACC-G XACC-L MPI+ACC (s/r) MPI+ACC (get) XACC-G XACC-L
12.2.5 SUMMARY
XMP provides an easy and efficient way to exploit parallelism on distributed mem-
ory systems, thanks to its directive-based programming model. XACC extends XMP
to mix OpenACC and XMP directives. To optimize performance, you can use coar-
rays on accelerators. The performance evaluation shows that XACC can achieve
equivalent performance to the OpenACC+MPI version with small changes from the
serial version.
267
269
Arithmetic units, coordinating functional units of Baseline CPU implementation, in CFD case study,
hardware, 82–83 113
Arrays Baseline profiling
data clauses, 12–13 analyzing application performance, 196–198
data layout for performance portability, 126 of asynchronous pipelining operation, 203–204
iterating over multidimensional, 17 as best practice, 101
optimization of data locality, 111–112 of translation of OpenACC to FPGAs, 239–243
programming with XACC directives, 258–259 bcast directive, communication directives, 259
reducing to scalar, 86–87 Benchmarks. See also Baseline profiling
shaping, 18 evaluating loop scheduling performance of
async clause OpenUH, 231
acc_async_noval, 191 evaluating OpenACC translation to FPGAs, 248
acc_async_sync, 181, 191 evaluating XACC performance on HA-PACS
adding directives to work queues, 191 using Himeno, 262–264
making pipelining operation asynchronous, 202–204 evaluating XACC performance on HA-PACS
Asynchronous operations using NPB-CG, 264–267
adding directives to work queues, 191 research topics in OpenUH, 234
advanced OpenACC options, 187–190 Best practices, programming
making pipelining operation asynchronous, 201–204 applied to thermodynamic fluid property table,
Asynchronous programming 112–118
asynchronous work queues, 190–191 general guidelines, 102–105
defined, 190 maximize on-device computation, 105–108
interoperating with CUDA streams, 194–195 optimize data locality, 108–112
joining work queues, 193–194 overview of, 101–102
overview of, 190 summary and exercises, 118–119
wait directive, 191–192 bisection function, in CFD solver, 113–114
Asynchronous work queues Block recycling, CUDA implementation of MiniFE,
advanced OpenACC options, 190–191 161
interoperability with OpenACC, 194–195 Blocks, of code
atomic directive blocking data movement, 200–201
for atomic operations, 105–106 blocking the computation, 198–199
types of data management directives, 4 Bottlenecks, detecting, 37
Atomic operations, maximize on-device Boundary conditions, 60
computation, 105–106 Branches, removing from code section, 95
auto clause, in loop parallelization, 27–28 Bugs
Auxiliary induction variable substitution, compiling identifying, 51–53
OpenACC, 93 user errors in compiling OpenACC, 95–97
AXPBY (vector addition) Bulldozer multicore
CUDA implementation of MiniFE, 159 running OpenACC over, 130–131
OpenACC implementation of MiniFE, 157 targeting multiple architectures, 129–130
OpenMP implementation of MiniFE, 158
serial implementation of MiniFE, 156 C
TBB implementation of MiniFE, 165 C++ AMP
comparing programming models, 136, 143
B data allocations, 153
Backslash (\), in directive syntax, 3 features, 140
Bakery counter, dynamic scheduling of workers, mapping simple loop to parallel loop, 145
94–95 tightly nested loops, 148
270
271
272
273
274
275
276
adding independent clause to, 25–27 using Jacobi iteration to locate Laplace
adding seq clause to, 27 condition, 61–67
in code generation, 125
collapse keyword, 24–25 M
combining with parallel for parallelization, 20 Management processing element (MPE)
data distribution and work mapping in XMP, Sunway Taihulight execution model, 218–219
255–256 Sunway Taihulight memory model, 217–218
executing loops on MIMD hardware, 87–88 in SW26010 manycore CPU, 216–217
internode communication in XMP, 256–257 map, acc_map_data, 182–184
levels of parallelism, 21–22 Mapping
overview of, 8–9 loops onto parallel hardware, 83–85
reduction clause, 28–30 OpenACC terminology to CUDA, 228–229
runtime implementation of XACC, 260 parallelism to hardware, 23–24
types of compute directives, 4 simple loop to parallel loop, 144–145
work sharing in XMP, 258 work in XACC, 255–256
Loop parallelization. See also Parallel loops; Matrix multiplication (MM)
Parallelism evaluating loop scheduling performance of
collapse keyword, 24–25 OpenUH, 231–233
independent clause, 25–27 in OpenACC, 240
kernels vs. parallel loops, 18–21 Maximize on-device computation
levels of parallelism, 21–23
atomic operations, 105–106
loop construct options, 24
as best practice, 101
mapping parallelism to hardware, 23–24
kernels and parallel constructs, 106–107
overview of, 17–18
overview of, 103
reduction clause, 28–30
runtime tuning and if clause, 107–108
seq and auto clauses, 27–28
Memory
summary and exercises, 30–31
hierarchy in compiling OpenACC, 85–86
Loop unrolling, 243–244, 250
management functions of API routines, 5
Loops
Memory models
applying to CFD solver. See Computational fluid
dynamics (CFD) solver, case study portability, 124–125
creating optimized parallel version of Laplace Sunway Taihulight, 217–218
Solver, 75–78 XACC, 257
distributing iterations across gangs, workers, or Message Passing Interface (MPI)
vectors, 125–126 combining OpenACC with, 187
evaluating loop scheduling performance of with direct memory access, 211–213
OpenUH, 230–234 interprocess communication, 37
extension for loop unrolling, 244 overview of, 208–210
loop-scheduling transformation in OpenUH, runtime implementation of XACC, 261–264
226–230 without direct memory access, 210–211
mapping to parallel hardware, 83–85 MIMD. See Multiple-instruction multiple data (MIMD)
nontightly nested loops (hierarchical MiniFE solver case study
parallelism), 148–151 CUDA implementation, 159–162
parallel loops, 143–145 implementation, 163–165
principles governing performance of, 96 OpenACC implementation, 157–158
symbolic loop bounds and steps creating issues OpenMP implementation, 158–159
for compilers, 91 overview of, 155
tightly nested loops, 147 performance comparisons, 167–169
277
278
279
280
281
R Scheduling
R2C (Real-to-complex), Discrete Fourier Transform, dynamic scheduling of workers using bakery
176 counter, 94–95
RAJA evaluating loop scheduling performance of
comparing programming models, 136, 143 OpenUH, 230–234
features, 141 loop-scheduling transformation in OpenUH,
mapping simple loop to parallel loop, 145 226–230
parallel reductions, 146 mapping loops, 83–85
tightly nested loops, 148 parallel and vector code, 93–94
Real-to-complex (R2C), Discrete Fourier Transform, Score-P performance infrastructure, 44–48
176 Scratch pad memory (SPM)
reduction clause Sunway Taihulight data management, 219, 221
adding to kernels, parallel, or loop Sunway Taihulight execution model, 219
directive, 28–30 Sunway Taihulight memory model, 217–218
compiling OpenACC, 92–93 in SW26010 manycore CPU, 216–217
internode communication in XMP, 256–257 Semantics
Reductions parallel hardware, 83–84
communication directives, 259 simultaneous, 84
parallel reductions. See Parallel reductions seq clause
of vector or array to scalar, 86–87 adding to loop directive, 27–28
Refactoring code, for portability, 126 for sequential execution of loop, 9
reflect, communication directives, 259 Sequential loops
Research. See Innovation/research adding seq clause to loop directive, 27–28
Reuse. See Data reuse executing, 9
routine directive vs. simultaneous or parallel loops, 87
acc routine, 115 Serial code
overview of, 9–11 compiling OpenACC, 94–95
types of compute directives, 4 implementation of MiniFE solver, 156–157
Routines using Jacobi iteration to locate Laplace
API routines, 5 condition, 61–67
API routines for target platforms, 180–181 Serialization, of tasks, 190
calling CUDA device routines from OpenACC set clause
kernels, 184–185 acc_set_cuda_stream, 180–181, 195
identifying hot spots, 102–103 acc_set_device, 206–207
querying/setting device type, 205–206 acc_set_device_num, 205–206, 209, 212
Runtime tuning, maximize on-device computation, acc_set_device_type, 124
107–108 Shadow elements, arrays, 258
Shared memories
S C++17, 142
Sampling HACCmk microkernel, 127–128
data acquisition for performance analysis, 38 OpenMP, 138
with TAU performance system, 48 types of system memory, 125
Scalar expansion, simultaneous semantics and, 86 Shut down, functions of API routines, 5
Scalars SIMD (Single-instruction multiple-data), 83
data-flow analysis as precursor of dependency, SIMT (Single-instruction multiple-thread), 83
89–90 Simultaneous semantics, 84
reducing vector or array to, 86–87 Single-instruction multiple-data (SIMD), 83
specifying variables as private, 88 Single-instruction multiple-thread (SIMT), 83
282
283
284
285