AMD OpenCL Programming User Guide
AMD OpenCL Programming User Guide
AMD OpenCL Programming User Guide
August 2015
rev1.0
2015 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo,
AMD Accelerated Parallel Processing, the AMD Accelerated Parallel Processing logo, ATI,
the ATI logo, Radeon, FireStream, FirePro, Catalyst, and combinations thereof are trademarks of Advanced Micro Devices, Inc. Microsoft, Visual Studio, Windows, and Windows
Vista are registered trademarks of Microsoft Corporation in the U.S. and/or other jurisdictions. Other names are for informational purposes only and may be trademarks of their
respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by
permission by Khronos.
The contents of this document are provided in connection with Advanced Micro Devices,
Inc. (AMD) products. AMD makes no representations or warranties with respect to the
accuracy or completeness of the contents of this publication and reserves the right to
make changes to specifications and product descriptions at any time without notice. The
information contained herein may be of a preliminary or advance nature and is subject to
change without notice. No license, whether express, implied, arising by estoppel or otherwise, to any intellectual property rights is granted by this publication. Except as set forth
in AMDs Standard Terms and Conditions of Sale, AMD assumes no liability whatsoever,
and disclaims any express or implied warranty, relating to its products including, but not
limited to, the implied warranty of merchantability, fitness for a particular purpose, or
infringement of any intellectual property right.
AMDs products are not designed, intended, authorized or warranted for use as components in systems intended for surgical implant into the body, or in other applications
intended to support or sustain life, or in any other application in which the failure of AMDs
product could create a situation where personal injury, death, or severe property or environmental damage may occur. AMD reserves the right to discontinue or make changes to
its products at any time without notice.
ii
URL:
developer.amd.com/appsdk
Developing:
developer.amd.com/
Preface
Audience
This document is intended for programmers. It assumes prior experience in
writing code for CPUs and a basic understanding of threads (work-items). While
a basic understanding of GPU architectures is useful, this document does not
assume prior graphics knowledge. It further assumes an understanding of
chapters 1, 2, and 3 of the OpenCL Specification (for the latest version, see
http://www.khronos.org/registry/cl/ ).
Organization
This AMD APP SDK document begins, in Chapter 1, with an overview of: the
AMD APP SDK programming models, OpenCL, and the AMD Compute
Abstraction Layer (CAL). Chapter 2 discusses the AMD implementation of
OpenCL. Chapter 3 discusses the compiling and running of OpenCL programs.
Chapter 4 describes using the AMD CodeXL GPU Debugger and the GNU
debugger (GDB) to debug OpenCL programs. Chapter 5 provides information
about the extension that defines the OpenCL Static C++ kernel language, which
is a form of the ISO/IEC Programming languages C++ specification. Chapter 6
provides information about the features introduced in OpenCL 2.0. Appendix A
describes the supported optional OpenCL extensions. Appendix B details the
installable client driver (ICD) for OpenCL. Appendix C details the compute kernel
and contrasts it with a pixel shader. Appendix C describes the OpenCL binary
image format (BIF). Appendix D provides a hardware overview of pre-GCN
devices. Appendix E describes the interoperability between OpenCL and
OpenGL. Appendix F describes the new and deprecated functions in OpenCL
2.0. Appendix G provides information about the SPIR format. The last section of
this book is an index.
Preface
Copyright 2015 Advanced Micro Devices, Inc. All rights reserved.
iii
Conventions
The following conventions are used in this document.
mono-spaced font
[1,2)
A range that includes the left-most value (in this case, 1) but excludes the right-most
value (in this case, 2).
[1,2]
A range that includes both the left-most and right-most values (in this case, 1 and 2).
{x | y}
0.0f
0.0
1011b
7:4
A bit range, from bit 7 to 4, inclusive. The high-order bit is shown first.
The first use of a term or concept basic to the understanding of stream computing.
Related Documents
iv
Kernighan Brian W., and Ritchie, Dennis M., The C Programming Language,
Prentice-Hall, Inc., Upper Saddle River, NJ, 1978.
Buck, Ian; Foley, Tim; Horn, Daniel; Sugerman, Jeremy; Hanrahan, Pat;
Houston, Mike; Fatahalian, Kayvon. BrookGPU
http://graphics.stanford.edu/projects/brookgpu/
Preface
Copyright 2015 Advanced Micro Devices, Inc. All rights reserved.
Contact Information
AMD APP SDK: developer.amd.com/appsdk
AMD Developer Central:developer.amd.com/
Preface
Copyright 2015 Advanced Micro Devices, Inc. All rights reserved.
vi
Preface
Copyright 2015 Advanced Micro Devices, Inc. All rights reserved.
Contents
Preface
Contents
Chapter 1
OpenCL Architecture and AMD Accelerated Parallel Processing Technology
1.1
1.2
1.3
1.4
1.5
1.6
1.5.2
1.5.3
1.5.4
1.5.5
1.6.3
Chapter 2
AMD Implementation
2.1
2.2
2.1.3
Key differences between Southern Islands, Sea Islands, and Volcanic Islands families
2-7
A note on hardware queues ...........................................................................................2-8
2.3
2.4
2.3.3
vii
Chapter 3
Building and Running OpenCL Programs
3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.7.3
3.8
3.9
Chapter 4
Debugging and Profiling OpenCL
4.1
4.1.3
4.2
4.2.3
4.2.4
Notes................................................................................................................................4-12
Chapter 5
OpenCL Static C++ Programming Language
5.1
viii
5.2
5.1.1
5.1.2
5.1.3
5.3
5.4
Additions and Changes to Section 6 - The OpenCL 1.2 C Programming Language.............. 5-3
5.3.1
Building C++ Kernels.......................................................................................................5-3
5.3.2
5.3.3
Namespaces......................................................................................................................5-4
5.3.4
Overloading.......................................................................................................................5-4
5.3.5
Templates ..........................................................................................................................5-5
5.3.6
Exceptions ........................................................................................................................5-6
5.3.7
Libraries ............................................................................................................................5-6
5.3.8
5.3.9
Examples........................................................................................................................................... 5-6
5.4.1
Passing a Class from the Host to the Device and Back.............................................5-6
5.4.2
5.4.3
Kernel Template................................................................................................................5-8
Chapter 6
OpenCL 2.0
6.1
6.2
Usage.................................................................................................................................6-3
Coarse-grained memory ..................................................................................................6-4
6.3
Usage.................................................................................................................................6-7
Generic example...............................................................................................................6-7
AMD APP SDK example ..................................................................................................6-8
6.4
6.4.3
Usage...............................................................................................................................6-10
Iterate until convergence ..............................................................................................6-10
Data-dependent refinement...........................................................................................6-10
Binary search using device-side enqueue..................................................................6-11
6.5
Usage...............................................................................................................................6-14
Contents
Copyright 2015 Advanced Micro Devices, Inc. All rights reserved.
ix
Pipes................................................................................................................................................ 6-17
6.6.1
Overview..........................................................................................................................6-17
6.6.2
6.6.3
Usage...............................................................................................................................6-18
6.7
6.8
sRGB................................................................................................................................6-20
6.8.3
Depth images..................................................................................................................6-22
6.9
6.10
Appendix A
OpenCL Optional Extensions
A.1
A.2
A.3
A.4
A.5
A.6
A.7
cl_ext Extensions.........................................................................................................................A-5
A.8
cl_amd_vec3................................................................................................................. A-5
A.8.3
cl_amd_device_persistent_memory.................................................................. A-5
A.8.4
cl_amd_device_attribute_query....................................................................... A-5
cl_device_profiling_timer_offset_amd....................................................... A-5
cl_amd_device_topology........................................................................................ A-6
cl_amd_device_board_name................................................................................... A-6
A.8.5
A.8.6
cl_amd_offline_devices....................................................................................... A-7
A.8.7
cl_amd_event_callback.......................................................................................... A-7
A.8.8
cl_amd_popcnt............................................................................................................ A-7
A.8.9
cl_amd_media_ops..................................................................................................... A-7
A.8.10
cl_amd_printf.......................................................................................................... A-10
Contents
Copyright 2015 Advanced Micro Devices, Inc. All rights reserved.
A.9
A.8.11
cl_amd_predefined_macros................................................................................. A-11
A.8.12
cl_amd_bus_addressable_memory..................................................................... A-12
Overview ...........................................................................................................................................B-1
B.2
Appendix C
OpenCL Binary Image Format (BIF) v2.0
C.1
Overview ...........................................................................................................................................C-1
C.1.1
Executable and Linkable Format (ELF) Header........................................................... C-2
C.1.2
C.2
Bitness.............................................................................................................................. C-3
BIF Options.......................................................................................................................................C-3
Appendix D
Hardware overview of pre-GCN devices
Appendix E
OpenCL-OpenGL Interoperability
E.1
E.1.3
E.3
Appendix F
New and deprecated functions in OpenCL 2.0
F.1
F.1.3
F.1.4
F.1.5
F.1.6
Contents
Copyright 2015 Advanced Micro Devices, Inc. All rights reserved.
xi
F.1.7
F.1.8
F.1.9
F.1.10
Sub-groups .......................................................................................................................F-4
F.2
F.3
F.4
F.3.2
F.3.3
Appendix G
Standard Portable Intermediate Representation (SPIR)
G.1
Index
xii
Contents
Copyright 2015 Advanced Micro Devices, Inc. All rights reserved.
Figures
1.1
1.2
1.3
1.4
1.5
2.1
2.2
2.3
2.4
2.5
2.6
2.7
3.1
4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8
4.9
4.10
6.1
6.2
A.1
D.1
D.2
Contents
Copyright 2015 Advanced Micro Devices, Inc. All rights reserved.
xiii
xiv
Contents
Copyright 2015 Advanced Micro Devices, Inc. All rights reserved.
Tables
A.1
A.2
C.1
E.1
Contents
Copyright 2015 Advanced Micro Devices, Inc. All rights reserved.
xv
xvi
Contents
Copyright 2015 Advanced Micro Devices, Inc. All rights reserved.
Chapter 1
OpenCL Architecture and AMD
Accelerated Parallel Processing
Technology
This chapter provides a general software and hardware overview of the AMD
APP SDK implementation of the OpenCL standard. It explains the memory
structure and gives simple programming examples.
1.1 Terminology
Term
Description
1-1
Term
Description
wavefronts and Wavefronts and work-groups are two concepts relating to compute
work-groups
kernels that provide data-parallel granularity. On most AMD GPUs, a
wavefront has 64 work-items. A wavefront is the lowest level that flow
control can affect. This means that if two work-items inside of a
wavefront go divergent paths of flow control, all work-items in the
wavefront go to both paths of flow control.
Grouping is a higher-level granularity of data parallelism that is enforced
in software, not hardware. Synchronization points in a kernel guarantee
that all work-items in a work-group reach that point (barrier) in the code
before the next statement is executed.
Work-groups are composed of wavefronts. Best performance is attained
when the group size is an integer multiple of the wavefront size.
local data store The LDS is a high-speed, low-latency memory private to each compute
(LDS)
unit. It is a full gather/scatter model: a work-group can write anywhere
in its allocated space. This model is unchanged for the AMD Radeon
HD 7XXX series. The constraints of the current LDS model are:
The LDS size is allocated per work-group. Each work-group specifies
how much of the LDS it requires. The hardware scheduler uses this
information to determine which work groups can share a compute unit.
Data can only be shared within work-items in a work-group.
Memory accesses outside of the work-group result in undefined
behavior.
OpenCL's API also supports the concept of a task dispatch. This is equivalent to
executing a kernel on a compute device with a work-group and NDRange
containing a single work-item. Parallelism is expressed using vector data types
implemented by the device, enqueuing multiple tasks, and/or enqueuing native
kernels developed using a programming model orthogonal to OpenCL.
1-2
_kernel foo(...) {
_kernel foo(...) {
}
Local Memory
Local Memory
Context
Queue
Queue
Global/Constant Memory
Figure 1.1
The devices are capable of running data- and task-parallel work. A kernel can be
executed as a function of multi-dimensional domains of indices. Each element is
called a work-item; the total number of indices is defined as the global work-size.
The global work-size can be divided into sub-domains, called work-groups, and
individual work-items within a group can communicate through global or locally
shared memory. Work-items are synchronized through barrier or fence
operations. Figure 1.1 is a representation of the host/device architecture with a
single platform, consisting of a GPU and a CPU.
An OpenCL application is built by first querying the runtime to determine which
platforms are present. There can be any number of different OpenCL
implementations installed on a single system. The desired OpenCL platform can
be selected by matching the platform vendor string to the desired vendor name,
such as Advanced Micro Devices, Inc. The next step is to create a context. As
shown in Figure 1.1, an OpenCL context has associated with it a number of
compute devices (for example, CPU or GPU devices),. Within a context, OpenCL
guarantees a relaxed consistency between these devices. This means that
memory objects, such as buffers or images, are allocated per context; but
changes made by one device are only guaranteed to be visible by another device
at well-defined synchronization points. For this, OpenCL provides events, with the
ability to synchronize on a given event to enforce the correct order of execution.
1-3
Many operations are performed with respect to a given context; there also are
many operations that are specific to a device. For example, program compilation
and kernel execution are done on a per-device basis. Performing work with a
device, such as executing kernels or moving data to and from the devices local
memory, is done using a corresponding command queue. A command queue is
associated with a single device and a given context; all work for a specific device
is done through this interface. Note that while a single command queue can be
associated with only a single device, there is no limit to the number of command
queues that can point to the same device. For example, it is possible to have
one command queue for executing kernels and a command queue for managing
data transfers between the host and the device.
Most OpenCL programs follow the same pattern. Given a specific platform, select
a device or devices to create a context, allocate memory, create device-specific
command queues, and perform data transfers and computations. Generally, the
platform is the gateway to accessing specific devices, given these devices and
a corresponding context, the application is independent of the platform. Given a
context, the application can:
Submit the kernel (with appropriate arguments) to the command queue for
execution.
1.4 Synchronization
The two domains of synchronization in OpenCL are work-items in a single workgroup and command-queue(s) in a single context. Work-group barriers enable
synchronization of work-items in a work-group. Each work-item in a work-group
must first execute the barrier before executing any instruction beyond this barrier.
Either all of, or none of, the work-items in a work-group must encounter the
barrier. A barrier or mem_fence operation does not have global scope, but is
relevant only to the local workgroup on which they operate.
There are two types of synchronization between commands in a commandqueue:
1-4
Description
private
local
global
constant
Read-only region for host-allocated and -initialized objects that are not
changed during kernel execution.
host (CPU)
PCIe
Part of host (CPU) memory accessible from, and modifiable by, the host
program and the GPU compute device. Modifying this memory requires
synchronization between the GPU compute device and the CPU.
1-5
Compute Device
Compute Unit n
Private Memory
(Reg Files)
m
Private Memory
(Reg Files)
1 2
Compute Unit 1
Private Memory
(Reg Files)
m
Private Memory
(Reg Files)
1 2
Proc. Elem.
(ALU)
Proc. Elem.
(ALU)
Proc. Elem.
(ALU)
Proc. Elem.
(ALU)
Local
n Mem.
(LDS) n
Local Mem.1
(LDS) 1
L1
G
n lobal Share
Mem. (GDS)
L1
Compute Device
GLOBAL MEMORY
Memory (VRAM)
Figure 1.2
Host
DMA
PCIe
CONSTANT MEMORY
Figure 1.3 illustrates the standard dataflow between host (CPU) and GPU.
S
T
I
e
H
O
S
T
Figure 1.3
P
C
I
e
B
A
L
G
L
O
B
A
L
A
L
L
O
C
A
L
A
TP
ER
I
V
A
T
E
There are two ways to copy data from the host to the GPU compute device
memory:
1-6
1.5.1
1-7
Compute Device
Work-Group
Work-Group
Private
Memory
Private
Memory
Private
Memory
Private
Memory
WorkItem
WorkItem
WorkItem
WorkItem
LDS
Frame Buffer
LDS
Global/Constant Memory
Host
Host Memory
Figure 1.4
Physically located on-chip, directly next to the ALUs, the LDS is approximately
one order of magnitude faster than global memory (assuming no bank conflicts).
In pre-GCN devices, there are 32 kB memory per compute unit, segmented into
32 or 16 banks (depending on the GPU type) of 1 k dwords (for 32 banks) or 2 k
dwords (for 16 banks). Each bank is a 256x32 two-port RAM (1R/1W per clock
cycle). Dwords are placed in the banks serially, but all banks can execute a store
or load simultaneously. One work-group can request up to 32 kB memory. Reads
across wavefront are dispatched over four cycles in waterfall.
GCN devices contain 64 kB memory per compute unit and allow up to a
maximum of 32 kB per workgroup.
The high bandwidth of the LDS memory is achieved not only through its proximity
to the ALUs, but also through simultaneous access to its memory banks. Thus,
it is possible to concurrently execute 32 write or read instructions, each nominally
32-bits; extended instructions, read2/write2, can be 64-bits each. If, however,
more than one access attempt is made to the same bank at the same time, a
bank conflict occurs. In this case, for indexed and atomic operations, hardware
prevents the attempted concurrent accesses to the same bank by turning them
into serial accesses. This decreases the effective bandwidth of the LDS. For
maximum throughput (optimal efficiency), therefore, it is important to avoid bank
conflicts. A knowledge of request scheduling and address mapping is key to
achieving this.
1-8
1.5.2
Private
Memory
WorkItem
Private
Memory
WorkItem
LDS
Global Memory
Figure 1.5
(Images)
(per
Texture
Compute
L1
Unit)
Texture
L2
(Global)
VRAM
To load data into LDS from global memory, it is read from global memory and
placed into the work-items registers; then, a store is performed to LDS. Similarly,
to store data into global memory, data is read from LDS and placed into the workitems registers, then placed into global memory. To make effective use of the
LDS, an algorithm must perform many operations on what is transferred between
global memory and LDS. It also is possible to load data from a memory buffer
directly into LDS, bypassing VGPRs.
LDS atomics are performed in the LDS hardware. (Thus, although ALUs are not
directly used for these operations, latency is incurred by the LDS executing this
function.) If the algorithm does not require write-to-read reuse (the data is read
only), it usually is better to use the image dataflow (see right side of Figure 1.5)
because of the cache hierarchy.
Actually, buffer reads may use L1 and L2. When caching is not used for a buffer,
reads from that buffer bypass L2. After a buffer read, the line is invalidated; then,
on the next read, it is read again (from the same wavefront or from a different
clause). After a buffer write, the changed parts of the cache line are written to
memory.
Buffers and images are written through the texture L2 cache, but this is flushed
immediately after an image write.
In GCN devices, both reads and writes happen through L1 and L2.
The data in private memory is first placed in registers. If more private memory is
used than can be placed in registers, or dynamic indexing is used on private
arrays, the overflow data is placed (spilled) into scratch memory. Scratch memory
1-9
1.5.3
Memory Access
Using local memory (known as local data store, or LDS, as shown in Figure 1.2)
typically is an order of magnitude faster than accessing host memory through
global memory (VRAM), which is one order of magnitude faster again than PCIe.
However, stream cores do not directly access memory; instead, they issue
memory requests through dedicated hardware units. When a work-item tries to
access memory, the work-item is transferred to the appropriate fetch unit. The
work-item then is deactivated until the access unit finishes accessing memory.
Meanwhile, other work-items can be active within the compute unit, contributing
to better performance. The data fetch units handle three basic types of memory
operations: loads, stores, and streaming stores. GPU compute devices can store
writes to random memory locations using global buffers.
1.5.4
Global Memory
The global memory lets applications read from, and write to, arbitrary locations
in memory. When using global memory, such read and write operations from the
stream kernel are done using regular GPU compute device instructions with the
global memory used as the source or destination for the instruction. The
programming interface is similar to load/store operations used with CPU
programs, where the relative address in the read/write buffer is specified.
When using a global memory, each work-item can write to an arbitrary location
within it. Global memory use a linear layout. If consecutive addresses are written,
the compute unit issues a burst write for more efficient memory access. Only
read-only buffers, such as constants, are cached.
1.5.5
Image Read/Write
Image reads are done by addressing the desired location in the input memory
using the fetch unit. The fetch units can process either 1D or 2 D addresses.
These addresses can be normalized or un-normalized. Normalized coordinates
are between 0.0 and 1.0 (inclusive). For the fetch units to handle 2D addresses
1-10
1.6.1
1-11
Example Code 1
//
// Copyright (c) 2010 Advanced Micro Devices, Inc. All rights reserved.
//
// A minimalist OpenCL program.
#include <CL/cl.h>
#include <stdio.h>
#define NWITEMS 512
// A simple memset kernel
const char *source =
"__kernel void memset( __global uint *dst )
"{
"
dst[get_global_id(0)] = get_global_id(0);
"}
\n"
\n"
\n"
\n";
1-12
// Copyright (c) 2010 Advanced Micro Devices, Inc. All rights reserved.
//
// A minimalist OpenCL program.
#include <CL/cl.h>
#include <stdio.h>
#define NWITEMS 512
// A simple memset kernel
const char *source =
"__kernel void memset( __global uint *dst )
"{
"
dst[get_global_id(0)] = get_global_id(0);
"}
\n"
\n"
\n"
\n";
1-13
// 6. Launch the kernel. Let OpenCL pick the local work size.
size_t global_work_size = NWITEMS;
clSetKernelArg(kernel, 0, sizeof(buffer), (void*) &buffer);
clEnqueueNDRangeKernel( queue,
kernel,
1,
NULL,
&global_work_size,
NULL, 0, NULL, NULL);
clFinish( queue );
// 7. Look at the results via synchronous buffer map.
cl_uint *ptr;
ptr = (cl_uint *) clEnqueueMapBuffer( queue,
buffer,
CL_TRUE,
CL_MAP_READ,
0,
NWITEMS * sizeof(cl_uint),
0, NULL, NULL, NULL );
int i;
for(i=0; i < NWITEMS; i++)
printf("%d %d\n", i, ptr[i]);
return 0;
}
1.6.2
This removes the need to error check after each OpenCL call. If there is an
error, the C++ bindings code throw an exception that is caught at the end of
the try block, where we can clean up the host memory allocations. In this
example, the C++ objects representing OpenCL resources (cl::Context,
cl::CommandQueue, etc.) are declared as automatic variables, so they do not
1-14
7. Create two buffers, corresponding to the X and Y vectors. Ensure the hostside buffers, pX and pY, are allocated and initialized. The
CL_MEM_COPY_HOST_PTR flag instructs the runtime to copy over the
contents of the host pointer pX in order to initialize the buffer bufX. The bufX
buffer uses the CL_MEM_READ_ONLY flag, while bufY requires the
CL_MEM_READ_WRITE flag.
bufX = cl::Buffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
sizeof(cl_float) * length, pX);
8. Create a program object from the kernel source string, build the program for
our devices, and create a kernel object corresponding to the SAXPY kernel.
(At this point, it is possible to create multiple kernel objects if there are more
than one.)
cl::Program::Sources sources(1, std::make_pair(kernelStr.c_str(),
kernelStr.length()));
program = cl::Program(context, sources);
program.build(devices);
kernel = cl::Kernel(program, "saxpy");
9. Enqueue the kernel for execution on the device (GPU in our example).
Set each argument individually in separate kernel.setArg() calls. The
arguments, do not need to be set again for subsequent kernel enqueue calls.
Reset only those arguments that are to pass a new value to the kernel. Then,
enqueue the kernel to the command queue with the appropriate global and
local work sizes.
1.6 Example Programs
Copyright 2015 Advanced Micro Devices, Inc. All rights reserved.
1-15
kernel.setArg(0, bufX);
kernel.setArg(1, bufY);
kernel.setArg(2, a);
queue.enqueueNDRangeKernel(kernel, cl::NDRange(),
cl::NDRange(length), cl::NDRange(64));
10. Read back the results from bufY to the host pointer pY. We will make this a
blocking call (using the CL_TRUE argument) since we do not want to proceed
before the kernel has finished execution and we have our results back.
queue.enqueueReadBuffer(bufY, CL_TRUE, 0, length * sizeof(cl_float),
pY);
11. Clean up the host resources (pX and pY). OpenCL resources is cleaned up
by the C++ bindings support code.
The catch(cl::Error err) block handles exceptions thrown by the C++
bindings code. If there is an OpenCL call error, it prints out the name of the call
and the error code (codes are defined in CL/cl.h). If there is a kernel compilation
error, the error code is CL_BUILD_PROGRAM_FAILURE, in which case it is
necessary to print out the build log.
Example Code 2
#define __CL_ENABLE_EXCEPTIONS
#include
#include
#include
#include
using
using
using
using
<CL/cl.hpp>
<string>
<iostream>
<string>
std::cout;
std::cerr;
std::endl;
std::string;
/////////////////////////////////////////////////////////////////
// Helper function to print vector elements
/////////////////////////////////////////////////////////////////
void printVector(const std::string arrayName,
const cl_float * arrayData,
const unsigned int length)
{
int numElementsToPrint = (256 < length) ? 256 : length;
cout << endl << arrayName << ":" << endl;
for(int i = 0; i < numElementsToPrint; ++i)
cout << arrayData[i] << " ";
cout << endl;
}
/////////////////////////////////////////////////////////////////
// Globals
/////////////////////////////////////////////////////////////////
int length
= 256;
cl_float * pX
= NULL;
cl_float * pY
= NULL;
cl_float a
= 2.f;
std::vector<cl::Platform> platforms;
cl::Context
context;
std::vector<cl::Device> devices;
cl::CommandQueue
queue;
cl::Program
program;
1-16
cl::Kernel
cl::Buffer
cl::Buffer
kernel;
bufX;
bufY;
/////////////////////////////////////////////////////////////////
// The saxpy kernel
/////////////////////////////////////////////////////////////////
string kernelStr
=
"__kernel void saxpy(const __global float * x,\n"
"
__global float * y,\n"
"
const float a)\n"
"{\n"
"
uint gid = get_global_id(0);\n"
"
y[gid] = a* x[gid] + y[gid];\n"
"}\n";
/////////////////////////////////////////////////////////////////
// Allocate and initialize memory on the host
/////////////////////////////////////////////////////////////////
void initHost()
{
size_t sizeInBytes = length * sizeof(cl_float);
pX = (cl_float *) malloc(sizeInBytes);
if(pX == NULL)
throw(string("Error: Failed to allocate input memory on host\n"));
pY = (cl_float *) malloc(sizeInBytes);
if(pY == NULL)
throw(string("Error: Failed to allocate input memory on host\n"));
for(int i = 0; i < length; i++)
{
pX[i] = cl_float(i);
pY[i] = cl_float(length-1-i);
}
printVector("X", pX, length);
printVector("Y", pY, length);
}
/////////////////////////////////////////////////////////////////
// Release host memory
/////////////////////////////////////////////////////////////////
void cleanupHost()
{
if(pX)
{
free(pX);
pX = NULL;
}
if(pY != NULL)
{
free(pY);
pY = NULL;
}
}
void
main(int argc, char * argv[])
{
try
{
/////////////////////////////////////////////////////////////////
// Allocate and initialize memory on the host
1.6 Example Programs
Copyright 2015 Advanced Micro Devices, Inc. All rights reserved.
1-17
/////////////////////////////////////////////////////////////////
initHost();
/////////////////////////////////////////////////////////////////
// Find the platform
/////////////////////////////////////////////////////////////////
cl::Platform::get(&platforms);
std::vector<cl::Platform>::iterator iter;
for(iter = platforms.begin(); iter != platforms.end(); ++iter)
{
if(!strcmp((*iter).getInfo<CL_PLATFORM_VENDOR>().c_str(),
"Advanced Micro Devices, Inc."))
{
break;
} }
/////////////////////////////////////////////////////////////////
// Create an OpenCL context
/////////////////////////////////////////////////////////////////
cl_context_properties cps[3] = { CL_CONTEXT_PLATFORM,
(cl_context_properties)(*iter)(), 0 };
context = cl::Context(CL_DEVICE_TYPE_GPU, cps);
/////////////////////////////////////////////////////////////////
// Detect OpenCL devices
/////////////////////////////////////////////////////////////////
devices = context.getInfo<CL_CONTEXT_DEVICES>();
/////////////////////////////////////////////////////////////////
// Create an OpenCL command queue
/////////////////////////////////////////////////////////////////
queue = cl::CommandQueue(context, devices[0]);
/////////////////////////////////////////////////////////////////
// Create OpenCL memory buffers
/////////////////////////////////////////////////////////////////
bufX = cl::Buffer(context,
CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
sizeof(cl_float) * length,
pX);
bufY = cl::Buffer(context,
CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR,
sizeof(cl_float) * length,
pY);
/////////////////////////////////////////////////////////////////
// Load CL file, build CL program object, create CL kernel object
/////////////////////////////////////////////////////////////////
cl::Program::Sources sources(1, std::make_pair(kernelStr.c_str(),
kernelStr.length()));
program = cl::Program(context, sources);
program.build(devices);
kernel = cl::Kernel(program, "saxpy");
/////////////////////////////////////////////////////////////////
// Set the arguments that will be used for kernel execution
/////////////////////////////////////////////////////////////////
kernel.setArg(0, bufX);
kernel.setArg(1, bufY);
kernel.setArg(2, a);
/////////////////////////////////////////////////////////////////
// Enqueue the kernel to the queue
// with appropriate global and local work sizes
/////////////////////////////////////////////////////////////////
queue.enqueueNDRangeKernel(kernel, cl::NDRange(),
cl::NDRange(length), cl::NDRange(64));
/////////////////////////////////////////////////////////////////
// Enqueue blocking call to read back buffer Y
/////////////////////////////////////////////////////////////////
1-18
1.6.3
1-19
4. The global work size is computed for each device. A simple heuristic is used
to ensure an optimal number of threads on each device. For the CPU, a
given CL implementation can translate one work-item per CL compute unit
into one thread per CPU core.
On the GPU, an initial multiple of the wavefront size is used, which is
adjusted to ensure even divisibility of the input data over all threads. The
value of 7 is a minimum value to keep all independent hardware units of the
compute units busy, and to provide a minimum amount of memory latency
hiding for a kernel with little ALU activity.
5. After the kernels are built, the code prints errors that occurred during kernel
compilation and linking.
6. The main loop is set up so that the measured timing reflects the actual kernel
performance. If a sufficiently large NLOOPS is chosen, effects from kernel
launch time and delayed buffer copies to the device by the CL runtime are
minimized. Note that while only a single clFinish() is executed at the end
of the timing run, the two kernels are always linked using an event to ensure
serial execution.
The bandwidth is expressed as number of input bytes processed. For highend graphics cards, the bandwidth of this algorithm is about an order of
magnitude higher than that of the CPU, due to the parallelized memory
subsystem of the graphics card.
7. The results then are checked against the comparison value. This also
establishes that the result is the same on both CPU and GPU, which can
serve as the first verification test for newly written kernel code.
8. Note the use of the debug buffer to obtain some runtime variables. Debug
buffers also can be used to create short execution traces for each thread,
assuming the device has enough memory.
9. You can use the Timer.cpp and Timer.h files from the TransferOverlap
sample, which is in the SDK samples.
Kernel Code
10. The code uses four-component vectors (uint4) so the compiler can identify
concurrent execution paths as often as possible. On the GPU, this can be
used to further optimize memory accesses and distribution across ALUs. On
the CPU, it can be used to enable SSE-like execution.
11. The kernel sets up a memory access pattern based on the device. For the
CPU, the source buffer is chopped into continuous buffers: one per thread.
Each CPU thread serially walks through its buffer portion, which results in
good cache and prefetch behavior for each core.
On the GPU, each thread walks the source buffer using a stride of the total
number of threads. As many threads are executed in parallel, the result is a
maximally coalesced memory pattern requested from the memory back-end.
For example, if each compute unit has 16 physical processors, 16 uint4
requests are produced in parallel, per clock, for a total of 256 bytes per clock.
1-20
12. The kernel code uses a reduction consisting of three stages: __global to
__private, __private to __local, which is flushed to __global, and finally
__global to __global. In the first loop, each thread walks __global
memory, and reduces all values into a min value in __private memory
(typically, a register). This is the bulk of the work, and is mainly bound by
__global memory bandwidth. The subsequent reduction stages are brief in
comparison.
13. Next, all per-thread minimum values inside the work-group are reduced to a
__local value, using an atomic operation. Access to the __local value is
serialized; however, the number of these operations is very small compared
to the work of the previous reduction stage. The threads within a work-group
are synchronized through a local barrier(). The reduced min value is
stored in __global memory.
14. After all work-groups are finished, a second kernel reduces all work-group
values into a single value in __global memory, using an atomic operation.
This is a minor contributor to the overall runtime.
Example Code 3
//
// Copyright (c) 2010 Advanced Micro Devices, Inc. All rights reserved.
//
#include
#include
#include
#include
#include
<CL/cl.h>
<stdio.h>
<stdlib.h>
<time.h>
"Timer.h"
#define NDEVS
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
1-21
"
for( int n=0; n < count; n++, idx += stride )
"
{
"
pmin = min( pmin, src[idx].x );
"
pmin = min( pmin, src[idx].y );
"
pmin = min( pmin, src[idx].z );
"
pmin = min( pmin, src[idx].w );
"
}
"
"
// 12. Reduce min values inside work-group.
"
"
if( get_local_id(0) == 0 )
"
lmin[0] = (uint) -1;
"
"
barrier( CLK_LOCAL_MEM_FENCE );
"
"
(void) atom_min( lmin, pmin );
"
"
barrier( CLK_LOCAL_MEM_FENCE );
"
"
// Write out to __global.
"
"
if( get_local_id(0) == 0 )
"
gmin[ get_group_id(0) ] = lmin[0];
"
"
// Dump some debug information.
"
"
if( get_global_id(0) == 0 )
"
{
"
dbg[0] = get_num_groups(0);
"
dbg[1] = get_global_size(0);
"
dbg[2] = count;
"
dbg[3] = stride;
"
}
"}
"
"// 13. Reduce work-group min values from __global to __global.
"
"__kernel void reduce( __global uint4 *src,
"
__global uint *gmin )
"{
"
(void) atom_min( gmin, gmin[get_global_id(0)] ) ;
"}
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n"
\n";
dev, nw;
devs[NDEVS] = { CL_DEVICE_TYPE_CPU,
CL_DEVICE_TYPE_GPU };
cl_uint
unsigned int
*src_ptr;
num_src_items = 4096*4096;
1-22
cl_uint a =
(cl_uint) ltime,
b =
(cl_uint) ltime;
cl_uint min = (cl_uint) -1;
// Do serial computation of min() for result verification.
for( int i=0; i < num_src_items; i++ )
{
src_ptr[i] = (cl_uint) (b = ( a * ( b & 65535 )) + (
min = src_ptr[i] < min ? src_ptr[i] : min;
}
b >> 16 ));
// Get a platform.
clGetPlatformIDs( 1, &platform, NULL );
// 3. Iterate over devices.
for(dev=0; dev < NDEVS; dev++)
{
cl_device_id
device;
cl_context
context;
cl_command_queue queue;
cl_program
program;
cl_kernel
minp;
cl_kernel
reduce;
cl_mem
cl_mem
cl_mem
src_buf;
dst_buf;
dbg_buf;
cl_uint
*dst_ptr,
*dbg_ptr;
compute_units;
global_work_size;
local_work_size;
num_groups;
clGetDeviceInfo( device,
CL_DEVICE_MAX_COMPUTE_UNITS,
sizeof(cl_uint),
&compute_units,
NULL);
if( devs[dev] == CL_DEVICE_TYPE_CPU )
{
global_work_size = compute_units * 1;
local_work_size = 1;
}
else
{
1.6 Example Programs
Copyright 2015 Advanced Micro Devices, Inc. All rights reserved.
1-23
cl_uint ws = 64;
global_work_size = compute_units * 7 * ws; // 7 wavefronts per SIMD
while( (num_src_items / 4) % global_work_size != 0 )
global_work_size += ws;
local_work_size = ws;
}
num_groups = global_work_size / local_work_size;
// Create a context and command queue on that device.
context = clCreateContext( NULL,
1,
&device,
NULL, NULL, NULL);
queue = clCreateCommandQueue(context,
device,
0, NULL);
// Minimal error check.
if( queue == NULL )
{
printf("Compute device setup failed\n");
return(-1);
}
// Perform runtime source compilation, and obtain kernel entry point.
program = clCreateProgramWithSource( context,
1,
&kernel_source,
NULL, NULL );
//Tell compiler to dump intermediate .il and .isa GPU files.
ret = clBuildProgram( program,
1,
&device,
-save-temps,
NUL, NULL );
// 5. Print compiler error messages
if(ret != CL_SUCCESS)
{
printf("clBuildProgram failed: %d\n", ret);
char buf[0x10000];
clGetProgramBuildInfo( program,
device,
CL_PROGRAM_BUILD_LOG,
0x10000,
buf,
NULL);
printf("\n%s\n", buf);
return(-1);
}
1-24
minp
= clCreateKernel( program, "minp", NULL );
reduce = clCreateKernel( program, "reduce", NULL );
// Create input, output and debug buffers.
src_buf = clCreateBuffer( context,
CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
num_src_items * sizeof(cl_uint),
src_ptr,
NULL );
dst_buf = clCreateBuffer( context,
CL_MEM_READ_WRITE,
num_groups * sizeof(cl_uint),
NULL, NULL );
dbg_buf = clCreateBuffer( context,
CL_MEM_WRITE_ONLY,
global_work_size * sizeof(cl_uint),
NULL, NULL );
clSetKernelArg(minp,
clSetKernelArg(minp,
clSetKernelArg(minp,
clSetKernelArg(minp,
clSetKernelArg(minp,
clSetKernelArg(minp,
0,
1,
2,
3,
4,
5,
sizeof(void *),
sizeof(void *),
1*sizeof(cl_uint),
sizeof(void *),
sizeof(num_src_items),
sizeof(dev),
(void*)
(void*)
(void*)
(void*)
(void*)
(void*)
&src_buf);
&dst_buf);
NULL);
&dbg_buf);
&num_src_items);
&dev);
(void*) &src_buf);
(void*) &dst_buf);
CPerfCounter t;
t.Reset();
t.Start();
// 6. Main timing loop.
#define NLOOPS 500
cl_event ev;
int nloops = NLOOPS;
while(nloops--)
{
clEnqueueNDRangeKernel( queue,
minp,
1,
NULL,
&global_work_size,
&local_work_size,
0, NULL, &ev);
clEnqueueNDRangeKernel( queue,
reduce,
1,
NULL,
&num_groups,
NULL, 1, &ev, NULL);
}
clFinish( queue );
t.Stop();
1-25
1-26
Chapter 2
AMD Implementation
Compute Applications
Libraries
Third-Party Tools
OpenCL Runtime
Multicore
CPUs
Figure 2.1
AMD GPUs
The latest generations of AMD GPUs use unified shader architectures capable
of running different kernel types interleaved on the same hardware.
Programmable GPU compute devices execute various user-developed programs,
AMD APP SDK - OpenCL User Guide
Copyright 2015 Advanced Micro Devices, Inc. All rights reserved.
2-1
GPU DEVICE
CUs
Compute Unit 0
00
1
Processing Elements
Work-Items
Work-Groups
ND-RANGE
Figure 2.2
Note that in OpenCL 2.0, the work-groups are not required to divide evenly into
the NDRange.
OpenCL maps the total number of work-items to be launched onto an ndimensional grid (ND-Range). The developer can specify how to divide these
items into work-groups. AMD GPUs execute on wavefronts (groups of work-items
executed in lock-step in a compute unit); there is an integer number of
wavefronts in each work-group. Thus, as shown in Figure 2.3, hardware that
schedules work-items for execution in the AMD Accelerated Parallel Processing
Technology environment includes the intermediate step of specifying wavefronts
within a work-group. This permits achieving maximum performance on AMD
GPUs. For a more detailed discussion of wavefronts, see Section 1.1,
Terminology, page 1-1.
Range
Dim Y
WORK-GROUP
on
si
en
m
Di
Dim Y
Dimension X
WORK-ITEM
on
si
en
m
Di
Wavefront
( HW-Specific Size)
Dimension X
Figure 2.3
2.1.1
Work-Item Processing
All processing elements within a vector unit execute the same instruction in each
cycle. For a typical instruction, 16 processing elements execute one instruction
for 64 work items over 4 cycles. The block of work-items that are executed
together is called a wavefront. For example, on the AMD Radeon HD 290X
2.1 The AMD APP SDK Implementation of OpenCL
Copyright 2015 Advanced Micro Devices, Inc. All rights reserved.
2-3
compute device, the 16 processing elements within each vector unit execute the
same instruction for four cycles, which effectively appears as a 64-wide compute
unit in execution width.
The size of wavefronts can differ on different GPU compute devices. For
example, some of the low-end and older GPUs, such as the AMD Radeon HD
54XX series graphics cards, have a wavefront size of 32 work-items. Higher-end
and newer AMD GPUs have a wavefront size of 64 work-items.
Compute units operate independently of each other, so it is possible for different
compute units to execute different instructions. It is also possible for different
vector units within a compute unit to execute different instructions.
Before discussing flow control, it is necessary to clarify the relationship of a
wavefront to a work-group. If a user defines a work-group, it consists of one or
more wavefronts. A wavefront is a hardware thread with its own program counter;
it is capable of following control flow independently of other wavefronts. A
wavefront consists of 64 or fewer work-items. The mapping is based on a linear
work-item order. On a device with a wavefront size of 64, work-items 0-63 map
to wavefront 0, work items 64-127 map to wavefront 1, etc. For optimum
hardware usage, an integer multiple of 64 work-items is recommended.
2.1.2
Work-Item Creation
For each work-group, the GPU compute device spawns the required number of
wavefronts on a single compute unit. If there are non-active work-items within a
wavefront, the processing elements that would have been mapped to those workitems are idle. An example is a work-group that is a non-multiple of a wavefront
size.
2.1.3
Flow Control
Flow control, such as branching, is achieved by combining all necessary paths
as a wavefront. If work-items within a wavefront diverge, all paths are executed
serially. For example, if a work-item contains a branch with two paths, the
wavefront first executes one path, then the second path. The total time to
execute the branch is the sum of each path time. An important point is that even
if only one work-item in a wavefront diverges, the rest of the work-items in the
wavefront execute the branch. The number of work-items that must be executed
during a branch is called the branch granularity. On AMD hardware, the branch
granularity is the same as the number of work-items in a wavefront.
Masking of wavefronts is effected by constructs such as:
if(x)
{
.
.
.
}
else
{
.
.
2-4
.
}
The wavefront mask is set true for lanes (elements/items) in which x is true, then
execute A. The mask then is inverted, and B is executed.
Example 1: If two branches, A and B, take the same amount of time t to execute
over a wavefront, the total time of execution, if any work-item diverges, is 2t.
Loops execute in a similar fashion, where the wavefront occupies a compute unit
as long as there is at least one work-item in the wavefront still being processed.
Thus, the total execution time for the wavefront is determined by the work-item
with the longest execution time.
Example 2: If t is the time it takes to execute a single iteration of a loop; and
within a wavefront all work-items execute the loop one time, except for a single
work-item that executes the loop 100 times, the time it takes to execute that
entire wavefront is 100t.
GPU
Compute Device
Compute
Unit
GPU
Compute Device
Compute
Unit
Compute
Unit
4 Vector Units
Figure 2.4
In GCN devices, each CU includes one Scalar Unit and four Vector (SIMD) units,
each of which contains an array of 16 processing elements (PEs). Each PE
2-5
contains one ALU. Each SIMD unit simultaneously executes a single operation
across 16 work items, but each can be working on a separate wavefront.
For example, for the AMD Radeon HD 79XX devices each of the 32 CUs has
one Scalar Unit and four Vector Units. Figure 2.5 shows only two compute
engines/command processors of the array that comprises the compute device of
the AMD Radeon HD 79XX family.
4 Vector Unit
1 Scalar Unit
4 Vector Unit
1 Scalar Unit
4 Vector Unit
1 Scalar Unit
4 Vector Unit
1 Scalar Unit
4 Vector Unit
1 Scalar Unit
4 Vector Unit
1 Scalar Unit
4 Vector Unit
1 Scalar Unit
4 Vector Unit
1 Scalar Unit
4 Vector Unit
1 Scalar Unit
4 Vector Unit
1 Scalar Unit
4 Vector Unit
1 Scalar Unit
4 Vector Unit
1 Scalar Unit
4 Vector Unit
1 Scalar Unit
4 Vector Unit
SC cache
I cache
SC cache
I cache
SC cache
I cache
4 Vector Unit
1 Scalar Unit
LDS
LDS
LDS
LDS
LDS
LDS
LDS
LDS
LDS
LDS
LDS
LDS
LDS
LDS
LDS
LDS
LDS
LDS
LDS
LDS
LDS
LDS
LDS
LDS
LDS
LDS
LDS
LDS
LDS
LDS
LDS
LDS
L1
L1
L1
L1
L1
L1
L1
L1
L1
L1
L1
L1
L1
L1
L1
L1
4 Vector Unit
1 Scalar Unit
4 Vector Unit
1 Scalar Unit
4 Vector Unit
1 Scalar Unit
4 Vector Unit
1 Scalar Unit
4 Vector Unit
1 Scalar Unit
4 Vector Unit
1 Scalar Unit
4 Vector Unit
1 Scalar Unit
4 Vector Unit
1 Scalar Unit
4 Vector Unit
1 Scalar Unit
4 Vector Unit
1 Scalar Unit
4 Vector Unit
1 Scalar Unit
4 Vector Unit
1 Scalar Unit
4 Vector Unit
1 Scalar Unit
4 Vector Unit
1 Scalar Unit
4 Vector Unit
1 Scalar Unit
4 Vector Unit
1 Scalar Unit
SC cache
I cache
1 Scalar Unit
L1
L1
L1
L1
L1
L1
L1
L1
L1
L1
L1
L1
L1
L1
L1
L1
SC cache
I cache
4 Vector Unit
SC cache
I cache
1 Scalar Unit
SC cache
I cache
Level 2 cache
GDDR5 Memory System
Figure 2.5
In Figure 2.5, there are two command processors, which can process two
command queues concurrently. The Scalar Unit, Vector Unit, Level 1 data cache
(L1), and Local Data Share (LDS) are the components of one compute unit, of
which there are 32. The scalar (SC) cache is the scalar unit data cache, and the
Level 2 cache consists of instructions and data.
On GCN devices, the instruction stream contains both scalar and vector
instructions. On each cycle, it selects a scalar instruction and a vector instruction
(as well as a memory operation and a branch operation, if available); it issues
one to the scalar unit, the other to the vector unit; this takes four cycles to issue
2-6
over the four vector cores (the same four cycles over which the 16 units execute
64 work-items).
The Asynchronous Compute Engines (ACEs) manage the CUs; a graphics
command processor handles graphics shaders and fixed-function hardware.
2.2.1
2.2.2
Key differences between Southern Islands, Sea Islands, and Volcanic Islands families
The number of Asynchronous Compute Engines (ACEs) and CUs in an AMD
GCN family GPU, and the way they are structured, vary with the GCN device
family, as well as with the device designations within the family.
The ACEs are responsible for managing the CUs and for scheduling and
resource allocation of the compute tasks (but not of the graphics shader tasks).
The ACEs operate independently; the greater the number of ACEs, the greater
is the performance. Each ACE fetches commands from cache or memory, and
2-7
read and write system memory directly from the compute unit through kernel
instructions over the PCIe bus.
2.3.1
2.3.2
Kernels
Constants
DMA Transfers
Certain memory transfer calls use the DMA engine. To properly leverage the
DMA engine, make the associated OpenCL data transfer calls. See the AMD
OpenCL Optimization Reference Guide for more information.
Direct Memory Access (DMA) memory transfers can be executed separately from
the command queue using the DMA engine on the GPU compute device. DMA
calls are executed immediately; and the order of DMA calls and command queue
flushes is guaranteed.
DMA transfers can occur asynchronously. This means that a DMA transfer is
executed concurrently with other system or GPU compute operations when there
are no dependencies. However, data is not guaranteed to be ready until the DMA
engine signals that the event or transfer is completed. The application can use
OpenCL to query the hardware for DMA event completion. If used carefully, DMA
transfers are another source of parallelization.
All GCN devices have two DMA engines that can perform bidirectional transfers
over the PCIe bus with multiple queues created in consecutive order, since each
DMA engine is assigned to an odd or an even queue correspondingly.
2.3.3
2-9
graphics operations and the other three (in a four-GPU system) for Compute. To
do that, set the GPU_DEVICE_ORDINAL environment parameter, which is a commaseparated list variable:
Another example is a system with eight GPUs, where two distinct OpenCL
applications are running at the same time. The administrator might want to set
GPU_DEVICE_ORDINAL to 0,1,2,3 for the first application, and 4,5,6,7 for the
second application; thus, partitioning the available GPUs so that both
applications can run at the same time.
2-10
W0
Wavefronts
STALL
READY
W1
READY
STALL
W2
READY
STALL
W3
READY
STALL
20
= executing
Figure 2.6
40
60
80
= stalled
At runtime, wavefront T0 executes until cycle 20; at this time, a stall occurs due
to a memory fetch request. The scheduler then begins execution of the next
wavefront, T1. Wavefront T1 executes until it stalls or completes. New wavefronts
execute, and the process continues until the available number of active
wavefronts is reached. The scheduler then returns to the first wavefront, T0.
If the data wavefront T0 is waiting for has returned from memory, T0 continues
execution. In the example in Figure 2.6, the data is ready, so T0 continues. Since
there were enough wavefronts and processing element operations to cover the
long memory latencies, the compute unit does not idle. This method of memory
latency hiding helps the GPU compute device achieve maximum performance.
If none of T0 T3 are runnable, the compute unit waits (stalls) until one of T0
T3 is ready to execute. In the example shown in Figure 2.7, T0 is the first to
continue execution.
2-11
W0
Wavefronts
STALL
W1
STALL
W2
STALL
W3
STALL
20
= executing
Figure 2.7
2-12
40
60
80
= stalled
Chapter 3
Building and Running OpenCL
Programs
An OpenCL application consists of a host program (C/C++) and an optional
kernel program (.cl). To compile an OpenCL application, the host program must
be compiled; this can be done using an off-the-shelf compiler such as g++ or
MSVC++. The application kernels are compiled into device-specific binaries
using the OpenCL compiler.
3.1.1
Compiling on Windows
To compile OpenCL applications on Windows, Visual Studio 2008 Professional
Edition (or later) or the Intel C (C++) compiler must be installed. All C++ files
must be added to the project, which must have the following settings.
3-1
3.1.2
Compiling on Linux
To compile OpenCL applications on Linux, gcc or the Intel C compiler must be
installed. There are two major steps: compiling and linking.
1. Compile all the C++ files (Template.cpp), and get the object files.
For 32-bit object files on a 32-bit system, or 64-bit object files on 64-bit
system:
g++ -o Template.o -c Template.cpp -I$AMDAPPSDKROOT/include
2. Link all the object files generated in the previous step to the OpenCL library
and create an executable.
For linking to a 64-bit library:
g++ -o Template Template.o -lOpenCL -L$AMDAPPSDKROOT/lib/x86_64
3.2.1
3-2
specification) as a text buffer to create the program object. If the source code
is in an external file, then it must be read and placed in a text buffer before
passing the buffer to the API.
Note: Most of the examples in this chapter are shown using runtime C APIs. In
order to use the C++ wrapper APIs, one must map (a trivial step) the C APIs to
corresponding C++ wrapper APIs. For cleanness, error checking is not shown.
Example creation of program objects from an inline text string
const char *source =
"__kernel void myKernel( __global uint *src, __global uint *dst)\n"
"{ \n"
" uint gid = get_global_id(0); \n"
" dst[gid] = src[gid] * 10; \n"
"} \n";
cl_program program = clCreateProgramWithSource( context, 1,
&source, NULL, NULL );
Example creation of program objects from an external file
std::ifstream f("my_kernel.cl");
std::stringstream st;
st << f.rdbuf();
std::string ss = st.str();
const char* source = ss.c_str();
const size_t length = ss.length();
cl_program program = clCreateProgramWithSource(context, 1, &source,
&length, NULL);
3.2.1.2 Creating program objects from a pre-built binary
OpenCL allows the creation of program object from binaries previously built for
one or more specific device(s) or from intermediate device-agnostic binaries
(using, for example, the Standard Portable Intermediate Representation (SPIR)
format). Such binaries serve two useful purposes:
The consumer of the OpenCL library can create new program objects using
those binaries for use with their own applications.
In this method, the OpenCL binary is passed to the binaries argument of the
clCreateProgramWithBinary runtime API (for more details, see the OpenCL
specification). If the binary program code is in a file, the binary must be loaded
from the file, the content of the file must be placed in a character buffer, and the
resulting buffer must be passed to the clCreateProgramWithBinary API.
For information about how to generate device-specific binaries, see Section 3.5
of the OpenCL specification.
3-3
For more information about the SPIR format and about how to consume SPIR
binaries, see Appendix G.
3.2.2
Example(s):
Suppose a program object has been created as follows:
cl_program program = clCreateProgramWithSource(context, 1, &source,
&length, NULL);
Next, the program object can be built for all the devices in the context or for a
list of selected devices.
To build the program for all the devices, NULL must be passed against
the target device list argument, as shown below:
clBuildProgram(program, 0, NULL, NULL, NULL, NULL);
3-4
Build Options:
A list of options can be passed during program build to control each stage of the
building process. The full list includes various categories of options, such as
preprocessor, compiler, optimization, linker, and debugger. Some of them are
standard (specified by Khronos); others are vendor-specific. For details about the
standard options, see the clBuildProgram APIs description in the OpenCL
specification.
For information about the frequently used standard build options, see 3.3,
Supported Standard OpenCL Compiler Options.
For information about AMD-developed supplemental options and environment
variables, see 3.4, AMD-Developed Supplemental Compiler Options.
Special note for building OpenCL 2.0 programs:
In order to build the program with OpenCL 2.0 support, the -cl-std=CL2.0
option must be specified; otherwise, the highest OpenCL C 1.x language version
supported by each device is used when compiling the program for each device.
OpenCL 2.0 is backwards-compatible with OpenCL 1.X. Applications written on
OpenCL 1.x should run on OpenCL 2.0 without requiring any changes to the
application.
Special note for debugging:
OpenCL provides a way to check and query the compilation/linking errors that
occur during program build. Various build parameters for each device in the
program object can be queried by using the clGetProgramBuildInfo API.
Retrieving the build, compile or link log by using the CL_PROGRAM_BUILD_LOG
input parameter is a useful and frequently-used technique. For details, see the
OpenCL specification.
Example:
cl_int err = clBuildProgram(program, 1, &device, NULL, NULL, NULL);
If (err != CL_SUCCESS)
{
printf("clBuildProgram failed: %d\n", err);
char log[0x10000];
clGetProgramBuildInfo( program, device, CL_PROGRAM_BUILD_LOG,
0x10000, log, NULL);
printf("\n%s\n", log);
return -1;
}
3-5
-I dir Add the directory dir to the list of directories to be searched for
header files. When parsing #include directives, the OpenCL compiler
resolves relative paths using the current working directory of the application.
-D name Predefine name as a macro, with definition = 1. For D name=definition, the contents of definition are tokenized and processed
as if they appeared during the translation phase three in a #define directive.
In particular, the definition is truncated by embedded newline characters.
-D options are processed in the order they are given in the options argument
to clBuildProgram.
-g This is an experimental feature that lets you use the GNU project
debugger, GDB, to debug kernels on x86 CPUs running Linux or
3-7
3-8
To avoid source changes, there are two environment variables that can be used
to change CL options during the runtime.
3-9
Now, save these device specific binaries into the files for future use.
Description
clCreateBuffer()
clSetKernelArg()
clEnqueueNDRangeKernel()
clEnqueueReadBuffer(),
clEnqueueWriteBuffer()
clEnqueueWaitForEvents()
As illustrated in Figure 3.1, the application can create multiple command queues
(some in libraries, for different components of the application, etc.). These
queues are muxed into one queue per device type. The figure shows command
queues 1 and 3 merged into one CPU device queue (blue arrows); command
queue 2 (and possibly others) are merged into the GPU device queue (red
arrow). The device queue then schedules work onto the multiple compute
resources present in the device. Here, K = kernel commands, M = memory
commands, and E = event commands.
3-10
Programming
Layer
Command
Queues
M1 K1 E1 K2 M2 K3 M3
For CPU queue
Device
Command
Queue
GPU
CPU
K111
K112
CPU Core 1
CPU Core 2
Figure 3.1
Scheduler
GPU Core 1
GPU Core 2
3-11
}
To create a kernel object for the above kernel, you must pass the program object
corresponding to the kernel to the clCreateKernel function. Assuming that the
program object containing the above kernel function has been created and built
as program, a kernel object for the above kernel would be created as follows:
Cl_kernel kernel = clCreateKernel(program, "sample_kernel",
NULL);
Suppose a buffer object and an SVM array have been created as follows:
Cl_mem buffer = clCreateBuffer(context, CL_MEM_READ_ONLY,
length * sizeof(cl_uchar), NULL, NULL);
cl_uchar *svmPtr = clSVMAlloc(context,
length * sizeof(cl_uchar), 0);
CL_MEM_READ_WRITE,
Now, to set the kernel arguments for the kernel object, the buffer (or SVM array
in OpenCL 2.0) and the corresponding index must be passed to the kernel as
first and second argument, respectively:
clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *)&buffer);
clSetKernelArgSVMPointer(kernel,
3.7.2
1,
3-12
3.7.3
3-13
3-14
Chapter 4
Debugging and Profiling OpenCL
This chapter discusses how to debug and profile OpenCL programs running on
AMD GPU and CPU compute devices. The preferred method is to debug with
AMD CodeXL, as described in Section 4.1, AMD CodeXL GPU Debugger. The
second method, described in Section 4.2, Debugging CPU Kernels with GDB,
is to use experimental features provided by AMD APP SDK (GNU project
debugger, GDB) to debug kernels on x86 CPUs running Linux or cygwin/minGW
under Windows.
4.1.1
4-1
Figure 4.1
The CodeXL home page also includes a video illustrating the different features
of CodeXL.
4-2
4.1.2
4-3
4.1.3
Analyze Mode
Figure 4.2
4-4
Figure 4.3
Figure 4.4
4-5
Figure 4.5
4-6
Figure 4.6
4-7
Highlight keywords
The CodeXL editor highlights keywords for easier editing
Figure 4.7
Fix OpenCL compiler errors and warnings in which the kernel file is
the only input
View OpenCL compilation errors and fix immediately.
4-8
Figure 4.8
Figure 4.9
4-9
Statistics view: AMD Compiler gathers statistics for the use of GPU
resources
Better understanding this data helps tune your kernel for better
performance even before running on real GPU
The Statistics tab helps detect where bottlenecks are even before
running your application
4.2.1
4-10
4.2.2
4.2.3
4-11
4.2.4
Notes
4. To make a breakpoint in a working thread with some particular ID in
dimension N, one technique is to set a conditional breakpoint when the
get_global_id(N) == ID. To do this, use:
b [ N | function | kernel_name ] if (get_global_id(N)==ID)
where N can be 0, 1, or 2.
5. For complete GDB documentation, see
http://www.gnu.org/software/gdb/documentation/ .
6. For debugging OpenCL kernels in Windows, a developer can use GDB
running in cygwin or minGW. It is done in the same way as described in
sections 3.1 and 3.2.
Notes:
4-12
4-13
4-14
Chapter 5
OpenCL Static C++ Programming
Language
5.1 Overview
This extension defines the OpenCL Static C++ kernel language, which is a form
of the ISO/IEC Programming languages C++ specification1. This language
supports overloading and templates that can be resolved at compile time (hence
static), while restricting the use of language features that require dynamic/runtime
resolving. The language also is extended to support most of the features
described in Section 6 of the OpenCL 1.2 specification: new data types (vectors,
images, samples, etc.), OpenCL 1.2 Built-in functions, and more.
5.1.1
Supported Features
The following list contains the major static C++ features supported by this
extension.
Inheritance:
Strict inheritance.
Friend classes.
Multiple inheritance.
Templates:
Kernel templates.
Member templates.
Namespaces.
References.
this operator.
Note that supporting templates and overloading highly improve the efficiency of
writing code: it allows developers to avoid replication of code when not
necessary.
5-1
Using kernel template and kernel overloading requires support from the runtime
API as well. AMD provides a simple extension to clCreateKernel, which
enables the user to specify the desired kernel.
5.1.2
Unsupported Features
Static C++ features not supported by this extension are:
5.1.3
The :: operator.
5-2
5.2.2
-x clc++ is required if the input language is static C++. -x clc++ may not
be used with -cl-std=CL2.0 and may only be used with -cl-std=CL1.2 if
-cl-std=CLX.Y is used.
5.3.2
5.3 Additions and Changes to Section 6 - The OpenCL 1.2 C Programming Language 5-
5.3.3
Namespaces
Namespaces are support without change as per [1].
5.3.4
Overloading
As defined in the static C++ language specification, when two or more different
declarations are specified for a single name in the same scope, that name is said
to be overloaded. By extension, two declarations in the same scope that declare
the same name but with different types are called overloaded declarations. Only
kernel and function declarations can be overloaded, not object and type
declarations.
As per of the static C++ language specification, a number of restrictions limit how
functions can be overloaded; these restrictions are defined formally in Section 13
of the static C++ language specification. Note that kernels and functions cannot
be overloaded by return type.
5-4
Also, the rules for well-formed programs as defined by Section 13 of the static
C++ language specification are lifted to apply to both kernel and function
declarations.
The overloading resolution is per Section 13.1 of the static C++ language
specification, but extended to account for vector types. The algorithm for best
viable function, Section 13.3.3 of the static C++ language specification, is
extended for vector types by inducing a partial-ordering as a function of the
partial-ordering of its elements. Following the existing rules for vector types in the
OpenCL 1.2 specification, explicit conversion between vectors is not allowed.
(This reduces the number of possible overloaded functions with respect to
vectors, but this is not expected to be a particular burden to developers because
explicit conversion can always be applied at the point of function evocation.)
For overloaded kernels, the following syntax is used as part of the kernel name:
foo(type1,...,typen)
where type1,...,typen must be either an OpenCL scalar or vector type, or can
be a user-defined type that is allocated in the same source file as the kernel foo.
To allow overloaded kernels, use the following syntax:
__attribute__((mangled_name(myMangledName)))
The kernel mangled_name is used as a parameter to pass to the
clCreateKernel() API. This mechanism is needed to allow overloaded kernels
without changing the existing OpenCL kernel creation API.
5.3.5
Templates
OpenCL C++ provides unrestricted support for C++ templates, as defined in
Section 14 of the static C++ language specification. The arguments to templates
are extended to allow for all OpenCL base types, including vectors and pointers
qualified with OpenCL C address spaces (i.e. __global, __local, __private,
and __constant).
OpenCL C++ kernels (defined with __kernel) can be templated and can be
called from within an OpenCL C (C++) program or as an external entry point
(from the host).
For kernel templates, the following syntax is used as part of the kernel name
(assuming a kernel called foo):
foo<type1,...,typen>
where type1,...,typen must be either OpenCL scalar or vector type, or can be
a user-defined type that is allocated in the same source file as the kernel foo. In
this case a kernel is both overloaded and templated:
foo<type1,...,typen>(typen+1,...,typem)
Note that here overloading resolution is done by first matching non-templated
arguments in order of appearance in the definition, then substituting template
5.3 Additions and Changes to Section 6 - The OpenCL 1.2 C Programming Language 5-
5.3.6
Exceptions
Exceptions, as per Section 15 of the static C++ language specification, are not
supported. The keywords try, catch, and throw are reserved, and the OpenCL
C++ compiler must produce a static compile time error if they are used in the
input program.
5.3.7
Libraries
Support for the general utilities library, as defined in Sections 20-21 of the static
C++ language specification, is not provided. The standard static C++ libraries
and STL library are not supported.
5.3.8
Dynamic Operation
Features related to dynamic operation are not supported:
5.3.9
5.4 Examples
5.4.1
5-6
MyFunc ()
{
tempClass = new(Test);
... // Some OpenCL startup code create context, queue, etc.
cl_mem classObj = clCreateBuffer(context,
CL_MEM_USE_HOST_PTR, sizeof(Test),
&tempClass, event);
clEnqueueMapBuffer(...,classObj,...);
tempClass.setX(10);
clEnqueueUnmapBuffer(...,classObj,...); //class is passed to the Device
clEnqueueNDRange(..., fooKernel, ...);
clEnqueueMapBuffer(...,classObj,...); //class is passed back to the Host
}
5.4.2
Kernel Overloading
This example shows how to define and use mangled_name for kernel overloading,
and how to choose the right kernel from the host code. Assume the following
kernels are defined:
__attribute__((mangled_name(testAddFloat4))) kernel void
testAdd(global float4 * src1, global float4 * src2, global float4 * dst)
{
int tid = get_global_id(0);
dst[tid] = src1[tid] + src2[tid];
}
5.4 Examples
Copyright 2015 Advanced Micro Devices, Inc. All rights reserved.
5-7
5.4.3
Kernel Template
This example defines a kernel template, testAdd. It also defines two explicit
instants of the kernel template, testAddFloat4 and testAddInt8. The names
testAddFloat4 and testAddInt8 are the external names for the two kernel
template instants that must be used as parameters when calling to the
clCreateKernel API.
template <class T>
kernel void testAdd(global T * src1, global T * src2, global T * dst)
{
int tid = get_global_id(0);
dst[tid] = src1[tid] + src2[tid];
}
template __attribute__((mangled_name(testAddFloat4))) kernel void
testAdd(global float4 * src1, global float4 * src2, global float4 *
dst);
template __attribute__((mangled_name(testAddInt8))) kernel void
testAdd(global int8 * src1, global int8 * src2, global int8 * dst);
5-8
Chapter 6
OpenCL 2.0
6.1 Introduction
The OpenCL 2.0 specification is a significant evolution of OpenCL. It introduces
features that allow closer collaboration between the host and OpenCL devices,
such as Shared Virtual Memory (SVM) and device-side enqueue. Other features,
such as pipes and new image-related additions provide effective ways of
expressing heterogeneous programming constructs.
The following sections highlight the salient features of OpenCL 2.0 and provide
usage guidelines.
Pipes
Image Enhancements
Overview
In OpenCL 1.2, the host and OpenCL devices do not share the same virtual
address space. Consequently, the host memory, the device memory, and
communication between the host and the OpenCL devices, need to be explicitly
specified and managed. Buffers may need to be copied over to the OpenCL
6-1
device memory for processing and copied back after processing. To access
locations within a buffer (or regions within an image), the appropriate offsets must
be passed to and from the OpenCL devices; a host memory pointer cannot be
used on the OpenCL device.
In OpenCL 2.0, the host and OpenCL devices may share the same virtual
address space. Buffers need not be copied over between devices. When the host
and the OpenCL devices share the address space, communication between the
host and the devices can occur via shared memory (pointers). This simplifies
programming in heterogeneous contexts.
Support for SVM does not imply or require that the host and the OpenCL devices
in an OpenCL 2.0 compliant architecture share actual physical memory. The
OpenCL runtime manages the transfer of data between the host and the OpenCL
devices; the process is transparent to the programmer, who sees a unified
address space.
A caveat, however, concerns situations in which the host and the OpenCL
devices access the same region of memory at the same time. It would be highly
inefficient for the host and the OpenCL devices to have a consistent view of the
memory for each load/store from any device/host. In general, the memory model
of the language or architecture implementation determines how or when a
memory location written by one thread or agent is visible to another. The memory
model also determines to what extent the programmer can control the scope of
such accesses.
OpenCL 2.0 adopts the memory model defined in C++11 with some extensions.
The memory orders taken from C++11 are: "relaxed", "acquire", "release",
acquire-release, and "sequential consistent".
OpenCL 2.0 introduces a new (C++11-based) set of atomic operations with
specific memory-model based semantics. Atomic operations are indivisible: a
thread or agent cannot see partial results. The atomic operations supported are:
atomic_load/store
atomic_init
atomic_work_item_fence
atomic_exchange
atomic_compare_exchange
OpenCL 2.0 introduces the concept of "memory scope", which limits the extent
to which atomic operations are visible. For example:
6-2
"workgroup" scope means that the updates are to be visible only within the
work group
"device" scope means that the updates are to be visible only within the
device (across workgroups within the device)
For coarse-grained SVM, the synchronization points are: the mapping or unmapping of the SVM memory and kernel launch or completion. This means
that any updates are visible only at the end of the kernel or at the point of
un-mapping the region of memory.
Coarse-grained buffer memory has a fixed virtual address for all the devices
it is allocated on. In the AMD implementation, the physical memory is
allocated on Device Memory.
For fine-grained SVM, the synchronization points include those defined for
coarse-grained SVM as well as atomic operations. This means that updates
are visible at the level of atomic operations on the SVM buffer (for finegrained buffer SVM, allocated with the CL_MEM_SVM_ATOMICS flag) or the
SVM system, i.e. anywhere in the SVM (for fine-grained system SVM).
Fine-grained buffer memory has the same virtual address for all devices it is
allocated on. In the AMD implementation, the physical memory is allocated
on the Device-Visible Host Memory. If the fine grain buffer is allocated with
the CL_MEM_SVM_ATOMICS flag, the memory will be GPU-CPU coherent.
The OpenCL 2.0 specification mandates coarse-grained SVM but not finegrained SVM.
For details, see Section 3.3 of the OpenCL 2.0 specification.
6.2.2
Usage
In OpenCL 2.0, SVM buffers shared between the host and OpenCL devices are
created by calling clSVMAlloc (or malloc/new in the case of fine-grain system
support). The contents of such buffers may include pointers (into SVM buffers).
Pointer-based data structures are especially useful in heterogenous
programming scenarios. A typical scenario is as follows:
1. Host creates SVM buffer(s) with clSVMAlloc
2. Host maps the SVM buffer(s) with the blocking call clEnqueueSVMMap
3. Host fills/updates the SVM buffer(s) with data structures, including pointers
4. Host unmaps the SVM buffer(s) by using clEnqueueSVMUnmap
5. Host enqueues processing kernels, passing SVM buffers to the kernels with
calls to clSetKernelArgSVMPointer and/or clSetKernelExecInfo
6.2 Shared Virtual Memory (SVM)
Copyright 2015 Advanced Micro Devices, Inc. All rights reserved.
6-3
6. The OpenCL 2.0 device processes the structures in SVM buffer(s) including
following/updating pointers.
7. Repeat step 2 through 6 as necessary.
Note that the map and unmap operations in Steps 2 and 4 may be eliminated if
the SVM buffers are created by using the CL_MEM_SVM_FINE_GRAIN_BUFFER
flag, which may not be supported on all devices.
6.2.2.1 Coarse-grained memory
Some applications do not require fine-grained atomics to ensure that the SVM is
consistent across devices after each read/write access. After the initial
map/creation of the buffer, the GPU or any other devices typically read from
memory. Even if the GPU or other devices write to memory, they may not require
a consistent view of the memory.
For example, while searching in parallel on a binary search tree , coarse-grain
buffers are usually sufficient. In general, coarse-grain buffers provide faster
access compared to fine grain buffers as the memory is not required to be
consistent across devices.
for (i = 0; i < keys_per_wi; i++) {
key = search_keys[init_id + i];
tmp_node = root;
while (1) {
if (!tmp_node || (tmp_node->value == key))
break;
tmp_node = (key < tmp_node->value) ? tmp_node>left : tmp_node->right;
}
found_nodes[init_id + i] = tmp_node;
}
In the above example, the binary search tree root is created using coarsegrain SVM on the host:
svmTreeBuf = clSVMAlloc(context, CL_MEM_READ_WRITE,
numNodes*sizeof(node), 0);
svmSearchBuf = clSVMAlloc(context, CL_MEM_READ_WRITE,
numKeys*sizeof(searchKey), 0);
6-4
The host creates two buffers, svmTreeBuf and svmSearchBuf, to hold the given
tree and the search keys, respectively. After populating the given tree, these two
buffers are passed to the kernel as parameters.
The next task is to create the tree and populate the svmTreeBuf using
clSVMEnqueueMap and clSVMEnqueueUnmap. The host-code method,
cpuCreateBinaryTree, illustrates this mechanism; note the calls to these
map/unmap APIs.
The host then creates the keys to be searched in svmSearchBuf, as the
cpuInitSearchKeys method illustrates. Next, it enqueues the kernel to search
the binary tree for the given keys in the svmSearchBuf, and it sets the parameters
to the kernel using clSetKernelArgSVMPointer:
int status = clSetKernelArgSVMPointer(sample_kernel, 0, (void
*)(svmTreeBuf));
status = clSetKernelArgSVMPointer(sample_kernel, 1, (void
*)(svmSearchBuf));
Note that the routine passes both svmTreeBuf and svmSearchBuf to the kernel
as parameters. The following node structure demonstrates how to create the tree
on the host using pointers to the left and right children:
typedef struct nodeStruct
{
int value;
struct nodeStruct* left;
struct nodeStruct* right;
} node;
At this point, the advantage of using SVM becomes clear. Because the structure
and its nodes are SVM memory, all the pointer values in these nodes are valid
on the GPUs as well.
The kernel running on the OpenCL 2.0 device can directly search the tree as
follows:
while(NULL != searchNode)
{
if(currKey->key == searchNode->value)
{
/* rejoice on finding key */
currKey->oclNode
= searchNode;
searchNode
= NULL;
}
else if(currKey->key < searchNode->value)
{
/* move left */
searchNode = searchNode->left;
}
else
{
/* move right */
searchNode = searchNode->right;
}
6.2 Shared Virtual Memory (SVM)
Copyright 2015 Advanced Micro Devices, Inc. All rights reserved.
6-5
}
Each work item searches one element in svmSearchKeys in parallel and sets
oclNode in the searchKey structure for that node.
Updates to the tree occur on the host (CPU) or on the GPU, but not on both
simultaneously.
Because the tree is created on the host, and because OpenCL 1.2 disallows
SVM, implementing these steps is difficult in OpenCL 1.2. In OpenCL 1.2, you
must store the tree as arrays, copy the arrays to the GPU memory (specifying
the appropriate offsets), and then copy the arrays back to the host.
The data is the tree created by the host as a coarse-grain buffer and is passed
to the kernel as an input pointer.
CPU time
Tree (size in M)(ms)
23.46
5.17
86.11
24.87
10
180.73
51.58
N/A
25
381.77
129.58
N/A
Figure 6.1
Note: All numbers were obtained on a Kaveri APU with 32 GB RAM running
Windows 8.1. All numbers are in milli-seconds (ms).
The above table shows the performance of the 2.0 implementation over the 1.2
implementation. As you can see, the GPU times mentioned under the OpenCL
1.2 column include the GPU run time, time to transfer the buffers from the host
to the device, the time required to transform the buffers into arrays and offsets,
and the time required to transfer the buffers from the device back to the host,
respectively.
Finally, more than 5M nodes could not be allocated in 1.2, as the allowable
memory allocation was limited by the amount of memory that could be used on
the device. Overall, the 2.0 version exceeds the 1.2 version in both performance
and usability.
6-6
Overview
In OpenCL 1.2, all pointer parameters in a function definition must have address
spaces associated with them. (The default address space is the private address
space.) This necessitates creating an explicit version of the function for each
desired address space.
OpenCL 2.0 introduces a new address space called the generic address space.
Data cannot be stored in the generic address space, but a pointer to this space
can reference data located in the private, local, or global address spaces. A
function with generic pointer arguments may be called with pointers to any
address space except the constant address space. Pointers that are declared
without pointing to a named address space, point to the generic address space.
However, such pointers must be associated with a named address space before
they can be used. Functions may be written with arguments and return values
that point to the generic address space, improving readability and
programmability.
6.3.2
Usage
// generic address
space pointer
} ;
6-7
Note: The OpenCL 2.0 spec itself shows most built-in functions that accept
pointer arguments as accepting generic pointer arguments.
6.3.2.2 AMD APP SDK example
In the AMD APP SDK sample, addMul2d is a generic function that uses generic
address spaces for its operands. The function computes the convolution sum of
two vectors. Two kernels compute the convolution: one uses data in the global
address space (convolution2DUsingGlobal); the other uses the local
address space (sepiaToning2DUsingLocal). The use of a single function
improves the readability of the source.
float4 addMul2D (uchar4 *src, float *filter, int2 filterDim, int
width)
{
int i, j;
float4 sum = (float4)(0);
for(i = 0; i < (filterDim.y); i++)
{
for(j = 0; j < (filterDim.x); j++)
{
sum +=
(convert_float4(src[(i*width)+j]))*((float4)(filter[(i*filterDim.x)
+j]));
}
}
return sum;
}
Note: The compiler will try to resolve the address space at compile time.
Otherwise, the runtime will decide whether the pointer references the local or the
global address space. For optimum performance, the code must make it easy for
the compiler to detect the pointer reference by avoiding data-dependent address
space selection, so that run-time resolution -- which is costly -- is not required.
Device-side enqueue
In OpenCL 1.2, a kernel cannot be enqueued from a currently running kernel.
Enqueuing a kernel requires returning control to the host -- potentially
undermining performance.
OpenCL 2.0 allows kernels to enqueue other kernels. It provides a new construct,
"clang blocks," and new built-in functions that allow a parent kernel to queue child
6-8
6.4.2
Workgroup/subgroup-level functions
OpenCL 2.0 introduces new built-in functions that operate at the workgroup or
subgroup level. (A workgroup comprises one or more subgroups; the vendor
handles the exact subgroup implementation.) For example, on AMD platforms, a
subgroup maps to a wavefront. (For details, see the AMD OpenCL User Guide.)
Basically, a wavefront is an execution unit on the GPU. The OpenCL specification
requires that all work items in a workgroup/subgroup executing the kernel handle
these new functions; otherwise, their results may be undefined.
OpenCL 2.0 defines the following new built-in functions. Note that it also defines
similar functions for subgroups under the cl_khr_subgroups extensions in
CL_DEVICE_EXTENSIONS.
1. work_group_all and work_group_any: These functions test a given
predicate on all work items in the workgroup. The all version effectively
performs an AND operation on all predicates and returns the result to all work
items; similarly, the any operation performs an OR operation. Thus, using
the all function returns true if the predicate is true for all work items; any
returns true if it is true for at least one work item.
2. work_group_broadcast: This function broadcasts a local value from
each work item to all the others in the workgroup.
3. work_group_reduce: Given an operation, work_group_reduce
performs the reduction operation on all work items and returns the result. The
operation can be min, max or add. For example, when called for an array
using the add operation, the function returns the sum of the array elements.
6-9
6.4.3
Usage
6-10
3. If a sub-region is interesting:
1. Refine the sub-region
2. Apply a process to the refined sub-region
With OpenCL 1.2, this process would require a complex interaction between the
host and the OpenCL device. The device-side kernel would need to somehow
mark the sub-regions requiring further work, and the host side code would need
to scan all of the sub-regions looking for the marked ones and then enqueue a
kernel for each marked sub-region. This process is made more difficult by the
lack of globally visible atomic operations in OpenCL 1.2.
However, with OpenCL 2.0, rather than just marking each interesting sub-region,
the kernel can instead launch a new sub-kernel to process each marked subregion. This significantly simplifies the code and improves efficiency due to the
elimination of the interactions with, and dependence on, the host.
6.4.3.3 Binary search using device-side enqueue
The power of device enqueue is aptly illustrated in the example of binary search.
To make the problem interesting, multiple keys in a sorted array will be searched
for. The versions written for OpenCL 1.2 and 2.0 will also be compared with
respect to programmability and performance.
A binary search looks for a given key in a sorted sequence by dividing the
sequence in two equal parts and then recursively checking the part that contains
the key. Because a typical GPU processes more than two work items, we divide
the sequence into several parts (globalThreads), and each work item
searches its part for the key. Furthermore, to make things more interesting, a
large number of keys are searched. At every recursion stage, the amount of work
varies with the chunk size. Thus, the algorithm is a good candidate for deviceside enqueue.
The OpenCL 1.2 version of the code that performs binary search is as follows:
kernel void binarySearch_mulkeys( global int *keys, global uint
*input, const unsigned int numKeys, global int *output)
{
int gid = get_global_id(0);
int lBound = gid * 256;
int uBound = lBound + 255;
for(int i = 0; i < numKeys; i++)
{
if(keys[i] >= input[lBound] && keys[i] <=input[uBound])
output[i]=lBound;
}
6-11
}
The search for multiple keys is done sequentially, while the sorted array is
divided into 256 sized chunks. The NDRange is the size of the array divided by
the chunk size. Each work item checks whether the key is present in the range
and if the key is present, updates the output array.
The issue with the above approach is that if the input array is very large, the
number of work items (NDRange) would be very large. The array is not divided
into smaller, more-manageable chunks.
In OpenCL 2.0, the device enqueue feature offers clear advantages in binary
search performance.
The kernel is rewritten in OpenCL 2.0 to enqueue itself. (For full details, see the
complete sample in the AMD APP SDK.) Each work item in the
binarySearch_device_enqueue_multiKeys_child kernel searches its
portion of the sequence for the keys; if it finds one, it updates the array bounds
for that key and also sets a variable, , to declare that another enqueue is
necessary. If all work items report failure, the search stops and reports that the
sequence contains no keys.
Finally, the kernel launches itself again using device enqueue, but with new
bounds:
void (^binarySearch_device_enqueue_wrapper_blk)(void) =
^{binarySearch_device_enqueue_multiKeys_child(outputArray,
sortedArray,
subdivSize,
globalLowerIndex,
keys
,nKeys
,parentGlobalids,globalThreads);};
int err_ret =
enqueue_kernel(defQ,CLK_ENQUEUE_FLAGS_WAIT_KERNEL,ndrange1,binarySe
arch_device_enqueue_wrapper_blk);
It also checks for missing keys; absent any such keys, the search stops by
forgoing further enqueues:
/**** Search continues only if at least one key is found in
previous search ****/
int Flag = atomic_load_explicit(&,memory_order_seq_cst);
if(Flag == 0)
return;
6-12
The advantage is that when the input array is large, the OpenCL 2.0 version
divides the input array into 1024-sized chunks. The chunk in which the given key
falls is found and another kernel is enqueued which further divides it into 1024sized chunks, and so on. In OpenCL 1.2, as the whole array is taken as the
NDRange, a huge number of work groups require processing.
The following figure shows how the OpenCL 2.0 version compares to the
OpenCL 1.2 as the array increases beyond a certain size.
OpenCL1.2
OpenCL2.0
1
10
100
1000 2000
10
Figure 6.2
100
1000 2000
10M
10M
10M
Note: These numbers are for an A10-7850K (3.7GHz) processor with 4GB of
RAM running Windows 8.1.
The above figure shows the performance benefit of using OpenCL 2.0 over the
same sample using OpenCL 1.2. In OpenCL 2.0, the reduced number of kernel
launches from the host allow superior performance. The kernel enqueues are
much more efficient when done from the device.
Device enqueue is a powerful feature, as the examples above help show. It can
be especially useful when repeatedly applying a set of kernels to a data structure
in accordance with a condition. For applications with dynamic data parallelism at
run time-such as when searching a large space for which the amount of
parallelism or the problem size is statically unknown from the outset-device
enqueue offers many benefits.
The above examples also exemplify the new workgroup and subgroup functions
that OpenCL 2.0 introduces. These functions can efficiently perform computation
at the workgroup level because they can map directly to hardware instructions at
the workgroup/subgroup level.
6-13
Overview
In OpenCL 1.2, only work-items in the same workgroup can synchronize.
OpenCL 2.0 introduces a new and detailed memory model which allows
developers to reason about the effects of their code on memory, and in particular
understand whether atomic operations and fences used for synchronization
ensure the visibility of variables being used to communicate between threads. In
conjunction with the new memory model, OpenCL 2.0 adds a new set of atomic
built-in functions and fences derived from C++11 (although the set of types is
restricted), and also deprecates the 1.2 atomic built in functions and fences.
These additions allow synchronization between work-items in different workgroups, as well as fine-grained synchronization with the host using atomic
operations on memory in fine-grained SVM buffers (allocated with the
CL_MEM_SVM_ATOMICS flag) for fine-grained SVM system memory.
6.5.2
Usage
The following examples to illustrate the use of atomics are part of the AMD APP
SDK.
6-14
{
int i;
while (atomic_load_explicit ((global atomic_int
*)&atomicBuffer[0], memory_order_acquire) != 99);
i = get_global_id(0);
buffer[i] += i;
atomic_store_explicit ((global atomic_int
*)&atomicBuffer[i], (100+i), memory_order_release);
}
The kernel next stores (100+i), where i is the ID of the work-item into
atomicBuffer[i]. The order used is memory_order_release which
ensures that the updated copy reaches the CPU which is waiting for it to report
PASS for the test.
After the atomic operation, the updates on fine-grain variables (such as buffer)
will also be available at the host. The CPU checks for the following to ensure that
the results are OK:
for (i=0;i<N;i++)
while(std::atomic_load_explicit ((std::atomic<int>
*)&atomicBuffer[i], std::memory_order_acquire) != (100+i));
/* check the results now */
for (i=0;i<N;i++)
if (buffer[i] != (64+i))
printf(" Test Failed \n");
printf (" Test Passed! \n");
6.5.2.2 Atomic Compare and Exchange (CAS)
This sample illustrates the use of the atomic CAS operation typically used for
"lock-free" programming, in which a critical section can be created without having
to use waiting mutexes/semaphores. The following kernel simultaneously inserts
the IDs of various work items into the "list" array by using atomic CAS operation.
The same loop also runs on the host and inserts the other half (N) work items.
In this way, 2*N numbers are inserted into this "list".
kernel void linkKernel(__global int *list) {
int head, i;
i = get_global_id(0) + 1;
head = list[0];
6-15
if (i != get_global_size(0)) {
do {
list[i] = head;
} while (!atomic_compare_exchange_strong((global
atomic_int *) &list[0], &head,i), memory_order_release,
memory_order_acquire, memory_scope_system);
}
}
Note how there is no wait to enter the critical section, but list[0] and head are
updated atomically. On the CPU too, a similar loop runs. Again note that the
variables "list"and "head" must be in fine-grain SVM buffers.
memory_order_release and memory_scope_system are used to ensure
that the CPU gets the updates -- hence the name "platform atomics."
6.5.2.3 Atomic Fetch
This sample illustrates the use of the atomic fetch operation. The fetch operation
is an RMW (Read-Modify-Write) operation. The following kernel computes the
maximum of the N numbers in array "A". The result of the intermediate
comparisons is computed and the result is placed in a Boolean array "B". After
the matrix "B" is computed, the row (i) is computed. The row which has all 1s will
be the maximum (C[i]).
kernel void atomicMax(volatile global int *A, global int *B, global
int *C, global int *P)
{
int
i = get_global_id(0);
int
j = get_global_id(1);
int N = *P, k;
if (A[i] >= A[j]) B[i*N+j] = 1;
else B[i*N+j] = 0;
if (j == 0) {
C[i] = 1;
for (k=0;k<N;k++)
atomic_fetch_and_explicit((global atomic_int *)&C[i],
B[i*N+k], memory_order_release, memory_scope_device);
}
}
Similarly, another sample includes the following kernel that increments 2*N times,
N times in the kernel and another N times on the host:
6-16
6.6 Pipes
6.6.1
Overview
OpenCL 2.0 introduces a new mechanism, pipes, for passing data between
kernels. A pipe is essentially a structured buffer containing some space for a set
of "packets"--kernel-specified type objects, and for bookkeeping information. As
the name suggests, these packets of data are ordered in the pipe (as a FIFO).
Pipes are accessed via special read_pipe and write_pipe built-in functions.
A given kernel may either read from or write to a pipe, but not both. Pipes are
only "coherent" at the standard synchronization points; the result of concurrent
accesses to the same pipe by multiple kernels (even if permitted by hardware)
is undefined. A pipe cannot be accessed from the host side; it can only be
accessed by using the kernel built-in functions.
Pipes are created on the host with a call to clCreatePipe, and may be passed
between kernels. Pipes may be particularly useful when combined with devicesize enqueue for dynamically constructing computational data flow graphs.
There are two types of pipes: a read pipe, from which a number of packets can
be read; and a write pipe, to which a number of packets can be written.
Note: A pipe specified as read-only cannot be written into and a pipe specified
as write-only cannot be read from. A pipe cannot be read from and written into
at the same time.
6.6.2
6.6 Pipes
Copyright 2015 Advanced Micro Devices, Inc. All rights reserved.
6-17
The memory allocated in the above function can be passed to kernels as readonly or write-only pipes. The pipe objects can only be passed as kernel
arguments or kernel functions and cannot be declared inside a kernel or as
program-scoped objects.
Also, a set of built-in functions have been added to operate on the pipes. The
important ones are:
read_pipe (pipe p, gentype *ptr: for reading packet from pipe p into ptr.
write_pipe (pipe p, gentype *ptr: for writing packet pointed to by ptr
to pipe p.
To ensure you have enough space in the pipe structure for reading and writing
(before you actually do it), you can use built-in functions to reserve enough
space. For example, you could reserve room by calling reserve_read_pipe
or reserve_write_pipe. These functions return a reservation ID, which can
be used when the actual operations are performed. Similarly, the standard has
built-in functions for workgroup level reservations, such as
work_group_reserve_read_pipe and
work_group_reserve_write_pipe and for the workgroup order (in the
program). These workgroup built-in functions operate at the workgroup level.
Ordering across workgroups is undefined. Calls to commit_read_pipe and
commit_write_pipe, as the names suggest, commit the actual operations
(read/write).
6.6.3
Usage
The following example code illustrates a typical usage of pipes in the example
code. The code contains two kernels: producer_kernel, which writes to the
pipe, and consumer_kernel, which reads from the same pipe. In the example,
the producer writes a sequence of random numbers; the consumer reads them
and creates a histogram.
The host creates the pipe, which both kernels will use, as follows:
rngPipe = clCreatePipe(context,
CL_MEM_READ_WRITE,
szPipePkt,
szPipe,
NULL,
&status);
This code makes a pipe that the program kernels can access (read/write). The
host creates two kernels, producer_kernel and consumer_kernel. The
producer kernel first reserves enough space for the write pipe:
//reserve space in pipe for writing random numbers.
reserve_id_t rid = work_group_reserve_write_pipe(rng_pipe,
szgr);
Next, the kernel writes and commits to the pipe by invoking the following
functions:
6-18
write_pipe(rng_pipe,rid,lid, &gfrn);
work_group_commit_write_pipe(rng_pipe, rid);
Similarly, the consumer kernel reads from the pipe:
//reserve pipe for reading
reserve_id_t rid = work_group_reserve_read_pipe(rng_pipe,
szgr);
if(is_valid_reserve_id(rid)) {
//read random number from the pipe.
read_pipe(rng_pipe,rid,lid, &rn);
work_group_commit_read_pipe(rng_pipe, rid);
}
The consumer_kernel then uses this set of random number and constructs the
histogram. The CPU creates the same histogram and verifies whether the
histogram created by the kernel is correct. Here, lid is the local id of the work
item, obtained by get_local_id(0).
The example code demonstrates how you can use a pipe as a convenient data
structure that allows two kernels to communicate.
In OpenCL 1.2, this kind of communication typically involves the host although
kernels can communicate without returning control to the host. Pipes, however,
ease programming by reducing the amount of code that some applications
require.
Overview
OpenCL 1.2 permits the declaration of only constant address space variables at
program scope.
OpenCL 2.0 permits the declaration of variables in the global address space at
program (i.e. outside function) scope. These variables have the lifetime of the
program in which they appear, and may be initialized. The host cannot directly
access program-scope variables; a kernel must be used to read/write their
contents from/to a buffer created on the host.
Program-scope global variables can save data across kernel executions. Using
program-scope variables can potentially eliminate the need to create buffers on
the host and pass them into each kernel for processing. However, there is a limit
to the size of such variables. The developer must ensure that the total size does
not exceed the value returned by the device info query:
CL_DEVICE_MAX_GLOBAL_VARIABLE_SIZE.
6-19
Overview
OpenCL 2.0 introduces significant enhancements for processing images.
A read_write access qualifier for images has been added. The qualifier allows
reading from and writing to certain types of images (verified against
clGetSupportedImageFormats by using the
CL_MEM_KERNEL_READ_AND_WRITE flag) in the same kernel, but reads must
be sampler-less. An atomic_work_item_fence with the
CLK_IMAGE_MEM_FENCE flag and the memory_scope_work_item memory
scope is required between reads and writes to the same image to ensure that
the writes are visible to subsequent reads. If multiple work-items are writing to
and reading from multiple locations in an image, a call to
work_group_barrier with the CLK_IMAGE_MEM_FENCE flag is required.
OpenCL 2.0 also allows 2D images to be created from a buffer or another 2D
image and makes the ability to write to 3D images a core feature. This extends
the power of image operations to more situations.
The function clGetSupportedImageFormats returns a list of the image
formats supported by the OpenCL platform. The Image format has two
parameters, channel order and data type. The following lists some image formats
OpenCL supports:
Channel orders: CL_A, CL_RG, CL_RGB, CL_RGBA.
Channel data type: CL_UNORM_INT8, CL_FLOAT.
OpenCL 2.0 provides improved image support, specially support for sRGB
images and depth images.
6.8.2
sRGB
sRGB is a standard RGB color space that is used widely on monitors, printers,
digital cameras, and the Internet. Because the linear RGB value is used in most
image processing algorithms, processing the images often requires converting
sRGB to linear RGB.
OpenCL 2.0 provides a new feature for handling this conversion directly. Note
that only the combination of data type CL_UNORM_INT8 and channel order
CL_sRGBA is mandatory in OpenCL 2.0. The AMD implementations support this
combination. The remaining combinations are optional in OpenCL 2.0.
When not using the mandatory combination (CL_sRGBA, CL_UNORM_INT8), the
clGetSupportedImageFormats function must be used to get a list of
supported image formats and data types before using the sRGB image,
Creating sRGB image objects is similar to creating an image object of existing
supported channel order with OpenCL 2.0. The following snippet shows how to
create CL_sRGBA image objects by using the read_image call.
6-20
cl_image_format
imageFormat;
imageFormat.image_channel_data_type
= CL_UNORM_INT8;
imageFormat.image_channel_order = CL_sRGBA
cl_mem imageObj = clCreateImage(
context,
CL_MEM_READ_ONY | CL_MEM_COPY_HOST_PTR,
&imageFormat,
&desc, //cl_image_desc
pSrcImage,
&retErr);
A new sRGB image can also be created based on an existing RGB image object,
so that the kernel can implicitly convert the sRGB image data to RGB. This is
useful when the viewing pixels are sRGB but share the same data as the existing
RGB image.
After an sRGB image object has been created, the read_imagef call can be
used in the kernel to read it transparently. read_imagef implicitly converts
sRGB values into linear RGB. Converting sRGB into RGB in the kernel explicitly
is not necessary if the device supports OpenCL 2.0. Note that only
read_imagef can be used for reading sRGB image data because only the
CL_UNORM_INT8 data type is supported with OpenCL 2.0.
The following is a kernel sample that illustrates how to read an sRGB image
object.
);
}
6.8 Image Enhancements
Copyright 2015 Advanced Micro Devices, Inc. All rights reserved.
6-21
OpenCL 2.0 does not include writing sRGB images directly, but provides the
cl_khr_srgb_image_writes extension. The AMD implementations do not
support this extension as of this writing.
In order to write sRGB pixels in a kernel, explicit conversion from linear RGB to
sRGB must be implemented in the kernel.
clFillImage is an exception for writing sRGB image directly. The AMD
OpenCL platform supports clFillImage for filling linear RGB image to sRGB
image directly.
6.8.3
Depth images
As with other image formats, clCreateImage is used for creating depth image
objects. However, the channel order must be set to CL_DEPTH, as illustrated
below. For the data type of depth image, OpenCL 2.0 supports only CL_FLOAT
and CL_UNORM_INT16.
cl_image_format
imageFormat;
imageFormat.image_channel_data_type
= CL_UNORM_INT16;
imageFormat.image_channel_order = CL_DEPTH
cl_mem imageObj = clCreateImage(
context,
// A
6-22
{
int tidX = get_global_id(0), tidY = get_global_id(1);
int offset = tidY*get_image_width(input) + tidX;
int2 coords = (int2)( xOffsets[offset], yOffsets[offset]);
results[offset] = read_imagef( input, imageSampler, coords
);
}
The AMD OpenCL 2.0 platform fully supports the cl_khr_depth_images
extension but not the cl_khr_gl_depth_images extension. Consequently, the
AMD OpenCL platform does not support creating a CL depth image from a GL
depth or depth-stencil texture.
Overview
Prior to OpenCL 2.0, each work-group size needed to divide evenly into the
corresponding global size. This requirement is relaxed in OpenCL 2.0; the last
work-group in each dimension is allowed to be smaller than all of the other workgroups in the "uniform" part of the NDRange. This can reduce the effort required
to map problems onto NDRanges.
A consequence is that kernels may no longer assume that calls to
get_work_group_size return the same value in all work-groups. However, a
new call (get_enqueued_local_size) has been added to obtain the size in
the uniform part, which is specified using the local_work_size argument to
the clEnqueueNDRangeKernel.
A new compile time option (-cl-uniform-work-group-size) has been
added to optimize the computation for cases in which the work-group size is
known to, or required to, divide evenly into the global size.
6-23
compiles a kernel without the cl-std=CL2.0 option, then the program should
run on OpenCL 2.0 platforms.
6.10.2
6-24
Appendix A
OpenCL Optional
Extensions
The OpenCL extensions are associated with the devices and can be queried for
a specific device. Extensions can be queried for platforms also, but that means
that all devices in the platform support those extensions.
Table A.1, on page A-14, lists the supported extensions.
The OpenCL Specification states that all API functions of the extension must
have names in the form of cl<FunctionName>KHR, cl<FunctionName>EXT, or
cl<FunctionName><VendorName>. All enumerated values must be in the form of
CL_<enum_name>_KHR, CL_<enum_name>_EXT, or
CL_<enum_name>_<VendorName>.
A-1
After the device list is retrieved, the extensions supported by each device can be
queried with function call clGetDeviceInfo() with parameter param_name being
set to enumerated value CL_DEVICE_EXTENSIONS.
The extensions are returned in a char string, with extension names separated by
a space. To see if an extension is present, search the string for a specified
substring.
all - only core functionality of OpenCL is used and supported, all extensions
are ignored. If the specified extension is not supported then a warning is
issued by the compiler.
A-2
This means that the extensions must be explicitly enabled to be used in kernel
programs.
Each extension that affects kernel code compilation must add a defined macro
with the name of the extension. This allows the kernel code to be compiled
differently, depending on whether the extension is supported and enabled, or not.
For example, for extension cl_khr_fp64 there should be a #define directive for
macro cl_khr_fp64, so that the following code can be preprocessed:
#ifdef cl_khr_fp64
// some code
#else
// some code
#endif
This returns the address of the extension function specified by the FunctionName
string. The returned value must be appropriately cast to a function pointer type,
specified in the extension spec and header file.
A return value of NULL means that the specified function does not exist in the
CL implementation. A non-NULL return value does not guarantee that the
extension function actually exists queries described in sec. 2 or 3 must be done
to ensure the extension is supported.
The clGetExtensionFunctionAddress() function cannot be used to get core
API function addresses.
A-3
cl_khr_icd the OpenCL Installable Client Driver (ICD) that lets developers
select from multiple OpenCL runtimes which may be installed on a system.
cl_dx9_media_sharing
Cl_khr_fp16
cl_khr_gl_event
A-4
cl_khr_int64_base_atomics
cl_khr_int64_extended_atomics
cl_khr_fp16
cl_khr_gl_sharing
cl_khr_gl_event
cl_khr_d3d10_sharing
cl_dx9_media_sharing
cl_khr_d3d11_sharing
cl_khr_gl_depth_images
cl_khr_gl_msaa_sharing
cl_khr_initialize_memory
cl_khr_terminate_context
cl_khr_spir
cl_khr_icd
cl_khr_subgroups
cl_khr_mipmap_image
cl_khr_mipmap_image_writes
cl_khr_egl_image
cl_khr_egl_event
cl_khr_device_enqueue_local_arg_types
A.8.1
cl_amd_fp64
Before using double data types, double-precision floating point operators, and/or
double-precision floating point routines in OpenCL C kernels, include the
#pragma OPENCL EXTENSION cl_amd_fp64 : enable directive. See Table A.1
for a list of supported routines.
A.8.2
cl_amd_vec3
This extension adds support for vectors with three elements: float3, short3,
char3, etc. This data type was added to OpenCL 1.1 as a core feature. For more
details, see section 6.1.2 in the OpenCL 1.1 or OpenCL 1.2 spec.
A.8.3
cl_amd_device_persistent_memory
This extension adds support for the new buffer and image creation flag
CL_MEM_USE_PERSISTENT_MEM_AMD. Buffers and images allocated with this flag
reside in host-visible device memory. This flag is mutually exclusive with the flags
CL_MEM_ALLOC_HOST_PTR and CL_MEM_USE_HOST_PTR.
A.8.4
cl_amd_device_attribute_query
This extension provides a means to query AMD-specific device attributes. To
enable this extension, include the #pragma OPENCL EXTENSION
cl_amd_device_attribute_query : enable directive. Once the extension is
enabled, and the clGetDeviceInfo parameter <param_name> is set to
CL_DEVICE_PROFILING_TIMER_OFFSET_AMD, the offset in nano-seconds between
an event timestamp and Epoch is returned.
1.8.4.1
cl_device_profiling_timer_offset_amd
This query enables the developer to get the offset between event timestamps in
nano-seconds. To use it, compile the kernels with the #pragma OPENCL
EXTENSION cl_amd_device_attribute_query : enable directive. For
A.7 cl_ext Extensions
Copyright 2015 Advanced Micro Devices, Inc. All rights reserved.
A-5
cl_amd_device_topology
This query enables the developer to get a description of the topology used to
connect the device to the host. Currently, this query works only in Linux. Calling
clGetDeviceInfo with <param_name> set to CL_DEVICE_TOPOLOGY_AMD returns
the following 32-bytes union of structures.
typedef union
{
struct { cl_uint type; cl_uint data[5]; } raw;
struct { cl_uint type; cl_char unused[17]; cl_char bus; cl_char
device; cl_char function; } pcie; } cl_device_topology_amd;
The type of the structure returned can be queried by reading the first unsigned
int of the returned data. The developer can use this type to cast the returned
union into the right structure type.
Currently, the only supported type in the structure above is PCIe (type value =
1). The information returned contains the PCI Bus/Device/Function of the device,
and is similar to the result of the lspci command in Linux. It enables the
developer to match between the OpenCL device ID and the physical PCI
connection of the card.
1.8.4.3
cl_amd_device_board_name
This query enables the developer to get the name of the GPU board and model
of the specific device. Currently, this is only for GPU devices.
Calling clGetDeviceInfo with <param_name> set to
CL_DEVICE_BOARD_NAME_AMD returns a 128-character value.
A.8.5
cl_amd_compile_options
This extension adds the following options, which are not part of the OpenCL
specification.
A-6
-g This is an experimental feature that lets you use the GNU project
debugger, GDB, to debug kernels on x86 CPUs running Linux or
cygwin/minGW under Windows. For more details, see Chapter 4, Debugging
and Profiling OpenCL. This option does not affect the default optimization of
the OpenCL code.
To avoid source changes, there are two environment variables that can be used
to change CL options during the runtime.
A.8.6 cl_amd_offline_devices
To generate binary images offline, it is necessary to access the compiler for every
device that the runtime supports, even if the device is currently not installed on
the system. When, during context creation, CL_CONTEXT_OFFLINE_DEVICES_AMD
is passed in the context properties, all supported devices, whether online or
offline, are reported and can be used to create OpenCL binary images.
A.8.7
cl_amd_event_callback
This extension provides the ability to register event callbacks for states other than
cl_complete. The full set of event states are allowed: cl_queued,
cl_submitted, and cl_running. This extension is enabled automatically and
does not need to be explicitly enabled through #pragma when using the AMD
APP SDK.
A.8.8 cl_amd_popcnt
This extension introduces a population count function called popcnt. This
extension was taken into core OpenCL 1.2, and the function was renamed
popcount. The core 1.2 popcount function (documented in section 6.12.3 of the
OpenCL Specification) is identical to the AMD extension popcnt function.
A.8.9
cl_amd_media_ops
This extension adds the following built-in functions to the OpenCL language.
Note: For OpenCL scalar types, n = 1; for vector types, it is {2, 4, 8, or 16}.
A-7
Return value
((((uint)src[0])
((((uint)src[1])
((((uint)src[2])
((((uint)src[3])
&
&
&
&
0xFF)
0xFF)
0xFF)
0xFF)
<< 0) +
<< 8) +
<< 16) +
<< 24)
A-8
uintn
>>
>>
>>
>>
0)
8)
16)
24)
&
&
&
&
0xFF)
0xFF)
0xFF)
0xFF)
((src1[i]
((src1[i]
((src1[i]
((src1[i]
>> 0) &
>> 8) &
>> 16) &
>> 24) &
0xFF)) +
0xFF)) +
0xFF)) +
0xFF));
>> 0) &
>> 8) &
>> 16) &
>> 24) &
0xFF)) +
0xFF)) +
0xFF)) +
0xFF));
>>
>>
>>
>>
0)
8)
16)
24)
&
&
&
&
0xFF)
0xFF)
0xFF)
0xFF)
((src1[i]
((src1[i]
((src1[i]
((src1[i]
>>
>>
>>
>>
0)
8)
16)
24)
&
&
&
&
0xFF)
0xFF)
0xFF)
0xFF)
((src1[i]
((src1[i]
((src1[i]
((src1[i]
>> 0) &
>> 8) &
>> 16) &
>> 24) &
0xFF))
0xFF))
0xFF))
0xFF))
<<
<<
<<
<<
16) +
16) +
16) +
16);
A-9
A.8.10 cl_amd_printf
The OpenCL Specification 1.1 and 1.2 support the optional AMD extension
cl_amd_printf, which provides printf capabilities to OpenCL C programs. To use
this extension, an application first must include
#pragma OPENCL EXTENSION cl_amd_printf : enable.
Built-in function:
printf(__constant char * restrict format, );
This function writes output to the stdout stream associated with the
host application. The format string is a character sequence that:
ordinary characters (i.e. not %), which are copied directly to the output
stream unchanged, and
The OpenCL C printf closely matches the definition found as part of the
C99 standard. Note that conversions introduced in the format string with
% are supported with the following guidelines:
A 32-bit floating point argument is not converted to a 64-bit double,
unless the extension cl_khr_fp64 is supported and enabled, as
defined in section 9.3 of the OpenCL Specification 1.1. This includes
the double variants if cl_khr_fp64 is supported and defined in the
corresponding compilation unit.
64-bit integer types can be printed using %ld / %lx / %lu .
%lld / %llx / %llu are not supported and reserved for 128-bit integer
types (long long).
A.8.11
cl_amd_predefined_macros
The following macros are predefined when compiling OpenCL C kernels.
These macros are defined automatically based on the device for which the code
is being compiled.
GPU devices:
__Barts__
__BeaverCreek__
__Bheem__
__Bonaire__
__Caicos__
__Capeverde__
__Carrizo__
__Cayman__
__Cedar__
__Cypress__
__Devastator__
__Hainan__
____
__Iceland__
__Juniper__
__Kalindi__
__Kauai__
__Lombok__
__Loveland__
__Mullins__
__Oland__
__Pitcairn__
__RV710__
__RV730__
__RV740__
__RV770__
__RV790__
__Redwood__
__Scrapper__
__Spectre__
__Spooky__
__Tahiti__
__Tonga__
__Turks__
__WinterPark__
__GPU__
CPU devices:
__CPU__
__X86__
__X86_64__
Note that __GPU__ or __CPU__ are predefined whenever a GPU or CPU device
is the compilation target.
An example kernel is provided below.
#pragma OPENCL EXTENSION cl_amd_printf : enable
const char* getDeviceName() {
#ifdef __Cayman__
return "Cayman";
#elif __Barts__
return "Barts";
#elif __Cypress__
return "Cypress";
#elif defined(__Juniper__)
A-11
return "Juniper";
#elif defined(__Redwood__)
return "Redwood";
#elif defined(__Cedar__)
return "Cedar";
#elif defined(__ATI_RV770__)
return "RV770";
#elif defined(__ATI_RV730__)
return "RV730";
#elif defined(__ATI_RV710__)
return "RV710";
#elif defined(__Loveland__)
return "Loveland";
#elif defined(__GPU__)
return "GenericGPU";
#elif defined(__X86__)
return "X86CPU";
#elif defined(__X86_64__)
return "X86-64CPU";
#elif defined(__CPU__)
return "GenericCPU";
#else
return "UnknownDevice";
#endif
}
kernel void test_pf(global int* a)
{
printf("Device Name: %s\n", getDeviceName());
}
A.8.12 cl_amd_bus_addressable_memory
This extension defines an API for peer-to-peer transfers between AMD GPUs
and other PCIe device, such as third-party SDI I/O devices. Peer-to-peer
transfers have extremely low latencies by not having to use the hosts main
memory or the CPU (see Figure A.1). This extension allows sharing a memory
allocated by the graphics driver to be used by other devices on the PCIe bus
(peer-to-peer transfers) by exposing a write-only bus address. It also allows
memory allocated on other PCIe devices (non-AMD GPU) to be directly
accessed by AMD GPUs. One possible use of this is for a video capture device
to directly write into the GPU memory using its DMA.This extension is supported
only on AMD FirePro professional graphics cards.
A-12
Figure A.1
A-13
Extension
A M D Radeon H D
Tahiti1,
Pitcairn2,
Brazos
Llano
cl_khr_*_atomics (32-bit)
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
cl_ext_atomic_counters_32
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
cl_khr_gl_sharing
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
cl_khr_byte_addressable_store
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
cl_ext_device_fission
CPU
only
CPU
only
CPU
only
No
No
No
No
No
cl_amd_device_attribute_query
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
cl_khr_fp64
CPU
only
CPU
only
CPU
only
Yes
Yes
Yes
No
Yes
cl_amd_fp64
CPU
only
CPU
only
CPU
only
Yes
Yes
Yes
No
Yes
cl_amd_vec3
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Images
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
cl_khr_d3d10_sharing
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
cl_amd_media_ops
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
cl_amd_printf
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
cl_amd_popcnt
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
cl_khr_3d_image_writes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
cl_khr_icd
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
cl_amd_event_callback
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
cl_amd_offline_devices
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Platform Extensions
1.
2.
3.
4.
5.
6.
7.
AMD
AMD
AMD
AMD
AMD
AMD
ATI Radeon
guaranteed. The access to the counter is done only through add/dec built-in
functions; thus, no two work-items have the same value returned in the case that
a given kernel only increments or decrements the counter. (Also see
http://www.khronos.org/registry/cl/extensions/ext/cl_ext_atomic_counters_32.txt.)
Table A.2
Redwood2
Cedar3
x86 CPU
with SSE2 or later
cl_khr_*_atomics
Yes
Yes
Yes
Yes
cl_ext_atomic_counters_32
Yes
Yes
Yes
No
cl_khr_gl_sharing
Yes
Yes
Yes
Yes
cl_khr_byte_addressable_store
Yes
Yes
Yes
Yes
cl_ext_device_fission
No
No
No
Yes
cl_amd_device_attribute_query
Yes
Yes
Yes
Yes
cl_khr_fp64
Extension
No
No
No
Yes
cl_amd_fp64
No
No
No
Yes
cl_amd_vec3
Yes
Yes
Yes
Yes
Images
Yes
Yes
Yes
Yes5
cl_khr_d3d10_sharing
Yes
Yes
Yes
Yes
cl_amd_media_ops
Yes
Yes
Yes
Yes
cl_amd_media_ops2
Yes
Yes
Yes
Yes
cl_amd_printf
Yes
Yes
Yes
Yes
cl_amd_popcnt
Yes
Yes
Yes
Yes
cl_khr_3d_image_writes
Yes
Yes
Yes
No
cl_khr_icd
Yes
Yes
Yes
Yes
cl_amd_event_callback
Yes
Yes
Yes
Yes
cl_amd_offline_devices
Yes
Yes
Yes
Yes
Platform Extensions
Radeon
Radeon
FirePro
1. ATI
HD 5700 series, AMD Mobility
HD 5800 series, AMD
V5800 series, AMD
Mobility FirePro M7820.
2. ATI Radeon HD 5600 Series, ATI Radeon HD 5600 Series, ATI Radeon HD 5500 Series, AMD
Mobility Radeon HD 5700 Series, AMD Mobility Radeon HD 5600 Series, AMD FirePro V4800
Series, AMD FirePro V3800 Series, AMD Mobility FirePro M5800
3. ATI Radeon HD 5400 Series, AMD Mobility Radeon HD 5400 Series
4. Available on all devices that have double-precision, including all Southern Island devices.
5. Environment variable CPU_IMAGE_SUPPORT must be set.
A-15
A-16
Appendix B
The OpenCL Installable Client Driver
(ICD)
The OpenCL Installable Client Driver (ICD) is installed as part of the AMD
Graphics driver software stack as well as the AMD APP SDK.
B.1 Overview
The ICD allows multiple OpenCL implementations to co-exist; also, it allows
applications to select between these implementations at runtime.
Use the clGetPlatformIDs() and clGetPlatformInfo() functions to see the
list of available OpenCL implementations, and select the one that is best for your
requirements. It is recommended that developers offer their users a choice on
first run of the program or whenever the list of available platforms changes.
A properly implemented ICD and OpenCL library is transparent to the end-user.
B-1
{
cl_platform_id* platforms = new cl_platform_id[numPlatforms];
status = clGetPlatformIDs(numPlatforms, platforms, NULL);
if(!sampleCommon->checkVal(status,
CL_SUCCESS,
"clGetPlatformIDs failed."))
{
return SDK_FAILURE;
}
for (unsigned i = 0; i < numPlatforms; ++i)
{
char pbuf[100];
status = clGetPlatformInfo(platforms[i],
CL_PLATFORM_VENDOR,
sizeof(pbuf),
pbuf,
NULL);
if(!sampleCommon->checkVal(status,
CL_SUCCESS,
"clGetPlatformInfo failed."))
{
return SDK_FAILURE;
}
platform = platforms[i];
if (!strcmp(pbuf, "Advanced Micro Devices, Inc."))
{
break;
}
}
delete[] platforms;
}
/*
* If we could find our platform, use it. Otherwise pass a NULL and
get whatever the
* implementation thinks we should be using.
*/
cl_context_properties cps[3] =
{
CL_CONTEXT_PLATFORM,
(cl_context_properties)platform,
0
};
/* Use NULL for backward compatibility */
cl_context_properties* cprops = (NULL == platform) ? NULL : cps;
context = clCreateContextFromType(
cprops,
dType,
NULL,
NULL,
&status);
B-2
B-3
B-4
Appendix C
OpenCL Binary Image Format (BIF)
v2.0
C.1 Overview
OpenCL Binary Image Format (BIF) 2.0 is in the ELF format. BIF2.0 allows the
OpenCL binary to contain the OpenCL source program, the LLVM IR, and the
executable. The BIF defines the following special sections:
.comment: for storing the OpenCL version and the driver version that created
the binary.
The BIF can have other special sections for debugging, etc. It also contains
several ELF special sections, such as:
other ELF special sections required for forming an ELF (for example:
.strtab, .symtab, .shstrtab).
By default, OpenCL generates a binary that has LLVM IR, and the executable for
the GPU (,.llvmir, .amdil, and .text sections), as well as LLVM IR and the
executable for the CPU (.llvmir and .text sections). The BIF binary always
contains a .comment section, which is a readable C string. The default behavior
can be changed with the BIF options described in Section C.2, BIF Options,
page C-3.
The LLVM IR enables recompilation from LLVM IR to the target. When a binary
is used to run on a device for which the original program was not generated and
the original device is feature-compatible with the current device, OpenCL
recompiles the LLVM IR to generate a new code for the device. Note that the
LLVM IR is only universal within devices that are feature-compatible in the same
device type, not across different device types. This means that the LLVM IR for
the CPU is not compatible with the LLVM IR for the GPU. The LLVM IR for a
GPU works only for GPU devices that have equivalent feature sets.
BIF2.0 is supported since Stream SDK 2.2.
C-1
C.1.1
Field
Value
Description
e_ident[EI_CLASS]
ELFCLASS32,
ELFCLASS64
e_ident[EI_DATA]
ELFDATA2LSB
e_ident[EI_OSABI]
ELFOSABI_NONE
Not used.
e_ident[EI_ABIVERSION] 0
Not used.
e_type
ET_NONE
Not used.
e_machine
oclElfTargets Enum
E_version
EV_CURRENT
Must be EV_CURRENT.
e_entry
Not used.
E_phoff
Not used.
e_flags
Not used.
E_phentsize
Not used.
E_phnum
Not used.
The fields not shown in Table C.1 are given values according to the ELF
Specification. The e_machine value is defined as one of the oclElfTargets
enumerants; the values for these are:
C-2
e_machine =
C.1.2
Bitness
The BIF can be either 32-bit ELF format or a 64-bit ELF format. For the GPU,
OpenCL generates a 32-bit BIF binary; it can read either 32-bit BIF or 64-bit BIF
binary. For the CPU, OpenCL generates and reads only 32-bit BIF binaries if the
host application is 32-bit (on either 32-bit OS or 64-bit OS). It generates and
reads only 64-bit BIF binary if the host application is 64-bit (on 64-bit OS).
C-3
This binary can recompile for all the other devices of the same device type.
C-4
Appendix D
Hardware overview of pre-GCN
devices
This chapter provides a hardware overview of pre-GCN devices. Pre-GCN
devices include the Evergreen and Northern Islands families that are based on
VLIW.
A general OpenCL device comprises compute units, each of which can have
multiple processing elements. A work-item (or SPMD kernel instance) executes
on a single processing element. The processing elements within a compute unit
can execute in lock-step using SIMD execution. Compute units, however,
execute independently (see Figure D.1).
AMD GPUs consist of multiple compute units. The number of them and the way
they are structured varies with the device family, as well as device designations
within a family. Each of these processing elements possesses ALUs. For devices
in the Northern Islands and Southern Islands families, these ALUs are arranged
in four (in the Evergreen family, there are five) processing elements with arrays
of 16 ALUs. Each of these arrays executes a single instruction across each lane
for each of a block of 16 work-items. That instruction is repeated over four cycles
to make the 64-element vector called a wavefront. On Northern Islands and
Evergreen family devices, the PE arrays execute instructions from one wavefront,
so that each work-item issues four (for Northern Islands) or five (for Evergreen)
instructions at once in a very-long-instruction-word (VLIW) packet.
D-1
Figure D.1 shows a simplified block diagram of a generalized AMD GPU compute
device.
GPU
Compute Device
Compute
Unit
Compute
Unit
GPU
Compute Device
Compute
Unit
Processing Elements
ALUs
Figure D.1
D-2
Compute
Unit
Compute
Unit
Compute
Unit
Processing Element
Compute
Unit
Instruction
and Control
Flow
Branch
Execution
Unit
ALUs
General-Purpose Registers
Figure D.2
GPU compute devices comprise groups of compute units. Each compute unit
contains numerous processing elements, which are responsible for executing
kernels, each operating on an independent data stream. Processing elements, in
turn, contain numerous processing elements, which are the fundamental,
programmable ALUs that perform integer, single-precision floating-point, doubleprecision floating-point, and transcendental operations. All processing elements
within a compute unit execute the same instruction sequence in lock-step for
Evergreen and Northern Islands devices; different compute units can execute
1. Much of this is transparent to the programmer.
D-3
Copyright 2015 Advanced Micro Devices, Inc. All rights reserved.
different instructions.
A processing element is arranged as a five-way or four-way (depending on the
GPU type) very long instruction word (VLIW) processor (see bottom of
Figure D.2). Up to five scalar operations (or four, depending on the GPU type)
can be co-issued in a VLIW instruction, each of which are executed on one of
the corresponding five ALUs. ALUs can execute single-precision floating point or
integer operations. One of the five ALUs also can perform transcendental
operations (sine, cosine, logarithm, etc.). Double-precision floating point
operations are processed (where supported) by connecting two or four of the
ALUs (excluding the transcendental core) to perform a single double-precision
operation. The processing element also contains one branch execution unit to
handle branch instructions.
Different GPU compute devices have different numbers of processing elements.
For example, the ATI Radeon HD 5870 GPU has 20 compute units, each with
16 processing elements, and each processing elements contains five ALUs; this
yields 1600 physical ALUs.
D-4
Appendix E
OpenCL-OpenGL Interoperability
E-1
E.1.1
E-2
2. Use GetDC to get a handle to the device context for the client area of a
specific window, or for the entire screen (OR). Use CreateDC function to
create a device context (HDC) for the specified device.
3. Use ChoosePixelFormat to match an appropriate pixel format supported by
a device context and to a given pixel format specification.
4. Use SetPixelFormat to set the pixel format of the specified device context
to the format specified.
5. Use wglCreateContext to create a new OpenGL rendering context from
device context (HDC).
6. Use wglMakeCurrent to bind the GL context created in the above step as
the current rendering context.
7. Use clGetGLContextInfoKHR function (see Section 9.7 of the OpenCL
Specification 1.1) and parameter CL_CURRENT_DEVICE_FOR_GL_CONTEXT_KHR
to get the device ID of the CL device associated with OpenGL context.
8. Use clCreateContext function (see Section 4.3 of the OpenCL Specification
1.1) to create the CL context (of type cl_context).
The following code snippet shows how to create an interoperability context using
WIN32 API for windowing. (Users also can refer to the SimpleGL sample in the
AMD APP SDK samples.)
int pfmt;
PIXELFORMATDESCRIPTOR pfd;
pfd.nSize
= sizeof(PIXELFORMATDESCRIPTOR);
pfd.nVersion
= 1;
pfd.dwFlags
= PFD_DRAW_TO_WINDOW |
PFD_SUPPORT_OPENGL | PFD_DOUBLEBUFFER ;
pfd.iPixelType
= PFD_TYPE_RGBA;
pfd.cColorBits
= 24;
pfd.cRedBits
= 8;
pfd.cRedShift
= 0;
pfd.cGreenBits
= 8;
pfd.cGreenShift
= 0;
pfd.cBlueBits
= 8;
pfd.cBlueShift
= 0;
pfd.cAlphaBits
= 8;
pfd.cAlphaShift
= 0;
pfd.cAccumBits
= 0;
pfd.cAccumRedBits
= 0;
pfd.cAccumGreenBits = 0;
pfd.cAccumBlueBits = 0;
pfd.cAccumAlphaBits = 0;
pfd.cDepthBits
= 24;
pfd.cStencilBits
= 8;
pfd.cAuxBuffers
= 0;
pfd.iLayerType
= PFD_MAIN_PLANE;
pfd.bReserved
= 0;
pfd.dwLayerMask
= 0;
pfd.dwVisibleMask
= 0;
pfd.dwDamageMask
= 0;
ZeroMemory(&pfd, sizeof(PIXELFORMATDESCRIPTOR));
WNDCLASS windowclass;
windowclass.style = CS_OWNDC;
windowclass.lpfnWndProc = WndProc;
windowclass.cbClsExtra = 0;
windowclass.cbWndExtra = 0;
E.1 Under Windows
Copyright 2015 Advanced Micro Devices, Inc. All rights reserved.
E-3
windowclass.hInstance = NULL;
windowclass.hIcon = LoadIcon(NULL, IDI_APPLICATION);
windowclass.hCursor = LoadCursor(NULL, IDC_ARROW);
windowclass.hbrBackground = (HBRUSH)GetStockObject(BLACK_BRUSH);
windowclass.lpszMenuName = NULL;
windowclass.lpszClassName = reinterpret_cast<LPCSTR>("SimpleGL");
RegisterClass(&windowclass);
gHwnd = CreateWindow(reinterpret_cast<LPCSTR>("SimpleGL"),
reinterpret_cast<LPCSTR>("SimpleGL"),
WS_CAPTION | WS_POPUPWINDOW | WS_VISIBLE,
0,
0,
screenWidth,
screenHeight,
NULL,
NULL,
windowclass.hInstance,
NULL);
hDC = GetDC(gHwnd);
pfmt = ChoosePixelFormat(hDC, &pfd);
ret = SetPixelFormat(hDC, pfmt, &pfd);
hRC = wglCreateContext(hDC);
ret = wglMakeCurrent(hDC, hRC);
cl_context_properties properties[] =
{
CL_CONTEXT_PLATFORM,
(cl_context_properties) platform,
CL_GL_CONTEXT_KHR,
(cl_context_properties) hRC,
CL_WGL_HDC_KHR,
(cl_context_properties) hDC,
0
};
status = clGetGLContextInfoKHR(properties,
CL_CURRENT_DEVICE_FOR_GL_CONTEXT_KHR,
sizeof(cl_device_id),
&interopDevice,
NULL);
// Create OpenCL context from device's id
context = clCreateContext(properties,
1,
&interopDevice,
0,
0,
&status);
E.1.2
Multi-GPU Environment
2. To query all display devices in the current session, call this function in a loop,
starting with DevNum set to 0, and incrementing DevNum until the function fails.
To select all display devices in the desktop, use only the display devices that
have the DISPLAY_DEVICE_ATTACHED_TO_DESKTOP flag in the
DISPLAY_DEVICE structure.
3. To get information on the display adapter, call EnumDisplayDevices with
lpDevice set to NULL. For example, DISPLAY_DEVICE.DeviceString
contains the adapter name.
4. Use EnumDisplaySettings to get DEVMODE. dmPosition.x and
dmPosition.y are used to get the x coordinate and y coordinate of the
current display.
5. Try to find the first OpenCL device (winner) associated with the OpenGL
rendering context by using the loop technique of 2., above.
6. Inside the loop:
a. Create a window on a specific display by using the CreateWindow
function. This function returns the window handle (HWND).
b.
Use GetDC to get a handle to the device context for the client area of a
specific window, or for the entire screen (OR). Use the CreateDC function
to create a device context (HDC) for the specified device.
c.
The following code demonstrates how to use WIN32 Windowing API in CL-GL
interoperability on multi-GPU environment.
int xCoordinate = 0;
int yCoordinate = 0;
for (deviceNum = 0; EnumDisplayDevices(NULL,
deviceNum,
&dispDevice,
E.1 Under Windows
Copyright 2015 Advanced Micro Devices, Inc. All rights reserved.
E-5
0); deviceNum++)
{
if (dispDevice.StateFlags &
DISPLAY_DEVICE_MIRRORING_DRIVER)
{
continue;
}
DEVMODE deviceMode;
EnumDisplaySettings(dispDevice.DeviceName,
ENUM_CURRENT_SETTINGS,
&deviceMode);
xCoordinate = deviceMode.dmPosition.x;
yCoordinate = deviceMode.dmPosition.y;
WNDCLASS windowclass;
windowclass.style = CS_OWNDC;
windowclass.lpfnWndProc = WndProc;
windowclass.cbClsExtra = 0;
windowclass.cbWndExtra = 0;
windowclass.hInstance = NULL;
windowclass.hIcon = LoadIcon(NULL, IDI_APPLICATION);
windowclass.hCursor = LoadCursor(NULL, IDC_ARROW);
windowclass.hbrBackground = (HBRUSH)GetStockObject(BLACK_BRUSH);
windowclass.lpszMenuName = NULL;
windowclass.lpszClassName = reinterpret_cast<LPCSTR>("SimpleGL");
RegisterClass(&windowclass);
gHwnd = CreateWindow(
reinterpret_cast<LPCSTR>("SimpleGL"),
reinterpret_cast<LPCSTR>(
"OpenGL Texture Renderer"),
WS_CAPTION | WS_POPUPWINDOW,
xCoordinate,
yCoordinate,
screenWidth,
screenHeight,
NULL,
NULL,
windowclass.hInstance,
NULL);
hDC = GetDC(gHwnd);
pfmt = ChoosePixelFormat(hDC, &pfd);
ret = SetPixelFormat(hDC, pfmt, &pfd);
hRC = wglCreateContext(hDC);
ret = wglMakeCurrent(hDC, hRC);
cl_context_properties properties[] =
{
CL_CONTEXT_PLATFORM,
(cl_context_properties) platform,
CL_GL_CONTEXT_KHR,
(cl_context_properties) hRC,
CL_WGL_HDC_KHR,
(cl_context_properties) hDC,
0
};
if (!clGetGLContextInfoKHR)
{
clGetGLContextInfoKHR = (clGetGLContextInfoKHR_fn)
clGetExtensionFunctionAddress(
"clGetGLContextInfoKHR");
}
E-6
size_t deviceSize = 0;
status = clGetGLContextInfoKHR(properties,
CL_CURRENT_DEVICE_FOR_GL_CONTEXT_KHR,
0,
NULL,
&deviceSize);
if (deviceSize == 0)
{
// no interopable CL device found, cleanup
wglMakeCurrent(NULL, NULL);
wglDeleteContext(hRC);
DeleteDC(hDC);
hDC = NULL;
hRC = NULL;
DestroyWindow(gHwnd);
// try the next display
continue;
}
ShowWindow(gHwnd, SW_SHOW);
//Found a winner
break;
}
cl_context_properties properties[] =
{
CL_CONTEXT_PLATFORM,
(cl_context_properties) platform,
CL_GL_CONTEXT_KHR,
(cl_context_properties) hRC,
CL_WGL_HDC_KHR,
(cl_context_properties) hDC,
0
};
status = clGetGLContextInfoKHR( properties,
CL_CURRENT_DEVICE_FOR_GL_CONTEXT_KHR,
sizeof(cl_device_id),
&interopDevice,
NULL);
// Create OpenCL context from device's id
context = clCreateContext(properties,
1,
&interopDevice,
0,
0,
&status);
E.1.3
Limitations
E-7
1. Use glutInit to initialize the GLUT library and to negotiate a session with
the windowing system. This function also processes the command-line
options depending on the windowing system.
2. Use glXGetCurrentContext to get the current rendering context
(GLXContext).
3. Use glXGetCurrentDisplay to get the display (Display *) that is associated
with the current OpenGL rendering context of the calling thread.
4. Use clGetGLContextInfoKHR (see Section 9.7 of the OpenCL Specification
1.1) and the CL_CURRENT_DEVICE_FOR_GL_CONTEXT_KHR parameter to get the
device ID of the CL device associated with the OpenGL context.
5. Use clCreateContext (see Section 4.3 of the OpenCL Specification 1.1) to
create the CL context (of type cl_context).
The following code snippet shows how to create an interoperability context using
GLUT in Linux.
glutInit(&argc, argv);
glutInitDisplayMode(GLUT_RGBA | GLUT_DOUBLE);
glutInitWindowSize(WINDOW_WIDTH, WINDOW_HEIGHT);
glutCreateWindow("OpenCL SimpleGL");
gGLXContext glCtx = glXGetCurrentContext();
Cl_context_properties cpsGL[] =
{
CL_CONTEXT_PLATFORM,
(cl_context_properties)platform,
CL_GLX_DISPLAY_KHR,
(intptr_t) glXGetCurrentDisplay(),
CL_GL_CONTEXT_KHR,
(
status = clGetGLContextInfoKHR(cpsGL,
CL_CURRENT_DEVICE_FOR_GL_CONTEXT_KHR,
sizeof(cl_device_id),
&interopDevice,
NULL);
// Create OpenCL context from device's id
context = clCreateContext(cpsGL,
1,
&interopDevice,
0,
0,
&status);
E-8
4. Use XCreateColormap to create a color map of the specified visual type for
the screen on which the specified window resides and returns the colormap
ID associated with it. Note that the specified window is only used to
determine the screen.
5. Use XCreateWindow to create an unmapped sub-window for a specified
parent window, returns the window ID of the created window, and causes the
X server to generate a CreateNotify event. The created window is placed on
top in the stacking order with respect to siblings.
6. Use XMapWindow to map the window and all of its sub-windows that have had
map requests. Mapping a window that has an unmapped ancestor does not
display the window, but marks it as eligible for display when the ancestor
becomes mapped. Such a window is called unviewable. When all its
ancestors are mapped, the window becomes viewable and is visible on the
screen if it is not obscured by another window.
7. Use glXCreateContextAttribsARB to initialize the context to the initial state
defined by the OpenGL specification, and returns a handle to it. This handle
can be used to render to any GLX surface.
8. Use glXMakeCurrent to make argrument3 (GLXContext) the current GLX
rendering context of the calling thread, replacing the previously current
context if there was one, and attaches argument3 (GLXcontext) to a GLX
drawable, either a window or a GLX pixmap.
9. Use clGetGLContextInfoKHR to get the OpenCL-OpenGL interoperability
device corresponding to the window created in step 5.
10. Use clCreateContext to create the context on the interoperable device
obtained in step 9.
The following code snippet shows how to create a CL-GL interoperability context
using the X Window system in Linux.
Display *displayName = XOpenDisplay(0);
int nelements;
GLXFBConfig *fbc = glXChooseFBConfig(displayName,
DefaultScreen(displayName), 0, &nelements);
static int attributeList[] = { GLX_RGBA,
GLX_DOUBLEBUFFER,
GLX_RED_SIZE,
1,
GLX_GREEN_SIZE,
1,
GLX_BLUE_SIZE,
1,
None
};
XVisualInfo *vi = glXChooseVisual(displayName,
DefaultScreen(displayName),
attributeList);
XSetWindowAttributes swa;
swa.colormap = XCreateColormap(displayName,
RootWindow(displayName, vi->screen),
vi->visual,
AllocNone);
swa.border_pixel = 0;
swa.event_mask = StructureNotifyMask;
E-9
E.2.2
Multi-GPU Configuration
E-10
g. Use XMapWindow to map the window and all of its sub-windows that have
had map requests. Mapping a window that has an unmapped ancestor
does not display the window but marks it as eligible for display when the
ancestor becomes mapped. Such a window is called unviewable. When
all its ancestors are mapped, the window becomes viewable and is
visible on the screen, if it is not obscured by another window.
h. Use glXCreateContextAttribsARB function to initialize the context to
the initial state defined by the OpenGL specification and return a handle
to it. This handle can be used to render to any GLX surface.
i.
j.
k.
E-11
E-12
int attribs[] = {
GLX_CONTEXT_MAJOR_VERSION_ARB, 3,
GLX_CONTEXT_MINOR_VERSION_ARB, 0,
0
};
GLXContext ctx = glXCreateContextAttribsARB(displayName,
*fbc,
0,
true,
attribs);
glXMakeCurrent (displayName,
win,
ctx);
gGlCtx = glXGetCurrentContext();
properties cpsGL[] = {
CL_CONTEXT_PLATFORM, (cl_context_properties)platform,
CL_GLX_DISPLAY_KHR, (intptr_t) glXGetCurrentDisplay(),
CL_GL_CONTEXT_KHR, (intptr_t) gGlCtx, 0
};
size_t deviceSize = 0;
status = clGetGLContextInfoKHR(cpsGL,
CL_CURRENT_DEVICE_FOR_GL_CONTEXT_KHR,
0,
NULL,
&deviceSize);
int numDevices = (deviceSize / sizeof(cl_device_id));
if(numDevices == 0)
{
glXDestroyContext(glXGetCurrentDisplay(), gGlCtx);
continue;
}
else
{
//Interoperable device found
std::cout<<"Interoperable device found "<<std::endl;
break;
}
}
status = clGetGLContextInfoKHR( cpsGL,
CL_CURRENT_DEVICE_FOR_GL_CONTEXT_KHR,
sizeof(cl_device_id),
&interopDeviceId,
NULL);
// Create OpenCL context from device's id
context = clCreateContext(cpsGL,
1,
&interopDeviceId,
0,
0,
&status);
E-13
Table E.1
E-14
AMD-Supported GL Formats
GL internal format
CL images format
GL_ALPHA8
CL_A,CL_UNORM8
GL_R8,
CL_R, CL_UNORM_INT8
GL_R8UI
CL_R, CL_UNSIGNED_INT8
GL_R8I
CL_R, CL_SIGNED_INT8
GL_RG8
CL_RG, CL_UNORM_INT8
GL_RG8UI
CL_RG, CL_UNSIGNED_INT8
GL_RG8I
CL_RG, CL_SIGNED_INT8
GL_RGB8
CL_RGB, CL_UNORM_INT8
GL_RGB8UI
CL_RGB, CL_UNSIGNED_INT8
GL_RGB8I
CL_RGB, CL_SIGNED_INT8
GL_R16
CL_R, CL_UNORM_INT16
GL_R16UI
CL_R, CL_UNSIGNED_INT16
GL_R16I
CL_R, CL_SIGNED_INT16
GL_RG16
CL_RG, CL_UNORM_INT16
GL_RG16UI
CL_RG, CL_UNSIGNED_INT16
GL_RG16I
CL_RG, CL_SIGNED_INT16
GL_RGB16
CL_RGB, CL_UNORM_INT16
GL_RGB16UI
CL_RGB, CL_UNSIGNED_INT16
GL_RGB16I
CL_RGB, CL_SIGNED_INT16
GL_R32I
CL_R, CL_SIGNED_INT32
GL_R32UI
CL_R, CL_UNSIGNED_INT32
GL_R32F
CL_R, CL_FLOAT
GL_RG32I
CL_RG, CL_SIGNED_INT32
GL_RG32UI
CL_RG, CL_UNSIGNED_INT32
GL_RG32F
CL_RG, CL_FLOAT
GL_RGB32I
CL_RGB, CL_SIGNED_INT32
GL_RGB32UI
CL_RGB, CL_UNSIGNED_INT32
GL_RGB32F
CL_RGB, CL_FLOAT
Appendix F
New and deprecated functions in
OpenCL 2.0
F.1 New built-in functions in OpenCL 2.0
F.1.1
F.1.2
get_global_linear_id
get_local_linear_id
Integer functions
ctz
F.1.3
Synchronization Functions
work_group_barrier
F-1
F.1.4
F.1.5
to_local
to_private
get_fence
Atomic functions
atomic_init
atomic_work_item_fence
memory fence
atomic_store[_explicit]
atomic store
atomic_load[_explicit]
atomic load
atomic_exchange[_explicit]
atomic exchange
F.1.6
F.1.7
F-2
atomic fetch+add
atomic_fetch_sub[_explicit]
atomic fetch+sub
atomic_fetch_or[_explicit]
atomic fetch+or
atomic_fetcn_xor[_explicit]
atomic fetch+xor
atomic_fetch_and[_explicit]
atomic fetch+and
atomic_fetch_max_[explicit]
atomic fetch+max
atomic_fetch_min[_explicit]
atomic fetch+min
atomic_flag_test_and_set[_explicit]
atomic_flag_clear[_explicit]
write_imagef
work_group_any
work_group_broadcast
work_group_reduce_add
work_group_reduce_max
work_group_reduce_min
work_group_scan_exclusive_add
work_group_scan_exclusive_max
work_group_scan_exclusive_min
F.1.8
F.1.9
work_group_scan_inclusive_add
work_group_scan_inclusive_max
work_group_scan_inclusive_min
Pipe functions
read_pipe
write_pipe
Write to pipe
reserve_read_pipe
reserve_write_pipe
commit_read_pipe
commit_write_pipe
is_valid_reserve_id
work_group_reserve_read_pipe
work_group_reserve_write_pipe
work_group_commit_read_pipe
work_group_commit_write_pipe
get_pipe_num_packets
get_pipe_max_packets
Enqueueing Kernels
enqueue_kernel
get_kernel_work_group_size
Enqueue a marker
retain_event
release_event
create_user_event
is_valid_event
set_user_event_status
capture_event_profiling_info
get_default_queue
ndrange_1D
Create 1D NDRange
ndrange_2D
Create 2D NDRange
ndrange_3D
Create 3D NDRange
F-3
F.1.10
Sub-groups
get_sub_group_size
get_max_sub_group_size
get_num_sub_groups
get_enqueued_num_sub_groups
get_sub_group_id
get_sub_group_local_id
sub_group_barrier
sub_group_all
sub_group_any
sub_group_broadcast
sub_group_reduce_add
sub_group_reduce_max
sub_group_reduce_min
sub_group_scan_exclusive_add
sub_group_scan_exclusive_max
sub_group_scan_exclusive_min
sub_group_scan_inclusive_add
sub_group_scan_inclusive_max
sub_group_scan_inclusive_min
sub_group_reserve_read_pipe
sub_group_reserve_write_pipe
sub_group_commit_read_pipe
sub_group_commit_write_pipe
F-4
write_mem_fence
atomic_add
atomic_sub
atomic_xchg
atomic_inc
atomic_dec
atomic_cmpxchg
atomic_min
atomic_max
atomic_and
atomic_or
atomic_xor
F-5
F.3.2
New Types
cl_device_svm_capabilities
Returned by
clGetDeviceInfo(...CL_DEVICE_SVM_CAP
ABILITIES...)
cl_queue_properties
See
clCreateCommandQueueWithProperties
cl_svm_mem_flags
See clSVMAlloc
cl_pipe_properties
See clCreatePipe
cl_pipe_info
See clGetPipeInfo
cl_sampler_properties
See clCreateSamplerWithProperties
cl_kernel_exec_info
See clSetKernelExecInfo
cl_image_desc
cl_kernel_sub_group_info
See clGetKernelSubGroupInfoKHR
New Macros
CL_INVALID_PIPE_SIZE
CL_INVALID_DEVICE_QUEUE
CL_VERSION_2_0
CL_DEVICE_QUEUE_ON_HOST_PROPERTIES
CL_DEVICE_MAX_READ_WRITE_IMAGE_ARGS
CL_DEVICE_MAX_GLOBAL_VARIABLE_SIZE
CL_DEVICE_QUEUE_ON_DEVICE_PROPERTIES
CL_DEVICE_QUEUE_ON_DEVICE_PREFERRED_SIZE
CL_DEVICE_QUEUE_ON_DEVICE_MAX_SIZE
CL_DEVICE_MAX_ON_DEVICE_QUEUES
CL_DEVICE_MAX_ON_DEVICE_EVENTS
CL_DEVICE_SVM_CAPABILITIES
CL_DEVICE_GLOBAL_VARIABLE_PREFERRED_TOTAL_SIZE
CL_DEVICE_MAX_PIPE_ARGS
CL_DEVICE_PIPE_MAX_ACTIVE_RESERVATIONS
CL_DEVICE_PIPE_MAX_PACKET_SIZE
CL_DEVICE_PREFERRED_PLATFORM_ATOMIC_ALIGNMENT
CL_DEVICE_PREFERRED_GLOBAL_ATOMIC_ALIGNMENT
CL_DEVICE_PREFERRED_LOCAL_ATOMIC_ALIGNMENT
CL_QUEUE_ON_DEVICE
CL_QUEUE_ON_DEVICE_DEFAULT
CL_DEVICE_SVM_COARSE_GRAIN_BUFFER
CL_DEVICE_SVM_FINE_GRAIN_BUFFER
CL_DEVICE_SVM_FINE_GRAIN_SYSTEM
F-6
CL_DEVICE_SVM_ATOMICS
CL_QUEUE_SIZE
CL_MEM_SVM_FINE_GRAIN_BUFFER
CL_MEM_SVM_ATOMICS
CL_sRGB
CL_sRGBx
CL_sRGBA
CL_sBGRA
CL_ABGR
CL_MEM_OBJECT_PIPE
CL_MEM_USES_SVM_POINTER
CL_PIPE_PACKET_SIZE
CL_PIPE_MAX_PACKETS
CL_SAMPLER_MIP_FILTER_MODE
CL_SAMPLER_LOD_MIN
CL_SAMPLER_LOD_MAX
CL_PROGRAM_BUILD_GLOBAL_VARIABLE_TOTAL_SIZE
CL_KERNEL_ARG_TYPE_PIPE
CL_KERNEL_EXEC_INFO_SVM_PTRS
CL_KERNEL_EXEC_INFO_SVM_FINE_GRAIN_SYSTEM
CL_COMMAND_SVM_FREE
CL_COMMAND_SVM_MEMCPY
CL_COMMAND_SVM_MEMFILL
CL_COMMAND_SVM_MAP
CL_COMMAND_SVM_UNMAP
CL_PROFILING_COMMAND_COMPLETE
F-7
F.3.3
clCreatePipe
clGetPipeInfo
clSVMAlloc
clSVMFree
clEnqueueSVMFree
clEnqueueSVMMemcpy
clEnqueueSVMMemFill
clEnqueueSVMMap
clEnqueueSVMUnmap
clCreateSamplerWithProperties
clSetKernelArgSVMPointer
clSetKernelExecInfo
clGetKernelSubGroupInfoKHR
F-8
Appendix G
Standard Portable Intermediate Representation (SPIR)
This chapter provides an overview of the Standard Portable Intermediate
Representation (SPIR) format. Application developers can use SPIR to avoid
shipping kernel source and to manage the proliferation of devices and drivers
from multiple vendors.
SPIR is a portable encoding of device programs. For example, SPIR 1.2 is an
encoding of OpenCL C (version 1.2) device programs in LLVM IR; SPIR 1.2
defines how any OpenCL C (version 1.2) device program can be encoded in
LLVM (version 3.2). SPIR 2.0 has yet to be published. For details, see the SPIR
specification.
Open-source tools such as CLANG compilers can be used for generating the
SPIR output for any OpenCL kernel program. For information about some open
source generators that are used to generate SPIR, see
https://github.com/KhronosGroup/SPIR.
G-1
G-2
Index
Symbols
_global atomics. . . . . . . . . . . . . . . . . . . . . . . 19
_local atomics . . . . . . . . . . . . . . . . . . . . . . . . 19
.amdil
generating. . . . . . . . . . . . . . . . . . . . . . . . . . 3
.comment
BIF binary . . . . . . . . . . . . . . . . . . . . . . . . . . 1
storing OpenCL and driver versions that created the binary . . . . . . . . . . . . . . . . . . . . 1
.llvmir
generating. . . . . . . . . . . . . . . . . . . . . . . . . . 3
storing OpenCL immediate representation
(LLVM IR). . . . . . . . . . . . . . . . . . . . . . . . 1
.rodata
storing OpenCL runtime control data. . . . . 1
.shstrtab
forming an ELF. . . . . . . . . . . . . . . . . . . . . . 1
.source
storing OpenCL source program . . . . . . . . 1
.strtab
forming an ELF. . . . . . . . . . . . . . . . . . . . . . 1
.symtab
forming an ELF. . . . . . . . . . . . . . . . . . . . . . 1
.text
generating. . . . . . . . . . . . . . . . . . . . . . . . . . 3
storing the executable . . . . . . . . . . . . . . . . 1
Numerics
1D address . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2D
address . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2D addresses
reading and writing. . . . . . . . . . . . . . . . . . 10
A
access
memory. . . . . . . . . . . . . . . . . . . . . . . . . 5,
accumulation operations
NDRange . . . . . . . . . . . . . . . . . . . . . . . . . .
address
1D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
1
10
10
normalized . . . . . . . . . . . . . . . . . . . . . . . . . 10
un-normalized. . . . . . . . . . . . . . . . . . . . . . . 10
allocating
images
OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . 4
memory
selecting a device . . . . . . . . . . . . . . . . . . 4
memory buffer
OpenCL program model . . . . . . . . . . . . . 4
ALUs
arrangement of . . . . . . . . . . . . . . . . . . . . . . . 7
AMD Accelerated Parallel Processing
implementation of OpenCL . . . . . . . . . . . . . 1
open platform strategy . . . . . . . . . . . . . . . . . 1
programming model . . . . . . . . . . . . . . . . . . . 2
relationship of components . . . . . . . . . . . . . 1
software stack . . . . . . . . . . . . . . . . . . . . . . . 1
AMD APP KernelAnalyzer . . . . . . . . . . . . . . . . 1
AMD GPU
number of compute units . . . . . . . . . . . . . . . 7
AMD Radeon HD 68XX . . . . . . . . . . . . . . . . . 14
AMD Radeon HD 69XX . . . . . . . . . . . . . . . . . 14
AMD Radeon HD 75XX . . . . . . . . . . . . . . . . . 14
AMD Radeon HD 77XX . . . . . . . . . . . . . . . . . 14
AMD Radeon HD 78XX . . . . . . . . . . . . . . . . . 14
AMD Radeon HD 79XX series. . . . . . . . . . . . 14
AMD Radeon HD 7XXX . . . . . . . . . . . . . . . . . . 2
AMD Radeon R9 290X . . . . . . . . . . . . . . . 3, 8
AMD supplemental compiler . . . . . . . . . . . . . . 6
-g option . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
AMD supplemental compiler option
-f[n-]bin-source . . . . . . . . . . . . . . . . . . . . . . . 6
-f[no-]bin-amdil . . . . . . . . . . . . . . . . . . . . . . . 7
-f[no-]bin-exe . . . . . . . . . . . . . . . . . . . . . . . . 7
-f[no-]bin-llvmir . . . . . . . . . . . . . . . . . . . . . . . 7
AMD vendor-specific extensions . . . . . . . . . . . 5
amd_bitalign
built-in function . . . . . . . . . . . . . . . . . . . . . . . 8
amd_bytealign
built-in function . . . . . . . . . . . . . . . . . . . . . . . 8
amd_lerp
built-in function . . . . . . . . . . . . . . . . . . . . . . . 9
AMD_OCL_BUILD_OPTIONS
Index-1
environment variables. . . . . . . . . . . . . . . . . 9
AMD_OCL_BUILD_OPTIONS_APPEND
environment variable. . . . . . . . . . . . . . . . . . 9
amd_pack
built-in function . . . . . . . . . . . . . . . . . . . . . . 8
amd_sad
buillt-in function . . . . . . . . . . . . . . . . . . . . . . 9
amd_sad4
built-in function . . . . . . . . . . . . . . . . . . . . . . 9
amd_sadhi
built-in function . . . . . . . . . . . . . . . . . . . . . . 9
amd_unpack0
built-in function . . . . . . . . . . . . . . . . . . . . . . 8
amd_unpack1
built-in function . . . . . . . . . . . . . . . . . . . . . . 8
amd_unpack2
built-in function . . . . . . . . . . . . . . . . . . . . . . 8
amd_unpack3
built-in function . . . . . . . . . . . . . . . . . . . . . . 8
API
C++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
naming extension functions . . . . . . . . . . . . 1
platform
querying . . . . . . . . . . . . . . . . . . . . . . . . . 1
processing calls . . . . . . . . . . . . . . . . . . . . . 9
API commands
three categories . . . . . . . . . . . . . . . . . . . . 10
application code
developing Visual Studio . . . . . . . . . . . . . 12
application kernels
device-specific binaries. . . . . . . . . . . . . . . . 1
arrangement of ALUs . . . . . . . . . . . . . . . . . . . 7
atomics
_global. . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
_local . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
B
barrier
command-queue . . . . . . . . . . . . . . . . . . . . .
barriers
execution order . . . . . . . . . . . . . . . . . . . . . .
work-group . . . . . . . . . . . . . . . . . . . . . . . . .
work-items
encountering . . . . . . . . . . . . . . . . . . . . . .
BIF
.comment
storing OpenCL and driver versions that
created the binary . . . . . . . . . . . . . . .
.llvmir
storing immediate representation (LLVM
IR). . . . . . . . . . . . . . . . . . . . . . . . . . . .
.source
storing OpenCL source program . . . . . .
5
4
4
4
1
1
1
binary
.comment section . . . . . . . . . . . . . . . . . . 1
bitness. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
changing default behavior . . . . . . . . . . . . . 1
ELF special sections. . . . . . . . . . . . . . . . . . 1
options to control what is contained in the
binary . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
binary
application kernels . . . . . . . . . . . . . . . . . . . 1
controlling
BIF options . . . . . . . . . . . . . . . . . . . . . . . 3
CPU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
generating
in OpenCL . . . . . . . . . . . . . . . . . . . . . . . 1
LLVM AS . . . . . . . . . . . . . . . . . . . . . . . 14
GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Binary Image Format (BIF)
See BIF
bitness
BIF. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
branch
granularity
work-item . . . . . . . . . . . . . . . . . . . . . . . . 4
instructions . . . . . . . . . . . . . . . . . . . . . . . . . 4
branching
flow control . . . . . . . . . . . . . . . . . . . . . . . . . 4
breakpoint
CL kernel function. . . . . . . . . . . . . . . . . . . 11
host code . . . . . . . . . . . . . . . . . . . . . . . . . 11
no breakpoint is set . . . . . . . . . . . . . . . . . 11
setting . . . . . . . . . . . . . . . . . . . . . . . . . 11, 12
sample GDB debugging session . . . . . 11
setting a . . . . . . . . . . . . . . . . . . . . . . . . . . 11
buffer
command queue . . . . . . . . . . . . . . . . . . . . . 9
global. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
source or destination for instruction. . . 10
storing writes to random memory locations
10
relationship
sample code . . . . . . . . . . . . . . . . . . . . . . 4
build log
printing out . . . . . . . . . . . . . . . . . . . . . . . . 16
built-in function
amd_bitalign . . . . . . . . . . . . . . . . . . . . . . . . 8
amd_bytealign. . . . . . . . . . . . . . . . . . . . . . . 8
amd_lerp . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
amd_pack . . . . . . . . . . . . . . . . . . . . . . . . . . 8
amd_sad . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
amd_sad4 . . . . . . . . . . . . . . . . . . . . . . . . . . 9
amd_sadhi. . . . . . . . . . . . . . . . . . . . . . . . . . 9
amd_unpack0 . . . . . . . . . . . . . . . . . . . . . . . 8
Index-2
Copyright 2015 Advanced Micro Devices, Inc. All rights reserved.
amd_unpack1 . . . . . . . . . . . . . . . . . . . . . . .
amd_unpack2 . . . . . . . . . . . . . . . . . . . . . . .
amd_unpack3 . . . . . . . . . . . . . . . . . . . . . . .
built-in functions
for OpenCL language
cl_amd_media_ops . . . . . . . . . . . . . . . .
OpenCL C programs
cl_amd_printf . . . . . . . . . . . . . . . . . . . .
variadic arguments . . . . . . . . . . . . . . . . . .
writing output to the stdout stream . . . . .
burst write . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
8
8
7
10
10
10
10
C
C kernels
predefined macros . . . . . . . . . . . . . . . . . . 11
C program sample
OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . 11
C programming
OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
C++ API . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
C++ bindings
OpenCL programming . . . . . . . . . . . . . . . 14
C++ files
compiling. . . . . . . . . . . . . . . . . . . . . . . . . . . 2
C++ kermel language . . . . . . . . . . . . . . . . . iii, 1
C++ kernels
building . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
C++ templates . . . . . . . . . . . . . . . . . . . . . . . . 5
cache
L1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
L2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
texture system . . . . . . . . . . . . . . . . . . . . . 11
call error
OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . 16
character extensions. . . . . . . . . . . . . . . . . . . . 1
searching for substrings . . . . . . . . . . . . . . . 2
character sequence
format string . . . . . . . . . . . . . . . . . . . . . . . 10
CL context
associate with GL context . . . . . . . . . . . . . 1
CL kernel function
breakpoint . . . . . . . . . . . . . . . . . . . . . . . . . 11
CL options
change during runtime . . . . . . . . . . . . . . . . 9
cl_amd_device_attribute_query extension
querying AMD-specific device attributes . . 5
cl_amd_event_callback extension
registering event callbacks for states . . . . 7
cl_amd_fp64 extension. . . . . . . . . . . . . . . . . . 5
cl_amd_media_ops extension
adding built-in functions to OpenCL language
7
cl_amd_printf extension . . . . . . . . . . . . . . . . 10
cl_ext extensions . . . . . . . . . . . . . . . . . . . . . . 5
cl_khr_fp64
supported function . . . . . . . . . . . . . . . . . . 14
classes
passing between host and device . . . . . . . 3
clBuildProgram
debugging OpenCL program . . . . . . . . . . 10
clCreateKernel
C++extension . . . . . . . . . . . . . . . . . . . . . . . 2
clEnqueue commands . . . . . . . . . . . . . . . . . 10
clEnqueueNDRangeKernel
setting breakpoint in the host code . . . . . 11
clGetPlatformIDs() function
available OpenCL implementations . . . . . . 1
clGetPlatformInfo() function
available OpenCL implementations . . . . . . 1
querying supported extensions for OpenCL
platform . . . . . . . . . . . . . . . . . . . . . . . . . 1
C-like language
OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
code
basic programming steps . . . . . . . . . . . . . 11
ICD-compliant version . . . . . . . . . . . . . . 1, 2
parallel min() function. . . . . . . . . . . . . . . . 19
pre-ICD snippet . . . . . . . . . . . . . . . . . . . 1, 2
runtime steps . . . . . . . . . . . . . . . . . . . . . . 19
CodeXL GPU Debugger . . . . . . . . . . . . . . . . . 1
command processor
transfer from system to GPU . . . . . . . . . . . 8
command processors
concurrent processing of command queues 6
command queue . . . . . . . . . . . . . . . . . . . . . . . 4
associated with single device . . . . . . . . . . 4
barrier
enforce ordering within a single queue . 5
creating device-specific . . . . . . . . . . . . . . . 4
elements
constants . . . . . . . . . . . . . . . . . . . . . . . . 9
kernel execution calls . . . . . . . . . . . . . . 9
kernels . . . . . . . . . . . . . . . . . . . . . . . . . . 9
transfers between device and host . . . . 9
executing kernels . . . . . . . . . . . . . . . . . . . . 4
execution . . . . . . . . . . . . . . . . . . . . . . . . . . 9
moving data . . . . . . . . . . . . . . . . . . . . . . . . 4
no limit of the number pointing to the same
device . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
command queues . . . . . . . . . . . . . . . . . . . . . . 6
multiple . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
command-queue barrier . . . . . . . . . . . . . . . . . 5
commands
API
three categories . . . . . . . . . . . . . . . . . . 10
Index-3
buffer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
clEnqueue . . . . . . . . . . . . . . . . . . . . . . . . . 10
driver layer issuing . . . . . . . . . . . . . . . . . . . 9
driver layer translating . . . . . . . . . . . . . . . . 9
event . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
GDB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
memory . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
OpenCL API functions . . . . . . . . . . . . . . . 10
queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
communication and data transfers between system and GPU
PCIe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
communication between the host (CPU) and the
GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
compilation
error
kernel . . . . . . . . . . . . . . . . . . . . . . . . . . 16
compile time
resolving format string . . . . . . . . . . . . . . . 10
compiler
set to ignore all extensions . . . . . . . . . . . . 2
toolchain . . . . . . . . . . . . . . . . . . . . . . . . . . 14
back-end . . . . . . . . . . . . . . . . . . . . . . . . 14
sharing front-end . . . . . . . . . . . . . . . . . 14
sharing high-level transformations . . . . 14
transformations . . . . . . . . . . . . . . . . . . . . . 14
compiler option
-f[no-]bin-amdil . . . . . . . . . . . . . . . . . . . . . . 8
-f[no-]bin-exe . . . . . . . . . . . . . . . . . . . . . . . . 8
-f[no-]bin-llvmir . . . . . . . . . . . . . . . . . . . . . . 8
-f[no-]bin-source . . . . . . . . . . . . . . . . . . . . . 8
-g . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
-O0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
-save-temps . . . . . . . . . . . . . . . . . . . . . . . . 8
compiling
an OpenCL application . . . . . . . . . . . . . . . . 1
C++ files . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
kernels. . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
on Linux
building 32-bit object files on a 64-bit system . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
linking to a 32-bit library. . . . . . . . . . . . . 2
linking to a 64-bit library. . . . . . . . . . . . . 2
OpenCL on Linux . . . . . . . . . . . . . . . . . . . . 2
OpenCL on Windows . . . . . . . . . . . . . . . . . 1
Intel C (C++) compiler . . . . . . . . . . . . . . 1
setting project properties . . . . . . . . . . . . 1
Visual Studio 2008 Professional Edition 1
the host program . . . . . . . . . . . . . . . . . . . . 1
computation
data-parallel model . . . . . . . . . . . . . . . . . . . 2
compute device structure
GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5, 2
compute kernel
data-parallel granularity . . . . . . . . . . . . . . . 2
definition . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
strengths
computationally intensive applications . . 1
wavefronts. . . . . . . . . . . . . . . . . . . . . . . . . . 2
workgroups . . . . . . . . . . . . . . . . . . . . . . . . . 2
compute unit
mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
stream cores . . . . . . . . . . . . . . . . . . . . . . . . 3
compute unites
number in AMD GPU . . . . . . . . . . . . . . . . . 7
compute units
290X devices . . . . . . . . . . . . . . . . . . . . . . . 6
independent operation . . . . . . . . . . . . . . . . 4
number in AMD GPUs . . . . . . . . . . . . . . . . 7
structured in AMD GPUs . . . . . . . . . . . . . . 7
constants
caching . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
command queue elements . . . . . . . . . . . . . 9
constraints
of the current LDS model . . . . . . . . . . . . . . 2
context
relationship
sample code . . . . . . . . . . . . . . . . . . . . . . 4
contexts
associating CL and GL . . . . . . . . . . . . . . . . 1
copying data
implicit and explicit . . . . . . . . . . . . . . . . . . . 6
copying processes . . . . . . . . . . . . . . . . . . . . . 7
CPU
binaries . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
code
parallel min() function . . . . . . . . . . . . . . 19
communication between host and GPU. . . 8
predefined macros . . . . . . . . . . . . . . . . . . 11
processing
OpenCL runtime . . . . . . . . . . . . . . . . . . 14
skip copying between host memory and PCIe
memory . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Creating CL context
from a GL Context . . . . . . . . . . . . . . . . . . 10
cygwin
GDB running . . . . . . . . . . . . . . . . . . . . . . . 12
D
-D name
OpenCL supported options. . . . . . . . . . . . . 7
data
computations
select a device . . . . . . . . . . . . . . . . . . . . 4
fetch units . . . . . . . . . . . . . . . . . . . . . . . . . 10
Index-4
Copyright 2015 Advanced Micro Devices, Inc. All rights reserved.
domains
of synchronization . . . . . . . . . . . . . . . . . . .
command-queue . . . . . . . . . . . . . . . . . .
work-items . . . . . . . . . . . . . . . . . . . . . . .
double copying
memory bandwidth . . . . . . . . . . . . . . . . . . .
double-precision floating-point
performing operations . . . . . . . . . . . . . . 3,
driver layer
issuing commands . . . . . . . . . . . . . . . . . . .
translating commands . . . . . . . . . . . . . . . .
4
4
4
7
4
9
9
E
element
work-item . . . . . . . . . . . . . . . . . . . . . . . . . .
ELF
.rodata
storing OpenCL runtime control data . .
.shstrtab
forming an ELF . . . . . . . . . . . . . . . . . . .
.strtab
forming an ELF . . . . . . . . . . . . . . . . . . .
.symtab
forming an ELF . . . . . . . . . . . . . . . . . . .
.text
storing the executable . . . . . . . . . . . . . .
format . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
forming . . . . . . . . . . . . . . . . . . . . . . . . . . . .
header fields. . . . . . . . . . . . . . . . . . . . . . . .
special sections
BIF . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
enforce ordering
between or within queues
events. . . . . . . . . . . . . . . . . . . . . . . . . . .
synchronizing a given event . . . . . . . . . . .
within a single queue
command-queue barrier. . . . . . . . . . . . .
engine
DMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
enqueuing
commands in OpenCL . . . . . . . . . . . . . . . .
multiple tasks
parallelism . . . . . . . . . . . . . . . . . . . . . . .
native kernels
parallelism . . . . . . . . . . . . . . . . . . . . . . .
environment variable
AMD_OCL_BUILD_OPTIONS . . . . . . . . . .
AMD_OCL_BUILD_OPTIONS_APPEND. .
setting to avoid source changes . . . . . . .
event
commands . . . . . . . . . . . . . . . . . . . . . . . .
1
1
1
1
1
1
1
2
1
5
3
5
8
5
2
2
9
9
10
10
Index-5
Copyright 2015 Advanced Micro Devices, Inc. All rights reserved.
enforces ordering
between queues . . . . . . . . . . . . . . . . . . . 5
within queues . . . . . . . . . . . . . . . . . . . . . 5
synchronizing . . . . . . . . . . . . . . . . . . . . . . . 3
event commands . . . . . . . . . . . . . . . . . . . . . . 10
events
forced ordering between . . . . . . . . . . . . . . . 5
exceptions
C++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
executing
branch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
kernels. . . . . . . . . . . . . . . . . . . . . . . . . . . 2, 3
using corresponding command queue . . 4
kernels for specific devices
OpenCL programming model . . . . . . . . . 3
loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
non-graphic function
data-parallel programming model. . . . . . 2
execution
command queue . . . . . . . . . . . . . . . . . . . . . 9
of a single instruction over all work-items . 2
order
barriers . . . . . . . . . . . . . . . . . . . . . . . . . . 4
single stream core . . . . . . . . . . . . . . . . . . 10
explicit copying of data . . . . . . . . . . . . . . . . . . 6
extension
cl_amd_popcnt . . . . . . . . . . . . . . . . . . . . . . 7
clCreateKernel . . . . . . . . . . . . . . . . . . . . . . 2
extension function pointers . . . . . . . . . . . . . . . 3
extension functions
NULL and non-Null return values. . . . . . . . 3
extension support by device
for devices 1 . . . . . . . . . . . . . . . . . . . . . . . 14
for devices 2 and CPUs . . . . . . . . . . . . . . 15
extensions
all. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
AMD vendor-specific. . . . . . . . . . . . . . . . . . 5
approved by Khronos Group . . . . . . . . . . . 1
character strings . . . . . . . . . . . . . . . . . . . . . 1
cl_amd_device_attribute_query . . . . . . . . . 5
cl_amd_event_callback
registering event callbacks for states. . . 7
cl_amd_fp64 . . . . . . . . . . . . . . . . . . . . . . . . 5
cl_amd_media_ops . . . . . . . . . . . . . . . . . . . 7
cl_amd_printf. . . . . . . . . . . . . . . . . . . . . . . 10
cl_ext . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
compiler set to ignore . . . . . . . . . . . . . . . . . 2
device fission . . . . . . . . . . . . . . . . . . . . . . . 5
disabling . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
enabling. . . . . . . . . . . . . . . . . . . . . . . . . . 2, 3
FunctionName string. . . . . . . . . . . . . . . . . . 3
kernel code compilation
adding defined macro. . . . . . . . . . . . . . . 3
naming conventions . . . . . . . . . . . . . . . . . .
optional . . . . . . . . . . . . . . . . . . . . . . . . . . . .
provided by a specific vendor . . . . . . . . . .
provided collectively by multiple vendors . .
querying for a platform . . . . . . . . . . . . . . . .
querying in OpenCL . . . . . . . . . . . . . . . . . .
same name overrides . . . . . . . . . . . . . . . . .
use in kernel programs. . . . . . . . . . . . . . . .
1
1
1
1
1
1
2
2
F
-f[n-]bin-source
AMD supplemental compiler option . . . . . . 6
-f[no-]bin-amdil
AMD supplemental compiler option . . . . . . 7
compiler option . . . . . . . . . . . . . . . . . . . . . . 8
-f[no-]bin-exe
AMD supplemental compiler option . . . . . . 7
compiler option . . . . . . . . . . . . . . . . . . . . . . 8
-f[no-]bin-llvmir
AMD supplemental compiler option . . . . . . 7
compiler option . . . . . . . . . . . . . . . . . . . . . . 8
-f[no-]bin-source
compiler option . . . . . . . . . . . . . . . . . . . . . . 8
fetch unit
loads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
processing. . . . . . . . . . . . . . . . . . . . . . . . . 10
stores . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
streaming stores . . . . . . . . . . . . . . . . . . . . 10
transferring the work-item. . . . . . . . . . . . . 10
fetches
memory
stalls . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
floating point operations
double-precision . . . . . . . . . . . . . . . . . . . . . 4
single-precision . . . . . . . . . . . . . . . . . . . . . . 4
flow control . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
branching . . . . . . . . . . . . . . . . . . . . . . . . . . 4
execution of a single instruction over all workitems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
forced ordering of events . . . . . . . . . . . . . . . . 5
format string . . . . . . . . . . . . . . . . . . . . . . . . . 10
conversion guidelines . . . . . . . . . . . . . . . . 10
resolving compile time . . . . . . . . . . . . . . . 10
function call
querying . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
FunctionName string
address of extension . . . . . . . . . . . . . . . . . 3
G
-g
compiler option . . . . . . . . . . . . . . . . . . . . . . 7
experimental feature . . . . . . . . . . . . . . . . . . 6
-g option
Index-6
Copyright 2015 Advanced Micro Devices, Inc. All rights reserved.
H
hardware
overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Hawaii
see R9 290X series devices or AMD Radeon
R9 290X . . . . . . . . . . . . . . . . . . . . . . . . . 8
header fields
in ELF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
hello world sample kernel. . . . . . . . . . . . . . . 10
hierarchical subdivision
OpenCL data-parallel programming model 2
host
communication between host and GPU . . 8
copying data from host to GPU . . . . . . . . . 6
dataflow between host and GPU . . . . . . . . 6
program
OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . 1
program compiling . . . . . . . . . . . . . . . . . . . 1
host code
breakpoint . . . . . . . . . . . . . . . . . . . . . . . . . 11
platform vendor string . . . . . . . . . . . . . . . . 3
setting breakpoint . . . . . . . . . . . . . . . . . . . 11
clEnqueueNDRangeKernel . . . . . . . . . 11
host/device architecture single platform
consisting of a GPU and CPU. . . . . . . . . . 3
I
-I dir
OpenCL supported options . . . . . . . . . . . . 7
idle stream cores . . . . . . . . . . . . . . . . . . . . . . 4
image
reads. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
implicit copying of data. . . . . . . . . . . . . . . . . . 6
index space
n-dimensional . . . . . . . . . . . . . . . . . . . . . . . 2
inheritance
strict and multiple . . . . . . . . . . . . . . . . . . . . 1
input stream
NDRange . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Installable Client Driver (ICD)
compliant version of code . . . . . . . . . . . 1, 2
overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
pre-ICD code snippet . . . . . . . . . . . . . . . 1, 2
using . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
instruction
branch. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
global buffer . . . . . . . . . . . . . . . . . . . . . . . 10
kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
sequence
stream cores . . . . . . . . . . . . . . . . . . . . . 3
instructions
scalar and vector . . . . . . . . . . . . . . . . . . . . 6
Index-7
Copyright 2015 Advanced Micro Devices, Inc. All rights reserved.
integer
performing operations . . . . . . . . . . . . . . . 3,
Intel C (C++) compiler
compiling OpenCL on Windows . . . . . . . . .
interoperability context
code for creating . . . . . . . . . . . . . . . . . . . . .
interrelationship of memory domains . . . . . . .
4
1
3
6
kernel_name
construction. . . . . . . . . . . . . . . . . . . . . . . . 11
kernels
debugging . . . . . . . . . . . . . . . . . . . . . . . . . 10
kernels and shaders . . . . . . . . . . . . . . . . . . . . 2
Khronos
website . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
kernel
code
parallel min() function . . . . . . . . . . . . . . 20
code compilation
adding a defined macro with the name of
the extension . . . . . . . . . . . . . . . . . . . 3
code example . . . . . . . . . . . . . . . . . . . . . . 11
command queue elements . . . . . . . . . . . . . 9
commands. . . . . . . . . . . . . . . . . . . . . . . . . 10
compilation error . . . . . . . . . . . . . . . . . . . . 16
compiling . . . . . . . . . . . . . . . . . . . . . . . . . . 14
compute
definition . . . . . . . . . . . . . . . . . . . . . . . . . 1
strengths . . . . . . . . . . . . . . . . . . . . . . . . . 1
creating within programs . . . . . . . . . . . . . . 4
definition of . . . . . . . . . . . . . . . . . . . . . . . . . 1
device-specific binaries. . . . . . . . . . . . . . . . 1
distributing in OpenCL . . . . . . . . . . . . . . . 14
executed as a function of multi-dimensional
domains of indices . . . . . . . . . . . . . . . . . 3
executing . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
using corresponding command queue . . 4
execution
device-specific operations . . . . . . . . . . . 4
execution calls
command queue elements . . . . . . . . . . . 9
hello world sample . . . . . . . . . . . . . . . . . . 10
instructions over PCIe bus . . . . . . . . . . . . . 9
keyword . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
loading. . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
no breakpoint set . . . . . . . . . . . . . . . . . . . 11
overloading . . . . . . . . . . . . . . . . . . . . . . . . . 2
program
OpenCL. . . . . . . . . . . . . . . . . . . . . . . . . . 1
programming
enabling extensions . . . . . . . . . . . . . . . . 3
programs using extensions. . . . . . . . . . . . . 2
running on compute unit. . . . . . . . . . . . . . . 2
stream . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
submitting for execution . . . . . . . . . . . . . . . 4
synchronization points . . . . . . . . . . . . . . . . 2
work-item. . . . . . . . . . . . . . . . . . . . . . . . . . . 2
kernel and function overloading . . . . . . . . . . . 1
kernel commands . . . . . . . . . . . . . . . . . . . . . 10
L1 cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
L2 cache . . . . . . . . . . . . . . . . . . . . . . . . . . 11, 6
latency
hiding in memory . . . . . . . . . . . . . . . . . . . 10
latency hiding . . . . . . . . . . . . . . . . . . . . . . . . 11
launching
threads . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
LDS
description. . . . . . . . . . . . . . . . . . . . . . . . . . 2
gather/scatter model . . . . . . . . . . . . . . . . . . 2
size allocated to work-group. . . . . . . . . . . . 2
using local memory. . . . . . . . . . . . . . . . . . 10
LDS model constraints . . . . . . . . . . . . . . . . . . 2
data sharing . . . . . . . . . . . . . . . . . . . . . . . . 2
memory accesses outside the work-group 2
size is allocated per work-group . . . . . . . . 2
linking
creating an executable . . . . . . . . . . . . . . . . 2
object files . . . . . . . . . . . . . . . . . . . . . . . . . . 2
OpenCL on Linux . . . . . . . . . . . . . . . . . . . . 2
to a 32-bit library compiling on Linux. . . . . 2
to a 64-bit library compiling on Linux. . . . . 2
Linux
building 32-bit object files on a 64-bit system
2
compiling OpenCL . . . . . . . . . . . . . . . . . . . 2
linking . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
linking
to a 32-bit library . . . . . . . . . . . . . . . . . . 2
to a 64-bit library . . . . . . . . . . . . . . . . . . 2
LLVM AS
generating binaries . . . . . . . . . . . . . . . . . . 14
LLVM IR
BIF. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
compatibility . . . . . . . . . . . . . . . . . . . . . . . . 1
enabling recompilation to the target. . . . . . 1
generating a new code . . . . . . . . . . . . . . . . 1
Local Data Store (LDS)
See LDS
loop
execute . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
M
macros
Index-8
Copyright 2015 Advanced Micro Devices, Inc. All rights reserved.
predefined
CPU . . . . . . . . . . . . . . . . . . . . . . . . . . .
GPU . . . . . . . . . . . . . . . . . . . . . . . . . . .
OpenCL C kernels . . . . . . . . . . . . . . . .
mapping
executions onto compute units . . . . . . . . .
OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . .
work-items onto n-dimensional grid (NDRange) . . . . . . . . . . . . . . . . . . . . . . . . . .
work-items to stream cores . . . . . . . . . . . .
masking GPUs . . . . . . . . . . . . . . . . . . . . . . . .
mem_fence operation . . . . . . . . . . . . . . . . . . .
memories
interrelationship of . . . . . . . . . . . . . . . . . . .
memory
access. . . . . . . . . . . . . . . . . . . . . . . . . . 5,
allocation
select a device . . . . . . . . . . . . . . . . . . . .
architecture . . . . . . . . . . . . . . . . . . . . . . . . .
bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . .
double copying . . . . . . . . . . . . . . . . . . . .
commands . . . . . . . . . . . . . . . . . . . . . . . .
domains
interrelationship . . . . . . . . . . . . . . . . . . .
fence
barriers . . . . . . . . . . . . . . . . . . . . . . . . . .
operations. . . . . . . . . . . . . . . . . . . . . . . .
global (VRAM) . . . . . . . . . . . . . . . . . . . . .
hiding latency . . . . . . . . . . . . . . . . . . . 10,
loads . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
object allocation
OpenCL context . . . . . . . . . . . . . . . . . . .
OpenCL domains . . . . . . . . . . . . . . . . . . . .
read operations . . . . . . . . . . . . . . . . . . . .
request . . . . . . . . . . . . . . . . . . . . . . . . . . .
stores . . . . . . . . . . . . . . . . . . . . . . . . . . . .
streaming . . . . . . . . . . . . . . . . . . . . . . .
system pinned . . . . . . . . . . . . . . . . . . . . . .
transfer management . . . . . . . . . . . . . . . . .
write operations . . . . . . . . . . . . . . . . . . . .
memory access
stream cores. . . . . . . . . . . . . . . . . . . . . . .
memory commands . . . . . . . . . . . . . . . . . . .
minGW
GDB running. . . . . . . . . . . . . . . . . . . . . . .
multi-GPU environment
use of GLUT. . . . . . . . . . . . . . . . . . . . . . . .
11
11
11
2
3
3
2
9
4
5
10
4
5
7
7
10
6
3
3
10
11
10
3
5
10
10
10
10
7
7
10
10
10
12
7
N
namespaces
C++ support for . . . . . . . . . . . . . . . . . . . . . 4
supported feature in C++ . . . . . . . . . . . . . . 1
naming conventions
1
1
1
1
1
1
1
3
2
1
1
1
2
1
1
1
4
10
3
O
-O0
compiler option . . . . . . . . . . . . . . . . . . . . . .
object files
linking . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
open platform strategy
AMD Accelerated Parallel Processing . . . .
OpenCL
Accelerated Parallel Processing
implementation . . . . . . . . . . . . . . . . . . . .
adding built-in functions to the language
cl_amd_media_ops extension . . . . . . . .
allocating images . . . . . . . . . . . . . . . . . . . .
Binary Image Format (BIF)
overview . . . . . . . . . . . . . . . . . . . . . . . . .
building
create a context . . . . . . . . . . . . . . . . . . .
programs . . . . . . . . . . . . . . . . . . . . . . . .
querying the runtime . . . . . . . . . . . . . . .
the application . . . . . . . . . . . . . . . . . . . .
C printf . . . . . . . . . . . . . . . . . . . . . . . . . . .
C programming. . . . . . . . . . . . . . . . . . . . . .
call error . . . . . . . . . . . . . . . . . . . . . . . . . .
checking for known symbols . . . . . . . . . .
C-like language with extensions
for parallel programming . . . . . . . . . . . .
compiler and runtime components. . . . . . .
compiler options
-D name . . . . . . . . . . . . . . . . . . . . . . . . .
-I dir . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
2
1
1
7
4
1
3
1
3
3
10
3
16
11
3
1
7
7
Index-9
Copyright 2015 Advanced Micro Devices, Inc. All rights reserved.
compiling
on Linux . . . . . . . . . . . . . . . . . . . . . . . . . 2
linking . . . . . . . . . . . . . . . . . . . . . . . . . 2
on Windows . . . . . . . . . . . . . . . . . . . . . . 1
the program . . . . . . . . . . . . . . . . . . . . . . 1
context
memory object allocation . . . . . . . . . . . . 3
conversion guidelines
format string . . . . . . . . . . . . . . . . . . . . . 10
CPU processing . . . . . . . . . . . . . . . . . . . . 14
create kernels within programs . . . . . . . . . 4
create one or more command queues. . . . 4
create programs to run on one or more
devices . . . . . . . . . . . . . . . . . . . . . . . . . . 4
creating a context
selecting a device. . . . . . . . . . . . . . . . . . 4
data-parallel model
hierarchical subdivision . . . . . . . . . . . . . 2
debugging . . . . . . . . . . . . . . . . . . . . . . . . . . 1
clBuildProgram . . . . . . . . . . . . . . . . . . . 10
desired platform . . . . . . . . . . . . . . . . . . . . . 3
selection . . . . . . . . . . . . . . . . . . . . . . . . . 3
directives to enable or disable extensions . 2
distributing the kernel . . . . . . . . . . . . . . . . 14
enqueued commands . . . . . . . . . . . . . . . . . 5
extensions
enabling or disabling . . . . . . . . . . . . . . . 2
following same pattern . . . . . . . . . . . . . . . . 4
generating
.amdil . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
.llvmir. . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
.text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
a binary . . . . . . . . . . . . . . . . . . . . . . . . . . 1
GPU processing . . . . . . . . . . . . . . . . . . . . 14
host program. . . . . . . . . . . . . . . . . . . . . . . . 1
implementations
use clGetPlatformIds() function . . . . . . . 1
use clGetPlatformInfo() function. . . . . . . 1
introductory sample
C++ bindings . . . . . . . . . . . . . . . . . . . . 14
kernel compiling . . . . . . . . . . . . . . . . . . . . 14
kernel symbols
not visible in debugger . . . . . . . . . . . . . 11
list
of available implementations . . . . . . . . . 1
of commands . . . . . . . . . . . . . . . . . . . . 10
mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
memory domains . . . . . . . . . . . . . . . . . . . . 5
minimalist C program sample. . . . . . . . . . 11
optional
extensions. . . . . . . . . . . . . . . . . . . . . . . . 1
kernel program . . . . . . . . . . . . . . . . . . . . 1
performance
libraries components . . . . . . . . . . . . . . . 1
profiling components . . . . . . . . . . . . . . . 1
printf capabilities . . . . . . . . . . . . . . . . . . . . 10
programmers introductory sample . . . . . . 14
programming model . . . . . . . . . . . . . . . . . . 3
allocating memory buffers . . . . . . . . . . . 4
executing kernels for specific devices . . 3
queues of commands . . . . . . . . . . . . . . . 3
reading/writing data . . . . . . . . . . . . . . . . 3
providing an event . . . . . . . . . . . . . . . . . . . 3
querying
extensions. . . . . . . . . . . . . . . . . . . . . . . . 1
supported extensions using clGetPlatformInfo() . . . . . . . . . . . . . . . . . . . . . . 1
read data back to the host from device . . . 4
recompiling LLVM IR to generate a new code
1
running
data-parallel work . . . . . . . . . . . . . . . . . . 3
programs. . . . . . . . . . . . . . . . . . . . . . . . . 1
task-parallel work . . . . . . . . . . . . . . . . . . 3
runtime . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
changing options . . . . . . . . . . . . . . . . . . 7
using LLVM AS. . . . . . . . . . . . . . . . . . . 14
setting breakpoint . . . . . . . . . . . . . . . . . . . 11
settings for compiling on Windows. . . . . . . 1
storing immediate representation (LLVM IR)
.llvmir. . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
storing OpenCL and driver versions
.comment . . . . . . . . . . . . . . . . . . . . . . . . 1
storing source program
.source . . . . . . . . . . . . . . . . . . . . . . . . . . 1
submit the kernel for execution . . . . . . . . . 4
supported standard
compiler options . . . . . . . . . . . . . . . . . . . 7
synchronizing a given event . . . . . . . . . . . . 3
write data to device . . . . . . . . . . . . . . . . . . 4
OpenCL device
general overview. . . . . . . . . . . . . . . . . . . . . 5
OpenCL programs
debugging . . . . . . . . . . . . . . . . . . . . . . . . . . 1
operation
mem_fence . . . . . . . . . . . . . . . . . . . . . . . . . 4
operations
device-specific
kernel execution . . . . . . . . . . . . . . . . . . . 4
program compilation . . . . . . . . . . . . . . . . 4
double-precision floating point . . . . . . . . . . 4
integer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
memory-read. . . . . . . . . . . . . . . . . . . . . . . 10
memory-write . . . . . . . . . . . . . . . . . . . . . . 10
single-precision floating point . . . . . . . . . . . 4
Index-10
Copyright 2015 Advanced Micro Devices, Inc. All rights reserved.
optional extensions
for OpenCL . . . . . . . . . . . . . . . . . . . . . . . . .
overloading
in C++ language. . . . . . . . . . . . . . . . . . . . .
kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
kernel and function. . . . . . . . . . . . . . . . . . .
overview
software and hardware. . . . . . . . . . . . . . . .
1
4
2
1
1
P
parallel min() function
code sample . . . . . . . . . . . . . . . . . . . . . . . 21
example programs . . . . . . . . . . . . . . . . . . 19
kernel code . . . . . . . . . . . . . . . . . . . . . . . . 20
programming techniques . . . . . . . . . . . . . 19
runtime code. . . . . . . . . . . . . . . . . . . . . . . 19
steps . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
parallel programming
memory fence
barriers . . . . . . . . . . . . . . . . . . . . . . . . . . 3
operations. . . . . . . . . . . . . . . . . . . . . . . . 3
parallelism
enqueuing
multiple tasks . . . . . . . . . . . . . . . . . . . . . 2
native kernels . . . . . . . . . . . . . . . . . . . . . 2
using vector data types . . . . . . . . . . . . . . . 2
parallelization
DMA transfers. . . . . . . . . . . . . . . . . . . . . . . 9
GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
passing a class between host to the device . 6
PCIe
communication between system and GPU 8
data transfers between system and GPU . 8
kernel instructions. . . . . . . . . . . . . . . . . . . . 9
overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
skip copying between host memory and PCIe
memory . . . . . . . . . . . . . . . . . . . . . . . . . 7
throughput. . . . . . . . . . . . . . . . . . . . . . . . . . 8
performance
work-groups . . . . . . . . . . . . . . . . . . . . . . . . 2
pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
platform vendor string
remains constant for a particular vendors
implementation . . . . . . . . . . . . . . . . . . . . 3
searching for desired OpenCL platform. . . 3
vs platform name string . . . . . . . . . . . . . . . 3
point (barrier) in the code. . . . . . . . . . . . . . . . 2
population count
extension . . . . . . . . . . . . . . . . . . . . . . . . . . 7
pre-ICD code snippet . . . . . . . . . . . . . . . . . 1, 2
processing by command processors . . . . . . . 6
processing elements
SIMD arrays . . . . . . . . . . . . . . . . . . . . . . . . 7
program
examples . . . . . . . . . . . . . . . . . . . . . . . . .
simple buffer write . . . . . . . . . . . . . . . .
programming
basic steps with minimum code. . . . . . . .
GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
techniques
simple tests
parallel min() function . . . . . . . . . . .
programming model
AMD Accelerated Parallel Processing . . . .
OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . .
executing kernels for specific devices. .
queues of commands . . . . . . . . . . . . . .
reading/writing data . . . . . . . . . . . . . . . .
project property settings
compiling on Windows . . . . . . . . . . . . . . . .
11
11
11
1
19
2
3
3
3
3
1
Q
querying
AMD-specific device attributes. . . . . . . . . .
extensions
for a list of devices . . . . . . . . . . . . . . . .
for a platform . . . . . . . . . . . . . . . . . . . . .
OpenCL . . . . . . . . . . . . . . . . . . . . . . . . .
for a specific device . . . . . . . . . . . . . . . . . .
for available platforms . . . . . . . . . . . . . . . .
the platform API . . . . . . . . . . . . . . . . . . . . .
the runtime
OpenCL building . . . . . . . . . . . . . . . . . .
queue
command . . . . . . . . . . . . . . . . . . . . . . . . 4,
5
2
1
1
2
2
1
3
9
R
R9 290X series devices . . . . . . . . . . . . . . . . .
random memory location
GPU storage of writes . . . . . . . . . . . . . . .
random-access functionality
NDRange . . . . . . . . . . . . . . . . . . . . . . . . . .
read imaging . . . . . . . . . . . . . . . . . . . . . . . . .
read-only buffers
constants . . . . . . . . . . . . . . . . . . . . . . . . .
runtime
change CL options . . . . . . . . . . . . . . . . . . .
code
parallel min() function . . . . . . . . . . . . .
OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . .
changing options . . . . . . . . . . . . . . . . . .
system functions. . . . . . . . . . . . . . . . . . . .
8
10
1
10
10
9
19
14
7
10
S
sample code
Index-11
Copyright 2015 Advanced Micro Devices, Inc. All rights reserved.
relationship between
buffer(s). . . . . . . . . . . . . . . . . . . . . . . . . . 4
command queue(s). . . . . . . . . . . . . . . . . 4
context(s) . . . . . . . . . . . . . . . . . . . . . . . . 4
device(s) . . . . . . . . . . . . . . . . . . . . . . . . . 4
kernel(s) . . . . . . . . . . . . . . . . . . . . . . . . . 4
relationship between context(s) . . . . . . . . . 4
-save-temps
compiler option . . . . . . . . . . . . . . . . . . . . . . 8
SAXPY function
code sample . . . . . . . . . . . . . . . . . . . . . . . 16
SC cache. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
scalar instructions . . . . . . . . . . . . . . . . . . . . . . 6
scalar unit data cache
SC cache . . . . . . . . . . . . . . . . . . . . . . . . . . 6
scalra instructions . . . . . . . . . . . . . . . . . . . . . . 6
scheduling
GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
work-items
for execution . . . . . . . . . . . . . . . . . . . . . . 3
range. . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
scope
global. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
set
a breakpoint . . . . . . . . . . . . . . . . . . . . . . . 11
shader architecture
unified . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
shaders and kernels . . . . . . . . . . . . . . . . . . . . 2
SIMD arrays
processing elements . . . . . . . . . . . . . . . . . . 7
simple buffer write
code sample . . . . . . . . . . . . . . . . . . . . . . . 12
example programs . . . . . . . . . . . . . . . . . . 11
simple testing
programming techniques
parallel min function . . . . . . . . . . . . . . . 19
single device associated with command queue .
4
single stream core execution . . . . . . . . . . . . 10
single-precision floating-point
performing operations . . . . . . . . . . . . . . . 3, 4
software
overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
spawn order
of work-item . . . . . . . . . . . . . . . . . . . . . . . . 1
sequential . . . . . . . . . . . . . . . . . . . . . . . . . . 1
stalls
memory fetch request . . . . . . . . . . . . . . . . 11
static C++ kernel language . . . . . . . . . . . . . iii, 1
stdout stream
writing output associated with the host application . . . . . . . . . . . . . . . . . . . . . . . . . . 10
stream core
compute units . . . . . . . . . . . . . . . . . . . . . . . 3
executing kernels . . . . . . . . . . . . . . . . . . . . 3
idle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
instruction sequence . . . . . . . . . . . . . . . . . . 3
processing elements . . . . . . . . . . . . . . . . . . 3
stall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
due to data dependency . . . . . . . . . . . 12
stream kernel . . . . . . . . . . . . . . . . . . . . . . . . 10
supplemental compiler options . . . . . . . . . . . . 6
synchronization
command-queue barrier . . . . . . . . . . . . . . . 4
domains. . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
command-queue . . . . . . . . . . . . . . . . . . . 4
work-items. . . . . . . . . . . . . . . . . . . . . . . . 4
events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
points
in a kernel. . . . . . . . . . . . . . . . . . . . . . . . 2
synchronizing
a given event . . . . . . . . . . . . . . . . . . . . . . . 3
event
enforce the correct order of execution. . 3
through barrier operations
work-items. . . . . . . . . . . . . . . . . . . . . . . . 3
through fence operations
work-items. . . . . . . . . . . . . . . . . . . . . . . . 3
syntax
GCC option . . . . . . . . . . . . . . . . . . . . . . . . . 3
system
pinned memory . . . . . . . . . . . . . . . . . . . . . . 7
T
templates
C++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
kernel, member, default argument, limited
class, partial . . . . . . . . . . . . . . . . . . . . . . 1
terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
texture system
caching . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
thread
launching. . . . . . . . . . . . . . . . . . . . . . . . . . 19
threading
device-optimal access pattern . . . . . . . . . 19
throughput
PCIe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
timing
of simplified execution of work-items
single stream core . . . . . . . . . . . . . . . . 10
toolchain
compiler. . . . . . . . . . . . . . . . . . . . . . . . . . . 14
transcendental
core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
performing operations . . . . . . . . . . . . . . . . . 3
transfer
Index-12
Copyright 2015 Advanced Micro Devices, Inc. All rights reserved.
9
4
9
8
8
7
10
U
unified shader architecture . . . . . . . . . . . . . . . 1
un-normalized addresses . . . . . . . . . . . . . . . 10
V
variable output counts
NDRange . . . . . . . . . . . . . . . . . . . . . . . . . .
variadic arguments
use of in the built-in printf . . . . . . . . . . . .
vector data types
parallelism. . . . . . . . . . . . . . . . . . . . . . . . . .
vector instructions. . . . . . . . . . . . . . . . . . . . . .
vendor
platform vendor string . . . . . . . . . . . . . . . .
vendor name
matching platform vendor string. . . . . . . . .
vendor-specific extensions
AMD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Very Long Instruction Word (VLIW)
instruction . . . . . . . . . . . . . . . . . . . . . . . . . .
Visual Studio 2008 Professional Edition . . .
compiling OpenCL on Windows. . . . . . . . .
developing application code. . . . . . . . . . .
VRAM
global memory . . . . . . . . . . . . . . . . . . . . .
1
10
2
6
3
3
5
4
12
1
12
10
W
wavefront
block of work-items . . . . . . . . . . . . . . . . . .
combining paths . . . . . . . . . . . . . . . . . . . . .
concept relating to compute kernels . . . . .
definition . . . . . . . . . . . . . . . . . . . . . . . . . 4,
mask . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
masking . . . . . . . . . . . . . . . . . . . . . . . . . . .
relationship to work-group . . . . . . . . . . . . .
relationship with work-groups. . . . . . . . . . .
required number spawned by GPU . . . . . .
size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
size for optimum hardware usage . . . . . . .
size on AMD GPUs . . . . . . . . . . . . . . . . . .
3
4
2
7
5
4
4
4
4
4
4
4
work-items
divergence in wavefront . . . . . . . . . . . . . . . 4
X
X Window system
using for CL-GL interoperability . . . . . . . . . 8
Index-14
Copyright 2015 Advanced Micro Devices, Inc. All rights reserved.