[go: up one dir, main page]

0% found this document useful (0 votes)
40 views16 pages

DS1822-Parallel Computing - Unit5

The document provides an overview of various GPU programming platforms including OpenCL, OpenACC, C++ AMP, and Thrust, highlighting their key features and architectures. It discusses the execution models, memory management, and the benefits of using these platforms for parallel computing across heterogeneous systems. Additionally, it touches on the advantages of programming heterogeneous clusters and the role of MPI in distributed computing.

Uploaded by

as.nisha.cse
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views16 pages

DS1822-Parallel Computing - Unit5

The document provides an overview of various GPU programming platforms including OpenCL, OpenACC, C++ AMP, and Thrust, highlighting their key features and architectures. It discusses the execution models, memory management, and the benefits of using these platforms for parallel computing across heterogeneous systems. Additionally, it touches on the advantages of programming heterogeneous clusters and the role of MPI in distributed computing.

Uploaded by

as.nisha.cse
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

UNIT V

OTHER GPU PROGRAMMING PLATFORMS


Introduction to OpenCL – OpenACC – C++AMP – Thrust – Programming Heterogeneous Clusters – CUDA
and MPI.

1. Introduction to OpenCL
OpenCL (Open Computing Language) is a framework designed for writing programs that execute across
heterogeneous platforms. This means it can work on CPUs, GPUs, and other processors within a single
device or across multiple devices. Here’s a brief introduction:
Key Features of OpenCL:
• Cross-Platform: Write code that runs on various types of processors.
• Scalability: Effective for parallel computing, it can scale from embedded processors to high-
performance GPUs.
• Portability: OpenCL is designed to be portable, allowing code to run on different hardware without
modification.
• Parallel Computing: OpenCL allows the execution of multiple tasks simultaneously, enhancing
performance for computational-heavy applications.
Anatomy of OpenCL:
The OpenCL development framework is made up of three main parts:
• Language specification
• Platform layer API
• Runtime API
Language Specification:
• The language specification describes the syntax and programming interface for writing kernel
programs that run on the supported accelerator (GPU, multi-core CPU, or DSP).
• Kernels can be precompiled or the developer can allow the OpenCL runtime to compile the kernel
program at runtime.
Platform API:
• The platform-layer API gives the developer access to software application routines that can query
the system for the existence of OpenCL-supported devices.
• This layer also lets the developer use the concepts of device context and work-queues to select and
initialize OpenCL devices, submit work to the devices, and enable data transfer to and from the
devices.
Runtime API:
• The OpenCL framework uses contexts to manage one or more OpenCL devices.
• The runtime API uses contexts for managing objects such as command queues, memory objects,
and kernel objects, as well as for executing kernels on one or more devices specified in the context.
OpenCL Architecture:
The Platform Model:
• The OpenCL platform model is defined as a host connected to one or more OpenCL devices.
• Figure 3–1 OpenCL Platform Model shows the platform model comprising one host plus multiple
compute devices, each having multiple compute units, each of which have multiple processing
elements.
• A host is any computer with a CPU running a standard operating system. OpenCL devices can be a
GPU, DSP, or a multi-core CPU. An OpenCL device consists of a collection of one or more compute
units (cores).
• A compute unit is further composed of one or more processing elements.
• Processing elements execute instructions as SIMD (Single Instruction, Multiple Data) or SPMD
(Single Program, Multiple Data).
• SPMD instructions are typically executed on general purpose devices such as CPUs, while SIMD
instructions require a vector processor such as a GPU or vector units in a CPU.

The Execution Model :


• The OpenCL execution model comprises two components: kernels and host programs.
• Kernels are the basic unit of executable code that runs on one or more OpenCL devices.
• Kernels are similar to a C function that can be data- or task-parallel.
• The host program executes on the host system, defines devices context, and queues kernel execution
instances using command queues.
• Kernels are queued in-order, but can be executed in-order or out-of-order.
Kernels:
• The OpenCL execution model supports two categories of kernels: OpenCL kernels and native
kernels.
• OpenCL kernels are written in the OpenCL C language and compiled with the OpenCL
compiler.
• All devices that are OpenCL-compliant support execution of OpenCL kernels.
• Native kernels are extension kernels that could be special functions defined in application code
or exported from a library designed for a particular accelerator.
• The OpenCL API includes functions to query capabilities of devices to determine if native
kernels are supported.
• If native kernels are used, developers should be aware that the code may not work on other
OpenCL devices.
Host Program:
• The host program is responsible for setting up and managing the execution of kernels on the
OpenCL device through the use of context.
• Using the OpenCL API, the host can create and manipulate the context by including the following
resources:
•Devices — A set of OpenCL devices use by the host to execute kernels.
• Program Objects — The program source or program object that implements a kernel or
collection of kernels.
• Kernels — The specific OpenCL functions that execute on the OpenCL device.
• Memory Objects — A set of memory buffers or memory maps common to the host and
OpenCL devices.
The Memory Model:
• Since common memory address space is unavailable on the host and the OpenCL devices, the
OpenCL memory model defines four regions of memory accessible to work-items when executing
a kernel.
• The following figure shows the regions of memory accessible by the host and the compute device:
• Global memory is a memory region in which all work-items and work-groups have read and write
access on both the compute device and the host.
• This region of memory can be allocated only by the host during runtime.
• Constant memory is a region of global memory that stays constant throughout the execution of the
kernel.
• Work-items have only read access to this region. The host is permitted both read and write access.
• Local memory is a region of memory used for data-sharing by work-items in a work group.
• All work-items in the same work-group have both read and write access.
• Private memory is a region that is accessible to only one work-item,

2.OPEN ACC
OpenACC is another powerful platform for GPU programming, similar to OpenCL but with a focus on
simplifying parallel programming for scientists and researchers. Here's a brief overview:

Key Features of OpenACC:

• Directive-Based: OpenACC uses compiler directives to specify which parts of the code should be
accelerated, making it easier to parallelize existing code without significant modifications.

• Cross-Platform: Like OpenCL, OpenACC supports multiple architectures, including CPUs and
GPUs from different vendors like NVIDIA, AMD, and Intel.
• Portability: Code written with OpenACC can run on various hardware platforms without needing
significant changes.

• Ease of Use: Designed to reduce the complexity of parallel programming, making it more
accessible for domain scientists and researchers,

3. C++ AMP

• C++ Accelerated Massive Parallelism (C++ AMP) is a native programming model that contains
elements that span the C++ programming language and its runtime library. It provides an easy way
to write programs that compile and execute on data-parallel hardware, such as graphics
cards (GPUs).
• The C++ AMP programming model gives the developer explicit control over all of the above aspects
of interaction with the accelerator.
• The developer may explicitly manage all communication between the CPU and the accelerator, and
this communication can be either synchronous or asynchronous.
• The data parallel computations performed on the accelerator are expressed using high-level
abstractions, such as multi-dimensional arrays, high level array manipulation functions, and multi-
dimensional indexing operations, all based on a large subset of the C++ programming language.
• The programming model contains multiple layers, allowing developers to trade off ease-of-use with
maximum performance.
• C++ AMP is composed of three broad categories of functionality:

1. C++ language and compiler

a. Kernel functions are compiled into code that is specific to the accelerator.

2. Runtime

a. The runtime contains a C++ AMP abstraction of lower-level accelerator APIs, as


well as support for multiple host threads and processors, and multiple accelerators.

b. Asychronous execution is supported through an eventing model.

3. Programming model

a. A set of classes describing the shape and extent of data.

b. A set of classes that contain or refer to data used in computations

c. A set of functions for copying data to and from accelerators

d. A math library 42 e. An atomic library


f. A set of miscellaneous intrinsic functions

Definitions:

This section introduces terms used within the body of this specification.

• Accelerator:

A hardware device or capability that enables accelerated computation on data-parallel workloads.


Examples include:

o Graphics Processing Unit, or GPU, other coprocessor, accessible through the PCIe bus.

o Graphics Processing Unit, or GPU, or other coprocessor that is integrated with a CPU on the
same die.

o SIMD units of the host node exposed through software emulation of a hardware accelerator.

• Array:A dense N-dimensional data container.

• Array View: A view into a contiguous piece of memory that adds array-like dimensionality.

• Compressed texture format: A format that divides a texture into blocks that allow the texture to be reduced
in size by a fixed ratio; typically 4:1 or 6:1.

Compressed textures are useful when perfect image/texel fidelity is not necessary but where minimizing 89
memory storage and bandwidth are critical to application performance.

• Extent :A vector of integers that describes lengths of N-dimensional array-like objects.

• Global memory:On a GPU, global memory is the main off-chip memory store.

Programming Model:
C++ Language Extensions for Accelerated Computing:
4. Thrust: A High-Level Library for CUDA

• Thrust is a productivity-oriented library for CUDA, built entirely on top of CUDA C/C++.
• It provides a high-level abstraction of common parallel programming patterns, inspired by the C++
Standard Template Library (STL), making it easier to write efficient and portable GPU-accelerated
code without needing to manage low-level CUDA details directly.

Thrust and CUDA Relationship

• Abstraction Layer: Thrust serves as an abstraction layer over CUDA C/C++. It simplifies tasks
like kernel launches, memory management, and algorithm selection, allowing developers to focus
on the high-level structure of their parallel computations.

• Interoperability: Thrust is designed for seamless interoperability with CUDA C/C++. Developers
can easily integrate CUDA C code into Thrust applications and vice versa, leveraging the strengths
of both. For example, using raw pointers allows passing data between Thrust containers and CUDA
kernels, ensuring a flexible approach to parallelization.

• Performance: While abstracting low-level details, Thrust maintains a focus on performance. It


employs optimizations like kernel fusion and structure of arrays to minimize memory transfers and
maximize GPU utilization, ultimately aiming to achieve performance comparable to carefully hand-
tuned CUDA C code.

Benefits of using Thrust over CUDA C:

• Increased Programmer Productivity: Thrust automates parallel programming tasks such as kernel
launch configurations and memory management, allowing developers to focus on the algorithm
logic instead of low-level implementation details.

• Enhanced Robustness: By handling low-level details, Thrust improves the robustness of CUDA
applications. It automatically addresses potential issues like limits on grid dimensions and data type
sizes, ensuring consistent behavior across various CUDA-capable devices.

• Improved Code Readability: Thrust’s STL-like syntax promotes concise and readable code,
making it easier to understand and maintain GPU-accelerated applications.

Examples of Thrust features simplifying CUDA development:

• Kernel Launch Abstraction: In CUDA, developers need to explicitly define the grid and block
dimensions for kernel launches. Thrust simplifies this by automatically determining an efficient
launch configuration based on factors like available resources and desired occupancy.

• Memory Management: Thrust’s container classes (e.g., host_vector, device_vector) automate


memory allocation and deallocation on both host and device, streamlining data management
compared to manual calls to functions like cudaMalloc and cudaFree.

• Thrust empowers developers to harness the power of CUDA GPUs with higher-level abstractions,
facilitating faster development, improved code maintainability, and efficient parallel program
execution.
• However, it’s important to remember that for specialized algorithms or maximum control over
hardware, direct CUDA C programming might still be preferred.

Performance Advantages of Thrust

• Thrust offers several performance benefits over CUDA C, particularly in terms of programmer
productivity and code robustness, which can indirectly lead to performance gains:

Programmer Productivity:

• Automatic Launch Configuration: Thrust simplifies parallel programming by handling the


configuration of CUDA launch parameters, such as grid and block dimensions, based on factors like
maximizing GPU occupancy. This relieves programmers from manually tuning these parameters,
potentially leading to faster development and more efficient code execution.
• Rich Set of Algorithms: Thrust’s ready-to-use algorithms for common parallel patterns, such as
map-reduce, contribute to reduced development time and potentially better performance compared
to manually implementing these patterns in CUDA C.

Robustness:

• Handling Hardware Limits: Thrust automatically manages constraints imposed by CUDA devices,
such as limits on grid dimensions or the size of function arguments, leading to more robust
applications and less time spent on debugging hardware-specific issues.

• Optimized Algorithm Selection: Thrust automatically chooses optimized algorithms based on the
data types and operations involved. For example, it uses a faster radix sort for primitive types and a
more general merge sort for other data types, potentially leading to performance gains without
explicit programmer intervention.

Real-World Performance:

• Kernel Fusion: Thrust enables the fusion of multiple operations, like transformations and
reductions, into a single kernel, minimizing memory transactions and improving bandwidth
utilization. This is particularly beneficial for algorithms with low computational intensity, resulting
in significant performance improvements.

• Structure of Arrays Optimization: Thrust promotes using a Structure of Arrays (SoA) data layout,
which aligns data for optimal memory coalescing and faster memory access, potentially leading to
significant performance improvements compared to less efficient data layouts.

While Thrust offers these advantages, it’s important to note that CUDA C provides a lower level of control
that might be necessary for highly specialized optimizations. However, for a wide range of applications,
Thrust simplifies parallel programming and can contribute to achieving high performance on CUDA GPUs.

Unified Memory Simplifies CPU-to-CUDA Porting

Unified Memory, simplifies porting CPU code to CUDA by creating a pool of managed memory shared
between the CPU and GPU. This shared memory pool is accessible using a single pointer, effectively
bridging the gap between CPU and GPU memory spaces.

Prior to Unified Memory, porting CPU code to CUDA involved explicitly allocating device memory and
managing data transfers between the host and device. With Unified Memory, the CUDA runtime
transparently handles data migration and coherence, making the managed memory appear as CPU memory
to code running on the CPU and GPU memory to code running on the GPU.

The benefits of Unified Memory for porting CPU code to CUDA are:
• Reduced Code Changes: Porting CPU code becomes as simple as replacing standard memory
allocation functions like malloc() and free() with their CUDA
counterparts, cudaMallocManaged() and cudaFree(). Additionally, the computation is offloaded to
the GPU by launching a kernel.

• Simplified Data Management: Unified Memory removes the need for explicit data transfers between
the host and device, as the CUDA runtime automatically migrates data as needed. This eliminates
the complexity of managing separate host and device memory spaces, leading to cleaner and easier-
to-maintain code.

While the provided source doesn’t elaborate on specific code examples demonstrating the evolution of code
from pre-Unified Memory to CUDA 6 and beyond, it does mention that Unified Memory, especially with
the page fault handling capabilities of Pascal and later architectures, allows for even more complex data
structures, such as linked lists, to be seamlessly shared between the CPU and GPU.

It is important to note that while Unified Memory offers significant convenience and simplifies the porting
process, achieving optimal performance might still require understanding data access patterns and potential
optimization techniques.

5. Programming Heterogeneous Clusters –MPI.

• One of the biggest advantages of distributed systems over standalone computers is an ability to
share the workload between computers, processors, and cores.
• Clusters (a network of computers configured to work as one computer), grids, and cloud computing
are one of the most progressive branches in a field of parallel computing and data processing
nowadays, and have been identified as important new technologies that may be used to solve
complex scientific and engineering problems as well as to tackle many projects in commerce and
industry.
• A broad spectrum of current parallel computing activities and scientific projects are carried out.
• A new model for parallel computing that relies on usage of CPU and GPU units to solve general
purpose scientific and engineering problems revolutionized data computation over last few years.
• The tasks that can be divided up into large numbers of independent parts are good candidates.
• GPU-enabled calculations seem to be very promising in data analysis, optimization, simulation, etc.
• Using CUDA or OpenCL, and graphical processing units many real-world applications can be easily
implemented and run significantly faster than the multi-processor or multi-core systems.
• A heterogeneous computing system refers to a system that contains different types of
computational units, such as multicore CPUs, GPUs, DSPs, FPGAs, and ASICs.
• The computational units in a heterogeneous system typically include a general-purpose processor
that runs an operating system.
• In High-Performance Computing (HPC), various applications require the aggregate computing
power of a cluster of computing nodes.
• Many of the HPC clusters today have one or more hosts and one or more devices in each node.
• Since early days, these clusters have been programmed predominately with the Message Passing
Interface (MPI).
• MPI helps to scale heterogeneous applications to multiple nodes in a cluster environment.
• By carrying out domain partitioning, point-to-point communication, and collective communication
can help a kernel to scale up that is it can improve efficiency.

MPI (Message Passing Interface)


MPI Fundamentals

• In today’s times widely used programming interface for computing clusters is MPI[Gropp1999],
which is a set of API functions for communication between processes running in a computing
cluster.
• MPI assumes a distributed memory model where processes exchange information by sending
messages to each other.
• When an application uses API communication functions, it does not need to deal with the details of
the interconnect network.
• The MPI implementation allows the processes to address each other using logical numbers, much
the same way as using phone numbers in a telephone system - telephone users can dial each other
using phone numbers without knowing exactly where the called person is and how the call is routed.
• Message Passing Interface, is a standard API for communicating data via messages between
distributed processes that is commonly used in HPC to build applications that can scale to multi-
node computer clusters.
• As such, MPI is fully compatible with CUDA, which is designed for parallel computing on a single
computer or node. There are many reasons for wanting to combine the two parallel programming
approaches of MPI and CUDA.
• A common reason is to enable solving problems with a data size too large to fit into the memory of
a single GPU, or that would require an unreasonably long compute time on a single node.
• In a typical MPI application, data and work are partitioned among processes. As shown in below
Fig. 5.4.1, each node can contain one or more processes, shown as clouds within nodes. As these
processes progress, they may need data from each other.
• This need is satisfied by sending and receiving messages.
MPI Working

1. Similar to CUDA, MPI programs are based on the SPMD(Single Program, Multiple Data) parallel
execution model. All MPI processes execute the same program. The MPI system provides a set of API
functions to establish communication systems that allow the processes to communicate with each other.

2. MPI process is usually called a “rank”. The processes involved in an MPI program have private address
spaces, which allows an MPI program to run on a system with a distributed memory space, such as a cluster.
The MPI standard defines a message-passing API which covers point-to-point messages as well as
collective operations like reductions.

3. Below are five essential API functions that set up and shut down communication systems for an MPI
application.

1. int MPI_Init (int*argc, char***argv) - Initialize MPI.

2. int MPI_Comm_rank (MPI_Comm comm, int *rank) - Rank of the calling process in group of comm.

3. int MPI_Comm_size (MPI_Comm comm, int *size) - Number of processes in the group of comm.

4. int MPI_Comm_abort (MPI_Comm comm) - Terminate MPI communication connection with an error
flag.

5. int MPI_Finalize ( ) - Ending an MPI application, close all resources.

MPI Programming Example:


Above program, is simple MPI program that uses these API functions. A user needs to

supply the executable file of the program to the mpirun command or the mpiexec command

in a cluster.

Each process starts by initializing the MPI runtime with a MPI_Init() call. This initializes

the communication system for all the processes running the application.

Once the MPI runtime is initialized, each process calls two functions to prepare for

communication.

The first function MPI_Comm_rank() - returns a unique number to call each process,

called an MPI rank or process ID. The numbers received by the processes vary from 0 to

the number of processes minus 1.

MPI rank for a process is equivalent to the expression blockIdx.x * blockDim.x +

threadIdx.x for a CUDA thread. It uniquely identifies the process in a communication,

similar to the phone number in a telephone system.

The MPI_Comm_rank() has two parameters. The first one is an MPI built-in type

MPI_Comm that specifies the scope of the request. Valuesof the MPI_Comm are

commonly referred to as a communicator.


MPI_Comm and other MPI built-in types are defined in a mpi.h header file that should be

included in all C program files that use MPI. This is similar to the cuda.h header file for

CUDA programs.

An MPI application can create one or more intracommunicators. Members of each

intracommunicator are MPI processes.

You might also like