0% found this document useful (0 votes)

40 views16 pages

DS1822-Parallel Computing - Unit5

The document provides an overview of various GPU programming platforms including OpenCL, OpenACC, C++ AMP, and Thrust, highlighting their key features and architectures. It discusses the execution models, memory management, and the benefits of using these platforms for parallel computing across heterogeneous systems. Additionally, it touches on the advantages of programming heterogeneous clusters and the role of MPI in distributed computing.

Uploaded by

as.nisha.cse

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

40 views16 pages

DS1822-Parallel Computing - Unit5

Uploaded by

as.nisha.cse

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

UNIT V

OTHER GPU PROGRAMMING PLATFORMS

Introduction to OpenCL – OpenACC – C++AMP – Thrust – Programming Heterogeneous Clusters – CUDA
and MPI.

1. Introduction to OpenCL
OpenCL (Open Computing Language) is a framework designed for writing programs that execute across
heterogeneous platforms. This means it can work on CPUs, GPUs, and other processors within a single
device or across multiple devices. Here’s a brief introduction:
Key Features of OpenCL:
• Cross-Platform: Write code that runs on various types of processors.
• Scalability: Effective for parallel computing, it can scale from embedded processors to high-
performance GPUs.
• Portability: OpenCL is designed to be portable, allowing code to run on different hardware without
modification.
• Parallel Computing: OpenCL allows the execution of multiple tasks simultaneously, enhancing
performance for computational-heavy applications.
Anatomy of OpenCL:
The OpenCL development framework is made up of three main parts:
• Language specification
• Platform layer API
• Runtime API
Language Specification:
• The language specification describes the syntax and programming interface for writing kernel
programs that run on the supported accelerator (GPU, multi-core CPU, or DSP).
• Kernels can be precompiled or the developer can allow the OpenCL runtime to compile the kernel
program at runtime.
Platform API:
• The platform-layer API gives the developer access to software application routines that can query
the system for the existence of OpenCL-supported devices.
• This layer also lets the developer use the concepts of device context and work-queues to select and
initialize OpenCL devices, submit work to the devices, and enable data transfer to and from the
devices.
Runtime API:
• The OpenCL framework uses contexts to manage one or more OpenCL devices.
• The runtime API uses contexts for managing objects such as command queues, memory objects,
and kernel objects, as well as for executing kernels on one or more devices specified in the context.
OpenCL Architecture:
The Platform Model:
• The OpenCL platform model is defined as a host connected to one or more OpenCL devices.
• Figure 3–1 OpenCL Platform Model shows the platform model comprising one host plus multiple
compute devices, each having multiple compute units, each of which have multiple processing
elements.
• A host is any computer with a CPU running a standard operating system. OpenCL devices can be a
GPU, DSP, or a multi-core CPU. An OpenCL device consists of a collection of one or more compute
units (cores).
• A compute unit is further composed of one or more processing elements.
• Processing elements execute instructions as SIMD (Single Instruction, Multiple Data) or SPMD
(Single Program, Multiple Data).
• SPMD instructions are typically executed on general purpose devices such as CPUs, while SIMD
instructions require a vector processor such as a GPU or vector units in a CPU.

The Execution Model :

• The OpenCL execution model comprises two components: kernels and host programs.
• Kernels are the basic unit of executable code that runs on one or more OpenCL devices.
• Kernels are similar to a C function that can be data- or task-parallel.
• The host program executes on the host system, defines devices context, and queues kernel execution
instances using command queues.
• Kernels are queued in-order, but can be executed in-order or out-of-order.
Kernels:
• The OpenCL execution model supports two categories of kernels: OpenCL kernels and native
kernels.
• OpenCL kernels are written in the OpenCL C language and compiled with the OpenCL
compiler.
• All devices that are OpenCL-compliant support execution of OpenCL kernels.
• Native kernels are extension kernels that could be special functions defined in application code
or exported from a library designed for a particular accelerator.
• The OpenCL API includes functions to query capabilities of devices to determine if native
kernels are supported.
• If native kernels are used, developers should be aware that the code may not work on other
OpenCL devices.
Host Program:
• The host program is responsible for setting up and managing the execution of kernels on the
OpenCL device through the use of context.
• Using the OpenCL API, the host can create and manipulate the context by including the following
resources:
•Devices — A set of OpenCL devices use by the host to execute kernels.
• Program Objects — The program source or program object that implements a kernel or
collection of kernels.
• Kernels — The specific OpenCL functions that execute on the OpenCL device.
• Memory Objects — A set of memory buffers or memory maps common to the host and
OpenCL devices.
The Memory Model:
• Since common memory address space is unavailable on the host and the OpenCL devices, the
OpenCL memory model defines four regions of memory accessible to work-items when executing
a kernel.
• The following figure shows the regions of memory accessible by the host and the compute device:
• Global memory is a memory region in which all work-items and work-groups have read and write
access on both the compute device and the host.
• This region of memory can be allocated only by the host during runtime.
• Constant memory is a region of global memory that stays constant throughout the execution of the
kernel.
• Work-items have only read access to this region. The host is permitted both read and write access.
• Local memory is a region of memory used for data-sharing by work-items in a work group.
• All work-items in the same work-group have both read and write access.
• Private memory is a region that is accessible to only one work-item,

2.OPEN ACC
OpenACC is another powerful platform for GPU programming, similar to OpenCL but with a focus on
simplifying parallel programming for scientists and researchers. Here's a brief overview:

Key Features of OpenACC:

• Directive-Based: OpenACC uses compiler directives to specify which parts of the code should be
accelerated, making it easier to parallelize existing code without significant modifications.

• Cross-Platform: Like OpenCL, OpenACC supports multiple architectures, including CPUs and
GPUs from different vendors like NVIDIA, AMD, and Intel.
• Portability: Code written with OpenACC can run on various hardware platforms without needing
significant changes.

• Ease of Use: Designed to reduce the complexity of parallel programming, making it more
accessible for domain scientists and researchers,

3. C++ AMP

• C++ Accelerated Massive Parallelism (C++ AMP) is a native programming model that contains
elements that span the C++ programming language and its runtime library. It provides an easy way
to write programs that compile and execute on data-parallel hardware, such as graphics
cards (GPUs).
• The C++ AMP programming model gives the developer explicit control over all of the above aspects
of interaction with the accelerator.
• The developer may explicitly manage all communication between the CPU and the accelerator, and
this communication can be either synchronous or asynchronous.
• The data parallel computations performed on the accelerator are expressed using high-level
abstractions, such as multi-dimensional arrays, high level array manipulation functions, and multi-
dimensional indexing operations, all based on a large subset of the C++ programming language.
• The programming model contains multiple layers, allowing developers to trade off ease-of-use with
maximum performance.
• C++ AMP is composed of three broad categories of functionality:

1. C++ language and compiler

a. Kernel functions are compiled into code that is specific to the accelerator.

2. Runtime

a. The runtime contains a C++ AMP abstraction of lower-level accelerator APIs, as

well as support for multiple host threads and processors, and multiple accelerators.

b. Asychronous execution is supported through an eventing model.

3. Programming model

a. A set of classes describing the shape and extent of data.

b. A set of classes that contain or refer to data used in computations

c. A set of functions for copying data to and from accelerators

d. A math library 42 e. An atomic library

f. A set of miscellaneous intrinsic functions

Definitions:

This section introduces terms used within the body of this specification.

• Accelerator:

A hardware device or capability that enables accelerated computation on data-parallel workloads.

Examples include:

o Graphics Processing Unit, or GPU, other coprocessor, accessible through the PCIe bus.

o Graphics Processing Unit, or GPU, or other coprocessor that is integrated with a CPU on the
same die.

o SIMD units of the host node exposed through software emulation of a hardware accelerator.

• Array:A dense N-dimensional data container.

• Array View: A view into a contiguous piece of memory that adds array-like dimensionality.

• Compressed texture format: A format that divides a texture into blocks that allow the texture to be reduced
in size by a fixed ratio; typically 4:1 or 6:1.

Compressed textures are useful when perfect image/texel fidelity is not necessary but where minimizing 89
memory storage and bandwidth are critical to application performance.

• Extent :A vector of integers that describes lengths of N-dimensional array-like objects.

• Global memory:On a GPU, global memory is the main off-chip memory store.

Programming Model:
C++ Language Extensions for Accelerated Computing:
4. Thrust: A High-Level Library for CUDA

• Thrust is a productivity-oriented library for CUDA, built entirely on top of CUDA C/C++.
• It provides a high-level abstraction of common parallel programming patterns, inspired by the C++
Standard Template Library (STL), making it easier to write efficient and portable GPU-accelerated
code without needing to manage low-level CUDA details directly.

Thrust and CUDA Relationship

• Abstraction Layer: Thrust serves as an abstraction layer over CUDA C/C++. It simplifies tasks
like kernel launches, memory management, and algorithm selection, allowing developers to focus
on the high-level structure of their parallel computations.

• Interoperability: Thrust is designed for seamless interoperability with CUDA C/C++. Developers
can easily integrate CUDA C code into Thrust applications and vice versa, leveraging the strengths
of both. For example, using raw pointers allows passing data between Thrust containers and CUDA
kernels, ensuring a flexible approach to parallelization.

• Performance: While abstracting low-level details, Thrust maintains a focus on performance. It

employs optimizations like kernel fusion and structure of arrays to minimize memory transfers and
maximize GPU utilization, ultimately aiming to achieve performance comparable to carefully hand-
tuned CUDA C code.

Benefits of using Thrust over CUDA C:

• Increased Programmer Productivity: Thrust automates parallel programming tasks such as kernel
launch configurations and memory management, allowing developers to focus on the algorithm
logic instead of low-level implementation details.

• Enhanced Robustness: By handling low-level details, Thrust improves the robustness of CUDA
applications. It automatically addresses potential issues like limits on grid dimensions and data type
sizes, ensuring consistent behavior across various CUDA-capable devices.

• Improved Code Readability: Thrust’s STL-like syntax promotes concise and readable code,
making it easier to understand and maintain GPU-accelerated applications.

Examples of Thrust features simplifying CUDA development:

• Kernel Launch Abstraction: In CUDA, developers need to explicitly define the grid and block
dimensions for kernel launches. Thrust simplifies this by automatically determining an efficient
launch configuration based on factors like available resources and desired occupancy.

• Memory Management: Thrust’s container classes (e.g., host_vector, device_vector) automate

memory allocation and deallocation on both host and device, streamlining data management
compared to manual calls to functions like cudaMalloc and cudaFree.

• Thrust empowers developers to harness the power of CUDA GPUs with higher-level abstractions,
facilitating faster development, improved code maintainability, and efficient parallel program
execution.
• However, it’s important to remember that for specialized algorithms or maximum control over
hardware, direct CUDA C programming might still be preferred.

Performance Advantages of Thrust

• Thrust offers several performance benefits over CUDA C, particularly in terms of programmer
productivity and code robustness, which can indirectly lead to performance gains:

Programmer Productivity:

• Automatic Launch Configuration: Thrust simplifies parallel programming by handling the

configuration of CUDA launch parameters, such as grid and block dimensions, based on factors like
maximizing GPU occupancy. This relieves programmers from manually tuning these parameters,
potentially leading to faster development and more efficient code execution.
• Rich Set of Algorithms: Thrust’s ready-to-use algorithms for common parallel patterns, such as
map-reduce, contribute to reduced development time and potentially better performance compared
to manually implementing these patterns in CUDA C.

Robustness:

• Handling Hardware Limits: Thrust automatically manages constraints imposed by CUDA devices,
such as limits on grid dimensions or the size of function arguments, leading to more robust
applications and less time spent on debugging hardware-specific issues.

• Optimized Algorithm Selection: Thrust automatically chooses optimized algorithms based on the
data types and operations involved. For example, it uses a faster radix sort for primitive types and a
more general merge sort for other data types, potentially leading to performance gains without
explicit programmer intervention.

Real-World Performance:

• Kernel Fusion: Thrust enables the fusion of multiple operations, like transformations and
reductions, into a single kernel, minimizing memory transactions and improving bandwidth
utilization. This is particularly beneficial for algorithms with low computational intensity, resulting
in significant performance improvements.

• Structure of Arrays Optimization: Thrust promotes using a Structure of Arrays (SoA) data layout,
which aligns data for optimal memory coalescing and faster memory access, potentially leading to
significant performance improvements compared to less efficient data layouts.

While Thrust offers these advantages, it’s important to note that CUDA C provides a lower level of control
that might be necessary for highly specialized optimizations. However, for a wide range of applications,
Thrust simplifies parallel programming and can contribute to achieving high performance on CUDA GPUs.

Unified Memory Simplifies CPU-to-CUDA Porting

Unified Memory, simplifies porting CPU code to CUDA by creating a pool of managed memory shared
between the CPU and GPU. This shared memory pool is accessible using a single pointer, effectively
bridging the gap between CPU and GPU memory spaces.

Prior to Unified Memory, porting CPU code to CUDA involved explicitly allocating device memory and
managing data transfers between the host and device. With Unified Memory, the CUDA runtime
transparently handles data migration and coherence, making the managed memory appear as CPU memory
to code running on the CPU and GPU memory to code running on the GPU.

The benefits of Unified Memory for porting CPU code to CUDA are:
• Reduced Code Changes: Porting CPU code becomes as simple as replacing standard memory
allocation functions like malloc() and free() with their CUDA
counterparts, cudaMallocManaged() and cudaFree(). Additionally, the computation is offloaded to
the GPU by launching a kernel.

• Simplified Data Management: Unified Memory removes the need for explicit data transfers between
the host and device, as the CUDA runtime automatically migrates data as needed. This eliminates
the complexity of managing separate host and device memory spaces, leading to cleaner and easier-
to-maintain code.

While the provided source doesn’t elaborate on specific code examples demonstrating the evolution of code
from pre-Unified Memory to CUDA 6 and beyond, it does mention that Unified Memory, especially with
the page fault handling capabilities of Pascal and later architectures, allows for even more complex data
structures, such as linked lists, to be seamlessly shared between the CPU and GPU.

It is important to note that while Unified Memory offers significant convenience and simplifies the porting
process, achieving optimal performance might still require understanding data access patterns and potential
optimization techniques.

5. Programming Heterogeneous Clusters –MPI.

• One of the biggest advantages of distributed systems over standalone computers is an ability to
share the workload between computers, processors, and cores.
• Clusters (a network of computers configured to work as one computer), grids, and cloud computing
are one of the most progressive branches in a field of parallel computing and data processing
nowadays, and have been identified as important new technologies that may be used to solve
complex scientific and engineering problems as well as to tackle many projects in commerce and
industry.
• A broad spectrum of current parallel computing activities and scientific projects are carried out.
• A new model for parallel computing that relies on usage of CPU and GPU units to solve general
purpose scientific and engineering problems revolutionized data computation over last few years.
• The tasks that can be divided up into large numbers of independent parts are good candidates.
• GPU-enabled calculations seem to be very promising in data analysis, optimization, simulation, etc.
• Using CUDA or OpenCL, and graphical processing units many real-world applications can be easily
implemented and run significantly faster than the multi-processor or multi-core systems.
• A heterogeneous computing system refers to a system that contains different types of
computational units, such as multicore CPUs, GPUs, DSPs, FPGAs, and ASICs.
• The computational units in a heterogeneous system typically include a general-purpose processor
that runs an operating system.
• In High-Performance Computing (HPC), various applications require the aggregate computing
power of a cluster of computing nodes.
• Many of the HPC clusters today have one or more hosts and one or more devices in each node.
• Since early days, these clusters have been programmed predominately with the Message Passing
Interface (MPI).
• MPI helps to scale heterogeneous applications to multiple nodes in a cluster environment.
• By carrying out domain partitioning, point-to-point communication, and collective communication
can help a kernel to scale up that is it can improve efficiency.

MPI (Message Passing Interface)

MPI Fundamentals

• In today’s times widely used programming interface for computing clusters is MPI[Gropp1999],
which is a set of API functions for communication between processes running in a computing
cluster.
• MPI assumes a distributed memory model where processes exchange information by sending
messages to each other.
• When an application uses API communication functions, it does not need to deal with the details of
the interconnect network.
• The MPI implementation allows the processes to address each other using logical numbers, much
the same way as using phone numbers in a telephone system - telephone users can dial each other
using phone numbers without knowing exactly where the called person is and how the call is routed.
• Message Passing Interface, is a standard API for communicating data via messages between
distributed processes that is commonly used in HPC to build applications that can scale to multi-
node computer clusters.
• As such, MPI is fully compatible with CUDA, which is designed for parallel computing on a single
computer or node. There are many reasons for wanting to combine the two parallel programming
approaches of MPI and CUDA.
• A common reason is to enable solving problems with a data size too large to fit into the memory of
a single GPU, or that would require an unreasonably long compute time on a single node.
• In a typical MPI application, data and work are partitioned among processes. As shown in below
Fig. 5.4.1, each node can contain one or more processes, shown as clouds within nodes. As these
processes progress, they may need data from each other.
• This need is satisfied by sending and receiving messages.
MPI Working

1. Similar to CUDA, MPI programs are based on the SPMD(Single Program, Multiple Data) parallel
execution model. All MPI processes execute the same program. The MPI system provides a set of API
functions to establish communication systems that allow the processes to communicate with each other.

2. MPI process is usually called a “rank”. The processes involved in an MPI program have private address
spaces, which allows an MPI program to run on a system with a distributed memory space, such as a cluster.
The MPI standard defines a message-passing API which covers point-to-point messages as well as
collective operations like reductions.

3. Below are five essential API functions that set up and shut down communication systems for an MPI
application.

1. int MPI_Init (int*argc, char***argv) - Initialize MPI.

2. int MPI_Comm_rank (MPI_Comm comm, int *rank) - Rank of the calling process in group of comm.

3. int MPI_Comm_size (MPI_Comm comm, int *size) - Number of processes in the group of comm.

4. int MPI_Comm_abort (MPI_Comm comm) - Terminate MPI communication connection with an error
flag.

5. int MPI_Finalize ( ) - Ending an MPI application, close all resources.

MPI Programming Example:

Above program, is simple MPI program that uses these API functions. A user needs to

supply the executable file of the program to the mpirun command or the mpiexec command

in a cluster.

Each process starts by initializing the MPI runtime with a MPI_Init() call. This initializes

the communication system for all the processes running the application.

Once the MPI runtime is initialized, each process calls two functions to prepare for

communication.

The first function MPI_Comm_rank() - returns a unique number to call each process,

called an MPI rank or process ID. The numbers received by the processes vary from 0 to

the number of processes minus 1.

MPI rank for a process is equivalent to the expression blockIdx.x * blockDim.x +

threadIdx.x for a CUDA thread. It uniquely identifies the process in a communication,

similar to the phone number in a telephone system.

The MPI_Comm_rank() has two parameters. The first one is an MPI built-in type

MPI_Comm that specifies the scope of the request. Valuesof the MPI_Comm are

commonly referred to as a communicator.

MPI_Comm and other MPI built-in types are defined in a mpi.h header file that should be

included in all C program files that use MPI. This is similar to the cuda.h header file for

CUDA programs.

An MPI application can create one or more intracommunicators. Members of each

intracommunicator are MPI processes.

OpenCL For EiT-M
No ratings yet
OpenCL For EiT-M
41 pages
Upcrc Opencl Lec1
No ratings yet
Upcrc Opencl Lec1
38 pages
3 Heterogeneous Computer Architectures: 3.1 Gpus
No ratings yet
3 Heterogeneous Computer Architectures: 3.1 Gpus
16 pages
Parallel Programming in Opencl: Advanced Graphics & Image Processing
No ratings yet
Parallel Programming in Opencl: Advanced Graphics & Image Processing
31 pages
Introduction To OpenCL With Examples
No ratings yet
Introduction To OpenCL With Examples
128 pages
Introduction To OpenCL
No ratings yet
Introduction To OpenCL
44 pages
OpenCL for Programmers
No ratings yet
OpenCL for Programmers
13 pages
WhitePaper GPU Computing On Mali
No ratings yet
WhitePaper GPU Computing On Mali
6 pages
OpenCL Guide
No ratings yet
OpenCL Guide
19 pages
Gpu Cuda
No ratings yet
Gpu Cuda
204 pages
GPU Programming Essentials
33% (3)
GPU Programming Essentials
28 pages
NVIDIA OpenCL JumpStart Guide
No ratings yet
NVIDIA OpenCL JumpStart Guide
15 pages
OpenCL A Parallel Programming Standart For Heterogeneous
No ratings yet
OpenCL A Parallel Programming Standart For Heterogeneous
12 pages
Parallel & Distributed Computing Report
No ratings yet
Parallel & Distributed Computing Report
4 pages
Chapter 8
No ratings yet
Chapter 8
58 pages
Programming Models For GPU Architecture
No ratings yet
Programming Models For GPU Architecture
55 pages
Opencl Programming For The Cuda Architecture
No ratings yet
Opencl Programming For The Cuda Architecture
23 pages
4 - Key Concepts
No ratings yet
4 - Key Concepts
2 pages
BCS702 Module 5 Textbook
No ratings yet
BCS702 Module 5 Textbook
48 pages
A Jump Start To Opencl: March 15, 2009 Cis 565/665 - Gpu Computing and Architecture
No ratings yet
A Jump Start To Opencl: March 15, 2009 Cis 565/665 - Gpu Computing and Architecture
74 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
OpenCL Jumpstart Guide
No ratings yet
OpenCL Jumpstart Guide
17 pages
PDC Lecture 09
No ratings yet
PDC Lecture 09
36 pages
DS1822 - Parallel Computing-Unit3
No ratings yet
DS1822 - Parallel Computing-Unit3
17 pages
Kirk+Hwu GPU
No ratings yet
Kirk+Hwu GPU
92 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
Cuda C
No ratings yet
Cuda C
20 pages
Hands On Opencl: Created by Simon Mcintosh-Smith and Tom Deakin
No ratings yet
Hands On Opencl: Created by Simon Mcintosh-Smith and Tom Deakin
258 pages
11 - OpenCL Fundamentals
No ratings yet
11 - OpenCL Fundamentals
253 pages
Gpgpu Final
No ratings yet
Gpgpu Final
124 pages
Chapter7 GPU
No ratings yet
Chapter7 GPU
45 pages
CUDA Tutorial
100% (1)
CUDA Tutorial
50 pages
OpenCL Architecture and Execution Model
No ratings yet
OpenCL Architecture and Execution Model
8 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
GPUs and GPGPU
No ratings yet
GPUs and GPGPU
15 pages
Seminar Igor Kamzic COSC3P93
No ratings yet
Seminar Igor Kamzic COSC3P93
58 pages
A Programming Model For Massive Data Parallelism With Data Dependencies
No ratings yet
A Programming Model For Massive Data Parallelism With Data Dependencies
8 pages
Introduction To OpenCL Programming (201005)
No ratings yet
Introduction To OpenCL Programming (201005)
132 pages
GPUParallelProgramming PDF
No ratings yet
GPUParallelProgramming PDF
104 pages
Cuda Mode Lecture2
No ratings yet
Cuda Mode Lecture2
33 pages
GPUMod 2
No ratings yet
GPUMod 2
64 pages
Gpgpu Workshop Cuda
No ratings yet
Gpgpu Workshop Cuda
10 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
CUDA Programming
No ratings yet
CUDA Programming
35 pages
GPU Architecture and Programming
No ratings yet
GPU Architecture and Programming
3 pages
Basics CUDA
No ratings yet
Basics CUDA
55 pages
DS1822 - Parallel Computing-Unit3
No ratings yet
DS1822 - Parallel Computing-Unit3
6 pages
Transparent CUDA Distributed Framework
No ratings yet
Transparent CUDA Distributed Framework
8 pages
GPU Architecture Ebook
No ratings yet
GPU Architecture Ebook
67 pages
Introduction To CUDA
No ratings yet
Introduction To CUDA
51 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
Lecture2 Cuda Basic 2010
No ratings yet
Lecture2 Cuda Basic 2010
44 pages
Part4 22
No ratings yet
Part4 22
65 pages
Lec 1
No ratings yet
Lec 1
27 pages
Cuda Final
No ratings yet
Cuda Final
17 pages
Cs-3006 8 Gpuprogramming Using Cuda&Opencl
No ratings yet
Cs-3006 8 Gpuprogramming Using Cuda&Opencl
167 pages
06 Intro Gpus
No ratings yet
06 Intro Gpus
33 pages
Coa PPT-2
No ratings yet
Coa PPT-2
16 pages
21cs3601-Computer Architecture
No ratings yet
21cs3601-Computer Architecture
5 pages
829-Article Text-5973-2-10-20210114
No ratings yet
829-Article Text-5973-2-10-20210114
11 pages
UNIT-2 PP FlynnsClassification
No ratings yet
UNIT-2 PP FlynnsClassification
80 pages
HPC Chapter 1
No ratings yet
HPC Chapter 1
12 pages
Cao - Unit 4 - Notes - Final
No ratings yet
Cao - Unit 4 - Notes - Final
30 pages
Parallel and Distributed Computing
No ratings yet
Parallel and Distributed Computing
14 pages
7th Sem Previous Year Question Paper
No ratings yet
7th Sem Previous Year Question Paper
28 pages
Chapter 2 - Parallel Algorithm Design
No ratings yet
Chapter 2 - Parallel Algorithm Design
84 pages
What's New in PSCAD v5.0.0
No ratings yet
What's New in PSCAD v5.0.0
51 pages
Hodges Et Al 2024 Introducing swmm5
No ratings yet
Hodges Et Al 2024 Introducing swmm5
4 pages
HPC - Unit-1 Insem Notes
No ratings yet
HPC - Unit-1 Insem Notes
76 pages
Chapter 2
No ratings yet
Chapter 2
54 pages
Start HPC
No ratings yet
Start HPC
145 pages
BCS702 Module1
No ratings yet
BCS702 Module1
25 pages
Ada2024 Gpu 3
No ratings yet
Ada2024 Gpu 3
47 pages
IPP Question Bank
No ratings yet
IPP Question Bank
2 pages
MPI Programming Guide
No ratings yet
MPI Programming Guide
74 pages
CSC 252 Study Questions 2
No ratings yet
CSC 252 Study Questions 2
21 pages
Module 1&2
No ratings yet
Module 1&2
108 pages
Kung
No ratings yet
Kung
5 pages
Parallel Programming For Modern High Performance Computing Systems (Czarnul, Pawel)
No ratings yet
Parallel Programming For Modern High Performance Computing Systems (Czarnul, Pawel)
330 pages
Handbook HPC 23-24
No ratings yet
Handbook HPC 23-24
18 pages
Assignment 1st PC
No ratings yet
Assignment 1st PC
12 pages
Lecture-13-14 Parallel and Distributed Systems Programming Models-Jameel
No ratings yet
Lecture-13-14 Parallel and Distributed Systems Programming Models-Jameel
70 pages
CUDA Part-1
No ratings yet
CUDA Part-1
52 pages
RISC and CISC, Parallel Processing
No ratings yet
RISC and CISC, Parallel Processing
23 pages
Parallel Programming For Modern High Performance Computing Systems Czarnul Ready To Read
100% (2)
Parallel Programming For Modern High Performance Computing Systems Czarnul Ready To Read
57 pages
HPC Insem 2024 FlyHigh Services
No ratings yet
HPC Insem 2024 FlyHigh Services
10 pages
CS621 Final Term
No ratings yet
CS621 Final Term
111 pages

DS1822-Parallel Computing - Unit5

Uploaded by

DS1822-Parallel Computing - Unit5

Uploaded by

UNIT V

OTHER GPU PROGRAMMING PLATFORMS

The Execution Model :

Key Features of OpenACC:

1. C++ language and compiler

a. The runtime contains a C++ AMP abstraction of lower-level accelerator APIs, as

b. Asychronous execution is supported through an eventing model.

a. A set of classes describing the shape and extent of data.

b. A set of classes that contain or refer to data used in computations

c. A set of functions for copying data to and from accelerators

d. A math library 42 e. An atomic library

A hardware device or capability that enables accelerated computation on data-parallel workloads.

• Array:A dense N-dimensional data container.

• Extent :A vector of integers that describes lengths of N-dimensional array-like objects.

Thrust and CUDA Relationship

• Performance: While abstracting low-level details, Thrust maintains a focus on performance. It

Benefits of using Thrust over CUDA C:

Examples of Thrust features simplifying CUDA development:

• Memory Management: Thrust’s container classes (e.g., host_vector, device_vector) automate

Performance Advantages of Thrust

• Automatic Launch Configuration: Thrust simplifies parallel programming by handling the

Unified Memory Simplifies CPU-to-CUDA Porting

5. Programming Heterogeneous Clusters –MPI.

MPI (Message Passing Interface)

1. int MPI_Init (int*argc, char***argv) - Initialize MPI.

5. int MPI_Finalize ( ) - Ending an MPI application, close all resources.

MPI Programming Example:

the number of processes minus 1.

MPI rank for a process is equivalent to the expression blockIdx.x * blockDim.x +

threadIdx.x for a CUDA thread. It uniquely identifies the process in a communication,

similar to the phone number in a telephone system.

commonly referred to as a communicator.

An MPI application can create one or more intracommunicators. Members of each

intracommunicator are MPI processes.

You might also like