Lesson 1.
Introduction to CUDA
- Memory Allocation and Data Movement API Functions
Objective
To learn the basic API functions in CUDA host
code
Device Memory Allocation
Host-Device Data Transfer
Data Parallelism - Vector Addition Example
vector A
A[0]
A[1]
A[2]
A[N-1]
vector B
B[0]
B[1]
B[2]
B[N-1]
C[0]
C[1]
C[2]
vector C
C[N-1]
Vector Addition Traditional C Code
// Compute vector sum C = A+B
void vecAdd(float *h_A, float *h_B, float *h_C,
int n)
{
int i;
for (i = 0; i<n; i++) h_C[i] = h_A[i]+h_B[i];
}
int main()
{
// Memory allocation for h_A, h_B, and h_C
// I/O to read h_A and h_B, N elements
vecAdd(h_A, h_B, h_C, N);
}
4
Part 1
Host Memory
Heterogeneous Computing vecAdd
CUDA Host Code
Device Memory
GPU
Part 2
CPU
Part 3
#include <cuda.h>
void vecAdd(float *h_A, float *h_B, float *h_C,
int n)
{
int size = n* sizeof(float);
float *d_A, *d_B, *d_C;
1. // Allocate device memory for A, B, and C
// copy A and B to device memory
2. // Kernel launch code the device performs the
actual vector addition
3. // copy C from the device memory // Free device
vectors
}
5
Partial Overview of CUDA Memories
(Device) Grid
Device code can:
R/W per-thread
registers
R/W all-shared global
memory
Host code can
Transfer data to/from
per grid global
memory
Block (0, 1)
Block (0, 0)
Registers
Registers
Thread (0, 0)
Thread (0, 1)
Registers
Registers
Thread (0, 0) Thread (0, 1)
Host
Global
Memory
We will cover more memory types later.
6
CUDA Device Memory Management
API functions
(Device) Grid
Registers
Registers
Thread (0, 0)
Thread (0, 1)
Host
Global
Memory
cudaMalloc()
Block (0, 1)
Block (0, 0)
Registers
Registers
Thread (0, 0) Thread (0, 1)
Allocates object in the
device global memory
Two parameters
Address of a pointer to
the allocated object
Size of allocated object
in terms of bytes
cudaFree()
Frees object from device
global memory
Pointer to freed object
7
Host-Device Data Transfer
API functions
(Device) Grid
Registers
Registers
Thread (0, 0)
Thread (0, 1)
Host
Global
Memory
cudaMemcpy()
Block (0, 1)
Block (0, 0)
Registers
Registers
Thread (0, 0) Thread (0, 1)
memory data transfer
Requires four parameters
Pointer to destination
Pointer to source
Number of bytes copied
Type/Direction of
transfer
Transfer to device is
asynchronous
8
Vector Addition Host Code
void vecAdd(float *h_A, float *h_B, float *h_C, int n)
{
int size = n * sizeof(float); float *d_A, *d_B, *d_C;
cudaMalloc((void **) &d_A, size);
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
cudaMalloc((void **) &d_B, size);
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
cudaMalloc((void **) &d_C, size);
// Kernel invocation code to be shown later
cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
cudaFree(d_A); cudaFree(d_B); cudaFree (d_C);
}
In Practice, Check for API Errors in Host Code
cudaError_t err = cudaMalloc((void **) &d_A, size);
if (err != cudaSuccess) {
printf(%s in %s at line %d\n,
cudaGetErrorString(err), __FILE__, __LINE__);
exit(EXIT_FAILURE);
}
10
To Learn More, Read
Chapter 3. Thank you!