0% found this document useful (0 votes)

103 views64 pages

Advanced CUDA Programming Guide

This document discusses parallel programming on many-core computing systems using CUDA. It provides an introduction and schedule for topics including many-core hardware, GPU hardware and the CUDA programming model, and an advanced CUDA class. It explains concepts such as grids, thread blocks and threads, hardware memory spaces in CUDA including shared memory, and provides examples of CUDA code for tasks like vector addition and using atomics and shared memory for optimization.

Uploaded by

Carlangaslangas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

103 views64 pages

Advanced CUDA Programming Guide

Uploaded by

Carlangaslangas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 64

PARALLEL PROGRAMMING

MANY-CORE COMPUTING:
ADVANCED CUDA (4/5)
Rob van Nieuwpoort
rob@cs.vu.nl
Schedule
2

1. Introduction, performance metrics & analysis

2. Many-core hardware, low-level optimizations
3. GPU hardware and Cuda class 1: basics
4. Cuda class 2: advanced; OpenCL
5. Case study: LOFAR telescope with many-cores
Grids, Thread Blocks and Threads

Grid
Thread Block 0, 0 Thread Block 0, 1 Thread Block 0, 2
0,0 0,1 0,2 0,3 0,0 0,1 0,2 0,3 0,0 0,1 0,2 0,3

1,0 1,1 1,2 2,3 1,0 1,1 1,2 2,3 1,0 1,1 1,2 2,3

2,0 2,1 2,2 2,3 2,0 2,1 2,2 2,3 2,0 2,1 2,2 2,3

Thread Block 1, 0 Thread Block 1, 1 Thread Block 1, 2

0,0 0,1 0,2 0,3 0,0 0,1 0,2 0,3 0,0 0,1 0,2 0,3

1,0 1,1 1,2 2,3 1,0 1,1 1,2 2,3 1,0 1,1 1,2 2,3

2,0 2,1 2,2 2,3 2,0 2,1 2,2 2,3 2,0 2,1 2,2 2,3
Hardware Memory Spaces in CUDA
4
Grid

Block (0, 0) Block (1, 0)

Shared Memory Shared Memory

Registers Registers Registers Registers

Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)

Host Device Memory

Constant Memory
Vector addition GPU code
5

// compute vector sum c = a + b

// each thread performs one pair-wise addition
__global__ void vector_add(float* A, float* B, float* C) {
int i = threadIdx.x + blockDim.x * blockIdx.x;
C[i] = A[i] + B[i];
} GPU code

int main() {
Host code
// initialization code here ...

// launch N/256 blocks of 256 threads each

vector_add<<< N/256, 256 >>>(deviceA, deviceB, deviceC);

// cleanup code here ...

}

(can be in the same file)

6
CUDA: Scheduling, Synchronization
and Atomics
Thread Scheduling
7

 Order in which thread blocks are scheduled is

undefined!
 any possible interleaving of blocks should be valid
 presumed to run to completion without preemption

 can run in any order

 can run concurrently OR sequentially

 Order of threads within a block is also undefined!

Global synchronization
8

 Q: How do we do global synchronization with these

scheduling semantics?
Global synchronization
9

 Q: How do we do global synchronization with these

scheduling semantics?
 A1: Not possible!
Global synchronization
10

 Q: How do we do global synchronization with these

scheduling semantics?
 A1: Not possible!
 A2: Finish a grid, and start a new one!
Global synchronization
11

 Q: How do we do global synchronization with these

scheduling semantics?
 A1: Not possible!
 A2: Finish a grid, and start a new one!

step1<<<grid1,blk1>>>(...);
// CUDA ensures that all writes from step1 are complete.
step2<<<grid2,blk2>>>(...);

 We don't have to copy the data back and forth!

Atomics
12

 Guarantee that only a single thread has access to a

piece of memory during an operation
 No dropped data, but ordering is still arbitrary
 Different types of atomic instructions
 Atomic Add, Sub, Exch, Min, Max, Inc, Dec, CAS, And,
Or, Xor
 Can be done on device memory and shared memory
 Much more expensive than load + operation + store
Example: Histogram
13

// Determine frequency of colors in a picture.

// Colors have already been converted into integers
// between 0 and 255.
// Each thread looks at one pixel,
// and increments a counter

global void histogram(int* colors, int* buckets)

{
int i = threadIdx.x + blockDim.x * blockIdx.x;
int c = colors[i];
buckets[c] += 1;
}
Example: Histogram
14

// Determine frequency of colors in a picture.

// Colors have already been converted into integers
// between 0 and 255.
// Each thread looks at one pixel,
// and increments a counter

global void histogram(int* colors, int* buckets)

{
int i = threadIdx.x + blockDim.x * blockIdx.x;
int c = colors[i];
buckets[c] += 1;
}
Example: Histogram
15

// Determine frequency of colors in a picture.

// Colors have already been converted into integers
// between 0 and 255.
// Each thread looks at one pixel,
// and increments a counter atomically

global void histogram(int* colors, int* buckets)

{
int i = threadIdx.x + blockDim.x * blockIdx.x;
int c = colors[i];
atomicAdd(&buckets[c], 1);
}
Example: Work queue
16

// For algorithms where the amount of work per item

// is highly non-uniform, it often makes sense to
// continuously grab work from a queue.

__global__
void workq(int* work_q, int* q_counter,
int queue_max, int* output)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
int q_index = atomicInc(q_counter, queue_max);
int result = do_work(work_q[q_index]);
output[i] = result;
}
17 CUDA: optimizing your application
Coalescing
Coalescing
18
Consider the stride of your accesses
19

global void foo(int* input, float3* input2) {

int i = blockDim.x * blockIdx.x + threadIdx.x;

// Stride 1, OK!
int a = input[i];

// Stride 2, half the bandwidth is wasted

int b = input[2*i];

// Stride 3, 2/3 of the bandwidth wasted

float c = input2[i].x;
}
Example: Array of Structures (AoS)
20

struct record {
int key;
int value;
int flag;
};

record *d_records;
cudaMalloc((void**)&d_records, ...);
Example: Structure of Arrays (SoA)
21

Struct SoA {
int* keys;
int* values;
int* flags;
};

SoA d_SoA_data;
cudaMalloc((void**)&d_SoA_data.keys, ...);
cudaMalloc((void**)&d_SoA_data.values, ...);
cudaMalloc((void**)&d_SoA_data.flags, ...);
Example: SoA vs AoS
22

global void bar(record* AoS_data,

SoA SoA_data) {
int i = blockDim.x * blockIdx.x + threadIdx.x;

// AoS wastes bandwidth

int key1 = AoS_data[i].key;

// SoA efficient use of bandwidth

int key2 = SoA_data.keys[i];
}
Memory Coalescing
23

 Structure of arrays is often better than array of

structures
 Very clear win on regular, stride 1 access patterns
 Unpredictable or irregular access patterns are
case-by-case
 Can lose a factor of 10 – 30!
24 CUDA: optimizing your application
Shared Memory
Using shared memory
25

// Adjacent Difference application:

// compute result[i] = input[i] – input[i-1]

global void adj_diff_naive(int result, int input) {

// compute this thread’s global index
unsigned int i = blockDim.x * blockIdx.x + threadIdx.x;

if(i > 0) {
// each thread loads two elements from device memory
int x_i = input[i];
int x_i_minus_one = input[i-1];

result[i] = x_i – x_i_minus_one;

}
}
Using shared memory
26

// Adjacent Difference application:

// compute result[i] = input[i] – input[i-1]

global void adj_diff_naive(int result, int input) {

// compute this thread’s global index
unsigned int i = blockDim.x * blockIdx.x + threadIdx.x;

if(i > 0) {
// each thread loads two elements from device memory
int x_i = input[i];
int x_i_minus_one = input[i-1];

result[i] = x_i – x_i_minus_one;

}
}
Using shared memory
27

// Adjacent Difference application:

// compute result[i] = input[i] – input[i-1]

global void adj_diff_naive(int result, int input) {

// compute this thread’s global index
unsigned int i = blockDim.x * blockIdx.x + threadIdx.x;

if(i > 0) {
// each thread loads two elements from device memory
int x_i = input[i];
int x_i_minus_one = input[i-1];

result[i] = x_i – x_i_minus_one;

}
} The next thread also reads input[i]
Using shared memory
28
__global__ void adj_diff(int *result, int *input) {
unsigned int i = blockDim.x * blockIdx.x + threadIdx.x;

shared int s_data[BLOCK_SIZE]; // shared, 1 elt / thread

// each thread reads 1 device memory elt, stores it in s_data
s_data[threadIdx.x] = input[i];

// avoid race condition: ensure all loads are complete

__syncthreads();

if(threadIdx.x > 0) {
result[i] = s_data[threadIdx.x] – s_data[threadIdx.x–1];
} else if(i > 0) {
// I am thread 0 in this block: handle thread block boundary
result[i] = s_data[threadIdx.x] – input[i-1];
}
}
Using shared memory: coalescing
29
__global__ void adj_diff(int *result, int *input) {
unsigned int i = blockDim.x * blockIdx.x + threadIdx.x;

shared int s_data[BLOCK_SIZE]; // shared, 1 elt / thread

// each thread reads 1 device memory elt, stores it in s_data
s_data[threadIdx.x] = input[i]; // COALESCED ACCESS!

// avoid race condition: ensure all loads are complete

__syncthreads();

 Partition data into subsets that fit into shared

memory
A Common Programming Strategy
31

 Handle each data subset with one thread block

A Common Programming Strategy
32

 Load the subset from device memory to shared

memory, using multiple threads to exploit memory-
level parallelism
A Common Programming Strategy
33

 Perform the computation on the subset from shared

memory
A Common Programming Strategy
34

 Copy the result from shared memory back to device

memory
35 CUDA: optimizing your application
Optimizing Occupancy
Thread Scheduling
36

 SM implements zero-overhead warp scheduling

 A warp is a group of 32 threads that runs concurrently on a SM
 At any time, only one of the warps is executed by SM
 Warps whose next instruction has its inputs ready for consumption
are eligible for execution
 Eligible Warps are selected for execution on a prioritized
scheduling policy
 All threads in a warp execute the same instruction when selected

TB1, W1 stall
TB2, W1 stall TB3, W2 stall

TB1 TB2 TB3 TB3 TB2 TB1 TB1 TB1 TB3

W1 W1 W1 W2 W1 W1 W2 W3 W2
Instruction: 1 2 3 4 5 6 1 2 1 2 1 2 3 4 7 8 1 2 1 2 3 4

Time TB = Thread Block, W = Warp

Stalling warps
37

 What happens if all warps are stalled?

 No instruction issued → performance lost

 Most common reason for stalling?

 Waiting on global memory

 If your code reads global memory every couple of

instructions
 You should try to maximize occupancy
Occupancy
38

 What determines occupancy?

 Limited resources!
 Register
usage per thread
 Shared memory per thread block
Resource Limits (1)
39

Registers Shared Memory Registers Shared Memory

TB 2
TB 1

TB 1

TB 1
TB 2
TB 0
TB 0 TB 1
TB 0
TB 0

 Pool of registers and shared memory per SM

 Each thread block grabs registers & shared memory
 If one or the other is fully utilized no more thread blocks
Resource Limits (2)
40

 Can only have 8 thread blocks per SM

 If
they’re too small, can’t fill up the SM
 Need 128 threads / block on gt200 (4 cycles/instruction)

 Need 192 threads / block on Fermi (6 cycles/instruction)

 Higher occupancy has diminishing returns for hiding

latency
Hiding Latency with more threads
41
How do you know what you’re using?
42

 Use “nvcc -Xptxas –v” to get register and

shared memory usage

 Plug those numbers into CUDA Occupancy

Calculator
44 CUDA: optimizing your application
Shared memory bank conflicts
Shared Memory Banks
45

 Shared memory is banked

 Only matters for threads within a warp
 Full performance with some restrictions
 Threads can each access different banks
 Or can all access the same value

 Consecutive words are in different banks

 If two or more threads access the same bank but
different value, we get bank conflicts
Bank Addressing Examples: OK
46

 No Bank Conflicts  No Bank Conflicts

Thread 0 Bank 0 Thread 0 Bank 0

Thread 1 Bank 1 Thread 1 Bank 1
Thread 2 Bank 2 Thread 2 Bank 2
Thread 3 Bank 3 Thread 3 Bank 3
Thread 4 Bank 4 Thread 4 Bank 4
Thread 5 Bank 5 Thread 5 Bank 5
Thread 6 Bank 6 Thread 6 Bank 6
Thread 7 Bank 7 Thread 7 Bank 7

Thread 15 Bank 15 Thread 15 Bank 15

Bank Addressing Examples: BAD
47

 2-way Bank Conflicts  8-way Bank Conflicts

Thread 0 Bank 0 Thread 0 x8 Bank 0

Thread 1 Bank 1 Thread 1 Bank 1
Thread 2 Bank 2 Thread 2 Bank 2
Thread 3 Bank 3 Thread 3
Thread 4 Bank 4 Thread 4
Bank 5 Thread 5 Bank 7
Bank 6 Thread 6 Bank 8
Bank 7 Thread 7 Bank 9
Thread 8 x8
Thread 9
Thread 10
Thread 11 Bank 15 Thread 15 Bank 15
Trick to Assess Performance Impact
48

 Change all shared memory reads to the same value

 All broadcasts = no conflicts
 Will show how much performance could be
improved by eliminating bank conflicts

 The same doesn’t work for shared memory writes

 So,replace shared memory array indices with
threadIdx.x
 (Could also be done for the reads)
49 Generic programming models
OpenCL
Portability
50

 Inter-family vs inter-vendor
 NVIDIACuda runs on all NVIDIA GPU families
 OpenCL runs on all GPUs, Cell, CPUs

 Parallelism portability
 Differentarchitecture requires different granularity
 Task vs data parallel

 Performance portability
 Can we express platform-specific optimizations?
The Khronos group
51
OpenCL: Open Compute Language
52

 Architecture independent
 Explicit support for many-cores
 Low-level host API
 Uses C library, no language extensions
 Separate high-level kernel language
 Explicit support for vectorization
 Run-time compilation
 Architecture-dependent optimizations
 Still
needed
 Possible
Cuda vs OpenCL Terminology
53

CUDA OpenCL
Thread Work item
Thread block Work group
Device memory Global memory
Constant memory Constant memory
Shared memory Local memory
Local memory Private memory
Cuda vs OpenCL Qualifiers
54

Functions
CUDA OpenCL
__global__ __kernel
__device__ (no qualifier needed)

Variables
CUDA OpenCL
__constant__ __constant
__device__ __global
__shared__ __local
Cuda vs OpenCL Indexing
55

CUDA OpenCL
gridDim get_num_groups()
blockDim get_local_size()
blockIdx get_group_id()
threadIdx get_local_id()
Calculate manually get_global_id()
Calculate manually get_global_size()

__syncthreads() → barrier()
Vector add: Cuda vs OpenCL kernel
56

global void CUDA

vectorAdd(float* a, float* b, float* c) {
int index = blockIdx.x * blockDim.x + threadIdx.x;
c[index] = a[index] + b[index];
}

__kernel void
vectorAdd(__global float* a, __global float* b,
__global float* c) {
int index = get_global_id(0);
c[index] = a[index] + b[index];
} OpenCL
OpenCL VectorAdd host code (1)
57

const size_t workGroupSize = 256;

const size_t nrWorkGroups = 3;
const size_t totalSize = nrWorkGroups * workGroupSize;

cl_platform_id platform;
clGetPlatformIDs(1, &platform, NULL);

// create properties list of key/values, 0-terminated.

cl_context_properties props[] = {
CL_CONTEXT_PLATFORM, (cl_context_properties)platform,
0
};

cl_context context = clCreateContextFromType(props,

CL_DEVICE_TYPE_GPU, 0, 0, 0);
OpenCL VectorAdd host code (2)
58

cl_device_id device;
clGetDeviceIDs(platform, CL_DEVICE_TYPE_DEFAULT, 1,
&device, NULL);

// create command queue on 1st device the context reported

cl_command_queue commandQueue =
clCreateCommandQueue(context, device, 0, 0);

// create & compile program

cl_program program = clCreateProgramWithSource(context, 1,
&programSource, 0, 0);
clBuildProgram(program, 0, 0, 0, 0, 0);

// create kernel
cl_kernel kernel = clCreateKernel(program, "vectorAdd",0);
OpenCL VectorAdd host code (3)
59

float* A, B, C = new float[totalSize]; // alloc host vecs

// initialize host memory here...

// allocate device memory

cl_mem deviceA = clCreateBuffer(context,
CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
totalSize * sizeof(cl_float), A, 0);

cl_mem deviceB = clCreateBuffer(context,

CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
totalSize * sizeof(cl_float), B, 0);

cl_mem deviceC = clCreateBuffer(context,

CL_MEM_WRITE_ONLY, totalSize * sizeof(cl_float), 0, 0);
OpenCL VectorAdd host code (4)
60

// setup parameter values

clSetKernelArg(kernel, 0, sizeof(cl_mem), &deviceA);
clSetKernelArg(kernel, 1, sizeof(cl_mem), &deviceB);
clSetKernelArg(kernel, 2, sizeof(cl_mem), &deviceC);

clEnqueueNDRangeKernel(commandQueue, kernel, 1, 0,
&totalSize, &workGroupSize, 0,0,0); // execute kernel

// copy results from device back to host, blocking

clEnqueueReadBuffer(commandQueue, deviceC, CL_TRUE, 0,
totalSize * sizeof(cl_float), C, 0, 0, 0);

delete[] A, B, C; // cleanup
clReleaseMemObject(deviceA); clReleaseMemObject(deviceB);
clReleaseMemObject(deviceC);
61 Summary and Conclusions
Summary and conclusions
62

 Higher performance cannot be reached by

increasing clock frequencies anymore
 Solution: introduction of large-scale parallelism
 Multiple cores on a chip
 Today:

 Up to 48 CPU cores in a node

 Up to 3200 cores on a single GPU

 Host system can contain multiple GPUs: 10,000+ cores

 We can build clusters of these nodes!

 Future: 100,000s – millions of cores?

Summary and conclusions
63

 Many different types of many-core hardware

 Very different properties
 Performance

 Programmability

 Portability

 It's all about the memory

 Choose the right platform for your application
 Arithmeticintensity / Operational intensity
 Roofline model
Summary and conclusions
64

 Many different many-core programming models

 Most models are hardware-induced, low-level
 DMA, double buffering
 Vectorization

 Coalescing

 Explicit cache (LS on Cell, shared memory on GPU)

 Future
 Cuda? OpenCL?
 high-level models on top of OpenCL?

 Many-cores are here to stay

CSED405 Lec3-Memory and Locality - 240912 - 113301
No ratings yet
CSED405 Lec3-Memory and Locality - 240912 - 113301
65 pages
CUDA Programming for Engineers
No ratings yet
CUDA Programming for Engineers
84 pages
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
No ratings yet
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
121 pages
Lecture5 2
No ratings yet
Lecture5 2
46 pages
05 - Atomics - Reductions - Warp - Shuffle 05 - Atomics - Reductions - Warp - Shuffle
No ratings yet
05 - Atomics - Reductions - Warp - Shuffle 05 - Atomics - Reductions - Warp - Shuffle
27 pages
CSED405 Lec5-Threads and Atomics - 240921 - 193053
No ratings yet
CSED405 Lec5-Threads and Atomics - 240921 - 193053
34 pages
02 CUDA Shared Memory
No ratings yet
02 CUDA Shared Memory
21 pages
GPUMod 2
No ratings yet
GPUMod 2
64 pages
06-CUDA Thread Organization
No ratings yet
06-CUDA Thread Organization
27 pages
Introduction To CUDA: CAP 4730 Spring 2012
No ratings yet
Introduction To CUDA: CAP 4730 Spring 2012
35 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
01 Cuda C Basics
No ratings yet
01 Cuda C Basics
32 pages
GPUs and GPGPU
No ratings yet
GPUs and GPGPU
15 pages
Lecture2 Cuda Basic 2010
No ratings yet
Lecture2 Cuda Basic 2010
44 pages
Cuda Mode Lecture2
No ratings yet
Cuda Mode Lecture2
33 pages
02 RTVis GPGPU CUDA
No ratings yet
02 RTVis GPGPU CUDA
34 pages
CUDA Memory Architecture Explained
No ratings yet
CUDA Memory Architecture Explained
28 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
217 Lec2
No ratings yet
217 Lec2
24 pages
CUDA Practical's
No ratings yet
CUDA Practical's
38 pages
Intro to CUDA Programming Guide
No ratings yet
Intro to CUDA Programming Guide
33 pages
s7122 Stephen Jones Cuda Optimization Tips Tricks and Techniques
No ratings yet
s7122 Stephen Jones Cuda Optimization Tips Tricks and Techniques
71 pages
GPU & CUDA Programming Guide
No ratings yet
GPU & CUDA Programming Guide
31 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
3 Cuda
No ratings yet
3 Cuda
5 pages
CUDA Programming for Developers
No ratings yet
CUDA Programming for Developers
29 pages
Chapter 8
No ratings yet
Chapter 8
58 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
GPU History & CUDA Programming Basics
No ratings yet
GPU History & CUDA Programming Basics
44 pages
cs179 2016 Lec13
No ratings yet
cs179 2016 Lec13
30 pages
Cuda Review 1
No ratings yet
Cuda Review 1
13 pages
Developing Kernels: Part 2: Algorithm Considerations, Multi-Kernel Programs and Optimization
No ratings yet
Developing Kernels: Part 2: Algorithm Considerations, Multi-Kernel Programs and Optimization
23 pages
CUDA Class Lecture02
No ratings yet
CUDA Class Lecture02
24 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
Advanced Performance Optimization in CUDA (S62192)
100% (1)
Advanced Performance Optimization in CUDA (S62192)
127 pages
CUDA Part-1
No ratings yet
CUDA Part-1
52 pages
CUDA Memory
No ratings yet
CUDA Memory
56 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
ECE408 S19 ZJUI Exam1 Study Guide
No ratings yet
ECE408 S19 ZJUI Exam1 Study Guide
25 pages
BCS3413 Principle & Applications of Parallel Programming Quiz 2: Gpgpu Cuda
No ratings yet
BCS3413 Principle & Applications of Parallel Programming Quiz 2: Gpgpu Cuda
3 pages
Lec6 Cuda Memory
No ratings yet
Lec6 Cuda Memory
18 pages
Cuda PPT
No ratings yet
Cuda PPT
54 pages
PDC Lecture 10
No ratings yet
PDC Lecture 10
32 pages
Cuda Notes From Udacity Lecture
No ratings yet
Cuda Notes From Udacity Lecture
3 pages
14 Parallel Algorithms CUDA Basics s20
No ratings yet
14 Parallel Algorithms CUDA Basics s20
89 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
CUDA Programming
No ratings yet
CUDA Programming
35 pages
2023 CSC14120 Lecture01 CUDAIntroduction
No ratings yet
2023 CSC14120 Lecture01 CUDAIntroduction
32 pages
Lec 1
No ratings yet
Lec 1
27 pages
Introduction to GPGPU Programming
No ratings yet
Introduction to GPGPU Programming
32 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
Multithreaded Architectures: Memory and Data Locality
No ratings yet
Multithreaded Architectures: Memory and Data Locality
39 pages
GPU Computing 2
No ratings yet
GPU Computing 2
28 pages
CUDA
No ratings yet
CUDA
18 pages
Cs344 - Lesson 2 - GPU Hardware and Parallel Communication Patterns - Udacity
No ratings yet
Cs344 - Lesson 2 - GPU Hardware and Parallel Communication Patterns - Udacity
14 pages
Cuda C/C++ Basics: NVIDIA Corporation
No ratings yet
Cuda C/C++ Basics: NVIDIA Corporation
67 pages
Introduction to Tensors for Physics Students
No ratings yet
Introduction to Tensors for Physics Students
29 pages
Eagram Personality Type Indicator
100% (2)
Eagram Personality Type Indicator
21 pages
Dzone Refcard317 Advancedtimeseries PDF
No ratings yet
Dzone Refcard317 Advancedtimeseries PDF
8 pages
Three Types of Negotiators
No ratings yet
Three Types of Negotiators
7 pages
Linear Algebra With R
No ratings yet
Linear Algebra With R
26 pages
R Slides
No ratings yet
R Slides
241 pages
CNN For Phoneme Recognition
No ratings yet
CNN For Phoneme Recognition
6 pages
Phenomenology for Philosophy Students
No ratings yet
Phenomenology for Philosophy Students
4 pages
NN Vs HMM
No ratings yet
NN Vs HMM
4 pages
Philosophical Dialogue
No ratings yet
Philosophical Dialogue
14 pages
Kaldi Whitepaper PDF
No ratings yet
Kaldi Whitepaper PDF
4 pages
No Rust Code 0
No ratings yet
No Rust Code 0
1 page
A Sensitivity Analysis of Convolutional Neural Networks For Sentence Classification
No ratings yet
A Sensitivity Analysis of Convolutional Neural Networks For Sentence Classification
18 pages
Hiding Routing Information
No ratings yet
Hiding Routing Information
14 pages
Theology Leuven
No ratings yet
Theology Leuven
5 pages
Business Model Change
100% (1)
Business Model Change
18 pages
Vision and Scope
No ratings yet
Vision and Scope
6 pages
Analysis of Competing Hypotheses
No ratings yet
Analysis of Competing Hypotheses
16 pages
Waves Platform Security Audit 2017
No ratings yet
Waves Platform Security Audit 2017
22 pages
Corporate Storyboard: CSB5806313SYN
No ratings yet
Corporate Storyboard: CSB5806313SYN
4 pages
Rings and Fields
No ratings yet
Rings and Fields
19 pages
People V Casio
No ratings yet
People V Casio
20 pages
The SSMC Wellhead System Is - Cedip
100% (1)
The SSMC Wellhead System Is - Cedip
8 pages
Gajjela Meena
No ratings yet
Gajjela Meena
2 pages
RADAR APN 241 Users Manual - Rev J
No ratings yet
RADAR APN 241 Users Manual - Rev J
120 pages
CH 5 Properties of Hardened Concrete
No ratings yet
CH 5 Properties of Hardened Concrete
39 pages
Information Technology Multiple Choice
75% (4)
Information Technology Multiple Choice
10 pages
CROSSWORD PUZZLE - Health Problems
No ratings yet
CROSSWORD PUZZLE - Health Problems
4 pages
Wheat Presentation
No ratings yet
Wheat Presentation
8 pages
1A Important Questions
No ratings yet
1A Important Questions
22 pages
Report On Agency Contract
No ratings yet
Report On Agency Contract
36 pages
Title 1. Crimes Against National Security and The Law of The Nations.
80% (5)
Title 1. Crimes Against National Security and The Law of The Nations.
4 pages
SIH Results Sheet 28-Feb
No ratings yet
SIH Results Sheet 28-Feb
106 pages
PowerShell Networking Commands Guide
No ratings yet
PowerShell Networking Commands Guide
2 pages
Fast Character - D&D Character Maker - Mountain Dwarf Cleric 12 (Domain of Knowledge)
No ratings yet
Fast Character - D&D Character Maker - Mountain Dwarf Cleric 12 (Domain of Knowledge)
4 pages
University of Saint Louis Tuguegarao City, Philippines: Maternal and Child Health Nursing
No ratings yet
University of Saint Louis Tuguegarao City, Philippines: Maternal and Child Health Nursing
70 pages
3 Ways To Get Clients... Our Most Successful Students Use These!
No ratings yet
3 Ways To Get Clients... Our Most Successful Students Use These!
20 pages
ICSE Focus On History Solutions Class 8 Chapter 9 Struggle For Freedom I
No ratings yet
ICSE Focus On History Solutions Class 8 Chapter 9 Struggle For Freedom I
17 pages
CPA Resume Boost for Career Growth
100% (2)
CPA Resume Boost for Career Growth
6 pages
Reliable Easy To Use: Truck-Mounted Concrete Boom Pump
100% (1)
Reliable Easy To Use: Truck-Mounted Concrete Boom Pump
2 pages
CRM Basics: Introduction and Benefits
No ratings yet
CRM Basics: Introduction and Benefits
60 pages
Quality of Life in Malaysia Additional
No ratings yet
Quality of Life in Malaysia Additional
15 pages
1 4 Luton To Farley Hill From 07 Jan 2024
No ratings yet
1 4 Luton To Farley Hill From 07 Jan 2024
1 page
Week 25 Gr.6
No ratings yet
Week 25 Gr.6
3 pages
FOR PRINTING. FINAL-ASSESSMENT-IN-SCIENCE 7-Q1 (Chemistry) (AutoRecovered)
No ratings yet
FOR PRINTING. FINAL-ASSESSMENT-IN-SCIENCE 7-Q1 (Chemistry) (AutoRecovered)
18 pages
Applications of Total Internal Reflection
67% (3)
Applications of Total Internal Reflection
2 pages
Geography CBA1: Plate Tectonics Lesson
No ratings yet
Geography CBA1: Plate Tectonics Lesson
4 pages
Gprs&Mms Data1
No ratings yet
Gprs&Mms Data1
1 page
The Realignment. Using Multiple Timeframes To Locate A Trade
No ratings yet
The Realignment. Using Multiple Timeframes To Locate A Trade
6 pages
Chapter 2
No ratings yet
Chapter 2
16 pages
Schools of Jurists
No ratings yet
Schools of Jurists
5 pages

Advanced CUDA Programming Guide

Uploaded by

Advanced CUDA Programming Guide

Uploaded by

PARALLEL PROGRAMMING

1. Introduction, performance metrics & analysis

Thread Block 1, 0 Thread Block 1, 1 Thread Block 1, 2

Block (0, 0) Block (1, 0)

Shared Memory Shared Memory

Registers Registers Registers Registers

Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)

Host Device Memory

// compute vector sum c = a + b

// launch N/256 blocks of 256 threads each

// cleanup code here ...

(can be in the same file)

 Order in which thread blocks are scheduled is

 can run in any order

 can run concurrently OR sequentially

 Order of threads within a block is also undefined!

 Q: How do we do global synchronization with these

 Q: How do we do global synchronization with these

 Q: How do we do global synchronization with these

 Q: How do we do global synchronization with these

 We don't have to copy the data back and forth!

 Guarantee that only a single thread has access to a

// Determine frequency of colors in a picture.

__global__ void histogram(int* colors, int* buckets)

// Determine frequency of colors in a picture.

__global__ void histogram(int* colors, int* buckets)

// Determine frequency of colors in a picture.

__global__ void histogram(int* colors, int* buckets)

// For algorithms where the amount of work per item

__global__ void foo(int* input, float3* input2) {

// Stride 2, half the bandwidth is wasted

// Stride 3, 2/3 of the bandwidth wasted

__global__ void bar(record* AoS_data,

// AoS wastes bandwidth

// SoA efficient use of bandwidth

 Structure of arrays is often better than array of

// Adjacent Difference application:

__global__ void adj_diff_naive(int *result, int *input) {

result[i] = x_i – x_i_minus_one;

// Adjacent Difference application:

__global__ void adj_diff_naive(int *result, int *input) {

result[i] = x_i – x_i_minus_one;

// Adjacent Difference application:

__global__ void adj_diff_naive(int *result, int *input) {

result[i] = x_i – x_i_minus_one;

__shared__ int s_data[BLOCK_SIZE]; // shared, 1 elt / thread

// avoid race condition: ensure all loads are complete

__shared__ int s_data[BLOCK_SIZE]; // shared, 1 elt / thread

// avoid race condition: ensure all loads are complete

 Partition data into subsets that fit into shared

 Handle each data subset with one thread block

 Load the subset from device memory to shared

 Perform the computation on the subset from shared

 Copy the result from shared memory back to device

 SM implements zero-overhead warp scheduling

TB1 TB2 TB3 TB3 TB2 TB1 TB1 TB1 TB3

Time TB = Thread Block, W = Warp

 What happens if all warps are stalled?

 Most common reason for stalling?

 If your code reads global memory every couple of

 What determines occupancy?

Registers Shared Memory Registers Shared Memory

 Pool of registers and shared memory per SM

 Can only have 8 thread blocks per SM

 Need 192 threads / block on Fermi (6 cycles/instruction)

 Higher occupancy has diminishing returns for hiding

 Use “nvcc -Xptxas –v” to get register and

 Plug those numbers into CUDA Occupancy

 Shared memory is banked

 Consecutive words are in different banks

 No Bank Conflicts  No Bank Conflicts

Thread 0 Bank 0 Thread 0 Bank 0

Thread 15 Bank 15 Thread 15 Bank 15

 2-way Bank Conflicts  8-way Bank Conflicts

Thread 0 Bank 0 Thread 0 x8 Bank 0

 Change all shared memory reads to the same value

global void histogram(int* colors, int* buckets)

global void histogram(int* colors, int* buckets)

global void histogram(int* colors, int* buckets)

global void foo(int* input, float3* input2) {

global void bar(record* AoS_data,

global void adj_diff_naive(int result, int input) {

global void adj_diff_naive(int result, int input) {

global void adj_diff_naive(int result, int input) {

shared int s_data[BLOCK_SIZE]; // shared, 1 elt / thread

shared int s_data[BLOCK_SIZE]; // shared, 1 elt / thread

global void CUDA