PARALLEL PROGRAMMING
MANY-CORE COMPUTING:
ADVANCED CUDA (4/5)
Rob van Nieuwpoort
rob@cs.vu.nl
Schedule
2
1. Introduction, performance metrics & analysis
2. Many-core hardware, low-level optimizations
3. GPU hardware and Cuda class 1: basics
4. Cuda class 2: advanced; OpenCL
5. Case study: LOFAR telescope with many-cores
Grids, Thread Blocks and Threads
Grid
Thread Block 0, 0 Thread Block 0, 1 Thread Block 0, 2
0,0 0,1 0,2 0,3 0,0 0,1 0,2 0,3 0,0 0,1 0,2 0,3
1,0 1,1 1,2 2,3 1,0 1,1 1,2 2,3 1,0 1,1 1,2 2,3
2,0 2,1 2,2 2,3 2,0 2,1 2,2 2,3 2,0 2,1 2,2 2,3
Thread Block 1, 0 Thread Block 1, 1 Thread Block 1, 2
0,0 0,1 0,2 0,3 0,0 0,1 0,2 0,3 0,0 0,1 0,2 0,3
1,0 1,1 1,2 2,3 1,0 1,1 1,2 2,3 1,0 1,1 1,2 2,3
2,0 2,1 2,2 2,3 2,0 2,1 2,2 2,3 2,0 2,1 2,2 2,3
Hardware Memory Spaces in CUDA
4
Grid
Block (0, 0) Block (1, 0)
Shared Memory Shared Memory
Registers Registers Registers Registers
Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)
Host Device Memory
Constant Memory
Vector addition GPU code
5
// compute vector sum c = a + b
// each thread performs one pair-wise addition
__global__ void vector_add(float* A, float* B, float* C) {
int i = threadIdx.x + blockDim.x * blockIdx.x;
C[i] = A[i] + B[i];
} GPU code
int main() {
Host code
// initialization code here ...
// launch N/256 blocks of 256 threads each
vector_add<<< N/256, 256 >>>(deviceA, deviceB, deviceC);
// cleanup code here ...
}
(can be in the same file)
6
CUDA: Scheduling, Synchronization
and Atomics
Thread Scheduling
7
Order in which thread blocks are scheduled is
undefined!
any possible interleaving of blocks should be valid
presumed to run to completion without preemption
can run in any order
can run concurrently OR sequentially
Order of threads within a block is also undefined!
Global synchronization
8
Q: How do we do global synchronization with these
scheduling semantics?
Global synchronization
9
Q: How do we do global synchronization with these
scheduling semantics?
A1: Not possible!
Global synchronization
10
Q: How do we do global synchronization with these
scheduling semantics?
A1: Not possible!
A2: Finish a grid, and start a new one!
Global synchronization
11
Q: How do we do global synchronization with these
scheduling semantics?
A1: Not possible!
A2: Finish a grid, and start a new one!
step1<<<grid1,blk1>>>(...);
// CUDA ensures that all writes from step1 are complete.
step2<<<grid2,blk2>>>(...);
We don't have to copy the data back and forth!
Atomics
12
Guarantee that only a single thread has access to a
piece of memory during an operation
No dropped data, but ordering is still arbitrary
Different types of atomic instructions
Atomic Add, Sub, Exch, Min, Max, Inc, Dec, CAS, And,
Or, Xor
Can be done on device memory and shared memory
Much more expensive than load + operation + store
Example: Histogram
13
// Determine frequency of colors in a picture.
// Colors have already been converted into integers
// between 0 and 255.
// Each thread looks at one pixel,
// and increments a counter
__global__ void histogram(int* colors, int* buckets)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
int c = colors[i];
buckets[c] += 1;
}
Example: Histogram
14
// Determine frequency of colors in a picture.
// Colors have already been converted into integers
// between 0 and 255.
// Each thread looks at one pixel,
// and increments a counter
__global__ void histogram(int* colors, int* buckets)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
int c = colors[i];
buckets[c] += 1;
}
Example: Histogram
15
// Determine frequency of colors in a picture.
// Colors have already been converted into integers
// between 0 and 255.
// Each thread looks at one pixel,
// and increments a counter atomically
__global__ void histogram(int* colors, int* buckets)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
int c = colors[i];
atomicAdd(&buckets[c], 1);
}
Example: Work queue
16
// For algorithms where the amount of work per item
// is highly non-uniform, it often makes sense to
// continuously grab work from a queue.
__global__
void workq(int* work_q, int* q_counter,
int queue_max, int* output)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
int q_index = atomicInc(q_counter, queue_max);
int result = do_work(work_q[q_index]);
output[i] = result;
}
17 CUDA: optimizing your application
Coalescing
Coalescing
18
Consider the stride of your accesses
19
__global__ void foo(int* input, float3* input2) {
int i = blockDim.x * blockIdx.x + threadIdx.x;
// Stride 1, OK!
int a = input[i];
// Stride 2, half the bandwidth is wasted
int b = input[2*i];
// Stride 3, 2/3 of the bandwidth wasted
float c = input2[i].x;
}
Example: Array of Structures (AoS)
20
struct record {
int key;
int value;
int flag;
};
record *d_records;
cudaMalloc((void**)&d_records, ...);
Example: Structure of Arrays (SoA)
21
Struct SoA {
int* keys;
int* values;
int* flags;
};
SoA d_SoA_data;
cudaMalloc((void**)&d_SoA_data.keys, ...);
cudaMalloc((void**)&d_SoA_data.values, ...);
cudaMalloc((void**)&d_SoA_data.flags, ...);
Example: SoA vs AoS
22
__global__ void bar(record* AoS_data,
SoA SoA_data) {
int i = blockDim.x * blockIdx.x + threadIdx.x;
// AoS wastes bandwidth
int key1 = AoS_data[i].key;
// SoA efficient use of bandwidth
int key2 = SoA_data.keys[i];
}
Memory Coalescing
23
Structure of arrays is often better than array of
structures
Very clear win on regular, stride 1 access patterns
Unpredictable or irregular access patterns are
case-by-case
Can lose a factor of 10 – 30!
24 CUDA: optimizing your application
Shared Memory
Using shared memory
25
// Adjacent Difference application:
// compute result[i] = input[i] – input[i-1]
__global__ void adj_diff_naive(int *result, int *input) {
// compute this thread’s global index
unsigned int i = blockDim.x * blockIdx.x + threadIdx.x;
if(i > 0) {
// each thread loads two elements from device memory
int x_i = input[i];
int x_i_minus_one = input[i-1];
result[i] = x_i – x_i_minus_one;
}
}
Using shared memory
26
// Adjacent Difference application:
// compute result[i] = input[i] – input[i-1]
__global__ void adj_diff_naive(int *result, int *input) {
// compute this thread’s global index
unsigned int i = blockDim.x * blockIdx.x + threadIdx.x;
if(i > 0) {
// each thread loads two elements from device memory
int x_i = input[i];
int x_i_minus_one = input[i-1];
result[i] = x_i – x_i_minus_one;
}
}
Using shared memory
27
// Adjacent Difference application:
// compute result[i] = input[i] – input[i-1]
__global__ void adj_diff_naive(int *result, int *input) {
// compute this thread’s global index
unsigned int i = blockDim.x * blockIdx.x + threadIdx.x;
if(i > 0) {
// each thread loads two elements from device memory
int x_i = input[i];
int x_i_minus_one = input[i-1];
result[i] = x_i – x_i_minus_one;
}
} The next thread also reads input[i]
Using shared memory
28
__global__ void adj_diff(int *result, int *input) {
unsigned int i = blockDim.x * blockIdx.x + threadIdx.x;
__shared__ int s_data[BLOCK_SIZE]; // shared, 1 elt / thread
// each thread reads 1 device memory elt, stores it in s_data
s_data[threadIdx.x] = input[i];
// avoid race condition: ensure all loads are complete
__syncthreads();
if(threadIdx.x > 0) {
result[i] = s_data[threadIdx.x] – s_data[threadIdx.x–1];
} else if(i > 0) {
// I am thread 0 in this block: handle thread block boundary
result[i] = s_data[threadIdx.x] – input[i-1];
}
}
Using shared memory: coalescing
29
__global__ void adj_diff(int *result, int *input) {
unsigned int i = blockDim.x * blockIdx.x + threadIdx.x;
__shared__ int s_data[BLOCK_SIZE]; // shared, 1 elt / thread
// each thread reads 1 device memory elt, stores it in s_data
s_data[threadIdx.x] = input[i]; // COALESCED ACCESS!
// avoid race condition: ensure all loads are complete
__syncthreads();
if(threadIdx.x > 0) {
result[i] = s_data[threadIdx.x] – s_data[threadIdx.x–1];
} else if(i > 0) {
// I am thread 0 in this block: handle thread block boundary
result[i] = s_data[threadIdx.x] – input[i-1];
}
}
A Common Programming Strategy
30
Partition data into subsets that fit into shared
memory
A Common Programming Strategy
31
Handle each data subset with one thread block
A Common Programming Strategy
32
Load the subset from device memory to shared
memory, using multiple threads to exploit memory-
level parallelism
A Common Programming Strategy
33
Perform the computation on the subset from shared
memory
A Common Programming Strategy
34
Copy the result from shared memory back to device
memory
35 CUDA: optimizing your application
Optimizing Occupancy
Thread Scheduling
36
SM implements zero-overhead warp scheduling
A warp is a group of 32 threads that runs concurrently on a SM
At any time, only one of the warps is executed by SM
Warps whose next instruction has its inputs ready for consumption
are eligible for execution
Eligible Warps are selected for execution on a prioritized
scheduling policy
All threads in a warp execute the same instruction when selected
TB1, W1 stall
TB2, W1 stall TB3, W2 stall
TB1 TB2 TB3 TB3 TB2 TB1 TB1 TB1 TB3
W1 W1 W1 W2 W1 W1 W2 W3 W2
Instruction: 1 2 3 4 5 6 1 2 1 2 1 2 3 4 7 8 1 2 1 2 3 4
Time TB = Thread Block, W = Warp
Stalling warps
37
What happens if all warps are stalled?
No instruction issued → performance lost
Most common reason for stalling?
Waiting on global memory
If your code reads global memory every couple of
instructions
You should try to maximize occupancy
Occupancy
38
What determines occupancy?
Limited resources!
Register
usage per thread
Shared memory per thread block
Resource Limits (1)
39
Registers Shared Memory Registers Shared Memory
TB 2
TB 1
TB 1
TB 1
TB 2
TB 0
TB 0 TB 1
TB 0
TB 0
Pool of registers and shared memory per SM
Each thread block grabs registers & shared memory
If one or the other is fully utilized no more thread blocks
Resource Limits (2)
40
Can only have 8 thread blocks per SM
If
they’re too small, can’t fill up the SM
Need 128 threads / block on gt200 (4 cycles/instruction)
Need 192 threads / block on Fermi (6 cycles/instruction)
Higher occupancy has diminishing returns for hiding
latency
Hiding Latency with more threads
41
How do you know what you’re using?
42
Use “nvcc -Xptxas –v” to get register and
shared memory usage
Plug those numbers into CUDA Occupancy
Calculator
44 CUDA: optimizing your application
Shared memory bank conflicts
Shared Memory Banks
45
Shared memory is banked
Only matters for threads within a warp
Full performance with some restrictions
Threads can each access different banks
Or can all access the same value
Consecutive words are in different banks
If two or more threads access the same bank but
different value, we get bank conflicts
Bank Addressing Examples: OK
46
No Bank Conflicts No Bank Conflicts
Thread 0 Bank 0 Thread 0 Bank 0
Thread 1 Bank 1 Thread 1 Bank 1
Thread 2 Bank 2 Thread 2 Bank 2
Thread 3 Bank 3 Thread 3 Bank 3
Thread 4 Bank 4 Thread 4 Bank 4
Thread 5 Bank 5 Thread 5 Bank 5
Thread 6 Bank 6 Thread 6 Bank 6
Thread 7 Bank 7 Thread 7 Bank 7
Thread 15 Bank 15 Thread 15 Bank 15
Bank Addressing Examples: BAD
47
2-way Bank Conflicts 8-way Bank Conflicts
Thread 0 Bank 0 Thread 0 x8 Bank 0
Thread 1 Bank 1 Thread 1 Bank 1
Thread 2 Bank 2 Thread 2 Bank 2
Thread 3 Bank 3 Thread 3
Thread 4 Bank 4 Thread 4
Bank 5 Thread 5 Bank 7
Bank 6 Thread 6 Bank 8
Bank 7 Thread 7 Bank 9
Thread 8 x8
Thread 9
Thread 10
Thread 11 Bank 15 Thread 15 Bank 15
Trick to Assess Performance Impact
48
Change all shared memory reads to the same value
All broadcasts = no conflicts
Will show how much performance could be
improved by eliminating bank conflicts
The same doesn’t work for shared memory writes
So,replace shared memory array indices with
threadIdx.x
(Could also be done for the reads)
49 Generic programming models
OpenCL
Portability
50
Inter-family vs inter-vendor
NVIDIACuda runs on all NVIDIA GPU families
OpenCL runs on all GPUs, Cell, CPUs
Parallelism portability
Differentarchitecture requires different granularity
Task vs data parallel
Performance portability
Can we express platform-specific optimizations?
The Khronos group
51
OpenCL: Open Compute Language
52
Architecture independent
Explicit support for many-cores
Low-level host API
Uses C library, no language extensions
Separate high-level kernel language
Explicit support for vectorization
Run-time compilation
Architecture-dependent optimizations
Still
needed
Possible
Cuda vs OpenCL Terminology
53
CUDA OpenCL
Thread Work item
Thread block Work group
Device memory Global memory
Constant memory Constant memory
Shared memory Local memory
Local memory Private memory
Cuda vs OpenCL Qualifiers
54
Functions
CUDA OpenCL
__global__ __kernel
__device__ (no qualifier needed)
Variables
CUDA OpenCL
__constant__ __constant
__device__ __global
__shared__ __local
Cuda vs OpenCL Indexing
55
CUDA OpenCL
gridDim get_num_groups()
blockDim get_local_size()
blockIdx get_group_id()
threadIdx get_local_id()
Calculate manually get_global_id()
Calculate manually get_global_size()
__syncthreads() → barrier()
Vector add: Cuda vs OpenCL kernel
56
__global__ void CUDA
vectorAdd(float* a, float* b, float* c) {
int index = blockIdx.x * blockDim.x + threadIdx.x;
c[index] = a[index] + b[index];
}
__kernel void
vectorAdd(__global float* a, __global float* b,
__global float* c) {
int index = get_global_id(0);
c[index] = a[index] + b[index];
} OpenCL
OpenCL VectorAdd host code (1)
57
const size_t workGroupSize = 256;
const size_t nrWorkGroups = 3;
const size_t totalSize = nrWorkGroups * workGroupSize;
cl_platform_id platform;
clGetPlatformIDs(1, &platform, NULL);
// create properties list of key/values, 0-terminated.
cl_context_properties props[] = {
CL_CONTEXT_PLATFORM, (cl_context_properties)platform,
0
};
cl_context context = clCreateContextFromType(props,
CL_DEVICE_TYPE_GPU, 0, 0, 0);
OpenCL VectorAdd host code (2)
58
cl_device_id device;
clGetDeviceIDs(platform, CL_DEVICE_TYPE_DEFAULT, 1,
&device, NULL);
// create command queue on 1st device the context reported
cl_command_queue commandQueue =
clCreateCommandQueue(context, device, 0, 0);
// create & compile program
cl_program program = clCreateProgramWithSource(context, 1,
&programSource, 0, 0);
clBuildProgram(program, 0, 0, 0, 0, 0);
// create kernel
cl_kernel kernel = clCreateKernel(program, "vectorAdd",0);
OpenCL VectorAdd host code (3)
59
float* A, B, C = new float[totalSize]; // alloc host vecs
// initialize host memory here...
// allocate device memory
cl_mem deviceA = clCreateBuffer(context,
CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
totalSize * sizeof(cl_float), A, 0);
cl_mem deviceB = clCreateBuffer(context,
CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
totalSize * sizeof(cl_float), B, 0);
cl_mem deviceC = clCreateBuffer(context,
CL_MEM_WRITE_ONLY, totalSize * sizeof(cl_float), 0, 0);
OpenCL VectorAdd host code (4)
60
// setup parameter values
clSetKernelArg(kernel, 0, sizeof(cl_mem), &deviceA);
clSetKernelArg(kernel, 1, sizeof(cl_mem), &deviceB);
clSetKernelArg(kernel, 2, sizeof(cl_mem), &deviceC);
clEnqueueNDRangeKernel(commandQueue, kernel, 1, 0,
&totalSize, &workGroupSize, 0,0,0); // execute kernel
// copy results from device back to host, blocking
clEnqueueReadBuffer(commandQueue, deviceC, CL_TRUE, 0,
totalSize * sizeof(cl_float), C, 0, 0, 0);
delete[] A, B, C; // cleanup
clReleaseMemObject(deviceA); clReleaseMemObject(deviceB);
clReleaseMemObject(deviceC);
61 Summary and Conclusions
Summary and conclusions
62
Higher performance cannot be reached by
increasing clock frequencies anymore
Solution: introduction of large-scale parallelism
Multiple cores on a chip
Today:
Up to 48 CPU cores in a node
Up to 3200 cores on a single GPU
Host system can contain multiple GPUs: 10,000+ cores
We can build clusters of these nodes!
Future: 100,000s – millions of cores?
Summary and conclusions
63
Many different types of many-core hardware
Very different properties
Performance
Programmability
Portability
It's all about the memory
Choose the right platform for your application
Arithmeticintensity / Operational intensity
Roofline model
Summary and conclusions
64
Many different many-core programming models
Most models are hardware-induced, low-level
DMA, double buffering
Vectorization
Coalescing
Explicit cache (LS on Cell, shared memory on GPU)
Future
Cuda? OpenCL?
high-level models on top of OpenCL?
Many-cores are here to stay