GTC S62191
GTC S62191
Optimization
Athena Elafrou, Guillaume Thomas Collignon, NVIDIA DevTech Compute
GPU Technology Conference, March 18th 2024
1
Agenda
• GPU Architecture and CUDA Programming Basics
• Summary
2
GPU Architecture and CUDA
Programming Basics
3
GPU Overview
NVIDIA H200 SXM
PCIe Gen 5,
128 GB/s
bidirectional
50 MB L2
141 GB HBM3e,
4.8 TB/s
132 SMs
4th Gen Tensor Cores 4th Gen NVLink,
900 GB/s
bidirectional
4
Streaming Multiprocessor (SM)
Hopper architecture
• 32 LD/ST units
• 64K 32-bit registers
• 256 KiB unified L1 data cache and shared memory
• Tensor Memory Accelerator (TMA)
5 5
SIMT Architecture
Single-Instruction, Multiple-Thread
Warp 0
• Akin to a single-instruction multiple-data (SIMD) array processor per Warp 1
…
Flynn’s taxonomy combined with fine-grained multithreading.
Warp N
• SIMT architectures expose a large set of hardware threads, which is Warps available for
partitioned into groups called warps. scheduling
• Interleave warp execution to hide latencies.
• Execution context for each warp is kept on-chip for fast interleaving.
6
SIMT Architecture
Warp divergence
Thread IDs
Time 0 1 2 3 4 5 6 7
7
SIMT Architecture
Independent Thread Scheduling
• Individual threads in a warp have their own program counter and call stack and are therefore free to execute
independently.
X; Y; Z; X; Y;
if (thread_id < 4) {
A;
B;
} else {
X;
sync
Y;
}
Z; A; B; Z; A; B; Z;
Time
8
CUDA Programming Model
Single-Program Multiple-Data
9
Thread Hierarchy
CUDA/Software Hardware
Grid
Thread Block Thread Block Thread Block Thread Block
• A CUDA kernel is launched on a grid of
thread blocks, which are completely
independent.
Thread Block Thread Block Thread Block Thread Block
Device
Thread
• Individual threads execute on scalar CUDA
Scalar CUDA core cores.
10
Thread Block Clusters
• For Hopper GPUs, CUDA introduced an optional level in the thread hierarchy called Thread Block Clusters.
• Thread blocks in a cluster are guaranteed to be concurrently scheduled and enable efficient cooperation and data
sharing for threads across multiple SMs.
• For more information on this topic visit GTC session [S62192]: “Advanced Performance Optimization in CUDA”.
Grid Grid
Thread Block Cluster Thread Block Cluster
Thread Block Thread Block Thread Block Thread Block Thread Block Thread Block Thread Block Thread Block
Thread Block Thread Block Thread Block Thread Block Thread Block Thread Block Thread Block Thread Block
11
Thread Hierarchy
What about warps?
32 threads
Thread Block 32 threads
32 threads
32 threads
SM
12
Thread Hierarchy
Thread block sizing
13
Thread Hierarchy
Thread block sizing
Block 9 Block 7
… …
14
Memory Hierarchy
Hardware CUDA/Software
• Per-thread registers.
Thread
• Lowest possible latency.
Registers Local Memory
• Per-thread local memory.
Registers • Private storage.
SM
Registers
Shared/L1 Thread Block
• Slowest access.
Shared
Shared/L1 Memory
• Global memory.
DRAM • Visible by all threads in a grid.
Global Memory • Slowest access.
15
Synchronization
Barriers
CUDA/Software
Grid
• Grid boundary.
Thread Block Thread Block Thread Block Thread Block
• Kernel completion.
• grid_group::sync() via Cooperative Groups API
Thread Block Thread Block Thread Block Thread Block • Requires the kernel to be launched via the
cudaLaunchCooperativeKernel() API
• Slow! Avoid unless necessary.
barrier
• Warp or sub-warp boundary.
• __syncwarp()
warp = 32 threads • coalesced_group::sync() via Cooperative Groups API
barrier
• Very fast!
16
Atomics
Memory spaces
CUDA/Software
• Read-modify-write operations on 16–, 32-, 64- or 128-bit
Thread Block words.
• Available as CUDA primitives or C++ atomics through
libcu++ extended API.
Shared Memory
17
Thread Scopes
Thread
• To account for non-uniform thread synchronization
costs, CUDA has introduced the notion of thread scopes. thread_scope_thread
18
Fundamental Performance
Optimizations
19
Little’s Law
For escalators
• 20 steps tall
• Latency = 40 seconds
20
Little’s Law
For escalators
• 20 steps tall
• Latency = 40 seconds
21
Little’s Law
For GPUs
22
Hiding Latencies
Increasing in-flight instructions
Thread
Thread 0 Thread 1
x = x + a
Instructions
y = x + a
Instructions
x = x + a y = y + a
z = y + a
x = x + b y = y + b
x = x + b
y = y + b Independent Instructions
z = z + b
23
Instruction Issue
__global__
void kernel(const float * __restrict__ a,
const float * __restrict__ b,
float * __restrict__ c)
{
• Assumptions int idx = blockIdx.x * blockDim.x + threadIdx.x;
• LDG/STG
c[idx] += a[idx] * b[idx];
• Dependent Issue Rate: 1000 cycles }
• Issue Rate: 1 cycle
• FP32 pipeline
• Dependent Issue Rate: 4 cycles
• Issue Rate: 2 cycles
• 1 available warp per scheduler
load a
load b 12 bytes in-flight
Cycle N N+1 N+2 N+1002 N+1006 load c
LDG LDG LDG (stall) FFMA (stall) STG
fma c, a, b
24 24
Increasing ILP
Computing 2 elements per thread – version #1
__global__
void kernel(const float * __restrict__ a,
const float * __restrict__ b,
float * __restrict__ c)
• Every thread computes 2 elements using a grid stride. {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
#pragma unroll 2
for (int i = 0; i < 2; i++) {
const int idx = tid + i * stride;
Cycle N N+1 N+2 N+3 N+4 N+1002 N+1006
c[idx] += a[idx] * b[idx];
LDG LDG LDG LDG LDG (stall) FFMA (stall) STG }
}
25 25
Increasing ILP
Computing 2 elements per thread – version #2
#define THREAD_BLOCK_DIM 128
__global__
void kernel(const float * __restrict__ a,
const float * __restrict__ b,
• Every thread computes 2 elements using a constant float * __restrict__ c)
block stride. {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
int off = 2 * THREAD_BLOCK_DIM * blockIdx.x + threadIdx.x;
#pragma unroll 2
Cycle N N+1 N+2 N+3 N+4 N+5 N+1002 for (int i = 0; i < 2; i++) {
const int idx = off + i * THREAD_BLOCK_DIM;
LDG LDG LDG LDG LDG LDG (stall) FFMA (stall) c[idx] += a[idx] * b[idx];
}
}
N+1004 N+1006 N+1008
FFMA (stall) STG (stall) STG load a[i1]
load b[i1]
load c[i1]
24 bytes in-flight
load a[i2]
load b[i2]
Total cycles = 1008 load c[i2]
fma c[i1], a[i1], b[i1]
fma c[i2], a[i2], b[i2]
store c[i1]
2x the amount of work in the ~same number store c[i2] 8 bytes in-flight
of cycles!
26 26
Warp Scheduling
Hopper SM
27 27
Warp Scheduling
Mental model
Warps occupying
Stalled
scheduler slots are
Waiting on:
considered active
an instruction fetch, N
a memory dependency, …
an execution dependency, or
5
Warp Slots
a synchronization barrier.
4
3
Eligible 2
Ready to issue an instruction.
1
0
Selected
Eligible that is selected to issue
an instruction.
28
Warp Scheduling
Mental model
Stalled Cycle: N
Waiting on:
an instruction fetch, N
a memory dependency, …
an execution dependency, or
5
Warp Slots
a synchronization barrier.
4
3
Eligible 2
Ready to issue an instruction.
1
0
Selected
Eligible that is selected to issue
Issue slot:
an instruction.
29
Warp Scheduling
Mental model
Stalled Cycle: N
Waiting on:
an instruction fetch, N
a memory dependency, …
an execution dependency, or
5
Warp Slots
a synchronization barrier.
4
3
Eligible 2
Ready to issue an instruction.
1
0
Selected 2
Eligible that is selected to issue
Issue slot:
an instruction
Each cycle: out of all eligible warps, select one to issue on that cycle
30
Warp Scheduling
Mental model
Warp Slots
a synchronization barrier.
4
3
Eligible 2
Ready to issue an instruction.
1
0
Selected 2
Eligible that is selected to issue
Issue slot:
an instruction
31
Warp Scheduling
Mental model
Warp Slots
a synchronization barrier.
4
3
Eligible 2
Ready to issue an instruction.
1
0
Selected 2
Eligible that is selected to issue
Issue slot:
an instruction
32
Warp Scheduling
Mental model
Warp Slots
a synchronization barrier.
4
3
Eligible 2
Ready to issue an instruction.
1
0
Selected 2
Eligible that is selected to issue
Issue slot:
an instruction
33
Warp Scheduling
Mental model
Warp Slots
a synchronization barrier.
4
3
Eligible 2
Ready to issue an instruction.
1
0
Selected 2 0
Eligible that is selected to issue
Issue slot:
an instruction
34
Warp Scheduling
Mental model
Warp Slots
a synchronization barrier.
4
3
Eligible 2
Active warp that is not stalled.
1
0
Selected 2 0
Eligible warp that is selected to
Issue slot:
issue an instruction.
35
Warp Scheduling
Mental model
Warp Slots
a synchronization barrier.
4
3
Eligible 2
Active warp that is not stalled.
1
0
Selected 2 0
Eligible warp that is selected to
Issue slot:
issue an instruction.
Having more active warps would help reduce idle issue slots and hide
latencies of stalled warps.
36
How to Increase Active Warps?
Occupancy
37
Occupancy Limiters
Registers
• Example:
• Kernel uses 63 registers per thread
• Registers per warp = 63 * 32 = 2016
• Registers allocated per warp = 2048
• Achievable active warps per SM = 65536 / 2048 = 32
• Occupancy = 32 / 64 * 100 = 50%
• Hopper supports up to 64 warps per SM
38
Occupancy Limiters
Shared memory
• Example:
• Kernel uses 17408 bytes of shared memory per 128-thread block.
• Blocks per SM = 233472 / (17408 +1024 ) = 12.66
• Achievable active warps per SM = 12 * 128 / 32 = 48
• Occupancy = 48 / 64 * 100 = 75%
• Hopper supports up to 64 warps per SM.
39
Occupancy Limiters
Thread block size
40
ILP vs TLP for Hiding Latencies
Computing c = c + a * b
• Experimental setup:
• NVIDIA H100 SXM, 1980 MHz
• Problem size = 2^28
• Datatype = float
• Baseline thread block size = 32 (50% occupancy)
• Experiment #1: increase occupancy
• Thread block size = 64 (100% occupancy)
• Experiment #2: increase ILP by computing more elements per thread
• Elements per thread = 2, 4
Main Memory
Elements per Thread Block SM Occupancy
Implementation Bandwidth GPU Time (ms)
Thread Size (%)
Utilization (%)
Baseline 1 32 25 50 5.0
Experiment #1 1 64 51 100 2.5
Experiment #2 2 32 51 50 2.5
Experiment #2 4 32 82 50 1.6
41
What Occupancy Do I Need?
General guidelines
-
Fewer threads per SM.
+ Registers per thread and shared memory More threads per SM.
42
Maximizing Memory Throughput
43
Memory Hierarchy
NVIDIA H200 SXM
44
Why Do GPUs Have Caches?
• 100s ~ 1000s of threads sharing the L1 and ~100000s of threads sharing the L2.
• L1, L2 capacity per thread is relatively small.
45
Memory Transactions
Cache lines and sectors
46
Memory Reads & Writes
Registers
SM
Reads
Registers
Shared/L1 Check if data is in L1 (if not, check L2)
Shared/L1 Check if data is in L2 (if not, get from DRAM)
Unit of data moved: full sector
L2
Writes
L1 is write-through: update both L1 and L2
47
Global Memory Access Patterns
Aligned and sequential
0 31
4-byte element access WARP
4 sectors
Memory Addresses
COALESCED!
48
Global Memory Access Patterns
Aligned and sequential
0 31
8-byte element access WARP
8 sectors
Memory Addresses
COALESCED!
49
Global Memory Access Patterns
Aligned and non-sequential
0 31
4-byte element access WARP
4 sectors
Memory Addresses
COALESCED!
50
Global Memory Access Patterns
Mis-aligned and sequential
0 31
4-byte element access WARP
5 sectors
Memory Addresses
51
Global Memory Access Patterns
Mis-aligned and sequential
0 310 31
4-byte element access WARP WARP
5 sectors
Memory Addresses
52
Global Memory Access Patterns
Same address
0 31
4-byte element access WARP
1 sector
Memory Addresses
53
Global Memory Access Patterns
Aligned and strided
0 31
4-byte element access WARP
32 sectors
Memory Addresses
54
Impact of Data Layout
Array-of-Structures (AoS) vs Structure-of-Arrays (SoA)
struct Coefficients
{
float u, v, w;
float x[8], y[8], z;
AoS Memory Layout };
55 55
Impact of Data Layout
Array-of-Structures (AoS) vs Structure-of-Arrays (SoA)
struct Coefficients
• When loading coefficients u and y[0]: {
• Successive threads in a warp read 4 bytes at 80-byte float u, v, w;
stride. float x[8], y[8], z;
};
u … y0 … z u … y0 … z u … y0 … z u … y0 … z …
0 80 160 240
56 56
Impact of Data Layout
Array-of-Structures (AoS) vs Structure-of-Arrays (SoA)
struct Coefficients
• When loading coefficients u and y[0]: {
• Successive threads in a warp read 4 bytes at 80-byte float u, v, w;
stride. float x[8], y[8], z;
};
• We are reading 7x more bytes than necessary!
• Remember data is read in sectors of 32 bytes. __global__ void kernel(Coefficients *data)
{
• No potential reuse of the sectors loaded by the
int i = cg::this_grid.thread_rank();
previous access.
data[i].u = data[i].u + 10.f;
data[i].y[0] = data[i].y[0] + 10.f;
T0 T1 T2 T3 }
u … y0 … z u … y0 … z u … y0 … z u … y0 … z …
44 124 204 284
57 57
Impact of Data Layout
Array-of-Structures vs Structure-of-Arrays
struct Coefficients
• Refactoring from AoS to SoA leads to coalesced {
memory accesses for u and y[0]. float *u, *v, *w;
float *x0, …, *x7, *y0, … *y7, *z;
};
u u u u u u … y0 y0 y0 y0 y0 y0 y0 … …
128 bytes
58 58
Impact of Data Layout
Array-of-Structures vs Structure-of-Arrays
struct Coefficients
• Refactoring from AoS to SoA leads to coalesced {
memory accesses for u and y[0]. float *u, *v, *w;
float *x0, …, *x7, *y0, … *y7, *z;
};
u u u u u u … y0 y0 y0 y0 y0 y0 … …
128 bytes
59 59
Impact of Data Layout
Performance Analysis
• Experimental setup:
• NVIDIA H100 SXM, 1980 MHz
• Problem size = 2^28
• Thread block size = 256
Main Memory
Load Efficiency Store Efficiency
Implementation Bandwidth GPU Time (ms)
(%) (%)
Utilization (%)
AoS 12.5 12.5 13.50 28.497
SoA 100 100 79.47 4.836
60
Unified L1 and Shared Memory
• Can be used as a typical hardware managed cache (L1) and/or a user-managed memory (Shared Memory)
• An application can configure its preferred split at runtime using cudaFuncSetAttribute() with the attribute
cudaFuncAttributePreferredSharedMemoryCarveout.
61
Shared Memory
Capacity:
• Default 48 KiB per thread block, opt-in to get more using cudaFuncSetAttribute() with the attribute
cudaFuncAttributeMaxDynamicSharedMemorySize.
• Up to 227KiB per thread block on Hopper.
Organization:
• Divided into 32 banks, each 4-byte wide.
• Successive 4-byte words map to successive banks.
• Bank index calculation examples:
• (4-byte word index) % 32
• (1-byte word index / 4) % 32
Performance:
• Slower than registers, but much faster than global memory.
62
Logical View of Shared Memory Banks
4-byte data
63
Processing Data Types of Different Sizes
64
Shared Memory Access Patterns
Bank conflicts
• Bank conflicts occur when threads in the same phase want to access the same bank.
65
Bank Conflicts
Example
…
Thread 2 (2,0) (2,1) (2,2) (2,3) (2,31)
• idx := threadIdx.x*32 + threadIdx.y Bank 2
• 32-way bank conflicts
Thread 3 (3,0) (3,1) (3,2) (3,3) (3,31) Bank 3
… … …
66
Resolving Bank Conflicts
Padding
…
Thread 2 (2,0) (2,1) (2,2) (2,3) (2,31)
• idx := threadIdx.x*33 + threadIdx.y Bank 2
• No conflicts!
Thread 3 (3,0) (3,1) (3,2) (3,3) (3,31) Bank 3
… … …
67
Resolving Bank Conflicts
Swizzling
…
Thread 2 (2,0) (2,1) (2,2) (2,3) (2,31)
• idx = threadIdx.x*32 + Bank 2
threadIdx.y ^ threadIdx.x
• No conflicts! Thread 3 (3,0) (3,1) (3,2) (3,3) (3,31) Bank 3
• No shared memory wasted!
… … …
68
Vectorized Memory Accesses
Multi-word as well as multi-thread
Memory
contiguous, aligned
memory access
warp
69
Vectorized Memory Accesses
Multi-word as well as multi-thread
Memory
contiguous, aligned
memory access
warp
70
Vectorized Memory Accesses
Multi-word as well as multi-thread
Memory
contiguous, aligned
memory access
warp
71
Vectorized Memory Accesses
Multi-thread, multi-word
72 72
Vectorized Memory Accesses
Performance Analysis
• Experimental setup:
• NVIDIA H100 SXM, 1980 MHz
• Problem size = 2^28
• Thread block size = 256
Main Memory
Implementation Bandwidth Utilization GPU Time (ms)
(%)
float 60.62 1.033
float2 84.34 0.737
float4 88.82 0.706
73
Maximizing Memory Throughput
General guidelines
Global memory
• Strive for aligned and coalesced accesses within a warp.
• Maximize bytes in-flight to saturate memory bandwidth.
• Process several elements per thread.
• Use vectorized loads/stores.
• Launch enough threads to maximize throughput.
L1 and L2 caches
• Cache blocking difficult, but not impossible.
• Rely on caches when you don’t have a choice.
Shared memory
• Use it to reduce global memory traffic.
• Strive to avoid bank conflicts.
• Use vectorized loads/stores.
74
Atomics
75
Using Atomics Efficiently
Access Patterns
76
Using Atomics Efficiently
Example #1: find the maximum value of an array
77 77
Using Atomics Efficiently
Example #1: find the maximum value of an array
__global__
void find_max(const int * __restrict__ in, int *max, int N)
{
• Optimization #1: maintain a block-level max in shared int grid_tid = cg::this_grid().thread_rank();
int grid_stride = cg::this_grid().num_threads();
memory. auto block = cg::this_thread_block();
• Reduces the number of same-address global atomics int block_tid = block.thread_rank();
by a factor equal to the thread block size.
__shared__ int block_max;
78 78
Using Atomics Efficiently
Example #1: find the maximum value of an array
//Assumes a block dimension of 256
__global__
void find_max(const int * __restrict__ in, int *max, int N) {
…
auto tile = cg::tiled_partition<32>(block);
extern __shared__ int sdata[];
• Optimization #2: use a parallel reduction to calculate // Find my local maximum as before
the block-level max in shared memory. // Each thread puts its local max into shared memory
sdata[block_tid] = thread_max;
// Block-level reduction
if (block_tid < 128) {
if (sdata[block_tid + 128] > thread_max)
thread_max = sdata[block_tid + 128];
sdata[block_tid] = thread_max;
}
block.sync();
if (block_tid < 64) {
if (sdata[block_tid + 64] > thread_max)
thread_max = sdata[block_tid + 64];
sdata[block_tid] = thread_max;
}
block.sync();
// Warp-level reduction
if (tile.meta_group_rank() == 0) {
thread_max = cg::reduce(tile, thread_max,
cg::greater<int>());
}
if (block_tid == 0)
atomicMax(max, thread_max);
}
79 79
Performance Analysis
• Experimental setup:
• NVIDIA H100 SXM, 1980 MHz
• Problem size = 2^28
• Uniform distribution (-50, 50).
80
Using Atomics Efficiently
Example #2: vector update
• Problem description: a = a + b * c
__global__
void vector(const float * __restrict__ b,
const float * __restrict__ c,
float * __restrict__ a,
int N) {
int grid_tid = cg::this_grid().thread_rank();
int grid_stride = cg::this_grid().num_threads();
81 81
Using Atomics Efficiently
Example #2: vector update
82 82
Performance Analysis
NVIDIA H100 SXM
83
Performance Analysis
NVIDIA H100 SXM
84
Performance Analysis
NVIDIA H100 SXM
• ~5% increase in memory throughput translates into a corresponding reduction in execution time. Why?
• This kernel is DRAM bandwidth bound.
85
Summary
86
Which optimizations to focus on?
Solving the bottlenecks
• Compute bound
• Reduce instruction count.
• E.g., use vector loads/stores.
• Use tensor cores.
• Use lower precision arithmetic, fast math intrinsics.
• Bandwidth bound
• Reduce the amount of data transferred
• Optimize memory access patterns.
• Lower precision datatypes.
• Kernel fusion.
• Latency bound
• Increase number of instructions and memory accesses in-flight.
• Increase parallelism, occupancy.
87
Resources/Further Study
88
89