[go: up one dir, main page]

0% found this document useful (0 votes)
34 views89 pages

GTC S62191

Uploaded by

Huy Gia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views89 pages

GTC S62191

Uploaded by

Huy Gia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 89

Introduction to CUDA Performance

Optimization
Athena Elafrou, Guillaume Thomas Collignon, NVIDIA DevTech Compute
GPU Technology Conference, March 18th 2024

1
Agenda
• GPU Architecture and CUDA Programming Basics

• Fundamental Performance Optimizations

• Summary

2
GPU Architecture and CUDA
Programming Basics

3
GPU Overview
NVIDIA H200 SXM

PCIe Gen 5,
128 GB/s
bidirectional

50 MB L2

141 GB HBM3e,
4.8 TB/s

132 SMs
4th Gen Tensor Cores 4th Gen NVLink,
900 GB/s
bidirectional
4
Streaming Multiprocessor (SM)
Hopper architecture

• 128 FP32 cores


• 64 FP64 cores
• 64 INT32 cores
• 4 mixed-precision Tensor Cores
• 16 special function units (transcendentals)
• 4 warp schedulers

• 32 LD/ST units
• 64K 32-bit registers
• 256 KiB unified L1 data cache and shared memory
• Tensor Memory Accelerator (TMA)

5 5
SIMT Architecture
Single-Instruction, Multiple-Thread

Warp 0
• Akin to a single-instruction multiple-data (SIMD) array processor per Warp 1

Flynn’s taxonomy combined with fine-grained multithreading.
Warp N

• SIMT architectures expose a large set of hardware threads, which is Warps available for
partitioned into groups called warps. scheduling
• Interleave warp execution to hide latencies.
• Execution context for each warp is kept on-chip for fast interleaving.

• When scheduled, each thread of a warp executes on a given lane of a


SIMD functional unit.

• Each SM sub-partition can be thought of as a SIMT engine that creates,


manages, schedules, and executes warps of 32 parallel threads.
SIMD back-end

6
SIMT Architecture
Warp divergence

Thread IDs

Time 0 1 2 3 4 5 6 7

• If threads in a warp diverge via a conditional branch, the warp


T T F T F F F F if (true) {
executes every branch path. instruction 1
x x x x x
instruction 2
x x x instruction 3
• Full efficiency is realized when all 32 threads of a warp agree x x x x x } else {
on their execution path. instruction 4
x x x x x instruction 5
• Aka they are converged. x x x }

7
SIMT Architecture
Independent Thread Scheduling

• Individual threads in a warp have their own program counter and call stack and are therefore free to execute
independently.

X; Y; Z; X; Y;
if (thread_id < 4) {
A;
B;
} else {
X;

sync
Y;
}
Z; A; B; Z; A; B; Z;

Time

Do not assume threads in a warp The compiler might sync to


are automatically re-converged enforce re-convergence for
after a conditional or at any point! better performance.

8
CUDA Programming Model
Single-Program Multiple-Data

Single-threaded CPU vector addition


• SIMT instructions specify the execution of a single thread.

for (int i = 0; i < N; i++) {


• A SIMT kernel is launched on many threads that execute in c[i] = a[i] + b[i];
parallel. }

• Threads use their thread index to work on disjoint data or to


enable different execution paths.
GPU vector addition

Thread IDs: 0 1 2 … N-1 N


• Three key software abstractions enable efficient
programming through the CUDA programming model:
• a hierarchy of thread groups, int i = my_global_thread_id();
• memory spaces, and if (i < N) c[i] = a[i] + b[i];
• synchronization.

9
Thread Hierarchy

CUDA/Software Hardware
Grid
Thread Block Thread Block Thread Block Thread Block
• A CUDA kernel is launched on a grid of
thread blocks, which are completely
independent.
Thread Block Thread Block Thread Block Thread Block

Device

• Thread blocks are executed on SMs.


Thread Block • Several concurrent thread blocks can reside
on an SM.
• Thread blocks do not migrate.
• Each block can be scheduled on any of the
SM available SMs, in any order, concurrently or in
series.

Thread
• Individual threads execute on scalar CUDA
Scalar CUDA core cores.

10
Thread Block Clusters

• For Hopper GPUs, CUDA introduced an optional level in the thread hierarchy called Thread Block Clusters.

• Thread blocks in a cluster are guaranteed to be concurrently scheduled and enable efficient cooperation and data
sharing for threads across multiple SMs.

• For more information on this topic visit GTC session [S62192]: “Advanced Performance Optimization in CUDA”.

Grid Grid
Thread Block Cluster Thread Block Cluster
Thread Block Thread Block Thread Block Thread Block Thread Block Thread Block Thread Block Thread Block

Thread Block Thread Block Thread Block Thread Block Thread Block Thread Block Thread Block Thread Block

11
Thread Hierarchy
What about warps?

• At runtime, a block of threads is divided into warps for SIMT execution.


• The way a block is partitioned into warps is always the same.
• Each warp contains threads of consecutive, increasing thread IDs with the first warp containing thread 0.

• The total number of warps in a block is defined as:


𝑡ℎ𝑟𝑒𝑎𝑑𝑠 𝑝𝑒𝑟 𝑏𝑙𝑜𝑐𝑘
• 𝑐𝑒𝑖𝑙 ,1
𝑤𝑎𝑟𝑝 𝑠𝑖𝑧𝑒

32 threads
Thread Block 32 threads

32 threads

32 threads

SM

12
Thread Hierarchy
Thread block sizing

• Let’s say we want to add two vectors of size N = 1000.


• Scenario #1: 1-D grid of 10 1-D blocks of size 100.
• Scenario #2: 1-D grid of 8 1-D blocks of size 128.
• Which option is better in terms of thread resource utilization?

13
Thread Hierarchy
Thread block sizing

• Let’s say we want to add two vectors of size N = 1000.


• Scenario #1: 1-D grid of 10 1-D blocks of size 100.
• Scenario #2: 1-D grid of 8 1-D blocks of size 128.
• Which option is better in terms of thread resource utilization?

32 threads 32 threads 32 threads 4 32 threads 32 threads 32 threads 8

Block 9 Block 7
… …

32 threads 32 threads 32 threads 4 32 threads 32 threads 32 threads 32 threads


32 threads 32 threads 32 threads 4 32 threads 32 threads 32 threads 32 threads
Block 0 Block 0
Block 0 Block 0

Scenario #1: Scenario #2:


3 full warps and 1 warp with 4 active threads per block 4 full warps per block, except last block
Average thread utilization = 78.125% Average thread utilization = 97.656%

14
Memory Hierarchy

Hardware CUDA/Software
• Per-thread registers.
Thread
• Lowest possible latency.
Registers Local Memory
• Per-thread local memory.
Registers • Private storage.
SM
Registers
Shared/L1 Thread Block
• Slowest access.
Shared
Shared/L1 Memory

• Per-block shared memory.


• Visible by all threads in a block.
L2 Grid • Can be used to exchange data between
Thread Block Thread Block Thread Block Thread Block
threads in a thread block.
• Very fast access.

Thread Block Thread Block Thread Block Thread Block

• Global memory.
DRAM • Visible by all threads in a grid.
Global Memory • Slowest access.

15
Synchronization
Barriers

CUDA/Software
Grid
• Grid boundary.
Thread Block Thread Block Thread Block Thread Block
• Kernel completion.
• grid_group::sync() via Cooperative Groups API
Thread Block Thread Block Thread Block Thread Block • Requires the kernel to be launched via the
cudaLaunchCooperativeKernel() API
• Slow! Avoid unless necessary.

barrier • Thread-block boundary.


• __syncthreads()
Thread Block • thread_block::sync() via Cooperative Groups API
• Fast! The most common synchronization level.

barrier
• Warp or sub-warp boundary.
• __syncwarp()
warp = 32 threads • coalesced_group::sync() via Cooperative Groups API
barrier
• Very fast!

16
Atomics
Memory spaces

CUDA/Software
• Read-modify-write operations on 16–, 32-, 64- or 128-bit
Thread Block words.
• Available as CUDA primitives or C++ atomics through
libcu++ extended API.
Shared Memory

• Shared memory atomics.


Grid
Thread Block Thread Block Thread Block Thread Block

• Global memory atomics.


Global Memory • Facilitated by special hardware in the L2 cache.

CPU GPU 0 GPU 1


• Unified memory atomics.
Unified Memory

17
Thread Scopes

Thread
• To account for non-uniform thread synchronization
costs, CUDA has introduced the notion of thread scopes. thread_scope_thread

• A thread scope specifies which threads can


communicate with each other using a primitive such as
an atomic or a barrier. Thread Block

• Thread scopes are exposed to the programmer in 3 ways: thread_scope_block


• PTX
• CUDA Math API
• CUDA C++ Grid
Thread Block Thread Block

• Always use the narrowest scope that ensures correctness


thread_scope_device
of your application.
Thread Block Thread Block

• More on thread scopes in the GTC session [S62192]:


“Advanced Performance Optimization in CUDA”.
CPU GPU 0 GPU 1
threads threads threads thread_scope_system

18
Fundamental Performance
Optimizations

19
Little’s Law
For escalators

Our escalator parameters:

• 1 person per step

• A step arrives every 2 seconds


• Bandwidth: 0.5 person/s

• 20 steps tall
• Latency = 40 seconds

One person in flight?

Throughput = 0.025 person/s

20
Little’s Law
For escalators

Our escalator parameters:

• 1 person per step

• A step arrives every 2 seconds


• Bandwidth: 0.5 person/s

• 20 steps tall
• Latency = 40 seconds

How many persons do we need in-flight to


saturate bandwidth?

Concurrency = Bandwidth x Latency


= 0.5 persons/s x 40 s
= 20 persons

21
Little’s Law
For GPUs

• How to maximize performance?

1. Saturate compute units.


2. Saturate memory bandwidth.
FP32 FP32

• Need to hide the corresponding latencies to achieve this. FP32 FP32


FP32 Latency = 24 cycles
FP32 FP32 8 FP32 ops per cycle
• Compute latencies.
FP32 FP32
• Memory access latencies.

• Latencies can be hidden by having more instructions in flight.

Concurrency = Bandwidth x Latency =


8 x 24 operations in-flight

22
Hiding Latencies
Increasing in-flight instructions

• Two ways to increase in-flight instructions:

2. Improve Thread-Level Parallelism (TLP).


1. Improve Instruction-Level Parallelism (ILP).
• Higher TLP -> more threads -> more independent
• Higher ILP -> more independent instructions per thread. instructions per kernel.

Thread

Thread 0 Thread 1
x = x + a

Instructions
y = x + a
Instructions

x = x + a y = y + a
z = y + a
x = x + b y = y + b
x = x + b
y = y + b Independent Instructions
z = z + b

23
Instruction Issue
__global__
void kernel(const float * __restrict__ a,
const float * __restrict__ b,
float * __restrict__ c)
{
• Assumptions int idx = blockIdx.x * blockDim.x + threadIdx.x;
• LDG/STG
c[idx] += a[idx] * b[idx];
• Dependent Issue Rate: 1000 cycles }
• Issue Rate: 1 cycle
• FP32 pipeline
• Dependent Issue Rate: 4 cycles
• Issue Rate: 2 cycles
• 1 available warp per scheduler

load a
load b 12 bytes in-flight
Cycle N N+1 N+2 N+1002 N+1006 load c
LDG LDG LDG (stall) FFMA (stall) STG
fma c, a, b

store c 4 bytes in-flight


Total cycles = 1006

24 24
Increasing ILP
Computing 2 elements per thread – version #1
__global__
void kernel(const float * __restrict__ a,
const float * __restrict__ b,
float * __restrict__ c)
• Every thread computes 2 elements using a grid stride. {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;

#pragma unroll 2
for (int i = 0; i < 2; i++) {
const int idx = tid + i * stride;
Cycle N N+1 N+2 N+3 N+4 N+1002 N+1006
c[idx] += a[idx] * b[idx];
LDG LDG LDG LDG LDG (stall) FFMA (stall) STG }
}

N+1007 N+2007 N+2011


LDG (stall) FFMA (stall) STG load a[i1]
load b[i1]
load c[i1] 20 bytes in-flight
load a[i2]
Total cycles = 2011 load b[i2]
fma c[i1], a[i1], b[i1]
store c[i1]
load c[i2] 8 bytes in-flight
fma c[i2], a[i2], b[i2]
2x the amount of work in 2x more cycles! store c[i2] 4 bytes in-flight

25 25
Increasing ILP
Computing 2 elements per thread – version #2
#define THREAD_BLOCK_DIM 128

__global__
void kernel(const float * __restrict__ a,
const float * __restrict__ b,
• Every thread computes 2 elements using a constant float * __restrict__ c)
block stride. {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
int off = 2 * THREAD_BLOCK_DIM * blockIdx.x + threadIdx.x;

#pragma unroll 2
Cycle N N+1 N+2 N+3 N+4 N+5 N+1002 for (int i = 0; i < 2; i++) {
const int idx = off + i * THREAD_BLOCK_DIM;
LDG LDG LDG LDG LDG LDG (stall) FFMA (stall) c[idx] += a[idx] * b[idx];
}
}
N+1004 N+1006 N+1008
FFMA (stall) STG (stall) STG load a[i1]
load b[i1]
load c[i1]
24 bytes in-flight
load a[i2]
load b[i2]
Total cycles = 1008 load c[i2]
fma c[i1], a[i1], b[i1]
fma c[i2], a[i2], b[i2]
store c[i1]
2x the amount of work in the ~same number store c[i2] 8 bytes in-flight
of cycles!
26 26
Warp Scheduling
Hopper SM

• 4 warp schedulers per SM.

• Each scheduler manages a pool of warps.


• Hopper: 16 warp slots per scheduler.

• Each scheduler can issue 1 warp per cycle.

27 27
Warp Scheduling
Mental model

Active Warp States:

Warps occupying
Stalled
scheduler slots are
Waiting on:
considered active
an instruction fetch, N
a memory dependency, …
an execution dependency, or
5

Warp Slots
a synchronization barrier.
4
3
Eligible 2
Ready to issue an instruction.
1
0

Selected
Eligible that is selected to issue
an instruction.

28
Warp Scheduling
Mental model

Active Warp States:

Stalled Cycle: N
Waiting on:
an instruction fetch, N
a memory dependency, …
an execution dependency, or
5

Warp Slots
a synchronization barrier.
4
3
Eligible 2
Ready to issue an instruction.
1
0

Selected
Eligible that is selected to issue
Issue slot:
an instruction.

29
Warp Scheduling
Mental model

Active Warp States:

Stalled Cycle: N
Waiting on:
an instruction fetch, N
a memory dependency, …
an execution dependency, or
5

Warp Slots
a synchronization barrier.
4
3
Eligible 2
Ready to issue an instruction.
1
0

Selected 2
Eligible that is selected to issue
Issue slot:
an instruction

Each cycle: out of all eligible warps, select one to issue on that cycle

30
Warp Scheduling
Mental model

Active Warp States:

Stalled Cycle: N N+1


Waiting on:
an instruction fetch, N
a memory dependency, …
an execution dependency, or
5

Warp Slots
a synchronization barrier.
4
3
Eligible 2
Ready to issue an instruction.
1
0

Selected 2
Eligible that is selected to issue
Issue slot:
an instruction

Warp selected at cycle N is not eligible in cycle N+1.


E.g., instructions with longer latencies.

31
Warp Scheduling
Mental model

Active Warp States:

Stalled Cycle: N N+1


Waiting on:
an instruction fetch, N
a memory dependency, …
an execution dependency, or
5

Warp Slots
a synchronization barrier.
4
3
Eligible 2
Ready to issue an instruction.
1
0

Selected 2
Eligible that is selected to issue
Issue slot:
an instruction

No eligible warps! Issue slot unused.

32
Warp Scheduling
Mental model

Active Warp States:

Stalled Cycle: N N+1 N+2


Waiting on:
an instruction fetch, N
a memory dependency, …
an execution dependency, or
5

Warp Slots
a synchronization barrier.
4
3
Eligible 2
Ready to issue an instruction.
1
0

Selected 2
Eligible that is selected to issue
Issue slot:
an instruction

Warp at slot 0 becomes eligible.

33
Warp Scheduling
Mental model

Active Warp States:

Stalled Cycle: N N+1 N+2


Waiting on:
an instruction fetch, N
a memory dependency, …
an execution dependency, or
5

Warp Slots
a synchronization barrier.
4
3
Eligible 2
Ready to issue an instruction.
1
0

Selected 2 0
Eligible that is selected to issue
Issue slot:
an instruction

Warp at slot 0 is selected.

34
Warp Scheduling
Mental model

Active Warp States:

Stalled Cycle: N N+1 N+2 N+3


Waiting on:
an instruction fetch, N
a memory dependency, …
an execution dependency, or
5

Warp Slots
a synchronization barrier.
4
3
Eligible 2
Active warp that is not stalled.
1
0

Selected 2 0
Eligible warp that is selected to
Issue slot:
issue an instruction.

No eligible warps! Issue slot unused.

35
Warp Scheduling
Mental model

Active Warp States:

Stalled Cycle: N N+1 N+2 N+3 N+4 N+5 N+6


Waiting on:
an instruction fetch, N
a memory dependency, …
an execution dependency, or
5

Warp Slots
a synchronization barrier.
4
3
Eligible 2
Active warp that is not stalled.
1
0

Selected 2 0
Eligible warp that is selected to
Issue slot:
issue an instruction.

Having more active warps would help reduce idle issue slots and hide
latencies of stalled warps.

36
How to Increase Active Warps?
Occupancy

• There is a maximum number of warps which can be concurrently active on an SM.


• Device (depends on compute capability of the GPU)
• Achievable (depends on kernel implementation + compiler)
• Achieved (depends mostly on the grid size)

𝑨𝒄𝒉𝒊𝒆𝒗𝒂𝒃𝒍𝒆 # 𝒂𝒄𝒕𝒊𝒗𝒆 𝒘𝒂𝒓𝒑𝒔 𝒑𝒆𝒓 𝑺𝑴


𝑶𝒄𝒄𝒖𝒑𝒂𝒏𝒄𝒚 =
𝑫𝒆𝒗𝒊𝒄𝒆 # 𝒂𝒄𝒕𝒊𝒗𝒆 𝒘𝒂𝒓𝒑𝒔 𝒑𝒆𝒓 𝑺𝑴

• Occupancy of a CUDA kernel may be limited by:


• Register usage
• SM registers are partitioned among threads. Analyze the
• Shared memory usage occupancy of CUDA
kernels with NVIDIA
• SM shared memory is partitioned among thread blocks.
Nsight Compute!
• Thread block size
• Threads are allocated at thread block granularity.

37
Occupancy Limiters
Registers

• Register usage: compile with --ptxas-options=-v


• Reports registers per thread
• The maximum number of registers per thread can be set manually:
• At compile time on a per-file basis using the --maxrregcount flag of nvcc
• Per-kernel using the __launch_bounds__ qualifier
• Hopper has 64K (65536) registers per SM
• Allocated in fixed-size chunks of 256 registers

• Example:
• Kernel uses 63 registers per thread
• Registers per warp = 63 * 32 = 2016
• Registers allocated per warp = 2048
• Achievable active warps per SM = 65536 / 2048 = 32
• Occupancy = 32 / 64 * 100 = 50%
• Hopper supports up to 64 warps per SM

38
Occupancy Limiters
Shared memory

• Shared memory usage: compile with --ptxas-options=-v.


• Reports static shared memory usage per thread block.
• Hopper has 228 KiB of shared memory.
• 1KiB per thread block is reserved for system use.
• With opt-in using dynamic shared memory.

• Example:
• Kernel uses 17408 bytes of shared memory per 128-thread block.
• Blocks per SM = 233472 / (17408 +1024 ) = 12.66
• Achievable active warps per SM = 12 * 128 / 32 = 48
• Occupancy = 48 / 64 * 100 = 75%
• Hopper supports up to 64 warps per SM.

39
Occupancy Limiters
Thread block size

• Thread block size is a multiple of warp size (32).


• Even if you request fewer threads, HW rounds up.
• Each thread block can have a maximum size of 1024.
• Each SM can have up to 64 warps, 32 blocks and 2048 threads (Hopper).

Active Active Active Active


Occupancy
Block Size threads Warps per Warps per Blocks per
(%)
per SM SM Block SM
32 1024 32 1 32 50
64 2048 64 2 32 100
256 2048 64 8 8 100
512 2048 64 16 4 100
768 1536 48 24 2 50
1024 2048 64 32 2 100

40
ILP vs TLP for Hiding Latencies
Computing c = c + a * b

• Experimental setup:
• NVIDIA H100 SXM, 1980 MHz
• Problem size = 2^28
• Datatype = float
• Baseline thread block size = 32 (50% occupancy)
• Experiment #1: increase occupancy
• Thread block size = 64 (100% occupancy)
• Experiment #2: increase ILP by computing more elements per thread
• Elements per thread = 2, 4

Main Memory
Elements per Thread Block SM Occupancy
Implementation Bandwidth GPU Time (ms)
Thread Size (%)
Utilization (%)
Baseline 1 32 25 50 5.0
Experiment #1 1 64 51 100 2.5
Experiment #2 2 32 51 50 2.5
Experiment #2 4 32 82 50 1.6

41
What Occupancy Do I Need?
General guidelines

Rule of thumb: Try to maximize occupancy.

But some algorithms will run better at low occupancy.


More registers and shared memory can allow higher data reuse, higher ILP, higher performance.

Low Occupancy High Occupancy

-
Fewer threads per SM.
+ Registers per thread and shared memory More threads per SM.

More resources per thread. Fewer registers per thread.

Enough instruction-level parallelism Rely on thread parallelism


Occupancy +
or GPU will starve!
- to hide latencies!

Complex algorithms Simple algorithms

42
Maximizing Memory Throughput

43
Memory Hierarchy
NVIDIA H200 SXM

SMRegisters Register File (64K 32-bit registers per SM)


Registers
Shared/L1 Unified Shared Memory / L1 Cache (228 KiB per SM, variable split)
Shared/L1

L2 L2 Cache (50 MiB)

4.8 TB/s DRAM HBM3e (141 GB)

44
Why Do GPUs Have Caches?

• 100s ~ 1000s of threads sharing the L1 and ~100000s of threads sharing the L2.
• L1, L2 capacity per thread is relatively small.

Caches on GPUs are mostly useful for:


• “Smoothing” irregular, misaligned access patterns.
• Caching common data accessed by multiple threads.
• Faster register spills, local memory.
• Faster atomics.

What about cache blocking?


• L2 cache blocking may be feasible.
• For an example of efficient use of L2 cache blocking, see [S62192]: “Advanced Performance Optimization in CUDA”.

45
Memory Transactions
Cache lines and sectors

• Minimum memory access granularity: 32 bytes = 1 sector


• L1 to L2: 1 sector
• L2 to Global: 2 sectors (default)
• User can set a preferred granularity with cudaDeviceSetLimit() and cudaLimitMaxL2FetchGranularity.
• Only a hint though!

• Cache line size: 128 bytes = 4 sectors


• Cache ”management” granularity = 1 cache line
• Coalescing of requests.
• Evictions.

46
Memory Reads & Writes

Registers
SM
Reads
Registers
Shared/L1 Check if data is in L1 (if not, check L2)
Shared/L1 Check if data is in L2 (if not, get from DRAM)
Unit of data moved: full sector

L2

Writes
L1 is write-through: update both L1 and L2

DRAM L2 is write back: flush data to DRAM only when needed


Unit of data moved: partial sector*

* Depends on whether ECC is enable/disabled.

47
Global Memory Access Patterns
Aligned and sequential

0 31
4-byte element access WARP
4 sectors

0 32 64 96 128 160 192 224 256 288 320 352

Memory Addresses

COALESCED!

48
Global Memory Access Patterns
Aligned and sequential

0 31
8-byte element access WARP
8 sectors

0 32 64 96 128 160 192 224 256 288 320 352

Memory Addresses

COALESCED!

49
Global Memory Access Patterns
Aligned and non-sequential

0 31
4-byte element access WARP
4 sectors

0 32 64 96 128 160 192 224 256 288 320 352

Memory Addresses

COALESCED!

50
Global Memory Access Patterns
Mis-aligned and sequential

0 31
4-byte element access WARP
5 sectors

0 32 64 96 128 160 192 224 256 288 320 352

Memory Addresses

51
Global Memory Access Patterns
Mis-aligned and sequential

0 310 31
4-byte element access WARP WARP
5 sectors

0 32 64 96 128 160 192 224 256 288 320 352

Memory Addresses

52
Global Memory Access Patterns
Same address

0 31
4-byte element access WARP
1 sector

0 32 64 96 128 160 192 224 256 288 320 352

Memory Addresses

53
Global Memory Access Patterns
Aligned and strided

0 31
4-byte element access WARP
32 sectors

0 32 64 96 128 160 192 224 256 288 320 352

Memory Addresses

54
Impact of Data Layout
Array-of-Structures (AoS) vs Structure-of-Arrays (SoA)

struct Coefficients
{
float u, v, w;
float x[8], y[8], z;
AoS Memory Layout };

__global__ void kernel(Coefficients *data)


{
u v w x0 … x7 y0 … y7 z int i = cg::this_grid.thread_rank();
0 80
data[i].u = data[i].u + 10.f;
data[i].y[0] = data[i].y[0] + 10.f;
}

55 55
Impact of Data Layout
Array-of-Structures (AoS) vs Structure-of-Arrays (SoA)

struct Coefficients
• When loading coefficients u and y[0]: {
• Successive threads in a warp read 4 bytes at 80-byte float u, v, w;
stride. float x[8], y[8], z;
};

__global__ void kernel(Coefficients *data)


{
int i = cg::this_grid.thread_rank();

data[i].u = data[i].u + 10.f;


data[i].y[0] = data[i].y[0] + 10.f;
T0 T1 T2 T3 }

u … y0 … z u … y0 … z u … y0 … z u … y0 … z …
0 80 160 240

56 56
Impact of Data Layout
Array-of-Structures (AoS) vs Structure-of-Arrays (SoA)

struct Coefficients
• When loading coefficients u and y[0]: {
• Successive threads in a warp read 4 bytes at 80-byte float u, v, w;
stride. float x[8], y[8], z;
};
• We are reading 7x more bytes than necessary!
• Remember data is read in sectors of 32 bytes. __global__ void kernel(Coefficients *data)
{
• No potential reuse of the sectors loaded by the
int i = cg::this_grid.thread_rank();
previous access.
data[i].u = data[i].u + 10.f;
data[i].y[0] = data[i].y[0] + 10.f;
T0 T1 T2 T3 }

u … y0 … z u … y0 … z u … y0 … z u … y0 … z …
44 124 204 284

57 57
Impact of Data Layout
Array-of-Structures vs Structure-of-Arrays

struct Coefficients
• Refactoring from AoS to SoA leads to coalesced {
memory accesses for u and y[0]. float *u, *v, *w;
float *x0, …, *x7, *y0, … *y7, *z;
};

__global__ void kernel(Coefficients data)


{
int i = cg::this_grid.thread_rank();

SoA Memory Layout data.u[i] = data.u[i] + 10.f;


data.y0[i] = data.y0[i] + 10.f;
T0 T1 T2 T3 T4 T5 }

u u u u u u … y0 y0 y0 y0 y0 y0 y0 … …

128 bytes

58 58
Impact of Data Layout
Array-of-Structures vs Structure-of-Arrays

struct Coefficients
• Refactoring from AoS to SoA leads to coalesced {
memory accesses for u and y[0]. float *u, *v, *w;
float *x0, …, *x7, *y0, … *y7, *z;
};

__global__ void kernel(Coefficients data)


{
int i = cg::this_grid.thread_rank();

SoA Memory Layout data.u[i] = data.u[i] + 10.f;


data.y0[i] = data.y0[i] + 10.f;
T0 T1 T2 T3 T4 T5 }

u u u u u u … y0 y0 y0 y0 y0 y0 … …

128 bytes

59 59
Impact of Data Layout
Performance Analysis

• Experimental setup:
• NVIDIA H100 SXM, 1980 MHz
• Problem size = 2^28
• Thread block size = 256

Main Memory
Load Efficiency Store Efficiency
Implementation Bandwidth GPU Time (ms)
(%) (%)
Utilization (%)
AoS 12.5 12.5 13.50 28.497
SoA 100 100 79.47 4.836

60
Unified L1 and Shared Memory

• Can be used as a typical hardware managed cache (L1) and/or a user-managed memory (Shared Memory)
• An application can configure its preferred split at runtime using cudaFuncSetAttribute() with the attribute
cudaFuncAttributePreferredSharedMemoryCarveout.

• Shared memory can be useful for:


• Storing frequently used data
• Improving global memory access patterns
• Data layout conversion
• Communication among threads of a thread block

61
Shared Memory

Capacity:
• Default 48 KiB per thread block, opt-in to get more using cudaFuncSetAttribute() with the attribute
cudaFuncAttributeMaxDynamicSharedMemorySize.
• Up to 227KiB per thread block on Hopper.

Organization:
• Divided into 32 banks, each 4-byte wide.
• Successive 4-byte words map to successive banks.
• Bank index calculation examples:
• (4-byte word index) % 32
• (1-byte word index / 4) % 32

Performance:
• Slower than registers, but much faster than global memory.

62
Logical View of Shared Memory Banks
4-byte data

63
Processing Data Types of Different Sizes

• 4-byte or smaller data types:


• Process addresses of all threads in a warp in a single phase

• 8-byte data types:


• Process addresses of all threads in a warp in 2 phases
• Each phase processes addresses of half of a warp

• 16-byte data types:


• Process addresses of all threads in a warp in 4 phases
• Each phase processes addresses of a quarter of a warp

64
Shared Memory Access Patterns
Bank conflicts

• Bank conflicts occur when threads in the same phase want to access the same bank.

Coalesced access Conflict access Broadcast access


(No bank conflicts) (2-way bank conflicts) (No bank conflicts)

shmem[threadIdx.x] = data[tid] shmem[threadIdx.x * 2] = data[tid] data = shmem[0]


4-byte data 4-byte data 4-byte data
Threads Threads Threads
0 31 0 31 0 31

Shared Memory Banks Shared Memory Banks Shared Memory Banks

65
Bank Conflicts
Example

• 32 x 32 array of floats in shared memory


• 4-byte data, 1 array element per bank
• Row-major layout
Thread 0 (0,0) (0,1) (0,2) (0,3) (0,31)
• 2D thread block
Bank 0
Thread 1 (1,0) (1,1) (1,2) (1,3) (1,31)
Bank 1
• Access pattern:


Thread 2 (2,0) (2,1) (2,2) (2,3) (2,31)
• idx := threadIdx.x*32 + threadIdx.y Bank 2
• 32-way bank conflicts
Thread 3 (3,0) (3,1) (3,2) (3,3) (3,31) Bank 3

… … …

(31,0) (31,1) (31,2) (31,3) (31,31) Bank 31


Thread 31

All threads in a warp access the same bank!

66
Resolving Bank Conflicts
Padding

• 32 x 33 array of floats in shared memory


• 4-byte data, 1 array element per bank
• Row-major layout
Thread 0 (0,0) (0,1) (0,2) (0,3) (0,31) (0,32)
• 2D thread block
Bank 0
Thread 1 (1,0) (1,1) (1,2) (1,3) (1,31)
Bank 1
• Access pattern:


Thread 2 (2,0) (2,1) (2,2) (2,3) (2,31)
• idx := threadIdx.x*33 + threadIdx.y Bank 2
• No conflicts!
Thread 3 (3,0) (3,1) (3,2) (3,3) (3,31) Bank 3

… … …

(31,0) (31,1) (31,2) (31,3) (31,31) Bank 31


Thread 31

Each thread in a warp accesses a distinct bank!

67
Resolving Bank Conflicts
Swizzling

• 32 x 32 array of floats in shared memory


• 4-byte data, 1 array element per bank
• Row-major layout
Thread 0 (0,0) (0,1) (0,2) (0,3) (0,31)
• 2D thread block
Bank 0
Thread 1 (1,0) (1,1) (1,2) (1,3) (1,31)
Bank 1
• Access pattern:


Thread 2 (2,0) (2,1) (2,2) (2,3) (2,31)
• idx = threadIdx.x*32 + Bank 2
threadIdx.y ^ threadIdx.x
• No conflicts! Thread 3 (3,0) (3,1) (3,2) (3,3) (3,31) Bank 3
• No shared memory wasted!
… … …

(31,0) (31,1) (31,2) (31,3) (31,31) Bank 31


Thread 31

Each thread in a warp accesses a distinct bank!

68
Vectorized Memory Accesses
Multi-word as well as multi-thread

Memory
contiguous, aligned
memory access

int Threads 0-31


cache line 0

warp

Fills 1 cache line in a single fetch.

69
Vectorized Memory Accesses
Multi-word as well as multi-thread

Memory
contiguous, aligned
memory access

int2 Threads 0-15 Threads 16-31


cache line 0 cache line 1

warp

Fills 2 cache lines in a single fetch.

70
Vectorized Memory Accesses
Multi-word as well as multi-thread

Memory
contiguous, aligned
memory access

int4 Threads 0-7 Threads 8-15 Threads 16-23 Threads 24-31


cache line 0 cache line 1 cache line 2 cache line 3

warp

Fills 4 cache lines in a single fetch.

71
Vectorized Memory Accesses
Multi-thread, multi-word

// Using vectors data types


• Vectorized global and shared memory accesses. __global__
• Require aligned data. void copy(const float2 * __restrict__ in,
float2 * __restrict__ out,
• 64- or 128-bit width.
int N)
{
auto grid = cg::this_grid();
• Less executed instructions! int tid = grid.thread_rank();
int stride = grid.size();
• More bytes in-flight!
for (int i = tid; i < N / 2; i += stride) {
out[i] = in[i];
// Same as:
• Approaches to enable vectorization:
// out[i].x = in[i].x;
1) By using vector data types, e.g., float2, float4. // out[i].y = in[i].y;
2) Explicitly by casting to vector pointers. }
}
1) Proper alignment required.

72 72
Vectorized Memory Accesses
Performance Analysis

• Experimental setup:
• NVIDIA H100 SXM, 1980 MHz
• Problem size = 2^28
• Thread block size = 256

Main Memory
Implementation Bandwidth Utilization GPU Time (ms)
(%)
float 60.62 1.033
float2 84.34 0.737
float4 88.82 0.706

73
Maximizing Memory Throughput
General guidelines

Global memory
• Strive for aligned and coalesced accesses within a warp.
• Maximize bytes in-flight to saturate memory bandwidth.
• Process several elements per thread.
• Use vectorized loads/stores.
• Launch enough threads to maximize throughput.

L1 and L2 caches
• Cache blocking difficult, but not impossible.
• Rely on caches when you don’t have a choice.

Shared memory
• Use it to reduce global memory traffic.
• Strive to avoid bank conflicts.
• Use vectorized loads/stores.

74
Atomics

75
Using Atomics Efficiently
Access Patterns

Same address Scattered Coalesced

Serialized! Most efficient access pattern.


Least efficient access pattern.

76
Using Atomics Efficiently
Example #1: find the maximum value of an array

• Problem description: given an input array, find the


maximum element in the array. __global__
void find_max(const int * __restrict__ in, int *max, int N)
{
int grid_tid = cg::this_grid().thread_rank();
• Naïve implementation: every thread find its local int grid_stride = cg::this_grid().num_threads();
maximum and then atomically updates the global
// Find my local maximum
maximum. int local_max = INT_MIN;
• N / elements_per_thread same-address global atomics. for (int i = grid_tid; i < N; i += grid_stride) {
if (in[i] > local_max)
local_max = in[i];
}

// Atomically update the global max


atomicMax(max, local_max);
}

77 77
Using Atomics Efficiently
Example #1: find the maximum value of an array

__global__
void find_max(const int * __restrict__ in, int *max, int N)
{
• Optimization #1: maintain a block-level max in shared int grid_tid = cg::this_grid().thread_rank();
int grid_stride = cg::this_grid().num_threads();
memory. auto block = cg::this_thread_block();
• Reduces the number of same-address global atomics int block_tid = block.thread_rank();
by a factor equal to the thread block size.
__shared__ int block_max;

// Find my local maximum


int local_max = INT_MIN;
for (int i = grid_tid; i < N; i += grid_stride) {
if (in[i] > local_max)
local_max = in[i];
}

// Atomically update the block-level max


atomicMax(&block_max, local_max);
block.sync();

// Atomically update the global max


if (block_tid == 0)
atomicMax(max, block_max);
}

78 78
Using Atomics Efficiently
Example #1: find the maximum value of an array
//Assumes a block dimension of 256
__global__
void find_max(const int * __restrict__ in, int *max, int N) {

auto tile = cg::tiled_partition<32>(block);
extern __shared__ int sdata[];
• Optimization #2: use a parallel reduction to calculate // Find my local maximum as before
the block-level max in shared memory. // Each thread puts its local max into shared memory
sdata[block_tid] = thread_max;

// Block-level reduction
if (block_tid < 128) {
if (sdata[block_tid + 128] > thread_max)
thread_max = sdata[block_tid + 128];
sdata[block_tid] = thread_max;
}
block.sync();
if (block_tid < 64) {
if (sdata[block_tid + 64] > thread_max)
thread_max = sdata[block_tid + 64];
sdata[block_tid] = thread_max;
}
block.sync();

// Warp-level reduction
if (tile.meta_group_rank() == 0) {
thread_max = cg::reduce(tile, thread_max,
cg::greater<int>());
}

if (block_tid == 0)
atomicMax(max, thread_max);
}

79 79
Performance Analysis

• Experimental setup:
• NVIDIA H100 SXM, 1980 MHz
• Problem size = 2^28
• Uniform distribution (-50, 50).

Implementation Thread Block Size GPU Time (ms)

global atomics 256 6.839


shared memory atomics 256 1.334
shared memory reduction 256 1.066

80
Using Atomics Efficiently
Example #2: vector update

• Problem description: a = a + b * c

__global__
void vector(const float * __restrict__ b,
const float * __restrict__ c,
float * __restrict__ a,
int N) {
int grid_tid = cg::this_grid().thread_rank();
int grid_stride = cg::this_grid().num_threads();

for (int i = grid_tid; i < N; i += grid_stride) {


a[i] += b[i] * c[i];
}
}

Memory operations = 3 reads + 1 write

81 81
Using Atomics Efficiently
Example #2: vector update

• Optimization: use atomics to update each vector


element even though atomicity is not required.
__global__
• Offload some of the computation to the L2 cache. void vector(const float * __restrict__ b,
const float * __restrict__ c,
float * __restrict__ a,
• Saves reading the value of a[i] in registers. int N) {
int grid_tid = cg::this_grid().thread_rank();
int grid_stride = cg::this_grid().num_threads();
• This reduces the latency to compute each element of
the vector. for (int i = grid_tid; i < N; i += grid_stride) {
atomicAdd(&a[i], b[i] * c[i]);
}
• Can result in more bytes in-flight! }

Memory operations = 2 reads + 1 write

82 82
Performance Analysis
NVIDIA H100 SXM

NCU Memory Chart (Transfer Size)


• Experimental setup:
• NVIDIA H100 SXM, 1980 MHz
• Problem size = 2^27

83
Performance Analysis
NVIDIA H100 SXM

NCU Memory Chart (Throughput)


• Experimental setup:
• NVIDIA H100 SXM, 1980 MHz
• Problem size = 2^27

84
Performance Analysis
NVIDIA H100 SXM

• ~5% increase in memory throughput translates into a corresponding reduction in execution time. Why?
• This kernel is DRAM bandwidth bound.

85
Summary

86
Which optimizations to focus on?
Solving the bottlenecks

• Compute bound
• Reduce instruction count.
• E.g., use vector loads/stores.
• Use tensor cores.
• Use lower precision arithmetic, fast math intrinsics.

• Bandwidth bound
• Reduce the amount of data transferred
• Optimize memory access patterns.
• Lower precision datatypes.
• Kernel fusion.

• Latency bound
• Increase number of instructions and memory accesses in-flight.
• Increase parallelism, occupancy.

87
Resources/Further Study

• CUDA best practices guide: https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/


• CUDA samples: https://github.com/NVIDIA/cuda-samples

• GTC’24 sessions: https://www.nvidia.com/gtc/sessions/performance-optimization/


• Advanced Performance Optimization in CUDA [S62192]
• Performance Optimization for Grace CPU Superchip [S62275]
• Grace Hopper Superchip Architecture and Performance Optimizations for Deep Learning Applications [S61159]
• Multi GPU Programming Models for HPC and AI [S61339]
• More Data, Faster: GPU Memory Management Best Practices in Python and C++ [S62550]
• Harnessing Grace Hopper's Capabilities to Accelerate Vector Database Search [S62339]
• From Scratch to Extreme: Boosting Service Throughput by Dozens of Times with Step-by-Step Optimization [S62410]

88
89

You might also like