AMPE Tema4 GPU Architecture
AMPE Tema4 GPU Architecture
ESPECIFICO (2024-25)
Unit 4
GPU Architectures
Conference title
March 2025
1
Outline
• Introduction
• Programming Model
• The SIMT Core: Instruction and Register Data Flow
• Memory System
• Evolution of NVIDIA architectures
GPU Architectures 2
Introduction
GPU basics
GPU Architectures 3
Introduction
GPU basics
GPU Architectures 5
Introduction
GPU basics
GPU Architectures 6
Introduction
GPU basics
• GPU in HPC:
GPU Architectures 7
Introduction
Quick recap of GPU evolution
Few complex cores, large caches vs. Thousands of simple cores, small caches
GPU Architectures 10
Introduction
The graphics side
GPU Architectures 11
Introduction
The graphics side
GPU Architectures 12
Introduction
The graphics side
GPU Architectures 13
Introduction
The graphics side
GPU Architectures 14
Introduction
The graphics side
primitive
primitive
GPU Architectures 15
Introduction
The graphics side
GPU Architectures 16
Introduction
The graphics side
GPU Architectures 17
Introduction
The graphics side
GPU Architectures 18
Introduction
The graphics side
GPU Architectures 19
Introduction
The graphics side
GPU Architectures 20
Introduction
The graphics side
GPU Architectures 21
Introduction
The parallel computing side
GPU Architectures 22
Introduction
Fine-grained multithreading
GPU Architectures 23
Introduction
Fine-grained multithreading
– CPU typically use the SMT model (e.g. Intel Hyperthreading enables
2 threads per core; IBM Power allows up to 8 threads per core)
– Sun Ultrasparc T2 used Fine MT, with 8 threads per core
– Today’s GPUs are based on Fine MT model with thousands of
threads!
GPU Architectures 24
Outline
• Introduction
• Programming Model
• The SIMT Core: Instruction and Register Data Flow
• Memory System
• Evolution of NVIDIA architectures
GPU Architectures 25
Programming model
Overview
GPU Architectures 26
Programming model
GPUs are SIMD engines
GPU Architectures 28
Programming model
Thread hierarchy
GPU Architectures 29
Programming model
Thread hierarchy
• # of threads per block and # of blocks per grid (and their setup)
are set by the programmer when the kernel is called
kernel<<<dimGrid, dimBlock>>>(parameters);
GPU Architectures 30
Programming model
Sample GPU Program
GPU Architectures 31 31
Slide credit: Hyesoon Kim
Programming model
Sample GPU program
GPU Architectures 32
Programming model
Example: Average Filters
GPU Architectures 33
Programming model
Example: Average Filters
kernelF<<<(1,1),(16,16)>>>(A);
Average over a __global__ kernelF(A){
3x3 window for
__shared__ int smem[16][16];
a 16x16 array
i = threadIdx.y;
With Shared Memory j = threadIdx.x;
smem[i][j] = A[i][j]; // load to smem
A[i][j] = ( smem[i-1][j-1] + smem[i-1][j] +
… + smem[i+1][i+1] ) / 9;
}
Do you notice any problems
here? Hazards?
GPU Architectures 34
Programming model
Example: Average Filters
kernelF<<<(1,1),(16,16)>>>(A);
Average over a __global__ kernelF(A){
3x3 window for
__shared__ int smem[16][16];
a 16x16 array
i = threadIdx.y;
With Shared Memory & j = threadIdx.x;
Synchronization smem[i][j] = A[i][j]; // load to smem
__syncthreads();
A[i][j] = ( smem[i-1][j-1] + smem[i-1][j] +
… + smem[i+1][i+1] ) / 9;
}
GPU Architectures 35
Programming model
Synchronization within a thread block
GPU Architectures 36
Programming model
Synchronization between thread blocks
• In addition to the
shared memory and
global memory,
threads can also
access data from a
private local
memory:
– Used by CUDA to for
register spilling, stack
frames or variables
that do not fit in the
thread’s registers
GPU Architectures 38
Programming model
NVIDIA GPU Instruction set architecture (ISA)
GPU Architectures 39
Programming model
PTX instruction set
GPU Architectures 41
Programming model
Basic PTX instruction set
GPU Architectures 42
Programming model
Example PTX code
GPU Architectures 43
Programming model
PTX instruction set
GPU Architectures 44
Programming model
Example SASS code
GPU Architectures 45
Programming model
Example SASS code
• SAXPY code
compiled with
CUDA 8.0 for
NVIDIA Pascal:
GPU Architectures 46
Outline
• Introduction
• Programming Model
• The SIMT Core: Instruction and Register Data Flow
• Memory System
• Evolution of NVIDIA architectures
GPU Architectures 47
The SIMT Core
Modern GPU architecture
GPU Architectures 49
The SIMT Core
Modern GPU architecture
Compute work
• Thread blocks of one or more distribution
GPU Architectures 50
The SIMT Core
Modern GPU architecture
• Each GPU thread has its own private registers, private per-
thread memory, program counter, and thread execution
state:
– Zero-overhead thread management and scheduling (done in
hardware) and fast barrier synchronization enables fine-grained
parallelism
– To efficiently manage and execute thousands of threads, threads (of
the same threadblock) are dynamically grouped into warps (NVIDIA)
or wavefronts (AMD):
• Set of parallel threads that execute the same instruction together (it can be seen
as a SIMD operation formed by hardware): typically 32 threads per warp
• Single-Instruction Multiple-Thread (SIMT) architecture: a processor architecture
that applies one instruction to multiple independent threads in parallel
• Individual parallel threads within a warp are same type and start together at the
same program address, but are free to branch and execute independently
GPU Architectures 51
for (i=0; i < N; i++)
The SIMT Core C[i] = A[i] + B[i];
Iter. Iter.
1 2 Warp: A set of threads that execute
the same instruction (i.e., at the same PC)
SPMD (Single Program Multiple SIMT (Single-Instruction, Multiple-
Data) programming model Thread) execution model
GPU Architectures 52
The SIMT Core SP : Streaming processor
RF: Register file • Multithreaded proc
Conceptual view of a SIMT core SFU: Special function unit (up to 64 threads)
• 1024 general-purpose •Scalar arithmetic
32-bit registers units (integer & FP)
•Partitioned among Instruction Cache •Pipelined (1
assigned threads (16-64 Up to 16 instr/thread/cycle)
registers per thread) warps Multithreaded Instruction Unit
SP SP SP SP SP SP SP SP
SFU SFU
• FP sine,
RF RF RF RF RF RF RF RF
cosine, log2,
exp2, square
root, reciprocal,
Interconnection Network
…
Memory
Unified L1 data cache & Shared memory Interface
GPU Architectures 53
The SIMT Core
Microarchitecture of a generic SIMT core
GPU Architectures 55
The SIMT Core
High-level view of the overall pipeline (single scheduler)
GPU Architectures 57
The SIMT Core
High-level view of the overall pipeline (single scheduler)
GPU Architectures 58
The SIMT Core
High-level view of the overall pipeline (single scheduler)
GPU Architectures 59
The SIMT Core
High-level view of the overall pipeline (single scheduler) Next instruction
to execute
• Example of SIMT stack fully managed in hardware:
4 threads per warp Top of
Stack
Control Flow (after line 6)
Graph (CFG)
GPU Architectures 60
The SIMT Core
High-level view of the overall pipeline (single scheduler)
• Warp scheduling:
– Each SIMT core must execute many warps:
• In which order should these warps be scheduled?
– For simplicity, assume that each warp issues a single instruction
when it is scheduled and the warp is not eligible to issue another
instruction until the first completes:
• For ideal memory system (memory requests are serviced within fixed time), we
could hide this latency by supporting enough warps and applying fine-grained
multithreading: scheduling warps in round-robin (RR) order, one each cycle
• To enable a different warp to issue an instruction each cycle, it is necessary for
each thread to have its own registers (to avoid saving/restoring register state
to/from memory)
– In practice, memory latency depends upon the application’s locality
and resulting contention encountered by off-chip memory accesses
• Locality properties can either favor or discourage RR scheduling: when different
threads shared data, RR is preferred
GPU Architectures 61
The SIMT Core
Two-loop approximation
GPU Architectures 63
The SIMT Core
Three-loop approximation
GPU Architectures 64
The SIMT Core
Three-loop approximation: the operand collector
GPU Architectures 65
The SIMT Core
Three-loop approximation: the operand collector
Issue
1
2
4
It takes 6 cycles for the 3 instructions to read their operands (most of the banks are idle)
GPU Architectures 66
The SIMT Core
Three-loop approximation: the operand collector
• Example:
Issue
In Execution
• Each collector unit can read one operand each cycle
• Different reads from several warps can be done each cycle (increase throughput)
GPU Architectures 69
The SIMT Core
Three-loop approximation: the operand collector
– We must require that instructions from the same warp leave the
operand collector to execution units in program order
GPU Architectures 70
The SIMT Core
Three-loop approximation: structural hazards
• Introduction
• Programming Model
• The SIMT Core: Instruction and Register Data Flow
• Memory System
• Evolution of NVIDIA architectures
GPU Architectures 72
Memory System
Memory spaces in a GPU
GPU Architectures 75
Memory System
Unified L1 data cache and shared memory
– SRAM data array (❺) can be configured partly for direct mapped
access (shared memory) and partly as a set associative cache
– L1 data cache misses and bank conflicts handled using a reply
mechanism (non-stalling interface with the instruction pipeline)
GPU Architectures 76
Memory System
Shared memory access operations
1
Memory access request received: one memory
Bank conflicts are determined by address per thread in the warp; operation type
the arbiter, and if found request
is split into 2 parts: 2
-A conflict-free part
3
- Rest, to be retried later
(buffered in the load/store 4
unit)
GPU Architectures 77
Memory System
Cache read operation (LD GLOBAL)
1
Memory addresses are computed and coalescing
The arbiter accepts or reject the rules applied (individual memory accesses are
request if not enough resources grouped into individual coalesced accesses)
available 2
GPU Architectures 78
Memory System
Cache read operation (LD GLOBAL)
Memory addresses are computed and coalescing
The arbiter accepts or reject the rules applied (individual memory accesses are
request if not enough resources grouped into individual coalesced accesses)
available
7
3
10
6
Data to be written is placed in the WB
GPU Architectures 80
Memory System
L1 texture cache
GPU Architectures 81
Memory System
Memory partition unit
GPU Architectures 84
Outline
• Introduction
• Programming Model
• The SIMT Core: Instruction and Register Data Flow
• Memory System
• Evolution of NVIDIA architectures
GPU Architectures 85
Evolution of NVIDIA architectures
From Tesla to Volta
6000 16000
14000
5000
12000
#Stream Processors
4000
10000
GFLOPS
3000 8000 Stream Processors
6000 GFLOPS
2000
4000
1000
2000
0 0
GTX 285 GTX 480 GTX 780 GTX 980 P100 V100
(2009) (2010) (2013) (2014) (2016) (2017)
GPU Architectures 86
Evolution of NVIDIA architectures
NVIDIA Fermi (2009)
GPU Architectures 87
Evolution of NVIDIA architectures
NVIDIA Kepler (2012)
GPU Architectures 88
Evolution of NVIDIA architectures
NVIDIA Maxwell (2014)
GPU Architectures 89
Evolution of NVIDIA architectures
NVIDIA PASCAL GP104 (2016)
90
GPU Architectures 90
Evolution of NVIDIA architectures
NVIDIA PASCAL GP100 (2016)
GPU Architectures 91
Evolution of NVIDIA architectures
NVIDIA Volta (2017)
GPU Architectures 92
Evolution of NVIDIA architectures
NVIDIA Turing (2018)
GPU Architectures 93
Evolution of NVIDIA architectures
NVIDIA Ampere (2020)
GPU Architectures 94
Evolution of NVIDIA architectures
NVIDIA Hopper (2022)
GPU Architectures 95
Evolution of NVIDIA architectures
Tensor cores
GPU Architectures 97