0% found this document useful (0 votes)

13 views95 pages

AMPE Tema4 GPU Architecture

Uploaded by

Adrián Garrido Beas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views95 pages

AMPE Tema4 GPU Architecture

Uploaded by

Adrián Garrido Beas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 95

ARQUITECTURAS MULTIMEDIA Y DE PROPOSITO

ESPECIFICO (2024-25)

Unit 4
GPU Architectures

Conference title
March 2025
1
Outline

• Introduction
• Programming Model
• The SIMT Core: Instruction and Register Data Flow
• Memory System
• Evolution of NVIDIA architectures

GPU Architectures 2
Introduction
GPU basics

• GPU stands for Graphics Processing Unit

– A processor optimized for 2D and 3D graphics, video, visual
computing and display

• GPUs are highly parallel, highly multithreaded

multiprocessors optimized for visual computing
– Visual computing refers to a mix of graphics processing and
computing that allows for real-time visual interaction with computed
objects via graphics, images and video

• Modern GPUs have a unified graphics and computing

architecture that servers as both a programmable graphics
processor and a scalable parallel computing platform

GPU Architectures 3
Introduction
GPU basics

• PCs and game consoles combine a GPU with a CPU

– Heterogeneous systems

• GPUs were initially introduced to enable real-time rendering

with a focus on video games
– Rendering is the process of generating an image (i.e. photorealistic
image) from a 2D or 3D model
– Accomplished through a logical graphics pipeline that is “mapped” to
the GPU hardware
– Huge amount of floating-point operations executed by the GPU

• Why not using the enormous computational power of a GPU

for non-graphics applications?
GPU Architectures 4
Introduction
GPU basics

• GPU in Graphics Card

GPU Architectures 5
Introduction
GPU basics

• GPU in Mobile Processors:

Image: Nvidia Jetson TX1 (Tegra X1 SOC)

– The Apple A8 application processor devotes more chip area to its

integrated GPU than to the CPU cores

GPU Architectures 6
Introduction
GPU basics

• GPU in HPC:

GPU Architectures 7
Introduction
Quick recap of GPU evolution

• No such thing as a GPU in PCs 30 years ago

– Graphics performed by a Video Graphics Array (VGA) controller, a
simple memory controller and display generator connected to DRAM

• By late the 1990s, advances on semiconductor technology

allowed for more functions to be added to VGA controllers
– 3D acceleration functions such as hardware for triangle setup and
rasterization, and texture mapping and shading

• The GPU term was coined in 2000, as single-chip graphics

processors fully implemented high-end graphics pipelines
• GPUs became more and more programmable…
– NVIDIA GeForce 3 (2001) included vertex shaders and pixel shaders
GPU Architectures 8
Introduction
Quick recap of GPU evolution

• …and more precise

– Integer and fixed-point arithmetic replaced by floating point-one

• Efforts by researcher to leverage GPUs to implement linear

algebra (mapping matrix data into textures and applying
shaders) inspired NVIDIA to directly support general-purpose
computing in GPUs (and graphics, of course)
– Processor instructions and memory hardware added to support
general-purpose programming languages
– A programming environment created to allow GPUs to be
programmed using familiar languages (such as C and C++)

• But… not all applications can benefit from running on a GPU

GPU Architectures 9
Introduction
What makes the difference between a CPU and a GPU?

• CPU vs. GPU chip area

Few complex cores, large caches vs. Thousands of simple cores, small caches

GPU Architectures 10
Introduction
The graphics side

• GPUs and associated drivers implement the model of graphics

processing defined by current APIs
– OpenGL is an open standard for 3D graphics programming originally proposed and
defined by Silicon Graphics Incorporated
– Direct3D is the interface for 3D graphics defined by Microsoft (as part of DirectX, a
series of multimedia programming interfaces)
• OpenGL and Direct3D have well-defined behaviors  Build an efficient
HW accelerator to give support to the execution of functions
• OpenGL and Direct3D are similarly structured: they define a logical
graphics processing pipeline that is mapped onto the GPU, and
programming models and languages for programmable pipeline stages
• Graphic applications calls API functions, which use the GPU as a
coprocessor: API functions send commands, programs and data to the
GPU through the corresponding driver

GPU Architectures 11
Introduction
The graphics side

• GPU is used to render triangles… billions of triangles per

second!

GPU Architectures 12
Introduction
The graphics side

• The graphics pipeline

Sequence of vertices grouped into geometric
primitives (points, lines, triangles and polygons

GPU Architectures 13
Introduction
The graphics side

• The graphics pipeline

GPU Architectures 14
Introduction
The graphics side

• The graphics pipeline

primitive

GPU Architectures 15
Introduction
The graphics side

• The graphics pipeline

GPU Architectures 16
Introduction
The graphics side

• The graphics pipeline

GPU Architectures 17
Introduction
The graphics side

• Graphics pipeline on GPU

GPU Architectures 18
Introduction
The graphics side

• Graphics pipeline on GPU

GPU Architectures 19
Introduction
The graphics side

• Graphics pipeline on GPU

GPU Architectures 20
Introduction
The graphics side

• Graphics pipeline on GPU

GPU Architectures 21
Introduction
The parallel computing side

• A modern GPU is composed of many cores (SIMT cores),

each of them able to run on the order of thousand of threads
– Large number of threads running on each core enables to hide the
latency to access memory (fine-grained multithreaded architecture)

GPU Architectures 22
Introduction
Fine-grained multithreading

• Idea: Hardware has multiple thread contexts (PC+registers).

Each cycle, fetch engine fetches from a different thread
– By the time the fetched branch/instruction resolves, no instruction is
fetched from the same thread
– Branch/instruction resolution latency overlapped with execution of
other threads’ instructions

• Advantage: No logic needed for handling control and data

dependences within a thread
• Disadvantages: Single thread performance suffers; Extra
logic for keeping thread contexts; Does not overlap latency if
not enough threads to cover the whole

GPU Architectures 23
Introduction
Fine-grained multithreading

– CPU typically use the SMT model (e.g. Intel Hyperthreading enables
2 threads per core; IBM Power allows up to 8 threads per core)
– Sun Ultrasparc T2 used Fine MT, with 8 threads per core
– Today’s GPUs are based on Fine MT model with thousands of
threads!
GPU Architectures 24
Outline

• Introduction
• Programming Model
• The SIMT Core: Instruction and Register Data Flow
• Memory System
• Evolution of NVIDIA architectures

GPU Architectures 25
Programming model
Overview

• Programming model vs. Hardware execution model

– Programming model refers to how the programmer expresses the
code
 E.g., sequential (von Neumman), Data Parallel (SIMD), Dataflow, Multi-threaded
(SPMD, MIMD), …
– Execution model refers to how the hardware executes the code
underneath
 E.g., Out-of-order execution, Vector processor, Array processor, Dataflow
processor, Multiprocessor, Multithreaded processor, …

• Execution model can be very different from the

Programming model
– E.g., von Neumann model implemented by an OoO processor
– E.g., SPMD model implemented by a SIMD processor (a GPU)

GPU Architectures 26
Programming model
GPUs are SIMD engines

• The instruction pipeline operates like a SIMD pipeline (e.g.,

a vector processor)
• However, the programming is done using threads, NOT
SIMD instructions
– SPMD programming model
 Each thread executes the same code but operates a different piece of data
 Each thread has its own context (i.e. can be treated/restarted/executed
independently)
 All the threads share a common memory space called global memory

• A GPU computing application starts execution on the CPU:

– Discrete GPUs: CPU code allocates memory for use in computation
on the GPU, initiate transfers of input data into GPU memory and
launch the computational kernel on GPU
– Integrated GPU: only last step is needed
GPU Architectures 27
Programming model
Thread hierarchy

• Computational kernels are made of thousands of threads

• Threads must be organized properly so that same kernel
code can be run on GPUs with varying architectural
parameters (e.g., number of cores per multiprocessor)
• Thread hierarchy in CUDA:

GPU Architectures 28
Programming model
Thread hierarchy

• Thread block: set of concurrent threads that can cooperate

among themselves through barrier synchronization and
share access to a private memory space
• Grid: set of thread blocks that are independent and thus can
be executed independently

GPU Architectures 29
Programming model
Thread hierarchy

• # of threads per block and # of blocks per grid (and their setup)
are set by the programmer when the kernel is called
kernel<<<dimGrid, dimBlock>>>(parameters);

• Each thread is given a unique number (threadIDx) within its

thread block (0, 1, …, blockDim-1) and each thread block is given a
unique number (blockIDx) within its grid:
– CUDA supports blocks of up to 1024 threads (must be multiple of 32)
– Thread blocks and grids may have 1, 2 or 3 dimensions, accessed by .x, .y
and .z index fields (positions of the block and of the thread)
– dimGrid and dimBlock are 3-element vectors (dim3 type) or integers
that specify dimensions of the grid in blocks & of the blocks in threads
– blockDim.x is the number of threads in x dimension (same for y and z),
gridDim.x is the number of blocks in x dimension (same for y and z)

GPU Architectures 30
Programming model
Sample GPU Program

CPU Code GPU Code

void saxpy_serial __global__ void saxpy_parallel
(int N, float a, float *x, float *y) { (int N, float a, float *x, float *y) {
int i=blockIdx.x*blockDim.x+threadIDx.x
for (int i=0; i<N; i++) if (i<N) y[i] = a*x[i] + y[i];
y[i] = a*x[i] + y[i]; }
}
main() {
main() { float *x, *y;
float *x, *y; int N=1<<20;
int N=1<<20; int nblocks=(N+255)/256;
…
saxpy_serial(N,2.0,x,y); cudaMalloc (&x,N*sizeof(float));
} cudaMalloc (&y,N*sizeof(float));
…
saxpy_parallel<<<nblocks,256>>>(N,2.0,x,y);
Each loop iteration in the CPU cudaDeviceSynchronize();
…
code becomes an independent cudaFree(x);
thread in the GPU code cudaFree(y);
}

GPU Architectures 31 31
Slide credit: Hyesoon Kim
Programming model
Sample GPU program

• Parallelism is determined explicitly by specifying the

dimensions
• Parallel execution and thread management is automatic:
– All thread creation, scheduling and termination handled by underlying
system
– Threads are managed directly in hardware
• Threads within each thread block can communicate each
other efficiently via the shared memory:
– Scrathpad memory that is close to where thread blocks are executed
– Exposed to programmers as different memory spaces (special keywords to
allocate memory into it, i.e., “__shared__” in CUDA)
– Acts as a software controlled cache used to keep data reused frequently and
in a predictable manner
– Small size, ranging from 16 to 64KiB

GPU Architectures 32
Programming model
Example: Average Filters

Average over a kernelF<<<(1,1),(16,16)>>>(A);

3x3 window for __global__ kernelF(A){
a 16x16 array i = threadIdx.y;
j = threadIdx.x;
tmp = (A[i-1][j-1] + A[i-1][j] +
… + A[i+1][i+1] ) / 9;
A[i][j] = tmp;
}

Each thread loads 9 elements from global

memory. It takes hundreds of cycles.

GPU Architectures 33
Programming model
Example: Average Filters

kernelF<<<(1,1),(16,16)>>>(A);
Average over a __global__ kernelF(A){
3x3 window for
__shared__ int smem[16][16];
a 16x16 array
i = threadIdx.y;
With Shared Memory j = threadIdx.x;
smem[i][j] = A[i][j]; // load to smem
A[i][j] = ( smem[i-1][j-1] + smem[i-1][j] +
… + smem[i+1][i+1] ) / 9;
}
Do you notice any problems
here? Hazards?
GPU Architectures 34
Programming model
Example: Average Filters

kernelF<<<(1,1),(16,16)>>>(A);
Average over a __global__ kernelF(A){
3x3 window for
__shared__ int smem[16][16];
a 16x16 array
i = threadIdx.y;
With Shared Memory & j = threadIdx.x;
Synchronization smem[i][j] = A[i][j]; // load to smem
__syncthreads();
A[i][j] = ( smem[i-1][j-1] + smem[i-1][j] +
… + smem[i+1][i+1] ) / 9;
}

GPU Architectures 35
Programming model
Synchronization within a thread block

• Threads within a block can synchronize at a

synchronization barrier by calling __syncthreads()
– No thread in the thread block can pass the barrier until all threads
have reached it
– After passing a barrier, all writes to memory by all threads in the
block BEFORE the barrier are visible to all threads
– Threads in a block can communicate by writing and reading per-
block shared memory at synchronization barrier

• Since threads within each thread block can share memory

and synchronize via barriers, they will be executed on the
same physical processor:
– # of thread blocks >> number of processors… but thread blocks are
independent of each other

GPU Architectures 36
Programming model
Synchronization between thread blocks

• Virtualization into threads and thread blocks allows intuitive

problem decompositions:
– # of thread blocks dictated by the size of the data rather than # of
processors

• It also allows same CUDA program to scale to widely varying

# of processors
• Threads in different thread blocks may coordinate their
activities using an atomic memory operations on the
global memory:
– Atomic memory operation: memory read, modify, write operation
sequence that completes without any intervening access
– Global memory: a per-application memory shared by all threads
GPU Architectures 37
Programming model
Memory spaces that threads can access to

• In addition to the
shared memory and
global memory,
threads can also
access data from a
private local
memory:
– Used by CUDA to for
register spilling, stack
frames or variables
that do not fit in the
thread’s registers

GPU Architectures 38
Programming model
NVIDIA GPU Instruction set architecture (ISA)

• NVIDIA compilers translate CUDA programs (in C / C++) into

a virtualized ISA called Parallel Thread Execution (PTX) :
– Abstraction of the hardware ISA that provides compatibility across
generations of GPUs (hardware ISA hidden from the programmer)
– Fully documented by NVDIA with each release of CUDA
– Similar to a RISC ISA like MIPS with limitless set of virtual registers

• PTX code is converted to the real ISA supported by the

hardware (Streaming ASSembler) when sent to the GPU:
– The process of translating from PTX to SASS is done either by the
GPU driver or a stand-alone program called ptxas (part of CUDA)
– Among other things, PTX registers are mapped to a fixed number of
hardware registers available on the actual device

GPU Architectures 39
Programming model
PTX instruction set

• The format of a PTX instruction is:

opcode.type d, a, b, c;
where d is the destination operand (register and memory
address por stores); a, b and c are source operands (32-bit
or 64-bit registers or constant); and type is one of the
following:

• In general, PTX instructions map one-to-one with hardware

instructions
GPU Architectures 40
Programming model
Basic PTX instruction set

GPU Architectures 41
Programming model
Basic PTX instruction set

GPU Architectures 42
Programming model
Example PTX code

• For the SAXPY code :

GPU Architectures 43
Programming model
PTX instruction set

• All PTX instructions can be predicated by 1-bit predicate

registers, which can be set by setp instruction:
– Only when predicate bit is 1 is the instruction executed
– Placing a predicate before a branch instruction creates conditional
branches

• Unlike vector architectures (and processors with vector

extensions) , there aren’t separate instructions for
sequential data transfers, strided data transfers o gather-
scatter data transfers:
– All data transfers are gather-scatter
– Data transfers by adjacent threads to sequential addressed are
recognized and transformed in hardware to a single memory request

GPU Architectures 44
Programming model
Example SASS code

• SAXPY code compiled with CUDA 8.0 for NVIDIA Fermi:

GPU Architectures 45
Programming model
Example SASS code

• SAXPY code
compiled with
CUDA 8.0 for
NVIDIA Pascal:

GPU Architectures 46
Outline

• Introduction
• Programming Model
• The SIMT Core: Instruction and Register Data Flow
• Memory System
• Evolution of NVIDIA architectures

GPU Architectures 47
The SIMT Core
Modern GPU architecture

• A modern GPU is composed of many SIMT cores:

– What NVIDIA calls Streaming Multiprocessors (SM) and AMD
Compute Units (CU)
– Quality basic GPUs has few SMs (2-4), high-end ones have dozens
– SMs execute vertex, geometry and pixel fragment shader programs
and parallel computing programs
vertex Fragment Compute work
distribution distribution distribution

GPU Architectures 49
The SIMT Core
Modern GPU architecture
Compute work
• Thread blocks of one or more distribution

kernels are dynamically ThreadBlocks

distributed among the SMs

for execution:

GPU Architectures 50
The SIMT Core
Modern GPU architecture

• Each GPU thread has its own private registers, private per-
thread memory, program counter, and thread execution
state:
– Zero-overhead thread management and scheduling (done in
hardware) and fast barrier synchronization enables fine-grained
parallelism
– To efficiently manage and execute thousands of threads, threads (of
the same threadblock) are dynamically grouped into warps (NVIDIA)
or wavefronts (AMD):
• Set of parallel threads that execute the same instruction together (it can be seen
as a SIMD operation formed by hardware): typically 32 threads per warp
• Single-Instruction Multiple-Thread (SIMT) architecture: a processor architecture
that applies one instruction to multiple independent threads in parallel
• Individual parallel threads within a warp are same type and start together at the
same program address, but are free to branch and execute independently

GPU Architectures 51
for (i=0; i < N; i++)
The SIMT Core C[i] = A[i] + B[i];

SPMD on SIMT Machine

load load Warp 0 at PC X

load load Warp 0 at PC X+1

add add Warp 0 at PC X+2

store store Warp 0 at PC X+3

Iter. Iter.
1 2 Warp: A set of threads that execute
the same instruction (i.e., at the same PC)
SPMD (Single Program Multiple SIMT (Single-Instruction, Multiple-
Data) programming model Thread) execution model

GPU Architectures 52
The SIMT Core SP : Streaming processor
RF: Register file • Multithreaded proc
Conceptual view of a SIMT core SFU: Special function unit (up to 64 threads)
• 1024 general-purpose •Scalar arithmetic
32-bit registers units (integer & FP)
•Partitioned among Instruction Cache •Pipelined (1
assigned threads (16-64 Up to 16 instr/thread/cycle)
registers per thread) warps Multithreaded Instruction Unit

SP SP SP SP SP SP SP SP
SFU SFU

• FP sine,
RF RF RF RF RF RF RF RF
cosine, log2,
exp2, square
root, reciprocal,
Interconnection Network
…

Memory
Unified L1 data cache & Shared memory Interface

GPU Architectures 53
The SIMT Core
Microarchitecture of a generic SIMT core

• The pipeline of each SIMT core can be divided into a SIMT

front-end and a SIMD backend:

– Three scheduling loops acting together in a single pipeline:

1) Instruction fetch loop: Fetch, I-Cache, Decode and I-Buffer blocks
2) Instruction issue loop: I-Buffer, Scoreboard, Issue and SIMT Stack blocks
3) Register access scheduling loop: Operand collector, ALU and Memory blocks

• We present an increasingly detailed view of the SIMT core

microarchitecture
GPU Architectures 54
The SIMT Core
High-level view of the overall pipeline (single scheduler)

• Remember: to increase efficiency threads are organized into

warps → The unit of scheduling is a warp:
– In each cycle, the hardware selects a warp for scheduling
– The warp’s PC is used to access instruction memory (I-Cache) to
obtain next instruction to execute for the warp
– Instruction is decoded
– Source registers are read from the register file & SIMT execution
mask values are determined
– Each thread executes on the functional unit (FU) associated with a
lane provided the SIMT execution mask is set (several types of
functional units in a GPU: Special FU, load/store units, FP FU, integer
FU, Tensor core, RT cores)
• If warp size > FU lanes, the warp is executed over several clock cycles

GPU Architectures 55
The SIMT Core
High-level view of the overall pipeline (single scheduler)

• SIMT execution masking:

– GPU programming model presents the programmer with the
abstraction that threads execute completely independently
• Achieved via a combination of predication along with a stack of predicate masks
called SIMT stack, and special instructions & markers
– SIMT stack helps efficiently handle two key issues:
• Nested control flow (one branch is control dependent upon another)
• Skip computation entirely while all threads in a warp avoid a control flow path
– Each warp uses its own SIMT stack:
• Each entry stores an identifier reconvergence PC, PC of next instruction to
execute, active mask
– The SIMT stack is at least partly managed by special instructions (at
SASS level):
• Special instructions to push stack entries
• Special instructions and instruction markers that pop a stack entry or unwind the
stack to a specified entry and branch to the target instruction address with the
target thread-active mask
GPU Architectures 56
The SIMT Core
High-level view of the overall pipeline (single scheduler)

• Example of SIMT stack fully managed in hardware:

CUDA C source code PTX assembly code

GPU Architectures 57
The SIMT Core
High-level view of the overall pipeline (single scheduler)

• Example of SIMT stack fully managed in hardware:

4 threads per warp
Control Flow
Graph (CFG)

GPU Architectures 58
The SIMT Core
High-level view of the overall pipeline (single scheduler)

• Example of SIMT stack fully managed in hardware:

4 threads per warp
Control Flow
Graph (CFG)

How does GPU

hardware enable
threads within a
warp to follow
different paths (all
threads must
execute same
Serialization
instruction)?

GPU Architectures 59
The SIMT Core
High-level view of the overall pipeline (single scheduler) Next instruction
to execute
• Example of SIMT stack fully managed in hardware:
4 threads per warp Top of
Stack
Control Flow (after line 6)
Graph (CFG)

How does GPU

hardware enable
threads within a
warp to follow
different paths (all Use of SIMT
threads must stack
execute same
Serialization
instruction)?

GPU Architectures 60
The SIMT Core
High-level view of the overall pipeline (single scheduler)

• Warp scheduling:
– Each SIMT core must execute many warps:
• In which order should these warps be scheduled?
– For simplicity, assume that each warp issues a single instruction
when it is scheduled and the warp is not eligible to issue another
instruction until the first completes:
• For ideal memory system (memory requests are serviced within fixed time), we
could hide this latency by supporting enough warps and applying fine-grained
multithreading: scheduling warps in round-robin (RR) order, one each cycle
• To enable a different warp to issue an instruction each cycle, it is necessary for
each thread to have its own registers (to avoid saving/restoring register state
to/from memory)
– In practice, memory latency depends upon the application’s locality
and resulting contention encountered by off-chip memory accesses
• Locality properties can either favor or discourage RR scheduling: when different
threads shared data, RR is preferred

GPU Architectures 61
The SIMT Core
Two-loop approximation

• To help reduce the number of warps per core needed to hide

long execution latencies, we should be able to issue a
subsequent instruction from a warp while some of its earlier
instructions have not yet completed:
– We need a mechanism to know whether the next instruction to issue
for each warp has a dependency upon an earlier instruction in the
warp that has not yet completed execution
– Instruction buffer (I-Buffer): contains fetched instructions and is
used to detect data and/or structural hazards
• A separate scheduler is used to decide which of the instructions in the I-Buffer
can be issued
• The I-Buffer has storage for one or more instructions per warp
– How are data dependencies between instructions within the same
warp detected?
• Use of the scoreboard mechanism (simplicity → consumes less energy & area)
GPU Architectures 62
The SIMT Core
Two-loop approximation: the scoreboard

• Scoreboard for a single-threaded in-order CPU:

– The scoreboard has one bit per each register:
• Whenever an instruction issues that will write to a particular register, the
corresponding bit in the scoreboard is set
– Instructions cannot be issued until all bits in the scoreboard
associated to the registers to be read / written are cleared:
• Prevents both RAW and WAW hazards (WAR hazards are prevented by in-order
instruction processing)
• GPUs use in-order scoreboards, but there are challenges:
– Large number of registers contained in a modern GPU
• 16K 32-bit registers per warp scheduler in a SIMT core
– Big number of warps may be repeatedly probing the scoreboard
waiting for their dependencies to be cleared → several read ports
are required (16 warps/scheduler & 2 instr. per warp & 4 operands
per instr  128 read ports if all warps probe the scoreboard each
cycle)

GPU Architectures 63
The SIMT Core
Three-loop approximation

• In the two-loop approximation architecture:

– The first loop selects a warp that has space in the I-Buffer, looks up
its PC and performs an I-Cache access to obtain next instruction
– The second loop selects an instruction in the I-Buffer that has no
outstanding dependencies and issues it to execution units (once the
operands have been obtained from the register file)
• As discussed, each SIMT core has a large register file that
contains separate physical registers for each warp:
– The size of the register file in modern GPU architectures is 256KiB
– The area of an SRAM memory is proportional to the number of ports
• We would need 1 port per operand per instruction issued per cycle
– The area of the register file can be reduced by implementing it with
multiple banks of single-ported memories:
• The operand collector hides this implementation detail to the ISA and forms the
third scheduling loop

GPU Architectures 64
The SIMT Core
Three-loop approximation: the operand collector

• A naive microarchitecture for providing increased register

file bandwidth (register read stage):

GPU Architectures 65
The SIMT Core
Three-loop approximation: the operand collector

• A naive microarchitecture for providing increased register

file bandwidth (register read stage):

Issue

1
2
4

In Execution mad add

It takes 6 cycles for the 3 instructions to read their operands (most of the banks are idle)

GPU Architectures 66
The SIMT Core
Three-loop approximation: the operand collector

• The operand collector microarchitecture:

– Idea: replace staging registers with collector units and assign each
instruction a collector unit when it enters the register read stage
• Multiple collector units can help
improve throughput in presence
of bank conflicts
• Each collector unit contains
buffering space for all source
operands used by the instruction
• The arbiter is more likely to find
bank-level parallelism to allow
simultaneous access to banks
GPU Architectures 67
The SIMT Core
Three-loop approximation: the operand collector

• The operand collector schedules bank accesses to tolerate

bank conflicts when they ocur:
– Conflicts between register operands in different warps can be
reduced by allocating equivalent registers from different warps in
different banks
– Example: Swizzled layout

– Goes well with round-robin scheduling (warps are making relatively

even progress)
GPU Architectures 68
The SIMT Core
Three-loop approximation: the operand collector

• Example:

Issue

In Execution
• Each collector unit can read one operand each cycle
• Different reads from several warps can be done each cycle (increase throughput)
GPU Architectures 69
The SIMT Core
Three-loop approximation: the operand collector

• The operand collector may allow WAR hazards:

– It does not impose any order among different instructions when they
ready to issue
– Two instructions from the same warp are hold in different collector
units, one of them reads a register that the other will write
• Second instruction could potentially write a new value to the register before the
first one has read the (correct) older value (due to repeated bank conflicts
suffered by the first instruction to obtain source registers)

– We must require that instructions from the same warp leave the
operand collector to execution units in program order

GPU Architectures 70
The SIMT Core
Three-loop approximation: structural hazards

• Many potential causes of structural hazards in GPU

pipelines, for example:
– Register read stage may run out of collector units
– Conflicts in the memory system

• What to do when an a structural hazard is found?

– In a single-threaded in-order CPU pipeline we could stall younger
instructions until the hazard disappears
– Stalls are not a good idea in a GPU pipeline (costly to implement and
ruin throughput—instructions from other warps that could otherwise
advance would be stalled!)

• GPUs implement a form of instruction replay:

– Instructions are hold in the I-Buffer until they have completed
GPU Architectures 71
Outline

• Introduction
• Programming Model
• The SIMT Core: Instruction and Register Data Flow
• Memory System
• Evolution of NVIDIA architectures

GPU Architectures 72
Memory System
Memory spaces in a GPU

• GPU computing kernels interact with the memory system

through load and store instructions
• CPUs typically include two separate memory spaces:
– Register file and memory
• GPUs subdivide memory into local and global memory
spaces:
– Local memory space is private per thread and typically used for
register spilling
– Global memory is used for data structures shared among threads
• Additionally, GPUs have a on-chip scratchpad memory with
shared access among threads in a Threadblock:
– Shared memory is used for data known to be used at a given step in
the computation and yields higher performance and saves energy
GPU Architectures 73
Memory System
First-level memory structures

• Comprised by the scratchpad memory, L1 data cache and L1

texture cache
• In CUDA, shared memory refers to a small, low-latency
memory space, accessible to all threads in a Threadblock:
– Implemented as SRAM → Similar latency to the register file
– One bank per lane, one read port and one write port per bank
– Each thread has access to all banks → bank conflicts arise when two
or more threads in a warp need different locations in the same bank
at the same cycle
• L1 data cache keeps a subset of the global memory address
space:
– In some architectures, the L1 cache contains only addresses not
modified by kernels (avoids complications due to lack of cache
coherence on GPUs)
GPU Architectures 74
Memory System
First-level memory structures

• L1 data cache keeps a subset of the global memory address

space:
– If all threads in a warp access locations within same L1 data cache
block and that block is not present in the cache, a single request is
sent to the lower level caches (the accesses are coalesced)
– If the threads in the warp access different cache blocks, multiple
memory accesses need to be generated (uncoalesced accesses)
• Programmers try to avoid both bank conflicts & uncoalesced
accesses but the hardware allows both (eases programming)
• L1 data cache and scratchpad memories are unified in since
NVIDIA’s Fermi:
– Allows programs to decide how much memory they want to use for
the shared memory and for the L1 data cache

GPU Architectures 75
Memory System
Unified L1 data cache and shared memory

– SRAM data array (❺) can be configured partly for direct mapped
access (shared memory) and partly as a set associative cache
– L1 data cache misses and bank conflicts handled using a reply
mechanism (non-stalling interface with the instruction pipeline)

GPU Architectures 76
Memory System
Shared memory access operations
1
Memory access request received: one memory
Bank conflicts are determined by address per thread in the warp; operation type
the arbiter, and if found request
is split into 2 parts: 2
-A conflict-free part
3
- Rest, to be retried later
(buffered in the load/store 4
unit)

Tag lookup in tag unit is bypassed 6

and tag unit determines the banks Single-cycle latency for shared memory lookup: the
to be accessed replayed portion goes to the arbiter the cycle after (if
The address crossbar distributes other bank conflicts, the process is repeated)
addresses to the individual banks Data is returned to the appropriate
Each bank is 32-bits wide and has its lanes for storage in the register file
own decoder (different rows can be via the data crossbar (only lanes of
accessed in each bank) active threads write values)

GPU Architectures 77
Memory System
Cache read operation (LD GLOBAL)
1
Memory addresses are computed and coalescing
The arbiter accepts or reject the rules applied (individual memory accesses are
request if not enough resources grouped into individual coalesced accesses)
available 2

Tag lookup in tag unit is done to 6

check hit or miss

If HIT, same row of the data array is

accessed in the 32 banks (128 bytes) Data is returned to the register file
via the data crossbar (only lanes of
active threads write values)

GPU Architectures 78
Memory System
Cache read operation (LD GLOBAL)
Memory addresses are computed and coalescing
The arbiter accepts or reject the rules applied (individual memory accesses are
request if not enough resources grouped into individual coalesced accesses)
available
7
3

If MISS, the arbiter informs the

load/store unit the request must be Once data is returned, the fill unit
replayed accesses the PRT to recover
In parallel, it sends request information information about the request and
to the PRT (similar to MSHRs) the load is rescheduled
Particularly, the request is sent to the
A memory request is forwarded to corresponding memory partition unit
the MMU for virtual to physical (which contains a bank of L2 cache &
address translation (VIVT cache) a memory access scheduler)
GPU Architectures 79
Memory System
Cache coherence is not supported:
Cache write operation (ST GLOBAL)
only local memory accesses or read-
The L1 data cache can support only global memory data can be
both WT and WB policies: placed in the L1 cache
-WT: global memory
-WB: local memory

10
6
Data to be written is placed in the WB

For uncoalesced accesses or when

some threads are masked off, only a WB: if block present in the cache,
portion of the a cache block is written data is written via the data crossbar.
Otherwise, it must be first read from
L2 cache or DRAM memory

GPU Architectures 80
Memory System
L1 texture cache

• Recent NVIDIA and AMD GPUs combine the L1 data cache

and texture cache to save area:
– Texture mapping is used in 3D graphics to make scenes look as
realistic as possible: an image (texture) is applied to a surface
– To implement texture mapping the coordinates of one or more
samples within the texture are first determined (texels)
– From these coordinates, the memory addresses of the texels are
calculated
– There is significant locality in texture memory accesses (adjacent
pixels map to adjacent texels and it is common to average the values
of nearby texels) that can be exploited by means of a cache

GPU Architectures 81
Memory System
Memory partition unit

• High-performance GPUs connect to multiple DRAM chips in

parallel via memory partition units:
– Provides the large amount of memory bandwidth required by the
SIMT cores
– Memory traffic is distributed across the memory partition units
using address interleaving (granularities of 256 bytes or 1024 bytes)
– The SIMT cores connect to the memory partition units via an on-chip
interconnection network (crossbars for NVIDIA and rings for AMD)
• Each memory partition unit contains:
– A portion of the L2 cache: contains both graphics and compute data
– One or more memory access schedulers (frame buffer): reorder
memory read/write operations to optimize the accesses to DRAM
– Raster operation unit (ROP): supports atomic operations defined in
CUDA, and is mainly used for graphics (alpha blending, compression)
GPU Architectures 82
Memory System
Memory access scheduler

• To store large amount of data GPUs use special DRAM such

as GDDR5:
– DRAM stores individual bits in small capacitors
– A read operation: A row of bits (called a page) is read into a small
memory structure called row buffer
• Bitlines connecting the individual capacitors to the row buffer are precharged to
a voltage half way between 0 and the supply voltage
• The capacitor is connected to the bitline through an access transistor and the
voltage of the bittline is pulled either up (1) or down (0)
• A sense amplifier amplifies this small change until a clean logic 0 or 1 is read and
the values stored in the capacitors are refreshed
• The precharge and activation operations introduce delays during which no data
can be read or written to the DRAM array → multiple banks, each with its own
row buffer are used
• Even with multiple DRAM banks, it is often not possible to completely hide the
latency of switching between rows when accessing data → memory access
schedulers
GPU Architectures 83
Memory System
Memory access scheduler

• Memory access schedulers reorder DRAM memory access

requests so as to reduce the number of times data must be
moved between the row buffers and the DRAM cells:
– Each slice of the L2 cache has its own memory access scheduler
– Each memory access scheduler contains separate logic for sorting
read requests and write requests sent from the L2 cache
– To group together reads to the same row in a DRAM bank, two
separate tables are used:
• Read request sorter: set-associative structure accessed by memory address that
maps all read requests to the same row in a given bank to a single pointer
• Read request store: stores lists of individual read requests. Each pointer in the
read request sorter points to a entry in the read request store

GPU Architectures 84
Outline

• Introduction
• Programming Model
• The SIMT Core: Instruction and Register Data Flow
• Memory System
• Evolution of NVIDIA architectures

GPU Architectures 85
Evolution of NVIDIA architectures
From Tesla to Volta

6000 16000

14000
5000
12000
#Stream Processors

4000
10000

GFLOPS
3000 8000 Stream Processors
6000 GFLOPS
2000
4000
1000
2000

0 0
GTX 285 GTX 480 GTX 780 GTX 980 P100 V100
(2009) (2010) (2013) (2014) (2016) (2017)

GPU Architectures 86
Evolution of NVIDIA architectures
NVIDIA Fermi (2009)

NVIDIA Fermi (2009)

GPU Architectures 87
Evolution of NVIDIA architectures
NVIDIA Kepler (2012)

GPU Architectures 88
Evolution of NVIDIA architectures
NVIDIA Maxwell (2014)

GPU Architectures 89
Evolution of NVIDIA architectures
NVIDIA PASCAL GP104 (2016)

GPU Architectures 90
Evolution of NVIDIA architectures
NVIDIA PASCAL GP100 (2016)

GPU Architectures 91
Evolution of NVIDIA architectures
NVIDIA Volta (2017)

GPU Architectures 92
Evolution of NVIDIA architectures
NVIDIA Turing (2018)

GPU Architectures 93
Evolution of NVIDIA architectures
NVIDIA Ampere (2020)

GPU Architectures 94
Evolution of NVIDIA architectures
NVIDIA Hopper (2022)

GPU Architectures 95
Evolution of NVIDIA architectures
Tensor cores

• Specialized core for e.g. deep learning

– Operates on narrow data types, e.g. fp16

• Example: 4x4x4 matrix-multiply acc

GPU Architectures 97

Gpus
No ratings yet
Gpus
32 pages
Graphics Processing Unit Graphics Processing Unit: Dhan V Sagar CB - EN.P2CSE13007
No ratings yet
Graphics Processing Unit Graphics Processing Unit: Dhan V Sagar CB - EN.P2CSE13007
21 pages
GPU Insights for Tech Enthusiasts
No ratings yet
GPU Insights for Tech Enthusiasts
35 pages
GPU Architecture and Programming
No ratings yet
GPU Architecture and Programming
3 pages
PDC Lecture 09
No ratings yet
PDC Lecture 09
36 pages
Gpu Computing
No ratings yet
Gpu Computing
57 pages
0 Gpu Computing I Give It
No ratings yet
0 Gpu Computing I Give It
57 pages
HPC 5th Unit - 240504 - 160548
No ratings yet
HPC 5th Unit - 240504 - 160548
18 pages
Lecture GPUArchCUDA01
No ratings yet
Lecture GPUArchCUDA01
57 pages
GPU (Graphics Processing Unit)
No ratings yet
GPU (Graphics Processing Unit)
23 pages
Unit 2 - GPU DFG
No ratings yet
Unit 2 - GPU DFG
27 pages
Data-Level Parallelism in Vector, SIMD, And: GPU Architectures
100% (1)
Data-Level Parallelism in Vector, SIMD, And: GPU Architectures
29 pages
GPU Architecture Ebook
No ratings yet
GPU Architecture Ebook
67 pages
GPU Architecture
No ratings yet
GPU Architecture
12 pages
Why GPU?: CS8803SC Software and Hardware Cooperative Computing
No ratings yet
Why GPU?: CS8803SC Software and Hardware Cooperative Computing
14 pages
p10 Cuda
No ratings yet
p10 Cuda
28 pages
GPU Cluster4
No ratings yet
GPU Cluster4
31 pages
GPU Programming Essentials
33% (3)
GPU Programming Essentials
28 pages
Graphics Processing Units Paper PDF
No ratings yet
Graphics Processing Units Paper PDF
14 pages
Kirk+Hwu GPU
No ratings yet
Kirk+Hwu GPU
92 pages
Lecture - 01 - CUDA Programming
No ratings yet
Lecture - 01 - CUDA Programming
52 pages
Chapter7 GPU
No ratings yet
Chapter7 GPU
45 pages
Presentation1 (1) HPC Mod 3
No ratings yet
Presentation1 (1) HPC Mod 3
51 pages
UNIT 4 GPU Computing - HPC
No ratings yet
UNIT 4 GPU Computing - HPC
13 pages
Evolution of The Graphics Process Units: Dr. Zhijie Xu Z.xu@hud - Ac.uk
No ratings yet
Evolution of The Graphics Process Units: Dr. Zhijie Xu Z.xu@hud - Ac.uk
24 pages
GPGPU
100% (1)
GPGPU
139 pages
Chapter 8
No ratings yet
Chapter 8
58 pages
GPU Introduction
No ratings yet
GPU Introduction
52 pages
Developers Had To Map Scientific Calculations Onto Problems That Could Be Represented by Triangles and Polygons
No ratings yet
Developers Had To Map Scientific Calculations Onto Problems That Could Be Represented by Triangles and Polygons
2 pages
Unit 4
100% (1)
Unit 4
48 pages
The Evolution of Gpus For General Purpose Computing
No ratings yet
The Evolution of Gpus For General Purpose Computing
38 pages
CUDA Tutorial
100% (1)
CUDA Tutorial
50 pages
Chapter 9 - Multiple Core Computers
No ratings yet
Chapter 9 - Multiple Core Computers
44 pages
Lecture 0: Cpus and Gpus: Prof. Mike Giles
No ratings yet
Lecture 0: Cpus and Gpus: Prof. Mike Giles
36 pages
GPU Khoruzhenko
No ratings yet
GPU Khoruzhenko
5 pages
Lecture-12-PDC - CUDA
No ratings yet
Lecture-12-PDC - CUDA
25 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
247 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
40 pages
06 Intro Gpus
No ratings yet
06 Intro Gpus
33 pages
Chapter 5 - General Purpose PGPU, CUDA
No ratings yet
Chapter 5 - General Purpose PGPU, CUDA
70 pages
Coe4590 15 Gpu1
No ratings yet
Coe4590 15 Gpu1
14 pages
Design of Graphics Processing Framework On FPGA
No ratings yet
Design of Graphics Processing Framework On FPGA
5 pages
Lecture 17-Introduction To GPU
No ratings yet
Lecture 17-Introduction To GPU
36 pages
Comp Arch Project 2 Final
No ratings yet
Comp Arch Project 2 Final
29 pages
Seminar Igor Kamzic COSC3P93
No ratings yet
Seminar Igor Kamzic COSC3P93
58 pages
Parallel & Distributed Computing Report
No ratings yet
Parallel & Distributed Computing Report
4 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
BCS702 Module 5 Textbook
No ratings yet
BCS702 Module 5 Textbook
48 pages
GPU Evolution for Tech Enthusiasts
No ratings yet
GPU Evolution for Tech Enthusiasts
21 pages
Graphics Processing Unit: Shashwat Shriparv Infinitysoft
No ratings yet
Graphics Processing Unit: Shashwat Shriparv Infinitysoft
39 pages
GPU Programming Insights
No ratings yet
GPU Programming Insights
22 pages
DS1822 - Parallel Computing-Unit3
No ratings yet
DS1822 - Parallel Computing-Unit3
17 pages
Chapter 4 Notes
No ratings yet
Chapter 4 Notes
2 pages
Vector Processors
No ratings yet
Vector Processors
20 pages
Design Patterns For Low-Level Real-Time Rendering - Nicolas Guillemot - CppCon 2017
No ratings yet
Design Patterns For Low-Level Real-Time Rendering - Nicolas Guillemot - CppCon 2017
56 pages
Comp206 Lecture14
No ratings yet
Comp206 Lecture14
29 pages
Lec 6
No ratings yet
Lec 6
16 pages
Brodtkorb Etal Meta10
No ratings yet
Brodtkorb Etal Meta10
15 pages
C Programs Week1
No ratings yet
C Programs Week1
4 pages
Dell Inspiron 3059 - 3459 - Skylake-U - 14091-1
No ratings yet
Dell Inspiron 3059 - 3459 - Skylake-U - 14091-1
105 pages
Tango Class Arduino Temp Sensor
No ratings yet
Tango Class Arduino Temp Sensor
29 pages
Cat RAM - Hardware Detail and Set Up Process
100% (1)
Cat RAM - Hardware Detail and Set Up Process
35 pages
Circuit Simplification Techniques
No ratings yet
Circuit Simplification Techniques
35 pages
User's Manual For AutoLaser Software
No ratings yet
User's Manual For AutoLaser Software
100 pages
English Midterm Exam
No ratings yet
English Midterm Exam
2 pages
Fanuc Series 5 & 6 Control Guide
No ratings yet
Fanuc Series 5 & 6 Control Guide
9 pages
Data Leakage Detection
100% (1)
Data Leakage Detection
81 pages
Digital Image Processing Using MATLAB
No ratings yet
Digital Image Processing Using MATLAB
24 pages
Plessey Radio and Radar Communications ICs 1981
No ratings yet
Plessey Radio and Radar Communications ICs 1981
497 pages
N101l6-L01-Cmi Mei
No ratings yet
N101l6-L01-Cmi Mei
28 pages
Versamax® Genius® Network Interface Unit: Specifications
No ratings yet
Versamax® Genius® Network Interface Unit: Specifications
6 pages
ZTA/ZTT Ceramic Resonators Guide
No ratings yet
ZTA/ZTT Ceramic Resonators Guide
24 pages
LM4766 Overture™ Audio Power Amplifier Series Dual 40W Audio Power Amplifier With Mute
No ratings yet
LM4766 Overture™ Audio Power Amplifier Series Dual 40W Audio Power Amplifier With Mute
22 pages
MDS-Weigh Bridge
100% (1)
MDS-Weigh Bridge
1 page
PSIM SmartCtrl Tutorial Resonant Converter Control Loop Design
No ratings yet
PSIM SmartCtrl Tutorial Resonant Converter Control Loop Design
7 pages
Labwindows/Cvi Programmer Reference Manual
No ratings yet
Labwindows/Cvi Programmer Reference Manual
97 pages
FS - MM-En-023 - Service PO For Transporters - Contractors and Other Overheads Pricing - v0.1
No ratings yet
FS - MM-En-023 - Service PO For Transporters - Contractors and Other Overheads Pricing - v0.1
8 pages
VMware ESX 6.5
100% (1)
VMware ESX 6.5
56 pages
Introduction To The Junos Operating System
No ratings yet
Introduction To The Junos Operating System
2 pages
Digikey Part # MFG Part # Description Qty Cost Line Cost Designator Total Cost
No ratings yet
Digikey Part # MFG Part # Description Qty Cost Line Cost Designator Total Cost
1 page
20ESCS101: Problem Solving and Programming in C (Common To All Branches)
No ratings yet
20ESCS101: Problem Solving and Programming in C (Common To All Branches)
154 pages
Veeam Backup Enterprise Manager
No ratings yet
Veeam Backup Enterprise Manager
309 pages
RS232 Interfacing
No ratings yet
RS232 Interfacing
41 pages
A1581036458 23838 3 2018 Unformatted
No ratings yet
A1581036458 23838 3 2018 Unformatted
13 pages
Comparison of S7 300 S7 400 S7 1200 S7 1500 PLC 1708187041-3-116
No ratings yet
Comparison of S7 300 S7 400 S7 1200 S7 1500 PLC 1708187041-3-116
114 pages
Class D Amplifiers - Interactive Simulations
No ratings yet
Class D Amplifiers - Interactive Simulations
5 pages
Tunnel Manager Thread
No ratings yet
Tunnel Manager Thread
19 pages
Radix-4/8 Booth Multi-Modulus Multipliers
No ratings yet
Radix-4/8 Booth Multi-Modulus Multipliers
3 pages