[go: up one dir, main page]

0% found this document useful (0 votes)
13 views95 pages

AMPE Tema4 GPU Architecture

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views95 pages

AMPE Tema4 GPU Architecture

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 95

ARQUITECTURAS MULTIMEDIA Y DE PROPOSITO

ESPECIFICO (2024-25)

Unit 4
GPU Architectures

Conference title
March 2025
1
Outline

• Introduction
• Programming Model
• The SIMT Core: Instruction and Register Data Flow
• Memory System
• Evolution of NVIDIA architectures

GPU Architectures 2
Introduction
GPU basics

• GPU stands for Graphics Processing Unit


– A processor optimized for 2D and 3D graphics, video, visual
computing and display

• GPUs are highly parallel, highly multithreaded


multiprocessors optimized for visual computing
– Visual computing refers to a mix of graphics processing and
computing that allows for real-time visual interaction with computed
objects via graphics, images and video

• Modern GPUs have a unified graphics and computing


architecture that servers as both a programmable graphics
processor and a scalable parallel computing platform

GPU Architectures 3
Introduction
GPU basics

• PCs and game consoles combine a GPU with a CPU


– Heterogeneous systems

• GPUs were initially introduced to enable real-time rendering


with a focus on video games
– Rendering is the process of generating an image (i.e. photorealistic
image) from a 2D or 3D model
– Accomplished through a logical graphics pipeline that is “mapped” to
the GPU hardware
– Huge amount of floating-point operations executed by the GPU

• Why not using the enormous computational power of a GPU


for non-graphics applications?
GPU Architectures 4
Introduction
GPU basics

• GPU in Graphics Card

GPU Architectures 5
Introduction
GPU basics

• GPU in Mobile Processors:

Image: Nvidia Jetson TX1 (Tegra X1 SOC)

– The Apple A8 application processor devotes more chip area to its


integrated GPU than to the CPU cores

GPU Architectures 6
Introduction
GPU basics

• GPU in HPC:

GPU Architectures 7
Introduction
Quick recap of GPU evolution

• No such thing as a GPU in PCs 30 years ago


– Graphics performed by a Video Graphics Array (VGA) controller, a
simple memory controller and display generator connected to DRAM

• By late the 1990s, advances on semiconductor technology


allowed for more functions to be added to VGA controllers
– 3D acceleration functions such as hardware for triangle setup and
rasterization, and texture mapping and shading

• The GPU term was coined in 2000, as single-chip graphics


processors fully implemented high-end graphics pipelines
• GPUs became more and more programmable…
– NVIDIA GeForce 3 (2001) included vertex shaders and pixel shaders
GPU Architectures 8
Introduction
Quick recap of GPU evolution

• …and more precise


– Integer and fixed-point arithmetic replaced by floating point-one

• Efforts by researcher to leverage GPUs to implement linear


algebra (mapping matrix data into textures and applying
shaders) inspired NVIDIA to directly support general-purpose
computing in GPUs (and graphics, of course)
– Processor instructions and memory hardware added to support
general-purpose programming languages
– A programming environment created to allow GPUs to be
programmed using familiar languages (such as C and C++)

• But… not all applications can benefit from running on a GPU


GPU Architectures 9
Introduction
What makes the difference between a CPU and a GPU?

• CPU vs. GPU chip area

Few complex cores, large caches vs. Thousands of simple cores, small caches

GPU Architectures 10
Introduction
The graphics side

• GPUs and associated drivers implement the model of graphics


processing defined by current APIs
– OpenGL is an open standard for 3D graphics programming originally proposed and
defined by Silicon Graphics Incorporated
– Direct3D is the interface for 3D graphics defined by Microsoft (as part of DirectX, a
series of multimedia programming interfaces)
• OpenGL and Direct3D have well-defined behaviors  Build an efficient
HW accelerator to give support to the execution of functions
• OpenGL and Direct3D are similarly structured: they define a logical
graphics processing pipeline that is mapped onto the GPU, and
programming models and languages for programmable pipeline stages
• Graphic applications calls API functions, which use the GPU as a
coprocessor: API functions send commands, programs and data to the
GPU through the corresponding driver

GPU Architectures 11
Introduction
The graphics side

• GPU is used to render triangles… billions of triangles per


second!

GPU Architectures 12
Introduction
The graphics side

• The graphics pipeline


Sequence of vertices grouped into geometric
primitives (points, lines, triangles and polygons

GPU Architectures 13
Introduction
The graphics side

• The graphics pipeline

GPU Architectures 14
Introduction
The graphics side

• The graphics pipeline

primitive

primitive

GPU Architectures 15
Introduction
The graphics side

• The graphics pipeline

GPU Architectures 16
Introduction
The graphics side

• The graphics pipeline

GPU Architectures 17
Introduction
The graphics side

• Graphics pipeline on GPU

GPU Architectures 18
Introduction
The graphics side

• Graphics pipeline on GPU

GPU Architectures 19
Introduction
The graphics side

• Graphics pipeline on GPU

GPU Architectures 20
Introduction
The graphics side

• Graphics pipeline on GPU

GPU Architectures 21
Introduction
The parallel computing side

• A modern GPU is composed of many cores (SIMT cores),


each of them able to run on the order of thousand of threads
– Large number of threads running on each core enables to hide the
latency to access memory (fine-grained multithreaded architecture)

GPU Architectures 22
Introduction
Fine-grained multithreading

• Idea: Hardware has multiple thread contexts (PC+registers).


Each cycle, fetch engine fetches from a different thread
– By the time the fetched branch/instruction resolves, no instruction is
fetched from the same thread
– Branch/instruction resolution latency overlapped with execution of
other threads’ instructions

• Advantage: No logic needed for handling control and data


dependences within a thread
• Disadvantages: Single thread performance suffers; Extra
logic for keeping thread contexts; Does not overlap latency if
not enough threads to cover the whole

GPU Architectures 23
Introduction
Fine-grained multithreading

– CPU typically use the SMT model (e.g. Intel Hyperthreading enables
2 threads per core; IBM Power allows up to 8 threads per core)
– Sun Ultrasparc T2 used Fine MT, with 8 threads per core
– Today’s GPUs are based on Fine MT model with thousands of
threads!
GPU Architectures 24
Outline

• Introduction
• Programming Model
• The SIMT Core: Instruction and Register Data Flow
• Memory System
• Evolution of NVIDIA architectures

GPU Architectures 25
Programming model
Overview

• Programming model vs. Hardware execution model


– Programming model refers to how the programmer expresses the
code
 E.g., sequential (von Neumman), Data Parallel (SIMD), Dataflow, Multi-threaded
(SPMD, MIMD), …
– Execution model refers to how the hardware executes the code
underneath
 E.g., Out-of-order execution, Vector processor, Array processor, Dataflow
processor, Multiprocessor, Multithreaded processor, …

• Execution model can be very different from the


Programming model
– E.g., von Neumann model implemented by an OoO processor
– E.g., SPMD model implemented by a SIMD processor (a GPU)

GPU Architectures 26
Programming model
GPUs are SIMD engines

• The instruction pipeline operates like a SIMD pipeline (e.g.,


a vector processor)
• However, the programming is done using threads, NOT
SIMD instructions
– SPMD programming model
 Each thread executes the same code but operates a different piece of data
 Each thread has its own context (i.e. can be treated/restarted/executed
independently)
 All the threads share a common memory space called global memory

• A GPU computing application starts execution on the CPU:


– Discrete GPUs: CPU code allocates memory for use in computation
on the GPU, initiate transfers of input data into GPU memory and
launch the computational kernel on GPU
– Integrated GPU: only last step is needed
GPU Architectures 27
Programming model
Thread hierarchy

• Computational kernels are made of thousands of threads


• Threads must be organized properly so that same kernel
code can be run on GPUs with varying architectural
parameters (e.g., number of cores per multiprocessor)
• Thread hierarchy in CUDA:

GPU Architectures 28
Programming model
Thread hierarchy

• Thread block: set of concurrent threads that can cooperate


among themselves through barrier synchronization and
share access to a private memory space
• Grid: set of thread blocks that are independent and thus can
be executed independently

GPU Architectures 29
Programming model
Thread hierarchy

• # of threads per block and # of blocks per grid (and their setup)
are set by the programmer when the kernel is called
kernel<<<dimGrid, dimBlock>>>(parameters);

• Each thread is given a unique number (threadIDx) within its


thread block (0, 1, …, blockDim-1) and each thread block is given a
unique number (blockIDx) within its grid:
– CUDA supports blocks of up to 1024 threads (must be multiple of 32)
– Thread blocks and grids may have 1, 2 or 3 dimensions, accessed by .x, .y
and .z index fields (positions of the block and of the thread)
– dimGrid and dimBlock are 3-element vectors (dim3 type) or integers
that specify dimensions of the grid in blocks & of the blocks in threads
– blockDim.x is the number of threads in x dimension (same for y and z),
gridDim.x is the number of blocks in x dimension (same for y and z)

GPU Architectures 30
Programming model
Sample GPU Program

CPU Code GPU Code


void saxpy_serial __global__ void saxpy_parallel
(int N, float a, float *x, float *y) { (int N, float a, float *x, float *y) {
int i=blockIdx.x*blockDim.x+threadIDx.x
for (int i=0; i<N; i++) if (i<N) y[i] = a*x[i] + y[i];
y[i] = a*x[i] + y[i]; }
}
main() {
main() { float *x, *y;
float *x, *y; int N=1<<20;
int N=1<<20; int nblocks=(N+255)/256;

saxpy_serial(N,2.0,x,y); cudaMalloc (&x,N*sizeof(float));
} cudaMalloc (&y,N*sizeof(float));

saxpy_parallel<<<nblocks,256>>>(N,2.0,x,y);
Each loop iteration in the CPU cudaDeviceSynchronize();

code becomes an independent cudaFree(x);
thread in the GPU code cudaFree(y);
}

GPU Architectures 31 31
Slide credit: Hyesoon Kim
Programming model
Sample GPU program

• Parallelism is determined explicitly by specifying the


dimensions
• Parallel execution and thread management is automatic:
– All thread creation, scheduling and termination handled by underlying
system
– Threads are managed directly in hardware
• Threads within each thread block can communicate each
other efficiently via the shared memory:
– Scrathpad memory that is close to where thread blocks are executed
– Exposed to programmers as different memory spaces (special keywords to
allocate memory into it, i.e., “__shared__” in CUDA)
– Acts as a software controlled cache used to keep data reused frequently and
in a predictable manner
– Small size, ranging from 16 to 64KiB

GPU Architectures 32
Programming model
Example: Average Filters

Average over a kernelF<<<(1,1),(16,16)>>>(A);


3x3 window for __global__ kernelF(A){
a 16x16 array i = threadIdx.y;
j = threadIdx.x;
tmp = (A[i-1][j-1] + A[i-1][j] +
… + A[i+1][i+1] ) / 9;
A[i][j] = tmp;
}

Each thread loads 9 elements from global


memory. It takes hundreds of cycles.

GPU Architectures 33
Programming model
Example: Average Filters

kernelF<<<(1,1),(16,16)>>>(A);
Average over a __global__ kernelF(A){
3x3 window for
__shared__ int smem[16][16];
a 16x16 array
i = threadIdx.y;
With Shared Memory j = threadIdx.x;
smem[i][j] = A[i][j]; // load to smem
A[i][j] = ( smem[i-1][j-1] + smem[i-1][j] +
… + smem[i+1][i+1] ) / 9;
}
Do you notice any problems
here? Hazards?
GPU Architectures 34
Programming model
Example: Average Filters

kernelF<<<(1,1),(16,16)>>>(A);
Average over a __global__ kernelF(A){
3x3 window for
__shared__ int smem[16][16];
a 16x16 array
i = threadIdx.y;
With Shared Memory & j = threadIdx.x;
Synchronization smem[i][j] = A[i][j]; // load to smem
__syncthreads();
A[i][j] = ( smem[i-1][j-1] + smem[i-1][j] +
… + smem[i+1][i+1] ) / 9;
}

GPU Architectures 35
Programming model
Synchronization within a thread block

• Threads within a block can synchronize at a


synchronization barrier by calling __syncthreads()
– No thread in the thread block can pass the barrier until all threads
have reached it
– After passing a barrier, all writes to memory by all threads in the
block BEFORE the barrier are visible to all threads
– Threads in a block can communicate by writing and reading per-
block shared memory at synchronization barrier

• Since threads within each thread block can share memory


and synchronize via barriers, they will be executed on the
same physical processor:
– # of thread blocks >> number of processors… but thread blocks are
independent of each other

GPU Architectures 36
Programming model
Synchronization between thread blocks

• Virtualization into threads and thread blocks allows intuitive


problem decompositions:
– # of thread blocks dictated by the size of the data rather than # of
processors

• It also allows same CUDA program to scale to widely varying


# of processors
• Threads in different thread blocks may coordinate their
activities using an atomic memory operations on the
global memory:
– Atomic memory operation: memory read, modify, write operation
sequence that completes without any intervening access
– Global memory: a per-application memory shared by all threads
GPU Architectures 37
Programming model
Memory spaces that threads can access to

• In addition to the
shared memory and
global memory,
threads can also
access data from a
private local
memory:
– Used by CUDA to for
register spilling, stack
frames or variables
that do not fit in the
thread’s registers

GPU Architectures 38
Programming model
NVIDIA GPU Instruction set architecture (ISA)

• NVIDIA compilers translate CUDA programs (in C / C++) into


a virtualized ISA called Parallel Thread Execution (PTX) :
– Abstraction of the hardware ISA that provides compatibility across
generations of GPUs (hardware ISA hidden from the programmer)
– Fully documented by NVDIA with each release of CUDA
– Similar to a RISC ISA like MIPS with limitless set of virtual registers

• PTX code is converted to the real ISA supported by the


hardware (Streaming ASSembler) when sent to the GPU:
– The process of translating from PTX to SASS is done either by the
GPU driver or a stand-alone program called ptxas (part of CUDA)
– Among other things, PTX registers are mapped to a fixed number of
hardware registers available on the actual device

GPU Architectures 39
Programming model
PTX instruction set

• The format of a PTX instruction is:


opcode.type d, a, b, c;
where d is the destination operand (register and memory
address por stores); a, b and c are source operands (32-bit
or 64-bit registers or constant); and type is one of the
following:

• In general, PTX instructions map one-to-one with hardware


instructions
GPU Architectures 40
Programming model
Basic PTX instruction set

GPU Architectures 41
Programming model
Basic PTX instruction set

GPU Architectures 42
Programming model
Example PTX code

• For the SAXPY code :

GPU Architectures 43
Programming model
PTX instruction set

• All PTX instructions can be predicated by 1-bit predicate


registers, which can be set by setp instruction:
– Only when predicate bit is 1 is the instruction executed
– Placing a predicate before a branch instruction creates conditional
branches

• Unlike vector architectures (and processors with vector


extensions) , there aren’t separate instructions for
sequential data transfers, strided data transfers o gather-
scatter data transfers:
– All data transfers are gather-scatter
– Data transfers by adjacent threads to sequential addressed are
recognized and transformed in hardware to a single memory request

GPU Architectures 44
Programming model
Example SASS code

• SAXPY code compiled with CUDA 8.0 for NVIDIA Fermi:

GPU Architectures 45
Programming model
Example SASS code

• SAXPY code
compiled with
CUDA 8.0 for
NVIDIA Pascal:

GPU Architectures 46
Outline

• Introduction
• Programming Model
• The SIMT Core: Instruction and Register Data Flow
• Memory System
• Evolution of NVIDIA architectures

GPU Architectures 47
The SIMT Core
Modern GPU architecture

• A modern GPU is composed of many SIMT cores:


– What NVIDIA calls Streaming Multiprocessors (SM) and AMD
Compute Units (CU)
– Quality basic GPUs has few SMs (2-4), high-end ones have dozens
– SMs execute vertex, geometry and pixel fragment shader programs
and parallel computing programs
vertex Fragment Compute work
distribution distribution distribution

GPU Architectures 49
The SIMT Core
Modern GPU architecture
Compute work
• Thread blocks of one or more distribution

kernels are dynamically ThreadBlocks

distributed among the SMs


for execution:

GPU Architectures 50
The SIMT Core
Modern GPU architecture

• Each GPU thread has its own private registers, private per-
thread memory, program counter, and thread execution
state:
– Zero-overhead thread management and scheduling (done in
hardware) and fast barrier synchronization enables fine-grained
parallelism
– To efficiently manage and execute thousands of threads, threads (of
the same threadblock) are dynamically grouped into warps (NVIDIA)
or wavefronts (AMD):
• Set of parallel threads that execute the same instruction together (it can be seen
as a SIMD operation formed by hardware): typically 32 threads per warp
• Single-Instruction Multiple-Thread (SIMT) architecture: a processor architecture
that applies one instruction to multiple independent threads in parallel
• Individual parallel threads within a warp are same type and start together at the
same program address, but are free to branch and execute independently

GPU Architectures 51
for (i=0; i < N; i++)
The SIMT Core C[i] = A[i] + B[i];

SPMD on SIMT Machine


load load Warp 0 at PC X

load load Warp 0 at PC X+1

add add Warp 0 at PC X+2

store store Warp 0 at PC X+3

Iter. Iter.
1 2 Warp: A set of threads that execute
the same instruction (i.e., at the same PC)
SPMD (Single Program Multiple SIMT (Single-Instruction, Multiple-
Data) programming model Thread) execution model

GPU Architectures 52
The SIMT Core SP : Streaming processor
RF: Register file • Multithreaded proc
Conceptual view of a SIMT core SFU: Special function unit (up to 64 threads)
• 1024 general-purpose •Scalar arithmetic
32-bit registers units (integer & FP)
•Partitioned among Instruction Cache •Pipelined (1
assigned threads (16-64 Up to 16 instr/thread/cycle)
registers per thread) warps Multithreaded Instruction Unit

SP SP SP SP SP SP SP SP
SFU SFU

• FP sine,
RF RF RF RF RF RF RF RF
cosine, log2,
exp2, square
root, reciprocal,
Interconnection Network

Memory
Unified L1 data cache & Shared memory Interface

GPU Architectures 53
The SIMT Core
Microarchitecture of a generic SIMT core

• The pipeline of each SIMT core can be divided into a SIMT


front-end and a SIMD backend:

– Three scheduling loops acting together in a single pipeline:


1) Instruction fetch loop: Fetch, I-Cache, Decode and I-Buffer blocks
2) Instruction issue loop: I-Buffer, Scoreboard, Issue and SIMT Stack blocks
3) Register access scheduling loop: Operand collector, ALU and Memory blocks

• We present an increasingly detailed view of the SIMT core


microarchitecture
GPU Architectures 54
The SIMT Core
High-level view of the overall pipeline (single scheduler)

• Remember: to increase efficiency threads are organized into


warps → The unit of scheduling is a warp:
– In each cycle, the hardware selects a warp for scheduling
– The warp’s PC is used to access instruction memory (I-Cache) to
obtain next instruction to execute for the warp
– Instruction is decoded
– Source registers are read from the register file & SIMT execution
mask values are determined
– Each thread executes on the functional unit (FU) associated with a
lane provided the SIMT execution mask is set (several types of
functional units in a GPU: Special FU, load/store units, FP FU, integer
FU, Tensor core, RT cores)
• If warp size > FU lanes, the warp is executed over several clock cycles

GPU Architectures 55
The SIMT Core
High-level view of the overall pipeline (single scheduler)

• SIMT execution masking:


– GPU programming model presents the programmer with the
abstraction that threads execute completely independently
• Achieved via a combination of predication along with a stack of predicate masks
called SIMT stack, and special instructions & markers
– SIMT stack helps efficiently handle two key issues:
• Nested control flow (one branch is control dependent upon another)
• Skip computation entirely while all threads in a warp avoid a control flow path
– Each warp uses its own SIMT stack:
• Each entry stores an identifier reconvergence PC, PC of next instruction to
execute, active mask
– The SIMT stack is at least partly managed by special instructions (at
SASS level):
• Special instructions to push stack entries
• Special instructions and instruction markers that pop a stack entry or unwind the
stack to a specified entry and branch to the target instruction address with the
target thread-active mask
GPU Architectures 56
The SIMT Core
High-level view of the overall pipeline (single scheduler)

• Example of SIMT stack fully managed in hardware:

CUDA C source code PTX assembly code

GPU Architectures 57
The SIMT Core
High-level view of the overall pipeline (single scheduler)

• Example of SIMT stack fully managed in hardware:


4 threads per warp
Control Flow
Graph (CFG)

GPU Architectures 58
The SIMT Core
High-level view of the overall pipeline (single scheduler)

• Example of SIMT stack fully managed in hardware:


4 threads per warp
Control Flow
Graph (CFG)

How does GPU


hardware enable
threads within a
warp to follow
different paths (all
threads must
execute same
Serialization
instruction)?

GPU Architectures 59
The SIMT Core
High-level view of the overall pipeline (single scheduler) Next instruction
to execute
• Example of SIMT stack fully managed in hardware:
4 threads per warp Top of
Stack
Control Flow (after line 6)
Graph (CFG)

How does GPU


hardware enable
threads within a
warp to follow
different paths (all Use of SIMT
threads must stack
execute same
Serialization
instruction)?

GPU Architectures 60
The SIMT Core
High-level view of the overall pipeline (single scheduler)

• Warp scheduling:
– Each SIMT core must execute many warps:
• In which order should these warps be scheduled?
– For simplicity, assume that each warp issues a single instruction
when it is scheduled and the warp is not eligible to issue another
instruction until the first completes:
• For ideal memory system (memory requests are serviced within fixed time), we
could hide this latency by supporting enough warps and applying fine-grained
multithreading: scheduling warps in round-robin (RR) order, one each cycle
• To enable a different warp to issue an instruction each cycle, it is necessary for
each thread to have its own registers (to avoid saving/restoring register state
to/from memory)
– In practice, memory latency depends upon the application’s locality
and resulting contention encountered by off-chip memory accesses
• Locality properties can either favor or discourage RR scheduling: when different
threads shared data, RR is preferred

GPU Architectures 61
The SIMT Core
Two-loop approximation

• To help reduce the number of warps per core needed to hide


long execution latencies, we should be able to issue a
subsequent instruction from a warp while some of its earlier
instructions have not yet completed:
– We need a mechanism to know whether the next instruction to issue
for each warp has a dependency upon an earlier instruction in the
warp that has not yet completed execution
– Instruction buffer (I-Buffer): contains fetched instructions and is
used to detect data and/or structural hazards
• A separate scheduler is used to decide which of the instructions in the I-Buffer
can be issued
• The I-Buffer has storage for one or more instructions per warp
– How are data dependencies between instructions within the same
warp detected?
• Use of the scoreboard mechanism (simplicity → consumes less energy & area)
GPU Architectures 62
The SIMT Core
Two-loop approximation: the scoreboard

• Scoreboard for a single-threaded in-order CPU:


– The scoreboard has one bit per each register:
• Whenever an instruction issues that will write to a particular register, the
corresponding bit in the scoreboard is set
– Instructions cannot be issued until all bits in the scoreboard
associated to the registers to be read / written are cleared:
• Prevents both RAW and WAW hazards (WAR hazards are prevented by in-order
instruction processing)
• GPUs use in-order scoreboards, but there are challenges:
– Large number of registers contained in a modern GPU
• 16K 32-bit registers per warp scheduler in a SIMT core
– Big number of warps may be repeatedly probing the scoreboard
waiting for their dependencies to be cleared → several read ports
are required (16 warps/scheduler & 2 instr. per warp & 4 operands
per instr  128 read ports if all warps probe the scoreboard each
cycle)

GPU Architectures 63
The SIMT Core
Three-loop approximation

• In the two-loop approximation architecture:


– The first loop selects a warp that has space in the I-Buffer, looks up
its PC and performs an I-Cache access to obtain next instruction
– The second loop selects an instruction in the I-Buffer that has no
outstanding dependencies and issues it to execution units (once the
operands have been obtained from the register file)
• As discussed, each SIMT core has a large register file that
contains separate physical registers for each warp:
– The size of the register file in modern GPU architectures is 256KiB
– The area of an SRAM memory is proportional to the number of ports
• We would need 1 port per operand per instruction issued per cycle
– The area of the register file can be reduced by implementing it with
multiple banks of single-ported memories:
• The operand collector hides this implementation detail to the ISA and forms the
third scheduling loop

GPU Architectures 64
The SIMT Core
Three-loop approximation: the operand collector

• A naive microarchitecture for providing increased register


file bandwidth (register read stage):

GPU Architectures 65
The SIMT Core
Three-loop approximation: the operand collector

• A naive microarchitecture for providing increased register


file bandwidth (register read stage):

Issue

1
2
4

In Execution mad add

It takes 6 cycles for the 3 instructions to read their operands (most of the banks are idle)

GPU Architectures 66
The SIMT Core
Three-loop approximation: the operand collector

• The operand collector microarchitecture:


– Idea: replace staging registers with collector units and assign each
instruction a collector unit when it enters the register read stage
• Multiple collector units can help
improve throughput in presence
of bank conflicts
• Each collector unit contains
buffering space for all source
operands used by the instruction
• The arbiter is more likely to find
bank-level parallelism to allow
simultaneous access to banks
GPU Architectures 67
The SIMT Core
Three-loop approximation: the operand collector

• The operand collector schedules bank accesses to tolerate


bank conflicts when they ocur:
– Conflicts between register operands in different warps can be
reduced by allocating equivalent registers from different warps in
different banks
– Example: Swizzled layout

– Goes well with round-robin scheduling (warps are making relatively


even progress)
GPU Architectures 68
The SIMT Core
Three-loop approximation: the operand collector

• Example:

Issue

In Execution
• Each collector unit can read one operand each cycle
• Different reads from several warps can be done each cycle (increase throughput)
GPU Architectures 69
The SIMT Core
Three-loop approximation: the operand collector

• The operand collector may allow WAR hazards:


– It does not impose any order among different instructions when they
ready to issue
– Two instructions from the same warp are hold in different collector
units, one of them reads a register that the other will write
• Second instruction could potentially write a new value to the register before the
first one has read the (correct) older value (due to repeated bank conflicts
suffered by the first instruction to obtain source registers)

– We must require that instructions from the same warp leave the
operand collector to execution units in program order

GPU Architectures 70
The SIMT Core
Three-loop approximation: structural hazards

• Many potential causes of structural hazards in GPU


pipelines, for example:
– Register read stage may run out of collector units
– Conflicts in the memory system

• What to do when an a structural hazard is found?


– In a single-threaded in-order CPU pipeline we could stall younger
instructions until the hazard disappears
– Stalls are not a good idea in a GPU pipeline (costly to implement and
ruin throughput—instructions from other warps that could otherwise
advance would be stalled!)

• GPUs implement a form of instruction replay:


– Instructions are hold in the I-Buffer until they have completed
GPU Architectures 71
Outline

• Introduction
• Programming Model
• The SIMT Core: Instruction and Register Data Flow
• Memory System
• Evolution of NVIDIA architectures

GPU Architectures 72
Memory System
Memory spaces in a GPU

• GPU computing kernels interact with the memory system


through load and store instructions
• CPUs typically include two separate memory spaces:
– Register file and memory
• GPUs subdivide memory into local and global memory
spaces:
– Local memory space is private per thread and typically used for
register spilling
– Global memory is used for data structures shared among threads
• Additionally, GPUs have a on-chip scratchpad memory with
shared access among threads in a Threadblock:
– Shared memory is used for data known to be used at a given step in
the computation and yields higher performance and saves energy
GPU Architectures 73
Memory System
First-level memory structures

• Comprised by the scratchpad memory, L1 data cache and L1


texture cache
• In CUDA, shared memory refers to a small, low-latency
memory space, accessible to all threads in a Threadblock:
– Implemented as SRAM → Similar latency to the register file
– One bank per lane, one read port and one write port per bank
– Each thread has access to all banks → bank conflicts arise when two
or more threads in a warp need different locations in the same bank
at the same cycle
• L1 data cache keeps a subset of the global memory address
space:
– In some architectures, the L1 cache contains only addresses not
modified by kernels (avoids complications due to lack of cache
coherence on GPUs)
GPU Architectures 74
Memory System
First-level memory structures

• L1 data cache keeps a subset of the global memory address


space:
– If all threads in a warp access locations within same L1 data cache
block and that block is not present in the cache, a single request is
sent to the lower level caches (the accesses are coalesced)
– If the threads in the warp access different cache blocks, multiple
memory accesses need to be generated (uncoalesced accesses)
• Programmers try to avoid both bank conflicts & uncoalesced
accesses but the hardware allows both (eases programming)
• L1 data cache and scratchpad memories are unified in since
NVIDIA’s Fermi:
– Allows programs to decide how much memory they want to use for
the shared memory and for the L1 data cache

GPU Architectures 75
Memory System
Unified L1 data cache and shared memory

– SRAM data array (❺) can be configured partly for direct mapped
access (shared memory) and partly as a set associative cache
– L1 data cache misses and bank conflicts handled using a reply
mechanism (non-stalling interface with the instruction pipeline)

GPU Architectures 76
Memory System
Shared memory access operations
1
Memory access request received: one memory
Bank conflicts are determined by address per thread in the warp; operation type
the arbiter, and if found request
is split into 2 parts: 2
-A conflict-free part
3
- Rest, to be retried later
(buffered in the load/store 4
unit)

Tag lookup in tag unit is bypassed 6


and tag unit determines the banks Single-cycle latency for shared memory lookup: the
to be accessed replayed portion goes to the arbiter the cycle after (if
The address crossbar distributes other bank conflicts, the process is repeated)
addresses to the individual banks Data is returned to the appropriate
Each bank is 32-bits wide and has its lanes for storage in the register file
own decoder (different rows can be via the data crossbar (only lanes of
accessed in each bank) active threads write values)

GPU Architectures 77
Memory System
Cache read operation (LD GLOBAL)
1
Memory addresses are computed and coalescing
The arbiter accepts or reject the rules applied (individual memory accesses are
request if not enough resources grouped into individual coalesced accesses)
available 2

Tag lookup in tag unit is done to 6


check hit or miss

If HIT, same row of the data array is


accessed in the 32 banks (128 bytes) Data is returned to the register file
via the data crossbar (only lanes of
active threads write values)

GPU Architectures 78
Memory System
Cache read operation (LD GLOBAL)
Memory addresses are computed and coalescing
The arbiter accepts or reject the rules applied (individual memory accesses are
request if not enough resources grouped into individual coalesced accesses)
available
7
3

If MISS, the arbiter informs the


load/store unit the request must be Once data is returned, the fill unit
replayed accesses the PRT to recover
In parallel, it sends request information information about the request and
to the PRT (similar to MSHRs) the load is rescheduled
Particularly, the request is sent to the
A memory request is forwarded to corresponding memory partition unit
the MMU for virtual to physical (which contains a bank of L2 cache &
address translation (VIVT cache) a memory access scheduler)
GPU Architectures 79
Memory System
Cache coherence is not supported:
Cache write operation (ST GLOBAL)
only local memory accesses or read-
The L1 data cache can support only global memory data can be
both WT and WB policies: placed in the L1 cache
-WT: global memory
-WB: local memory

10
6
Data to be written is placed in the WB

For uncoalesced accesses or when


some threads are masked off, only a WB: if block present in the cache,
portion of the a cache block is written data is written via the data crossbar.
Otherwise, it must be first read from
L2 cache or DRAM memory

GPU Architectures 80
Memory System
L1 texture cache

• Recent NVIDIA and AMD GPUs combine the L1 data cache


and texture cache to save area:
– Texture mapping is used in 3D graphics to make scenes look as
realistic as possible: an image (texture) is applied to a surface
– To implement texture mapping the coordinates of one or more
samples within the texture are first determined (texels)
– From these coordinates, the memory addresses of the texels are
calculated
– There is significant locality in texture memory accesses (adjacent
pixels map to adjacent texels and it is common to average the values
of nearby texels) that can be exploited by means of a cache

GPU Architectures 81
Memory System
Memory partition unit

• High-performance GPUs connect to multiple DRAM chips in


parallel via memory partition units:
– Provides the large amount of memory bandwidth required by the
SIMT cores
– Memory traffic is distributed across the memory partition units
using address interleaving (granularities of 256 bytes or 1024 bytes)
– The SIMT cores connect to the memory partition units via an on-chip
interconnection network (crossbars for NVIDIA and rings for AMD)
• Each memory partition unit contains:
– A portion of the L2 cache: contains both graphics and compute data
– One or more memory access schedulers (frame buffer): reorder
memory read/write operations to optimize the accesses to DRAM
– Raster operation unit (ROP): supports atomic operations defined in
CUDA, and is mainly used for graphics (alpha blending, compression)
GPU Architectures 82
Memory System
Memory access scheduler

• To store large amount of data GPUs use special DRAM such


as GDDR5:
– DRAM stores individual bits in small capacitors
– A read operation: A row of bits (called a page) is read into a small
memory structure called row buffer
• Bitlines connecting the individual capacitors to the row buffer are precharged to
a voltage half way between 0 and the supply voltage
• The capacitor is connected to the bitline through an access transistor and the
voltage of the bittline is pulled either up (1) or down (0)
• A sense amplifier amplifies this small change until a clean logic 0 or 1 is read and
the values stored in the capacitors are refreshed
• The precharge and activation operations introduce delays during which no data
can be read or written to the DRAM array → multiple banks, each with its own
row buffer are used
• Even with multiple DRAM banks, it is often not possible to completely hide the
latency of switching between rows when accessing data → memory access
schedulers
GPU Architectures 83
Memory System
Memory access scheduler

• Memory access schedulers reorder DRAM memory access


requests so as to reduce the number of times data must be
moved between the row buffers and the DRAM cells:
– Each slice of the L2 cache has its own memory access scheduler
– Each memory access scheduler contains separate logic for sorting
read requests and write requests sent from the L2 cache
– To group together reads to the same row in a DRAM bank, two
separate tables are used:
• Read request sorter: set-associative structure accessed by memory address that
maps all read requests to the same row in a given bank to a single pointer
• Read request store: stores lists of individual read requests. Each pointer in the
read request sorter points to a entry in the read request store

GPU Architectures 84
Outline

• Introduction
• Programming Model
• The SIMT Core: Instruction and Register Data Flow
• Memory System
• Evolution of NVIDIA architectures

GPU Architectures 85
Evolution of NVIDIA architectures
From Tesla to Volta

6000 16000

14000
5000
12000
#Stream Processors

4000
10000

GFLOPS
3000 8000 Stream Processors
6000 GFLOPS
2000
4000
1000
2000

0 0
GTX 285 GTX 480 GTX 780 GTX 980 P100 V100
(2009) (2010) (2013) (2014) (2016) (2017)

GPU Architectures 86
Evolution of NVIDIA architectures
NVIDIA Fermi (2009)

NVIDIA Fermi (2009)

GPU Architectures 87
Evolution of NVIDIA architectures
NVIDIA Kepler (2012)

GPU Architectures 88
Evolution of NVIDIA architectures
NVIDIA Maxwell (2014)

GPU Architectures 89
Evolution of NVIDIA architectures
NVIDIA PASCAL GP104 (2016)

90

GPU Architectures 90
Evolution of NVIDIA architectures
NVIDIA PASCAL GP100 (2016)

GPU Architectures 91
Evolution of NVIDIA architectures
NVIDIA Volta (2017)

GPU Architectures 92
Evolution of NVIDIA architectures
NVIDIA Turing (2018)

GPU Architectures 93
Evolution of NVIDIA architectures
NVIDIA Ampere (2020)

GPU Architectures 94
Evolution of NVIDIA architectures
NVIDIA Hopper (2022)

GPU Architectures 95
Evolution of NVIDIA architectures
Tensor cores

• Specialized core for e.g. deep learning


– Operates on narrow data types, e.g. fp16

• Example: 4x4x4 matrix-multiply acc

GPU Architectures 97

You might also like