NVSwitch
NVSwitch
3. Chip Details
Hopper-Generation SuperPODs
1. NVSwitch-Enabled Platforms
3. SuperPOD Performance
NVLINK MOTIVATIONS
Bandwidth and GPU-Synergistic Operation Thread Thread Thread Thread
Block Block Block Block
NVSwitch
NVSwitch
NVSwitch
V100 V100 A100 H100
5 NVLinks 20 NVLink
NVSwitch
V100 V100 A100 H100 Network
NVSwitch
NVSwitch
NVSwitch
4 NVLinks Ports
V100 V100 A100 H100
16 NVLink
NVSwitch
4 NVLinks
NVSwitch
NVSwitch
NVSwitch
Network
V100 V100 A100 H100 Ports
P100 P100 P100 P100
5 NVLinks
NVSwitch
NVSwitch
NVSwitch
V100 V100 A100 H100 16 NVLink
NVSwitch
P100 P100 P100 P100 Network
V100 V100 A100 H100 Ports
NVSwitch
NVSwitch
NVSwitch
20 NVLink
NVSwitch
V100 V100 A100 H100 Network
NVSwitch
NVSwitch
NVSwitch
Ports
V100 V100 A100 H100
H100
NVLink Network Support
5 NVLinks 20 NVLink
NVSwitch
H100 Network ▪ PHY-electrical interfaces compatible with 400G Ethernet/InfiniBand
4 NVLinks Ports
▪ OSFP support (4 NVLinks per cage) with custom FW for active modules
H100
16 NVLink ▪ Additional Forward Error Correction (FEC) modes for optical-cable
NVSwitch
4 NVLinks
Network performance/reliability
H100 Ports
5 NVLinks Doubling of Bandwidth
H100 16 NVLink
NVSwitch
H100 Network
SHARP Collectives/Multicast Support
Ports
H100 ▪ NVSwitch-internal duplication of data avoid need for multiple access
from/by source GPU
▪ Embedded ALUs allow NVSwitches to perform AllReduce (and similar)
2022 calculations on behalf of GPUs
Update
Database : GBs of Local Local Local Local
input data : images,
sound, …
gradients gradients gradients gradients
5x
Linear layers
Data parallel
… Replicated across GPUs
4x
3x
Redistribute: Model-Parallel -> Data-Parallel
All2All
2x
10
GB
20 40GB
GB 60GB Embedding tables
10 60GB … 1x
Model parallel
GB 10GB Distributed across GPUs
0x
GPU 0 GPU 1 GPU 2 GPU n A100 H100 H100
IB IB NVLink Network
Projected performance subject to change. Example model assumes DLRM with a mix of 300-hot and 1-hot embedding tables with total capacity of 14TB. Different recommender models
may show different performance characteristics.
NVLINK NETWORK
New Hopper
GPU NVL NVLink NVL Link NIC functions
SM HBM
MMU NIC Network NIC TLB ensure request
is legal & maps
NVL Switch NVL to GPU physical
SM NIC NIC HBM
address space
NVL NVL
SM NIC NIC HBM
NVL NVL
SM NIC NIC HBM
SHARP groups
Session Layer Sockets
CUDA export of Network addresses of data-structures
NIC PCIe NIC (card or chip) Functions embedded in GPU and NVSwitch
Security Off-Load NIC Security Features GPU-internal Encryption and “TLB” Firewalls
SHARP-related exchanges
XBAR (64 X 64)
SHARP SHARP
SHARP
ALU Scratch
Controller
(Hopper) SRAM New NVLink Network Blocks
▪ Security Processor protects data and chip
configuration from attacks
Port Logic 63 NVLink 63 ▪ Partitioning features isolate subsets of
Classification & ports into separate NVLink Networks
Routing PHY
Packet Transforms
TL DL
▪ Management controller now also handles
Error Check & Transaction Tracking &
attached OSFP cables
PHY
Statistics Collection Packet Transforms
▪ Expanded telemetry to support
InfiniBand-style monitoring
NVLink4-Generation NVSwitch Chip
1. Brief History of NVLink
3. Chip Details
Hopper-Generation SuperPODs
1. NVSwitch-Enabled Platforms
3. SuperPOD Performance
DGX H100 SERVER
8-H100 4-NVSwitch Server
▪ 32 PFLOPS of AI Performance
▪ 640 GB aggregate GPU memory
▪ 18 NVLink Network OSFPs
▪ 3.6 TBps of full-duplex NVLink Network bandwidth (72 NVLinks)
▪ 8x 400 Gb/s ConnectX-7 InfiniBand/Ethernet ports
▪ 2 dual-port Bluefield-3 DPUs
▪ Dual Sapphire Rapids CPUs
▪ PCIe Gen5
CPU CPU DGX H100: DATA-NETWORK
CONFIGURATION
CX7 CX7 CX7 CX7 CX7 CX7 CX7 CX7
HCA/NIC HCA/NIC HCA/NIC HCA/NIC HCA/NIC HCA/NIC HCA/NIC HCA/NIC Full-BW Intra-Server NVLink
w/ PCIe w/ PCIe w/ PCIe w/ PCIe w/ PCIe w/ PCIe w/ PCIe w/ PCIe
Switch Switch Switch Switch Switch Switch Switch Switch ▪ All 8 GPUs can simultaneously saturate
18 NVLinks to other GPUs within server
▪ Limited only by over-subscription from multiple
other GPUs
Half-BW NVLink Network
▪ All 8 GPUs can half-subscribe 18 NVLinks
H100 H100 H100 H100 H100 H100 H100 H100
to GPUs in other servers
▪ 4 GPUs can saturate 18 NVLinks to GPUs
in other servers
▪ Equivalent of full-BW on AllReduce with SHARP
▪ Reduction in All2All BW is a balance with server
NVSwitch NVSwitch NVSwitch NVSwitch complexity and costs
Multi-Rail InfiniBand/Ethernet
OSFP
OSFP
OSFP
OSFP
OSFP
OSFP
OSFP
OSFP
OSFP
OSFP
OSFP
OSFP
OSFP
OSFP
OSFP
OSFP
OSFP
over its own dedicated 400 Gb/s HCA/NIC
▪ 800 GBps of aggregate full-duplex
to non-NVLink Network devices
4 NVLinks 5 NVLinks 1 400Gb DGX H100
DGX H100 SUPERPOD: NVLINK SWITCH
NVLink Switch
▪ Standard 1RU 19-inch formfactor highly leveraged from InfiniBand
switch design
▪ Dual NVLink4 NVSwitch chips
▪ 128 NVLink4 ports
▪ 32 OSFP cages
▪ 6.4 TB/s full-duplex BW
▪ Managed switch with out-of-band management communication
▪ Support for passive-copper, active-copper and optical OSFP cables
(custom FW)
DGX H100 SUPERPOD: AI EXASCALE
DGX H100 SuperPOD Scalable Unit
▪ 32 DGX H100 nodes + 18 NVLink Switches
▪ 256 H100 Tensor Core GPUs
▪ 1 ExaFLOP of AI performance
▪ 20 TB of aggregate GPU memory
▪ Network optimized for AI and HPC
▪ 128 L1 NVLink4 NVSwitch chips + 36 L2 NVLink4 NVSwitch chips
▪ 57.6 TB/s bisection NVLink Network spanning entire Scalable Unit
▪ 25.6 TB/s full-duplex NDR 400 Gb/s InfiniBand for connecting
multiple Scalable Units in a SuperPOD
SCALE-UP WITH NVLINK NETWORK
Fully NVLink-connected
… IB HDR leaf switches …
Massive bisection bandwidth
… 32 nodes (256 GPUs) … … 32 nodes (256 GPUs) …
30x
8x
6x
Speedup over A100
25x
6x
20x
4x
15x
4x
10x
2x
2x
5x
Projected performance subject to change. A100 cluster: HDR IB network. H100 cluster: NDR IB network with NVLink Network where indicated.
# GPUs: Climate Modelling 1K, LQCD 1K, Genomics 8, 3D-FFT 256, MT-NLG 32 (batch sizes: 4 for A100, 60 for H100 at 1sec, 8 for A100 and 64 for H100 at 1.5 and 2sec), MRCNN 8 (batch 32),
GPT-3 16B 512 (batch 256), DLRM 128 (batch 64K), GPT-3 175B 16K (batch 512), MoE 8K (batch 512, one expert per GPU)
SUMMARY
Cutting-Edge Speeds and Capabilities
NVLink4-Generation NVSwitch
▪ 64 NVLink4 ports and 3.2 TB/s full-duplex BW
▪ NVLink SHARP (multi-cast and reductions off-load)
▪ Inter-Server NVLink Network support
▪ Custom FW OSFP NVLink Network cable support
▪ Basis of new NVLink Switch
Hopper-Generation SuperPOD
▪ 32 DGX H100 servers
▪ 18 NVLink Switches
▪ 1 ExaFLOP of AI performance
▪ 57.6TB/s NVLink Network bisection BW
▪ NVLink Network can more than double performance for
communication-intensive applications
▪ Scalable to thousands of GPUs using InfiniBand to connect
multiple Scalable Units