[go: up one dir, main page]

100% found this document useful (1 vote)
636 views23 pages

NVSwitch

The document discusses NVIDIA's 4th generation NVSwitch chip, which features an expanded NVLink network that supports higher bandwidth, optical cables, and error correction. The NVSwitch chip can now connect up to 64 GPUs using 20 NVLinks per chip at 100Gbps each, delivering 1.6TB/s of internal bandwidth. It also supports new collective communication capabilities through multicast and embedded arithmetic logic units.

Uploaded by

MF Kang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
636 views23 pages

NVSwitch

The document discusses NVIDIA's 4th generation NVSwitch chip, which features an expanded NVLink network that supports higher bandwidth, optical cables, and error correction. The NVSwitch chip can now connect up to 64 GPUs using 20 NVLinks per chip at 100Gbps each, delivering 1.6TB/s of internal bandwidth. It also supports new collective communication capabilities through multicast and embedded arithmetic logic units.

Uploaded by

MF Kang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

THE NVLINK-NETWORK SWITCH:

NVIDIA’S SWITCH CHIP FOR HIGH COMMUNICATION-BANDWIDTH SUPERPODS


ALEXANDER ISHII AND RYAN WELLS, SYSTEMS ARCHITECTS
4th-Generation NVSwitch Chip
1. Brief History of NVLink

2. 4th-Generation New Features

3. Chip Details

Hopper-Generation SuperPODs
1. NVSwitch-Enabled Platforms

2. NVLink Network SuperPODs

3. SuperPOD Performance
NVLINK MOTIVATIONS
Bandwidth and GPU-Synergistic Operation Thread Thread Thread Thread
Block Block Block Block

GPU Operational Characteristics Match NVLink Spec


▪ Thread-Block execution structure efficiently feeds parallelized NVLink SM SM SM SM
architecture
GPC0
▪ NVLink-Port Interfaces match data-exchange semantics of L2 as closely
as possible
SM to SM
Faster than PCIe
▪ 100Gbps-per-lane (NVLink4) vs 32Gbps-per-lane (PCIe Gen5)
▪ Multiple NVLinks can be “ganged” to realize higher aggregate lane counts

Lower Overheads than Traditional Networks


▪ Target system scales (256 Hopper GPUs) allow complex features (e.g., L2
end-to-end retry, adaptive routing, packet reordering) to be traded-off
against increased port counts
▪ Simplified Application/Presentation/Session-layer functionality
allows all to be embedded directly in CUDA programs/driver NVLink-Port Interfaces
NVLINK GENERATIONS
Evolution In-step with GPUs

x86 x86 x86 x86

PCIe PCIe PCIe PCIe

2016 2017 2020 2022

P100-NVLink1 V100-NVLink2 A100-NVLink3 H100-NVLink4

4 NVLinks 6 NVLinks 12 NVLinks 18 NVLinks


40GB/s each 50GB/s each 50GB/s each 50GB/s each
x8@20Gbaud-NRZ x8@25Gbaud-NRZ x4@50Gbaud-NRZ x2@50Gbaud-PAM4
160GB/s total 300GB/s total 600GB/s total 900GB/s total

Listed bandwidths are full-duplex (total of both directions). Whitepaper: http://www.nvidia.com/object/nvlink.html


NVLINK-ENABLED SERVER GENERATIONS
Any-to-Any Connectivity with NVSwitch

NVSwitch

NVSwitch
NVSwitch
V100 V100 A100 H100
5 NVLinks 20 NVLink

NVSwitch
V100 V100 A100 H100 Network

NVSwitch

NVSwitch
NVSwitch
4 NVLinks Ports
V100 V100 A100 H100
16 NVLink

NVSwitch
4 NVLinks

NVSwitch

NVSwitch
NVSwitch
Network
V100 V100 A100 H100 Ports
P100 P100 P100 P100
5 NVLinks

NVSwitch

NVSwitch
NVSwitch
V100 V100 A100 H100 16 NVLink

NVSwitch
P100 P100 P100 P100 Network
V100 V100 A100 H100 Ports

NVSwitch

NVSwitch
NVSwitch
20 NVLink

NVSwitch
V100 V100 A100 H100 Network
NVSwitch

NVSwitch
NVSwitch
Ports
V100 V100 A100 H100

2016 2018 2020 2022

DGX-1 (P100) DGX-2 (V100) DGX A100 DGX H100

140GB/s Bisection BW 2.4TB/s Bisection BW 2.4TB/s Bisection BW 3.6TB/s Bisection BW


40GB/s AllReduce BW 75GB/s AllReduce BW 150GB/s AllReduce BW 450GB/s AllReduce BW
NVLINK4 NVSWITCH NEW FEATURES
Expanding Server Performance

H100
NVLink Network Support
5 NVLinks 20 NVLink

NVSwitch
H100 Network ▪ PHY-electrical interfaces compatible with 400G Ethernet/InfiniBand
4 NVLinks Ports
▪ OSFP support (4 NVLinks per cage) with custom FW for active modules
H100
16 NVLink ▪ Additional Forward Error Correction (FEC) modes for optical-cable
NVSwitch
4 NVLinks
Network performance/reliability
H100 Ports
5 NVLinks Doubling of Bandwidth
H100 16 NVLink
NVSwitch

▪ 100Gbps-per-diff-pair (50Gbaud PAM4)


Network
H100 Ports ▪ x2 NVLinks and 64 NVLinks-per-NVSwitch (1.6TB/s internal bisection BW)
20 NVLink ▪ More BW with fewer chips
NVSwitch

H100 Network
SHARP Collectives/Multicast Support
Ports
H100 ▪ NVSwitch-internal duplication of data avoid need for multiple access
from/by source GPU
▪ Embedded ALUs allow NVSwitches to perform AllReduce (and similar)
2022 calculations on behalf of GPUs

DGX H100 ▪ Roughly doubles data throughput on communication-intensive-operations


in AI-applications
3.6TB/s Bisection BW
450GB/s AllReduce BW
NVLINK4 NVSWITCH
Chip Characteristics

32 PHY Lanes 32 PHY Lanes

PORT Logic PORT Logic


(including SHARP XBAR (including SHARP
accelerators) accelerators)

32 PHY Lanes 32 PHY Lanes

Largest NVSwitch Ever Highest Bandwidth Ever New Capabilities


▪ TSMC 4N process ▪ 64 NVLink4 ports (x2 per NVLink) ▪ 400GFLOPS of FP32 SHARP (other
number formats are supported)
▪ 25.1B transistors ▪ 3.2TB/s full-duplex bandwidth
▪ NVLink Network management,
▪ 294mm2 ▪ 50Gbaud PAM4 diff-pair signaling security and telemetry engines
▪ 50mmX50mm package (2645 balls) ▪ All ports NVLink Network capable
ALLREDUCE IN AI TRAINING
Critical Communication-Intensive Operation

parameters parameters parameters parameters parameters

Forward/ Batch Batch Batch Batch Batch


Backward (e.g. 256 images)

Update
Database : GBs of Local Local Local Local
input data : images,
sound, …
gradients gradients gradients gradients

NCCL AllReduce : Sum gradients across GPUs

gradients gradients gradients gradients gradients

Data parallelism : split batch across multiple GPUs

BASIC TRAINING FLOW ALLREDUCE IN MULTI-GPU TRAINING


TRADITIONAL ALLREDUCE CALCULATION
Data-Exchange and Parallel Calculation

local gradients local gradients local gradients local gradients


parameters parameters parameters parameters

NCCL AllReduce : Sum gradients across GPUs

Batch Exchange Partial Local Gradients


Batch Batch Batch

Local Local Local Local Reduce (Sum) Partials


gradients gradients gradients gradients
Broadcast Reduced Partials

NCCL AllReduce : Sum gradients across GPUs

gradients gradients gradients gradients


gradients gradients gradients gradients

Data parallelism : split batch across multiple GPUs

ALLREDUCE IN MULTI-GPU TRAINING


NVLINK SHARP ACCELERATION

A100 H100 + NVLink SHARP

Step 1: Read and reduce


Send Partials
A100 H100
Send Partials N reads
N reads In-Switch Sum
NVS NVS
Receive Partials --
N reads Receive Partials
1 reduced read

Step 2: Broadcast result


Send New Partial
A100 H100
Send New Partial 1 write
N writes In-Switch MultiCast
NVS NVS
Receive New Partials N duplications
N writes Receive New Partials
N writes

Traffic summary (at each GPU interface)


2N send, 2N receive N+1 send, N+1 receive

~2x effective NVLink bandwidth


NVLINK NETWORK FOR RAW BW
4.5X More BW than Maximum InfiniBand (IB)

NEURAL RECOMMENDER ENGINE EXAMPLE RECOMMENDER WITH 14 TB EMBEDDING TABLES

5x
Linear layers
Data parallel
… Replicated across GPUs
4x

3x
Redistribute: Model-Parallel -> Data-Parallel
All2All

2x
10
GB
20 40GB
GB 60GB Embedding tables
10 60GB … 1x
Model parallel
GB 10GB Distributed across GPUs

0x
GPU 0 GPU 1 GPU 2 GPU n A100 H100 H100
IB IB NVLink Network

Projected performance subject to change. Example model assumes DLRM with a mix of 300-hot and 1-hot embedding tables with total capacity of 14TB. Different recommender models
may show different performance characteristics.
NVLINK NETWORK

Source GPU Network Destination GPU


Virtual Address Address Physical Address

New Hopper
GPU NVL NVLink NVL Link NIC functions
SM HBM
MMU NIC Network NIC TLB ensure request
is legal & maps
NVL Switch NVL to GPU physical
SM NIC NIC HBM
address space
NVL NVL
SM NIC NIC HBM

NVL NVL
SM NIC NIC HBM

Source GPU Destination GPU

NVLink NVLink Network


Address Spaces 1 (shared) N (independent)
Request Addressing GPU physical address Network address
Connection Setup During boot process Runtime API call by software
Isolation No Yes
MAPPING TO TRADITIONAL NETWORKING
NVLink Network is Tightly Integrated with GPU

Concept Traditional Example NVLink Network


Physical Layer 400G electrical/optical media Custom-FW OSFP

Data Link Layer Ethernet NVLink custom on-chip HW and FW

Network Layer IP New NVLink Network Addressing and Management Protocols

Transport Layer TCP NVLink custom on-chip HW and FW

SHARP groups
Session Layer Sockets
CUDA export of Network addresses of data-structures

Presentation Layer TSL/SSL Library abstractions (e.g., NCCL, NVSHMEM)

Application Layer HTTP/FTP AI Frameworks or User Apps

NIC PCIe NIC (card or chip) Functions embedded in GPU and NVSwitch

RDMA Off-Load NIC Off-Load Engine GPU-internal Copy Engine

Collectives Off-Load NIC/Switch Off-Load Engine NVSwitch-internal SHARP Engines

Security Off-Load NIC Security Features GPU-internal Encryption and “TLB” Firewalls

Media Control NIC Cable Adaptation NVSwitch-internal OSFP-cable controllers


NVLINK4 NVSWITCH BLOCK DIAGRAM

Management New SHARP Blocks


Control Processor and State/Telemetry Proxy
(including associated OSFPs)
Security Processor ▪ ALU matched to Hopper unit
PCIe I/O
▪ Wide variety of operators (logical,
Port Logic 0 NVLink 0 min/max, add) and formats (S/U integers,
Routing
Classification & FP16, FP32, FP64, BF16)
PHY
Packet Transforms
TL DL ▪ SHARP Controller can manage up to 128
Error Check & Transaction Tracking & SHARP groups in parallel
PHY
Statistics Collection Packet Transforms
▪ XBAR BW uprated to carry additional
(NVLink plus SHARP)

SHARP-related exchanges
XBAR (64 X 64)

SHARP SHARP
SHARP
ALU Scratch
Controller
(Hopper) SRAM New NVLink Network Blocks
▪ Security Processor protects data and chip
configuration from attacks
Port Logic 63 NVLink 63 ▪ Partitioning features isolate subsets of
Classification & ports into separate NVLink Networks
Routing PHY
Packet Transforms
TL DL
▪ Management controller now also handles
Error Check & Transaction Tracking &
attached OSFP cables
PHY
Statistics Collection Packet Transforms
▪ Expanded telemetry to support
InfiniBand-style monitoring
NVLink4-Generation NVSwitch Chip
1. Brief History of NVLink

2. NVLink4-Generation New Features

3. Chip Details

Hopper-Generation SuperPODs
1. NVSwitch-Enabled Platforms

2. NVLink Network SuperPODs

3. SuperPOD Performance
DGX H100 SERVER
8-H100 4-NVSwitch Server
▪ 32 PFLOPS of AI Performance
▪ 640 GB aggregate GPU memory
▪ 18 NVLink Network OSFPs
▪ 3.6 TBps of full-duplex NVLink Network bandwidth (72 NVLinks)
▪ 8x 400 Gb/s ConnectX-7 InfiniBand/Ethernet ports
▪ 2 dual-port Bluefield-3 DPUs
▪ Dual Sapphire Rapids CPUs
▪ PCIe Gen5
CPU CPU DGX H100: DATA-NETWORK
CONFIGURATION
CX7 CX7 CX7 CX7 CX7 CX7 CX7 CX7
HCA/NIC HCA/NIC HCA/NIC HCA/NIC HCA/NIC HCA/NIC HCA/NIC HCA/NIC Full-BW Intra-Server NVLink
w/ PCIe w/ PCIe w/ PCIe w/ PCIe w/ PCIe w/ PCIe w/ PCIe w/ PCIe
Switch Switch Switch Switch Switch Switch Switch Switch ▪ All 8 GPUs can simultaneously saturate
18 NVLinks to other GPUs within server
▪ Limited only by over-subscription from multiple
other GPUs
Half-BW NVLink Network
▪ All 8 GPUs can half-subscribe 18 NVLinks
H100 H100 H100 H100 H100 H100 H100 H100
to GPUs in other servers
▪ 4 GPUs can saturate 18 NVLinks to GPUs
in other servers
▪ Equivalent of full-BW on AllReduce with SHARP
▪ Reduction in All2All BW is a balance with server
NVSwitch NVSwitch NVSwitch NVSwitch complexity and costs
Multi-Rail InfiniBand/Ethernet
OSFP
OSFP

OSFP
OSFP

▪ All 8 GPUs can independently RDMA data


OSFP
OSFP
OSFP
OSFP
OSFP

OSFP

OSFP

OSFP

OSFP

OSFP

OSFP

OSFP

OSFP

OSFP
OSFP
OSFP
OSFP
OSFP
over its own dedicated 400 Gb/s HCA/NIC
▪ 800 GBps of aggregate full-duplex
to non-NVLink Network devices
4 NVLinks 5 NVLinks 1 400Gb DGX H100
DGX H100 SUPERPOD: NVLINK SWITCH
NVLink Switch
▪ Standard 1RU 19-inch formfactor highly leveraged from InfiniBand
switch design
▪ Dual NVLink4 NVSwitch chips
▪ 128 NVLink4 ports
▪ 32 OSFP cages
▪ 6.4 TB/s full-duplex BW
▪ Managed switch with out-of-band management communication
▪ Support for passive-copper, active-copper and optical OSFP cables
(custom FW)
DGX H100 SUPERPOD: AI EXASCALE
DGX H100 SuperPOD Scalable Unit
▪ 32 DGX H100 nodes + 18 NVLink Switches
▪ 256 H100 Tensor Core GPUs
▪ 1 ExaFLOP of AI performance
▪ 20 TB of aggregate GPU memory
▪ Network optimized for AI and HPC
▪ 128 L1 NVLink4 NVSwitch chips + 36 L2 NVLink4 NVSwitch chips
▪ 57.6 TB/s bisection NVLink Network spanning entire Scalable Unit
▪ 25.6 TB/s full-duplex NDR 400 Gb/s InfiniBand for connecting
multiple Scalable Units in a SuperPOD
SCALE-UP WITH NVLINK NETWORK

DGX A100 256 POD DGX H100 256 POD

IB HDR spine switches NVS NVS NVS … NVS

Fully NVLink-connected
… IB HDR leaf switches …
Massive bisection bandwidth
… 32 nodes (256 GPUs) … … 32 nodes (256 GPUs) …

A100 SuperPOD H100 SuperPOD Speedup

Dense Bisection Reduce Dense Bisection Reduce


Bisection Reduce
PFLOP/s [GB/s] [GB/s] PFLOP/s [GB/s] [GB/s]

1 DGX / 8 GPUs 2.5 2,400 150 16 3,600 450 1.5x 3x

32 DGXs / 256 GPUs 80 6,400 100 512 57,600 450 9x 4.5x


NVLINK NETWORK BENEFITS
Dependent on Communication Intensity

A100 H100 H100 + NVLink Network

HPC AI Inference AI Training


8x 35x 10x

30x
8x
6x
Speedup over A100

25x

6x
20x

4x

15x
4x

10x
2x
2x
5x

0x Climate Modelling Lattice QCD 3D FFT Genomics


0x GPT-3 (530B Params)… GPT-3 (530B Params)… GPT-3 (530B Params)…
0x Vision Models 10TB Recommender GPT-3 175B Switch-XXL 395B

Megatron Turing NLG 530B

Projected performance subject to change. A100 cluster: HDR IB network. H100 cluster: NDR IB network with NVLink Network where indicated.
# GPUs: Climate Modelling 1K, LQCD 1K, Genomics 8, 3D-FFT 256, MT-NLG 32 (batch sizes: 4 for A100, 60 for H100 at 1sec, 8 for A100 and 64 for H100 at 1.5 and 2sec), MRCNN 8 (batch 32),
GPT-3 16B 512 (batch 256), DLRM 128 (batch 64K), GPT-3 175B 16K (batch 512), MoE 8K (batch 512, one expert per GPU)
SUMMARY
Cutting-Edge Speeds and Capabilities

NVLink4-Generation NVSwitch
▪ 64 NVLink4 ports and 3.2 TB/s full-duplex BW
▪ NVLink SHARP (multi-cast and reductions off-load)
▪ Inter-Server NVLink Network support
▪ Custom FW OSFP NVLink Network cable support
▪ Basis of new NVLink Switch
Hopper-Generation SuperPOD
▪ 32 DGX H100 servers
▪ 18 NVLink Switches
▪ 1 ExaFLOP of AI performance
▪ 57.6TB/s NVLink Network bisection BW
▪ NVLink Network can more than double performance for
communication-intensive applications
▪ Scalable to thousands of GPUs using InfiniBand to connect
multiple Scalable Units

You might also like