100% found this document useful (1 vote)

636 views23 pages

NVSwitch

The document discusses NVIDIA's 4th generation NVSwitch chip, which features an expanded NVLink network that supports higher bandwidth, optical cables, and error correction. The NVSwitch chip can now connect up to 64 GPUs using 20 NVLinks per chip at 100Gbps each, delivering 1.6TB/s of internal bandwidth. It also supports new collective communication capabilities through multicast and embedded arithmetic logic units.

Uploaded by

MF Kang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

636 views23 pages

NVSwitch

Uploaded by

MF Kang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

THE NVLINK-NETWORK SWITCH:

NVIDIA’S SWITCH CHIP FOR HIGH COMMUNICATION-BANDWIDTH SUPERPODS

ALEXANDER ISHII AND RYAN WELLS, SYSTEMS ARCHITECTS
4th-Generation NVSwitch Chip
1. Brief History of NVLink

2. 4th-Generation New Features

3. Chip Details

Hopper-Generation SuperPODs
1. NVSwitch-Enabled Platforms

2. NVLink Network SuperPODs

3. SuperPOD Performance
NVLINK MOTIVATIONS
Bandwidth and GPU-Synergistic Operation Thread Thread Thread Thread
Block Block Block Block

GPU Operational Characteristics Match NVLink Spec

▪ Thread-Block execution structure efficiently feeds parallelized NVLink SM SM SM SM
architecture
GPC0
▪ NVLink-Port Interfaces match data-exchange semantics of L2 as closely
as possible
SM to SM
Faster than PCIe
▪ 100Gbps-per-lane (NVLink4) vs 32Gbps-per-lane (PCIe Gen5)
▪ Multiple NVLinks can be “ganged” to realize higher aggregate lane counts

Lower Overheads than Traditional Networks

▪ Target system scales (256 Hopper GPUs) allow complex features (e.g., L2
end-to-end retry, adaptive routing, packet reordering) to be traded-off
against increased port counts
▪ Simplified Application/Presentation/Session-layer functionality
allows all to be embedded directly in CUDA programs/driver NVLink-Port Interfaces
NVLINK GENERATIONS
Evolution In-step with GPUs

x86 x86 x86 x86

PCIe PCIe PCIe PCIe

2016 2017 2020 2022

P100-NVLink1 V100-NVLink2 A100-NVLink3 H100-NVLink4

4 NVLinks 6 NVLinks 12 NVLinks 18 NVLinks

40GB/s each 50GB/s each 50GB/s each 50GB/s each
x8@20Gbaud-NRZ x8@25Gbaud-NRZ x4@50Gbaud-NRZ x2@50Gbaud-PAM4
160GB/s total 300GB/s total 600GB/s total 900GB/s total

Listed bandwidths are full-duplex (total of both directions). Whitepaper: http://www.nvidia.com/object/nvlink.html

NVLINK-ENABLED SERVER GENERATIONS
Any-to-Any Connectivity with NVSwitch

NVSwitch

NVSwitch
NVSwitch
V100 V100 A100 H100
5 NVLinks 20 NVLink

NVSwitch
V100 V100 A100 H100 Network

NVSwitch

NVSwitch
NVSwitch
4 NVLinks Ports
V100 V100 A100 H100
16 NVLink

NVSwitch
4 NVLinks

NVSwitch

NVSwitch
NVSwitch
Network
V100 V100 A100 H100 Ports
P100 P100 P100 P100
5 NVLinks

NVSwitch

NVSwitch
NVSwitch
V100 V100 A100 H100 16 NVLink

NVSwitch
P100 P100 P100 P100 Network
V100 V100 A100 H100 Ports

NVSwitch

NVSwitch
NVSwitch
20 NVLink

NVSwitch
V100 V100 A100 H100 Network
NVSwitch

NVSwitch
NVSwitch
Ports
V100 V100 A100 H100

2016 2018 2020 2022

DGX-1 (P100) DGX-2 (V100) DGX A100 DGX H100

140GB/s Bisection BW 2.4TB/s Bisection BW 2.4TB/s Bisection BW 3.6TB/s Bisection BW

40GB/s AllReduce BW 75GB/s AllReduce BW 150GB/s AllReduce BW 450GB/s AllReduce BW
NVLINK4 NVSWITCH NEW FEATURES
Expanding Server Performance

H100
NVLink Network Support
5 NVLinks 20 NVLink

NVSwitch
H100 Network ▪ PHY-electrical interfaces compatible with 400G Ethernet/InfiniBand
4 NVLinks Ports
▪ OSFP support (4 NVLinks per cage) with custom FW for active modules
H100
16 NVLink ▪ Additional Forward Error Correction (FEC) modes for optical-cable
NVSwitch
4 NVLinks
Network performance/reliability
H100 Ports
5 NVLinks Doubling of Bandwidth
H100 16 NVLink
NVSwitch

▪ 100Gbps-per-diff-pair (50Gbaud PAM4)

Network
H100 Ports ▪ x2 NVLinks and 64 NVLinks-per-NVSwitch (1.6TB/s internal bisection BW)
20 NVLink ▪ More BW with fewer chips
NVSwitch

H100 Network
SHARP Collectives/Multicast Support
Ports
H100 ▪ NVSwitch-internal duplication of data avoid need for multiple access
from/by source GPU
▪ Embedded ALUs allow NVSwitches to perform AllReduce (and similar)
2022 calculations on behalf of GPUs

DGX H100 ▪ Roughly doubles data throughput on communication-intensive-operations

in AI-applications
3.6TB/s Bisection BW
450GB/s AllReduce BW
NVLINK4 NVSWITCH
Chip Characteristics

32 PHY Lanes 32 PHY Lanes

PORT Logic PORT Logic

(including SHARP XBAR (including SHARP
accelerators) accelerators)

32 PHY Lanes 32 PHY Lanes

Largest NVSwitch Ever Highest Bandwidth Ever New Capabilities

▪ TSMC 4N process ▪ 64 NVLink4 ports (x2 per NVLink) ▪ 400GFLOPS of FP32 SHARP (other
number formats are supported)
▪ 25.1B transistors ▪ 3.2TB/s full-duplex bandwidth
▪ NVLink Network management,
▪ 294mm2 ▪ 50Gbaud PAM4 diff-pair signaling security and telemetry engines
▪ 50mmX50mm package (2645 balls) ▪ All ports NVLink Network capable
ALLREDUCE IN AI TRAINING
Critical Communication-Intensive Operation

parameters parameters parameters parameters parameters

Forward/ Batch Batch Batch Batch Batch

Backward (e.g. 256 images)

Update
Database : GBs of Local Local Local Local
input data : images,
sound, …
gradients gradients gradients gradients

NCCL AllReduce : Sum gradients across GPUs

gradients gradients gradients gradients gradients

Data parallelism : split batch across multiple GPUs

BASIC TRAINING FLOW ALLREDUCE IN MULTI-GPU TRAINING

TRADITIONAL ALLREDUCE CALCULATION
Data-Exchange and Parallel Calculation

local gradients local gradients local gradients local gradients

parameters parameters parameters parameters

NCCL AllReduce : Sum gradients across GPUs

Batch Exchange Partial Local Gradients

Batch Batch Batch

Local Local Local Local Reduce (Sum) Partials

gradients gradients gradients gradients
Broadcast Reduced Partials

NCCL AllReduce : Sum gradients across GPUs

gradients gradients gradients gradients

Data parallelism : split batch across multiple GPUs

ALLREDUCE IN MULTI-GPU TRAINING

NVLINK SHARP ACCELERATION

A100 H100 + NVLink SHARP

Step 1: Read and reduce

Send Partials
A100 H100
Send Partials N reads
N reads In-Switch Sum
NVS NVS
Receive Partials --
N reads Receive Partials
1 reduced read

Step 2: Broadcast result

Send New Partial
A100 H100
Send New Partial 1 write
N writes In-Switch MultiCast
NVS NVS
Receive New Partials N duplications
N writes Receive New Partials
N writes

Traffic summary (at each GPU interface)

2N send, 2N receive N+1 send, N+1 receive

~2x effective NVLink bandwidth

NVLINK NETWORK FOR RAW BW
4.5X More BW than Maximum InfiniBand (IB)

NEURAL RECOMMENDER ENGINE EXAMPLE RECOMMENDER WITH 14 TB EMBEDDING TABLES

5x
Linear layers
Data parallel
… Replicated across GPUs
4x

3x
Redistribute: Model-Parallel -> Data-Parallel
All2All

2x
10
GB
20 40GB
GB 60GB Embedding tables
10 60GB … 1x
Model parallel
GB 10GB Distributed across GPUs

0x
GPU 0 GPU 1 GPU 2 GPU n A100 H100 H100
IB IB NVLink Network

Projected performance subject to change. Example model assumes DLRM with a mix of 300-hot and 1-hot embedding tables with total capacity of 14TB. Different recommender models
may show different performance characteristics.
NVLINK NETWORK

Source GPU Network Destination GPU

Virtual Address Address Physical Address

New Hopper
GPU NVL NVLink NVL Link NIC functions
SM HBM
MMU NIC Network NIC TLB ensure request
is legal & maps
NVL Switch NVL to GPU physical
SM NIC NIC HBM
address space
NVL NVL
SM NIC NIC HBM

NVL NVL
SM NIC NIC HBM

Source GPU Destination GPU

NVLink NVLink Network

Address Spaces 1 (shared) N (independent)
Request Addressing GPU physical address Network address
Connection Setup During boot process Runtime API call by software
Isolation No Yes
MAPPING TO TRADITIONAL NETWORKING
NVLink Network is Tightly Integrated with GPU

Concept Traditional Example NVLink Network

Physical Layer 400G electrical/optical media Custom-FW OSFP

Data Link Layer Ethernet NVLink custom on-chip HW and FW

Network Layer IP New NVLink Network Addressing and Management Protocols

Transport Layer TCP NVLink custom on-chip HW and FW

SHARP groups
Session Layer Sockets
CUDA export of Network addresses of data-structures

Presentation Layer TSL/SSL Library abstractions (e.g., NCCL, NVSHMEM)

Application Layer HTTP/FTP AI Frameworks or User Apps

NIC PCIe NIC (card or chip) Functions embedded in GPU and NVSwitch

RDMA Off-Load NIC Off-Load Engine GPU-internal Copy Engine

Collectives Off-Load NIC/Switch Off-Load Engine NVSwitch-internal SHARP Engines

Security Off-Load NIC Security Features GPU-internal Encryption and “TLB” Firewalls

Media Control NIC Cable Adaptation NVSwitch-internal OSFP-cable controllers

NVLINK4 NVSWITCH BLOCK DIAGRAM

Management New SHARP Blocks

Control Processor and State/Telemetry Proxy
(including associated OSFPs)
Security Processor ▪ ALU matched to Hopper unit
PCIe I/O
▪ Wide variety of operators (logical,
Port Logic 0 NVLink 0 min/max, add) and formats (S/U integers,
Routing
Classification & FP16, FP32, FP64, BF16)
PHY
Packet Transforms
TL DL ▪ SHARP Controller can manage up to 128
Error Check & Transaction Tracking & SHARP groups in parallel
PHY
Statistics Collection Packet Transforms
▪ XBAR BW uprated to carry additional
(NVLink plus SHARP)

SHARP-related exchanges
XBAR (64 X 64)

SHARP SHARP
SHARP
ALU Scratch
Controller
(Hopper) SRAM New NVLink Network Blocks
▪ Security Processor protects data and chip
configuration from attacks
Port Logic 63 NVLink 63 ▪ Partitioning features isolate subsets of
Classification & ports into separate NVLink Networks
Routing PHY
Packet Transforms
TL DL
▪ Management controller now also handles
Error Check & Transaction Tracking &
attached OSFP cables
PHY
Statistics Collection Packet Transforms
▪ Expanded telemetry to support
InfiniBand-style monitoring
NVLink4-Generation NVSwitch Chip
1. Brief History of NVLink

2. NVLink4-Generation New Features

3. Chip Details

Hopper-Generation SuperPODs
1. NVSwitch-Enabled Platforms

2. NVLink Network SuperPODs

3. SuperPOD Performance
DGX H100 SERVER
8-H100 4-NVSwitch Server
▪ 32 PFLOPS of AI Performance
▪ 640 GB aggregate GPU memory
▪ 18 NVLink Network OSFPs
▪ 3.6 TBps of full-duplex NVLink Network bandwidth (72 NVLinks)
▪ 8x 400 Gb/s ConnectX-7 InfiniBand/Ethernet ports
▪ 2 dual-port Bluefield-3 DPUs
▪ Dual Sapphire Rapids CPUs
▪ PCIe Gen5
CPU CPU DGX H100: DATA-NETWORK
CONFIGURATION
CX7 CX7 CX7 CX7 CX7 CX7 CX7 CX7
HCA/NIC HCA/NIC HCA/NIC HCA/NIC HCA/NIC HCA/NIC HCA/NIC HCA/NIC Full-BW Intra-Server NVLink
w/ PCIe w/ PCIe w/ PCIe w/ PCIe w/ PCIe w/ PCIe w/ PCIe w/ PCIe
Switch Switch Switch Switch Switch Switch Switch Switch ▪ All 8 GPUs can simultaneously saturate
18 NVLinks to other GPUs within server
▪ Limited only by over-subscription from multiple
other GPUs
Half-BW NVLink Network
▪ All 8 GPUs can half-subscribe 18 NVLinks
H100 H100 H100 H100 H100 H100 H100 H100
to GPUs in other servers
▪ 4 GPUs can saturate 18 NVLinks to GPUs
in other servers
▪ Equivalent of full-BW on AllReduce with SHARP
▪ Reduction in All2All BW is a balance with server
NVSwitch NVSwitch NVSwitch NVSwitch complexity and costs
Multi-Rail InfiniBand/Ethernet
OSFP
OSFP

OSFP
OSFP

▪ All 8 GPUs can independently RDMA data

OSFP
OSFP
OSFP
OSFP
OSFP

OSFP

OSFP
OSFP
OSFP
OSFP
OSFP
over its own dedicated 400 Gb/s HCA/NIC
▪ 800 GBps of aggregate full-duplex
to non-NVLink Network devices
4 NVLinks 5 NVLinks 1 400Gb DGX H100
DGX H100 SUPERPOD: NVLINK SWITCH
NVLink Switch
▪ Standard 1RU 19-inch formfactor highly leveraged from InfiniBand
switch design
▪ Dual NVLink4 NVSwitch chips
▪ 128 NVLink4 ports
▪ 32 OSFP cages
▪ 6.4 TB/s full-duplex BW
▪ Managed switch with out-of-band management communication
▪ Support for passive-copper, active-copper and optical OSFP cables
(custom FW)
DGX H100 SUPERPOD: AI EXASCALE
DGX H100 SuperPOD Scalable Unit
▪ 32 DGX H100 nodes + 18 NVLink Switches
▪ 256 H100 Tensor Core GPUs
▪ 1 ExaFLOP of AI performance
▪ 20 TB of aggregate GPU memory
▪ Network optimized for AI and HPC
▪ 128 L1 NVLink4 NVSwitch chips + 36 L2 NVLink4 NVSwitch chips
▪ 57.6 TB/s bisection NVLink Network spanning entire Scalable Unit
▪ 25.6 TB/s full-duplex NDR 400 Gb/s InfiniBand for connecting
multiple Scalable Units in a SuperPOD
SCALE-UP WITH NVLINK NETWORK

DGX A100 256 POD DGX H100 256 POD

IB HDR spine switches NVS NVS NVS … NVS

Fully NVLink-connected
… IB HDR leaf switches …
Massive bisection bandwidth
… 32 nodes (256 GPUs) … … 32 nodes (256 GPUs) …

A100 SuperPOD H100 SuperPOD Speedup

Dense Bisection Reduce Dense Bisection Reduce

Bisection Reduce
PFLOP/s [GB/s] [GB/s] PFLOP/s [GB/s] [GB/s]

1 DGX / 8 GPUs 2.5 2,400 150 16 3,600 450 1.5x 3x

32 DGXs / 256 GPUs 80 6,400 100 512 57,600 450 9x 4.5x

NVLINK NETWORK BENEFITS
Dependent on Communication Intensity

A100 H100 H100 + NVLink Network

HPC AI Inference AI Training

8x 35x 10x

30x
8x
6x
Speedup over A100

25x

6x
20x

15x
4x

10x
2x
2x
5x

0x Climate Modelling Lattice QCD 3D FFT Genomics

0x GPT-3 (530B Params)… GPT-3 (530B Params)… GPT-3 (530B Params)…
0x Vision Models 10TB Recommender GPT-3 175B Switch-XXL 395B

Megatron Turing NLG 530B

Projected performance subject to change. A100 cluster: HDR IB network. H100 cluster: NDR IB network with NVLink Network where indicated.
# GPUs: Climate Modelling 1K, LQCD 1K, Genomics 8, 3D-FFT 256, MT-NLG 32 (batch sizes: 4 for A100, 60 for H100 at 1sec, 8 for A100 and 64 for H100 at 1.5 and 2sec), MRCNN 8 (batch 32),
GPT-3 16B 512 (batch 256), DLRM 128 (batch 64K), GPT-3 175B 16K (batch 512), MoE 8K (batch 512, one expert per GPU)
SUMMARY
Cutting-Edge Speeds and Capabilities

NVLink4-Generation NVSwitch
▪ 64 NVLink4 ports and 3.2 TB/s full-duplex BW
▪ NVLink SHARP (multi-cast and reductions off-load)
▪ Inter-Server NVLink Network support
▪ Custom FW OSFP NVLink Network cable support
▪ Basis of new NVLink Switch
Hopper-Generation SuperPOD
▪ 32 DGX H100 servers
▪ 18 NVLink Switches
▪ 1 ExaFLOP of AI performance
▪ 57.6TB/s NVLink Network bisection BW
▪ NVLink Network can more than double performance for
communication-intensive applications
▪ Scalable to thousands of GPUs using InfiniBand to connect
multiple Scalable Units

Generative AI On AWS
100% (6)
Generative AI On AWS
208 pages
Databricks Big Book of GenAI FINAL
100% (7)
Databricks Big Book of GenAI FINAL
118 pages
Building LLM Applications For Production
100% (3)
Building LLM Applications For Production
28 pages
Applied Generative AI For Beginners Practical Knowledge 1703207445
93% (14)
Applied Generative AI For Beginners Practical Knowledge 1703207445
221 pages
Apress Understanding Large Language Models B0CJ2C8TXQ
100% (11)
Apress Understanding Large Language Models B0CJ2C8TXQ
166 pages
Introduction To Artificial Intelligence
93% (41)
Introduction To Artificial Intelligence
316 pages
Carding Useful Information PDF
100% (8)
Carding Useful Information PDF
17 pages
Machine Learning - An Applied Mathematics Introduction PDF
100% (13)
Machine Learning - An Applied Mathematics Introduction PDF
246 pages
Generative Ai Fundamentals v1
100% (15)
Generative Ai Fundamentals v1
80 pages
(EARLY RELEASE) Quick Start Guide To Large Language Models Strategies and Best Practices For Using ChatGPT and Other LLMs (Sinan Ozdemir) (Z-Library)
100% (14)
(EARLY RELEASE) Quick Start Guide To Large Language Models Strategies and Best Practices For Using ChatGPT and Other LLMs (Sinan Ozdemir) (Z-Library)
132 pages
Fundamentals of Artificial Intelligence PDF
100% (13)
Fundamentals of Artificial Intelligence PDF
730 pages
RAG Architecture
100% (8)
RAG Architecture
52 pages
PCIe SFF R5.0 02062024 NCB
No ratings yet
PCIe SFF R5.0 02062024 NCB
112 pages
Artificial Intelligence With Python (Machine Learning Foundations, Methodologies, and Applications) (Teik Toe Teoh, Zheng Rong)
93% (15)
Artificial Intelligence With Python (Machine Learning Foundations, Methodologies, and Applications) (Teik Toe Teoh, Zheng Rong)
334 pages
LLMs and Generative AI For (Z-Library)
100% (3)
LLMs and Generative AI For (Z-Library)
58 pages
SD Wan Exam 1
0% (5)
SD Wan Exam 1
4 pages
Generative AI Usecases - A Comprehensive Guide - Dummies
100% (1)
Generative AI Usecases - A Comprehensive Guide - Dummies
19 pages
Generative AI With Large Language Models
100% (3)
Generative AI With Large Language Models
31 pages
Machine Learning With Python
100% (14)
Machine Learning With Python
692 pages
Top 100 Applications of Generative AI 1683282083
100% (15)
Top 100 Applications of Generative AI 1683282083
119 pages
PCI Express
100% (1)
PCI Express
15 pages
Mastering SaltStack - Second Edition
From Everand
Mastering SaltStack - Second Edition
Joseph Hall
No ratings yet
Machine Learning Projects in Python
100% (16)
Machine Learning Projects in Python
135 pages
Full Course of Machine Learning
100% (16)
Full Course of Machine Learning
660 pages
FPGAs - Fundamentals, Advanced Features, and Applications in Industrial Electronics PDF
100% (4)
FPGAs - Fundamentals, Advanced Features, and Applications in Industrial Electronics PDF
267 pages
AI
100% (10)
AI
36 pages
Machine Learning Masterclass
100% (11)
Machine Learning Masterclass
108 pages
Hackers Guide To Machine Learning With Python PDF
100% (15)
Hackers Guide To Machine Learning With Python PDF
272 pages
4 1 MWagner GPU Volta
No ratings yet
4 1 MWagner GPU Volta
36 pages
ECMWF Advanced GPU Topics 1
100% (1)
ECMWF Advanced GPU Topics 1
59 pages
ECN TLP Prefix 2008-12-15
100% (1)
ECN TLP Prefix 2008-12-15
19 pages
Introduction To Single Root IO Virtualization SRIOV
No ratings yet
Introduction To Single Root IO Virtualization SRIOV
27 pages
Ethernet PDF Tutorial
No ratings yet
Ethernet PDF Tutorial
2 pages
Solution Overview NIM
No ratings yet
Solution Overview NIM
2 pages
Gpu1 - GPU Introduction
No ratings yet
Gpu1 - GPU Introduction
20 pages
NVIDIA Techies Guide To Ethernet - Storage - Fabrics
100% (1)
NVIDIA Techies Guide To Ethernet - Storage - Fabrics
64 pages
Cours 1 - Intro To Deep Learning
100% (1)
Cours 1 - Intro To Deep Learning
38 pages
PCS White Paper
No ratings yet
PCS White Paper
14 pages
PCI Express System Architecture
No ratings yet
PCI Express System Architecture
222 pages
Cours 2 - Training Deep Neural Networks
No ratings yet
Cours 2 - Training Deep Neural Networks
42 pages
Chapter 9 L3Outs
No ratings yet
Chapter 9 L3Outs
100 pages
DPDK
No ratings yet
DPDK
10 pages
Vmware Nvidia Presentation
No ratings yet
Vmware Nvidia Presentation
38 pages
How To Verify MIPI Protocols
No ratings yet
How To Verify MIPI Protocols
6 pages
Packet Journey Inside ASR 9000
No ratings yet
Packet Journey Inside ASR 9000
207 pages
Introduction To PCI Express
No ratings yet
Introduction To PCI Express
71 pages
Pci Spec PDF
No ratings yet
Pci Spec PDF
344 pages
gtc22 Whitepaper Hopper
No ratings yet
gtc22 Whitepaper Hopper
71 pages
Nvidia Ampere Architecture Whitepaper
No ratings yet
Nvidia Ampere Architecture Whitepaper
83 pages
Nvidia DGX A100 Datasheet
No ratings yet
Nvidia DGX A100 Datasheet
2 pages
6WIND-Intel White Paper - Optimized Data Plane Processing Solutions Using The Intel® DPDK v2
No ratings yet
6WIND-Intel White Paper - Optimized Data Plane Processing Solutions Using The Intel® DPDK v2
8 pages
Practical Introduction To PCI Express With FPGAs - Extended
No ratings yet
Practical Introduction To PCI Express With FPGAs - Extended
77 pages
Which GPU(s) To Get For Deep Learning
No ratings yet
Which GPU(s) To Get For Deep Learning
388 pages
IEEE 802.3bj: 100GBASE-CR4 Specifications Minneapolis, MN May 2012
No ratings yet
IEEE 802.3bj: 100GBASE-CR4 Specifications Minneapolis, MN May 2012
24 pages
Hpe Proliant Gen11 Ai
No ratings yet
Hpe Proliant Gen11 Ai
7 pages
ESGB5303 MX Series Scalable Fabric Architecture Technical Overview - FINAL - v2
No ratings yet
ESGB5303 MX Series Scalable Fabric Architecture Technical Overview - FINAL - v2
28 pages
HyperTransport 3.1 Interconnect Technology PDF
100% (1)
HyperTransport 3.1 Interconnect Technology PDF
30 pages
B Cisco Nexus 7000 Series NX-OS Fundamentals Configuration Guide Release 6.x
No ratings yet
B Cisco Nexus 7000 Series NX-OS Fundamentals Configuration Guide Release 6.x
144 pages
NVIDIA Consolidated Teaching Kit
No ratings yet
NVIDIA Consolidated Teaching Kit
32 pages
Mipi Csi-2 32
No ratings yet
Mipi Csi-2 32
32 pages
Cable Modem Terminate System: Setup
100% (1)
Cable Modem Terminate System: Setup
68 pages
Intel Architecture Day 2021 Presentation
No ratings yet
Intel Architecture Day 2021 Presentation
195 pages
Nvidia Cuda Arc
No ratings yet
Nvidia Cuda Arc
16 pages
6wind Support Intel DPDK Presentation
100% (1)
6wind Support Intel DPDK Presentation
40 pages
Understanding of Mipi I3c White Paper v0.95
No ratings yet
Understanding of Mipi I3c White Paper v0.95
13 pages
Adam Wilen, Justin P. Schade, Ron Thornburg Introduction To PCI Express A Hardware and Software Developers Guide PDF
0% (1)
Adam Wilen, Justin P. Schade, Ron Thornburg Introduction To PCI Express A Hardware and Software Developers Guide PDF
309 pages
P4 Tutorial
No ratings yet
P4 Tutorial
120 pages
Chapter 2. Pair Programming
No ratings yet
Chapter 2. Pair Programming
15 pages
2-3 SSUSB DevCon LinkLayer Vining
No ratings yet
2-3 SSUSB DevCon LinkLayer Vining
54 pages
h19611 Nvidia Gen Ai WP
No ratings yet
h19611 Nvidia Gen Ai WP
33 pages
Getting Started With CUDA Samples
No ratings yet
Getting Started With CUDA Samples
9 pages
ARM Architecture
No ratings yet
ARM Architecture
547 pages
Mt7621 Datasheet
No ratings yet
Mt7621 Datasheet
43 pages
MDS Course Lab01
No ratings yet
MDS Course Lab01
28 pages
Video and Imaging Solutions Guide
100% (1)
Video and Imaging Solutions Guide
61 pages
Jetson Nano Developer Kit User Guide
No ratings yet
Jetson Nano Developer Kit User Guide
26 pages
Framework For Digital Camera in Linux-In Detail
No ratings yet
Framework For Digital Camera in Linux-In Detail
58 pages
NVIDIA MTE - Perfect The Art of Your Network Landscape
100% (1)
NVIDIA MTE - Perfect The Art of Your Network Landscape
14 pages
Developer Track 6 TrustZone TEEs and Trusted Video Path Implementation Considerations
No ratings yet
Developer Track 6 TrustZone TEEs and Trusted Video Path Implementation Considerations
31 pages
PCIE Document
No ratings yet
PCIE Document
7 pages
DDR Sdram: A 1.8V, 700mb/s/pin, 512Mb DDR-II SDRAM With On-Die Termination and Off-Chip Driver Calibration
No ratings yet
DDR Sdram: A 1.8V, 700mb/s/pin, 512Mb DDR-II SDRAM With On-Die Termination and Off-Chip Driver Calibration
36 pages
Apache ZooKeeper Essentials
From Everand
Apache ZooKeeper Essentials
Saurav Haloi
5/5 (2)
Hybrid Neural Networks: Fundamentals and Applications for Interacting Biological Neural Networks with Artificial Neuronal Models
From Everand
Hybrid Neural Networks: Fundamentals and Applications for Interacting Biological Neural Networks with Artificial Neuronal Models
Fouad Sabry
No ratings yet
vSphere 5 AutoLab 1.1a Deployment Guide
From Everand
vSphere 5 AutoLab 1.1a Deployment Guide
Alastair Cooke
No ratings yet
LLM Application Through Production
100% (11)
LLM Application Through Production
254 pages
Deep Learning - Fundamentals, Theory and Applications 2019 PDF
100% (10)
Deep Learning - Fundamentals, Theory and Applications 2019 PDF
168 pages
Generative Ai Terminology
67% (3)
Generative Ai Terminology
26 pages
Machine Learning Projects Python
94% (18)
Machine Learning Projects Python
134 pages
Python Programming For Beginners - Learn Python Programming in 24 Hours PDF
100% (21)
Python Programming For Beginners - Learn Python Programming in 24 Hours PDF
133 pages
Machine Learning Paradigms
100% (10)
Machine Learning Paradigms
336 pages
HP z820 Quickspecs
No ratings yet
HP z820 Quickspecs
72 pages
Practical Labs: Yocto Project and Openembedded Training
No ratings yet
Practical Labs: Yocto Project and Openembedded Training
24 pages
Proces sg001 - en P
No ratings yet
Proces sg001 - en P
62 pages
SONiC Community 2022 Update and Beyond
No ratings yet
SONiC Community 2022 Update and Beyond
19 pages
VMX Up and Running: Day One
No ratings yet
VMX Up and Running: Day One
94 pages
Palltronic Flowstar IV: Filter Integrity Test Instrument and MUX Extension
No ratings yet
Palltronic Flowstar IV: Filter Integrity Test Instrument and MUX Extension
13 pages
Com.-Sci1 Q4 Wk1
No ratings yet
Com.-Sci1 Q4 Wk1
22 pages
Bcs 041
No ratings yet
Bcs 041
28 pages
eBUS SDK Programmer's Guide
No ratings yet
eBUS SDK Programmer's Guide
98 pages
Intro To Ns2
100% (1)
Intro To Ns2
239 pages
UWBR Matlab Ise
No ratings yet
UWBR Matlab Ise
20 pages
PowerPoint Presentation PDF
No ratings yet
PowerPoint Presentation PDF
32 pages
OrionMX-Datasheet 042822
No ratings yet
OrionMX-Datasheet 042822
2 pages
Homework English Course: Univeristatea Titu Maiorescu Bucuresti Facultatea de Informatica
No ratings yet
Homework English Course: Univeristatea Titu Maiorescu Bucuresti Facultatea de Informatica
9 pages
Bài Giảng CCNA - VNPro - The OSI Model
No ratings yet
Bài Giảng CCNA - VNPro - The OSI Model
62 pages
Chapter-6 LAN and WAN Technology
No ratings yet
Chapter-6 LAN and WAN Technology
47 pages
DATA COMMS AND NETWORKS Assignment 1
No ratings yet
DATA COMMS AND NETWORKS Assignment 1
23 pages
Isp250 SDD
No ratings yet
Isp250 SDD
28 pages
XenDesktop Hyper-V Design Guide - Final - v1 2
No ratings yet
XenDesktop Hyper-V Design Guide - Final - v1 2
25 pages
Bni Eip-302-105-Z015 PDF
No ratings yet
Bni Eip-302-105-Z015 PDF
20 pages
2 2
No ratings yet
2 2
13 pages
Dir-878 A1 r1 User Manua en
No ratings yet
Dir-878 A1 r1 User Manua en
247 pages
VIMSpc2011A Getting Started Guide 26-Sep-2011
No ratings yet
VIMSpc2011A Getting Started Guide 26-Sep-2011
19 pages
AIX Commands
No ratings yet
AIX Commands
66 pages
v11 7 WSM User Guide (en-US)
No ratings yet
v11 7 WSM User Guide (en-US)
1,658 pages
Presented By: Swapnil Jain
No ratings yet
Presented By: Swapnil Jain
28 pages
b0700hc D
No ratings yet
b0700hc D
267 pages
How To Get MIF File
No ratings yet
How To Get MIF File
5 pages

NVSwitch

Uploaded by

NVSwitch

Uploaded by

THE NVLINK-NETWORK SWITCH:

NVIDIA’S SWITCH CHIP FOR HIGH COMMUNICATION-BANDWIDTH SUPERPODS

2. 4th-Generation New Features

2. NVLink Network SuperPODs

GPU Operational Characteristics Match NVLink Spec

Lower Overheads than Traditional Networks

x86 x86 x86 x86

PCIe PCIe PCIe PCIe

2016 2017 2020 2022

P100-NVLink1 V100-NVLink2 A100-NVLink3 H100-NVLink4

4 NVLinks 6 NVLinks 12 NVLinks 18 NVLinks

Listed bandwidths are full-duplex (total of both directions). Whitepaper: http://www.nvidia.com/object/nvlink.html

2016 2018 2020 2022

DGX-1 (P100) DGX-2 (V100) DGX A100 DGX H100

140GB/s Bisection BW 2.4TB/s Bisection BW 2.4TB/s Bisection BW 3.6TB/s Bisection BW

▪ 100Gbps-per-diff-pair (50Gbaud PAM4)

DGX H100 ▪ Roughly doubles data throughput on communication-intensive-operations

32 PHY Lanes 32 PHY Lanes

PORT Logic PORT Logic

32 PHY Lanes 32 PHY Lanes

Largest NVSwitch Ever Highest Bandwidth Ever New Capabilities

parameters parameters parameters parameters parameters

Forward/ Batch Batch Batch Batch Batch

NCCL AllReduce : Sum gradients across GPUs

gradients gradients gradients gradients gradients

Data parallelism : split batch across multiple GPUs

BASIC TRAINING FLOW ALLREDUCE IN MULTI-GPU TRAINING

local gradients local gradients local gradients local gradients

NCCL AllReduce : Sum gradients across GPUs

Batch Exchange Partial Local Gradients

Local Local Local Local Reduce (Sum) Partials

NCCL AllReduce : Sum gradients across GPUs

gradients gradients gradients gradients

Data parallelism : split batch across multiple GPUs

ALLREDUCE IN MULTI-GPU TRAINING

A100 H100 + NVLink SHARP

Step 1: Read and reduce

Step 2: Broadcast result

Traffic summary (at each GPU interface)

~2x effective NVLink bandwidth

NEURAL RECOMMENDER ENGINE EXAMPLE RECOMMENDER WITH 14 TB EMBEDDING TABLES

Source GPU Network Destination GPU

Source GPU Destination GPU

NVLink NVLink Network

Concept Traditional Example NVLink Network

Data Link Layer Ethernet NVLink custom on-chip HW and FW

Network Layer IP New NVLink Network Addressing and Management Protocols

Transport Layer TCP NVLink custom on-chip HW and FW

Presentation Layer TSL/SSL Library abstractions (e.g., NCCL, NVSHMEM)

Application Layer HTTP/FTP AI Frameworks or User Apps

RDMA Off-Load NIC Off-Load Engine GPU-internal Copy Engine

Collectives Off-Load NIC/Switch Off-Load Engine NVSwitch-internal SHARP Engines

Media Control NIC Cable Adaptation NVSwitch-internal OSFP-cable controllers

Management New SHARP Blocks

2. NVLink4-Generation New Features

2. NVLink Network SuperPODs

▪ All 8 GPUs can independently RDMA data

DGX A100 256 POD DGX H100 256 POD

IB HDR spine switches NVS NVS NVS … NVS

A100 SuperPOD H100 SuperPOD Speedup

Dense Bisection Reduce Dense Bisection Reduce

1 DGX / 8 GPUs 2.5 2,400 150 16 3,600 450 1.5x 3x

32 DGXs / 256 GPUs 80 6,400 100 512 57,600 450 9x 4.5x

A100 H100 H100 + NVLink Network

HPC AI Inference AI Training

0x Climate Modelling Lattice QCD 3D FFT Genomics

Megatron Turing NLG 530B

You might also like