0% found this document useful (0 votes)

79 views7 pages

Optimizing MLPerf with NVIDIA A100

SuperPod reference architecture

Uploaded by

ljwsiam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

79 views7 pages

Optimizing MLPerf with NVIDIA A100

SuperPod reference architecture

Uploaded by

ljwsiam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

System within one node

GPU:NIC = 8:8

IB NIC 200 Gb/s

PCIE Gen4 64 GB/s

2x6=12 NVlink
400 GB/s

6 NVswitch
Scalable Unit (SU)
1SU = 20 nodes

A800
Computing Network Architecture
With HDR IB (200Gbps)

140 nodes 80 nodes

For more info, please refer to

SuperPOD reference architecture

Whitepaper
MLPERF TRAINING 0.7
NVIDIA Selene with DGX A100 (40GB)
Tested in 2020 Q3.
• Pure data parallelism.
• Up to 43% perf gains for MLPerf Bert training with 8*HCAs VS 1*HCA, on 128 nodes scale.
• Up to 30% perf gains for MLPerf RN50 training with 8*HCAs VS 1*HCA, on 230 nodes scale.

5
OPTIMIZED IMPLEMENTATION
7.5B Model, 32 Nodes
7.5B model train with Megatron on 32 nodes
• 32 nodes: TPS=4, PPS=1, DPS=64 (TPS=4, PPS=1, DPS=64)

• Forward-compute and backward-compute includes all-reduce 18000

within each tensor model parallel group 16000

• AllReduce time within each data parallel group is significant 14000

Elapsed Time Per Step (ms)

12000

• 1HCA, 2HCAs: Extremely bad, bounded by communication perf. 10000

Some GPUs have to go through SMP interconnect to reach HCA. Bad GDR perf. 8000

6000

• 8HCAs VS 4HCAs: 6.6% improvement. 4000

Mainly from improvement of AllReduce for gradients in data parallelism group. 2000

1HCA 2HCA 4HCA 8HCA

0
Forward-compute Backward-compute All reduce (Data P) Optimizer

global_batch_size=512
10

Nvidia Ampere Architecture Whitepaper
No ratings yet
Nvidia Ampere Architecture Whitepaper
83 pages
gtc22 Whitepaper Hopper
No ratings yet
gtc22 Whitepaper Hopper
71 pages
RA11334001 DSPB200 ReferenceArch
No ratings yet
RA11334001 DSPB200 ReferenceArch
29 pages
DGX Superpod Reference Architecture DGX h100
No ratings yet
DGX Superpod Reference Architecture DGX h100
27 pages
HotChips2020 GPU NVIDIA Choquette v01
No ratings yet
HotChips2020 GPU NVIDIA Choquette v01
43 pages
Tesla V100 Performance Guide
No ratings yet
Tesla V100 Performance Guide
23 pages
Project Description
No ratings yet
Project Description
4 pages
RA 11336 001 DSPH200 ReferenceArch
No ratings yet
RA 11336 001 DSPH200 ReferenceArch
34 pages
DGX A100 System Architecture Whitepaper
No ratings yet
DGX A100 System Architecture Whitepaper
23 pages
NVIDIA GPU Innovations for AI Experts
100% (1)
NVIDIA GPU Innovations for AI Experts
96 pages
Nvidia h100 Datasheet 2287922 Web
No ratings yet
Nvidia h100 Datasheet 2287922 Web
3 pages
Nvidia A100 Datasheet Us Nvidia 1758950 r4 Web
No ratings yet
Nvidia A100 Datasheet Us Nvidia 1758950 r4 Web
3 pages
Fast Distributed Deep Learning
No ratings yet
Fast Distributed Deep Learning
7 pages
Nvidia A100 Datasheet Nvidia Us 2188504 Web
No ratings yet
Nvidia A100 Datasheet Nvidia Us 2188504 Web
3 pages
Dell PowerScale Storage Reference Architecture For NVIDIA DGX SuperPOD
No ratings yet
Dell PowerScale Storage Reference Architecture For NVIDIA DGX SuperPOD
14 pages
NVIDIA DGX SuperPOD With DGX GB200 Systems
No ratings yet
NVIDIA DGX SuperPOD With DGX GB200 Systems
3 pages
Speaker - A02 - 5747 - Best Practices in Networking For AI
No ratings yet
Speaker - A02 - 5747 - Best Practices in Networking For AI
15 pages
Nvidia DGX Superpod Datasheet Us Web
No ratings yet
Nvidia DGX Superpod Datasheet Us Web
3 pages
Anewtech Systems Supermicro Generative AI SuperCluster SRS 48UGPU AI ACSU GPU Server
No ratings yet
Anewtech Systems Supermicro Generative AI SuperCluster SRS 48UGPU AI ACSU GPU Server
2 pages
A100 80gb Datasheet Update Nvidia Us 1521051 r2 Web
No ratings yet
A100 80gb Datasheet Update Nvidia Us 1521051 r2 Web
3 pages
Tme303 Dspa100 Ra1138901 Vast
No ratings yet
Tme303 Dspa100 Ra1138901 Vast
13 pages
Nvidia H100 GPU Datasheet
No ratings yet
Nvidia H100 GPU Datasheet
3 pages
GTC2025 Keynote
No ratings yet
GTC2025 Keynote
73 pages
A30-Datasheet 240818 151232
No ratings yet
A30-Datasheet 240818 151232
3 pages
A100 80gb HGX A100 Datasheet Us Nvidia 1485640 r6 Web
No ratings yet
A100 80gb HGX A100 Datasheet Us Nvidia 1485640 r6 Web
3 pages
Datasheet SuperCluster 4U
No ratings yet
Datasheet SuperCluster 4U
2 pages
HPC Day 12 ppt-2
No ratings yet
HPC Day 12 ppt-2
139 pages
AI Data Center Design Guide
No ratings yet
AI Data Center Design Guide
34 pages
A Unified Architecture For Accelerating Distributed
No ratings yet
A Unified Architecture For Accelerating Distributed
18 pages
Nvidia DGX A100 Datasheet PDF
No ratings yet
Nvidia DGX A100 Datasheet PDF
2 pages
GPU Scalar-Vector Architecture
No ratings yet
GPU Scalar-Vector Architecture
70 pages
Technical Trends in General-Purpose Computing On Graphics Processing Units (GPGPU)
No ratings yet
Technical Trends in General-Purpose Computing On Graphics Processing Units (GPGPU)
6 pages
(R) Dell EMC PowerScale and NVIDIA DGX A100 Systems For Deep Learning
No ratings yet
(R) Dell EMC PowerScale and NVIDIA DGX A100 Systems For Deep Learning
19 pages
Zhongliang Chen Thesis
No ratings yet
Zhongliang Chen Thesis
71 pages
Lecture 6 5
No ratings yet
Lecture 6 5
18 pages
Introduction To CUDA
No ratings yet
Introduction To CUDA
51 pages
Report On Nvidia A100 Tensor Core Gpu
No ratings yet
Report On Nvidia A100 Tensor Core Gpu
3 pages
Line Card
No ratings yet
Line Card
4 pages
S74360 - Accelerate Large-Scale AI Training Pipelines - Solve Checkpointing Challenges For LLMs and Multi-Modal Models - 1741861667071001ocei
No ratings yet
S74360 - Accelerate Large-Scale AI Training Pipelines - Solve Checkpointing Challenges For LLMs and Multi-Modal Models - 1741861667071001ocei
19 pages
NVIDIA Blackwell Platform Advancing Generative AI and Accelerated Computing
No ratings yet
NVIDIA Blackwell Platform Advancing Generative AI and Accelerated Computing
33 pages
S51413 - Developing Optimal CUDA Kernels On Hopper Tensor Cores - 1679452516682001bWRm
No ratings yet
S51413 - Developing Optimal CUDA Kernels On Hopper Tensor Cores - 1679452516682001bWRm
80 pages
Neural Network Architectures Guide
No ratings yet
Neural Network Architectures Guide
6 pages
NVIDIA DGX A100 System Architecture Datasheet
No ratings yet
NVIDIA DGX A100 System Architecture Datasheet
2 pages
Tensorcore-X Ai Chip Architecture
No ratings yet
Tensorcore-X Ai Chip Architecture
2 pages
HC31 1.13 Cerebras - SeanLie.v02 PDF
No ratings yet
HC31 1.13 Cerebras - SeanLie.v02 PDF
31 pages
NGC Registry Launch Technical Overview
No ratings yet
NGC Registry Launch Technical Overview
11 pages
Google TPU
No ratings yet
Google TPU
27 pages
Nvidia DGX A100 System 80gb Datasheet Web Us
No ratings yet
Nvidia DGX A100 System 80gb Datasheet Web Us
2 pages
DGX Station Fujitsu
No ratings yet
DGX Station Fujitsu
2 pages
Poweredge Server Gpu Matrix
No ratings yet
Poweredge Server Gpu Matrix
2 pages
NVIDIA DGX SuperPOD - NetApp EF600 and BeeGFS
No ratings yet
NVIDIA DGX SuperPOD - NetApp EF600 and BeeGFS
14 pages
Gpgpu Final
No ratings yet
Gpgpu Final
124 pages
Nvidia DGX A100 Datasheet
No ratings yet
Nvidia DGX A100 Datasheet
2 pages
Nvidia-Dgx-A100-80gb-Datasheet 08.09.2022
No ratings yet
Nvidia-Dgx-A100-80gb-Datasheet 08.09.2022
2 pages
Unit 5'
No ratings yet
Unit 5'
33 pages
Nvidia DGX A100 Datasheet
No ratings yet
Nvidia DGX A100 Datasheet
2 pages
HPC Summit Digital 2020: Gpu Experts Panel: Ampere Explained
No ratings yet
HPC Summit Digital 2020: Gpu Experts Panel: Ampere Explained
29 pages
Powerai DDL: Minsik Cho, Ulrich Finkler, Sameer Kumar, David Kung, Vaibhav Saxena, Dheeraj Sreedhar Ibm Research
No ratings yet
Powerai DDL: Minsik Cho, Ulrich Finkler, Sameer Kumar, David Kung, Vaibhav Saxena, Dheeraj Sreedhar Ibm Research
10 pages
Nvidia Update For Lenovo
No ratings yet
Nvidia Update For Lenovo
30 pages
Primer Parrallel Processing 1980 To 2020
No ratings yet
Primer Parrallel Processing 1980 To 2020
192 pages
Amdahl'S Law: AZIZULLAH (P146111)
No ratings yet
Amdahl'S Law: AZIZULLAH (P146111)
6 pages
Distributed OS Synchronization Guide
No ratings yet
Distributed OS Synchronization Guide
14 pages
Assignment C6
No ratings yet
Assignment C6
5 pages
Rtos and Ide For Embedded System Design: Roopesh Kumar B N, Assistant Professor, Cse, Ksit
No ratings yet
Rtos and Ide For Embedded System Design: Roopesh Kumar B N, Assistant Professor, Cse, Ksit
72 pages
OS Question Bank
No ratings yet
OS Question Bank
62 pages
OS Exam Paper for CS Students
No ratings yet
OS Exam Paper for CS Students
3 pages
5 Software - Architectures - Detailed - PPT
100% (1)
5 Software - Architectures - Detailed - PPT
12 pages
CH 4 Synchronization Models of Memory Consistency
100% (1)
CH 4 Synchronization Models of Memory Consistency
26 pages
Parallel Random Access Machine
No ratings yet
Parallel Random Access Machine
22 pages
Distributed Systems Study Guide
No ratings yet
Distributed Systems Study Guide
12 pages
Database Transaction Essentials
No ratings yet
Database Transaction Essentials
89 pages
4 Parallel Architectures
No ratings yet
4 Parallel Architectures
49 pages
Chapter 4-Problems-2
No ratings yet
Chapter 4-Problems-2
25 pages
Real Time System - : BITS Pilani
No ratings yet
Real Time System - : BITS Pilani
29 pages
CPU Scheduling: CPU - I/O Burst Cycle
No ratings yet
CPU Scheduling: CPU - I/O Burst Cycle
4 pages
Operating System Questions & Answers - Deadlock
0% (1)
Operating System Questions & Answers - Deadlock
16 pages
Swift Concurrency: #Swiftlang #Iosdevelopment
No ratings yet
Swift Concurrency: #Swiftlang #Iosdevelopment
11 pages
IPC & Synchronization Guide
No ratings yet
IPC & Synchronization Guide
45 pages
OS - Unit 1 - Notes
No ratings yet
OS - Unit 1 - Notes
15 pages
Efficiency Enhancing Resource Scheduling Strategies in Cloud Computing
No ratings yet
Efficiency Enhancing Resource Scheduling Strategies in Cloud Computing
3 pages
Computer System Architecture Course
No ratings yet
Computer System Architecture Course
6 pages
07-Interprocess Communication
No ratings yet
07-Interprocess Communication
14 pages
CCGrid08 Part 6 Satin Short
No ratings yet
CCGrid08 Part 6 Satin Short
53 pages
05 C++ Threads
No ratings yet
05 C++ Threads
28 pages
Home Assignment Ii S. No Roll No Cou Rse Out Com e BTL: Appl Y, Anal Yze
No ratings yet
Home Assignment Ii S. No Roll No Cou Rse Out Com e BTL: Appl Y, Anal Yze
10 pages
2 Param
No ratings yet
2 Param
7 pages
GPU Programming for IIT Students
No ratings yet
GPU Programming for IIT Students
37 pages
B.E. Distributed Systems Guide
No ratings yet
B.E. Distributed Systems Guide
65 pages
Module 4 - Synchronization Tools
No ratings yet
Module 4 - Synchronization Tools
28 pages

Optimizing MLPerf with NVIDIA A100

Uploaded by

Optimizing MLPerf with NVIDIA A100

Uploaded by

System within one node

IB NIC 200 Gb/s

PCIE Gen4 64 GB/s

140 nodes 80 nodes

For more info, please refer to

• Forward-compute and backward-compute includes all-reduce 18000

within each tensor model parallel group 16000

• AllReduce time within each data parallel group is significant 14000

Elapsed Time Per Step (ms)

• 1*HCA, 2*HCAs: Extremely bad, bounded by communication perf. 10000

• 8*HCAs VS 4*HCAs: 6.6% improvement. 4000

1*HCA 2*HCA 4*HCA 8*HCA

You might also like

• 1HCA, 2HCAs: Extremely bad, bounded by communication perf. 10000

• 8HCAs VS 4HCAs: 6.6% improvement. 4000

1HCA 2HCA 4HCA 8HCA