System within one node
GPU:NIC = 8:8
IB NIC 200 Gb/s
PCIE Gen4 64 GB/s
2x6=12 NVlink
400 GB/s
6 NVswitch
Scalable Unit (SU)
1SU = 20 nodes
A800
Computing Network Architecture
With HDR IB (200Gbps)
140 nodes 80 nodes
For more info, please refer to
SuperPOD reference architecture
Whitepaper
MLPERF TRAINING 0.7
NVIDIA Selene with DGX A100 (40GB)
Tested in 2020 Q3.
• Pure data parallelism.
• Up to 43% perf gains for MLPerf Bert training with 8*HCAs VS 1*HCA, on 128 nodes scale.
• Up to 30% perf gains for MLPerf RN50 training with 8*HCAs VS 1*HCA, on 230 nodes scale.
5
OPTIMIZED IMPLEMENTATION
7.5B Model, 32 Nodes
7.5B model train with Megatron on 32 nodes
• 32 nodes: TPS=4, PPS=1, DPS=64 (TPS=4, PPS=1, DPS=64)
• Forward-compute and backward-compute includes all-reduce 18000
within each tensor model parallel group 16000
• AllReduce time within each data parallel group is significant 14000
Elapsed Time Per Step (ms)
12000
• 1*HCA, 2*HCAs: Extremely bad, bounded by communication perf. 10000
Some GPUs have to go through SMP interconnect to reach HCA. Bad GDR perf. 8000
6000
• 8*HCAs VS 4*HCAs: 6.6% improvement. 4000
Mainly from improvement of AllReduce for gradients in data parallelism group. 2000
1*HCA 2*HCA 4*HCA 8*HCA
0
Forward-compute Backward-compute All reduce (Data P) Optimizer
global_batch_size=512
10