Aum HPC
Processor
Development under
National
Supercomputing Mission
Sanjay Wandhekar
Senior Director, HoD HPC
Technologies Group, C-DAC
sanjayw@cdac.in
National Supercomputing Mission (NSM)
Brief about National Supercomputing Mission
• Supercomputing infrastructure in the country
• Indigenous Supercomputing ecosystem in phased
manner: From “Assembly” to “Manufacturing” to
“Design and Manufacturing” of Supercomputers
• Servers
• HPC network
• Software stack
• HPC Processor
• Liquid cooling technologies
• Supercomputing Applications of National interest
• Human resources for applications development and
HPC maintenance
3
Motivation for India’s own HPC Processor - AUM
•Processor architecture suitable for both HPC & General
Purpose Computing- Extracting maximum application level
performance
•Energy efficiency: Arm Architecture
•Capability building with bargaining power
•Immunity from possible export restrictions to India in Future
•Technological sovereignty: Designed and Engineered in India
•Security (Back doors etc.) : Highest priority for strategic
sectors
4
HPC Processor development program
• Develop a competitive HPC Processor for HPC, AI and server market
• Develop a complete ecosystem leveraging open source components
• Open source software ecosystem
• Reference Boards
• Reference server designs
• Build a Pilot HPC system with > 1 PF compute power
• Be ready with Exascale system design and subsystems based on AUM
Processor
• Industry Collaboration – Design of SoC, Server designs, Deploy and market
solutions based on AUM Processor
• Targeted for both HPC and Cloud market
• Planned to be available in 2024
5
Some of the Architectural decisions
• Best Efficiency
• Memory Bandwidth
• Easy to optimize (Vector Size)
• Superior Application level program /
Watt
• Better I/O for Data Access • Superior Application level program /
• HBM and DDR Watt -> Increase Memory sub-system
• Many PCIe5 Lanes
performance
• CXL for Coherent accelerators • Need Much better Bytes/Flop
performance – Target > 0.5 Byte/Flop
• No Competition with specialized
devices like GPUs – Keep Provision
of GPUs for specialized applications
• Security Features Provision
6
High Performance Conjugate Gradients (HPCG) Benchmark
HPL Rmax Fraction of
Rank Site Computer Bytes/Flop HPCG (Pflop/s)
(Pflop/s) Peak
RIKEN Center for Supercomputer
1 Computational Science Fugaku — A64FX 48C 442.01 0.3 16 3.00%
Japan 2.2GHz, Tofu D
Summit — IBM
DOE/SC/ORNL POWER9 22C 3.07GHz,
2 Dual-rail Mellanox EDR 148.6 <0.2 2.926 1.50%
Infiniband, NVIDIA
USA Volta GV100
Perlmutter — AMD
DOE/SC/LBNL/NERSC EPYC 7763 64C
3 2.45GHz, Slingshot-10, 64.59 <0.2 1.905 2.10%
NVIDIA A100 SXM4 40
United States GB
Sierra — IBM POWER9
DOE/NNSA/LLNL 22C 3.1GHz, Dual-rail
4 Mellanox EDR 94.64 <0.2 1.796 1.40%
Infiniband, NVIDIA
USA Volta GV100
K-Computer: Bytes/Flop = 0.5 Superior Application level program
HPCG – 5.2% of Peak Better Bytes/Flop i.e. higher Memory B/w
7
C-DAC HPC SoC (A48Z) Block Diagram (48-Cores)
Other Other
Chiplet Socket
Arm Neoverse V1 (Zeus)
ArmZeus
Zeus
D2D C2C
Arm
Cores SubSystem SubSystem
Cores (48) PCIe Gen5
(32) (Fully (Fully
/ CXL
Coherent) Coherent)
CortexM7
System
based MSCP Coherent Mesh Network Cache
SubSystem
(Memory
Security
Subsystem)
SubSystem
HBM3, DDR5
C-DAC AUM ॐ Microprocessor – 96 Cores
HBM3 5600 HBM3 5600
RAM RAM
Interpos
HBM3-6400 er
HBM3-640
0
Interposer
PCIe Gen5
PCIe Gen5
A48Z A48Z
CXL
CXL
/
/
(Chiplet-1) D2D D2 (Chiplet-2)
48-Zeus Cores D 48-Zeus Cores
0
DDR5-520
8-DDR5 Channels 8-DDR5 Channels
DDR5-520
0
HBM3-6400 HBM3-6400
HBM3
HBM35600 HBM3 5600
RAM
5600 RAM
PHY
D2D Chiplet Interconnect
AUM - HPC Processor Development
• 96 core HPC Processor
• ARM 8.4 architecture
• 96MB L2 cache, 96MB System cache
• 8 channel 5200 Mhz DDR5 memory
• 64 GB HBM3 5600Mhz memory
• PCIe5 64/128 Lanes – CXL support for coherent Accelerators/ NIC
• SMP support up to 2 sockets
• Security Features - Secure boot and Crypto support
• 5nm Technology Node, Chiplet based architecture, 2-Chiplets, 96-Cores and up to
96-GB HBM3 memory in a socket
• Dual socket Server design with up to 4 Industry standard GPU accelerators –
Both HPC and AI applications (CPU Only node ~ 10 TF/Node)
• Indigenous Software eco-system for Aum Processor leveraging open source
eco-system
10
Specification Comparison
Fujitsu A64FX C-DAC AUM HPC Processor
Fabrication 7nm FF TSMC 5nm FF
Technology
Core (48+4)-Cores, 2.2 GHz (typical) 96-Cores, 3.0 GHz (typical)
Configuration 3.5+ GHz (turbo)
DDR Configuration No DDR 16-Channels (32 bit) DDR5-5200
BW = 332.8 GB/s
HBM 32-GB HBM2 (4-Controllers) 64-GB HBM3 (4-Controllers)
BW = 1 TB/s BW = 2.87 TB/s
PCIe 16 PCIe Gen3 Lanes 64 PCIe Gen5 Lanes
Power Not Known 300 W (TDP)
Performance (DP) 2.7 TFLOPS per socket 4.6+ TFLOPS per socket
Bytes/FLOPS 0.38 0.7
C-DAC HPC System SW & Development Tools
• System Software, Dev Tools and
Utilities
• HPC Compiler (C and Fortran,
multicore + accelerator, HPC & AI
applications)
• IDE for HPC & AI applications on ARM
system supporting multiple parallel
paradigms
• Automatic Parallelizer generate
parallel code for multicore /
accelerators
• Application Debugger & Profiler
• Optimized Math-AI Libraries
• ARM System Monitor and Utilities
• Parallel Runtime System
• Secure Access Interface
Dual Socket Compute Node: ANANTA
4/8 4/8 4/8 4/8
DIMM DIMM Slots DIMM DIMM DIMM Slots DIMM
8 8 8 8
Riser slot_0 1 x16 1 x16 Riser slot_4
(PCIe/CXL) 4 x HBM3 (PCIe/CXL)
4 x HBM3
Riser Slot_1 1 x16 C2C
(PCIe/CXL) Socket_0 (CCIX) Socket_1 1 x16 Riser Slot_5
4 x16 (PCIe/CXL)
OCP 3.0 IF/
1 x16
Riser Slot_2
(PCIe)
1 x16 Riser Slot_6
1 x16 (PCIe)
(PCIe)
Riser Slot_7
1 x16
(PCIe)
2 x4 1 x4 2 x2
2x M.2 1 x2 1 x2
NVMe Network
Adapter USB Debug Port
3.0/2.0 BMC
MGMT Interface
Clock
UTP UTP (for KVM and Distribution
1G/10G 1G/10G Redfish System
ports ports support)
Power
CPLD
Distribution
Block
System
Single Socket Compute Node: ANANTA
4/8 4/8
DIMM DIMM Slots DIMM
8 8
1 x16
Riser slot_0 1 x16 Riser slot_4 (PCIe)
(PCIe/CXL) 4 x HBM3 1 x16
1 x16 Riser Slot_5 (PCIe)
Riser Slot_1
(PCIe/CXL) Socket_0 1 x16
Riser Slot_6 (PCIe)
OCP 3.0 IF/ 1 x16 1 x16
Riser Slot_2 Riser Slot_7 (PCIe)
(PCIe)
(PCIe)
1 x16
2 x4 1 x4 2 x2
2x M.2 1 x2 1 x2
NVMe Network
Adapter USB Debug Port
3.0/2.0 BMC
MGMT Interface
Clock
UTP UTP (for KVM and Distribution
1G/10G 1G/10G Redfish System
ports ports support)
Power
CPLD
Distribution
Block
System
Summary Aum Processor
• Competitive HPC Processor for HPC, AI and server market
• Address Strategic requirements
• Complete ecosystem
• Open source software ecosystem
• Reference Boards
• Reference server designs – Derivatives as per market requirements
• Industry partners OEMs/ODMs/Solution providers
• Towards Indigenous Exascale system including Processors
• Targeted market HPC/AI, Cloud, storage, edge computing
• Planned to be available in 2024
15
Thank You
16
Market Comparison
Ampere Altra SiPearl Rhea C-DAC ॐ
80 Arm Neoverse N1 (Ares)
Cores 72 Arm Neoverse V1 (Zeus) Cores 96 Arm Neoverse V1 (Zeus) Cores
Cores
L1: 64KB I-Cache / 64KB D-Cache
L1: 64 KB L1 I / 64 KB L1 D per Core
Cache per core L2: 1MB Unified Cache per Core
Cache System Cache: 128 MB
L2: 1 MB L2 cache per core System Cache: 96MB, Snoop
System Cache: 32 MB Filter: 192MB
Frequency 3.0 (base), 3.3 GHz (turbo) 2.5 GHz (base), 3.0 GHz (turbo) 3.0 Ghz
HBM No HBM 96GB of HBM2E 96GB of HBM3
DDR up to 4 TB per socket. 4 Channels DDR5 16 Channles DDR5
104 Lanes PCIe5: 128 Lanes PCIe5:
- Upto 64-lanes for coherent - Upto 64-lanes for coherent
PCI 128 PCIe4 Lanes
connectivity connectivity
- Remaining lanes as PCIe5 - Remaining lanes as PCIe5/CXL
TDP 250 W 320W 280 - 320 W
Node TSMC 5nm TSMC 6nm TSMC 5nm
Package 2.5 D 2.5D
Release Year 2020 2022 /23 2023 / 24