NCCL Fast Init - CPU Optimizations for NCCL Initialization Large Scale #1789

saifhhasan · 2025-07-22T18:33:35Z

Problem

At large scale 32K+ GPUs we start to see a significant initialization time coming from ncclBuildRing and initTransportsRank - often several dozens of seconds at 100K scale. This occurs because both of these functions performs nested loops of complexity O(N*N), which at 100K scale translates to 10B loop iterations, each loop iteration with multiple instructions.

We used CPU Profiling tools that helped us spot the loops that were taking excessive amount of time during the initialization phase.

Observations

This changes optimizes the two functions to remove the overhead. And enables NCCL to initialize fast at 100K scale. We tested this at Meta at Scale and were able to observe following savings in numbers.

At 96K (96*1024) scale ncclBuildRing saves 26s of busy CPU cycles.
At 48K (48*1024) scale initTranportsRank optimization saves of 11+ seconds

Most notably this patch helps reduce CPU Utilization during job startup phase which is also crucial for other ongoing operations at a job level (Checkpoint initialization, Model Loading, Data Fetching etc.).

Testing

At meta we've tested these fixes on our large scale clusters. On the top I'm also adding a simple stanalone benchmark to demonstrate the performance improvement of ncclBuildRings before and after. The numbers are as below for varying number of ranks from 1024 to 96K. With 16 rings, we save about 13s, for 32 rings it'll be about 52s of duration.

Run on (368 X 2396.4 MHz CPU s)
CPU Caches:
  L1 Data 64 KiB (x368)
  L1 Instruction 64 KiB (x368)
  L2 Unified 512 KiB (x368)
  L3 Unified 16384 KiB (x1)
Load Average: 8.19, 7.24, 9.36
---------------------------------------------------------------------------
Benchmark                                 Time             CPU   Iterations
---------------------------------------------------------------------------
//
// Without the Fix
//
BM_ncclBuildRings/1024              1648154 ns      1648085 ns          415
BM_ncclBuildRings/2048              6378142 ns      6377981 ns          112
BM_ncclBuildRings/4096             23255134 ns     23253167 ns           30
BM_ncclBuildRings/8192             90435922 ns     90432606 ns            8
BM_ncclBuildRings/16384           358064532 ns    358047348 ns            2
BM_ncclBuildRings/32768          1430027485 ns   1429897998 ns            1
BM_ncclBuildRings/65536          5720381498 ns   5719983433 ns            1
BM_ncclBuildRings/98304          12864594936 ns   12863500960 ns            1

//
// With the Fix in this PR
//
BM_ncclBuildRingsOptimized/1024       30666 ns        30665 ns        22790
BM_ncclBuildRingsOptimized/2048       60711 ns        60710 ns        11498
BM_ncclBuildRingsOptimized/4096      120775 ns       120771 ns         5795
BM_ncclBuildRingsOptimized/8192      243442 ns       243431 ns         2878
BM_ncclBuildRingsOptimized/16384     502392 ns       502383 ns         1398
BM_ncclBuildRingsOptimized/32768    1006667 ns      1006595 ns          696
BM_ncclBuildRingsOptimized/65536    2010650 ns      2010633 ns          345
BM_ncclBuildRingsOptimized/98304    3003351 ns      3003085 ns          233

Test Binary Code

Can be compiled standalone by linking with gtest and google-benchmark
nccl-build-ring-benchmark.cpp.txt

At large scale 32K+ GPUs we start to see a significant initialization time coming from `ncclBuildRing` and `initTransportsRank` - often several dozens of seconds at 100K scale. This changes optimizes the two functions to remove the overhead. And enables NCCL to initialize fast at 100K scale.

saifhhasan · 2025-07-22T18:36:31Z

src/graph/rings.cc

    sprintf(prefix, "[%d] Channel %d Next : ", rank, r);
    dumpLine(next+r*nranks, nranks, prefix);*/

+    std::vector<bool> rankBitSet(nranks, false);


I'm using std::vector here but open to other choices. My rationale for using vector is that

nccl is using C++ std library so using vector doesn't add a new dependency

Memory is automatically managed with std container

Most notably std::vector<bool> has a space efficient implementation to use 1 bit per entry instead of 1 byte, thus being space as well as page cache efficient. https://en.cppreference.com/w/cpp/container/vector_bool/reference

saifhhasan · 2025-07-22T18:42:18Z

Hi @sjeaugey and @marksantesson - I've sent out a new PR for optimizing init times at large scale that surprised us at Meta. We believe it'll be of help to broader community. I'm happy to iterate with you to improve the PR based on your feedback.

stephenmsachs · 2025-09-22T14:23:30Z

Hi @saifhhasan. Thanks for your contribution. I have a question about the initTransportsRank changes the implementation:
The original implementation checks every pair (i,j) of comm->peerInfo[i], comm->peerInfo[j] and disables nvlsRegSupport if any two ranks sharing host & pid are found. The proposed implementation only checks if any comm->peerInfo[i] shares a process with current rank comm->peerInfo[comm->rank]. Since comm->nvlsRegSupport is process-local I believe this can lead to a different outcome.

stephenmsachs · 2025-09-26T07:01:18Z

Hi @saifhhasan . We will have a similar change in rings.cc in the next release. This will bring down the complexity a little more. I would still appreciate clarification concerning your proposed changes in init.cc. Maybe I am just not understanding them correctly.

saifhhasan · 2025-09-27T04:55:54Z

Thank you @stephenmsachs for helping take a look at the PR.

Regarding init.cc changes, what you present could be the case edge case. comm->nvlsRegSupport may have different values on various ranks with this change, but in practice due to execution scheduling we never run into this case. We may be able to overcome this by using std::unordered_set<pair<hostHash, pidHash>> to achieve O(N-LogN) complexity.

GPU-Initiated Networking (GIN): * Provides device-side API for integrating GPU-Initiated Networking capability into application kernels. * New transport layer called DOCA GPUNetIO. * New ncclGin construct to create, destroy and manipulate GIN contexts. * New ncclGinBarrierSession to provide synchronization functionality. * New put, signal, counter operations for data movement and signaling. * GIN API signatures and functionalities are subject to change. * GIN Support Requirements * CUDA 12.2 or later when compiling the GPU code * NVIDIA GPUs: Volta or newer. NVIDIA GPU drivers >= 510.40.3 * NVIDIA NICs: CX4 or newer. rdma-core >= 44.0 * Requires nvidia-peermem or DMABUF support. When using DMABUF, linux kernel >= 6.1 is required. New ncclCommRevoke API for fault tolerance: * Introduces ncclCommRevoke to quiesce ongoing NCCL work on a communicator without freeing resources. * This answers the need for a lightweight way to cancel in-flight collectives and bring a communicator to a safe state before split/shrink/finalize/destroy. * Includes optional cross-rank coordination (global barrier) and supports blocking/non-blocking usage. New NCCL Environment Plugin: * The env plugin allows users to set NCCL environment variables, for example, after loading them from a centralized database. * The NCCL_ENV_PLUGIN variable can be used to let NCCL load an external environment plugin. New NCCL Examples on GitHub: * The NCCL examples directory provides users and developers with practical code samples that highlight NCCL’s core features. * It covers basic operations like communicator initialization, point-to-point communication, and collective operations, as well as advanced features such as user buffer registration, symmetric memory, and the device API. Device API improvements: * Adds ncclFindWindow API. * Adds new ncclBarrierSession to provide hybrid synchronization functionality. * Makes multimem available with as few as two ranks. * Removes distance (NCCL_P2P_LEVEL) considerations from determining the availability of symmetric memory. Enhanced NCCL RAS output: * Extends RAS subsystem with JSON format to support machine-parsable metrics collection. * Enables structured data export for monitoring tools, dashboards, and automated analysis systems. Github Pull Requests resolved: * Fast Init - CPU Optimizations for NCCL Initialization Large Scale. (PR #1789) * Fast Init - Improve Bootstrap AllGather by 2x at large scale by sending bootstrap information bidirectionally. (PR #1791) * Fixes spurious failures when PyTorch is statically linked with NCCL-2.28.3 because error is not drained, but rather gets propagated into the next CUDA kernel invocation. (PR #1864) Other notable improvements: * Fixes multicast object leaks in case of failed NVLS user buffer registrations, which could lead to crashes. Avoids such registration attempts in case of the use of incompatible memory allocators. * Fixes potential data corruption with built-in symmetric kernels for small messages with size granularity under 8 bytes or when multiple symmetric operations were aggregated in a group. * Generalizes the existing point-to-point scheduling to the case of un-even GPU count per node. * Fixes a crash when network plugin assignment fails. * Fixes a large performance issue with NCCL_CROSS_NIC=0 and certain split mask settings, where NCCL cannot find a viable ring. * Fixes crash when NCCL is compiled with recent CUDA versions but running on hosts with certain specific older CUDA drivers.

xiaofanl-nvidia · 2025-10-18T23:26:23Z

This has been accepted and released in the latest 2.28.7 release. @saifhhasan

saifhhasan commented Jul 22, 2025

View reviewed changes

saifhhasan changed the title ~~NCCL Fast Init - Optimize Topology Processing at Large Scale~~ NCCL Fast Init - CPU Optimizations for NCCL Initialization Large Scale Jul 22, 2025

kiskra-nvidia closed this Oct 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NCCL Fast Init - CPU Optimizations for NCCL Initialization Large Scale #1789

NCCL Fast Init - CPU Optimizations for NCCL Initialization Large Scale #1789

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

NCCL Fast Init - CPU Optimizations for NCCL Initialization Large Scale #1789

NCCL Fast Init - CPU Optimizations for NCCL Initialization Large Scale #1789

Uh oh!

Conversation

Uh oh!

Problem

Observations

Testing

Test Binary Code

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants