8000 NCCL Fast Init - CPU Optimizations for NCCL Initialization Large Scale by saifhhasan · Pull Request #1789 · NVIDIA/nccl · GitHub
[go: up one dir, main page]

Skip to content

Conversation

@saifhhasan
Copy link
@saifhhasan saifhhasan commented Jul 22, 2025

Problem

At large scale 32K+ GPUs we start to see a significant initialization time coming from ncclBuildRing and initTransportsRank - often several dozens of seconds at 100K scale. This occurs because both of these functions performs nested loops of complexity O(N*N), which at 100K scale translates to 10B loop iterations, each loop iteration with multiple instructions.

We used CPU Profiling tools that helped us spot the loops that were taking excessive amount of time during the initialization phase.

Screenshot 2025-07-22 at 11 20 18 AM Screenshot 2025-07-22 at 11 21 24 AM

Observations

This changes optimizes the two functions to remove the overhead. And enables NCCL to initialize fast at 100K scale. We tested this at Meta at Scale and were able to observe following savings in numbers.

  • At 96K (96*1024) scale ncclBuildRing saves 26s of busy CPU cycles.
  • At 48K (48*1024) scale initTranportsRank optimization saves of 11+ seconds

Most notably this patch helps reduce CPU Utilization during job startup phase which is also crucial for other ongoing operations at a job level (Checkpoint initialization, Model Loading, Data Fetching etc.).

Testing

At meta we've tested these fixes on our large scale clusters. On the top I'm also adding a simple stanalone benchmark to demonstrate the performance improvement of ncclBuildRings before and after. The numbers are as below for varying number of ranks from 1024 to 96K. With 16 rings, we save about 13s, for 32 rings it'll be about 52s of duration.

Run on (368 X 2396.4 MHz CPU s)
CPU Caches:
  L1 Data 64 KiB (x368)
  L1 Instruction 64 KiB (x368)
  L2 Unified 512 KiB (x368)
  L3 Unified 16384 KiB (x1)
Load Average: 8.19, 7.24, 9.36
---------------------------------------------------------------------------
Benchmark                                 Time             CPU   Iterations
---------------------------------------------------------------------------
//
// Without the Fix
//
BM_ncclBuildRings/1024              1648154 ns      1648085 ns          415
BM_ncclBuildRings/2048              6378142 ns      6377981 ns          112
BM_ncclBuildRings/4096             23255134 ns     23253167 ns           30
BM_ncclBuildRings/8192             90435922 ns     90432606 ns            8
BM_ncclBuildRings/16384           358064532 ns    358047348 ns            2
BM_ncclBuildRings/32768          1430027485 ns   1429897998 ns            1
BM_ncclBuildRings/65536          5720381498 ns   5719983433 ns            1
BM_ncclBuildRings/98304          12864594936 ns   12863500960 ns            1

//
// With the Fix in this PR
//
BM_ncclBuildRingsOptimized/1024       30666 ns        30665 ns        22790
BM_ncclBuildRingsOptimized/2048       60711 ns        60710 ns        11498
BM_ncclBuildRingsOptimized/4096      120775 ns       120771 ns         5795
BM_ncclBuildRingsOptimized/8192      243442 ns       243431 ns         2878
BM_ncclBuildRingsOptimized/16384     502392 ns       502383 ns         1398
BM_ncclBuildRingsOptimized/32768    1006667 ns      1006595 ns          696
BM_ncclBuildRingsOptimized/65536    2010650 ns      2010633 ns          345
BM_ncclBuildRingsOptimized/98304    3003351 ns      3003085 ns          233

Test Binary Code

Can be compiled standalone by linking with gtest and google-benchmark
nccl-build-ring-benchmark.cpp.txt

At large scale 32K+ GPUs we start to see a significant initialization
time coming from `ncclBuildRing` and `initTransportsRank` - often
several dozens of seconds at 100K scale.

This changes optimizes the two functions to remove the overhead. And
enables NCCL to initialize fast at 100K scale.
sprintf(prefix, "[%d] Channel %d Next : ", rank, r);
dumpLine(next+r*nranks, nranks, prefix);*/

std::vector<bool> rankBitSet(nranks, false);
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm using std::vector here but open to other choices. My rationale for using vector is that

  • nccl is using C++ std library so using vector doesn't add a new dependency
  • Memory is automatically managed with std container
  • Most notably std::vector<bool> has a space efficient implementation to use 1 bit per entry instead of 1 byte, thus being space as well as page cache efficient. https://en.cppreference.com/w/cpp/container/vector_bool/reference

@saifhhasan saifhhasan changed the title NCCL Fast Init - Optimize Topology Processing at Large Scale NCCL Fast Init - CPU Optimizations for NCCL Initialization Large Scale Jul 22, 2025
@saifhhasan
Copy link
Author

Hi @sjeaugey and @marksantesson - I've sent out a new PR for optimizing init times at large scale that surprised us at Meta. We believe it'll be of help to broader community. I'm happy to iterate with you to improve the PR based on your feedback.

@stephenmsachs
Copy link
Collaborator

Hi @saifhhasan. Thanks for your contribution. I have a question about the initTransportsRank changes the implementation:
The original implementation checks every pair (i,j) of comm->peerInfo[i], comm->peerInfo[j] and disables nvlsRegSupport if any two ranks sharing host & pid are found. The proposed implementation only checks if any comm->peerInfo[i] shares a process with current rank comm->peerInfo[comm->rank]. Since comm->nvlsRegSupport is process-local I believe this can lead to a different outcome.

@stephenmsachs
Copy link
Collabor 8000 ator

Hi @saifhhasan . We will have a similar change in rings.cc in the next release. This will bring down the complexity a little more. I would still appreciate clarification concerning your proposed changes in init.cc. Maybe I am just not understanding them correctly.

@saifhhasan
Copy link
Author

Thank you @stephenmsachs for helping take a look at the PR.

Regarding init.cc changes, what you present could be the case edge case. comm->nvlsRegSupport may have different values on various ranks with this change, but in practice due to execution scheduling we never run into this case. We may be able to overcome this by using std::unordered_set<pair<hostHash, pidHash>> to achieve O(N-LogN) complexity.

marksantesson added a commit that referenced this pull request Oct 18, 2025
GPU-Initiated Networking (GIN):
 * Provides device-side API for integrating GPU-Initiated Networking
   capability into application kernels.
 * New transport layer called DOCA GPUNetIO.
 * New ncclGin construct to create, destroy and manipulate GIN contexts.
 * New ncclGinBarrierSession to provide synchronization functionality.
 * New put, signal, counter operations for data movement and signaling.
 * GIN API signatures and functionalities are subject to change.
 * GIN Support Requirements
   * CUDA 12.2 or later when compiling the GPU code
   * NVIDIA GPUs: Volta or newer. NVIDIA GPU drivers >= 510.40.3
   * NVIDIA NICs: CX4 or newer. rdma-core >= 44.0
   * Requires nvidia-peermem or DMABUF support. When using DMABUF, linux
     kernel >= 6.1 is required.

New ncclCommRevoke API for fault tolerance:
 * Introduces ncclCommRevoke to quiesce ongoing NCCL work on a
   communicator without freeing resources.
 * This answers the need for a lightweight way to cancel in-flight
   collectives and bring a communicator to a safe state before
   split/shrink/finalize/destroy.
 * Includes optional cross-rank coordination (global barrier) and
   supports blocking/non-blocking usage.

New NCCL Environment Plugin:
 * The env plugin allows users to set NCCL environment variables, for
   example, after loading them from a centralized database.
 * The NCCL_ENV_PLUGIN variable can be used to let NCCL load an external
   environment plugin.

New NCCL Examples on GitHub:
 * The NCCL examples directory provides users and developers with
   practical code samples that highlight NCCL’s core features.
 * It covers basic operations like communicator initialization,
   point-to-point communication, and collective operations, as well as
   advanced features such as user buffer registration, symmetric memory,
   and the device API.

Device API improvements:
 * Adds ncclFindWindow API.
 * Adds new ncclBarrierSession to provide hybrid synchronization
   functionality.
 * Makes multimem available with as few as two ranks.
 * Removes distance (NCCL_P2P_LEVEL) considerations from determining the
   availability of symmetric memory.

Enhanced NCCL RAS output:
 * Extends RAS subsystem with JSON format to support machine-parsable
   metrics collection.
 * Enables structured data export for monitoring tools, dashboards, and
   automated analysis systems.

Github Pull Requests resolved:
 * Fast Init - CPU Optimizations for NCCL Initialization Large Scale.
   (PR #1789)
 * Fast Init - Improve Bootstrap AllGather by 2x at large scale by
   sending bootstrap information bidirectionally. (PR #1791)
 * Fixes spurious failures when PyTorch is statically linked with
   NCCL-2.28.3 because error is not drained, but rather gets propagated
   into the next CUDA kernel invocation. (PR #1864)

Other notable improvements:
 * Fixes multicast object leaks in case of failed NVLS user buffer
   registrations, which could lead to crashes. Avoids such registration
   attempts in case of the use of incompatible memory allocators.
 * Fixes potential data corruption with built-in symmetric kernels for
   small messages with size granularity under 8 bytes or when multiple
   symmetric operations were aggregated in a group.
 * Generalizes the existing point-to-point scheduling to the case of
   un-even GPU count per node.
 * Fixes a crash when network plugin assignment fails.
 * Fixes a large performance issue with NCCL_CROSS_NIC=0 and certain
   split mask settings, where NCCL cannot find a viable ring.
 * Fixes crash when NCCL is compiled with recent CUDA versions but
   running on hosts with certain specific older CUDA drivers.
@xiaofanl-nvidia
Copy link
Collaborator

This has been accepted and released in the latest 2.28.7 release. @saifhhasan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants

0