-
Notifications
You must be signed in to change notification settings - Fork 1.1k
NCCL Fast Init - Improve Bootstrap AllGather by 2x at large scale #1791
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NCCL Fast Init - Improve Bootstrap AllGather by 2x at large scale #1791
Conversation
|
Question about the background of this PR: in NCCL 2.23, for large scale init optimization, we implemented an IB/RoCE based allgather which is supposed to be way faster than the socket implementation. Was there a reason why you didn't want to use that implementation and preferred to optimize the socket implementation instead? |
|
That's a good point. We kind of did this when NCCL-2.21 was around. But it is something that we can try out now. However we may not be able to test it upfront due to resources. TCP Solution we were able to test and validate using the CPU Emulation based test runs with mocked Cuda/IB. Though I believe, this can also give some improvement to IB/RoCE based allGather too, however the absolute improvement won't be of too much importance at 100K scale. But for TCP the absolute gain would be quite noticeable. |
GPU-Initiated Networking (GIN):
* Provides device-side API for integrating GPU-Initiated Networking
capability into application kernels.
* New transport layer called DOCA GPUNetIO.
* New ncclGin construct to create, destroy and manipulate GIN contexts.
* New ncclGinBarrierSession to provide synchronization functionality.
* New put, signal, counter operations for data movement and signaling.
* GIN API signatures and functionalities are subject to change.
* GIN Support Requirements
* CUDA 12.2 or later when compiling the GPU code
* NVIDIA GPUs: Volta or newer. NVIDIA GPU drivers >= 510.40.3
* NVIDIA NICs: CX4 or newer. rdma-core >= 44.0
* Requires nvidia-peermem or DMABUF support. When using DMABUF, linux
kernel >= 6.1 is required.
New ncclCommRevoke API for fault tolerance:
* Introduces ncclCommRevoke to quiesce ongoing NCCL work on a
communicator without freeing resources.
* This answers the need for a lightweight way to cancel in-flight
collectives and bring a communicator to a safe state before
split/shrink/finalize/destroy.
* Includes optional cross-rank coordination (global barrier) and
supports blocking/non-blocking usage.
New NCCL Environment Plugin:
* The env plugin allows users to set NCCL environment variables, for
example, after loading them from a centralized database.
* The NCCL_ENV_PLUGIN variable can be used to let NCCL load an external
environment plugin.
New NCCL Examples on GitHub:
* The NCCL examples directory provides users and developers with
practical code samples that highlight NCCL’s core features.
* It covers basic operations like communicator initialization,
point-to-point communication, and collective operations, as well as
advanced features such as user buffer registration, symmetric memory,
and the device API.
Device API improvements:
* Adds ncclFindWindow API.
* Adds new ncclBarrierSession to provide hybrid synchronization
functionality.
* Makes multimem available with as few as two ranks.
* Removes distance (NCCL_P2P_LEVEL) considerations from determining the
availability of symmetric memory.
Enhanced NCCL RAS output:
* Extends RAS subsystem with JSON format to support machine-parsable
metrics collection.
* Enables structured data export for monitoring tools, dashboards, and
automated analysis systems.
Github Pull Requests resolved:
* Fast Init - CPU Optimizations for NCCL Initialization Large Scale.
(PR #1789)
* Fast Init - Improve Bootstrap AllGather by 2x at large scale by
sending bootstrap information bidirectionally. (PR #1791)
* Fixes spurious failures when PyTorch is statically linked with
NCCL-2.28.3 because error is not drained, but rather gets propagated
into the next CUDA kernel invocation. (PR #1864)
Other notable improvements:
* Fixes multicast object leaks in case of failed NVLS user buffer
registrations, which could lead to crashes. Avoids such registration
attempts in case of the use of incompatible memory allocators.
* Fixes potential data corruption with built-in symmetric kernels for
small messages with size granularity under 8 bytes or when multiple
symmetric operations were aggregated in a group.
* Generalizes the existing point-to-point scheduling to the case of
un-even GPU count per node.
* Fixes a crash when network plugin assignment fails.
* Fixes a large performance issue with NCCL_CROSS_NIC=0 and certain
split mask settings, where NCCL cannot find a viable ring.
* Fixes crash when NCCL is compiled with recent CUDA versions but
running on hosts with certain specific older CUDA drivers.
|
This has been accepted and released in the latest 2.28.7 release. @saifhhasan |
|
I am going to close this since the bidirectional ring has been released with v2.28.7. Thanks again for you input. |
Problem
The default specification of AllGather is to send to the next peer and receive from a previous one. In the case of RDMA this is the most optimal strategy. However, in the case of bootstrap we’re exchanging trivial data chunks ~100 bytes) over TCP sockets from a process running on CPU. Thus saturating NIC line rate is not of a concern and most of our performance bottleneck comes from the number of steps performed for AllGather which equals to N-1 where N is number of processes.
Solution
We propose to leverage bidirectional AllGather, where for each step we send and receive a unit of information to both of our peers. With this it’ll require us N/2 steps to conclude all gathers.
Testing
NOTE: These numbers are used on a standalone test for collective initialization only. The actual numbers would be higher in a training workload as system will have more load on the CPU (based on our observations)
Before - 9.1s
After - 5.55s