Releases: NVIDIA/nccl
10BC0
Releases · NVIDIA/nccl
NCCL v2.29.2-1 Release
Device API Improvements
- Supports Device API struct versioning for backwards compatibility with future versions.
- Adds ncclCommQueryProperties to allow Device API users to check supported features before creating a DevComm.
- Adds host-accessible device pointer functions from symmetric registered ncclWindows.
- Adds improved GIN documentation to clarify the support matrix.
New One-Sided Host APIs
- Adds new host APIs (ncclPutSignal, ncclWaitSignal, etc) for both network and NVL using zero-SM.
- One-sided communication operation writes data from the local buffer to a remote peer’s registered memory window without explicit participation from the target process.
- Utilizes CopyEngine for NVL transfer and CPU proxy for network.
- Requires CUDA 12.5 or greater.
New Experimental Python language binding (NCCL4Py)
- Pythonic NCCL API for Python applications - native collectives, P2P and other NCCL operations.
- Interoperable with CUDA Python ecosystem: DLPack/CUDA Array Interface, and special support for PyTorch and CuPy.
- Automatic cleanup of NCCL-managed resources (GPU buffers, registered buffers/windows, custom reduction operations).
New LLVM intermediate representation (IR) support
- Exposes NCCL Device APIs through LLVM IR to enable consumption by diverse code generation systems.
- Example usages include high-level languages, Just-In-Time (JIT) compilers, and domain-specific languages (DSL).
- Build with EMIT_LLVM_IR=1 to generate LLVM IR bitcode.
- Requires CUDA 12 and Clang 21.
Built-in hybrid (LSA+GIN) symmetric kernel for AllGather
- Adds a new hierarchical kernel using MCRing (NVLS multicast + Ring) to improve performance and scalability of AllGather.
- Requires symmetric memory registration and GIN.
New ncclCommGrow API
- Adds the ability to dynamically and efficiently add ranks to an existing NCCL communicator.
- Use ncclCommGrow with ncclCommShrink to adjust membership of communicators in response to failing and recovering nodes.
- Also addresses the need for elastic applications to expand a running job by integrating new ranks.
Multi-segment registration
- Expands buffer registration to support multiple segments of physical memory mapped to one contiguous VA space for the p2p, ib and nvls transports.
- Enables support for expandable segments in PyTorch.
Improves scalability of AllGatherV pattern
- Adds support for a scalable allgatherv pattern (group of broadcasts).
- Adds new scheduler path and new kernels to improve performance at large scale.
Debuggability & Observability Improvements
- RAS supports realtime monitoring to continuously track peer status changes.
- Inspector adds support for Prometheus format output (with NCCL_INSPECTOR_PROM_DUMP=1), in addition to the existing JSON format.
- Adds profiler support for CopyEngine(CE) based collectives.
Community Engagement
- Adds contribution guide: https://github.com/NVIDIA/nccl/blob/master/CONTRIBUTING.md
- Adds NCCL_SOCKET_POLL_TIMEOUT_MSEC which allows waiting instead of spinning during bootstrap in order to reduce CPU usage. (Github PR #1759)
- Fixes segfault in ncclGin initialization that can happen if ncclGinIbGdaki.devices() fails after init() succeeds. (Github PR #1881)
- Fixes crash that can happen when calling p2p and then collectives while using the same user buffer. (Github Issue #1859)
- Fixes bug that was lowering performance on some sm80 or earlier machines with one NIC per GPU. (Github Issue #1876)
- Clears non-fatal CUDA errors so they do not propagate. (Pytorch Issue #164402)
Other Improvements
- Improves performance of large-size AllGather operations using symmetric memory buffers on Blackwell by transparently switching to CE collectives.
- Improves the default number of channels per net peer for all-to-all, send, and recv to achieve better performance.
- Improves performance tuning of 256M-512M message sizes on Blackwell for AllReduce.
- Enables built-in symmetric kernels only on fully connected nvlink systems, as PCIE systems do not perform as well.
- Prints git branch and commit checksum at the INFO level during NCCL initialization.
- Improves support for symmetric window registrations on CUDA versions prior to 12.1.
- Relaxes symmetric buffer registration requirements for collectives so that users can leverage the symmetric kernels with only one of the buffers being registered, when possible.
- All2all, send, recv now obey NCCL_NETDEVS_POLICY. For these operations, NCCL will now by default use a subset of available network devices as dictated by the Network Device Policy.
- Fixes a hang on GB200/300 + CX8 when the user disables GDR.
- Fixes a bug that could cause AllReduce on ncclFloat8e4m3 to yield “no algorithm/protocol available”.
- ncclCommWindowRegister will now return a NULL window if the system does not support window registration.
- More prominent error when cuMulticastBind fails and NCCL_NVLS_ENABLE=2.
- Upgrades to doca gpunetio v1.1.
Known Limitations
- Since Device API was experimental in 2.28.x, applications that use the Device API in v2.28 may need modifications to work with v2.29.
- One-sided host APIs (e.g. ncclPutSignal) currently do not support graph capture. Future releases will add cuda graph support.
- The improved AllGatherV support breaks the NCCL profiler support for ncclBroadcast operations, limiting visibility to API events. NCCL_ALLGATHERV_ENABLE=0 can be used as a workaround until it is fixed in a future release.
- NCCL4Py (experimental) has a known issue with cuda.core 0.5.0. We currently recommend using cuda.core 0.4.1 with nccl4py.
NCCL v2.28.9-1 Release
NCCL v2.28 Update 2 -
- Fix operation ordering between main thread and proxy thread to prevent hangs at large scale.
- Fix Issue #1893, a bug fix in GIN.
NCCL v2.28.7-1 Release
GPU-Initiated Networking (GIN)
- Provides device-side API for integrating GPU-Initiated Networking
capability into application kernels. - New transport layer called DOCA GPUNetIO.
- New ncclGin construct to create, destroy and manipulate GIN contexts.
- New ncclGinBarrierSession to provide synchronization functionality.
- New put, signal, counter operations for data movement and signaling.
- GIN API signatures and functionalities are subject to change.
- GIN Support Requirements
- CUDA 12.2 or later when compiling the GPU code
- NVIDIA GPUs: Volta or newer. NVIDIA GPU drivers >= 510.40.3
- NVIDIA NICs: CX4 or newer. rdma-core >= 44.0
- Requires nvidia-peermem or DMABUF support. When using DMABUF, linux
kernel >= 6.1 is required.
New ncclCommRevoke API for fault tolerance
- Introduces ncclCommRevoke to quiesce ongoing NCCL work on a
communicator without freeing resources. - This answers the need for a lightweight way to cancel in-flight
collectives and bring a communicator to a safe state before
split/shrink/finalize/destroy. - Includes optional cross-rank coordination (global barrier) and
supports blocking/non-blocking usage.
New NCCL Environment Plugin
- The env plugin allows users to set NCCL environment variables, for
example, after loading them from a centralized database. - The NCCL_ENV_PLUGIN variable can be used to let NCCL load an external
environment plugin.
New NCCL Examples on GitHub
- The NCCL examples directory provides users and developers with
practical code samples that highlight NCCL’s core features. - It covers basic operations like communicator initialization,
point-to-point communication, and collective operations, as well as
advanced features such as user buffer registration, symmetric memory,
and the device API.
Device API improvements
- Adds ncclFindWindow API.
- Adds new ncclBarrierSession to provide hybrid synchronization
functionality. - Makes multimem available with as few as two ranks.
- Removes distance (NCCL_P2P_LEVEL) considerations from determining the
availability of symmetric memory.
Enhanced NCCL RAS output
- Extends RAS subsystem with JSON format to support machine-parsable
metrics collection. - Enables structured data export for monitoring tools, dashboards, and
automated analysis systems.
Github Pull Requests resolved
- Fast Init - CPU Optimizations for NCCL Initialization Large Scale.
(PR #1789) - Fast Init - Improve Bootstrap AllGather by 2x at large scale by
sending bootstrap information bidirectionally. (PR #1791) - Fixes spurious failures when PyTorch is statically linked with
NCCL-2.28.3 because error is not drained, but rather gets propagated
into the next CUDA kernel invocation. (PR #1864)
Other notable improvements
- Fixes multicast object leaks in case of failed NVLS user buffer
registrations, which could lead to crashes. Avoids such registration
attempts in case of the use of incompatible memory allocators. - Fixes potential data corruption with built-in symmetric kernels for
small messages with size granularity under 8 bytes or when multiple
symmetric operations were aggregated in a group. - Generalizes the existing point-to-point scheduling to the case of
un-even GPU count per node. - Fixes a crash when network plugin assignment fails.
- Fixes a large performance issue with NCCL_CROSS_NIC=0 and certain
split mask settings, where NCCL cannot find a viable ring. - Fixes crash when NCCL is compiled with recent CUDA versions but
running on hosts with certain specific older CUDA drivers.
NCCL v2.28.3-1 Release
See the NCCL 2.28.3 Release Notes for more information
Device API (Experimental)
- Introduces device-side APIs to integrate NCCL communication directly into application kernels.
- Supports LSA (Load/Store Access) for CUDA P2P communication over NVLink and some PCIe platforms.
- Supports Multimem for hardware multicast using NVLink SHARP.
- Adds initial framework for GIN (GPU-Initiated Networking), currently under development.
- Introduces device communicators created using ncclDevCommCreate.
- Enables device-side communication operations with synchronization (ncclLsaBarrierSession) and memory accessors (ncclGetLsaPointer, ncclGetLsaMultimemPointer).
- Experimental APIs - signatures and functionality may evolve in future releases.
- No ABI compatibility is guaranteed — applications must be recompiled with each new NCCL release.
Symmetric memory improvements
- Support for aggregating symmetric operations using ncclGroupStart/End APIs.
- Reimplement symmetric kernels using device API.
New Host APIs
- Introduce new host collective APIs: ncclAlltoAll, ncclScatter, ncclGather.
CE (Copy Engine) Collectives
- Reduce SM utilization for alltoall, scatter, gather, and allgather within a single (MN)NVL domain.
- Free up SM capacity for the application to do computation at the same time.
- To enable the feature for ncclAllGather, ncclAlltoAll, ncclGather, ncclScatter, register buffers into symmetric windows and use the NCCL_CTA_POLICY_ZERO flag in the communicator config_t.
NCCL Inspector Plugin
- Introduces an Inspector plugin for always-on performance monitoring.
- Produces structured JSON output with metadata, execution time, bandwidth, and optional event traces for each NCCL operation.
- Enables integration with analysis tools such as Performance Exporter to visualize NCCL performance bottlenecks.
- Lightweight to enable via environment variables NCCL_PROFILER_PLUGIN and NCCL_INSPECTOR_ENABLE.
CMake support (Experiemental)
- Adds a CMake build system as an alternative to existing Makefiles.
- Known issues: pkg.build and Device API currently do not work with CMake.
- The known issues will be addressed in a future release.
Decreased max CTA count from 32 to 16 on Blackwell
- SM overhead is decreased by 50% with this improvement.
- This may cause some perf drop on Blackwell because of the reduced SM usage.
- If the extra SM capacity is not desired, two options are available to restore to previous behavior: 1) Setting NCCL_MIN_CTAS=32 NCCL_MAX_CTAS=32 environment variables; 2) setting communicator config to over-write max CTA count to 32.
- Based on community feedback, future versions may consider different trade-offs between performance and SM overhead.
Plugins
- Network
- App-aware Network plugin. NCCL passes information about communication operations to be executed on the network end point. This allows for better tuning of network end points and their use in the plugins.
- Improve handling of physical and virtual network devices and load/unload.
- Network plugin version 11 - add explicit context and communication ID support for per communicator init/finalize.
- Add Multi-Request Net API. Using this will help NCCL to anticipate multiple send/recv requests and optimize for it. See maxMultiRequestSize field in ncclNetProperties_v11_t.
- Profiler
- Add support for API events (group, collective, and p2p) and for tracking kernel launches in the profiler plugin.
- Add Inspector Profiler Plugin (see section above).
- Add a hook to Google’s CoMMA profiler on github.
- Tuner
- Expose NCCL tuning constants at tuner initialization via ncclTunerConstants_v5_t.
- Add NVL Domain Information API.
- Support multiple plugin types from a single shared object.
New Parameterization and ncclConfig changes:
- Add new option NCCL_MNNVL_CLIQUE_ID=-2 which will use rack serial number to partition the MNNVL clique. This will limit NVLink domains to GPUs within a single rack.
- Add NCCL_NETDEVS_POLICY to control how NET devices are assigned to GPUs. The default (AUTO) is the policy used in previous versions.
- Add NCCL_SINGLE_PROC_MEM_REG_ENABLE control variable to enable NVLS UB registration in the “one process, multiple ranks” case as opt in.
- Move nChannelsPerNetPeer into ncclConfig. NCCL_NCHANNELS_PER_NET_PEER can override the value in ncclConfig.
- Enable PxN over C2C by default
- PxN over C2C will improve performance for Grace-Blackwell platforms by allowing NCCL to leverage the NIC attached to a peer GPU over NVLINK, C2C, and PCIe.
- This behavior can be overridden by setting NCCL_PXN_C2C=0.
Other Improvements:
- Allow FP8 support for non-reductive operations on pre sm90 devices. (See pytorch/pytorch#151594 (comment))
- Fix NVLS+CollNet and temporarily disables COLLNET_CHAIN for >8 GPUs.
- Only consider running interfaces for socket traffic. NCCL will not attempt to use interfaces that do not have the IFF_RUNNING bit. (#1798)
- Modernize mutex management. Convert to std::mutex and std::lock_guard.
- Remove sm35 and sm50 GENCODE targets which have long been deprecated and were causing issues with the latest NCCL release builds.
- Improved NVLS/NVLSTree tuning prediction to improve algorithm and protocol selection.
- NVLSTree Tuning Fixes. Update tuning data for H100, GB200-NV72.
- Respond better to RoCE link flaps. Instead of reporting an “unknown event” it will now report “GID table changed”.
- Move libvirt bridge interface to the end of possible interfaces so that they are considered last. These interfaces are usually virtual bridges to relay traffic to containers running on the host and cannot be used for traffic to a remote node and are therefore unsuitable.