[PGNCCL] Launch kernel on current stream & remove `record_stream` entirely #148467

kwen2501 · 2025-03-04T18:56:53Z

This PR has multiple changes to ProcessGroupNCCL (which unfortunately have to be atomic):

When async_op=False, we directly launch the collective on "current" stream instead of a trampoline stream and join back.

Resolves [RFE][Distributed][NCCL] A feature request for stream management API in PG NCCL #147729
Resolves [Feature][c10d] Allow flexible cudaStreamWaitEvent in PGNCCL #146881
Also saves an event sync and one pybind during the unnecessary work.wait() called by distributed_c10d.py.

Entirely remove record_stream and use CPU-side stashing for managing tensor lifetime against recycling.

Resolves [FSDP2] The evil record_stream in c10d causes FSDP2 to over-allocate GPU memory #147168

Remove tensor life management when async_op=False; only use it when async_op=True.
To guard against user not calling work.wait(), we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion here.
Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels.

Joint work with @cenzhaometa who wants to remove the event sync overhead.

Cc: @ngimel @awgu @Aidyn-A @skyw @wconstab @leonardo0lyj

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

…147820) Summary: PTD current workflow: - PTD creates its own dedicated `ncclStream` for comm operation - it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us). This diff: - async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead - async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready - pass down async from c10d down to NCCL-PG this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%** Test Plan: - # AMD > before ``` [cenzhao@devgpu039.atn3 ~/fbsource/fbcode (2265d32f0)]$ buck2 run @//mode/opt-amd-gpu -c fbcode.split-dwarf=True //param_bench/train/comms/pt:launcher -- --launcher mpi --nnode 1 --collective all_reduce --b 20M --e 20M --data-type bfloat16 --backend nccl --n 100 --w 5 --envs "NCCL_DEBUG_FILE=/tmp/dedicated_log_rccl.%h.%p.log;NCCL_DEBUG=INFO;NCCL_DEBUG_SUBSYS=INIT,COLL;MSCCL_ALGO_DIR=/data/users/${USER}/fbsource/third-party/rccl/develop/tools/msccl-algorithms;RCCL_MSCCLPP_THRESHOLD=$((128*1024*1024));RCCL_MSCCLPP_ENABLE=1;TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK=1;" --size-start-profiler 20M ``` https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devgpu039.atn3.facebook.com/rank-0.Feb_24_16_19_28.354787.pt.trace.json.gz&bucket=hpc_traces {F1975408857} - c10d::allreduce_(69us) - cudaStreamSync (23us) - nccl::all_reduce(26us) > after ``` [cenzhao@devgpu039.atn3 ~/fbsource/fbcode (2265d32f0)]$ buck2 run @//mode/opt-amd-gpu -c fbcode.split-dwarf=True //param_bench/train/comms/pt:launcher -- --launcher mpi --nnode 1 --collective all_reduce --b 20M --e 20M --data-type bfloat16 --backend nccl --n 100 --w 5 --envs "NCCL_DEBUG_FILE=/tmp/dedicated_log_rccl.%h.%p.log;NCCL_DEBUG=INFO;NCCL_DEBUG_SUBSYS=INIT,COLL;MSCCL_ALGO_DIR=/data/users/${USER}/fbsource/third-party/rccl/develop/tools/msccl-algorithms;RCCL_MSCCLPP_THRESHOLD=$((128*1024*1024));RCCL_MSCCLPP_ENABLE=1;TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK=1;TORCH_NCCL_USE_CURRENT_STREAM_AS_NCCL_STREAM=1" --size-start-profiler 20M ``` https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devgpu039.atn3.facebook.com/rank-4.Feb_24_16_22_56.534269.pt.trace.json.gz&bucket=hpc_traces {F1975408962} - c10d:allreduce_(37us) - cudaStreamSync (gone) - nccl::all_reduce(20us) # NV > before ``` [cenzhao@devgpu019.prn3 ~/fbsource/fbcode (e3f64263c)]$ buck2 run @//mode/opt -c fbcode.split-dwarf=True //param_bench/train/comms/pt:launcher -- --launcher mpi --nnode 1 --collective all_reduce --b 20M --e 20M --data-type bfloat16 --backend nccl --n 100 --w 5 --envs "NCCL_DEBUG_FILE=/tmp/dedicated_log_rccl.%h.%p.log;NCCL_DEBUG=INFO;NCCL_DEBUG_SUBSYS=INIT,COLL;" --size-start-profiler 20M ``` https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devgpu019.prn3.facebook.com/rank-2.Feb_25_11_11_28.3328768.pt.trace.json.gz&bucket=hpc_traces {F1975437097} - c10d::allreduce_ (62us) - cudaStreamWait (0us) - nccl::all_reduce (47us) > after ``` [cenzhao@devgpu019.prn3 ~/fbsource/fbcode (e3f64263c)]$ buck2 run @//mode/opt -c fbcode.split-dwarf=True //param_bench/train/comms/pt:launcher -- --launcher mpi --nnode 1 --collective all_reduce --b 20M --e 20M --data-type bfloat16 --backend nccl --n 100 --w 5 --envs "NCCL_DEBUG_FILE=/tmp/dedicated_log_rccl.%h.%p.log;NCCL_DEBUG=INFO;NCCL_DEBUG_SUBSYS=INIT,COLL;TORCH_NCCL_USE_CURRENT_STREAM_AS_NCCL_STREAM=1" --size-start-profiler 20M ``` https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devgpu019.prn3.facebook.com/rank-4.Feb_25_11_17_05.3469865.pt.trace.json.gz&bucket=hpc_traces {F1975437192} - c10d::allreduce_ (62us) - cudaStreamWait (gone) - nccl:all_reduce (53us) Differential Revision: D70135605

pytorch-bot · 2025-03-04T18:56:57Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/148467

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 5 New Failures

As of commit 69042db with merge base 10ffd94 ():

NEW FAILURES - The following jobs have failed:

Lint / lintrunner-clang / linux-job (gh)
>>> Lint for torch/csrc/distributed/c10d/ProcessGroup.hpp:
pull / linux-focal-cuda11.8-py3.10-gcc9 / test (distributed, 1, 3, lf.linux.g4dn.12xlarge.nvidia.gpu) (gh)
distributed/test_c10d_nccl.py::DistributedDataParallelTest::test_find_unused_parameters_kwarg_debug_info
pull / linux-focal-cuda11.8-py3.10-gcc9 / test (distributed, 2, 3, lf.linux.g4dn.12xlarge.nvidia.gpu) (gh)
distributed/test_c10d_nccl.py::DistributedDataParallelTest::test_find_unused_parameters_kwarg_debug_detail
pull / linux-focal-cuda11.8-py3.10-gcc9 / test (distributed, 3, 3, lf.linux.g4dn.12xlarge.nvidia.gpu) (gh)
distributed/test_c10d_nccl.py::DistributedDataParallelTest::test_find_unused_parameters_kwarg_grad_is_view_debug_info
pull / linux-jammy-py3.9-gcc11 / test (backwards_compat, 1, 1, lf.linux.2xlarge) (gh)
test_modules_can_be_imported

This comment was automatically generated by Dr. CI and updates every 15 minutes.

eqy · 2025-03-04T21:00:28Z

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp

-  }
-
+  // Therefore, we warn and fall back to the typical recordStream logic.
+  // TODO( kwen2501 ): revisit this when we have a better solution.


A few options here:

Change the requirement to be that the user has to call .wait() even on isend calls.

Add some logic to the destructor that does the synchronizeStream or something equivalent to block the destruction on the p2p op

kwen2501 · 2025-03-05T18:55:13Z

Moving over to #148590 as internal user needs ghimport which supports ghstack PRs only.

cenzhaometa and others added 5 commits February 27, 2025 13:25

[PGNCCL] Make avoid-record-stream default

013cbda

[c10d] Add asyncOp argument to Ops

bea5c89

Change python side wait

a88c43f

Pass asyncOp at ProcessGroup level

530f28e

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Mar 4, 2025

kwen2501 requested review from wconstab, fduwjj, eqy and Aidyn-A March 4, 2025 18:59

Watchdog unstashing tensors as a safety net

cb8c889

kwen2501 mentioned this pull request Mar 4, 2025

[RFE][Distributed][NCCL] A feature request for stream management API in PG NCCL #147729

Closed

lint

69042db

eqy reviewed Mar 4, 2025

View reviewed changes

kwen2501 closed this Mar 5, 2025

ngimel mentioned this pull request Mar 6, 2025

remove TORCH_NCCL_AVOID_RECORD_STREAMS,use stashed_for_allocator_safety_ to save the input ref #148553

Closed

github-actions bot deleted the tensor_life branch April 11, 2025 02:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[PGNCCL] Launch kernel on current stream & remove `record_stream` entirely #148467

[PGNCCL] Launch kernel on current stream & remove `record_stream` entirely #148467

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[PGNCCL] Launch kernel on current stream & remove record_stream entirely #148467

[PGNCCL] Launch kernel on current stream & remove record_stream entirely #148467

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/148467

❌ 5 New Failures

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

[PGNCCL] Launch kernel on current stream & remove `record_stream` entirely #148467

[PGNCCL] Launch kernel on current stream & remove `record_stream` entirely #148467