[PGNCCL] Launch kernel on current stream & remove `record_stream` entirely #148590

kwen2501 · 2025-03-05T18:52:03Z

Stack from ghstack (oldest at bottom):

-> [PGNCCL] Launch kernel on current stream & remove record_stream entirely #148590

This PR has multiple changes to ProcessGroupNCCL (which unfortunately are related):

When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back.

Resolves [RFE][Distributed][NCCL] A feature request for stream management API in PG NCCL #147729
Resolves [Feature][c10d] Allow flexible cudaStreamWaitEvent in PGNCCL #146881
Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call work.wait() in distributed_c10d.py on behalf of user.

Entirely remove record_stream and use CPU-side stashing for managing tensor lifetime against recycling.

Resolves [FSDP2] The evil record_stream in c10d causes FSDP2 to over-allocate GPU memory #147168

Remove tensor life management when async_op=False; only use it when async_op=True.
To guard against user not calling work.wait(), we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion here.
Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels.

Joint work with @cenzhaometa who wants to remove the event sync overhead.

Cc: @ngimel @awgu @Aidyn-A @skyw @wconstab @leonardo0lyj

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

@diff-train-skip-merge

Differential Revision: D71652868

…147820) Summary: PTD current workflow: - PTD creates its own dedicated `ncclStream` for comm operation - it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us). This diff: - async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead - async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready - pass down async from c10d down to NCCL-PG this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%** Differential Revision: D70135605 [PGNCCL] Make avoid-record-stream default [c10d] Add asyncOp argument to Ops Change python side wait Pass asyncOp at ProcessGroup level Watchdog unstashing tensors as a safety net lint [ghstack-poisoned]

pytorch-bot · 2025-03-05T18:52:07Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/148590

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 7 New Failures, 2 Unrelated Failures

As of commit e933dfb with merge base 666508e ():

NEW FAILURES - The following jobs have failed:

Check mergeability of ghstack PR / ghstack-mergeability-check (gh)
RuntimeError: Command git -C /home/runner/work/pytorch/pytorch cherry-pick -x 269f45f returned non-zero exit code 1
linux-binary-libtorch-cxx11-abi / libtorch-cpu-shared-with-deps-cxx11-abi-build / build (gh)
Process completed with exit code 1.
linux-binary-manywheel / manywheel-py3_9-cuda11_8-build / build (gh)
Process completed with exit code 1.
linux-binary-manywheel / manywheel-py3_9-cuda12_4-build / build (gh)
Process completed with exit code 1.
linux-binary-manywheel / manywheel-py3_9-cuda12_6-build / build (gh)
Process completed with exit code 1.
linux-binary-manywheel / manywheel-py3_9-cuda12_8-build / build (gh)
Process completed with exit code 1.
pull / linux-jammy-py3.9-gcc11 / test (backwards_compat, 1, 1, ephemeral.linux.2xlarge) (gh)
test_modules_can_be_imported

FLAKY - The following job failed but was likely due to flakiness present on trunk:

trunk / linux-focal-rocm6.3-py3.10 / test (default, 1, 2, linux.rocm.gpu.2) (gh) (disabled by #68849, #107183 but the issue was closed recently and a rebase is needed to make it pass)
test_nn.py::TestNN::test_RNN_dropout_state

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

pull / linux-jammy-py3-clang12-executorch / test (executorch, 1, 1, ephemeral.linux.2xlarge) (gh) (#144480)
Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…147820) Summary: PTD current workflow: - PTD creates its own dedicated `ncclStream` for comm operation - it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us). This diff: - async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead - async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready - pass down async from c10d down to NCCL-PG this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%** Differential Revision: D70135605 [PGNCCL] Make avoid-record-stream default [c10d] Add asyncOp argument to Ops Change python side wait Pass asyncOp at ProcessGroup level Watchdog unstashing tensors as a safety net lint ghstack-source-id: 4793680 Pull Request resolved: #148590

…stream` entirely" This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately have to be atomic): 1. When async_op=False, we directly launch the collective on "current" stream instead of a trampoline stream and join back. - Resolves #147729 - Resolves #146881 - Also saves an event sync and one pybind during the unnecessary `work.wait()` called by distributed_c10d.py. 2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling. - Resolves #147168 3. Remove tensor life management when async_op=False; only use it when async_op=True. 4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](#147168 (comment)). 5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels. Joint work with cenzhaometa who wants to remove the event sync overhead. Cc: ngimel awgu Aidyn-A skyw wconstab leonardo0lyj cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

…147820) Summary: PTD current workflow: - PTD creates its own dedicated `ncclStream` for comm operation - it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us). This diff: - async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead - async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready - pass down async from c10d down to NCCL-PG this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%** Differential Revision: D70135605 [PGNCCL] Make avoid-record-stream default [c10d] Add asyncOp argument to Ops Change python side wait Pass asyncOp at ProcessGroup level Watchdog unstashing tensors as a safety net lint ghstack-source-id: ac5295d Pull Request resolved: #148590

linux-foundation-easycla · 2025-03-06T01:50:02Z

The committers listed above are authorized under a signed CLA.

✅ login: kwen2501 / name: Ke Wen (ba9132f, a424cd1, 01e04ea, 018017a, 6a5be4e, 310cdf7, dd26950, 940cca6, ff1f4c4, 510dead, e933dfb)

taozhiwei · 2025-03-06T05:58:30Z

#148553
I mentioned a similar one

…stream` entirely" This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately have to be atomic): 1. When async_op=False, we directly launch the collective on "current" stream instead of a trampoline stream and join back. - Resolves #147729 - Resolves #146881 - Also saves an event sync and one pybind during the unnecessary `work.wait()` called by distributed_c10d.py. 2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling. - Resolves #147168 3. Remove tensor life management when async_op=False; only use it when async_op=True. 4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](#147168 (comment)). 5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels. Joint work with cenzhaometa who wants to remove the event sync overhead. Cc: ngimel awgu Aidyn-A skyw wconstab leonardo0lyj cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

…147820) Summary: PTD current workflow: - PTD creates its own dedicated `ncclStream` for comm operation - it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us). This diff: - async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead - async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready - pass down async from c10d down to NCCL-PG this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%** Differential Revision: D70135605 [PGNCCL] Make avoid-record-stream default [c10d] Add asyncOp argument to Ops Change python side wait Pass asyncOp at ProcessGroup level Watchdog unstashing tensors as a safety net lint ghstack-source-id: 0f222d3 Pull Request resolved: #148590

…stream` entirely" This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately have to be atomic): 1. When async_op=False, we directly launch the collective on "current" stream instead of a trampoline stream and join back. - Resolves #147729 - Resolves #146881 - Also saves an event sync and one pybind during the unnecessary `work.wait()` called by distributed_c10d.py. 2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling. - Resolves #147168 3. Remove tensor life management when async_op=False; only use it when async_op=True. 4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](#147168 (comment)). 5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels. Joint work with cenzhaometa who wants to remove the event sync overhead. Cc: ngimel awgu Aidyn-A skyw wconstab leonardo0lyj cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

…147820) Summary: PTD current workflow: - PTD creates its own dedicated `ncclStream` for comm operation - it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us). This diff: - async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead - async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready - pass down async from c10d down to NCCL-PG this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%** Differential Revision: D70135605 [PGNCCL] Make avoid-record-stream default [c10d] Add asyncOp argument to Ops Change python side wait Pass asyncOp at ProcessGroup level Watchdog unstashing tensors as a safety net lint ghstack-source-id: 8306cce Pull Request resolved: #148590

kwen2501 · 2025-03-06T22:08:41Z

@albanD Would appreciate your help:

The PR is introducing backward incompatible changes to the operator library. Please contact PyTorch team to confirm whether this change is wanted or not

We are adding asyncOp argument to the ops.
Thanks!

taozhiwei · 2025-03-07T08:47:46Z

Stack from ghstack (oldest at bottom):

-> [PGNCCL] Launch kernel on current stream & remove record_stream entirely #148590

This PR has multiple changes to ProcessGroupNCCL (which unfortunately have to be atomic):

When async_op=False, we directly launch the collective on "current" stream instead of a trampoline stream and join back.

Resolves [RFE][Distributed][NCCL] A feature request for stream management API in PG NCCL #147729

Resolves [Feature][c10d] Allow flexible cudaStreamWaitEvent in PGNCCL #146881

Also saves an event sync and one pybind during the unnecessary work.wait() called by distributed_c10d.py.

Entirely remove record_stream and use CPU-side stashing for managing tensor lifetime against recycling.

Resolves [FSDP2] The evil record_stream in c10d causes FSDP2 to over-allocate GPU memory #147168

Remove tensor life management when async_op=False; only use it when async_op=True.

To guard against user not calling work.wait(), we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion here.

Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels.

Joint work with @cenzhaometa who wants to remove the event sync overhead.

Cc: @ngimel @awgu @Aidyn-A @skyw @wconstab @leonardo0lyj

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

In cuda graph mode，watchdogHandler is disable. when async_op=True，and user not calling work.wait(), Can't input be released?
when async_op=False, users code will ensure input is not released prematurely;There shouldn't be any need to make current stream as nccl stream? make current stream as nccl stream will break the user's habit of looking at Profile.

kwen2501 · 2025-03-07T09:11:48Z

Users are in general expected to call work.wait() in async mode, esp when they are CUDA Graphing. Failure to do so can results in holding of the tensors. That's the side effect of making avoid-record the default. We are just using watchdog to mitigate that side effect.
The change to use current stream has some other reasons than tensor lifetime management, such as reducing stream sync overhead.

…stream` entirely" This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately have to be atomic): 1. When async_op=False, we directly launch the collective on "current" stream instead of a trampoline stream and join back. - Resolves #147729 - Resolves #146881 - Also saves an event sync and one pybind during the unnecessary `work.wait()` called by distributed_c10d.py. 2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling. - Resolves #147168 3. Remove tensor life management when async_op=False; only use it when async_op=True. 4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](#147168 (comment)). 5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels. Joint work with cenzhaometa who wants to remove the event sync overhead. Cc: ngimel awgu Aidyn-A skyw wconstab leonardo0lyj cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

…147820) Summary: PTD current workflow: - PTD creates its own dedicated `ncclStream` for comm operation - it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us). This diff: - async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead - async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready - pass down async from c10d down to NCCL-PG this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%** Differential Revision: D70135605 [PGNCCL] Make avoid-record-stream default [c10d] Add asyncOp argument to Ops Change python side wait Pass asyncOp at ProcessGroup level Watchdog unstashing tensors as a safety net lint ghstack-source-id: c31ca32 Pull Request resolved: #148590

taozhiwei · 2025-03-07T12:00:22Z

Users are in general expected to call work.wait() in async mode, esp when they are CUDA Graphing. Failure to do so can results in holding of the tensors. That's the side effect of making avoid-record the default. We are just using watchdog to mitigate that side effect.

The change to use current stream has some other reasons than tensor lifetime management, such as reducing stream sync overhead.

should be able to add a check in A syncStream If it's the same stream, don't call record and block anymore?

…stream` entirely" This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately have to be atomic): 1. When async_op=False, we directly launch the collective on "current" stream instead of a trampoline stream and join back. - Resolves #147729 - Resolves #146881 - Also saves an event sync and one pybind during the unnecessary `work.wait()` called by distributed_c10d.py. 2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling. - Resolves #147168 3. Remove tensor life management when async_op=False; only use it when async_op=True. 4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](#147168 (comment)). 5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels. Joint work with cenzhaometa who wants to remove the event sync overhead. Cc: ngimel awgu Aidyn-A skyw wconstab leonardo0lyj cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

…147820) Summary: PTD current workflow: - PTD creates its own dedicated `ncclStream` for comm operation - it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective such stream synchronization become expensive in 8000 Inference world (cpu overhead: 70us vs GPU kernel time: 160us). This diff: - async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead - async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready - pass down async from c10d down to NCCL-PG this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%** Differential Revision: D70135605 [PGNCCL] Make avoid-record-stream default [c10d] Add asyncOp argument to Ops Change python side wait Pass asyncOp at ProcessGroup level Watchdog unstashing tensors as a safety net lint ghstack-source-id: dfc758a Pull Request resolved: #148590

pytorchmergebot · 2025-03-30T22:20:30Z

Merge started

Your change will be merged while ignoring the following 8 checks: pull / linux-jammy-py3-clang12-executorch / test (executorch, 1, 1, ephemeral.linux.2xlarge), pull / linux-jammy-py3.9-gcc11 / test (backwards_compat, 1, 1, ephemeral.linux.2xlarge), linux-binary-libtorch-cxx11-abi / libtorch-cpu-shared-with-deps-cxx11-abi-build / build, linux-binary-manywheel / manywheel-py3_9-cuda12_8-build / build, linux-binary-manywheel / manywheel-py3_9-cuda12_6-build / build, linux-binary-manywheel / manywheel-py3_9-cuda11_8-build / build, linux-binary-manywheel / manywheel-py3_9-cuda12_4-build / build, trunk / linux-focal-rocm6.3-py3.10 / test (default, 1, 2, linux.rocm.gpu.2)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-03-30T22:20:38Z

Merge failed

Reason: This PR has internal changes and must be landed via Phabricator! Please try reimporting/rexporting the PR!

Details for Dev Infra team

Raised by workflow job

atalman · 2025-03-31T13:15:07Z

@pytorchbot merge -f "already landed"

pytorchmergebot · 2025-03-31T13:16:34Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-03-31T13:16:48Z

Merge failed

Reason: This PR has internal changes and must be landed via Phabricator! Please try reimporting/rexporting the PR!

Details for Dev Infra team

Raised by workflow job

atalman · 2025-03-31T16:43:35Z

@pytorchbot merge -f "already landed internally"

pytorchmergebot · 2025-03-31T16:45:09Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-03-31T16:45:22Z

Merge failed

Reason: Command git -C /home/runner/work/pytorch/pytorch cherry-pick -x 269f45f641011f4da5fc7d38e973036e04489b72 returned non-zero exit code 1

Auto-merging torch/_C/_distributed_c10d.pyi
Auto-merging torch/csrc/distributed/c10d/ProcessGroup.hpp
Auto-merging torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp
CONFLICT (content): Merge conflict in torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp
Auto-merging torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp
Auto-merging torch/csrc/distributed/c10d/init.cpp
Auto-merging torch/distributed/distributed_c10d.py
error: could not apply 269f45f6410... [ptd][nccl] use current-stream as nccl-stream under async=False mode (#147820)
hint: After resolving the conflicts, mark them with
hint: "git add/rm <pathspec>", then run
hint: "git cherry-pick --continue".
hint: You can instead skip this commit with "git cherry-pick --skip".
hint: To abort and get back to the state before "git cherry-pick",
hint: run "git cherry-pick --abort".
hint: Disable this message with "git config set advice.mergeConflict false"

Details for Dev Infra team

Raised by workflow job

…eam` entirely (pytorch#148590)" This reverts commit ef6296e.

…eam` entirely (#148590) (#150352) Revert "[PGNCCL] Launch kernel on current stream & remove `record_stream` entirely (#148590)" This reverts commit ef6296e.

@cenzhaometa

…irely Relanding #148590 due to merge conflict. This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related): 1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back. - Resolves #147729 - Resolves #146881 - Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user. 2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling. - Resolves #147168 3. Remove tensor life management when async_op=False; only use it when async_op=True. 4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](#147168 (comment)). 5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels. Joint work with @cenzhaometa who wants to remove the event sync overhead. Squashed contents: * [ptd][nccl] use current-stream as nccl-stream under async=False mode (#147820) PTD current workflow: - PTD creates its own dedicated `ncclStream` for comm operation - it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us). This diff: - async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead - async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready - pass down async from c10d down to NCCL-PG this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%** Differential Revision: D70135605 * [PGNCCL] Make avoid-record-stream default * [c10d] Add asyncOp argument to Ops * Change python side wait * Pass asyncOp at ProcessGroup level * Watchdog unstashing tensors as a safety net * Stash tensors for reduce_scatter_v and all_gather_v Pull Request approved: #149753 * [c10d] Move unstashing from watchdog to main thread Pull Request approved: #150079 * [PGNCCL][BE] Merge mutex into TensorShelf for encapsulation Pull Request approved: #150130 [ghstack-poisoned]

…irely Relanding #148590 due to merge conflict. This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related): 1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back. - Resolves #147729 - Resolves #146881 - Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user. 2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling. - Resolves #147168 3. Remove tensor life management when async_op=False; only use it when async_op=True. 4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](#147168 (comment)). 5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels. Joint work with cenzhaometa who wants to remove the event sync overhead. Squashed contents: * [ptd][nccl] use current-stream as nccl-stream under async=False mode (#147820) PTD current workflow: - PTD creates its own dedicated `ncclStream` for comm operation - it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us). This diff: - async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead - async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready - pass down async from c10d down to NCCL-PG this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%** Differential Revision: D70135605 * [PGNCCL] Make avoid-record-stream default * [c10d] Add asyncOp argument to Ops * Change python side wait * Pass asyncOp at ProcessGroup level * Watchdog unstashing tensors as a safety net * Stash tensors for reduce_scatter_v and all_gather_v Pull Request approved: #149753 * [c10d] Move unstashing from watchdog to main thread Pull Request approved: #150079 * [PGNCCL][BE] Merge mutex into TensorShelf for encapsulation Pull Request approved: #150130 ghstack-source-id: ce103fc Pull Request resolved: #150398

Reverts #1450 Original PR (pytorch/pytorch#148590) in PyTorch got reverted: pytorch/pytorch@afa1eda --------- Co-authored-by: Yutao Xu <yutao.xu@intel.com>

@cenzhaometa

…irely (#150398) Relanding #148590 due to merge conflict. This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related): 1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back. - Resolves #147729 - Resolves #146881 - Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user. 2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling. - Resolves #147168 3. Remove tensor life management when async_op=False; only use it when async_op=True. 4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](#147168 (comment)). 5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels. Joint work with @cenzhaometa who wants to remove the event sync overhead. Squashed contents: * [ptd][nccl] use current-stream as nccl-stream under async=False mode (#147820) PTD current workflow: - PTD creates its own dedicated `ncclStream` for comm operation - it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us). This diff: - async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead - async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready - pass down async from c10d down to NCCL-PG this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%** * [PGNCCL] Make avoid-record-stream default * [c10d] Add asyncOp argument to Ops * Change python side wait * Pass asyncOp at ProcessGroup level * Watchdog unstashing tensors as a safety net * Stash tensors for reduce_scatter_v and all_gather_v Pull Request approved: #149753 * [c10d] Move unstashing from watchdog to main thread Pull Request approved: #150079 * [PGNCCL][BE] Merge mutex into TensorShelf for encapsulation Pull Request approved: #150130 Pull Request resolved: #150398 Approved by: https://github.com/atalman

Pytorch introduce new stream method in pytorch/pytorch#148590, which update base distributed interface. This pr align with latest register interface.

@cenzhaometa

…irely (pytorch#150398) Relanding pytorch#148590 due to merge conflict. This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related): 1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back. - Resolves pytorch#147729 - Resolves pytorch#146881 - Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user. 2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling. - Resolves pytorch#147168 3. Remove tensor life management when async_op=False; only use it when async_op=True. 4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](pytorch#147168 (comment)). 5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels. Joint work with @cenzhaometa who wants to remove the event sync overhead. Squashed contents: * [ptd][nccl] use current-stream as nccl-stream under async=False mode (pytorch#147820) PTD current workflow: - PTD creates its own dedicated `ncclStream` for comm operation - it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us). This diff: - async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead - async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready - pass down async from c10d down to NCCL-PG this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%** * [PGNCCL] Make avoid-record-stream default * [c10d] Add asyncOp argument to Ops * Change python side wait * Pass asyncOp at ProcessGroup level * Watchdog unstashing tensors as a safety net * Stash tensors for reduce_scatter_v and all_gather_v Pull Request approved: pytorch#149753 * [c10d] Move unstashing from watchdog to main thread Pull Request approved: pytorch#150079 * [PGNCCL][BE] Merge mutex into TensorShelf for encapsulation Pull Request approved: pytorch#150130 Pull Request resolved: pytorch#150398 Approved by: https://github.com/atalman

kwen2501 · 2025-05-06T15:32:18Z

Landed

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Mar 5, 2025

kwen2501 changed the title ~~[ptd][nccl] use current-stream as nccl-stream under async=False mode (#147820)~~ [PGNCCL] Launch kernel on current stream & remove record_stream entirely Mar 5, 2025

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 5, 2025

kwen2501 mentioned this pull request Mar 5, 2025

[PGNCCL] Launch kernel on current stream & remove record_stream entirely #148467

Closed

kwen2501 mentioned this pull request Mar 6, 2025

remove TORCH_NCCL_AVOID_RECORD_STREAMS,use stashed_for_allocator_safety_ to save the input ref #148553

Closed

kwen2501 added the keep-going Don't stop on first failure, keep running tests until the end label Mar 6, 2025

kwen2501 requested review from wconstab, fduwjj, eqy and Aidyn-A March 7, 2025 18:19

pytorchmergebot removed the merging label Mar 30, 2025

pytorchmergebot added the merging label Mar 31, 2025

pytorchmergebot removed the merging label Mar 31, 2025

pytorchmergebot added the merging label Mar 31, 2025

pytorchmergebot removed the merging label Mar 31, 2025

This was referenced Mar 31, 2025

Stash tensors for reduce_scatter_v and all_gather_v #150332

Closed

[c10d] Move unstashing from watchdog to main thread #150334

Closed

[PGNCCL][BE] Merge mutex into TensorShelf for encapsulation #150335

Closed

[v.2.7.0] Release Tracker #149044

Closed

atalman added a commit to atalman/pytorch that referenced this pull request Mar 31, 2025

Revert "[PGNCCL] Launch kernel on current stream & remove `record_str…

4a7a5e5

…eam` entirely (pytorch#148590)" This reverts commit ef6296e.

kwen2501 mentioned this pull request Apr 1, 2025

[Reland] Launch kernel on current stream & remove record_stream entirely #150398

Closed

xytintel mentioned this pull request Apr 1, 2025

Update torch-xpu-ops commit pin #150409

Closed

jeffdaily mentioned this pull request Apr 7, 2025

RCCL deadlock discussion (was: increase NCCL_STEPS to match WARPSIZE/4) ROCm/rccl#1600

Draft

1 task

kwen2501 closed this May 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PGNCCL] Launch kernel on current stream & remove `record_stream` entirely #148590

[PGNCCL] Launch kernel on current stream & remove `record_stream` entirely #148590

[PGNCCL] Launch kernel on current stream & remove record_stream entirely #148590

[PGNCCL] Launch kernel on current stream & remove record_stream entirely #148590

Conversation

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/148590

❌ 7 New Failures, 2 Unrelated Failures

Merge started

Merge failed

Merge started

Merge failed

Merge started

Merge failed

[PGNCCL] Launch kernel on current stream & remove `record_stream` entirely #148590

[PGNCCL] Launch kernel on current stream & remove `record_stream` entirely #148590