Stash tensors for reduce_scatter_v and all_gather_v #149753

kwen2501 · 2025-03-21T18:37:58Z

Stack from ghstack (oldest at bottom):

#148590 removed record_stream. Since previous AVOID_RECORD flag does not cover reduce_scatter_v and all_gather_v which are in coalescing form, these two ops were missed. Causing TorchRec's Variable Length Embedding to fail.

This PR adds a vector to stash tensors when coalescing is in flight. And the end of coalescing, it will hand over the tensors to Work.

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

[ghstack-poisoned]

pytorch-bot · 2025-03-21T18:38:02Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/149753

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit dc68c40 with merge base 8d08b49 ():

NEW FAILURE - The following job has failed:

pull / linux-jammy-py3-clang12-executorch / test (executorch, 1, 1, linux.2xlarge) (gh)
Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 8156c5a Pull Request resolved: #149753

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

ghstack-source-id: e4cc4d7 Pull Request resolved: #149753

#148590 removed record_stream. Since previous AVOID_RECORD flag does not cover reduce_scatter_v and all_gather_v which are in coalescing form, these two ops were missed. Causing TorchRec's Variable Length Embedding to fail. This PR adds a vector to stash tensors when coalescing is in flight. And the end of coalescing, it will hand over the tensors to Work. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

ghstack-source-id: f391590 Pull Request resolved: #149753

fduwjj · 2025-03-21T19:52:54Z

test/distributed/test_c10d_ops_nccl.py

+
+    @requires_nccl()
+    @skip_but_pass_in_sandcastle_if(not TEST_MULTIGPU, "NCCL test requires 2+ GPUs")
+    def test_all_gather_v(self):


fduwjj

LGTM

…147820) Summary: PTD current workflow: - PTD creates its own dedicated `ncclStream` for comm operation - it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us). This diff: - async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead - async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready - pass down async from c10d down to NCCL-PG this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%** Differential Revision: D70135605 [PGNCCL] Make avoid-record-stream default [c10d] Add asyncOp argument to Ops Change python side wait Pass asyncOp at ProcessGroup level Watchdog unstashing tensors as a safety net lint ghstack-source-id: f391590 Pull Request resolved: #148590 Stash tensors for reduce_scatter_v and all_gather_v ghstack-source-id: f391590 Pull Request resolved: #149753

…147820) Summary: PTD current workflow: - PTD creates its own dedicated `ncclStream` for comm operation - it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us). This diff: - async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead - async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready - pass down async from c10d down to NCCL-PG this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%** Differential Revision: D70135605 [PGNCCL] Make avoid-record-stream default [c10d] Add asyncOp argument to Ops Change python side wait Pass asyncOp at ProcessGroup level Watchdog unstashing tensors as a safety net lint ghstack-source-id: e4b48e5 Pull Request resolved: #148590 Stash tensors for reduce_scatter_v and all_gather_v ghstack-source-id: e4b48e5 Pull Request resolved: #149753 [c10d] Move unstashing from watchdog to main thread ghstack-source-id: e4b48e5 Pull Request resolved: #150079 [PGNCCL][BE] Merge mutex into TensorShelf for encapsulation ghstack-source-id: e4b48e5 Pull Request resolved: #150130

ghstack-source-id: f391590 Pull Request resolved: #149753 (cherry picked from commit b39cb92)

@cenzhaometa

…irely Relanding #148590 due to merge conflict. This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related): 1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back. - Resolves #147729 - Resolves #146881 - Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user. 2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling. - Resolves #147168 3. Remove tensor life management when async_op=False; only use it when async_op=True. 4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](#147168 (comment)). 5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels. Joint work with @cenzhaometa who wants to remove the event sync overhead. Squashed contents: * [ptd][nccl] use current-stream as nccl-stream under async=False mode (#147820) PTD current workflow: - PTD creates its own dedicated `ncclStream` for comm operation - it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us). This diff: - async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead - async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready - pass down async from c10d down to NCCL-PG this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%** Differential Revision: D70135605 * [PGNCCL] Make avoid-record-stream default * [c10d] Add asyncOp argument to Ops * Change python side wait * Pass asyncOp at ProcessGroup level * Watchdog unstashing tensors as a safety net * Stash tensors for reduce_scatter_v and all_gather_v Pull Request approved: #149753 * [c10d] Move unstashing from watchdog to main thread Pull Request approved: #150079 * [PGNCCL][BE] Merge mutex into TensorShelf for encapsulation Pull Request approved: #150130 [ghstack-poisoned]

…irely Relanding #148590 due to merge conflict. This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related): 1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back. - Resolves #147729 - Resolves #146881 - Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user. 2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling. - Resolves #147168 3. Remove tensor life management when async_op=False; only use it when async_op=True. 4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](#147168 (comment)). 5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels. Joint work with cenzhaometa who wants to remove the event sync overhead. Squashed contents: * [ptd][nccl] use current-stream as nccl-stream under async=False mode (#147820) PTD current workflow: - PTD creates its own dedicated `ncclStream` for comm operation - it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us). This diff: - async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead - async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready - pass down async from c10d down to NCCL-PG this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%** Differential Revision: D70135605 * [PGNCCL] Make avoid-record-stream default * [c10d] Add asyncOp argument to Ops * Change python side wait * Pass asyncOp at ProcessGroup level * Watchdog unstashing tensors as a safety net * Stash tensors for reduce_scatter_v and all_gather_v Pull Request approved: #149753 * [c10d] Move unstashing from watchdog to main thread Pull Request approved: #150079 * [PGNCCL][BE] Merge mutex into TensorShelf for encapsulation Pull Request approved: #150130 ghstack-source-id: ce103fc Pull Request resolved: #150398

@cenzhaometa

…irely (#150398) Relanding #148590 due to merge conflict. This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related): 1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back. - Resolves #147729 - Resolves #146881 - Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user. 2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling. - Resolves #147168 3. Remove tensor life management when async_op=False; only use it when async_op=True. 4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](#147168 (comment)). 5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels. Joint work with @cenzhaometa who wants to remove the event sync overhead. Squashed contents: * [ptd][nccl] use current-stream as nccl-stream under async=False mode (#147820) PTD current workflow: - PTD creates its own dedicated `ncclStream` for comm operation - it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us). This diff: - async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead - async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready - pass down async from c10d down to NCCL-PG this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%** * [PGNCCL] Make avoid-record-stream default * [c10d] Add asyncOp argument to Ops * Change python side wait * Pass asyncOp at ProcessGroup level * Watchdog unstashing tensors as a safety net * Stash tensors for reduce_scatter_v and all_gather_v Pull Request approved: #149753 * [c10d] Move unstashing from watchdog to main thread Pull Request approved: #150079 * [PGNCCL][BE] Merge mutex into TensorShelf for encapsulation Pull Request approved: #150130 Pull Request resolved: #150398 Approved by: https://github.com/atalman

@cenzhaometa

…irely (pytorch#150398) Relanding pytorch#148590 due to merge conflict. This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related): 1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back. - Resolves pytorch#147729 - Resolves pytorch#146881 - Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user. 2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling. - Resolves pytorch#147168 3. Remove tensor life management when async_op=False; only use it when async_op=True. 4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](pytorch#147168 (comment)). 5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels. Joint work with @cenzhaometa who wants to remove the event sync overhead. Squashed contents: * [ptd][nccl] use current-stream as nccl-stream under async=False mode (pytorch#147820) PTD current workflow: - PTD creates its own dedicated `ncclStream` for comm operation - it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us). This diff: - async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead - async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready - pass down async from c10d down to NCCL-PG this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%** * [PGNCCL] Make avoid-record-stream default * [c10d] Add asyncOp argument to Ops * Change python side wait * Pass asyncOp at ProcessGroup level * Watchdog unstashing tensors as a safety net * Stash tensors for reduce_scatter_v and all_gather_v Pull Request approved: pytorch#149753 * [c10d] Move unstashing from watchdog to main thread Pull Request approved: pytorch#150079 * [PGNCCL][BE] Merge mutex into TensorShelf for encapsulation Pull Request approved: pytorch#150130 Pull Request resolved: pytorch#150398 Approved by: https://github.com/atalman

kwen2501 · 2025-05-06T15:31:34Z

Squashed

Stash tensors for reduce_scatter_v and all_gather_v

51b5cb3

[ghstack-poisoned]

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Mar 21, 2025

kwen2501 added a commit that referenced this pull request Mar 21, 2025

Stash tensors for reduce_scatter_v and all_gather_v

a78e848

ghstack-source-id: 8156c5a Pull Request resolved: #149753

Update on "Stash tensors for reduce_scatter_v and all_gather_v"

def3398

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

kwen2501 mentioned this pull request Mar 20, 2025

[PGNCCL] Launch kernel on current stream & remove record_stream entirely #148590

Closed

kwen2501 added a commit that referenced this pull request Mar 21, 2025

Stash tensors for reduce_scatter_v and all_gather_v

16206d7

ghstack-source-id: e4cc4d7 Pull Request resolved: #149753

kwen2501 requested review from fduwjj and eqy March 21, 2025 18:46

kwen2501 added a commit that referenced this pull request Mar 21, 2025

Stash tensors for reduce_scatter_v and all_gather_v

b39cb92

ghstack-source-id: f391590 Pull Request resolved: #149753

fduwjj reviewed Mar 21, 2025

View reviewed changes

fduwjj approved these changes Mar 21, 2025

View reviewed changes

kwen2501 added a commit that referenced this pull request Mar 31, 2025

Stash tensors for reduce_scatter_v and all_gather_v

4e6f5e3

ghstack-source-id: f391590 Pull Request resolved: #149753 (cherry picked from commit b39cb92)

kwen2501 mentioned this pull request Mar 31, 2025

Stash tensors for reduce_scatter_v and all_gather_v #150332

Closed

kwen2501 mentioned this pull request Apr 1, 2025

[Reland] Launch kernel on current stream & remove record_stream entirely #150398

Closed

kwen2501 closed this May 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stash tensors for reduce_scatter_v and all_gather_v #149753

Stash tensors for reduce_scatter_v and all_gather_v #149753

Stash tensors for reduce_scatter_v and all_gather_v #149753

Stash tensors for reduce_scatter_v and all_gather_v #149753

Conversation

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/149753

❌ 1 New Failure

Choose a reason for hiding this comment

Choose a reason for hiding this comment