8000 Stash tensors for reduce_scatter_v and all_gather_v by kwen2501 · Pull Request #149753 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

Stash tensors for reduce_scatter_v and all_gather_v #149753

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

kwen2501
Copy link
Contributor
@kwen2501 kwen2501 commented Mar 21, 2025

Stack from ghstack (oldest at bottom):

#148590 removed record_stream. Since previous AVOID_RECORD flag does not cover reduce_scatter_v and all_gather_v which are in coalescing form, these two ops were missed. Causing TorchRec's Variable Length Embedding to fail.

This PR adds a vector to stash tensors when coalescing is in flight. And the end of coalescing, it will hand over the tensors to Work.

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

Copy link
pytorch-bot bot commented Mar 21, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/149753

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit dc68c40 with merge base 8d08b49 (image):

NEW FAILURE - The following job has failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Mar 21, 2025
kwen2501 added a commit that referenced this pull request Mar 21, 2025
ghstack-source-id: 8156c5a
Pull Request resolved: #149753
cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
kwen2501 added a commit that referenced this pull request Mar 21, 2025
ghstack-source-id: e4cc4d7
Pull Request resolved: #149753
@kwen2501 kwen2501 requested review from fduwjj and eqy March 21, 2025 18:46
#148590 removed record_stream. Since previous AVOID_RECORD flag does not cover reduce_scatter_v and all_gather_v which are in coalescing form, these two ops were missed. Causing TorchRec's Variable Length Embedding to fail.

This PR adds a vector to stash tensors when coalescing is in flight. And the end of coalescing, it will hand over the tensors to Work.

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
kwen2501 added a commit that referenced this pull request Mar 21, 2025
ghstack-source-id: f391590
Pull Request resolved: #149753

@requires_nccl()
@skip_but_pass_in_sandcastle_if(not TEST_MULTIGPU, "NCCL test requires 2+ GPUs")
def test_all_gather_v(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

Copy link
Contributor
@fduwjj fduwjj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

kwen2501 added a commit that referenced this pull request Mar 21, 2025
…147820)

Summary:

PTD current workflow:
- PTD creates its own dedicated `ncclStream` for comm operation
- it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective

such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us).

This diff:
- async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead
- async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready
- pass down async from c10d down to NCCL-PG

this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%**

Differential Revision: D70135605

[PGNCCL] Make avoid-record-stream default

[c10d] Add asyncOp argument to Ops

Change python side wait

Pass asyncOp at ProcessGroup level

Watchdog unstashing tensors as a safety net

lint

ghstack-source-id: f391590
Pull Request resolved: #148590

Stash tensors for reduce_scatter_v and all_gather_v

ghstack-source-id: f391590
Pull Request resolved: #149753
kwen2501 added a commit that referenced this pull request Mar 28, 2025
…147820)

Summary:

PTD current workflow:
- PTD creates its own dedicated `ncclStream` for comm operation
- it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective

such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us).

This diff:
- async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead
- async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready
- pass down async from c10d down to NCCL-PG

this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%**

Differential Revision: D70135605

[PGNCCL] Make avoid-record-stream default

[c10d] Add asyncOp argument to Ops

Change python side wait

Pass asyncOp at ProcessGroup level

Watchdog unstashing tensors as a safety net

lint

ghstack-source-id: e4b48e5
Pull Request resolved: #148590

Stash tensors for reduce_scatter_v and all_gather_v

ghstack-source-id: e4b48e5
Pull Request resolved: #149753

[c10d] Move unstashing from watchdog to main thread

ghstack-source-id: e4b48e5
Pull Request resolved: #150079

[PGNCCL][BE] Merge mutex into TensorShelf for encapsulation

ghstack-source-id: e4b48e5
Pull Request resolved: #150130
kwen2501 added a commit that referenced this pull request Mar 31, 2025
ghstack-source-id: f391590
Pull Request resolved: #149753

(cherry picked from commit b39cb92)
kwen2501 added a commit that referenced this pull request Apr 1, 2025
…irely

Relanding #148590 due to merge conflict.

This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related):
1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back.
- Resolves #147729
- Resolves #146881
- Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user.
2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling.
- Resolves #147168
3. Remove tensor life management when async_op=False; only use it when async_op=True.
4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](#147168 (comment)).
5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels.

Joint work with @cenzhaometa who wants to remove the event sync overhead.

Squashed contents:

* [ptd][nccl] use current-stream as nccl-stream under async=False mode (#147820)
PTD current workflow:
- PTD creates its own dedicated `ncclStream` for comm operation
- it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective
such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us).
This diff:
- async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead
- async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready
- pass down async from c10d down to NCCL-PG
this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%**
Differential Revision: D70135605

* [PGNCCL] Make avoid-record-stream default

* [c10d] Add asyncOp argument to Ops

* Change python side wait

* Pass asyncOp at ProcessGroup level

* Watchdog unstashing tensors as a safety net

* Stash tensors for reduce_scatter_v and all_gather_v
Pull Request approved: #149753

* [c10d] Move unstashing from watchdog to main thread
Pull Request approved: #150079

* [PGNCCL][BE] Merge mutex into TensorShelf for encapsulation
Pull Request approved: #150130

[ghstack-poisoned]
kwen2501 added a commit that referenced this pull request Apr 1, 2025
…irely

Relanding #148590 due to merge conflict.

This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related):
1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back.
- Resolves #147729
- Resolves #146881
- Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user.
2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling.
- Resolves #147168
3. Remove tensor life management when async_op=False; only use it when async_op=True.
4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](#147168 (comment)).
5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels.

Joint work with cenzhaometa who wants to remove the event sync overhead.

Squashed contents:

* [ptd][nccl] use current-stream as nccl-stream under async=False mode (#147820)
PTD current workflow:
- PTD creates its own dedicated `ncclStream` for comm operation
- it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective
such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us).
This diff:
- async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead
- async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready
- pass down async from c10d down to NCCL-PG
this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%**
Differential Revision: D70135605

* [PGNCCL] Make avoid-record-stream default

* [c10d] Add asyncOp argument to Ops

* Change python side wait

* Pass asyncOp at ProcessGroup level

* Watchdog unstashing tensors as a safety net

* Stash tensors for reduce_scatter_v and all_gather_v
Pull Request approved: #149753

* [c10d] Move unstashing from watchdog to main thread
Pull Request approved: #150079

* [PGNCCL][BE] Merge mutex into TensorShelf for encapsulation
Pull Request approved: #150130

ghstack-source-id: ce103fc
Pull Request resolved: #150398
pytorchmergebot pushed a commit that referenced this pull request Apr 1, 2025
…irely (#150398)

Relanding #148590 due to merge conflict.

This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related):
1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back.
- Resolves #147729
- Resolves #146881
- Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user.
2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling.
- Resolves #147168
3. Remove tensor life management when async_op=False; only use it when async_op=True.
4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](#147168 (comment)).
5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels.

Joint work with @cenzhaometa who wants to remove the event sync overhead.

Squashed contents:

* [ptd][nccl] use current-stream as nccl-stream under async=False mode (#147820)
PTD current workflow:
- PTD creates its own dedicated `ncclStream` for comm operation
- it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective
such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us).
This diff:
- async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead
- async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready
- pass down async from c10d down to NCCL-PG
this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%**

* [PGNCCL] Make avoid-record-stream default

* [c10d] Add asyncOp argument to Ops

* Change python side wait

* Pass asyncOp at ProcessGroup level

* Watchdog unstashing tensors as a safety net

* Stash tensors for reduce_scatter_v and all_gather_v
Pull Request approved: #149753

* [c10d] Move unstashing from watchdog to main thread
Pull Request approved: #150079

* [PGNCCL][BE] Merge mutex into TensorShelf for encapsulation
Pull Request approved: #150130

Pull Request resolved: #150398
Approved by: https://github.com/atalman
amathewc pushed a commit to amathewc/pytorch that referenced this pull request Apr 17, 2025
…irely (pytorch#150398)

Relanding pytorch#148590 due to merge conflict.

This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related):
1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back.
- Resolves pytorch#147729
- Resolves pytorch#146881
- Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user.
2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling.
- Resolves pytorch#147168
3. Remove tensor life management when async_op=False; only use it when async_op=True.
4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](pytorch#147168 (comment)).
5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels.

Joint work with @cenzhaometa who wants to remove the event sync overhead.

Squashed contents:

* [ptd][nccl] use current-stream as nccl-stream under async=False mode (pytorch#147820)
PTD current workflow:
- PTD creates its own dedicated `ncclStream` for comm operation
- it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective
such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us).
This diff:
- async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead
- async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready
- pass down async from c10d down to NCCL-PG
this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%**

* [PGNCCL] Make avoid-record-stream default

* [c10d] Add asyncOp argument to Ops

* Change python side wait

* Pass asyncOp at ProcessGroup level

* Watchdog unstashing tensors as a safety net

* Stash tensors for reduce_scatter_v and all_gather_v
Pull Request approved: pytorch#149753

* [c10d] Move unstashing from watchdog to main thread
Pull Request approved: pytorch#150079

* [PGNCCL][BE] Merge mutex into TensorShelf for encapsulation
Pull Request approved: pytorch#150130

Pull Request resolved: pytorch#150398
Approved by: https://github.com/atalman
@kwen2501
Copy link
Contributor Author
kwen2501 commented May 6, 2025

Squashed

@kwen2501 kwen2501 closed this May 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0