8000 Stash tensors for reduce_scatter_v and all_gather_v by kwen2501 · Pull Request #150332 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

Stash tensors for reduce_scatter_v and all_gather_v #150332

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

kwen2501
Copy link
Contributor
@kwen2501 kwen2501 commented Mar 31, 2025

Mirror of #149753 for 2.7 release.

Fix 1 of 3 for #148590

#148590 removed record_stream. Since previous AVOID_RECORD flag does not cover reduce_scatter_v and all_gather_v which are in coalescing form, these two ops were missed. Causing TorchRec's Variable Length Embedding to fail.

This PR adds a vector to stash tensors when coalescing is in flight. And the end of coalescing, it will hand over the tensors to Work.

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

ghstack-source-id: f391590
Pull Request resolved: #149753

(cherry picked from commit b39cb92)
Copy link
pytorch-bot bot commented Mar 31, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/150332

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 4e6f5e3 with merge base 924a247 (image):

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Mar 31, 2025
@kwen2501 kwen2501 closed this May 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant
0