[async-tp] fix a race condition that can cause silent correctness issue #137199

yifuwang · 2024-10-02T18:35:07Z

Stack from ghstack (oldest at bottom):

-> [async-tp] fix a race condition that can cause silent correctness issue #137199

Details described in #137171:

Fix: we introduce the following invariants in _pipelined_all_gather_and_consume and _pipelined_produce_and_all2all:

Before any stream writes to/reads from p2p buffers, perform a barrier on channel 0 on the launch stream.
After all streams completed writing to/reading from p2p buffers, perform a barrier on channel 0 on the launch stream.

NOTE: This fix only focuses on addressing the race condition. Some barriers are exposed, which can be hidden by computation, and we'll optimize them in subsequent PRs.

cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

[ghstack-poisoned]

pytorch-bot · 2024-10-02T18:35:11Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137199

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 4f2957e with merge base 0d1701f ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 2b10015 Pull Request resolved: #137199

yifuwang · 2024-10-02T22:15:50Z

@pytorchbot merge

pytorchmergebot · 2024-10-02T22:17:25Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-10-02T22:17:32Z

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

Lint / Test run_test.py is usable without boto3/rockset

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

…ectness issue" Details described in #137171: ![image](https://github.com/user-attachments/assets/8247b4f1-7805-4585-9d72-05e9475f218b) Fix: we introduce the following invariants in `_pipelined_all_gather_and_consume` and `_pipelined_produce_and_all2all`: - Before any stream writes to/reads from p2p buffers, perform a barrier on channel 0 on the launch stream. - After all streams completed writing to/reading from p2p buffers, perform a barrier on channel 0 on the launch stream. NOTE: This fix only focuses on addressing the race condition. Some barriers are exposed, which can be hidden by computation, and we'll optimize them in subsequent PRs. cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

ghstack-source-id: b29091c Pull Request resolved: #137199

yifuwang · 2024-10-02T22:57:12Z

@pytorchbot merge

pytorchmergebot · 2024-10-02T22:58:57Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Deb 8000 ugging

Check the merge workflow status
here

pytorchmergebot · 2024-10-03T01:12:47Z

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / linux-focal-cuda12.4-py3.10-gcc9-sm86 / test (default, 3, 5, lf.linux.g5.4xlarge.nvidia.gpu)

Details for Dev Infra team

Raised by workflow job

…ectness issue" Details described in #137171: ![image](https://github.com/user-attachments/assets/8247b4f1-7805-4585-9d72-05e9475f218b) Fix: we introduce the following invariants in `_pipelined_all_gather_and_consume` and `_pipelined_produce_and_all2all`: - Before any stream writes to/reads from p2p buffers, perform a barrier on channel 0 on the launch stream. - After all streams completed writing to/reading from p2p buffers, perform a barrier on channel 0 on the launch stream. NOTE: This fix only focuses on addressing the race condition. Some barriers are exposed, which can be hidden by computation, and we'll optimize them in subsequent PRs. cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

ghstack-source-id: 25ae32a Pull Request resolved: #137199

yifuwang · 2024-10-03T09:13:01Z

@pytorchbot merge

pytorchmergebot · 2024-10-03T09:14:46Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

kit1980 · 2024-10-23T20:32:47Z

2.5.1 is an emergency patch release to address specific large regressions, moving this to 2.6.0
In addition, this doesn't have any tests that the issue was actually fixed.

yifuwang · 2025-01-24T23:06:57Z

I confirm that the issue is fixed in 2.6.0 release candidate. cc @kit1980

[async-tp] fix a race condition that can cause data corruption

6f75548

[ghstack-poisoned]

pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Oct 2, 2024

yifuwang pushed a commit that referenced this pull request Oct 2, 2024

[async-tp] fix a race condition that can cause data corruption

1359787

ghstack-source-id: 2b10015 Pull Request resolved: #137199

yifuwang changed the title ~~[async-tp] fix a race condition that can cause data corruption~~ [async-tp] fix a race condition that can cause silent correctness issue Oct 2, 2024

yifuwang requested review from lw and Chillee October 2, 2024 22:02

yifuwang added the topic: not user facing topic category label Oct 2, 2024

weifengpy approved these changes Oct 2, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 2, 2024

pytorchmergebot added the merging label Oct 2, 2024

pytorchmergebot removed the merging label Oct 2, 2024

yifuwang pushed a commit that referenced this pull request Oct 2, 2024

[async-tp] fix a race condition that can cause data corruption

9dfbb7b

ghstack-source-id: b29091c Pull Request resolved: #137199

pytorchmergebot added the merging label Oct 2, 2024

pytorchmergebot removed the merging label Oct 3, 2024

yifuwang pushed a commit that referenced this pull request Oct 3, 2024

[async-tp] fix a race condition that can cause data corruption

2909b0f

ghstack-source-id: 25ae32a Pull Request resolved: #137199

pytorchmergebot added the merging label Oct 3, 2024

pytorchmergebot added the Merged label Oct 3, 2024

pytorchmergebot closed this in 38114ec Oct 3, 2024

pytorchmergebot removed the merging label Oct 3, 2024

kit1980 added this to the 2.5.1 milestone Oct 3, 2024

awgu mentioned this pull request Oct 12, 2024

SymmetricMemory can silently corrupt data when all-to-all followed by all-gather #137171

Closed

kit1980 modified the milestones: 2.5.1, 2.6.0 Oct 23, 2024

github-actions bot deleted the gh/yifuwang/131/head branch November 23, 2024 02:05

atalman mentioned this pull request Jan 13, 2025

Release 2.6.0 validations checklist and cherry-picks #144503

Closed

73 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[async-tp] fix a race condition that can cause silent correctness issue #137199

[async-tp] fix a race condition that can cause silent correctness issue #137199

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[async-tp] fix a race condition that can cause silent correctness issue #137199

[async-tp] fix a race condition that can cause silent correctness issue #137199

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137199

✅ No Failures

Uh oh!

Uh oh!

Merge started

Uh oh!

Merge failed

Uh oh!

Uh oh!

Merge started

Uh oh!

Merge failed

Uh oh!

Uh oh!

Merge started

Uh oh!

Uh oh!

Uh oh!

Uh oh!