8000 [async-tp] fix a race condition that can cause silent correctness issue by yifuwang · Pull Request #137199 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

[async-tp] fix a race condition that can cause silent correctness issue #137199

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

yifuwang
Copy link
Collaborator
@yifuwang yifuwang commented Oct 2, 2024

Stack from ghstack (oldest at bottom):

Details described in #137171:

image

Fix: we introduce the following invariants in _pipelined_all_gather_and_consume and _pipelined_produce_and_all2all:

  • Before any stream writes to/reads from p2p buffers, perform a barrier on channel 0 on the launch stream.
  • After all streams completed writing to/reading from p2p buffers, perform a barrier on channel 0 on the launch stream.

NOTE: This fix only focuses on addressing the race condition. Some barriers are exposed, which can be hidden by computation, and we'll optimize them in subsequent PRs.

cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

Copy link
pytorch-bot bot commented Oct 2, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137199

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 4f2957e with merge base 0d1701f (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Oct 2, 2024
yifuwang pushed a commit that referenced this pull request Oct 2, 2024
@yifuwang yifuwang changed the title [async-tp] fix a race condition that can cause data corruption [async-tp] fix a race condition that can cause silent correctness issue Oct 2, 2024
@yifuwang yifuwang requested review from lw and Chillee October 2, 2024 22:02
@yifuwang yifuwang added the topic: not user facing topic category label Oct 2, 2024
@yifuwang
Copy link
Collaborator Author
yifuwang commented Oct 2, 2024

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 2, 2024
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

…ectness issue"


Details described in #137171:

![image](https://github.com/user-attachments/assets/8247b4f1-7805-4585-9d72-05e9475f218b)

Fix: we introduce the following invariants in `_pipelined_all_gather_and_consume` and `_pipelined_produce_and_all2all`:
- Before any stream writes to/reads from p2p buffers, perform a barrier on channel 0 on the launch stream.
- After all streams completed writing to/reading from p2p buffers, perform a barrier on channel 0 on the launch stream.

NOTE: This fix only focuses on addressing the race condition. Some barriers are exposed, which can be hidden by computation, and we'll optimize them in subsequent PRs.

cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
yifuwang pushed a commit that referenced this pull request Oct 2, 2024
@yifuwang
Copy link
Collaborator Author
yifuwang commented Oct 2, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Deb 8000 ugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / linux-focal-cuda12.4-py3.10-gcc9-sm86 / test (default, 3, 5, lf.linux.g5.4xlarge.nvidia.gpu)

Details for Dev Infra team Raised by workflow job

…ectness issue"


Details described in #137171:

![image](https://github.com/user-attachments/assets/8247b4f1-7805-4585-9d72-05e9475f218b)

Fix: we introduce the following invariants in `_pipelined_all_gather_and_consume` and `_pipelined_produce_and_all2all`:
- Before any stream writes to/reads from p2p buffers, perform a barrier on channel 0 on the launch stream.
- After all streams completed writing to/reading from p2p buffers, perform a barrier on channel 0 on the launch stream.

NOTE: This fix only focuses on addressing the race condition. Some barriers are exposed, which can be hidden by computation, and we'll optimize them in subsequent PRs.

cc XilunWu H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
yifuwang pushed a commit that referenced this pull request Oct 3, 2024
@yifuwang
Copy link
Collaborator Author
yifuwang commented Oct 3, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@kit1980
Copy link
Contributor
kit1980 commented Oct 23, 2024

2.5.1 is an emergency patch release to address specific large regressions, moving this to 2.6.0
In addition, this doesn't have any tests that the issue was actually fixed.

@kit1980 kit1980 modified the milestones: 2.5.1, 2.6.0 Oct 23, 2024
@github-actions github-actions bot deleted the gh/yifuwang/131/head branch November 23, 2024 02:05
@yifuwang
Copy link
Collaborator Author

I confirm that the issue is fixed in 2.6.0 release candidate. cc @kit1980

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue topic: not user facing topic category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0