-
Notifications
You must be signed in to change notification settings - Fork 24.4k
Make barrier blocking in UCC #86961
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make barrier blocking in UCC #86961
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/86961
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit c431ba0: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
cc @vtlam for UCC related PR review |
b87a6c3
to
af0cab8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks!
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: This PR is too stale; the last push date was more than 3 days ago. Please rebase and try again. You can rebase by leaving the following comment on this PR: Details for Dev Infra teamRaised by workflow job |
@pytorchbot rebase |
@pytorchbot successfully started a rebase job. Check the current status here |
Successfully rebased |
af0cab8
to
c431ba0
Compare
@pytorchbot merge |
1 similar comment
@pytorchbot merge |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
The merge job was canceled. If you believe this is a mistake,then you can re trigger it through pytorch-bot. |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Hey @Sergei-Lebedev. |
Currently CUDA UCC barrier is nonblocking with respect to CPU and there is no flag to change it. To make UCC PG barrier behaviour consistent with NCCL PG in this PR barrier has changed to be always blocking. cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu Pull Request resolved: pytorch#86961 Approved by: https://github.com/kwen2501
Enables previously failing UCC dis 6D40 tributed_test.py tests that are now fixed due to either ProcessGroupUCC barrier blocking fix (#86961) or UCC-side timeout error handling fix: (https://github.com/openucx/ucc/pull/679/files). Bump upstream UCC version to build UCC with timeout error handling fix merged in. Pull Request resolved: #89023 Approved by: https://github.com/kwen2501, https://github.com/malfet
Currently CUDA UCC barrier is nonblocking with respect to CPU and there is no flag to change it. To make UCC PG barrier behaviour consistent with NCCL PG in this PR barrier has changed to be always blocking. cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu Pull Request resolved: pytorch#86961 Approved by: https://github.com/kwen2501
Enables previously failing UCC distributed_test.py tests that are now fixed due to either ProcessGroupUCC barrier blocking fix (pytorch#86961) or UCC-side timeout error handling fix: (https://github.com/openucx/ucc/pull/679/files). Bump upstream UCC version to build UCC with timeout error handling fix merged in. Pull Request resolved: pytorch#89023 Approved by: https://github.com/kwen2501, https://github.com/malfet
Currently CUDA UCC barrier is nonblocking with respect to CPU and there is no flag to change it. To make UCC PG barrier behaviour consistent with NCCL PG in this PR barrier has changed to be always blocking.
cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu