8000 Make barrier blocking in UCC by Sergei-Lebedev · Pull Request #86961 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

Make barrier blocking in UCC #86961

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

Sergei-Lebedev
Copy link
Contributor
@Sergei-Lebedev Sergei-Lebedev commented Oct 14, 2022

Currently CUDA UCC barrier is nonblocking with respect to CPU and there is no flag to change it. To make UCC PG barrier behaviour consistent with NCCL PG in this PR barrier has changed to be always blocking.

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu

@pytorch-bot
Copy link
pytorch-bot bot commented Oct 14, 2022

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/86961

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit c431ba0:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@linux-foundation-easycla
Copy link
linux-foundation-easycla bot commented Oct 14, 2022

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: Sergei-Lebedev / name: Sergey Lebedev (b87a6c3dd9b48c2d1617eb8de1c7273b0ca1f698)

@pytorch-bot pytorch-bot bot added the release notes: distributed (c10d) release notes category label Oct 14, 2022
@H-Huang H-Huang added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Oct 14, 2022
@H-Huang
Copy link
Member
H-Huang commented Oct 14, 2022

cc @vtlam for UCC related PR review

@H-Huang H-Huang requested a review from vtlam October 14, 2022 13:43
@zou3519 zou3519 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Oct 14, 2022
@Sergei-Lebedev Sergei-Lebedev force-pushed the ucc_barrier_blocking_wait branch from b87a6c3 to af0cab8 Compare October 21, 2022 05:47
Copy link
Contributor
@kwen2501 kwen2501 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 28, 2022
@Fuzzkatt
Copy link
Collaborator

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: This PR is too stale; the last push date was more than 3 days ago. Please rebase and try again. You can rebase by leaving the following comment on this PR:
@pytorchbot rebase

Details for Dev Infra team Raised by workflow job

@Fuzzkatt
Copy link
Collaborator
8000 Fuzzkatt commented Oct 28, 2022

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a rebase job. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased ucc_barrier_blocking_wait onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout ucc_barrier_blocking_wait && git pull --rebase)

@pytorchmergebot pytorchmergebot force-pushed the ucc_barrier_blocking_wait branch from af0cab8 to c431ba0 Compare October 28, 2022 21:37
@Fuzzkatt
Copy link
Collaborator

@pytorchbot merge

1 similar comment
@Fuzzkatt
Copy link
Collaborator

@pytorchbot merge

@seemethere
Copy link
Member

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

The merge job was canceled. If you believe this is a mistake,then you can re trigger it through pytorch-bot.

@Fuzzkatt
Copy link
Collaborator

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@github-actions
Copy link
Contributor

Hey @Sergei-Lebedev.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

kulinseth pushed a commit to kulinseth/pytorch that referenced this pull request Nov 5, 2022
Currently CUDA UCC barrier is nonblocking with respect to CPU and there is no flag to change it. To make UCC PG barrier behaviour consistent with NCCL PG in this PR barrier has changed to be always blocking.

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu
Pull Request resolved: pytorch#86961
Approved by: https://github.com/kwen2501
pytorchmergebot pushed a commit that referenced this pull request Nov 22, 2022
Enables previously failing UCC dis
6D40
tributed_test.py tests that are now fixed due to either ProcessGroupUCC barrier blocking fix (#86961) or UCC-side timeout error handling fix:  (https://github.com/openucx/ucc/pull/679/files). Bump upstream UCC version to build UCC with timeout error handling fix merged in.

Pull Request resolved: #89023
Approved by: https://github.com/kwen2501, https://github.com/malfet
kulinseth pushed a commit to kulinseth/pytorch that referenced this pull request Dec 10, 2022
Currently CUDA UCC barrier is nonblocking with respect to CPU and there is no flag to change it. To make UCC PG barrier behaviour consistent with NCCL PG in this PR barrier has changed to be always blocking.

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu
Pull Request resolved: pytorch#86961
Approved by: https://github.com/kwen2501
kulinseth pushed a commit to kulinseth/pytorch that referenced this pull request Dec 10, 2022
Enables previously failing UCC distributed_test.py tests that are now fixed due to either ProcessGroupUCC barrier blocking fix (pytorch#86961) or UCC-side timeout error handling fix:  (https://github.com/openucx/ucc/pull/679/files). Bump upstream UCC version to build UCC with timeout error handling fix merged in.

Pull Request resolved: pytorch#89023
Approved by: https://github.com/kwen2501, https://github.com/malfet
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue open source release notes: distributed (c10d) release notes category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants
0