-
Notifications
You must be signed in to change notification settings - Fork 24.3k
[c10d] Simplify ProcessGroupNCCL into single-device style #118674
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/118674
Note: Links to docs will display an error until the docs builds have been completed. ❌ 2 New Failures, 1 Unrelated FailureAs of commit db92422 with merge base 9afd539 ( NEW FAILURES - The following jobs have failed:
FLAKY - The following job failed but was likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
this SGTM! a couple thoughts on landing. Should we first land a change that just starts asserting the condition we want to achieve (assert that all those loops/vectors are size 1) and ensure no breakages with minimal code change effort? Then the remaining PR would literally be dead code removal. (and could even be landed in a few smaller chunks if its easier to review) |
Thanks, I think that's a good idea. But I also just realized that a lot of our tests were written in |
is this actually a deprecation and a bc cycle kind of thing? (and are you thinking its not worth doing the normal BC process bc we don't think any users are actually using this approach?) |
We actually followed the deprecation process (2 release cycles at least), and recently removed The problem is in some of our tests, which used unsupported/documented style. |
Ok. So more work... |
This is awesome! I think we should probably try to import this and see if internal tests are happy given this is a large refactor? |
4838707
to
bd4a7f0
Compare
Breaking pytorch#118674 into multiple smaller PRs. This is the first one. It adds `assert size==1` to PGNCCL, and refactors some old tests written in multi-device style (which would otherwise fail at the assert). Pull Request resolved: pytorch#119099 Approved by: https://github.com/wconstab
Breaking #118674 into multiple smaller PRs. This is the first one. It adds `assert size==1` to PGNCCL, and refactors some old tests written in multi-device style (which would otherwise fail at the assert). Pull Request resolved: #119099 Approved by: https://github.com/wconstab, https://github.com/XilunWu
Breaking #118674 into multiple smaller PRs. This is the first one. It adds `assert size==1` to PGNCCL, and refactors some old tests written in multi-device style (which would otherwise fail at the assert). Pull Request resolved: #119099 Approved by: https://github.com/wconstab
Breaking #118674 into multiple smaller PRs. This is the first one. It adds `assert size==1` to PGNCCL, and refactors some old tests written in multi-device style (which would otherwise fail at the assert). Pull Request resolved: #119099 Approved by: https://github.com/wconstab, https://github.com/XilunWu
…device style (#119421) Part 2 and last part of #118674: Introduce actual "single-device" code change to ProcessGroupNCCL. assert size == 1 and test refactor have been done in #119099. Pull Request resolved: #119421 Approved by: https://github.com/shuqiangzhang
…device style (#119421) Part 2 and last part of #118674: Introduce actual "single-device" code change to ProcessGroupNCCL. assert size == 1 and test refactor have been done in #119099. Pull Request resolved: #119421 Approved by: https://github.com/shuqiangzhang
Breaking #118674 into multiple smaller PRs. This is the first one. It adds `assert size==1` to PGNCCL, and refactors some old tests written in multi-device style (which would otherwise fail at the assert). Pull Request resolved: #119099 Approved by: https://github.com/wconstab
Breaking #118674 into multiple smaller PRs. This is the first one. It adds `assert size==1` to PGNCCL, and refactors some old tests written in multi-device style (which would otherwise fail at the assert). Pull Request resolved: #119099 Approved by: https://github.com/wconstab, https://github.com/XilunWu
…device style (#119421) Part 2 and last part of #118674: Introduce actual "single-device" code change to ProcessGroupNCCL. assert size == 1 and test refactor have been done in #119099. Pull Request resolved: #119421 Approved by: https://github.com/shuqiangzhang
…device style (#119421) Part 2 and last part of #118674: Introduce actual "single-device" code change to ProcessGroupNCCL. assert size == 1 and test refactor have been done in #119099. Pull Request resolved: #119421 Approved by: https://github.com/shuqiangzhang
Resolves #117756
Saves ~45 for loops.
Saves ~100 std::vector.
cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @ezyang @gchanan