8000 [c10d] Simplify ProcessGroupNCCL into single-device style by kwen2501 · Pull Request #118674 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

[c10d] Simplify ProcessGroupNCCL into single-device style #118674

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 16 commits into from

Conversation

kwen2501
Copy link
Contributor
@kwen2501 kwen2501 commented Jan 30, 2024

Copy link
pytorch-bot bot commented Jan 30, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/118674

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 1 Unrelated Failure

As of commit db92422 with merge base 9afd539 (image):

NEW FAILURES - The following jobs have failed:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the release notes: distributed (c10d) release notes category label Jan 30, 2024
@github-actions github-actions bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Jan 30, 2024
@kwen2501 kwen2501 marked this pull request as ready for review January 30, 2024 20:56
@wconstab
Copy link
Contributor

this SGTM! a couple thoughts on landing. Should we first land a change that just starts asserting the condition we want to achieve (assert that all those loops/vectors are size 1) and ensure no breakages with minimal code change effort? Then the remaining PR would literally be dead code removal. (and could even be landed in a few smaller chunks if its easier to review)

@kwen2501
Copy link
Contributor Author

Thanks, I think that's a good idea. But I also just realized that a lot of our tests were written in pg.all_reduce([tensors]) style, instead of dist.all_reduce(tensor) (the documented way). So, even for assert == 1 to become green, there needs quite some efforts to refactor our tests first.

@wconstab
Copy link
Contributor

is this actually a deprecation and a bc cycle kind of thing? (and are you thinking its not worth doing the normal BC process bc we don't think any users are actually using this approach?)

@kwen2501
Copy link
Contributor Author

We actually followed the deprecation process (2 release cycles at least), and recently removed <coll>_multigpu APIs from the distributed package. So if the distributed package is used per documentation, the assert == 1 would be true.

The problem is in some of our tests, which used unsupported/documented style.

@wconstab
Copy link
Contributor

Ok. So more work...

10000

@wanchaol
Copy link
Collaborator

This is awesome! I think we should probably try to import this and see if internal tests are happy given this is a large refactor?

@kwen2501 kwen2501 added the module: bc-breaking Related to a BC-breaking change label Feb 2, 2024
@kwen2501 kwen2501 changed the title [WIP][c10d] Simplify ProcessGroupNCCL into single-device style [c10d] Simplify ProcessGroupNCCL into single-device style Feb 2, 2024
@kwen2501
Copy link
Contributor Author
kwen2501 commented Feb 3, 2024

@wconstab @wanchaol breaking this into smaller PRs as you suggested:
First one: #119099

smalltalkman pushed a commit to smalltalkman/pytorch that referenced this pull request Feb 6, 2024
Breaking pytorch#118674 into multiple smaller PRs.
This is the first one.
It adds `assert size==1` to PGNCCL, and refactors some old tests written in multi-device style (which would otherwise fail at the assert).

Pull Request resolved: pytorch#119099
Approved by: https://github.com/wconstab
pytorchmergebot pushed a commit that referenced this pull request Feb 7, 2024
Breaking #118674 into multiple smaller PRs.
This is the first one.
It adds `assert size==1` to PGNCCL, and refactors some old tests written in multi-device style (which would otherwise fail at the assert).

Pull Request resolved: #119099
Approved by: https://github.com/wconstab, https://github.com/XilunWu
pytorch-bot bot pushed a commit that referenced this pull request Feb 8, 2024
Breaking #118674 into multiple smaller PRs.
This is the first one.
It adds `assert size==1` to PGNCCL, and refactors some old tests written in multi-device style (which would otherwise fail at the assert).

Pull Request resolved: #119099
Approved by: https://github.com/wconstab
pytorch-bot bot pushed a commit that referenced this pull request Feb 8, 2024
Breaking #118674 into multiple smaller PRs.
This is the first one.
It adds `assert size==1` to PGNCCL, and refactors some old tests written in multi-device style (which would otherwise fail at the assert).

Pull Request resolved: #119099
Approved by: https://github.com/wconstab, https://github.com/XilunWu
pytorchmergebot pushed a commit that referenced this pull request Feb 9, 2024
…device style (#119421)

Part 2 and last part of #118674:
Introduce actual "single-device" code change to ProcessGroupNCCL.

assert size == 1 and test refactor have been done in #119099.

Pull Request resolved: #119421
Approved by: https://github.com/shuqiangzhang
@kwen2501
Copy link
Contributor Author
kwen2501 commented Feb 9, 2024

Closing via #119099 and #119421

@kwen2501 kwen2501 closed this Feb 9, 2024
pytorchmergebot pushed a commit that referenced this pull request Feb 12, 2024
…device style (#119421)

Part 2 and last part of #118674:
Introduce actual "single-device" code change to ProcessGroupNCCL.

assert size == 1 and test refactor have been done in #119099.

Pull Request resolved: #119421
Approved by: https://github.com/shuqiangzhang
clee2000 pushed a commit that referenced this pull request Feb 14, 2024
Breaking #118674 into multiple smaller PRs.
This is the first one.
It adds `assert size==1` to PGNCCL, and refactors some old tests written in multi-device style (which would otherwise fail at the assert).

Pull Request resolved: #119099
Approved by: https://github.com/wconstab
clee2000 pushed a commit that referenced this pull request Feb 14, 2024
Breaking #118674 into multiple smaller PRs.
This is the first one.
It adds `assert size==1` to PGNCCL, and refactors some old tests written in multi-device style (which would otherwise fail at the assert).

Pull Request resolved: #119099
Approved by: https://github.com/wconstab, https://github.com/XilunWu
clee2000 pushed a commit that referenced this pull request Feb 14, 2024
…device style (#119421)

Part 2 and last part of #118674:
Introduce actual "single-device" code change to ProcessGroupNCCL.

assert size == 1 and test refactor have been done in #119099.

Pull Request resolved: #119421
Approved by: https://github.com/shuqiangzhang
clee2000 pushed a commit that referenced this pull request Feb 14, 2024
…device style (#119421)

Part 2 and last part of #118674:
Introduce actual "single-device" code change to ProcessGroupNCCL.

assert size == 1 and test refactor have been done in #119099.

Pull Request resolved: #119421
Approved by: https://github.com/shuqiangzhang
@github-actions github-actions bot deleted the single_dev_pg branch March 11, 2024 01:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: bc-breaking Related to a BC-breaking change oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[c10d] Refactor ProcessGroupNCCL to use single device
3 participants
0