8000 [c10d] Remove Option for ProcessGroup and Expose backend Options to reflect the correct code structure by fduwjj · Pull Request #132931 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

[c10d] Remove Option for ProcessGroup and Expose backend Options to reflect the correct code structure #132931

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 10 commits into from

Conversation

fduwjj
Copy link
Contributor
@fduwjj fduwjj commented Aug 7, 2024

Stack from ghstack (oldest at bottom):

We introduced the dispatchable backend for a ProcessGroup and collective in #86225. This PR is a follow-up cleanup to clean up the option of a ProcessGroup and ask users to either set timeout or backend later on or directly create backend after creating a PG.

Also PGNCCL is using option class from ProcessGroup but we actually should use Option from backend class. So this PR is to make the type or name to be aligned with what we are doing in cpp side. I don't change the signature for the public API, so they still use args named "pg_options"

We need to make changes to the test to make it aligned with the change.

cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @wz337 @wconstab @d4l3k @c-p-i-o @ezyang @gchanan

Copy link
pytorch-bot bot commented Aug 7, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/132931

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 13359b2 with merge base 55236d0 (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Aug 7, 2024
fduwjj added a commit that referenced this pull request Aug 7, 2024
@fduwjj fduwjj requested review from H-Huang, kwen2501 and wconstab August 7, 2024 21:20
…structure"


What I noticed is that we still assume PGNCCL is using option from ProcessGroup but we actually should make it use Option from backend class.


cc XilunWu H-Huang awgu kwen2501 wanchaol fegin wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
fduwjj added a commit that referenced this pull request Aug 7, 2024
…structure"


Although it looks like documents for Options mentioned in #130958 have been all addressed. What I noticed is that we still assume PGNCCL is using option from ProcessGroup but we actually should make it use Option from backend class. So this PR is to make the type or name to be aligned with what we are doing in cpp side. I don't change the signature for the public API, so they still use args named "pg_options"


cc XilunWu H-Huang awgu kwen2501 wanchaol fegin wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
fduwjj added a commit that referenced this pull request Aug 27, 2024
@fduwjj fduwjj added the module: bc-breaking Related to a BC-breaking change label Aug 27, 2024
@pytorch-bot pytorch-bot bot added the topic: bc breaking topic category label Aug 27, 2024
@fduwjj fduwjj changed the title [c10d] Expose backend Options to reflect the correct code structure [c10d] Remove Option for ProcessGroup and Expose backend Options to reflect the correct code structure Aug 27, 2024
@fduwjj fduwjj requested a review from H-Huang August 27, 2024 21:04
@fduwjj fduwjj added the better-engineering Relatively self-contained tasks for better engineering contributors label Aug 27, 2024
…ptions to reflect the correct code structure"


We introduced the dispatchable backend for a ProcessGroup and collective in #86225. This PR is a follow-up cleanup to clean up the option of a ProcessGroup and ask users to either set timeout or backend later on or directly create backend after creating a PG.

Also PGNCCL is using option class from ProcessGroup but we actually should use Option from backend class. So this PR is to make the type or name to be aligned with what we are doing in cpp side. I don't change the signature for the public API, so they still use args named "pg_options"


cc XilunWu H-Huang awgu kwen2501 wanchaol fegin wz337 wconstab d4l3k c-p-i-o ezyang gchanan

[ghstack-poisoned]
fduwjj added a commit that referenced this pull request Aug 27, 2024
@fduwjj fduwjj requested a review from wz337 August 27, 2024 21:34
…ptions to reflect the correct code structure"


We introduced the dispatchable backend for a ProcessGroup and collective in #86225. This PR is a follow-up cleanup to clean up the option of a ProcessGroup and ask users to either set timeout or backend later on or directly create backend after creating a PG.

Also PGNCCL is using option class from ProcessGroup but we actually should use Option from backend class. So this PR is to make the type or name to be aligned with what we are doing in cpp side. I don't change the signature for the public API, so they still use args named "pg_options"


cc XilunWu H-Huang awgu kwen2501 wanchaol fegin wz337 wconstab d4l3k c-p-i-o ezyang gchanan

[ghstack-poisoned]
@fduwjj fduwjj requested a review from albanD as a code owner August 28, 2024 17:41
fduwjj added a commit that referenced this pull request Aug 28, 2024
@fduwjj fduwjj requested a review from malfet August 28, 2024 17:42
…ptions to reflect the correct code structure"


We introduced the dispatchable backend for a ProcessGroup and collective in #86225. This PR is a follow-up cleanup to clean up the option of a ProcessGroup and ask users to either set timeout or backend later on or directly create backend after creating a PG.

Also PGNCCL is using option class from ProcessGroup but we actually should use Option from backend class. So this PR is to make the type or name to be aligned with what we are doing in cpp side. I don't change the signature for the public API, so they still use args named "pg_options"


cc XilunWu H-Huang awgu kwen2501 wanchaol fegin wz337 wconstab d4l3k c-p-i-o ezyang gchanan

[ghstack-poisoned]
fduwjj added a commit that referenced this pull request Aug 28, 2024
@fduwjj fduwjj removed request for malfet and albanD August 28, 2024 17:51
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@ZainRizvi
Copy link
Contributor

@pytorchbot revert -c ghfirst -m "This PR is breaking builds internally due to the removal of ProcessGroup::Options"

I'd recommend ghimporting this PR to make sure it passes internally before re-merging. See D62008954 for more details

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

@pytorchmergebot
Copy link
Collaborator

@fduwjj your PR has been successfully reverted.

pytorchmergebot added a commit that referenced this pull request Aug 30, 2024
…ons to reflect the correct code structure (#132931)"

This reverts commit 65864d0.

Reverted #132931 on behalf of https://github.com/ZainRizvi due to This PR is breaking builds internally due to the removal of ProcessGroup::Options ([comment](#132931 (comment)))
fduwjj added a commit that referenced this pull request Sep 11, 2024
…ons to reflect the correct code structure (#132931)

We introduced the dispatchable backend for a ProcessGroup and collective in #86225. This PR is a follow-up cleanup to clean up the option of a ProcessGroup and ask users to either set timeout or backend later on or directly create backend after creating a PG.

Also PGNCCL is using option class from ProcessGroup but we actually should use Option from backend class. So this PR is to make the type or name to be aligned with what we are doing in cpp side. I don't change the signature for the public API, so they still use args named "pg_options"

We need to make changes to the test to make it aligned with the change.

This is try to reland D62008954 by fixing internal errors.

Differential Revision: [D62483294](https://our.internmc.facebook.com/intern/diff/D62483294/)

[ghstack-poisoned]
fduwjj added a commit that referenced this pull request Sep 11, 2024
…ons to reflect the correct code structure (#132931)

We introduced the dispatchable backend for a ProcessGroup and collective in #86225. This PR is a follow-up cleanup to clean up the option of a ProcessGroup and ask users to either set timeout or backend later on or directly create backend after creating a PG.

Also PGNCCL is using option class from ProcessGroup but we actually should use Option from backend class. So this PR is to make the type or name to be aligned with what we are doing in cpp side. I don't change the signature for the public API, so they still use args named "pg_options"

We need to mak
8000
e changes to the test to make it aligned with the change.

This is try to reland D62008954 by fixing internal errors.

Differential Revision: [D62483294](https://our.internmc.facebook.com/intern/diff/D62483294/)

ghstack-source-id: 241951792
Pull Request resolved: #135653
@fduwjj
Copy link
Contributor Author
fduwjj commented Sep 11, 2024

reland in #135653

@fduwjj fduwjj closed this Sep 11, 2024
fduwjj added a commit that referenced this pull request Sep 11, 2024
…ackend Options to reflect the correct code structure (#132931)"

We introduced the dispatchable backend for a ProcessGroup and collective in #86225. This PR is a follow-up cleanup to clean up the option of a ProcessGroup and ask users to either set timeout or backend later on or directly create backend after creating a PG.

Also PGNCCL is using option class from ProcessGroup but we actually should use Option from backend class. So this PR is to make the type or name to be aligned with what we are doing in cpp side. I don't change the signature for the public API, so they still use args named "pg_options"

We need to make changes to the test to make it aligned with the change.

This is try to reland D62008954 by fixing internal errors.

Differential Revision: [D62483294](https://our.internmc.facebook.com/intern/diff/D62483294/)

cc XilunWu H-Huang awgu kwen2501 wanchaol fegin wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
fduwjj added a commit that referenced this pull request Sep 11, 2024
…ons to reflect the correct code structure (#132931)

Pull Request resolved: #135653

We introduced the dispatchable backend for a ProcessGroup and collective in #86225. This PR is a follow-up cleanup to clean up the option of a ProcessGroup and ask users to either set timeout or backend later on or directly create backend after creating a PG.

Also PGNCCL is using option class from ProcessGroup but we actually should use Option from backend class. So this PR is to make the type or name to be aligned with what we are doing in cpp side. I don't change the signature for the public API, so they still use args named "pg_options"

We need to make changes to the test to make it aligned with the change.

This is try to reland D62008954 by fixing internal errors.

Differential Revision: [D62483294](https://our.internmc.facebook.com/intern/diff/D62483294/)
ghstack-source-id: 242054823
fduwjj added a commit that referenced this pull request Sep 11, 2024
…ons to reflect the correct code structu
6D40
re (#132931)

Pull Request resolved: #135653

We introduced the dispatchable backend for a ProcessGroup and collective in #86225. This PR is a follow-up cleanup to clean up the option of a ProcessGroup and ask users to either set timeout or backend later on or directly create backend after creating a PG.

Also PGNCCL is using option class from ProcessGroup but we actually should use Option from backend class. So this PR is to make the type or name to be aligned with what we are doing in cpp side. I don't change the signature for the public API, so they still use args named "pg_options"

We need to make changes to the test to make it aligned with the change.

This is try to reland D62008954 by fixing internal errors.
ghstack-source-id: 242088446

Differential Revision: [D62483294](https://our.internmc.facebook.com/intern/diff/D62483294/)
fduwjj added a commit that referenced this pull request Sep 11, 2024
…oup and Expose backend Options to reflect the correct code structure (#132931)"

We introduced the dispatchable backend for a ProcessGroup and collective in #86225. This PR is a follow-up cleanup to clean up the option of a ProcessGroup and ask users to either set timeout or backend later on or directly create backend after creating a PG.

Also PGNCCL is using option class from ProcessGroup but we actually should use Option from backend class. So this PR is to make the type or name to be aligned with what we are doing in cpp side. I don't change the signature for the public API, so they still use args named "pg_options"

We need to make changes to the test to make it aligned with the change.

This is try to reland D62008954 by fixing internal errors.

Differential Revision: [D62483294](https://our.internmc.facebook.com/intern/diff/D62483294/)

cc XilunWu H-Huang awgu kwen2501 wanchaol fegin wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
fduwjj added a commit that referenced this pull request Sep 11, 2024
…ackend Options to reflect the correct code structure (#132931)"

We introduced the dispatchable backend for a ProcessGroup and collective in #86225. This PR is a follow-up cleanup to clean up the option of a ProcessGroup and ask users to either set timeout or backend later on or directly create backend after creating a PG.

Also PGNCCL is using option class from ProcessGroup but we actually should use Option from backend class. So this PR is to make the type or name to be aligned with what we are doing in cpp side. I don't change the signature for the public API, so they still use args named "pg_options"

We need to make changes to the test to make it aligned with the change.

This is try to reland D62008954 by fixing internal errors.

Differential Revision: [D62483294](https://our.internmc.facebook.com/intern/diff/D62483294/)

cc XilunWu H-Huang awgu kwen2501 wanchaol fegin wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
tolleybot pushed a commit to tolleybot/pytorch that referenced this pull request Sep 14, 2024
…eflect the correct code structure (pytorch#132931)

We introduced the dispatchable backend for a ProcessGroup and collective in pytorch#86225. This PR is a follow-up cleanup to clean up the option of a ProcessGroup and ask users to either set timeout or backend later on or directly create backend after creating a PG.

Also PGNCCL is using option class from ProcessGroup but we actually should use Option from backend class. So this PR is to make the type or name to be aligned with what we are doing in cpp side. I don't change the signature for the public API, so they still use args named "pg_options"

We need to make changes to the test to make it aligned with the change.

Pull Request resolved: pytorch#132931
Approved by: https://github.com/H-Huang
tolleybot pushed a commit to tolleybot/pytorch that referenced this pull request Sep 14, 2024
…ons to reflect the correct code structure (pytorch#132931)"

This reverts commit 65864d0.

Reverted pytorch#132931 on behalf of https://github.com/ZainRizvi due to This PR is breaking builds internally due to the removal of ProcessGroup::Options ([comment](pytorch#132931 (comment)))
pytorchmergebot pushed a commit that referenced this pull request Sep 16, 2024
…ons to reflect the correct code structure (#132931) (#135653)

We introduced the dispatchable backend for a ProcessGroup and collective in #86225. This PR is a follow-up cleanup to clean up the option of a ProcessGroup and ask users to either set timeout or backend later on or directly create backend after creating a PG.

Also PGNCCL is using option class from ProcessGroup but we actually should use Option from backend class. So this PR is to make the type or name to be aligned with what we are doing in cpp side. I don't change the signature for the public API, so they still use args named "pg_options"

We need to make changes to the test to make it aligned with the change.

This is try to reland D62008954 by fixing internal errors.

Differential Revision: [D62483294](https://our.internmc.facebook.com/intern/diff/D62483294/)

Pull Request resolved: #135653
Approved by: https://github.com/wz337, https://github.com/H-Huang
Chao1Han pushed a commit to Chao1Han/pytorch that referenced this pull request Sep 20, 2024
…eflect the correct code structure (pytorch#132931)

We introduced the dispatchable backend for a ProcessGroup and collective in pytorch#86225. This PR is a follow-up cleanup to clean up the option of a ProcessGroup and ask users to either set timeout or backend later on or directly create backend after creating a PG.

Also PGNCCL is using option class from ProcessGroup but we actually should use Option from backend class. So this PR is to make the type or name to be aligned with what we are doing in cpp side. I don't change the signature for the public API, so they still use args named "pg_options"

We need to make changes to the test to make it aligned with the change.

Pull Request resolved: pytorch#132931
Approved by: https://github.com/H-Huang
Chao1Han pushed a commit to Chao1Han/pytorch that referenced this pull request Sep 20, 2024
…ons to reflect the correct code structure (pytorch#132931)"

This reverts commit 65864d0.

Reverted pytorch#132931 on behalf of https://github.com/ZainRizvi due to This PR is breaking builds internally due to the removal of ProcessGroup::Options ([comment](pytorch#132931 (comment)))
Chao1Han pushed a commit to Chao1Han/pytorch that referenced this pull request Sep 20, 2024
…ons to reflect the correct code structure (pytorch#132931) (pytorch#135653)

We introduced the dispatchable backend for a ProcessGroup and collective in pytorch#86225. This PR is a follow-up cleanup to clean up the option of a ProcessGroup and ask users to either set timeout or backend later on or directly create backend after creating a PG.

Also PGNCCL is using option class from ProcessGroup but we actually should use Option from backend class. So this PR is to make the type or name to be aligned with what we are doing in cpp side. I don't change the signature for the public API, so they still use args named "pg_options"

We need to make changes to the test to make it aligned with the change.

This is try to reland D62008954 by fixing internal errors.

Differential Revision: [D62483294](https://our.internmc.facebook.com/intern/diff/D62483294/)

Pull Request resolved: pytorch#135653
Approved by: https://github.com/wz337, https://github.com/H-Huang
@github-actions github-actions bot deleted the gh/fduwjj/112/head branch October 12, 2024 02:04
njhill added a commit to njhill/vllm that referenced this pull request Mar 15, 2025
ProcessGroup.Options was removed in torch 2.6: pytorch/pytorch#132931

Signed-off-by: Nick Hill <nhill@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
better-engineering Relatively self-contained tasks for better engineering contributors ciflow/trunk Trigger trunk jobs on your pull request Merged module: bc-breaking Related to a BC-breaking change oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category Reverted topic: bc breaking topic category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0