8000 [ddp] decouple python reducer from compilation mode by xmfan · Pull Request #147123 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

[ddp] decouple python reducer from compilation mode #147123

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

xmfan
Copy link
Member
@xmfan xmfan commented Feb 13, 2025

Stack from ghstack (oldest at bottom):

Current implementation reads as: we will only actually use the "python_reducer" config if the DDP forward is compiled. Otherwise, we will silently fallback to C++ reducer + no DDPOptimizer.
I'm changing this behavior to always use the python reducer if the config is specified.

cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @chenyang78 @kadeng @chauhang @amjames @StrongerXi

[ghstack-poisoned]
@pytorch-bot pytorch-bot bot added ciflow/inductor module: dynamo oncall: distributed Add this issue/PR to distributed oncall triage queue labels Feb 13, 2025
Copy link
pytorch-bot bot commented Feb 13, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/147123

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures

As of commit 6c3658d with merge base 63e8ad4 (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added ciflow/inductor module: dynamo oncall: distributed Add this issue/PR to distributed oncall triage queue labels Feb 13, 2025
xmfan added a commit that referenced this pull request Feb 13, 2025
ghstack-source-id: 53e9053
Pull Request resolved: #147123
def _should_disable_cpp_reducer(self) -> bool:
return self._use_python_reducer and (
torch.compiler.is_compiling() or self._force_to_disable_cpp_reducer
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current implementation reads as: we will only actually use the "python_reducer" config if the DDP forward is compiled. Otherwise, we will silently fallback to C++ reducer + no DDPOptimizer.

I'm changing this behavior to always use the python reducer if the config is specified.

@xmfan xmfan marked this pull request as ready for review February 13, 2025 19:25
@xmfan xmfan requested a review from fegin February 13, 2025 19:25
Update
13eee19 10000
[ghstack-poisoned]
xmfan added a commit that referenced this pull request Feb 13, 2025
ghstack-source-id: ec001dd
Pull Request resolved: #147123
@mikaylagawarecki mikaylagawarecki removed their request for review February 13, 2025 23:24


_ddp_optimization_mode = [
"ddp_optimizer",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, this should be a Tuple of typing.Literal for type checking, probably.

@fegin fegin added ciflow/trunk Trigger trunk jobs on your pull request module: ddp Issues/PRs related distributed data parallel training labels Feb 18, 2025
Copy link
Contributor
@fegin fegin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. We should add one more warning to explicitly ask users to use torch.compile and compiledautograd if python reducer is used.

[ghstack-poisoned]
xmfan added a commit that referenced this pull request Feb 18, 2025
ghstack-source-id: b68023a
Pull Request resolved: #147123
@xmfan
Copy link
Member Author
xmfan commented Feb 18, 2025

imo there is value in using python_reducer outside of torch.compile + CA in order to prototype, debug, or to capture for a 3p compiler backend. A parallel is torch.compile allowing you to specify non-inductor backends, i don't think we should raise warning.

@xmfan
Copy link
Member Author
xmfan commented Feb 19, 2025

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: Command git -C /home/runner/work/pytorch/pytorch cherry-pick -x 96ba4f899dc61456e296c05531a054f80b9986f9 returned non-zero exit code 1

Auto-merging torch/_dynamo/config.py
Auto-merging torch/_dynamo/convert_frame.py
CONFLICT (content): Merge conflict in torch/_dynamo/convert_frame.py
error: could not apply 96ba4f899dc... [ddp] decouple python reducer from compilation mode
hint: After resolving the conflicts, mark them with
hint: "git add/rm <pathspec>", then run
hint: "git cherry-pick --continue".
hint: You can instead skip this commit with "git cherry-pick --skip".
hint: To abort and get back to the state before "git cherry-pick",
hint: run "git cherry-pick --abort".
hint: Disable this message with "git config set advice.mergeConflict false"
Details for Dev Infra team Raised by workflow job

[ghstack-poisoned]
xmfan added a commit that referenced this pull request Feb 19, 2025
ghstack-source-id: 3eda73b
Pull Request resolved: #147123
@xmfan
Copy link
Member Author
xmfan commented Feb 19, 2025

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

@xmfan
Copy link
Member Author
xmfan commented Feb 19, 2025

@pytorchbot merge -i

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged while ignoring the following 3 checks: trunk / linux-focal-rocm6.3-py3.10 / test (default, 1, 2, linux.rocm.gpu.2), trunk / linux-focal-rocm6.3-py3.10 / test (default, 2, 2, linux.rocm.gpu.2), trunk / linux-focal-rocm6.3-py3.10 / test (distributed, 1, 1, linux.rocm.gpu.4)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

facebook-github-bot pushed a commit to pytorch/benchmark that referenced this pull request Feb 20, 2025
Summary:
Current implementation reads as: we will only actually use the "python_reducer" config if the DDP forward is compiled. Otherwise, we will silently fallback to C++ reducer + no DDPOptimizer.
I'm changing this behavior to always use the python reducer if the config is specified.

X-link: pytorch/pytorch#147123
Approved by: https://github.com/fegin

Reviewed By: jeanschmidt

Differential Revision: D69890503

fbshipit-source-id: 49c9e2995548ee8140e33388cdeaf47720fcf8d8
Raymo111 pushed a commit that referenced this pull request Feb 20, 2025
Current implementation reads as: we will only actually use the "python_reducer" config if the DDP forward is compiled. Otherwise, we will silently fallback to C++ reducer + no DDPOptimizer.
I'm changing this behavior to always use the python reducer if the config is specified.

Pull Request resolved: #147123
Approved by: https://github.com/fegin
pytorch-bot bot pushed a commit that referenced this pull request Feb 24, 2025
Current implementation reads as: we will only actually use the "python_reducer" config if the DDP forward is compiled. Otherwise, we will silently fallback to C++ reducer + no DDPOptimizer.
I'm changing this behavior to always use the python reducer if the config is specified.

Pull Request resolved: #147123
Approved by: https://github.com/fegin
@github-actions github-actions bot deleted the gh/xmfan/181/head branch March 27, 2025 02:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request Merged module: ddp Issues/PRs related distributed data parallel training module: dynamo oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (miscellaneous)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0