[ddp] decouple python reducer from compilation mode #147123

xmfan · 2025-02-13T18:06:45Z

Stack from ghstack (oldest at bottom):

-> [ddp] decouple python reducer from compilation mode #147123

Current implementation reads as: we will only actually use the "python_reducer" config if the DDP forward is compiled. Otherwise, we will silently fallback to C++ reducer + no DDPOptimizer.
I'm changing this behavior to always use the python reducer if the config is specified.

cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @chenyang78 @kadeng @chauhang @amjames @StrongerXi

[ghstack-poisoned]

pytorch-bot · 2025-02-13T18:06:51Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/147123

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures

As of commit 6c3658d with merge base 63e8ad4 ():

NEW FAILURES - The following jobs have failed:

trunk / linux-focal-rocm6.3-py3.10 / test (default, 1, 2, linux.rocm.gpu.2) (gh)
Process completed with exit code 1.
trunk / linux-focal-rocm6.3-py3.10 / test (default, 2, 2, linux.rocm.gpu.2) (gh)
Process completed with exit code 1.
trunk / linux-focal-rocm6.3-py3.10 / test (distributed, 1, 1, linux.rocm.gpu.4) (gh)
Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 53e9053 Pull Request resolved: #147123

xmfan · 2025-02-13T19:24:40Z

torch/nn/parallel/distributed.py

-    def _should_disable_cpp_reducer(self) -> bool:
-        return self._use_python_reducer and (
-            torch.compiler.is_compiling() or self._force_to_disable_cpp_reducer
-        )


Current implementation reads as: we will only actually use the "python_reducer" config if the DDP forward is compiled. Otherwise, we will silently fallback to C++ reducer + no DDPOptimizer.

I'm changing this behavior to always use the python reducer if the config is specified.

[ghstack-poisoned]

ghstack-source-id: ec001dd Pull Request resolved: #147123

Skylion007 · 2025-02-15T14:14:52Z

torch/_dynamo/utils.py

+
+
+_ddp_optimization_mode = [
+    "ddp_optimizer",


Nit, this should be a Tuple of typing.Literal for type checking, probably.

fegin

LGTM. We should add one more warning to explicitly ask users to use torch.compile and compiledautograd if python reducer is used.

[ghstack-poisoned]

ghstack-source-id: b68023a Pull Request resolved: #147123

xmfan · 2025-02-18T21:26:05Z

imo there is value in using python_reducer outside of torch.compile + CA in order to prototype, debug, or to capture for a 3p compiler backend. A parallel is torch.compile allowing you to specify non-inductor backends, i don't think we should raise warning.

xmfan · 2025-02-19T01:26:11Z

@pytorchbot merge

pytorchmergebot · 2025-02-19T01:27:53Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-02-19T01:33:44Z

Merge failed

Reason: Command git -C /home/runner/work/pytorch/pytorch cherry-pick -x 96ba4f899dc61456e296c05531a054f80b9986f9 returned non-zero exit code 1

Auto-merging torch/_dynamo/config.py
Auto-merging torch/_dynamo/convert_frame.py
CONFLICT (content): Merge conflict in torch/_dynamo/convert_frame.py
error: could not apply 96ba4f899dc... [ddp] decouple python reducer from compilation mode
hint: After resolving the conflicts, mark them with
hint: "git add/rm <pathspec>", then run
hint: "git cherry-pick --continue".
hint: You can instead skip this commit with "git cherry-pick --skip".
hint: To abort and get back to the state before "git cherry-pick",
hint: run "git cherry-pick --abort".
hint: Disable this message with "git config set advice.mergeConflict false"

Details for Dev Infra team

Raised by workflow job

[ghstack-poisoned]

ghstack-source-id: 3eda73b Pull Request resolved: #147123

xmfan · 2025-02-19T05:04:28Z

@pytorchbot merge

pytorchmergebot · 2025-02-19T05:06:06Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-02-19T06:21:00Z

Merge failed

Reason: 3 jobs have failed, first few of them are: trunk / linux-focal-rocm6.3-py3.10 / test (default, 1, 2, linux.rocm.gpu.2), trunk / linux-focal-rocm6.3-py3.10 / test (default, 2, 2, linux.rocm.gpu.2), trunk / linux-focal-rocm6.3-py3.10 / test (distributed, 1, 1, linux.rocm.gpu.4)

Details for Dev Infra team

Raised by workflow job

xmfan · 2025-02-19T15:43:41Z

@pytorchbot merge -i

pytorchmergebot · 2025-02-19T15:45:38Z

Merge started

Your change will be merged while ignoring the following 3 checks: trunk / linux-focal-rocm6.3-py3.10 / test (default, 1, 2, linux.rocm.gpu.2), trunk / linux-focal-rocm6.3-py3.10 / test (default, 2, 2, linux.rocm.gpu.2), trunk / linux-focal-rocm6.3-py3.10 / test (distributed, 1, 1, linux.rocm.gpu.4)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Summary: Current implementation reads as: we will only actually use the "python_reducer" config if the DDP forward is compiled. Otherwise, we will silently fallback to C++ reducer + no DDPOptimizer. I'm changing this behavior to always use the python reducer if the config is specified. X-link: pytorch/pytorch#147123 Approved by: https://github.com/fegin Reviewed By: jeanschmidt Differential Revision: D69890503 fbshipit-source-id: 49c9e2995548ee8140e33388cdeaf47720fcf8d8

Current implementation reads as: we will only actually use the "python_reducer" config if the DDP forward is compiled. Otherwise, we will silently fallback to C++ reducer + no DDPOptimizer. I'm changing this behavior to always use the python reducer if the config is specified. Pull Request resolved: #147123 Approved by: https://github.com/fegin

Update

1ebdb45

[ghstack-poisoned]

pytorch-bot bot added ciflow/inductor module: dynamo oncall: distributed Add this issue/PR to distributed oncall triage queue labels Feb 13, 2025

xmfan added a commit that referenced this pull request Feb 13, 2025

[ddp] decouple python reducer from compilation mode

145a5e3

ghstack-source-id: 53e9053 Pull Request resolved: #147123

xmfan commented Feb 13, 2025

View reviewed changes

xmfan marked this pull request as ready for review February 13, 2025 19:25

xmfan requested review from albanD, jbschlosser and mikaylagawarecki as code owners February 13, 2025 19:25

xmfan requested a review from fegin February 13, 2025 19:25

Update

13eee19 10000

[ghstack-poisoned]

xmfan added a commit that referenced this pull request Feb 13, 2025

[ddp] decouple python reducer from compilation mode

8a7ae98

ghstack-source-id: ec001dd Pull Request resolved: #147123

mikaylagawarecki removed their request for review February 13, 2025 23:24

xmfan mentioned this pull request Feb 15, 2025

[ca] trace saved variable unpacking #147242

Closed

Skylion007 reviewed Feb 15, 2025

View reviewed changes

fegin added ciflow/trunk Trigger trunk jobs on your pull request module: ddp Issues/PRs related distributed data parallel training labels Feb 18, 2025

fegin approved these changes Feb 18, 2025

View reviewed changes

Update

b3b56a7

[ghstack-poisoned]

xmfan added a commit that referenced this pull request Feb 18, 2025

[ddp] decouple python reducer from compilation mode

96ba4f8

ghstack-source-id: b68023a Pull Request resolved: #147123

xmfan added 8000 the release notes: distributed (miscellaneous) label Feb 18, 2025

pytorchmergebot added the merging label Feb 19, 2025

pytorchmergebot removed the merging label Feb 19, 2025

Update

6c3658d

[ghstack-poisoned]

xmfan added a commit that referenced this pull request Feb 19, 2025

[ddp] decouple python reducer from compilation mode

3f305a0

ghstack-source-id: 3eda73b Pull Request resolved: #147123

pytorchmergebot added the merging label Feb 19, 2025

pytorchmergebot removed the merging label Feb 19, 2025

pytorchmergebot added the merging label Feb 19, 2025

pytorchmergebot closed this in ed83b0b Feb 19, 2025

pytorchmergebot added Merged and removed merging labels Feb 19, 2025

github-actions bot deleted the gh/xmfan/181/head branch March 27, 2025 02:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ddp] decouple python reducer from compilation mode #147123

[ddp] decouple python reducer from compilation mode #147123

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!



		_ddp_optimization_mode = [
		"ddp_optimizer",

[ddp] decouple python reducer from compilation mode #147123

[ddp] decouple python reducer from compilation mode #147123

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/147123

❌ 3 New Failures

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Merge started

Uh oh!

Merge failed

Uh oh!

Uh oh!

Merge started

Uh oh!

Merge failed

Uh oh!

Uh oh!

Merge started

Uh oh!

Uh oh!