torch.utils.checkpoint preserves torch function mode stack during recompute #148023

soulitzer · 2025-02-26T23:04:34Z

Stack from ghstack (oldest at bottom):

-> torch.utils.checkpoint preserves torch function mode stack during recompute #148023
[NJT] Fix inference mode for composite implicit ops without nested-specific kernel #146633

TorchFunctionModeTLS is part of the autograd tls, but because .backward() itself is a leaf for TorchFunctionMode, the mode is disabled before we enter into the engine. Conversely, since TorchDispatchMode traces through the .backward() python call, we don't actually need to manually stash/restore if the user keeps the same mode enabled.

…ompute [ghstack-poisoned]

pytorch-bot · 2025-02-26T23:04:38Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/148023

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

[PREEMPTIVE] Removal of ephemeral variants on scale-config.yml

❌ 12 New Failures, 1 Cancelled Job

As of commit 9652102 with merge base e57cdb8 ():

NEW FAILURES - The following jobs have failed:

Lint / lintrunner-clang / linux-job (gh)
>>> Lint for torch/csrc/autograd/init.cpp:
Lint / lintrunner-noclang / linux-job (gh)
>>> Lint for torch/utils/checkpoint.py:
pull / linux-focal-cuda12.4-py3.10-gcc9 / test (default, 2, 5, linux.4xlarge.nvidia.gpu) (gh)
test_public_bindings.py::TestPublicBindings::test_no_new_bindings
pull / linux-focal-cuda12.4-py3.10-gcc9-sm89 / test (default, 3, 5, linux.g6.4xlarge.experimental.nvidia.gpu) (gh)
test_public_bindings.py::TestPublicBindings::test_no_new_bindings
pull / linux-focal-py3.13-clang10 / test (crossref, 2, 2, linux.2xlarge) (gh)
test_public_bindings.py::TestPublicBindings::test_no_new_bindings
pull / linux-focal-py3.13-clang10 / test (default, 5, 5, linux.4xlarge) (gh)
test_public_bindings.py::TestPublicBindings::test_no_new_bindings
pull / linux-focal-py3.13-clang10 / test (dynamo_wrapped, 3, 3, linux.2xlarge) (gh)
test_public_bindings.py::TestPublicBindings::test_no_new_bindings
pull / linux-focal-py3.9-clang10 / test (crossref, 2, 2, linux.2xlarge) (gh)
test_public_bindings.py::TestPublicBindings::test_no_new_bindings
pull / linux-focal-py3.9-clang10 / test (default, 5, 5, linux.4xlarge) (gh)
test_public_bindings.py::TestPublicBindings::test_no_new_bindings
pull / linux-focal-py3.9-clang10 / test (dynamo_wrapped, 3, 3, linux.2xlarge) (gh)
test_autograd.py::TestAutograd::test_checkpointing_without_reentrant_correct_grad
pull / linux-jammy-py3.10-clang15-asan / test (default, 3, 6, linux.4xlarge) (gh)
test_public_bindings.py::TestPublicBindings::test_no_new_bindings
pull / linux-jammy-py3.9-gcc11 / test (default, 4, 5, linux.2xlarge) (gh)
test_public_bindings.py::TestPublicBindings::test_no_new_bindings

CANCELLED JOB - The following job was cancelled. Please retry:

Check Labels / Check labels (gh)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…ompute ghstack-source-id: 5358c3c Pull Request resolved: #148023

… during recompute" [ghstack-poisoned]

…ompute ghstack-source-id: 803a757 Pull Request resolved: #148023

… during recompute" Fixes #147995 [ghstack-poisoned]

…ompute ghstack-source-id: 9cf154a Pull Request resolved: #148023

… during recompute" Fixes #147995 TorchFunctionModeTLS is part of the autograd tls, but because .backward() itself is a leaf for TorchFunctionMode, the mode is disabled before we enter into the engine. Conversely, since TorchDispatchMode traces through the .backward() python call, we don't actually need to manually stash/restore if the user keeps the same mode enabled. We should still fix TorchDispatchMode though, because even if the user doesn't keep the same mode enabled on the .backward() call, checkpoint should not fail. [ghstack-poisoned]

…ompute ghstack-source-id: cbb3741 Pull Request resolved: #148023

bdhirsh · 2025-02-27T16:26:37Z

torch/utils/checkpoint.py

+            with (
+                device_autocast_ctx,   # type: ignore[attr-defined]
+                torch.amp.autocast("cpu", **cpu_autocast_kwargs),
+                _apply_torch_function_mode_stack(torch_function_mode_stack),


noob question - during the recompute phase, are we effectively rerunning the user's forward function (and all of there torch.* ops)? Or is the AC code here doing something different, like capturing all of the ATen ops that is witnessed during the forward, and replaying those. If it's the latter, I'm not sure if it will work as cleanly with a TorchFunctionMode

in eager, its the former - we're just rerunning the user's forward function

for compile, I'm not sure what the status of TorchFunctionMode is with compile generally (cc @mlazos), but assuming that the TorchFunctionMode is inlined through by dynamo, I guess the subgraph that the HOP applies the is_recompute annotations to should include the TorchFunctionMode logic baked in

yep I think this is right

Yes, this is correct @soulitzer

… during recompute" Fixes #147995 TorchFunctionModeTLS is part of the autograd tls, but because .backward() itself is a leaf for TorchFunctionMode, the mode is disabled before we enter into the engine. Conversely, since TorchDispatchMode traces through the .backward() python call, we don't actually need to manually stash/restore if the user keeps the same mode enabled. We should still fix TorchDispatchMode though, because even if the user doesn't keep the same mode enabled on the .backward() call, checkpoint should not fail. [ghstack-poisoned]

…ompute ghstack-source-id: c01200a Pull Request resolved: #148023

… during recompute" Fixes #147995 TorchFunctionModeTLS is part of the autograd tls, but because .backward() itself is a leaf for TorchFunctionMode, the mode is disabled before we enter into the engine. Conversely, since TorchDispatchMode traces through the .backward() python call, we don't actually need to manually stash/restore if the user keeps the same mode enabled. We should still fix TorchDispatchMode though, because even if the user doesn't keep the same mode enabled on the .backward() call, checkpoint should not fail. [ghstack-poisoned]

…ompute ghstack-source-id: bd6e254 Pull Request resolved: #148023

albanD

can you describe why this is the right thing to do?

soulitzer · 2025-03-03T22:52:16Z

In a world where AC's semantics are to "rerun the forward" its possible to break the invariant that the saved
activations must be the same between forward and recompute due to control flow depending on global state changing what ops are run the second time around.

The responsibility to preserve the invariant is split between the user and the AC. One one hand, the user should not do random control flow depending on global state. On the other hand, AC will handle the composing of built in features like RNG and autocast.

TorchFunctionMode is an interesting case where its a built-in feature but its also an extension point where the user can add custom logic, so there's no guarantee whether they want to reenable the stack or not depending on their use case.

But two reasons for reenabling the stack by default are:

Most TorchFunctionModes are probably designed to apply some kind of program transform (applying a subclass like in the issue, decomposing this torch op into different ones to achieve a different memory/speed trade off). Under this case it is quite important to make sure the stack is reenabled and annoying to require the user to do something extra.
If the user is doing some kind of side effect in the TorchFunctionMode and they did not expect that side effect to execute a second time - I think they should just adjust their expectations because we've already defined AC's semantic to be "execute forward function again".

soulitzer · 2025-03-06T22:24:47Z

Discussed this with @albanD and we may want to support this more generally by just stashing/restoring the TLS state. The blast radius of this PR becomes a bit larger, but would allow us to also support TorchDispatchMode and clean up the currently manual handling around autocast + no_grad.

… during recompute" Fixes #147995 TorchFunctionModeTLS is part of the autograd tls, but because .backward() itself is a leaf for TorchFunctionMode, the mode is disabled before we enter into the engine. Conversely, since TorchDispatchMode traces through the .backward() python call, we don't actually need to manually stash/restore if the user keeps the same mode enabled. [ghstack-poisoned]

…ompute ghstack-source-id: a462ae1 Pull Request resolved: #148023

torch.utils.checkpoint preserves torch function mode stack during rec…

89bedbc

…ompute [ghstack-poisoned]

soulitzer added a commit that referenced this pull request Feb 26, 2025

torch.utils.checkpoint preserves torch function mode stack during rec…

7ce46d5

…ompute ghstack-source-id: 5358c3c Pull Request resolved: #148023

soulitzer added release notes: autograd release notes category topic: bug fixes topic category labels Feb 26, 2025

pytorch deleted a comment from github-actions bot Feb 26, 2025

Update on "torch.utils.checkpoint preserves torch function mode stack…

db7ffdc

… during recompute" [ghstack-poisoned]

soulitzer added a commit that referenced this pull request Feb 26, 2025

torch.utils.checkpoint preserves torch function mode stack during rec…

c53ea9e

…ompute ghstack-source-id: 803a757 Pull Request resolved: #148023

Update on "torch.utils.checkpoint preserves torch function mode stack…

9d0b4a7

… during recompute" Fixes #147995 [ghstack-poisoned]

soulitzer added a commit that referenced this pull request Feb 26, 2025

torch.utils.checkpoint preserves torch function mode stack during rec…

bddb193

…ompute ghstack-source-id: 9cf154a Pull Request resolved: #148023

soulitzer requested review from zou3519, albanD and bdhirsh February 26, 2025 23:38

soulitzer added a commit that referenced this pull request Feb 27, 2025

torch.utils.checkpoint preserves torch function mode stack during rec…

31e31f7

…ompute ghstack-source-id: cbb3741 Pull Request resolved: #148023

bdhirsh reviewed Feb 27, 2025

View reviewed changes

soulitzer added a commit that referenced this pull request Feb 27, 2025

torch.utils.checkpoint preserves torch function mode stack during rec…

ce101fa

…ompute ghstack-source-id: c01200a Pull Request resolved: #148023

soulitzer added a commit that referenced this pull request Mar 3, 2025

torch.utils.checkpoint preserves torch function mode stack during rec…

39cd204

…ompute ghstack-source-id: bd6e254 Pull Request resolved: #148023

albanD reviewed Mar 3, 2025

View reviewed changes

soulitzer added a commit that referenced this pull request Mar 10, 2025

torch.utils.checkpoint preserves torch function mode stack during rec…

4b41fad

…ompute ghstack-source-id: a462ae1 Pull Request resolved: #148023

github-actions bot added the Stale label May 10, 2025

pytorch deleted a comment from github-actions bot May 12, 2025

soulitzer added no-stale and removed Stale labels May 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

torch.utils.checkpoint preserves torch function mode stack during recompute #148023

torch.utils.checkpoint preserves torch function mode stack during recompute #148023

torch.utils.checkpoint preserves torch function mode stack during recompute #148023

Are you sure you want to change the base?

torch.utils.checkpoint preserves torch function mode stack during recompute #148023

Conversation

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/148023

❗ 1 Active SEVs

❌ 12 New Failures, 1 Cancelled Job

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment