Fix fake tensor caching when output has unbacked #153034

aorenste · 2025-05-07T04:06:46Z

We handle fake tensor caching in two ways:

If the inputs have no symbols (SymInt, etc) then we cache on the FakeTensorMode.
If the inputs have symbols then we cache on the ShapeEnv.

This way the symbols in the inputs and outputs are associated with the guards in place at the time of the call.

However - it's possible to have an op where there are no symbols in the inputs but there is an unbacked symbol in the output. In this case we shouldn't cache at all because what would that really mean?

So this PR changes the caching behavior so that if there's a symbol in the output which doesn't come in some way from the input then we refuse to cache that op.

Added a test which checks for this case.

While in there I also did a couple other related changes:

Added negative caching - if we see that an (op, args) failed to cache previously we don't even bother trying to cache it again.
Reworked the inner behavior of _cached_dispatch_impl a little to make it more clear which bits we expect to be able to throw _BypassDispatchCache and add some comments.

Stack from ghstack (oldest at bottom):

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @chenyang78 @kadeng @chauhang @amjames

[ghstack-poisoned]

pytorch-bot · 2025-05-07T04:06:50Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/153034

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 2aca85c with merge base 480ae2d ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: da901ca Pull Request resolved: #153034

[ghstack-poisoned]

ghstack-source-id: 4500c0a Pull Request resolved: #153034

[ghstack-poisoned]

ghstack-source-id: d47bdf5 Pull Request resolved: #153034

[ghstack-poisoned]

ghstack-source-id: 81df6be Pull Request resolved: #153034

torch/_subclasses/_fake_tensor_utils.py

torch/_subclasses/fake_tensor.py

torch/_subclasses/_fake_tensor_utils.py

We handle fake tensor caching in two ways: 1. If the inputs have no symbols (SymInt, etc) then we cache on the FakeTensorMode. 2. If the inputs have symbols then we cache on the ShapeEnv. This way the symbols in the inputs and outputs are associated with the guards in place at the time of the call. However - it's possible to have an op where there are no symbols in the inputs but there is an unbacked symbol in the output. In this case we shouldn't cache at all because what would that really mean? So this PR changes the caching behavior so that if there's a symbol in the output which doesn't come in some way from the input then we refuse to cache that op. Added a test which checks for this case. While in there I also did a couple other related changes: 1. Added negative caching - if we see that an (op, args) failed to cache previously we don't even bother trying to cache it again. 2. Reworked the inner behavior of _cached_dispatch_impl a little to make it more clear which bits we expect to be able to throw _BypassDispatchCache and add some comments. [ghstack-poisoned]

ghstack-source-id: a47ebb3 Pull Request resolved: #153034

aorenste · 2025-05-09T18:41:24Z

Sounds like this error is flaky
@pytorchbot merge -i

pytorchmergebot · 2025-05-09T18:44:52Z

Merge started

Your change will be merged while ignoring the following 2 checks: pull / linux-focal-py3_9-clang9-xla / build, inductor / unit-test / cuda12.6-py3.10-gcc9-sm86 / test (inductor_cpp_wrapper, 1, 2, ephemeral.linux.g5.4xlarge.nvidia.gpu)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

malfet · 2025-05-09T23:41:36Z

@pytorchbot revert -m "Broke pr_time_benchmarks, see https://hud.pytorch.org/hud/pytorch/pytorch/d07fbd41e3589fc9377865a95960b211ec899b90/1?per_page=50&name_filter=pr_time_be&mergeEphemeralLF=true" -c nosignal

pytorchmergebot · 2025-05-09T23:43:45Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

This reverts commit 4f425a0. Reverted #153034 on behalf of https://github.com/malfet due to Broke pr_time_benchmarks, see https://hud.pytorch.org/hud/pytorch/pytorch/d07fbd41e3589fc9377865a95960b211ec899b90/1?per_page=50&name_filter=pr_time_be&mergeEphemeralLF=true ([comment](#153034 (comment)))

pytorchmergebot · 2025-05-09T23:44:00Z

@aorenste your PR has been successfully reverted.

We handle fake tensor caching in two ways: 1. If the inputs have no symbols (SymInt, etc) then we cache on the FakeTensorMode. 2. If the inputs have symbols then we cache on the ShapeEnv. This way the symbols in the inputs and outputs are associated with the guards in place at the time of the call. However - it's possible to have an op where there are no symbols in the inputs but there is an unbacked symbol in the output. In this case we shouldn't cache at all because what would that really mean? So this PR changes the caching behavior so that if there's a symbol in the output which doesn't come in some way from the input then we refuse to cache that op. Added a test which checks for this case. While in there I also did a couple other related changes: 1. Added negative caching - if we see that an (op, args) failed to cache previously we don't even bother trying to cache it again. 2. Reworked the inner behavior of _cached_dispatch_impl a little to make it more clear which bits we expect to be able to throw _BypassDispatchCache and add some comments. [ghstack-poisoned]

pytorchmergebot · 2025-05-13T17:06:07Z

Starting merge as part of PR stack under #152662

aorenste · 2025-05-15T16:43:50Z

@pytorchbot merge

pytorchmergebot · 2025-05-15T16:46:17Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-05-15T16:52:05Z

Merge failed

Reason: Command git -C /home/runner/work/pytorch/pytorch cherry-pick -x d3054dc31333764617aededdedd53f9fa7df532b returned non-zero exit code 1

Auto-merging benchmarks/dynamo/pr_time_benchmarks/expected_results.csv
CONFLICT (content): Merge conflict in benchmarks/dynamo/pr_time_benchmarks/expected_results.csv
Auto-merging torch/_subclasses/fake_tensor.py
error: could not apply d3054dc3133... WIP: Fix caching when output has unbacked
hint: After resolving the conflicts, mark them with
hint: "git add/rm <pathspec>", then run
hint: "git cherry-pick --continue".
hint: You can instead skip this commit with "git cherry-pick --skip".
hint: To abort and get back to the state before "git cherry-pick",
hint: run "git cherry-pick --abort".
hint: Disable this message with "git config set advice.mergeConflict false"

Details for Dev Infra team

Raised by workflow job

We handle fake tensor caching in two ways: 1. If the inputs have no symbols (SymInt, etc) then we cache on the FakeTensorMode. 2. If the inputs have symbols then we cache on the ShapeEnv. This way the symbols in the inputs and outputs are associated with the guards in place at the time of the call. However - it's possible to have an op where there are no symbols in the inputs but there is an unbacked symbol in the output. In this case we shouldn't cache at all because what would that really mean? So this PR changes the caching behavior so that if there's a symbol in the output which doesn't come in some way from the input then we refuse to cache that op. Added a test which checks for this case. While in there I also did a couple other related changes: 1. Added negative caching - if we see that an (op, args) failed to cache previously we don't even bother trying to cache it again. 2. Reworked the inner behavior of _cached_dispatch_impl a little to make it more clear which bits we expect to be able to throw _BypassDispatchCache and add some comments. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames [ghstack-poisoned]

aorenste · 2025-05-15T23:11:15Z

@pytorchbot merge

pytorchmergebot · 2025-05-15T23:13:15Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

WIP: Fix caching when output has unbacked

0c3991e

[ghstack-poisoned]

pytorch-bot bot added the ciflow/inductor label May 7, 2025

aorenste added a commit that referenced this pull request May 7, 2025

WIP: Fix caching when output has unbacked

77da8d3

ghstack-source-id: da901ca Pull Request resolved: #153034

Update on "WIP: Fix caching when output has unbacked"

2991032

[ghstack-poisoned]

aorenste added a commit that referenced this pull request May 7, 2025

WIP: Fix caching when output has unbacked

e59e26a

ghstack-source-id: 4500c0a Pull Request resolved: #153034

pytorch-bot bot added the release notes: fx release notes category label May 7, 2025

Update on "WIP: Fix caching when output has unbacked"

54c0cbd

[ghstack-poisoned]

aorenste added a commit that referenced this pull request May 8, 2025

WIP: Fix caching when output has unbacked

ae66b2f

ghstack-source-id: d47bdf5 Pull Request resolved: #153034

Update on "WIP: Fix caching when output has unbacked"

cd9e990

[ghstack-poisoned]

aorenste added a commit that referenced this pull request May 8, 2025

WIP: Fix caching when output has unbacked

b2ad5b2

ghstack-source-id: 81df6be Pull Request resolved: #153034

aorenste changed the title ~~WIP: Fix caching when output has unbacked~~ Fix fake tensor caching when output has unbacked May 8, 2025

aorenste requested review from tugsbayasgalan and masnesral May 8, 2025 18:52

aorenste marked this pull request as ready for review May 8, 2025 18:52

masnesral approved these changes May 8, 2025

View reviewed changes

torch/_subclasses/_fake_tensor_utils.py Outdated Show resolved Hide resolved

torch/_subclasses/fake_tensor.py Outdated Show resolved Hide resolved

torch/_subclasses/_fake_tensor_utils.py Show resolved Hide resolved

aorenste added a commit that referenced this pull request May 8, 2025

WIP: Fix caching when output has unbacked

b0a77fb

ghstack-source-id: a47ebb3 Pull Request resolved: #153034

tugsbayasgalan approved these changes May 8, 2025

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label May 9, 2025

pytorchmergebot added the merging label May 9, 2025

pytorchmergebot closed this in 4f425a0 May 9, 2025

pytorchmergebot added Merged and removed merging labels May 9, 2025

pytorchmergebot added Reverted ci-no-td Do not run TD on this PR labels May 9, 2025

pytorchmergebot reopened this May 9, 2025

aorenste mentioned this pull request May 12, 2025

Re-enable FakeTensor caching for SymInts #152662

Draft

pytorch-bot bot added the module: dynamo label May 12, 2025

pytorchmergebot added the merging label May 15, 2025

pytorchmergebot removed the merging label May 15, 2025

pytorchmergebot added the merging label May 15, 2025

pytorchmergebot closed this in cb5f31a May 15, 2025

pytorchmergebot removed the merging label May 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix fake tensor caching when output has unbacked #153034

Fix fake tensor caching when output has unbacked #153034

Fix fake tensor caching when output has unbacked #153034

Fix fake tensor caching when output has unbacked #153034

Conversation

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/153034

✅ No Failures

Merge started

Merge started

Merge failed

Merge started