[precompile] Add BundledAOTAutogradCacheEntry #152840

jamesjwu · 2025-05-05T17:16:57Z

Stack from ghstack (oldest at bottom):

Finally, this PR adds BundledAOTAutogradCacheEntry. A BundledAOTAutogradCacheEntry is an AOTAutogradCacheEntry that saves the entire CompiledFxGraph directly in the entry.

This has some advantages:

No more dependency on FxGraphCache at all
Clearing FxGraphCache does not result in AOTAutogradCache miss
Simpler logic, as BundledAOTAutogradCacheEntry has everything you need to load a full compiled python wrapper from a dynamo output

We plan to use BundledAOTAutogradCacheEntry for precompile. There's also a question of whether we want to use it for regular caching — the main disadvantage of this is having to save the same CompiledFxGraph twice, once in Inductor cache and once for AOTAutogradCache. With MegaCaching, this could be a regression in total cache size (as well as a minor cold start regression, as you have to save the same graph twice). I will import this and measure the mega cache space complexity, and if it looks good I'll enable it by default for caching as well.

On warm start, if AOTAutogradCache hits, you won't have to load inductor at all, so warm start overhead should be unaffected.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @chenyang78 @kadeng @chauhang @amjames

Differential Revision: D74593304

[ghstack-poisoned]

pytorch-bot · 2025-05-05T17:17:01Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/152840

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit 7b181d5 with merge base 9d00f2b ():

UNSTABLE - The following jobs are marked as unstable, possibly due to flakiness on trunk:

inductor / unit-test / cuda12.6-py3.10-gcc9-sm86 / test (inductor_cpp_wrapper, 1, 2, ephemeral.linux.g5.4xlarge.nvidia.gpu) (gh) (#152916)
[ FAILED ] AotInductorTest.BasicTestCpu
pull / linux-jammy-py3-clang12-executorch / test (executorch, 1, 1, lf.ephemeral.linux.2xlarge) (gh) (#144480)
Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 05bcce8 Pull Request resolved: #152840

[ghstack-poisoned]

ghstack-source-id: 0de7afe Pull Request resolved: #152840

[ghstack-poisoned]

ghstack-source-id: 1846951 Pull Request resolved: #152840

ghstack-source-id: 237f9b3 Pull Request resolved: #152840

[ghstack-poisoned]

ghstack-source-id: a9fe63d Pull Request resolved: #152840

jamesjwu · 2025-05-09T18:12:03Z

torch/_functorch/_aot_autograd/autograd_cache.py

+        if config.bundled_autograd_cache:
+            # Helper function to unwrap all the wrappers we added during aotdispatch
+            # They get reapplied on cache load
+            def unwrap_compiled_fx_graph(obj):


So normally, the compiled_fx_graph that AOTAutogradCache passes to the entry is actually wrapped a few times by wrappers like FunctionalizedRngWrapper, etc. For the non bundled case, this didn't matter, because we only care about the cache key from the compiled fx graph.

But now that we are storing the actual compiled fx graph, we need to save the inner CompiledFxGraph object. The object then gets rewrapped properly on post_compile (the same way as an FxGraphCacheLoadable would be).

I also attempted to refactor jit_compile_runtime_wrappers.py to just save to the cache immediately before doing post compile stuff, but there's some variables like num_symints_saved_for_bw and others that get calculated only after certain other wrappers run. Disentangling all of that logic seemed both risky and not really worth it, when it costs basically nothing to just grab the inner wrapped object using this helper function.

jamesjwu · 2025-05-12T18:02:38Z

@jamesjwu has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Update

4fabebb

[ghstack-poisoned]

This was referenced May 5, 2025

[precompile] Refactor AOTAutogradCacheEntry to be generic #152836

Closed

[precompile] [easy] Refactor FxGraphCache to add cache_hit_post_compile function #152839

Closed

pytorch-bot bot added the ciflow/inductor label May 5, 2025

jamesjwu added a commit that referenced this pull request May 5, 2025

Add BundledAOTAutogradCacheEntry

a1cfa81

ghstack-source-id: 05bcce8 Pull Request resolved: #152840

jamesjwu changed the title ~~Add BundledAOTAutogradCacheEntry~~ [precompile] Add BundledAOTAutogradCacheEntry May 5, 2025

jamesjwu added the topic: not user facing topic category label May 5, 2025

jamesjwu mentioned this pull request May 7, 2025

Keep raw cubin file around in case it gets deleted underneath us #153064

Closed

Update

40185d2

[ghstack-poisoned]

jamesjwu added a commit that referenced this pull request May 8, 2025

Add BundledAOTAutogradCacheEntry

35bf783

ghstack-source-id: 0de7afe Pull Request resolved: #152840

Update

5232ebd

[ghstack-poisoned]

jamesjwu added a commit that referenced this pull request May 8, 2025

Add BundledAOTAutogradCacheEntry

0a88a51

ghstack-source-id: 1846951 Pull Request resolved: #152840

jamesjwu added a commit that referenced this pull request May 9, 2025

Add BundledAOTAutogradCacheEntry

1daa2af

ghstack-source-id: 237f9b3 Pull Request resolved: #152840

pytorch-bot bot added the module: dynamo label May 9, 2025

Rebase

7b181d5

[ghstack-poisoned]

jamesjwu added a commit that referenced this pull request May 9, 2025

Add BundledAOTAutogradCacheEntry

ed1644f

ghstack-source-id: a9fe63d Pull Request resolved: #152840

jamesjwu added a commit that referenced this pull request May 9, 2025

Add BundledAOTAutogradCacheEntry

569f9e2

ghstack-source-id: a9fe63d Pull Request resolved: #152840

jamesjwu mentioned this pull request May 9, 2025

[nocommit] bundled autograd cache test #153269

Draft

jamesjwu added ciflow/trunk Trigger trunk jobs on your pull request ciflow/mps Run MPS tests (subset of trunk) ciflow/pull and removed ciflow/mps Run MPS tests (subset of trunk) labels May 9, 2025

jamesjwu requested review from oulgen, zhxchen17 and bdhirsh May 9, 2025 18:08

jamesjwu marked this pull request as ready for review May 9, 2025 18:08

jamesjwu commented May 9, 2025

View reviewed changes

jamesjwu mentioned this pull request May 12, 2025

Pass inductor config for static cuda launcher to workers #153382

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[precompile] Add BundledAOTAutogradCacheEntry #152840

[precompile] Add BundledAOTAutogradCacheEntry #152840

[precompile] Add BundledAOTAutogradCacheEntry #152840

Are you sure you want to change the base?

[precompile] Add BundledAOTAutogradCacheEntry #152840

Conversation

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/152840

✅ You can merge normally! (2 Unrelated Failures)

Choose a reason for hiding this comment