[export][reland] Convert autocast to HOO #132677

yushangdi · 2024-08-05T16:03:55Z

Summary:
Reland of D60206382.

Suggested in #128394.

If there's an autocast context manager, the predispatch (strict) graph can look something like:

class <lambda>(torch.nn.Module):
    def forward(self, x: "f32[1]"):
        ...
        _enter_autocast = torch.amp.autocast_mode._enter_autocast('cuda', torch.bfloat16, True, None)
        mm: "f32[8, 8]" = torch.ops.aten.mm.default(rand, rand_1);  rand = rand_1 = None
        _exit_autocast = torch.amp.autocast_mode._exit_autocast(_enter_autocast);  _enter_autocast = None
        return (mm_1,)

But the operator torch.amp.autocast_mode._enter_autocast is not a valid ATen op. We remove these nodes by turning autocast into a higher order operator and make a submodule for the blocks between _enter_autocast and _exit_autocast.

Some potential followup improvement:

Merge some of the duplicated logic with replace_set_grad_with_hop_pass.py
Check the current autocast status (any enabled? dtype?) and not create a submodule if the autocast args matches current autocast status.

Test Plan:
CI

buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export -- -r "test_predispatch_autocast"
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export -- -r "test_predispatch_set_grad"

Verified that now we can export the llama model in gh issue 128394 and the gemma model in gh issue 131829 without error.

Differential Revision: D60770038

Summary: Reland of D60206382. Suggested in pytorch#128394. If there's an autocast context manager, the predispatch (strict) graph can look something like: ``` class <lambda>(torch.nn.Module): def forward(self, x: "f32[1]"): ... _enter_autocast = torch.amp.autocast_mode._enter_autocast('cuda', torch.bfloat16, True, None) mm: "f32[8, 8]" = torch.ops.aten.mm.default(rand, rand_1); rand = rand_1 = None _exit_autocast = torch.amp.autocast_mode._exit_autocast(_enter_autocast); _enter_autocast = None return (mm_1,) ``` But the operator `torch.amp.autocast_mode._enter_autocast` is not a valid ATen op. We remove these nodes by turning autocast into a higher order operator and make a submodule for the blocks between `_enter_autocast` and `_exit_autocast`. Some potential followup improvement: 1) Merge some of the duplicated logic with `replace_set_grad_with_hop_pass.py` 2) Check the current autocast status (any enabled? dtype?) and not create a submodule if the autocast args matches current autocast status. Test Plan: CI ``` buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export -- -r "test_predispatch_autocast" buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export -- -r "test_predispatch_set_grad" ``` Verified that now we can export the llama model in gh issue 128394 and the gemma model in gh issue 131829 without error. Differential Revision: D60770038

pytorch-bot · 2024-08-05T16:03:58Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/132677

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (14 Unrelated Failures)

As of commit 60ee575 with merge base a672f6c ():

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / linux-focal-cuda12.1-py3.10-gcc9 / test (default, 3, 5, amz2023.linux.4xlarge.nvidia.gpu) (gh) (trunk failure)
test_public_bindings.py::TestPublicBindings::test_correct_module_names
pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 4, 5, amz2023.linux.g5.4xlarge.nvidia.gpu) (gh) (trunk failure)
test_public_bindings.py::TestPublicBindings::test_correct_module_names
pull / linux-focal-py3.11-clang10 / test (crossref, 2, 2, amz2023.linux.2xlarge) (gh) (trunk failure)
test_public_bindings.py::TestPublicBindings::test_correct_module_names
pull / linux-focal-py3.11-clang10 / test (default, 3, 4, amz2023.linux.2xlarge) (gh) (trunk failure)
test_public_bindings.py::TestPublicBindings::test_correct_module_names
pull / linux-focal-py3.12-clang10 / test (default, 3, 4, amz2023.linux.2xlarge) (gh) (trunk failure)
test_public_bindings.py::TestPublicBindings::test_correct_module_names
pull / linux-focal-py3.12-clang10-experimental-split-build / test (default, 3, 3, amz2023.linux.2xlarge) (gh) (trunk failure)
test_public_bindings.py::TestPublicBindings::test_correct_module_names
pull / linux-focal-py3.8-clang10 / test (crossref, 2, 2, amz2023.linux.2xlarge) (gh) (trunk failure)
test_public_bindings.py::TestPublicBindings::test_correct_module_names
pull / linux-focal-py3.8-clang10 / test (default, 3, 4, amz2023.linux.2xlarge) (gh) (trunk failure)
test_public_bindings.py::TestPublicBindings::test_correct_module_names
pull / linux-jammy-py3.10-clang15-asan / test (default, 3, 6, amz2023.linux.4xlarge) (gh) (trunk failure)
test_public_bindings.py::TestPublicBindings::test_correct_module_names
pull / linux-jammy-py3.8-gcc11 / test (default, 3, 4, amz2023.linux.2xlarge) (gh) (trunk failure)
test_public_bindings.py::TestPublicBindings::test_correct_module_names
trunk / linux-focal-cuda12.4-py3.10-gcc9-experimental-split-build-test / test (default, 4, 5, amz2023.linux.4xlarge.nvidia.gpu) (gh) (trunk failure)
test_public_bindings.py::TestPublicBindings::test_correct_module_names
trunk / linux-focal-cuda12.4-py3.10-gcc9-experimental-split-build-test / test (nogpu_AVX512, 1, 1, amz2023.linux.2xlarge) (gh) (trunk failure)
test_public_bindings.py::TestPublicBindings::test_correct_module_names
trunk / linux-focal-cuda12.4-py3.10-gcc9-experimental-split-build-test / test (nogpu_NO_AVX2, 1, 1, amz2023.linux.2xlarge) (gh) (trunk failure)
test_public_bindings.py::TestPublicBindings::test_correct_module_names
trunk / linux-focal-cuda12.4-py3.10-gcc9-sm86 / test (default, 4, 5, amz2023.linux.g5.4xlarge.nvidia.gpu) (gh) (trunk failure)
test_public_bindings.py::TestPublicBindings::test_correct_module_names

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2024-08-05T16:04:14Z

This pull request was exported from Phabricator. Differential Revision: D60770038

facebook-github-bot · 2024-08-05T22:33:05Z

@pytorchbot merge -f 'Landed internally'

(Initiating merge automatically since Phabricator Diff has merged, using force because this PR might not pass merge_rules.json but landed internally)

pytorchmergebot · 2024-08-05T22:34:42Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Summary: Reland of D60206382. Suggested in #128394. If there's an autocast context manager, the predispatch (strict) graph can look something like: ``` class <lambda>(torch.nn.Module): def forward(self, x: "f32[1]"): ... _enter_autocast = torch.amp.autocast_mode._enter_autocast('cuda', torch.bfloat16, True, None) mm: "f32[8, 8]" = torch.ops.aten.mm.default(rand, rand_1); rand = rand_1 = None _exit_autocast = torch.amp.autocast_mode._exit_autocast(_enter_autocast); _enter_autocast = None return (mm_1,) ``` But the operator `torch.amp.autocast_mode._enter_autocast` is not a valid ATen op. We remove these nodes by turning autocast into a higher order operator and make a submodule for the blocks between `_enter_autocast` and `_exit_autocast`. Some potential followup improvement: 1) Merge some of the duplicated logic with `replace_set_grad_with_hop_pass.py` 2) Check the current autocast status (any enabled? dtype?) and not create a submodule if the autocast args matches current autocast status. Test Plan: CI ``` buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export -- -r "test_predispatch_autocast" buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export -- -r "test_predispatch_set_grad" ``` Verified that now we can export the llama model in gh issue 128394 and the gemma model in gh issue 131829 without error. Differential Revision: D60770038 Pull Request resolved: #132677 Approved by: https://github.com/angelayi

bluenote10 · 2024-12-01T13:31:40Z

torch/_export/passes/replace_with_hop_pass_util.py

+def _replace_with_hop_helper(
+    node: torch.fx.Node,
+    enter_block_node: torch.fx.Node,
+    node_filter: Callable,


@yushangdi I'm wondering what is the purpose of that node_filter?

As far as I can see it is not used at all, making it a dead argument. Was the idea to use it eventually?

If not, should it be removed? It looks like the usages would become significantly simpler if they don't have to pass in that filter function.

good catch, yeah I think it can be removed.

@yushangdi I'm wondering what is the purpose if that node_filter?

As far as I can see it is not used at all, making it a dead argument. Was the idea to use it eventually?

If not, should it be removed? It looks like the usages would become significantly simpler if they don't have to pass in that filter function.

PR for removing it: #141983

yushangdi requested review from avikchaudhuri, tugsbayasgalan, zhxchen17, ydwu4 and angelayi as code owners August 5, 2024 16:03

pytorch-bot bot added the ciflow/inductor label Aug 5, 2024

facebook-github-bot added the fb-exported label Aug 5, 2024

angelayi approved these changes Aug 5, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 5, 2024

pytorchmergebot added the merging label Aug 5, 2024

pytorchmergebot closed this in 4a2cf50 Aug 5, 2024

pytorchmergebot added Merged and removed merging labels Aug 5, 2024

This was referenced Aug 6, 2024

[export] Node.meta _enter_autocast is missing val field #131829

Closed

torch.export: Unsupported: call_function args: UserDefinedObjectVariable(BatchEncoding) on Gemma #122340

Closed

This was referenced Aug 7, 2024

[cp][export] Convert autocast to HOO (#132677) #132946

Closed

[v2.4.1] Release Tracker #132400

Closed

bluenote10 reviewed Dec 1, 2024

View reviewed changes

yushangdi deleted the export-D60770038 branch December 3, 2024 18:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[export][reland] Convert autocast to HOO #132677

[export][reland] Convert autocast to HOO #132677

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[export][reland] Convert autocast to HOO #132677

[export][reland] Convert autocast to HOO #132677

Uh oh!

Conversation

Uh oh!

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/132677

✅ You can merge normally! (14 Unrelated Failures)

Uh oh!

Uh oh!

Uh oh!

Merge started

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!