10000 [ONNX] Migrate onnx decomps into PyTorch by justinchuby · Pull Request #146224 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

[ONNX] Migrate onnx decomps into PyTorch #146224

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

justinchuby
Copy link
Collaborator
@justinchuby justinchuby commented Feb 1, 2025

Migrate the ATen op decomp library for ONNX ("torchlib") from ONNX Script with necessary changes in the onnx registry.

The migration

"torchlib" is what we the decomp library from aten ops to onnx in the onnxscript project. (name doesn't matter, can be changed.) It is the improved version of the "symbolic functions" in torch/onnx implemented using onnxscript, a graph builder for onnx. Since PyTorch 2.1, it has been a dependency for torch.onnx via onnxscript and has been powering the torch.onnx exporter.

torchlib was hosted in onnxscript for rapid evolvement. However, is it time to migrate the logic into PyTorch because:

  1. The logic (is developed for and) belongs to torch.onnx and is the equivalent of the onnx "symbolic functions" for FX graphs
  2. Migrating to PyTorch decouples torch.onnx from logic in onnxscript, which is a good thing.
  3. Maintaining its compatibility among multiple PyTorch versions is becoming harder and harder. After migration we can evolve the logic with aten operators without having to worry about backward compatibility for different PyTorch versions
  4. We can use newer opsets by default, again without having to worry about BC. The proposal is upgrade to use opset 21 (from opset 18, released 2 years ago) for torch 2.7. This makes it easier for developers and users to leverage new dtypes and new operators like the corrected GroupNormalization.

Test and infra impact

The tests leverage OpInfo. They run in an onnx shard only. On a 2-core machine, tests typically complete within 15 minutes.

No new dependencies are introduced. Packaging, test activities should remain the same.

State of the migrated code

The migrated code are lifted from https://github.com/microsoft/onnxscript/tree/main/onnxscript/function_libs/torch_lib. It is reviewed by the same set of reviewers owning the torch.onnx component.

Fixes #139301

Next steps

The follow up PRs will decouple the implementation from ONNX Script type system to improve type checking and bump the onnx opset version used.

Copy link
pytorch-bot bot commented Feb 1, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/146224

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit f625b4d with merge base dd2a943 (image):

NEW FAILURE - The following job has failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the release notes: onnx torch.onnx related changes that should show up in the release notes label Feb 1, 2025
@titaiwangms titaiwangms self-assigned this Feb 1, 2025
@justinchuby justinchuby changed the title [ONNX] Migrate torchlib into PyTorch [Do not merge ghstack][ONNX] Migrate torchlib into PyTorch Feb 1, 2025
@justinchuby justinchuby added module: onnx Related to torch.onnx and removed fx labels Feb 1, 2025
@justinchuby justinchuby force-pushed the justinchu/ghstack/torchlib branch from fb355b3 to d3904c9 Compare February 3, 2025 17:37
@justinchuby
Copy link
Collaborator Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Feb 5, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

Copy link
Collaborator
@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why skipping the PR sanity check is acceptable here?
We only accept skipping it for codemod and other codegened changes with a provided repro command.

FYI @seemethere @malfet

@@ -1370,7 +1370,7 @@ def _export(
# We do not want to fail because we should still allow users to create
# custom symbolic functions for opset>17
warnings.warn(
f"Exporting to ONNX opset version {opset_version} is not supported. "
f"Exporting to ONNX opset version {opset_version} is not officially supported. "
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it was resolved.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still don't have symbolic implementations for aten ops that use Reduce* operators. Setting opset>17 will cause models that use those ops to fail

@@ -241,10 +241,10 @@ def get_matching_overload(
for overload in overloads:
assigned_types: dict[str, ir.TypeProtocol] = {}
fail_reason = ""
if not hasattr(overload, "signature"):
if overload.signature is None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if overload is an operator? (operator.mul, ...)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overload is type OnnxDecompMeta. When it stores an operator, overload.signature is None.

@justinchuby
Copy link
Collaborator Author

@albanD @malfet thanks for raising the concern. This PR brings in the onnx decomp implementations which exceeded the line limits imposed by the sanity check (I think?). This migration is part of the effort in revamping the onnx exporter and is similar to #132530

@titaiwangms titaiwangms removed their assignment Feb 5, 2025
@mikaylagawarecki mikaylagawarecki added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Feb 7, 2025
@justinchuby justinchuby changed the title [ONNX] Migrate torchlib into PyTorch [ONNX] Migrate onnx decomps into PyTorch Feb 13, 2025
@malfet
Copy link
Contributor
malfet commented Feb 13, 2025

@albanD @malfet thanks for raising the concern. This PR brings in the onnx decomp implementations which exceeded the line limits imposed by the sanity check (I think?). This migration is part of the effort in revamping the onnx exporter and is similar to #132530

@justinchuby few thoughts:

  • PR claims to migrate code from somewhere, but there are no links neither in the PR nor in the issue that specify repo/commit this migration is coming from. If there were something like that reviewer can potentially run diff tool to make sure this does not introduce any changes to the code being migrated
  • Neither PR nor the issue give any explanation on what problem is migration is trying to address, i.e. what is wrong with the status quo and why migrating this code into PyTorch will make it any better
  • Considering that PR move significant number of tests into PyTorch, would be nice to highlight what is the expected increase in the test time, and whether those tests will be run in default shards or only in the ONNX ones
  • Also please note that when [ONNX] Migrate torchlib from onnxscript #139301 were created back in Oct, nobody outside of the ONNX team was aware of it, and so we have an opportunity to give feedback only after you've spent some effort authoring the PR
  • 20K lines of non-generated code is a lot, [ONNX] New export logic leveraging ExportedProgram and ONNX IR #132530 was already over the limit, but it was 4 times smaller than this one

@justinchuby
Copy link
Collaborator Author
justinchuby commented Feb 14, 2025

Thanks, @malfet, @albanD . I updated the PR description to address your concerns above and from slack. In short the logic migrated has been part of what a user runs when they call torch.onnx dynamo export since pt2.1 and should belong to torch. There should be minimum impact on test runtime and no impact on other activities. Please let me know if I can provide more information!

@justinchuby justinchuby requested a review from albanD February 14, 2025 04:10
@albanD
Copy link
Collaborator
albanD commented Feb 14, 2025

Thanks for sharing the details!
Some thoughts on that:

  • The current PR doesn't seem to aim to actually remove the dependency on onnxscript so the benefit is limited there?
  • Also this seems very intertwined with the old integration so I guess there is no realistic plan to remove that either?
  • Doing quick dev out of core sounds good but I would expect a thorough review of the code when we are importing it to ensure it uses the proper tools, testing framework, systems, etc that are in core. I don't think it is possible to do such review on a 20k LOC PR (the github UI is almost crashing just opening the files tab). On top of this, as you said, there are changes to adapt the code to be in core which I guess are not reviewed?
  • I would expect that adding such feature can be done by adding first scaffolding piece by piece, basic ops for showing it works, testing for the scaffolding, then a set of PRs for decomp + their test.
  • btw, I 100% understand the need to reach the end state where we remove the current version and migrate to this new set of rules. The only concern here is with the process and make sure we have proper visibility on how all of this is happening.

@justinchuby
Copy link
Collaborator Author
justinchuby commented Feb 14, 2025

Thanks! To clarify -

  • The current PR doesn't seem to aim to actually remove the dependency on onnxscript so the benefit is limited there?

onnxscript core is still a dependency for building the onnx graph, but since the decomp logic is no longer in onnxscript with this proposal, we can freely evolve that with PyTorch without bc constraints.

  • Also this seems very intertwined with the old integration so I guess there is no realistic plan to remove that either?

By old logic if you mean the decomp component: reference to the decomp component is removed completely. Thats the biggest coupling we want to get rid of. Is there other parts of integration you were referring to?

  • Doing quick dev out of core sounds good but I would expect a thorough review of the code when we are importing it to ensure it uses the proper tools, testing framework, systems, etc that are in core. I don't think it is possible to do such review on a 20k LOC PR (the github UI is almost crashing just opening the files tab). On top of this, as you said, there are changes to adapt the code to be in core which I guess are not reviewed?

Sounds good. The changes are mostly simplifications of current code and reviewed by the torch.onnx team. For better visibility and reviewability I will do what you suggested by adding the scaffolding and add the decomps in a stack.

  • I would expect that adding such feature can be done by adding first scaffolding piece by piece, basic ops for showing it works, testing for the scaffolding, then a set of PRs for decomp + their test.
  • btw, I 100% understand the need to reach the end state where we remove the current version and migrate to this new set of rules. The only concern here is with the process and make sure we have proper visibility on how all of this is happening.

Thanks for the suggestions and feedback! I will create a stack that does the above and cc you and @malfet when ready.

@justinchuby justinchuby marked this pull request as draft February 15, 2025 17:33
@justinchuby
Copy link
Collaborator Author

Closed in favor of #147469

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request merging module: onnx Related to torch.onnx open source release notes: onnx torch.onnx related changes that should show up in the release notes topic: new features topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ONNX] Migrate torchlib from onnxscript
9 participants
0