8000 AOTI regression on SAM and tts-angular · Issue #152606 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

AOTI regression on SAM and tts-angular #152606

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
zou3519 opened this issue May 1, 2025 · 4 comments
Open

AOTI regression on SAM and tts-angular #152606

zou3519 opened this issue May 1, 2025 · 4 comments
Assignees
Labels
high priority module: aotinductor aot inductor oncall: export oncall: pt2 triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@zou3519
Copy link
Contributor
zou3519 commented May 1, 2025

In aot_inductor_torchbench. See https://hud.pytorch.org/pytorch/pytorch/commit/701c0848b8695daa802c2d7ff2f9177faa6e1fe8#41477577732-box for failing logs.

It looks like these were both previously "pass" but now "fail_to_run", so at least there isn't silent incorrectness.

I'm going to flip the statuses on these so that the inductor-periodic CI becomes green, but we should either look into this or determine that we don't care about them.

cc @ezyang @gchanan @kadeng @msaroufim @chauhang @penguinwu @avikchaudhuri @gmagogsfm @zhxchen17 @tugsbayasgalan @angelayi @suo @ydwu4 @desertfire @chenyang78 @yushangdi @benjaminglass1

zou3519 added a commit that referenced this issue May 1, 2025
Some were reporting "pass" consistently on https://hud.pytorch.org/
Those are fine to flip.

I filed a separate issue for the now-regressions for AOTI:
#152606. These should be looked
at.

[ghstack-poisoned]
zou3519 added a commit that referenced this issue May 1, 2025
Some were reporting "pass" consistently on https://hud.pytorch.org/
Those are fine to flip.

I filed a separate issue for the now-regressions for AOTI:
#152606. These should be looked
at.

ghstack-source-id: cd8217f
Pull Request resolved: #152605
@desertfire
Copy link
Contributor

sam looks like a regression happened between April 2 and April 3 from the OSS dashboard, https://hud.pytorch.org/benchmark/torchbench/inductor_aot_inductor?dashboard=torchinductor&startTime=Wed%2C%2002%20Apr%202025%2015%3A23%3A08%20GMT&stopTime=Thu%2C%2003%20Apr%202025%2015%3A23%3A08%20GMT&granularity=day&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=main&lCommit=c067127d47fcf0254f38d95e9990f51092fb4fab&rBranch=main&rCommit=0da8127f77f9bf05ba204ea7659cb15ec85e88a7&model=sam

tts-angular's regression happened somewhere between March 16 to March 23, given some data are missing on the dashboard between these dates.

Go ahead to flip the status for now while I am investigating.

@desertfire
Copy link
Contributor

sam failure bisected to #149235. @pianpwk , can you take a look?

python benchmarks/dynamo/torchbench.py --accuracy --no-translation-validation --inference --bfloat16 --export-aot-inductor --disable-cudagraphs --device cuda --only sam

pytorchmergebot pushed a commit that referenced this issue May 1, 2025
Some were reporting "pass" consistently on https://hud.pytorch.org/
Those are fine to flip.

I filed a separate issue for the now-regressions for AOTI:
#152606. These should be looked
at.

Pull Request resolved: #152605
Approved by: https://github.com/eellison, https://github.com/huydhn
@pianpwk
Copy link
Contributor
pianpwk commented May 6, 2025

Sorry for the delay, I took a look with @yushangdi, and it looks like #149235 added dtype assertions that exposed some weirdness in AOTI's dtype promotion behavior, and both models fail from resulting bf16 != expected fp32.

The offending op is a cat between a fp32 and bf16 tensor, which in eager & normal export, results in a fp32 tensor, but in AOTI lowering (while producing a graph in aot_export_module), a bf16 tensor is returned. Weirdly enough, the functionalization metadata analysis pass in AOTI, and both passes when called from ep.run_decompositions() instead, manage to produce the correct bf16 tensor.

cc @tugsbayasgalan @bdhirsh would you know of anything in functionalization/AOTI that could change dtype promotion behavior?

for now we could land #152915 to remove the metadata assertions as a short term fix, but the underlying issue is still there.

@pianpwk pianpwk self-assigned this May 6, 2025
@zou3519 zou3519 added triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module high priority and removed triage review labels May 6, 2025
@pianpwk
Copy link
Contributor
pianpwk commented May 7, 2025

SAM issue turned out to be inductor's cat decomp skipping dtype promotion: #152995

pytorchmergebot pushed a commit that referenced this issue May 9, 2025
cloning single tensor wasn't following dtype promotion rules
for SAM model: #152606

Pull Request resolved: #152995
Approved by: https://github.com/yushangdi, https://github.com/eellison
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high priority module: aotinductor aot inductor oncall: export oncall: pt2 triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

4 participants
0