AOTI regression on SAM and tts-angular #152606

zou3519 · 2025-05-01T15:06:43Z

In aot_inductor_torchbench. See https://hud.pytorch.org/pytorch/pytorch/commit/701c0848b8695daa802c2d7ff2f9177faa6e1fe8#41477577732-box for failing logs.

It looks like these were both previously "pass" but now "fail_to_run", so at least there isn't silent incorrectness.

I'm going to flip the statuses on these so that the inductor-periodic CI becomes green, but we should either look into this or determine that we don't care about them.

cc @ezyang @gchanan @kadeng @msaroufim @chauhang @penguinwu @avikchaudhuri @gmagogsfm @zhxchen17 @tugsbayasgalan @angelayi @suo @ydwu4 @desertfire @chenyang78 @yushangdi @benjaminglass1

Some were reporting "pass" consistently on https://hud.pytorch.org/ Those are fine to flip. I filed a separate issue for the now-regressions for AOTI: #152606. These should be looked at. [ghstack-poisoned]

Some were reporting "pass" consistently on https://hud.pytorch.org/ Those are fine to flip. I filed a separate issue for the now-regressions for AOTI: #152606. These should be looked at. ghstack-source-id: cd8217f Pull Request resolved: #152605

desertfire · 2025-05-01T15:32:02Z

sam looks like a regression happened between April 2 and April 3 from the OSS dashboard, https://hud.pytorch.org/benchmark/torchbench/inductor_aot_inductor?dashboard=torchinductor&startTime=Wed%2C%2002%20Apr%202025%2015%3A23%3A08%20GMT&stopTime=Thu%2C%2003%20Apr%202025%2015%3A23%3A08%20GMT&granularity=day&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=main&lCommit=c067127d47fcf0254f38d95e9990f51092fb4fab&rBranch=main&rCommit=0da8127f77f9bf05ba204ea7659cb15ec85e88a7&model=sam

tts-angular's regression happened somewhere between March 16 to March 23, given some data are missing on the dashboard between these dates.

Go ahead to flip the status for now while I am investigating.

desertfire · 2025-05-01T16:54:42Z

sam failure bisected to #149235. @pianpwk , can you take a look?

python benchmarks/dynamo/torchbench.py --accuracy --no-translation-validation --inference --bfloat16 --export-aot-inductor --disable-cudagraphs --device cuda --only sam

Some were reporting "pass" consistently on https://hud.pytorch.org/ Those are fine to flip. I filed a separate issue for the now-regressions for AOTI: #152606. These should be looked at. Pull Request resolved: #152605 Approved by: https://github.com/eellison, https://github.com/huydhn

pianpwk · 2025-05-06T16:34:52Z

Sorry for the delay, I took a look with @yushangdi, and it looks like #149235 added dtype assertions that exposed some weirdness in AOTI's dtype promotion behavior, and both models fail from resulting bf16 != expected fp32.

The offending op is a cat between a fp32 and bf16 tensor, which in eager & normal export, results in a fp32 tensor, but in AOTI lowering (while producing a graph in aot_export_module), a bf16 tensor is returned. Weirdly enough, the functionalization metadata analysis pass in AOTI, and both passes when called from ep.run_decompositions() instead, manage to produce the correct bf16 tensor.

cc @tugsbayasgalan @bdhirsh would you know of anything in functionalization/AOTI that could change dtype promotion behavior?

for now we could land #152915 to remove the metadata assertions as a short term fix, but the underlying issue is still there.

pianpwk · 2025-05-07T00:21:04Z

SAM issue turned out to be inductor's cat decomp skipping dtype promotion: #152995

cloning single tensor wasn't following dtype promotion rules for SAM model: #152606 Pull Request resolved: #152995 Approved by: https://github.com/yushangdi, https://github.com/eellison

zou3519 added triage review oncall: pt2 module: aotinductor aot inductor labels May 1, 2025

pytorch-bot bot added the oncall: export label May 1, 2025

zou3519 mentioned this issue May 1, 2025

Fix some inductor periodic benchmarks #152605

Closed

pianpwk self-assigned this May 6, 2025

zou3519 added triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module high priority and removed triage review labels May 6, 2025

pytorch-bot bot added the triage review label May 6, 2025

desertfire mentioned this issue May 6, 2025

[benchmarks] disable aten.to metadata assertions for AOTI #152915

Closed

pianpwk mentioned this issue May 7, 2025

[inductor] dtype promotion error in cat decomp #152995

Closed

masnesral removed the triage review label May 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AOTI regression on SAM and tts-angular #152606

AOTI regression on SAM and tts-angular #152606

AOTI regression on SAM and tts-angular #152606

AOTI regression on SAM and tts-angular #152606

Comments