-
Notifications
You must be signed in to change notification settings - Fork 24.2k
AOTI regression on SAM and tts-angular #152606
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Some were reporting "pass" consistently on https://hud.pytorch.org/ Those are fine to flip. I filed a separate issue for the now-regressions for AOTI: #152606. These should be looked at. [ghstack-poisoned]
Some were reporting "pass" consistently on https://hud.pytorch.org/ Those are fine to flip. I filed a separate issue for the now-regressions for AOTI: #152606. These should be looked at. ghstack-source-id: cd8217f Pull Request resolved: #152605
sam looks like a regression happened between April 2 and April 3 from the OSS dashboard, https://hud.pytorch.org/benchmark/torchbench/inductor_aot_inductor?dashboard=torchinductor&startTime=Wed%2C%2002%20Apr%202025%2015%3A23%3A08%20GMT&stopTime=Thu%2C%2003%20Apr%202025%2015%3A23%3A08%20GMT&granularity=day&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=main&lCommit=c067127d47fcf0254f38d95e9990f51092fb4fab&rBranch=main&rCommit=0da8127f77f9bf05ba204ea7659cb15ec85e88a7&model=sam tts-angular's regression happened somewhere between March 16 to March 23, given some data are missing on the dashboard between these dates. Go ahead to flip the status for now while I am investigating. |
Some were reporting "pass" consistently on https://hud.pytorch.org/ Those are fine to flip. I filed a separate issue for the now-regressions for AOTI: #152606. These should be looked at. Pull Request resolved: #152605 Approved by: https://github.com/eellison, https://github.com/huydhn
Sorry for the delay, I took a look with @yushangdi, and it looks like #149235 added dtype assertions that exposed some weirdness in AOTI's dtype promotion behavior, and both models fail from The offending op is a cat between a fp32 and bf16 tensor, which in eager & normal export, results in a fp32 tensor, but in AOTI lowering (while producing a graph in aot_export_module), a bf16 tensor is returned. Weirdly enough, the functionalization metadata analysis pass in AOTI, and both passes when called from cc @tugsbayasgalan @bdhirsh would you know of anything in functionalization/AOTI that could change dtype promotion behavior? for now we could land #152915 to remove the metadata assertions as a short term fix, but the underlying issue is still there. |
SAM issue turned out to be inductor's cat decomp skipping dtype promotion: #152995 |
cloning single tensor wasn't following dtype promotion rules for SAM model: #152606 Pull Request resolved: #152995 Approved by: https://github.com/yushangdi, https://github.com/eellison
In aot_inductor_torchbench. See https://hud.pytorch.org/pytorch/pytorch/commit/701c0848b8695daa802c2d7ff2f9177faa6e1fe8#41477577732-box for failing logs.
It looks like these were both previously "pass" but now "fail_to_run", so at least there isn't silent incorrectness.
I'm going to flip the statuses on these so that the inductor-periodic CI becomes green, but we should either look into this or determine that we don't care about them.
cc @ezyang @gchanan @kadeng @msaroufim @chauhang @penguinwu @avikchaudhuri @gmagogsfm @zhxchen17 @tugsbayasgalan @angelayi @suo @ydwu4 @desertfire @chenyang78 @yushangdi @benjaminglass1
The text was updated successfully, but these errors were encountered: