Commit aeb3c94

committed

Update base for Update on "[Inductor] Add decomposeK as an autotuning choice for mm"

As a result of adding subgraph as a choice to inductor #149761 and enabling FP32 output from PyTorch GEMMs from FP16/BF16 inputs: #150812, this PR enables decompose_k as an autotuning choice for Inductor in generating the fastest matmuls with Triton. Followups: * decompose_k does not currently support epilogue fusion, which will take some work to enable * Enable autotuning the bmm with Triton Templates as well without requiring tons of more compile time, async compilation. Anecdotal evidence shows that Triton BMM performs better usually than aten BMM * Add for addmm Below are the results of running TritonBench for Split-K shapes, comparing the aten performance versus pt2_triton, which now autotunes on decompose_k, seeing >10% speedup compared to aten on average, and for some shapes over 3x the performance of the best Triton mm previously: <img width="929" alt="Screenshot 2025-04-28 at 9 15 39 PM" src="https://github.com/user-attachments/assets/27d85bbc-4f3a-43a6-a8fa-d4a5bbb8c999" /> cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov Differential Revision: [D73820115](https://our.internmc.facebook.com/intern/diff/D73820115) [ghstack-poisoned]

1 parent 8da3fcf commit aeb3c94Copy full SHA for aeb3c94

0 file changed

-0

lines changed

0 file changed

-0

lines changed

Comments

(0)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit aeb3c94

0 file changed

0 file changed

File tree

0 file changed

0 file changed

0 commit comments