-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Commit aeb3c94
committed
Update base for Update on "[Inductor] Add decomposeK as an autotuning choice for mm"
As a result of adding subgraph as a choice to inductor #149761 and enabling FP32 output from PyTorch GEMMs from FP16/BF16 inputs: #150812, this PR enables decompose_k as an autotuning choice for Inductor in generating the fastest matmuls with Triton.
Followups:
* decompose_k does not currently support epilogue fusion, which will take some work to enable
* Enable autotuning the bmm with Triton Templates as well without requiring tons of more compile time, async compilation. Anecdotal evidence shows that Triton BMM performs better usually than aten BMM
* Add for addmm
Below are the results of running TritonBench for Split-K shapes, comparing the aten performance versus pt2_triton, which now autotunes on decompose_k, seeing >10% speedup compared to aten on average, and for some shapes over 3x the performance of the best Triton mm previously:
<img width="929" alt="Screenshot 2025-04-28 at 9 15 39 PM" src="https://github.com/user-attachments/assets/27d85bbc-4f3a-43a6-a8fa-d4a5bbb8c999" />
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov
Differential Revision: [D73820115](https://our.internmc.facebook.com/intern/diff/D73820115)
[ghstack-poisoned]1 parent 8da3fcf commit aeb3c94Copy full SHA for aeb3c94
File tree
Expand file treeCollapse file tree
0 file changed
+0
-0
lines changedOpen diff view settings
Filter options
Expand file treeCollapse file tree
0 file changed
+0
-0
lines changedOpen diff view settings
0 commit comments