[inductor] Generate synthetic offsets appropriately for autotuning _scaled_grouped_mm (#152968)

bertmaher · pytorchmergebot · commit e820b05cab2c · 2025-05-08T21:07:04.000Z
Summary: The autotuner is using zero-filled tensors to autotune _scaled_grouped_mm and that's not appropriate for the offsets tensor, since it essentially corresponds to "no input" and thus yields invalid perf results. We can't really use the actual input tensors, since we might be compiling this op in the context of an entire graph. So instead, I decided to create a synthetic offsets tensor assuming that each group is (roughly) the same size. I don't have data but I'd guess this approach is OK for MoE since we're generally hoping to load-balance the experts; I'm not sure how well it applies to other scenarios that might be more heavy-tailed. Test Plan: ``` pytest test_matmul_cuda.py -k test_scaled_grouped_gemm_ ``` Pull Request resolved: #152968 Approved by: https://github.com/ngimel
diff --git a/torch/_inductor/kernel/mm_scaled_grouped.py b/torch/_inductor/kernel/mm_scaled_grouped.py
@@ -380,6 +380,16 @@ def can_use_triton_kernel(
     )
 
 
+def create_offsets(x, m1_size, m2_size, offs_size):
+    assert len(m1_size) == 2 and len(m2_size) == 3, (
+        "Autotuning _scaled_grouped_mm is only implemented for 2d-3d tensors"
+    )
+    m = V.graph.sizevars.size_hint(m1_size[0])
+    noffs = V.graph.sizevars.size_hint(offs_size[0])
+    step = m / noffs
+    return torch.linspace(step, m, noffs, dtype=x.get_dtype(), device=x.get_device())
+
+
 @register_lowering(aten._scaled_grouped_mm.default, type_promotion_kind=None)
 def tuned_scaled_grouped_mm(
     mat_a: TensorBox,
@@ -461,4 +471,9 @@ def tuned_scaled_grouped_mm(
                 **config.kwargs,
             )
 
-    return autotune_select_algorithm("scaled_grouped_mm", choices, input_nodes, layout)
+    input_gen_fns = {
+        4: lambda x: create_offsets(x, m1_size, m2_size, offs.get_size()),
+    }
+    return autotune_select_algorithm(
+        "scaled_grouped_mm", choices, input_nodes, layout, input_gen_fns=input_gen_fns
+    )