basic compile support for grouped_mm #153384

bdhirsh · 2025-05-12T16:02:52Z

grouped_mm is used in torchtitan, this adds just enough support in compile to allow inductor to lower it as a fallback kernel. I imagine that at some point in the future it may be valuable to get inductor to support templating grouped_mm, although this PR just provides basic support. cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @ngimel @eellison

Stack from ghstack (oldest at bottom):

-> basic compile support for grouped_mm #153384

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov

[ghstack-poisoned]

pytorch-bot · 2025-05-12T16:02:56Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/153384

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 2 Unrelated Failures

As of commit 4e78bbc with merge base daca611 ():

NEW FAILURE - The following job has failed:

inductor-rocm / rocm-py3.10-inductor / test (inductor, 2, 2, linux.rocm.gpu.2) (gh)
inductor/test_torchinductor.py::TritonCodeGenTests::test_grouped_mm

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / cuda12.4-py3.10-gcc9-sm75 / test (pr_time_benchmarks, 1, 1, linux.g4dn.metal.nvidia.gpu) (gh) (trunk failure)
MISSING REGRESSION TEST

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

inductor / unit-test / cuda12.6-py3.10-gcc9-sm86 / test (inductor_cpp_wrapper, 1, 2, linux.g5.4xlarge.nvidia.gpu) (gh) (#152916)
[ FAILED ] AotInductorTest.BasicTestCuda

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: bfb03e9 Pull Request resolved: #153384

This PR + pytorch/pytorch#153384 is enough to get torchtitan running for me with llama4 and compile ``` CONFIG_FILE="./torchtitan/experiments/llama4/train_configs/debug_model.toml" ./run_train.sh --training.compile ``` [ghstack-poisoned]

grouped_mm is used in torchtitan, this adds just enough support in compile to allow inductor to lower it as a fallback kernel. I imagine that at some point in the future it may be valuable to get inductor to support templating grouped_mm, although this PR just provides basic support. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov ngimel eellison cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov [ghstack-poisoned]

ghstack-source-id: fe4107e Pull Request resolved: #153384

This PR + pytorch/pytorch#153384 is enough to get torchtitan running for me with llama4 and compile ``` CONFIG_FILE="./torchtitan/experiments/llama4/train_configs/debug_model.toml" ./run_train.sh --training.compile ``` [ghstack-poisoned]

grouped_mm is used in torchtitan, this adds just enough support in compile to allow inductor to lower it as a fallback kernel. I imagine that at some point in the future it may be valuable to get inductor to support templating grouped_mm, although this PR just provides basic support. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov ngimel eellison cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov [ghstack-poisoned]

ghstack-source-id: 3c696c5 Pull Request resolved: #153384

bdhirsh · 2025-05-12T18:34:01Z

test/inductor/test_torchinductor.py

@@ -13741,6 +13742,50 @@ def forward(
                )
                torch._inductor.aot_compile(traced, inputs)

+        @skipCUDAIf(not SM90OrLater, "Requires sm90")
+        @requires_gpu()
+        @config.patch(implicit_fallbacks=True)


I removed the lowering I added in inductor - I realized that implicit_fallbacks actually defaults to True, but was set to false in this test suite (thanks @ngimel asking why I needed the lowering)

Do we have an explicit striding requirement ? if so, we should add them - potentially as require_exact_strides, requires_contiguous, etc.

good point. maybe @ngimel would know? (does grouped_mm only support contiguous inputs?)

For grouped_mm we don't other than one of the dimensions should be contiguous, similar to regular mm.

@bdhirsh can you add make_fallback(aten._grouped_mm, require_dense)

@eellison what would be the correct way of handling this for custom ops provided by third-party libraries?

For custom ops provided by third-party libraries, we are trying to match the eager strides by default. @zou3519 is going to turn that on (i forget if he did or not yet).

Otherwise we have these tags:

pytorch/aten/src/ATen/native/tags.yaml

Lines 45 to 67 in 236b08c

- tag: needs_exact_strides

desc: |

This tag indicates that the operator should be passed Tensors following

the same strides as observed in eager when compiled in inductor.

Only one of {needs_exact_strides, needs_contiguous_strides, needs_fixed_stride_order, flexible_layout}

can apply; if multiple are assigned then we assume the most restrictive one.

- tag: needs_contiguous_strides

desc: |

This tag indicates that the operator should be passed contiguous Tensors.

Failure to do so will result in undefined behavior.

- tag: needs_fixed_stride_order

desc: |

This tag indicates that the operator should be passed Tensors following

the same stride permutation as observed in eager when compiled in inductor.

Only one of {needs_exact_strides, needs_contiguous_strides, needs_fixed_stride_order, flexible_layout}

can apply; if multiple are assigned then we assume the most restrictive one.

- tag: flexible_layout

desc: |

This tag indicates that the custom operator can accept inputs with varying

strides/storage_offset and that when compiled, Inductor is allowed to change

the strides/storage_offset of inputs to the custom operator.

Only one of {needs_exact_strides, needs_contiguous_strides, needs_fixed_stride_order, flexible_layout}

can apply; if multiple are assigned then we assume the most restrictive one.

.

Ah interesting, so there's no "require_dense" equivalent?

Theres not, we could add one if someone wanted

basic compile support for grouped_mm

7235aba

[ghstack-poisoned]

pytorch-bot bot added ciflow/inductor module: dynamo module: inductor labels May 12, 2025

bdhirsh added a commit that referenced this pull request May 12, 2025

basic compile support for grouped_mm

46581c9

ghstack-source-id: bfb03e9 Pull Request resolved: #153384

github-actions bot requested review from albanD, antoniojkim, ezyang, miladm and SherlockNoMad May 12, 2025 16:03

bdhirsh requested review from eellison and ngimel May 12, 2025 17:35

bdhirsh mentioned this pull request May 12, 2025

compile: turn off fullgraph=True to support llama4 pytorch/torchtitan#1182

Open

bdhirsh added the release notes: python_frontend python frontend release notes category label May 12, 2025

bdhirsh added a commit that referenced this pull request May 12, 2025

basic compile support for grouped_mm

b6b7e04

ghstack-source-id: fe4107e Pull Request resolved: #153384

bdhirsh added a commit that referenced this pull request May 12, 2025

basic compile support for grouped_mm

89a051a

ghstack-source-id: 3c696c5 Pull Request resolved: #153384

bdhirsh commented May 12, 2025

View reviewed changes

albanD removed their request for review May 12, 2025 18:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

basic compile support for grouped_mm #153384

basic compile support for grouped_mm #153384

	- tag: needs_exact_strides
	desc: \|
	This tag indicates that the operator should be passed Tensors following
	the same strides as observed in eager when compiled in inductor.
	Only one of {needs_exact_strides, needs_contiguous_strides, needs_fixed_stride_order, flexible_layout}
	can apply; if multiple are assigned then we assume the most restrictive one.
	- tag: needs_contiguous_strides
	desc: \|
	This tag indicates that the operator should be passed contiguous Tensors.
	Failure to do so will result in undefined behavior.
	- tag: needs_fixed_stride_order
	desc: \|
	This tag indicates that the operator should be passed Tensors following
	the same stride permutation as observed in eager when compiled in inductor.
	Only one of {needs_exact_strides, needs_contiguous_strides, needs_fixed_stride_order, flexible_layout}
	can apply; if multiple are assigned then we assume the most restrictive one.
	- tag: flexible_layout
	desc: \|
	This tag indicates that the custom operator can accept inputs with varying
	strides/storage_offset and that when compiled, Inductor is allowed to change
	the strides/storage_offset of inputs to the custom operator.
	Only one of {needs_exact_strides, needs_contiguous_strides, needs_fixed_stride_order, flexible_layout}
	can apply; if multiple are assigned then we assume the most restrictive one.

basic compile support for grouped_mm #153384

Are you sure you want to change the base?

basic compile support for grouped_mm #153384

Conversation

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/153384

❌ 1 New Failure, 2 Unrelated Failures

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment