[Inductor-CPU] Disable auto-tuning for templated int8 WoQ GEMM for small M to fix perf regression #148502

sanchitintel · 2025-03-04T23:01:23Z

Summary

Described in #148494 - this PR fixes a regression (compared to the default Inductor-CPU behavior of not using max-autotune) for templated int8 WoQ GEMM (with BF16 activation) for small M dimension by disabling auto-tuning for small M, so that the ATen _weight_int8pack_mm kernel would be used.
The significance is next-token generation of LLMs.

Turning off auto-tuning for small M is a workaround. Ideally, we should improve the auto-tuning infra to prevent templated AVX512 GEMM for int8 WoQ being chosen if _weight_int8pack_mm would be faster E2E.

Details

During auto-tuning, AVX512 GEMM micro-kernel is chosen for small M, but it's faster during auto-tuning, and performs worse E2E, which is expected as it can exploit cache locality for inputs while being called several times for the same inputs in a loop, but the same behavior isn't observed for its ATen counterpart _weight_int8pack_mm, which performs worse than it during auto-tuning but performs better E2E. However, it too would've benefited from better cache locality for inputs if it had been benchmarked for a longer time-period. Even so, the latency of the templated GEMM would still have been lower, even if we had benchmarked for more time.

M	N	K	Templated GEMM latency during autotuning benchmarking	Templated GEMM latency E2E	`_weight_int8pack_mm` latency during autotuning benchmarking	`_weight_int8pack_mm` latency E2E	Ratio of E2E latency of templated GEMM over `_weight_int8pack_mm`
1	4096	4096	31.2 us	91.1 us	108.7 us	76.07 us	1.19
1	1024	4096	16.1 us	33.36 us	52.9 us	24.275 us	1.37
1	14336	4096	112.8 us	274.16 us	335.3 us	233.197 us	1.17
1	4096	14336	128.1 us	280.76 us	330 us	237.797 us	1.18
1	4096	128256	1.642 ms	2.16 ms	2.118ms	2.034 ms	1.06

UTs

python test/inductor/test_cpu_select_algorithm.py -v -k test_int8_woq_mm

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov

pytorch-bot · 2025-03-04T23:01:29Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/148502

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit b08a6a3 with merge base 98458e5 ():

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

trunk / libtorch-linux-focal-cuda12.4-py3.10-gcc9-debug / build (gh) (#148495)
undefined reference to std::__throw_bad_array_new_length()'`

This comment was automatically generated by Dr. CI and updates every 15 minutes.

sanchitintel · 2025-03-04T23:21:41Z

@leslie-fang-intel, please advise if we should disable the templated GEMM kernel for int8 WoQ case on machines that don't support AMX ISA. Thanks!

jgong5

Could you please share the performance comparisons on the micro and end-to-end benchmarks with various small M? I'm wondering why the template GEMM with AVX512 is slower than AMX counterpart. The GEMM with small M is memory-intensive, not compute-intensive, AMX cannot help a lot here...

sanchitintel · 2025-03-05T01:10:12Z

why the template GEMM with AVX512 is slower than AMX counterpart

It's faster

@jgong5, the idea behind this PR is to choose ATen op _weight_int8pack_mm for small M during auto-tuning because the ATen kernel performs better than the templated AVX512 GEMM end-to-end, although the templated AVX512 GEMM performs better during auto-tuning benchmarking. We had discussed this issue last year as well.

In this PR, I deliberately (re-)enabled AMX GEMM even for small M. because it's quite slow, so the ATen kernel (_weight_int8pack_mm) would be chosen over it during auto-tuning.

I guess another approach would've been to disable the AVX512 GEMM micro-kernel altogether for int8 WoQ with BF16 activation, so that it wouldn't be chosen even on machines that lack AMX support (I haven't collected perf data on such machines, though).

Thanks!

Copy link Collaborator · 2025-03-05T01:29:05Z

One reason for this behavior could be different cache behavior with respect to input data (for auto-tuning benchmarking vs. E2E model run). A secondary reason could be that the ATen kernel is generic but there are many codegened GEMMs created, which are not as i-cache friendly as the ATen kernel (since there's only one _weight_int8pack_mm but many codegened kernels), and the i-cache misses are not amortized by the speedup the codegened kernels bring in this case.

How much perf gap did we see for a specific GEMM between ATEN kernel and AVX512 Micro GEMM? For both the benchmark data and model runtime data.

sanchitintel · 2025-03-05T11:21:05Z

Hi @leslie-fang-intel, the rationale for this PR can be proved simply by benchmarking with int8 WoQ enabled & disabled for an LLM, and comparing next-token generation time. With max-autotune enabled, the only change is that templated GEMMs are used instead of _weight_int8pack_mm, and thus they're the only differentiating factor.

The data you asked for is thus not necessary to prove the source of regression, so can we follow the standard operating procedure of fixing a regression first, so that at least the regression with respect to max-autotune mode disabled (Inductor-CPU default behavior) can be fixed? Would it be okay to move the debugging discussions to the linked issue #148494?

Thanks!

linux-foundation-easycla · 2025-03-05T19:30:44Z

✅login: sanchitintel / (9fbd241)
✅login: sanchitintel / (9fbd241, b08a6a3)

The committers listed above are authorized under a signed CLA.

sanchitintel · 2025-03-05T22:02:03Z

Hi @leslie-fang-intel @jgong5, I modified this PR to disable auto-tuning for int8 WoQ GEMM case for M < 32 (templated GEMM is slower for M < 32, irrespective of whether AVX512 or AMX micro-kernel used. We may need to refine this heuristic, perhaps by considering shapes of K & N as well, so I'll gather data to support it, but then end-to-end performance may be different). For M >= 32, AMX GEMM micro-kernel based templated GEMM kernels would be used on machines that support the AMX ISA.

leslie-fang-intel

It looks like you have converted this PR to a draft and are currently collecting performance data. Please re-request a review when you believe the PR is ready.

jgong5 · 2025-03-06T10:32:16Z

why the template GEMM with AVX512 is slower than AMX counterpart

It's faster

@jgong5, the idea behind this PR is to choose ATen op _weight_int8pack_mm for small M during auto-tuning because the ATen kernel performs better than the templated AVX512 GEMM end-to-end, although the templated AVX512 GEMM performs better during auto-tuning benchmarking. We had discussed this issue last year as well.

In this PR, I deliberately (re-)enabled AMX GEMM even for small M. because it's quite slow, so the ATen kernel (_weight_int8pack_mm) would be chosen over it during auto-tuning.

I guess another approach would've been to disable the AVX512 GEMM micro-kernel altogether for int8 WoQ with BF16 activation, so that it wouldn't be chosen even on machines that lack AMX support (I haven't collected perf data on such machines, though).

Thanks!

I understand the problem of differed performance you saw between ATen kernel and templated AVX512 GEMM kernel during auto-tuning and end-to-end runs. It is definitively something we need to fix. But I am concerned about the way you address the problem. It is more like a temporary workaround and this workaround promotes an even worse kernel (AMX GEMM kernel) for small "M". If we force the compilation to use the template GEMMs, we would have a regression with your fix. Can you share perf numbers as I requested for us to know the problems better?

sanchitintel · 2025-03-06T17:49:44Z

Can you share perf numbers as I requested for us to know the problems better?

Discovered something surprising - the ATen kernel _weight_int8pack_mm performs better E2E than during auto-tuning benchmarking! And its E2E performance is better than that of templated GEMM with AVX512 micro-kernel.
Similar trend with other small Ms (I've benchmarked 1 to 8 so far), and will post it soon.

M	N	K	Templated GEMM latency during autotuning benchmarking	Templated GEMM latency E2E	`_weight_int8pack_mm` latency during autotuning benchmarking	`_weight_int8pack_mm` latency E2E	Ratio of E2E latency of templated GEMM over `_weight_int8pack_mm`
1	4096	4096	31.2 us	91.1 us	108.7 us	76.07 us	1.19
1	1024	4096	16.1 us	33.36 us	52.9 us	24.275 us	1.37
1	14336	4096	112.8 us	274.16 us	335.3 us	233.197 us	1.17
1	4096	14336	128.1 us	280.76 us	330 us	237.797 us	1.18
1	4096	128256	1.642 ms	2.16 ms	2.118ms	2.034 ms	1.06

But I am concerned about the way you address the problem. It is more like a temporary workaround and this workaround promotes an even worse kernel (AMX GEMM kernel) for small "M".

@jgong5, yes it's a workaround. _weight_int8pack_mm would be used instead of templated GEMM for small M.
Also, y'day, in this PR, I disabled auto-tuning for small M altogether for int8 WoQ case.
Previously in this PR, the poorly performing AMX micro-kernel based templated GEMM was deliberately being used for auto-tuning, so that it wouldn't be chosen after auto-tuning.

If we force the compilation to use the template GEMMs, we would have a regression with your fix.

Sorry, is there any plan to force use of templated GEMMs even if their ATen counterpart is more performant?
If not, then we wouldn't have had a regression.

Thanks!

leslie-fang-intel · 2025-03-07T00:50:36Z

Sorry, is there any plan to force use of templated GEMMs even if their ATen counterpart is more performant?
If not, then we wouldn't have had a regression.

User can set max_autotune_gemm_backends="CPP" to force use it.

sanchitintel · 2025-03-07T04:24:25Z

User can set max_autotune_gemm_backends="CPP" to force use it.

Thanks for the info, @leslie-fang-intel! The revision wouldn't run into that problem but I'm currently looking into revising the auto-tuning implementation to avoid using this workaround.

jgong5

I'm fine if you disable templated codegen for this particular cases and always use the aten in this PR. But feel free to improve the gemm template in it too.

jgong5 · 2025-03-07T09:01:48Z

torch/_inductor/quantized_lowerings.py

-            return create_epilogue_with_attr(
-                buf, "mul", other=realize_inputs(expand(scale, layout.size))
-            )
+        def _use_autotuning() -> bool:


We have a check function use_cpp_gemm_template that does similar things. It is called in the same function below.

sanchitintel requested review from jgong5, leslie-fang-intel and chunyuan-w March 4, 2025 23:01

pytorch-bot bot added ciflow/inductor module: inductor labels Mar 4, 2025

pytorchbot added the open source label Mar 4, 2025

sanchitintel added ciflow/trunk Trigger trunk jobs on your pull request topic: not user facing topic category labels Mar 4, 2025

jgong5 requested changes Mar 5, 2025

View reviewed changes

sanchitintel requested a review from jgong5 March 5, 2025 11:21

sanchitintel mentioned this pull request Mar 5, 2025

[Inductor-CPU] Templated int8 WoQ GEMMs (with BF16 activation) may cause regressions for next-token generation of LLMs #148494

Closed

sanchitintel requested review from mingfeima and removed request for chunyuan-w March 5, 2025 11:46

sanchitintel changed the title ~~[Inductor-CPU] Let AMX ISA be chosen for int8 WoQ GEMM for any input shapes~~ [Inductor-CPU] Fix perf regression for templated int8 WoQ GEMM for small M dimension Mar 5, 2025

Disable int8 WoQ GEMM auto-tuning for small M

9fbd241

sanchitintel force-pushed the int8_woq_gemm_use_amx_isa_if_available branch from 6753078 to 9fbd241 Compare March 5, 2025 19:30

Merge branch 'pytorch:main' into int8_woq_gemm_use_amx_isa_if_available

b08a6a3

sanchitintel marked this pull request as draft March 5, 2025 23:35

leslie-fang-intel reviewed Mar 6, 2025

View reviewed changes

sanchitintel changed the title ~~[Inductor-CPU] Fix perf regression for templated int8 WoQ GEMM for small M dimension~~ [Inductor-CPU] Disable auto-tuning for templated int8 WoQ GEMM for small M to fix perf regression Mar 6, 2025

jgong5 requested changes Mar 7, 2025

View reviewed changes

sanchitintel closed this May 15, 2025

Xia-Weiwen mentioned this pull request Jun 9, 2025

High-performance LLM quantization on X86 CPU with native PyTorch #155435

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Inductor-CPU] Disable auto-tuning for templated int8 WoQ GEMM for small M to fix perf regression #148502

[Inductor-CPU] Disable auto-tuning for templated int8 WoQ GEMM for small M to fix perf regression #148502

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[Inductor-CPU] Disable auto-tuning for templated int8 WoQ GEMM for small M to fix perf regression #148502

[Inductor-CPU] Disable auto-tuning for templated int8 WoQ GEMM for small M to fix perf regression #148502

Uh oh!

Conversation

Uh oh!

Summary

Details

UTs

Uh oh!

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/148502

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!