[Inductor-CPU] Templated int8 WoQ GEMMs (with BF16 activation) may cause regressions for next-token generation of LLMs #148494

sanchitintel · 2025-03-04T22:06:00Z

🐛 Describe the bug

Inductor-CPU templated int8 WoQ (with BF16 activation) GEMMs for next-token generation (with small M dimension) are faster than their ATen counterparts during auto-tuning, so they're chosen at compile time, but they might cause a regression when a model is run end-to-end. (A digression: during auto-tuning, templated GEMMs are only benchmarked against their ATen counterpart, while the templated GEMM that runs E2E also has some epilogue fusions).

The root-cause for this behavior is unknown at this point.

Solution to fix regression (compared to Inductor-CPU max-autotune disabled)

Currently, an AVX512 GEMM micro-kernel is being used for small M & an AMX ISA micro-kernel is being used for large M dimension.

We should disable the AVX512 GEMM micro-kernel when AMX ISA is available, so that:

For small M, _weight_int8pack_mm would be chosen during auto-tuning -> no regression for next-token latency E2E.
For large M, templated GEMM kernel with AMX micro-kernel would be chosen -> lower first token latency E2E

PR: #148502

Solution to improve end-to-end templated int8 WoQ GEMM performance over Inductor-CPU for small M

?

Versions

Main branch

cc @chauhang @penguinwu

The text was updated successfully, but these errors were encountered:

sanchitintel · 2025-03-05T11:36:33Z

As per @leslie-fang-intel's comment, will provide data on performance gap between templated GEMM & _weight_int8pack_mm during auto-tuning benchmarking, and during E2E runs.

sanchitintel · 2025-03-06T05:47:31Z

@leslie-fang-intel @jgong5

Data collection methodology

The perf data for LLaMA 3.1-8B instruct (prompt of 1024 tokens & 1024 tokens were generated) was gathered on 32 physical cores of one NUMA node of a Xeon Gen 5 by turning off out-of-template epilogue fusions for templated GEMMs, so that an apple-to-apple comparison could be possible. Intel OpenMP & tcmalloc were preloaded.

For max-autotune mode disabled, _weight_int8pack_mm was used for int8 WoQ GEMMs. CPP wrapper was disabled (otherwise, runtime of ATen ops is not captured. Will create a new issue).
For max-autotune mode enabled, templated GEMMs were codegened. CPP wrapper was enabled.

Summary

The data's very interesting (in the sense that it's counter-intuitive) & may be indicative of some fundamental issue (not a bug, but some overlooked aspect) in the auto-tuning benchmarking implementation.

The templated GEMMs perform worse E2E than during auto-tuning benchmarking.
But the ATen kernel _weight_int8pack_mm performs better E2E than during auto-tuning benchmarking!

Table 1. Comparison of next-token latency for max-autotune enabled/disabled

max-autotune enabled/disabled	cpp-wrapper enabled/disabled	Latency for next token generation
Enabled	Enabled	45 ms
Disabled	Disabled	41 ms

Table 2. GEMM runtime comparison for E2E vs autotuning-benchmarking runtimes

M	N	K	Templated GEMM latency during autotuning benchmarking	Templated GEMM latency E2E	`_weight_int8pack_mm` latency during autotuning benchmarking	`_weight_int8pack_mm` latency E2E	Ratio of E2E latency of templated GEMM over `_weight_int8pack_mm`
1	4096	4096	31.2 us	91.1 us	108.7 us	76.07 us	1.19
1	1024	4096	16.1 us	33.36 us	52.9 us	24.275 us	1.37
1	14336	4096	112.8 us	274.16 us	335.3 us	233.197 us	1.17
1	4096	14336	128.1 us	280.76 us	330 us	237.797 us	1.18
1	4096	128256	1.642 ms	2.16 ms	2.118ms	2.034 ms	1.06

Table 3. GEMM runtime for generating each token (rest-tokens)

Based on the above data & knowledge of layers in LLaMA 3.1, let's project the runtime of GEMMs for generating one token with max-autotune mode enabled:

M	N	K	Number of `_weight_int8pack_mm` calls during generation of one token	Total runtime of GEMM in generation of one token for default Inductor-CPU implementation	Projected runtime of GEMM with max-autotune for generating one token
1	4096	4096	64	4.869 ms	5.79 ms
1	1024	4096	64	1.533 ms	2.100 ms
1	14336	4096	64	14.924 ms	17.461 ms
1	4096	14336	32	7.596 ms	8.963 ms
1	4096	128256	1	2.034 ms	2.16 ms

Sanity-check of gathered data

With the default Inductor-CPU implementation, 30.956 ms were spent on quantized GEMMs for next/rest-token generation (on average, for 1024 tokens).
The expected increase in runtime with max-autotune due to the gathered data would be 36.474 (sum of items in the last column of Table 3) - 30.956 (sum of items in the penultimate column of table 3) = 5.518 ms.

The actual average runtime of the templated GEMM kernel was 34.974 ms for next-token generation. Thus, the observed runtime difference in int8 WoQ GEMMs for each next-token generation is ~4 ms, while the next-token generation latency difference between max-autotune enabled & disabled was 3 ms. Some of this difference can be attributed to truncation of latency while printing next-token latency.

The projected difference is off by 1.5 ms. I'm not sure if this aspect needs to be investigated to ascertain if the collected data is correct for us to draw any inference from it, since the projections are based on numbers that can be slightly different across each run, so for now, we'd deem it valid. Besides, we have a bigger question at hand.

To investigate

The biggest surprise here is the discrepancy between latencies of a GEMM across auto-tuning benchmarking & end-to-end model execution for both _weight_int8pack_mm and the templated GEMM kernel, whose trends run antithetical to each other.

Please advise how to approach this investigation. Thanks!

leslie-fang-intel · 2025-03-06T06:31:41Z

Thanks for the detail breakdown.

ATen kernel _weight_int8pack_mm performs better E2E than during auto-tuning benchmarking!

Comparing Table 2, Column _weight_int8pack_mm latency during autotuning benchmarking and Column _weight_int8pack_mm latency E2E, the data appears counterintuitive. Intuitively, the benchmarking run should have better cache locality and, therefore, better performance than E2E.

sanchitintel · 2025-03-07T07:16:04Z

@leslie-fang-intel @jgong5, I increased the time autotuning benchmarking is performed for, and _weight_int8pack_mm's profiled latency did go down, just as expected (however not as much as the templated GEMMs).

I'm thinking of changing the benchmarking process for autotuning by making it impervious to leveraging temporal & spatial cache locality because of previous benchmarking iterations of the same kernel. Would have to use separate example inputs that are copies of each other (assuming both kernels share the same inputs).

The two approaches I'm thinking of are something like:

# Approach 1 # for each choice for _ in range(num_benchmarking_iterations): # I haven't thought about how we should time it # Benchmark kernel for one iteration & record time # based on the size of L1 & L2 caches, perform an operation on a tensor (not one of the example inputs) large enough to occupy L1 & L2 caches (although in E2E workloads, some inputs may already be in L1 or L2 caches)

# Approach 2 for _ in range(num_benchmarking_iterations): # I'll use per-choice time counters instead of a fixed number of iterations for _ in range(choices): # Benchmark each choice and record time for it in a separate data structure

However, for small input shapes, the second approach may still be susceptible to cache locality being exploited in subsequent benchmarking runs of a kernel.

I'll try approach 2 first. If it'd not work well, I'll try approach 1.

Thanks!

leslie-fang-intel · 2025-03-07T08:35:33Z

Thanks for the detail breakdown.

ATen kernel _weight_int8pack_mm performs better E2E than during auto-tuning benchmarking!

Comparing Table 2, Column _weight_int8pack_mm latency during autotuning benchmarking and Column _weight_int8pack_mm latency E2E, the data appears counterintuitive. Intuitively, the benchmarking run should have better cache locality and, therefore, better performance than E2E.

Based on the data you showed above, for a GEMM with a specific shape, performance in the E2E run is significantly better than in the benchmark run when using the ATEN kernel. Would reduce cache impact in the benchmark run help address this discrepancy?

sanchitintel · 2025-03-07T18:02:14Z

Based on the data you showed above, for a GEMM with a specific shape, performance in the E2E run is significantly better than in the benchmark run when using the ATEN kernel. Would reduce cache impact in the benchmark run help address this discrepancy?

I increased the time autotuning benchmarking is performed for, 
and _weight_int8pack_mm's profiled latency did go down, 
just as expected (however not as much as the templated GEMMs).

During this experimentation, _weight_int8pack_mm's auto-tuning latency also became lower than its latency in an E2E run of a model.

leslie-fang-intel · 2025-03-07T23:59:52Z

I increased the time autotuning benchmarking is performed for, and _weight_int8pack_mm's profiled latency did go down, just as expected (however not as much as the templated GEMMs).

During this experimentation, _weight_int8pack_mm's auto-tuning latency also became lower than its latency in an E2E run of a model.

Thanks, how much time we increased for benchmarking and what's the latest GEMM breakdown time after your experiment?

sanchitintel · 2025-03-14T23:38:31Z

As for the performance gap, today, I determined that it's because of not using explicit prefetching for the cache lines of the next tile of B, which was leading to more cache misses. The gains from prefetching are not apparent from op-level benchmarking, though, and are only apparent in E2E runs of models. I'll submit a PR for it.

how much time we increased for benchmarking and what's the latest GEMM breakdown time after your experiment?

I did a lot of experiments & even varied the benchmarking time up to a few minutes.
The absolute runtime of a kernel in autotuning benchmarking doesn't matter.

The only takeaways from those experiments were:

1. I increased the time autotuning benchmarking is performed for, and _weight_int8pack_mm's profiled latency did go down, 
just as expected (however not as much as the templated GEMMs).

2. During this experimentation, _weight_int8pack_mm's auto-tuning latency also became lower than its latency in an E2E run of a model.

sanchitintel · 2025-03-21T02:46:21Z

Would have to use separate example inputs that are copies of each other (assuming both kernels share the same inputs).

@CaoE, can you please link your PR that does this? I'll check if that change alone would be enough to solve the problem of E2E performance being quite different from performance during benchmarking. I think one of the 2 changes listed here would also be needed. Thanks!

sanchitintel added the oncall: cpu inductor CPU Inductor issues for Intel team to triage label Mar 4, 2025

sanchitintel self-assigned this Mar 4, 2025

sanchitintel changed the title ~~[Inductor-CPU] Templated int8 WoQ GEMMs (with BF16 activation) may cause regressions E2E~~ [Inductor-CPU] Templated int8 WoQ GEMMs (with BF16 activation) may cause regressions E2E for next-token generation of LLMs Mar 4, 2025

sanchitintel mentioned this issue Mar 4, 2025

[Inductor-CPU] Disable auto-tuning for templated int8 WoQ GEMM for small M to fix perf regression #148502

Closed

sanchitintel changed the title ~~[Inductor-CPU] Templated int8 WoQ GEMMs (with BF16 activation) may cause regressions E2E for next-token generation of LLMs~~ [Inductor-CPU] Templated int8 WoQ GEMMs (with BF16 activation) may cause regressions for next-token generation of LLMs Mar 5, 2025

soulitzer added the oncall: pt2 label Mar 10, 2025

sanchitintel assigned CaoE Mar 21, 2025

sanchitintel mentioned this issue May 6, 2025

[Inductor-CPU] Faster int8 WoQ GEMM for small M with explicit prefetching and different outer loops #149373

Closed

pytorchmergebot closed this as completed in 7482eb2 May 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Inductor-CPU] Templated int8 WoQ GEMMs (with BF16 activation) may cause regressions for next-token generation of LLMs #148494

[Inductor-CPU] Templated int8 WoQ GEMMs (with BF16 activation) may cause regressions for next-token generation of LLMs #148494

[Inductor-CPU] Templated int8 WoQ GEMMs (with BF16 activation) may cause regressions for next-token generation of LLMs #148494

[Inductor-CPU] Templated int8 WoQ GEMMs (with BF16 activation) may cause regressions for next-token generation of LLMs #148494

Comments

🐛 Describe the bug

Solution to fix regression (compared to Inductor-CPU max-autotune disabled)

Solution to improve end-to-end templated int8 WoQ GEMM performance over Inductor-CPU for small M

Versions

Data collection methodology

Summary

Table 1. Comparison of next-token latency for max-autotune enabled/disabled

Table 2. GEMM runtime comparison for E2E vs autotuning-benchmarking runtimes

Table 3. GEMM runtime for generating each token (rest-tokens)

Sanity-check of gathered data

To investigate