8000 [Inductor-CPU] Templated int8 WoQ GEMMs (with BF16 activation) may cause regressions for next-token generation of LLMs · Issue #148494 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

[Inductor-CPU] Templated int8 WoQ GEMMs (with BF16 activation) may cause regressions for next-token generation of LLMs #148494

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
8000
sanchitintel opened this issue Mar 4, 2025 · 9 comments
Assignees
Labels
oncall: cpu inductor CPU Inductor issues for Intel team to triage oncall: pt2

Comments

@sanchitintel
Copy link
Collaborator
sanchitintel commented Mar 4, 2025

🐛 Describe the bug

Inductor-CPU templated int8 WoQ (with BF16 activation) GEMMs for next-token generation (with small M dimension) are faster than their ATen counterparts during auto-tuning, so they're chosen at compile time, but they might cause a regression when a model is run end-to-end. (A digression: during auto-tuning, templated GEMMs are only benchmarked against their ATen counterpart, while the templated GEMM that runs E2E also has some epilogue fusions).

The root-cause for this behavior is unknown at this point.

Solution to fix regression (compared to Inductor-CPU max-autotune disabled)

Currently, an AVX512 GEMM micro-kernel is being used for small M & an AMX ISA micro-kernel is being used for large M dimension.

We should disable the AVX512 GEMM micro-kernel when AMX ISA is available, so that:

  1. For small M, _weight_int8pack_mm would be chosen during auto-tuning -> no regression for next-token latency E2E.
  2. For large M, templated GEMM kernel with AMX micro-kernel would be chosen -> lower first token latency E2E

PR: #148502

Solution to improve end-to-end templated int8 WoQ GEMM performance over Inductor-CPU for small M

?

Versions

Main branch

cc @chauhang @penguinwu

@sanchitintel sanchitintel added the oncall: cpu inductor CPU Inductor issues for Intel team to triage label Mar 4, 2025
@sanchitintel sanchitintel self-assigned this Mar 4, 2025
@sanchitintel sanchitintel changed the title [Inductor-CPU] Templated int8 WoQ GEMMs (with BF16 activation) may cause regressions E2E [Inductor-CPU] Templated int8 WoQ GEMMs (with BF16 activation) may cause regressions E2E for next-token generation of LLMs Mar 4, 2025
@sanchitintel sanchitintel changed the title [Inductor-CPU] Templated int8 WoQ GEMMs (with BF16 activation) may cause regressions E2E for next-token generation of LLMs [Inductor-CPU] Templated int8 WoQ GEMMs (with BF16 activation) may cause regressions for next-token generation of LLMs Mar 5, 2025
@sanchitintel
Copy link
Collaborator Author

As per @leslie-fang-intel's comment, will provide data on performance gap between templated GEMM & _weight_int8pack_mm during auto-tuning benchmarking, and during E2E runs.

@sanchitintel
Copy link
Collaborator Author
sanchitintel commented Mar 6, 2025

@leslie-fang-intel @jgong5

Data collection methodology

The perf data for LLaMA 3.1-8B instruct (prompt of 1024 tokens & 1024 tokens were generated) was gathered on 32 physical cores of one NUMA node of a Xeon Gen 5 by turning off out-of-template epilogue fusions for templated GEMMs, so that an apple-to-apple comparison could be possible. Intel OpenMP & tcmalloc were preloaded.

For max-autotune mode disabled, _weight_int8pack_mm was used for int8 WoQ GEMMs. CPP wrapper was disabled (otherwise, runtime of ATen ops is not captured. Will create a new issue).
For max-autotune mode enabled, templated GEMMs were codegened. CPP wrapper was enabled.

Summary

The data's very interesting (in the sense that it's counter-intuitive) & may be indicative of some fundamental issue (not a bug, but some overlooked aspect) in the auto-tuning benchmarking implementation.

The templated GEMMs perform worse E2E than during auto-tuning benchmarking.
But the ATen kernel _weight_int8pack_mm performs better E2E than during auto-tuning benchmarking!

Table 1. Comparison of next-token latency for max-autotune enabled/disabled

max-autotune enabled/disabled cpp-wrapper enabled/disabled Latency for next token generation
Enabled Enabled 45 ms
Disabled Disabled 41 ms

Table 2. GEMM runtime comparison for E2E vs autotuning-benchmarking runtimes

M N K Templated GEMM latency during autotuning benchmarking Templated GEMM latency E2E _weight_int8pack_mm latency during autotuning benchmarking _weight_int8pack_mm latency E2E Ratio of E2E latency of templated GEMM over _weight_int8pack_mm
1 4096 4096 31.2 us 91.1 us 108.7 us 76.07 us 1.19
1 1024 4096 16.1 us 33.36 us 52.9 us 24.275 us 1.37
1 14336 4096 112.8 us 274.16 us 335.3 us 233.197 us 1.17
1 4096 14336 128.1 us 280.76 us 330 us 237.797 us 1.18
1 4096 128256 1.642 ms 2.16 ms 2.118ms 2.034 ms 1.06

Table 3. GEMM runtime for generating each token (rest-tokens)

Based on the above data & knowledge of layers in LLaMA 3.1, let's project the runtime of GEMMs for generating one token with max-autotune mode enabled:

M N K Number of _weight_int8pack_mm calls during generation of one token Total runtime of GEMM in generation of one token for default Inductor-CPU implementation Projected runtime of GEMM with max-autotune for generating one token
1 4096 4096 64 4.869 ms 5.79 ms
1 1024 4096 64 1.533 ms 2.100 ms
1 14336 4096 64 14.924 ms 17.461 ms
1 4096 14336 32 7.596 ms 8.963 ms
1 4096 128256 1 2.034 ms 2.16 ms

Sanity-check of gathered data

With the default Inductor-CPU implementation, 30.956 ms were spent on quantized GEMMs for next/rest-token generation (on average, for 1024 tokens).
The expected increase in runtime with max-autotune due to the gathered data would be 36.474 (sum of items in the last column of Table 3) - 30.956 (sum of items in the penultimate column of table 3) = 5.518 ms.

The actual average runtime of the templated GEMM kernel was 34.974 ms for next-token generation. Thus, the observed runtime difference in int8 WoQ GEMMs for each next-token generation is ~4 ms, while the next-token generation latency difference between max-autotune enabled & disabled was 3 ms. Some of this difference can be attributed to truncation of latency while printing next-token latency.

The projected difference is off by 1.5 ms. I'm not sure if this aspect needs to be investigated to ascertain if the collected data is correct for us to draw any inference from it, since the projections are based on numbers that can be slightly different across each run, so for now, we'd deem it valid. Besides, we have a bigger question at hand.

To investigate

The biggest surprise here is the discrepancy between latencies of a GEMM across auto-tuning benchmarking & end-to-end model execution for both _weight_int8pack_mm and the templated GEMM kernel, whose trends run antithetical to each other.

Please advise how to approach this investigation. Thanks!

@leslie-fang-intel
Copy link
Collaborator

Thanks for the detail breakdown.

ATen kernel _weight_int8pack_mm performs better E2E than during auto-tuning benchmarking!

Comparing Table 2, Column _weight_int8pack_mm latency during autotuning benchmarking and Column _weight_int8pack_mm latency E2E, the data appears counterintuitive. Intuitively, the benchmarking run should have better cache locality and, therefore, better performance than E2E.

@sanchitintel
Copy link
Collaborator Author
sanchitintel commented Mar 7, 2025

@leslie-fang-intel @jgong5, I increased the time autotuning benchmarking is performed for, and _weight_int8pack_mm's profiled latency did go down, just as expected (however not as much as the templated GEMMs).

I'm thinking of changing the benchmarking process for autotuning by making it impervious to leveraging temporal & spatial cache locality because of previous benchmarking iterations of the same kernel. Would have to use separate example inputs that are copies of each other (assuming both kernels share the same inputs).

The two approaches I'm thinking of are something like:

# Approach 1
# for each choice
    for _ in range(num_benchmarking_iterations):   # I haven't thought about how we should time it
       # Benchmark kernel for one iteration & record time
       # based on the size of L1 & L2  caches, perform an operation on a tensor (not one of the example inputs) large enough to occupy L1 & L2 caches (although in E2E workloads, some inputs  may already be in L1 or L2 caches)
# Approach 2
for _ in range(num_benchmarking_iterations):  # I'll use per-choice time counters instead of a fixed number of iterations
    for _ in range(choices):
           # Benchmark each choice and record time for it in a separate data structure

However, for small input shapes, the second approach may still be susceptible to cache locality being exploited in subsequent benchmarking runs of a kernel.

I'll try approach 2 first. If it'd not work well, I'll try approach 1.

Thanks!

@leslie-fang-intel
Copy link
Collaborator

Thanks for the detail breakdown.

ATen kernel _weight_int8pack_mm performs better E2E than during auto-tuning benchmarking!

Comparing Table 2, Column _weight_int8pack_mm latency during autotuning benchmarking and Column _weight_int8pack_mm latency E2E, the data appears counterintuitive. Intuitively, the benchmarking run should have better cache locality and, therefore, better performance than E2E.

Based on the data you showed above, for a GEMM with a specific shape, performance in the E2E run is significantly better than in the benchmark run when using the ATEN kernel. Would reduce cache impact in the benchmark run help address this discrepancy?

@sanchitintel
Copy link
Collaborator Author

Based on the data you showed above, for a GEMM with a specific shape, performance in the E2E run is significantly better than in the benchmark run when using the ATEN kernel. Would reduce cache impact in the benchmark run help address this discrepancy?

I increased the time autotuning benchmarking is performed for, 
and _weight_int8pack_mm's profiled latency did go down, 
just as expected (however not as much as the templated GEMMs).

During this experimentation, _weight_int8pack_mm's auto-tuning latency also became lower than its latency in an E2E run of a model.

@leslie-fang-intel
Copy link
Collaborator

I increased the time autotuning benchmarking is performed for, and _weight_int8pack_mm's profiled latency did go down, just as expected (however not as much as the templated GEMMs).

During this experimentation, _weight_int8pack_mm's auto-tuning latency also became lower than its latency in an E2E run of a model.

Thanks, how much time we increased for benchmarking and what's the latest GEMM breakdown time after your experiment?

@sanchitintel
Copy link
Collaborator Author
sanchitintel commented Mar 14, 2025

As for the performance gap, today, I determined that it's because of not using explicit prefetching for the cache lines of the next tile of B, which was leading to more cache misses. The gains from prefetching are not apparent from op-level benchmarking, though, and are only apparent in E2E runs of models. I'll submit a PR for it.

how much time we increased for benchmarking and what's the latest GEMM breakdown time after your experiment?

I did a lot of experiments & even varied the benchmarking time up to a few minutes.
The absolute runtime of a kernel in autotuning benchmarking doesn't matter.

The only takeaways from those experiments were:

1. I increased the time autotuning benchmarking is performed for, and _weight_int8pack_mm's profiled latency did go down, 
just as expected (however not as much as the templated GEMMs).

2. During this experimentation, _weight_int8pack_mm's auto-tuning latency also became lower than its latency in an E2E run of a model.

@sanchitintel
Copy link
Collaborator Author
sanchitintel commented Mar 21, 2025

Would have to use separate example inputs that are copies of each other (assuming both kernels share the same inputs).

@CaoE, can you please link your PR that does this? I'll check if that change alone would be enough to solve the problem of E2E performance being quite different from performance during benchmarking. I think one of the 2 changes listed here would also be needed. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
oncall: cpu inductor CPU Inductor issues for Intel team to triage oncall: pt2
Projects
None yet
4 participants
0