-
Notifications
You must be signed in to change notification settings - Fork 24.2k
[Inductor-CPU] Templated int8 WoQ GEMMs (with BF16 activation) may cause regressions for next-token generation of LLMs #148494
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
As per @leslie-fang-intel's comment, will provide data on performance gap between templated GEMM & |
Data collection methodologyThe perf data for LLaMA 3.1-8B instruct (prompt of 1024 tokens & 1024 tokens were generated) was gathered on 32 physical cores of one NUMA node of a Xeon Gen 5 by turning off out-of-template epilogue fusions for templated GEMMs, so that an apple-to-apple comparison could be possible. Intel OpenMP & tcmalloc were preloaded. For SummaryThe data's very interesting (in the sense that it's counter-intuitive) & may be indicative of some fundamental issue (not a bug, but some overlooked aspect) in the auto-tuning benchmarking implementation. The templated GEMMs perform worse E2E than during auto-tuning benchmarking. Table 1. Comparison of next-token latency for max-autotune enabled/disabled
Table 2. GEMM runtime comparison for E2E vs autotuning-benchmarking runtimes
Table 3. GEMM runtime for generating each token (rest-tokens)Based on the above data & knowledge of layers in LLaMA 3.1, let's project the runtime of GEMMs for generating one token with max-autotune mode enabled:
Sanity-check of gathered dataWith the default Inductor-CPU implementation, 30.956 ms were spent on quantized GEMMs for next/rest-token generation (on average, for 1024 tokens). The actual average runtime of the templated GEMM kernel was 34.974 ms for next-token generation. Thus, the observed runtime difference in int8 WoQ GEMMs for each next-token generation is ~4 ms, while the next-token generation latency difference between max-autotune enabled & disabled was 3 ms. Some of this difference can be attributed to truncation of latency while printing next-token latency. The projected difference is off by 1.5 ms. I'm not sure if this aspect needs to be investigated to ascertain if the collected data is correct for us to draw any inference from it, since the projections are based on numbers that can be slightly different across each run, so for now, we'd deem it valid. Besides, we have a bigger question at hand. To investigateThe biggest surprise here is the discrepancy between latencies of a GEMM across auto-tuning benchmarking & end-to-end model execution for both Please advise how to approach this investigation. Thanks! |
Thanks for the detail breakdown.
Comparing Table 2, Column |
@leslie-fang-intel @jgong5, I increased the time autotuning benchmarking is performed for, and I'm thinking of changing the benchmarking process for autotuning by making it impervious to leveraging temporal & spatial cache locality because of previous benchmarking iterations of the same kernel. Would have to use separate example inputs that are copies of each other (assuming both kernels share the same inputs). The two approaches I'm thinking of are something like:
However, for small input shapes, the second approach may still be susceptible to cache locality being exploited in subsequent benchmarking runs of a kernel. I'll try approach 2 first. If it'd not work well, I'll try approach 1. Thanks! |
Based on the data you showed above, for a GEMM with a specific shape, performance in the |
During this experimentation, |
Thanks, how much time we increased for benchmarking and what's the latest GEMM breakdown time after your experiment? |
As for the performance gap, today, I determined that it's because of not using explicit prefetching for the cache lines of the next tile of B, which was leading to more cache misses. The gains from prefetching are not apparent from op-level benchmarking, though, and are only apparent in E2E runs of models. I'll submit a PR for it.
I did a lot of experiments & even varied the benchmarking time up to a few minutes. The only takeaways from those experiments were:
|
@CaoE, can you please link your PR that does this? I'll check if that change alone would be enough to solve the problem of E2E performance being quite different from performance during benchmarking. I think one of the 2 changes listed here would also be needed. Thanks! |
🐛 Describe the bug
Inductor-CPU templated int8 WoQ (with BF16 activation) GEMMs for next-token generation (with small
M
dimension) are faster than their ATen counterparts during auto-tuning, so they're chosen at compile time, but they might cause a regression when a model is run end-to-end. (A digression: during auto-tuning, templated GEMMs are only benchmarked against their ATen counterpart, while the templated GEMM that runs E2E also has some epilogue fusions).The root-cause for this behavior is unknown at this point.
Solution to fix regression (compared to Inductor-CPU max-autotune disabled)
Currently, an AVX512 GEMM micro-kernel is being used for small
M
& an AMX ISA micro-kernel is being used for largeM
dimension.We should disable the AVX512 GEMM micro-kernel when AMX ISA is available, so that:
_weight_int8pack_mm
would be chosen during auto-tuning -> no regression for next-token latency E2E.PR: #148502
Solution to improve end-to-end templated int8 WoQ GEMM performance over Inductor-CPU for small M
?
Versions
Main branch
cc @chauhang @penguinwu
The text was updated successfully, but these errors were encountered: