8000 CUDA: skip fully masked-out KV in FA vec kernel by JohannesGaessler · Pull Request #13584 · ggml-org/llama.cpp · GitHub
[go: up one dir, main page]

Skip to content

CUDA: skip fully masked-out KV in FA vec kernel #13584

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 20, 2025

Conversation

JohannesGaessler
Copy link
Collaborator

This PR extends the CUDA FlashAttention vector kernels to skip fully masked-out KV slices as was done for Metal in #13493 . The overhead from adding this check is ~0.5% which I think is negligible. For CC >= 8.9 this PR should be fine to use as-is for #13194 , for Turing and Ampere the vector kernels are currently almost never used.

While working on this I tried adding the GQA optimization that the tensor core kernel has to the vector kernels. However, after #12183 it seems that for NVIDIA GPUs it's preferable to use one CUDA block per Q column and to rely on automatic caching (since shared memory is effectively just manually allocated L1 cache). The upside is that the performance of NVIDIA GPUs for 2-8 columns can be improved by using the kernel for 1 column. The downside is that the vector kernels cannot be used as the universally fastest kernel for batch size 1 if you factor in GQA models (the difference is ~5%). To allow skipping unused KV slices for all GPUs it will be necessary to modify the kernel in fattn-mma-f16.cuh. However, that kernel pre-loads the mask and K data simultaneously so as it is there would still be unnecessary data loads. More generally, it would be useful if the CUDA backend were given a parameter to indicate how fragmented the KV cache is, I could then tune the kernel selection logic accordingly.

Performance changes for old NVIDIA:

GPU Model Microbatch size Test t/s c6a2c9e t/s 9854370 Speedup
P40 llama 8B Q4_0 2 pp16384 51.94 64.24 1.24
P40 llama 8B Q4_0 4 pp16384 68.13 79.50 1.17
P40 llama 8B Q4_0 8 pp16384 78.50 90.93 1.16

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels May 16, 2025
@ggerganov
Copy link
Member

To allow skipping unused KV slices for all GPUs it will be necessary to modify the kernel in fattn-mma-f16.cuh. However, that kernel pre-loads the mask and K data simultaneously so as it is there would still be unnecessary data loads.

After merging #13194, I plan to explore a per-sequence KV cache implementation which when used would alleviate the need for additional changes to the kernels and I think this mma modification would not even be necessary. In short, we will likely have 2 types of cache - one that is optimal for cases such as speculative decoding (i.e. reusing large prefixes across different sequences) and one that is optimal for multiple users/contexts (i.e. different sequences having different large separate prompts).

@JohannesGaessler
Copy link
Collaborator Author

Speaking as someone that mostly works on one of the backends, that is the solution that would be easiest for me. One thing to keep in mind is that it would be preferable to do the FA kernel as one single batch vs. one batch per sequence. The challenge would be that the sequences are not necessarily of the same length. This is essentially the same problem as with MoE where each expert will generally work on an uneven number of tokens. But in any case, this is an optimization that can be worked on afterwards.

@JohannesGaessler
Copy link
Collaborator Author

There seems to have been an issue on AMD where __all_sync is not equivalent because it assumes a warp size of 64 instead of 32 - I disabled the check for skipping KV slices on AMD. @yeahdongcn are there instances of MUSA GPUs with a warp size != 32?

@yeahdongcn
Copy link
Collaborator

There seems to have been an issue on AMD where __all_sync is not equivalent because it assumes a warp size of 64 instead of 32 - I disabled the check for skipping KV slices on AMD. @yeahdongcn are there instances of MUSA GPUs with a warp size != 32?

Yes, on the MTT S80 (qy1), the warp size is 128. However, I forced it to 32 in #12445. Could you share your llama-bench arguments? I can run some tests on MTGPU next week.

@JohannesGaessler
Copy link
Collaborator Author

I tested the performance in the OP with

export model_name=llama_3-8b && export quantization=q4_0
./bench --model models/opt/${model_name}-${quantization}.gguf -r 1 -fa 1 -n 0 -p 16384 -ub 2,4,8

But I was concerned about correctness rather than performance. If the warp size is != 32 __all_sync will result in unexpected behavior. The skipping logic will be relevant in terms of performance with SWA, right now it doesn't matter.

@yeahdongcn
Copy link
Collaborator

I manually tested several models (deepseek-r1_7b_q4_0.gguf, qwen3_8b_q4_k_m.gguf, and nvidia-llama-3_1-nemotron-nano-8b-v1-q4_k_m.gguf), and the generated tokens look correct. I also ran /test-backend-ops, and all tests passed using the MUSA backend. As for the benchmark test, I didn't observe any explicit performance regression or improvement.

@JohannesGaessler JohannesGaessler merged commit b69f164 into ggml-org:master May 20, 2025
40 checks passed
infil00p pushed a commit to baseweight/llama.cpp that referenced this pull request May 22, 2025
* CUDA: skip fully masked-out KV in FA vec kernel
Comment on lines +193 to +195
for (int j = 0; j < ncols; ++j) {
maskh_shared[j*D + tid] = slopeh*maskh[j*ne11 + k_VKQ_0 + tid];
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JohannesGaessler I think this is missing an inner loop over D?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, because for this kernel the number of threads is equal to D.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm looking into the bug report #13733 and it seems there is problem with the change in this PR.

I can reproduce with this command:

./bin/llama-parallel -hf ggml-org/Qwen2.5-Coder-0.5B-Q8_0-GGUF -np 2 -ns 32 --top-k 1 --junk 131 -c 16384 -fa

This will output repetitive junk.

If apply this patch to disable the skipping logic, it works:

diff --git a/ggml/src/ggml-cuda/fattn-vec-f32.cuh b/ggml/src/ggml-cuda/fattn-vec-f32.cuh
index 49c592ea5..b141c233c 100644
--- a/ggml/src/ggml-cuda/fattn-vec-f32.cuh
+++ b/ggml/src/ggml-cuda/fattn-vec-f32.cuh
@@ -217,7 +217,7 @@ static __global__ void flash_attn_vec_ext_f32(
                 }
             }
             if (__all_sync(0xFFFFFFFF, skip)) {
-                continue;
+                //continue;
             }
 #endif // GGML_USE_HIP
         }

Any ideas what could be wrong?

Comment on lines +219 to +221
if (__all_sync(0xFFFFFFFF, skip)) {
continue;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need one more sync here:

            if (__all_sync(0xFFFFFFFF, skip)) {
                __syncthreads();
                continue;
            }

This change fixes it for me.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think you're right. I initially only read your other comment above but when I looked at the code again myself I came to the same conclusion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0