-
Notifications
You must be signed in to change notification settings - Fork 12k
CUDA: skip fully masked-out KV in FA vec kernel #13584
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA: skip fully masked-out KV in FA vec kernel #13584
Conversation
After merging #13194, I plan to explore a per-sequence KV cache implementation which when used would alleviate the need for additional changes to the kernels and I think this |
Speaking as someone that mostly works on one of the backends, that is the solution that would be easiest for me. One thing to keep in mind is that it would be preferable to do the FA kernel as one single batch vs. one batch per sequence. The challenge would be that the sequences are not necessarily of the same length. This is essentially the same problem as with MoE where each expert will generally work on an uneven number of tokens. But in any case, this is an optimization that can be worked on afterwards. |
There seems to have been an issue on AMD where |
Yes, on the MTT S80 (qy1), the warp size is 128. However, I forced it to 32 in #12445. Could you share your |
I tested the performance in the OP with export model_name=llama_3-8b && export quantization=q4_0
./bench --model models/opt/${model_name}-${quantization}.gguf -r 1 -fa 1 -n 0 -p 16384 -ub 2,4,8 But I was concerned about correctness rather than performance. If the warp size is != 32 |
I manually tested several models ( |
* CUDA: skip fully masked-out KV in FA vec kernel
for (int j = 0; j < ncols; ++j) { | ||
maskh_shared[j*D + tid] = slopeh*maskh[j*ne11 + k_VKQ_0 + tid]; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JohannesGaessler I think this is missing an inner loop over D
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, because for this kernel the number of threads is equal to D.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm looking into the bug report #13733 and it seems there is problem with the change in this PR.
I can reproduce with this command:
./bin/llama-parallel -hf ggml-org/Qwen2.5-Coder-0.5B-Q8_0-GGUF -np 2 -ns 32 --top-k 1 --junk 131 -c 16384 -fa
This will output repetitive junk.
If apply this patch to disable the skipping logic, it works:
diff --git a/ggml/src/ggml-cuda/fattn-vec-f32.cuh b/ggml/src/ggml-cuda/fattn-vec-f32.cuh
index 49c592ea5..b141c233c 100644
--- a/ggml/src/ggml-cuda/fattn-vec-f32.cuh
+++ b/ggml/src/ggml-cuda/fattn-vec-f32.cuh
@@ -217,7 +217,7 @@ static __global__ void flash_attn_vec_ext_f32(
}
}
if (__all_sync(0xFFFFFFFF, skip)) {
- continue;
+ //continue;
}
#endif // GGML_USE_HIP
}
Any ideas what could be wrong?
if (__all_sync(0xFFFFFFFF, skip)) { | ||
continue; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need one more sync here:
if (__all_sync(0xFFFFFFFF, skip)) {
__syncthreads();
continue;
}
This change fixes it for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think you're right. I initially only read your other comment above but when I looked at the code again myself I came to the same conclusion.
This PR extends the CUDA FlashAttention vector kernels to skip fully masked-out KV slices as was done for Metal in #13493 . The overhead from adding this check is ~0.5% which I think is negligible. For CC >= 8.9 this PR should be fine to use as-is for #13194 , for Turing and Ampere the vector kernels are currently almost never used.
While working on this I tried adding the GQA optimization that the tensor core kernel has to the vector kernels. However, after #12183 it seems that for NVIDIA GPUs it's preferable to use one CUDA block per Q column and to rely on automatic caching (since shared memory is effectively just manually allocated L1 cache). The upside is that the performance of NVIDIA GPUs for 2-8 columns can be improved by using the kernel for 1 column. The downside is that the vector kernels cannot be used as the universally fastest kernel for batch size 1 if you factor in GQA models (the difference is ~5%). To allow skipping unused KV slices for all GPUs it will be necessary to modify the kernel in
fattn-mma-f16.cuh
. However, that kernel pre-loads the mask and K data simultaneously so as it is there would still be unnecessary data loads. More generally, it would be useful if the CUDA backend were given a parameter to indicate how fragmented the KV cache is, I could then tune the kernel selection logic accordingly.Performance changes for old NVIDIA: