8000 cuda : add batched cuBLAS GEMM for faster attention by ggerganov · Pull Request #3749 · ggml-org/llama.cpp · GitHub
[go: up one dir, main page]

Skip to content

cuda : add batched cuBLAS GEMM for faster attention #3749

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Oct 24, 2023
Prev Previous commit
cuda : add TODO for calling cublas from kernel + using mem pool
  • Loading branch information
ggerganov committed Oct 24, 2023
commit d798a17c34f2326093d0cf2c0ea90b8fded15dc6
1 change: 1 addition & 0 deletions ggml-cuda.cu
Original file line number Diff line number Diff line change
Expand Up @@ -7149,6 +7149,7 @@ static void ggml_cuda_mul_mat_mat_batched_cublas(const ggml_tensor * src0, const
CUBLAS_GEMM_DEFAULT_TENSOR_OP));
} else {
// use cublasGemmBatchedEx
// TODO: https://github.com/ggerganov/llama.cpp/pull/3749#discussion_r1369997000
const int ne23 = ne12*ne13;

// TODO: avoid this alloc
Expand Down
0