-
Notifications
You must be signed in to change notification settings - Fork 12k
musa: add support for muBLAS and MMA #13149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
So far, I can only get it working with @JohannesGaessler @slaren Could you please share some tips on how to debug this kind of issue? I'd appreciate it! |
ggml/src/ggml-cuda/ggml-cuda.cu
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you manually allocating and deallocating memory instead of using ggml_cuda_pool_alloc
? Batched FP16 GEMM is used for attention without FlashAttention so most likely this is where the bug is. I don't remember what the synchronization behavior of cudaFree
is but if it's done asynchronously from the kernel executions that would explain why you get incorrect results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late reply — I've been working closely with the MUSA SDK team to refine this.
Why are you manually allocating and deallocating memory instead of using
ggml_cuda_pool_alloc
?
This is because muBLAS
behaves differently from cuBLAS
: it requires that arrays of pointers be explicitly allocated in GPU memory and obtain valid GPU addresses. Using ggml_cuda_pool_alloc
does not meet this requirement in this context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please take a look at ggml_cuda_pool_leg
. Before the current VMM implementation became the default that is how temporary memory buffers were assigned. It is still used if a GPU returns false for CU_DEVICE_ATTRIBUTE_VIRTUAL_MEMORY_MANAGEMENT_SUPPORTED
. If muBLAS does not work correctly for VMM the GPU should return false for this property; or the MUSA backend should be compiled with GGML_CUDA_NO_VMM
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll double-check this. I don't recall if I’ve tested that option. Thanks for pointing it out!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After trying -DGGML_CUDA_NO_VMM=ON
again, I remembered why I avoided it: compiling with this option results in garbled output during inference. I also attempted to modify the legacy pool to bypass pooling and use musaMalloc
directly, but that didn’t work either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright. For me the bottom line is this: I consider MUSA support to be comparatively low-priority. I am willing to review PRs and help with general development, but this is under the condition that it doesn't interfere with the rest of the code. I would not consider the current code in this PR acceptable in that regard. So I would ask you to either debug and fix the issues with the pool or to create a completely separate function like ggml_musa_mul_mat_mublas
(with the understanding that maintenance of this function will be 100% your own responsibility). Relevant context: I'm currently working on vendor-agnostic multi GPU support, once that is available I intend to remove ggml_cuda_op_mul_mat
and refactor the cuBLAS code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the clarification — that makes perfect sense. I agree the current state of the code isn't ideal. I'll go ahead and close this PR for now, take the time to sort out the necessary changes cleanly, and revisit it later with a more self-contained approach. Appreciate your feedback and support!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you intend to take the approach with a separate function? If yes, you should write it in a way analogous to e.g. ggml_cuda_mul_mat_vec
(not to be confused with ggml_cuda_op_mul_mat_vec
) where it's being called directly from ggml_cuda_mul_mat
. As long as you implement support for non-contiguous tensors this is going to work correctly for all use cases except --split-mode row
on master. Soon it will work correctly for all use cases once I've moved the logic for tensor parallelism out of the CUDA backend.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late reply. I was testing with VMM disabled, and it turns out the issue I encountered was related to the kvcache (please see: #13788). Once I disabled it, everything worked as expected. So it seems I can stick with the legacy VMM implementation by default.
Run |
1428e18
to
d91bdb3
Compare
24a7737
to
16aa155
Compare
45b5578
to
4b4ba0f
Compare
This PR should depend on MUSA SDK version bump: #13647 |
Hi @JohannesGaessler Could you please also review this one? Thanks. |
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
Rebased onto |
ggml/src/ggml-cuda/ggml-cuda.cu
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please take a look at ggml_cuda_pool_leg
. Before the current VMM implementation became the default that is how temporary memory buffers were assigned. It is still used if a GPU returns false for CU_DEVICE_ATTRIBUTE_VIRTUAL_MEMORY_MANAGEMENT_SUPPORTED
. If muBLAS does not work correctly for VMM the GPU should return false for this property; or the MUSA backend should be compiled with GGML_CUDA_NO_VMM
.
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
Make sure to read the contributing guidelines before submitting a PR
This PR adds support for muBLAS and MMA on Moore Threads GPU.
Two important notes:
MTT S80 (QY1)
and earlier,muBLAS
does not implementGEMM
due to hardware limitations.muBLAS
behaves differently fromcuBLAS
— specifically, arrays of pointers used inmublasGemmBatchedEx
must be explicitly allocated on the GPU memory space and obtain valid GPU addresses.Testing Done
./build/bin/test-backend-ops
passed onMTT S80
andMTT S4000
./build/bin/llama-cli -m ~/models/qwen3_8b_q4_k_m.gguf -ngl 999
runs as expected on bothMTT S80
andMTT S4000
, with or without the-fa
flag