-
Notifications
You must be signed in to change notification settings - Fork 12k
Large performance drop when using pipeline parallelism and layer splitting on multiple GPUs #13751
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Same here. build: cf0a43b (5361), default value for
|
Likely a Windows issue. Try disabling "Hardware-accelerated GPU scheduling" under graphics settings. |
Disabling "Hardware-accelerated GPU scheduling" made no difference. The same issue exists on Linux although the difference is smaller. Without
|
Are you using native linux or WSL? |
Native.
|
#13814 should fix this. |
I can confirm that #13814 fixes the issue. I am getting expected performance in the layer + pp case on that branch. Thanks. |
Uh oh!
There was an error while loading. Please reload this page.
Problem description
The default value for GGML_SCHED_MAX_COPIES is 4. With that value
-sm layer
performs significantly worse than-sm none
Setting GGML_SCHED_MAX_COPIES to 1 brings
-sm layer
performance up to the level of-sm none
and doesn't seem to otherwise negatively impact performance in this use case.Benchmarks
CMake options:
-DGGML_CUDA=ON
,-sm layer
performs worse than-sm none
CMake options:
-DGGML_SCHED_MAX_COPIES=1 -DGGML_CUDA=ON
,-sm layer
performs the same as-sm none
Additional information
Using
--override-tensors
also seems to have the effect of disabling pipeline parallelism even in builds withDGGML_SCHED_MAX_COPIES=4
. When using-v
the linellama_context: pipeline parallelism enabled (n_copies=4)
is not printed when--override-tensors
is used.The model used is https://huggingface.co/Qwen/Qwen3-32B-GGUF/blob/main/Qwen3-32B-Q4_K_M.gguf but other model architectures also exhibit the same behaviour. I tested qwen3, qwen3moe, llama, and gemma3.
Disabling pipeline parallelism also improves performance for models that don't fit on a single GPU in the first place. For example https://huggingface.co/Qwen/Qwen3-235B-A22B-GGUF/tree/main/Q4_K_M goes from 25t/s to 60t/s.
All tests were done on Windows. Version of the CUDA Toolkit is 12.9.
The text was updated successfully, but these errors were encountered: