-
Notifications
You must be signed in to change notification settings - Fork 1.1k
llama-server not using GPU #1826
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I'm seeing the same problem with Vulkan. I see all layers of the model actually getting loaded on the GPU, and nvtop shows significant memory use. llama.cpp itself works fine on the same hardware. |
Wait maybe it's actually using the GPU but just insanely bottlenecked on Python CPU performance? |
I still have this issue! The GPU is being used when I use llama-cli |
Can you post logs? In my case the GPU was being detected and used but bottlenecked by the CPU because the python part is slow I ended up building https://github.com/sanctuary-systems-com/llama_multiserver |
Nvm I used the wrong argument, make sure that ngl is not negative! ngl 99999 not ngl -99999 since ngl -1 in theory should use all cores. |
Uh oh!
There was an error while loading. Please reload this page.
After I install
llama-cpp-python-server
with cuda support and runpython3 -m llama_cpp.server --model starcoderbase-3b/starcoderbase-3b.Q4_K_M.gguf --n_gpu_layers 10
The GPU is not getting used its running on the CPU
The text was updated successfully, but these errors were encountered: