-
Notifications
You must be signed in to change notification settings - Fork 40
Handle incompatible DeepSeek GGUFs #394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
python convert_hf_to_gguf.py --outfile /mydata/Downloads/DeepSeek-V3-0324-Pruned-Coder-411B-q8_0-ik.gguf --outtype q8_0 /mydata/Downloads/DeepSeek-V3-0324-Pruned-Coder-411B/ WARNING:gguf.vocab:Adding merges requested but no merges found, output may be non-functional. using llama.cpp's convert_hf_to_gguf.py works, but if I requantize into IQ4K, tensor errors pop out: llama_model_load: error loading model: check_tensor_dims: tensor 'blk.0.attn_q_b.weight' has wrong shape; expected 1536, 73728, got 1536, 24576, 1, 1 I would rather have convert_hf_to_gguf.py from the ik_llama.cpp repo work. |
Yes, the |
I'm testing now! With DeepSeekV3 0324 Q2_K_XL latest quant, on 128GB VRAM (5090+4090x2+A6000) and 192GB RAM (6000Mhz 7800X3D). But first I just noticed this
Is there a way to load on GPU first and then CPU? This explains why on ikllamacpp I get 5-20 t/s on PP vs 60-100 t/s on llamacpp (on the latter it looks like this)
Okay now regarding the model itself, I have loaded it with (no fa since I think fa is merged on main but not on the PR), had to change the allocation a bit to make it work.
And it loads without issues
Then generating also works without issues! Speeds look like this
For reference, as I mentioned above on llamacpp with the same command but having CUDA0 loading first instead of CPU, I get
So I can confirm latest quants with MLA works on ik llamacpp. |
@Panchovix Thanks for testing! Why don't you simply use the same tensor overrides that you use with mainline If you post your |
Had to modify it as I use -fa on main llamacpp and I think this PR was done before fa + mla was possible on main. The compute buffers on FA were 3.7 GB and then 400mb each on main llamacpp, while here on ik llamacpp it was 4.5GB each buffer (which is near 1 tensor per GPU) without fa. My command on main is
Adding -ub 1024 increases PP from 66 t/s to 100 t/s and -ub 1536 to 126 t/s Sometimes it tries to load on CPU first, but I cancel and start it again until it starts to load on CUDA0. That way PP T/s perform as it should. If it loads on CPU first it drops to 20 t/s or less, so same behaviour as ik llamacpp for example. |
I have merged this PR. If you take the current
it should give you a similar TG performance as current If you have the patience to wait for the longer loading time, adding As |
Thanks! I went ahead and test, this is the output
I could probably add 1 layer to a 4090 (5 GB left) and one 4090 (4GB left) PP is still slower than main llamcpp, but I think it's becuase the reason I mentioned before. On ikllamacpp, it seems the main GPU doesn't get saturated when starting as on llamacpp, and that also happens on main llamacpp if it loads CPU first before CUDA 0 As can you see on
It starts loading from CPU buffer size instead of CUDA 0. Also this seems to make the CPU to stutter a bit while loading. I haven't tested with mmap yet. RX/TX looks like this on ik llamacpp While on main llamacpp looks like this (5090 X8 5.0 is saturated) Tested now on both latest commit of llamacpp and ikllamacpp, and speeds look like this llamacpp (with the command I mentioned earlier, ub 1024)
ikllamacpp with the command above (ub 512)
30 t/s PP still is pretty fast to not saturate GPU 0. This is the output as reference from llamacpp
I can add more info if needed! |
Thanks for the above. I now finally understand. The difference is that But if it happens that you feel bored, try Maverick (e.g., this model) and see what happens. There is a PR in mainline |
@ikawrakow ohh I see! If it's possible to do add the reverse feature it would be great! As I think ik llamacpp with it's optimizations would be faster than llamacpp for PP t/s if we could do the matrix multiplication in the GPU. |
There is PR #405 now. You can try it with as high u-batch size as you can go. Don't use '-rtr' as this will disable the GPU offload of the experts. |
Mainline
llama.cpp
PR 12801, which added MLA support for DeepSeek models 2.5 months after MLA was available here, broke backwards compatibility. As a result, the new DeepSeek GGUFs that started appearing on HF are not compatible withik_llama.cpp
, resulting in issues #373 and #383.My initial reaction was to not support the new DeepSeek GGUFs, as there was no real reason to introduce the backwards incompatibility (and have people re-download the giant DeepSeek-R1/V3 models). The two new tensors (per layer) required for MLA can be easily created on-the-fly when loading the model as it is done here.
But after some more thought I decided to handle the incompatible GGUFs, and this functionality is added with this PR.
I have tested with DeepSeek-Lite, which uses the exact same attention architecture as DeepSeek-R1/V3. As I don't have the ability to run the large DeepSeek models, I would really appreciate if someone confirmed that it works for them.
Big caveat: Using an incompatible model will only allow the initial MLA implementation (
mla = 1
) in this repository, which corresponds to what is done in mainlinellama.cpp
. The consequences aremla = 3
. The performance degradation increases with increasing context length (number of tokens in the KV cache)