-
Notifications
You must be signed in to change notification settings - Fork 42
Fix imatrix calculation for MLA models #411
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I have been purposefully avoiding reuploading with MLA, not even with the awareness of this glaring issue :') And of course even these changes you've made, despite me knowing your exact intentions, are black magic to me, so I personally wouldn't have been able to even consider making this change upstream |
Super nice work @ikawrakow ! I had to temporarily disable quantizing the _v and _b matrices and left them in Q8_0 - you're new changes are super good - nice again! |
Thank you for this! |
I don't have the hardware to play with DeepSeek-V3/R1, but I'm curious about potential performance gains one can get that way. Published quantized models tend to use high-bit quants for the attention tensors (and after the MLA changes in |
This is why I've been running pure IQ4_K and my next mix is going be a mix of IQ4_KS and Q4_K and IQ4_K. |
Do you accept donations? You could feature such a page on your README explaining the goal of investing in a test bench for your experiments with this fork. You already have a 4090 iirc, so a second-hand CPU server with ~256-512 GB of RAM for ~0.5-1k € on eBay could work. I believe you've helped enough people that some would be willing to help. |
There is a company that wanted to sponsor me to get my hands on a higher end system. It even seemed to go ahead, but it looks like things got lost on their end. I guess I have to remind them. I even own a Ryzen-5975WX system that I inherited from the company I was working for when it died. It has 8 memory slots, but is currently configured with just 4 x 32 GB RAM. It used to be remote but circumstances changed and I got it home just 2-3 days ago. I guess, now I need to get organized, replace the RAM with 8 x 64 GB, and add a second larger SSD (the one currently inside is just 2 TB, and always full to 98% capacity). Oh, a second GPU would be good too so I can finally look into multi-GPU stuff. |
Well, that's amazing news, even if your sponsor doesn't get back to you. |
Haha, this is because you don't know me, and so don't expect for how long I'm going to procrastinate on this.
What is TP? |
Oh you mean procrastinate by instead submitting even more amazing PRs here lmao TP is tensor parallelism, aiming at using 100% of each GPU during inference. But I guess it would require a tremendous amount of work to get there from a codebase that is not meant for such a feature. I don't even know if there would be significant gains because of hybrid inference bottlenecks. https://github.com/turboderp-org/exllamav2/blob/master/exllamav2/exllamav2_ext/ext_tp.cpp |
Ah, OK, TP is one of the things I would look into if I had 2 or more GPUs. I wouldn't dare to do it in the CUDA code, but have some vague ideas how it could be done on the level of the compute graph. I have no idea if/how much performance one would gain. How much faster is exllamav2? |
Without speculative decoding, 2x3090@275w:
Exl3 is supposed to have even better TP performance, but it's not implemented yet. |
So, barely faster than |
Sorry for the long wait. I finally got the time to properly benchmark all the quants in this repo and multiple exl2 sizes of Llama-3.1-Nemotron-Nano-8B-v1 (maybe a bit too much, I tried to generate the exl quants based on the bpw of the equivalent gguf files, and as a result, small quants ended up a lot heavier than their gguf counterpart) I was also curious to see how fast each quant is (for custom mixes), but I didn't convert with --pure for the sake of the benchmark. I used basic standard parameters for both programs, and generated 1k token * 10 and averaged the result. Using ExllamaV2 0.3.1 and latest ik_llama.cpp. A single 350w RTX 3090 was used. I didn't benchmark tensor parralelism. Commands used: TabbyAPI:CUDA_VISIBLE_DEVICES=0 python main.py --model-dir /home/user/exl --model-name <MODEL> --max-seq-len 4096 llama-server:CUDA_VISIBLE_DEVICES=0 ./ik_llama.cpp/llama-server -m <MODEL> -ngl 99 -mg 0 -c 4096 (yes I forgot that some types are aliases, and ended up benchmarking everything...) TablesEXL2 Models
GGUF Models
For completeness, another plot with PPL metrics could have been useful, but I don't know any program that can compute PPL from an API |
Thanks for the data. Did you accidentally create a second comment instead of editing the first? (I do appreciate the tables for raw data though). Also this repo has three types of quants, the k-quants, i-quants which are also in mainline, and iqk-quants (see this) which are not found on mainline. This is why some of the green-dots are especially close together or have sudden changes in performance as you are putting both i-quants and iqk-quants together, even though they are different types of quants. |
@ThomasBaruzier Thanks for the detailed comparison! You are not using flash attention in I did a quick comparison to your data for Llama-3.1-Nemotron-Nano-8B-v1 on my 4080 GPU. First lets look at legacy ( For 4+ bpw the behavior is as expected: the 4080 has less memory bandwidth, so performance is lower than your 3090. The difference decreases with decreasing bpw, that's most likely because you did not use FA. But something goes wrong on your GPU sub-4 bpw. k-quants have a very simple unpacking algorithm, so it would be unexpected if the calculation became compute bound so that the faster 4080 pulls ahead because of that. Things go south for i- and iqk-quants: If I put all 4080 data on the same plot it looks like this: Not much of a difference as TG is memory bound (apart from The only explanation for the massive performance difference below 4 bpw between the 4080 and the 3090 is that the 3090 somehow does not like lookup tables (all i- and iqk-quants use a non-linear mapping between quant index and dequantized model weight, and this requires lookup tables). Here the
|
Thanks for all the feedback! FA helps with 4+bpw as you predicted, but for i- and iqk-quants, I'll investigate further another time, maybe a few param tweaks could help? TablesEXL2 Models
GGUF Models
|
Is it possible to repack mainline quants somehow to be ik_llama compatible? Rather than doing it on the fly to just save a "normal" version of the weights as a copy? That should regain memory lost from the work around?
Nah. Regardless of whatever calculations, I can load 70b models in llama.cpp of all kinds. They are about as fast with pipeline parallel, but in tensor parallel it is a much larger difference as he showed. Plus that is 0 CTX speeds, as context builds, it output t/s falls much less. For multi GPU and dual CPU socket it is a worthy endeavor 100%. On larger models the responsiveness goes from bleh to wow. |
What do you mean? All mainline quants apart from
where |
I'm guessing he means the wk_b tensors (#259 uses the term on the fly as well). And as an answer to his question, a python script using gguf-py should be able to do it, assuming you have "donor" tensors. (on my system this on the fly generation came at a minor but measurable cost, and if I still had any "legacy" quants, that I needed to use extensively I would take this approach) |
I suspect because the new tensors get created as |
I think I tested that theory and even accounting for that it was still a difference. I definitely have made quants that use |
If folks are looking for ik_llama.cpp quantized version of DeepSeek-R1-0528, I just got one cooked up and released on huggingface here. Feel free to use the imatrix in the repo if you are making your own quants to save a step. Details on that are int he model card and it was generated from a the Q8_0.
Gonna definitely look into a smaller one now with attention tensors possibly |
Thank you for the imatrix. I was considering making a discussion thread for DeepSeek-R1-0528. The one we had for V3 was quite nice. |
I would be curious to see how much degradation in quality there is from using 6- or 5-bit quants for the attention tensors and shared experts. It would be also interesting to see how much mainline suffers when quantizing attention with less than |
In theory if you had the compute and benchmarks, I think https://github.com/Just-Curieous/Curie would result in nice quants, but with a model this big the compute would might be very expensive. |
Yes, I wanted to do this after V3-0324, but I think now is the time to try it out. I'll probably go for I see unsloth's keeping k_b and v_b at Q8_0 but don't see the actual imatrix data file hrmm.. |
I thought there is still a penalty to memory, prompt processing and speed from using MLA containing mainline quants vs the old ones. Even if they load/work. As much as IQ3/Q4 quants sound nice, anything over 250gb is going to go down into unusable speeds on my system. Only get about ~50t/s PP and 10t/s using IQ2XXS as it is. If it gets much slower... Usability comes from cramming as much into the GPUs as possible because the CPUs/memory speed isn't that good. |
Do we need an "AI" agent for this? #! /bin/sh
model=...
imatrix=...
q_exps=...
for q in q6_K iq6_k q5_K iq5_k iq5_ks; do
./bin/llama-quantize --imatrix $imatrix --custom-q "attn=$q,shexps=$q" --custom-q $q_exps $model tmp.gguf iq3_k
./bin/llama-perplexity -m tmp.gguf >>log.out 2>&1
done
grep Final log.out |
There shouldn't be after #409. Just |
If I had the b/w to download the full model and use the script, I'd be golden. But sadly I have to go with what people upload. Losing several GB of GPU memory is another couple of tensors I can throw on there. Just trying to get a gauge of if I should avoid any new mainline quants. Unsloth was going to make some kind of 140gb one for the new R1. Even if quality is a little lower, speed is going to be like Qwen.
I use those settings, so it will be mostly the same memory footprint as a native quant? Single GPU for ctx, I see how it doesn't matter but for 4x24 it really does. |
If you want to create a full almost continuous spectrum of quality to size trade-offs you kind of need to do a lot of experimenting. I know ubergarm and EAddario are working on trying to rank tensors/layers to achieve that goal as well, but I do not think a greedy algorithm is optimal, and doing anything more would require more than just using a ranking. |
While I don't have a Ph.D., I didn't have to vibe code this bash script to brute force check these 7 test cases varying attn and shexp but holding all else constant q4_0. Its gonna take a long while to finish and then test perplexity on though. Will report back by later this weekend hopefully. 👈 Test Case Bash Script#!/usr/bin/env bash
model=/mnt/raid/models/ubergarm/DeepSeek-R1-0528-GGUF/DeepSeek-R1-256x21B-0528-BF16-00001-of-00030.gguf
imatrix=/mnt/raid/models/ubergarm/DeepSeek-R1-0528-GGUF/imatrix-DeepSeek-R1-0528.dat
outdir=/mnt/raid/models/ubergarm/DeepSeek-R1-0528-GGUF
basename=DeepSeek-R1-0528
base_q=q4_0
# iterate over list of tuples as attn_k_b shape requires qN_0 types
for q in q8_0,q8_0 q6_0,q6_K q6_0,iq6_k q5_0,q5_K q5_0,iq5_k q5_0,iq5_ks q4_0,q4_0
do
# unpack tuples into $1,$2
IFS=","
set -- $q
# quantize using $1 for attn_k_b and $2 for rest of attn and base_q for all else
numactl --interleave=all \
./build/bin/llama-quantize \
--imatrix $imatrix \
--custom-q attn_k_b=$1 \
--custom-q attn=$2 \
--custom-q shexp=$2 \
--custom-q exps=$base_q \
$model \
$outdir/$basename-$base_q-attn-shexp-$2.gguf \
$base_q \
2>&1 | tee -a logs/quantize-$basename-$base_q-attn-shexp-$2.gguf
done
I haven't tried making MLA imatrix on mainline, but possibly there are some issues still with the 3D tensor shapes right? I'll not fuss with this for now, maybe someone else can figure this one out. I'm gonna release a quant today with |
So far I have smoothie qwen, 2 quants of regular qwen and the older V3 (3/24). Those all work. I wanted to get chimera but not sure there is a small enough one out there. The mini R1 from this week, I'm willing to gamble with the smallest quant, if it ever makes an appearance. For the future though, who knows. Might be worth it. |
Good idea, I created one and will link it in my huggingface repo card to try to keep traffic directed there as any questions and discussion arise: #477 |
Mainline
llama.cpp
implemented MLA for DeepSeek models in this PR 2.5 months after MLA was available here. The PR broke backwards compatibility with existing DeepSeek GGUFs. The incompatibility was handled in PR #394, and the reduced prompt processing performance withllama.cpp
-style MLA GGUFs was recovered in #409.This PR fixes imatrix calculation for
llama.cpp
-style MLA GGUFs. The mainline MLA implementation splits the originalattn_kv_b
2D tensor intoattn_k_b
andattn_v_b
, which are 3D and have the shape128 x n_lora x n_head
(attn_k_b
) andn_lora x 128 x n_head
(attn_v_b
). When theimatrix
tool was written there were only 2D tensors in the models, so it does not really work for the new 3D MLA tensors. There are two issues:imatrix
tool. The crash was fixed in mainlinellama.cpp
in PR 13286, and is fixed here with this PRllama.cpp
version after PR 13286 was merged, one will not be able to use this imatrix to quantize a model. This PR handles the situation the way it should be handled: the imatrix for the 3D tensors needs to have128*n_head
(attn_k_b
) or512*n_head
(attn_v_b
) entries.It is now almost a month since the
llama.cpp
MLA PR was merged, so I'm wondering what "quant cookers" (as @ubergarm likes to call them) have been doing for MLA models. Hence, pinging @bartowski1182 and @danielhanchen.