-
Notifications
You must be signed in to change notification settings - Fork 11.9k
llama : initial Mamba-2 support #9126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
* ggml : improve ggml_mul speed when masking recurrent states
* ggml : make the ggml_mul fast broadcast path more consistently formatted
e9b0d19
to
aff9692
Compare
Hey @compilade , thanks for implementing this! I tried converting https://huggingface.co/mistralai/Mamba-Codestral-7B-v0.1 using
Nevertheless, I successfully converted a Mamba-Codestral Run it output model (remember to select the correct chat template, since the model does not come with one):
The result looks promising, but I have no idea why there are
Link to download GGUF: https://huggingface.co/ngxson/codestral-mamba-llamacpp-test/tree/main |
The steps I took to convert Mamba-Codestral-7B-v0.1 are the following:
I did not have tokenization problems in my tests. Maybe because I was using the original SentencePiece tokenizer instead of a BPE tokenizer. That There are probably still problems with the SentencePiece tokenizer too, I think the SentencePiece tokenizer should be preferred for this model; it should be easier to handle without workarounds. I should change that in |
The tokenzier.json of Mamba-Codestral-7B-v0.1 otherwise requires workarounds to work correctly.
Thanks for the guide! I've successfully converted the original repository the gguf by following your steps. For the I'm wondering if (Also cc @Vaibhavs10 since he's the maintainer of gguf-my-repo.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @compilade/ @ngxson - JFYI - the transformers weights are now merged in the main repo: https://huggingface.co/mistralai/Mamba-Codestral-7B-v0.1
If you face any issues with the conversion with this could you open an issue on the repo for us to track! 🤗
Any updates on when Codestral Mamba should be supported? |
Nice work! Just a note on the ssm_scan kernel performance: a better fused implementation by the flash-linear-attention project can give the equivalent functionality as Mamba2's original kernel: fla-org/flash-linear-attention#49 , and runs 2x faster: fla-org/flash-linear-attention#50 |
Hi @compilade ! I worked on repo conversion for the transformers-compatible mamba2 version, let us know if you need anything from us to move forward with this PR :) |
It sounds like having a simple fallback of expected filenames would be a reasonable thing to include here? I don't know that we want to maintain a ton of different ones, but adding a second layer of fallbacks for alternate filenames doesn't feel arduous. |
That's not really a problem anymore (at least for Mamba-Codestral) since the official repo was updated in https://huggingface.co/mistralai/Mamba-Codestral-7B-v0.1/commit/88085f9cdfa832c3aca8a0315a4520cf7558c947 to use more standard names. What is currently blocking this is that the Metal and CUDA kernels for |
Any updates on this? |
@compilade So, it means we can do
Seems like the 2nd one could be a better choice for us.
Yeah, I would definitely like to have a try. I am excited about it. Will post again if there is any issue. Thanks a lot! BTW, what do you mean about 'TQ1_0 and TQ2_0' are not good to this model? You mean the ppl will be bad or the speed &memory will be bad? I tried Bi-Mamba with both tq1_0 and tq2_0, the ppl is fine, as expected. I guess you refers much to the memory usage and speed. |
@Tangshengku But yes, speed and memory can also be better with a dedicated binary type. |
@compilade Hello, I am back for Bi-Mamba data type implementation. Some codes are written with the help of ChatGPT. The implementation is here: https://github.com/Tangshengku/llama.cpp/tree/compilade/mamba2 Current status:
I have tested this implementation with Bi-Mamba 2.7B model, the perplexity is fine and the same. The speed is optimized compared with directly using q4_0 or tq2_0:
The speed is tested on M4 Pro CPU. Further optimization:
How to reproduce the results:
python convert_hf_to_gguf.py xxx/bimamba/2.7B --model-name mamba2-2.7B.gguf \
--outfile ./ckpt/mamba2-2.7B/ --outtype f16
./build/bin/llama-quantize ./ckpt/mamba2-2.7B bi_0
./build/bin/llama-bench -m ./ckpt/ggml-model-BI_0.gguf --n-gpu-layers 0 |
Cross-posting progress on sync'ing this branch with |
There is a problem with multi-user (and/or parallel sequence) inference for recurrent models (also on I'll try to figure out what's the problem. Like I said in #7531 (comment), there's a problematic early |
And also fix multi-user inference for recurrent models by using cell_id instead of i as the kv cell index when populating s_copy.
I found the problem! It was introduced in #12181 Lines 287 to 291 in 791998b
The problem is not actually the But in 94c3d53 I've also removed the Now the only thing left is to adapt the CUDA kernel for the SSM scan (added in #10558) to Mamba2. |
Thanks for tracking this down. I'll see if I can merge the #12799 today, so we can start building on top of it. |
Hi @compilade @ggerganov, quick check on the plan for this branch. I'm continuing to push towards Granite 4 support and I think I'm close to an initially functional version of bamba with the hybrid cache, but it's dependent on this branch, so I'd love to understand if there's a plan for getting this branch over the line. |
I've been attempting to adapt the CUDA implementation of the This is pretty much the last step I think, unless the proposed changes need more fundamental modifications (e.g. splitting the I'll try to let you know how this progresses. Right now in my local changes the Mamba(1) part of the CUDA operator works with the new structure, but not yet for Mamba2. |
Thanks for the update, and much appreciated on the hard work! |
@compilade Hi, I am wondering if I can merge my Bi-mamba implementation to this branch? In CPU, it works well from my side or I open another merge request after your GPU implementation? |
Hi @compilade, I just wanted to check in and see how things are looking for |
We should first merge #13746 since it significantly reworks the KV cache logic and interface. There are some comments there about the recurrent cache implementation that I think would be nice to be addressed first (I think they might have been already addressed in this PR, but they can be upstreamed separately from the Mamba implementation). After that is done, we should be able to merge the rest of the code from this PR. |
@ggerganov Thanks for the update! Is #13746 the end of the chain for KV-cache refactors before we want to address hybrid caching (#13276) directly? I'm trying to keep the Granite 4 pieces as up-to-date as possible with the inbound changes, so just trying to get a handle on what else to expect. |
@ggerganov I assume you likely mean making the kv-cells fully read-only when setting the inputs and maybe also the removal of |
@gabe-l-hart I hope it is very near to the end. I have at least one more PR queued after that, related to the KV cache, with some more minor changes. @compilade Let's target #13746 |
Follow-up from #8519 (comment). This should fix #7727 and fix #8519.
I've implemented the fully recurrent mode of Mamba-2, because it's very similar to Mamba-1, and also because it seems like the most appropriate mode for text generation.
This does not implement the sequentially semistructured matrix mode, because I'm not yet sure how the block decomposition would fit within the
batch
andubatch
framework ofllama.cpp
, and how the chunk size should be chosen. If the recurrent mode is faster at single-user auto-regressive text generation, then I'm not sure how to keep the graph node structure constant when using the most appropriate technique for the batch size.If the sequentially semistructured matrix mode is eventually implemented, it should help with prompt processing speed for large prompts.
What to expect
(mostly taken from #8519 (comment))
The state in Mamba-2 is bigger than I thought; Mamba-Codestral-7B-v0.1 takes
263.5 MiB
(inF32
) per sequence (e.g. with-np 1
), compared to38 MiB
(also inF32
) for Falcon-Mamba-7B (which is based on Mamba-1). But that remains constant whatever the context size. Mamba-2 is easier to implement efficiently, so the bigger state does not really impede inference speed.However, a big downside right now with recurrent models in
llama.cpp
is the lack of state rollback (which is implemented through state checkpoints in #7531, but needs to be re-adapted to #8526), so the prompt will be reprocessed a lot if usingllama-server
. I think usingllama-cli
in conversation mode does not have this problem, however (or maybe only the bare interactive mode with--in-prefix
and--in-suffix
, not sure).This initial implementation is CPU-only, but uses SIMD for the SSM scan, so even though the state is bigger than for Mamba-1 models, in my tests, the speed of
Mamba2-130M
is similar or better thanMamba-130M
(but still not that fast compared to transformer-based models with an empty context), when both are run on CPU.The speed of Mamba-2 models seems comparable to Transformer-based models when the latter have 2k to 4k tokens in their context.
Summary of changes
Mamba2ForCausalLM
(including the official Mamba-2 models, and Mamba-Codestral-7B-v0.1)config.json
needs to contain"architectures": ["Mamba2ForCausalLM"],
for the convert script to properly detect the architecture.d_inner
(aka2 * n_embd
) heads of size 1.ggml_ssm_scan
ggml
ggml_ssm_scan
.ssm_a
is broadcast)ssm_d
intoggml_ssm_scan
GGML_SIMD
.expf
in the state update unlike with Mamba-1.ggml_ssm_scan
.perf
.Other
Here's my favorite quote from Section 3.3 of https://arxiv.org/abs/2405.21060:
TODO
master
after merging llama : simplify Mamba with advanced batch splits #8526.ggml_ssm_scan
GGML_MUL
fast broadcast path because it's not used anymore to mask the states.Maybe use a new metadata key instead of(well, maybe kind of).{arch}.ssm.time_step_rank
for the number of heads of Mamba-2, because it's not really the rank of the time stepssm_d
inggml_ssm_scan
?ggml_ssm_scan
to separate the implementations for Mamba-1 and Mamba-2, although they do have a lot in common.