SQuat cache implementation #38055

phymhan · 2025-05-09T22:13:40Z

What does this PR do?

This PR implements our recent work SQuat (arXiv:2503.24358) for KV cache quantization in Transformers. SQuat introduces a method for orthogonally projecting keys to a query subspace, improving the accuracy of quantized attention computation.

Key changes:

Added a new cache_implementation="squat" option for KV cache, in parallel to existing "quantized", with support for both quanto and HQQ backends.
Introduced offline or prefilling-time computation of a query subspace, which requires query states. To enable this, I modified the LLaMA model class (as an example) to pass query_states and attention_mask through cache_kwargs.
Fixed a bug in the prefilling stage when using per-channel quantization and the sequence length is not a multiple of or shorter than residual_length.

Evaluation Results

We evaluated the testing perplexity under different KV cache implementations using a script adapted from the Hugging Face blog on KV cache quantization.

Before submitting

This PR implements a new feature based on published research.
I have read the contributor guideline.
I have updated the documentation as needed.
I have written tests for the added functionality.

Who can review?

KV cache / quantization: @SunMarc @MekkCyber
LLaMA model integration: @ArthurZucker
General review or documentation: @stevhliu

github-actions · 2025-05-09T22:13:51Z

Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. The CI will be paused while the PR is in draft mode. When it is ready for review, please click the Ready for review button (at the bottom of the PR page). This will assign reviewers and trigger CI.

zucchini-nlp · 2025-05-12T10:27:38Z

cc @gante as well

gante · 2025-05-22T09:35:56Z

Hi @phymhan 👋 Thank you for opening the PR with the cool cache technique 🔥

We're experimenting with a new workflow to add generation-related features to transformers: instead of us acting like gate-keepers, deciding what comes into transformers or not, we've designed custom generation code from the Hub (docs) that anyone can add without our intervention.

If the feature becomes popular in terms of usage, then we add it to transformers 🤗 That way, we can ensure the few features we add here are well maintained and documented (as opposed to having many poorly maintained features, which is happening at the moment).

Except for the place where the code is added, nothing changes: I can provide feedback to your Hub repo if you'd like, and promote your cache technique on social media!

Let me know how you'd like to move forward 🙌

phymhan · 2025-05-22T17:36:02Z

Hi @gante, thanks for the thoughtful feedback and for pointing us to the new workflow! We’re definitely interested and will explore how to integrate our method into the custom generation setup.

We had two quick questions:

Our current PR doesn’t modify the generation strategy itself but introduces new cache classes under cache_utils.py. Would this still fit into the custom generation framework?
Our approach requires modifying the model class (we used LLaMA as an example) to pass query_states and attention_mask into cache_kwargs. Is this kind of model-level change supported in the current design?

Appreciate your guidance!

gante · 2025-05-23T10:07:41Z

@phymhan

Our current PR doesn’t modify the generation strategy itself but introduces new cache classes under cache_utils.py. Would this still fit into the custom generation framework?

Yes! See e.g. this repo, which holds SinkCache (which is being moved from transformers to a custom generation method). The repo is still WIP, but it's already working, so you can see the code structure of a custom generation method designed to introduce a new cache type!

Our approach requires modifying the model class (we used LLaMA as an example) to pass query_states and attention_mask into cache_kwargs. Is this kind of model-level change supported in the current design?

This change is more complex, but I think we can make transformers-side changes across models to facilitate this use. Let me try something, I'll ping you back here with updates.

Meanwhile, to avoid being blocked by me, my suggestion would be to dynamically replace the attention layer in the model instance by the attention layer containing your modifications, in your custom generation method. If we are able to design a good solution on the transformers-side, then it would require minimal changes to your custom generation method to make it work across models :)

phymhan · 2025-05-23T13:32:16Z

Hi @gante Thanks for the swift response and for sharing the example repo. This is very helpful! Also really appreciate your willingness to experiment with changes on the transformers side. We’ll take a closer look and start exploring integration through the custom generation path. Looking forward to your updates!

gante · 2025-05-26T15:03:41Z

@phymhan we discussed internally, and we don't want to expand the input surface to the cache classes for now -- it adds significant maintenance overhead on our side, and increases the coupling between the cache classes and the model attention layers.

I do have potential model-agnostic solution for you to add in your custom code that will likely work, if you want your cache to be compatible with most models. This is pretty much the same as we do in our VLLM integration 🤗

In the custom generate method, pass your cache instance under a different variable name (i.e. != past_key_values). It should be forwarded all the way to the attention_interface in the attention layer, inside **kwargs.
Be sure to pass use_cache=False, to avoid the use of the default cache class when past_key_values is None
Implement a custom attention forward pass, in which you update your cache (which is in **kwargs) with QKV and the attention mask. See eager_attention_forward for an example
Register this custom attention forward pass in ALL_ATTENTION_FUNCTIONS. See our custom attention docs.
Make sure the model uses this custom attention function :)

phymhan · 2025-05-26T16:41:31Z

Hi @gante, thanks so much for the update, and for providing the guidance on a workaround, that’s very helpful! We’ll work on it following the suggested path and will keep you posted on our progress!

gante · 2025-05-28T09:09:06Z

(FYI: the Sink Cache Hub repo is now complete, you can use it as a template for your custom cache :) )

phymhan · 2025-05-28T12:52:12Z

Thanks, that’s great, we’ll take a look! :)

phymhan · 2025-07-09T15:24:33Z

Hi @gante — we’ve reviewed the Sink Cache repo, and the issue we’re facing is more similar to SepCache, in that we need to pass additional arguments to .update(). Based on your comment, would you recommend the following plan?

Integrate SQuat into the custom generation method as a new cache class
Apply a monkey patch to .update() to support the extra argument

Thank you very much!

phymhan · 2025-07-10T19:34:09Z

Hi @gante, we’ve just completed converting this into a custom_generate implementation! You can check it out here. Could you advise on whether we should open a PR and, if so, what the best way to proceed would be? Thanks in advance!

phymhan added 3 commits April 14, 2025 19:36

convert scale and zero to cuda when using HQQ backend

ce53806

Merge remote-tracking branch 'upstream/main' into kivi

7b8d2f2

add squat cache implementation

278fdfc

github-actions bot marked this pull request as draft May 9, 2025 22:13

phymhan mentioned this pull request May 9, 2025

Kv cache squat huggingface/transformers-research-projects#1

Open

zucchini-nlp mentioned this pull request May 23, 2025

LagKV for key-value compression #38312

Open

gante mentioned this pull request May 26, 2025

[WIP] Cache specific inputs #38380

Closed

zucchini-nlp mentioned this pull request Jun 2, 2025

SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference #38509

Open

zucchini-nlp mentioned this pull request Jun 16, 2025

Add SepCache [An efficient and easy-to-use Cache from the SepLLM paper - ICML 2025 (https://arxiv.org/abs/2412.12094) ] to the cache_utils.py and __init__.py #38824

Open

5 tasks

SunMarc requested a review from gante July 11, 2025 15:21

phymhan mentioned this pull request Jul 23, 2025

Feature Request: Add Adaptive Singular Value Decomposition based Orthogonal Subspace Fine-Tuning huggingface/peft#2648

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SQuat cache implementation #38055

SQuat cache implementation #38055

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SQuat cache implementation #38055

Are you sure you want to change the base?

SQuat cache implementation #38055

Conversation

Uh oh!

What does this PR do?

Evaluation Results

Before submitting

Who can review?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!