8000 SQuat cache implementation by phymhan · Pull Request #38055 · huggingface/transformers · GitHub
[go: up one dir, main page]

Skip to content

SQuat cache implementation #38055

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

phymhan
Copy link
Contributor
@phymhan phymhan commented May 9, 2025

What does this PR do?

This PR implements our recent work SQuat (arXiv:2503.24358) for KV cache quantization in Transformers. SQuat introduces a method for orthogonally projecting keys to a query subspace, improving the accuracy of quantized attention computation.

Key changes:

  • Added a new cache_implementation="squat" option for KV cache, in parallel to existing "quantized", with support for both quanto and HQQ backends.
  • Introduced offline or prefilling-time computation of a query subspace, which requires query states. To enable this, I modified the LLaMA model class (as an example) to pass query_states and attention_mask through cache_kwargs.
  • Fixed a bug in the prefilling stage when using per-channel quantization and the sequence length is not a multiple of or shorter than residual_length.

Evaluation Results

We evaluated the testing perplexity under different KV cache implementations using a script adapted from the Hugging Face blog on KV cache quantization.

Testing Perplexity Comparison

Before submitting

  • This PR implements a new feature based on published research.
  • I have read the contributor guideline.
  • I have updated the documentation as needed.
  • I have written tests for the added functionality.

Who can review?

@github-actions github-actions bot marked this pull request as draft May 9, 2025 22:13
Copy link
Contributor
github-actions bot commented May 9, 2025

Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. The CI will be paused while the PR is in draft mode. When it is ready for review, please click the Ready for review button (at the bottom of the PR page). This will assign reviewers and trigger CI.

@zucchini-nlp
Copy link
Member

cc @gante as well

@gante
Copy link
Member
gante commented May 22, 2025

Hi @phymhan 👋 Thank you for opening the PR with the cool cache technique 🔥

We're experimenting with a new workflow to add generation-related features to transformers: instead of us acting like gate-keepers, deciding what comes into transformers or not, we've designed custom generation code from the Hub (docs) that anyone can add without our intervention.

If the feature becomes popular in terms of usage, then we add it to transformers 🤗 That way, we can ensure the few features we add here are well maintained and documented (as opposed to having many poorly maintained features, which is happening at the moment).

Except for the place where the code is added, nothing changes: I can provide feedback to your Hub repo if you'd like, and promote your cache technique on social media!

Let me know how you'd like to move forward 🙌

@phymhan
Copy link
Contributor Author
phymhan commented May 22, 2025

Hi @gante, thanks for the thoughtful feedback and for pointing us to the new workflow! We’re definitely interested and will explore how to integrate our method into the custom generation setup.

We had two quick questions:

  1. Our current PR doesn’t modify the generation strategy itself but introduces new cache classes under cache_utils.py. Would this still fit into the custom generation framework?
  2. Our approach requires modifying the model class (we used LLaMA as an example) to pass query_states and attention_mask into cache_kwargs. Is this kind of model-level change supported in the current design?

Appreciate your guidance!

@gante
Copy link
Member
gante commented May 23, 2025

@phymhan

Our current PR doesn’t modify the generation strategy itself but introduces new cache classes under cache_utils.py. Would this still fit into the custom generation framework?

Yes! See e.g. this repo, which holds SinkCache (which is being moved from transformers to a custom generation method). The repo is still WIP, but it's already working, so you can see the code structure of a custom generation method designed to introduce a new cache type!

Our approach requires modifying the model class (we used LLaMA as an example) to pass query_states and attention_mask into cache_kwargs. Is this kind of model-level change supported in the current design?

This change is more complex, but I think we can make transformers-side changes across models to facilitate this use. Let me try something, I'll ping you back here with updates.

Meanwhile, to avoid being blocked by me, my suggestion would be to dynamically replace the attention layer in the model instance by the attention layer containing your modifications, in your custom generation method. If we are able to design a good solution on the transformers-side, then it would require minimal changes to your custom generation method to make it work across models :)

@phymhan
Copy link
Contributor Author
phymhan commented May 23, 2025

Hi @gante Thanks for the swift response and for sharing the example repo. This is very helpful! Also really appreciate your willingness to experiment with changes on the transformers side. We’ll take a closer look and start exploring integration through the custom generation path. Looking forward to your updates!

@gante
Copy link
Member
gante commented May 26, 2025

@phymhan we discussed internally, and we don't want to expand the input surface to the cache classes for now -- it adds significant maintenance overhead on our side, and increases the coupling between the cache classes and the model attention layers.


I do have potential model-agnostic solution for you to add in your custom code that will likely work, if you want your cache to be compatible with most models. This is pretty much the same as we do in our VLLM integration 🤗

  1. In the custom generate method, pass your cache instance under a different variable name (i.e. != past_key_values). It should be forwarded all the way to the attention_interface in the attention layer, inside **kwargs.
  2. Be sure to pass use_cache=False, to avoid the use of the default cache class when past_key_values is None
  3. Implement a custom attention forward pass, in which you update your cache (which is in **kwargs) with QKV and the attention mask. See eager_attention_forward for an example
  4. Register this custom attention forward pass in ALL_ATTENTION_FUNCTIONS. See our custom attention docs.
  5. Make sure the model uses this custom attention function :)

@phymhan
Copy link
Contributor Author
phymhan commented May 26, 2025

Hi @gante, thanks so much for the update, and for providing the guidance on a workaround, that’s very helpful! We’ll work on it following the suggested path and will keep you posted on our progress!

@gante
Copy link
Member
gante commented May 28, 2025

(FYI: the Sink Cache Hub repo is now complete, you can use it as a template for your custom cache :) )

@phymhan
Copy link
Contributor Author
phymhan commented May 28, 2025

Thanks, that’s great, we’ll take a look! :)

@phymhan
Copy link
Contributor Author
phymhan commented Jul 9, 2025

Hi @gante — we’ve reviewed the Sink Cache repo, and the issue we’re facing is more similar to SepCache, in that we need to pass additional arguments to .update(). Based on your comment, would you recommend the following plan?

  1. Integrate SQuat into the custom generation method as a new cache class
  2. Apply a monkey patch to .update() to support the extra argument

Thank you very much!

@phymhan
Copy link
Contributor Author
phymhan commented Jul 10, 2025

Hi @gante, we’ve just completed converting this into a custom_generate implementation! You can check it out here. Could you advise on whether we should open a PR and, if so, what the best way to proceed would be? Thanks in advance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0