-
Notifications
You must be signed in to change notification settings - Fork 29.8k
SQuat cache implementation #38055
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
SQuat cache implementation #38055
Conversation
Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. The CI will be paused while the PR is in draft mode. When it is ready for review, please click the |
cc @gante as well |
Hi @phymhan 👋 Thank you for opening the PR with the cool cache technique 🔥 We're experimenting with a new workflow to add generation-related features to If the feature becomes popular in terms of usage, then we add it to Except for the place where the code is added, nothing changes: I can provide feedback to your Hub repo if you'd like, and promote your cache technique on social media! Let me know how you'd like to move forward 🙌 |
Hi @gante, thanks for the thoughtful feedback and for pointing us to the new workflow! We’re definitely interested and will explore how to integrate our method into the custom generation setup. We had two quick questions:
Appreciate your guidance! |
Yes! See e.g. this repo, which holds
This change is more complex, but I think we can make Meanwhile, to avoid being blocked by me, my suggestion would be to dynamically replace the attention layer in the |
Hi @gante Thanks for the swift response and for sharing the example repo. This is very helpful! Also really appreciate your willingness to experiment with changes on the |
@phymhan we discussed internally, and we don't want to expand the input surface to the cache classes for now -- it adds significant maintenance overhead on our side, and increases the coupling between the cache classes and the model attention layers. I do have potential model-agnostic solution for you to add in your custom code that will likely work, if you want your cache to be compatible with most models. This is pretty much the same as we do in our VLLM integration 🤗
|
Hi @gante, thanks so much for the update, and for providing the guidance on a workaround, that’s very helpful! We’ll work on it following the suggested path and will keep you posted on our progress! |
(FYI: the Sink Cache Hub repo is now complete, you can use it as a template for your custom cache :) ) |
Thanks, that’s great, we’ll take a look! :) |
Hi @gante — we’ve reviewed the Sink Cache repo, and the issue we’re facing is more similar to SepCache, in that we need to pass additional arguments to
Thank you very much! |
What does this PR do?
This PR implements our recent work SQuat (arXiv:2503.24358) for KV cache quantization in Transformers. SQuat introduces a method for orthogonally projecting keys to a query subspace, improving the accuracy of quantized attention computation.
Key changes:
cache_implementation="squat"
option for KV cache, in parallel to existing"quantized"
, with support for bothquanto
andHQQ
backends.query_states
andattention_mask
throughcache_kwargs
.residual_length
.Evaluation Results
We evaluated the testing perplexity under different KV cache implementations using a script adapted from the Hugging Face blog on KV cache quantization.
Before submitting
Who can review?