8000 Bart: new cache format by zucchini-nlp · Pull Request #35314 · huggingface/transformers · GitHub
[go: up one dir, main page]

Skip to content

Bart: new cache format #35314

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 31 commits into from
May 16, 2025
Merged

Conversation

zucchini-nlp
Copy link
Member
@zucchini-nlp zucchini-nlp commented Dec 18, 2024

What does this PR do?

As per title, enables new cache format in Bart ans several models copied from Bart. Since there are too many models copying attention from Bart, I decided to not touch the audio ones and changed their "Copied from" statements

TODO:

  • Run all tests for models with new cache + slow ✅

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@zucchini-nlp zucchini-nlp changed the title [WIP] Bart: new cache format Bart: new cache format Dec 18, 2024
@zucchini-nlp
Copy link
Member Author

cc @BenjaminBossan I am running slow tests from transformers side and the current state of PR should be almost ready for review. So we might need to run PEFT tests now

@BenjaminBossan
Copy link
Member

I am running slow tests from transformers side and the current state of PR should be almost ready for review. So we might need to run PEFT tests now

Thanks for the ping. I skimmed the PR and if I'm not mistaken, of all the models that were changed, Bart is the only one that is covered in the PEFT test suite. Therefore, running tests with -k bart should be sufficient. LMK if you run the tests, otherwise I can get to it later.

@zucchini-nlp zucchini-nlp requested a review from eustlb as a code owner January 13, 2025 13:42
@zucchini-nlp
Copy link
Member Author

Cool, the code owners tagged all relevant people. Ready for review!

Slow tests for text models that now support cache class are passing on my end

Copy link
Member
@gante gante left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general LGTM. A few minor nits, hence the approval.

I'm assuming slow tests were run for all touched models, and there are no regression with respect to main 🔍

(I've reviewed /generation, /models/bart, /tests/generation, and /tests/models/bart. I'm assuming other models follow the same pattern)

Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>
@ducviet00
Copy link
Contributor

@zucchini-nlp
I am curious about the status of this PR. Could you update it?

@zucchini-nlp
Copy link
Member Author

@ducviet00 sorry but the PR is blocked by another one.These got stale a bit since we had some higher priority releases recently. I will be back on Bart from next week and get it merged, thanks

@ducviet00
Copy link
Contributor

Hi @zucchini-nlp
Can we have bart with new cache format, it will boost performance much when using with torch.compile

@zucchini-nlp
Copy link
Member Author
zucchini-nlp commented May 14, 2025

It is currently blocked by another PR (#35786). cc @ArthurZucker can you review it again please?

@ducviet00
Copy link
Contributor

awesome @zucchini-nlp thank you so much

Copy link
Collaborator
@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but let's refactor the EncoderDecoderCache to hide the legacy complicated logic!

Comment on lines 1289 to 1291
position_ids = cache_position.unsqueeze(0)
position_ids = self.embed_positions(input, past_key_values_length, position_ids=position_ids)
position_ids = position_ids.to(inputs_embeds.device)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unsqueeze can be done in the embed positions, no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah, in case of Bart we could. I wanted like the module to expect correct 2D position ids, to account for padding. But Bart apparently never used padded positions

next_cache = past_key_values.self_attention_cache
if return_legacy_cache:
next_cache = past_key_values.to_legacy_cache()

if not return_dict:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can return tuple would be welcome as well but no worries

Copy link
Collaborator
@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice let's go!

Comment on lines +1245 to 1250
return_legacy_cache = True
logger.warning_once(
"Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.58.0. "
"You should pass an instance of `EncoderDecoderCache` instead, e.g. "
"`past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`."
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's put this warning inside the from_legacy_cache directly!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can put a specific version inside cache. Each model is deprecated until different releases, because we update then slowly at lower priority than other tasks

Also, the from_legacy_cache per se isn't deprecated and will be still available. the warning applies only for model's forward pass

@zucchini-nlp zucchini-nlp merged commit 01ad9f4 into huggingface:main May 16, 2025
20 checks passed
@zucchini-nlp
Copy link
Member Author

cc @BenjaminBossan, totally forgot to ping you. Bart is basically same as T5, hope it won't cause red CI on peft 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants
0