-
Notifications
You must be signed in to change notification settings - Fork 29.8k
Add support for Florence-2 #34160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for Florence-2 #34160
Conversation
033979d
to
af0fd69
Compare
The skipped tests were failing from this line: The model_type is included in the config so I'm not sure why. The rest + the slow tests are passing. |
Thanks! @Rocketknight1 would you be down to do a first review? |
Sure! Taking it today. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi! I just finished a partial review of this PR. However, I've realized that probably over 50% of the modelling code is copied from BART. We actually have a brand new method for handling cases like this, to both simplify the library and the PR! That method is Modular transformers.
If you want to try this in modular transformers style, you can replace the modeling_florence2.py
file with modular_florence2.py
. In this file, you can import modules from other classes in transformers
, like BartEncoderLayer
, and then simply define your layers like this, without any Copied from
statements:
class Florence2EncoderLayer(BartEncoderLayer):
pass
Thus, the only layers the modular_florence2.py
file needs to contain are the ones that are unique, and everything else can be imported. You can see an example of this in action in this PR. If you do it this way, then the modeling_florence2.py
will be autogenerated for you, and we don't need to review it in detail. You can leave the tokenization/processing files as-is for now.
Would you be willing to refactor this PR to modular style? We'd be happy to help, and it should make things much simpler.
Some other comments below as well, but you can ignore the copied from
comments if you refactor to modular
style - except as a guide for which classes to inherit from in your modular
file!
patch_size=[7, 3, 3, 3], | ||
patch_stride=[4, 2, 2, 2], | ||
patch_padding=[3, 1, 1, 1], | ||
patch_prenorm=[False, True, True, True], | ||
dim_embed=[256, 512, 1024, 2048], | ||
num_heads=[8, 16, 32, 64], | ||
num_groups=[8, 16, 32, 64], | ||
depths=[1, 1, 9, 1], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lists as default arguments are highly dangerous, because they're mutable, and the same mutable list will be shared by the init method and any objects of this class. For example:
config = Florence2VisionConfig()
config.patch_size[3] = 4 # This actually mutates the same list object held by the init method!
new_config = Florence2VisionConfig() # The new config will inherit the mutated patch size!
new_config.patch_size[3] = 5 # This will affect both the init and the first config object!
I recommend either:
- Replace default list args with default tuples, which are immutable, and then convert them to
list()
in the body of the method. This will create a new list each time, so there will be no shared list to be mutated. - Replace default list args with
None
, and then in the body of the method, check for aNone
value and create a list with the default value if it's present. Again, this ensures there's no shared list to be mutated.
projection_dim=1024, | ||
visual_temporal_embedding=None, | ||
image_pos_embed=None, | ||
image_feature_source=["spatial_avg_pool", "temporal_avg_pool"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also a list as a default argument - it's less likely to be mutated, but still kind of dangerous.
if vision_config is not None: | ||
vision_config = PretrainedConfig(**vision_config) | ||
self.vision_config = vision_config | ||
self.vocab_size = self.vocab_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.vocab_size = self.vocab_size |
Redundant line
[1, T, D] or [T, D]. | ||
""" | ||
shape_len = len(seq_embeds.shape) | ||
assert 2 <= shape_len <= 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We generally prefer to avoid asserts in favour of errors, like if ...: raise ValueError()
shape_len = len(seq_embeds.shape) | ||
assert 2 <= shape_len <= 3 | ||
len_seq = seq_embeds.size(-2) | ||
assert len_seq <= self.max_seq_len |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More asserts
return super().forward(input_ids) * self.embed_scale | ||
|
||
|
||
class Florence2Attention(nn.Module): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copied from BART
return attn_output, attn_weights_reshaped, past_key_value | ||
|
||
|
||
class Florence2FlashAttention2(Florence2Attention): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copied from BART
return attn_output, attn_weights, past_key_value | ||
|
||
|
||
class Florence2SdpaAttention(Florence2Attention): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copied from BART
8000
} | ||
|
||
|
||
class Florence2EncoderLayer(nn.Module): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copied from BART
return outputs | ||
|
||
|
||
class Florence2DecoderLayer(nn.Module): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copied from BART
Thanks for the review, I'll rework this using modular. |
Seems to be some issue with modular conversion. |
@hlky we think we know what's happening there. There is a fix open for it, give us a little bit! |
@hlky modular conversions should be working now - I pulled in changes from another branch! Please let me know if you encounter any other problems. To use the updated converter, pull the latest changes and then run |
class Florence2Decoder( | ||
Florence2LanguagePreTrainedModel, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modular converter generates Florence2PreTrainedModel
here instead of Florence2LanguagePreTrainedModel
. I'll look into a fix, but given the similarity to Bart I'm thinking we can just use Bart directly in Florence2ForConditionalGeneration
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep otherwise I can checkout the PR and push the fix! 🤗
return x, size | ||
|
||
|
||
class DaViT(nn.Module): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we implement DaViT separately to follow the Transformers philosophy of "one model per file"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would normally lean towards yes, but I'm not sure how that interacts with modular transformers - cc @ArthurZucker
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's kinda up to you, if you think DaVit will be used alone, maybe, but otherwise we can have FlorenceVit !
Hi @hlky, could you provide an update on the status of this PR? |
Hi @ducviet00. That would be great, thank you! 🤗 |
Hi, how are you getting on with this? No problem if there's no progress, just thought I'd check if you could use any help as I'm using Florence-2 for a project and support would be very helpful for me! |
In the mean time will review the PR to make sure it's ready! 🚀 |
Hi @hlky why was this PR closed? I noticed that the
|
What does this PR do?
This PR adds support for Florence-2.
Compared to the existing remote code the main difference is removal of
MySequential
, removal ofeinops.rearrange
andtrunc_normal_
andDropPath
are copied fromtimm
.Fixes #34155
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.