Add support for Florence-2 #34160

hlky · 2024-10-14T16:00:26Z

What does this PR do?

This PR adds support for Florence-2.

Compared to the existing remote code the main difference is removal of MySequential, removal of einops.rearrange and trunc_normal_ and DropPath are copied from timm.

Fixes #34155

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

hlky · 2024-10-14T18:08:21Z

https://github.com/huggingface/transformers/blob/b1d3a52f22ea791e0c6084d9827d33d740675b7c/tests/models/florence2/test_modeling_florence2.py#L259-L353

The skipped tests were failing from this line:

https://github.com/huggingface/transformers/blob/b1d3a52f22ea791e0c6084d9827d33d740675b7c/src/transformers/models/florence2/modeling_florence2.py#L2497-L2499

The model_type is included in the config so I'm not sure why.

https://github.com/huggingface/transformers/blob/b1d3a52f22ea791e0c6084d9827d33d740675b7c/tests/models/florence2/test_modeling_florence2.py#L93

The rest + the slow tests are passing.

LysandreJik · 2024-10-15T08:43:45Z

Thanks! @Rocketknight1 would you be down to do a first review?

Rocketknight1 · 2024-10-15T12:24:46Z

Sure! Taking it today.

Rocketknight1

Hi! I just finished a partial review of this PR. However, I've realized that probably over 50% of the modelling code is copied from BART. We actually have a brand new method for handling cases like this, to both simplify the library and the PR! That method is Modular transformers.

If you want to try this in modular transformers style, you can replace the modeling_florence2.py file with modular_florence2.py. In this file, you can import modules from other classes in transformers, like BartEncoderLayer, and then simply define your layers like this, without any Copied from statements:

class Florence2EncoderLayer(BartEncoderLayer):
    pass

Thus, the only layers the modular_florence2.py file needs to contain are the ones that are unique, and everything else can be imported. You can see an example of this in action in this PR. If you do it this way, then the modeling_florence2.py will be autogenerated for you, and we don't need to review it in detail. You can leave the tokenization/processing files as-is for now.

Would you be willing to refactor this PR to modular style? We'd be happy to help, and it should make things much simpler.

Some other comments below as well, but you can ignore the copied from comments if you refactor to modular style - except as a guide for which classes to inherit from in your modular file!

Rocketknight1 · 2024-10-15T12:30:50Z

src/transformers/models/florence2/configuration_florence2.py

+        patch_size=[7, 3, 3, 3],
+        patch_stride=[4, 2, 2, 2],
+        patch_padding=[3, 1, 1, 1],
+        patch_prenorm=[False, True, True, True],
+        dim_embed=[256, 512, 1024, 2048],
+        num_heads=[8, 16, 32, 64],
+        num_groups=[8, 16, 32, 64],
+        depths=[1, 1, 9, 1],


Lists as default arguments are highly dangerous, because they're mutable, and the same mutable list will be shared by the init method and any objects of this class. For example:

config = Florence2VisionConfig() config.patch_size[3] = 4 # This actually mutates the same list object held by the init method! new_config = Florence2VisionConfig() # The new config will inherit the mutated patch size! new_config.patch_size[3] = 5 # This will affect both the init and the first config object!

I recommend either:

Replace default list args with default tuples, which are immutable, and then convert them to list() in the body of the method. This will create a new list each time, so there will be no shared list to be mutated.

Replace default list args with None, and then in the body of the method, check for a None value and create a list with the default value if it's present. Again, this ensures there's no shared list to be mutated.

Rocketknight1 · 2024-10-15T12:33:41Z

src/transformers/models/florence2/configuration_florence2.py

+        projection_dim=1024,
+        visual_temporal_embedding=None,
+        image_pos_embed=None,
+        image_feature_source=["spatial_avg_pool", "temporal_avg_pool"],


This is also a list as a default argument - it's less likely to be mutated, but still kind of dangerous.

Rocketknight1 · 2024-10-15T12:39:25Z

src/transformers/models/florence2/configuration_florence2.py

+        if vision_config is not None:
+            vision_config = PretrainedConfig(**vision_config)
+        self.vision_config = vision_config
+        self.vocab_size = self.vocab_size


Suggested change

self.vocab_size = self.vocab_size

Redundant line

Rocketknight1 · 2024-10-15T13:24:05Z