8000 [WiP] Add EoMT Model by yaswanth19 · Pull Request #37610 · huggingface/transformers · GitHub
[go: up one dir, main page]

Skip to content

[WiP] Add EoMT Model #37610

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 26 commits into
base: main
Choose a base branch
from
Draft

Conversation

yaswanth19
Copy link
Contributor
@yaswanth19 yaswanth19 commented Apr 18, 2025

What does this PR do?

Fixes #37171 and continuation of #37392

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

What's Left:

  • Pre-processing class
    • Targets preprocessing
    • Refactor to HF standards
  • Overall cleanup - Docstrings, modular, reafactoring,
  • Modelling Tests
  • Processor Tests

< 8000 circle cx="8" cy="8" r="7" stroke="currentColor" stroke-opacity="0.25" stroke-width="2" vector-effect="non-scaling-stroke" fill="none" />

@github-actions github-actions bot marked this pull request as draft April 18, 2025 11:13
Copy link
Contributor

Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. The CI will be paused while the PR is in draft mode. When it is ready for review, please click the Ready for review button (at the bottom of the PR page). This will assign reviewers and trigger CI.

@Rocketknight1
Copy link
Member

Image segmentation model so cc @qubvel @NielsRogge!

@yaswanth19
Copy link
Contributor Author

@qubvel A rough draft is ready for inference 🤗 . Now i am adding support for training and they use mask_annealing to determine probability for attn masks. Is it required in this HF implementation also (I don't see for any other model) or are we fine with having a fixed prob for attn mask

@qubvel
Copy link
Member
qubvel commented Apr 28, 2025

Hey @yaswanth19, do you mean the mask probability changes during the model training? Would be nice to have it but not sure it can be easily implemented tbh. May 8000 be we can just add a callback for a Trainer to change it? similar to learning rate callback (I mean not adding it to the Transformers actually, but to the docs/fine-tuning guide)

@yaswanth19
Copy link
Contributor Author

do you mean the mask probability changes during the model training?

Yup exactly, and adding a trainer callback seems to be a good idea. I will check the feasibility of implementation and if simple enough then we can implement in the model itself else pivot to trainer callback.

@tommiekerssies
Copy link

@yaswanth19 Thanks for your great work!

A few thoughts:
• Mask annealing: Setting it to a fixed probability of 1 requires masked attention during inference, which we want to avoid. A fixed probability of 0 removes masked attention but does hurt performance. If mask annealing isn’t feasible right now, I’d suggest setting it to 0 for now. If you do implement it, note that intermediate mask predictions will be required, so the _predict method might need to move into EoMTEncoder. Just make sure intermediate masks are skipped at inference to avoid slowing things down.
• Testing: It might be good to add an integration test that verifies both the mask and class outputs for semantic, instance, and panoptic segmentation.
• Weight initialization: The current approach likely misinitializes parameters like query embeddings (e.g. std=0.02). All parameters outside the ViT should follow PyTorch defaults (e.g. nn.Embedding uses std=1.0).
• ViT backbone: Would it be simpler to use timm for the ViT? This allows loading pre-trained weights when training EoMT from scratch, avoids unnecessary code, and avoids unsupported use cases like training from a randomly initialized ViT.
• Loss function: Could we reuse the Mask2Former loss? Unless it conflicts with Transformers guidelines, this might reduce redundancy.

Let me know your thoughts.

@yaswanth19
Copy link
Contributor Author

Thanks @tommiekerssies for your initial thoughts.

Mask annealing: Setting it to a fixed probability of 1 requires masked attention during inference, which we want to avoid.

  • I am not sure abut mask annealing implementation compatibility with transformers natively. As I have said above, in the worst case we can set to 0 or use tariner callback if that's feasible.

Testing: It might be good to add an integration test

Yup, will add the complete test suite once I have a implementation ready.

Weight initialization: The current approach likely misinitializes parameters like query embeddings

Thanks for bringing this to my attention, I can make some correction to initialization later on when we have the end-to-end code ready. IMO, most of the user don't init from scratch and will either finetune it or just perform inference. But having said that I will look at timm implementation and will init in the same way

ViT backbone: Would it be simpler to use timm for the ViT?

Ideally yes 😅 But that's not the library coding standard (Don't want to introduce a hard dependency on timm). Also using timm backbone directly will not be compatible with all other features that HF ecosystem provides IMO. I am actually referring the HF VIT and timm implementation to get the best of both worlds and as to not introduce any bug.

Loss function: Could we reuse the Mask2Former loss?

Transformers has one model one file philosophy and because of that I have copied the Mask2Former loss completely here. It can be subjective call with Modular file in the sense we can expose Mask2Former loss and import it here for EoMT (Will require additional changes in mask2former) but that can be discussed during reviews with the core maintainer.

@tommiekerssies
Copy link

Thanks @tommiekerssies for your initial thoughts.

Mask annealing: Setting it to a fixed probability of 1 requires masked attention during inference, which we want to avoid.

  • I am not sure abut mask annealing implementation compatibility with transformers natively. As I have said above, in the worst case we can set to 0 or use tariner callback if that's feasible.

Testing: It might be good to add an integration test

Yup, will add the complete test suite once I have a implementation ready.

Weight initialization: The current approach likely misinitializes parameters like query embeddings

Thanks for bringing this to my attention, I can make some correction to initialization later on when we have the end-to-end code ready. IMO, most of the user don't init from scratch and will either finetune it or just perform inference. But having said that I will look at timm implementation and will init in the same way

ViT backbone: Would it be simpler to use timm for the ViT?

Ideally yes 😅 But that's not the library coding standard (Don't want to introduce a hard dependency on timm). Also using timm backbone directly will not be compatible with all other features that HF ecosystem provides IMO. I am actually referring the HF VIT and timm implementation to get the best of both worlds and as to not introduce any bug.

Loss function: Could we reuse the Mask2Former loss?

Transformers has one model one file philosophy and because of that I have copied the Mask2Former loss completely here.

Thanks for the clarifications!

Regarding mask annealing, I agree that 0 for now is fine. That means effectively disabling masked attention and mask annealing, which is what the current code already does, so no changes needed on that front.

For weight initialization and the ViT backbone, I understand the constraints around using timm. In that case, I’d just make sure that the non-ViT parameters (query embeddings, mask MLP, upscale blocks, class head) aren’t using any custom initializations and instead follow PyTorch defaults. Should be a quick fix.

Let me know if you’d like me to look at any part in more detail.

@yaswanth19
Copy link
Contributor Author
yaswanth19 commented May 1, 2025

Hi @qubvel ,I’m working on refactoring the training logic for EoMT , and I’m running into a design challenge:

In the original single‐class implementation, they call _predict (which uses the class predictor head) on intermediate layer outputs to build attention masks. Because everything lives in one class, this is straightforward.

Refer: https://github.com/tue-mps/eomt/blob/c311b377d3189c976163e4ceb2156d90bb7db88f/models/eomt.py#L130

In our modular HF version, the encoder (EoMTEncoder) only runs the transformer blocks, and _predict (with mask_head, upscale_block, and class_predictor) lives in EoMTModel or EoMTForUniversalSegmentation. That separation means the encoder loop can’t access _predict , so we can’t reconstruct the original training flow.

I have two solutions in mind, LMK your thoughts on the below approaches and suggest any other better alternative:

1.) Club all the classes from EoMTEncoder into EomtForUniversalSegmentation, in this ways we can do all the processing in a single forward class.

2.) Move _predict which includes mask_head, upscale_block into EoMTEncoder and somehow pass the class_head. Pass these modules into the encoder class so that inside its forward loop it can call _predict, build the attention mask, and feed it into the next block. IMO this is a bit dirty and flow is tangled 😅 .

Here the _predict func is the same code which is used in Mask2Former for get mask_logits and class_logits from model output.

@yaswanth19
Copy link
Contributor Author

@qubvel A ping for your comments on the above issue/challenge so that I can refactor it this weekend 😅

@NielsRogge
Copy link
Contributor

@yaswanth19 can you send me your email address so we can set up a Slack channel to discuss the integration, rather than having to discuss everything on the Github thread?

@yaswanth19
Copy link
Contributor Author

Yup sure @NielsRogge , Here is my email id: yaswanthgali8@gmail.com

@tommiekerssies
Copy link

Nice work @yaswanth19 ! For fixing the initialisation, maybe you could have a look here, as the Mask2Former implementation in HuggingFace suffers from the same bug: #35877 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add EoMT
5 participants
0