Computer Science > Computer Vision and Pattern Recognition

arXiv:2210.15871 (cs)

[Submitted on 28 Oct 2022]

Title:VLT: Vision-Language Transformer and Query Generation for Referring Segmentation

Authors:Henghui Ding, Chang Liu, Suchen Wang, Xudong Jiang

View PDF

Abstract:We propose a Vision-Language Transformer (VLT) framework for referring segmentation to facilitate deep interactions among multi-modal information and enhance the holistic understanding to vision-language features. There are different ways to understand the dynamic emphasis of a language expression, especially when interacting with the image. However, the learned queries in existing transformer works are fixed after training, which cannot cope with the randomness and huge diversity of the language expressions. To address this issue, we propose a Query Generation Module, which dynamically produces multiple sets of input-specific queries to represent the diverse comprehensions of language expression. To find the best among these diverse comprehensions, so as to generate a better mask, we propose a Query Balance Module to selectively fuse the corresponding responses of the set of queries. Furthermore, to enhance the model's ability in dealing with diverse language expressions, we consider inter-sample learning to explicitly endow the model with knowledge of understanding different language expressions to the same object. We introduce masked contrastive learning to narrow down the features of different expressions for the same target object while distinguishing the features of different objects. The proposed approach is lightweight and achieves new state-of-the-art referring segmentation results consistently on five datasets.

Comments:	TPAMI
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2210.15871 [cs.CV]
	(or arXiv:2210.15871v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2210.15871
Related DOI:	https://doi.org/10.1109/TPAMI.2022.3217852

Submission history

From: Henghui Ding [view email]
[v1] Fri, 28 Oct 2022 03:36:07 UTC (8,657 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:VLT: Vision-Language Transformer and Query Generation for Referring Segmentation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:VLT: Vision-Language Transformer and Query Generation for Referring Segmentation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators