Computer Science > Computer Vision and Pattern Recognition

arXiv:2409.19425 (cs)

[Submitted on 28 Sep 2024]

Title:From Unimodal to Multimodal: Scaling up Projectors to Align Modalities

Authors:Mayug Maniparambil, Raiymbek Akshulakov, Yasser Abdelaziz Dahou Djilali, Sanath Narayan, Ankit Singh, Noel E. O'Connor

View PDF HTML (experimental)

Abstract:Recent contrastive multimodal vision-language models like CLIP have demonstrated robust open-world semantic understanding, becoming the standard image backbones for vision-language applications due to their aligned latent space. However, this practice has left powerful unimodal encoders for both vision and language underutilized in multimodal applications which raises a key question: Is there a plausible way to connect unimodal backbones for zero-shot vision-language tasks? To this end, we propose a novel approach that aligns vision and language modalities using only projection layers on pretrained, frozen unimodal encoders. Our method exploits the high semantic similarity between embedding spaces of well-trained vision and language models. It involves selecting semantically similar encoders in the latent space, curating a concept-rich dataset of image-caption pairs, and training simple MLP projectors. We evaluated our approach on 12 zero-shot classification datasets and 2 image-text retrieval datasets. Our best model, utilizing DINOv2 and All-Roberta-Large text encoder, achieves 76\(\%\) accuracy on ImageNet with a 20-fold reduction in data and 65 fold reduction in compute requirements. The proposed framework enhances the accessibility of model development while enabling flexible adaptation across diverse scenarios, offering an efficient approach to building multimodal models by utilizing existing unimodal architectures. Code and datasets will be released soon.

Comments:	Preprint, 10 pages; First two authors contributed equally
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2409.19425 [cs.CV]
	(or arXiv:2409.19425v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2409.19425

Submission history

From: Mayug Maniparambil [view email]
[v1] Sat, 28 Sep 2024 17:57:32 UTC (3,683 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:From Unimodal to Multimodal: Scaling up Projectors to Align Modalities

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:From Unimodal to Multimodal: Scaling up Projectors to Align Modalities

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators