[go: up one dir, main page]

×
Apr 22, 2022 · We propose a method to leverage unimodal vision and text encoders for VL tasks that augment existing VL approaches while conserving computational complexity.
Feb 1, 2023 · Leveraging Pretrained Unimodal Encoders for Vision-Language Tasks via Adaptive Knowledge Distillation.
Cross-modal encoders for vision-language (VL) tasks are often pretrained with carefully curated vision-language datasets. While these datasets reach an ...
Experiments demonstrate that MAD leads to consistent gains in the low-shot, domain-shifted, and fully-supervised conditions on VCR, SNLI-VE, and VQA, ...
Abstract. Abstract. Cross-modal encoders for vision-language (VL) tasks are often pretrained with carefully curated vision-language datasets.
Prior work demonstrates the importance of leveraging knowledge in pre-trained encoders for Vision-Language (VL) tasks. Therefore, several studies focus on.
Examples of modified samples in SM validation set. Multimodal Adaptive Distillation for Leveraging Unimodal Encoders for Vision-Language Tasks. Preprint. Full ...
We propose an approach, named CLIP Targeted Distillation (CLIP-TD), to intelligently distill knowledge from CLIP into existing architectures.
Specifically, we propose Multimodal Adaptive Distillation (MAD), which adaptively distills useful knowledge from pretrained encoders to cross-modal VL encoders.
Description:"MAD: Multimodal Adaptive Distillation for Leveraging Unimodal Encoders for Vision-Language Tasks" (https://arxiv.org/abs/2204.10496) with ...