Abstract
Traditional image-to-text retrieval models learn joint representations by aligning multimodal features, typically learning the weak correlation between image and text data which can introduce noise during modality alignment. To solve this problem, we propose a Multimodal Causal CLIP (MMC-CLIP) network that integrates causal semantic relationships into CLIP for image-to-text retrieval task. Firstly, we employ the Multimodal Causal Discovery (MCD) method, which models the causal relationships of causal variables in both image and text data to construct the multimodal causal graph. Subsequently, we seamlessly integrate the causal nodes extracted from the multimodal causal graph as learnable prompts within the CLIP model, giving rise to the novel framework of Multimodal Causal CLIP. By integrating causal semantics into CLIP, MMC-CLIP effectively enhances the correlation between causal variables in image and text modalities, leading to an improved alignment capability for multimodal image-text data. We demonstrate the superiority and generalization of the proposed method by outperforming all strong baselines in image-to-text retrieval task on the Flickr30K and MSCOCO datasets.
D. Cao ā is the corresponding author.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213ā229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Cornia, M., Stefanini, M., Baraldi, L., Cucchiara, R.: Meshed-memory transformer for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10578ā10587 (2020)
Diao, H., Zhang, Y., Ma, L., Lu, H.: Similarity reasoning and filtration for image-text matching. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 1218ā1226 (2021)
Faghri, F., Fleet, D.J., Kiros, J.R., Fidler, S.: VSE++: improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612 (2017)
Feng, F., Zhang, J., He, X., Zhang, H., Chua, T.S.: Empowering language understanding with counterfactual reasoning. arXiv preprint arXiv:2106.03046 (2021)
Frome, A., et al.: DeViSE: a deep visual-semantic embedding model. In: Advances in Neural Information Processing Systems, vol. 26 (2013)
Girshick, R.: Fast R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1440ā1448 (2015)
Isinkaye, F.O., Folajimi, Y.O., Ojokoh, B.A.: Recommendation systems: principles, methods and evaluation. Egypt. Inf. J. 16(3), 261ā273 (2015)
Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. In: International Conference on Machine Learning, pp. 4904ā4916. PMLR (2021)
Kim, W., Son, B., Kim, I.: ViLT: vision-and-language transformer without convolution or region supervision. In: International Conference on Machine Learning, pp. 5583ā5594. PMLR (2021)
Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016)
Landeiro, V., Culotta, A.: Robust text classification in the presence of confounding bias. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 30 (2016)
Lee, K.-H., Chen, X., Hua, G., Hu, H., He, X.: Stacked cross attention for image-text matching. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11208, pp. 212ā228. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01225-0_13
Li, G., Duan, N., Fang, Y., Gong, M., Jiang, D.: Unicoder-VL: a universal encoder for vision and language by cross-modal pre-training. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 11336ā11344 (2020)
Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023)
Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888ā12900. PMLR (2022)
Li, J., Selvaraju, R., Gotmare, A., Joty, S., Xiong, C., Hoi, S.C.H.: Align before fuse: vision and language representation learning with momentum distillation. Adv. Neural. Inf. Process. Syst. 34, 9694ā9705 (2021)
Li, L., Gan, Z., Cheng, Y., Liu, J.: Relation-aware graph attention network for visual question answering. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10313ā10322 (2019)
Li, X., et al.: Oscar: object-semantics aligned pre-training for vision-language tasks. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 121ā137. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_8
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740ā755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Liu, X., Yin, D., Feng, Y., Wu, Y., Zhao, D.: Everything has a cause: leveraging causal inference in legal text analysis. arXiv preprint arXiv:2104.09420 (2021)
Loper, E., Bird, S.: NLTK: the natural language toolkit. arXiv preprint cs/0205028 (2002)
Lu, J., Batra, D., Parikh, D., Lee, S.: ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Messina, N., et al.: ALADIN: distilling fine-grained alignment scores for efficient image-text matching and retrieval. In: Proceedings of the 19th International Conference on Content-Based Multimedia Indexing, pp. 64ā70 (2022)
Nan, G., Zeng, J., Qiao, R., Guo, Z., Lu, W.: Uncovering main causalities for long-tailed information extraction. arXiv preprint arXiv:2109.05213 (2021)
Pennington, J., Socher, R., Manning, C.D.: GloVe: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532ā1543 (2014)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748ā8763. PMLR (2021)
Ramsey, J., Glymour, M., Sanchez-Romero, R., Glymour, C.: A million variables and more: the fast greedy equivalence search algorithm for learning high-dimensional graphical causal models, with an application to functional magnetic resonance images. Int. J. Data Sci. Anal. 3, 121ā129 (2017)
Sharma, D., Shukla, R., Giri, A.K., Kumar, S.: A brief review on search engine optimization. In: 2019 9th International Conference on Cloud Computing, Data Science & Engineering (Confluence), pp. 687ā692. IEEE (2019)
Tang, K., Niu, Y., Huang, J., Shi, J., Zhang, H.: Unbiased scene graph generation from biased training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3716ā3725 (2020)
Wang, Y., Liang, D., Charlin, L., Blei, D.M.: Causal inference for recommender systems. In: Proceedings of the 14th ACM Conference on Recommender Systems, pp. 426ā431 (2020)
Yang, X., Zhang, H., Cai, J.: Deconfounded image captioning: a causal retrospect. IEEE Trans. Pattern Anal. Mach. Intell. (2021)
Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguist. 2, 67ā78 (2014)
Acknowledgement
This work is supported by the National Natural Science Foundation of China (No. 62076210, No. 81973752), the Natural Science Foundation of Xiamen city (No. 3502Z20227188) and the Open Project Program of The Key Laboratory of Cognitive Computing and Intelligent Information Processing of Fujian Education Institutions, Wuyi University (No. KLCCIIP2020203).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
Ā© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Feng, W., Lin, D., Cao, D. (2024). Multimodal Causal Relations Enhanced CLIP forĀ Image-to-Text Retrieval. In: Liu, Q., et al. Pattern Recognition and Computer Vision. PRCV 2023. Lecture Notes in Computer Science, vol 14425. Springer, Singapore. https://doi.org/10.1007/978-981-99-8429-9_17
Download citation
DOI: https://doi.org/10.1007/978-981-99-8429-9_17
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8428-2
Online ISBN: 978-981-99-8429-9
eBook Packages: Computer ScienceComputer Science (R0)