Abstract
We introduce a Depicted image Quality Assessment method (DepictQA), overcoming the constraints of traditional score-based methods. DepictQA allows for detailed, language-based, human-like evaluation of image quality by leveraging Multi-modal Large Language Models (MLLMs). Unlike conventional Image Quality Assessment (IQA) methods relying on scores, DepictQA interprets image content and distortions descriptively and comparatively, aligning closely with humans’ reasoning process. To build the DepictQA model, we establish a hierarchical task framework, and collect a multi-modal IQA training dataset. To tackle the challenges of limited training data and multi-image processing, we propose to use multi-source training data and specialized image tags. These designs result in a better performance of DepictQA than score-based approaches on multiple benchmarks. Moreover, compared with general MLLMs, DepictQA can generate more accurate reasoning descriptive languages. We also demonstrate that our full-reference dataset can be extended to non-reference applications. These results showcase the research potential of multi-modal IQA methods.
Z. You, Z. Li and J. Gu—Contribute Equally.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
The results were tested on the preview web version released in Nov. 2023. A comprehensive evaluation of the stable API version, gpt-4-turbo, is given in [70].
References
Agrawal, H., et al.: Nocaps: novel object captioning at scale. In: ICCV (2019)
Alayrac, J.B., et al.: Flamingo: a visual language model for few-shot learning. In: NeurIPS (2022)
Bell, S., Upchurch, P., Snavely, N., Bala, K.: Material recognition in the wild with the materials in context database. In: CVPR (2015)
Bosse, S., Maniry, D., Müller, K.R., Wiegand, T., Samek, W.: Deep neural networks for no-reference and full-reference image quality assessment. IEEE TIP (2017)
Bosse, S., Maniry, D., Müller, K.R., Wiegand, T., Samek, W.: Deep neural networks for no-reference and full-reference image quality assessment. IEEE TIP (2018)
Brown, T., et al.: Language models are few-shot learners. In: NeurIPS (2020)
Cao, Y., Wan, Z., Ren, D., Yan, Z., Zuo, W.: Incorporating semi-supervised and positive-unlabeled learning for boosting full reference image quality assessment. In: CVPR (2022)
Chen, X., et al.: A comparative study of image restoration networks for general backbone network design. arXiv preprint arXiv:2310.11881 (2023)
Chen, X., et al.: Microsoft coco captions: data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015)
Chen, Z., et al.: Deep linear array pushbroom image restoration: a degradation pipeline and jitter-aware restoration network. In: AAAI (2024)
Chen, Z., He, Z., Lu, Z., Lu, Z.M.: Prompt-based test-time real image dehazing: a novel pipeline. arXiv preprint arXiv:2309.17389 (2023)
Chiang, W.L., et al.: Vicuna: an open-source chatbot impressing GPT-4 with 90%* ChatGPT quality (2023). https://vicuna.lmsys.org
Dai, W., et al.: InstructBLIP: towards general-purpose vision-language models with instruction tuning. In: NeurIPS (2023)
Ding, K., Liu, Y., Zou, X., Wang, S., Ma, K.: Locally adaptive structure and texture similarity for image quality assessment. In: ACM MM (2021)
Ding, K., Ma, K., Wang, S., Simoncelli, E.P.: Image quality assessment: unifying structure and texture similarity. IEEE TPAMI (2020)
Ghildyal, A., Liu, F.: Shift-tolerant perceptual similarity metric. In: ECCV (2022)
Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: elevating the role of image understanding in visual question answering. In: CVPR (2017)
Gu, J., Cai, H., Chen, H., Ye, X., Ren, J., Dong, C.: Image quality assessment for perceptual image restoration: a new dataset, benchmark and metric. arXiv preprint arXiv:2011.15002 (2020)
Gu, J., Cai, H., Chen, H., Ye, X., Ren, J., Dong, C.: Pipal: a large-scale image quality assessment dataset for perceptual image restoration. arXiv preprint arXiv:2007.12142 (2020)
Hu, E.J., et al.: Lora: low-rank adaptation of large language models. In: ICLR (2021)
Huang, S., et al.: Language is not all you need: aligning perception with language models. arXiv preprint arXiv:2302.14045 (2023)
Jayaraman, D., Mittal, A., Moorthy, A.K., Bovik, A.C.: Objective quality assessment of multiply distorted images. In: Asilomar Conference on Signals, Systems and Computers (ASILOMAR) (2012)
Kang, L., Ye, P., Li, Y., Doermann, D.: Convolutional neural networks for no-reference image quality assessment. In: CVPR (2014)
Ke, J., Wang, Q., Wang, Y., Milanfar, P., Yang, F.: Musiq: multi-scale image quality transformer. In: CVPR (2021)
Kudo, T., Richardson, J.: Sentencepiece: a simple and language independent subword tokenizer and detokenizer for neural text processing. In: EMNLP (2018)
Larson, E.C., Chandler, D.M.: Most apparent distortion: full-reference image quality assessment and the role of strategy. J. Electron. Imaging (2010)
Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023)
Lin, H., Hosu, V., Saupe, D.: Kadid-10k: a large-scale artificially distorted IQA database. In: International Conference on Quality of Multimedia Experience (QoMEX) (2019)
Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744 (2023)
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. arXiv preprint arXiv:2304.08485 (2023)
Liu, X., van de Weijer, J., Bagdanov, A.D.: RankIQA: learning from rankings for no-reference image quality assessment. In: ICCV (2017)
Liu, Y., et al.: MMBench: is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281 (2023)
Lu, P., et al.: Learn to explain: multimodal reasoning via thought chains for science question answering. In: NeurIPS (2022)
Ma, C., Yang, C.Y., Yang, X., Yang, M.H.: Learning a no-reference quality metric for single-image super-resolution. In: Computer Vision and Image Understanding (2017)
Masry, A., Do, X.L., Tan, J.Q., Joty, S., Hoque, E.: ChartQA: a benchmark for question answering about charts with visual and logical reasoning. In: ACL Findings (2022)
Mathew, M., Karatzas, D., Jawahar, C.: DocVQA: A dataset for VQA on document images. In: WACV (2021)
Mittal, A., Moorthy, A.K., Bovik, A.C.: No-reference image quality assessment in the spatial domain. IEEE TIP (2012)
Mittal, A., Soundararajan, R., Bovik, A.C.: Making a “completely blind” image quality analyzer. IEEE Sign. Process. Lett. (2013)
Moorthy, A.K., Bovik, A.C.: A two-step framework for constructing blind image quality indices. IEEE Sign. Process. Lett. (2010)
Moorthy, A.K., Bovik, A.C.: Blind image quality assessment: from natural scene statistics to perceptual quality. IEEE TIP (2011)
OpenAI: GPT-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
OpenAI: GPT-4V(ision) system card (2023). https://openai.com/research/gpt-4v-system-card
Ouyang, L., et al.: Training language models to follow instructions with human feedback. In: NeurIPS (2022)
Pan, D., Shi, P., Hou, M., Ying, Z., Fu, S., Zhang, Y.: Blind predicting similar quality map for image quality assessment. In: CVPR (2018)
Ponomarenko, N., et al.: Image database TID2013: peculiarities, results and perspectives. Signal Processing: Image Communication (2015)
Ponomarenko, N., Lukin, V., Zelensky, A., Egiazarian, K., Carli, M., Battisti, F.: TID2008-a database for evaluation of full-reference visual quality assessment metrics. In: Advances of Modern Radioelectronics (2009)
Prashnani, E., Cai, H., Mostofi, Y., Sen, P.: Pieapp: perceptual image-error assessment through pairwise preference. In: CVPR (2018)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022)
Saad, M.A., Bovik, A.C., Charrier, C.: Blind image quality assessment: a natural scene statistics approach in the DCT domain. IEEE TIP (2012)
Sheikh, H.: Live image quality assessment database release 2 (2005). http://live.ece.utexas.edu/research/quality
Sheikh, H.R., Bovik, A.C.: Image information and visual quality. IEEE TIP (2006)
Singh, A., et al.: Towards VQA models that can read. In: CVPR (2019)
Su, S., et al.: Blindly assess image quality in the wild guided by a self-adaptive hyper network. In: CVPR (2020)
Sun, S., Yu, T., Xu, J., Zhou, W., Chen, Z.: GraphIQA: learning distortion graph representations for blind image quality assessment. IEEE TMM (2022)
Sun, W., Zhou, F., Liao, Q.: MDID: a multiply distorted image database for image quality assessment. PR (2017)
Tang, H., Joshi, N., Kapoor, A.: Learning a blind measure of perceptual image quality. In: CVPR (2011)
Touvron, H., et al.: Llama: open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)
Touvron, H., et al.: Llama 2: open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)
Wang, J., Chan, K.C., Loy, C.C.: Exploring clip for assessing the look and feel of images. In: AAAI (2023)
Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE TIP (2004)
Wu, H., et al.: Q-bench: a benchmark for general-purpose foundation models on low-level vision. arXiv preprint arXiv:2309.14181 (2023)
Wu, H., et al.: Q-instruct: improving low-level visual abilities for multi-modality foundation models. arXiv preprint arXiv:2311.06783 (2023)
Wu, H., et al.: Q-align: teaching LMMs for visual scoring via discrete text-defined levels. arXiv preprint arXiv:2312.17090 (2023)
Wu, H., et al.: Towards open-ended visual quality comparison. arXiv preprint arXiv:2402.16641 (2024)
Wu, T., Ma, K., Liang, J., Yang, Y., Zhang, L.: A comprehensive study of multimodal large language models for image quality assessment. arXiv preprint arXiv:2403.10854 (2024)
Ye, Q., et al.: mPLUG-Owl: modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178 (2023)
Yin, G., et al.: Content-variant reference image quality assessment via knowledge distillation. In: AAAI (2022)
Yin, Z., et al.: LAMM: language-assisted multi-modal instruction-tuning dataset, framework, and benchmark. In: NeurIPS (2023)
You, Z., et al.: Descriptive image quality assessment in the wild. arXiv preprint arXiv:2405.18842 (2024)
Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguist. (2014)
Zhang, L., Zhang, L., Mou, X., Zhang, D.: FSIM: a feature similarity index for image quality assessment. IEEE TIP (2011)
Zhang, P., et al.: Internlm-xcomposer: a vision-language large model for advanced text-image comprehension and composition. arXiv preprint arXiv:2309.15112 (2023)
Zhang, R., et al.: Llama-adapter: efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199 (2023)
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR (2018)
Zhang, W., Li, D., Ma, C., Zhai, G., Yang, X., Ma, K.: Continual learning for blind image quality assessment. IEEE TPAMI (2022)
Zhang, W., Zhai, G., Wei, Y., Yang, X., Ma, K.: Blind image quality assessment via vision-language correspondence: a multitask learning perspective. In: CVPR (2023)
Zhang, Z., Wu, H., Zhang, E., Zhai, G., Lin, W.: A benchmark for multi-modal foundation models on low-level vision: from single images to pairs. arXiv preprint arXiv:2402.07116 (2024)
Zhao, H., et al.: MMICL: empowering vision-language model with multi-modal in-context learning. arXiv preprint arXiv:2309.07915 (2023)
Zheng, H., Fu, J., Zeng, Y., Zha, Z.J., Luo, J.: Learning conditional knowledge distillation for degraded-reference image quality assessment. ICCV (2021)
Zhou, W., Wang, Z.: Quality assessment of image super-resolution: balancing deterministic and statistical fidelity. In: ACM MM (2022)
Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: MiniGPT-4: enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023)
Zhu, H., Li, L., Wu, J., Dong, W., Shi, G.: MetaIQA: deep meta-learning for no-reference image quality assessment. In: CVPR (2020)
Acknowledgement
This work was sponsored by CUHK Direct Grants (RCFUS) No. 4055189, National Natural Science Foundation of China (Grant No. 62276251), and the Joint Lab of CAS-HK.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
You, Z., Li, Z., Gu, J., Yin, Z., Xue, T., Dong, C. (2025). Depicting Beyond Scores: Advancing Image Quality Assessment Through Multi-modal Language Models. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15105. Springer, Cham. https://doi.org/10.1007/978-3-031-72970-6_15
Download citation
DOI: https://doi.org/10.1007/978-3-031-72970-6_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72969-0
Online ISBN: 978-3-031-72970-6
eBook Packages: Computer ScienceComputer Science (R0)