Depicting Beyond Scores: Advancing Image Quality Assessment Through Multi-modal Language Models

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15105))

Included in the following conference series:

European Conference on Computer Vision

411 Accesses

Abstract

We introduce a Depicted image Quality Assessment method (DepictQA), overcoming the constraints of traditional score-based methods. DepictQA allows for detailed, language-based, human-like evaluation of image quality by leveraging Multi-modal Large Language Models (MLLMs). Unlike conventional Image Quality Assessment (IQA) methods relying on scores, DepictQA interprets image content and distortions descriptively and comparatively, aligning closely with humans’ reasoning process. To build the DepictQA model, we establish a hierarchical task framework, and collect a multi-modal IQA training dataset. To tackle the challenges of limited training data and multi-image processing, we propose to use multi-source training data and specialized image tags. These designs result in a better performance of DepictQA than score-based approaches on multiple benchmarks. Moreover, compared with general MLLMs, DepictQA can generate more accurate reasoning descriptive languages. We also demonstrate that our full-reference dataset can be extended to non-reference applications. These results showcase the research potential of multi-modal IQA methods.

Z. You, Z. Li and J. Gu—Contribute Equally.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A Comprehensive Study of Multimodal Large Language Models for Image Quality Assessment

Towards Open-Ended Visual Quality Comparison

ATTIQA: Generalizable Image Quality Feature Extractor Using Attribute-Aware Pretraining

Notes

1.
The results were tested on the preview web version released in Nov. 2023. A comprehensive evaluation of the stable API version, gpt-4-turbo, is given in [70].

References

Agrawal, H., et al.: Nocaps: novel object captioning at scale. In: ICCV (2019)
Google Scholar
Alayrac, J.B., et al.: Flamingo: a visual language model for few-shot learning. In: NeurIPS (2022)
Google Scholar
Bell, S., Upchurch, P., Snavely, N., Bala, K.: Material recognition in the wild with the materials in context database. In: CVPR (2015)
Google Scholar
Bosse, S., Maniry, D., Müller, K.R., Wiegand, T., Samek, W.: Deep neural networks for no-reference and full-reference image quality assessment. IEEE TIP (2017)
Google Scholar
Bosse, S., Maniry, D., Müller, K.R., Wiegand, T., Samek, W.: Deep neural networks for no-reference and full-reference image quality assessment. IEEE TIP (2018)
Google Scholar
Brown, T., et al.: Language models are few-shot learners. In: NeurIPS (2020)
Google Scholar
Cao, Y., Wan, Z., Ren, D., Yan, Z., Zuo, W.: Incorporating semi-supervised and positive-unlabeled learning for boosting full reference image quality assessment. In: CVPR (2022)
Google Scholar
Chen, X., et al.: A comparative study of image restoration networks for general backbone network design. arXiv preprint arXiv:2310.11881 (2023)
Chen, X., et al.: Microsoft coco captions: data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015)
Chen, Z., et al.: Deep linear array pushbroom image restoration: a degradation pipeline and jitter-aware restoration network. In: AAAI (2024)
Google Scholar
Chen, Z., He, Z., Lu, Z., Lu, Z.M.: Prompt-based test-time real image dehazing: a novel pipeline. arXiv preprint arXiv:2309.17389 (2023)
Chiang, W.L., et al.: Vicuna: an open-source chatbot impressing GPT-4 with 90%* ChatGPT quality (2023). https://vicuna.lmsys.org
Dai, W., et al.: InstructBLIP: towards general-purpose vision-language models with instruction tuning. In: NeurIPS (2023)
Google Scholar
Ding, K., Liu, Y., Zou, X., Wang, S., Ma, K.: Locally adaptive structure and texture similarity for image quality assessment. In: ACM MM (2021)
Google Scholar
Ding, K., Ma, K., Wang, S., Simoncelli, E.P.: Image quality assessment: unifying structure and texture similarity. IEEE TPAMI (2020)
Google Scholar
Ghildyal, A., Liu, F.: Shift-tolerant perceptual similarity metric. In: ECCV (2022)
Google Scholar
Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: elevating the role of image understanding in visual question answering. In: CVPR (2017)
Google Scholar
Gu, J., Cai, H., Chen, H., Ye, X., Ren, J., Dong, C.: Image quality assessment for perceptual image restoration: a new dataset, benchmark and metric. arXiv preprint arXiv:2011.15002 (2020)
Gu, J., Cai, H., Chen, H., Ye, X., Ren, J., Dong, C.: Pipal: a large-scale image quality assessment dataset for perceptual image restoration. arXiv preprint arXiv:2007.12142 (2020)
Hu, E.J., et al.: Lora: low-rank adaptation of large language models. In: ICLR (2021)
Google Scholar
Huang, S., et al.: Language is not all you need: aligning perception with language models. arXiv preprint arXiv:2302.14045 (2023)
Jayaraman, D., Mittal, A., Moorthy, A.K., Bovik, A.C.: Objective quality assessment of multiply distorted images. In: Asilomar Conference on Signals, Systems and Computers (ASILOMAR) (2012)
Google Scholar
Kang, L., Ye, P., Li, Y., Doermann, D.: Convolutional neural networks for no-reference image quality assessment. In: CVPR (2014)
Google Scholar
Ke, J., Wang, Q., Wang, Y., Milanfar, P., Yang, F.: Musiq: multi-scale image quality transformer. In: CVPR (2021)
Google Scholar
Kudo, T., Richardson, J.: Sentencepiece: a simple and language independent subword tokenizer and detokenizer for neural text processing. In: EMNLP (2018)
Google Scholar
Larson, E.C., Chandler, D.M.: Most apparent distortion: full-reference image quality assessment and the role of strategy. J. Electron. Imaging (2010)
Google Scholar
Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023)
Lin, H., Hosu, V., Saupe, D.: Kadid-10k: a large-scale artificially distorted IQA database. In: International Conference on Quality of Multimedia Experience (QoMEX) (2019)
Google Scholar
Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744 (2023)
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. arXiv preprint arXiv:2304.08485 (2023)
Liu, X., van de Weijer, J., Bagdanov, A.D.: RankIQA: learning from rankings for no-reference image quality assessment. In: ICCV (2017)
Google Scholar
Liu, Y., et al.: MMBench: is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281 (2023)
Lu, P., et al.: Learn to explain: multimodal reasoning via thought chains for science question answering. In: NeurIPS (2022)
Google Scholar
Ma, C., Yang, C.Y., Yang, X., Yang, M.H.: Learning a no-reference quality metric for single-image super-resolution. In: Computer Vision and Image Understanding (2017)
Google Scholar
Masry, A., Do, X.L., Tan, J.Q., Joty, S., Hoque, E.: ChartQA: a benchmark for question answering about charts with visual and logical reasoning. In: ACL Findings (2022)
Google Scholar
Mathew, M., Karatzas, D., Jawahar, C.: DocVQA: A dataset for VQA on document images. In: WACV (2021)
Google Scholar
Mittal, A., Moorthy, A.K., Bovik, A.C.: No-reference image quality assessment in the spatial domain. IEEE TIP (2012)
Google Scholar
Mittal, A., Soundararajan, R., Bovik, A.C.: Making a “completely blind” image quality analyzer. IEEE Sign. Process. Lett. (2013)
Google Scholar
Moorthy, A.K., Bovik, A.C.: A two-step framework for constructing blind image quality indices. IEEE Sign. Process. Lett. (2010)
Google Scholar
Moorthy, A.K., Bovik, A.C.: Blind image quality assessment: from natural scene statistics to perceptual quality. IEEE TIP (2011)
Google Scholar
OpenAI: GPT-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
OpenAI: GPT-4V(ision) system card (2023). https://openai.com/research/gpt-4v-system-card
Ouyang, L., et al.: Training language models to follow instructions with human feedback. In: NeurIPS (2022)
Google Scholar
Pan, D., Shi, P., Hou, M., Ying, Z., Fu, S., Zhang, Y.: Blind predicting similar quality map for image quality assessment. In: CVPR (2018)
Google Scholar
Ponomarenko, N., et al.: Image database TID2013: peculiarities, results and perspectives. Signal Processing: Image Communication (2015)
Google Scholar
Ponomarenko, N., Lukin, V., Zelensky, A., Egiazarian, K., Carli, M., Battisti, F.: TID2008-a database for evaluation of full-reference visual quality assessment metrics. In: Advances of Modern Radioelectronics (2009)
Google Scholar
Prashnani, E., Cai, H., Mostofi, Y., Sen, P.: Pieapp: perceptual image-error assessment through pairwise preference. In: CVPR (2018)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)
Google Scholar
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022)
Google Scholar
Saad, M.A., Bovik, A.C., Charrier, C.: Blind image quality assessment: a natural scene statistics approach in the DCT domain. IEEE TIP (2012)
Google Scholar
Sheikh, H.: Live image quality assessment database release 2 (2005). http://live.ece.utexas.edu/research/quality
Sheikh, H.R., Bovik, A.C.: Image information and visual quality. IEEE TIP (2006)
Google Scholar
Singh, A., et al.: Towards VQA models that can read. In: CVPR (2019)
Google Scholar
Su, S., et al.: Blindly assess image quality in the wild guided by a self-adaptive hyper network. In: CVPR (2020)
Google Scholar
Sun, S., Yu, T., Xu, J., Zhou, W., Chen, Z.: GraphIQA: learning distortion graph representations for blind image quality assessment. IEEE TMM (2022)
Google Scholar
Sun, W., Zhou, F., Liao, Q.: MDID: a multiply distorted image database for image quality assessment. PR (2017)
Google Scholar
Tang, H., Joshi, N., Kapoor, A.: Learning a blind measure of perceptual image quality. In: CVPR (2011)
Google Scholar
Touvron, H., et al.: Llama: open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)
Touvron, H., et al.: Llama 2: open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)
Wang, J., Chan, K.C., Loy, C.C.: Exploring clip for assessing the look and feel of images. In: AAAI (2023)
Google Scholar
Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE TIP (2004)
Google Scholar
Wu, H., et al.: Q-bench: a benchmark for general-purpose foundation models on low-level vision. arXiv preprint arXiv:2309.14181 (2023)
Wu, H., et al.: Q-instruct: improving low-level visual abilities for multi-modality foundation models. arXiv preprint arXiv:2311.06783 (2023)
Wu, H., et al.: Q-align: teaching LMMs for visual scoring via discrete text-defined levels. arXiv preprint arXiv:2312.17090 (2023)
Wu, H., et al.: Towards open-ended visual quality comparison. arXiv preprint arXiv:2402.16641 (2024)
Wu, T., Ma, K., Liang, J., Yang, Y., Zhang, L.: A comprehensive study of multimodal large language models for image quality assessment. arXiv preprint arXiv:2403.10854 (2024)
Ye, Q., et al.: mPLUG-Owl: modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178 (2023)
Yin, G., et al.: Content-variant reference image quality assessment via knowledge distillation. In: AAAI (2022)
Google Scholar
Yin, Z., et al.: LAMM: language-assisted multi-modal instruction-tuning dataset, framework, and benchmark. In: NeurIPS (2023)
Google Scholar
You, Z., et al.: Descriptive image quality assessment in the wild. arXiv preprint arXiv:2405.18842 (2024)
Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguist. (2014)
Google Scholar
Zhang, L., Zhang, L., Mou, X., Zhang, D.: FSIM: a feature similarity index for image quality assessment. IEEE TIP (2011)
Google Scholar
Zhang, P., et al.: Internlm-xcomposer: a vision-language large model for advanced text-image comprehension and composition. arXiv preprint arXiv:2309.15112 (2023)
Zhang, R., et al.: Llama-adapter: efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199 (2023)
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR (2018)
Google Scholar
Zhang, W., Li, D., Ma, C., Zhai, G., Yang, X., Ma, K.: Continual learning for blind image quality assessment. IEEE TPAMI (2022)
Google Scholar
Zhang, W., Zhai, G., Wei, Y., Yang, X., Ma, K.: Blind image quality assessment via vision-language correspondence: a multitask learning perspective. In: CVPR (2023)
Google Scholar
Zhang, Z., Wu, H., Zhang, E., Zhai, G., Lin, W.: A benchmark for multi-modal foundation models on low-level vision: from single images to pairs. arXiv preprint arXiv:2402.07116 (2024)
Zhao, H., et al.: MMICL: empowering vision-language model with multi-modal in-context learning. arXiv preprint arXiv:2309.07915 (2023)
Zheng, H., Fu, J., Zeng, Y., Zha, Z.J., Luo, J.: Learning conditional knowledge distillation for degraded-reference image quality assessment. ICCV (2021)
Google Scholar
Zhou, W., Wang, Z.: Quality assessment of image super-resolution: balancing deterministic and statistical fidelity. In: ACM MM (2022)
Google Scholar
Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: MiniGPT-4: enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023)
Zhu, H., Li, L., Wu, J., Dong, W., Shi, G.: MetaIQA: deep meta-learning for no-reference image quality assessment. In: CVPR (2020)
Google Scholar

Download references

Acknowledgement

This work was sponsored by CUHK Direct Grants (RCFUS) No. 4055189, National Natural Science Foundation of China (Grant No. 62276251), and the Joint Lab of CAS-HK.

Author information

Authors and Affiliations

The Chinese University of Hong Kong, Sha Tin, Hong Kong
Zhiyuan You & Tianfan Xue
Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Beijing, China
Zhiyuan You, Zheyuan Li & Chao Dong
Shanghai AI Laboratory, Shanghai, China
Jinjin Gu, Zhenfei Yin & Chao Dong
University of Sydney, Camperdown, Australia
Jinjin Gu & Zhenfei Yin
Shenzhen University of Advanced Technology, Shenzhen, China
Chao Dong
University of Macau, Taipa, Macau
Zheyuan Li

Authors

Zhiyuan You
View author publications
You can also search for this author in PubMed Google Scholar
Zheyuan Li
View author publications
You can also search for this author in PubMed Google Scholar
Jinjin Gu
View author publications
You can also search for this author in PubMed Google Scholar
Zhenfei Yin
View author publications
You can also search for this author in PubMed Google Scholar
Tianfan Xue
View author publications
You can also search for this author in PubMed Google Scholar
Chao Dong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Tianfan Xue or Chao Dong .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 4842 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

You, Z., Li, Z., Gu, J., Yin, Z., Xue, T., Dong, C. (2025). Depicting Beyond Scores: Advancing Image Quality Assessment Through Multi-modal Language Models. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15105. Springer, Cham. https://doi.org/10.1007/978-3-031-72970-6_15

Download citation

DOI: https://doi.org/10.1007/978-3-031-72970-6_15
Published: 23 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72969-0
Online ISBN: 978-3-031-72970-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics