[go: up one dir, main page]

Skip to main content

Depicting Beyond Scores: Advancing Image Quality Assessment Through Multi-modal Language Models

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Abstract

We introduce a Depicted image Quality Assessment method (DepictQA), overcoming the constraints of traditional score-based methods. DepictQA allows for detailed, language-based, human-like evaluation of image quality by leveraging Multi-modal Large Language Models (MLLMs). Unlike conventional Image Quality Assessment (IQA) methods relying on scores, DepictQA interprets image content and distortions descriptively and comparatively, aligning closely with humans’ reasoning process. To build the DepictQA model, we establish a hierarchical task framework, and collect a multi-modal IQA training dataset. To tackle the challenges of limited training data and multi-image processing, we propose to use multi-source training data and specialized image tags. These designs result in a better performance of DepictQA  than score-based approaches on multiple benchmarks. Moreover, compared with general MLLMs, DepictQA can generate more accurate reasoning descriptive languages. We also demonstrate that our full-reference dataset can be extended to non-reference applications. These results showcase the research potential of multi-modal IQA methods.

Z. You, Z. Li and J. Gu—Contribute Equally.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    The results were tested on the preview web version released in Nov. 2023. A comprehensive evaluation of the stable API version, gpt-4-turbo, is given in [70].

References

  1. Agrawal, H., et al.: Nocaps: novel object captioning at scale. In: ICCV (2019)

    Google Scholar 

  2. Alayrac, J.B., et al.: Flamingo: a visual language model for few-shot learning. In: NeurIPS (2022)

    Google Scholar 

  3. Bell, S., Upchurch, P., Snavely, N., Bala, K.: Material recognition in the wild with the materials in context database. In: CVPR (2015)

    Google Scholar 

  4. Bosse, S., Maniry, D., Müller, K.R., Wiegand, T., Samek, W.: Deep neural networks for no-reference and full-reference image quality assessment. IEEE TIP (2017)

    Google Scholar 

  5. Bosse, S., Maniry, D., Müller, K.R., Wiegand, T., Samek, W.: Deep neural networks for no-reference and full-reference image quality assessment. IEEE TIP (2018)

    Google Scholar 

  6. Brown, T., et al.: Language models are few-shot learners. In: NeurIPS (2020)

    Google Scholar 

  7. Cao, Y., Wan, Z., Ren, D., Yan, Z., Zuo, W.: Incorporating semi-supervised and positive-unlabeled learning for boosting full reference image quality assessment. In: CVPR (2022)

    Google Scholar 

  8. Chen, X., et al.: A comparative study of image restoration networks for general backbone network design. arXiv preprint arXiv:2310.11881 (2023)

  9. Chen, X., et al.: Microsoft coco captions: data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015)

  10. Chen, Z., et al.: Deep linear array pushbroom image restoration: a degradation pipeline and jitter-aware restoration network. In: AAAI (2024)

    Google Scholar 

  11. Chen, Z., He, Z., Lu, Z., Lu, Z.M.: Prompt-based test-time real image dehazing: a novel pipeline. arXiv preprint arXiv:2309.17389 (2023)

  12. Chiang, W.L., et al.: Vicuna: an open-source chatbot impressing GPT-4 with 90%* ChatGPT quality (2023). https://vicuna.lmsys.org

  13. Dai, W., et al.: InstructBLIP: towards general-purpose vision-language models with instruction tuning. In: NeurIPS (2023)

    Google Scholar 

  14. Ding, K., Liu, Y., Zou, X., Wang, S., Ma, K.: Locally adaptive structure and texture similarity for image quality assessment. In: ACM MM (2021)

    Google Scholar 

  15. Ding, K., Ma, K., Wang, S., Simoncelli, E.P.: Image quality assessment: unifying structure and texture similarity. IEEE TPAMI (2020)

    Google Scholar 

  16. Ghildyal, A., Liu, F.: Shift-tolerant perceptual similarity metric. In: ECCV (2022)

    Google Scholar 

  17. Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: elevating the role of image understanding in visual question answering. In: CVPR (2017)

    Google Scholar 

  18. Gu, J., Cai, H., Chen, H., Ye, X., Ren, J., Dong, C.: Image quality assessment for perceptual image restoration: a new dataset, benchmark and metric. arXiv preprint arXiv:2011.15002 (2020)

  19. Gu, J., Cai, H., Chen, H., Ye, X., Ren, J., Dong, C.: Pipal: a large-scale image quality assessment dataset for perceptual image restoration. arXiv preprint arXiv:2007.12142 (2020)

  20. Hu, E.J., et al.: Lora: low-rank adaptation of large language models. In: ICLR (2021)

    Google Scholar 

  21. Huang, S., et al.: Language is not all you need: aligning perception with language models. arXiv preprint arXiv:2302.14045 (2023)

  22. Jayaraman, D., Mittal, A., Moorthy, A.K., Bovik, A.C.: Objective quality assessment of multiply distorted images. In: Asilomar Conference on Signals, Systems and Computers (ASILOMAR) (2012)

    Google Scholar 

  23. Kang, L., Ye, P., Li, Y., Doermann, D.: Convolutional neural networks for no-reference image quality assessment. In: CVPR (2014)

    Google Scholar 

  24. Ke, J., Wang, Q., Wang, Y., Milanfar, P., Yang, F.: Musiq: multi-scale image quality transformer. In: CVPR (2021)

    Google Scholar 

  25. Kudo, T., Richardson, J.: Sentencepiece: a simple and language independent subword tokenizer and detokenizer for neural text processing. In: EMNLP (2018)

    Google Scholar 

  26. Larson, E.C., Chandler, D.M.: Most apparent distortion: full-reference image quality assessment and the role of strategy. J. Electron. Imaging (2010)

    Google Scholar 

  27. Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023)

  28. Lin, H., Hosu, V., Saupe, D.: Kadid-10k: a large-scale artificially distorted IQA database. In: International Conference on Quality of Multimedia Experience (QoMEX) (2019)

    Google Scholar 

  29. Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744 (2023)

  30. Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. arXiv preprint arXiv:2304.08485 (2023)

  31. Liu, X., van de Weijer, J., Bagdanov, A.D.: RankIQA: learning from rankings for no-reference image quality assessment. In: ICCV (2017)

    Google Scholar 

  32. Liu, Y., et al.: MMBench: is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281 (2023)

  33. Lu, P., et al.: Learn to explain: multimodal reasoning via thought chains for science question answering. In: NeurIPS (2022)

    Google Scholar 

  34. Ma, C., Yang, C.Y., Yang, X., Yang, M.H.: Learning a no-reference quality metric for single-image super-resolution. In: Computer Vision and Image Understanding (2017)

    Google Scholar 

  35. Masry, A., Do, X.L., Tan, J.Q., Joty, S., Hoque, E.: ChartQA: a benchmark for question answering about charts with visual and logical reasoning. In: ACL Findings (2022)

    Google Scholar 

  36. Mathew, M., Karatzas, D., Jawahar, C.: DocVQA: A dataset for VQA on document images. In: WACV (2021)

    Google Scholar 

  37. Mittal, A., Moorthy, A.K., Bovik, A.C.: No-reference image quality assessment in the spatial domain. IEEE TIP (2012)

    Google Scholar 

  38. Mittal, A., Soundararajan, R., Bovik, A.C.: Making a “completely blind” image quality analyzer. IEEE Sign. Process. Lett. (2013)

    Google Scholar 

  39. Moorthy, A.K., Bovik, A.C.: A two-step framework for constructing blind image quality indices. IEEE Sign. Process. Lett. (2010)

    Google Scholar 

  40. Moorthy, A.K., Bovik, A.C.: Blind image quality assessment: from natural scene statistics to perceptual quality. IEEE TIP (2011)

    Google Scholar 

  41. OpenAI: GPT-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

  42. OpenAI: GPT-4V(ision) system card (2023). https://openai.com/research/gpt-4v-system-card

  43. Ouyang, L., et al.: Training language models to follow instructions with human feedback. In: NeurIPS (2022)

    Google Scholar 

  44. Pan, D., Shi, P., Hou, M., Ying, Z., Fu, S., Zhang, Y.: Blind predicting similar quality map for image quality assessment. In: CVPR (2018)

    Google Scholar 

  45. Ponomarenko, N., et al.: Image database TID2013: peculiarities, results and perspectives. Signal Processing: Image Communication (2015)

    Google Scholar 

  46. Ponomarenko, N., Lukin, V., Zelensky, A., Egiazarian, K., Carli, M., Battisti, F.: TID2008-a database for evaluation of full-reference visual quality assessment metrics. In: Advances of Modern Radioelectronics (2009)

    Google Scholar 

  47. Prashnani, E., Cai, H., Mostofi, Y., Sen, P.: Pieapp: perceptual image-error assessment through pairwise preference. In: CVPR (2018)

    Google Scholar 

  48. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)

    Google Scholar 

  49. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022)

    Google Scholar 

  50. Saad, M.A., Bovik, A.C., Charrier, C.: Blind image quality assessment: a natural scene statistics approach in the DCT domain. IEEE TIP (2012)

    Google Scholar 

  51. Sheikh, H.: Live image quality assessment database release 2 (2005). http://live.ece.utexas.edu/research/quality

  52. Sheikh, H.R., Bovik, A.C.: Image information and visual quality. IEEE TIP (2006)

    Google Scholar 

  53. Singh, A., et al.: Towards VQA models that can read. In: CVPR (2019)

    Google Scholar 

  54. Su, S., et al.: Blindly assess image quality in the wild guided by a self-adaptive hyper network. In: CVPR (2020)

    Google Scholar 

  55. Sun, S., Yu, T., Xu, J., Zhou, W., Chen, Z.: GraphIQA: learning distortion graph representations for blind image quality assessment. IEEE TMM (2022)

    Google Scholar 

  56. Sun, W., Zhou, F., Liao, Q.: MDID: a multiply distorted image database for image quality assessment. PR (2017)

    Google Scholar 

  57. Tang, H., Joshi, N., Kapoor, A.: Learning a blind measure of perceptual image quality. In: CVPR (2011)

    Google Scholar 

  58. Touvron, H., et al.: Llama: open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)

  59. Touvron, H., et al.: Llama 2: open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)

  60. Wang, J., Chan, K.C., Loy, C.C.: Exploring clip for assessing the look and feel of images. In: AAAI (2023)

    Google Scholar 

  61. Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE TIP (2004)

    Google Scholar 

  62. Wu, H., et al.: Q-bench: a benchmark for general-purpose foundation models on low-level vision. arXiv preprint arXiv:2309.14181 (2023)

  63. Wu, H., et al.: Q-instruct: improving low-level visual abilities for multi-modality foundation models. arXiv preprint arXiv:2311.06783 (2023)

  64. Wu, H., et al.: Q-align: teaching LMMs for visual scoring via discrete text-defined levels. arXiv preprint arXiv:2312.17090 (2023)

  65. Wu, H., et al.: Towards open-ended visual quality comparison. arXiv preprint arXiv:2402.16641 (2024)

  66. Wu, T., Ma, K., Liang, J., Yang, Y., Zhang, L.: A comprehensive study of multimodal large language models for image quality assessment. arXiv preprint arXiv:2403.10854 (2024)

  67. Ye, Q., et al.: mPLUG-Owl: modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178 (2023)

  68. Yin, G., et al.: Content-variant reference image quality assessment via knowledge distillation. In: AAAI (2022)

    Google Scholar 

  69. Yin, Z., et al.: LAMM: language-assisted multi-modal instruction-tuning dataset, framework, and benchmark. In: NeurIPS (2023)

    Google Scholar 

  70. You, Z., et al.: Descriptive image quality assessment in the wild. arXiv preprint arXiv:2405.18842 (2024)

  71. Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguist. (2014)

    Google Scholar 

  72. Zhang, L., Zhang, L., Mou, X., Zhang, D.: FSIM: a feature similarity index for image quality assessment. IEEE TIP (2011)

    Google Scholar 

  73. Zhang, P., et al.: Internlm-xcomposer: a vision-language large model for advanced text-image comprehension and composition. arXiv preprint arXiv:2309.15112 (2023)

  74. Zhang, R., et al.: Llama-adapter: efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199 (2023)

  75. Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR (2018)

    Google Scholar 

  76. Zhang, W., Li, D., Ma, C., Zhai, G., Yang, X., Ma, K.: Continual learning for blind image quality assessment. IEEE TPAMI (2022)

    Google Scholar 

  77. Zhang, W., Zhai, G., Wei, Y., Yang, X., Ma, K.: Blind image quality assessment via vision-language correspondence: a multitask learning perspective. In: CVPR (2023)

    Google Scholar 

  78. Zhang, Z., Wu, H., Zhang, E., Zhai, G., Lin, W.: A benchmark for multi-modal foundation models on low-level vision: from single images to pairs. arXiv preprint arXiv:2402.07116 (2024)

  79. Zhao, H., et al.: MMICL: empowering vision-language model with multi-modal in-context learning. arXiv preprint arXiv:2309.07915 (2023)

  80. Zheng, H., Fu, J., Zeng, Y., Zha, Z.J., Luo, J.: Learning conditional knowledge distillation for degraded-reference image quality assessment. ICCV (2021)

    Google Scholar 

  81. Zhou, W., Wang, Z.: Quality assessment of image super-resolution: balancing deterministic and statistical fidelity. In: ACM MM (2022)

    Google Scholar 

  82. Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: MiniGPT-4: enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023)

  83. Zhu, H., Li, L., Wu, J., Dong, W., Shi, G.: MetaIQA: deep meta-learning for no-reference image quality assessment. In: CVPR (2020)

    Google Scholar 

Download references

Acknowledgement

This work was sponsored by CUHK Direct Grants (RCFUS) No. 4055189, National Natural Science Foundation of China (Grant No. 62276251), and the Joint Lab of CAS-HK.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Tianfan Xue or Chao Dong .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 4842 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

You, Z., Li, Z., Gu, J., Yin, Z., Xue, T., Dong, C. (2025). Depicting Beyond Scores: Advancing Image Quality Assessment Through Multi-modal Language Models. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15105. Springer, Cham. https://doi.org/10.1007/978-3-031-72970-6_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72970-6_15

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72969-0

  • Online ISBN: 978-3-031-72970-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics