A Unified Image Compression Method for Human Perception and Multiple Vision Tasks

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15129))

Included in the following conference series:

European Conference on Computer Vision

249 Accesses

Abstract

Recent advancements in end-to-end image compression demonstrate the potential to surpass traditional codecs regarding rate-distortion performance. However, current methods either prioritize human perceptual quality or solely optimize for one or a few predetermined downstream tasks, neglecting a more common scenario that involves a variety of unforeseen machine vision tasks. In this paper, we propose a Diffusion-based Multiple-Task Unified Image Compression framework that aims to expand the boundary of traditional image compression by incorporating human perception and multiple vision tasks in open-set scenarios. Our proposed method comprises a Multi-Task Collaborative Embedding module and a Diffusion-based Invariant Knowledge Learning module. The former module facilitates collaborative embedding for multiple tasks, while the latter module boosts generalization toward unforeseen tasks by distilling the invariant knowledge from seen vision tasks. Experiments show that the proposed method extracts compact and versatile embeddings for human and machine vision collaborative compression, resulting in superior performance. Specifically, our method outperforms the state-of-the-art by 52.25%/51.68%/48.87%/48.07%/6.29% BD-rate reduction in terms of mAP/mAP/aAcc/PQ-all/accuracy on the MS-COCO for object detection/instance segmentation/semantic segmentation/panoptic segmentation and video question answering tasks, respectively.

S. Guo and L. Sui—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Just Recognizable Distortion for Machine Vision Oriented Image and Video Coding

Article 13 August 2021

Aligning Salient Objects to Queries: A Multi-modal and Multi-object Image Retrieval Framework

Rethinking Image-to-Video Adaptation: An Object-Centric Perspective

References

Bai, Y., et al.: Towards end-to-end image compression and analysis with transformers. In: AAAI, pp. 104–112 (2022)
Google Scholar
Ballé, J., Laparra, V., Simoncelli, E.P.: End-to-end optimized image compression. In: ICLR (2017)
Google Scholar
Ballé, J., Minnen, D., Singh, S., Hwang, S.J., Johnston, N.: Variational image compression with a scale hyperprior. In: ICLR (2018)
Google Scholar
Bjontegaard, G.: Calculation of average PSNR differences between RD-curves. ITU SG16 Doc. VCEG-M33 (2001)
Google Scholar
Blau, Y., Michaeli, T.: Rethinking lossy compression: the rate-distortion-perception tradeoff. In: ICML, pp. 675–685. PMLR (2019)
Google Scholar
Bross, B., et al.: Overview of the versatile video coding (VVC) standard and its applications. IEEE TCSVT 31(10), 3736–3764 (2021)
Google Scholar
Chamain, L.D., Racapé, F., Bégaint, J., Pushparaja, A., Feltman, S.: End-to-end optimized image compression for machines, a study. In: DCC, pp. 163–172. IEEE (2021)
Google Scholar
Chen, T., Liu, H., Ma, Z., Shen, Q., Cao, X., Wang, Y.: End-to-end learnt image compression via non-local attention optimization and improved context modeling. IEEE TIP 30, 3179–3191 (2021)
Google Scholar
Chen, Z., Fan, K., Wang, S., Duan, L., Lin, W., Kot, A.C.: Toward intelligent sensing: intermediate deep feature compression. IEEE TIP 29, 2230–2243 (2019)
Google Scholar
Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1290–1299 (2022)
Google Scholar
Cheng, Z., Sun, H., Takeuchi, M., Katto, J.: Learned image compression with discretized gaussian mixture likelihoods and attention modules. In: CVPR, pp. 7939–7948 (2020)
Google Scholar
Choi, J., Han, B.: Task-aware quantization network for JPEG image compression. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12365, pp. 309–324. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58565-5_19
Chapter Google Scholar
Dhariwal, P., Nichol, A.: Diffusion models beat GANs on image synthesis. In: NeurIPS, vol. 34, pp. 8780–8794 (2021)
Google Scholar
Duan, L., Liu, J., Yang, W., Huang, T., Gao, W.: Video coding for machines: a paradigm of collaborative compression and intelligent analytics. IEEE TIP 29, 8680–8695 (2020)
Google Scholar
Duan, Z., Lu, M., Ma, Z., Zhu, F.: Lossy image compression with quantized hierarchical VAEs. In: WACV, pp. 198–207 (2023)
Google Scholar
Feng, Y., Ji, S., Liu, Y.S., Du, S., Dai, Q., Gao, Y.: Hypergraph-based multi-modal representation for open-set 3d object retrieval. IEEE TPAMI (2023)
Google Scholar
Ge, X., et al.: Task-aware encoder control for deep video compression. In: CVPR, pp. 26036–26045 (2024)
Google Scholar
Guo, S., Chen, Z., Zhao, Y., Zhang, N., Li, X., Duan, L.: Toward scalable image feature compression: a content-adaptive and diffusion-based approach. In: ACM MM, pp. 1431–1442 (2023)
Google Scholar
He, D., Yang, Z., Peng, W., Ma, R., Qin, H., Wang, Y.: ELIC: efficient learned image compression with unevenly grouped space-channel contextual adaptive coding. In: CVPR, pp. 5718–5727 (2022)
Google Scholar
He, D., Zheng, Y., Sun, B., Wang, Y., Qin, H.: Checkerboard context model for efficient learned image compression. In: CVPR, pp. 14771–14780 (2021)
Google Scholar
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: ICCV, pp. 2961–2969 (2017)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Google Scholar
Hu, Y., Yang, S., Yang, W., Duan, L.Y., Liu, J.: Towards coding for human and machine vision: a scalable image coding approach. In: ICME, pp. 1–6. IEEE (2020)
Google Scholar
Huang, Z., Jia, C., Wang, S., Ma, S.: Visual analysis motivated rate-distortion model for image coding. In: ICME, pp. 1–6. IEEE (2021)
Google Scholar
Kim, Y., et al.: End-to-end learnable multi-scale feature compression for VCM. IEEE TCSVT (2023)
Google Scholar
Kirillov, A., He, K., Girshick, R., Rother, C., Dollár, P.: Panoptic segmentation. In: CVPR, pp. 9404–9413 (2019)
Google Scholar
Le, N., Zhang, H., Cricri, F., Ghaznavi-Youvalari, R., Rahtu, E.: Image coding for machines: an end-to-end learned approach. In: ICASSP, pp. 1590–1594. IEEE (2021)
Google Scholar
Le, N., Zhang, H., Cricri, F., Ghaznavi-Youvalari, R., Tavakoli, H.R., Rahtu, E.: Learned image coding for machines: a content-adaptive approach. In: ICME, pp. 1–6. IEEE (2021)
Google Scholar
Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: ICML, pp. 12888–12900. PMLR (2022)
Google Scholar
Li, M., Gao, S., Feng, Y., Shi, Y., Wang, J.: Content-oriented learned image compression. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13679, pp. 632–647. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19800-7_37
Chapter Google Scholar
Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: CVPR, pp. 2117–2125 (2017)
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Lin, X., et al.: DiffBIR: towards blind image restoration with generative diffusion prior. arXiv preprint arXiv:2308.15070 (2023)
Liu, K., Liu, D., Li, L., Yan, N., Li, H.: Semantics-to-signal scalable image compression with learned revertible representations. IJCV 129(9), 2605–2621 (2021)
Article Google Scholar
Liu, L., Hu, Z., Chen, Z., Xu, D.: ICMH-net: neural image compression towards both machine vision and human vision. In: ACM MM, pp. 8047–8056 (2023)
Google Scholar
Lu, M., Guo, P., Shi, H., Cao, C., Ma, Z.: Transformer-based image compression. In: DCC, p. 469. IEEE (2022)
Google Scholar
Lu, W., Chen, J., Xue, F.: Using computer vision to recognize composition of construction waste mixtures: a semantic segmentation approach. Resour. Conserv. Recycl. 178, 106022 (2022)
Article Google Scholar
Van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9(11) (2008)
Google Scholar
Mentzer, F., Toderici, G.D., Tschannen, M., Agustsson, E.: High-fidelity generative image compression. In: NeurIPS, vol. 33, pp. 11913–11924 (2020)
Google Scholar
Minnen, D., Ballé, J., Toderici, G.D.: Joint autoregressive and hierarchical priors for learned image compression. In: NeurIPS, vol. 31 (2018)
Google Scholar
Minnen, D., Singh, S.: Channel-wise autoregressive entropy models for learned image compression. In: ICIP, pp. 3339–3343. IEEE (2020)
Google Scholar
Müller-Franzes, G., et al.: Diffusion probabilistic models beat GANs on medical images. arXiv preprint arXiv:2212.07501 (2022)
Pennebaker, W.B., Mitchell, J.L.: JPEG: Still Image Data Compression Standard. Springer, Heidelberg (1992)
Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NeurIPS, vol. 28 (2015)
Google Scholar
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR, pp. 10684–10695 (2022)
Google Scholar
Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Chapter Google Scholar
Ruder, S.: An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747 (2016)
Si, Z., Shen, K.: Research on the WebP image format. In: Xu, M., Yang, L., Ouyang, Y., Ouyang, Y. (eds.) Advanced Graphic Communications, Packaging Technology and Materials. LNEE, vol. 369, pp. 271–277. Springer, Singapore (2016). https://doi.org/10.1007/978-981-10-0072-0_35
Chapter Google Scholar
Strudel, R., Garcia, R., Laptev, I., Schmid, C.: Segmenter: transformer for semantic segmentation. In: ICCV, pp. 7262–7272 (2021)
Google Scholar
Suzuki, S., Takagi, M., Hayase, K., Onishi, T., Shimizu, A.: Image pre-transformation for recognition-aware image compression. In: ICIP, pp. 2686–2690. IEEE (2019)
Google Scholar
Xiang, J., Tian, K., Zhang, J.: MIMT: masked image modeling transformer for video compression. In: ICLR (2022)
Google Scholar
Yang, M., Yu, K., Zhang, C., Li, Z., Yang, K.: DenseASPP for semantic segmentation in street scenes. In: CVPR, pp. 3684–3692 (2018)
Google Scholar
Yang, W., Huang, H., Hu, Y., Duan, L.Y., Liu, J.: Video coding for machines: compact visual representation compression for intelligent collaborative analytics. IEEE TPAMI (2024)
Google Scholar
Yoon, C., et al.: MEDO: minimizing effective distortions only for machine-oriented visual feature compression. In: VCIP, pp. 1–5. IEEE (2023)
Google Scholar
Zamir, A.R., Sax, A., Shen, W., Guibas, L.J., Malik, J., Savarese, S.: Taskonomy: Disentangling task transfer learning. In: CVPR, pp. 3712–3722 (2018)
Google Scholar
Zeng, H., Peng, S., Li, D.: DeepLabv3+ semantic segmentation model based on feature cross attention mechanism. In: JPCS, p. 012106. IOP Publishing (2020)
Google Scholar
Zhu, X., Song, J., Gao, L., Zheng, F., Shen, H.T.: Unified multivariate gaussian mixture for efficient neural image compression. In: CVPR, pp. 17612–17621 (2022)
Google Scholar
Zhu, Y., Yang, Y., Cohen, T.: Transformer-based transform coding. In: ICLR (2021)
Google Scholar
Zou, R., Song, C., Zhang, Z.: The devil is in the details: window-based attention for image compression. In: CVPR, pp. 17492–17501 (2022)
Google Scholar

Download references

Acknowledgement

This work was supported by the National Natural Science Foundation of China under Grant 62088102, in part by the PKU-NTU Joint Research Institute (JRI) sponsored by a donation from the Ng Teng Fong Charitable Foundation and in part by AI Joint Lab of Future Urban Infrastructure sponsored by Fuzhou Chengtou New Infrastructure Group and Boyun Vision Co. Ltd.

Author information

Authors and Affiliations

School of Computer Science, Peking University, Beijing, China
Sha Guo & Lingyu Duan
Peng Cheng Laboratory, Shenzhen, China
Zhuo Chen, Wenhan Yang & Lingyu Duan
4Paradigm Inc, Beijing, China
Lin Sui & Chenlin Zhang

Authors

Sha Guo
View author publications
You can also search for this author in PubMed Google Scholar
Lin Sui
View author publications
You can also search for this author in PubMed Google Scholar
Chenlin Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Zhuo Chen
View author publications
You can also search for this author in PubMed Google Scholar
Wenhan Yang
View author publications
You can also search for this author in PubMed Google Scholar
Lingyu Duan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lingyu Duan .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 12838 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Guo, S., Sui, L., Zhang, C., Chen, Z., Yang, W., Duan, L. (2025). A Unified Image Compression Method for Human Perception and Multiple Vision Tasks. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15129. Springer, Cham. https://doi.org/10.1007/978-3-031-73209-6_20

Download citation

DOI: https://doi.org/10.1007/978-3-031-73209-6_20
Published: 01 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73208-9
Online ISBN: 978-3-031-73209-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics