Abstract
Recent advancements in end-to-end image compression demonstrate the potential to surpass traditional codecs regarding rate-distortion performance. However, current methods either prioritize human perceptual quality or solely optimize for one or a few predetermined downstream tasks, neglecting a more common scenario that involves a variety of unforeseen machine vision tasks. In this paper, we propose a Diffusion-based Multiple-Task Unified Image Compression framework that aims to expand the boundary of traditional image compression by incorporating human perception and multiple vision tasks in open-set scenarios. Our proposed method comprises a Multi-Task Collaborative Embedding module and a Diffusion-based Invariant Knowledge Learning module. The former module facilitates collaborative embedding for multiple tasks, while the latter module boosts generalization toward unforeseen tasks by distilling the invariant knowledge from seen vision tasks. Experiments show that the proposed method extracts compact and versatile embeddings for human and machine vision collaborative compression, resulting in superior performance. Specifically, our method outperforms the state-of-the-art by 52.25%/51.68%/48.87%/48.07%/6.29% BD-rate reduction in terms of mAP/mAP/aAcc/PQ-all/accuracy on the MS-COCO for object detection/instance segmentation/semantic segmentation/panoptic segmentation and video question answering tasks, respectively.
S. Guo and L. Sui—Equal contribution.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Bai, Y., et al.: Towards end-to-end image compression and analysis with transformers. In: AAAI, pp. 104–112 (2022)
Ballé, J., Laparra, V., Simoncelli, E.P.: End-to-end optimized image compression. In: ICLR (2017)
Ballé, J., Minnen, D., Singh, S., Hwang, S.J., Johnston, N.: Variational image compression with a scale hyperprior. In: ICLR (2018)
Bjontegaard, G.: Calculation of average PSNR differences between RD-curves. ITU SG16 Doc. VCEG-M33 (2001)
Blau, Y., Michaeli, T.: Rethinking lossy compression: the rate-distortion-perception tradeoff. In: ICML, pp. 675–685. PMLR (2019)
Bross, B., et al.: Overview of the versatile video coding (VVC) standard and its applications. IEEE TCSVT 31(10), 3736–3764 (2021)
Chamain, L.D., Racapé, F., Bégaint, J., Pushparaja, A., Feltman, S.: End-to-end optimized image compression for machines, a study. In: DCC, pp. 163–172. IEEE (2021)
Chen, T., Liu, H., Ma, Z., Shen, Q., Cao, X., Wang, Y.: End-to-end learnt image compression via non-local attention optimization and improved context modeling. IEEE TIP 30, 3179–3191 (2021)
Chen, Z., Fan, K., Wang, S., Duan, L., Lin, W., Kot, A.C.: Toward intelligent sensing: intermediate deep feature compression. IEEE TIP 29, 2230–2243 (2019)
Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR, pp. 1290–1299 (2022)
Cheng, Z., Sun, H., Takeuchi, M., Katto, J.: Learned image compression with discretized gaussian mixture likelihoods and attention modules. In: CVPR, pp. 7939–7948 (2020)
Choi, J., Han, B.: Task-aware quantization network for JPEG image compression. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12365, pp. 309–324. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58565-5_19
Dhariwal, P., Nichol, A.: Diffusion models beat GANs on image synthesis. In: NeurIPS, vol. 34, pp. 8780–8794 (2021)
Duan, L., Liu, J., Yang, W., Huang, T., Gao, W.: Video coding for machines: a paradigm of collaborative compression and intelligent analytics. IEEE TIP 29, 8680–8695 (2020)
Duan, Z., Lu, M., Ma, Z., Zhu, F.: Lossy image compression with quantized hierarchical VAEs. In: WACV, pp. 198–207 (2023)
Feng, Y., Ji, S., Liu, Y.S., Du, S., Dai, Q., Gao, Y.: Hypergraph-based multi-modal representation for open-set 3d object retrieval. IEEE TPAMI (2023)
Ge, X., et al.: Task-aware encoder control for deep video compression. In: CVPR, pp. 26036–26045 (2024)
Guo, S., Chen, Z., Zhao, Y., Zhang, N., Li, X., Duan, L.: Toward scalable image feature compression: a content-adaptive and diffusion-based approach. In: ACM MM, pp. 1431–1442 (2023)
He, D., Yang, Z., Peng, W., Ma, R., Qin, H., Wang, Y.: ELIC: efficient learned image compression with unevenly grouped space-channel contextual adaptive coding. In: CVPR, pp. 5718–5727 (2022)
He, D., Zheng, Y., Sun, B., Wang, Y., Qin, H.: Checkerboard context model for efficient learned image compression. In: CVPR, pp. 14771–14780 (2021)
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: ICCV, pp. 2961–2969 (2017)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Hu, Y., Yang, S., Yang, W., Duan, L.Y., Liu, J.: Towards coding for human and machine vision: a scalable image coding approach. In: ICME, pp. 1–6. IEEE (2020)
Huang, Z., Jia, C., Wang, S., Ma, S.: Visual analysis motivated rate-distortion model for image coding. In: ICME, pp. 1–6. IEEE (2021)
Kim, Y., et al.: End-to-end learnable multi-scale feature compression for VCM. IEEE TCSVT (2023)
Kirillov, A., He, K., Girshick, R., Rother, C., Dollár, P.: Panoptic segmentation. In: CVPR, pp. 9404–9413 (2019)
Le, N., Zhang, H., Cricri, F., Ghaznavi-Youvalari, R., Rahtu, E.: Image coding for machines: an end-to-end learned approach. In: ICASSP, pp. 1590–1594. IEEE (2021)
Le, N., Zhang, H., Cricri, F., Ghaznavi-Youvalari, R., Tavakoli, H.R., Rahtu, E.: Learned image coding for machines: a content-adaptive approach. In: ICME, pp. 1–6. IEEE (2021)
Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: ICML, pp. 12888–12900. PMLR (2022)
Li, M., Gao, S., Feng, Y., Shi, Y., Wang, J.: Content-oriented learned image compression. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13679, pp. 632–647. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19800-7_37
Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: CVPR, pp. 2117–2125 (2017)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Lin, X., et al.: DiffBIR: towards blind image restoration with generative diffusion prior. arXiv preprint arXiv:2308.15070 (2023)
Liu, K., Liu, D., Li, L., Yan, N., Li, H.: Semantics-to-signal scalable image compression with learned revertible representations. IJCV 129(9), 2605–2621 (2021)
Liu, L., Hu, Z., Chen, Z., Xu, D.: ICMH-net: neural image compression towards both machine vision and human vision. In: ACM MM, pp. 8047–8056 (2023)
Lu, M., Guo, P., Shi, H., Cao, C., Ma, Z.: Transformer-based image compression. In: DCC, p. 469. IEEE (2022)
Lu, W., Chen, J., Xue, F.: Using computer vision to recognize composition of construction waste mixtures: a semantic segmentation approach. Resour. Conserv. Recycl. 178, 106022 (2022)
Van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9(11) (2008)
Mentzer, F., Toderici, G.D., Tschannen, M., Agustsson, E.: High-fidelity generative image compression. In: NeurIPS, vol. 33, pp. 11913–11924 (2020)
Minnen, D., Ballé, J., Toderici, G.D.: Joint autoregressive and hierarchical priors for learned image compression. In: NeurIPS, vol. 31 (2018)
Minnen, D., Singh, S.: Channel-wise autoregressive entropy models for learned image compression. In: ICIP, pp. 3339–3343. IEEE (2020)
Müller-Franzes, G., et al.: Diffusion probabilistic models beat GANs on medical images. arXiv preprint arXiv:2212.07501 (2022)
Pennebaker, W.B., Mitchell, J.L.: JPEG: Still Image Data Compression Standard. Springer, Heidelberg (1992)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NeurIPS, vol. 28 (2015)
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR, pp. 10684–10695 (2022)
Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Ruder, S.: An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747 (2016)
Si, Z., Shen, K.: Research on the WebP image format. In: Xu, M., Yang, L., Ouyang, Y., Ouyang, Y. (eds.) Advanced Graphic Communications, Packaging Technology and Materials. LNEE, vol. 369, pp. 271–277. Springer, Singapore (2016). https://doi.org/10.1007/978-981-10-0072-0_35
Strudel, R., Garcia, R., Laptev, I., Schmid, C.: Segmenter: transformer for semantic segmentation. In: ICCV, pp. 7262–7272 (2021)
Suzuki, S., Takagi, M., Hayase, K., Onishi, T., Shimizu, A.: Image pre-transformation for recognition-aware image compression. In: ICIP, pp. 2686–2690. IEEE (2019)
Xiang, J., Tian, K., Zhang, J.: MIMT: masked image modeling transformer for video compression. In: ICLR (2022)
Yang, M., Yu, K., Zhang, C., Li, Z., Yang, K.: DenseASPP for semantic segmentation in street scenes. In: CVPR, pp. 3684–3692 (2018)
Yang, W., Huang, H., Hu, Y., Duan, L.Y., Liu, J.: Video coding for machines: compact visual representation compression for intelligent collaborative analytics. IEEE TPAMI (2024)
Yoon, C., et al.: MEDO: minimizing effective distortions only for machine-oriented visual feature compression. In: VCIP, pp. 1–5. IEEE (2023)
Zamir, A.R., Sax, A., Shen, W., Guibas, L.J., Malik, J., Savarese, S.: Taskonomy: Disentangling task transfer learning. In: CVPR, pp. 3712–3722 (2018)
Zeng, H., Peng, S., Li, D.: DeepLabv3+ semantic segmentation model based on feature cross attention mechanism. In: JPCS, p. 012106. IOP Publishing (2020)
Zhu, X., Song, J., Gao, L., Zheng, F., Shen, H.T.: Unified multivariate gaussian mixture for efficient neural image compression. In: CVPR, pp. 17612–17621 (2022)
Zhu, Y., Yang, Y., Cohen, T.: Transformer-based transform coding. In: ICLR (2021)
Zou, R., Song, C., Zhang, Z.: The devil is in the details: window-based attention for image compression. In: CVPR, pp. 17492–17501 (2022)
Acknowledgement
This work was supported by the National Natural Science Foundation of China under Grant 62088102, in part by the PKU-NTU Joint Research Institute (JRI) sponsored by a donation from the Ng Teng Fong Charitable Foundation and in part by AI Joint Lab of Future Urban Infrastructure sponsored by Fuzhou Chengtou New Infrastructure Group and Boyun Vision Co. Ltd.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Guo, S., Sui, L., Zhang, C., Chen, Z., Yang, W., Duan, L. (2025). A Unified Image Compression Method for Human Perception and Multiple Vision Tasks. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15129. Springer, Cham. https://doi.org/10.1007/978-3-031-73209-6_20
Download citation
DOI: https://doi.org/10.1007/978-3-031-73209-6_20
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73208-9
Online ISBN: 978-3-031-73209-6
eBook Packages: Computer ScienceComputer Science (R0)