Masked Feature Compression for Object Detection
<p>Variational image compression with a hyperprior. Image encoder, image decoder, hyperprior encoder, and hyperprior decoder are denoted by <math display="inline"><semantics> <msub> <mi>g</mi> <mi>e</mi> </msub> </semantics></math>, <math display="inline"><semantics> <msub> <mi>g</mi> <mi>d</mi> </msub> </semantics></math>, <math display="inline"><semantics> <msub> <mi>h</mi> <mi>e</mi> </msub> </semantics></math>, and <math display="inline"><semantics> <msub> <mi>h</mi> <mi>d</mi> </msub> </semantics></math>, respectively. All encoders and decoders are neural networks. AE means arithmetic encoder and AD means arithmetic decoder.</p> "> Figure 2
<p>Two kinds of compression pipelines. The feature extractor block is typically a neural network backbone used to extract feature information from the input image. The image encoder and image decoder are neural networks to perform image encoding and decoding, respectively. Similarly, the feature encoder and feature decoder are neural network implementations for feature encoding and decoding. The perform detection block can be implemented using common detection models and is intended to preliminarily determine the approximate regions where the objects are located.</p> "> Figure 3
<p>The visualization of extracted features.</p> "> Figure 4
<p>The workflow of the proposed method. “*” represents convolution operation.</p> "> Figure 5
<p>The structure of the mask generator. The C3 block used here is based on the structure in CSPNet [<a href="#B41-mathematics-12-01848" class="html-bibr">41</a>], <math display="inline"><semantics> <mrow> <mo>×</mo> <mi>n</mi> </mrow> </semantics></math> means the repeat number of it. “Up” represents the nearest-neighbor interpolation with a scaling factor of 2. In the mask, “1” is represented by white, while “0” is represented by black.</p> "> Figure 6
<p>The neighborhood convolution operation, where the yellow region means objects and the orange region is the final neighborhood. “*” is convolution, and the kernel size here is 3.</p> "> Figure 7
<p>The forward propagation process of training.</p> "> Figure 8
<p>Example of generated masks by the mask generator.</p> "> Figure 9
<p>The rate–accuracy curves on the VisDrone and COCO dataset [<a href="#B4-mathematics-12-01848" class="html-bibr">4</a>,<a href="#B5-mathematics-12-01848" class="html-bibr">5</a>,<a href="#B11-mathematics-12-01848" class="html-bibr">11</a>,<a href="#B13-mathematics-12-01848" class="html-bibr">13</a>,<a href="#B14-mathematics-12-01848" class="html-bibr">14</a>,<a href="#B15-mathematics-12-01848" class="html-bibr">15</a>,<a href="#B16-mathematics-12-01848" class="html-bibr">16</a>,<a href="#B17-mathematics-12-01848" class="html-bibr">17</a>,<a href="#B26-mathematics-12-01848" class="html-bibr">26</a>].</p> "> Figure 10
<p>Example patches comparing the detection outcomes obtained from different methods [<a href="#B11-mathematics-12-01848" class="html-bibr">11</a>,<a href="#B13-mathematics-12-01848" class="html-bibr">13</a>,<a href="#B15-mathematics-12-01848" class="html-bibr">15</a>]. The first column displays results obtained by applying the detector to the original image (baseline). The first and third rows display the input to the detection model under different compression methods, which, for conventional compression methods, would be the decompressed image. Since our approach involves compressing features, we showcase the visualization of the channel with the most information. The second and fourth rows present the results of detection. Our method outputs detection results directly from the features, and for visualization use, we draw the bounding boxes on the corresponding original image.</p> "> Figure 11
<p>The rate–accuracy curves before and after using our method on VisDrone [<a href="#B14-mathematics-12-01848" class="html-bibr">14</a>].</p> "> Figure 12
<p>Rate–accuracy curves before and after using the mask generator on VisDrone.</p> "> Figure 13
<p>Neighborhood convolution with different kernel sizes on VisDrone.</p> "> Figure 14
<p>Rate–accuracy curves on VisDrone when using different compression models as the base model [<a href="#B11-mathematics-12-01848" class="html-bibr">11</a>,<a href="#B14-mathematics-12-01848" class="html-bibr">14</a>,<a href="#B26-mathematics-12-01848" class="html-bibr">26</a>].</p> "> Figure 14 Cont.
<p>Rate–accuracy curves on VisDrone when using different compression models as the base model [<a href="#B11-mathematics-12-01848" class="html-bibr">11</a>,<a href="#B14-mathematics-12-01848" class="html-bibr">14</a>,<a href="#B26-mathematics-12-01848" class="html-bibr">26</a>].</p> ">
Abstract
:1. Introduction
- We explore the feasibility of applying generated masks to low-level features and reduce the model’s time complexity by directly compressing the features. The model’s encoding speed surpasses that of current DNN image compression models.
- We design a lightweight mask generator that can generate an object mask in one forward pass, and perform compression on the masked feature to save bits while ensuring the accuracy of backend object detection tasks.
- The proposed framework can easily be integrated with existing neural network compression frameworks, enhancing the compression performance for object detection tasks.
2. Related Works
2.1. Variational Image Compression
2.2. Region Proposal Methods
2.3. One-Stage Object Detection Models
3. Feasibility Analysis of Masked Feature Compression
4. Proposed Method
4.1. The Feature Extractor, Mask Generator, and Powerful Detector
4.2. Neighborhood Convolution
- Use the pretrained mask generator (training details in Section 5) to obtain a 0–1 mask, where the object areas are marked as 1 and the background areas as 0. Then, resize the mask through nearest neighbor interpolation to match the size of the features.
- Use neighborhood convolutions of different sizes to expand the object regions in the mask, and then multiply the mask with the features to obtain masked features.
- Detect the masked features by YOLOv5l, which has been trained on the VisDrone dataset. Results are shown in Table 2.
4.3. Feature Compression Model
4.4. Computational Complexity Analysis
5. Experiments
5.1. Dataset
5.2. Training Settings
5.3. Experimental Results
5.4. Ablation Study
Algorithm 1 Generate Mask by Bounding Boxes |
|
6. Conclusions and Future Work
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef]
- Wallace, G.K. The JPEG still picture compression standard. Commun. ACM 1991, 34, 30–44. [Google Scholar] [CrossRef]
- Taubman, D.S.; Marcellin, M.W.; Rabbani, M. JPEG2000: Image compression fundamentals, standards and practice. J. Electron. Imaging 2002, 11, 286–287. [Google Scholar] [CrossRef]
- Bellard, F. BPG Image Format. 2015. Available online: https://bellard.org/bpg/ (accessed on 1 May 2024).
- Google. WebP: Compression Techniques; Google: Mountain View, CA, USA, 2017. [Google Scholar]
- Ahmed, N.; Natarajan, T.; Rao, K.R. Discrete cosine transform. IEEE Trans. Comput. 1974, 100, 90–93. [Google Scholar] [CrossRef]
- Daubechies, I. Ten Lectures on Wavelets; SIAM: Philadelphia, PA, USA, 1992. [Google Scholar]
- Ma, S.; Zhang, X.; Jia, C.; Zhao, Z.; Wang, S.; Wang, S. Image and Video Compression With Neural Networks: A Review. IEEE Trans. Circuits Syst. Video Technol. 2020, 30, 1683–1698. [Google Scholar] [CrossRef]
- Sadeeq, H.T.; Hameed, T.H.; Abdi, A.S.; Abdulfatah, A.N. Image compression using neural networks: A review. Int. J. Online Biomed. Eng. (iJOE) 2021, 17, 135–153. [Google Scholar] [CrossRef]
- Theis, L.; Shi, W.; Cunningham, A.; Huszár, F. Lossy Image Compression with Compressive Autoencoders. In Proceedings of the International Conference on Learning Representations (ICLR 2017), Toulon, France, 24–26 April 2017. [Google Scholar]
- Cheng, Z.; Sun, H.; Takeuchi, M.; Katto, J. Learned image compression with discretized gaussian mixture likelihoods and attention modules. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 7939–7948. [Google Scholar]
- He, D.; Zheng, Y.; Sun, B.; Wang, Y.; Qin, H. Checkerboard context model for efficient learned image compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 14771–14780. [Google Scholar]
- He, D.; Yang, Z.; Peng, W.; Ma, R.; Qin, H.; Wang, Y. ELIC: Efficient Learned Image Compression with Unevenly Grouped Space-Channel Contextual Adaptive Coding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 5718–5727. [Google Scholar]
- Minnen, D.; Ballé, J.; Toderici, G.D. Joint Autoregressive and Hierarchical Priors for Learned Image Compression; Advances in Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2018; Volume 31. [Google Scholar]
- Zou, R.; Song, C.; Zhang, Z. The devil is in the details: Window-based attention for image compression. In Proceedings of the EEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 17492–17501. [Google Scholar]
- Liu, J.; Sun, H.; Katto, J. Learned image compression with mixed transformer-cnn architectures. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 14388–14397. [Google Scholar]
- He, Z.; Huang, M.; Luo, L.; Yang, X.; Zhu, C. Towards real-time practical image compression with lightweight attention. Expert Syst. Appl. 2024, 252, 124142. [Google Scholar] [CrossRef]
- Versatile Video Coding Reference Software Version 12.1 (vtm-12.1). 2021. Available online: https://vcgit.hhi.fraunhofer.de/jvet/VVCSoftware_VTM/-/tree/VTM-12.1 (accessed on 1 May 2024).
- Wang, Z.; Simoncelli, E.P.; Bovik, A.C. Multiscale structural similarity for image quality assessment. In Proceedings of the Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, Pacific Grove, CA, USA, 9–12 November 2003; IEEE: Piscataway, NJ, USA, 2003; Volume 2, pp. 1398–1402. [Google Scholar]
- Lyu, Z.; Yu, T.; Pan, F.; Zhang, Y.; Luo, J.; Zhang, D.; Chen, Y.; Zhang, B.; Li, G. A survey of model compression strategies for object detection. Multimed. Tools Appl. 2023, 83, 48165–48236. [Google Scholar] [CrossRef]
- Wan, S.; Wu, T.Y.; Hsu, H.W.; Wong, W.H.; Lee, C.Y. Feature Consistency Training With JPEG Compressed Images. IEEE Trans. Circuits Syst. Video Technol. 2020, 30, 4769–4780. [Google Scholar] [CrossRef]
- Chan, P.H.; Huggett, A.; Souvalioti, G.; Jennings, P.; Donzella, V. Influence of AVC and HEVC Compression on Detection of Vehicles Through Faster R-CNN. IEEE Trans. Intell. Transp. Syst. 2024, 25, 203–213. [Google Scholar] [CrossRef]
- Duan, Z.; Ma, Z.; Zhu, F. Unified Architecture Adaptation for Compressed Domain Semantic Inference. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 4108–4121. [Google Scholar] [CrossRef] [PubMed]
- Huang, Z.; Jia, C.; Wang, S.; Ma, S. HMFVC: A Human-Machine Friendly Video Compression Scheme. IEEE Trans. Circuits Syst. Video Technol. 2022; in press. [Google Scholar] [CrossRef]
- Ballé, J.; Laparra, V.; Simoncelli, E.P. End-to-end optimized image compression. In Proceedings of the 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, 24–26 April 2017. [Google Scholar]
- Ballé, J.; Minnen, D.; Singh, S.; Hwang, S.J.; Johnston, N. Variational image compression with a scale hyperprior. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- Marpe, D.; Schwarz, H.; Wiegand, T. Context-based adaptive binary arithmetic coding in the H. 264/AVC video compression standard. IEEE Trans. Circuits Syst. Video Technol. 2003, 13, 620–636. [Google Scholar] [CrossRef]
- Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks; Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2015; Volume 28, pp. 91–99. [Google Scholar]
- He, K.; Gkioxari, G.; Dollar, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
- Long, J.; Shelhamer, E.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015. [Google Scholar]
- Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef] [PubMed]
- Huang, H.; Lin, L.; Tong, R.; Hu, H.; Zhang, Q.; Iwamoto, Y.; Han, X.; Chen, Y.W.; Wu, J. UNet 3+: A Full-Scale Connected UNet for Medical Image Segmentation. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Virtual, 4–8 May 2020; pp. 1055–1059. [Google Scholar] [CrossRef]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
- Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
- Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
- Jocher, G.; Chaurasia, A.; Stoken, A.; Borovec, J.; NanoCode012; Kwon, Y.; Michael, K.; Xie, T.; Fang, J.; Imyhxy; et al. ultralytics/yolov5: v7.0—YOLOv5 SOTA Realtime Instance Segmentation; GitHub: San Francisco, CA, USA, 2022. [Google Scholar] [CrossRef]
- Lin, T.; Maire, M.; Belongie, S.J.; Bourdev, L.D.; Girshick, R.B.; Hays, J.; Perona, P.; Ramanan, D.; Doll’ar, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. arXiv 2014, arXiv:1405.0312. [Google Scholar]
- Li, H.; Chen, J.; Lu, H.; Chi, Z. CNN for saliency detection with low-level feature integration. Neurocomputing 2017, 226, 212–220. [Google Scholar] [CrossRef]
- Li, Z.; Lang, C.; Liang, L.; Zhao, J.; Feng, S.; Hou, Q.; Feng, J. Dense attentive feature enhancement for salient object detection. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 8128–8141. [Google Scholar] [CrossRef]
- Wang, C.Y.; Liao, H.Y.M.; Wu, Y.H.; Chen, P.Y.; Hsieh, J.W.; Yeh, I.H. CSPNet: A new backbone that can enhance learning capability of CNN. In Proceedings of the IEEE/CVF Conference On Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 390–391. [Google Scholar]
- Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; DeVito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; Lerer, A. Automatic Differentiation in PyTorch. In Proceedings of the NIPS 2017 Autodiff Workshop, Long Beach, CA, USA, 9 December 2017. [Google Scholar]
- Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G.S.; Davis, A.; Dean, J.; Devin, M.; et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. 2015. Available online: https://www.tensorflow.org/ (accessed on 1 May 2024).
- Ballé, J.; Laparra, V.; Simoncelli, E.P. Density modeling of images using a generalized normalization transformation. arXiv 2015, arXiv:1511.06281. [Google Scholar]
- Zhu, P.; Wen, L.; Du, D.; Bian, X.; Fan, H.; Hu, Q.; Ling, H. Detection and Tracking Meet Drones Challenge. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 7380–7399. [Google Scholar] [CrossRef] [PubMed]
- Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Model | Input Size | mAP | Params | FLOPs |
---|---|---|---|---|
(Pixels) | (%) | (M) | (B) | |
YOLOv5n | 640 | 45.7 | 1.9 | 4.5 |
YOLOv5s | 640 | 56.8 | 7.2 | 16.5 |
YOLOv5m | 640 | 64.1 | 21.2 | 49 |
YOLOv5l | 640 | 67.3 | 46.5 | 109.1 |
YOLOv5x | 640 | 68.9 | 86.7 | 205.7 |
Kernel Size | mAP | mAP | Precision | Recall |
---|---|---|---|---|
0.5 (%) | 0.5–0.95 (%) | (%) | (%) | |
None | 34.6 | 20.4 | 52.2 | 34.3 |
5 | 36.7 | 20.3 | 53.1 | 36.0 |
7 | 38.0 | 21.3 | 51.1 | 38.4 |
9 | 38.8 | 21.9 | 55.9 | 37.3 |
11 | 41.0 | 24.0 | 55.8 | 39.9 |
Unmasked | 41.6 | 24.7 | 55.1 | 40.6 |
Image Encoder → | Feature Encoder | Image Decoder → | Feature Decoder |
---|---|---|---|
Conv, 3, 128, K5, S2 | Conv, 256, 224, K5, S2 | Deconv, 192, 128, K5, S2 | Deconv, 192, 224, K5, S2 |
GDN | GDN | IGDN | IGDN |
Conv, 128, 128, K5, S2 | Conv, 224, 224, K5, S1 | Deconv, 128, 128, K5, S2 | Deconv, 224, 224, K5, S1 |
GDN | GDN | IGDN | IGDN |
Conv, 128, 128, K5, S2 | Conv, 224, 224, K5, S1 | Deconv, 128, 128, K5, S2 | Deconv, 224, 224, K5, S1 |
GDN | GDN | IGDN | IGDN |
Conv, 128, 192, K5, S2 | Conv, 224, 192, K5, S1 | Deconv, 128, 3, K5, S2 | Deconv, 224, 256, K5, S1 |
Dataset | Method | BPP | mAP0.5 | mAP0.5:0.95 |
---|---|---|---|---|
Ours | 0.118 | 0.388 | 0.228 | |
TCM2023 [16] | 0.158 | 0.363 | 0.213 | |
STF2022 [15] | 0.126 | 0.348 | 0.203 | |
VisDrone | Cheng2020 [11] | 0.120 | 0.339 | 0.197 |
Minnen2018 [14] | 0.130 | 0.330 | 0.191 | |
BPG [4] | 0.150 | 0.349 | 0.202 | |
WebP [5] | 0.157 | 0.331 | 0.190 | |
Ours | 0.203 | 0.605 | 0.430 | |
TCM2023 [16] | 0.274 | 0.602 | 0.420 | |
STF2022 [15] | 0.254 | 0.602 | 0.421 | |
COCO | Cheng2020 [11] | 0.22 | 0.597 | 0.408 |
Minnen2018 [14] | 0.231 | 0.591 | 0.413 | |
BPG [4] | 0.306 | 0.580 | 0.398 | |
WebP [5] | 0.342 | 0.570 | 0.395 |
Model | Encode | Decode | Parameters | FLOPs |
---|---|---|---|---|
ms | ms | M | G | |
Our Method | 17.5 | 5.2 | 24.5 | 44.8 |
TCM2023 [16] | 333.7 | 433.8 | 44.9 | 221.0 |
STF2022(CNN) [15] | 195.5 | 329.7 | 75.2 | 304.9 |
ELIC2022 [13] | 35.2 | 34.1 | 31.6 | 340.4 |
Cheng2020 [11] | 34.2 | > | 12.3 | 187.3 |
Minnen2018-MeanScale [14] | 12.9 | 6.6 | 6.9 | 79.9 |
Minnen2018 [14] | 20.9 | > | 12.0 | 168.9 |
Ballé18 [26] | 12.5 | 6.3 | 4.9 | 77.1 |
Model | Pre-Process | Inference | Mask + Compression | Total |
---|---|---|---|---|
ms | ms | ms | ms | |
Direct using YOLOv5n | 0.2 | 7.3 | 30.8 | 38.3 |
Our Method | 0.2 | 7.5 | 9.8 | 17.5 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Dai, C.; Song, T.; Jin, Y.; Ren, Y.; Yang, B.; Song, G. Masked Feature Compression for Object Detection. Mathematics 2024, 12, 1848. https://doi.org/10.3390/math12121848
Dai C, Song T, Jin Y, Ren Y, Yang B, Song G. Masked Feature Compression for Object Detection. Mathematics. 2024; 12(12):1848. https://doi.org/10.3390/math12121848
Chicago/Turabian StyleDai, Chengjie, Tiantian Song, Yuxuan Jin, Yixiang Ren, Bowei Yang, and Guanghua Song. 2024. "Masked Feature Compression for Object Detection" Mathematics 12, no. 12: 1848. https://doi.org/10.3390/math12121848