Remote Sensing Small Object Detection Network Based on Attention Mechanism and Multi-Scale Feature Fusion
"> Figure 1
<p>Network structure.</p> "> Figure 2
<p>Detection head enhancement module.</p> "> Figure 3
<p>Channel cascade module.</p> "> Figure 4
<p>Schematic diagram of the sensitivity analysis of IoU for small and large objects, where each grid represents a pixel, the left diagram shows the small-object schematic, the right diagram shows the large-object schematic, A indicates the label box, and B and C indicate the prediction box with different degrees of program offset, respectively.</p> "> Figure 5
<p>Sample VisDrone2021 dataset image.</p> "> Figure 6
<p>Sample picture of the homemade dataset.</p> "> Figure 7
<p>Pictures of the detection effect of each comparison model. The green and blue lines indicate the box and category of the label, respectively. The red and light blue lines indicate the prediction boxes for the pedestrian category and the car category, respectively.</p> "> Figure 8
<p>Visualization of the heat map for each comparison model.</p> "> Figure 9
<p>Pictures of detection results of YOLOv5s and our model. The green and blue lines indicate the box and category of the label, respectively. The red and light blue boxes indicate the forecast boxes for the aircraft category and the car category, respectively. The yellow box indicates the objects missed by the model.</p> "> Figure 9 Cont.
<p>Pictures of detection results of YOLOv5s and our model. The green and blue lines indicate the box and category of the label, respectively. The red and light blue boxes indicate the forecast boxes for the aircraft category and the car category, respectively. The yellow box indicates the objects missed by the model.</p> ">
Abstract
:1. Introduction
- We propose a detection head enhancement module DHEM to further achieve more accurate small-object detection by combining a multi-scale feature fusion module and an attention mechanism module to enhance feature characterization, at the cost of slightly increasing model parameters.
- We design a channel cascade module based on an attention mechanism, AMCC, to help the model remove redundant information in the feature layer, highlight small-object feature information, and help the model learn more efficiently for small-object features.
- We introduce the NWD loss function and combine it with GIoU as the location regression loss function to improve the optimization weight of the model for small objects and the accuracy of the regression boxes. Additionally, an object detection layer is added to improve the object feature extraction ability at different scales.
- AMMFN is compared with YOLOv5s and other advanced models on the homemade remote sensing dataset and publicly available dataset VisDrone2021, with significant improvements in the values and values.
2. Related Works
3. Our Work
3.1. Detection Head Enhancement Module
3.2. Channel Cascade Module Based on Attention Mechanism
3.3. Optimization of the Loss Function
3.4. Optimization of the Prediction Feature Layer
4. Experiments
4.1. Dataset
4.2. Experimental Environment Configuration and Parameter Setting
4.3. Experimental Evaluation Metrics
4.4. Results of Ablation Experiments
4.4.1. Proposed Modules
4.4.2. Finding the Appropriate Scale Factor in the Loss Function
4.5. Comparison Experiments
4.6. Comparison of the Homemade Dataset
5. Discussion
5.1. Discussion of Comparison with Other Advanced Models
5.2. Discussion of the Proposed Innovation Module
5.3. Speed of Inference
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Kellenberger, B.; Marcos, D.; Tuia, D. Detecting Mammals in UAV Images: Best Practices to address a substantially Imbalanced Dataset with Deep Learning. Remote Sens. Environ. 2018, 216, 139–153. [Google Scholar] [CrossRef]
- Kellenberger, B.; Volpi, M.; Tuia, D. Fast animal detection in UAV images using convolutional neural networks. In Proceedings of the 2017 IEEE International Geoscience and Remote Sensing Symposium, Fort Worth, TX, USA, 23–28 July 2017; IEEE: Piscataway, NJ, USA, 2017. [Google Scholar]
- Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) IEEE Computer Society, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8759–8768. [Google Scholar]
- Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 10781–10790. [Google Scholar]
- Liu, S.; Huang, D.; Wang, Y. Learning spatial fusion for single-shot object detection. arXiv 2018, arXiv:1911.09516. [Google Scholar]
- Ghiasi, G.; Lin, T.Y.; Le, Q.V. NAS-FPN: Learning scalable feature pyramid architecture for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7036–7045. [Google Scholar]
- Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
- Yu, J.; Jiang, Y.; Wang, Z.; Cao, Z.; Huang, T. Unitbox: An advanced object detection network. In Proceedings of the 24th ACM International Conference on Multimedia, Amsterdam, The Netherlands, 15–19 October 2016. [Google Scholar]
- Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
- Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-iou loss: Faster and better learning for bounding box regression. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 29–31 January 2019. [Google Scholar]
- Zhang, Y.F.; Ren, W.; Zhang, Z.; Jia, Z.; Wang, L.; Tan, T. Focal and efficient iou loss for accurate bounding box regression. Neurocomputing 2021, 506, 146–157. [Google Scholar] [CrossRef]
- Jocher, G.; Stoken, A.; Borovec, J.; Chaurasia, A.; Changyu, L.; Hogan, A.; Hajek, J.; Diaconu, L.; Kwon, Y.; Defretin, Y.; et al. Ultralytics/Yolov5: v5.0–YOLOv5-P6 1280 Models, AWS, Supervise.ly and YouTube integrations. Zenodo, 2021. [Google Scholar]
- Wang, J.; Xu, C.; Yang, W.; Yu, L. A normalized gaussian wasserstein distance for tiny object detection. arXiv 2021, arXiv:2110.13389. [Google Scholar]
- Yan, J.; Jiao, H.; Pu, W.; Shi, C.; Dai, J.; Liu, H. Radar Sensor Network Resource Allocation for Fused Target Tracking: A Brief Review. Inf. Fusion 2022, 86–87, 104–115. [Google Scholar] [CrossRef]
- Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
- Dai, J.; Li, Y.; He, K.; Sun, J. R-FCN: Object detection via region-based fully convolutional networks. Adv. Neural Inf. Process. Syst. 2016, 29, 379–387. [Google Scholar]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Springer International Publishing: Midtown Manhattan, NY, USA, 2016. [Google Scholar]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
- Redmon, J.; Farhadi, A. Yolo9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
- Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv preprint 2018, arXiv:1804.02767. [Google Scholar]
- Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
- Tian, Z.; Shen, C.; Chen, H.; He, T. Fcos: Fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9627–9636. [Google Scholar]
- Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollar, P. Focal Loss for Dense Object Detection. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar] [CrossRef]
- Zhu, Z.; Liang, D.; Zhang, S.; Huang, X.; Li, B.; Hu, S. Traffic-Sign Detection and Classification in the Wild. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: Piscataway, NJ, USA, 2016. [Google Scholar]
- Qu, J.; Su, C.; Zhang, Z.; Razi, A. Dilated convolution and feature fusion SSD network for small object detection in remote sensing images. IEEE Access 2020, 8, 82832–82843. [Google Scholar] [CrossRef]
- Deng, C.; Wang, M.; Liu, L.; Liu, Y.; Jiang, Y. Extended Feature Pyramid Network for Small Object Detection. IEEE Trans. Multimed. 2021, 24, 1968–1979. [Google Scholar] [CrossRef]
- Deng, T.; Liu, X.; Mao, G. Improved YOLOv5 Based on Hybrid Domain Attention for Small Object Detection in Optical Remote Sensing Images. Electronics 2022, 11, 2657. [Google Scholar] [CrossRef]
- Zhu, X.; Lyu, S.; Wang, X.; Zhao, Q. TPH-YOLOv5: Improved YOLOv5 Based on Transformer Prediction Head for Object Detection on Drone-captured Scenarios. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 10–17 October 2021; pp. 2778–2788. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 17–24 May 2018; pp. 3–19. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17), Long Beach, CA, USA, 4–9 December 2017; Curran Associates Inc.: Red Hook, NY, USA, 2017; pp. 6000–6010. [Google Scholar]
- Shi, T.; Gong, J.; Hu, J.; Zhi, X.; Zhang, W.; Zhang, Y.; Zhang, P.; Bao, G. Feature-Enhanced CenterNet for Small Object Detection in Remote Sensing Images. Remote Sens. 2022, 14, 5488. [Google Scholar] [CrossRef]
- Zhao, P.; Xie, L.; Peng, L. Deep-level Small Target Detection Algorithm Based on Attention Mechanism. J. Comput. Sci. Explor. 2022, 16, 927–937. [Google Scholar]
- Zhang, F.; Jiao, L.; Li, L.; Liu, F.; Liu, X. MultiResolution Attention Extractor for Small Object Detection. arXiv 2020, arXiv:2006.05941. [Google Scholar]
- Dai, Y.; Gieseke, F.; Oehmcke, S.; Wu, Y.; Barnard, K. Attentional Feature Fusion. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Village, CO, USA, 1–5 March 2020. [Google Scholar]
- Cao, Y.; He, Z.; Wang, L.; Wang, W.; Yuan, Y.; Zhang, D.; Zhang, J.; Zhu, P.; Van Gool, L.; Han, J.; et al. VisDrone-DET2021: The vision meets drone object detection challenge results. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Montreal, BC, Canada, 11–17 October 2021; IEEE: Montreal, QC, Canada, 2021; pp. 2847–2854. [Google Scholar]
- Lin, T.Y.; Maire, M.; Belongie, S.; Bourdev, L.; Girshick, R.; Hays, J.; Perona, P.; Zitnick, C.L.; Dollár, P. Microsoft COCO: Common Objects in Context. arXiv 2015, arXiv:1405.0312. [Google Scholar]
Type | VisDrone2021 | Homemade Dataset |
---|---|---|
Tiny object (0, ] | 94,424 | 294 |
Small object (, ] | 93,655 | 4901 |
Medium object (, ] | 94,273 | 6507 |
Large objects (, ) | 13,840 | 442 |
Types | Environment |
---|---|
Operating System | Ubuntu18.04 |
GPU (Video memory size, memory size) | NVIDIA GeForce GTX 4090Ti (24 G, 128 G) |
PyTorch Versions | 1.8.0 |
CUDA | 12.1 |
DHEM | AMCC | NWD | Head | Params | ||||||
---|---|---|---|---|---|---|---|---|---|---|
0.221 | 0.440 | 0.204 | 0.146 | 0.395 | 0.556 | 7.025 M | ||||
√ | 0.224 | 0.442 | 0.204 | 0.153 | 0.408 | 0.547 | 7.405 M | |||
√ | 0.225 | 0.442 | 0.206 | 0.151 | 0.397 | 0.543 | 7.088 M | |||
√ | 0.232 | 0.453 | 0.213 | 0.156 | 0.412 | 0.554 | 7.025 M | |||
√ | 0.232 | 0.458 | 0.213 | 0.158 | 0.408 | 0.568 | 7.169 M | |||
√ | √ | 0.236 | 0.466 | 0.215 | 0.162 | 0.422 | 0.579 | 7.651 M | ||
√ | √ | 0.235 | 0.464 | 0.216 | 0.160 | 0.415 | 0.572 | 7.236 M | ||
√ | √ | 0.239 | 0.469 | 0.218 | 0.163 | 0.423 | 0.576 | 7.169 M | ||
√ | √ | √ | 0.241 | 0.474 | 0.224 | 0.165 | 0.427 | 0.591 | 7.236 M | |
√ | √ | √ | 0.243 | 0.478 | 0.223 | 0.167 | 0.433 | 0.594 | 7.651 M | |
√ | √ | √ | 0.239 | 0.475 | 0.217 | 0.163 | 0.429 | 0.588 | 7.717 M | |
√ | √ | √ | √ | 0.247 | 0.481 | 0.229 | 0.170 | 0.436 | 0.601 | 7.717 M |
GIoU | NWD | ||||||
---|---|---|---|---|---|---|---|
0 | 1 | 0.234 | 0.468 | 0.210 | 0.161 | 0.401 | 0.556 |
1 | 0 | 0.232 | 0.458 | 0.213 | 0.157 | 0.408 | 0.558 |
1 | 1 | 0.237 | 0.468 | 0.214 | 0.161 | 0.414 | 0.568 |
0.8 | 1.2 | 0.236 | 0.469 | 0.212 | 0.160 | 0.415 | 0.569 |
1.2 | 0.8 | 0.239 | 0.469 | 0.218 | 0.163 | 0.423 | 0.576 |
1.6 | 0.4 | 0.237 | 0.466 | 0.216 | 0.150 | 0.420 | 0.581 |
0.4 | 1.6 | 0.237 | 0.470 | 0.214 | 0.164 | 0.413 | 0.575 |
Model | Params | GFLOPS | FPS | ||||||
---|---|---|---|---|---|---|---|---|---|
YOLOv5s | 0.221 | 0.440 | 0.204 | 0.146 | 0.395 | 0.556 | 7.025 M | 15.954 G | 137.91 |
RetinaNet-50 | 0.102 | 0.168 | 0.114 | 0.008 | 0.258 | 0.505 | 36.351 M | 145.652 G | 82.54 |
Efficientnet-d2 | 0.126 | 0.207 | 0.140 | 0.017 | 0.325 | 0.526 | 8.007 M | 14.281 G | 39.44 |
Efficientnet-YOLOv3 | 0.120 | 0.297 | 0.076 | 0.072 | 0.189 | 0.347 | 6.999 M | 9.039 G | 102.47 |
YOLOv4-tiny | 0.129 | 0.306 | 0.087 | 0.075 | 0.234 | 0.415 | 5.876 M | 6.836 G | 286.61 |
Mobilenetv2-YOLOv4 | 0.108 | 0.268 | 0.069 | 0.059 | 0.184 | 0.315 | 10.381 M | 18.270 G | 99.78 |
YOLOv7-tiny | 0.220 | 0.432 | 0.212 | 0.143 | 0.417 | 0.613 | 6.017 M | 13.190 G | 277.13 |
YOLOXs | 0.246 | 0.478 | 0.227 | 0.157 | 0.445 | 0.646 | 8.938 M | 26.759 G | 85.12 |
YOLOv8s | 0.243 | 0.473 | 0.224 | 0.155 | 0.452 | 0.642 | 11.136 M | 28.649 G | 228.83 |
FCOS | 0.231 | 0.434 | 0.221 | 0.138 | 0.444 | 0.610 | 32.113 M | 161.174 G | 81.95 |
AMMFN | 0.247 | 0.481 | 0.229 | 0.170 | 0.436 | 0.601 | 7.717 M | 25.782 G | 84.28 |
Categories | YOLOv5s | YOLOv7-Tiny | FCOS | YOLOv8s | YOLOXs | AMMFN | Label |
---|---|---|---|---|---|---|---|
Car | 4 | 4 | 4 | 4 | 4 | 4 | 12 |
Pedestrian | 33 | 27 | 30 | 34 | 35 | 44 | 91 |
Log-Average Miss Rate | YOLOv5s | YOLOXs | YOLOv8s | AMMFN |
---|---|---|---|---|
Pedestrian | 0.91 | 0.88 | 0.89 | 0.86 |
Car | 0.78 | 0.75 | 0.76 | 0.75 |
Input Size | Car | Pedestrian | Sum |
---|---|---|---|
2 | 6 | 8 | |
3 | 18 | 21 | |
4 | 25 | 29 | |
4 | 32 | 36 | |
4 | 44 | 48 | |
3 | 46 | 49 |
NWD | DHEM | AMCC | Params | GFLOPS | ||||||
---|---|---|---|---|---|---|---|---|---|---|
0.441 | 0.865 | 0.389 | 0.302 | 0.491 | 0.649 | 7.025 M | 15.954 G | |||
√ | 0.478 | 0.877 | 0.465 | 0.314 | 0.541 | 0.661 | 7.025 M | 15.954 G | ||
√ | 0.466 | 0.874 | 0.440 | 0.311 | 0.527 | 0.679 | 7.405 M | 17.803 G | ||
√ | 0.450 | 0.871 | 0.425 | 0.313 | 0.518 | 0.651 | 7.088 M | 16.030 G | ||
√ | √ | 0.485 | 0.877 | 0.479 | 0.330 | 0.555 | 0.710 | 7.405 M | 17.803 G | |
√ | √ | 0.472 | 0.879 | 0.451 | 0.320 | 0.537 | 0.674 | 7.468 M | 17.878 G | |
√ | √ | 0.487 | 0.883 | 0.485 | 0.329 | 0.554 | 0.669 | 7.088 M | 16.030 G | |
√ | √ | √ | 0.492 | 0.896 | 0.489 | 0.334 | 0.557 | 0.703 | 7.468 M | 17.878 G |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Qu, J.; Tang, Z.; Zhang, L.; Zhang, Y.; Zhang, Z. Remote Sensing Small Object Detection Network Based on Attention Mechanism and Multi-Scale Feature Fusion. Remote Sens. 2023, 15, 2728. https://doi.org/10.3390/rs15112728
Qu J, Tang Z, Zhang L, Zhang Y, Zhang Z. Remote Sensing Small Object Detection Network Based on Attention Mechanism and Multi-Scale Feature Fusion. Remote Sensing. 2023; 15(11):2728. https://doi.org/10.3390/rs15112728
Chicago/Turabian StyleQu, Junsuo, Zongbing Tang, Le Zhang, Yanghai Zhang, and Zhenguo Zhang. 2023. "Remote Sensing Small Object Detection Network Based on Attention Mechanism and Multi-Scale Feature Fusion" Remote Sensing 15, no. 11: 2728. https://doi.org/10.3390/rs15112728