EAAnet: Efficient Attention and Aggregation Network for Crowd Person Detection
<p>EAAnet achieves the highest precision performance on crowd person detection with low computational cost.</p> "> Figure 2
<p>Typical pyramid structure. (<b>a</b>) FPN and PANet. (<b>b</b>) ASFF.</p> "> Figure 3
<p>Typical scenes in COP dataset. (<b>a</b>) Indoor crowd person. (<b>b</b>) Outdoor crowd person.</p> "> Figure 4
<p>Detailed information on COP dataset. (<b>a</b>) The number of instances for each category. (<b>b</b>) The distribution of instances being aligned. (<b>c</b>) The distribution of instances’ center points after being normalized. (<b>d</b>) The proportion of width and height.</p> "> Figure 5
<p>The CBAM attention module.</p> "> Figure 6
<p>BiFPN structure.</p> "> Figure 7
<p>EAAnet model structure. (<b>a</b>) Overall structure (<b>b</b>) CBAM structure inserted in the backbone.</p> "> Figure 8
<p>Performance comparison. (<b>a</b>) P curve. (<b>b</b>) mAP 50.</p> "> Figure 9
<p>Detection effects. (<b>a</b>) GT. (<b>b</b>) Prediction of YOLOv5. (<b>c</b>) Prediction of EAAnet.</p> "> Figure 10
<p>Heat map of some representative stages. (<b>a</b>) Ordinary environment. (<b>b</b>) Indoor crowd. (<b>c</b>) Lower feature map in dense environment.</p> "> Figure 11
<p>Performance difference for each class. (<b>a</b>) P curve. (<b>b</b>) R curve. (<b>c</b>) PR curve. (<b>d</b>) F1 curve.</p> "> Figure 12
<p>The confusion matrix.</p> "> Figure 13
<p>Box_loss comparison.</p> "> Figure 14
<p>Cls_loss comparison.</p> "> Figure 15
<p>Obj_loss comparison.</p> ">
Abstract
:1. Introduction
- (1)
- A supervised learning dataset of COP (Crowd Occlusion Person) was constructed, which selected 9000 labeled images from the WiderPerson [18] dataset. These images were split into train dataset and validation set at a ratio of 9:1.
- (2)
- The backbone was optimized by integrating the CBAM [19], enhancing the focus on crucial fine-grained feature information while suppressing unimportant background details.
- (3)
- BiFPN [20] was introduced into the neck of the original YOLOv5. The bidirectional connections in BiFPN ensure the integrity of features for each layer, facilitating the construction of a multi-scale feature fusion network. This effectively addresses the transmission and fusion of feature information across different scales, further enhancing the precision of object detection.
1.1. Algorithms for Crowd Person Detection
1.2. Attention Mechanism
1.3. Feature Pyramid
2. Model and Method
2.1. Datasets
2.2. Model Description
2.2.1. CBAM Attention Mechanism
2.2.2. BiFPN Mechanism
2.2.3. Loss Function
2.2.4. EAAnet Model
3. Experiment Results
3.1. The Experimental Environment
3.2. Performance Indicators
3.3. Comparison with SOTA Method
3.4. Ablation Study
4. Discussion
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Sun, J.; Wang, Z. Vehicle and Pedestrian Detection Algorithm Based on Improved YOLOv5. IAENG Int. J. Comput. Sci. 2023, 50, 28. [Google Scholar]
- Lin, X.; Song, A. Research on improving pedestrian detection algorithm based on YOLOv5. In Proceedings of the International Conference on Electronic Information Engineering and Data Processing (EIEDP 2023), Nanchang, China, 26 May 2023; pp. 506–511. [Google Scholar]
- Jin, Y.; Lu, Z.; Wang, R.; Liang, C. Research on lightweight pedestrian detection based on improved YOLOv5. Math. Model. Eng. 2023, 9, 178–187. [Google Scholar] [CrossRef]
- Hürlimann, M.; Coviello, V.; Bel, C.; Guo, X.; Berti, M.; Graf, C.; Hübl, J.; Miyata, S.; Smith, J.B.; Yin, H.-Y. Debris-flow monitoring and warning: Review and examples. Earth-Sci. Rev. 2019, 199, 102981. [Google Scholar] [CrossRef]
- Bendali-Braham, M.; Weber, J.; Forestier, G.; Idoumghar, L.; Muller, P.-A. Recent trends in crowd analysis: A review. Mach. Learn. Appl. 2021, 4, 100023. [Google Scholar] [CrossRef]
- Hung, G.L.; Bin Sahimi, M.S.; Samma, H.; Almohamad, T.A.; Lahasan, B. Faster R-CNN deep learning model for pedestrian detection from drone images. SN Comput. Sci. 2020, 1, 116. [Google Scholar] [CrossRef]
- Mittal, U.; Chawla, P.; Tiwari, R. EnsembleNet: A hybrid approach for vehicle detection and estimation of traffic density based on faster R-CNN and YOLO models. Neural Comput. Appl. 2023, 35, 4755–4774. [Google Scholar] [CrossRef]
- Zhang, S.; Chi, C.; Yao, Y.; Lei, Z.; Li, S.Z. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seoul, Republic of Korea, 16–18 June 2020; pp. 9759–9768. [Google Scholar]
- Liu, S.; Chi, J.; Wu, C. FCOS-Lite: An Efficient Anchor-free Network for Real-time Object Detection. In Proceedings of the 33rd Chinese Control and Decision Conference (CCDC), Kunming, China, 22–24 May 2021; pp. 1519–1524. [Google Scholar]
- Qiu, R.; Cai, Z.; Chang, Z.; Liu, S.; Tu, G. A two-stage image process for water level recognition via dual-attention CornerNet and CTransformer. Vis. Comput. 2023, 39, 2933–2952. [Google Scholar] [CrossRef]
- Qi, Q.; Huo, Q.; Wang, J.; Sun, H.; Cao, Y.; Liao, J. Personalized Sketch-Based Image Retrieval by Convolutional Neural Network and Deep Transfer Learning. IEEE Access 2019, 7, 16537–16549. [Google Scholar] [CrossRef]
- Zhang, H.; Chang, H.; Ma, B.; Wang, N.; Chen, X. Dynamic R-CNN: Towards High Quality Object Detection via Dynamic Training. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 260–275. [Google Scholar] [CrossRef]
- Lin, T.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar] [CrossRef]
- Zhang, Q.; Liu, Y.; Zhang, Y.; Zong, M.; Zhu, J. Improved YOLOv3 integrating SENet and optimized GIoU loss for occluded pedestrian detection. Sensors 2023, 23, 9089. [Google Scholar] [CrossRef]
- Tang, F.; Yang, F.; Tian, X. Long-Distance Person Detection Based on YOLOv7. Electronics 2023, 12, 1502. [Google Scholar] [CrossRef]
- Dai, K.; Sui, X.; Wang, L.; Wu, Q.; Chen, Q.; Gu, G. Research on multi-target detection method based on deep learning. In Proceedings of the Seventh Symposium on Novel Photoelectronic Detection Technology and Application, Kunming, China, 5–7 November 2020; p. 117637U. [Google Scholar]
- Yang, K.; Song, Z. Deep Learning-Based Object Detection Improvement for Fine-Grained Birds. IEEE Access 2021, 9, 67901–67915. [Google Scholar] [CrossRef]
- Zhang, S.; Xie, Y.; Wan, J.; Xia, H.; Li, S.Z.; Guo, G. WiderPerson: A Diverse Dataset for Dense Pedestrian Detection in the Wild. IEEE Trans. Multimedia 2019, 22, 380–393. [Google Scholar] [CrossRef]
- Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar] [CrossRef]
- Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar] [CrossRef]
- Taleb, N.O.; BEN Maati, M.L.; Nanne, M.F.; Aboubekrine, A.M.; Chergui, A. Study of Haar-AdaBoost (VJ) and HOG-AdaBoost (PoseInv) Detectors for People Detection. Int. J. Adv. Comput. Sci. Appl. 2021, 12. [Google Scholar] [CrossRef]
- Papageorgiou, C.; Poggio, T. A Trainable System for Object Detection. Int. J. Comput. Vis. 2000, 38, 15–33. [Google Scholar] [CrossRef]
- Wu, B.; Nevatia, R. Tracking of Multiple, Partially Occluded Humans based on Static Body Part Detection. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), New York, NY, USA, 17–22 June 2006; pp. 951–958. [Google Scholar]
- Maolin, L.; Shen, J. Fast Object Detection Method Based on Deformable Part Model (Dpm). U.S. Patent EP3183691A1, 19 December 2017. [Google Scholar]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar] [CrossRef]
- Felzenszwalb, P.; Girshick, R.; McAllester, D.; Ramanan, D. Visual object detection with deformable part models. Commun. ACM 2013, 56, 97–105. [Google Scholar] [CrossRef]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
- Zhou, C.; Yuan, J. Bi-box regression for pedestrian detection and occlusion estimation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 135–151. [Google Scholar]
- Pei, D.; Jing, M.; Liu, H.; Sun, F.; Jiang, L. A fast RetinaNet fusion framework for multi-spectral pedestrian detection. Infrared Phys. Technol. 2020, 105, 103178. [Google Scholar] [CrossRef]
- Peng, Q.; Luo, W.; Hong, G.; Feng, M.; Xia, Y.; Yu, L.; Hao, X.; Wang, X.; Li, M. Pedestrian detection for transformer substation based on gaussian mixture model and YOLO. In Proceedings of the 2016 8th International Conference on Intelligent Human-Machine Systems and Cybernetics, Hangzhou, China, 27–28 August 2016; pp. 562–565. [Google Scholar]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar] [CrossRef]
- Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar] [CrossRef]
- Desai, A.P.; Razeghin, M.; Meruvia-Pastor, O.; Peña-Castillo, L. GeNET: A web application to explore and share Gene Co-expression Network Analysis data. PeerJ 2017, 5, e3678. [Google Scholar] [CrossRef]
- Wang, C.; Zhong, C. Adaptive Feature Pyramid Networks for Object Detection. IEEE Access 2021, 9, 107024–107032. [Google Scholar] [CrossRef]
- Qing, Y.; Liu, W.; Feng, L.; Gao, W. Improved YOLO Network for Free-Angle Remote Sensing Target Detection. Remote. Sens. 2021, 13, 2171. [Google Scholar] [CrossRef]
- Wang, H.; Guo, E.; Chen, F.; Chen, P. Depth Completion in Autonomous Driving: Adaptive Spatial Feature Fusion and Semi-Quantitative Visualization. Appl. Sci. 2023, 13, 9804. [Google Scholar] [CrossRef]
- Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: Fully Convolutional One-Stage Object Detection. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
- Ding, X.; Zhang, X.; Ma, N.; Han, J.; Ding, G.; Sun, J. RepVGG: Making VGG-style ConvNets Great Again. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 13728–13737. [Google Scholar]
- Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollar, P. Focal loss for dense object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 318–327. [Google Scholar] [CrossRef]
Rank | Model | Box mAP | Params (M) |
---|---|---|---|
1 | Co-DETR | 66 | 348 |
2 | InternImage-H | 65.4 | 2180 |
3 | Focal-Stable-DINO (Focal-Huge, no TTA) | 64.8 | 689 |
4 | Co-DETR (Swin-L) | 64.8 | 218 |
5 | InternImage-XL | 64.3 | 602 |
6 | Relation-DETR (Focal-L) | 63.5 | 214 |
7 | SwinV2-G (HTC++) | 63.1 | 3000 |
8 | YOLOX-X (Modified CSP v5) | 51.2 | 99.1 |
9 | LeYOLO-Large | 41 | 2.4 |
11 | YOLOX-Tiny (416 × 416, single-scale) | 32.8 | 5.06 |
Model | P (%) | mAP50 (%) | mAP50:95 (%) | Params (M) | FLOPs (G) |
---|---|---|---|---|---|
YOLOv7 | 61.5 | 52.1 | 26.7 | 9.1 | 26.0 |
Dynamic_RCNN | 62.5 | 30.6 | 15.5 | 41.4 | 67.4 |
RetinaNet | 68.0 | 28.0 | 13.5 | 36.4 | 57.1 |
Atss | 71.0 | 40.6 | 20.5 | 32.1 | 55.9 |
Faster_RCNN | 74.5 | 47.3 | 23.5 | 41.1 | 67.4 |
YOLOv5s | 76.8 | 69.8 | 37.6 | 7.0 | 15.8 |
EAAnet | 78.6 | 68.5 | 36.9 | 7.1 | 16.0 |
Models | All | Pedestrians | Riders | Partially Visible Person | Crowd |
---|---|---|---|---|---|
YOLOv5s | 76.8 | 86.1 | 74 | 73.9 | 76.1 |
Our | 78.6 | 85.8 | 78.6 | 74.7 | 75.6 |
3 Layer | 5 Layer | 7 Layer | P (%) | (%) |
---|---|---|---|---|
76.8 | 69.8 | |||
√ | 76.4 | 69.9 | ||
√ | 77.0 | 70.0 | ||
√ | 75.7 | 70.0 | ||
√ | √ | 76.2 | 69.9 | |
√ | √ | 77.3 | 70.0 | |
√ | √ | √ | 76.5 | 69.2 |
√ | √ | 78.2 | 69.1 |
CBAM | BiFPN | P (%) | Params (M) | FLOPs (G) |
---|---|---|---|---|
76.8 | 7.0 | 15.8 | ||
√ | 78.2 | 7.0 | 15.8 | |
√ | 73.5 | 7.1 | 16.0 | |
√ | √ | 78.6 | 7.1 | 16.0 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Chen, W.; Wu, W.; Dai, W.; Huang, F. EAAnet: Efficient Attention and Aggregation Network for Crowd Person Detection. Appl. Sci. 2024, 14, 8692. https://doi.org/10.3390/app14198692
Chen W, Wu W, Dai W, Huang F. EAAnet: Efficient Attention and Aggregation Network for Crowd Person Detection. Applied Sciences. 2024; 14(19):8692. https://doi.org/10.3390/app14198692
Chicago/Turabian StyleChen, Wenzhuo, Wen Wu, Wantao Dai, and Feng Huang. 2024. "EAAnet: Efficient Attention and Aggregation Network for Crowd Person Detection" Applied Sciences 14, no. 19: 8692. https://doi.org/10.3390/app14198692
APA StyleChen, W., Wu, W., Dai, W., & Huang, F. (2024). EAAnet: Efficient Attention and Aggregation Network for Crowd Person Detection. Applied Sciences, 14(19), 8692. https://doi.org/10.3390/app14198692