Open AccessArticle

EAAnet: Efficient Attention and Aggregation Network for Crowd Person Detection

College of Information and Electrical Engineering, China Agricultural University, Beijing 100083, China

College of Electronic and Information Engineering, North China Institute of Science and Technology, Langfang 065201, China

College of Science, China Agricultural University, Beijing 100083, China

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(19), 8692; https://doi.org/10.3390/app14198692

Submission received: 23 August 2024 / Revised: 16 September 2024 / Accepted: 18 September 2024 / Published: 26 September 2024

(This article belongs to the Special Issue Deep Learning for Object Detection)

Download

Browse Figures

Figure 1
EAAnet achieves the highest precision performance on crowd person detection with low computational cost. "> Figure 2
Typical pyramid structure. (a) FPN and PANet. (b) ASFF. "> Figure 3
Typical scenes in COP dataset. (a) Indoor crowd person. (b) Outdoor crowd person. "> Figure 4
Detailed information on COP dataset. (a) The number of instances for each category. (b) The distribution of instances being aligned. (c) The distribution of instances’ center points after being normalized. (d) The proportion of width and height. "> Figure 5
The CBAM attention module. "> Figure 6
BiFPN structure. "> Figure 7
EAAnet model structure. (a) Overall structure (b) CBAM structure inserted in the backbone. "> Figure 8
Performance comparison. (a) P curve. (b) mAP 50. "> Figure 9
Detection effects. (a) GT. (b) Prediction of YOLOv5. (c) Prediction of EAAnet. "> Figure 10
Heat map of some representative stages. (a) Ordinary environment. (b) Indoor crowd. (c) Lower feature map in dense environment. "> Figure 11
Performance difference for each class. (a) P curve. (b) R curve. (c) PR curve. (d) F1 curve. "> Figure 12
The confusion matrix. "> Figure 13
Box_loss comparison. "> Figure 14
Cls_loss comparison. "> Figure 15
Obj_loss comparison. ">

Versions Notes

Abstract

With the frequent occurrence of natural disasters and the acceleration of urbanization, it is necessary to carry out efficient evacuation, especially when earthquakes, fires, terrorist attacks, and other serious threats occur. However, due to factors such as small targets, complex posture, occlusion, and dense distribution, the current mainstream algorithms still have problems such as low precision and poor real-time performance in crowd person detection. Therefore, this paper proposes EAAnet, a crowd person detection algorithm. It is based on YOLOv5, with CBAM (Convolutional Block Attention Module) introduced into the backbone, BiFPN (Bidirectional Feature Pyramid Network) introduced into the neck, and combined with a loss function of CIoU_Loss to better predict the person number. The experimental results show that compared with other mainstream detection algorithms, EAAnet has achieved significant improvement in precision and real-time performance. The precision value of all categories was 78.6%, which was increased by 1.8. Among these, the categories of riders and partially visible person were increased by 4.6 and 0.8, respectively. At the same time, the parameter number of EAAnet is only 7.1M, with a calculation amount of 16.0G FLOPs. Therefore, it is proved that EAAnet has the ability of the efficient real-time detection of the crowd person and is feasible in the field of emergency management.

Keywords:

crowd person detection; YOLOv5; CBAM; BiFPN; emergency management

1. Introduction

With the acceleration of urbanization, it becomes imperative to count and locate individuals in public spaces when an emergency occurs to ensure efficient and timely evacuation [1,2,3]. Although crowd person detection technologies are currently widely employed in various domains such as pedestrian flow monitoring, intelligent transportation systems, and security surveillance, scenarios featuring high crowd density, such as railway stations, shopping malls, and traffic intersections, are facing significant challenges. That is, on one hand, the diverse postures and prevalent occlusion and overlapping among persons lead to scarcity and ambiguity of visible features, thereby complicating the feature extraction process for the network [4,5,6]. Consequently, feature loss emerges as a pivotal issue in crowd detection. On the other hand, the existing detection models consume substantial computational resources. Therefore, the optimization of crowd person detection models, aiming at enhancing precision and real-time performance in complex scenarios, is important for public safety and enhancing urban management efficiency.

As deep learning algorithms develop, they have been leveraged to automatically learn deeper and more intricate features from images, demonstrating robust generalization capabilities. Consequently, they are widely applied across diverse tasks such as image classification, target tracking, and object detection. In the realm of object detection algorithms, there are typically two major kinds of classification. Firstly, based on the presence of anchor boxes, object detection algorithms can be categorized into anchor-based and anchor-free. Secondly, they can also be divided into one-stage and two-stage detection algorithms, distinguished by their distinct methodologies and workflows in accomplishing the detection task. Among these, anchor-based algorithms, originating from the Faster R-CNN [7] algorithm, generate candidate object regions providing validation information for object detection, and serve as a benchmark for the network to learn and predict object locations and sizes. Notably, the mainstream ATSS (Adaptive Training Sample Selection) [8] significantly enhanced the performance of both anchor-based and anchor-free detection algorithms, bridging the gap between them. Meanwhile, anchor-free algorithms predict the center point, four corners of the bounding box, keypoints, or boundaries of objects directly, without predefined anchor boxes. Its exemplary algorithms include FCOS [9] and CornerNet [10]. The former one directly predicts the object class, bounding box regression, and center point confidence for each location, while the other determines bounding boxes by predicting the top-left and bottom-right keypoints of objects. Moreover, regarding two-stage detection algorithms, such as R-CNN [11], Faster R-CNN [7], and Dynamic_RCNN [12], they first generate numerous candidate regions that may contain objects and then classify these regions. However, their complex training process and relatively long detection time render them unsuitable for real-time applications for emergency management. In contrast, one-stage detection algorithms like RetinaNet [13] and the YOLO series are better in time-consuming performance. RetinaNet employs focal loss as its loss function, reducing the weight of easy samples during training, enabling the model to focus more on difficult-to-classify samples. YOLO directly predicts object locations and classes, requiring minimal computation and significantly boosting detection speed. Therefore, YOLOv5, as a typical single-stage, anchor-based detection algorithm, inherently satisfies both real-time and precision requirements, making it an ideal choice for wide application in real-time object detection tasks.

However, scenarios involving dense object detection are prone to interference from various factors like background noise and occlusion, making it challenging for detection algorithms to efficiently capture effective feature information from the images. This is due to the locations, shapes, and background environments of the detected objects exhibiting high degrees of uncertainty. Moreover, dense small objects, such as crowded people, lead to effiective individual characteristic information lay in a relatively small area of the image. Thus, to address these issues, researchers have proposed various strategies. For instance, Q. Zhang has introduced the SENet attention mechanism into the YOLOv3 model to optimize the loss function [14]. F. Tang and his team augmented YOLOv7 with a small object detection layer and integrated a collaborative attention mechanism [15]. K. Dai modified the Darknet network by adding a set of residual blocks at the end, replacing IoU with DIoU, and generating 12 anchor boxes using the K-means algorithm [16]. K. Yang improved YOLOv3 by introducing depthwise separable convolutions into feature extraction and incorporating focal loss into the loss function [17]. But for these approaches, the selection of attention mechanisms should be suitable and the introduction of complex operations inevitably increases the computational burden of the models.

Hence, addressing these challenges, we proposed the EAAnet, which surpasses the mainstream SOTA (State-of-the-Art) models in terms of precision while maintaining a relatively low computational cost. The primary results are illustrated in Figure 1. Based on the YOLOv5 model, improvements were made and experimental comparisons were conducted. The main contributions are as follows:

(1): A supervised learning dataset of COP (Crowd Occlusion Person) was constructed, which selected 9000 labeled images from the WiderPerson [18] dataset. These images were split into train dataset and validation set at a ratio of 9:1.
(2): The backbone was optimized by integrating the CBAM [19], enhancing the focus on crucial fine-grained feature information while suppressing unimportant background details.
(3): BiFPN [20] was introduced into the neck of the original YOLOv5. The bidirectional connections in BiFPN ensure the integrity of features for each layer, facilitating the construction of a multi-scale feature fusion network. This effectively addresses the transmission and fusion of feature information across different scales, further enhancing the precision of object detection.

1.1. Algorithms for Crowd Person Detection

Prior to the widespread adoption of deep learning algorithms, crowd person detection primarily relied on traditional image processing techniques, which classify objects by extracting features such as color, shape, and edges. M Maati traversed images using a sliding window and ultimately determined object categories through a classifier [21]. C Papageorgiou trained a Support Vector Machine (SVM) classifier using numerous positive and negative examples to implicitly model object categories [22]. B Wu introduced Edgelet features to represent the human body in images, and modeled various human body components separately [23]. The Deformable Part Model (DPM) comprised a root filter and multiple part filters, leveraging techniques like hard negative mining, bounding box regression, and context priming to improve detection precision [24]. However, these traditional methods suffer from limitations such as high computational complexity, slow speed, high error rates, and strong subjectivity, which make them inadequate for the efficiency and precision requirements of practical applications.

Then, with the advent of AlexNet [25], VGG [26], and other influential networks, deep learning-based object detection algorithms have undergone rapid development. P. Felzenszwalb combined region proposals with deep CNNs (Convolutional Neural Networks) to segment images into multiple regions, and subsequently classified and performed bounding box regression on each region [27]. S Ren introduced the Faster R-CNN employing RPN (Region Proposal Network) to efficiently generate candidate regions that might contain objects [28]. C Zhou proposed BiBox, which localized both the full body and visible parts of pedestrians by regressing two bounding boxes [29]. D Pei presented a multi-spectral pedestrian detection fusion framework based on RetinaNet, achieving reduced runtime while maintaining detection performance [30]. Q Peng modeled the background using GMM, and applied YOLO to pedestrian detection, finally combining their results with different weights [31]. This single-stage approach has improved detection precision, but it still falls short in terms of real-time performance. At the same time, we also summarized the ranking of various models on the COCO test-dev dataset (a classical object detection dataset), as shown in Table 1. Overall, the top eight are almost all the variants of the Swin or DETR algorithm, characterized by high mean average precision and a huge number of parameters. And the remaining models belong to the YOLO series, which is characterized by relatively mean average precision, but the number of model parameters is greatly reduced. For example, YOLOX-X ranked eighth in Table 1, and its mean average precision reached 51.2, although lower than the first-ranked Co-DETR by 14.8, but the number of model parameters dropped as high as 249.9 Mb. Thus, to satisfy both the requirements of precision and real-time capability for emergency management, YOLOv5 (a widely adopted variant within the real-time detection YOLO algorithm family) has been selected.

1.2. Attention Mechanism

The attention mechanism mimics the attention mechanism in the human visual system, enhancing neural networks’ focus on the different dimensions of input data to better process complex inputs. And common attention mechanisms include channel attention which is exemplified by SENet [32] (Squeeze-and-Excitation Networks) and ECA (Efficient Channel Attention) [33], and spatial attention exemplified by GENet (Gather–Excite Network) [34].

On one hand, the SENet employed GAP (Global Average Pooling) to compute global features for each channel, achieving global awareness across different channels. In contrast, the ECA attention mechanism used a 1D convolutional layer to capture inter-channel features and generate attention weight. On the other hand, GENet gathered information from different regions within the feature maps and utilized this information to reconstruct new feature maps. However, these attention mechanisms fail to simultaneously capture both channel and spatial information, leading to an insufficient extraction of crucial information, especially the fine-grained information for small and crowd objects and unsuitable for complex scenarios.

Meanwhile, the CBAM [19] mechanism combined both channel attention and spatial attention, which helped with capturing information from both the channel and spatial dimensions. This enhances the model’s ability to extract important information from feature maps, addressing challenges posed by crowd occlusion, blurred backgrounds, and low-resolution images in datasets.

1.3. Feature Pyramid

Feature pyramids can fuse extracted features across different scales, thereby enhancing algorithm performance. Commonly used feature pyramids include FPN (Feature Pyramid Networks) [35], PANet (Path Aggregation Network) [36], ASFF (Adaptively Spatial Feature Fusion) [37], and BiFPN [20]; the first three architectures are illustrated in Figure 2. The left part of Figure 2a depicts the architecture of FPN, which has lateral connections and a top-down pathway, enabling the extraction of semantic information at all scales. However, the unidirectional flow of information may lead to feature loss. The right part of Figure 2a shows the architecture of PANet, which augments FPN by adding a bottom-up path aggregation, but this augmentation only fuses features from the middle portion of the network; by omitting the initial sampling process, fine-grained information might be lost. Moreover, Figure 2b presents the architecture of ASFF, in which each ASFF module integrates features from all of the previous layers in the bottom-up path; it resolves conflicts between features from different layers, but is limited in generalization and information redundancy.

2. Model and Method

With the development of computer vision techniques, the YOLOv5 series adaptively combines effective methods such as Fcos [38], RepVGG, etc. [39]. Its excellent performance meets the requirements of rapid response and high precision in emergency rescue and management scenarios, thus making it the most popular object detection algorithm at present. Although YOLOv5s has the least parameters and highest speed, small objects might be easily obscured by each other in chaotic and crowded scenarios, and there is a lack of relevant information which leads to a relatively low precision. Therefore, based on YOLOv5s, we designed the EAAnet, and used the images in the COP dataset as input to verify the network effectiveness and feasibility.

2.1. Datasets

In the field of crowd person detection, there are some datasets, such as Caltech Pedestrians and CityPersons [40]. The former one contains about 10 h of video taken from vehicles driving in an urban environment, with a total of 350,000 boundary boxes and 2300 persons. But there is only one single scene (road traffic) and the number of pedestrians is small. The images in the CityPersons dataset are mainly in urban streets, with a number of more than 5000, and an average of seven pedestrians in each image. Although it has a large amount of images, the average number of people per image is limited. The above datasets both have problems such as an insufficient number of people and single scenes, which are weak to meet the requirements of safety monitoring and rapid rescue in public places. Therefore, the WiderPerson [18] dataset was selected. It has up to five types of annotations, which include pedestrians, riders, partially visible persons, ignored areas, and crowds. And there are 13,382 images with 399,786 annotations, with an average of 44 pedestrians per image. Therefore, it is more suitable for the real-time detection of a crowd person when an emergency occurs, and is helpful with assisting emergency rescue for rapid crowd evacuation and rescue.

Moreover, for emergency management applications, we pay more attention to the total number of persons or crowds in order to make sure as many people as possible have been rescued. So, taking diversity (scenario and object class) and density per image into account, 9000 annotated images (containing complex scenes, such as indoor, public places, traffic roads, outdoor, etc.) were selected from the WiderPerson dataset to be reconstructed as the COP dataset. Typical pictures are shown in Figure 3. Figure 3a is a complex scene in the station hall, representing the monitoring scene of large crowd density in public places such as subway stations and high-speed railway stations. Figure 3b is an outdoor scene with a dense crowd and simple background.

Figure 4 shows the detailed information of the COP dataset. Among these, Figure 4a indicates that the number of the pedestrians category is the largest, and it has more than 150,000 instances. This is followed by partially visible persons, with approximately 75,000 instances. In contrast, the others have a relatively small amount of instances. Obviously, this conflict may lead to models inclined to predict pedestrians and partially visible persons. Figure 4b shows when the center point of the instances is aligned; it is obvious that there are targets of different scales, and most of them are small and vertical rectangles, which probably represent the pedestrian in a stand-up posture, while a few of them are horizontal, which probably represent riders or crowds. Figure 4c is the distribution of center points for each instance while being normalized; the darker the color, the denser the distribution. In other words, the center points of the detected pedestrians are mainly concentrated in the Y-axis between 0.4 and 0.6, and all the positions are basically evenly distributed in the X-axis direction. Figure 4d represents the proportion distribution between the images’ length and width and the instances’ length and width. Also, the darker the color, the denser the distribution. The closer the distance to the coordinate origin is, the smaller the detected object is compared to the original image, and the ratio is mostly 0.05 to 0.01.

2.2. Model Description

YOLO is mainly composed of a backbone, neck, and head. With the continuous improvement in lightweighted models, at present, YOLOv5 is a widely used object detection algorithm. However, people may occlude each other and become relatively small targets in chaotic and crowded scenes, and YOLOv5 often lacks sufficient feature information to cope with these problems. Therefore, in order to solve the occlusion problem and overcome the difficulty of insufficient sensitivity to small targets, it is necessary to add an appropriate attention mechanism to extract important features efficiently. Based on the analysis of the COP dataset in the previous section, the P (precision) value is the most important performance indicator. However, the backbone carries out feature extraction many times, resulting in a serious loss of fine-grained information in the upper layer of the network. Therefore, BiFPN is introduced in the neck part to increase the cross-layer information interaction ability so that the information with the characteristics of occlusion and dense crowds can be well transmitted to the output layer.

2.2.1. CBAM Attention Mechanism

Traditional attention mechanism mainly emphasizes the extraction of channel information and is limited by the interaction between channels. However, CBAM [19] introduces SA and CA, implementing a sequential structure from the individual channel level to the spatial context. SA is able to assign greater focus to individual pixels that have a significant influence on image classification. CA solves the spatial distribution of channels. The structure of the CBAM attention mechanism is shown in Figure 5. It consists of a channel attention module and a space attention module, respectively. The expression of the overall calculation process is shown in Formula (1)

\begin{array}{l} F^{'} = M_{C} (F) \otimes F \\ F^{″} = M_{S} (F^{'}) \otimes F^{'} \end{array}

(1)

where

F^{'}

represents the input feature,

M_{C}

represents channel attention output,

M_{S}

represents space attention output,

\otimes

represents the multiplication of each element step by step, and

F^{″}

represents the final output of the CBAM module.

The channel attention module first passes the input

F

through two parallel branches, effectively fusing the output of max-pool and average-pool operation. Average pooling efficiently compresses the spatial dimension of the input feature maps, while maximum pooling focuses on capturing and emphasizing key features. The fusion results are then fed into a shared MLP (multi-layer perceptron) to construct a correlation between the two fully connected layers, and use the

sigmoid

activation function to obtain the weight of each feature channel

M_{S}

. Then, channel multiplication is applied to it and the original input features

F

to realize a weighted fusion in order to finely adjust and calibrate original features in the channel dimension.

The calculation process of channel attention in CBAM is shown in Formula (2).

F^{'} = M_{C} (F) = σ (W_{1} (W_{0} (F_{a v g}^{C})) + W_{1} (W_{0} (F_{m a x}^{C})))

(2)

where

σ

represents the

s i g m o i d

activation function,

W_{0}

and

W_{1}

represent the sharing weight of MLP, and

F_{a v g}^{C}

and

F_{m a x}^{C}

represent the average pooling result of the channel, and maximum pooling result of the channel, respectively.

The spatial attention module weights the different spatial positions of the input to spatially enhance the important features. The output feature map

F^{'}

of the channel attention module is taken as input, and the spatial weights are learned through serial maximum pooling and average pooling operations, and then applied to each spatial position on the input feature map

F^{'}

to weight it. Finally, this result is processed by the

sigmoid

activation function to produce the final spatial attention feature map. The calculation process is shown in Formula (3).

F^{″} = M_{S} (F^{'}) = σ (f^{7 \times 7} ([F_{a v g}^{S} : F_{m a x}^{S}]))

(3)

where

f^{7 \times 7}

represents the kernel size of the convolution operation is 7 × 7,

F_{a v g}^{S}

and

F_{m a x}^{S}

represent the results of the average and maximum pooling in spatial attention, respectively, and

:

represents the concatenation operation.

2.2.2. BiFPN Mechanism

As the degree of feature loss increases with the number of network layers, the effective fusion of multi-scale features can enhance the correlation between the backbone and the neck. EAAnet introduces BiFPN [20] in the neck because it can solve the problem that traditional one-way FPN cannot make full use of different scale feature information, as shown in Figure 6. Compared with PANet [36] and ASFF [37], in Figure 6, BiFPN abandons the top and bottom convolution operations in the middle branch while maintaining the top-down and bottom-up paths. Moreover, it adds links as the blue dashed line shows to upsample the input in order to fuse features intensively and efficiently, which just increases the computational cost a little bit. And it also adds shortcuts as the red arrow indicates, which can fuse cross-layer features within the same scale. These improvements result in a few redundant information and loss.

2.2.3. Loss Function

The loss function of YOLOv5 mainly consists of three parts: the loss of classification (cls_loss), loss of bounding box (box_loss), and loss of confidence (obj_loss), which is expressed in Formula (4):

L = L_{cls} + L_{box} + L_{obj}

(4)

The gain of each part is set as 0.5, 0.05, and 1 according to the application scenario. And the commonly used expressions of bounding box loss function mainly include GIoU_Loss, DIoU_Loss, and CIoU_Loss. On the basis of IoU, GIoU_Loss introduces the smallest closed rectangle that can contain the GT (ground truth) bounding box and the predicted bounding box. But it still relies heavily on IoU. In the vertical direction, the error of the two bounding boxes is very large, and the loss is difficult to converge. Based on GIoU_Loss, DIoU_Loss introduced the distance between the center point of the GT bounding box and the predicted bounding box, but did not consider the difference in the aspect ratio between them. Therefore, this model adopts the CIOU_Loss loss function, which comprehensively considers multiple factors such as the overlapping area of the bounding box, the distance from the center point, and the aspect ratio. The bounding box loss function is defined as follows:

L_{box} = 1 - CIoU = 1 - IoU + \frac{σ^{2} (b, b^{g t})}{c^{2}} + α ν

(5)

where

α

represents a positive trade-off parameter, and

ν

represents the measure of the similarity of the aspect ratio.

σ

represents the center point distance of the GT bounding box and the predicted bounding box, and

c

represents the diagonal length of the minimum rectangle between the detection and target boxes.

ν

and

α

are defined as in Equations (6) and (7).

ν = \frac{4}{Π} {(\arctan \frac{w^{g t}}{h^{g t}} - \arctan \frac{w}{h})}^{2}

(6)

α = \frac{ν}{(1 - IoU) + ν}

(7)

where

IoU

is defined as follows:

IoU = \frac{S_{1}}{S_{2}}

(8)

Compared with the GIoU_Loss and DIoU_Loss functions, CIoU_Loss not only considers the overlap area between the predicted box and the real box, but also considers the center point distance and aspect ratio, which can satisfy the model’s recognition and processing of occlusive and dense crowd targets.

2.2.4. EAAnet Model

Based on the aforementioned principle analysis, when an emergency occurs, the background might be complex, and the resolution might be low. In addition, people have various gestures, which might lead to the occlusion and overlap of each other. The detection process is prone to missing information and objects. To solve this problem, the structure of the EAAnet model is shown in Figure 7.

As shown in Figure 7a, the input scales image adaptively, and data enhancements are carried out to improve the generalization ability of the model. The model uses CSP1 and CSP2 modules in the backbone and neck, respectively. The backbone is responsible for extracting the features of the image and keeping the gradient smooth, so CSP1 is selected here. The neck is responsible for the further fusion of the features extracted from the backbone network, and the shortcuts of CSP2 are introduced to generate richer feature representations. Finally, the head is responsible for prediction, and the CIoU loss function is used as a part of the total loss function to better help the network’s backward propagation run smoothly. Figure 7b shows the improvement of the backbone, namely the introduction of the CBAM attention mechanism in the backbone’s layers 3 and 5. This improvement is selected because with the network becoming deeper, the size of the feature map decreases gradually, and the number of channels increases. Moreover, the bottom level contains more fine-grained information that may gradually get lost. Therefore, we only applied CBAM at the bottom layers of 3 and 5, which can enhance the concerns of the fine-grained information of small person objects in the COP dataset.

Secondly, in the neck, the BiFPN structure is adopted, splicing the output of layers 6 to 19 to achieve cross-layer feature fusion and provide a richer global feature representation. After extracting information through the CBAM attention mechanism, further convolution can effectively compress the redundancy of fine-grained information. The feature map of layer 6 contains rich channel information and spatial information, which, combined with the higher-level semantic information of layer 19, helps to understand the object categories and to detect objects of different scales.

Through the above improvement, EAAnet can better reduce the cases of missing detection and false detection when dealing with complex scenarios and crowd persons.

3. Experiment Results

3.1. The Experimental Environment

The environment configuration is shown below: the operating platform is Windows11, CPU is 13th Gen Intel(R) Core(TM) i7-13700KF 3.40 GHz, GPU is NVIDIA GeForce RTX 4060Ti, Python is 3.8.19, torch is 2.1.0 + cu121, and torchvision is 0.16.0 + cu121. And the relevant parameters are set as follows: epoch is 200, batch size is 64, input image resolution is 640 × 640, and optimizer is SGD with an initial learning rate of 0.01 and momentum of 0.9.

3.2. Performance Indicators

In order to validate the effectiveness of the proposed network, five key indicators are selected from the main comparative indicators of object detection, which include precision (P), mean Average Precision (mAP), number of parameters (Params), floating point operations (FLOPs), and

F_{1}

value. Among them, P is the most important indicator since it is closest to the predicted number of people. This is the key to success in emergency rescue and management, the bigger the p value is, the more people will be saved.

Taking the pedestrians category in the COP dataset as an example, P describes the ratio of true pedestrians, as expressed in Formula (9).

P = \frac{T P}{T P + F P}

(9)

where

T P

(true positives) refer to the number of pedestrians category, for which both its label and prediction are pedestrians.

F P

(false negatives) refer to the number of pedestrians category, for which its label is not pedestrians but prediction is.

The formula for calculating

m A P

is shown as (10).

m A P = \frac{1}{M} \sum_{i = 1}^{M} A P_{i} = \frac{1}{M} \sum_{i = 1}^{M} \int_{0}^{1} P (R) dr

(10)

where

M

is the number of all the categories, and here it is five in the COP dataset.

m A P

is an average value of all of

AP

(average precision), which refers to the average precision value of a specific category, for example, the pedestrians.

m A P 50

refers to the

m A P

value when the overlap rate

IoU

(Intersection over Union) is more than 50%.

Params refers to the number of parameters that need to be learned in the model; it contains the weights and biases of the model, and is usually used to measure the size and to map inputs to outputs. The larger it is, the greater the storage space will be.

FLOPs are used to measure the computational complexity of models. Generally, a smaller value indicates less computational resources required, as well as higher inference speed and better real-time performance. The calculation formula is shown in Formula (11).

FLOPs = {2 HW (C}_{in} K^{2} + {1) C}_{out}

(11)

where

H

is for the height of the input image,

W

is the width,

C_{in}

and

C_{out}

represent the number of input and output channels, respectively, and

K

represent the convolution kernel size.

3.3. Comparison with SOTA Method

For the COP dataset, EAAnet is compared with six mainstream object detection models, which include RetinaNet [30], Dynamic_RCNN [12], Atss [8], Faster R-CNN [7], YOLOv5, and YOLOv7, to prove its effectiveness. The experimental results are shown in Table 2.

From Table 2, it can be observed that EAAnet has the highest p value of 78.6%, surpassing YOLOv5s by 1.8%. However, its performance on other indicators is slightly lower than YOLOv5s. For instance, EAAnet achieves mAP50 and mAP50:95 of 68.5% and 36.9%, respectively, which are reduced by 1.3 and 0.7 compared to YOLOv5s. Since the application scenario focuses on crowd evacuation in public places, the main goal is to precisely predict the number of people in order to ensure effective evacuation (as many people as possible) without knowing the exact location of each person. Therefore, this slight decrease does not significantly impact the final effectiveness of the model. The reasons for this decrease may be attributed to potential human biases in dataset labeling as well as a potential discrepancy between spatial attention and channel attention within the CBAM module when applied to the COP dataset. It is worth noticing that EAAnet adopts a lightweight backbone network structure, whereas YOLOv7 uses a deeper network structure, faster convolution operations, and smaller models. Theoretically, the parameter number of YOLOv7 should be smaller than the former, and the detection speed should be faster. However, in order to ensure the fairness of the experiment, we keep the network scale and configuration of EAAnet and YOLOv7 consistent. It is found that although the number of parameters of the two is close, the performance and computation of the latter are much lower than that of EAAnet. Moreover, Params and FLOPs of EAAnet reach 7.1M and 16.0G, respectively, which go up by 0.1 and 0.2. This is because of the introduction of BiFPN, which adds concatenation operations to cross-fuse feature maps. In summary, compared with other models, EAAnet exhibits the highest p value, indicating that its BiFPN structure effectively integrates features extracted from CBAM modules, thereby enhancing detection precision for the all category in the COP dataset.

To further explore the precision of the algorithms in each category, we compared the performance of P on the two algorithms using the validation dataset in the COP dataset, and the result is shown in Table 3. The COP validation dataset has five categories, including all (25,937), pedestrians (17,038), riders (154), partially visible person (7602), and crowd (850). Obviously, EAAnet achieved optimal P in the all category, going up from 76.8% to 78.6% with an increment of 1.8. And it is worth noting that the riders and partially visible person categories have grown by 4.6 and 0.8, respectively, demonstrating that the improved algorithm has better performance for occluded or partially visible person. Although, for EAAnet, the pedestrians category has the highest P of 85.8% among all the categories, there is a decrease of 0.3 compared to YOLOv5s. This may be due to the large number of pedestrian instances bringing strong diversity and raising challenges for its detection. Moreover, the P of the crowd category is also decreased by 0.5, possibly because the extremely small object area leads to insufficient learning about this information. In summary, with the introduction of CABM and BiFPN, EAAnet has enhanced its ability to extract features from crowded environments with multiple objects and severe occlusions, making it more suitable for emergency applications such as crowd evacuation in public spaces.

Figure 8 compares the trends in P and mAP50 as the epoch increases. It is evident that EAAnet shows a significant improvement in P, while its mAP50 performance is comparable to that of YOLOv5s. In Figure 8a, it can be observed that EAAnet exhibits smaller oscillations in the mid-term stage, indicating relatively more stable performance. As the epoch increases, the P of EAAnet rises at a faster rate with room for further improvement, whereas YOLOv5s has already reached a stable state. Similarly, in Figure 8b, it can be seen that EAAnet experiences smaller oscillations in the initial stage, and both models show an increasing trend in mAP50 as the epoch progresses, with negligible differences in their final values. However, the curve of EAAnet appears smoother compared to YOLOv5s.

3.4. Ablation Study

To further validate the effectiveness of EAAnet, dissection experiments were conducted. Table 4 presents the effects of adding the CBAM attention mechanism to different layers of the backbone. It is evident that compared to YOLOv5s, introducing CBAM at the third and fifth layers (the blocks in read in the backbone, as shown in Figure 7) resulted in the biggest increase of 1.4% in P but a 0.7% decrease in

m A P 50

. And only introducing CBAM to top layers in the backbone (such as the fifth and seventh, respectively) led to a limited increase of 0.2%, which resulted in the best

m A P 50

of 70.0%. This indicates top layers have a close relationship to the positioning ability of the network. Meanwhile adding CBAM to the third, fifth, and seventh layer at the same time led to almost the worst performance in terms of both P and

m A P 50

because the bottom layer provides attention to fine grain information (helps to find out the bounding box belongs to which category) of crowd, and after repeating convolution operations, fine grain information may get lost at the top layers and are more likely to output semantic information related to object position. Adding CBAM to each layer may result in redundancy, especially producing extra noise to both the precision and

m A P 50

. Thus, through the ablation experiments presented in Table 4, it is confirmed that while it may not yield optimal

m A P 50

performance when adding CBAM modules simultaneously at both the third and fifth layers within our specific model application scenario, we prioritize the precise prediction of person number.

And to further validate the effectiveness of adding a BiFPN on the neck, the ablation experiment results are presented in Table 5. It can be seen that YOLOv5s has a P of 76.8%, FLOPs of 15.8G, and 7.0M Params. Then, incorporating only CBAM led to a 1.4 improvement in P without changing the number of Params or FLOPs. It is worth noticing that by only adding the BiFPN module, there was a decrease of 3.3 in P along with growth in both Params and FLOPs by 0.1M and 0.2G FLOPs, respectively. This phenomena might be caused by using either improvement alone will lead to insufficient acquisition of fine-grained information related to detecting small objects at earlier stages in the backbone network. However, when applying CBAM and BiFPN together, it achieved the best performance, the biggest P of 78.6% which went up by 1.8% over YOLOv5s. Therefore, based on the ablation results shown in Table 4, it is evident that CBAM and BiFPN have mutually reinforcing effects and should be used together to effectively enhance model precision for target recognition.

4. Discussion

In order to visually explore the performance of EAAnet from different perspectives, combined with the validation set of the COP dataset, comparative analyses of the detection effects and heat maps are conducted horizontally. Vertically, the performance of the model on different categories and confusion matrices is plotted.

Firstly, typical images from the validation set are selected for visual display. As shown in Figure 9, each row represents normal scenes (few individuals and minimal occlusion), dense scenes (concentrated crowd with a horizontal arrangement and severe occlusion), and dense complex scenes (complex background, diverse vehicle types, and extremely severe occlusion). In each row, the first column shows the original images from the COP dataset, the second column shows the predictions by YOLOv5s and the third column shows the predictions by EAAnet. Specifically, as shown in the first-row image, at the bottom right of the image, a pedestrian obscured by a street lamp is predicted by EAAnet, while it is not labeled in the original dataset annotations or predicted by YOLOv5s. Furthermore, as seen in images in the second row, both EAAnet and YOLOv5s find a woman wearing a black blouse which is not annotated in the original dataset. Additionally, through comparison in the third row, it can be observed that EAAnet predicts people in the left middle and bottom middle part of the image, which are not labeled or predicted. Thus, it proved that EAAnet has strong detection capability, adaptability, and robustness, especially near the image boundaries, so as to more accurately predict numbers of crowd persons for various scenarios.

Heat maps are usually used to better observe whether the added modules extract feature information correctly. In the heat map, areas with brighter colors indicate more contribution to the prediction. As shown in Figure 10, three groups of feature heat maps are selected at the 0th, 5th, and 19th layers of the EAAnet structure layer, which represent the result of preprocessing, adding CBAM, and adding both CBAM and BiFPN. It is found that the extracted features change from simple textures to more complex and comprehensive semantic information as the layers go deeper. Moreover, the roughly corresponding areas of people are almost all in bright yellow, proving that EAAnet can effectively extract and integrate fine-grained information about small objects, thereby significantly improving both the precision and efficiency of detection models.

To further explore the performance differences in EAAnet for each category for the COP dataset, P curves, R curves, PR curves, and F1 curves were plotted separately as shown in Figure 11a–d. In Figure 11a, the relationship between the different categories and P values is displayed. Specifically, all the categories tend to overlap each other when the confidence level tends to be one. Figure 11b describes the variation in recall rates at different confidence levels; lower recall values indicate better performance. It can be seen that there is a relatively dispersed curve for each category and a particularly good performance for the crowd category. Figure 11c represents the P-R curve; normally, higher precision corresponds to lower recall. An ideal curve is a curve that is convex towards the upper right corner with a large enclosed area representing an AP value. Therefore, it can be concluded that the pedestrians and riders category are better. In Figure 11d, F1 serves as a harmonic parameter between P and R, where the higher the better. It can be observed similarly that the model exhibits better performance for the pedestrians category. In summary, from these above perspectives, it has been confirmed that EAAnet demonstrates superior performance in crowd person detection tasks.

In addition, compared with the ROC curve (drawing the relationship between true match rate and true match rate to compare the performance of different binary classification models, and by observing the curve, an appropriate threshold can be selected to balance the sensitivity and specificity of the model), the confusion matrix provides an intuitive basis about all the categories. And except for the main diagonal representing the true positive rate, the sum of the horizontal axis (vertical axis) value represents the false positive (false negative) rate, as shown in Figure 12. The horizontal axis represents the GT boxes, while the vertical axis represents the predicted boxes. The colors along the main diagonal are relatively darker and more concentrated, indicating that the model has good detection ability for each category in the COP dataset. In the non-diagonal areas, the colors are lighter, suggesting that there are very few instances where a certain category is incorrectly predicted as another category by the model. Hence, these areas appear almost white. And taking pedestrians as an example, the joint point of pedestrians on the horizontal and vertical axis, respectively, has a value of 0.93, indicating that 93% of the pedestrian labels are correctly predicted. And 1% of the pedestrian labels are incorrectly predicted as riders. Although there may be differences between Figure 12 and Table 2, that was because of using the training dataset and validation dataset separately, which leads to variations in category quantities and diversity. Moreover, Figure 9 and Figure 10 have already demonstrated EAAnet’s overall superiority in terms of detection performance and heat maps. Therefore, through Figure 12, it is once again proven that our improvement work has enhanced EAAnet’s perception capability for small crowd objects by effectively extracting fine-grained information.

Although attention modules can be used to help improve model performance, their complexity leads to a tendency to overfit the data, resulting in poor performance in the prediction of unknown data. In order to ensure the model appropriately fits the COP dataset, the experimental hyper parameter fine-tuning (the specific hyperparameters related to the optimizer, loss function, image preprocessing and image enhancement) adopts evolution strategies, which use a genetic algorithm (GA) to select the more appropriate hyper parameters through the settings of the mutation scale, lower limit, and upper limit. For example, the Adam optimizer has an initial learning rate of 0.001 by default, which is fine-tuned to 0.01 through evolution strategies. Meanwhile, to prove the effectiveness of model improvement rather than false improvement caused by overfitting, we explored the model performance on the training dataset and validation dataset based on three types of losses, as shown in Figure 13, Figure 14 and Figure 15. In these figures, with the increase in epoch number, both the training loss and validation loss of EAAnet showed a downward trend. And among all types of losses, the validation loss was lower than the training loss; it did not decrease first and then increase, indicating that EAAnet performed well on previously unseen data, and it did not overfit the details of COP data. In the figure, although EAAnet and YOLOv5s tend to converge to similar values in the end, the curve of EAAnet is smoother, indicating that the improved model is more robust to the deviation between samples. Among them, the fluctuation amplitude of the training set in Figure 15 is significantly larger than that of other curves, which may be due to the noise in the training set data, wrong labeling, or missing labeling caused by human deviation, which leads to failure to converge quickly. In general, the three loss functions of EAAnet on the verification set in the three diagrams fluctuated greatly before 50 epoch, and then tended to be smooth and continued to decrease, indicating that the model fitting ability was moderate.

5. Conclusions

To enhance the recognition precision of the COP dataset in emergency management scenarios, we propose a dense crowd detection model EAAnet which leverages an attention mechanism and a BiFPN. A distinctive aspect of this model is the incorporation of CBAM at the bottom layer of the feature extraction backbone, utilizing rich fine-grained information from these layers to guide the model in extracting features from densely populated regions. Subsequently, BiFPN is introduced to efficiently integrate cross-layer features, thereby augmenting the model’s detection capabilities. We also replace and adjust the loss function coefficients based on specific focuses within emergency scenarios. Ultimately, through various enhancements, we develop an efficient attention aggregation network tailored for the COP dataset. These meticulously designed modules have been validated through ablation studies and comparative experiments, enabling our proposed network to capture more detailed and unique features that significantly improve re-identification performance. And since CBAM is a lightweight module, it does not change the size of the feature map, and the small increase in the parameter count and computational load comes from a cross-layer information flow increased by BiFPN; EAAnet maintains a compact parameter count (7.1M) and computational load (16.0G), and achieves a notable p value increase to 78.6%. Future research will delve deeper into image datasets and seek to bolster predictive power for categories with limited instances, and this may necessitate exploring complex data processing techniques to balance category representation effectively. Consequently, utilizing the insights gained to refine prediction accuracy greatly enhances these design improvements’ effectiveness. Implementing these recommendations could assist policymakers and practitioners in emergency management by fostering more effective strategies aimed at increasing rescue rates while enhancing the overall emergency management standards and promoting harmonious development.

Author Contributions

Conceptualization, F.H. and W.C.; methodology, W.C.; software, W.W. and W.D.; validation, W.W., W.D. and W.C.; formal analysis, F.H.; investigation, W.D.; resources, W.W.; data curation, W.W.; writing—original draft preparation, W.W. and W.D.; writing—review and editing, F.H. and W.C.; visualization, W.W.; supervision, F.H.; project administration, F.H.; funding acquisition, F.H. and W.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partly funded by the S&T Program of Hebei, grant number 22375411D. And partly by the horizontal projects of China Agricultural University, grant number 202405510810535 and 202405510810372.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflict of interest.

References

Sun, J.; Wang, Z. Vehicle and Pedestrian Detection Algorithm Based on Improved YOLOv5. IAENG Int. J. Comput. Sci. 2023, 50, 28. [Google Scholar]
Lin, X.; Song, A. Research on improving pedestrian detection algorithm based on YOLOv5. In Proceedings of the International Conference on Electronic Information Engineering and Data Processing (EIEDP 2023), Nanchang, China, 26 May 2023; pp. 506–511. [Google Scholar]
Jin, Y.; Lu, Z.; Wang, R.; Liang, C. Research on lightweight pedestrian detection based on improved YOLOv5. Math. Model. Eng. 2023, 9, 178–187. [Google Scholar] [CrossRef]
Hürlimann, M.; Coviello, V.; Bel, C.; Guo, X.; Berti, M.; Graf, C.; Hübl, J.; Miyata, S.; Smith, J.B.; Yin, H.-Y. Debris-flow monitoring and warning: Review and examples. Earth-Sci. Rev. 2019, 199, 102981. [Google Scholar] [CrossRef]
Bendali-Braham, M.; Weber, J.; Forestier, G.; Idoumghar, L.; Muller, P.-A. Recent trends in crowd analysis: A review. Mach. Learn. Appl. 2021, 4, 100023. [Google Scholar] [CrossRef]
Hung, G.L.; Bin Sahimi, M.S.; Samma, H.; Almohamad, T.A.; Lahasan, B. Faster R-CNN deep learning model for pedestrian detection from drone images. SN Comput. Sci. 2020, 1, 116. [Google Scholar] [CrossRef]
Mittal, U.; Chawla, P.; Tiwari, R. EnsembleNet: A hybrid approach for vehicle detection and estimation of traffic density based on faster R-CNN and YOLO models. Neural Comput. Appl. 2023, 35, 4755–4774. [Google Scholar] [CrossRef]
Zhang, S.; Chi, C.; Yao, Y.; Lei, Z.; Li, S.Z. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seoul, Republic of Korea, 16–18 June 2020; pp. 9759–9768. [Google Scholar]
Liu, S.; Chi, J.; Wu, C. FCOS-Lite: An Efficient Anchor-free Network for Real-time Object Detection. In Proceedings of the 33rd Chinese Control and Decision Conference (CCDC), Kunming, China, 22–24 May 2021; pp. 1519–1524. [Google Scholar]
Qiu, R.; Cai, Z.; Chang, Z.; Liu, S.; Tu, G. A two-stage image process for water level recognition via dual-attention CornerNet and CTransformer. Vis. Comput. 2023, 39, 2933–2952. [Google Scholar] [CrossRef]
Qi, Q.; Huo, Q.; Wang, J.; Sun, H.; Cao, Y.; Liao, J. Personalized Sketch-Based Image Retrieval by Convolutional Neural Network and Deep Transfer Learning. IEEE Access 2019, 7, 16537–16549. [Google Scholar] [CrossRef]
Zhang, H.; Chang, H.; Ma, B.; Wang, N.; Chen, X. Dynamic R-CNN: Towards High Quality Object Detection via Dynamic Training. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 260–275. [Google Scholar] [CrossRef]
Lin, T.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar] [CrossRef]
Zhang, Q.; Liu, Y.; Zhang, Y.; Zong, M.; Zhu, J. Improved YOLOv3 integrating SENet and optimized GIoU loss for occluded pedestrian detection. Sensors 2023, 23, 9089. [Google Scholar] [CrossRef]
Tang, F.; Yang, F.; Tian, X. Long-Distance Person Detection Based on YOLOv7. Electronics 2023, 12, 1502. [Google Scholar] [CrossRef]
Dai, K.; Sui, X.; Wang, L.; Wu, Q.; Chen, Q.; Gu, G. Research on multi-target detection method based on deep learning. In Proceedings of the Seventh Symposium on Novel Photoelectronic Detection Technology and Application, Kunming, China, 5–7 November 2020; p. 117637U. [Google Scholar]
Yang, K.; Song, Z. Deep Learning-Based Object Detection Improvement for Fine-Grained Birds. IEEE Access 2021, 9, 67901–67915. [Google Scholar] [CrossRef]
Zhang, S.; Xie, Y.; Wan, J.; Xia, H.; Li, S.Z.; Guo, G. WiderPerson: A Diverse Dataset for Dense Pedestrian Detection in the Wild. IEEE Trans. Multimedia 2019, 22, 380–393. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar] [CrossRef]
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar] [CrossRef]
Taleb, N.O.; BEN Maati, M.L.; Nanne, M.F.; Aboubekrine, A.M.; Chergui, A. Study of Haar-AdaBoost (VJ) and HOG-AdaBoost (PoseInv) Detectors for People Detection. Int. J. Adv. Comput. Sci. Appl. 2021, 12. [Google Scholar] [CrossRef]
Papageorgiou, C.; Poggio, T. A Trainable System for Object Detection. Int. J. Comput. Vis. 2000, 38, 15–33. [Google Scholar] [CrossRef]
Wu, B.; Nevatia, R. Tracking of Multiple, Partially Occluded Humans based on Static Body Part Detection. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), New York, NY, USA, 17–22 June 2006; pp. 951–958. [Google Scholar]
Maolin, L.; Shen, J. Fast Object Detection Method Based on Deformable Part Model (Dpm). U.S. Patent EP3183691A1, 19 December 2017. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar] [CrossRef]
Felzenszwalb, P.; Girshick, R.; McAllester, D.; Ramanan, D. Visual object detection with deformable part models. Commun. ACM 2013, 56, 97–105. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Zhou, C.; Yuan, J. Bi-box regression for pedestrian detection and occlusion estimation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 135–151. [Google Scholar]
Pei, D.; Jing, M.; Liu, H.; Sun, F.; Jiang, L. A fast RetinaNet fusion framework for multi-spectral pedestrian detection. Infrared Phys. Technol. 2020, 105, 103178. [Google Scholar] [CrossRef]
Peng, Q.; Luo, W.; Hong, G.; Feng, M.; Xia, Y.; Yu, L.; Hao, X.; Wang, X.; Li, M. Pedestrian detection for transformer substation based on gaussian mixture model and YOLO. In Proceedings of the 2016 8th International Conference on Intelligent Human-Machine Systems and Cybernetics, Hangzhou, China, 27–28 August 2016; pp. 562–565. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar] [CrossRef]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar] [CrossRef]
Desai, A.P.; Razeghin, M.; Meruvia-Pastor, O.; Peña-Castillo, L. GeNET: A web application to explore and share Gene Co-expression Network Analysis data. PeerJ 2017, 5, e3678. [Google Scholar] [CrossRef]
Wang, C.; Zhong, C. Adaptive Feature Pyramid Networks for Object Detection. IEEE Access 2021, 9, 107024–107032. [Google Scholar] [CrossRef]
Qing, Y.; Liu, W.; Feng, L.; Gao, W. Improved YOLO Network for Free-Angle Remote Sensing Target Detection. Remote. Sens. 2021, 13, 2171. [Google Scholar] [CrossRef]
Wang, H.; Guo, E.; Chen, F.; Chen, P. Depth Completion in Autonomous Driving: Adaptive Spatial Feature Fusion and Semi-Quantitative Visualization. Appl. Sci. 2023, 13, 9804. [Google Scholar] [CrossRef]
Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: Fully Convolutional One-Stage Object Detection. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Ding, X.; Zhang, X.; Ma, N.; Han, J.; Ding, G.; Sun, J. RepVGG: Making VGG-style ConvNets Great Again. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 13728–13737. [Google Scholar]
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollar, P. Focal loss for dense object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 318–327. [Google Scholar] [CrossRef]

Figure 1. EAAnet achieves the highest precision performance on crowd person detection with low computational cost.

Figure 2. Typical pyramid structure. (a) FPN and PANet. (b) ASFF.

Figure 3. Typical scenes in COP dataset. (a) Indoor crowd person. (b) Outdoor crowd person.

Figure 4. Detailed information on COP dataset. (a) The number of instances for each category. (b) The distribution of instances being aligned. (c) The distribution of instances’ center points after being normalized. (d) The proportion of width and height.

Figure 5. The CBAM attention module.

Figure 6. BiFPN structure.

Figure 7. EAAnet model structure. (a) Overall structure (b) CBAM structure inserted in the backbone.

Figure 8. Performance comparison. (a) P curve. (b) mAP 50.

Figure 9. Detection effects. (a) GT. (b) Prediction of YOLOv5. (c) Prediction of EAAnet.

Figure 10. Heat map of some representative stages. (a) Ordinary environment. (b) Indoor crowd. (c) Lower feature map in dense environment.

Figure 11. Performance difference for each class. (a) P curve. (b) R curve. (c) PR curve. (d) F1 curve.

Figure 12. The confusion matrix.

Figure 13. Box_loss comparison.

Figure 14. Cls_loss comparison.

Figure 15. Obj_loss comparison.

Table 1. Model comparison on COCO test-dev dataset.

Rank	Model	Box mAP	Params (M)
1	Co-DETR	66	348
2	InternImage-H	65.4	2180
3	Focal-Stable-DINO (Focal-Huge, no TTA)	64.8	689
4	Co-DETR (Swin-L)	64.8	218
5	InternImage-XL	64.3	602
6	Relation-DETR (Focal-L)	63.5	214
7	SwinV2-G (HTC++)	63.1	3000
8	YOLOX-X (Modified CSP v5)	51.2	99.1
9	LeYOLO-Large	41	2.4
11	YOLOX-Tiny (416 × 416, single-scale)	32.8	5.06

Table 2. Data comparison of SOTA algorithms.

Model	P (%)	mAP50 (%)	mAP50:95 (%)	Params (M)	FLOPs (G)
YOLOv7	61.5	52.1	26.7	9.1	26.0
Dynamic_RCNN	62.5	30.6	15.5	41.4	67.4
RetinaNet	68.0	28.0	13.5	36.4	57.1
Atss	71.0	40.6	20.5	32.1	55.9
Faster_RCNN	74.5	47.3	23.5	41.1	67.4
YOLOv5s	76.8	69.8	37.6	7.0	15.8
EAAnet	78.6	68.5	36.9	7.1	16.0

Table 3. Precision comparison with YOLOv5s in each category.

Models	All	Pedestrians	Riders	Partially Visible Person	Crowd
YOLOv5s	76.8	86.1	74	73.9	76.1
Our	78.6	85.8	78.6	74.7	75.6

Table 4. CBAM ablation experiments.

3 Layer	5 Layer	7 Layer	P (%)	$m A P 50$ (%)
			76.8	69.8
√			76.4	69.9
	√		77.0	70.0
		√	75.7	70.0
	√	√	76.2	69.9
√		√	77.3	70.0
√	√	√	76.5	69.2
√	√		78.2	69.1

Table 5. BiFPN ablation experiments.

CBAM	BiFPN	P (%)	Params (M)	FLOPs (G)
		76.8	7.0	15.8
√		78.2	7.0	15.8
	√	73.5	7.1	16.0
√	√	78.6	7.1	16.0

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, W.; Wu, W.; Dai, W.; Huang, F. EAAnet: Efficient Attention and Aggregation Network for Crowd Person Detection. Appl. Sci. 2024, 14, 8692. https://doi.org/10.3390/app14198692

AMA Style

Chen W, Wu W, Dai W, Huang F. EAAnet: Efficient Attention and Aggregation Network for Crowd Person Detection. Applied Sciences. 2024; 14(19):8692. https://doi.org/10.3390/app14198692

Chicago/Turabian Style

Chen, Wenzhuo, Wen Wu, Wantao Dai, and Feng Huang. 2024. "EAAnet: Efficient Attention and Aggregation Network for Crowd Person Detection" Applied Sciences 14, no. 19: 8692. https://doi.org/10.3390/app14198692

APA Style

Chen, W., Wu, W., Dai, W., & Huang, F. (2024). EAAnet: Efficient Attention and Aggregation Network for Crowd Person Detection. Applied Sciences, 14(19), 8692. https://doi.org/10.3390/app14198692

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

EAAnet: Efficient Attention and Aggregation Network for Crowd Person Detection

Abstract

1. Introduction

1.1. Algorithms for Crowd Person Detection

1.2. Attention Mechanism

1.3. Feature Pyramid

2. Model and Method

2.1. Datasets

2.2. Model Description

2.2.1. CBAM Attention Mechanism

2.2.2. BiFPN Mechanism

2.2.3. Loss Function

2.2.4. EAAnet Model

3. Experiment Results

3.1. The Experimental Environment

3.2. Performance Indicators

3.3. Comparison with SOTA Method

3.4. Ablation Study

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI