CN113065558A

CN113065558A - Lightweight small target detection method combined with attention mechanism

Info

Publication number: CN113065558A
Application number: CN202110432768.6A
Authority: CN
Inventors: 朱威; 王立凯; 靳作宝; 何德峰; 郑雅羽
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2021-04-21
Filing date: 2021-04-21
Publication date: 2021-07-02
Anticipated expiration: 2041-04-21
Also published as: CN113065558B

Abstract

The invention relates to a lightweight small target detection method combined with an attention mechanism, comprising the following steps: (1) building a small target detection network based on YOLOv4: building an MSE multi-scale attention module and inserting it into the feature extraction network, adding shallow The layer feature map is used as the prediction layer, and the improvement of the SPP module enhances the feature extraction ability; (2) Build a small target data set, use the data enhancement strategy to enhance the training set data, and customize the anchor box (3) To model Perform channel pruning, and use knowledge distillation to restore the model accuracy; (4) Input a UAV aerial image to obtain target classification and positioning results. The present invention utilizes the channel attention mechanism and the model compression strategy, which can effectively improve the phenomenon of false detection and missed detection of small targets while ensuring the real-time performance of the model.

Description

Lightweight small target detection method combined with attention mechanism

Technical Field

The invention belongs to the application of a deep learning technology in the field of machine vision, and particularly relates to a lightweight small target detection method combined with an attention mechanism.

Background

The target detection finds out the specific target category and the accurate position thereof in a given image, wherein the small target detection is an important research content in the field of target detection, and has important application value in scenes such as remote sensing image target identification, infrared imaging target identification, agricultural pest identification and the like. In the target detection, a target having a target pixel value of 0.12% or less of the entire image or having a pixel value of less than 32 × 32 is generally referred to as a small target. Detecting small objects in an image is very difficult because of the low resolution and noise of small objects, often the insignificant features extracted after multi-layer convolution.

Early detection of small targets was mainly achieved by manually designed methods to obtain characteristic information of the target. Wen pei zhi et al apply wavelet transform to small target detection process (see wen pei zhi, smilin, hai and wu dawn. sea background infrared small target detection method [ J ]. photoelectric engineering, 2004) utilize multi-resolution analysis of orthogonal wavelet decomposition to achieve band selection, suppress interference of noise and background, and utilize edges in different directions to fuse to obtain candidate points, and finally eliminate the interference target according to gray threshold. CHEN et al (see C.L.P.Chen, H.Li, Y.Wei, et al.A. Local Contrast Method for Small extracted Target Detection [ J ]// IEEE Transactions on Geoscience and remove Sensing,2014,52(1):574 581) are motivated by biological visual mechanisms, use a proposed Local Contrast metric to obtain a Local Contrast map of the input image, which can represent the difference between the current location and its neighborhood, thus achieving both Target signal enhancement and background clutter suppression, and finally segmenting the Target by adaptive thresholds. The method starts from the bottom layer characteristics of the image, uses the basic image characteristics to realize the detection task, has simpler operation, but also has the problems of missed detection and error detection and real-time performance for the detection of small targets with complex backgrounds.

In recent years, with the improvement of computer power and the rapid development of deep learning theory, deep learning techniques have been widely used for target detection. Currently popular target detection models can be roughly divided into two categories: a stage detection algorithm, classification and positioning are regarded as regression tasks, and representative algorithms are SSD and YOLO; and (3) a two-stage detection algorithm, namely selecting a candidate box and separating a target classification, wherein the representative algorithms comprise R-CNN and Faster R-CNN. The one-stage detection algorithm has great advantages in real-time performance because the whole detection task is regarded as regression operation.

The main ways of improving the small target detection by using the deep learning technology are multi-scale representation, context information, super-resolution and the like. Patent application No. CN202010537199.7 discloses a detection method for small objects in pictures. Acquiring six feature graphs with different sizes from a picture to be detected, performing feature fusion on a pyramid bottom layer feature graph and a pyramid high layer feature graph in the six feature graphs with different sizes by adopting a bilinear interpolation method to obtain new six feature graphs with different sizes, and using the new six feature graphs with different sizes to participate in prediction. The method adopts the multi-scale characteristic diagram to enhance the target characteristic information, but is easily interfered by a complex background, and the false detection rate is higher. The patent with the application number of CN202010444356.X discloses a remote sensing image small target detection method based on resolution enhancement, which carries out super-resolution processing on a remote sensing image containing a small target and then carries out target detection, solves the problems that the small target in the remote sensing image has less available characteristic information and the small target area has geometric deformation, further perfects the detailed characteristic information of the small target by adopting a super-resolution processing technology, fully utilizes the limited characteristic information of the small target by applying a deformable convolution network based on an area, and improves the detection capability of the remote sensing image on the small target. Although the method has better accuracy, the real-time performance of the network is reduced due to the increase of the resolution of the picture, and the method is not beneficial to the light weight of the network.

Disclosure of Invention

In order to solve the problems of high false detection rate, missing detection, poor real-time performance and the like of the existing target detection method for detecting the small target, the invention provides a lightweight small target detection method combined with an attention mechanism, which comprises the following steps:

(1) building improved small target detection network based on YOLOv4

The small target detection network is obtained by improving a one-stage target detection network YOLOv4, and the specific network structure improvement comprises the following three aspects:

(1-1) constructing an MSE multi-scale attention mechanism module, and inserting the MSE multi-scale attention mechanism module into a feature extraction network

The MSE multi-scale attention mechanism module constructed by the invention is obtained by improving an SE attention module, the SE attention module is a lightweight attention mechanism module which is proposed by Hu et al in 2017 and is used in the field of computer vision, the MSE multi-scale attention mechanism module can be conveniently inserted between two network layers of a feature extraction network, an interested feature channel is selected and emphasized by learning global information, and irrelevant interference information is inhibited.

An MSE multi-scale attention mechanism module is constructed and inserted between a Concat layer and a CBM module in each CSP module of a YOLOv4 feature extraction network CSPDarknet53 to form a new MSE-CSPUnit module, and the feature extraction network of the MSE-CSPDarknet53 with attention information is obtained. The specific steps of the construction of the MSE multi-scale attention mechanism module are as follows:

(1-1-1) firstly, taking the output of the Concat layer of the CSP module as an input feature map, integrating feature maps of various scales through convolution kernels of different sizes, and carrying out next-step feature extraction operation based on the feature maps of various scales. The convolution kernel sizes are 3 × 3, 5 × 5, and 7 × 7, respectively, and in the case of a parameter amount explosion caused by using a large-size convolution kernel, 2 layers of convolution kernels of 3 × 3 are used instead of the convolution kernels of 5 × 5, and 3 layers of convolution kernels of 3 × 3 are used instead of the convolution kernels of 7 × 7. Let input characteristic diagram X ∈ R^C×H×WC, H, W are input channel, input height, and input width, respectively, the process of feature extraction using convolution kernels of different sizes for the input feature map is as follows:

X_c＝V_3×3X+V_5×5X+V_7×7X

wherein, X_cFor multi-scale feature map output, V represents the convolution operation using convolution kernels of different sizes.

(1-1-2) to X_cPerforming extrusion operation, and respectively extruding the channels by using global average pooling and global maximum pooling to obtain channel-level feature information, wherein the global features of the global average pooling emphasis feature map and the local features of the global maximum pooling emphasis feature map are as follows:

X_max＝max(X_c(i,j))

wherein, X_cFor input of multi-scale features, X_avgFor features obtained after global averaging pooling, X_maxFor the features obtained after the global maximum pooling, i is 1,2, …, H, j is 1,2, …, W, H, W are input height and input width, respectively.

(1-1-3) pairs of X_avgAnd X_maxExcitation operation is carried out, and channel attention weight information X is generated through addition and normalization operation_s. Preserving more non-linear relationships between channels using Mish activation function, FC, when performing stimulus operations₁、FC₂Is two different fully connected layers, wherein

C is input channel, r is dimension reduction ratio, FC₁Plays a role of reducing dimension to reduce full connection layer parameter, FC₂And the function of restoring the original dimension is realized. The activation and normalization operations are as follows:

X_a＝FC₂(Mish(FC₁(X_avg))

X_m＝FC₂(Mish(FC₁(X_max))

X_s＝Softmax(X_a+X_m)

wherein Mish is a nonlinear activation function, and Softmax is a normalization function.

(1-1-4) performing weighting operation on the channel attention weight information generated in (1-1-3) and the multi-scale feature map generated in (1-1-1) to obtain output X of the MSE multi-scale attention module_weightIs mixing X_weightAs input to the CBM module in the MSE-CSPUnit module.

X_weight＝Scale(X_c,X_s)

(1-2) adding shallow feature map as prediction layer

The deep features have stronger semantic information and are more suitable for positioning; and shallow layer characteristics have rich resolution information, which is more beneficial to the detection of small targets. Deleting the feature maps of 19 multiplied by 19 output by the FPN and PAN structures, and keeping the original output feature maps of 38 multiplied by 38 and 76 multiplied by 76 of the FPN and PAN structures; performing feature fusion on the output of MSE-CSPUnit 2 and the result of sampling on the lower deep feature map by using an FPN structure and a PAN structure to obtain a shallow feature map with the size of 152 x 152; finally, three feature maps of different sizes, namely 38 × 38, 76 × 76 and 152 × 152, are obtained to predict the targets of different scales.

Here MSE-CSPUnit x2 refers to two MSE-CSPUnit modules.

(1-3) SPP Module improvements

The SPP module can enrich the expression capability of the feature diagram and provide important context information. In order to improve the performance of small target detection, the SPP modules are respectively placed in front of the 38 × 38, 76 × 76 and 152 × 152 feature maps, so that the effective fusion of local features and global features is realized. The SPP module performs maximum pooling operations of 1 × 1, 5 × 5, 9 × 9 and 13 × 13 on the input feature map, and then performs tensor stitching on the generated feature maps with different scales.

(2) Training and optimizing small target detection networks

Aiming at a specific application scene, a small target detection data set is constructed, multi-mode random adjustment pairs are carried out on image data through data enhancement, the number of small targets, the image brightness, the contrast and the saturation in the data are randomly adjusted, and the generalization performance of a model is enhanced.

Finally, setting an anchor frame for fitting a target in the data set; and re-clustering the anchor frame of the target data set through a Kmeans + + algorithm to obtain anchor frame parameters more suitable for the current data set, and accelerating the convergence speed of the network.

(3) Model lightweight for small target detection network

(3-1) channel pruning

And (4) performing channel pruning on the small target detection network aiming at the parameter redundancy of the network. Using gamma of a convolution module BN layer of the YOLOv4 as a scaling factor, adding an L1 regularization item related to the gamma of the BN layer into a loss function, carrying out sparse training for a preset number of times on the network, sorting the gamma based on a gamma value after gradient updating, and removing a channel where the gamma is smaller than a pruning threshold value by setting the pruning threshold value to obtain the light-weight YOLOv4 network after pruning. In the YOLOv4 network, except for the convolution layer and the SPP structure before the upsampling layer, channel pruning is performed on other convolution modules containing the BN layer to obtain a model file and a model structure configuration file after the channel pruning. For the YOLOv4 sparse training, the established target loss function is:

where x is the input value of the model, y is the desired output value, w is the trainable parameter in the network, g (.) is the penalty term for the scaling factor, and λ is the balance factor.

(3-2) knowledge distillation recovery model accuracy

After channel pruning, although the removed channel contributes slightly to the model output, the model accuracy after pruning is slightly reduced, so that the model accuracy is recovered.

Knowledge distillation was performed using the YOLOv4 network without pruning as the teacher network and the network after channel pruning as the student network. The knowledge distillation of YOLOv4 will perform the learning of the classification task and the regression task, for the distillation of the regression results, it is not direct to the teacher's network learning when calculating the regression loss, since the output of the regression is unbounded and the predicted results of the teacher's network may be opposite to the label values. First, the teacher network and tag values, student network and tag value L2 losses are calculated separately, and a range w is set, and when the L2 loss of student network and tag value deviates from the teacher network and tag value L2 loss by more than the range w, the L2 loss of student network is accounted for in the loss. That is, when the performance of the student network exceeds a certain value of the teacher network, the loss of the student network is not calculated. The overall loss function is:

L_reg＝(1-v)L_sL1(R_s,y_reg)+vL_b(R_s,R_t,y_reg)

wherein w is a preset deviation range, y_regIs the true tag value, R_tAnd R_sRegression outputs, L, for teachers and students respectively_bPartial loss for model distillation, L_sL1For loss of student network and true tags, v is L_bAnd L_sL1The balance factor is set between 0.1 and 0.5 percent of the time before the network training and 0.6 to 0.9 percent of the training time after the network training; l is_regIs the total loss in learning by net distillation.

(4) Detection of input images using trained small target detection network models

And inputting a frame of aerial image of the unmanned aerial vehicle, and sending the aerial image into a trained and optimized small target detection network for positioning and classifying targets. The network firstly inputs the image into a feature extraction network with an attention mechanism to extract features, and 3 feature graphs with different resolution ratios are respectively output through an SPP module. Detecting the targets with three different scales on the 3 characteristic graphs by using a regression and classification idea, and obtaining classification and positioning results of the targets after filtering by using a confidence threshold; and repeating until the detection of the pictures in the test set is completed.

Compared with the prior art, the invention has the following beneficial effects:

compared with the traditional small target detection method, the method has the advantages that the MSE attention module is designed based on SE (secure element) and is inserted into the YOLOv4 feature extraction network, the attention capacity of the network to the region of interest is enhanced, and the interference of a complex background in the small target detection process is reduced; then adding a shallow feature map as a prediction layer, and predicting targets with different scales by using feature maps with different sizes of 38 × 38, 76 × 76 and 152 × 152; the SPP module is improved, and the SPP modules are respectively placed in front of the feature maps of 38 multiplied by 38, 76 multiplied by 76 and 152 multiplied by 152, so that the effective fusion of local features and global features is realized; finally, compression optimization is carried out on the model by using channel pruning and knowledge distillation strategies, and the large-scale compression of the quantity of the model parameters is realized with little precision loss; in addition, the number of small targets, the brightness, the contrast and the saturation of the image in the data set are randomly adjusted by using a data enhancement mode, and the training effect of the model is enhanced. In small target data concentration, the network has better detection effect and robustness, and simultaneously meets the requirement of light weight model deployment.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a MSE-CSPUnit module after adding an MSE multi-scale attention mechanism module;

FIG. 3 is a MSE multi-scale attention module structure of the present invention;

FIG. 4 is a small target detection network architecture designed by the present invention;

FIG. 5 is a comparison of the number of channels after compression of the model, where the dark columns are before pruning and the light columns are after pruning;

fig. 6 is a diagram of the detection effect of the small target detection network on the target picture according to the present invention, wherein (a), (c) are the detection effects before improvement, and (b) and (d) are the detection effects after improvement corresponding to (a), (c).

Detailed Description

The present invention will be described in detail below with reference to examples and drawings, but the present invention is not limited thereto. The target detection embodiment of the invention is implemented by various small targets in a data set, the selected processing platform is a combination of Intel i9-9900k, NVIDIA RTX2080ti and 32G RAM, and the operating system is Linux64 Ubuntu 18.04. The method is realized on a deep learning frame Pytrich1.6.

The method for detecting the light-weight small target with the attention mechanism introduced as shown in FIG. 1 comprises four parts:

(1) building an improved small target detection network based on YOLOv 4;

(2) training and optimizing the small target detection network;

(3) carrying out model lightweight on the small target detection network;

(4) and detecting the input image by using the trained small target detection network model.

The first part of building an improved small target detection network based on YOLOv4 specifically comprises the following steps:

(1-1) designing an MSE multi-scale attention mechanism module, and embedding the MSE multi-scale attention mechanism module into a feature extraction network

An MSE multi-scale attention mechanism module is constructed and inserted between a Concat layer and a CBM module in each CSP module of a YOLOv4 feature extraction network CSPDarknet53 to form a new MSE-CSPUnit module, so that the feature extraction network of the MSE-CSPDarknet53 with attention information is obtained, and as shown in FIG. 2, the rest modules except the MSE are conventional structure modules of a YOLOv4 feature extraction network CSPDarknet 53. The MSE multi-scale attention mechanism module is constructed as follows:

firstly, the output of a Concat layer of a CSP module is used as an input feature map, feature maps of various scales are integrated through convolution kernels of different sizes, and next feature extraction operation is carried out on the basis of the multi-scale feature maps, wherein the sizes of the convolution kernels are respectively 3 × 3, 5 × 5 and 7 × 7. In the case of a parameter amount explosion caused by using a large-size convolution kernel, 2 layers of 3 × 3 convolution kernels are used instead of the 5 × 5 convolution kernels, and 3 layers of 3 × 3 convolution kernels are used instead of the 7 × 7 convolution kernels. Let input characteristic diagram X ∈ R^C×H×WC, H, W are input channel, input height, and input width, respectively, the process of feature extraction using convolution kernels of different sizes for the input feature map is as follows:

X_c＝V_3×3X+V_5×5X+V_7×7X

wherein, X_cFor multi-scale fused feature output, V represents convolution operations using convolution kernels of different sizes.

To X_cPerforming extrusion operation, and aiming at the characteristic that the small target feature information is less, using global maximum pooling operation to emphasize local information of the feature map, and simultaneously using global average pooling operation to emphasize global features of the feature map, wherein the pooling operation is as follows:

X_max＝max(X_c(i,j))

wherein, X_avgFor features obtained after global averaging pooling, X_maxFor the features obtained after the global maximum pooling, i is 1,2, …, H, j is 1,2, …, W, H, W are input height and input width, respectively.

Are respectively paired with X_avgAnd X_maxPerforming excitation operation, adding, and normalizing to generate attention weight information X_s. The use of the Mish activation function preserves more of the non-linear relationship between channels when performing the excitation operation. FC₁、FC₂Is two different fully connected layers, wherein

X_a＝FC₂(Mish(FC₁(X_avg))

X_m＝FC₂(Mish(FC₁(X_max))

X_s＝Softmax(X_a+X_m)

Mixing X_sWith the multiscale feature map X generated in the first step_cPerforming weighting operation to obtain output X of MSE multi-scale attention module_weightIs mixing X_weightAs input to the CBM module in the MSE-CSPUnit module.

X_weight＝Scale(X_c,X_s)

(1-2) adding shallow features in the predicted layer

(1-3) SPP Module improvements

The second part of training and optimizing the small target detection network specifically comprises:

(2-1) construction of data set

Firstly, a small target data set is constructed, and an unmanned aerial vehicle aerial photography data set VisDrone2019 is selected in an experiment. The VisDrone2019 data set is in an unmanned aerial vehicle aerial shooting mode, so that a large number of small objects and dense objects are contained, and in addition, illumination change and object shielding are difficult points of the data set. Simultaneously because unmanned aerial vehicle image is the perpendicular reason of shooing, it is less to detect the object and contain the characteristic. For example, for pedestrian detection, the ground captured image may contain features of human arms, legs, etc., while for drone images, there may be features of the top of the head.

(2-2) data enhancement and multimodal stochastic adjustment of picture data

And during network training, improving the training effect of the small target by adopting an online enhancement mode on the data set. Since the data set may contain fewer pictures of small targets, the model may be biased toward medium and large sized targets during training. The data online enhancement is realized by copying a plurality of small targets in the picture, increasing the times of small objects appearing in the picture manually, and increasing the probability of the small targets contained by the anchor, so that the model can also have an opportunity to obtain more small target training samples in the training process. And meanwhile, randomly rotating and scaling the picture, and adjusting the brightness, the contrast and the saturation to increase the robustness of the model.

(2-3) custom Anchor Box for fitting targets in data sets

For target detection of extreme scale objects, a suitable anchor frame may more accurately fit objects in the data set. And for the unmanned aerial vehicle aerial photography data set, re-clustering the anchor frame of the target data set through a Kmeans + + algorithm to obtain the anchor frame parameters more suitable for the current data set. The anchor box parameters obtained by the Kmeans + + algorithm are (1,4), (2,8), (4,13), (4,5), (8,20), (9,9), (16,29), (16,15), (35, 42).

The third part of small target detection network model lightweight specifically comprises:

(3-1) channel pruning

And (4) performing channel pruning on the small target detection network aiming at the parameter redundancy of the network. Using gamma of a convolution module BN layer of the YOLOv4 as a scaling factor, adding an L1 regularization item related to gamma of the BN layer in a loss function, conducting preset rounds of sparsification training on the network for several times, such as 300 rounds of sparsification training, sorting gamma based on a gamma value after gradient updating, and removing a channel where gamma smaller than a pruning threshold is located by setting the pruning threshold to obtain a light-weight YOLOv4 network after pruning. In the YOLOv4 network, except for the convolutional layer and the SPP structure before the upsampling layer, channel pruning is performed on other convolutional modules containing the BN layer. And selecting the channel cutting proportion through multiple experiments to achieve better balance between speed and precision, finally selecting the cutting proportion to be 0.7, and obtaining the model file and the model structure configuration file after channel pruning.

(3-2) knowledge distillation recovery model accuracy

Knowledge distillation was performed using the YOLOv4 network without pruning as the teacher network and the network after channel pruning as the student network. The knowledge distillation of YOLOv4 will perform the learning of classification tasks and regression tasks, for the distillation of regression results, it is not direct to teacher's web learning when calculating the regression loss, since the output of regression is unbounded and the predicted results of teacher's web may be opposite to the true values. Firstly, the distances L2 between the teacher network and the label value and between the student network and the label value are respectively calculated, the deviation range w is set to be 0.3 through multiple experimental comparisons, and when the distance L2 between the student network and the label value and the deviation between the teacher network and the label value exceed the range w, the L2 loss of the student network can be added in the loss. That is, when the performance of the student network exceeds a certain value of the teacher network, the loss of the student network is not calculated. The overall loss function is:

L_reg＝(1-v)L_sL1(R_s,y_reg)+vL_b(R_s,R_t,y_reg)

wherein w is a preset deviation range, y_regIs the true tag value, R_tAnd R_sRegression outputs, L, for teachers and students respectively_bPartial loss for model distillation, L_sL1For studentsLoss of network and real tag, v is L_bAnd L_sL1The balance factor is set between 0.1 and 0.5 percent of the time before the network training and 0.6 to 0.9 percent of the training time after the network training; l is_regIs the total loss in learning by net distillation.

The fourth step of detecting the small picture target specifically includes:

(4-1) inputting an unmanned aerial vehicle aerial image

And (4-2) after reading an unmanned aerial vehicle aerial image, sending the image into a trained and optimized small target detection network for positioning and classifying the target. The network firstly inputs the image into a feature extraction network with an attention mechanism to extract features, and 3 feature graphs with different resolution ratios are respectively output through an SPP module. And (3) detecting the targets with three different scales by using a regression and classification idea, wherein a confidence threshold value is 0.2-0.6, the confidence threshold value is generally set to be 0.3, and after threshold value filtering, the classification and positioning results of the targets are obtained.

And (4-3) repeating the steps (4-1) to (4-2) until the detection of the pictures in the test set is completed, wherein the detection effect of various small targets is shown in fig. 6.

Claims

1. A lightweight small target detection method combined with an attention mechanism, characterized in that: the method comprises the following steps:

(1) Build an improved small target detection network based on YOLOv4;

(2) Train and optimize the small target detection network;

(3) Model lightweighting of the small target detection network;

(4) Use the trained small target detection network model to detect the input image.

2. A lightweight small target detection method combined with an attention mechanism according to claim 1, wherein the step (1) comprises the following steps:

(1-1) Build the MSE multi-scale attention mechanism module and insert it into the feature extraction network;

(1-2) Add a shallow feature map as a prediction layer;

(1-3) SPP module improvement.

3. A light-weight small target detection method combined with an attention mechanism according to claim 2, wherein the step (1-1) comprises the steps of: constructing an MSE multi-scale attention mechanism module, inserting Between the Concat layer and the CBM module in each CSP module of the YOLOv4 feature extraction network CSPDarknet53, a new MSE-CSPUnit module is formed to obtain the feature extraction network of MSE-CSPDarknet53 with attention information.

4. A lightweight small target detection method combined with attention mechanism according to claim 2 or 3, characterized in that: the step (1-1) constructs MSE multi-scale on the basis of SE attention mechanism module Attention mechanism module, including the following steps:

(1-1-1) The output of the Concat layer of the CSP module is used as the input feature X, and the feature maps of various scales are integrated through the convolution kernels of different sizes to obtain the multi-scale fusion feature output X _c ; the convolution kernel sizes are respectively 3×3, 5×5, 7×7, X _c =V _3×3 X+V _5×5 X+V _7×7 X, where V represents convolution operations using convolution kernels of different sizes;

(1-1-2) Squeeze X _c , and use global average pooling and global maximum pooling to squeeze channels to obtain channel-level feature information. Global average pooling focuses on global features, and global maximum pooling focuses on global features. Pay attention to localized features,

X _max =max(X _c (i,j));

Among them, X _avg is the feature obtained after global average pooling, X _max is the feature obtained after global max pooling, i=1, 2,...,H, j=1,2,...,W,H,W respectively For input height, input width;

(1-1-3) Perform excitation operations on X _avg and X _max respectively, add and normalize to generate attention weight information X _s , FC ₁ and FC ₂ are two different fully connected layers, where

C is the input channel, r is the dimensionality reduction ratio, FC ₁ plays the role of dimensionality reduction to reduce the parameters of the fully connected layer, and FC ₂ plays the role of restoring the original dimension;

X _a =FC ₂ (Mish(FC ₁ (X _avg ))

X _m =FC ₂ (Mish(FC ₁ (X _max ))

X _s =Softmax(X _a +X _m )

Among them, Mish is a nonlinear activation function, and Softmax is a normalization function;

(1-1-4) Perform a weighting operation on the X _s generated by (1-1-3) and the X _c generated by (1-1-1) to obtain the output X _weight of the MSE multi-scale attention module, X _weight = Scale(X _c , X _s ), take X _weight as the input of the CBM module in the MSE-CSPUnit module.

5. A lightweight small target detection method combined with an attention mechanism according to claim 2, characterized in that: in the step (1-2), delete the 19×19 size of the output of the FPN and PAN structures. Feature map, retain the original 38×38 and 76×76 output feature maps of the FPN and PAN structures; use the FPN and PAN structures to fuse the output of MSE-CSPUnit*2 and the result of the upsampling of the deep feature map below, and obtain 152 ×152 size shallow feature map; finally, three different size feature maps of 38 × 38, 76 × 76, and 152 × 152 are obtained to predict targets of different scales.

6. A lightweight small target detection method combined with an attention mechanism according to claim 6, characterized in that: in the step (1-3), in the FPN and PAN structures and the corresponding three prediction layers The SPP modules are placed between them, and the SPP module performs the maximum pooling operation of 1×1, 5×5, 9×9, and 13×13 on the input feature map, and then performs tensor splicing of the generated feature maps of different scales.

7. A lightweight small target detection method combined with an attention mechanism according to claim 1, wherein the step (2) comprises the following steps:

(2-1) Build a small target dataset;

(2-2) Data enhancement, and multi-mode random adjustment of picture data;

(2-3) Set anchor boxes for fitting the targets in the dataset.

8. A lightweight small target detection method combined with an attention mechanism according to claim 1, wherein the step (3) comprises the following steps:

(3-1) Channel pruning

The γ of the BN layer is selected as the scaling factor, the L1 regularization term about the γ of the BN layer is added to the loss function, and the network is sparsely trained for a preset number of rounds. The convolutional layer before the sampling layer and the layers other than the SPP module perform channel pruning to obtain the model file and model structure configuration file after channel pruning;

(3-2) Knowledge distillation restores network accuracy

Take the unpruned YOLOv4 network as the teacher network, and the network after channel pruning as the student network; calculate the L2 loss of the teacher network and the label value, the student network and the label value respectively, and set the deviation range. When the deviation of the L2 loss from the L2 loss of the teacher network and the label value exceeds the range w, the L2 loss of the student network is included in the total loss, and the overall loss function is

L _reg =(1-v)L _sL1 (R _s ,y _reg )+vL _b (R _s ,R _t ,y _reg )

Among them, w is the preset deviation range, y _reg is the label value, R _t and R _s are the regression output of the teacher network and the student network, respectively, L _b is the loss of the model distillation part, and L _sL1 is the regression output and label value of the student network. The loss between L b and L sL1, v is the balance factor between L _b and L _sL1 , which is set between 0.1 and 0.5 for 80% of the time before network training, and between 0.6 and 0.9 for the last 20% of the training time; L _reg is the total loss during network distillation learning.

9. A lightweight small target detection method combined with an attention mechanism according to claim 1, wherein the step (4) comprises the following steps:

(4-1) Input a frame of image;

(4-2) After reading an image, send it to the trained and optimized small target detection network to locate and classify the target; input the image to the feature extraction network with attention mechanism for feature extraction , after the SPP module outputs three feature maps of different resolutions respectively, and detects three different scale targets on the three feature maps, and sets the confidence threshold to 0.2~0.6. After threshold filtering, the classification and positioning results of the targets are obtained. ;

(4-3) Repeat steps (4-1) to (4-2) until the detection of pictures in the test set is completed.