Lightweight small target detection method combined with attention mechanism
Technical Field
The invention belongs to the application of a deep learning technology in the field of machine vision, and particularly relates to a lightweight small target detection method combined with an attention mechanism.
Background
The target detection finds out the specific target category and the accurate position thereof in a given image, wherein the small target detection is an important research content in the field of target detection, and has important application value in scenes such as remote sensing image target identification, infrared imaging target identification, agricultural pest identification and the like. In the target detection, a target having a target pixel value of 0.12% or less of the entire image or having a pixel value of less than 32 × 32 is generally referred to as a small target. Detecting small objects in an image is very difficult because of the low resolution and noise of small objects, often the insignificant features extracted after multi-layer convolution.
Early detection of small targets was mainly achieved by manually designed methods to obtain characteristic information of the target. Wen pei zhi et al apply wavelet transform to small target detection process (see wen pei zhi, smilin, hai and wu dawn. sea background infrared small target detection method [ J ]. photoelectric engineering, 2004) utilize multi-resolution analysis of orthogonal wavelet decomposition to achieve band selection, suppress interference of noise and background, and utilize edges in different directions to fuse to obtain candidate points, and finally eliminate the interference target according to gray threshold. CHEN et al (see C.L.P.Chen, H.Li, Y.Wei, et al.A. Local Contrast Method for Small extracted Target Detection [ J ]// IEEE Transactions on Geoscience and remove Sensing,2014,52(1):574 581) are motivated by biological visual mechanisms, use a proposed Local Contrast metric to obtain a Local Contrast map of the input image, which can represent the difference between the current location and its neighborhood, thus achieving both Target signal enhancement and background clutter suppression, and finally segmenting the Target by adaptive thresholds. The method starts from the bottom layer characteristics of the image, uses the basic image characteristics to realize the detection task, has simpler operation, but also has the problems of missed detection and error detection and real-time performance for the detection of small targets with complex backgrounds.
In recent years, with the improvement of computer power and the rapid development of deep learning theory, deep learning techniques have been widely used for target detection. Currently popular target detection models can be roughly divided into two categories: a stage detection algorithm, classification and positioning are regarded as regression tasks, and representative algorithms are SSD and YOLO; and (3) a two-stage detection algorithm, namely selecting a candidate box and separating a target classification, wherein the representative algorithms comprise R-CNN and Faster R-CNN. The one-stage detection algorithm has great advantages in real-time performance because the whole detection task is regarded as regression operation.
The main ways of improving the small target detection by using the deep learning technology are multi-scale representation, context information, super-resolution and the like. Patent application No. CN202010537199.7 discloses a detection method for small objects in pictures. Acquiring six feature graphs with different sizes from a picture to be detected, performing feature fusion on a pyramid bottom layer feature graph and a pyramid high layer feature graph in the six feature graphs with different sizes by adopting a bilinear interpolation method to obtain new six feature graphs with different sizes, and using the new six feature graphs with different sizes to participate in prediction. The method adopts the multi-scale characteristic diagram to enhance the target characteristic information, but is easily interfered by a complex background, and the false detection rate is higher. The patent with the application number of CN202010444356.X discloses a remote sensing image small target detection method based on resolution enhancement, which carries out super-resolution processing on a remote sensing image containing a small target and then carries out target detection, solves the problems that the small target in the remote sensing image has less available characteristic information and the small target area has geometric deformation, further perfects the detailed characteristic information of the small target by adopting a super-resolution processing technology, fully utilizes the limited characteristic information of the small target by applying a deformable convolution network based on an area, and improves the detection capability of the remote sensing image on the small target. Although the method has better accuracy, the real-time performance of the network is reduced due to the increase of the resolution of the picture, and the method is not beneficial to the light weight of the network.
Disclosure of Invention
In order to solve the problems of high false detection rate, missing detection, poor real-time performance and the like of the existing target detection method for detecting the small target, the invention provides a lightweight small target detection method combined with an attention mechanism, which comprises the following steps:
(1) building improved small target detection network based on YOLOv4
The small target detection network is obtained by improving a one-stage target detection network YOLOv4, and the specific network structure improvement comprises the following three aspects:
(1-1) constructing an MSE multi-scale attention mechanism module, and inserting the MSE multi-scale attention mechanism module into a feature extraction network
The MSE multi-scale attention mechanism module constructed by the invention is obtained by improving an SE attention module, the SE attention module is a lightweight attention mechanism module which is proposed by Hu et al in 2017 and is used in the field of computer vision, the MSE multi-scale attention mechanism module can be conveniently inserted between two network layers of a feature extraction network, an interested feature channel is selected and emphasized by learning global information, and irrelevant interference information is inhibited.
An MSE multi-scale attention mechanism module is constructed and inserted between a Concat layer and a CBM module in each CSP module of a YOLOv4 feature extraction network CSPDarknet53 to form a new MSE-CSPUnit module, and the feature extraction network of the MSE-CSPDarknet53 with attention information is obtained. The specific steps of the construction of the MSE multi-scale attention mechanism module are as follows:
(1-1-1) firstly, taking the output of the Concat layer of the CSP module as an input feature map, integrating feature maps of various scales through convolution kernels of different sizes, and carrying out next-step feature extraction operation based on the feature maps of various scales. The convolution kernel sizes are 3 × 3, 5 × 5, and 7 × 7, respectively, and in the case of a parameter amount explosion caused by using a large-size convolution kernel, 2 layers of convolution kernels of 3 × 3 are used instead of the convolution kernels of 5 × 5, and 3 layers of convolution kernels of 3 × 3 are used instead of the convolution kernels of 7 × 7. Let input characteristic diagram X ∈ RC×H×WC, H, W are input channel, input height, and input width, respectively, the process of feature extraction using convolution kernels of different sizes for the input feature map is as follows:
Xc=V3×3X+V5×5X+V7×7X
wherein, XcFor multi-scale feature map output, V represents the convolution operation using convolution kernels of different sizes.
(1-1-2) to XcPerforming extrusion operation, and respectively extruding the channels by using global average pooling and global maximum pooling to obtain channel-level feature information, wherein the global features of the global average pooling emphasis feature map and the local features of the global maximum pooling emphasis feature map are as follows:
Xmax=max(Xc(i,j))
wherein, XcFor input of multi-scale features, XavgFor features obtained after global averaging pooling, XmaxFor the features obtained after the global maximum pooling, i is 1,2, …, H, j is 1,2, …, W, H, W are input height and input width, respectively.
(1-1-3) pairs of X
avgAnd X
maxExcitation operation is carried out, and channel attention weight information X is generated through addition and normalization operation
s. Preserving more non-linear relationships between channels using Mish activation function, FC, when performing stimulus operations
1、FC
2Is two different fully connected layers, wherein
C is input channel, r is dimension reduction ratio, FC
1Plays a role of reducing dimension to reduce full connection layer parameter, FC
2And the function of restoring the original dimension is realized. The activation and normalization operations are as follows:
Xa=FC2(Mish(FC1(Xavg))
Xm=FC2(Mish(FC1(Xmax))
Xs=Softmax(Xa+Xm)
wherein Mish is a nonlinear activation function, and Softmax is a normalization function.
(1-1-4) performing weighting operation on the channel attention weight information generated in (1-1-3) and the multi-scale feature map generated in (1-1-1) to obtain output X of the MSE multi-scale attention moduleweightIs mixing XweightAs input to the CBM module in the MSE-CSPUnit module.
Xweight=Scale(Xc,Xs)
(1-2) adding shallow feature map as prediction layer
The deep features have stronger semantic information and are more suitable for positioning; and shallow layer characteristics have rich resolution information, which is more beneficial to the detection of small targets. Deleting the feature maps of 19 multiplied by 19 output by the FPN and PAN structures, and keeping the original output feature maps of 38 multiplied by 38 and 76 multiplied by 76 of the FPN and PAN structures; performing feature fusion on the output of MSE-CSPUnit 2 and the result of sampling on the lower deep feature map by using an FPN structure and a PAN structure to obtain a shallow feature map with the size of 152 x 152; finally, three feature maps of different sizes, namely 38 × 38, 76 × 76 and 152 × 152, are obtained to predict the targets of different scales.
Here MSE-CSPUnit x2 refers to two MSE-CSPUnit modules.
(1-3) SPP Module improvements
The SPP module can enrich the expression capability of the feature diagram and provide important context information. In order to improve the performance of small target detection, the SPP modules are respectively placed in front of the 38 × 38, 76 × 76 and 152 × 152 feature maps, so that the effective fusion of local features and global features is realized. The SPP module performs maximum pooling operations of 1 × 1, 5 × 5, 9 × 9 and 13 × 13 on the input feature map, and then performs tensor stitching on the generated feature maps with different scales.
(2) Training and optimizing small target detection networks
Aiming at a specific application scene, a small target detection data set is constructed, multi-mode random adjustment pairs are carried out on image data through data enhancement, the number of small targets, the image brightness, the contrast and the saturation in the data are randomly adjusted, and the generalization performance of a model is enhanced.
Finally, setting an anchor frame for fitting a target in the data set; and re-clustering the anchor frame of the target data set through a Kmeans + + algorithm to obtain anchor frame parameters more suitable for the current data set, and accelerating the convergence speed of the network.
(3) Model lightweight for small target detection network
(3-1) channel pruning
And (4) performing channel pruning on the small target detection network aiming at the parameter redundancy of the network. Using gamma of a convolution module BN layer of the YOLOv4 as a scaling factor, adding an L1 regularization item related to the gamma of the BN layer into a loss function, carrying out sparse training for a preset number of times on the network, sorting the gamma based on a gamma value after gradient updating, and removing a channel where the gamma is smaller than a pruning threshold value by setting the pruning threshold value to obtain the light-weight YOLOv4 network after pruning. In the YOLOv4 network, except for the convolution layer and the SPP structure before the upsampling layer, channel pruning is performed on other convolution modules containing the BN layer to obtain a model file and a model structure configuration file after the channel pruning. For the YOLOv4 sparse training, the established target loss function is:
where x is the input value of the model, y is the desired output value, w is the trainable parameter in the network, g (.) is the penalty term for the scaling factor, and λ is the balance factor.
(3-2) knowledge distillation recovery model accuracy
After channel pruning, although the removed channel contributes slightly to the model output, the model accuracy after pruning is slightly reduced, so that the model accuracy is recovered.
Knowledge distillation was performed using the YOLOv4 network without pruning as the teacher network and the network after channel pruning as the student network. The knowledge distillation of YOLOv4 will perform the learning of the classification task and the regression task, for the distillation of the regression results, it is not direct to the teacher's network learning when calculating the regression loss, since the output of the regression is unbounded and the predicted results of the teacher's network may be opposite to the label values. First, the teacher network and tag values, student network and tag value L2 losses are calculated separately, and a range w is set, and when the L2 loss of student network and tag value deviates from the teacher network and tag value L2 loss by more than the range w, the L2 loss of student network is accounted for in the loss. That is, when the performance of the student network exceeds a certain value of the teacher network, the loss of the student network is not calculated. The overall loss function is:
Lreg=(1-v)LsL1(Rs,yreg)+vLb(Rs,Rt,yreg)
wherein w is a preset deviation range, yregIs the true tag value, RtAnd RsRegression outputs, L, for teachers and students respectivelybPartial loss for model distillation, LsL1For loss of student network and true tags, v is LbAnd LsL1The balance factor is set between 0.1 and 0.5 percent of the time before the network training and 0.6 to 0.9 percent of the training time after the network training; l isregIs the total loss in learning by net distillation.
(4) Detection of input images using trained small target detection network models
And inputting a frame of aerial image of the unmanned aerial vehicle, and sending the aerial image into a trained and optimized small target detection network for positioning and classifying targets. The network firstly inputs the image into a feature extraction network with an attention mechanism to extract features, and 3 feature graphs with different resolution ratios are respectively output through an SPP module. Detecting the targets with three different scales on the 3 characteristic graphs by using a regression and classification idea, and obtaining classification and positioning results of the targets after filtering by using a confidence threshold; and repeating until the detection of the pictures in the test set is completed.
Compared with the prior art, the invention has the following beneficial effects:
compared with the traditional small target detection method, the method has the advantages that the MSE attention module is designed based on SE (secure element) and is inserted into the YOLOv4 feature extraction network, the attention capacity of the network to the region of interest is enhanced, and the interference of a complex background in the small target detection process is reduced; then adding a shallow feature map as a prediction layer, and predicting targets with different scales by using feature maps with different sizes of 38 × 38, 76 × 76 and 152 × 152; the SPP module is improved, and the SPP modules are respectively placed in front of the feature maps of 38 multiplied by 38, 76 multiplied by 76 and 152 multiplied by 152, so that the effective fusion of local features and global features is realized; finally, compression optimization is carried out on the model by using channel pruning and knowledge distillation strategies, and the large-scale compression of the quantity of the model parameters is realized with little precision loss; in addition, the number of small targets, the brightness, the contrast and the saturation of the image in the data set are randomly adjusted by using a data enhancement mode, and the training effect of the model is enhanced. In small target data concentration, the network has better detection effect and robustness, and simultaneously meets the requirement of light weight model deployment.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a MSE-CSPUnit module after adding an MSE multi-scale attention mechanism module;
FIG. 3 is a MSE multi-scale attention module structure of the present invention;
FIG. 4 is a small target detection network architecture designed by the present invention;
FIG. 5 is a comparison of the number of channels after compression of the model, where the dark columns are before pruning and the light columns are after pruning;
fig. 6 is a diagram of the detection effect of the small target detection network on the target picture according to the present invention, wherein (a), (c) are the detection effects before improvement, and (b) and (d) are the detection effects after improvement corresponding to (a), (c).
Detailed Description
The present invention will be described in detail below with reference to examples and drawings, but the present invention is not limited thereto. The target detection embodiment of the invention is implemented by various small targets in a data set, the selected processing platform is a combination of Intel i9-9900k, NVIDIA RTX2080ti and 32G RAM, and the operating system is Linux64 Ubuntu 18.04. The method is realized on a deep learning frame Pytrich1.6.
The method for detecting the light-weight small target with the attention mechanism introduced as shown in FIG. 1 comprises four parts:
(1) building an improved small target detection network based on YOLOv 4;
(2) training and optimizing the small target detection network;
(3) carrying out model lightweight on the small target detection network;
(4) and detecting the input image by using the trained small target detection network model.
The first part of building an improved small target detection network based on YOLOv4 specifically comprises the following steps:
(1-1) designing an MSE multi-scale attention mechanism module, and embedding the MSE multi-scale attention mechanism module into a feature extraction network
An MSE multi-scale attention mechanism module is constructed and inserted between a Concat layer and a CBM module in each CSP module of a YOLOv4 feature extraction network CSPDarknet53 to form a new MSE-CSPUnit module, so that the feature extraction network of the MSE-CSPDarknet53 with attention information is obtained, and as shown in FIG. 2, the rest modules except the MSE are conventional structure modules of a YOLOv4 feature extraction network CSPDarknet 53. The MSE multi-scale attention mechanism module is constructed as follows:
firstly, the output of a Concat layer of a CSP module is used as an input feature map, feature maps of various scales are integrated through convolution kernels of different sizes, and next feature extraction operation is carried out on the basis of the multi-scale feature maps, wherein the sizes of the convolution kernels are respectively 3 × 3, 5 × 5 and 7 × 7. In the case of a parameter amount explosion caused by using a large-size convolution kernel, 2 layers of 3 × 3 convolution kernels are used instead of the 5 × 5 convolution kernels, and 3 layers of 3 × 3 convolution kernels are used instead of the 7 × 7 convolution kernels. Let input characteristic diagram X ∈ RC×H×WC, H, W are input channel, input height, and input width, respectively, the process of feature extraction using convolution kernels of different sizes for the input feature map is as follows:
Xc=V3×3X+V5×5X+V7×7X
wherein, XcFor multi-scale fused feature output, V represents convolution operations using convolution kernels of different sizes.
To XcPerforming extrusion operation, and aiming at the characteristic that the small target feature information is less, using global maximum pooling operation to emphasize local information of the feature map, and simultaneously using global average pooling operation to emphasize global features of the feature map, wherein the pooling operation is as follows:
Xmax=max(Xc(i,j))
wherein, XavgFor features obtained after global averaging pooling, XmaxFor the features obtained after the global maximum pooling, i is 1,2, …, H, j is 1,2, …, W, H, W are input height and input width, respectively.
Are respectively paired with X
avgAnd X
maxPerforming excitation operation, adding, and normalizing to generate attention weight information X
s. The use of the Mish activation function preserves more of the non-linear relationship between channels when performing the excitation operation. FC
1、FC
2Is two different fully connected layers, wherein
C is input channel, r is dimension reduction ratio, FC
1Plays a role of reducing dimension to reduce full connection layer parameter, FC
2And the function of restoring the original dimension is realized. The activation and normalization operations are as follows:
Xa=FC2(Mish(FC1(Xavg))
Xm=FC2(Mish(FC1(Xmax))
Xs=Softmax(Xa+Xm)
wherein Mish is a nonlinear activation function, and Softmax is a normalization function.
Mixing XsWith the multiscale feature map X generated in the first stepcPerforming weighting operation to obtain output X of MSE multi-scale attention moduleweightIs mixing XweightAs input to the CBM module in the MSE-CSPUnit module.
Xweight=Scale(Xc,Xs)
(1-2) adding shallow features in the predicted layer
The deep features have stronger semantic information and are more suitable for positioning; and shallow layer characteristics have rich resolution information, which is more beneficial to the detection of small targets. Deleting the feature maps of 19 multiplied by 19 output by the FPN and PAN structures, and keeping the original output feature maps of 38 multiplied by 38 and 76 multiplied by 76 of the FPN and PAN structures; performing feature fusion on the output of MSE-CSPUnit 2 and the result of sampling on the lower deep feature map by using an FPN structure and a PAN structure to obtain a shallow feature map with the size of 152 x 152; finally, three feature maps of different sizes, namely 38 × 38, 76 × 76 and 152 × 152, are obtained to predict the targets of different scales.
(1-3) SPP Module improvements
The SPP module can enrich the expression capability of the feature diagram and provide important context information. In order to improve the performance of small target detection, the SPP modules are respectively placed in front of the 38 × 38, 76 × 76 and 152 × 152 feature maps, so that the effective fusion of local features and global features is realized. The SPP module performs maximum pooling operations of 1 × 1, 5 × 5, 9 × 9 and 13 × 13 on the input feature map, and then performs tensor stitching on the generated feature maps with different scales.
The second part of training and optimizing the small target detection network specifically comprises:
(2-1) construction of data set
Firstly, a small target data set is constructed, and an unmanned aerial vehicle aerial photography data set VisDrone2019 is selected in an experiment. The VisDrone2019 data set is in an unmanned aerial vehicle aerial shooting mode, so that a large number of small objects and dense objects are contained, and in addition, illumination change and object shielding are difficult points of the data set. Simultaneously because unmanned aerial vehicle image is the perpendicular reason of shooing, it is less to detect the object and contain the characteristic. For example, for pedestrian detection, the ground captured image may contain features of human arms, legs, etc., while for drone images, there may be features of the top of the head.
(2-2) data enhancement and multimodal stochastic adjustment of picture data
And during network training, improving the training effect of the small target by adopting an online enhancement mode on the data set. Since the data set may contain fewer pictures of small targets, the model may be biased toward medium and large sized targets during training. The data online enhancement is realized by copying a plurality of small targets in the picture, increasing the times of small objects appearing in the picture manually, and increasing the probability of the small targets contained by the anchor, so that the model can also have an opportunity to obtain more small target training samples in the training process. And meanwhile, randomly rotating and scaling the picture, and adjusting the brightness, the contrast and the saturation to increase the robustness of the model.
(2-3) custom Anchor Box for fitting targets in data sets
For target detection of extreme scale objects, a suitable anchor frame may more accurately fit objects in the data set. And for the unmanned aerial vehicle aerial photography data set, re-clustering the anchor frame of the target data set through a Kmeans + + algorithm to obtain the anchor frame parameters more suitable for the current data set. The anchor box parameters obtained by the Kmeans + + algorithm are (1,4), (2,8), (4,13), (4,5), (8,20), (9,9), (16,29), (16,15), (35, 42).
The third part of small target detection network model lightweight specifically comprises:
(3-1) channel pruning
And (4) performing channel pruning on the small target detection network aiming at the parameter redundancy of the network. Using gamma of a convolution module BN layer of the YOLOv4 as a scaling factor, adding an L1 regularization item related to gamma of the BN layer in a loss function, conducting preset rounds of sparsification training on the network for several times, such as 300 rounds of sparsification training, sorting gamma based on a gamma value after gradient updating, and removing a channel where gamma smaller than a pruning threshold is located by setting the pruning threshold to obtain a light-weight YOLOv4 network after pruning. In the YOLOv4 network, except for the convolutional layer and the SPP structure before the upsampling layer, channel pruning is performed on other convolutional modules containing the BN layer. And selecting the channel cutting proportion through multiple experiments to achieve better balance between speed and precision, finally selecting the cutting proportion to be 0.7, and obtaining the model file and the model structure configuration file after channel pruning.
(3-2) knowledge distillation recovery model accuracy
After channel pruning, although the removed channel contributes slightly to the model output, the model accuracy after pruning is slightly reduced, so that the model accuracy is recovered.
Knowledge distillation was performed using the YOLOv4 network without pruning as the teacher network and the network after channel pruning as the student network. The knowledge distillation of YOLOv4 will perform the learning of classification tasks and regression tasks, for the distillation of regression results, it is not direct to teacher's web learning when calculating the regression loss, since the output of regression is unbounded and the predicted results of teacher's web may be opposite to the true values. Firstly, the distances L2 between the teacher network and the label value and between the student network and the label value are respectively calculated, the deviation range w is set to be 0.3 through multiple experimental comparisons, and when the distance L2 between the student network and the label value and the deviation between the teacher network and the label value exceed the range w, the L2 loss of the student network can be added in the loss. That is, when the performance of the student network exceeds a certain value of the teacher network, the loss of the student network is not calculated. The overall loss function is:
Lreg=(1-v)LsL1(Rs,yreg)+vLb(Rs,Rt,yreg)
wherein w is a preset deviation range, yregIs the true tag value, RtAnd RsRegression outputs, L, for teachers and students respectivelybPartial loss for model distillation, LsL1For studentsLoss of network and real tag, v is LbAnd LsL1The balance factor is set between 0.1 and 0.5 percent of the time before the network training and 0.6 to 0.9 percent of the training time after the network training; l isregIs the total loss in learning by net distillation.
The fourth step of detecting the small picture target specifically includes:
(4-1) inputting an unmanned aerial vehicle aerial image
And (4-2) after reading an unmanned aerial vehicle aerial image, sending the image into a trained and optimized small target detection network for positioning and classifying the target. The network firstly inputs the image into a feature extraction network with an attention mechanism to extract features, and 3 feature graphs with different resolution ratios are respectively output through an SPP module. And (3) detecting the targets with three different scales by using a regression and classification idea, wherein a confidence threshold value is 0.2-0.6, the confidence threshold value is generally set to be 0.3, and after threshold value filtering, the classification and positioning results of the targets are obtained.
And (4-3) repeating the steps (4-1) to (4-2) until the detection of the pictures in the test set is completed, wherein the detection effect of various small targets is shown in fig. 6.