CN115578364A

CN115578364A - Weak target detection method and system based on mixed attention and harmonic factor

Info

Publication number: CN115578364A
Application number: CN202211318263.8A
Authority: CN
Inventors: 任文凯; 李恒; 黄兆年
Original assignee: 709th Research Institute of CSSC
Current assignee: 709th Research Institute of CSSC
Priority date: 2022-10-26
Filing date: 2022-10-26
Publication date: 2023-01-06

Abstract

The invention belongs to the technical field of computer vision correlation, and discloses a weak target detection method based on mixed attention and harmonic factors, which comprises the following steps: collecting images to form a data set, and then labeling a target to be detected; dividing the detection network into a feature extraction unit, a feature fusion unit and a prediction branch unit, and introducing a mixed attention mechanism to complete the construction of the detection network; training the detection network in a data-driven mode, and generating a final detection model; and detecting the real-time image imaged by the sensor by using the final detection model to obtain a weak target in the image. The invention also discloses a corresponding system. By the method, the problems of small target pixel ratio, low target signal intensity, difficult extraction of target features, low target and background division degree and the like of the weak target in a typical task scene can be effectively solved, the detection capability of the model on the weak target is improved, and the detection performance of the model is comprehensively optimized.

Description

Weak target detection method and system based on mixed attention and harmonic factor

Technical Field

The invention belongs to the technical field of computer vision correlation, and particularly relates to a weak target detection method and system based on mixed attention and harmonic factors, which can effectively solve the problem of low detection recognition rate of weak targets in typical scenes at the present stage and accurately recognize weak targets appearing in images.

Background

In recent years, with the rapid development of economy and the continuous progress of science and technology, the artificial intelligence technology has attracted great attention of various industries, and scenes which need a large amount of human resources for supervision in the past can be gradually replaced by machines. Research shows that more than seven percent of information acquired by human beings is acquired by vision, so that the computer vision technology is extremely important for the development of the artificial intelligence industry.

Specifically, computer vision means that electronic imaging equipment is used for replacing human eyes to realize tasks of classifying and identifying targets. In recent years, with the popularization of high-performance computing resources of computers, deep learning techniques have been attempted to be applied to the field of computer vision. In 2012, hinton and his student Alex proposed a convolutional neural network AlexNet for image classification in the ImageNet race and captured the first name of the current year image classification race group at a time. After this, more and more researchers have attempted to apply deep learning techniques to computer vision tasks such as target detection, target tracking, semantic segmentation, scene understanding, etc.

Research shows that some target detection algorithms in the prior art generally have better detection effects on relatively striking targets in images; for weak targets appearing in the image, the detector often has more missed detections and false detections due to the reasons of few occupied pixel points, difficult extraction of target features and the like.

In fact, "small" and "weak" are the two most prominent features of a weak target. "Small" means that the total number of pixels occupied by the object to be recognized in the image is small, and the current academic definition of small objects defines the imaging size of the small objects. Typically in an N × N size image, targets with pixels smaller than 0.12% of the N × N are considered small targets, while for a general target detection data set, targets with pixels smaller than 32 × 32 may be considered small targets. The term "weak" means that the signal intensity of the target in the image is weak, the contrast with the background is low, and the target is easily interfered by clutter and noise.

Accordingly, there is a need in the art for further improvements to better meet the need for high-precision, high-efficiency detection of weak targets in typical scenarios.

Disclosure of Invention

In view of the above defects or needs in the prior art, an object of the present invention is to provide a weak target detection method based on mixed attention and harmonic factors, wherein relevant characteristics of a weak target in typical task scenes such as a monitored image and an aerial image are fully considered, a specific algorithm is selected as a baseline, and a network structure is modified, so that compared with the prior art, the detection capability of a model for the weak target can be further improved, and the detection performance of the model is further optimized comprehensively.

To achieve the above object, according to one aspect of the present invention, there is provided a weak target detection method based on a mixed attention and blend factor, the method comprising:

step one, sample calibration

The method comprises the steps that a typical image containing a target to be detected is collected in real time by a sensor to form a data set, and then the target to be detected appearing in the typical image is marked;

step two, constructing a detection network

Dividing the detection network into a feature extraction unit, a feature fusion unit and a prediction branch unit, wherein the feature extraction unit adopts CSPDarkNet53 as a backbone network and is used for extracting relevant information such as position, texture, semantics and the like of a target to be detected; the feature fusion unit adopts a bidirectional feature fusion mode and is used for equivalently aggregating deep texture information and shallow position information of a target to be detected, and a blending factor is introduced for adjusting the fusion proportion between adjacent feature layers in the feature fusion unit; the prediction branch unit is provided with a plurality of branches, and each branch is respectively responsible for detection according to different scales of the target to be detected;

step three, training the detection network

Training the detection network constructed in the step two by using the sample data calibrated in the step one in a data driving mode, and further generating a final detection model;

step four, weak target detection

And detecting a real-time image imaged by the sensor by using the trained final detection model to obtain a weak target in the image, and outputting a result.

Preferably, in the first step, firstly, screening is preferably performed on the acquired image according to the principles of image quality, definition of the target to be detected, presence or absence of shielding of the target and the like, so as to prepare a set of initial data set; the PASCAL VOC data set is then preferably standardized in its annotated format.

As a further preference, in the second step, for the feature fusion unit, the bidirectional feature fusion mode preferably includes the following processes: firstly, conducting top-down aggregation on high-level features of low-resolution and high-semantic information and bottom-level features of high-resolution and low-semantic information in an up-sampling mode, so that the features under all scales contain rich target semantic information; and then, adding a bottom-up feature aggregation path on the basis of a top-down aggregation path, thereby transmitting the position information of the bottom layer to the high-level features and finishing the fusion of the target position information.

As a further preference, in step two, the feature fusion unit is preferably further configured with a mixed attention unit comprising a channel attention subunit and a spatial attention subunit connected in series with each other, wherein:

for a channel attention subunit, calculating the average value of all pixels of the feature map of each channel by the feature vector of the channel attention subunit through global average pooling, and then generating channel weight between 0 and 1 through one-dimensional convolution with a convolution kernel of k × k; finally, the generated channel weight is multiplied element by element with the feature map, thereby generating a refined feature map;

for the spatial attention subunit, firstly performing average pooling and maximum pooling operations on the income feature map, and then splicing the two generated feature vectors; then, the channel dimensions are compressed using the convolutional layer and spatial weights are generated, and then multiplied element by element with the input feature vectors, thereby generating a refined feature map.

As a further preference, in step two, the harmonic factor is preferably set to a hyper-parameter adaptive to the training process, and it is continuously updated iteratively with the loss function in the network training iterative process.

As a further preference, in step two, it is preferable for the predicted branch unit to be of a four-branch structure, where the first predicted branch is generated from a low-level, high-resolution feature map for being responsible for predicting a tiny target of the first scale; the second prediction branch is obtained by down-sampling the first prediction branch and is used for predicting a tiny target of a second scale; the third prediction branch is obtained by branch downsampling of the second prediction and is used for predicting a tiny target of a third scale; the fourth prediction branch is generated by a high-level and low-resolution feature map and is used for predicting a tiny target of a fourth scale.

Further preferably, in step three, a standard Adam optimizer is preferably used for multiple rounds of training, and after the training is finished, a final detection model can be obtained.

Preferably, in the fourth step, a color box is preferably used to label weak targets existing in the image, different types of targets are represented by rectangular boxes with different colors, and accurate position coordinates of the weak targets are output in real time.

According to another aspect of the present invention, there is also provided a corresponding weak target detection system based on a mixed attention and reconciliation factor, wherein the system comprises:

the system comprises a sample calibration module, a data acquisition module and a data processing module, wherein the sample calibration module is used for acquiring a typical image composition data set containing a target to be detected in real time by using a sensor and then marking the target to be detected appearing in the typical image;

the detection network module is used for dividing the detection network into a feature extraction unit, a feature fusion unit and a prediction branch unit, wherein the feature extraction unit adopts CSPDarkNet53 as a backbone network and is used for extracting relevant information such as position, texture, semantics and the like of a target to be detected; the feature fusion unit adopts a bidirectional feature fusion mode and is used for equivalently aggregating deep texture information and shallow position information of a target to be detected, and a blending factor is introduced for adjusting the fusion proportion between adjacent feature layers in the feature fusion unit; the prediction branch unit is provided with a plurality of branches, and each branch is respectively responsible for detection according to different scales of the target to be detected;

the training detection network module is used for training the constructed detection network by using the sample data calibrated in the first step in a data driving mode so as to generate a final detection model;

and the weak target detection module is applied to detecting a real-time image imaged by the sensor by utilizing the trained final detection model, obtaining a weak target existing in the image and outputting a result.

Generally, compared with the prior art, the above technical solution conceived by the present invention has the following beneficial effects:

(1) According to the method, relevant characteristics of the weak target in typical task scenes such as monitoring images and aerial images are fully considered, multiple aspects such as key operation steps and algorithm mechanisms are improved in a targeted manner, the selected backbone network comprises novel Focus and CSP structures, the capability of extracting target characteristics by the convolutional neural network can be further improved, and the extraction of detailed information of the weak target is facilitated;

(2) According to the method, the bidirectional feature fusion network is adopted and the harmonic factors are introduced, so that the feature fusion relationship between adjacent layers of the network can be controlled in a self-adaptive mode, and the harmonic factors dynamically change along with the loss function in the training process, so that the learning of the detail features of the weak target is further promoted, and the small target is identified more stably and accurately;

(3) The invention also further adds a mixed attention mechanism in the operations of network up-sampling, characteristic splicing and the like, wherein the channel attention subunit only comprises global average pooling and convolution operations, thereby ensuring the light weight of the module and generating accurate channel weight; the spatial attention subunit comprises average pooling, maximum pooling and convolution operations, and can generate accurate spatial weight information; correspondingly, the interference of a network on a complex background in the training process is further inhibited, and the overall performance of the detection model is improved;

(4) According to the weak target detection method based on the mixed attention and the harmonic factor, the difficult problems of small target pixel ratio, low target signal intensity, difficult target feature extraction, low target and background division and the like of the weak target in typical task scenes such as monitored images, aerial images and the like can be effectively solved, the detection precision of the weak target is remarkably improved, the overall performance of a detection network is optimized, and the robustness of a model is enhanced.

Drawings

FIG. 1 is a flow chart of a weak target detection method based on a mixed attention and reconciliation factor in accordance with the present invention;

FIG. 2 is a schematic view for exemplarily showing an entire weak target detection process according to the present invention;

fig. 3 is a general block diagram for an exemplary display hybrid attention unit in accordance with a preferred embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is an overall flowchart of a weak target detection method based on a mixed attention and reconciliation factor according to the present invention. The invention will be explained in more detail below with reference to fig. 1.

First, a sample calibration step is performed.

In the step, a typical image composition data set containing the target to be detected is acquired in real time by using a sensor, and then the target to be detected appearing in the typical image is labeled.

More specifically, a field image is collected through a sensor according to a specific task scene, the collected image is screened according to the principles of image quality, the definition degree of a target to be detected, the existence of shielding of the target and the like, and a set of initial and unmarked data sets is prepared;

and then, labeling the prepared data set, for example, labeling the self-made data set by adopting a labeling format of a PASCAL VOC data set widely applied in the field of target detection, wherein a labeling tool is a Label Image, in the labeling process, the coordinates and the category of the target to be identified need to be labeled simultaneously, and an xml file is generated after each Image is labeled.

Next, a step of constructing a detection network is performed.

In the step, the detection network is divided into a feature extraction unit, a feature fusion unit and a prediction branch unit, wherein the feature extraction unit adopts CSPDarkNet53 as a backbone network and is used for extracting relevant information such as the position, texture, semantics and the like of the target to be detected; the feature fusion unit adopts a bidirectional feature fusion mode and is used for equivalently aggregating deep texture information and shallow position information of a target to be detected, and a blending factor is introduced for adjusting the fusion proportion between adjacent feature layers in the feature fusion unit; the prediction branch unit is provided with a plurality of branches, and each branch is respectively responsible for detection according to different scales of the target to be detected.

More specifically, the CSPDarkNet53 is preferably used as a backbone network for feature extraction to extract relevant information such as position, texture, semantics and the like of the target to be detected, compared with the traditional DarkNet53 network, the new backbone network comprises novel Focus and CSP structures, and the capability of extracting the target features by the convolutional neural network can be further improved. The Focus structure is that before the image enters the backbone network, every other pixel of the image takes one value, and the four groups of pictures can be obtained by similar adjacent downsampling operation. The Focus operation concentrates the width and height information of the image into a channel space, the dimension of the channel is enlarged to 4 times of the original dimension, namely the spliced result is equivalent to that the original three-channel mode is changed into 12 channels. The Focus structure has the advantage of reducing the loss of target detail information in the downsampling process, which is beneficial to the detection of tiny targets. The CSP structure divides the feature map into two parts according to the channel, one part is subjected to feature extraction through the convolutional layer and the residual error module, the other part is combined with the transmitted feature map, and then the feature maps output by the two parts are spliced, so that the problem of gradient information repetition in network optimization is solved, a brand new fusion mode is formed while the network calculation amount is reduced, the extraction of target detail information in the image is promoted, the efficiency of a backbone network is improved, and the comprehensive performance of a feature extraction network is improved.

The high-level features of the convolutional neural network contain semantic information of more targets, the bottom-level features contain more target positions and texture information, and the positions and the semantic information of the required targets which are equal to the network are predicted for a weak target detection task, so that the small targets are stably and accurately identified.

Under the background, the bidirectional feature fusion network is adopted in a targeted manner, firstly, the high-level features of low-resolution and high-semantic information are aggregated with the bottom-level features of high-resolution and low-semantic information from top to bottom in an up-sampling manner, so that the features under all scales contain rich target semantic information; and then, adding a bottom-up feature aggregation path on the basis of a top-down aggregation path, thereby transmitting the position information of the bottom layer to the high-level features and completing the fusion of the target position information.

In addition, because the invention concentrates on the detection of weak targets, and the detailed characteristics of small targets are more difficult to learn than those of medium and large targets, in order to ensure that the trained model has better detection precision, the invention provides that harmonic factors are added between two adjacent characteristic layers in the characteristic fusion module to control the characteristic fusion relationship between the adjacent layers. The harmonic factor determines the degree of coupling between adjacent levels in the feature fusion module by re-weighting the loss in gradient back propagation. The blending is set as a hyper-parameter adaptive to the training process, and in the network training iterative process, the hyper-parameter is continuously updated iteratively along with the loss function, and the change of the value reflects the change condition of each level of learning difficulty in the characteristic fusion process.

According to a preferred embodiment of the present invention, considering that the target to be detected in the present invention is a weak target, such target usually occupies only few pixels in the image, and has a low degree of distinction from the background, and is easily interfered by the complex background in the scene, thereby affecting the detection performance of the algorithm. In view of the above problems, the present invention further provides a novel efficient hybrid attention mechanism, which includes two subunits, i.e., a channel attention subunit and a spatial attention subunit, and a serial data flow manner is adopted between the two subunits. And the method is integrated into the operations of network up-sampling, feature splicing and the like.

According to another preferred embodiment of the present invention, the structure of the channel attention subunit can be described as follows. Firstly, for a feature vector (H multiplied by W multiplied by C) of an input attention unit, calculating an Average value of all pixels for a feature map of each channel through Global Average Pooling (GAP) so as to reduce the number of parameters and reduce the calculated amount, and then achieving the purpose of cross-channel interaction through one-dimensional convolution with a convolution kernel of k multiplied by k. The convolution kernel k of the one-dimensional convolution determines the range of channel interaction, k is an adjustable hyper-parameter, and the k value is initially set to 3 in the invention. The feature vector output by the one-dimensional convolution can generate channel weight between 0 and 1 after being mapped by a Sigmoid function. Finally, the generated channel weight is multiplied by the feature map of the input attention module element by element to generate a refined feature map.

According to another preferred embodiment of the present invention, the structure of the spatial attention subunit can be described as follows. Firstly, carrying out average pooling and maximum pooling operations on the income feature map, then splicing two generated feature vectors, then compressing the channel dimension by using the convolution layer, and finally generating space weight by using a Sigmoid function, for example, and multiplying the space weight element by element with the input feature vector to generate a refined feature map.

The combination of the two attention subunits can be understood with reference to fig. 3. For the input feature vector, a channel weight is generated through a channel attention module, the category information of the target to be detected is highlighted, and then the channel weight and the feature vector are multiplied element by element, so that the importance degree among channels in the feature vector is highlighted. After the attention of the channel is obtained, the vector is sent to a space attention module to obtain space weight information, so that the network can be concentrated on position information learning of the target, and the positioning capability of the network on the target is enhanced. By introducing the attention module, the network is promoted to be always concentrated in the learning of the target detail characteristics in the training process, the interference of a complex background in an image is suppressed, and the overall performance of detecting the network is improved.

According to another preferred embodiment of the invention, after the detection difficulty of the tiny target is fully considered, the invention designs the prediction network with the four-branch structure to relieve the negative influence on the detection result caused by severe target scale change. The four branches can be numbered 1, 2, 3 and 4 from top to bottom in sequence. The 1 st prediction branch is generated by a low-level and high-resolution feature map, and the high-resolution feature map is more sensitive to the position information of the target, so that the level is responsible for predicting the tiny target; the 2 nd prediction branch is obtained by downsampling the 1 st branch, the size of the feature map is reduced by half, and the 2 nd prediction branch is responsible for predicting the common small target; the 3 rd prediction branch is obtained by down-sampling of the 2 branches and is responsible for predicting a target with a medium size; the 4 th branch is generated by a high-level and low-resolution feature map, contains rich target semantic information and is responsible for predicting a large-scale target. In conclusion, the multi-scale prediction network can improve the detection precision of the tiny target and ensure that targets of other scales can be stably detected, so that the phenomenon that the detection performance fluctuates in the dynamic change process of the target scale of the detection model is prevented.

According to another preferred embodiment of the present invention, the loss function of the weak target detection algorithm designed by the present invention may preferably be composed of two parts, namely, a bounding box regression loss and a classification loss. The bounding box regression uses GIoU as a loss function, and the calculation formula of GIoU is shown in the following formula (1), wherein a is a prediction box, B is a real box, and C represents a minimum convex closed box containing a and B. The loss function of the bounding box regression is shown in equation (2).

L _GIoULoss ＝1-GIoU (2)

In addition, the classification loss can use binary cross entropy as a loss function, and the overall loss function of the network is shown in formula (3).

Next, the step of training the detection network is performed.

In this step, the detection network constructed above is trained by using calibrated sample data in a data-driven manner, and then a final detection model is generated.

More specifically, after the weak target detection network is built, end-to-end training is performed on the network by using labeled data in a data-driven manner. For example, a standard Adam optimizer may be used to train the network for 100 rounds during the network training process, with an initial learning rate of 1e-4, and the learning rate is reduced to 1e-5 at round 60, with further adjustments to the network parameters using a small learning rate, and with a batch size of 16 during the training process. After the training is finished, the final weak target detection model can be obtained.

And finally, a weak target detection step.

In the step, a trained final detection model is used for detecting a real-time image imaged by the sensor, so that a weak target existing in the image is obtained, and a result is output at the same time.

More specifically, in practical application, when a target to be detected appears in an image, the detection network can accurately detect the target to be identified in the image, and the target is marked by a color box, different types of targets are represented by rectangular boxes with different colors, and accurate position coordinates of a weak target are output in real time.

It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A weak target detection method based on mixed attention and harmonic factors is characterized by comprising the following steps:

step one, sample calibration

step two, constructing a detection network

step three, training the detection network

step four, weak target detection

And detecting a real-time image imaged by the sensor by using the trained final detection model to obtain a weak target in the image and output a result.

2. The method for detecting the weak target according to claim 1, wherein in the first step, firstly, screening is preferably performed on the acquired image according to the principles of image quality, definition of the target to be detected, presence or absence of shielding of the target and the like, so as to prepare a set of initial data set; the PASCAL VOC data set is then preferably standardized in its annotated format.

3. The weak object detection method according to claim 1 or 2, wherein in step two, the bidirectional feature fusion method preferably comprises the following procedures for the feature fusion unit: firstly, conducting top-down aggregation on high-level features of low-resolution and high-semantic information and bottom-level features of high-resolution and low-semantic information in an up-sampling mode, so that the features under all scales contain rich target semantic information; and then, adding a bottom-up feature aggregation path on the basis of a top-down aggregation path, thereby transmitting the position information of the bottom layer to the high-level features and finishing the fusion of the target position information.

4. The weak target detection method according to any one of claims 1 to 3, wherein in step two, the feature fusion unit is preferably further configured with a mixed attention unit, and the mixed attention unit comprises a channel attention subunit and a spatial attention subunit connected in series with each other, wherein:

for the spatial attention subunit, firstly performing average pooling and maximum pooling operations on the income feature map, and then splicing the two generated feature vectors; next, the channel dimensions are compressed using the convolutional layer and spatial weights are generated, which are then multiplied element by element with the input feature vectors, thereby generating a refined feature map.

5. The weak object detection method according to any one of claims 1 to 4, wherein in step two, the harmonic factor is preferably set as a hyper-parameter adaptive to the training process, and is continuously updated iteratively with the loss function during the network training iteration.

6. The weak target detection method as claimed in any one of claims 1 to 5, wherein in step two, it is preferable for the predicted branch unit to be of a four-branch structure, wherein the first predicted branch is generated from a low-level, high-resolution feature map and is responsible for predicting a first scale of a tiny target; the second prediction branch is obtained by downsampling the first prediction branch and is used for predicting a tiny target of a second scale; the third prediction branch is obtained by branch downsampling of the second prediction and is used for predicting a tiny target of a third scale; the fourth prediction branch is generated by a high-level and low-resolution feature map and is used for predicting a tiny target of a fourth scale.

7. The method for detecting the weak targets of any one of claims 1 to 6, wherein in step three, a standard Adam optimizer is preferably adopted for multiple rounds of training, and after the training is finished, a final detection model can be obtained.

8. The method for detecting the weak targets as claimed in any one of claims 1 to 7, wherein in the fourth step, the weak targets existing in the image are labeled preferably by using color boxes, different types of targets are represented by using rectangular boxes with different colors, and the accurate position coordinates of the weak targets are output in real time.

9. A weak target detection system based on a mixed attention and reconciliation factor, the system comprising:

the detection network module is used for dividing the detection network into a feature extraction unit, a feature fusion unit and a prediction branch unit, wherein the feature extraction unit adopts CSPDarkNet53 as a backbone network and is used for extracting relevant information such as position, texture, semantics and the like of an object to be detected; the feature fusion unit adopts a bidirectional feature fusion mode and is used for equivalently aggregating deep texture information and shallow position information of a target to be detected, and introduces a harmonic factor for adjusting the fusion proportion between adjacent feature layers in the feature fusion unit; the prediction branch unit is provided with a plurality of branches, and each branch is respectively responsible for detection according to different scales of the target to be detected;