CN115578364A - Weak target detection method and system based on mixed attention and harmonic factor - Google Patents
Weak target detection method and system based on mixed attention and harmonic factor Download PDFInfo
- Publication number
- CN115578364A CN115578364A CN202211318263.8A CN202211318263A CN115578364A CN 115578364 A CN115578364 A CN 115578364A CN 202211318263 A CN202211318263 A CN 202211318263A CN 115578364 A CN115578364 A CN 115578364A
- Authority
- CN
- China
- Prior art keywords
- target
- detection
- feature
- weak
- detected
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 94
- 230000004927 fusion Effects 0.000 claims abstract description 41
- 238000012549 training Methods 0.000 claims abstract description 27
- 238000000034 method Methods 0.000 claims abstract description 26
- 238000000605 extraction Methods 0.000 claims abstract description 19
- 239000013598 vector Substances 0.000 claims description 14
- 230000008569 process Effects 0.000 claims description 13
- 238000011176 pooling Methods 0.000 claims description 12
- 230000002457 bidirectional effect Effects 0.000 claims description 9
- 230000002776 aggregation Effects 0.000 claims description 8
- 238000004220 aggregation Methods 0.000 claims description 8
- 238000005070 sampling Methods 0.000 claims description 7
- 230000004931 aggregating effect Effects 0.000 claims description 5
- 238000002156 mixing Methods 0.000 claims description 5
- 239000000203 mixture Substances 0.000 claims description 4
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 claims description 3
- 230000003044 adaptive effect Effects 0.000 claims description 3
- 239000003086 colorant Substances 0.000 claims description 3
- 238000012545 processing Methods 0.000 claims description 2
- 238000012216 screening Methods 0.000 claims description 2
- 238000007500 overflow downdraw method Methods 0.000 claims 1
- 238000002372 labelling Methods 0.000 abstract description 6
- 230000007246 mechanism Effects 0.000 abstract description 4
- 238000010276 construction Methods 0.000 abstract 1
- 108091006146 Channels Proteins 0.000 description 25
- 230000006870 function Effects 0.000 description 10
- 230000008859 change Effects 0.000 description 5
- 238000013527 convolutional neural network Methods 0.000 description 4
- 241000282414 Homo sapiens Species 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 238000013473 artificial intelligence Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 239000012141 concentrate Substances 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 238000003384 imaging method Methods 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000012804 iterative process Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 230000008878 coupling Effects 0.000 description 1
- 238000010168 coupling process Methods 0.000 description 1
- 238000005859 coupling reaction Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000007499 fusion processing Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/0002—Inspection of images, e.g. flaw detection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/54—Extraction of image or video features relating to texture
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/766—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using regression, e.g. by projecting features on hyperplanes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V2201/00—Indexing scheme relating to image or video recognition or understanding
- G06V2201/07—Target detection
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Multimedia (AREA)
- General Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Computing Systems (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Biomedical Technology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Quality & Reliability (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Image Analysis (AREA)
Abstract
The invention belongs to the technical field of computer vision correlation, and discloses a weak target detection method based on mixed attention and harmonic factors, which comprises the following steps: collecting images to form a data set, and then labeling a target to be detected; dividing the detection network into a feature extraction unit, a feature fusion unit and a prediction branch unit, and introducing a mixed attention mechanism to complete the construction of the detection network; training the detection network in a data-driven mode, and generating a final detection model; and detecting the real-time image imaged by the sensor by using the final detection model to obtain a weak target in the image. The invention also discloses a corresponding system. By the method, the problems of small target pixel ratio, low target signal intensity, difficult extraction of target features, low target and background division degree and the like of the weak target in a typical task scene can be effectively solved, the detection capability of the model on the weak target is improved, and the detection performance of the model is comprehensively optimized.
Description
Technical Field
The invention belongs to the technical field of computer vision correlation, and particularly relates to a weak target detection method and system based on mixed attention and harmonic factors, which can effectively solve the problem of low detection recognition rate of weak targets in typical scenes at the present stage and accurately recognize weak targets appearing in images.
Background
In recent years, with the rapid development of economy and the continuous progress of science and technology, the artificial intelligence technology has attracted great attention of various industries, and scenes which need a large amount of human resources for supervision in the past can be gradually replaced by machines. Research shows that more than seven percent of information acquired by human beings is acquired by vision, so that the computer vision technology is extremely important for the development of the artificial intelligence industry.
Specifically, computer vision means that electronic imaging equipment is used for replacing human eyes to realize tasks of classifying and identifying targets. In recent years, with the popularization of high-performance computing resources of computers, deep learning techniques have been attempted to be applied to the field of computer vision. In 2012, hinton and his student Alex proposed a convolutional neural network AlexNet for image classification in the ImageNet race and captured the first name of the current year image classification race group at a time. After this, more and more researchers have attempted to apply deep learning techniques to computer vision tasks such as target detection, target tracking, semantic segmentation, scene understanding, etc.
Research shows that some target detection algorithms in the prior art generally have better detection effects on relatively striking targets in images; for weak targets appearing in the image, the detector often has more missed detections and false detections due to the reasons of few occupied pixel points, difficult extraction of target features and the like.
In fact, "small" and "weak" are the two most prominent features of a weak target. "Small" means that the total number of pixels occupied by the object to be recognized in the image is small, and the current academic definition of small objects defines the imaging size of the small objects. Typically in an N × N size image, targets with pixels smaller than 0.12% of the N × N are considered small targets, while for a general target detection data set, targets with pixels smaller than 32 × 32 may be considered small targets. The term "weak" means that the signal intensity of the target in the image is weak, the contrast with the background is low, and the target is easily interfered by clutter and noise.
Accordingly, there is a need in the art for further improvements to better meet the need for high-precision, high-efficiency detection of weak targets in typical scenarios.
Disclosure of Invention
In view of the above defects or needs in the prior art, an object of the present invention is to provide a weak target detection method based on mixed attention and harmonic factors, wherein relevant characteristics of a weak target in typical task scenes such as a monitored image and an aerial image are fully considered, a specific algorithm is selected as a baseline, and a network structure is modified, so that compared with the prior art, the detection capability of a model for the weak target can be further improved, and the detection performance of the model is further optimized comprehensively.
To achieve the above object, according to one aspect of the present invention, there is provided a weak target detection method based on a mixed attention and blend factor, the method comprising:
step one, sample calibration
The method comprises the steps that a typical image containing a target to be detected is collected in real time by a sensor to form a data set, and then the target to be detected appearing in the typical image is marked;
step two, constructing a detection network
Dividing the detection network into a feature extraction unit, a feature fusion unit and a prediction branch unit, wherein the feature extraction unit adopts CSPDarkNet53 as a backbone network and is used for extracting relevant information such as position, texture, semantics and the like of a target to be detected; the feature fusion unit adopts a bidirectional feature fusion mode and is used for equivalently aggregating deep texture information and shallow position information of a target to be detected, and a blending factor is introduced for adjusting the fusion proportion between adjacent feature layers in the feature fusion unit; the prediction branch unit is provided with a plurality of branches, and each branch is respectively responsible for detection according to different scales of the target to be detected;
step three, training the detection network
Training the detection network constructed in the step two by using the sample data calibrated in the step one in a data driving mode, and further generating a final detection model;
step four, weak target detection
And detecting a real-time image imaged by the sensor by using the trained final detection model to obtain a weak target in the image, and outputting a result.
Preferably, in the first step, firstly, screening is preferably performed on the acquired image according to the principles of image quality, definition of the target to be detected, presence or absence of shielding of the target and the like, so as to prepare a set of initial data set; the PASCAL VOC data set is then preferably standardized in its annotated format.
As a further preference, in the second step, for the feature fusion unit, the bidirectional feature fusion mode preferably includes the following processes: firstly, conducting top-down aggregation on high-level features of low-resolution and high-semantic information and bottom-level features of high-resolution and low-semantic information in an up-sampling mode, so that the features under all scales contain rich target semantic information; and then, adding a bottom-up feature aggregation path on the basis of a top-down aggregation path, thereby transmitting the position information of the bottom layer to the high-level features and finishing the fusion of the target position information.
As a further preference, in step two, the feature fusion unit is preferably further configured with a mixed attention unit comprising a channel attention subunit and a spatial attention subunit connected in series with each other, wherein:
for a channel attention subunit, calculating the average value of all pixels of the feature map of each channel by the feature vector of the channel attention subunit through global average pooling, and then generating channel weight between 0 and 1 through one-dimensional convolution with a convolution kernel of k × k; finally, the generated channel weight is multiplied element by element with the feature map, thereby generating a refined feature map;
for the spatial attention subunit, firstly performing average pooling and maximum pooling operations on the income feature map, and then splicing the two generated feature vectors; then, the channel dimensions are compressed using the convolutional layer and spatial weights are generated, and then multiplied element by element with the input feature vectors, thereby generating a refined feature map.
As a further preference, in step two, the harmonic factor is preferably set to a hyper-parameter adaptive to the training process, and it is continuously updated iteratively with the loss function in the network training iterative process.
As a further preference, in step two, it is preferable for the predicted branch unit to be of a four-branch structure, where the first predicted branch is generated from a low-level, high-resolution feature map for being responsible for predicting a tiny target of the first scale; the second prediction branch is obtained by down-sampling the first prediction branch and is used for predicting a tiny target of a second scale; the third prediction branch is obtained by branch downsampling of the second prediction and is used for predicting a tiny target of a third scale; the fourth prediction branch is generated by a high-level and low-resolution feature map and is used for predicting a tiny target of a fourth scale.
Further preferably, in step three, a standard Adam optimizer is preferably used for multiple rounds of training, and after the training is finished, a final detection model can be obtained.
Preferably, in the fourth step, a color box is preferably used to label weak targets existing in the image, different types of targets are represented by rectangular boxes with different colors, and accurate position coordinates of the weak targets are output in real time.
According to another aspect of the present invention, there is also provided a corresponding weak target detection system based on a mixed attention and reconciliation factor, wherein the system comprises:
the system comprises a sample calibration module, a data acquisition module and a data processing module, wherein the sample calibration module is used for acquiring a typical image composition data set containing a target to be detected in real time by using a sensor and then marking the target to be detected appearing in the typical image;
the detection network module is used for dividing the detection network into a feature extraction unit, a feature fusion unit and a prediction branch unit, wherein the feature extraction unit adopts CSPDarkNet53 as a backbone network and is used for extracting relevant information such as position, texture, semantics and the like of a target to be detected; the feature fusion unit adopts a bidirectional feature fusion mode and is used for equivalently aggregating deep texture information and shallow position information of a target to be detected, and a blending factor is introduced for adjusting the fusion proportion between adjacent feature layers in the feature fusion unit; the prediction branch unit is provided with a plurality of branches, and each branch is respectively responsible for detection according to different scales of the target to be detected;
the training detection network module is used for training the constructed detection network by using the sample data calibrated in the first step in a data driving mode so as to generate a final detection model;
and the weak target detection module is applied to detecting a real-time image imaged by the sensor by utilizing the trained final detection model, obtaining a weak target existing in the image and outputting a result.
Generally, compared with the prior art, the above technical solution conceived by the present invention has the following beneficial effects:
(1) According to the method, relevant characteristics of the weak target in typical task scenes such as monitoring images and aerial images are fully considered, multiple aspects such as key operation steps and algorithm mechanisms are improved in a targeted manner, the selected backbone network comprises novel Focus and CSP structures, the capability of extracting target characteristics by the convolutional neural network can be further improved, and the extraction of detailed information of the weak target is facilitated;
(2) According to the method, the bidirectional feature fusion network is adopted and the harmonic factors are introduced, so that the feature fusion relationship between adjacent layers of the network can be controlled in a self-adaptive mode, and the harmonic factors dynamically change along with the loss function in the training process, so that the learning of the detail features of the weak target is further promoted, and the small target is identified more stably and accurately;
(3) The invention also further adds a mixed attention mechanism in the operations of network up-sampling, characteristic splicing and the like, wherein the channel attention subunit only comprises global average pooling and convolution operations, thereby ensuring the light weight of the module and generating accurate channel weight; the spatial attention subunit comprises average pooling, maximum pooling and convolution operations, and can generate accurate spatial weight information; correspondingly, the interference of a network on a complex background in the training process is further inhibited, and the overall performance of the detection model is improved;
(4) According to the weak target detection method based on the mixed attention and the harmonic factor, the difficult problems of small target pixel ratio, low target signal intensity, difficult target feature extraction, low target and background division and the like of the weak target in typical task scenes such as monitored images, aerial images and the like can be effectively solved, the detection precision of the weak target is remarkably improved, the overall performance of a detection network is optimized, and the robustness of a model is enhanced.
Drawings
FIG. 1 is a flow chart of a weak target detection method based on a mixed attention and reconciliation factor in accordance with the present invention;
FIG. 2 is a schematic view for exemplarily showing an entire weak target detection process according to the present invention;
fig. 3 is a general block diagram for an exemplary display hybrid attention unit in accordance with a preferred embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 is an overall flowchart of a weak target detection method based on a mixed attention and reconciliation factor according to the present invention. The invention will be explained in more detail below with reference to fig. 1.
First, a sample calibration step is performed.
In the step, a typical image composition data set containing the target to be detected is acquired in real time by using a sensor, and then the target to be detected appearing in the typical image is labeled.
More specifically, a field image is collected through a sensor according to a specific task scene, the collected image is screened according to the principles of image quality, the definition degree of a target to be detected, the existence of shielding of the target and the like, and a set of initial and unmarked data sets is prepared;
and then, labeling the prepared data set, for example, labeling the self-made data set by adopting a labeling format of a PASCAL VOC data set widely applied in the field of target detection, wherein a labeling tool is a Label Image, in the labeling process, the coordinates and the category of the target to be identified need to be labeled simultaneously, and an xml file is generated after each Image is labeled.
Next, a step of constructing a detection network is performed.
In the step, the detection network is divided into a feature extraction unit, a feature fusion unit and a prediction branch unit, wherein the feature extraction unit adopts CSPDarkNet53 as a backbone network and is used for extracting relevant information such as the position, texture, semantics and the like of the target to be detected; the feature fusion unit adopts a bidirectional feature fusion mode and is used for equivalently aggregating deep texture information and shallow position information of a target to be detected, and a blending factor is introduced for adjusting the fusion proportion between adjacent feature layers in the feature fusion unit; the prediction branch unit is provided with a plurality of branches, and each branch is respectively responsible for detection according to different scales of the target to be detected.
More specifically, the CSPDarkNet53 is preferably used as a backbone network for feature extraction to extract relevant information such as position, texture, semantics and the like of the target to be detected, compared with the traditional DarkNet53 network, the new backbone network comprises novel Focus and CSP structures, and the capability of extracting the target features by the convolutional neural network can be further improved. The Focus structure is that before the image enters the backbone network, every other pixel of the image takes one value, and the four groups of pictures can be obtained by similar adjacent downsampling operation. The Focus operation concentrates the width and height information of the image into a channel space, the dimension of the channel is enlarged to 4 times of the original dimension, namely the spliced result is equivalent to that the original three-channel mode is changed into 12 channels. The Focus structure has the advantage of reducing the loss of target detail information in the downsampling process, which is beneficial to the detection of tiny targets. The CSP structure divides the feature map into two parts according to the channel, one part is subjected to feature extraction through the convolutional layer and the residual error module, the other part is combined with the transmitted feature map, and then the feature maps output by the two parts are spliced, so that the problem of gradient information repetition in network optimization is solved, a brand new fusion mode is formed while the network calculation amount is reduced, the extraction of target detail information in the image is promoted, the efficiency of a backbone network is improved, and the comprehensive performance of a feature extraction network is improved.
The high-level features of the convolutional neural network contain semantic information of more targets, the bottom-level features contain more target positions and texture information, and the positions and the semantic information of the required targets which are equal to the network are predicted for a weak target detection task, so that the small targets are stably and accurately identified.
Under the background, the bidirectional feature fusion network is adopted in a targeted manner, firstly, the high-level features of low-resolution and high-semantic information are aggregated with the bottom-level features of high-resolution and low-semantic information from top to bottom in an up-sampling manner, so that the features under all scales contain rich target semantic information; and then, adding a bottom-up feature aggregation path on the basis of a top-down aggregation path, thereby transmitting the position information of the bottom layer to the high-level features and completing the fusion of the target position information.
In addition, because the invention concentrates on the detection of weak targets, and the detailed characteristics of small targets are more difficult to learn than those of medium and large targets, in order to ensure that the trained model has better detection precision, the invention provides that harmonic factors are added between two adjacent characteristic layers in the characteristic fusion module to control the characteristic fusion relationship between the adjacent layers. The harmonic factor determines the degree of coupling between adjacent levels in the feature fusion module by re-weighting the loss in gradient back propagation. The blending is set as a hyper-parameter adaptive to the training process, and in the network training iterative process, the hyper-parameter is continuously updated iteratively along with the loss function, and the change of the value reflects the change condition of each level of learning difficulty in the characteristic fusion process.
According to a preferred embodiment of the present invention, considering that the target to be detected in the present invention is a weak target, such target usually occupies only few pixels in the image, and has a low degree of distinction from the background, and is easily interfered by the complex background in the scene, thereby affecting the detection performance of the algorithm. In view of the above problems, the present invention further provides a novel efficient hybrid attention mechanism, which includes two subunits, i.e., a channel attention subunit and a spatial attention subunit, and a serial data flow manner is adopted between the two subunits. And the method is integrated into the operations of network up-sampling, feature splicing and the like.
According to another preferred embodiment of the present invention, the structure of the channel attention subunit can be described as follows. Firstly, for a feature vector (H multiplied by W multiplied by C) of an input attention unit, calculating an Average value of all pixels for a feature map of each channel through Global Average Pooling (GAP) so as to reduce the number of parameters and reduce the calculated amount, and then achieving the purpose of cross-channel interaction through one-dimensional convolution with a convolution kernel of k multiplied by k. The convolution kernel k of the one-dimensional convolution determines the range of channel interaction, k is an adjustable hyper-parameter, and the k value is initially set to 3 in the invention. The feature vector output by the one-dimensional convolution can generate channel weight between 0 and 1 after being mapped by a Sigmoid function. Finally, the generated channel weight is multiplied by the feature map of the input attention module element by element to generate a refined feature map.
According to another preferred embodiment of the present invention, the structure of the spatial attention subunit can be described as follows. Firstly, carrying out average pooling and maximum pooling operations on the income feature map, then splicing two generated feature vectors, then compressing the channel dimension by using the convolution layer, and finally generating space weight by using a Sigmoid function, for example, and multiplying the space weight element by element with the input feature vector to generate a refined feature map.
The combination of the two attention subunits can be understood with reference to fig. 3. For the input feature vector, a channel weight is generated through a channel attention module, the category information of the target to be detected is highlighted, and then the channel weight and the feature vector are multiplied element by element, so that the importance degree among channels in the feature vector is highlighted. After the attention of the channel is obtained, the vector is sent to a space attention module to obtain space weight information, so that the network can be concentrated on position information learning of the target, and the positioning capability of the network on the target is enhanced. By introducing the attention module, the network is promoted to be always concentrated in the learning of the target detail characteristics in the training process, the interference of a complex background in an image is suppressed, and the overall performance of detecting the network is improved.
According to another preferred embodiment of the invention, after the detection difficulty of the tiny target is fully considered, the invention designs the prediction network with the four-branch structure to relieve the negative influence on the detection result caused by severe target scale change. The four branches can be numbered 1, 2, 3 and 4 from top to bottom in sequence. The 1 st prediction branch is generated by a low-level and high-resolution feature map, and the high-resolution feature map is more sensitive to the position information of the target, so that the level is responsible for predicting the tiny target; the 2 nd prediction branch is obtained by downsampling the 1 st branch, the size of the feature map is reduced by half, and the 2 nd prediction branch is responsible for predicting the common small target; the 3 rd prediction branch is obtained by down-sampling of the 2 branches and is responsible for predicting a target with a medium size; the 4 th branch is generated by a high-level and low-resolution feature map, contains rich target semantic information and is responsible for predicting a large-scale target. In conclusion, the multi-scale prediction network can improve the detection precision of the tiny target and ensure that targets of other scales can be stably detected, so that the phenomenon that the detection performance fluctuates in the dynamic change process of the target scale of the detection model is prevented.
According to another preferred embodiment of the present invention, the loss function of the weak target detection algorithm designed by the present invention may preferably be composed of two parts, namely, a bounding box regression loss and a classification loss. The bounding box regression uses GIoU as a loss function, and the calculation formula of GIoU is shown in the following formula (1), wherein a is a prediction box, B is a real box, and C represents a minimum convex closed box containing a and B. The loss function of the bounding box regression is shown in equation (2).
L GIoULoss =1-GIoU (2)
In addition, the classification loss can use binary cross entropy as a loss function, and the overall loss function of the network is shown in formula (3).
Next, the step of training the detection network is performed.
In this step, the detection network constructed above is trained by using calibrated sample data in a data-driven manner, and then a final detection model is generated.
More specifically, after the weak target detection network is built, end-to-end training is performed on the network by using labeled data in a data-driven manner. For example, a standard Adam optimizer may be used to train the network for 100 rounds during the network training process, with an initial learning rate of 1e-4, and the learning rate is reduced to 1e-5 at round 60, with further adjustments to the network parameters using a small learning rate, and with a batch size of 16 during the training process. After the training is finished, the final weak target detection model can be obtained.
And finally, a weak target detection step.
In the step, a trained final detection model is used for detecting a real-time image imaged by the sensor, so that a weak target existing in the image is obtained, and a result is output at the same time.
More specifically, in practical application, when a target to be detected appears in an image, the detection network can accurately detect the target to be identified in the image, and the target is marked by a color box, different types of targets are represented by rectangular boxes with different colors, and accurate position coordinates of a weak target are output in real time.
It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.
Claims (9)
1. A weak target detection method based on mixed attention and harmonic factors is characterized by comprising the following steps:
step one, sample calibration
The method comprises the steps that a typical image containing a target to be detected is collected in real time by a sensor to form a data set, and then the target to be detected appearing in the typical image is marked;
step two, constructing a detection network
Dividing the detection network into a feature extraction unit, a feature fusion unit and a prediction branch unit, wherein the feature extraction unit adopts CSPDarkNet53 as a backbone network and is used for extracting relevant information such as position, texture, semantics and the like of a target to be detected; the feature fusion unit adopts a bidirectional feature fusion mode and is used for equivalently aggregating deep texture information and shallow position information of a target to be detected, and a blending factor is introduced for adjusting the fusion proportion between adjacent feature layers in the feature fusion unit; the prediction branch unit is provided with a plurality of branches, and each branch is respectively responsible for detection according to different scales of the target to be detected;
step three, training the detection network
Training the detection network constructed in the step two by using the sample data calibrated in the step one in a data driving mode, and further generating a final detection model;
step four, weak target detection
And detecting a real-time image imaged by the sensor by using the trained final detection model to obtain a weak target in the image and output a result.
2. The method for detecting the weak target according to claim 1, wherein in the first step, firstly, screening is preferably performed on the acquired image according to the principles of image quality, definition of the target to be detected, presence or absence of shielding of the target and the like, so as to prepare a set of initial data set; the PASCAL VOC data set is then preferably standardized in its annotated format.
3. The weak object detection method according to claim 1 or 2, wherein in step two, the bidirectional feature fusion method preferably comprises the following procedures for the feature fusion unit: firstly, conducting top-down aggregation on high-level features of low-resolution and high-semantic information and bottom-level features of high-resolution and low-semantic information in an up-sampling mode, so that the features under all scales contain rich target semantic information; and then, adding a bottom-up feature aggregation path on the basis of a top-down aggregation path, thereby transmitting the position information of the bottom layer to the high-level features and finishing the fusion of the target position information.
4. The weak target detection method according to any one of claims 1 to 3, wherein in step two, the feature fusion unit is preferably further configured with a mixed attention unit, and the mixed attention unit comprises a channel attention subunit and a spatial attention subunit connected in series with each other, wherein:
for a channel attention subunit, calculating the average value of all pixels of the feature map of each channel by the feature vector of the channel attention subunit through global average pooling, and then generating channel weight between 0 and 1 through one-dimensional convolution with a convolution kernel of k × k; finally, the generated channel weight is multiplied element by element with the feature map, thereby generating a refined feature map;
for the spatial attention subunit, firstly performing average pooling and maximum pooling operations on the income feature map, and then splicing the two generated feature vectors; next, the channel dimensions are compressed using the convolutional layer and spatial weights are generated, which are then multiplied element by element with the input feature vectors, thereby generating a refined feature map.
5. The weak object detection method according to any one of claims 1 to 4, wherein in step two, the harmonic factor is preferably set as a hyper-parameter adaptive to the training process, and is continuously updated iteratively with the loss function during the network training iteration.
6. The weak target detection method as claimed in any one of claims 1 to 5, wherein in step two, it is preferable for the predicted branch unit to be of a four-branch structure, wherein the first predicted branch is generated from a low-level, high-resolution feature map and is responsible for predicting a first scale of a tiny target; the second prediction branch is obtained by downsampling the first prediction branch and is used for predicting a tiny target of a second scale; the third prediction branch is obtained by branch downsampling of the second prediction and is used for predicting a tiny target of a third scale; the fourth prediction branch is generated by a high-level and low-resolution feature map and is used for predicting a tiny target of a fourth scale.
7. The method for detecting the weak targets of any one of claims 1 to 6, wherein in step three, a standard Adam optimizer is preferably adopted for multiple rounds of training, and after the training is finished, a final detection model can be obtained.
8. The method for detecting the weak targets as claimed in any one of claims 1 to 7, wherein in the fourth step, the weak targets existing in the image are labeled preferably by using color boxes, different types of targets are represented by using rectangular boxes with different colors, and the accurate position coordinates of the weak targets are output in real time.
9. A weak target detection system based on a mixed attention and reconciliation factor, the system comprising:
the system comprises a sample calibration module, a data acquisition module and a data processing module, wherein the sample calibration module is used for acquiring a typical image composition data set containing a target to be detected in real time by using a sensor and then marking the target to be detected appearing in the typical image;
the detection network module is used for dividing the detection network into a feature extraction unit, a feature fusion unit and a prediction branch unit, wherein the feature extraction unit adopts CSPDarkNet53 as a backbone network and is used for extracting relevant information such as position, texture, semantics and the like of an object to be detected; the feature fusion unit adopts a bidirectional feature fusion mode and is used for equivalently aggregating deep texture information and shallow position information of a target to be detected, and introduces a harmonic factor for adjusting the fusion proportion between adjacent feature layers in the feature fusion unit; the prediction branch unit is provided with a plurality of branches, and each branch is respectively responsible for detection according to different scales of the target to be detected;
the training detection network module is used for training the constructed detection network by using the sample data calibrated in the first step in a data driving mode so as to generate a final detection model;
and the weak target detection module is applied to detecting a real-time image imaged by the sensor by utilizing the trained final detection model, obtaining a weak target existing in the image and outputting a result.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211318263.8A CN115578364A (en) | 2022-10-26 | 2022-10-26 | Weak target detection method and system based on mixed attention and harmonic factor |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211318263.8A CN115578364A (en) | 2022-10-26 | 2022-10-26 | Weak target detection method and system based on mixed attention and harmonic factor |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115578364A true CN115578364A (en) | 2023-01-06 |
Family
ID=84587910
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211318263.8A Pending CN115578364A (en) | 2022-10-26 | 2022-10-26 | Weak target detection method and system based on mixed attention and harmonic factor |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115578364A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118015397A (en) * | 2024-01-16 | 2024-05-10 | 深圳市锐明像素科技有限公司 | Method and device for determining difficult samples for autonomous driving |
CN118135669A (en) * | 2024-05-10 | 2024-06-04 | 武汉纺织大学 | A classroom behavior recognition method and system based on lightweight network |
CN118865065A (en) * | 2024-09-24 | 2024-10-29 | 北京观微科技有限公司 | Small target detection method, device, electronic device and storage medium |
-
2022
- 2022-10-26 CN CN202211318263.8A patent/CN115578364A/en active Pending
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118015397A (en) * | 2024-01-16 | 2024-05-10 | 深圳市锐明像素科技有限公司 | Method and device for determining difficult samples for autonomous driving |
CN118135669A (en) * | 2024-05-10 | 2024-06-04 | 武汉纺织大学 | A classroom behavior recognition method and system based on lightweight network |
CN118135669B (en) * | 2024-05-10 | 2024-08-02 | 武汉纺织大学 | A classroom behavior recognition method and system based on lightweight network |
CN118865065A (en) * | 2024-09-24 | 2024-10-29 | 北京观微科技有限公司 | Small target detection method, device, electronic device and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112884064B (en) | Target detection and identification method based on neural network | |
Wang et al. | FE-YOLOv5: Feature enhancement network based on YOLOv5 for small object detection | |
CN112966684B (en) | Cooperative learning character recognition method under attention mechanism | |
CN112651978B (en) | Sublingual microcirculation image segmentation method and device, electronic equipment and storage medium | |
CN115578364A (en) | Weak target detection method and system based on mixed attention and harmonic factor | |
CN113822951B (en) | Image processing method, device, electronic equipment and storage medium | |
CN111626993A (en) | Image automatic detection counting method and system based on embedded FEFnet network | |
CN102385592B (en) | Image concept detection method and device | |
CN116758340A (en) | Small target detection method based on super-resolution feature pyramid and attention mechanism | |
Fan et al. | A novel sonar target detection and classification algorithm | |
CN112365451A (en) | Method, device and equipment for determining image quality grade and computer readable medium | |
Li et al. | Page segmentation using convolutional neural network and graphical model | |
CN109977963A (en) | Image processing method, unit and computer-readable medium | |
CN117557784A (en) | Target detection method, target detection device, electronic equipment and storage medium | |
CN114332473A (en) | Object detection method, object detection device, computer equipment, storage medium and program product | |
Xiong et al. | Bi-directional skip connection feature pyramid network and sub-pixel convolution for high-quality object detection | |
Zhang et al. | Crowd counting based on attention-guided multi-scale fusion networks | |
CN117994573A (en) | Infrared dim target detection method based on superpixel and deformable convolution | |
Rahmon et al. | Deepftsg: Multi-stream asymmetric use-net trellis encoders with shared decoder feature fusion architecture for video motion segmentation | |
Lv et al. | An image rendering-based identification method for apples with different growth forms | |
CN114792300B (en) | X-ray broken needle detection method based on multi-scale attention | |
CN113409327B (en) | An improved instance segmentation method based on sorting and semantic consistency constraints | |
Zhang et al. | Multi-scenario pear tree inflorescence detection based on improved YOLOv7 object detection algorithm | |
CN118968480A (en) | Ovulation test paper detection method and device, electronic device and storage medium | |
He et al. | NTS-YOLO: A Nocturnal Traffic Sign Detection Method Based on Improved YOLOv5. |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |