CN116758419A

CN116758419A - Multi-scale target detection method, device and equipment for remote sensing image

Info

Publication number: CN116758419A
Application number: CN202310658922.0A
Authority: CN
Inventors: 后兴海; 孙宇; 侯正方; 毕福昆
Original assignee: Beijing North Zhitu Information Technology Co ltd
Current assignee: Beijing North Zhitu Information Technology Co ltd
Priority date: 2023-06-05
Filing date: 2023-06-05
Publication date: 2023-09-15

Abstract

The invention provides a multi-scale target detection method, device and equipment for a remote sensing image and a storage medium, and relates to the technical field of image processing. The method comprises the following steps: acquiring an original remote sensing image, and inputting the original remote sensing image into a backbone network of a target detection model; extracting features of the original remote sensing image on different scales by using a backbone network to obtain an enhanced feature map; inputting the enhanced feature images into a feature processor of a target detection model, so that the feature processor respectively carries out feature reconstruction on the enhanced feature images of each scale according to a multi-head self-attention method to obtain reconstructed feature images corresponding to each scale; and inputting the reconstructed feature images on each scale into a decoupling module of the target detection model to perform target detection, so that the decoupling module performs target detection on the reconstructed feature images on each scale respectively to obtain the types of the targets contained in the reconstructed feature images on each scale and the probabilities corresponding to the types. The invention can improve the target detection efficiency.

Description

Multi-scale target detection method, device and equipment for remote sensing image

Technical Field

The present invention relates to the field of image processing technologies, and in particular, to a method, an apparatus, a device, and a storage medium for detecting a multi-scale target for a remote sensing image.

Background

With the development of computer science and technology and remote sensing technology, the exploration of the ground information by utilizing the remote sensing information technology gradually becomes a research hot spot. The remote sensing information technology is a technology for information processing and analysis by utilizing data acquired by remote sensing equipment, is a field of intersection of various disciplines, comprises a plurality of fields of computer science, remote sensing science, geographic information science and the like, and can be widely applied to a plurality of aspects of strategic reconnaissance, building planning, urban satellite navigation and the like. The optical remote sensing image is taken as an effective carrier in remote sensing information, has the characteristics of high resolution, multispectral and multiband, is one of very important data in the remote sensing field, and has important research value and practical significance on how to quickly identify interesting target information from massive high-resolution optical remote sensing images by utilizing an image processing technology.

For target detection of remote sensing images, the traditional image processing mode mainly relies on manual intervention, namely people preset templates of targets according to experience, then the whole image is traversed by utilizing a sliding window method, pixel-level target approximate matching is carried out on each sliding window, the matched targets are output, the method needs to rely on human experience and preset templates, the automation level is low, and the overall efficiency is low. In addition to the above method, there is a very typical candidate region extraction method, that is, a candidate region is extracted by means of shallow features such as texture, color, edge, etc. of the image to be detected, and then template matching is performed in the candidate region, so as to finally match the target. Compared with the sliding window method, the method does not need to traverse the whole image, and does not need to perform target detection on some blank areas in the original image, so that the detection speed is improved, but the target detection accuracy is lower, and the candidate areas are generated according to the characteristics of the image, so that the method is only applied to images with similar characteristics and cannot be widely applied.

In summary, the current target detection method has long detection time, low speed or low detection precision, and can cause low target detection efficiency when being applied to optical remote sensing images with complex background, multiple target types and large scale difference among targets of different types.

Disclosure of Invention

The invention provides a multi-scale target detection method, device, equipment and storage medium for a remote sensing image, which are used for solving the defects of long detection time, low speed and low detection precision in target detection for an optical remote sensing image containing complex information in the prior art, realizing automatic remote sensing target detection and improving the automatic detection level and detection efficiency of the remote sensing image.

The invention provides a multi-scale target detection method for a remote sensing image, which comprises the following steps:

acquiring an original remote sensing image, and inputting the original remote sensing image into a backbone network of a target detection model;

extracting features of the original remote sensing image on different scales by using the backbone network to obtain an enhanced feature map corresponding to each scale;

inputting the enhanced feature images of each scale into a feature processor of the target detection model, so that the feature processor respectively carries out feature reconstruction on the enhanced feature images of each scale according to a multi-head self-attention method to obtain reconstructed feature images corresponding to each scale;

And inputting the reconstructed feature images on each scale into a decoupling module of the target detection model to perform target detection, so that the decoupling module performs target detection on the reconstructed feature images on each scale respectively to obtain the categories of the targets contained in the reconstructed feature images on each scale and the probabilities corresponding to the categories.

According to the multi-scale target detection method for the remote sensing image, the backbone network comprises a plurality of branch networks; the step of extracting features of the original remote sensing image on different scales by using the backbone network to obtain an enhanced feature map corresponding to each scale comprises the following steps:

downsampling the original remote sensing image based on different downsampling step sizes by utilizing each branch network to obtain downsampled images with different scales;

respectively calculating the characteristic space mean value of each downsampled image on the plane where the wide dimension and the high dimension are located;

calculating the variance of the characteristic value in the channel direction for the downsampled image of each scale;

calculating an energy distribution function of the downsampled image of each scale based on the characteristic space mean value and the variance aiming at the downsampled image of each scale;

Aiming at the downsampled image of each scale, calculating according to the characteristic space mean value to obtain a channel attention factor, and calculating according to the mean value of the downsampled image in the channel direction to obtain the space attention factor;

and obtaining an enhancement feature map corresponding to the sampled image of each scale based on the energy distribution function, the channel attention factor and the spatial attention factor of the downsampled image of each scale.

According to the multi-scale target detection method for remote sensing images provided by the invention, the target detection model comprises a plurality of feature processors, the enhanced feature images of each scale are input into the feature processors of the target detection model, so that the feature processors respectively perform feature reconstruction on the enhanced feature images of each scale according to a multi-head self-attention method to obtain reconstructed feature images corresponding to each scale, and the method comprises the following steps:

inputting the enhanced feature images of each scale into a feature processor corresponding to the scale of the enhanced feature images, and generating a plurality of groups of vector combinations for the enhanced feature images by using the feature processor; wherein each set of vector combinations includes a query vector, a key vector, and a value vector;

based on the query vector, the key vector and the value vector in each group of vector combinations, obtaining the self-attention weight of the enhanced feature map by utilizing multi-head calculation;

Multiplying the self-attention weight by the corresponding value vector, and splicing to obtain the reconstruction feature map corresponding to the enhancement feature map of each scale.

According to the multi-scale target detection method for remote sensing images provided by the invention, the method for inputting the reconstructed feature images on each scale into the decoupling module of the target detection model to perform target detection, so that the decoupling module performs target detection on the reconstructed feature images on each scale by using the reconstructed feature images on each scale to obtain the categories of the targets contained in the reconstructed feature images on each scale and the probabilities corresponding to each category, and the method comprises the following steps:

inputting the reconstructed feature images on all scales into a decoupling module of the target detection model so that the decoupling module can respectively detect the targets of the reconstructed feature images on all scales by using a target detection method based on an Anchor-free to obtain the categories of the targets contained in the reconstructed feature images on all scales and the probabilities corresponding to the categories;

the decoupling module comprises a target detection decoupling head, a prediction frame output decoupling head and a target class confidence degree decoupling head;

the target detection decoupling head is used for outputting the probability that the target belongs to each category in the reconstructed feature map;

The prediction frame output decoupling head is used for outputting four-dimensional offset of each category;

the target class confidence decoupling head is used for outputting the probability that the position contains the target.

According to the multi-scale target detection method for the remote sensing image provided by the invention, before the original remote sensing image is acquired, the method further comprises the following steps:

acquiring a training sample image and label images corresponding to the training sample images;

inputting the training sample image into a target detection model to be trained, and obtaining a target detection result, a prediction frame output result and a target class confidence result which are output by the target detection model;

respectively determining loss values among the target detection result, the prediction frame output result and the target category confidence result and the tag image based on a plurality of preset loss functions;

and adjusting parameters of the target detection model to be trained until each loss value meets a preset training ending condition, and obtaining the target detection model.

According to the multi-scale target detection method for the remote sensing image, each preset loss function is respectively as follows:

wherein c _(x,y) Representing the probability that the model predicts a target at the feature map (x, y) points,a tag indicating whether or not an object actually exists at a point of the feature map (x, y); p is a prediction frame, and B is a real frame.

The invention also provides a multi-scale target detection device for the remote sensing image, which comprises:

the image acquisition module is used for acquiring an original remote sensing image and inputting the original remote sensing image into a backbone network of the target detection model;

the characteristic enhancement module is used for extracting characteristics of the original remote sensing image on different scales by utilizing the backbone network to obtain an enhanced characteristic diagram corresponding to each scale;

the feature reconstruction module is used for inputting the enhanced feature images of each scale into a feature processor of the target detection model, so that the feature processor performs feature reconstruction on the enhanced feature images of each scale according to a multi-head self-attention method to obtain reconstructed feature images corresponding to each scale;

and the class output module is used for inputting the reconstructed feature images on each scale into the decoupling module of the target detection model to perform target detection, so that the decoupling module respectively performs target detection on the reconstructed feature images on each scale to obtain classes of targets contained in the reconstructed feature images on each scale and probabilities corresponding to the classes.

The invention also provides electronic equipment, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the multi-scale target detection method for the remote sensing image when executing the program.

The invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a multi-scale object detection method for a remote sensing image as described in any of the above.

The invention also provides a computer program product comprising a computer program which when executed by a processor implements a multi-scale object detection method for a remote sensing image as described in any of the above.

According to the multi-scale target detection method, device, equipment and storage medium for the remote sensing image, the original remote sensing image is subjected to downsampling of different multiples to obtain images with different resolutions, so that the foreground and the background of different levels can be distinguished, the characteristic image is reconstructed by using the multi-head self-attention method MSA, namely, the characteristic of an image scene is reconstructed after being flattened, each characteristic point on the characteristic image can be subjected to self-adaptive global modeling, the receptive field is improved, the expression capacity of target characteristics is optimized, finally, the decoupling module is used for carrying out target detection on the reconstructed characteristic image on each scale to obtain the number of target types contained in the reconstructed characteristic image on each scale, the class names of the targets and the probability corresponding to each type, the automatic target detection of the multi-level and complex background remote sensing image can be realized, the target recognition accuracy of different scales and different levels is improved, the characteristic extraction process does not need to be designed manually, the extraction of important characteristics is automatically completed by the model, a large amount of training time is saved, and the target detection efficiency is improved.

Drawings

In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic view of an application environment of a multi-scale target detection method for a remote sensing image provided by the invention;

FIG. 2 is a schematic flow chart of a multi-scale target detection method for a remote sensing image according to the present invention;

FIG. 3 is a schematic diagram of a model structure of the object detection model provided by the present invention;

FIG. 4 is a schematic diagram of a feature reconstruction network provided by the present invention;

FIG. 5 is a second flow chart of the method for detecting multi-scale targets for remote sensing images according to the present invention;

FIG. 6 is a schematic diagram of a backbone network according to the present invention;

FIG. 7 is a second schematic diagram of a backbone network according to the present invention;

FIG. 8 (a) is a schematic diagram of the architecture of a self-attention network;

FIG. 8 (b) is a schematic diagram of a multi-headed self-focusing network provided by the present invention;

Fig. 9 is a schematic structural diagram of a decoupling network provided by the present invention;

FIG. 10 is a schematic diagram of a model output prediction block provided by the present invention;

FIG. 11 is a schematic structural diagram of a multi-scale object detection device for remote sensing images according to the present invention;

fig. 12 is a schematic structural diagram of an electronic device provided by the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It should be noted that in the description of embodiments of the present invention, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element. The orientation or positional relationship indicated by the terms "upper", "lower", etc. are based on the orientation or positional relationship shown in the drawings, are merely for convenience of description and to simplify the description, and are not indicative or implying that the apparatus or elements in question must have a specific orientation, be constructed and operated in a specific orientation, and therefore should not be construed as limiting the present invention. Unless specifically stated or limited otherwise, the terms "mounted," "connected," and "coupled" are to be construed broadly, and may be, for example, fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present invention can be understood by those of ordinary skill in the art according to the specific circumstances.

The terms "first," "second," and the like in this specification are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged, as appropriate, such that embodiments of the present application may be implemented in sequences other than those illustrated or described herein, and that the objects identified by "first," "second," etc. are generally of a type, and are not limited to the number of objects, such as the first object may be one or more. In addition, "and/or" indicates at least one of the connected objects, and the character "/", generally indicates that the associated object is an "or" relationship.

Specific embodiments of the present application are described below in conjunction with fig. 1-12.

The multi-scale target detection method for the remote sensing image, provided by the embodiment of the application, can be applied to an application environment shown in fig. 1. Wherein the terminal 101 communicates with the server 102 via a network. The data storage system may store data that the server 102 needs to process. The data storage system may be integrated on the server 102 or may be located on a cloud or other network server. The terminal 101 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, internet of things devices and portable wearable devices, and the internet of things devices may be smart speakers, smart televisions, smart air conditioners, smart vehicle devices, and the like. The portable wearable device may be a smart watch, smart bracelet, headset, or the like. The server 102 may be implemented as a stand-alone server or as a server cluster of multiple servers.

In one embodiment, as shown in fig. 2, a multi-scale object detection method for a remote sensing image is provided, and the method is applied to the server 102 in fig. 1 for illustration, and includes the following steps:

step 201, an original remote sensing image is obtained, and the original remote sensing image is input into a backbone network of a target detection model;

the original remote sensing image is an image obtained by shooting through the remote sensing device, for example, may be a map, or may be a slice map (i.e., a layer or layers in the multi-resolution hierarchical model). The target detection model is a neural network model trained in advance, and is used for identifying and marking the position and the category of an object contained in an input image.

Specifically, as shown in fig. 3, fig. 3 is an overall structure diagram of the object detection model provided by the present invention, where the object detection model includes a backbone network, a feature processor, a decoupling module, and an output module. First, the original remote sensing image is input into the backbone network of the target detection model.

Step 202, extracting features of an original remote sensing image on different scales by using a backbone network to obtain an enhanced feature map X corresponding to each scale ^* 。

Wherein different scales refer to different resolution sizes.

Specifically, the backbone network firstly samples the original remote sensing image by adopting different sampling step sizes to obtain sampling graphs with multiple scales, for example, after an original remote sensing image with the resolution of 512×512 is downsampled by adopting different sampling step sizes, three sampling graphs with different scales are obtained, the resolution of each sampling graph corresponds to 8 times, 16 times and 32 times of the original graph, and the scales are 64×64, 32×32 and 16×16 respectively.

Furthermore, the backbone network in the application also adopts a multidimensional attention sensing mechanism (Multi-dimensional Attention, MDA) to perform feature extraction on the sampling graphs with different scales to obtain an enhanced feature graph X corresponding to each scale ^* . The multidimensional attention-aware mechanism MDA is able to focus resources on more important features that people are more expected to focus on under conditions where computational resources are limited. The multidimensional attention sensing mechanism MDA focuses attention on channel characteristics and space characteristics of an image in the characteristic extraction process, so that a model can pay attention to important characteristic information in the channel and the space, the problem that the characteristics are difficult to extract under a complex background is solved, and the characteristic extraction capability is enhanced, thereby obtaining Is referred to as an enhanced feature map.

Step 203, the enhanced feature map X of each scale ^* Inputting the characteristic images into a characteristic processor of a target detection model, so that the characteristic processor respectively carries out characteristic reconstruction on the enhanced characteristic images of each scale according to a multi-head self-attention method to obtain reconstructed characteristic images of each scale;

as shown in fig. 4, the feature processor is a network based on a transducer structure, and the transducer uses a Multi-head self-attention (Multi-head self attention, MSA) method, and the Multi-head self-attention MSA implements parallelization by using multiple heads on the basis of self-attention, so as to improve the operation efficiency.

The input and output of the multi-head self-attention MSA are two-dimensional matrix vectors, and for a three-dimensional image detection task, a characteristic map needs to be flattened into a two-dimensional characteristic matrix, and the three-dimensional image is restored after self-attention processing, so that the image scene characteristic is flattened and then reconstructed. As shown in fig. 4, for example, an input enhanced feature map X ^* The size is 512 multiplied by 8, and the enhancement characteristic diagram X is firstly subjected to the transducer ^* And (3) carrying out feature reconstruction on a two-dimensional matrix which is flattened (flat) into 512 multiplied by 64 along the wide and high dimensions by a Layer norm and a multi-head self-attention MSA mechanism, outputting the two-dimensional matrix which is still 512 multiplied by 64, and then restoring (reshape) the size of the feature map which is 512 multiplied by 8, namely reconstructing the feature map.

And 204, inputting the reconstructed feature images on each scale into a decoupling module of a target detection model to perform target detection, so that the decoupling module performs target detection on the reconstructed feature images on each scale respectively to obtain the categories of the targets contained in the reconstructed feature images on each scale and the probabilities corresponding to the categories.

The decoupling module refers to decoupling of the neural network.

Specifically, as shown in fig. 3, the reconstructed feature maps with different scales are input to a decoupling module (i.e., a decoupling module (decoupling module) in fig. 3) of the target detection model, so that the decoupling module performs target detection on the reconstructed feature maps with each scale to obtain the number of types, the names of the types and the probabilities of the types of the targets contained in the reconstructed feature maps with each scale.

Finally, the category names of the targets, the probabilities of the categories and the positions of the targets are output through an output module, for example, the positions of the targets are displayed by selecting the target frames through boundary boxes, and labels are attached to the category frames and used for representing the category names of the targets.

According to the embodiment, the original remote sensing image is subjected to downsampling by different multiples to obtain images with different resolutions, so that foreground and background of different levels can be distinguished, the characteristic image is reconstructed by using the multi-head self-attention method MSA, namely, the characteristic of the image scene is reconstructed after being flattened, each characteristic point on the characteristic image can be subjected to self-adaptive global modeling, the expression capacity of the target characteristic is optimized while the receptive field is improved, and finally, the reconstructed characteristic image on each scale is subjected to target detection by using the decoupling module, so that the probability of the target contained in each position in the reconstructed characteristic image on each scale, the boundary frame of the target and the probability corresponding to each type are obtained, the automatic target detection of the remote sensing image with multiple levels and complex background can be realized, the target recognition accuracy of different scales and different levels is improved, the characteristic extraction process does not need to be designed manually, the extraction of important characteristics is automatically completed by the model, a large amount of training time is saved, and the target detection efficiency is improved.

In one embodiment, the backbone network includes a plurality of branch networks, as shown in fig. 5, the step 202 includes:

step 501, respectively downsampling the original remote sensing image based on different downsampling step sizes by utilizing each branch network to obtain downsampled images X with various different scales;

Step 502, calculating feature space mean (dim= [2,3 ]) of each downsampled image X on a plane where its wide and high dimensions are located.

Wherein, feature space mean x.mean (dim= [2, 3) refers to the pixel mean of the downsampled image on a plane formed by the wide and high dimensions, wherein dim=2 represents the wide dimension of the downsampled image X and dim=3 represents the high dimension of the downsampled image X.

Step 503, calculating the variance v of the characteristic value in the channel direction for each scale of downsampled image X; the feature value is the pixel value of the downsampled image;

Wherein d in the above formula is the calculated intermediate quantity, which is the square of the difference between the downsampled image X and the characteristic spatial mean value thereof; sum (dim= [2, 3) represents the sum of d in the image channel direction, H represents the image height, and W represents the image width.

Step 504, for each scale of downsampled image, calculating an energy distribution function E of each scale of downsampled image based on the feature spatial mean x.mean (dim= [2,3 ]) and the variance v;

where E is the energy distribution function of the feature map X (i.e., the downsampled image), v is the variance, and ρ is the energy coefficient.

Step 505, for each scale of downsampled image X, calculating a channel attention factor c according to a feature spatial mean value x.mean (dim= [2,3 ]), and calculating a spatial attention factor s according to a mean value of the downsampled image X in the channel direction:

c＝x.mean(dim＝[2,3]) (4)

s＝X.mean(dim＝1) (5)

where x.mean (dim= [2,3 ]) is the feature space mean, and x.mean (dim=1) is the mean of the downsampled image X in the channel direction.

Step 506, obtaining an enhanced feature map X corresponding to the sampled image of each scale based on the energy distribution function E, the channel attention factor c and the spatial attention factor s of the downsampled image X of each scale ^* 。

Wherein sigmoid represents a sigmoid function.

According to the embodiment, the multi-dimensional attention sensing mechanism MDA is introduced into the backbone network, and the channel attention factor c and the space attention factor s are added in the feature extraction process, so that the model focuses on important feature information in channels and spaces in the feature extraction process, the problem that features of complex backgrounds in multi-level remote sensing images are difficult to extract is solved, and the method is beneficial to subsequently improving the target detection accuracy in different levels of backgrounds.

Still further, as shown in fig. 6 to 7, fig. 6 and 7 show schematic structural diagrams of basic units constituting a backbone network, the basic units adopting a Multi-dimensional attention sensing Mechanism (MDA) based on a residual structure, each basic unit including a plurality of CBL units and one MDA (Multi-dimensional Attention, multi-dimensional attention sensing) unit, each CBL unit including a convolution (Conv, abbreviated as C), batch Normalization (BN) and an activation function (Leaky relu); the backbone network may be a cascade of a plurality (N, N being a natural number) of basic units.

In one embodiment, because there are a large number of different classes of targets in the optical remote sensing image, the variability of the target dimensions of the different classes is significant, and similar features between the different classes can lead to classification errors. In the traditional target detection framework, the final classification and positioning result is directly output on a prediction layer through a convolution kernel, so that the problem that the originally uncorrelated classification task and positioning task are easy to classify incorrectly and easily cause inaccurate positioning due to the fact that the results are coupled. Therefore, the present application proposes a feature processor based on feature reconstruction, which performs feature reconstruction through a visual transducer to enhance the aggregation degree of different types of features and improve the detection capability of targets with different dimensions, where the target detection model includes a plurality of feature processors (transducers), and the step 203 specifically includes:

Enhanced feature map X for each scale ^* Input into a feature processor corresponding to the scale, and the feature processor is utilized to enhance the feature image X ^* Generating a plurality of groups of vector combinations; wherein each set of vector combinations (Q, K, V) includes a query vector Q, a key vector K, and a value vector V;

enhanced feature map X calculated using multi-head self-attention Method (MSA) ^* Self-attention weight head of (2) _i ；

MultiHead(Q,K,V)＝Concat(head ₁ ,……head _i )W ⁰ (7)

Wherein W is ⁰ Representing a preset linear mapping matrix; each head is based on a predetermined linear mapping matrix (W ^Q ，W ^K ，W ^V ) Calculating the query vector Q, the key vector K and the value vector V to obtain a self-Attention weight value combination Attention (QW) corresponding to each group of vector combinations (Q, K, V) ^Q ,KW ^K ,VW ^V ) Wherein, (W) ^Q ，W ^K ，W ^V ) The linear mapping matrixes corresponding to the query vector Q, the key vector K and the value vector V are respectively obtained.

Self-attention weight head _i Multiplying the corresponding value vector V, and splicing to obtain a reconstruction feature map MultiHead (Q, K, V) corresponding to the enhancement feature map of each scale.

The above embodiment implements parallelization by Multi-head self-attention (Multi-head self attention, MSA) based on self-attention by using multiple heads, and improves the operation efficiency, the difference between the self-attention method and the Multi-head self-attention method is shown in fig. 8 (a) and fig. 8 (b), wherein the Matmul function represents multiplication of two matrices, the Scale operation represents the self-attention method, the normalized weight and the original input feature map are multiplied channel by channel to generate a weighted feature map, linear represents a linear projection layer, scaled dot-product attention represents the scaling dot product attention, and Concat represents stitching

In one embodiment, the step 204 includes: inputting the reconstructed feature images on all scales into a decoupling module of the target detection model so that the decoupling module can respectively detect the targets of the reconstructed feature images on all scales by using a target detection method based on an Anchor-free to obtain the categories of the targets contained in the reconstructed feature images on all scales and the probabilities corresponding to the categories;

the decoupling module comprises a target detection decoupling head, a prediction frame output decoupling head and a target category confidence decoupling head, as shown in fig. 9; the target detection decoupling head is used for outputting the number of categories contained in the reconstructed feature map; the prediction frame output decoupling head is used for outputting four-dimensional offset of each category; the target category confidence decoupling header is used for outputting the probability of each category.

Specifically, as shown in fig. 9, the prediction processes of feature maps with different dimensions are the same, taking a feature map sampled 32 times down as an example, it outputs three feature maps with different dimensions, namely, cls branch, reg branch and Obj branch, through a decoupling network based on feature reconstruction, the dimensions of the three feature maps are the probabilities (numbers of classes, nums) of the prediction categories, four offsets and a confidence level of whether an object is contained, and the size of each feature map is 16×16. Taking the feature map of the Reg branch as an example, the dimension of each feature vector (green square) is 4, and 256 feature vectors are in charge of predicting four-dimensional offsets of corresponding positions, and the feature map of 16×16 is mapped back to the original map, as shown in fig. 10, where each grid unit represents one pixel point on the feature map. In the training process, if the target center in the label falls inside the grid unit, namely, on the corresponding feature point, the confidence (Obj) output of the network on the feature point is expected to be close to 1, meanwhile, the feature map of the Reg branch outputs four prediction offsets u, r, d, l, a final boundary frame is generated through the upper left corner coordinates of the grid unit and four regression parameters, and the feature point of the corresponding position of the feature map of the other Cls branch outputs the category of the target.

In the above embodiment, the effect of rapidly detecting the target is achieved by the Anchor-free based optimization method.

In one embodiment, before the step 201, the method includes: acquiring a training sample image and label images corresponding to the training sample images; inputting the training sample image into a target detection model to be trained, and obtaining a target detection result, a prediction frame output result and a target class confidence result which are output by the target detection model; respectively determining loss values among the target detection result, the prediction frame output result and the target category confidence result and the tag image based on a plurality of preset loss functions; and adjusting parameters of the target detection model to be trained until each loss value meets a preset training ending condition, and obtaining the target detection model.

Wherein for Obj, reg and Cls branches, three loss functions L are used herein _obj 、L _reg And L _cls And (5) joint optimization. The loss of Obj branches is mainly used for helping a model to distinguish a foreground from a background, and the calculation method is as follows:

wherein c _(x,y) Representing the probability that the model predicts a target at the feature map (x, y) points,a label indicating whether or not a target is actually present at a point (x, y) of the feature map, and if present, 1 indicates that the feature point is a positive sample, and if not present, 0 indicates that the feature point is a negative sample. When- >When 1 is the target, the second half of the above formula is 0 as a whole, and if the loss is desired to be reduced, c is desired _(x,y) The output of (2) is close to 1, namely, the larger the probability that the model predicts that the feature point is targeted, the better; when->When 0 is the target, i.e., when there is no target, the first half of the above formula is 0 as a whole, and if the loss is desired to be reduced, c is desired _(x,y ) The output of the model approaches 0, that is, the lower the probability that the model predicts that the feature point has targets, the better, and through positive and negative sample loss optimization, the model can judge whether each feature point contains targets on the feature map of each layer scale.

For the feature map of the Reg branch, the dimension is 4, and the four offsets of the corresponding feature point of the 4-dimensional feature vector corresponding to each feature point relative to the real frame: u, r, d, l. The regression loss is used for optimizing the output of the Reg branch of the model, specifically, the coordinates of the feature points on the feature map are known, the coordinates of the prediction frame under the scale of the feature map can be obtained according to the four offsets output by the feature points, and the coordinates of the upper left corner and the lower right corner of the prediction frame are generally used for representing the prediction frame, and the calculation formula is as follows:

x _min ＝x _p -l (11)

x _max ＝x _p +r (12)

y _min ＝y _p -u (13)

y _max ＝y _p +d (14)

wherein (x) _min ,y _min ) Representing the coordinates of the upper left corner of the prediction frame, (x) _max ,y _max ) Representing coordinates of lower right corner of prediction frame ，(x _p ,y _p ) Representing the coordinates of the feature points. The same processing is carried out on the labels, and regression loss is calculated by downsampling to the scale corresponding to the feature map. The regression loss is calculated by adopting an intersection ratio (IOU), which is the most commonly used loss function in the target detection task, and is used for measuring the superposition condition of the prediction frame P and the real frame B, and the calculation formula of the loss function is as follows:

the optimization process of the IOU loss is a process of gradually overlapping two detection frames, as shown in fig. 7, when the area a is gradually enlarged and the areas b and c are gradually reduced, the prediction frame is gradually overlapped with the real frame, and the IOU loss is 0 at the moment, so that the effect of frame regression optimization is achieved.

Classification loss is primarily used to optimize the output of the Cls branches of the model, helping the model to classify specific classes of feature points that have been determined to be positive samples, where binary cross entropy loss is used to solve the multi-classification problem. Specifically, for example, a target detection data set including ten categories, the categories include 0-9 and total 10 categories, and the dimension corresponding to nums of the Cls branch of the model is 10, that is, 10 values are output for each feature point on the feature map to judge the probability of corresponding to 10 categories, and the output probability of each category is optimized by binary cross entropy loss, and the formula is as follows:

Where nums represents the total number of categories,a label indicating that a feature point in (x, y) of the feature map belongs to the i category, does not belong to the category 0, belongs to the category 1, c _(x,y),i Representing the probability that the model predicts that the feature point at (x, y) belongs to the i category

The above embodiment decouples classification and positioning by decoupling networks so that they do not interfere with each other. Different loss function joint training is used for different tasks, and output modes of decoupling types and coordinates are given, so that the model can distinguish the foreground and the background, the detection capability under the incapable scale is improved, and the positioning accuracy of the position is also improved.

The multi-scale target detection device for the remote sensing image provided by the invention is described below, and the multi-scale target detection device for the remote sensing image described below and the multi-scale target detection method for the remote sensing image described above can be correspondingly referred to each other.

In one embodiment, as shown in fig. 11, there is provided a multi-scale object detection apparatus for a remote sensing image, including: an image acquisition module 1101, a feature enhancement module 1102, a feature reconstruction module 1103, and a category output module 1104, wherein,

the image acquisition module 1101 is configured to acquire an original remote sensing image, and input the original remote sensing image into a backbone network of a target detection model;

The feature enhancement module 1102 is configured to perform feature extraction on the original remote sensing image on different scales by using the backbone network, so as to obtain an enhanced feature map corresponding to each scale;

the feature reconstruction module 1103 is configured to input the enhanced feature map of each scale into a feature processor of the target detection model, so that the feature processor performs feature reconstruction on the enhanced feature map of each scale according to a multi-head self-attention method, to obtain a reconstructed feature map corresponding to each scale;

and the class output module 1104 is configured to input the reconstructed feature map on each scale to a decoupling module of the target detection model to perform target detection, so that the decoupling module performs target detection on the reconstructed feature map on each scale, and obtains a class of the target included in the reconstructed feature map on each scale and a probability corresponding to each class.

In one embodiment, the backbone network comprises a plurality of branched networks; the feature enhancement module 1102 is further configured to:

In one embodiment, the object detection model includes a plurality of feature processors, and the feature reconstruction module 1103 is further configured to:

Calculating the self-attention weight of the enhanced feature map based on the query vector, the key vector and the value vector in each group of vector combinations;

In one embodiment, the category output module 1104 is further configured to:

the target detection decoupling head is used for outputting the number of categories contained in the reconstructed feature map;

the target category confidence decoupling header is used for outputting the probability of each category.

In one embodiment, the method further comprises a model training unit for:

respectively determining loss values among the target detection result, the prediction frame output result and the target confidence coefficient result and the tag image based on a plurality of preset loss functions;

In one embodiment, the above respective predetermined loss functions are shown in the above equations (10), (15), and (16), and are not described herein.

Fig. 12 illustrates a physical structure diagram of an electronic device, as shown in fig. 12, which may include: processor 1210, communication interface (Communications Interface), 1220, memory 1230 and communication bus 1240, wherein processor 1210, communication interface 1220 and memory 1230 communicate with each other via communication bus 1240. Processor 1210 may invoke logic instructions in memory 1230 to perform a multi-scale object detection method for a telemetry image, the method comprising: acquiring an original remote sensing image, and inputting the original remote sensing image into a backbone network of a target detection model; extracting features of the original remote sensing image on different scales by using the backbone network to obtain an enhanced feature map corresponding to each scale; inputting the enhanced feature images of each scale into a feature processor of the target detection model, so that the feature processor respectively carries out feature reconstruction on the enhanced feature images of each scale according to a multi-head self-attention method to obtain reconstructed feature images corresponding to each scale; and inputting the reconstructed feature images on each scale into a decoupling module of the target detection model to perform target detection, so that the decoupling module performs target detection on the reconstructed feature images on each scale respectively to obtain the categories of the targets contained in the reconstructed feature images on each scale and the probabilities corresponding to the categories.

In addition, the logic instructions in the memory 1230 described above may be implemented in the form of software functional units and sold or used as a stand-alone product, stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product, the computer program product comprising a computer program, the computer program being storable on a non-transitory computer readable storage medium, the computer program, when executed by a processor, being capable of performing the multi-scale object detection method for a remote sensing image provided by the methods above, the method comprising: acquiring an original remote sensing image, and inputting the original remote sensing image into a backbone network of a target detection model; extracting features of the original remote sensing image on different scales by using the backbone network to obtain an enhanced feature map corresponding to each scale; inputting the enhanced feature images of each scale into a feature processor of the target detection model, so that the feature processor respectively carries out feature reconstruction on the enhanced feature images of each scale according to a multi-head self-attention method to obtain reconstructed feature images corresponding to each scale; and inputting the reconstructed feature images on each scale into a decoupling module of the target detection model to perform target detection, so that the decoupling module performs target detection on the reconstructed feature images on each scale respectively to obtain the categories of the targets contained in the reconstructed feature images on each scale and the probabilities corresponding to the categories.

In yet another aspect, the present invention further provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the multi-scale object detection method for a remote sensing image provided by the above methods, the method comprising: acquiring an original remote sensing image, and inputting the original remote sensing image into a backbone network of a target detection model; extracting features of the original remote sensing image on different scales by using the backbone network to obtain an enhanced feature map corresponding to each scale; inputting the enhanced feature images of each scale into a feature processor of the target detection model, so that the feature processor respectively carries out feature reconstruction on the enhanced feature images of each scale according to a multi-head self-attention method to obtain reconstructed feature images corresponding to each scale; and inputting the reconstructed feature images on each scale into a decoupling module of the target detection model to perform target detection, so that the decoupling module performs target detection on the reconstructed feature images on each scale respectively to obtain the categories of the targets contained in the reconstructed feature images on each scale and the probabilities corresponding to the categories.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A multi-scale target detection method for a remote sensing image, comprising:

2. The method of claim 1, wherein the backbone network comprises a plurality of branched networks; the step of extracting features of the original remote sensing image on different scales by using the backbone network to obtain an enhanced feature map corresponding to each scale comprises the following steps:

3. The method for multi-scale object detection for remote sensing image according to claim 2, wherein the object detection model includes a plurality of feature processors, the feature processor inputs the enhanced feature map of each scale into the feature processor of the object detection model, so that the feature processor performs feature reconstruction on the enhanced feature map of each scale according to a multi-head self-attention method, respectively, to obtain a reconstructed feature map corresponding to each scale, and the method includes:

4. The method for multi-scale object detection for remote sensing images according to claim 3, wherein the inputting the reconstructed feature map on each scale into the decoupling module of the object detection model for object detection, so that the decoupling module performs object detection on the reconstructed feature map on each scale to obtain the class of the object contained in the reconstructed feature map on each scale and the probability corresponding to each class, includes:

5. The method of claim 4, wherein prior to the obtaining the original remote sensing image, the method further comprises:

6. The method for multi-scale object detection for remote sensing images according to claim 5, wherein each preset loss function is respectively:

7. A multi-scale object detection device for a remote sensing image, comprising:

8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the multi-scale object detection method for a telemetry image according to any one of claims 1 to 6 when the program is executed.

9. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor implements the multi-scale object detection method for a remote sensing image according to any of claims 1 to 6.

10. A computer program product comprising a computer program which, when executed by a processor, implements a multi-scale object detection method for a remote sensing image as claimed in any one of claims 1 to 6.