CN114724046B

CN114724046B - Optical remote sensing image detection method, device and storage medium

Info

Publication number: CN114724046B
Application number: CN202210314026.8A
Authority: CN
Inventors: 马雷; 罗心怡; 洪汉玉; 陈冰川; 赵凡; 刘红; 许启航
Original assignee: Wuhan Institute of Technology
Current assignee: Wuhan Institute of Technology
Priority date: 2022-03-28
Filing date: 2022-03-28
Publication date: 2025-01-10
Anticipated expiration: 2042-03-28
Also published as: CN114724046A

Abstract

The invention provides an optical remote sensing image detection method, an optical remote sensing image detection device and a storage medium, which belong to the field of image detection, wherein the method comprises the steps of dividing a plurality of optical remote sensing images to obtain a plurality of optical remote sensing training images and a plurality of optical remote sensing test images; the method comprises the steps of respectively marking each optical remote sensing training image to obtain optical remote sensing marking images, constructing training models, carrying out model training on the training models according to a plurality of optical remote sensing marking images to obtain image detection models to be tested, testing the image detection models to be tested according to a plurality of optical remote sensing test images to obtain image detection models, and detecting the optical remote sensing images to be tested through the image detection models to obtain detection results. The invention realizes the distinction between the target area and the background area, can more accurately predict the saliency map, improves the accuracy of optical remote sensing image detection, and reduces a great deal of labor force and time.

Description

Optical remote sensing image detection method, device and storage medium

Technical Field

The invention mainly relates to the technical field of image detection, in particular to an optical remote sensing image detection method, an optical remote sensing image detection device and a storage medium.

Background

Conventional saliency target detection models typically rely on certain low-level features such as color contrast and background priors. With the great success of deep learning technology in the field of computer vision, models based on full convolution networks become the mainstream of saliency target detection architecture. In recent years, research based on salient object detection in natural scenes has greatly progressed, but little research is done on optical remote sensing images. In general, the foreground region of an optical remote sensing image is similar to the surrounding environment and possesses targets of different scales.

Currently, some salient object detection methods typically use attention mechanisms to capture semantic context, or use pyramid structures to solve the problem of scale change of salient objects. However, these methods fail to introduce target boundaries while targeting different scale targets to define a foreground region, resulting in inaccurate target detection for the foreground region, which presents certain difficulties in distinguishing salient targets in complex scenes. Furthermore, existing depth learning based optical remote sensing image saliency target detection algorithms focus on employing pixel-by-pixel accurate labeling, which requires a significant amount of labor and time.

Disclosure of Invention

The invention aims to solve the technical problem of providing an optical remote sensing image detection method, an optical remote sensing image detection device and a storage medium aiming at the defects of the prior art.

The technical scheme for solving the technical problems is as follows, the optical remote sensing image detection method comprises the following steps:

Importing a plurality of optical remote sensing images, and dividing the optical remote sensing images to obtain an optical remote sensing image training set and an optical remote sensing image testing set, wherein the optical remote sensing image training set comprises a plurality of optical remote sensing training images, and the optical remote sensing image testing set comprises a plurality of optical remote sensing testing images;

Labeling each optical remote sensing training image to obtain an optical remote sensing labeling image corresponding to each optical remote sensing training image;

constructing a training model, and carrying out model training on the training model according to a plurality of optical remote sensing annotation images to obtain an image detection model to be tested;

Testing the to-be-tested image detection model according to a plurality of the optical remote sensing test images to obtain an image detection model;

And importing an optical remote sensing image to be detected, and detecting the optical remote sensing image to be detected through the image detection model to obtain a detection result.

The other technical scheme for solving the technical problems is as follows, an optical remote sensing image detection device comprises:

The image dividing module is used for importing a plurality of optical remote sensing images and dividing the optical remote sensing images to obtain an optical remote sensing image training set and an optical remote sensing image testing set, wherein the optical remote sensing image training set comprises a plurality of optical remote sensing training images, and the optical remote sensing image testing set comprises a plurality of optical remote sensing testing images;

the image labeling module is used for labeling each optical remote sensing training image respectively to obtain optical remote sensing labeling images corresponding to each optical remote sensing training image;

The model training module is used for constructing a training model, and carrying out model training on the training model according to a plurality of optical remote sensing annotation images to obtain an image detection model to be tested;

the model test module is used for testing the to-be-tested image trying detection model according to a plurality of the optical remote sensing test images to obtain an image detection model;

the detection result obtaining module is used for importing an optical remote sensing image to be detected, and detecting the optical remote sensing image to be detected through the image detection model to obtain a detection result.

The optical remote sensing image detection device comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the optical remote sensing image detection method is realized when the processor executes the computer program.

Another technical solution of the present invention for solving the above technical problem is a computer-readable storage medium storing a computer program which, when executed by a processor, implements the optical remote sensing image detection method as described above.

The method has the advantages that the multiple optical remote sensing training images and the multiple optical remote sensing test images are obtained through dividing the multiple optical remote sensing images, the optical remote sensing labeling images are respectively obtained through labeling the multiple optical remote sensing training images, the to-be-tested try image detection model is obtained through model training of the multiple optical remote sensing labeling images on the training model, the image detection model is obtained through testing of the image detection model to be tested through the multiple optical remote sensing test images, the detection result is obtained through detecting the optical remote sensing images to be tested through the image detection model, the distinction of a target area and a background area is achieved, the saliency map can be predicted more accurately, the accuracy of detecting the optical remote sensing images is improved, and a large amount of labor force and time are reduced.

Drawings

Fig. 1 is a schematic flow chart of an optical remote sensing image detection method according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart of a model analysis of an optical remote sensing image detection method according to an embodiment of the present invention;

Fig. 3 is a block diagram of an optical remote sensing image detection device according to an embodiment of the present invention.

Detailed Description

The principles and features of the present invention are described below with reference to the drawings, the examples are illustrated for the purpose of illustrating the invention and are not to be construed as limiting the scope of the invention.

Fig. 1 is a schematic flow chart of an optical remote sensing image detection method according to an embodiment of the present invention.

As shown in fig. 1, an optical remote sensing image detection method includes the following steps:

It should be appreciated that the plurality of optical telemetry images may be data in the EORSSD dataset and the plurality of optical telemetry training images may be data in the EORSSD training dataset.

It should be appreciated that the image dataset (i.e., the plurality of the optical remote sensing images) is divided into a training set (i.e., the optical remote sensing image training set) and a testing set (i.e., the optical remote sensing image testing set), the images in the training set (i.e., the optical remote sensing image training set) are annotated with graffiti annotations, and the annotated images (i.e., the optical remote sensing annotation images) are used as graffiti labels.

It should be appreciated that the foreground and background regions in the original image I _rgb (i.e., the optical remote sensing training image) are arbitrarily labeled with the graffiti annotation, the foreground, background and unknown pixel correspondence in the graffiti annotation is noted as {1,2,0}, and the training dataset (i.e., the plurality of optical remote sensing labeling images) is defined as: Where x _i is the input image, y _i is the corresponding label, and N is the size of the training dataset.

It will be appreciated that the trained network (i.e. the test pattern of the test pattern) is tested using a test sample set (i.e. a plurality of the optical telemetry test images) and the network (i.e. the test pattern of the test pattern) is evaluated.

It should be understood that the trained optimal model (i.e. the test model to be tested) parameters are saved for subsequent testing, testing is performed, and the test results are evaluated.

It should be appreciated that the present invention is an algorithm presented on both of the disclosed datasets ORSSD and EORSSD. The existing ORSSD data set comprises 600 training samples and 200 test samples, which are the first publicly available data set for the task of detecting the significance target of an optical remote sensing image, but the data size is relatively small. EORSSD the dataset is an extended dataset of ORSSD datasets, which contains 2000 images and corresponding true images of pixel-level annotations, where the training set contains 1400 images and the test set contains 600 images.

In the above embodiment, the multiple optical remote sensing training images and the multiple optical remote sensing test images are obtained by dividing the multiple optical remote sensing images, the optical remote sensing labeling images are respectively obtained by labeling the multiple optical remote sensing training images, the test model of the training model is obtained by training the multiple optical remote sensing labeling images, the image detection model is obtained by testing the image detection model to be tested, the detection result is obtained by detecting the optical remote sensing image to be tested through the image detection model, the distinction of the target area and the background area is realized, the saliency map can be predicted more accurately, the accuracy of the optical remote sensing image detection is improved, and a large amount of labor force and time are reduced.

Optionally, as an embodiment of the present invention, the process of labeling each of the optical remote sensing training images to obtain an optical remote sensing labeling image corresponding to each of the optical remote sensing training images includes:

And marking each optical remote sensing training image by using a Quick Selection tool to obtain an optical remote sensing marking image corresponding to each optical remote sensing training image.

It should be appreciated that the Quick Selection tool is a Quick Selection tool in Adobe Photoshop, and that the annotation is accomplished by a Quick Selection tool in Adobe Photoshop.

In the above embodiment, the Quick Selection tool is used to label each optical remote sensing training image to obtain the optical remote sensing label image, so as to provide accurate data for subsequent processing, realize the distinction of the target area and the background area, and more accurately predict the saliency map.

Optionally, as an embodiment of the present invention, the process of constructing a training model, performing model training on the training model according to a plurality of optical remote sensing annotation images, and obtaining an image detection model to be tested includes:

S1, constructing a training model, and respectively extracting the characteristics of each optical remote sensing annotation image according to the training model to obtain a first to-be-processed characteristic image corresponding to each optical remote sensing annotation image, a second to-be-processed characteristic image corresponding to each optical remote sensing annotation image and a third to-be-processed characteristic image corresponding to each optical remote sensing annotation image;

s2, analyzing target positioning feature images of the optical remote sensing annotation images and the third to-be-processed feature images corresponding to the optical remote sensing annotation images respectively to obtain target positioning feature images corresponding to the optical remote sensing annotation images;

S3, analyzing the edge feature images of each first feature image to be processed, the second feature images to be processed corresponding to each optical remote sensing labeling image and the third feature images to be processed corresponding to each optical remote sensing labeling image respectively to obtain the edge feature images corresponding to each optical remote sensing labeling image;

s4, respectively carrying out predictive analysis on each target positioning feature image and the edge feature image corresponding to each optical remote sensing labeling image to obtain a salient image corresponding to each optical remote sensing labeling image;

And S5, respectively analyzing the loss values of the optical remote sensing annotation images and the salient images corresponding to the optical remote sensing annotation images to obtain an attempted image detection model to be detected.

It should be understood that foreground features (i.e., the first feature map to be processed, the second feature map to be processed, and the third feature map to be processed) of a top-level feature map are extracted through graffiti labeling (i.e., the optical remote sensing labeling image), a target positioning feature map is obtained by using a correlation filtering operation, an edge detection network is constructed by using an FPN feature pyramid network for extracting an edge feature map, and the target positioning feature map and the edge feature map are spliced, and saliency prediction is performed on the spliced feature map.

In the above embodiment, the image detection model is obtained by model analysis of the plurality of optical remote sensing labeling images and the optical remote sensing image test set through the training model, so that the distinction between the target area and the background area is realized, the multi-scale information and the edge information of the picture are integrated into the learning process of the saliency map at the same time, and the saliency map can be predicted more accurately.

Optionally, as an embodiment of the present invention, the training model includes a first feature block, a second feature block, a third feature block, a fourth feature block, and a fifth feature block, and the process of step S1 includes:

Performing first feature extraction on each optical remote sensing annotation image through the first feature blocks to obtain first initial feature images corresponding to each optical remote sensing annotation image;

Performing second feature extraction on each first initial feature image through the second feature block to obtain second initial feature images corresponding to each optical remote sensing annotation image;

Performing third feature extraction on each second initial feature image through the third feature block to obtain third initial feature images corresponding to each optical remote sensing annotation image, and taking the third initial feature images as first feature images to be processed;

Performing fourth feature extraction on each first feature image to be processed through the fourth feature block to obtain second feature images to be processed corresponding to each optical remote sensing annotation image;

And respectively carrying out fifth feature extraction on each second feature image to be processed through the fifth feature block to obtain a third feature image to be processed corresponding to each optical remote sensing annotation image.

It should be appreciated that the top level feature map (i.e., the third pending feature map) in the VGG16 network is extracted.

It should be understood that the VGG16 full connection layer is removed and used as a backbone network (i.e., the training model) which includes five convolution blocks, each convolution block being composed of a number of convolutions, wherein the channels output by each convolution block are 64, 128, 256, 512, respectively.

Specifically, the VGG16 structure with the full connection layer removed (i.e., the training model) is as follows:

The size of the input picture (namely the optical remote sensing label image) is 512x512, and the channel number is 3 (three channels of RGB); the first convolution block first layer (convolution layer) is 512x512x64, the first convolution block second layer (convolution layer) is 512x512x64, the maximum pooling layer is 256x256x128 (pooling layer) which upsamples the image, the second convolution block first layer (convolution layer) is 256x256x128, the second convolution block second layer (convolution layer) is 256x256x128, the maximum pooling layer is 128x128x256, the third convolution block first layer (convolution layer) is 128x128x256, the third convolution block second layer (convolution layer) is 128x128x256, the third convolution block third layer (convolution layer) is 128x128x256, the maximum pooling layer is 64x64x512, the fourth convolution block first layer (convolution layer) is 64x64x512, the fourth convolution block second layer (convolution layer) is 64x64x512, the maximum pooling layer is 32x32, the third convolution block third layer (convolution layer) is 16x512, the fifth convolution layer is 32x32, and the fourth convolution block third layer (convolution layer) is 16x512.

It should be understood that the pre-trained VGG16 network is adopted as a backbone network (i.e., the training model), and the backbone network (i.e., the training model) includes five convolution blocks, each convolution block is composed of a plurality of convolutions, wherein the output channel of each convolution block is respectively 64, 128, 256, 512 and 512. Parameters of the network model are initialized, batchsize is set to 6, the initial learning rate is set to 0.0001, and the number of training epochs is set to 60.

It should be understood that the VGG16 full connection layer is removed and used as a main network (i.e. the training model), and the color image I _rgb (i.e. the optical remote sensing label image) is input into the pretrained VGG16 convolutional neural network (i.e. the training model) to obtain the characteristic diagram of the last layer of the VGG16 convolutional layer (i.e. the third to-be-processed characteristic diagram), which is defined asWhere H and W represent the height and width of the feature map and d represents the number of channels of the feature map.

In the above embodiment, the first feature map to be processed, the second feature map to be processed, and the third feature map to be processed are obtained by extracting the features of each optical remote sensing labeling image through the training model, so that the distinction between the target area and the background area is realized, and the multi-scale information and the edge information of the image are integrated into the learning process of the saliency map at the same time.

Optionally, as an embodiment of the present invention, the process of step S2 includes:

Respectively multiplying each optical remote sensing annotation image and a third feature image to be processed corresponding to each optical remote sensing annotation image pixel by pixel to obtain a foreground feature image corresponding to each optical remote sensing annotation image;

Respectively carrying out average pooling treatment on each foreground feature image to obtain a filter corresponding to each optical remote sensing annotation image;

And respectively carrying out deep convolution processing on each third feature image to be processed and the filter corresponding to each optical remote sensing annotation image to obtain a target positioning feature image corresponding to each optical remote sensing annotation image.

It should be appreciated that the filter may be a 512 x1 filter, i.e. a "kernel" more prone to a 2D weight matrix. And 'filter' refers to a 3D structure of multiple Kernel stacks. The 1*1 core here has 512 channels, which are considered 3D.

It should be appreciated that the VGG16 network last layer feature mapMultiplying the corresponding graffiti annotation Y _i (namely the optical remote sensing annotation image) with the third to-be-processed feature image to obtain a feature image Y _f (namely the foreground feature image) containing only foreground features, carrying out average pooling operation on Y _f (namely the foreground feature image) to obtain a 512 multiplied by 1 filter kernel (namely the filter), and carrying out averaging pooling operation on 5121 multiplied by 1 filter kernel (namely the filter) and the image feature imageAnd (i.e. the third feature map to be processed) performing depth-by-depth correlation convolution to obtain a target positioning feature map Y _o.

It should be understood that the output of the last layer of the VGG16 network (i.e. the third to-be-processed feature map) is multiplied by the corresponding doodle front Jing Zhushi Y _i (i.e. the optical remote sensing label image) to obtain a feature map Y _f (i.e. the foreground feature map) only containing foreground features, an average pooling operation is performed on Y _f (i.e. the foreground feature map) to obtain a 512×1×1 filter, and a depth-wise correlation convolution is performed on the filter and the output of the last layer of the VGG16 network (i.e. the third to-be-processed feature map) to obtain a target positioning feature map Y _o.

Specifically, the image I _rgb is input to a network, the VGG16 is utilized to perform feature extraction, the top-level feature map (i.e., the third to-be-processed feature map) output by the fifth layer C ₅ of the VGG16 is multiplied by the foreground region (i.e., the optical remote sensing labeling image) of the marked graffiti annotation Y _i pixel by pixel, a foreground feature map Y _f is obtained, the foreground feature map Y _f is subjected to average pooling operation to obtain a 512×1×1 single filter, and the filter and the C ₅ (i.e., the third to-be-processed feature map) are subjected to depth convolution to realize relevant filtering of image features, i.e., the features with discrimination are mined by utilizing the correlation between the multi-scale visual features and the advanced semantic information of the image.

It should be appreciated that the extracted foreground features (i.e., the foreground feature map) are integrated into a 512×1×1 filter, and the image features (i.e., the third to-be-processed feature map) are subjected to correlation filtering by using the filter, so as to obtain the target positioning feature map Y _o.

In the above embodiment, the target positioning feature map is obtained by analyzing the target positioning feature map of each optical remote sensing labeling image and the third feature map to be processed, so that the related filtering of the image features is realized, and the features with discriminant are mined by utilizing the correlation between the multi-scale visual features and the advanced semantic information of the images.

Optionally, as an embodiment of the present invention, the process of step S3 includes:

Transversely connecting all the third feature images to be processed based on a 1X 1 convolution layer respectively to obtain first pyramid feature images corresponding to all the optical remote sensing annotation images;

performing first element-by-element addition on each first pyramid feature map and a second feature map to be processed corresponding to each optical remote sensing annotation image respectively to obtain second pyramid feature maps corresponding to each optical remote sensing annotation image;

Performing second element-by-element addition on each second pyramid feature map and the first feature map to be processed corresponding to each optical remote sensing annotation image respectively to obtain a third pyramid feature map corresponding to each optical remote sensing annotation image;

The first pyramid feature images, the second pyramid feature images corresponding to the optical remote sensing labeling images and the third pyramid feature images corresponding to the optical remote sensing labeling images are respectively spliced and calculated according to a first formula to obtain fusion feature images corresponding to the optical remote sensing labeling images, wherein the first formula is as follows:

P_fusion＝Cat{P₃,P₄,P₅},

Wherein, P ₅ is a first pyramid feature map, P ₄ is a second pyramid feature map, P ₃ is a third pyramid feature map, { Cat } is a splicing operation along the channel dimension, and P _fusion is a fusion feature map;

And respectively calculating the edge map of each fusion feature map through a second formula to obtain the edge feature map corresponding to each optical remote sensing annotation image, wherein the second formula is as follows:

S_e＝δ(C_RCAB(P_fusion)),

Wherein S _e is an edge feature map, P _fusion is a fusion feature map, δ is a sigmoid activation function, and C _RCAB is a residual channel attention mechanism.

It should be understood that, the final three layers of the FPN feature pyramid network output feature maps (P ₃、P₄、P₅) (i.e., the third pyramid feature map, the second pyramid feature map, and the first pyramid feature map) respectively, the final three layers of the FPN feature pyramid network output feature maps (P ₃、P₄、P₅) (i.e., the third pyramid feature map, the second pyramid feature map, and the first pyramid feature map) in series are connected to obtain a multi-scale feature map (i.e., the fusion feature map) output in the series stage, and the local receptive field of the multi-scale feature map is enlarged by using a residual channel attention block and weighted by using a sigmoid function to obtain an edge feature map S _e.

It should be appreciated that the image multi-scale features (i.e., the fused feature map) are extracted by using the FPN feature pyramid network, and the local receptive field of the multi-scale feature map (i.e., the fused feature map) is enlarged by using the residual channel attention block, and the feature map with the larger local receptive field is weighted by using the sigmoid function, so as to obtain the edge feature map S _e.

Specifically, the C ₁、C₂、C₃、C₄、C₅ (i.e., the first initial feature map, the second initial feature map, the first feature map to be processed, the second feature map to be processed, the third feature map to be processed) in the VGG16 convolutional neural network is laterally connected by using a convolution of 1×1, and feature maps of the same spatial dimension of a bottom-up path and a top-down path are combined by using element-by-element addition, and finally the output feature (P ₃、P₄、P₅) (i.e., the third pyramid feature map, the second pyramid feature map, the first pyramid feature map) is obtained. In order to adapt to targets with different scales, each layer of pyramid feature map is generated by a top-down connection mode, and each layer of pyramid feature map is expressed as follows:

Where σ represents the cross-connect implemented by the learnable 1×1 convolutional layer, τ represents bilinear interpolation upsampling with a scale factor of 2. This indicates that the resolution of adjacent layers of features { P ₃,P₄,P₅ } (i.e., the third pyramid feature map, the second pyramid feature map, the first pyramid feature map) differ by a factor of 2. And finally, keeping consistency of the feature output size, facilitating the final fusion operation, and representing the fused feature graph as follows:

P_fusion＝Cat{P₃,P₄,P₅},

where { P ₃,P₄,P₅ } represents the output features of the pyramid feature map of 256, 512 channels. { Cat } represents the stitching operation along the channel dimension, and P _fusion is the final stitching result (i.e., the fused feature map).

On this basis, a residual channel attention mechanism is introduced, and different attention is generated for each channel feature by utilizing the interdependence among feature channels. Then generating an edge map (namely the edge feature map) by a sigmoid activation function, wherein the specific formula is as follows:

S_e＝δ(C_RCAB(P_fusion)),

Where S _e is an edge map (i.e., the edge feature map), δ represents a sigmoid activation function, and C _RCAB is a residual channel attention mechanism.

It will be appreciated that the output of the last layer of each stage (when the image is input to the network, i.e. the feature map) is selected as the corresponding number of layers in the top-down path, convolved by 1x1, and then summed by the pixel. In general FPN can also be understood as a model.

It should be understood that P5 (i.e. the first pyramid feature map) is obtained by first using a transverse connection from C5 (i.e. the third feature map to be processed), then P5 (i.e. the first pyramid feature map) and C4 (i.e. the second feature map to be processed) are combined to obtain P4 (i.e. the second pyramid feature map), then P4 (i.e. the second pyramid feature map) and C3 (i.e. the first feature map to be processed) are combined to obtain P3 (i.e. the third pyramid feature map), and P3 (i.e. the third pyramid feature map) and C2 (i.e. the second initial feature map) are combined to obtain P2.

In the above embodiment, the edge feature images are obtained by analyzing the edge feature images of the first feature image to be processed, the second feature image to be processed and the third feature image to be processed, so that the distinction between the target area and the background area is realized, the multi-scale information and the edge information of the picture are integrated into the learning process of the saliency map at the same time, and the saliency map can be predicted more accurately.

Optionally, as an embodiment of the present invention, the process of step S4 includes:

Respectively splicing the feature images of each target positioning feature image and the edge feature images corresponding to each optical remote sensing labeling image to obtain spliced feature images corresponding to each optical remote sensing labeling image;

and respectively predicting each spliced characteristic image based on the 1 multiplied by 1 convolution layer to obtain a salient image corresponding to each optical remote sensing labeling image.

It should be appreciated that the object localization feature map Y _o and the edge feature map S _e are combined and image saliency prediction is performed by one 1 x 1 convolution layer, resulting in a saliency map S _y.

It should be appreciated that the object localization feature map Y _o and the edge feature map S _e are stitched and input to a1 x1 convolution layer to generate a saliency map.

In the above embodiment, the feature images of the target positioning feature images and the edge feature images are respectively spliced to obtain the spliced feature images, the prediction of each spliced feature image is respectively based on the 1×1 convolution layer to obtain the salient image, and the multi-scale information and the edge information of the image are simultaneously integrated into the learning process of the salient image, so that the salient image can be predicted more accurately.

Optionally, as an embodiment of the present invention, the process of step S5 includes:

Calculating the loss values of the optical remote sensing annotation images and the saliency maps corresponding to the optical remote sensing annotation images respectively through a third formula to obtain the loss values corresponding to the optical remote sensing annotation images, wherein the third formula is as follows:

L(θ)=SlogS_y+(1-S)log(1-S_y),

Wherein L (theta) is a loss value, S is an optical remote sensing label image, and S _y is a saliency map;

And (3) carrying out parameter updating on the training model according to all the loss values, returning to the step (S1) until the preset iteration times are reached, and taking the training model with the updated parameters as the model to be detected for the image detection.

It should be appreciated that with the training sample set, the loss function is constructed by an automatic differentiation technique using a random gradient descent and back propagation based algorithm as follows:

L(θ)=SlogS_y+(1-S)log(1-S_y),

Wherein S represents graffiti annotation (i.e. the optical remote sensing annotation image), S _y represents a final saliency map (i.e. the saliency map), the whole network is optimized by using a loss function L (θ) (i.e. the loss value), and the network parameter θ is updated to obtain the weight of the training sample set.

In the embodiment, the optical remote sensing labeling image and the loss value of the salient map are respectively analyzed to obtain the to-be-detected image detection model, so that the accuracy of the training model is improved, the distinction between the target area and the background area is realized, and the salient map can be predicted more accurately.

Optionally, as another embodiment of the present invention, as shown in fig. 2, a pre-trained VGG16 network is used as a feature extractor, a feature extraction is performed on a picture (i.e. the optical remote sensing labeling image) to obtain a convolution network of C ₁、C₂、C₃、C₄、C₅ (i.e. the first initial feature map, the second initial feature map, the first feature map to be processed, the second feature map to be processed, and the third feature map to be processed), a top-level feature map (i.e. the third feature map to be processed) output by a fifth layer C ₅ of the VGG16 network is multiplied by pixels with a foreground region (i.e. the optical remote sensing labeling image) of the marked graffiti annotation Y _i, so as to obtain a foreground feature map Y _f, an average pooling operation is performed on the foreground feature map Y _f to obtain a single filter of 512×1×1, and a depth is performed on the image feature by the filter and the C ₅ (i.e. the third feature map to be processed), so as to implement correlation filtering on image features, i.e. feature mining by using multi-scale semantic convolution of images and advanced semantic information correlation. Through the FPN feature pyramid network, multi-scale image features can be extracted, namely, C ₁、C₂、C₃、C₄、C₅ in the VGG16 convolutional neural network, namely, the first initial feature map, the second initial feature map, the first feature map to be processed, the second feature map to be processed and the third feature map to be processed are transversely connected by using 1×1 convolution, and feature maps with the same spatial dimension of a bottom-up path and a top-down path are combined by using element-by-element addition, and finally output features (P ₃、P₄、P₅) (namely, the third pyramid feature map, the second pyramid feature map and the first pyramid feature map) are finally output. In order to adapt to targets with different scales, each layer of pyramid feature map is generated by a top-down connection mode, and each layer of pyramid feature map is expressed as follows:

Where σ represents the cross-connect implemented by the learnable 1×1 convolutional layer, τ represents bilinear interpolation upsampling with a scale factor of 2. This indicates that the resolution of adjacent layers of the feature P ₃,P₄,P₅ differ by a factor of 2. And finally, keeping consistency of the feature output size, facilitating the final fusion operation, and representing the fused feature graph as follows:

P_fusion＝Cat{P₃,P₄,P₅},

Where { P ₃,P₄,P₅ } represents the output features of the pyramid feature map of 256, 512 channels. { Cat } represents the splice operation along the channel dimension, and P _fusion is the final splice result.

On this basis, a residual channel attention mechanism is introduced, and different attention is generated for each channel feature by utilizing the interdependence among feature channels. An edge map is then generated from the sigmoid activation function:

S_e＝δ(C_RCAB(P_fusion))

Where S _e is an edge graph, δ represents a sigmoid activation function, and C _RCAB is a residual channel attention mechanism.

The object localization feature map Y _o and the edge feature map S _e are combined, and image saliency prediction is performed through one 1×1 convolution layer, so as to obtain a saliency map S _y.

Alternatively, as another embodiment of the present invention, after the process of testing the test pattern of the test pattern according to a plurality of the optical remote sensing test images, four common evaluation indexes may be used to perform performance test on the test result, namely, average absolute error (Mean Absolute Error, MAE), F metric (F-Measure), E metric (E-Measure), S metric (S-Measure).

Wherein (1) the mean absolute error (Mean Absolute Error, MAE) is used to calculate the difference between the predicted saliency map G (i, j) and the true value S (i, j), as follows:

where W and H represent the width and height of the image and i and j represent the abscissa and ordinate of the pixel points in the image.

(2) The F-Measure (F-Measure) is a weighted average of accuracy and recall. The greater the value of F-Measure, denoted by F _β, the more efficient the method, the following formula:

where β ² =0.3, precision denotes accuracy, recall denotes Recall. Accuracy and recall are calculated by comparing the predicted saliency map with the true values, and F _β is derived from a weighted average of accuracy and recall.

(3) The E metric (E-Measure) reflects the correlation between the predicted saliency map and the true value minus the global average, as follows:

Where W and H represent the width and height of the image and Φs (i, j) is the enhancement alignment matrix, which reflects the correlation of the predicted saliency map and the true value minus the overall average of the errors between them. The higher the correlation, the better the effect.

(4) The S-Measure (S-Measure) is used to Measure the structural similarity between the predicted saliency map and the true value, as follows:

S_m＝α×S₀+(1-α)×S_r

Where α is typically set to 0.5, and S ₀ and S _r represent the target perceived structural similarity and the regional structural similarity between the predicted saliency map and the truth, respectively. The greater the S _m value, the greater the degree of agreement of the experimental results with the true values.

The smaller and better the four common evaluation indexes are, the smaller the error between the smaller value and the true value is, and the fact that the value can be determined by determining the threshold value is not said to be good or bad, but the value of the invention is smaller compared with other existing methods.

Alternatively, as another embodiment of the present invention, the present invention is compared to several methods of salient object detection, including DSG, SMD, SSOD, DSS and PoolNet. Notably, DSG, SMD and SSOD are unsupervised/weakly supervised salient object detection methods, while DSS and PoolNet are fully supervised salient object detection methods. For fair comparison, the invention is implemented with an NVIDIA RTX A6000 GPU and an open source machine learning library Pytorch, which shows better performance gains than other methods. Experimental results show that the method can more effectively capture multi-scale information and distinguish differences between foreground and background. Test results show that the method has advantages compared with other methods.

Alternatively, as another embodiment of the present invention, the present invention may be written in the python language on a Linux operating system platform.

Optionally, as another embodiment of the present invention, the present invention implements the distinction between the target area and the background area through the correlation filtering, and constructs the multi-scale edge detection network through the FPN, so that the multi-scale information and the edge information of the picture are integrated into the learning process of the saliency map at the same time, and the saliency map can be predicted more accurately through the present invention.

Alternatively, as another embodiment of the present invention, as shown in fig. 3, an optical remote sensing image detection apparatus includes:

Alternatively, another embodiment of the present invention provides an optical remote sensing image detection apparatus, including a memory, a processor, and a computer program stored in the memory and executable on the processor, which when executed by the processor, implements the optical remote sensing image detection method as described above. The device may be a computer or the like.

Alternatively, another embodiment of the present invention provides a computer readable storage medium storing a computer program which, when executed by a processor, implements the optical remote sensing image detection method as described above.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the apparatus and units described above may refer to corresponding procedures in the foregoing method embodiments, which are not described herein again.

In the several embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of elements is merely a logical functional division, and there may be additional divisions of actual implementation, e.g., multiple elements or components may be combined or integrated into another system, or some features may be omitted, or not performed.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the embodiment of the present invention.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention is essentially or a part contributing to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods of the embodiments of the present invention. The storage medium includes a U disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, an optical disk, or other various media capable of storing program codes.

The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.

Claims

1. The optical remote sensing image detection method is characterized by comprising the following steps of:

leading in an optical remote sensing image to be detected, and detecting the optical remote sensing image to be detected through the image detection model to obtain a detection result;

the process of constructing a training model, carrying out model training on the training model according to a plurality of optical remote sensing annotation images, and obtaining an image detection model to be tested comprises the following steps:

s5, analyzing loss values of the optical remote sensing annotation images and the salient images corresponding to the optical remote sensing annotation images respectively to obtain an attempted image detection model to be detected;

The training model comprises a first feature block, a second feature block, a third feature block, a fourth feature block and a fifth feature block, and the process of the step S1 comprises the following steps:

carrying out fifth feature extraction on each second feature image to be processed through the fifth feature block to obtain a third feature image to be processed corresponding to each optical remote sensing annotation image;

the process of the step S2 includes:

Performing deep convolution processing on each third feature image to be processed and the filter corresponding to each optical remote sensing annotation image respectively to obtain a target positioning feature image corresponding to each optical remote sensing annotation image;

the process of the step S3 includes:

Based on The convolution layer is used for respectively and transversely connecting all the third feature images to be processed to obtain first pyramid feature images corresponding to all the optical remote sensing annotation images;

,

Wherein, As a first pyramid feature map,As a feature map of the second pyramid,As a feature map of the third pyramid,For a splicing operation along the channel dimension,Is a fusion feature map;

,

Wherein, As a graph of the edge characteristics,In order to fuse the feature map(s),Is thatThe function is activated and the function is activated,Is a residual channel attention mechanism.

2. The method according to claim 1, wherein the process of labeling each of the optical remote sensing training images to obtain an optical remote sensing labeling image corresponding to each of the optical remote sensing training images comprises:

3. The method according to claim 1, wherein the step S4 includes:

Based on And the convolution layer predicts the spliced feature images respectively to obtain a saliency image corresponding to each optical remote sensing annotation image.

4. The method according to claim 1, wherein the step S5 includes:

,

Wherein, In order to achieve a loss value, the value of the loss,For the optical remote sensing of the marked image,Is a saliency map;

5. An optical remote sensing image detection device, comprising:

the detection result obtaining module is used for importing an optical remote sensing image to be detected, and detecting the optical remote sensing image to be detected through the image detection model to obtain a detection result;

The model training module is specifically used for:

the process of the step S2 includes:

the process of the step S3 includes:

,

6. A computer readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the optical remote sensing image detection method according to any one of claims 1 to 4.