Disclosure of Invention
In view of the above, the invention provides a multi-layer feature decoupling optical remote sensing image building extraction method, which constructs powerful semantic feature expression capability in a multi-layer feature decoupling mode, improves the edge accuracy of a building while guaranteeing the integrity of a building main body, and improves the building extraction performance.
The invention relates to a multi-layer characteristic decoupling optical remote sensing image building extraction method, which comprises the following steps:
Firstly, carrying out multi-scale feature extraction on an optical remote sensing image to obtain multi-layer feature images with different scales;
step two, carrying out feature decomposition on the feature map obtained in the step one, namely calculating the offset of each feature in the adjacent deep feature map by using the feature flow field by taking the adjacent shallow feature map as a reference; correcting the deep feature map based on the offset to obtain more stable semantic main body features in a strong semantic region representing the main body part of the building, and then obtaining uncertain semantic boundary features in a weak semantic region representing the edge part of the building in the deep feature map by utilizing subtraction operation;
step three, respectively fusing the more stable semantic main body features and the uncertain semantic boundary features obtained in the step two layer by layer to obtain a plurality of pixel-level prediction graphs, wherein the strong semantic regions are fused from deep to shallow, and the weak semantic regions are fused from shallow to deep;
And step four, respectively performing supervised learning on the plurality of pixel-level prediction graphs obtained in the step three based on the multi-task joint loss function by using a multi-task supervision method, and completing building extraction.
Preferably, in the first step, a AlexNet, VGGNet, resNet, resNeXt or DenseNet network is used to perform feature extraction on the optical remote sensing image.
Preferably, the second step specifically includes the following sub-steps:
S2.1, feature preprocessing, namely, marking F and F ' as relative shallow features and relative deep features of two adjacent feature layers, and marking the deep features as F ' ' by changing the deep features into the same size as the shallow features F;
S2.2, generating a characteristic flow field, namely cascading the deep layer characteristic F '' obtained in the step S2.1 with the shallow layer characteristic F and convolving the deep layer characteristic F to obtain a flow field delta;
S2.3, generating more stable semantic main body features in the strong semantic region, namely obtaining the offset corresponding to each feature in the deep feature map F ' ' according to the flow field delta, correcting the deep feature map F ' ' based on the offset, and obtaining more stable semantic main body features F ' MainBody in the strong semantic region;
S2.4, generating uncertain semantic boundary features in the weak semantic region, namely obtaining uncertain semantic boundary features F' Uncerta in Boundary from a deep feature map F″ by using a subtraction operation;
s2.5, sequentially combining the multilayer feature images in the first step, repeating the steps S2.1-S2.4 to generate N-1 groups of more stable semantic bodies and uncertain semantic boundaries, wherein N is the total layer number of the feature images obtained in the first step.
Preferably, in S2.1, the deep features are changed to the same size as the shallow features using a 1×1 convolution and upsampling operation.
Preferably, a feature flow warping operation is used to correct the deep feature map.
Preferably, in the third step, when the features of the adjacent feature layers are fused, the attention mechanism is used to give weight to the features of the adjacent feature layers.
Preferably, the Sigmoid function is used as a gate to weight the features of the adjacent feature layers.
Preferably, in the third step, the following manner is adopted for fusion:
Wherein, the AndRespectively represent the feature layers after the attention mechanism is adopted, Z represents the feature fusion result, G X and G Y respectively represent the attention coefficients obtained from the selection gates,
Gx=Sigmoid(conv1×1(X))
Gy=Sigmoid(conv1×1(Y))
Where Sigmoid (·) represents performing a Sigmoid operation and conv 1×1 (·) represents a 1 x1 convolution.
Preferably, in the fourth step, each subtask is supervised and learned by using a cross entropy loss function.
The beneficial effects are that:
(1) The method comprises the steps of firstly extracting multi-scale features of a building by utilizing a multi-layer feature decoupling network and decomposing the multi-scale features to obtain more stable semantic main features and uncertain semantic boundary features, then gradually fusing by adopting a double-flow semantic feature description network in different modes based on the difference of the semantic main features and the uncertain semantic boundary features, deepening semantic representation in deep features in a semantic region, retaining more detail information in a weak semantic region, and finally improving the accuracy of edges of the building while guaranteeing the integrity of the main body part of the building by utilizing a multi-task supervision method to realize high-performance extraction of a high-resolution optical remote sensing image building. Compared with the prior art, the method can obviously improve the extraction effect on the high-resolution optical remote sensing image building. Especially when facing buildings with different dimensions and different spatial distributions in complex environments, the method can ensure the integrity of the main body of the building and the edge accuracy of the building, and reduce the occurrence of over-extraction and under-extraction. The method is based on an encoding-decoding framework, utilizes multi-layer feature decoupling and double-flow semantic feature description networks to improve semantic description capability, greatly improves building extraction performance, and has good practical application value.
(2) According to the invention, the offset of the deep feature map features is obtained based on the feature flow field, then the deep feature map is corrected by utilizing the feature flow distortion operation, the deep feature can be adaptively adjusted and aligned, the feature positioning capability is improved, and the more stable semantic main feature is obtained.
(3) When the features of the adjacent feature layers are fused, the complementarity of the adjacent feature layers is considered, the attention mechanism is utilized to select the features of the adjacent feature layers and guide the complementation information fusion, the fusion of invalid feature information can be obviously reduced, and the fusion can be more efficient and reasonable.
Detailed Description
The invention will now be described in detail by way of example with reference to the accompanying drawings.
The invention provides a multi-layer characteristic decoupling optical remote sensing image building extraction method. The flow chart of the method is shown in fig. 1, and specifically comprises the following steps:
Step one, multi-scale feature extraction is carried out on the high-resolution optical remote sensing image, and multi-layer feature diagrams with different scales are obtained.
The step can adopt AlexNet, VGGNet, resNet, resNeXt, denseNet and other feature extraction networks to realize multi-scale feature extraction.
In this embodiment, a ResNet backbone network is used to perform feature extraction on the input high-resolution optical remote sensing image, as shown in fig. 1 (a). Specifically, the original optical image is input ResNet into a feature extraction network, deep feature information is obtained through multiple rolling and pooling operations, and the obtained 4-layer feature images are respectively marked as F 1,F2,F3,F4 from shallow to deep.
And step two, decomposing the feature map obtained in the step one to obtain a more stable semantic main body representing the main body part of the building and an uncertain semantic boundary in a weak semantic region representing the edge part of the building.
The continuous convolution operation in the first step can lose building detail information and have the problem of feature misalignment while stabilizing deep semantic features. Therefore, the deep feature map is corrected by introducing the feature flow field and taking the adjacent low-layer features as references to align the deep features and utilizing the offset corresponding to each feature of the deep feature map obtained in the feature flow field, so that a more stable semantic main body in a strong semantic region is obtained.
As shown in fig. 1 (b), the method can be specifically divided into the following 5 sub-steps:
s2.1, feature pretreatment, namely, as shown in FIG. 2, marking F and F' as the relative shallow features and the relative deep features of two adjacent feature layers. F' is a deep feature with a greater number of data channels and smaller spatial dimensions. To decompose F' using the shallow feature F as a reference, the deep feature is first changed to the same size as the shallow feature F, denoted as F ", using a1 x 1 convolution and upsampling operation, as shown in the following equation:
F′′=Up(conv1×1(F′))
Where Up (-) represents the Up-sampling operation and conv 1×1 (-) represents the 1 x1 convolution.
S2.2, generating a characteristic flow field, wherein the characteristic flow field is introduced to enable the network to automatically learn characteristic dislocation information. First, the deep features f″ obtained in the sub-step (1) are cascaded with the shallow features F, and then a 3×3 convolution is used to obtain the flow field δ. The flow field delta has two dimensions, representing the direction of departure of each feature point in the flow field.
δ=conv3×3(cat(F,F′′))
Where cat (-) represents the cascading operation and conv 3×3 (-) represents the 3 x 3 convolution.
S2.3, generating more stable semantic main body characteristics in the strong semantic region, namely obtaining the offset corresponding to each characteristic in the relative deep characteristic diagram F ' ' according to the flow field delta, and correcting the deep characteristic diagram F ' ' by means of the characteristic flow distortion operation to obtain more stable semantic main body characteristics F ' MainBody in the strong semantic region, wherein the formula is as follows:
Where ψ (·) represents the feature flow warping operation, ρ is a feature point in the deep feature map f″, N (ρ) represents feature points around the warped feature point ρ, ω ρ is the ρ corresponding offset.
S2.4, generating uncertain semantic boundary features in the weak semantic region, wherein the uncertain semantic boundary features F' Uncerta in Boundary can be obtained from the deep feature map F″ by using a subtraction operation.
F′Uncerta in Boundary=F″-F′MainBody
And S2.5, generating three groups of more stable semantic bodies and uncertain semantic boundaries, namely combining adjacent feature images of the 4-layer feature images obtained in the step one two by two, and repeating the sub-steps S2.1-S2.4 to generate three groups of more stable semantic bodies and uncertain semantic boundaries.
And thirdly, based on a component fusion module, describing a network structure by utilizing double-flow semantic features, and respectively fusing the three groups of more stable semantic bodies and the uncertain semantic boundaries obtained in the second step layer by layer. Features belonging to a strong semantic region and a weak semantic region are respectively integrated into two parallel branches, feature fusion is carried out from deep to shallow in the strong semantic region, semantic representation in deep features is deepened, fusion is carried out from shallow to deep in the weak semantic region, and more detail information is reserved.
The specific fusion mode is that firstly, up-sampling, channel compression and other operations are utilized, and feature layers with different depths have the same channel number and space size. And then fusing the processed adjacent feature layer features together through cascading operation, wherein the strong semantic region adopts feature fusion from deep to shallow, and the weak semantic region adopts fusion from shallow to deep.
Furthermore, the complementarity of adjacent feature layers is considered, and the invention introduces a attention mechanism to select and guide the complementarity information fusion in the fusion process, so that the fusion of invalid feature information can be obviously reduced, and the fusion can be more efficient and reasonable.
As shown in fig. 3, the present embodiment is specifically divided into the following 6 substeps:
s3.1, before layer-by-layer feature fusion, up-sampling, down-sampling, channel compression and other operations are used for adjusting the feature graphs with different channel numbers and space sizes obtained in the step two, so that the feature graphs finally have the same channel numbers and space sizes.
S3.2, the designed component feature fusion module is utilized to efficiently fuse the features of the strong semantic region and the weak semantic region. As shown in fig. 4, X and Y are adjacent feature layers adjusted in the substep S3.1, and are used as inputs of the component feature fusion module. Considering the complementarity of adjacent feature layers, the mutual fusion of complementary information is selected and guided by means of the attention mechanism using Sigmoid function as gate, and the process can be represented by the following formula:
Wherein, the AndRepresenting the optimized feature layer, Z represents the output of the component feature fusion module, and G X and G Y represent the attention coefficients from the select gates, respectively, which can be obtained by the following equation:
Gx=Sigmoid(conv1×1(X))
Gy=Sigmoid(conv1×1(Y))
Where Sigmoid (·) represents performing a Sigmoid operation and conv 1×1 (·) represents a 1 x1 convolution.
And S3.3, repeating the substep S3.2, and sequentially fusing the features of each layer obtained in the step two to obtain fused features of the fused strong semantic region and weak semantic region, wherein the more stable semantic main feature fusion in the strong semantic region follows a fusion sequence from top to bottom, and the uncertain semantic boundary feature fusion in the weak semantic region is carried out according to the sequence from bottom to top. The size of the feature space after fusion is 1/4 of the input feature map, and the number of channels is 2 times of the input feature map.
And S3.4, carrying out depth and parallel multi-rate hole convolution on the fused strong semantic region and weak semantic region obtained in the substep S3.3, wherein the hole rate is 1,2 and 5 respectively. The feature fusion is then refined using a 1x1 convolution, resulting in a more stable semantic body and an uncertain semantic boundary.
And S3.5, adding the more stable semantic body obtained in the substep (4) and the uncertain semantic boundary through a point-by-point addition operation to obtain the complete building feature.
And S3.6, respectively obtaining three groups of pixel-level prediction graphs by using the prediction structure module shown in FIG. 3. The structure can be represented by the following formula:
P=Up(ReLU(BN(conv1×1(ReLU(BN(conv3×3(F)))))))
Wherein F represents the final feature, P represents the predicted graph obtained through the predicted architecture module, up (-) represents the upsampling operation, reLU (-) represents the convolution using the ReLU activation function, conv 1×1 (-) represents the convolution of 1X 1, conv 3×3 (-) represents the convolution of 3X 3, and BN (-) represents the batch normalization operation.
And step four, utilizing a multitask supervision method to respectively supervise and optimize the more stable semantic body, the uncertain semantic boundary and the three groups of predictive graphs of the building generated in the step three based on the cross entropy loss function, as shown in fig. 1 (d). The supervision of the more stable semantic subjects and the uncertain semantic boundaries is auxiliary supervision, so that the more complete building subjects can be generated, and meanwhile, the accuracy of the building boundaries is improved.
The method comprises the following steps of:
s4.1, generating a building main body and edge label graph corresponding to the more stable semantic main body and the uncertain semantic boundary in the step three by using morphological image operations such as image erosion and the like.
S4.2, performing supervised learning on each subtask by using a cross entropy loss function, wherein the formula is as follows:
Wherein L S,LB and L E represent the complete building segmentation loss, the more stable semantic body loss in the strong semantic region and the uncertain semantic boundary loss in the weak semantic region, respectively, N represents the number of pixels in the picture, y i E {0,1} represents whether the pixel i belongs to a building, i is a sequence number, and p i E [0,1] is the prediction probability of the pixel i.
S4.3, optimizing the network by using the multi-task joint loss function, and improving the extraction performance of the building, wherein the extraction performance can be represented by the following formula:
Ltotal=λ1·LS+λ2·LB+λ3·LE
where L total is the total loss of the multitasking, λ 1,λ2 and λ 3 are the loss weights corresponding to each task, and are set to 1,20 in the method.
And finally obtaining the building extraction result with high accuracy and low false alarm rate through the first step to the fourth step.
In summary, the above embodiments are only preferred embodiments of the present invention, and are not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.