CN115731461B

CN115731461B - A method for extracting buildings from optical remote sensing images with multi-layer feature decoupling

Info

Publication number: CN115731461B
Application number: CN202211377963.4A
Authority: CN
Inventors: 庄胤; 李健昊; 董珊; 陈禾; 陈亮
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2022-11-04
Filing date: 2022-11-04
Publication date: 2025-11-14
Anticipated expiration: 2042-11-04
Also published as: CN115731461A

Abstract

This invention discloses a multi-layer feature decoupling method for extracting buildings from optical remote sensing images. First, a multi-layer feature decoupling network is used to extract and decompose multi-scale features of buildings, obtaining more stable semantic subject features and uncertain semantic boundary features. Then, based on the differences between the semantic subject features and uncertain semantic boundary features, a dual-stream semantic feature description network is used to gradually fuse them in different ways, deepening the semantic representation in deep features in strong semantic regions and preserving more detailed information in weak semantic regions. Finally, a multi-task supervision method is used to improve the accuracy of building edges while ensuring the integrity of the main body of the building, achieving high-performance extraction of buildings from high-resolution optical remote sensing images. This invention can significantly improve the extraction effect of buildings from high-resolution optical remote sensing images and can accurately extract buildings of different scales and spatial distributions in complex environments.

Description

Optical remote sensing image building extraction method with multi-layer characteristic decoupling

Technical Field

The invention relates to the technical fields of remote sensing image processing and building extraction, in particular to a multi-layer characteristic decoupling optical remote sensing image building extraction method.

Background

Building extraction plays an important role in city planning, illegal building monitoring, geographical information exploration and the like. With the rapid development of high resolution optical remote sensing images, more and more data are available for building extraction. However, when faced with massive amounts of optical remote sensing data, manually extracting buildings is a time-consuming and labor-consuming task, which requires an automatic and efficient building extraction algorithm to solve this problem. The buildings exhibit different morphologies and spatial distributions in different environments, and at the same time, limited by imaging conditions, the buildings may have a low contrast with the surrounding environment, which can present challenges for accurate extraction of the buildings. In the face of increasingly complex building extraction tasks, conventional methods relying on the features of the building itself, such as spectrum, shape, color, texture, and shading, have failed to meet the actual needs of people. With the continuous development of deep learning, convolutional neural networks with strong feature extraction and generalization learning capabilities are increasingly and widely applied to tasks of building extraction, and the performance of an automatic building extraction algorithm is remarkably improved.

To better extract buildings accurately from complex environments, many studies have improved building extraction performance by increasing semantic feature description capabilities. Such as by introducing a spatial pyramid pooling module in the encoding-decoding structure to increase the receptive field, and enhancing the feature descriptors with different convolution forms such as asymmetric convolution, dense upsampling convolution, etc. Many studies have focused on improving building edge extraction accuracy by introducing attention mechanisms in the U-Net network to better perform feature fusion or by introducing semantic edge information in feature fusion to further refine the boundaries of irregular buildings. Many studies further optimize the extraction algorithm for the problem of large differences in building volumes. The method can accurately extract building information of different scales by designing independent shape prediction branches, optimize a multi-scale feature fusion mechanism by a multi-scale attention model, or adopt various forms of image post-processing operation to improve the integrity and accuracy of building extraction. However, with the continuous development of high-resolution optical remote sensing images, detailed information contained in the images is continuously increased, differences among classes of buildings are continuously reduced, differences in the classes are continuously increased, and the method can generate more over-extraction and under-extraction results and has poor extraction precision.

Disclosure of Invention

In view of the above, the invention provides a multi-layer feature decoupling optical remote sensing image building extraction method, which constructs powerful semantic feature expression capability in a multi-layer feature decoupling mode, improves the edge accuracy of a building while guaranteeing the integrity of a building main body, and improves the building extraction performance.

The invention relates to a multi-layer characteristic decoupling optical remote sensing image building extraction method, which comprises the following steps:

Firstly, carrying out multi-scale feature extraction on an optical remote sensing image to obtain multi-layer feature images with different scales;

step two, carrying out feature decomposition on the feature map obtained in the step one, namely calculating the offset of each feature in the adjacent deep feature map by using the feature flow field by taking the adjacent shallow feature map as a reference; correcting the deep feature map based on the offset to obtain more stable semantic main body features in a strong semantic region representing the main body part of the building, and then obtaining uncertain semantic boundary features in a weak semantic region representing the edge part of the building in the deep feature map by utilizing subtraction operation;

step three, respectively fusing the more stable semantic main body features and the uncertain semantic boundary features obtained in the step two layer by layer to obtain a plurality of pixel-level prediction graphs, wherein the strong semantic regions are fused from deep to shallow, and the weak semantic regions are fused from shallow to deep;

And step four, respectively performing supervised learning on the plurality of pixel-level prediction graphs obtained in the step three based on the multi-task joint loss function by using a multi-task supervision method, and completing building extraction.

Preferably, in the first step, a AlexNet, VGGNet, resNet, resNeXt or DenseNet network is used to perform feature extraction on the optical remote sensing image.

Preferably, the second step specifically includes the following sub-steps:

S2.1, feature preprocessing, namely, marking F and F ' as relative shallow features and relative deep features of two adjacent feature layers, and marking the deep features as F ' ' by changing the deep features into the same size as the shallow features F;

S2.2, generating a characteristic flow field, namely cascading the deep layer characteristic F '' obtained in the step S2.1 with the shallow layer characteristic F and convolving the deep layer characteristic F to obtain a flow field delta;

S2.3, generating more stable semantic main body features in the strong semantic region, namely obtaining the offset corresponding to each feature in the deep feature map F ' ' according to the flow field delta, correcting the deep feature map F ' ' based on the offset, and obtaining more stable semantic main body features F ' _MainBody in the strong semantic region;

S2.4, generating uncertain semantic boundary features in the weak semantic region, namely obtaining uncertain semantic boundary features F' _{Uncerta in Boundary} from a deep feature map F″ by using a subtraction operation;

s2.5, sequentially combining the multilayer feature images in the first step, repeating the steps S2.1-S2.4 to generate N-1 groups of more stable semantic bodies and uncertain semantic boundaries, wherein N is the total layer number of the feature images obtained in the first step.

Preferably, in S2.1, the deep features are changed to the same size as the shallow features using a 1×1 convolution and upsampling operation.

Preferably, a feature flow warping operation is used to correct the deep feature map.

Preferably, in the third step, when the features of the adjacent feature layers are fused, the attention mechanism is used to give weight to the features of the adjacent feature layers.

Preferably, the Sigmoid function is used as a gate to weight the features of the adjacent feature layers.

Preferably, in the third step, the following manner is adopted for fusion:

Wherein, the AndRespectively represent the feature layers after the attention mechanism is adopted, Z represents the feature fusion result, G _X and G _Y respectively represent the attention coefficients obtained from the selection gates,

G_x＝Sigmoid(conv_1×1(X))

G_y＝Sigmoid(conv_1×1(Y))

Where Sigmoid (·) represents performing a Sigmoid operation and conv _1×1 (·) represents a 1 x1 convolution.

Preferably, in the fourth step, each subtask is supervised and learned by using a cross entropy loss function.

The beneficial effects are that:

(1) The method comprises the steps of firstly extracting multi-scale features of a building by utilizing a multi-layer feature decoupling network and decomposing the multi-scale features to obtain more stable semantic main features and uncertain semantic boundary features, then gradually fusing by adopting a double-flow semantic feature description network in different modes based on the difference of the semantic main features and the uncertain semantic boundary features, deepening semantic representation in deep features in a semantic region, retaining more detail information in a weak semantic region, and finally improving the accuracy of edges of the building while guaranteeing the integrity of the main body part of the building by utilizing a multi-task supervision method to realize high-performance extraction of a high-resolution optical remote sensing image building. Compared with the prior art, the method can obviously improve the extraction effect on the high-resolution optical remote sensing image building. Especially when facing buildings with different dimensions and different spatial distributions in complex environments, the method can ensure the integrity of the main body of the building and the edge accuracy of the building, and reduce the occurrence of over-extraction and under-extraction. The method is based on an encoding-decoding framework, utilizes multi-layer feature decoupling and double-flow semantic feature description networks to improve semantic description capability, greatly improves building extraction performance, and has good practical application value.

(2) According to the invention, the offset of the deep feature map features is obtained based on the feature flow field, then the deep feature map is corrected by utilizing the feature flow distortion operation, the deep feature can be adaptively adjusted and aligned, the feature positioning capability is improved, and the more stable semantic main feature is obtained.

(3) When the features of the adjacent feature layers are fused, the complementarity of the adjacent feature layers is considered, the attention mechanism is utilized to select the features of the adjacent feature layers and guide the complementation information fusion, the fusion of invalid feature information can be obviously reduced, and the fusion can be more efficient and reasonable.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

Fig. 2 is a schematic diagram of the characteristic structure module structure of the invention.

FIG. 3 is a dual stream semantic feature description framework of the present invention.

Fig. 4 is a schematic diagram of a component fusion module.

Detailed Description

The invention will now be described in detail by way of example with reference to the accompanying drawings.

The invention provides a multi-layer characteristic decoupling optical remote sensing image building extraction method. The flow chart of the method is shown in fig. 1, and specifically comprises the following steps:

Step one, multi-scale feature extraction is carried out on the high-resolution optical remote sensing image, and multi-layer feature diagrams with different scales are obtained.

The step can adopt AlexNet, VGGNet, resNet, resNeXt, denseNet and other feature extraction networks to realize multi-scale feature extraction.

In this embodiment, a ResNet backbone network is used to perform feature extraction on the input high-resolution optical remote sensing image, as shown in fig. 1 (a). Specifically, the original optical image is input ResNet into a feature extraction network, deep feature information is obtained through multiple rolling and pooling operations, and the obtained 4-layer feature images are respectively marked as F ¹,F²,F³,F⁴ from shallow to deep.

And step two, decomposing the feature map obtained in the step one to obtain a more stable semantic main body representing the main body part of the building and an uncertain semantic boundary in a weak semantic region representing the edge part of the building.

The continuous convolution operation in the first step can lose building detail information and have the problem of feature misalignment while stabilizing deep semantic features. Therefore, the deep feature map is corrected by introducing the feature flow field and taking the adjacent low-layer features as references to align the deep features and utilizing the offset corresponding to each feature of the deep feature map obtained in the feature flow field, so that a more stable semantic main body in a strong semantic region is obtained.

As shown in fig. 1 (b), the method can be specifically divided into the following 5 sub-steps:

s2.1, feature pretreatment, namely, as shown in FIG. 2, marking F and F' as the relative shallow features and the relative deep features of two adjacent feature layers. F' is a deep feature with a greater number of data channels and smaller spatial dimensions. To decompose F' using the shallow feature F as a reference, the deep feature is first changed to the same size as the shallow feature F, denoted as F ", using a1 x 1 convolution and upsampling operation, as shown in the following equation:

F′′=Up(conv_1×1(F′))

Where Up (-) represents the Up-sampling operation and conv _1×1 (-) represents the 1 x1 convolution.

S2.2, generating a characteristic flow field, wherein the characteristic flow field is introduced to enable the network to automatically learn characteristic dislocation information. First, the deep features f″ obtained in the sub-step (1) are cascaded with the shallow features F, and then a 3×3 convolution is used to obtain the flow field δ. The flow field delta has two dimensions, representing the direction of departure of each feature point in the flow field.

δ=conv_3×3(cat(F,F′′))

Where cat (-) represents the cascading operation and conv _3×3 (-) represents the 3 x 3 convolution.

S2.3, generating more stable semantic main body characteristics in the strong semantic region, namely obtaining the offset corresponding to each characteristic in the relative deep characteristic diagram F ' ' according to the flow field delta, and correcting the deep characteristic diagram F ' ' by means of the characteristic flow distortion operation to obtain more stable semantic main body characteristics F ' _MainBody in the strong semantic region, wherein the formula is as follows:

Where ψ (·) represents the feature flow warping operation, ρ is a feature point in the deep feature map f″, N (ρ) represents feature points around the warped feature point ρ, ω _ρ is the ρ corresponding offset.

S2.4, generating uncertain semantic boundary features in the weak semantic region, wherein the uncertain semantic boundary features F' _{Uncerta in Boundary} can be obtained from the deep feature map F″ by using a subtraction operation.

F′_{Uncerta in Boundary}＝F″-F′_MainBody

And S2.5, generating three groups of more stable semantic bodies and uncertain semantic boundaries, namely combining adjacent feature images of the 4-layer feature images obtained in the step one two by two, and repeating the sub-steps S2.1-S2.4 to generate three groups of more stable semantic bodies and uncertain semantic boundaries.

And thirdly, based on a component fusion module, describing a network structure by utilizing double-flow semantic features, and respectively fusing the three groups of more stable semantic bodies and the uncertain semantic boundaries obtained in the second step layer by layer. Features belonging to a strong semantic region and a weak semantic region are respectively integrated into two parallel branches, feature fusion is carried out from deep to shallow in the strong semantic region, semantic representation in deep features is deepened, fusion is carried out from shallow to deep in the weak semantic region, and more detail information is reserved.

The specific fusion mode is that firstly, up-sampling, channel compression and other operations are utilized, and feature layers with different depths have the same channel number and space size. And then fusing the processed adjacent feature layer features together through cascading operation, wherein the strong semantic region adopts feature fusion from deep to shallow, and the weak semantic region adopts fusion from shallow to deep.

Furthermore, the complementarity of adjacent feature layers is considered, and the invention introduces a attention mechanism to select and guide the complementarity information fusion in the fusion process, so that the fusion of invalid feature information can be obviously reduced, and the fusion can be more efficient and reasonable.

As shown in fig. 3, the present embodiment is specifically divided into the following 6 substeps:

s3.1, before layer-by-layer feature fusion, up-sampling, down-sampling, channel compression and other operations are used for adjusting the feature graphs with different channel numbers and space sizes obtained in the step two, so that the feature graphs finally have the same channel numbers and space sizes.

S3.2, the designed component feature fusion module is utilized to efficiently fuse the features of the strong semantic region and the weak semantic region. As shown in fig. 4, X and Y are adjacent feature layers adjusted in the substep S3.1, and are used as inputs of the component feature fusion module. Considering the complementarity of adjacent feature layers, the mutual fusion of complementary information is selected and guided by means of the attention mechanism using Sigmoid function as gate, and the process can be represented by the following formula:

Wherein, the AndRepresenting the optimized feature layer, Z represents the output of the component feature fusion module, and G _X and G _Y represent the attention coefficients from the select gates, respectively, which can be obtained by the following equation:

G_x＝Sigmoid(conv_1×1(X))

G_y＝Sigmoid(conv_1×1(Y))

And S3.3, repeating the substep S3.2, and sequentially fusing the features of each layer obtained in the step two to obtain fused features of the fused strong semantic region and weak semantic region, wherein the more stable semantic main feature fusion in the strong semantic region follows a fusion sequence from top to bottom, and the uncertain semantic boundary feature fusion in the weak semantic region is carried out according to the sequence from bottom to top. The size of the feature space after fusion is 1/4 of the input feature map, and the number of channels is 2 times of the input feature map.

And S3.4, carrying out depth and parallel multi-rate hole convolution on the fused strong semantic region and weak semantic region obtained in the substep S3.3, wherein the hole rate is 1,2 and 5 respectively. The feature fusion is then refined using a 1x1 convolution, resulting in a more stable semantic body and an uncertain semantic boundary.

And S3.5, adding the more stable semantic body obtained in the substep (4) and the uncertain semantic boundary through a point-by-point addition operation to obtain the complete building feature.

And S3.6, respectively obtaining three groups of pixel-level prediction graphs by using the prediction structure module shown in FIG. 3. The structure can be represented by the following formula:

P=Up(ReLU(BN(conv_1×1(ReLU(BN(conv_3×3(F)))))))

Wherein F represents the final feature, P represents the predicted graph obtained through the predicted architecture module, up (-) represents the upsampling operation, reLU (-) represents the convolution using the ReLU activation function, conv _1×1 (-) represents the convolution of 1X 1, conv _3×3 (-) represents the convolution of 3X 3, and BN (-) represents the batch normalization operation.

And step four, utilizing a multitask supervision method to respectively supervise and optimize the more stable semantic body, the uncertain semantic boundary and the three groups of predictive graphs of the building generated in the step three based on the cross entropy loss function, as shown in fig. 1 (d). The supervision of the more stable semantic subjects and the uncertain semantic boundaries is auxiliary supervision, so that the more complete building subjects can be generated, and meanwhile, the accuracy of the building boundaries is improved.

The method comprises the following steps of:

s4.1, generating a building main body and edge label graph corresponding to the more stable semantic main body and the uncertain semantic boundary in the step three by using morphological image operations such as image erosion and the like.

S4.2, performing supervised learning on each subtask by using a cross entropy loss function, wherein the formula is as follows:

Wherein L _S,L_B and L _E represent the complete building segmentation loss, the more stable semantic body loss in the strong semantic region and the uncertain semantic boundary loss in the weak semantic region, respectively, N represents the number of pixels in the picture, y _i E {0,1} represents whether the pixel i belongs to a building, i is a sequence number, and p _i E [0,1] is the prediction probability of the pixel i.

S4.3, optimizing the network by using the multi-task joint loss function, and improving the extraction performance of the building, wherein the extraction performance can be represented by the following formula:

L_total＝λ₁·L_S+λ₂·L_B+λ₃·L_E

where L _total is the total loss of the multitasking, λ ₁,λ₂ and λ ₃ are the loss weights corresponding to each task, and are set to 1,20 in the method.

And finally obtaining the building extraction result with high accuracy and low false alarm rate through the first step to the fourth step.

In summary, the above embodiments are only preferred embodiments of the present invention, and are not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for extracting buildings from optical remote sensing images with multi-layer feature decoupling, characterized in that it includes:

Step 1: Extract multi-scale features from the optical remote sensing image to obtain multi-layer feature maps at different scales;

Step 2: Perform feature decomposition on the feature map obtained in Step 1. Specifically, using adjacent shallow feature maps as references, calculate the offset of each feature in adjacent deep feature maps using the feature flow field; correct the deep feature map based on the offset to obtain more stable semantic main features in the strong semantic region representing the main part of the building; then use subtraction to obtain uncertain semantic boundary features in the weak semantic region representing the edge part of the building in the deep feature map.

Step 3: Fuse the multiple sets of more stable semantic subject features and uncertain semantic boundary features obtained in Step 2 layer by layer to obtain multiple pixel-level prediction maps; among them, strong semantic regions are fused from deep to shallow, and weak semantic regions are fused from shallow to deep.

Step four: Using a multi-task supervision method, supervised learning is performed on the multiple pixel-level prediction maps obtained in step three based on the multi-task joint loss function to complete the building extraction.

2. The method as described in claim 1, wherein in step one, AlexNet, VGGNet, ResNet, ResNeXt or DenseNet networks are used to extract features from the optical remote sensing image.

3. The method as described in claim 1, wherein step two specifically includes the following sub-steps:

S2.1 Feature preprocessing: Let F and F′ be the relatively shallow and relatively deep features of two adjacent feature layers; transform the deep feature to the same size as the shallow feature F, denoted as F″;

S2.2, Generate the characteristic flow field: Concatenate the deep feature F″ obtained in S2.1 with the shallow feature F and perform convolution to obtain the flow field δ;

S2.3, Generate more stable semantic subject features in strong semantic regions: Based on the flow field δ, obtain the offset corresponding to each feature in the deep feature map F″; Based on the offset, correct the deep feature map F″ to obtain more stable semantic subject features F′ _MainBody that constitute the strong semantic region;

S2.4, Generate uncertain semantic boundary features in weak semantic regions: Use subtraction operation to obtain uncertain semantic boundary features F′ _{UncertainBoundary} from deep feature map F″;

S2.5, combine adjacent feature maps in pairs in the multi-layer feature maps from step one, and repeat S2.1 to S2.4 to generate N-1 sets of more stable semantic subjects and uncertain semantic boundaries; where N is the total number of layers of feature maps obtained in step one.

4. The method as described in claim 3, wherein in step S2.1, a 1×1 convolution and upsampling operation are used to transform the deep features to the same size as the shallow features.

5. The method as described in claim 1 or 3, characterized in that a feature flow distortion operation is used to correct the deep feature map.

6. The method as described in claim 1, wherein in step three, when the features of adjacent feature layers are fused, an attention mechanism is used to assign weights to the features of adjacent feature layers.

7. The method as described in claim 6, characterized in that a Sigmoid function is used as a gate to assign weights to the features of adjacent feature layers.

8. The method as described in claim 7, characterized in that, in step three, the fusion is performed in the following manner:

in, and Z represents the feature layer after applying the attention mechanism, Z represents the feature fusion result, and G<sub>_X</sub> and G<sub> _Y </sub> represent the attention coefficients obtained from the selection gate, respectively.

G _x = Sigmoid(conv _1×1 (X))

G _y = Sigmoid(conv _1×1 (Y))

Here, Sigmoid(·) represents performing a Sigmoid operation, and conv _1×1 (·) represents a 1×1 convolution.

9. The method as described in claim 1, wherein in step four, supervised learning is performed on each subtask using the cross-entropy loss function.