CN114037891A

CN114037891A - Method and device for building extraction from high-resolution remote sensing images based on U-shaped attention control network

Info

Publication number: CN114037891A
Application number: CN202110975846.7A
Authority: CN
Inventors: 于明洋; 陈肖娴; 张宣峰; 张文焯; 李景琪; 刘耀辉; 邢华桥; 孟飞
Original assignee: Shandong Jianzhu University
Current assignee: Shandong Jianzhu University
Priority date: 2021-08-24
Filing date: 2021-08-24
Publication date: 2022-02-11

Abstract

The invention discloses a high-resolution remote sensing image building extraction method based on a U-shaped attention control network, comprising: inputting remote sensing images to an encoder; Output to the converter; attention valves are strung in the skip connections between the convolutional blocks of the encoder and decoder; the converter extracts the abstract feature map and outputs it to the decoder; The step-by-step upsampling process generates a decoder feature map and outputs it to the segmentation unit; the segmentation unit adjusts the number of channels of the input decoder feature map to obtain the building segmentation result in the remote sensing image. In addition, the invention also discloses a high-resolution remote sensing image building extraction device based on a U-shaped attention control network. The U-shaped attention control network of the present invention achieves an effective improvement in the extraction quality of buildings from high-scoring remote sensing images in terms of prediction performance and result accuracy.

Description

Method and device for building extraction from high-resolution remote sensing images based on U-shaped attention control network

技术领域technical field

本发明涉及遥感影像提取技术领域，特别涉及一种基于U型注意力控制网络的高分辨率遥感影像建筑物提取方法及装置。The invention relates to the technical field of remote sensing image extraction, in particular to a high-resolution remote sensing image building extraction method and device based on a U-shaped attention control network.

背景技术Background technique

建筑物作为城市及乡村区域的不动产资源，在区域建设规划、地区人口预估、经济发展评估、地形图制作及更新等方面中具有重要意义。随着卫星传感器技术的发展，遥感影像的成像质量不断提高，研究人员可更加快捷地获取批量的高分辨率遥感影像。然而，随着可获取遥感影像的体量不断增加，如何自动、准确、有效地从影像中提取建筑物信息逐渐成为了攻关难题。As a real estate resource in urban and rural areas, buildings are of great significance in regional construction planning, regional population estimation, economic development assessment, topographic map making and updating, etc. With the development of satellite sensor technology, the imaging quality of remote sensing images has been continuously improved, and researchers can obtain batches of high-resolution remote sensing images more quickly. However, with the increasing volume of available remote sensing images, how to automatically, accurately and effectively extract building information from images has gradually become a difficult problem.

早期遥感影像的数据量较少，提取影像中的建筑物主要依靠人工识别、目视解译、矢量特征提取等手段。然而，随着近年来遥感影像数据逐渐呈现海量化、多尺度化、多元化以及大数据化等特征，仅凭借人工解译影像的方式几乎无法完成批量的影像建筑物提取工作。这一是由于工作成本高、效率低，无法以统一标准衡量数据提取质量；二是由于需要大量的专业的人员进行判读与提取，长期目视会造成判读人员的视力损伤等问题导致工作效率具有不稳定性。因此，通过计算机技术实现自动化、智能化地提取遥感影像建筑物已成为遥感影像提取研究领域的重要内容之一。The amount of data in early remote sensing images was small, and the extraction of buildings in images mainly depended on manual recognition, visual interpretation, and vector feature extraction. However, in recent years, remote sensing image data has gradually presented the characteristics of mass, multi-scale, diversification, and big data, and it is almost impossible to complete batch image building extraction work only by manually interpreting images. This is due to the high work cost and low efficiency, and it is impossible to measure the quality of data extraction with a unified standard; the second is that a large number of professional personnel are required for interpretation and extraction, and long-term visual inspection will cause problems such as visual impairment of the interpretation personnel, resulting in high work efficiency. instability. Therefore, the automatic and intelligent extraction of buildings from remote sensing images through computer technology has become one of the important contents in the field of remote sensing image extraction research.

现有技术中，用于遥感影像建筑物提取的主流方法主要有两类：一是运用数学和形态学知识人工设计提取目标特征的方法，二是基于深度学习自动提取目标特征的方法。其中，基于人工设计特征方法的主要原理是利用遥感影像中建筑物的纹理、几何、光谱和上下文等信息，通过专业知识手动筛选、设计并构建出有效的特征完成建筑物的提取。Lin与Nevatia最先通过边缘检测算法来检测建筑物的屋顶、墙壁和阴影，进而提取出建筑物。黄金库等人通过先构建知识规则算法处理影像,再进行形态学修复和边缘检测，最后借助三种数字化方法来规则化处理建筑物形状特征以得到建筑物轮廓精确提取结果。方鑫等人通过阴影和建筑物的空间位置关系构建筛选条件，从而过滤出疑似建筑物的区域，再运用图割算法来精确提取建筑物轮廓。Zhang等人利用影像中建筑物的形状和光谱特征通过聚类具有相似形状和轮廓信息的同质像素来提取建筑物。In the prior art, there are two main types of mainstream methods for extracting buildings from remote sensing images: one is to manually design and extract target features using mathematical and morphological knowledge, and the other is to automatically extract target features based on deep learning. Among them, the main principle of the method based on artificial design features is to use the texture, geometry, spectrum and context of buildings in remote sensing images to manually filter, design and construct effective features through professional knowledge to complete the extraction of buildings. Lin and Nevatia first used edge detection algorithms to detect the roofs, walls and shadows of buildings, and then extracted buildings. Jinku et al. process the image by first constructing a knowledge rule algorithm, then perform morphological repair and edge detection, and finally use three digital methods to regularize the shape features of buildings to obtain accurate extraction results of building outlines. Fang Xin et al. constructed screening conditions through the spatial relationship between shadows and buildings, so as to filter out areas suspected of buildings, and then used the graph cut algorithm to accurately extract the outline of buildings. Zhang et al. exploited the shape and spectral features of buildings in images to extract buildings by clustering homogeneous pixels with similar shape and contour information.

然而，发明人经研究发现，尽管前人用于高分辨率遥感影像建筑物提取的方法已取得了一定进展，但此类基于人工设计特征的方法较普遍地存在提取精度不高、处理过程复杂以及影像特征利用不充分等缺点。此外，这些方法还需要各种规则来手动预定义特征，通过它们实现大范围遥感影像建筑物提取较为费力。However, the inventor found through research that although the previous methods for extracting buildings from high-resolution remote sensing images have made some progress, such methods based on artificially designed features generally suffer from low extraction accuracy and complex processing procedures. And the shortcomings of insufficient use of image features. In addition, these methods also require various rules to manually pre-define features, which are laborious to achieve building extraction from large-scale remote sensing imagery.

随着近年来计算机运算性能的提高，深度学习特别是卷积神经网络(Convolutional Neural Network，简称CNN)的概念及理论得到了迅速发展，在诸如自然图像分类、目标检测和语义分割等领域中已获得广泛应用。目前，常见的CNN模型主要包括AlexNet、VGGNet、GoogLeNet和ResNet等，与传统基于人工设计提取目标特征的方法相比，CNN模型可以自动提取输入图像的特征，具有强大的特征表示能力，近年来已被越来越多的研究人员所关注，并在遥感影像的目标分割和识别领域取得重要进展。如Lv等人利用SEEDS-CNN和尺度有效性分析实现了遥感影像中小尺度目标的高精度分类；Li等人为解决CNN在提取特征时仅会堆叠不同比例的正方形图像而忽略分割目标的背景信息，从而对分类精度造成影响的问题，提出了基于标准对象的双重CNN模型用于影像分类；Chen等人应用多尺度CNN和尺度参数估计的方法实现土地覆被的高准确率分类；Zhang等人使用融合OCNN模型，将光谱模式、几何特征和对象级上下文特征等信息融合汇总用于高分辨率图像中准确地识别目标物体，提高对象分类的精度。然而，仅借助CNN模型无法直接生成准确的建筑物轮廓，且影像分割后的后处理步骤同样不容忽视，因此近年来学者们开始尝试将全卷积神经网络(Fully Convolutional Networks,简称FCN)模型引入遥感影像建筑物提取领域，实现“端到端”地提取建筑物轮廓。谢跃辉等人结合局部二值模式表达的纹理特征和高斯金字塔提取的尺度特征构建模型的训练样本，使用SegNet网络进行样本训练，通过SoftMax分类器完成建筑物粗提取。刘亦凡等人利用深度残差网络Res-Une提取建筑物像素级特征，设置阈值过滤预测概率图中的噪声，通过后处理较大程度地保存了建筑物的完整性。With the improvement of computer computing performance in recent years, the concept and theory of deep learning, especially Convolutional Neural Network (CNN), have developed rapidly, and have been used in fields such as natural image classification, object detection and semantic segmentation. Get a wide range of applications. At present, common CNN models mainly include AlexNet, VGGNet, GoogLeNet and ResNet, etc. Compared with the traditional methods of extracting target features based on manual design, the CNN model can automatically extract the features of the input image and has strong feature representation capabilities. It has been paid attention by more and more researchers, and has made important progress in the field of target segmentation and recognition of remote sensing images. For example, Lv et al. used SEEDS-CNN and scale validity analysis to achieve high-precision classification of small-scale objects in remote sensing images; Li et al. solved the problem that CNN only stacks square images of different proportions and ignores the background information of segmentation objects when extracting features. Therefore, the problem of affecting the classification accuracy, a dual CNN model based on standard objects was proposed for image classification; Chen et al. applied multi-scale CNN and scale parameter estimation to achieve high accuracy classification of land cover; Zhang et al. used The OCNN model is fused to fuse and summarize information such as spectral patterns, geometric features, and object-level contextual features to accurately identify target objects in high-resolution images and improve the accuracy of object classification. However, only the CNN model cannot directly generate accurate building outlines, and the post-processing steps after image segmentation cannot be ignored. Therefore, in recent years, scholars have begun to try to introduce the Fully Convolutional Neural Networks (FCN) model into the model. In the field of building extraction from remote sensing images, "end-to-end" building contours can be extracted. Xie Yuehui et al. combined the texture features expressed by the local binary pattern and the scale features extracted by the Gaussian pyramid to construct the training samples of the model, used the SegNet network for sample training, and completed the rough building extraction through the SoftMax classifier. Liu Yifan et al. used the deep residual network Res-Une to extract the pixel-level features of the building, set a threshold to filter the noise in the predicted probability map, and preserved the integrity of the building to a large extent through post-processing.

发明内容SUMMARY OF THE INVENTION

基于此，为解决现有技术中的技术问题，特提出了一种基于U型注意力控制网络的高分辨率遥感影像建筑物提取方法，包括：Based on this, in order to solve the technical problems in the existing technology, a method for extracting buildings from high-resolution remote sensing images based on U-shaped attention control network is proposed, including:

输入遥感影像至编码器；所述编码器利用全局上下文信息以及局部上下文信息提取不同网格维度层级的特征并生成编码器特征图；所述编码器包括级联的四个卷积块，分别为第一卷积块、第二卷积块、第三卷积块、第四卷积块；Input remote sensing images to the encoder; the encoder utilizes global context information and local context information to extract features of different grid dimension levels and generate encoder feature maps; the encoder includes four concatenated convolution blocks, which are respectively The first convolution block, the second convolution block, the third convolution block, and the fourth convolution block;

所述编码器连接至转换器；所述编码器将提取生成的编码器特征图输出至所述转换器；所述转换器包括第五卷积块、四个跳跃连接中的注意力阀门；The encoder is connected to the converter; the encoder outputs the encoder feature map generated by the extraction to the converter; the converter includes a fifth convolution block and an attention valve in four skip connections;

所述转换器连接至解码器；在所述编码器和解码器的卷积块之间的跳跃连接中串入注意力阀门；所述转换器提取抽象特征图并将其输出至所述解码器；The converter is connected to a decoder; attention valves are inline in skip connections between convolutional blocks of the encoder and decoder; the converter extracts an abstract feature map and outputs it to the decoder ;

所述解码器包括级联的四个卷积块，分别为第六卷积块、第七卷积块、第八卷积块、第九卷积块；其中，第一卷积块跳跃连接至第九卷积块，第二卷积块跳跃连接至第八卷积块，第三卷积块跳跃连接至第七卷积块，第四卷积块跳跃连接至第六卷积块；The decoder includes four convolution blocks in cascade, which are the sixth convolution block, the seventh convolution block, the eighth convolution block, and the ninth convolution block; wherein the first convolution block is skipped and connected to The ninth convolution block, the second convolution block is skip connected to the eighth convolution block, the third convolution block is skip connected to the seventh convolution block, and the fourth convolution block is skip connected to the sixth convolution block;

所述解码器连接至分割单元；所述解码器对所述转换器输入的抽象特征图进行逐级的上采样处理，生成与输入图像大小相同的解码器特征图，并输出至所述分割单元；所述分割单元调整输入的所述解码器特征图的通道数，获取遥感影像中的建筑物分割结果。The decoder is connected to the segmentation unit; the decoder performs a step-by-step upsampling process on the abstract feature map input by the converter, generates a decoder feature map with the same size as the input image, and outputs it to the segmentation unit ; The segmentation unit adjusts the number of channels of the input feature map of the decoder, and obtains the building segmentation result in the remote sensing image.

在一种实施例中，所述编码器中的各个卷积块皆包括卷积层、最大池化层、标准化层、激活函数单元；其中，所述编码器的卷积块中的最大池化层通过提取原特征图中局部区域的最大值构建生成新的特征图，并通过减少参数数量来防止过拟合；In one embodiment, each convolution block in the encoder includes a convolution layer, a max pooling layer, a normalization layer, and an activation function unit; wherein, the max pooling in the convolution block of the encoder The layer constructs a new feature map by extracting the maximum value of the local area in the original feature map, and prevents overfitting by reducing the number of parameters;

所述第五卷积块包括卷积层、最大池化层、标准化层；所述第五卷积块将所述编码器输入的编码器特征图抽象至最高级别，叠加特征图的通道数并缩小特征图的尺寸，从而提取得到最高维度的抽象特征图。The fifth convolution block includes a convolution layer, a maximum pooling layer, and a normalization layer; the fifth convolution block abstracts the encoder feature map input by the encoder to the highest level, and superimposes the number of channels of the feature map. Reduce the size of the feature map to extract the abstract feature map with the highest dimension.

在一种实施例中，所述解码器对所述转换器输入的抽象特征图进行逐级的上采样处理，具体包括：第六卷积块对第五卷积块输入的抽象特征图进行上采样处理并输出至第七卷积块；第七卷积块对第六卷积块输入的抽象特征图进行上采样处理并输出至第八卷积块；第八卷积块对第七卷积块输入的抽象特征图进行上采样处理并输出至第九卷积块；In an embodiment, the decoder performs a step-by-step up-sampling process on the abstract feature map input by the converter, which specifically includes: the sixth convolution block performs an up-sampling process on the abstract feature map input by the fifth convolution block. Sampling and output to the seventh convolution block; the seventh convolution block performs up-sampling processing on the abstract feature map input by the sixth convolution block and outputs it to the eighth convolution block; the eighth convolution block convolutes the seventh convolution block The abstract feature map of the block input is up-sampled and output to the ninth convolution block;

其中，所述分割单元包括第十卷积块；第十卷积块包括卷积层、激活函数单元；第十卷积块的卷积层执行1×1卷积处理，第十卷积块的激活函数单元为Sigmoid函数；所述解码器特征图通过第十卷积块的1×1卷积处理调整模型通道数为类别数，通过Sigmoid函数获得遥感影像中的建筑物分割结果。The segmentation unit includes a tenth convolution block; the tenth convolution block includes a convolution layer and an activation function unit; the convolution layer of the tenth convolution block performs 1×1 convolution processing, and the tenth convolution block has a The activation function unit is a Sigmoid function; the decoder feature map adjusts the number of model channels to the number of categories through 1×1 convolution processing of the tenth convolution block, and obtains the building segmentation result in the remote sensing image through the Sigmoid function.

在一种实施例中，所述注意力阀门包括注意力层、阀门控制层、第一激活函数单元、线性变换单元、第二激活函数单元、重采样器；In one embodiment, the attention valve includes an attention layer, a valve control layer, a first activation function unit, a linear transformation unit, a second activation function unit, and a resampler;

在阀门控制层中，输入阀门控制信号至注意力阀门，所述阀门控制信号包括控制向量；其中，所述阀门控制信号来自当前卷积块的前一级卷积块；In the valve control layer, a valve control signal is input to the attention valve, and the valve control signal includes a control vector; wherein, the valve control signal comes from the previous convolution block of the current convolution block;

将控制向量与控制系数相乘得到第一乘积；Multiply the control vector and the control coefficient to obtain the first product;

在注意力层中，输入特征图至注意力阀门；其中，输入注意力阀门的特征图来自与当前卷积块跳跃连接的卷积块；In the attention layer, the feature map is input to the attention valve; wherein, the feature map input to the attention valve comes from the convolution block that is skip-connected to the current convolution block;

输入注意力阀门的特征图的像素向量与注意力系数相乘得到第二乘积；The pixel vector of the feature map of the input attention valve is multiplied by the attention coefficient to obtain the second product;

将第一乘积、第二乘积相加并经过偏置处理后利用第一激活函数单元进行激活得到第一激活结果；第一激活结果经过线性变换单元调整通道数并作偏置处理后再利用第二激活函数单元进行激活得到第二激活结果，将第二激活结果输入至重采样器中进行调整得到注意力系数α；将注意力层输入的特征图与注意力系数α相乘，输出与注意力层输入的特征图具有相同尺寸的相乘结果。The first product and the second product are added together, and the first activation function unit is used for activation after bias processing to obtain the first activation result; The second activation function unit is activated to obtain the second activation result, and the second activation result is input into the resampler for adjustment to obtain the attention coefficient α; the feature map input by the attention layer is multiplied by the attention coefficient α, and the output is equal to the attention coefficient α. The feature maps input to the force layer have the same size multiplication result.

在一种实施例中，所述注意力阀门利用附加注意力机制获得注意力系数；所述附加注意力机制的计算式如下所示：In an embodiment, the attention valve uses an additional attention mechanism to obtain an attention coefficient; the calculation formula of the additional attention mechanism is as follows:

其中，

为sigmoid函数；in,

is the sigmoid function;

注意力阀门的参数组Θ_att包括：

b_ψ∈R；其中，W_x为控制系数，W_g为注意力系数，ψ为线性变换，b_g为第一偏置参数，b_ψ为第二偏置系数；其中，F_l表示第l层中特征图的数量，F_int表示输入层；The parameter group _Θatt of the attention valve includes:

b _ψ ∈ R; where W _x is the control coefficient, W _g is the attention coefficient, ψ is the linear transformation, b _g is the first bias parameter, and b _ψ is the second bias coefficient; among them, F _l represents the first The number of feature maps in the layer, F _int represents the input layer;

其中，注意力阀门的第一激活函数单元所执行的第一激活函数如下式所示：Among them, the first activation function performed by the first activation function unit of the attention valve is shown in the following formula:

其中，i表示空间维度，c表示通道维度；

Among them, i represents the spatial dimension, and c represents the channel dimension;

其中，线性变换单元所执行的线性变换为通过1×1×1的通道卷积处理改变输入向量的通道数；Wherein, the linear transformation performed by the linear transformation unit is to change the number of channels of the input vector through 1×1×1 channel convolution processing;

其中，注意力阀门的第二激活函数单元所执行的第二激活函数σ₂为sigmoid函数；Wherein, the second activation function σ ₂ executed by the second activation function unit of the attention valve is a sigmoid function;

输入的特征图与注意力系数α在像素级的相乘结果为注意力阀门的输出，如下式所示：

其中，注意力系数α∈[0，1]。The multiplication result of the input feature map and the attention coefficient α at the pixel level is the output of the attention valve, as shown in the following formula:

Among them, the attention coefficient α∈[0, 1].

此外，为解决现有技术中的技术问题，特提出了一种基于U型注意力控制网络的高分辨率遥感影像建筑物提取装置，包括编码器、转换器、解码器和分割单元；所述编码器连接至所述转换器，所述转换器连接至所述解码器，所述解码器连接至所述分割单元；In addition, in order to solve the technical problems in the prior art, a high-resolution remote sensing image building extraction device based on U-shaped attention control network is proposed, which includes an encoder, a converter, a decoder and a segmentation unit; the An encoder is connected to the converter, the converter is connected to the decoder, and the decoder is connected to the dividing unit;

输入遥感影像至所述编码器；所述编码器利用全局上下文信息以及局部上下文信息提取不同网格维度层级的特征并生成编码器特征图；所述编码器包括级联的四个卷积块，分别为第一卷积块、第二卷积块、第三卷积块、第四卷积块；所述编码器将提取生成的编码器特征图输出至所述转换器；Input remote sensing images to the encoder; the encoder utilizes global context information and local context information to extract features of different grid dimension levels and generate encoder feature maps; the encoder includes four concatenated convolution blocks, are respectively the first convolution block, the second convolution block, the third convolution block and the fourth convolution block; the encoder outputs the encoder feature map generated by extraction to the converter;

所述转换器包括第五卷积块、四个跳跃连接中的注意力阀门；所述转换器将所述编码器和所述解码器相对应的特征图连通；在所述编码器和所述解码器的卷积块之间的跳跃连接中串入注意力阀门；所述转换器提取抽象特征图并将其输出至所述解码器；The converter includes a fifth convolution block and an attention valve in four skip connections; the converter connects the feature maps corresponding to the encoder and the decoder; Attention valves are serialized into skip connections between convolutional blocks of the decoder; the converter extracts abstract feature maps and outputs them to the decoder;

所述解码器对所述转换器输入的抽象特征图进行逐级的上采样处理，生成与输入图像大小相同的解码器特征图，并输出至所述分割单元；所述分割单元调整输入的所述解码器特征图的通道数，获取遥感影像中的建筑物分割结果。The decoder performs a step-by-step upsampling process on the abstract feature map input by the converter, generates a decoder feature map with the same size as the input image, and outputs it to the segmentation unit; the segmentation unit adjusts all the input images. The number of channels of the decoder feature map is described to obtain the building segmentation results in the remote sensing image.

其中，

为sigmoid函数；in,

is the sigmoid function;

注意力阀门的参数组Θ_att包括：

其中，i表示空间维度，c表示通道维度；

Among them, the attention coefficient α∈[0, 1].

实施本发明实施例，将具有如下有益效果：Implementing the embodiment of the present invention will have the following beneficial effects:

本发明中，建筑物目标的定位和分割通过在卷积神经网络CNN中集成注意力阀门AGs来实现，不需要训练多个模型和大量额外的模型参数；与多阶段卷积神经网络CNN中的定位模型不同，注意力阀门AGs在多个维度中抑制不相关背景区域中的特征响应，不需要在网络之间裁剪感兴趣区域。In the present invention, the localization and segmentation of building objects are realized by integrating attention valve AGs in the convolutional neural network CNN, and there is no need to train multiple models and a large number of additional model parameters; Unlike localization models, attention valve AGs suppress feature responses in irrelevant background regions in multiple dimensions, without clipping regions of interest between networks.

AGs-Unet模型是U-Net架构上构建的U型注意力控制模型；U-Nets由于其良好的计算性能和对GPU内存的有效利用而被广泛用于图像分割任务；注意力机制可在图像多尺度特征提取时，突出目标有效特征并抑制冗余无效的信息，AGs-Unet模型可集合两者优势之处。在低维尺度上可捕获上下文的广域信息，提取影像全局性粗尺度的特征；高维尺度中可提取抽象性细层次的特征，通过注意力阀门突出特征图中建筑物的位置和边界。网格化、多尺度提取的特征图通过跳跃连接接入解码器部分，将粗层次和细层次的密集建筑物预测融合。The AGs-Unet model is a U-shaped attention control model built on the U-Net architecture; U-Nets are widely used for image segmentation tasks due to their good computational performance and efficient use of GPU memory; the attention mechanism can be used in image segmentation. In multi-scale feature extraction, the effective features of the target are highlighted and the redundant and invalid information is suppressed. The AGs-Unet model can combine the advantages of both. In the low-dimensional scale, it can capture the wide-area information of the context and extract the global coarse-scale features of the image; in the high-dimensional scale, the abstract and fine-level features can be extracted, and the position and boundary of the building in the feature map can be highlighted through the attention valve. The gridded, multi-scale extracted feature maps are connected to the decoder part through skip connections to fuse the dense building predictions at the coarse and fine levels.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to explain the embodiments of the present invention or the technical solutions in the prior art more clearly, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only These are some embodiments of the present invention. For those of ordinary skill in the art, other drawings can also be obtained according to these drawings without creative efforts.

其中：in:

图1为本发明中基于U型注意力控制网络的高分辨率遥感影像建筑物提取装置的示意图；1 is a schematic diagram of a high-resolution remote sensing image building extraction device based on a U-shaped attention control network in the present invention;

图2为本发明中注意力阀门的结构示意图。FIG. 2 is a schematic diagram of the structure of the attention valve in the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, but not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

高分辨率遥感影像是地物类别众多、地面背景复杂、数据信息繁复及分辨率较高的光学图像，因此，运用深度学习方法处理高分辨率遥感影像也需要不断发展。为高效、快速、准确地完成高分辨率遥感影像建筑物提取的任务，借鉴在自然语言处理领域的机器翻译和图像字幕任务中通过解释输出类的分数相对于输入图像的梯度来探索注意力图的注意力阀门(Attention Gates，简称AGs)，本发明公开了一种基于网格的注意力阀门AGs，并将注意力阀门AGs集成至U-Net中构成了U型注意力控制网络(AGs-Unet)。High-resolution remote sensing images are optical images with many types of ground objects, complex ground backgrounds, complex data information, and high resolution. Therefore, the use of deep learning methods to process high-resolution remote sensing images also requires continuous development. In order to efficiently, quickly and accurately complete the task of building extraction from high-resolution remote sensing images, we use machine translation and image captioning tasks in the field of natural language processing to explore the importance of attention maps by interpreting the gradient of the output class scores relative to the input image. Attention Gates (AGs for short), the present invention discloses a grid-based attention gate AGs, and integrates the attention gate AGs into U-Net to form a U-shaped attention control network (AGs-Unet ).

本发明提出了一种基于U型注意力控制网络的高分辨率遥感影像建筑物提取模型，并将其与传统的具有经典对称结构特征的FCN8s、SegNet、U-Net模型以及集合通道注意力机制和位置注意力机制的DANet模型借助WHU数据集在预测精度、参数量、训练耗时等方面进行遥感影像建筑物自动化提取任务的实验对比。The present invention proposes a high-resolution remote sensing image building extraction model based on a U-shaped attention control network, and compares it with the traditional FCN8s, SegNet, U-Net models with classical symmetric structural features and the collective channel attention mechanism Compared with the DANet model of the location attention mechanism, the WHU data set is used to perform an experimental comparison of the automatic extraction task of remote sensing image buildings in terms of prediction accuracy, parameter quantity, and training time.

如图1所示，本发明公开了一种基于U型注意力控制网络的高分辨率遥感影像建筑物提取方法，包括：As shown in Figure 1, the present invention discloses a method for extracting buildings from high-resolution remote sensing images based on U-shaped attention control network, including:

输入遥感影像至所述编码器；所述编码器利用全局上下文信息以及局部上下文信息提取不同网格维度层级的特征(x^l)并生成编码器特征图；所述编码器包括级联的四个卷积块，分别为第一卷积块、第二卷积块、第三卷积块、第四卷积块；Input remote sensing images to the encoder; the encoder uses global context information and local context information to extract features (x ^l ) of different grid dimension levels and generate encoder feature maps; the encoder includes four cascaded The convolution blocks are the first convolution block, the second convolution block, the third convolution block, and the fourth convolution block;

特别地，所述编码器中的各个卷积块即第一卷积块、第二卷积块、第三卷积块、第四卷积块皆包括卷积层(Convolutional layer，简称Conv)、最大池化层(Maxpool Layer，简称Maxpool)、标准化层(Batch Normalization，简称BN)、激活函数单元(RectifiedLinear Unit，简称ReLU)；其中，所述编码器的卷积块中的最大池化层通过提取原特征图中局部区域的最大值构建生成新的特征图，并通过减少参数数量来防止过拟合；In particular, each convolution block in the encoder, that is, the first convolution block, the second convolution block, the third convolution block, and the fourth convolution block all include a convolutional layer (Convolutional layer, referred to as Conv), Maximum pooling layer (Maxpool Layer, referred to as Maxpool), normalization layer (Batch Normalization, referred to as BN), activation function unit (RectifiedLinear Unit, referred to as ReLU); wherein, the maximum pooling layer in the convolution block of the encoder passes through Extract the maximum value of the local area in the original feature map to generate a new feature map, and prevent overfitting by reducing the number of parameters;

所述转换器连接至解码器；在所述编码器和解码器的卷积块之间的跳跃连接中串入注意力阀门；所述转换器连通了所述编码器和所述解码器中相对应的特征图；所述转换器提取抽象特征图并将其输出至所述解码器；The converter is connected to a decoder; an attention valve is serialized in a skip connection between the convolution blocks of the encoder and the decoder; the converter connects the phase between the encoder and the decoder. the corresponding feature map; the converter extracts the abstract feature map and outputs it to the decoder;

解码器与编码器具有相同数量的卷积块；所述解码器包括级联的四个卷积块，分别为第六卷积块、第七卷积块、第八卷积块、第九卷积块；其中，第一卷积块跳跃连接至第九卷积块，第二卷积块跳跃连接至第八卷积块，第三卷积块跳跃连接至第七卷积块，第四卷积块跳跃连接至第六卷积块；The decoder and the encoder have the same number of convolution blocks; the decoder includes four convolution blocks concatenated, which are the sixth convolution block, the seventh convolution block, the eighth convolution block, and the ninth convolution block. Convolution block; wherein, the first convolution block is connected to the ninth convolution block, the second convolution block is connected to the eighth convolution block, the third convolution block is connected to the seventh convolution block, and the fourth convolution block is connected to the fourth convolution block. The block skip is connected to the sixth convolution block;

所述转换器包括第五卷积块、四个跳跃连接中的注意力阀门；所述转换器将所述编码器和所述解码器相对应的特征图连通；在所述编码器和所述解码器的卷积块之间的跳跃连接中串入注意力阀门；The converter includes a fifth convolution block and an attention valve in four skip connections; the converter connects the feature maps corresponding to the encoder and the decoder; Attention valves are strung in the skip connections between the convolutional blocks of the decoder;

特别地，所述第五卷积块包括卷积层(Conv)、最大池化层(Maxpool)、标准化层(BN)；所述第五卷积块将所述编码器输入的编码器特征图抽象至最高级别，叠加特征图的通道数并缩小特征图的尺寸，从而提取得到最高维度的抽象特征图；所述转换器将提取得到的所述抽象特征图输出至所述解码器；In particular, the fifth convolution block includes a convolution layer (Conv), a maximum pooling layer (Maxpool), and a normalization layer (BN); the fifth convolution block uses the encoder feature map input by the encoder Abstract to the highest level, superimpose the number of channels of the feature map and reduce the size of the feature map, thereby extracting the abstract feature map of the highest dimension; the converter outputs the extracted abstract feature map to the decoder;

在一种实施例中，所述第五卷积块将所述编码器输入的编码器特征图的通道数叠加到1248，将编码器特征图的尺寸缩小到16×16；In an embodiment, the fifth convolution block superimposes the number of channels of the encoder feature map input by the encoder to 1248, and reduces the size of the encoder feature map to 16×16;

所述解码器连接至分割单元；所述解码器对所述转换器输入的抽象特征图进行逐级的上采样处理，生成与输入图像大小相同的解码器特征图，并输出至所述分割单元；所述分割单元调整输入的所述解码器特征图的模型通道数，并获取遥感影像中的建筑物分割结果。The decoder is connected to the segmentation unit; the decoder performs a step-by-step upsampling process on the abstract feature map input by the converter, generates a decoder feature map with the same size as the input image, and outputs it to the segmentation unit ; the segmentation unit adjusts the input model channel number of the decoder feature map, and obtains the building segmentation result in the remote sensing image.

所述解码器对所述转换器输入的抽象特征图进行逐级的上采样处理，具体包括：The decoder performs step-by-step up-sampling processing on the abstract feature map input by the converter, which specifically includes:

第六卷积块对第五卷积块输入的抽象特征图进行上采样处理并输出至第七卷积块；第七卷积块对第六卷积块输入的抽象特征图进行上采样处理并输出至第八卷积块；第八卷积块对第七卷积块输入的抽象特征图进行上采样处理并输出至第九卷积块；The sixth convolution block performs up-sampling processing on the abstract feature map input by the fifth convolution block and outputs it to the seventh convolution block; the seventh convolution block performs up-sampling processing on the abstract feature map input by the sixth convolution block and outputs output to the eighth convolution block; the eighth convolution block performs upsampling processing on the abstract feature map input by the seventh convolution block and outputs it to the ninth convolution block;

特别地，所述分割单元包括第十卷积块；第十卷积块包括卷积层、激活函数单元；第十卷积块的卷积层执行1×1卷积处理，第十卷积块的激活函数单元为Sigmoid函数；所述解码器特征图通过第十卷积块的1×1卷积处理调整模型通道数为类别数，通过Sigmoid函数获得遥感影像中的建筑物分割结果；In particular, the segmentation unit includes a tenth convolution block; the tenth convolution block includes a convolution layer and an activation function unit; the convolution layer of the tenth convolution block performs 1×1 convolution processing, and the tenth convolution block The activation function unit is the Sigmoid function; the decoder feature map adjusts the number of model channels to the number of categories through the 1×1 convolution processing of the tenth convolution block, and obtains the building segmentation result in the remote sensing image through the Sigmoid function;

在一种实施例中，U型注意力控制网络AGs-Unet的参数设置如下面的表1所示，其中Conv表示卷积，Maxpool表示最大池化，Up表示上采样，AGs表示注意力阀门，Up_conv表示上采样加卷积；In one embodiment, the parameters of the U-shaped attention control network AGs-Unet are set as shown in Table 1 below, where Conv represents convolution, Maxpool represents maximum pooling, Up represents upsampling, and AGs represents attention valve, Up_conv means upsampling plus convolution;

表1Table 1

如图2所示为注意力阀门，具体包括：Figure 2 shows the attention valve, which includes:

其中，所述注意力阀门包括注意力层、阀门控制层、第一激活函数单元、线性变换单元、第二激活函数单元、重采样器；Wherein, the attention valve includes an attention layer, a valve control layer, a first activation function unit, a linear transformation unit, a second activation function unit, and a resampler;

在阀门控制层中，输入阀门控制信号至注意力阀门，所述阀门控制信号包括控制向量(g_i)；其中，所述阀门控制信号来自当前卷积块的前一级卷积块；In the valve control layer, a valve control signal is input to the attention valve, and the valve control signal includes a control vector (g _i ); wherein, the valve control signal comes from the previous convolution block of the current convolution block;

具体地，第六卷积块的前一级卷积块为第五卷积块，第七卷积块的前一级卷积块为第六卷积块，第八卷积块的前一级卷积块为第七卷积块，第九卷积块的前一级卷积块为第六八卷积块；Specifically, the convolution block of the previous stage of the sixth convolution block is the fifth convolution block, the convolution block of the previous stage of the seventh convolution block is the sixth convolution block, and the previous stage of the eighth convolution block The convolution block is the seventh convolution block, and the previous convolution block of the ninth convolution block is the sixth and eighth convolution block;

所述控制向量作用于每个像素以确定聚焦区域；所述控制向量包含上下文信息，以去除低级别的特征响应；The control vector is applied to each pixel to determine the focus area; the control vector contains contextual information to remove low-level feature responses;

具体地，第一卷积块跳跃连接至第九卷积块，输入第九卷积块的特征图来自第一卷积块；第二卷积块跳跃连接至第八卷积块，输入第八卷积块的特征图来自第二卷积块；第三卷积块跳跃连接至第七卷积块，输入第七卷积块的特征图来自第三卷积块；第四卷积块跳跃连接至第六卷积块，输入第六卷积块的特征图来自第四卷积块；Specifically, the first convolution block is skip connected to the ninth convolution block, and the feature map input to the ninth convolution block comes from the first convolution block; the second convolution block is skip connected to the eighth convolution block, and the eighth convolution block is input. The feature map of the convolution block comes from the second convolution block; the third convolution block is skip connected to the seventh convolution block, and the feature map input to the seventh convolution block comes from the third convolution block; the fourth convolution block is skip connected To the sixth convolution block, the feature map input to the sixth convolution block comes from the fourth convolution block;

输入注意力阀门的特征图的像素向量

与注意力系数相乘得到第二乘积；其中，像素向量

F_l表示第l层中特征图的数量；Pixel vector of the feature map of the input attention valve

Multiplied by the attention coefficient to get the second product; where the pixel vector

F _l represents the number of feature maps in the lth layer;

将第一乘积、第二乘积相加并经过偏置处理后利用第一激活函数单元进行激活得到第一激活结果；第一激活结果经过线性变换单元调整通道数并作偏置处理后再利用第二激活函数单元进行激活得到第二激活结果，将第二激活结果输入至重采样器中进行调整得到注意力系数α；将注意力层输入的特征图与注意力系数α相乘，输出与注意力层输入的特征图具有相同尺寸的相乘结果；The first product and the second product are added together, and the first activation function unit is used for activation after bias processing to obtain the first activation result; The second activation function unit is activated to obtain the second activation result, and the second activation result is input into the resampler for adjustment to obtain the attention coefficient α; the feature map input by the attention layer is multiplied by the attention coefficient α, and the output is equal to the attention coefficient α. The feature map input by the force layer has the same size of the multiplication result;

具体地，输入的特征图与注意力系数α在像素级的相乘结果为注意力阀门的输出，如下式所示：

其中，注意力系数α∈[0，1]；Specifically, the multiplication result of the input feature map and the attention coefficient α at the pixel level is the output of the attention valve, as shown in the following formula:

Among them, the attention coefficient α∈[0,1];

具体地，所述注意力阀门利用附加注意力机制获得注意力系数；所述附加注意力机制的计算式如下所示：Specifically, the attention valve uses an additional attention mechanism to obtain an attention coefficient; the calculation formula of the additional attention mechanism is as follows:

其中，

为sigmoid函数；in,

is the sigmoid function;

注意力阀门的参数组Θ_att包括：

b_ψ∈R；其中，W_x为控制系数，W_g为注意力系数，ψ为线性变换，b_g为第一偏置参数，b_ψ为第二偏置系数；其中，F_int表示输入层；The parameter group _Θatt of the attention valve includes:

b _ψ ∈ R; where W _x is the control coefficient, W _g is the attention coefficient, ψ is the linear transformation, b _g is the first bias parameter, and b _ψ is the second bias coefficient; where F _int represents the input layer ;

其中，注意力阀门的第一激活函数单元所执行的第一激活函数σ₁如下式所示：

其中，i表示空间维度，c表示通道维度；Among them, the first activation function σ ₁ executed by the first activation function unit of the attention valve is shown in the following formula:

在遥感影像中提取建筑物属于单类别的语义分割任务，因此设置一个注意力系数α即可；注意力阀门识别突出图像区域并抑制无关的特征响应，而最大程度地保留并激活仅与建筑物相关的神经元；Extracting buildings from remote sensing images belongs to a single-category semantic segmentation task, so an attention coefficient α can be set; the attention valve identifies prominent image areas and suppresses irrelevant feature responses, while maximizing retention and activation only related to buildings related neurons;

注意力阀门的实现包括：训练控制系数W_g；训练注意力系数W_x；连接两部分并调整输出通道数的P_s。The realization of the attention valve includes: training control coefficient W _g ; training attention coefficient W _x ; P _s connecting the two parts and adjusting the number of output channels.

此外，本发明还公开了一种基于U型注意力控制网络的高分辨率遥感影像建筑物提取装置，包括编码器、转换器、解码器和分割单元；所述编码器连接至所述转换器，所述转换器连接至所述解码器，所述解码器连接至所述分割单元；In addition, the invention also discloses a high-resolution remote sensing image building extraction device based on a U-shaped attention control network, comprising an encoder, a converter, a decoder and a segmentation unit; the encoder is connected to the converter , the converter is connected to the decoder, and the decoder is connected to the dividing unit;

特别地，所述编码器中的各个卷积块皆包括卷积层(Conv)、最大池化层(Maxpool)、标准化层(BN)、激活函数单元(ReLU)；其中，所述编码器的卷积块中的最大池化层通过提取原特征图中局部区域的最大值构建生成新的特征图，并通过减少参数数量来防止过拟合；In particular, each convolution block in the encoder includes a convolution layer (Conv), a maximum pooling layer (Maxpool), a normalization layer (BN), and an activation function unit (ReLU); wherein, the encoder's The maximum pooling layer in the convolution block generates a new feature map by extracting the maximum value of the local area in the original feature map, and prevents overfitting by reducing the number of parameters;

所述编码器将提取生成的编码器特征图输出至所述转换器；The encoder outputs the encoder feature map generated by extraction to the converter;

解码器与编码器具有相同数量的卷积块；所述解码器包括级联的四个卷积块，分别为第六卷积块、第七卷积块、第八卷积块、第九卷积块；The decoder and the encoder have the same number of convolution blocks; the decoder includes four convolution blocks concatenated, which are the sixth convolution block, the seventh convolution block, the eighth convolution block, and the ninth convolution block. block;

其中，第一卷积块跳跃连接至第九卷积块，第二卷积块跳跃连接至第八卷积块，第三卷积块跳跃连接至第七卷积块，第四卷积块跳跃连接至第六卷积块；Among them, the first convolution block is skipped to connect to the ninth convolution block, the second convolution block is skipped to the eighth convolution block, the third convolution block is skipped to the seventh convolution block, and the fourth convolution block is skipped connected to the sixth convolution block;

所述转换器包括第五卷积块、四个跳跃连接中的注意力阀门(AGs)；所述转换器连通了所述编码器和所述解码器中相对应的特征图；在所述编码器和所述解码器的卷积块之间的跳跃连接中串入注意力阀门；所述转换器提取抽象特征图并将其输出至所述解码器；The converter includes a fifth convolution block, attention valves (AGs) in four skip connections; the converter connects the corresponding feature maps in the encoder and the decoder; in the encoding attention valves are serialized in skip connections between the convolutional blocks of the decoder and the decoder; the converter extracts abstract feature maps and outputs them to the decoder;

其中，串入跳跃连接的注意力阀门筛选出低维度特征图中有利于提取建筑物的特征点，过滤并抑制无关的特征和节点；四个注意力阀门在从低到高四个不同的网格维度层级中全方面、多维度地提取有效的特征；Among them, the attention valves stringed into the skip connection screen out low-dimensional feature maps that are conducive to extracting feature points of buildings, filtering and suppressing irrelevant features and nodes; four attention valves are used in four different networks from low to high. Extract effective features from all aspects and multiple dimensions in the grid dimension level;

所述转换器连通了所述编码器和所述解码器中相对应的特征图，解决了反向传播梯度消失的问题；The converter connects the corresponding feature maps in the encoder and the decoder, and solves the problem of the vanishing gradient of backpropagation;

所述解码器对所述转换器输入的抽象特征图进行逐级的上采样处理，生成与输入图像大小相同的解码器特征图，并输出至所述分割单元；The decoder performs a step-by-step upsampling process on the abstract feature map input by the converter, generates a decoder feature map with the same size as the input image, and outputs it to the segmentation unit;

特别地，所述解码器对所述转换器输入的抽象特征图进行逐级的上采样处理，具体包括：In particular, the decoder performs a step-by-step upsampling process on the abstract feature map input by the converter, which specifically includes:

所述分割单元调整输入的所述解码器特征图的通道数，并获取遥感影像中的建筑物分割结果；The segmentation unit adjusts the input channel number of the decoder feature map, and obtains the building segmentation result in the remote sensing image;

特别地，所述分割单元包括第十卷积块；第十卷积块包括卷积层、激活函数单元；In particular, the segmentation unit includes a tenth convolution block; the tenth convolution block includes a convolution layer and an activation function unit;

第十卷积块的卷积层执行1×1卷积处理，第十卷积块的激活函数单元为Sigmoid函数；所述解码器特征图通过第十卷积块的1×1卷积处理调整模型通道数为类别数，通过Sigmoid函数获得遥感影像中的建筑物分割结果；The convolution layer of the tenth convolution block performs 1×1 convolution processing, and the activation function unit of the tenth convolution block is a sigmoid function; the decoder feature map is adjusted by the 1×1 convolution processing of the tenth convolution block. The number of model channels is the number of categories, and the building segmentation results in remote sensing images are obtained through the Sigmoid function;

具体地，第一卷积块跳跃连接至第九卷积块，输入第九卷积块的特征图来自第一卷积块；第二卷积块跳跃连接至第八卷积块，输入第八卷积块的特征图来自第二卷积块；第三卷积块跳跃连接至第七卷积块，输入第七卷积块的特征图来自第三卷积块；第四卷积块跳跃连接至第六卷积块，输入第六卷积块的特征图来自第四卷积块；Specifically, the first convolution block is skip connected to the ninth convolution block, and the feature map input to the ninth convolution block is from the first convolution block; the second convolution block is skip connected to the eighth convolution block, and the eighth convolution block is input The feature map of the convolution block comes from the second convolution block; the third convolution block is skip connected to the seventh convolution block, and the feature map input to the seventh convolution block comes from the third convolution block; the fourth convolution block is skip connected To the sixth convolution block, the feature map input to the sixth convolution block comes from the fourth convolution block;

输入注意力阀门的特征图的像素向量

与注意力系数相乘得到第二乘积；其中，像素向量

F _l represents the number of feature maps in the lth layer;

Among them, the attention coefficient α∈[0,1];

其中，

为sigmoid函数；in,

is the sigmoid function;

注意力阀门的参数组Θ_att包括：

在遥感影像中提取建筑物属于单类别的语义分割任务，因此设置一个注意力系数α即可；注意力阀门识别突出图像区域并抑制无关的特征响应，而最大程度地保留并激活仅与建筑物相关的神经元；Extracting buildings from remote sensing images belongs to a single-category semantic segmentation task, so an attention coefficient α can be set; the attention valve identifies prominent image areas and suppresses irrelevant feature responses, while retaining and activating only buildings related to buildings to the greatest extent. related neurons;

注意力阀门的实现包括：训练控制系数W_g；训练注意力系数W_x；连接两部分并调整输出通道数的P_s；The realization of the attention valve includes: training the control coefficient W _g ; training the attention coefficient W _x ; connecting the two parts and adjusting the number of output channels P _s ;

在一种实施例中，实现注意力阀门的参数如下表2所示：In one embodiment, the parameters for realizing the attention valve are shown in Table 2 below:

表2Table 2

在图像分类任务中，softmax激活函数用于归一化处理注意力系数；但是多次使用softmax会在输出端产生激活稀疏化，进而导致梯度消失问题；因此，在注意力阀门中利用sigmoid函数使得注意力阀门的参数具有更好的训练收敛性；在基于网格的注意力中，控制信号不是所有像素的全局单一矢量，而是适应于图像空间信息的栅格信号；In the image classification task, the softmax activation function is used to normalize the attention coefficient; however, using softmax multiple times will produce activation sparsity at the output, which will lead to the problem of gradient disappearance; therefore, using the sigmoid function in the attention valve makes the The parameters of the attention valve have better training convergence; in grid-based attention, the control signal is not a global single vector for all pixels, but a grid signal adapted to the spatial information of the image;

如图1所示，每个跳跃连接中的阀门控制信号聚合了多维度信息，增加了查询信号的网格分辨率并实现了更好的性能；As shown in Figure 1, the valve control signal in each skip connection aggregates multi-dimensional information, which increases the grid resolution of the query signal and achieves better performance;

本发明在U-Net架构中加入注意力阀门，为重点提取出通过跳跃连接获得的多维度广域上的粗尺度特征，并消除其中无关信息和噪声响应的歧义。此过程会在连接操作中执行，以便只调用相关的激活。此外，注意力阀门在向前传播和反向传播中有针对性地过滤了无需激活的神经元。从背景区域开始的梯度在反向传播中不断向下加权，浅层的模型参数可以基于与建筑物提取任务相关的空间区域不断进行更新。The invention adds an attention valve to the U-Net architecture, extracts the coarse-scale features in the multi-dimensional wide area obtained by skip connections, and eliminates the ambiguity of irrelevant information and noise responses. This process is performed in the connect operation so that only the relevant activations are called. In addition, the attentional valve selectively filters neurons that do not need to be activated in forward and backpropagation. Gradients starting from background regions are continuously weighted downwards in backpropagation, and the model parameters of the shallow layers can be continuously updated based on the spatial regions relevant to the building extraction task.

在多维的注意力阀门中，对应于每个网格尺度上的一个向量，在每个注意力阀门中，提取互补信息进行融合，定义跳跃连接的输出。为了减少注意力阀门中训练参数的数量和计算复杂度，在没有任何空间支持的情况下进行线性变换(1×1×1卷积)，并将输入的特征图通过下采样处理映射到阀门控制信号的分辨率，再由相应的线性变换对特征映射进行解码，并将其映射到低维空间进行注意力控制操作。In the multi-dimensional attention valve, corresponding to a vector at each grid scale, in each attention valve, complementary information is extracted for fusion to define the output of the skip connection. To reduce the number of training parameters and computational complexity in the attention valve, a linear transformation (1×1×1 convolution) is performed without any spatial support, and the input feature map is mapped to the valve control through a downsampling process The resolution of the signal is then decoded by the corresponding linear transformation, and the feature map is mapped to the low-dimensional space for attention control operation.

本发明利用深度监督来促使中间特征图在影像各个尺度上都具有相应的语义区分性；深度监督确保在不同尺度上的注意力阀门单元对输入的大范围的特征图能够及时处理，还能够防止直接从跳跃连接的子集进行重构密集的预测。The present invention utilizes deep supervision to make the intermediate feature maps have corresponding semantic distinction at each scale of the image; the depth supervision ensures that the attention valve units at different scales can process the input feature maps of a large range in time, and can also prevent Reconstruct dense predictions directly from a subset of skip connections.

通过实验验证并对比模型的有效性和准确性，将总体精度(Overall Accuracy，简称OA)、精密度(Precision)和交并比(IoU)作为模型性能评估的指标；OA表示正确分类的像素数与测试像素总数之比，Precision是正确分类为正的像素数在所有预测为正的像素数的百分比，IoU可描述分段级的准确性；The validity and accuracy of the model are verified and compared through experiments, and the overall accuracy (OA), precision (Precision) and intersection ratio (IoU) are used as indicators for model performance evaluation; OA represents the number of correctly classified pixels Ratio to the total number of pixels tested, Precision is the percentage of correctly classified positive pixels among all predicted positive pixels, and IoU describes segment-level accuracy;

采用WHU建筑物数据集进行对比实验，WHU建筑物数据集由航空数据集和卫星数据集组成，使用其中的航空影像数据集，对比SegNet、FCN8s、DANet、U-Net和AGs-Unet在WHU航空影像数据集上完成建筑物提取的结果；相较于SegNet、FCN8s、DANet，U-Net和AGs-Unet在验证数据上建筑物边缘的平滑性较好，AGs-Unet提取的建筑物有更加准确的边缘并且在注意力阀门AGs作用下内部空洞现象更少；The WHU building dataset is used for comparative experiments. The WHU building dataset is composed of aerial datasets and satellite datasets. Using the aerial image datasets among them, SegNet, FCN8s, DANet, U-Net and AGs-Unet are used to compare the WHU aviation datasets. The results of building extraction on the image dataset; compared with SegNet, FCN8s, DANet, U-Net and AGs-Unet have better smoothness of building edges on the verification data, and the buildings extracted by AGs-Unet are more accurate edge and less internal voids under the action of attention valve AGs;

单独分析测试集中的影像，选取影像中的建筑物，计算资源占用最高的SegNet模型提取建筑物的完整性比FCN8s和DANet更好，U-Net模型提取建筑物有空洞化的问题，AGs-Unet提取出建筑物较为流畅、准确的边缘线以及较为完整的内部结构具有最好的提取效果；Separately analyze the images in the test set, select the buildings in the images, and the SegNet model with the highest computational resource consumption is better than FCN8s and DANet to extract the integrity of the buildings. It has the best extraction effect to extract the more fluent and accurate edge lines and the more complete internal structure of the building;

进一步分析测试集中的其他影像，选取影像中整座建筑物中外延的凸局部和内陷的凹局部两处细节，SegNet模型预测图中未提取出此建筑物凸局部信息，并且在凹局部区域产生了较大的空洞现象；FCN8s和DANet都未能提取出该建筑物凸局部信息，对于凹局部特征区域FCN8s模型缺失性较少，相对表现出更好的预测效果；U-Net模型提取出建筑物凸局部信息，但在凹局部区域存在目标预测不完整的问题；AGs-Unett模型较为准确地提取出凸局部和凹局部两处细节信息；根据验证集数据的整体分析，SegNet、FCN8s模型提取建筑物的完整性不够其边缘流畅度不高；AGs-Unet模型提取出建筑物较为流畅、准确的边缘线以及较为完整的内部结构具有最好的提取效果。Further analyze other images in the test set, and select two details of the entire building in the image: the convex part of the extension and the concave part of the concave part. The SegNet model prediction map does not extract the information of the convex part of the building, and it is in the concave local area. A large hole phenomenon is generated; both FCN8s and DANet fail to extract the convex local information of the building, and the FCN8s model is less missing for the concave local feature area, and relatively shows a better prediction effect; U-Net model extracts The building has convex local information, but there is a problem of incomplete target prediction in the concave local area; the AGs-Unett model can accurately extract the details of the convex and concave parts; according to the overall analysis of the validation set data, the SegNet and FCN8s models The integrity of the extracted buildings is not enough, and the edge fluency is not high; the AGs-Unet model extracts relatively smooth and accurate edge lines and relatively complete internal structures of buildings, which has the best extraction effect.

在总体精度、精密度和平均交并比这三方面的精度结果上进行对比，AGs-Unet在三种评价指标上都取得了显著的最好结果。Comparing the accuracy results of overall accuracy, precision and average cross-union ratio, AGs-Unet achieves significantly the best results on all three evaluation metrics.

本发明提出的AGs-Unet将注意力阀门(AGs)集成至U-Net中，解决了遥感影像语义分割模型中密集建筑物目标的提取问题；注意力阀门AGs有利于消除多层级卷积神经网络(CNNs)中建筑物定位模块的必要性，减少模型的参数数量和训练消耗的时间，提高模型的运算效率；同时，注意力阀门(AGs)可用于进一步扩展，集成至卷积神经网络中，从而有效地完成与图像分割密集目标预测相关的任务。The AGs-Unet proposed by the present invention integrates the attention valve (AGs) into the U-Net, which solves the problem of extracting dense building objects in the semantic segmentation model of remote sensing images; the attention valve AGs is beneficial to eliminate the multi-level convolutional neural network The necessity of building localization module in (CNNs) reduces the number of parameters of the model and the time consumed by training, and improves the computational efficiency of the model; at the same time, attention valves (AGs) can be used for further expansion and integration into convolutional neural networks, Thus, tasks related to dense object prediction for image segmentation are efficiently accomplished.

本发明提出的基于网格的控制门，使注意力系数更具体地针对局部区域，提出了应用于遥感影像建筑物提取任务的前向CNN模型中的软注意机制的一种实现方式，本发明提出的注意力阀门替代图像分类中使用的硬注意机制和图像分割框架中的目标定位模型；The grid-based control gate proposed by the present invention makes the attention coefficient more specific to the local area, and proposes an implementation of the soft attention mechanism in the forward CNN model applied to the task of extracting buildings from remote sensing images. The proposed attention valve replaces the hard attention mechanism used in image classification and the object localization model in the image segmentation framework;

本发明提出了对标准U-Net模型的扩展，以提高模型对前景像素的敏感性，而不需要复杂的启发式算法；通过实验对比可知，在WHU数据集上和标准U-Net模型相比，本本发明提出的AGs-Unet在遥感影像建筑物提取中的精度有了进一步提高。The present invention proposes an extension to the standard U-Net model to improve the sensitivity of the model to foreground pixels without the need for complex heuristic algorithms; through experimental comparison, it can be seen that compared with the standard U-Net model on the WHU data set , the accuracy of the AGs-Unet proposed in the present invention in extracting buildings from remote sensing images has been further improved.

以上实施例仅用以说明本发明的技术方案，而非对其限制；尽管参照前述实施例对本发明进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分技术特征进行等同替换；而这些修改或者替换，并不会使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。The above embodiments are only used to illustrate the technical solutions of the present invention, but not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: The recorded technical solutions are modified, or some technical features thereof are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A high-resolution remote sensing image building extraction method based on a U-shaped attention control network is characterized by comprising the following steps:

inputting the remote sensing image to an encoder; the encoder extracts features of different grid dimension levels by using the global context information and the local context information and generates an encoder feature map; the encoder comprises four cascaded rolling blocks which are respectively a first rolling block, a second rolling block, a third rolling block and a fourth rolling block;

the encoder is connected to the converter; the encoder outputs the extracted and generated encoder characteristic diagram to the converter; the converter comprises a fifth volume block, attention valves in four jump connections;

the converter is connected to a decoder; (ii) cascading an attention valve in a jump connection between convolutional blocks of said encoder and decoder; the converter extracts an abstract feature map and outputs the abstract feature map to the decoder;

the decoder comprises four cascaded rolling blocks which are a sixth rolling block, a seventh rolling block, an eighth rolling block and a ninth rolling block respectively; the first convolution block is connected to the ninth convolution block in a jumping mode, the second convolution block is connected to the eighth convolution block in a jumping mode, the third convolution block is connected to the seventh convolution block in a jumping mode, and the fourth convolution block is connected to the sixth convolution block in a jumping mode;

the decoder is connected to a segmentation unit; the decoder performs up-sampling processing step by step on the abstract feature map input by the converter to generate a decoder feature map with the same size as the input image, and outputs the decoder feature map to the segmentation unit; and the segmentation unit adjusts the number of channels of the input decoder characteristic diagram to obtain a building segmentation result in the remote sensing image.

2. The method for extracting the high-resolution remote sensing image building based on the U-shaped attention control network as claimed in claim 1,

each convolution block in the encoder comprises a convolution layer, a maximum pooling layer, a normalization layer and an activation function unit; the maximum pooling layer in the rolling block of the encoder extracts the maximum value of a local area in the original characteristic graph to construct and generate a new characteristic graph, and overfitting is prevented by reducing the number of parameters;

the fifth convolution block comprises a convolution layer, a maximum pooling layer and a normalization layer; and abstracting the encoder characteristic diagram input by the encoder to the highest level by the fifth rolling block, overlapping the number of channels of the characteristic diagram and reducing the size of the characteristic diagram, thereby extracting the abstract characteristic diagram with the highest dimension.

3. The method for extracting the high-resolution remote sensing image building based on the U-shaped attention control network as claimed in claim 1,

wherein, the decoder performs progressive upsampling processing on the abstract feature map input by the converter, and specifically includes: the sixth convolution block performs up-sampling processing on the abstract feature map input by the fifth convolution block and outputs the abstract feature map to the seventh convolution block; the seventh convolution block performs up-sampling processing on the abstract feature map input by the sixth convolution block and outputs the abstract feature map to the eighth convolution block; the eighth convolution block performs up-sampling processing on the abstract feature map input by the seventh convolution block and outputs the abstract feature map to the ninth convolution block;

wherein the partition unit includes a tenth volume block; the tenth convolution block comprises a convolution layer and an activation function unit; the convolution layer of the tenth convolution block executes 1 × 1 convolution processing, and the active function unit of the tenth convolution block is a Sigmoid function; the decoder feature map adjusts the number of model channels to be the number of categories through 1 x 1 convolution processing of a tenth convolution block, and building segmentation results in remote sensing images are obtained through a Sigmoid function.

4. The method for extracting the high-resolution remote sensing image building based on the U-shaped attention control network as claimed in claim 1,

the attention valve comprises an attention layer, a valve control layer, a first activation function unit, a linear transformation unit, a second activation function unit and a resampler;

in a valve control layer, inputting a valve control signal to an attention valve, the valve control signal comprising a control vector; wherein the valve control signal is from a previous-stage volume block of a current volume block;

multiplying the control vector by the control coefficient to obtain a first product;

in the attention layer, inputting a characteristic diagram to an attention valve; wherein the feature map input to the attention valve is from a convolution block that is jumpingly connected to the current convolution block;

multiplying a pixel vector of the feature map input into the attention valve by the attention coefficient to obtain a second product;

adding the first product and the second product, performing offset processing, and activating by using a first activation function unit to obtain a first activation result; the first activation result is subjected to channel number adjustment and bias processing by the linear transformation unit, then is activated by the second activation function unit to obtain a second activation result, and the second activation result is input into the resampler to be adjusted to obtain an attention coefficient alpha; the feature map of the attention layer input is multiplied by the attention coefficient α, and a multiplication result having the same size as the feature map of the attention layer input is output.

5. The method for extracting the high-resolution remote sensing image building based on the U-shaped attention control network as claimed in claim 4,

the attention valve obtains an attention coefficient by using an additional attention mechanism; the calculation of the additional attention mechanism is as follows:

wherein,

is sigmoid function;

parameter set theta of attention valve_attThe method comprises the following steps:

b_ψe is R; wherein, W_xTo control the coefficient, W_gFor attention coefficients,. phi._gIs a first bias parameter, b_ψIs a second bias coefficient; wherein, F_lDenotes the number of feature maps in layer I, F_intRepresenting an input layer;

wherein the first activation function performed by the first activation function unit of the attention valve is as follows:

wherein i represents a spatial dimension and c represents a channel dimension;

wherein the linear transformation performed by the linear transformation unit changes the number of channels of the input vector by a 1 × 1 × 1 channel convolution process;

wherein the second activation function σ executed by the second activation function unit of the attention valve₂Is sigmoid function;

the result of multiplying the input feature map by the attention coefficient α at the pixel level is the output of the attention valve, as shown in the following equation:

wherein the attention coefficient alpha is ∈ [0, 1 ]]。

6. A high-resolution remote sensing image building extraction device based on a U-shaped attention control network is characterized by comprising an encoder, a converter, a decoder and a segmentation unit; the encoder is connected to the converter, the converter is connected to the decoder, and the decoder is connected to the dividing unit;

inputting a remote sensing image to the encoder; the encoder extracts features of different grid dimension levels by using the global context information and the local context information and generates an encoder feature map; the encoder comprises four cascaded rolling blocks which are respectively a first rolling block, a second rolling block, a third rolling block and a fourth rolling block; the encoder outputs the extracted and generated encoder characteristic diagram to the converter;

the converter comprises a fifth volume block, attention valves in four jump connections; the converter connects the corresponding characteristic graphs of the encoder and the decoder; (ii) concatenating attention valves in a hopping connection between convolutional blocks of the encoder and the decoder; the converter extracts an abstract feature map and outputs the abstract feature map to the decoder;

the decoder performs up-sampling processing step by step on the abstract feature map input by the converter to generate a decoder feature map with the same size as the input image, and outputs the decoder feature map to the segmentation unit; and the segmentation unit adjusts the number of channels of the input decoder characteristic diagram to obtain a building segmentation result in the remote sensing image.

7. The building extraction device of high resolution remote sensing image based on U-shaped attention control network as claimed in claim 6,

8. The building extraction device of high resolution remote sensing image based on U-shaped attention control network as claimed in claim 6,

9. The building extraction device of high resolution remote sensing image based on U-shaped attention control network as claimed in claim 6,

10. The building extraction device of high resolution remote sensing image based on U-shaped attention control network as claimed in claim 9,

wherein,

is sigmoid function;

wherein i represents a spatial dimension and c represents a channel dimension;

wherein the attention coefficient alpha is ∈ [0, 1 ]]。