CN115631412A

CN115631412A - Remote sensing image building extraction method based on coordinate attention and data correlation upsampling

Info

Publication number: CN115631412A
Application number: CN202211270279.6A
Authority: CN
Inventors: 程志友; 彭友根; 汪传建
Original assignee: Anhui University
Current assignee: Anhui University
Priority date: 2022-10-18
Filing date: 2022-10-18
Publication date: 2023-01-20

Abstract

The invention relates to a remote sensing image building extraction method based on coordinate attention and data correlation upsampling, which comprises the following steps: acquiring remote sensing data; data preprocessing and data enhancement; constructing a building extraction network model, namely a CAD-UNet network model, comprising an encoder, a coordinate attention CA module and a data-related up-sampling DUp module; training and evaluating a model; building automation extraction: and after data preprocessing is carried out on the new remote sensing image to be extracted, inputting the trained CAD-UNet network model, and outputting a predicted image by the CAD-UNet network model to obtain a building extraction result. The network designed by the invention gradually extracts the deep features of the building, performs feature fusion and then gradually upsamples the features to the size of the input resolution, is more friendly to the building extraction task, and obviously improves the building extraction precision; the position information and the boundary information of the building can be effectively captured, so that the extracted building has smoother boundaries and complete outlines.

Description

Building Extraction from Remote Sensing Images Based on Coordinate Attention and Data Correlation Upsampling method

技术领域technical field

本发明涉及图像处理技术领域，尤其是一种基于坐标注意力和数据相关上采样的遥感图像建筑物提取方法。The invention relates to the technical field of image processing, in particular to a method for extracting buildings from remote sensing images based on coordinate attention and data correlation upsampling.

背景技术Background technique

建筑物是人们日常生活中不可或缺的活动场所，也是城市建设发展过程中的重要组成部分。建筑物提取的主要任务是从遥感图像识别并提取出建筑物区域。建筑物提取对智慧城市建设、交通管理、人口估计和土地利用监测等具有重要意义。随着遥感技术的迅速发展，遥感图像开始由低分辨率向高分辨率过渡，形成了高空间分辨率、高光谱分辨率、高时间分辨率为特征的发展趋势。高分辨率遥感图像的特征和信息不断增加，噪声和干扰信息也相应的增多，这给建筑物提取带来了新的挑战，如何从高分辨率遥感影像中精确提取建筑物已成为研究的热点和难点。Buildings are an indispensable activity place in people's daily life, and also an important part of the process of urban construction and development. The main task of building extraction is to identify and extract building areas from remote sensing images. Building extraction is of great significance for smart city construction, traffic management, population estimation, and land use monitoring, etc. With the rapid development of remote sensing technology, remote sensing images began to transition from low resolution to high resolution, forming a development trend characterized by high spatial resolution, high spectral resolution, and high temporal resolution. The features and information of high-resolution remote sensing images continue to increase, and the noise and interference information also increase accordingly, which brings new challenges to building extraction. How to accurately extract buildings from high-resolution remote sensing images has become a research hotspot and difficult.

传统的建筑物提取方法通常是基于先验知识和手工特征，然后再采用聚类等算法进行建筑物提取，主要包括基于建筑物特征的方法、基于辅助信息的方法等。这些方法大多是利用建筑物的形状纹理等特征和基于辅助信息等方式进行建筑物提取，实现原理较为简单，存在识别率低、错误较多等问题，并且过程费时费力，在实际应用中存在很大局限性，性能也受到极大限制。具体表现在：Traditional building extraction methods are usually based on prior knowledge and manual features, and then clustering and other algorithms are used to extract buildings, mainly including methods based on building features and auxiliary information. Most of these methods use the features of the building's shape, texture, and other methods to extract buildings based on auxiliary information. The realization principle is relatively simple, but there are problems such as low recognition rate and many errors, and the process is time-consuming and laborious. Large limitations, performance is also greatly limited. Specifically in:

第一，对建筑物的位置信息关注较少；建筑物的位置信息对于建筑物提取任务异常重要，建筑物通常规则分布于一张遥感图像内，且通常存在树木等阴影遮挡情况，着重关注建筑物的位置信息可以获取建筑物在一张图像内的准确位置，避免错分现象，现有技术未充分考虑到建筑物的位置信息，尤其是对于存在阴影遮挡的和复杂粘连的建筑物，对位置信息的关注度不够，因而很容易出现错分现象；First, less attention is paid to the location information of buildings; the location information of buildings is extremely important for building extraction tasks. Buildings are usually regularly distributed in a remote sensing image, and there are usually shadows such as trees, so focus on building The location information of the object can obtain the accurate location of the building in an image and avoid misclassification. The existing technology does not fully consider the location information of the building, especially for buildings with shadow occlusion and complex adhesion. Insufficient attention is paid to location information, so misclassification is easy to occur;

第二，建筑物提取边界粗糙、模糊；建筑物多为矩形，通常拥有规则的边界，边界信息是建筑物提取任务中不容忽视的重要特征，对建筑物进行提取时，如果忽略边缘信息，容易导致边界粗糙模糊、边界混乱、存在孔洞等问题，现有技术在进行建筑物提取时，仅仅进行常规的特征提取，未能充分挖掘建筑物的边缘特征，因而导致提取效果不佳，存在边界粗糙模糊等问题；Second, the boundary of building extraction is rough and fuzzy; buildings are mostly rectangular and usually have regular boundaries. Boundary information is an important feature that cannot be ignored in the task of building extraction. When extracting buildings, if the edge information is ignored, it is easy to It leads to problems such as rough and blurred boundaries, chaotic boundaries, and holes. When extracting buildings in the existing technology, only conventional feature extraction is performed, and the edge features of buildings cannot be fully excavated, resulting in poor extraction results and rough boundaries. problems such as fuzzy;

第三，存在正负样本不均衡问题；建筑物提取是一个二分类任务，主要区分建筑物和背景。但通常情况下，遥感图像中背景像素会多于建筑物像素，这会在训练时削弱模型对建筑物的提取能力。现有技术未能充分考虑到正负样本不均衡的问题，因而导致模型的建筑物提取精度不高，泛化能力偏弱。Third, there is an imbalance between positive and negative samples; building extraction is a binary classification task, which mainly distinguishes buildings and backgrounds. However, under normal circumstances, there are more background pixels in remote sensing images than building pixels, which will weaken the model's ability to extract buildings during training. The existing technology fails to fully consider the problem of the imbalance between positive and negative samples, which leads to low accuracy of building extraction and weak generalization ability of the model.

发明内容Contents of the invention

本发明的目的在于提供一种有效解决对建筑物位置信息关注不够而导致的错分问题，优化建筑物的边界提取效果，缓解正负样本不均衡问题，提高网络的泛化能力的基于坐标注意力和数据相关上采样的遥感图像建筑物提取方法。The purpose of the present invention is to provide a coordinate-based attention that effectively solves the problem of misclassification caused by insufficient attention to building location information, optimizes the boundary extraction effect of buildings, alleviates the problem of unbalanced positive and negative samples, and improves the generalization ability of the network. Force- and data-dependent upsampling for building extraction from remote sensing images.

为实现上述目的，本发明采用了以下技术方案：一种基于坐标注意力和数据相关上采样的遥感图像建筑物提取方法，该方法包括下列顺序的步骤：In order to achieve the above object, the present invention adopts the following technical solutions: a method for extracting buildings from remote sensing images based on coordinate attention and data correlation upsampling, the method includes the steps in the following order:

(1)获取遥感数据：下载WHU建筑物数据集和Massachusetts建筑物数据集；(1) Obtain remote sensing data: download WHU building dataset and Massachusetts building dataset;

(2)数据预处理与数据增强：所述预处理是指对数据集中的大图像进行裁剪，对裁剪后的遥感图像和标签图像进行数据增强；将数据增强后的遥感图像和标签图像分别按照8:1:1的比例划分为训练集、验证集和测试集；(2) Data preprocessing and data enhancement: the preprocessing refers to cropping the large images in the data set, and performing data enhancement on the cropped remote sensing images and label images; The ratio of 8:1:1 is divided into training set, verification set and test set;

(3)构建CAD-UNet网络模型：以UNet网络为基础进行改进，构建包括编码器、坐标注意力CA模块、数据相关上采样DUp模块的建筑物提取网络模型，即CAD-UNet网络模型；(3) Constructing the CAD-UNet network model: Based on the UNet network, improve and construct the building extraction network model including the encoder, the coordinate attention CA module, and the data-dependent upsampling DUp module, namely the CAD-UNet network model;

(4)模型训练与评估：基于训练集数据BCE Loss二分类交叉熵损失与Focal Loss焦点损失相结合的联合损失函数，对CAD-UNet网络模型进行训练，训练完成后用测试集评估CAD-UNet网络模型的建筑物提取精度和效果；(4) Model training and evaluation: Based on the joint loss function of the training set data BCE Loss binary classification cross entropy loss and Focal Loss focal loss, the CAD-UNet network model is trained, and the test set is used to evaluate CAD-UNet after training Building extraction accuracy and effect of network model;

(5)建筑物自动化提取：将新的待提取的遥感图像进行数据预处理后，输入训练完成的CAD-UNet网络模型，CAD-UNet网络模型输出预测图像，得到建筑物提取结果。(5) Automatic extraction of buildings: after data preprocessing of the new remote sensing images to be extracted, the trained CAD-UNet network model is input, and the CAD-UNet network model outputs predicted images to obtain building extraction results.

所述步骤(2)具体包括以下步骤：Described step (2) specifically comprises the following steps:

(2a)通过滑动窗口的方式，对Massachusetts建筑物数据集中遥感大图像和标签图像进行裁剪，裁剪成512×512大小的图像，并将WHU建筑物数据集和Massachusetts建筑物数据集中标签图像中建筑的像素值标注为1，背景的像素值标注为0；(2a) By means of sliding window, crop the remote sensing large image and label image in the Massachusetts building dataset, cut it into a 512×512 image, and put the WHU building dataset and the Massachusetts building dataset in the label image into the building The pixel value of the background is marked as 1, and the pixel value of the background is marked as 0;

(2b)对WHU建筑物数据集中遥感图像和标签图像、以及裁剪后的Massachusetts建筑物数据集中遥感图像和标签图像进行数据增强，扩大数据量，所述数据增强包括：(2b) Carry out data enhancement to the remote sensing images and label images in the WHU building dataset and the cropped Massachusetts building dataset remote sensing images and label images to expand the amount of data. The data enhancement includes:

水平翻转：使用图像处理库OpenCV分别对遥感图像和标签图像进行水平翻转；Horizontal flip: Use the image processing library OpenCV to flip the remote sensing image and the label image horizontally;

垂直翻转：使用图像处理库OpenCV分别对遥感图像和标签图像进行垂直翻转；Vertical flip: Use the image processing library OpenCV to flip the remote sensing image and the label image vertically;

水平垂直翻转：使用图像处理库OpenCV分别对遥感图像和标签图像先水平翻转再垂直翻转；Horizontal and vertical flip: Use the image processing library OpenCV to flip the remote sensing image and the label image first horizontally and then vertically;

移位、缩放、随机裁剪和添加噪声：分别对遥感图像和标签图像进行移位、缩放、随机裁剪和添加噪声等操作；Shift, zoom, random crop and add noise: perform operations such as shift, zoom, random crop and add noise on remote sensing images and label images respectively;

(2c)将数据增强后的遥感图像和标签图像分别按照8:1:1的比例划分为训练集、验证集和测试集，所述训练集用于直接参与CAD-UNet网络模型的训练，进行特征提取；所述验证集用于调整CAD-UNet网络模型的超参数；所述测试集用于在训练完成后测试CAD-UNet网络模型的精度和提取效果。(2c) divide the remote sensing image and label image after data enhancement into training set, verification set and test set according to the ratio of 8:1:1 respectively, and the training set is used to directly participate in the training of CAD-UNet network model, and carry out Feature extraction; the verification set is used to adjust the hyperparameters of the CAD-UNet network model; the test set is used to test the accuracy and extraction effect of the CAD-UNet network model after the training is completed.

所述步骤(3)具体包括以下步骤：Described step (3) specifically comprises the following steps:

(3a)对UNet网络编码器进行替换：用VGG16网络模块替换UNet网络编码器，VGG16网络模块由VGG16网络去掉最后一个池化层和全连接层构成，VGG16网络模块通过多次卷积和四次最大池化进行下采样，提取遥感图像中的建筑物特征，并输出四个不同尺度的特征图；(3a) Replace the UNet network encoder: replace the UNet network encoder with the VGG16 network module. The VGG16 network module is composed of the VGG16 network without the last pooling layer and the fully connected layer. The VGG16 network module passes multiple convolutions and four times Maximum pooling is used for down-sampling, extracting building features in remote sensing images, and outputting feature maps of four different scales;

(3b)构造坐标注意力CA模块：将坐标注意力CA模块嵌入到经步骤(3a)得到的UNet网络的跳跃连接处；(3b) Construct the coordinate attention CA module: embed the coordinate attention CA module into the skip connection of the UNet network obtained through step (3a);

坐标注意力CA模块沿着两个空间方向分别捕获长程依赖和保留位置信息，将特征图分别编码，形成两个特征图，分别对方向感知和对位置敏感，将任何中间张量X＝[x₁，x₂，x₃...，x_C]∈R^C×H×W作为输入，并输出一个同样长度的张量Y＝[y₁，y₂，y₃...，y_C]，具体而言对输入X使用尺寸(H，1)和(1，W)的池化核沿着水平和竖直两个方向对每个通道进行编码，高度为H的第c个通道的表述如下：The coordinate attention CA module captures long-range dependencies and retains location information along two spatial directions, and encodes the feature maps separately to form two feature maps, which are direction-aware and position-sensitive, respectively, and any intermediate tensor X=[x ₁ , x ₂ , x ₃ ..., x _C ]∈R ^C×H×W as input, and output a tensor of the same length Y=[y ₁ , y ₂ , y ₃ ..., y _C ] , specifically for the input X, use a pooling kernel of size (H, 1) and (1, W) to encode each channel along the horizontal and vertical directions, and the expression of the cth channel with a height of H as follows:

式中，H表示图像的高度、W表示图像的宽度，C表示图像的通道总数，c表示第c个通道，X_c表示第c个通道的图像，i表示图像的横坐标，R表示中间张量集合；In the formula, H represents the height of the image, W represents the width of the image, C represents the total number of channels of the image, c represents the c-th channel, X _c represents the image of the c-th channel, i represents the abscissa of the image, and R represents the middle sheet volume collection;

类似的，宽度为W的第c个通道的输出表述如下：Similarly, the output of the cth channel of width W is expressed as follows:

公式(1)、(2)是特征聚合的两个变换，它们分别沿着两个空间方向进行聚合，返回两个方向感知注意力图；坐标注意力CA模块在级联之前生成两个特征层，之后共用一个1×1的卷积操作来变换，如公式(3)所示：Formulas (1), (2) are two transformations of feature aggregation, which are aggregated along two spatial directions respectively, and return two-direction perceptual attention maps; the coordinate attention CA module generates two feature layers before cascading, Then share a 1×1 convolution operation to transform, as shown in formula (3):

f＝δ(F₁([Z^H，Z^W])) (3)f=δ(F ₁ ([Z ^H , Z ^W ])) (3)

式中，δ为非线性激活函数，f是中间特征映射，是在水平和垂直两个方向上，对空间信息进行特征编码后的结果；然后沿着空间维数将f分解为2个单独的张量，f^H∈R^C/r×H和f^W∈R^C/r×W，r是用来控制通道数量的缩减比，然后利用另外两个1×1的卷积变换F_H和F_W分别将f^H和f^W变换为两个含有相同特征层数的张量g^H和g^W：In the formula, δ is a nonlinear activation function, and f is an intermediate feature map, which is the result of feature encoding of spatial information in both horizontal and vertical directions; then f is decomposed into two separate Tensor, f ^H ∈ R ^C/r×H and f ^W ∈ R ^C/r×W , r is used to control the reduction ratio of the number of channels, and then use two other 1×1 convolutions to transform F _H and F _W transforms f ^H and f ^W into two tensors g ^H and g ^W with the same number of feature layers, respectively:

g^H＝σ(F_H(f^H)) (4)g ^H ＝σ(F _H (f ^H )) (4)

g^W＝σ(F_H(f^W))(7)g ^W ＝σ(F _H (f ^W ))(7)

式中，F_H和F_W是两个1×1的卷积变换，f^H和f^W是将f分解后得到的两个单独张量，g^H和g^W是经卷积变换和激活函数后得到的张量，σ是sigmoid激活函数，在变换过程中，使用缩减比r来减少f的通道数，然后对输出的g^H和g^W进行扩展，分别作为注意力权重；坐标注意力CA模块最后的输出如公式(6)所示：In the formula, F _H and F _W are two 1×1 convolution transformations, f ^H and f ^W are two separate tensors obtained by decomposing f, g ^H and g ^W are convolution transformations and activation functions The obtained tensor, σ is the sigmoid activation function. During the transformation process, use the reduction ratio r to reduce the number of channels of f, and then expand the output g ^H and g ^W , respectively as attention weights; coordinate attention CA The final output of the module is shown in formula (6):

(3c)构造数据相关上采样DUp模块：结合卷积层和数据相关型上采样模块构造数据相关上采样DUp模块，并用于提取高分辨率建筑物的边界信息，对于输入的四个不同尺度特征图，先经过一个3×3卷积层，减少特征图的通道数；再进行数据相关上采样，将特征图直接恢复到512×512大小，将上采样后得到的四个特征图进行逐点相加融合后，从数据相关上采样DUp模块输出；(3c) Construct a data-dependent upsampling DUp module: combine the convolutional layer and the data-dependent upsampling module to construct a data-dependent upsampling DUp module, and use it to extract the boundary information of high-resolution buildings. For the four different scale features of the input Figure, first pass through a 3×3 convolutional layer to reduce the number of channels of the feature map; then perform data-related upsampling to directly restore the feature map to a size of 512×512, and perform point-by-point processing on the four feature maps obtained after upsampling After addition and fusion, output from the data correlation upsampling DUp module;

(4)得到CAD-UNet网络模型。(4) Obtain the CAD-UNet network model.

所述步骤(4)具体包括以下步骤：Described step (4) specifically comprises the following steps:

(4a)构造联合损失函数：构造BCE Loss二分类交叉熵损失与Focal Loss焦点损失相结合的联合损失函数，BCE Loss二分类交叉熵损失与Focal Loss焦点损失的公式分别如下：(4a) Construct a joint loss function: Construct a joint loss function that combines BCE Loss binary cross-entropy loss and Focal Loss focal loss. The formulas of BCE Loss binary cross-entropy loss and Focal Loss focal loss are as follows:

BL(p_t，target)＝-ω*(target*ln(p_t)+(1-target)*ln(1-p_t)) (7)BL(p _t , target)=-ω*(target*ln(p _t )+(1-target)*ln(1-p _t )) (7)

式中，p_t是CAD-UNet网络模型的预测值，target是标签值，ω为权重值；In the formula, p _t is the predicted value of CAD-UNet network model, target is the label value, and ω is the weight value;

FL(p_t)＝-α(1-p_t)^γlog(p_t) (8)FL(p _t )＝-α(1-p _t ) ^γ log(p _t ) (8)

式中，p_t为CAD-UNet网络模型的预测值，α是平衡参数，用来平衡正负样本的比例，取值范围(0，1]；γ是聚焦参数，用来减少易分类样本的损失，取值范围[0，+∞)；In the formula, p _t is the prediction value of the CAD-UNet network model, α is a balance parameter, used to balance the proportion of positive and negative samples, and the value range is (0, 1]; γ is a focusing parameter, used to reduce the number of easy-to-classify samples Loss, value range [0, +∞);

联合损失函数如公式(9)所示：The joint loss function is shown in Equation (9):

Loss＝BL+FL (9)Loss＝BL+FL (9)

(4b)参数设置：设置ω＝1，α＝0.5，γ＝2；(4b) Parameter setting: set ω=1, α=0.5, γ=2;

(4c)训练策略：训练时使用VGG16网络的预训练权重，采用冻结训练方式，前100个epoch冻结主干网络的参数进行训练，后100个epoch正常训练，每次实验共训练200个epoch；(4c) Training strategy: use the pre-training weight of the VGG16 network during training, adopt the frozen training method, freeze the parameters of the backbone network for the first 100 epochs for training, and train normally for the next 100 epochs, and train a total of 200 epochs for each experiment;

(4d)模型精度评估：采用评价指标精确率Precision、交并比IoU来评价精度，评价指标计算公式如式(10)、(11)所示：(4d) Evaluation of model accuracy: The accuracy of evaluation indicators Precision and IoU are used to evaluate the accuracy. The calculation formulas of evaluation indicators are shown in formulas (10) and (11):

式中，TP为真值是正，模型判定为正；FP为真值是负，模型判定为正；FN为真值是正，模型判定为负。In the formula, TP means that the true value is positive, and the model judges it as positive; FP means that the true value is negative, and the model judges it as positive; FN means that the true value is positive, and the model judges it as negative.

所述步骤(5)具体包括以下步骤：Described step (5) specifically comprises the following steps:

(5a)将新的待提取的遥感图像进行数据预处理后，大小调整为512×512大小；(5a) After performing data preprocessing on the new remote sensing image to be extracted, the size is adjusted to 512×512;

(5b)将调整好的图像输进训练完成的CAD-UNet网络模型，通过CAD-UNet网络模型后输出预测图像，得到建筑物提取结果，CAD-UNet网络模型预测为建筑物的像素值为255，CAD-UNet网络模型预测为背景的像素值为0，因此在预测图像中白色部分为建筑物区域，黑色部分为背景区域。(5b) Input the adjusted image into the trained CAD-UNet network model, output the predicted image through the CAD-UNet network model, and obtain the building extraction result. The CAD-UNet network model predicts that the pixel value of the building is 255 , the CAD-UNet network model predicts that the pixel value of the background is 0, so the white part in the predicted image is the building area, and the black part is the background area.

由上述技术方案可知，本发明的有益效果为：第一，建筑物提取精度高，与其他方法相比，本发明设计的网络逐步提取建筑物的深层特征，进行特征融合之后再逐步上采样至输入分辨率大小，对建筑物提取任务更为友好，显著提高了建筑物提取精度；第二，拥有更好的建筑物边界提取效果，本发明添加和构造的坐标注意力CA模块、数据相关上采样DUp模块，能有效捕捉建筑物的位置信息和边界信息，因而能够使提取的建筑物拥有更加平滑的边界和完整的轮廓；第三，网络参数量少，易于训练，本发明采用的坐标注意力是一种即插即用的轻量级注意力，且本发明的CAD-UNet网络模型相较原UNet模型通道数有所减少，降低了网络复杂度，因为本发明的方法参数量较少，易于训练。It can be seen from the above technical solution that the beneficial effects of the present invention are as follows: First, the building extraction accuracy is high. Compared with other methods, the network designed by the present invention gradually extracts the deep features of the buildings, and then gradually upsamples to The input resolution size is more friendly to the building extraction task, and the accuracy of building extraction is significantly improved; second, it has better building boundary extraction effect, the coordinate attention CA module added and constructed by the present invention, and the data correlation The sampling DUp module can effectively capture the location information and boundary information of the building, so that the extracted building can have a smoother boundary and a complete outline; the third, the network parameter amount is small, and it is easy to train. The coordinates used in the present invention pay attention to Force is a kind of plug-and-play lightweight attention, and the CAD-UNet network model of the present invention has fewer channels than the original UNet model, which reduces the complexity of the network, because the method of the present invention has fewer parameters , easy to train.

附图说明Description of drawings

图1为本发明的方法流程图；Fig. 1 is method flowchart of the present invention;

图2为本发明中CAD-UNet网络模型的结构图；Fig. 2 is the structural diagram of CAD-UNet network model among the present invention;

图3为本发明中坐标注意力CA模块的结构图；Fig. 3 is the structural diagram of coordinate attention CA module in the present invention;

图4为本发明中数据相关上采样DUp模块的结构图；Fig. 4 is the structural diagram of data-dependent upsampling DUp module among the present invention;

图5为本发明中的训练数据示例；Fig. 5 is the training data example among the present invention;

图6为本发明中的预测结果图。Fig. 6 is a graph of prediction results in the present invention.

具体实施方式Detailed ways

如图1所示，一种基于坐标注意力和数据相关上采样的遥感图像建筑物提取方法，该方法包括下列顺序的步骤：As shown in Figure 1, a remote sensing image building extraction method based on coordinate attention and data correlation upsampling, the method includes the following sequential steps:

(1)获取遥感数据：下载WHU建筑物数据集和Massachusetts建筑物数据集；所述WHU建筑物数据集为武汉大学建筑物数据集，所述Massachusetts建筑物数据集为马萨诸塞州建筑物数据集；(1) Obtain remote sensing data: download WHU building data set and Massachusetts building data set; The WHU building data set is Wuhan University building data set, and the Massachusetts building data set is Massachusetts building data set;

(2c)将数据增强后的遥感图像和标签图像分别按照8∶1∶1的比例划分为训练集、验证集和测试集，所述训练集用于直接参与CAD-UNet网络模型的训练，进行特征提取；所述验证集用于调整CAD-UNet网络模型的超参数；所述测试集用于在训练完成后测试CAD-UNet网络模型的精度和提取效果。(2c) Divide the remote sensing images and label images after data enhancement into training set, verification set and test set according to the ratio of 8:1:1. The training set is used to directly participate in the training of CAD-UNet network model. Feature extraction; the verification set is used to adjust the hyperparameters of the CAD-UNet network model; the test set is used to test the accuracy and extraction effect of the CAD-UNet network model after the training is completed.

坐标注意力CA模块沿着两个空间方向分别捕获长程依赖和保留位置信息，将特征图分别编码，形成两个特征图，分别对方向感知和对位置敏感，将任何中间张量X＝[x₁，x₂，x₃...，x_C]∈R^C×H×W作为输入，并输出一个同样长度的张量Y＝[y₁，y₂，y₃...，y_C]，具体而言对输入X使用尺寸(H，1)和(1，W)的池化核沿着水平和竖直两个方向对每个通道进行编码，高度为H的第c个通道的表述如下：The coordinate attention CA module captures long-range dependence and retains positional information along two spatial directions, and encodes the feature maps separately to form two feature maps, which are direction-aware and position-sensitive, respectively, and any intermediate tensor X=[x ₁ , x ₂ , x ₃ ..., x _C ]∈R ^C×H×W as input, and output a tensor of the same length Y=[y ₁ , y ₂ , y ₃ ..., y _C ] , specifically for the input X, use a pooling kernel of size (H, 1) and (1, W) to encode each channel along the horizontal and vertical directions, and the expression of the cth channel with a height of H as follows:

f＝δ(F₁([Z^H，Z^W])) (3)f=δ(F ₁ ([Z ^H , Z ^W ])) (3)

g^H＝σ(F_H(f^H)) (4)g ^H ＝σ(F _H (f ^H )) (4)

g^W＝σ(F_H(f^W))(11)g ^W =σ(F _H (f ^W ))(11)

(4)得到CAD-UNet网络模型。(4) Obtain the CAD-UNet network model.

FL(p_t)＝-α(1-p_t)^γlog(p_t) (8)FL(p _t )＝-α(1-p _t ) ^γ log(p _t ) (8)

Loss＝BL+FL (9)Loss＝BL+FL (9)

为了验证本发明的有效性，特选取Unet作为比较例，使用标准的建筑物数据集来比较结果，比较算法的精确率和交并比。In order to verify the effectiveness of the present invention, Unet is selected as a comparative example, and the standard building data set is used to compare the results, and compare the accuracy and intersection ratio of the algorithm.

表1：实施例和比较例在数据集上的结果比较Table 1: Comparison of the results of the examples and comparative examples on the data set

如图2所示，本发明中CAD-UNet网络模型同样采用编解码结构，左边是编码器，用于下采样和特征提取；中间实线箭头代表坐标注意力CA模块，用于关注建筑物的位置信息；右边是解码器，作用是特征融合和上采样；右下角虚线箭头代表数据相关上采样DUp模块，用于提取建筑物的边界信息；最后经过1×1卷积调整通道数后，输出建筑物提取结果。As shown in Figure 2, the CAD-UNet network model in the present invention also adopts a codec structure, the left side is an encoder for down-sampling and feature extraction; the middle solid arrow represents the coordinate attention CA module, which is used to pay attention to the building Position information; on the right is the decoder, which is used for feature fusion and upsampling; the dotted arrow in the lower right corner represents the data-related upsampling DUp module, which is used to extract the boundary information of the building; finally, after adjusting the number of channels by 1×1 convolution, the output Building extraction results.

如图3所示，本发明中坐标注意力CA模块对于输入的特征张量，首先分别从水平方向和垂直方向对通道进行编码；然后，将两个不同方向的特征进行聚合后，得到中间张量；最后，将中间张量沿空间维度进行拆分，并分别经过卷积层和Sigmoid函数后得到最终的输出。As shown in Figure 3, for the input feature tensor, the coordinate attention CA module in the present invention first encodes the channel from the horizontal direction and the vertical direction respectively; then, after aggregating the features in two different directions, the middle tensor is obtained Quantity; Finally, the intermediate tensor is split along the spatial dimension, and the final output is obtained after going through the convolutional layer and the Sigmoid function respectively.

如图4所示，本发明中数据相关上采样DUp模块，将输入的四个不同尺度特征图先经过一个3×3卷积层，减少通道数；再进行数据相关型上采样，将四个特征图直接恢复到512×512大小；最后将四个特征图进行逐点相加融合后，从DUp模块输出。As shown in Figure 4, the data-dependent up-sampling DUp module in the present invention passes the four input feature maps of different scales through a 3×3 convolution layer to reduce the number of channels; then performs data-dependent up-sampling, and the four The feature map is directly restored to the size of 512×512; finally, the four feature maps are added and fused point by point, and then output from the DUp module.

图5中分别是WHU建筑物数据集和Massachusetts建筑物数据集示例，左边是遥感图像，右边是对应的真实标签。Figure 5 is an example of the WHU building dataset and the Massachusetts building dataset respectively. The left side is the remote sensing image, and the right side is the corresponding real label.

图6是本发明CAD-UNet网络模型的预测结果，第一列是遥感图像，第二列是对应的真实标签，第三列是本发明CAD-UNet网络模型的预测结果，第四列是UNet的预测结果，从图中可以看到本发明方法预测结果拥有更加平滑清晰的边界，优于UNet的预测结果。Fig. 6 is the predicted result of the CAD-UNet network model of the present invention, the first column is the remote sensing image, the second column is the corresponding real label, the third column is the predicted result of the CAD-UNet network model of the present invention, and the fourth column is UNet It can be seen from the figure that the prediction result of the method of the present invention has a smoother and clearer boundary, which is better than that of UNet.

综上所述，本发明设计的网络逐步提取建筑物的深层特征，进行特征融合之后再逐步上采样至输入分辨率大小，对建筑物提取任务更为友好，显著提高了建筑物提取精度；本发明添加和构造的坐标注意力CA模块、数据相关上采样DUp模块，能有效捕捉建筑物的位置信息和边界信息，因而能够使提取的建筑物拥有更加平滑的边界和完整的轮廓；本发明采用的坐标注意力是一种即插即用的轻量级注意力，且本发明的CAD-UNet网络模型相较原UNet模型通道数有所减少，降低了网络复杂度，因为本发明的方法参数量较少，易于训练。In summary, the network designed in the present invention gradually extracts the deep features of buildings, and then gradually upsamples to the input resolution after feature fusion, which is more friendly to building extraction tasks and significantly improves the accuracy of building extraction; The invention adds and constructs the coordinate attention CA module and the data-related upsampling DUp module, which can effectively capture the location information and boundary information of the building, so that the extracted building can have a smoother boundary and a complete outline; the present invention adopts Coordinate attention is a plug-and-play lightweight attention, and the CAD-UNet network model of the present invention has fewer channels than the original UNet model, reducing network complexity, because the method parameters of the present invention Small amount, easy to train.

Claims

1. A remote sensing image building extraction method based on coordinate attention and data correlation up-sampling, characterized in that: the method comprises the steps of the following order:

(1) Obtain remote sensing data: download WHU building dataset and Massachusetts building dataset;

(2) Data preprocessing and data enhancement: the preprocessing refers to cropping the large images in the data set, and performing data enhancement on the cropped remote sensing images and label images; The ratio of 8:1:1 is divided into training set, verification set and test set;

(3) Constructing the CAD-UNet network model: Based on the UNet network, improve and construct the building extraction network model including the encoder, the coordinate attention CA module, and the data-dependent upsampling DUp module, namely the CAD-UNet network model;

(4) Model training and evaluation: Based on the joint loss function of the training set data BCE Loss binary classification cross entropy loss and Focal Loss focal loss, the CAD-UNet network model is trained, and the test set is used to evaluate CAD-UNet after training Building extraction accuracy and effect of network model;

(5) Automatic extraction of buildings: after data preprocessing of the new remote sensing images to be extracted, the trained CAD-UNet network model is input, and the CAD-UNet network model outputs predicted images to obtain building extraction results.

2. the remote sensing image building extraction method based on coordinate attention and data correlation upsampling according to claim 1, is characterized in that: described step (2) specifically comprises the following steps:

(2a) By means of sliding window, crop the remote sensing large image and label image in the Massachusetts building dataset, cut it into a 512×512 image, and put the WHU building dataset and the Massachusetts building dataset in the label image into the building The pixel value of the background is marked as 1, and the pixel value of the background is marked as 0;

(2b) Carry out data enhancement to the remote sensing images and label images in the WHU building dataset and the cropped Massachusetts building dataset remote sensing images and label images to expand the amount of data. The data enhancement includes:

Horizontal flip: Use the image processing library OpenCV to flip the remote sensing image and the label image horizontally;

Vertical flip: Use the image processing library OpenCV to flip the remote sensing image and the label image vertically;

Horizontal and vertical flip: Use the image processing library OpenCV to flip the remote sensing image and the label image first horizontally and then vertically;

Shift, zoom, random crop and add noise: perform operations such as shift, zoom, random crop and add noise on remote sensing images and label images respectively;

(2c) divide the remote sensing image and label image after data enhancement into training set, verification set and test set according to the ratio of 8:1:1 respectively, and the training set is used to directly participate in the training of CAD-UNet network model, and carry out Feature extraction; the verification set is used to adjust the hyperparameters of the CAD-UNet network model; the test set is used to test the accuracy and extraction effect of the CAD-UNet network model after the training is completed.

3. the remote sensing image building extraction method based on coordinate attention and data correlation upsampling according to claim 1, is characterized in that: described step (3) specifically comprises the following steps:

(3a) Replace the UNet network encoder: replace the UNet network encoder with the VGG16 network module. The VGG16 network module is composed of the VGG16 network without the last pooling layer and the fully connected layer. The VGG16 network module passes multiple convolutions and four times Maximum pooling is used for down-sampling, extracting building features in remote sensing images, and outputting feature maps of four different scales;

(3b) Construct the coordinate attention CA module: embed the coordinate attention CA module into the skip connection of the UNet network obtained through step (3a);

The coordinate attention CA module captures long-range dependencies and retains location information along two spatial directions, and encodes the feature maps separately to form two feature maps, which are direction-aware and position-sensitive, respectively, and any intermediate tensor X=[x ₁ ,x ₂ ,x ₃ ...,x _C ]∈R ^C×H×W as input, and output a tensor of the same length Y=[y ₁ ,y ₂ ,y ₃ ...,y _C ] , specifically for the input X, use a pooling kernel of size (H, 1) and (1, W) to encode each channel along the horizontal and vertical directions, and the expression of the cth channel with a height of H as follows:

In the formula, H represents the height of the image, W represents the width of the image, C represents the total number of channels of the image, c represents the c-th channel, X _c represents the image of the c-th channel, i represents the abscissa of the image, and R represents the middle sheet volume collection;

Similarly, the output of the cth channel of width W is expressed as follows:

Formulas (1), (2) are two transformations of feature aggregation, which are aggregated along two spatial directions respectively, and return two-direction perceptual attention maps; the coordinate attention CA module generates two feature layers before cascading, Then share a 1×1 convolution operation to transform, as shown in formula (3):

f=δ(F ₁ ([Z ^H , Z ^W ])) (3)

In the formula, δ is a nonlinear activation function, and f is an intermediate feature map, which is the result of feature encoding of spatial information in both horizontal and vertical directions; then f is decomposed into two separate Tensor, f ^H ∈ R ^C/r×H and f ^W ∈ R ^C/r×W , r is used to control the reduction ratio of the number of channels, and then use two other 1×1 convolutions to transform F _H and F _W transforms f ^H and f ^W into two tensors g ^H and g ^W with the same number of feature layers, respectively:

g ^H ＝σ(F _H (f ^H )) (4)

g ^w =σ(F _H (f ^W ))(3)

In the formula, F _H and F _W are two 1×1 convolution transformations, f ^H and f ^W are two separate tensors obtained by decomposing f, g ^H and g ^W are convolution transformations and activation functions The obtained tensor, σ is the sigmoid activation function. During the transformation process, use the reduction ratio r to reduce the number of channels of f, and then expand the output g ^H and g ^W , respectively as attention weights; coordinate attention CA The final output of the module is shown in formula (6):

(3c) Construct a data-dependent upsampling DUp module: combine the convolutional layer and the data-dependent upsampling module to construct a data-dependent upsampling DUp module, and use it to extract the boundary information of high-resolution buildings. For the four different scale features of the input Figure, first pass through a 3×3 convolutional layer to reduce the number of channels of the feature map; then perform data-related upsampling to directly restore the feature map to a size of 512×512, and perform point-by-point processing on the four feature maps obtained after upsampling After addition and fusion, output from the data correlation upsampling DUp module;

(4) Obtain the CAD-UNet network model.

4. the remote sensing image building extraction method based on coordinate attention and data correlation upsampling according to claim 1, is characterized in that: described step (4) specifically comprises the following steps:

(4a) Construct a joint loss function: Construct a joint loss function that combines BCE Loss binary cross-entropy loss and Focal Loss focal loss. The formulas of BCE Loss binary cross-entropy loss and Focal Loss focal loss are as follows:

BL(p _t , target)=-ω*(target*ln(p _t )+(1-target)*ln(1-p _t )) (7)

In the formula, p _t is the predicted value of CAD-UNet network model, target is the label value, and ω is the weight value;

FL(p _t )＝-α(1-p _t ) ^γ log(p _t ) (8)

In the formula, p _t is the prediction value of the CAD-UNet network model, α is a balance parameter, used to balance the proportion of positive and negative samples, and the value range is (0, 1]; γ is a focusing parameter, used to reduce the number of easy-to-classify samples Loss, value range [0, +∞);

The joint loss function is shown in Equation (9):

Loss＝BL+FL (9)

(4b) Parameter setting: set ω=1, α=0.5, γ=2;

(4c) Training strategy: use the pre-training weight of the VGG16 network during training, adopt the frozen training method, freeze the parameters of the backbone network for the first 100 epochs for training, and train normally for the next 100 epochs, and train a total of 200 epochs for each experiment;

(4d) Evaluation of model accuracy: The accuracy of evaluation indicators Precision and IoU are used to evaluate the accuracy. The calculation formulas of evaluation indicators are shown in formulas (10) and (11):

In the formula, TP means that the true value is positive, and the model judges it as positive; FP means that the true value is negative, and the model judges it as positive; FN means that the true value is positive, and the model judges it as negative.

5. the remote sensing image building extraction method based on coordinate attention and data correlation upsampling according to claim 1, is characterized in that: described step (5) specifically comprises the following steps:

(5a) After performing data preprocessing on the new remote sensing image to be extracted, the size is adjusted to 512×512;

(5b) Input the adjusted image into the trained CAD-UNet network model, output the predicted image through the CAD-UNet network model, and obtain the building extraction result. The CAD-UNet network model predicts that the pixel value of the building is 255 , the CAD-UNet network model predicts that the pixel value of the background is 0, so the white part in the predicted image is the building area, and the black part is the background area.