CN110210485A

CN110210485A - The image, semantic dividing method of Fusion Features is instructed based on attention mechanism

Info

Publication number: CN110210485A
Application number: CN201910391452.XA
Authority: CN
Inventors: 龚声蓉; 周鹏程
Original assignee: Changshu Institute of Technology
Current assignee: Changshu Institute of Technology
Priority date: 2019-05-13
Filing date: 2019-05-13
Publication date: 2019-09-06

Abstract

The invention discloses an image semantic segmentation method based on attention mechanism to guide feature fusion, including the following steps: (10) Encoder basic network construction: use the improved ResNet-101 to generate a series of high-resolution low-semantic to low-resolution (20) Decoder feature fusion module construction: using a pyramid structure module based on three-layer convolution operations, extracting high-level semantics with strong consistency constraints, and then weighting and fusing low-level features layer by layer to obtain a preliminary Segmentation heat map; (30) Auxiliary loss function construction: add auxiliary supervision to each fusion output in the decoding stage, and then superimpose with the main supervision loss after sampling on the heat map, strengthen the layered training of the model, and obtain a semantic segmentation map. The image semantic segmentation method guided by feature fusion based on the attention mechanism of the present invention has high accuracy and clear boundary outline.

Description

Image Semantic Segmentation Method Based on Attention Mechanism Guided Feature Fusion

技术领域technical field

本发明属于静态图像识别技术领域，特别是一种准确性高、边界轮廓清楚的基于注意力机制指导特征融合的图像语义分割方法。The invention belongs to the technical field of static image recognition, in particular to an image semantic segmentation method based on an attention mechanism guiding feature fusion with high accuracy and clear boundaries.

背景技术Background technique

语义分割即像素级别的图像理解，是计算机视觉领域的重要基石之一，有着非常广泛的应用场景。其通过细粒度分割的方式，赋予了机器将视觉画面的不同区域以像素级别剥离开来的能力。语义分割将图像中属于同一目标的像素区域划分到一起，从而扩展其应用领域。Semantic segmentation, that is, pixel-level image understanding, is one of the important cornerstones in the field of computer vision and has a very wide range of application scenarios. Through fine-grained segmentation, it gives the machine the ability to separate different areas of the visual image at the pixel level. Semantic segmentation divides the pixel regions belonging to the same object in the image together, thereby expanding its application field.

语义分割在进行像素级预测的同时将对象分类和目标定位两个问题结合在一起解决，如何在高层抽象的对象分类和低层精确的目标定位这两个相互约束的问题之间取得平衡是当前语义分割方法所要面对的核心问题。语义分割方法可以大致分为两类。第一种，通过人工提取特征生成图像中各个对象的语义，这种方法往往需要细致的特征工程手段，再输入分类器进行像素级别的分类。第二种是基于深度学习方法，通过构建端到端系统将特征提取与分类器合在一起从而直接为每个像素分配一个语义标签。Semantic segmentation combines the two problems of object classification and target positioning while performing pixel-level prediction. How to strike a balance between the two mutually constrained problems of high-level abstract object classification and low-level precise target positioning is the current semantic The core problem that segmentation methods have to face. Semantic segmentation methods can be broadly classified into two categories. The first is to generate the semantics of each object in the image by manually extracting features. This method often requires careful feature engineering methods, and then input into a classifier for pixel-level classification. The second is based on deep learning methods, which directly assign a semantic label to each pixel by building an end-to-end system combining feature extraction and classifiers.

大多数传统的方法都是依赖于手工提取特征并与分类器相结合的机器学习方法，如 Shotton等人的Boost方法，Johnson等人的随机森林，Soatto等人的支持向量机。这些方法通过整合来自上下文和结构化预测技术的丰富信息取得了实质性的进步。然而，由于手工提取的特征表达能力受限的影响，基于传统机器学习方法的图像语义分割系统性能逐渐饱和，无法突破瓶颈，其在分割准确率性能上仍有很大的提升空间。Most of the traditional methods are machine learning methods that rely on manually extracting features and combining them with classifiers, such as the Boost method of Shotton et al., the random forest of Johnson et al., and the support vector machine of Soatto et al. These methods have made substantial progress by integrating rich information from contextual and structured prediction techniques. However, due to the limited expressive ability of manually extracted features, the performance of the image semantic segmentation system based on traditional machine learning methods is gradually saturated, unable to break through the bottleneck, and there is still a lot of room for improvement in the performance of segmentation accuracy.

近些年，深度学习革命让相关领域发生了翻天覆地的变化，包括语义分割在内的许多计算机视觉问题都开始使用深层架构来解决。基于深度卷积神经网络提出的全卷积网络方法，以卷积层代替全连接层构造全卷积网络应用到语义分割上，生成了密集的逐像素标记输出，获得了更高的分割精度。Zhao等人提出金字塔场景解析网络方法，利用金字塔池化模块，通过不同区域的上下文聚合来利用全局上下文信息，利用全局先验有效地产生了高质量的分割结果。Li等人通过先对浅层阶段区域进行分类，并将更深层次的阶段重点放在少数困难区域上，以进行自适应和针对困难样例识别的学习，最终改善了分割性能。Lin等人提出了一种通用的多路径优化网络方法，明确利用下采样过程中的所有可用信息，以实现使用远程残留连接的高分辨率像素级预测。In recent years, the deep learning revolution has revolutionized related fields, and many computer vision problems, including semantic segmentation, have begun to be solved using deep architectures. Based on the fully convolutional network method proposed by the deep convolutional neural network, the convolutional layer is used instead of the fully connected layer to construct a fully convolutional network and applied to semantic segmentation, which generates dense pixel-by-pixel labeling output and obtains higher segmentation accuracy. Zhao et al. proposed a pyramid scene parsing network method, using a pyramid pooling module to utilize global context information through context aggregation in different regions, and effectively producing high-quality segmentation results using global priors. Li et al. improved segmentation performance by first classifying shallow stage regions and focusing deeper stages on few difficult regions for adaptation and learning for difficult example recognition. Lin et al. propose a general multipath optimization network approach that explicitly exploits all available information during downsampling to achieve high-resolution pixel-wise predictions using long-range residual connections.

然而，现有技术存就语义分割效果而言，仍存在两个主要问题：However, there are still two main problems in the existing technology in terms of semantic segmentation effect:

1、在基于深度全卷积网络的图像语义分割中，使用卷积网络进行特征提取时，由于卷积、最大池化和下采样操作的重复组合引起特征分辨率逐渐降低，导致上下文信息丢失，使得导致分割结果中出现外观复杂目标的局部区域误识别以及多尺度对象中小目标识别错误等语义不一致；1. In the semantic segmentation of images based on deep fully convolutional networks, when using convolutional networks for feature extraction, the feature resolution is gradually reduced due to repeated combinations of convolution, maximum pooling, and downsampling operations, resulting in loss of context information. Semantic inconsistencies such as misrecognition of local areas of complex-looking targets in the segmentation results and misrecognition of small targets in multi-scale objects;

2、卷积网络的成功部分归因于其对图像局部变换的内在不变性，该不变性增强了网络学习分层抽象的能力，这恰恰是对象分类等高层视觉任务所需的。而语义分割在解决分类问题的同时还需要面对分割中定位对象的边界轮廓等空间细节问题，单纯的像素分类任务经常出现分割结果中对象的边界轮廓模糊不清的现象。2. The success of convolutional networks is partly due to their inherent invariance to local transformations of images, which enhances the network's ability to learn hierarchical abstractions, which is exactly what is required for high-level vision tasks such as object classification. While semantic segmentation solves the classification problem, it also needs to face the spatial details such as the boundary contour of the positioning object in the segmentation. The simple pixel classification task often has the phenomenon that the boundary contour of the object in the segmentation result is blurred.

发明内容Contents of the invention

本发明的目的在于提供一种基于注意力机制指导特征融合的图像语义分割方法，准确性高、边界轮廓清楚。The purpose of the present invention is to provide an image semantic segmentation method based on attention mechanism to guide feature fusion, which has high accuracy and clear boundary outline.

实现本发明目的的技术方案为：The technical scheme that realizes the object of the present invention is:

一种基于注意力机制指导特征融合的图像语义分割方法，包括如下步骤：An image semantic segmentation method based on an attention mechanism to guide feature fusion, comprising the following steps:

(10)编码器基础网络构建：使用改进后的ResNet-101生成一系列由高分辨率低语义到低分辨率高语义变化的特征；(10) Encoder basic network construction: use the improved ResNet-101 to generate a series of features ranging from high resolution and low semantics to low resolution and high semantics;

(20)解码器特征融合模块构建：采用基于三层卷积操作的金字塔结构模块，提取强一致性约束的高层语义，再向低层阶段特征逐层加权融合，得到初步分割热图；(20) Decoder feature fusion module construction: a pyramid structure module based on three-layer convolution operation is used to extract high-level semantics with strong consistency constraints, and then weighted and fused to low-level features layer by layer to obtain a preliminary segmentation heat map;

(30)辅助损失函数构建：向解码阶段的每个融合输出追加辅助监督，再与热图上采样后的主监督损失叠加，强化模型的分层训练，得到语义分割图。(30) Auxiliary loss function construction: add auxiliary supervision to each fusion output in the decoding stage, and then superimpose with the main supervision loss after sampling on the heat map, strengthen the layered training of the model, and obtain a semantic segmentation map.

与现有技术相比，本发明具有以下几点：Compared with the prior art, the present invention has the following points:

1、准确性高：本发明方法通过实现类似金字塔结构的末端高层语义信息提取模块来融合三个不同尺度的特征，并额外引入与输出特征连接的全局池化分支做后续处理，将上下文信息与经过简单卷积操作后的原特征相乘，在不引入太多的计算的前提下能够捕获强语义一致性特征，减少对象局部区域识别出错的几率；1. High accuracy: the method of the present invention integrates features of three different scales by implementing a high-level semantic information extraction module at the end similar to a pyramid structure, and additionally introduces a global pooling branch connected to the output features for subsequent processing, combining context information with Multiplying the original features after a simple convolution operation can capture strong semantic consistency features without introducing too much calculation, and reduce the probability of error in the recognition of local areas of objects;

2、边界轮廓清楚：本发明根据相邻特征间高层特征含有较多语义信息而低层特征含有较多空间细节信息这一特点，先连接两个层级特征生成通道注意向量，将其作为权重来选择低层特征中最具判别力的信息，利用高层特征的强语义一致性约束指导和细化其与低层特征的融合，捕获丰富的上下文，最终细化了对象的分割边界，更好地融合层级特征以恢复分割图中对象的边界细节，减少了边界轮廓模糊不清的现象。2. The boundary outline is clear: according to the feature that the high-level features contain more semantic information and the low-level features contain more spatial detail information between adjacent features, the present invention first connects the two-level features to generate a channel attention vector, and selects it as a weight The most discriminative information in low-level features, using the strong semantic consistency constraints of high-level features to guide and refine its fusion with low-level features, capture rich context, and finally refine the segmentation boundary of objects to better fuse hierarchical features In order to restore the boundary details of objects in the segmentation map, the blurring of boundary contours is reduced.

附图说明Description of drawings

图1为本发明基于注意力机制指导特征融合的图像语义分割方法的主流程图。Fig. 1 is the main flowchart of the image semantic segmentation method based on the attention mechanism guiding feature fusion in the present invention.

图2为图1中编码器基础网络构建步骤的流程图。Fig. 2 is a flowchart of the construction steps of the encoder basic network in Fig. 1 .

图3为图1中解码器特征融合模块构建步骤的流程图。Fig. 3 is a flowchart of the construction steps of the decoder feature fusion module in Fig. 1 .

图4为末端高层语义信息提取模块示例。Figure 4 is an example of the terminal high-level semantic information extraction module.

图5为注意力机制指导特征融合模块示例。Figure 5 is an example of the attention mechanism guiding the feature fusion module.

具体实施方式Detailed ways

如图1所示，本发明基于注意力机制指导特征融合的图像语义分割方法，包括如下步骤：As shown in Figure 1, the image semantic segmentation method of the present invention based on attention mechanism guiding feature fusion, comprises the following steps:

如图2所示，所述(10)编码器基础网络构建步骤包括：As shown in Figure 2, the (10) encoder basic network construction steps include:

(11)构建块的层数重部署：重新部署res-2到res-5阶段各自拥有的构建块数量，将原始ResNet-101的res-2到res-5的{3，4，23，3}构建块数量调整为{8，8，9，8}；(11) Re-deployment of the number of layers of building blocks: redeploy the number of building blocks owned by each stage from res-2 to res-5, and res-2 to res-5 of the original ResNet-101 {3, 4, 23, 3 } The number of building blocks is adjusted to {8, 8, 9, 8};

卷积网络编码器目的是生成一系列由高分辨率低语义到低分辨率高语义变化的特征。该基础网络通常使用现有的卷积神经网络模型，如LeNet、AlexNet、VGG、 GoogLeNet、ResNet等。其中ResNet-101使用了大量的残差结构，解决了层数加深的同时梯度消失的问题，其每个残差结构也为正向和反向传播提供了新的路径，因此具有极强的表达能力。本发明使用ResNet-101作为语义分割的编码器基础网络。The purpose of the convolutional network encoder is to generate a series of features ranging from high-resolution low-semantic to low-resolution high-semantic. The base network usually uses existing convolutional neural network models, such as LeNet, AlexNet, VGG, GoogLeNet, ResNet, etc. Among them, ResNet-101 uses a large number of residual structures to solve the problem of gradient disappearance while the number of layers is deepened. Each residual structure also provides a new path for forward and backward propagation, so it has a strong expression ability. The present invention uses ResNet-101 as the encoder base network for semantic segmentation.

在基础网络中，特征从编码器部分的每个阶段尾部提取而来，对于ResNet-101而言，分别为res-2、res-3、res-4和res-5四个阶段，其中每个阶段所含构建块数量分别为{3，4，23，3}，每个构建块由三个卷积层组成。可见，ResNet-101的前两个编码阶段只有少量的构建块，这样较浅的卷积层数使其不能提取深层的语义特征，低层特征的语义质量较差。而从res-4开始，经过大量的深层卷积之后，输出的特征拥有较强的语义。res-3 和res-4两个阶段提取出的两个相邻特征之间的语义质量差距极大。为了改善低层特征的语义质量，使其更接近监督，一个直接的方法是重新部署res-2到res-5阶段各自拥有的构建块数量，平衡各个阶段的卷积层数，减少res-3和res-4两个阶段输出的特征之间的语义差异性。重新部署的各阶段构建块数量中，原始ResNet-101的res-2到res-5的{3， 4，23，3}构建块数量调整为{8，8，9，8}。In the basic network, features are extracted from the end of each stage of the encoder part. For ResNet-101, there are four stages of res-2, res-3, res-4 and res-5, each of which The number of building blocks contained in the stages are {3, 4, 23, 3} respectively, and each building block consists of three convolutional layers. It can be seen that the first two encoding stages of ResNet-101 have only a small number of building blocks, so the shallow number of convolutional layers makes it impossible to extract deep semantic features, and the semantic quality of low-level features is poor. Starting from res-4, after a large number of deep convolutions, the output features have stronger semantics. The semantic quality gap between the two adjacent features extracted in the res-3 and res-4 stages is extremely large. In order to improve the semantic quality of low-level features and make them closer to supervision, a straightforward method is to redeploy the number of building blocks owned by res-2 to res-5 stages, balance the number of convolutional layers in each stage, and reduce res-3 and res-5. The semantic difference between the features output by the two stages of res-4. Among the number of building blocks in each stage of redeployment, the number of {3, 4, 23, 3} building blocks from res-2 to res-5 of the original ResNet-101 is adjusted to {8, 8, 9, 8}.

(12)扩大感受野：将ResNet-101基础网络结构中res-5阶段的传统卷积改为扩张率为2的空洞卷积。(12) Expand the receptive field: Change the traditional convolution of the res-5 stage in the basic network structure of ResNet-101 to an empty convolution with an expansion rate of 2.

语义分割的输出分辨率应与输入图像一致。基于全卷积网络的语义分割方法虽然能够接受任意分辨率的输入图像，但连续的卷积和池化操作在增大感受野的同时也减小了特征的分辨率。虽然通过上采样可以将缩小的特征图还原到图像的原始尺寸，但这个过程必然造成丢失的信息无法还原，上采样恢复的特征图将失去对图像细节的敏感性。并且，频繁的上采样操作也需要额外的内存和时间。本发明采用最初应用于信号处理领域小波变换分析中的空洞卷积方法克服这一问题。The output resolution of semantic segmentation should be consistent with the input image. Although the semantic segmentation method based on the fully convolutional network can accept input images of any resolution, continuous convolution and pooling operations increase the receptive field while reducing the resolution of the feature. Although the reduced feature map can be restored to the original size of the image through upsampling, this process will inevitably cause the lost information to be irreversible, and the feature map restored by upsampling will lose its sensitivity to image details. Also, frequent upsampling operations require additional memory and time. The present invention overcomes this problem by adopting the atrous convolution method originally applied in wavelet transform analysis in the field of signal processing.

将原始滤波器上采样2倍，并在滤波器值之间插入零值。虽然有效滤波器的尺寸有所增加，但无需考虑中间插入的零值，即空洞，因此滤波器参数的数量和每个位置的操作数量保持不变。可以通过改变扩张率参数r以自适应地修改感受野的大小，进而有效地控制卷积网络中特征的分辨率而无需学习额外的参数。Upsample the original filter by a factor of 2 and insert zero values between filter values. Although the size of the effective filter is increased, there is no need to account for intervening zeros, i.e. holes, so the number of filter parameters and the number of operations per position remain the same. The size of the receptive field can be adaptively modified by changing the expansion rate parameter r, thereby effectively controlling the resolution of features in the convolutional network without learning additional parameters.

在卷积神经网络中，经过连续3次核尺寸为3×3的标准卷积后，感受野尺寸分别为3×3，5×5和7×7。若连续卷积操作的核尺寸为(2d+1)×(2d+1)且不变，则第n层的感受野尺寸为：In the convolutional neural network, after three consecutive standard convolutions with a kernel size of 3×3, the receptive field sizes are 3×3, 5×5 and 7×7, respectively. If the kernel size of the continuous convolution operation is (2d+1)×(2d+1) and remains unchanged, the receptive field size of the nth layer is:

f_n＝2dn+1 (1)f _n =2dn+1 (1)

即标准卷积下感受野的大小呈线性增长。而图2中所示是核尺寸为3×3，扩张率分别为 1，2和4的空洞卷积，其感受野分别为3×3，7×7和15×15。假设核尺寸同样为 (2d+1)×(2d+1)不变的连续空洞卷积操作中第n层的扩张率为r_n，则感受野的尺寸为：That is, the size of the receptive field under standard convolution increases linearly. Figure 2 shows dilated convolutions with a kernel size of 3×3 and expansion rates of 1, 2 and 4, and their receptive fields are 3×3, 7×7 and 15×15, respectively. Assuming that the kernel size is also (2d+1)×(2d+1) constant, the expansion rate of the nth layer in the continuous dilated convolution operation is r _n , then the size of the receptive field is:

f_n＝f_n-1+2dr_n (2)f _n ＝f _n-1 +2dr _n (2)

其中n≥2且f₁＝2dr₁+1，递推可得：Where n≥2 and f ₁ =2dr ₁ +1, recursively:

令扩张率r_n＝2^n-1，则感受野的尺寸变为：Let the expansion rate r _n =2 ^n-1 , then the size of the receptive field becomes:

f_n＝2d(2ⁿ-1)+1 (4)f _n =2d(2 ⁿ -1)+1 (4)

由此，为空洞卷积选取适当的扩张率可以使感受野呈指数型增长。在基础网络结构中的res-5阶段开始使用扩张率2的空洞卷积，因该阶段中res5a和res5c均使用1×1的滤波核，故而实际仅res5b在快速扩大感受野，以提取密集型特征。Therefore, choosing an appropriate expansion rate for dilated convolution can make the receptive field grow exponentially. In the res-5 stage of the basic network structure, the hole convolution with an expansion rate of 2 is used. Because both res5a and res5c use a 1×1 filter kernel in this stage, only res5b is actually rapidly expanding the receptive field to extract dense feature.

使用基于解码器模块的结构用于恢复图像分辨率，其间通过整合各个层级的特征以细化最终的预测。解码器架构主要考虑如何恢复因连续的池化和下采样操作而丢失的空间信息。本发明在解码器架构中设计一个末端模块主要用于提取具有最强一致性约束的高层语义信息，并利用注意力机制指导与低层特征的融合，细化输出结果。A decoder module-based structure is used to recover image resolution, during which features from various levels are integrated to refine the final prediction. The decoder architecture mainly considers how to recover the spatial information lost due to successive pooling and downsampling operations. In the present invention, an end module is designed in the decoder architecture to extract high-level semantic information with the strongest consistency constraints, and use the attention mechanism to guide the fusion with low-level features and refine the output results.

如图3所示，所述(20)解码器特征融合模块构建步骤包括：As shown in Figure 3, the (20) decoder feature fusion module construction steps include:

(21)提取末端高层语义信息：采用基于三层卷积操作的类似金字塔的结构模块，在模块中分别使用3×3，5×5和7×7的卷积，通过融合不同尺度的上下文，得到具有最强类内语义一致性的高层语义；(21) Extract high-level semantic information at the end: use a pyramid-like structural module based on three-layer convolution operations, use 3×3, 5×5 and 7×7 convolutions in the module, and fuse contexts of different scales, Get high-level semantics with the strongest intra-class semantic consistency;

先前的模型多是在基础网络末端执行一系列尺度的空洞金字塔池化或空洞空间金字塔模块。在当前的语义分割体系中，金字塔结构可以提取不同尺度的特征信息，并在像素级别扩大感受野，但这种结构缺乏全局的上下文先验，无法按通道选择适合的要素，并且可能丢失重要的像素级信息。例如，过于频繁的空洞卷积会引起局部信息丢失而网格池化对特征图的局部一致性也是有害的。PSPNet中提出的金字塔池化模块更是时常会在不同尺度的池化操作中丢失像素位置。Most of the previous models performed a series of scale-scale empty pyramid pooling or empty space pyramid modules at the end of the basic network. In the current semantic segmentation system, the pyramid structure can extract feature information at different scales and expand the receptive field at the pixel level, but this structure lacks global context priors, cannot select suitable elements by channel, and may lose important Pixel level information. For example, too frequent dilated convolutions can cause loss of local information and grid pooling is also detrimental to the local consistency of feature maps. The pyramid pooling module proposed in PSPNet often loses pixel positions in pooling operations of different scales.

本发明使用如图4所示的高层语义信息提取模块从基础网络末端提取具有类内强语义一致性约束的高层特征。The present invention uses a high-level semantic information extraction module as shown in FIG. 4 to extract high-level features with strong intra-class semantic consistency constraints from the end of the basic network.

该模块通过实现类似金字塔这样的结构来融合三个不同尺度的特征信息。为了更好地从不同尺度中提取有用的上下文，分别在模块中使用3×3，5×5和7×7的卷积，由于高层特征分辨率较小，因此不会带来太大的计算负担。该模块通过逐步融合不同尺度的特征信息，可以较为精确地结合相邻尺度的上下文特征。来自res-5的输出特征在经过 1×1的卷积后，与融合特征按通道相乘。该模块还额外引入与输出特征连接的全局池化分支，在后续处理中可以进一步提高语义分割的性能。This module fuses feature information of three different scales by implementing a pyramid-like structure. In order to better extract useful context from different scales, 3×3, 5×5 and 7×7 convolutions are used in the modules respectively. Since the high-level feature resolution is small, it will not bring too much calculation. burden. This module can more accurately combine the context features of adjacent scales by gradually fusing feature information of different scales. The output features from res-5 are multiplied by channels with the fusion features after 1×1 convolution. This module also additionally introduces a global pooling branch connected to the output features, which can further improve the performance of semantic segmentation in subsequent processing.

受益于类似金字塔的结构，末端高层语义信息提取模块可以融合不同规模的上下文信息，同时为高层特征产生强大的语义信息。与金字塔池化模块在裁剪通道的卷积层之前连接不同尺度的特征不同，末端高层语义信息提取模块将上下文信息与经过简单1×1卷积操作后的原特征相乘，不会引入太多的计算。Benefiting from the pyramid-like structure, the terminal high-level semantic information extraction module can fuse contextual information of different scales while producing powerful semantic information for high-level features. Unlike the pyramid pooling module that connects features of different scales before the convolutional layer of the clipping channel, the terminal high-level semantic information extraction module multiplies the context information with the original features after a simple 1×1 convolution operation without introducing too much calculation.

(22)融合上下文特征：通过逐层合并相邻阶段的特征从而计算出通道注意向量，以此作为加权选择出低层阶段中判别力强的特征信息，并与相邻高阶段特征相融合，得到分割热图。(22) Fusion of contextual features: The channel attention vector is calculated by merging the features of the adjacent stages layer by layer, which is used as a weight to select the feature information with strong discriminative power in the low-level stage, and fused with the adjacent high-stage features to obtain Segmentation heatmap.

在基础网络中，ResNet-101包含五个阶段，分别生成对应尺度的特征，不同阶段拥有不同的识别能力导致多种一致性表现。在低层阶段，网络编码出精细的空间信息，其小感受野和缺乏空间上下文引导的特性使其仅含少量的语义一致性。在高层阶段，因其大感受野而拥有强大的类内语义一致性，但能预测的空间精度十分粗糙。总之，低层阶段生成更多精准的空间预测而高层阶段能给出更多精准的语义预测。由此可以结合两者各自的优势，利用高阶段的语义一致性去指导和低阶段的特征融合从而得到最佳预测。本发明使用如图5所示的注意力机制指导特征融合。In the basic network, ResNet-101 consists of five stages, which generate features of corresponding scales. Different stages have different recognition capabilities, resulting in multiple consistent performances. In the low-level stage, the network encodes fine spatial information, and its small receptive field and lack of spatial context guidance make it contain only a small amount of semantic consistency. In the high-level stage, it has strong intra-class semantic consistency due to its large receptive field, but the spatial accuracy of prediction is very rough. In summary, the lower stages generate more accurate spatial predictions while the higher stages give more accurate semantic predictions. In this way, the respective advantages of the two can be combined, and the high-level semantic consistency can be used to guide and low-level feature fusion to obtain the best prediction. The present invention uses the attention mechanism shown in Figure 5 to guide feature fusion.

这一设计通过合并相邻阶段的特征来计算一个通道注意向量作为权重。高层特征提供强大的一致性指导，而低层阶段提供的特征中具有不同判别能力的信息。通道注意向量即用于加权，选择出判别力强的特征信息。语义分割架构中，卷积操作最终输出了得分图，从而提供了每个像素所属不同类别的概率。最终得分图中的分数通过对特征图中所有通道求和而最终得分图中的分数通过对特征图中所有通道求和而来：This design computes a channel attention vector as weights by incorporating features from adjacent stages. The high-level features provide strong consistent guidance, while the low-level stages provide information with different discriminative power in the features. The channel attention vector is used for weighting to select feature information with strong discriminative power. In the semantic segmentation architecture, the convolution operation finally outputs a score map, which provides the probability that each pixel belongs to a different class. The score in the final score map is obtained by summing all the channels in the feature map and the score in the final score map is obtained by summing all the channels in the feature map:

其中x代表网络输出的特征，ω表示卷积核，D是像素位置的集合。Where x represents the features of the network output, ω represents the convolution kernel, and D is the set of pixel positions.

式(6)中，p是预测概率，N即通道数。如式(5)和式(6)中，最终的预测标签是概率值最高的那个类别。假设某个类别的预测结果是而真实标签是y，于是，引入一个参数α将最高概率值由变为y，如公式(7)所示：In formula (6), p is the predicted probability, and N is the number of channels. As in formulas (5) and (6), the final predicted label is the category with the highest probability value. Suppose the prediction result of a class is And the real label is y, so a parameter α is introduced to change the highest probability value from becomes y, as shown in formula (7):

其中，y即是新的预测输出，而α＝Sigmoid(ω,x)，即图4中的Sigmoid输出。Among them, y is the new predicted output, and α=Sigmoid(ω,x), which is the Sigmoid output in Figure 4.

基于上述分析，可以看出注意力机制的深层含义。式(5)隐含地揭示了不同通道的权重是相等的。但正如前面提到的，不同阶段的特征拥有不同程度的判别能力，从而导致不同的预测细粒度。为了获得具有精细对象边界的预测结果，应尽可能地提取具有判别能力的特征而抑制判别能力弱的特征。因此，将式(7)中的α值应用于特征图x，表示注意力机制的特征选择。有了这个模块，便可以逐层细化，输出最佳的预测结果。Based on the above analysis, we can see the deep meaning of the attention mechanism. Equation (5) implicitly reveals that the weights of different channels are equal. But as mentioned earlier, features at different stages have different degrees of discriminative ability, resulting in different prediction fine-grainedness. In order to obtain prediction results with fine object boundaries, discriminative features should be extracted as much as possible while weak discriminative features are suppressed. Therefore, applying the value of α in Equation (7) to the feature map x represents the feature selection of the attention mechanism. With this module, it can be refined layer by layer to output the best prediction results.

本发明对常用语义分割方法的损失函数进行改进，采用一种逐层标签监督策略，通过直接向解码阶段的每个融合输出的特征追加辅助监督，以提升网络模型中每一层分支的学习能力。为了在辅助分支生成语义输出，每个融合特征作为高层阶段在进入下一步之前被强制学习更多的语义，以期望对后面的融合更有帮助。需要注意的是，与编码器阶段的构建块层数重部署一样，逐层标签监督本身并不能提升卷积网络的分类功能，只是在语义分割任务中这一措施可以使卷积网络被迫提升低层阶段特征的语义质量，从而对解码阶段的输出更有帮助。The present invention improves the loss function of the commonly used semantic segmentation method, adopts a layer-by-layer label supervision strategy, and directly adds auxiliary supervision to the features of each fusion output in the decoding stage to improve the learning ability of each layer branch in the network model . In order to generate semantic output in the auxiliary branch, each fused feature as a high-level stage is forced to learn more semantics before going to the next step, hoping to be more helpful for the subsequent fusion. It should be noted that, like the redeployment of building block layers in the encoder stage, the layer-by-layer label supervision itself cannot improve the classification function of the convolutional network, but this measure can force the convolutional network to improve in the semantic segmentation task. Semantic quality of low-level stage features, which are more helpful to the output of the decoding stage.

在训练网络的时候，于res-2、res-3和res-4对应的特征融合模块尾部追加等分辨率标注图的辅助softmax损失。整个模型最终的分类损失相当于最终输出的监督以及三个辅助分支的监督之和。When training the network, the auxiliary softmax loss of the equal-resolution annotation map is added to the end of the feature fusion module corresponding to res-2, res-3 and res-4. The final classification loss of the entire model is equivalent to the sum of the supervision of the final output and the supervision of the three auxiliary branches.

给定的3个分支及最终输出共T＝4个监督中，每个监督对象输出的特征通道数即训练集中的类别个数为N。第t个分支末尾上采样后的特征F^t的空间分辨率为W_t×H_t，其对应特定坐标位置(w,h,n)的值为F^t _w,h,n。对最终输出及各个分支的特征图监督加入带权重的softmax交叉熵损失，对应权重为λ_t，其中λ₀＝1是最终输出的损失权重，其余为辅助监督的损失。将F^t输入到softmax函数中，计算图像中每个像素归属不同类别的概率softmax函数层的具体公式为：Given 3 branches and a total of T=4 supervisions in the final output, the number of feature channels output by each supervised object, that is, the number of categories in the training set is N. The spatial resolution of the up-sampled feature F ^t at the end of the t-th branch is W _t ×H _t , and its value corresponding to a specific coordinate position (w,h,n) is F ^t _w,h,n . The weighted softmax cross-entropy loss is added to the final output and the feature map supervision of each branch, and the corresponding weight is λ _t , where λ ₀ =1 is the loss weight of the final output, and the rest is the loss of auxiliary supervision. Input F ^t into the softmax function to calculate the probability that each pixel in the image belongs to a different category The specific formula of the softmax function layer is:

将预测映射到真实标签P^t _w,h,n上，最终用于训练的损失函数如式(9)所示：will predict Mapped to the real label P ^t _w,h,n , the final loss function used for training is shown in formula (9):

逐层标签监督策略使得梯度优化更加平滑，模型也更容易训练。监督下的每一个分支各自拥有强大的学习能力，能够学到各个层级丰富的语义特征。通过融合使得最终得到的分割图精度不依赖于任意单独的分支。The layer-by-layer label supervision strategy makes the gradient optimization smoother and the model is easier to train. Each branch under supervision has a strong learning ability and can learn rich semantic features at each level. Through fusion, the accuracy of the final segmentation map does not depend on any individual branch.

下面以一个具体实施例，验证本发明方法能够提高图像语义分割的准确性。A specific example is used below to verify that the method of the present invention can improve the accuracy of image semantic segmentation.

在PASCAL VOC 2012和Cityscapes两个语义分割数据集上对所提出的若干修正模块进行实验。基础网络是在ImageNet上预训练的ResNet-101。实验硬件平台是Core i7 处理器，3.6GHz主频，48G内存，GPU为NVIDIA GTX 1080，代码运行在TensorFlow 深度学习框架上。Several proposed correction modules are tested on two semantic segmentation datasets, PASCAL VOC 2012 and Cityscapes. The base network is ResNet-101 pre-trained on ImageNet. The experimental hardware platform is Core i7 processor, 3.6GHz main frequency, 48G memory, GPU is NVIDIA GTX 1080, and the code runs on the TensorFlow deep learning framework.

1、消融实验1. Ablation experiment

本节逐步分解所提出的方法，验证各个组建模块的有效性，接下来的实验中，在PASCAL VOC 2012的验证集上评估并比较所得的数据。首先，基于原始的ResNet-101 作为基础网络，并在末端直接上采样输出，如表1所示。This section decomposes the proposed method step by step to verify the effectiveness of each building block. In the next experiment, we evaluate and compare the obtained data on the validation set of PASCAL VOC 2012. First, based on the original ResNet-101 as the base network, and directly upsample the output at the end, as shown in Table 1.

表1随机缩放与翻转增强数据集后的效果Table 1 The effect of random scaling and flipping the enhanced dataset

随后，将基础网络扩展到基于FCN编码-解码架构的特征融合，特征融合策略采用裁剪和上采样后按通道简单求和。为了检验这种特征融合的有效性，选择一系列特征子集列出各阶段特征融合的效果，并与各阶段构建块层数重部署后效果做了对比，如表2 所示。Subsequently, the basic network is extended to feature fusion based on the FCN encoding-decoding architecture, and the feature fusion strategy adopts simple summation by channels after cropping and upsampling. In order to test the effectiveness of this feature fusion, a series of feature subsets are selected to list the effect of feature fusion at each stage, and compared with the effect after re-deployment of building block layers at each stage, as shown in Table 2.

从表2中第2列可以很清晰地看出，融合更多的层级特征的确逐渐提升了分割系统的输出质量，然而越往后融合更多的低层特征时，整体性能快速趋于饱和。分水岭在于ResNet-101的res-4阶段，该阶段共有23个构建块共69个卷积层，使得res-2阶段输出的低层特征与res-3阶段输出的低层特征之间存在巨大的语义鸿沟，这种隔阂使得融合res-3阶段输出的低层特征时整体性能提升几乎为零，后期继续融合的效果也并不显著。From the second column in Table 2, it can be clearly seen that fusing more hierarchical features does gradually improve the output quality of the segmentation system. However, when more low-level features are fused later, the overall performance quickly tends to saturate. The watershed lies in the res-4 stage of ResNet-101, which has a total of 23 building blocks and 69 convolutional layers, making a huge semantic gap between the low-level features output by the res-2 stage and the low-level features output by the res-3 stage , this gap makes the overall performance improvement almost zero when fusing the low-level features output by the res-3 stage, and the effect of continuing to fuse in the later stage is not significant.

表2构建块层数重部署前后融合特征的效果Table 2 Effects of fusion features before and after redeployment of building block layers

因此得出结论，差异性巨大的特征之间的融合基本无效。表中第3列显示了四个构建块层数重部署之后的特征融合效果。初始时，res-5输出的特征上采样后的分割质量比之重部署之前略有不及，但几乎可以忽略不计，证实了构建块层数重部署并没有强化卷积网络本身的分类能力。与第2列不同的是，随着往后融合更多的低层特征时，性能稳步提升，虽然提升步伐并不稳定，但也不像第2列中那样快速饱和。重部署机制使得 ResNet-101原先的res2、res-3、res-4和res-5四个阶段的构建块数由{3，4，23，3}变为 {8，8，9，8}，使得各阶段输出特征中特征之间的隔阂变化相对缩小，特征融合效果更佳，并于最后优于重部署前的性能，高出0.52个百分点。Therefore, it is concluded that the fusion between features with huge differences is basically ineffective. The third column in the table shows the effect of feature fusion after redeployment of four building block layers. Initially, the segmentation quality of res-5 output feature upsampling is slightly inferior to that before re-deployment, but it is almost negligible, which confirms that the re-deployment of the number of building block layers does not strengthen the classification ability of the convolutional network itself. The difference from column 2 is that as more low-level features are fused later, the performance improves steadily, although the improvement pace is not stable, but it is not as fast saturated as in column 2. The re-deployment mechanism makes the number of building blocks in the original four stages of res2, res-3, res-4 and res-5 of ResNet-101 change from {3, 4, 23, 3} to {8, 8, 9, 8} , so that the gap between the features in the output features of each stage is relatively narrowed, and the feature fusion effect is better, and finally it is better than the performance before redeployment, which is 0.52 percentage points higher.

表3显示了整个模型各部分组件的有效性。Table 3 shows the effectiveness of various components of the whole model.

表3 PASCAL VOC2012验证集上的消融实验对比Table 3 Comparison of ablation experiments on the PASCAL VOC2012 validation set

末端高层提取出的语义信息含有强大的语义一致性，通过强大的语义约束，逐步向低层阶段融合，得到更加细致的图像语义特征，将模型性能提升了1.1％。注意力机制是整个模型最重要的改进。与按通道简单求和的融合方法不同，该机制生成的通道注意向量选择低层特征中最具判别力的信息，从而很好地细化对象分割的边界，在前述基础上提升了模型2.06％的性能，比之其他组件模块，贡献最大。最终的逐层标签监督细化了融合后的层级特征，使每次融合后的特征进一步靠近监督，对整个模型的性能提升了 0.43％。末端模块除了生成高层语义信息之外，还有一个分支输出全局池化特征。利用全局池化特征进一步约束res-2阶段低层特征融合后的输出，强化了整个模型在处理图像时对目标所有像素的语义一致性。全局池化分支提升了模型0.96％的性能，具备重要的价值。The semantic information extracted by the high-level end contains strong semantic consistency. Through strong semantic constraints, it is gradually fused to the low-level stage to obtain more detailed image semantic features, which improves the model performance by 1.1%. The attention mechanism is the most important improvement of the whole model. Different from the fusion method of simple summation by channel, the channel attention vector generated by this mechanism selects the most discriminative information in the low-level features, so as to refine the boundary of object segmentation well, and improves the model's performance by 2.06% on the basis of the foregoing. Performance, compared to other component modules, contributes the most. The final layer-by-layer label supervision refines the fused hierarchical features, making each fused feature closer to the supervision, improving the performance of the entire model by 0.43%. In addition to generating high-level semantic information, the end module also has a branch to output global pooling features. The global pooling feature is used to further constrain the output of the low-level feature fusion in the res-2 stage, which strengthens the semantic consistency of the entire model for all pixels of the target when processing images. The global pooling branch improves the performance of the model by 0.96%, which is of great value.

2、定性分析2. Qualitative analysis

表4中展示了几种比较方法的图像语义分割可视化效果。Table 4 shows the visualization of image semantic segmentation for several comparison methods.

表4部分图像分割效果可视化Table 4 Partial image segmentation effect visualization

表中第三栏和第五栏，FCN方法显示出对象局部区域误识别现象。第三栏原图中对象是三只牛，其中两只相对而言尺度较小。FCN基础网络显现出目标对象局部区域识别错误的问题，尺度较大的牛的前两只脚与地面视觉效果相近，外观略显复杂，虽然有一定的分割效果但被却误分类为马的脚底。对于两只较小的牛更是不少像素区域分类错误，错误分类的区域也被误识别为马，推测应该是训练集中牛的尺度一般较大，模型无法很好地处理较小的同类目标。本发明显示出几乎完美的效果，很好地应对了FCN丢失图像细节导致地对象局部像素分类出错的问题。第五栏图像是白色栏杆边上的白马，视觉上马背与马腿被栏杆遮挡。由于颜色相近，FCN方法直接没有识别出栏杆以上的马背部分，栏杆遮挡下的马腿也显示出模糊的结果。本发明稍加完美，除了极少部分像素出现判断失误，基本上没有问题。In the third and fifth columns of the table, the FCN method shows the phenomenon of misidentification of object local regions. The objects in the original picture in the third column are three cows, two of which are relatively small in scale. The FCN basic network shows the problem of incorrect recognition of the local area of the target object. The front two feet of the large-scale cow are similar to the ground visual effect, and the appearance is slightly complicated. Although it has a certain segmentation effect, it is misclassified as the sole of the horse's foot. . For the two smaller cows, many pixel regions are misclassified, and the misclassified regions are also misidentified as horses. It is speculated that the scale of cattle in the training set is generally large, and the model cannot handle smaller similar targets well. . The present invention shows an almost perfect effect, and well copes with the problem that the FCN loses image details and causes errors in local pixel classification of ground objects. The image in the fifth column is a white horse beside the white railing. Visually, the horse's back and legs are blocked by the railing. Due to the similar colors, the FCN method does not directly recognize the part of the horse's back above the railing, and the horse's legs under the railing also show blurred results. The present invention is a little more perfect, and there is basically no problem except that a very small number of pixels have misjudgment.

第一、二和四栏中，FCN方法显示出分割对象边界的模糊性。第一栏输入图像是一只羊，黑白色的底片中羊身上的部分色调光泽与地面背景接近为亮度最高的白色。FCN 分割结果中部分亮色地面背景被误识别为山羊身体的一部分，且误识别区域较为散乱，本发明在注意力机制的基础上得到的分割图中很好地消除了散乱的误识别区，边界十分清晰，对分割的约束效果极佳。第二栏中的机箱加显示器和第四栏中的赛马也是如此。In the first, second and fourth columns, the FCN method shows the ambiguity of the boundaries of the segmented objects. The input image in the first column is a sheep. In the black and white negative film, the part of the sheep’s body tone gloss is close to the ground background, which is the white with the highest brightness. In the FCN segmentation results, part of the bright-colored ground background is misidentified as a part of the goat's body, and the misidentified area is relatively scattered. The segmentation map obtained on the basis of the attention mechanism of the present invention eliminates the scattered misidentified areas. It is very clear and has an excellent constraint effect on segmentation. The same goes for the case plus monitor in the second column and the racehorse in the fourth column.

3、定量评估3. Quantitative evaluation

本发明在PASCAL VOC 2012增强版和Cityscapes数据集上进行了几种方法的实验结果定量分析与比较。测试结果如表5所示。In the present invention, the quantitative analysis and comparison of the experimental results of several methods are carried out on the PASCAL VOC 2012 enhanced version and the Cityscapes data set. The test results are shown in Table 5.

表5注意力机制模型在PASCAL VOC 2012测试集上逐类别的准确率Table 5 Class-by-category accuracy of the attention mechanism model on the PASCAL VOC 2012 test set

本发明与DeepLab进行比较时，有一半左右的类别准确率高于DeepLab，且部分类别的准确率属于远远高出，最终总的准确率略高于DeepLab。在与前沿的LRR方法进行比较时，本发明在大部分类别上具有较高的准确率，其中自行车，船，瓶子，椅子，盆栽，沙发，电视等类别上比LRR高出3％，有的甚至高出15％到20％，这些类别都是分割难度较大且易混淆的类别，本发明方法由高层语义指导细致地融合了多个低层级的特征，因此在处理有较多语义细节的自行车，椅子，盆栽等类别时具有特征提取上的优势，分割出的目标具有很强的语义一致性，极少出现局部区域误识别等问题。对于奶牛，羊，狗等具有相似外观的类别目标，也能够区分出复杂的语义类别。When the present invention is compared with DeepLab, the accuracy rate of about half of the categories is higher than that of DeepLab, and the accuracy rate of some categories is far higher, and the final total accuracy rate is slightly higher than that of DeepLab. When compared with the cutting-edge LRR method, the present invention has a higher accuracy rate on most categories, among which bicycle, boat, bottle, chair, potted plant, sofa, TV and other categories are 3% higher than LRR, and some Even higher by 15% to 20%, these categories are difficult to segment and easily confused categories, the method of the present invention is guided by high-level semantics to carefully integrate multiple low-level features, so when dealing with more semantic details Bicycles, chairs, potted plants and other categories have advantages in feature extraction, the segmented objects have strong semantic consistency, and there are few problems such as misidentification of local areas. For category objects with similar appearances such as cows, sheep, dogs, etc., complex semantic categories can also be distinguished.

最后，也在Cityscapes数据集上评估本发明的方法。在训练过程中，每张图像被裁剪成800×800。通过观察发现对于高分辨率图像来说，大规模的裁剪是很有用的。模型在测试集上的性能表现如表6所示。与PASCAL VOC 2012情况类似，本发明在大部分对象的分割中取得了最好的结果，并在最终的的份上优于其他方法。Finally, the method of the present invention is also evaluated on the Cityscapes dataset. During training, each image is cropped to 800×800. It has been observed that for high-resolution images, large-scale cropping is useful. The performance of the model on the test set is shown in Table 6. Similar to the case of PASCAL VOC 2012, the present invention achieves the best results in the segmentation of most objects and outperforms other methods in the final score.

表6注意力机制模型Cityscapes测试集上逐类别的准确率Table 6 Class-by-category accuracy of the attention mechanism model on the Cityscapes test set

本发明利用卷积网络编码器将不同层级的语义信息嵌入特征图，再使用解码器整合各个特征图细化输出并生成最终的分割结果。The present invention uses a convolutional network encoder to embed semantic information of different levels into a feature map, and then uses a decoder to integrate each feature map to refine the output and generate a final segmentation result.

编码器是用于提取图像特征的预训练卷积模型，其最顶层特征具有高度的语义，但由于分辨率不足，在重建分割图的精确细节方面能力不足。而编码器底端的特征具有高分辨率细节却缺乏强大的语义信息。编码器重新部署了各阶段中的构建块数量以平衡特征之间的语义差异性变化，并于res5b块使用扩张率为2的空洞卷积。通过末端高层语义信息提取模块生成强大的语义一致性约束，在解码阶段通过注意力机制自顶向下地逐层融合低分辨率的高层特征和高分辨率的低层特征，利用高层特征强大的语义一致性指导特征融合，以生成高分辨率的语义结果。Encoders are pre-trained convolutional models used to extract image features, whose topmost features are highly semantic, but insufficient in reconstructing the precise details of segmentation maps due to insufficient resolution. While the features at the bottom of the encoder have high-resolution details but lack strong semantic information. The encoder redeploys the number of building blocks in each stage to balance the semantic discrepancy changes between features, and uses atrous convolution with a dilation rate of 2 in the res5b block. Powerful semantic consistency constraints are generated through the terminal high-level semantic information extraction module. In the decoding stage, low-resolution high-level features and high-resolution low-level features are fused layer by layer through the attention mechanism, and the strong semantic consistency of high-level features is used. Sexual guidance feature fusion to generate high-resolution semantic results.

Claims

1. an image semantic segmentation method based on attention mechanism guidance feature fusion, is characterized in that, comprises the steps:

(10) Encoder basic network construction: use the improved ResNet-101 to generate a series of features ranging from high resolution and low semantics to low resolution and high semantics;

(20) Decoder feature fusion module construction: a pyramid structure module based on three-layer convolution operation is used to extract high-level semantics with strong consistency constraints, and then weighted and fused to low-level features layer by layer to obtain a preliminary segmentation heat map;

(30) Auxiliary loss function construction: add auxiliary supervision to each fusion output in the decoding stage, and then superimpose with the main supervision loss after sampling on the heat map, strengthen the layered training of the model, and obtain a semantic segmentation map.

2. image semantic segmentation method according to claim 1, is characterized in that,

The (10) encoder basic network construction steps include:

(11) Re-deployment of the number of building block layers: redeploy the number of building blocks owned by each of res-2 to res-5, and res-2 to res-5 of the original ResNet-101 {3, 4, 23, 3} The number of building blocks is adjusted to {8, 8, 9, 8};

(12) Expand the receptive field: Change the traditional convolution of the res-5 stage in the basic network structure of ResNet-101 to an empty convolution with an expansion rate of 2.

3. image semantic segmentation method according to claim 1, is characterized in that,

The (20) decoder feature fusion module construction steps include:

(21) Extract high-level semantic information at the end: use a pyramid-like structural module based on three-layer convolution operations, use 3×3, 5×5 and 7×7 convolutions in the module, and fuse contexts of different scales, Get high-level semantics with the strongest intra-class semantic consistency;

(22) Fusion of contextual features: The channel attention vector is calculated by merging the features of the adjacent stages layer by layer, which is used as a weight to select the feature information with strong discriminative power in the low-level stage, and fused with the adjacent high-stage features to obtain Preliminary segmentation heatmap.