CN109711413B

CN109711413B - Image semantic segmentation method based on deep learning

Info

Publication number: CN109711413B
Application number: CN201811646148.7A
Authority: CN
Inventors: 郭敏; 丁晓; 马苗; 陈昱莅; 裴炤
Original assignee: Shaanxi Normal University
Current assignee: Shaanxi Normal University
Priority date: 2018-12-30
Filing date: 2018-12-30
Publication date: 2023-04-07
Anticipated expiration: 2038-12-30
Also published as: CN109711413A

Abstract

An image semantic segmentation method based on deep learning, which consists of four parts: data set processing, construction of deep semantic segmentation network, deep semantic segmentation network training and parameter learning, and semantic segmentation of test images. The present invention takes the RGB image and the grayscale image of the input image as the input of the network model, makes full use of the edge information of the grayscale image, and effectively increases the richness of the input features; Based on the local features of the image, capture more context dependencies and global feature information; add coordinate information to the feature map through the first coordinate channel module and the second coordinate channel module, enrich the coordinate features of the model, and improve the generalization ability of the model , producing semantic segmentation results with high resolution and precise boundaries.

Description

Image Semantic Segmentation Method Based on Deep Learning

技术领域technical field

本发明属于计算机视觉与深度学习技术领域，具体涉及一种基于深度学习的图像语义分割方法。The invention belongs to the technical field of computer vision and deep learning, and in particular relates to an image semantic segmentation method based on deep learning.

背景技术Background technique

图像语义分割是从像素水平上，理解、识别图片的内容，其目的是建立每个像素和语义类别之间的一一映射关系，根据语义信息进行分割，其被广泛应用于场景理解、自动驾驶、医学影像分析、机器人视觉等领域。图像语义分割是图像理解的基石，其分割结果的好坏将直接影响对后续图像内容的处理，因此，对图像语义分割技术的研究具有非常重要的现实意义。Image semantic segmentation is to understand and recognize the content of pictures from the pixel level. Its purpose is to establish a one-to-one mapping relationship between each pixel and semantic categories, and perform segmentation based on semantic information. It is widely used in scene understanding and automatic driving. , medical image analysis, robot vision and other fields. Image semantic segmentation is the cornerstone of image understanding, and the quality of its segmentation results will directly affect the processing of subsequent image content. Therefore, the research on image semantic segmentation technology has very important practical significance.

传统的图像语义分割方法大多依赖于手工特征提取和概率图模型，如：随机森林、条件随机场(CRF)、马尔科夫随机场(MRF)等，这些方法只能学习浅层的表征信息，不能产生准确精细的分割结果。2012年以来，随着深度学习的快速发展，基于卷积神经网络的图像语义分割方法成为了研究热点。2014年，Hariharan 等人提出了协同目标检测和语义分割的SDS(simultaneous detection and segmentation)方法，该方法首先使用MCG方法抽取每幅图像中多个候选区域，然后分两路使用CNN提取边界框(bounding-box)特征和前景区域特征，并完成两路特征的信息融合，最后，利用非最大约束NMS(non-maximum suppression)方法生成语义分割结果。除了SDS方法，类似的还有R-CNN、SPP等方法，都是基于候选区域的语义分割方法，但这类方法依赖于大量的区域建议，导致内存消耗非常大，训练时间比较长，得到的语义分割结果精度低。Most of the traditional image semantic segmentation methods rely on manual feature extraction and probabilistic graphical models, such as: random forest, conditional random field (CRF), Markov random field (MRF), etc. These methods can only learn shallow representation information, Accurate and fine segmentation results cannot be produced. Since 2012, with the rapid development of deep learning, image semantic segmentation methods based on convolutional neural networks have become a research hotspot. In 2014, Hariharan et al. proposed the SDS (simultaneous detection and segmentation) method of collaborative target detection and semantic segmentation. This method first uses the MCG method to extract multiple candidate regions in each image, and then uses CNN to extract the bounding box in two ways ( bounding-box) features and foreground area features, and complete the information fusion of the two features, and finally, use the non-maximum constraint NMS (non-maximum suppression) method to generate semantic segmentation results. In addition to the SDS method, there are similar methods such as R-CNN and SPP, which are semantic segmentation methods based on candidate regions, but these methods rely on a large number of region proposals, resulting in very large memory consumption and long training time. The accuracy of semantic segmentation results is low.

为了进一步降低内存开销，提升语义分割精度。2015年，Long等人提出了全卷积网络模型FCN(fully convolutional networks)，该模型将深度卷积神经网路最后的全连接层全部转换成卷积层，形成端到端、像素到像素的全卷积网络框架，使图像语义分割进入了一个全新的时代。Kendall等人提出了一种深度卷积编码－解码架构SegNet，该模型由一个卷积编码网络和一个反卷积解码网络组成，每一个编码器层都对应一个解码器层，最终编码器的输出被送入soft-max分类器进行逐像素分类。Chen等人在FCN的基础上，提出了一个更加成熟的语义分割模型Deeplab-CRF，该模型采用优化后的DCNN(深度卷积神经网络)得到粗糙得分图并通过双线性插值上采样到原图像大小，然后使用全连接条件随机场(CRF)进行迭代优化，得到精细的语义分割结果。In order to further reduce memory overhead and improve semantic segmentation accuracy. In 2015, Long et al. proposed the fully convolutional network model FCN (fully convolutional networks), which converts the last fully connected layer of the deep convolutional neural network into a convolutional layer to form an end-to-end, pixel-to-pixel The fully convolutional network framework has brought image semantic segmentation into a new era. Kendall et al. proposed a deep convolutional encoding-decoding architecture SegNet, which consists of a convolutional encoding network and a deconvolutional decoding network. Each encoder layer corresponds to a decoder layer, and the output of the final encoder is are fed into a soft-max classifier for pixel-by-pixel classification. On the basis of FCN, Chen et al. proposed a more mature semantic segmentation model Deeplab-CRF, which uses an optimized DCNN (deep convolutional neural network) to obtain a rough score map and upsample to the original image through bilinear interpolation. The image size is then iteratively optimized using a fully connected conditional random field (CRF) to obtain fine semantic segmentation results.

上述语义分割方法的缺点：其一，模型输入一般为RGB图像，输入过于单一，可能会导致局部特征缺失；其二，这些方法都是基于卷积神经网络来进行特征提取的，没有充分利用图像的局部特征信息和全局上下文依赖关系，导致图像的分割边缘非常粗糙，分割精度非常低。Disadvantages of the above semantic segmentation methods: first, the model input is generally an RGB image, and the input is too single, which may lead to the loss of local features; second, these methods are based on convolutional neural networks for feature extraction, and do not make full use of images. The local feature information and global context dependence of the image lead to very rough segmentation edges of the image and very low segmentation accuracy.

发明内容Contents of the invention

本发明所要解决的技术问题在于克服现有方法的缺陷，提供一种分割精度高、泛化能力强的基于深度学习的图像语义分割方法。The technical problem to be solved by the present invention is to overcome the defects of the existing methods and provide an image semantic segmentation method based on deep learning with high segmentation accuracy and strong generalization ability.

解决上述技术问题所采用的技术方案包括下述步骤：The technical solution adopted to solve the above technical problems comprises the following steps:

S1、数据集处理S1. Data set processing

将图像数据集分为训练图像集和测试图像集，并对训练图像集进行数据增强操作，将训练图像的数量增加到万级单位；Divide the image data set into a training image set and a test image set, and perform data enhancement operations on the training image set to increase the number of training images to tens of thousands of units;

S2、构建深度语义分割网络S2. Build a deep semantic segmentation network

深度语义分割网络由并行深度神经网络模块、特征融合模块、Softmax分类层构成，所述的并行深度神经网络模块用于对输入图像进行特征提取，所述特征融合模块将并行深度神经网络的输出特征图进行加权融合得到新的特征图，所述Softmax分类层将像素类别标签预测分值转换成像素类别标签预测概率分布图；The deep semantic segmentation network consists of a parallel deep neural network module, a feature fusion module, and a Softmax classification layer. The parallel deep neural network module is used to extract features from the input image, and the feature fusion module combines the output features of the parallel deep neural network. Graph carries out weighted fusion and obtains new feature map, and described Softmax classification layer converts pixel category label prediction score into pixel category label prediction probability distribution map;

所述的并行深度神经网络模块由第一深度神经网络模块和第二深度神经网络模块组成，且第一深度神经网络模块和第二深度神经网络模块网络结构相同，第一深度神经网络模块的输入为输入图像的RGB图像，第二深度神经网络模块的输入为输入图像的灰度图像；Described parallel deep neural network module is made up of the first deep neural network module and the second deep neural network module, and the first deep neural network module and the second deep neural network module network structure are identical, the input of the first deep neural network module Be the RGB image of input image, the input of the second depth neural network module is the grayscale image of input image;

所述的第一深度神经网络模块由全卷积网络模块、第一坐标通道模块、第一循环层模块、第二坐标通道模块、第二循环层模块、空间金字塔池化模块构成，第一坐标通道模块与第二坐标通道模块的结构相同，第一循环层模块与第二循环层模块的结构相同，所述全卷积网络模块对输入图像进行局部特征提取，所述第一循环层模块用于捕获图像的上下文依赖关系和全局特征信息，所述第一坐标通道模块对全卷积网络模块输出的特征图连接i、j、r坐标通道构成新的特征图，以学习更多的坐标特征信息并提高模型的泛化能力，所述空间金字塔池化模块对第二循环层模块输出的特征图在多个采样率上进行卷积操作，提取不同尺度区域的特征信息；The first deep neural network module is composed of a full convolutional network module, a first coordinate channel module, a first circulation layer module, a second coordinate channel module, a second circulation layer module, and a spatial pyramid pooling module. The structure of the channel module is the same as that of the second coordinate channel module, the structure of the first recurrent layer module is the same as that of the second recurrent layer module, the full convolution network module performs local feature extraction on the input image, and the first recurrent layer module uses In order to capture the context dependencies and global feature information of the image, the first coordinate channel module connects the i, j, r coordinate channels to the feature map output by the full convolutional network module to form a new feature map to learn more coordinate features information and improve the generalization capability of the model, the spatial pyramid pooling module performs convolution operations on the feature map output by the second loop layer module on multiple sampling rates, and extracts feature information of different scale regions;

S3、深度语义分割网络训练及参数学习S3, deep semantic segmentation network training and parameter learning

S31、网络模型参数初始化：使用ResNet101在ImageNet数据集上的预训练模型对全卷积网络模块进行参数初始化，使用标准均匀分布对第一循环层模块和第二循环层模块进行参数初始化，使用标准高斯分布对空间金字塔池化模块的卷积层进行参数初始化；S31. Network model parameter initialization: use the pre-training model of ResNet101 on the ImageNet dataset to initialize the parameters of the full convolutional network module, use the standard uniform distribution to initialize the parameters of the first cycle layer module and the second cycle layer module, and use the standard The Gaussian distribution initializes the parameters of the convolution layer of the spatial pyramid pooling module;

S32、使用数据增强后的训练图像集训练深度语义分割网络，生成像素类别预测标签概率分布图，利用预测标签概率和原始标签概率计算预测损失，具体采用混合损失函数L(θ)作为目标函数，S32. Use the data-enhanced training image set to train the deep semantic segmentation network, generate a pixel category prediction label probability distribution map, and use the predicted label probability and the original label probability to calculate the prediction loss. Specifically, the mixed loss function L(θ) is used as the objective function.

L(θ)＝L₁(θ)+L₂(θ)L(θ)＝L ₁ (θ)+L ₂ (θ)

式中L₁(θ)为交叉熵损失函数，L₂(θ)为L2正则化项，θ是深度语义分割网络的参数；where L ₁ (θ) is the cross-entropy loss function, L ₂ (θ) is the L2 regularization term, and θ is the parameter of the deep semantic segmentation network;

S33、采用随机梯度下降算法优化目标函数，运用反向传播算法更新网络模型参数，直到目标函数的值不再下降时结束训练；S33, using the stochastic gradient descent algorithm to optimize the objective function, using the backpropagation algorithm to update the network model parameters, and ending the training until the value of the objective function no longer decreases;

S4、对测试图像进行语义分割S4. Perform semantic segmentation on the test image

S41、将测试图像集输入步骤S3训练好的深度语义分割网络；S41. Input the test image set into the deep semantic segmentation network trained in step S3;

S42、并行深度神经网络模块对输入的测试图像集进行特征提取S42, the parallel deep neural network module performs feature extraction on the input test image set

测试图像的RGB图像作为第一深度神经网络模块的输入，测试图像的灰度图像作为第二深度神经网络模块的输入；The RGB image of the test image is used as the input of the first deep neural network module, and the grayscale image of the test image is used as the input of the second deep neural network module;

第一深度神经网络模块特征提取过程为：全卷积网络模块通过空洞卷积、最大池化、卷积操作对测试图像的RGB图像进行局部特征提取；将全卷积网络模块输出的特征图通过第一坐标通道模块得到新的特征图送入第一循环层模块进行水平和垂直扫描，学习图像的全局特征信息；将第一循环层模块输出的特征图通过第二坐标通道模块得到新的特征图再送入第二循环层模块进行水平和垂直扫描，捕获图像的全局特征信息；将第二循环层模块输出的特征图输入空间金字塔池化模块，在多个采样率上进行卷积操作，提取不同尺度区域的特征信息；The feature extraction process of the first deep neural network module is as follows: the full convolutional network module extracts local features from the RGB image of the test image through atrous convolution, maximum pooling, and convolution operations; the feature map output by the full convolutional network module is passed through The new feature map obtained by the first coordinate channel module is sent to the first circulation layer module for horizontal and vertical scanning, and the global feature information of the image is learned; the feature map output by the first circulation layer module is passed through the second coordinate channel module to obtain new features The image is then sent to the second circulation layer module for horizontal and vertical scanning to capture the global feature information of the image; the feature map output by the second circulation layer module is input into the spatial pyramid pooling module, and convolution operations are performed at multiple sampling rates to extract Feature information of regions of different scales;

第二深度神经网络模块特征提取过程与第一深度神经网络模块特征提取过程相同；The second deep neural network module feature extraction process is the same as the first deep neural network module feature extraction process;

S43、将第一深度神经网络模块输出的特征图与第二深度神经网络模块输出的特征图进行加权融合得到新的特征图；S43. Perform weighted fusion of the feature map output by the first deep neural network module and the feature map output by the second deep neural network module to obtain a new feature map;

S44、将步骤S43的结果送入Softmax分类层进行像素类别标签预测，得到图像中每个像素所属的物体类别，并做双线性插值操作上采样到原图像尺寸，得到精细的语义分割图。S44. Send the result of step S43 to the Softmax classification layer for pixel category label prediction, obtain the object category to which each pixel in the image belongs, and perform bilinear interpolation to upsample to the original image size to obtain a fine semantic segmentation map.

作为一种优选的技术方案，所述的第一循环层模块由两个双向门限递归单元构成，双向门限递归单元的神经元个数为150。As a preferred technical solution, the first recurrent layer module is composed of two bidirectional threshold recurrent units, and the number of neurons in the bidirectional threshold recurrent unit is 150.

作为一种优选的技术方案，所述的空间金字塔池化模块由4个不同采样率的空洞卷积构成，空洞卷积的卷积核大小为3×3，扩张率分别为4、6、8、12。As a preferred technical solution, the spatial pyramid pooling module is composed of four atrous convolutions with different sampling rates, the convolution kernel size of the atrous convolution is 3×3, and the dilation rates are 4, 6, and 8 respectively. , 12.

作为一种优选的技术方案，所述的步骤S2中i、j、r坐标通道由i坐标通道、j坐标通道、r坐标通道构成，i坐标通道、j坐标通道及r坐标通道均为e×f的坐标矩阵，i坐标通道第1行～第e行的元素依次为0、1、...、e-1，j坐标通道第1列～第f列的元素依次为0、1、...、f-1，e、f取正整数，r坐标通道为

m为i坐标通道中的任意元素，n为j坐标通道中与m坐标位置相同的元素，将i坐标通道和j坐标通道中的元素线性缩放到[-1,1]范围内。As a preferred technical solution, the i, j, and r coordinate channels in the step S2 are composed of i coordinate channel, j coordinate channel, and r coordinate channel, and the i coordinate channel, j coordinate channel and r coordinate channel are all e× The coordinate matrix of f, the elements in the first row to the eth row of the i coordinate channel are 0, 1, ..., e-1 in sequence, and the elements in the first column to the f column of the j coordinate channel are 0, 1, . .., f-1, e, f take positive integers, and the r coordinate channel is

m is any element in the i coordinate channel, n is the element in the j coordinate channel at the same position as the m coordinate, and the elements in the i coordinate channel and the j coordinate channel are linearly scaled to the range [-1,1].

作为一种优选的技术方案，所述的步骤S3中参数学习的学习率按照如下公式进行衰减：As a preferred technical solution, the learning rate of the parameter learning in the step S3 is attenuated according to the following formula:

式中t为迭代次数，l₀是初始学习率，l_t是第t次迭代的学习率，power是动量为0.9。In the formula, t is the number of iterations, l ₀ is the initial learning rate, l _t is the learning rate of the t-th iteration, and power is the momentum of 0.9.

本发明的有益效果如下：The beneficial effects of the present invention are as follows:

本发明将输入图像的RGB图像和灰度图像作为网络模型的输入，充分利用灰度图像的边缘信息，有效增加输入特征的丰富程度；把卷积神经网络和双向门限递归单元相结合，在学习图像局部特征的基础上，捕获更多的上下文依赖关系和全局特征信息；通过第一坐标通道模块和第二坐标通道模块对特征图加入坐标信息，丰富模型的坐标特征，提升模型的泛化能力，产生分辨率高、边界精确的语义分割结果。The present invention uses the RGB image and the grayscale image of the input image as the input of the network model, makes full use of the edge information of the grayscale image, and effectively increases the richness of the input features; Based on the local features of the image, capture more context dependencies and global feature information; add coordinate information to the feature map through the first coordinate channel module and the second coordinate channel module, enrich the coordinate features of the model, and improve the generalization ability of the model , producing semantic segmentation results with high resolution and precise boundaries.

附图说明Description of drawings

图1是基于深度学习的图像语义分割方法流程图。Figure 1 is a flow chart of an image semantic segmentation method based on deep learning.

图2是第一深度神经网络模块。Fig. 2 is the first deep neural network module.

图3是WeizmannHorse数据集中部分测试图像的语义分割图。Figure 3 is a semantic segmentation map of some test images in the WeizmannHorse dataset.

图4是StanfordBackground数据集中部分测试图像的语义分割图。Figure 4 is a semantic segmentation diagram of some test images in the StanfordBackground dataset.

具体实施方式Detailed ways

下面结合附图和实施例对本发明进一步详细说明，但本发明不限于这些实施例。The present invention will be described in further detail below in conjunction with the accompanying drawings and embodiments, but the present invention is not limited to these embodiments.

实施例1Example 1

WeizmannHorse数据集是一个由328幅图像组成的图像分割数据集，数据集中部分图像如图3所示，网络模型的训练使用Pytorch平台，代码在python上编写完成，本实施例基于深度学习的图像语义分割方法，如图1所示，其步骤如下：The WeizmannHorse dataset is an image segmentation dataset consisting of 328 images. Some images in the dataset are shown in Figure 3. The network model is trained using the Pytorch platform, and the code is written on python. This example is based on image semantics of deep learning The segmentation method, as shown in Figure 1, has the following steps:

S1、数据集处理S1. Data set processing

从WeizmannHorse数据集中随机选取200张作为训练图像集，剩下的128张作为测试图像集，并对训练图像集进行数据增强操作，使训练图像的数量增加到11000张；Randomly select 200 images from the WeizmannHorse dataset as a training image set, and the remaining 128 images as a test image set, and perform data enhancement operations on the training image set to increase the number of training images to 11,000;

深度语义分割网络由并行深度神经网络模块、特征融合模块、Softmax分类层构成，并行深度神经网络模块用于对输入图像进行特征提取，特征融合模块将两个并行深度神经网络的输出特征图进行加权融合得到新的特征图，Softmax分类层将像素类别标签预测分值转换成像素类别标签预测概率分布图；The deep semantic segmentation network consists of a parallel deep neural network module, a feature fusion module, and a Softmax classification layer. The parallel deep neural network module is used to extract features from the input image, and the feature fusion module weights the output feature maps of the two parallel deep neural networks. A new feature map is obtained by fusion, and the Softmax classification layer converts the pixel category label prediction score into a pixel category label prediction probability distribution map;

并行深度神经网络模块由第一深度神经网络模块和第二深度神经网络模块组成，且第一深度神经网络模块和第二深度神经网络模块结构相同，第一深度神经网络模块的输入为输入图像的RGB图像，第二深度神经网络模块的输入为输入图像的灰度图像；The parallel deep neural network module is made up of the first deep neural network module and the second deep neural network module, and the first deep neural network module and the second deep neural network module have the same structure, and the input of the first deep neural network module is the input image RGB image, the input of the second depth neural network module is the gray scale image of input image;

在图2中，第一深度神经网络模块由全卷积网络模块、第一坐标通道模块、第一循环层模块、第二坐标通道模块、第二循环层模块、空间金字塔池化模块构成，第一坐标通道模块与第二坐标通道模块的结构相同，第一循环层模块与第二循环层模块的结构相同；In Figure 2, the first deep neural network module is composed of a full convolutional network module, a first coordinate channel module, a first circulation layer module, a second coordinate channel module, a second circulation layer module, and a spatial pyramid pooling module. The first coordinate channel module has the same structure as the second coordinate channel module, and the first circulation layer module has the same structure as the second circulation layer module;

全卷积网络模块对输入图像进行局部特征提取，全卷积网络模块由Deeplab_largeFOV模型中Resnet101网络的第一卷积组～第五个卷积组构成，其中，第一卷积组～第三个卷积组使用卷积操作、最大池化操作，第四卷积组和第五个卷积组使用卷积操作、空洞卷积操作；The full convolutional network module performs local feature extraction on the input image. The full convolutional network module is composed of the first convolution group to the fifth convolution group of the Resnet101 network in the Deeplab_largeFOV model. Among them, the first convolution group to the third convolution group The convolution group uses convolution operation and maximum pooling operation, and the fourth convolution group and the fifth convolution group use convolution operation and hole convolution operation;

第一循环层模块由两个双向门限递归单元构成，双向门限递归单元的神经元个数为150，用于捕获图像的上下文依赖关系和全局特征信息；首先使用1×1的分块大小把特征图X分成G×K个不重叠的区域块，其中，G、K分别等于特征图X的长、宽；然后使用一个双向门限递归单元沿特征图X的每列进行垂直扫描，一个自顶向下扫描、一个自下向上扫描，每次读取一个区域块，并将扫描得到的输出预测按坐标索引连接起来得到一个复合特征图

同样地，使用另外一个双向门限递归单元沿着复合特征图

的每行进行水平扫描，一个自左向右扫描、一个自右向左扫描，每次读取一个区域块，并将输出预测按坐标索引连接起来得到一个新的复合特征图

新的复合特征图

具有来自整个图像的上下文信息；The first loop layer module is composed of two bidirectional threshold recurrent units. The number of neurons in the bidirectional threshold recurrent unit is 150, which is used to capture the context dependence and global feature information of the image; The graph X is divided into G×K non-overlapping area blocks, where G and K are equal to the length and width of the feature map X respectively; then a bidirectional threshold recursive unit is used to scan vertically along each column of the feature map X, and a Downscan, a bottom-up scan, reads one area block at a time, and connects the output predictions obtained by scanning according to the coordinate index to obtain a composite feature map

Similarly, use another bidirectional threshold recurrent unit along the composite feature map

Each line of each row is scanned horizontally, one scans from left to right, and the other scans from right to left, each time a region block is read, and the output predictions are connected by coordinate index to obtain a new composite feature map

New Composite Feature Map

have contextual information from the entire image;

第一坐标通道模块对全卷积网络模块输出的特征图连接i、j、r坐标通道构成新的特征图，以学习更多的坐标特征信息并提高模型的泛化能力；i、j、r坐标通道由i坐标通道、j坐标通道、r坐标通道构成，i坐标通道、j坐标通道及r坐标通道均为e×f的坐标矩阵，i坐标通道第1行～第e行的元素依次为0、1、...、e-1，j坐标通道第1列～第f列的元素依次为0、1、...、f-1，e、f取正整数，r坐标通道为

m为i坐标通道中的任意元素，n为j坐标通道中与m坐标位置相同的元素，将i坐标通道和j坐标通道中的元素线性缩放到[-1,1]范围内；The first coordinate channel module connects the feature map output by the full convolutional network module to the i, j, r coordinate channels to form a new feature map to learn more coordinate feature information and improve the generalization ability of the model; i, j, r The coordinate channel is composed of i coordinate channel, j coordinate channel and r coordinate channel. The i coordinate channel, j coordinate channel and r coordinate channel are all coordinate matrices of e×f. The elements in the first row to the eth row of the i coordinate channel are 0, 1, ..., e-1, the elements in the first column to the fth column of the j coordinate channel are 0, 1, ..., f-1 in turn, e, f take positive integers, and the r coordinate channel is

m is any element in the i coordinate channel, n is the element in the j coordinate channel with the same position as the m coordinate, and the elements in the i coordinate channel and the j coordinate channel are linearly scaled to the range of [-1,1];

空间金字塔池化模块对第二循环层模块输出的特征图在多个采样率上进行卷积操作，提取不同尺度区域的特征信息，该模块由4个不同采样率的空洞卷积构成，空洞卷积的卷积核大小为3×3，扩张率分别为4、6、8、12；The spatial pyramid pooling module performs convolution operations on the feature maps output by the second loop layer module at multiple sampling rates to extract feature information of regions of different scales. This module consists of four hole convolutions with different sampling rates. The hole convolution The convolution kernel size of the product is 3×3, and the expansion rates are 4, 6, 8, and 12 respectively;

S32、将数据增强后的训练图像集中的图像尺寸裁剪为330×330，使用裁剪后的训练图像训练深度语义分割网络，生成像素类别预测标签概率分布图，利用预测标签概率和原始标签概率计算预测损失，具体采用混合损失函数L(θ)作为目标函数，S32. Crop the size of the image in the training image set after data enhancement to 330×330, use the cropped training image to train the deep semantic segmentation network, generate a pixel category prediction label probability distribution map, and use the predicted label probability and the original label probability to calculate the prediction Loss, specifically using the hybrid loss function L(θ) as the objective function,

L(θ)＝L₁(θ)+L₂(θ)L(θ)＝L ₁ (θ)+L ₂ (θ)

本实施例的交叉熵损失函数L₁(θ)为：The cross-entropy loss function L ₁ (θ) of this embodiment is:

式中y_pq是预测标签概率向量，

是原始标签概率向量，N是每张图片的像素个数，N为330×330＝108900，B是批大小，B为10，C是像素类别数，C为2，ln(.)是求自然对数；where _ypq is the predicted label probability vector,

is the original label probability vector, N is the number of pixels in each picture, N is 330×330=108900, B is the batch size, B is 10, C is the number of pixel categories, C is 2, ln(.) is seeking nature logarithm;

本实施例的L2正则化项L₂(θ)为：The L2 regularization term L ₂ (θ) of this embodiment is:

式中λ是正则化系数且为正数，N是每张图像的像素个数，N为330×330＝108900，B是批大小，B为10，S是w的参数个数且S取正整数，w是权重参数；In the formula, λ is a regularization coefficient and is a positive number, N is the number of pixels of each image, N is 330×330=108900, B is the batch size, B is 10, S is the number of parameters of w and S is positive Integer, w is the weight parameter;

S33、采用随机梯度下降算法优化目标函数，运用反向传播算法更新网络模型参数，直到目标函数的值不再下降时结束训练，为了加速模型收敛，引入参数学习的学习率，学习率按照如下公式进行衰减：S33. Use the stochastic gradient descent algorithm to optimize the objective function, use the backpropagation algorithm to update the network model parameters, and end the training until the value of the objective function no longer decreases. In order to accelerate the model convergence, introduce a learning rate for parameter learning, and the learning rate is according to the following formula For attenuation:

式中t为迭代次数且t≤35000，l₀是初始学习率，l₀为0.003，l_t是第t次迭代的学习率，梯度衰减为0.0001，power是动量为0.9；where t is the number of iterations and t≤35000, l ₀ is the initial learning rate, l ₀ is 0.003, l _t is the learning rate of the t-th iteration, the gradient decay is 0.0001, and power is the momentum of 0.9;

测试图像的RGB图像作为第一深度神经网络模块的输入，测试图像对应的灰度图像作为第二深度神经网络模块的输入；The RGB image of the test image is used as the input of the first deep neural network module, and the grayscale image corresponding to the test image is used as the input of the second deep neural network module;

采用本实施例方法对WeizmannHorse数据集中128张测试图像进行语义分割，部分测试图像的语义分割图如图3所示，其中，第一行是输入图像、第二行是输入图像对应的彩色标签图像、第三行是其对应的语义分割图。Using the method of this embodiment to semantically segment 128 test images in the WeizmannHorse dataset, the semantic segmentation diagram of some test images is shown in Figure 3, where the first row is the input image, and the second row is the color label image corresponding to the input image , the third line is its corresponding semantic segmentation map.

实施例2Example 2

StanfordBackground数据集是一个由715幅图像组成的图像分割数据集，数据集中部分图像如图4所示，网络模型的训练使用Pytorch平台，代码在python上编写完成。The StanfordBackground dataset is an image segmentation dataset consisting of 715 images. Some images in the dataset are shown in Figure 4. The network model is trained using the Pytorch platform, and the code is written in python.

本实施例基于深度学习的图像语义分割方法，在步骤S1中从StanfordBackground数据集中随机选取573张作为训练图像集，剩下的142张作为测试图像集，并对训练图像集进行数据增强操作，使训练图像的数量增加到13752张；步骤S32中将数据增强后的训练图像集中的图像尺寸裁剪为421×421，使用裁剪后的训练图像训练深度语义分割网络，生成像素类别预测标签概率分布图，利用预测标签概率和原始标签概率计算预测损失，具体采用混合损失函数L(θ)作为目标函数，This embodiment is based on the image semantic segmentation method of deep learning. In step S1, 573 images are randomly selected from the StanfordBackground data set as a training image set, and the remaining 142 images are used as a test image set, and the data enhancement operation is performed on the training image set. The number of training images is increased to 13,752; in step S32, the size of the image in the training image set after data enhancement is cut to 421×421, and the training image is used to train the deep semantic segmentation network to generate a pixel category prediction label probability distribution map, The prediction loss is calculated using the predicted label probability and the original label probability, specifically using the mixed loss function L(θ) as the objective function,

L(θ)＝L₁(θ)+L₂(θ)L(θ)＝L ₁ (θ)+L ₂ (θ)

式中y_pq是预测标签概率向量，

是原始标签概率向量，N是每张图片的像素个数，N为421×421＝177241，B是批大小，B为6，C是像素类别数，C为8，ln(.)是求自然对数；where _ypq is the predicted label probability vector,

is the original label probability vector, N is the number of pixels of each picture, N is 421×421=177241, B is the batch size, B is 6, C is the number of pixel categories, C is 8, ln(.) is seeking nature logarithm;

式中λ是正则化系数且为正数，N是每张图像的像素个数，N为421×421＝177241，B是批大小，B为6，S是w的参数个数且S取正整数，w是权重参数；步骤S33中采用随机梯度下降算法优化目标函数，运用反向传播算法更新网络模型参数，直到目标函数的值不再下降时结束训练，为了加速模型收敛，引入参数学习的学习率，学习率按照如下公式进行衰减：In the formula, λ is a regularization coefficient and is a positive number, N is the number of pixels in each image, N is 421×421=177241, B is the batch size, B is 6, S is the number of parameters of w and S is positive Integer, w is a weight parameter; in step S33, adopt the stochastic gradient descent algorithm to optimize the objective function, use the backpropagation algorithm to update the network model parameters, and end the training until the value of the objective function no longer decreases. In order to accelerate the model convergence, the parameter learning method is introduced Learning rate, the learning rate decays according to the following formula:

式中t为迭代次数且t≤35000，l₀是初始学习率，l₀为0.001，l_t是第t次迭代的学习率，梯度衰减为0.0001，power是动量为0.9；In the formula, t is the number of iterations and t≤35000, l ₀ is the initial learning rate, l ₀ is 0.001, l _t is the learning rate of the t-th iteration, the gradient decay is 0.0001, and power is the momentum of 0.9;

其它操作步骤及参数与实施例1相同。Other operating steps and parameters are the same as in Example 1.

采用本实施例方法对StanfordBackground数据集中142张测试图像进行语义分割，部分测试图像的语义分割图如图4所示，其中，第一行是输入图像、第二行是输入图像对应的彩色标签图像、第三行是其对应的语义分割图。Using the method of this embodiment to semantically segment 142 test images in the StanfordBackground dataset, the semantic segmentation diagram of some test images is shown in Figure 4, where the first row is the input image, and the second row is the color label image corresponding to the input image , the third line is its corresponding semantic segmentation map.

Claims

1. An image semantic segmentation method based on deep learning is characterized by comprising the following steps:

s1, processing data sets

Dividing an image data set into a training image set and a test image set, performing data enhancement operation on the training image set, and increasing the number of training images to ten thousand-level units;

s2, constructing a deep semantic segmentation network

The deep semantic segmentation network is composed of a parallel deep neural network module, a feature fusion module and a Softmax classification layer, wherein the parallel deep neural network module is used for carrying out feature extraction on an input image, the feature fusion module carries out weighting fusion on an output feature map of the parallel deep neural network to obtain a new feature map, and the Softmax classification layer converts a pixel class label prediction score into a pixel class label prediction probability distribution map;

the parallel deep neural network module consists of a first deep neural network module and a second deep neural network module, the network structures of the first deep neural network module and the second deep neural network module are the same, the input of the first deep neural network module is an RGB image of an input image, and the input of the second deep neural network module is a gray image of the input image;

the first deep neural network module consists of a full convolution network module, a first coordinate channel module, a first circulation layer module, a second coordinate channel module, a second circulation layer module and a space pyramid pooling module, wherein the first coordinate channel module and the second coordinate channel module have the same structure, the first circulation layer module and the second circulation layer module have the same structure, the full convolution network module is used for extracting local features of an input image, the first circulation layer module is used for capturing context dependency and global feature information of the image, the first coordinate channel module is used for connecting i, j and r coordinate channels to a feature map output by the full convolution network module to form a new feature map so as to learn more coordinate feature information and improve the generalization capability of the model, and the space pyramid pooling module is used for performing convolution operation on the feature map output by the second circulation layer module at a plurality of sampling rates to extract feature information of different scale areas;

s3, deep semantic segmentation network training and parameter learning

S31, initializing network model parameters: performing parameter initialization on a full convolution network module by using a pre-training model of ResNet101 on an ImageNet data set, performing parameter initialization on a first circulation layer module and a second circulation layer module by using standard uniform distribution, and performing parameter initialization on a convolution layer of a space pyramid pooling module by using standard Gaussian distribution;

s32, training a deep semantic segmentation network by using the training image set after data enhancement to generate a pixel class prediction label probability distribution graph, calculating prediction loss by using the prediction label probability and the original label probability, specifically adopting a mixed loss function L (theta) as a target function,

L(θ)＝L ₁ (θ)+L ₂ (θ)

in the formula L ₁ (θ) is a cross entropy loss function, L ₂ (theta) is an L2 regularization term, theta is a parameter of the deep semantic segmentation network;

the cross entropy loss function L ₁ (θ) is:

in the formula y _pq Is a predicted label probability vector that is,

the method comprises the following steps of (1) obtaining an original label probability vector, wherein N is the number of pixels of each picture, B is the batch size, C is the number of pixel categories, and ln (.) is the natural logarithm;

the L2 regularization term L ₂ (θ) is:

in the formula, lambda is a regularization coefficient and is a positive number, N is the number of pixels of each image, B is the batch size, S is the parameter number of w, S is a positive integer, and w is a weight parameter;

s33, optimizing the target function by adopting a random gradient descent algorithm, and updating network model parameters by adopting a back propagation algorithm until the value of the target function does not descend any more, so as to finish training;

s4, performing semantic segmentation on the test image

S41, inputting the test image set into the deep semantic segmentation network trained in the step S3;

s42, the parallel deep neural network module performs feature extraction on the input test image set

The RGB image of the test image is used as the input of the first deep neural network module, and the gray image of the test image is used as the input of the second deep neural network module;

the first deep neural network module feature extraction process comprises the following steps: the full convolution network module performs local feature extraction on the RGB image of the test image through hole convolution, maximum pooling and convolution operations; the feature map output by the full convolution network module is sent to a first circulation layer module through a first coordinate channel module to obtain a new feature map, the new feature map is sent to the first circulation layer module to be scanned horizontally and vertically, and the global feature information of the image is learned; obtaining a new feature map from the feature map output by the first circulating layer module through the second coordinate channel module, sending the new feature map into the second circulating layer module for horizontal and vertical scanning, and capturing global feature information of the image; inputting the feature map output by the second circulation layer module into a space pyramid pooling module, performing convolution operation on a plurality of sampling rates, and extracting feature information of different scale areas;

the second deep neural network module feature extraction process is the same as the first deep neural network module feature extraction process;

s43, carrying out weighted fusion on the feature map output by the first deep neural network module and the feature map output by the second deep neural network module to obtain a new feature map;

and S44, sending the result of the step S43 into a Softmax classification layer to perform pixel class label prediction to obtain the object class of each pixel in the image, performing bilinear interpolation operation to up-sample the size of the original image, and obtaining a fine semantic segmentation image.

2. The deep learning based image semantic segmentation method according to claim 1, characterized in that: the first circulation layer module is composed of two bidirectional threshold recursion units, and the number of neurons of each bidirectional threshold recursion unit is 150.

3. The deep learning based image semantic segmentation method according to claim 1, characterized in that: the spatial pyramid pooling module is formed by convolution of 4 holes with different sampling rates, the convolution kernel size of the hole convolution is 3 multiplied by 3, and the expansion rates are 4, 6, 8 and 12 respectively.

4. The deep learning based image semantic segmentation method according to claim 1, characterized in that: in the step S2, the i, j and r coordinate channels are composed of an i coordinate channel, a j coordinate channel and an r coordinate channel, the i coordinate channel, the j coordinate channel and the r coordinate channel are all coordinate matrixes of e multiplied by f, the elements of the 1 st row to the e th row of the i coordinate channel are sequentially 0, 1,. The elements of the 1 st column to the f th column of the i coordinate channel are sequentially 0, 1,. The.f, f-1, e and f take positive integers, and the r coordinate channel is

m is any element in the i coordinate channel, n is the element in the j coordinate channel with the same position as the m coordinate, and the elements in the i coordinate channel and the j coordinate channel are linearly scaled to [ -1,1]Within the range.

5. The deep learning based image semantic segmentation method according to claim 1, characterized in that: the learning rate of the parameter learning in the step S3 is attenuated according to the following formula:

where t is the number of iterations, l ₀ Is the initial learning rate,/ _t Is the learning rate for the t-th iteration, power is the momentum of 0.9.