CN116645514A

CN116645514A - Improved U 2 Ceramic tile surface defect segmentation method of Net

Info

Publication number: CN116645514A
Application number: CN202310754771.9A
Authority: CN
Inventors: 黄进; 包锐; 王逢港; 谢艺玮; 方铮; 冯义从; 李剑波; 荣鹏; 郭伦; 翟树红
Original assignee: Southwest Jiaotong University
Current assignee: Southwest Jiaotong University
Priority date: 2023-06-25
Filing date: 2023-06-25
Publication date: 2023-08-25

Abstract

The invention relates to an improved U ² ‑Net tile surface defect segmentation method, which belongs to the technical field of salient object segmentation. In order to overcome the deficiencies in the prior art, the present invention aims to provide a method for segmenting surface defects of ceramic tiles that improves ^U2 -Net, including obtaining a data set for detecting surface defects of ceramic tiles; constructing a network for segmenting surface defects of ceramic tiles based on ^U2 -Net Model; through the training set, the tile surface defect segmentation network model is continuously iteratively trained until the network finally converges, and the trained tile surface defect segmentation network model is obtained; the image to be processed is input into the trained tile surface defect segmentation network model , to get the segmented target. The present invention constructs a deep learning network model based on codec structure and multi-scale feature fusion for target segmentation of tile surface defects, so as to improve the segmentation effect of tile surface defects.

Description

An Improved U2-Net Method for Tile Surface Defect Segmentation

技术领域technical field

本发明涉及一种改进U²-Net的瓷砖表面缺陷分割方法，属于显著性目标分割技术领域。The invention relates to an improved U ² -Net tile surface defect segmentation method, which belongs to the technical field of salient object segmentation.

背景技术Background technique

显著性目标分割又叫显著性目标检测，主要目标为区分图像中最明显的区域。通过提取出图像中的目标区域来进行分析。目前在场景物体分割、人体前景分割、人脸人体parsing、三维重建技术等在智能安防、无人驾驶、安防监控等领域应用广泛。Salient object segmentation is also called salient object detection. The main goal is to distinguish the most obvious areas in the image. Analysis is performed by extracting the target area in the image. At present, it is widely used in the fields of intelligent security, unmanned driving, security monitoring and other fields such as scene object segmentation, human body foreground segmentation, face and human body parsing, and 3D reconstruction technology.

传统的显著性目标检测算法整体上主要都是基于低级视觉特征的，包括中心偏差、对比度先验和背景先验等。Achanata等人分别对亮度和颜色两种低级特征处理，使用高斯差分函数来进行频域滤波，从而来计算当前像素与其周围不同大小的域中像素的对比度，以此确定图像像素的显著值。Klein等人是利用信息论中的K-L散度去衡量图像的特征空间的中心位置于其周围特征的差异。这些算法都是从考虑局部像素对比的方面，它在检测目标的边缘信息时能有很好的表现，但是对目标整体很难实现检测。而基于全局对比的检测算法则是计算像素区域相对于图像里所有像素的对比，这样能够将显著目标的具体位置检测出来。如Wei等人将背景连接度的概念提出，在他们的方法中用来获取有效范围并且计算图像显著性的就是全局对比度。在高等人的方法中，先获得超像素分割后的图像，然后在CIELab颜色空间中计算纹理细节和图像的全局对比度，借助了多核增强学习融合这两个特征得到显著图，并且用使用滤波器的后处理方法优化得到的预测图。Perazzi等人首先将图像分解成不同的块，这些块与块之间的图片关系是紧凑的，具有视觉平滑性，然后给这些块的唯一特殊性和空间分布情况计算评估，根据评估度量结果生成图像的显著性图。此外还有研究人员通过构建图模型来进行图像像素的显著值计算。Yang等人据此提出了基于流型排序的SOD算法。他们的做法是首先采用超像素分割图片，然后选取合适的背景种子点，其他节点和种子点之间可以计算相似度，不同大小的排序的到初始显著图，再将它作为前景种子点，再次计算相似度再排序得到结果显著图。但是这种算法最后是否能很好地检测出显著图和种子点的选取十分相关。Yuan等人提出了一个显著性反转校正过程，去除边界附近的前景超像素，防止显著反转，从而提高基于边界先验的显著性预测的准确性和鲁棒性。Shan等人利用背景权重图为流形排序提供种子，并且利用一个三阶光滑性框架来提高流形排序的性能。在Wu等人针对先前存在的方法在检测一些背景对比度低的图片时可能遗漏目标区域的问题，提出了一种新的传播模型，该模型考虑到了利用变形平滑约束。该模型局部正则化图像的节点和它周围的点，然后使用变形的平滑约束阻止可能出现错误的结果传播下去。Traditional salient object detection algorithms are generally based on low-level visual features, including center bias, contrast prior, and background prior. Achanata et al. respectively processed two low-level features of brightness and color, and used Gaussian difference function to perform frequency domain filtering, so as to calculate the contrast between the current pixel and the pixels in the surrounding domains of different sizes, so as to determine the salient value of the image pixel. Klein et al. use the K-L divergence in information theory to measure the difference between the center of the feature space of the image and its surrounding features. These algorithms are based on the consideration of local pixel contrast, which can perform well in detecting the edge information of the target, but it is difficult to detect the target as a whole. The detection algorithm based on global contrast is to calculate the contrast of the pixel area relative to all pixels in the image, so that the specific position of the salient target can be detected. For example, Wei et al. proposed the concept of background connectivity. In their method, the global contrast is used to obtain the effective range and calculate the saliency of the image. In Gao et al.’s method, the image after superpixel segmentation is obtained first, then the texture details and the global contrast of the image are calculated in the CIELab color space, and the saliency map is obtained by fusing these two features with the help of multi-core reinforcement learning, and using the filter The prediction map obtained by optimizing the post-processing method. Perazzi et al. first decomposed the image into different blocks. The picture relationship between these blocks is compact and has visual smoothness, and then calculates and evaluates the uniqueness and spatial distribution of these blocks, and generates A saliency map of an image. In addition, some researchers calculate the saliency value of image pixels by constructing a graph model. Based on this, Yang et al. proposed a SOD algorithm based on flow sorting. Their approach is to first use superpixels to segment the image, and then select a suitable background seed point. The similarity between other nodes and the seed point can be calculated. Calculate the similarity and then sort to get the result saliency map. But whether this algorithm can detect the saliency map well in the end is very related to the selection of seed points. Yuan et al. proposed a saliency inversion correction process to remove foreground superpixels near the boundary and prevent saliency inversion, thereby improving the accuracy and robustness of saliency prediction based on boundary priors. Shan et al. use background weight maps to seed manifold sorting, and use a third-order smoothness framework to improve the performance of manifold sorting. In Wu et al., addressing the problem that the previously existing methods may miss target regions when detecting some images with low background contrast, a new propagation model is proposed, which takes into account the use of deformation smoothness constraints. The model locally regularizes the nodes of the image and its surrounding points, and then uses deformable smoothness constraints to prevent potentially erroneous results from propagating.

总之，基于传统的目标检测算法大多提取的特征比较简单，并且对图片显著值的训练过程也不是很复杂，目前在许多场景下也有着比较好的性能。然而在大量的数据集和实际场景图像中，需要检测的环境和图片情况变得越来越复杂，检测性能和检测速率的要求在增高，基于传统的方法逐渐难以满足这些要求。In short, most of the features extracted by traditional target detection algorithms are relatively simple, and the training process for the saliency value of the picture is not very complicated, and it currently has relatively good performance in many scenarios. However, in a large number of data sets and actual scene images, the environment and picture conditions that need to be detected become more and more complex, and the requirements for detection performance and detection rate are increasing. Traditional methods are gradually difficult to meet these requirements.

发明内容Contents of the invention

为了克服现有技术中存在的缺陷，本发明旨在提供一种改进U²-Net的瓷砖表面缺陷分割方法。In order to overcome the defects in the prior art, the present invention aims to provide an improved U ² -Net tile surface defect segmentation method.

本发明解决上述技术问题所提供的技术方案是：一种改进U²-Net的瓷砖表面缺陷分割方法，包括以下步骤：The technical solution provided by the present invention to solve the above-mentioned technical problems is: an improved U ² -Net tile surface defect segmentation method, comprising the following steps:

步骤S1、获取瓷砖表面缺陷检测数据集，并将其分为训练集与测试集；Step S1, obtaining the tile surface defect detection data set, and dividing it into a training set and a testing set;

步骤S2、构建基于U²-Net的瓷砖表面缺陷分割网络模型；Step S2, constructing a tile surface defect segmentation network model based on U ² -Net;

所述瓷砖表面缺陷分割网络模型为六层U型结构，包括6级编码器、5级解码器、2级多尺度特征融合模块及显著图融合模块；The tile surface defect segmentation network model is a six-layer U-shaped structure, including a 6-level encoder, a 5-level decoder, a 2-level multi-scale feature fusion module and a saliency map fusion module;

其中前4个编码器及对应的4个解码器由特征提取结构DCRSU构成，每个DCRSU的层数随着encoder和decoder的层数的增加而减少，即前4个编码器使用的分别是DCRSU-7、DCRSU-6、DCRSU-5、DCRSU-4，前4个解码器同理如此；Among them, the first 4 encoders and the corresponding 4 decoders are composed of the feature extraction structure DCRSU, and the number of layers of each DCRSU decreases as the number of layers of the encoder and decoder increases, that is, the first 4 encoders use DCRSU respectively. -7, DCRSU-6, DCRSU-5, DCRSU-4, the same is true for the first 4 decoders;

第5个编码器及对应的解码器采用的就是RSU-4F；The fifth encoder and corresponding decoder use RSU-4F;

第6个编码器引入SKnet作为最深一层的编码器；The sixth encoder introduces SKnet as the deepest encoder;

分别在第5个解码器和第4个解码器的输入引入2个改进注意力门模块，将第6个编码器与第5个编码器输出的特征进行融合输入第5个解码器，将第4个编码器与第5个解码器输出的特征进行融合输入第4个解码器；然后使用3×3卷积和sigmoid函数从第6个编码器、第5个解码器、第4个解码器、第3个解码器、第2个解码器和第1个解码器生成6个输出显著概率图；然后将输出的显著图的逻辑图向上采样至于输入图像大小一致，并通过级联操作相融合；最后通过1×1卷积层和sigmoid函数，以生成最终的显著性概率映射图；Introduce two improved attention gate modules at the input of the 5th decoder and the 4th decoder respectively, fuse the features output by the 6th encoder and the 5th encoder into the 5th decoder, and use the The 4 encoders are fused with the features output by the 5th decoder and input to the 4th decoder; then use 3×3 convolution and sigmoid function from the 6th encoder, the 5th decoder, the 4th decoder , the 3rd decoder, the 2nd decoder and the 1st decoder generate 6 output saliency probability maps; then the logic map of the output saliency map is up-sampled to the same size as the input image, and are fused by cascade operation ;Finally, through the 1×1 convolutional layer and the sigmoid function to generate the final significance probability map;

步骤S3、通过训练集对瓷砖表面缺陷分割网络模型进行不断迭代训练，直到网络最终收敛，得到训练好的瓷砖表面缺陷分割网络模型；Step S3, continuously iteratively train the tile surface defect segmentation network model through the training set, until the network finally converges, and obtain the trained tile surface defect segmentation network model;

步骤S4、将待处理的图片输入到训练好的瓷砖表面缺陷分割网络模型中，得到分割的目标。Step S4: Input the image to be processed into the trained tile surface defect segmentation network model to obtain the segmentation target.

进一步的技术方案是，所述步骤S10中训练集与测试集按4：1进行划分。A further technical solution is that in the step S10, the training set and the testing set are divided by 4:1.

进一步的技术方案是，所述DCRSU由输入卷积层、编码器、解码器和残差结构4部分组成；A further technical solution is that the DCRSU consists of four parts: an input convolutional layer, an encoder, a decoder, and a residual structure;

输入卷积层用于提取局部特征和转换通道；The input convolution layer is used to extract local features and transform channels;

编码阶段，最后一个编码器采用卷积+批量归一化+ReLU激活函数结构，倒数第2层采用深度可分离卷积+批量归一化+ReLU激活函数结构；其余编码器利用残差结构将深度可分离卷积提取的特征和经注意力机制模块处理的输入特征相加后再输入下一个特征提取层进行特征提取,使得每级输出特征可以在聚焦于具有更多有效特征信息通道，加强每一级有效特征的提取能力并获取多尺度特征信息；In the encoding stage, the last encoder adopts convolution + batch normalization + ReLU activation function structure, and the penultimate layer adopts depth separable convolution + batch normalization + ReLU activation function structure; the remaining encoders use the residual structure to The features extracted by the depth-separable convolution and the input features processed by the attention mechanism module are added and then input to the next feature extraction layer for feature extraction, so that the output features of each level can be focused on channels with more effective feature information and strengthened. The ability to extract effective features at each level and obtain multi-scale feature information;

残差结构将输入层和中间层进行融合，对两个不同尺度的特征进行一个拼接；The residual structure fuses the input layer and the middle layer, and splices features of two different scales;

解码阶段，解码器模块将经过拼接的特征图，经过一个3×3卷积一个批量归一化层和Relu激活函数通过上采样逐步修复分割对象的细节和空间维度；经最后一个解码器输出的特征图，与经输入卷积层的特征图相加融合得到经DCRSU模块处理后的最终特征图。In the decoding stage, the decoder module takes the concatenated feature map, passes through a 3×3 convolution, a batch normalization layer and a Relu activation function to gradually repair the details and spatial dimensions of the segmented object through upsampling; the final decoder output The feature map is added and fused with the feature map of the input convolutional layer to obtain the final feature map processed by the DCRSU module.

进一步的技术方案是，所述RSU-4F将下采样和上采样换成了膨胀卷积，输入特征CxHxW首先通过2个由卷积+批量归一化层+Relu组成的模块，然后经过膨胀卷积依次为1、2、4、8，整个过程中特征图大小不变。A further technical solution is that the RSU-4F replaces downsampling and upsampling with dilation convolution, and the input feature CxHxW first passes through two modules consisting of convolution + batch normalization layer + Relu, and then goes through dilation volume The products are 1, 2, 4, and 8 in turn, and the size of the feature map remains unchanged throughout the process.

进一步的技术方案是，SKnet作为最深一层的编码器，提取多尺度特征操作是将原特征图分别通过一个3×3的分组/深度卷积和3×3的空洞卷积生成两个特征图：和/>然后将这两个特征图进行相加，生成U；The further technical solution is that SKnet is used as the deepest encoder, and the operation of extracting multi-scale features is to generate two feature maps through a 3×3 grouping/depth convolution and a 3×3 hole convolution of the original feature map. : and /> Then add these two feature maps to generate U;

生成的U通过全局平均池化生成1×1×C的特征图，该特征图通过全连接层生成d×1的向量z，对向量z分别经2个FC层重新变回长度C，对2个向量在通道维度上求softmax，得到各自的权重向量，并将权重与阶段一的2个输出进行相乘得到新的特征图，对两个新的特征图进行求和得到最终输出，送入下一个解码器。The generated U generates a 1×1×C feature map through global average pooling. The feature map generates a d×1 vector z through a fully connected layer, and the vector z is converted back to length C through 2 FC layers. For 2 A vector is calculated for softmax in the channel dimension to obtain its respective weight vector, and the weight is multiplied by the two outputs of stage 1 to obtain a new feature map, and the two new feature maps are summed to obtain the final output, which is sent to the next decoder.

进一步的技术方案是，所述改进注意力门模块的两个输入分别是编码器的当前层xl和解码器的下一层g，输入特征为CxHxW，他们首先经过逐元素的相加，经过Relu激活函数，得到CxHxW的特征图，然后通过1×1的卷积将通道数降为1，然后sigmoid激活函数进行归一化得到注意力系数，然后再经过一个1×1模块将尺寸还原回来，得到CxHxW的系数，最后使用得到的注意力系数对两个输入特征图进行相乘，然后进行拼接，将最后得到的特征图送入下一个解码器模块。A further technical solution is that the two inputs of the improved attention gate module are the current layer xl of the encoder and the next layer g of the decoder, and the input features are CxHxW. They first go through element-by-element addition, and then pass Relu Activation function, get the feature map of CxHxW, then reduce the number of channels to 1 through 1×1 convolution, then normalize the sigmoid activation function to get the attention coefficient, and then restore the size through a 1×1 module, Get the coefficient of CxHxW, and finally use the obtained attention coefficient to multiply the two input feature maps, and then splicing, and send the final feature map to the next decoder module.

进一步的技术方案是，所述步骤S3中训练模型时，设定批量大小为16，使用AdamW优化器进行优化；首先使用1×10^-3的学习率进行初始训练，然后使用1×10^-5的学习率微调模型。A further technical solution is that when training the model in step S3, set the batch size to 16, and use the AdamW optimizer for optimization; first use a learning rate of 1×10 ^-3 for initial training, and then use 1×10 ^-5 The learning rate fine-tuning model.

进一步的技术方案是，所述步骤S3中的损失公式为：A further technical solution is that the loss formula in the step S3 is:

式中：l_fuse代表的是最终的预测概率图的损失，l代表二值交叉熵损失，w代表每个损失的权重。In the formula: l _fuse represents the loss of the final predicted probability map, l represents the binary cross-entropy loss, and w represents the weight of each loss.

本发明具有以下有益效果：本发明的U²-Net网络结构深且复杂，通过RSU和跳跃连接能够提取图片不同尺寸的级内和级间信息，但在连接中容易出现非缺陷区域等无效特征的保留和缺陷边缘等信息的丢失，仍存在一定的漏分割情况，为进一步提升网络分割性能，本发明主要从提高有效特征的提取能力和减少信息丢失两方面对网络进行改进；The present invention has the following beneficial effects: the U ² -Net network structure of the present invention is deep and complex, and the intra-level and inter-level information of different sizes of pictures can be extracted through RSU and skip connections, but invalid features such as non-defect areas are prone to appear in the connection There is still a certain leakage of segmentation due to the retention of information such as the retention of defects and the loss of edge defects. In order to further improve the performance of network segmentation, the present invention mainly improves the network from two aspects: improving the extraction ability of effective features and reducing information loss;

针对当前显著性目标检测解码阶段因为跳跃连接导致非缺陷区域等无效特征的保留和缺陷边缘等信息的丢失的问题，本发明能够充分利用图像的上下文信息从而减少信息丢失，提升显著性主体的分割效果；Aiming at the problem of the retention of invalid features such as non-defect areas and the loss of information such as defect edges due to skip connections in the current salient object detection and decoding stage, the present invention can make full use of the context information of the image to reduce information loss and improve the segmentation of salient subjects Effect;

针对显著性目标检测在瓷砖表面缺陷场景下目标与背景对比度较低、检测精度较低的问题，本发明能够充分利用主干网络对有效特征的提取能力，使图片中缺陷目标的分割效果更好。Aiming at the problem of low contrast between target and background and low detection accuracy in the detection of salient targets in the tile surface defect scene, the present invention can make full use of the backbone network's ability to extract effective features, so that the segmentation effect of defective targets in pictures is better.

附图说明Description of drawings

图1本发明流程图；Fig. 1 flow chart of the present invention;

图2为本发明方法的网络结构图；Fig. 2 is the network structural diagram of the inventive method;

图3为本发明方法编-解码器中的特征提取DCRSU模块；Fig. 3 is the feature extraction DCRSU module in the method coder-decoder of the present invention;

图4为本发明方法编-解码器中用到的CA注意力模块；Fig. 4 is the CA attention module used in the method coder-decoder of the present invention;

图5为本发明方法编-解码器中的特征提取RSU4-F模块；Fig. 5 is the feature extraction RSU4-F module in the method coder-decoder of the present invention;

图6为本发明方法的编码器模块SKnet；Fig. 6 is the coder module SKnet of the inventive method;

图7为本发明方法多尺度特征融合模块AG+模块；Fig. 7 is the multi-scale feature fusion module AG+ module of the method of the present invention;

图8为本发明方法效果示意图。Fig. 8 is a schematic diagram of the effect of the method of the present invention.

实施方式Implementation

下面将结合附图对本发明的技术方案进行清楚、完整地描述，显然，所描述的实施例是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The technical solutions of the present invention will be clearly and completely described below in conjunction with the accompanying drawings. Apparently, the described embodiments are part of the embodiments of the present invention, but not all of them. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

如图1所示，本发明的一种改进U²-Net的瓷砖表面缺陷分割方法，包括以下步骤：As shown in Figure 1, an improved U ² -Net tile surface defect segmentation method of the present invention includes the following steps:

步骤S1、获表面缺陷检测领域的Magnetic-tile-defect-datasets数据集，并将其分为训练集与测试集，按4：1进行划分；Step S1, obtain the Magnetic-tile-defect-datasets data set in the field of surface defect detection, and divide it into a training set and a test set, and divide it by 4:1;

该网络基于编解码结构，并采取了侧边融合的方式进行显著性目标预测。如图2所示，该模型由6级编码器、5级解码器，2级多尺度特征融合模块及显著图融合模块4部分组成。The network is based on the encoding and decoding structure, and adopts the method of side fusion for salient target prediction. As shown in Figure 2, the model consists of 6-level encoder, 5-level decoder, 2-level multi-scale feature fusion module and saliency map fusion module.

该网络结构总体是U形结构，前4个编码器及对应的4个解码器由特征提取结构DCRSU(如图3所示)构成，每个DCRSU的层数随着encoder和decoder的层数的增加而减少，即第1个编码器、第2个编码器、第3个编码器、第4个编码器使用的分别是DCRSU-7、DCRSU-6、DCRSU-5、DCRSU-4，解码器同理如此。The network structure is generally U-shaped. The first 4 encoders and the corresponding 4 decoders are composed of the feature extraction structure DCRSU (as shown in Figure 3). The number of layers of each DCRSU increases with the number of layers of the encoder and decoder. Increase and decrease, that is, the first encoder, the second encoder, the third encoder, and the fourth encoder use DCRSU-7, DCRSU-6, DCRSU-5, DCRSU-4, and the decoder The same is true.

第5个编码器及对应的解码器采用的都是RSU-4F(如图4所示)，F表示不会变化尺寸，也就是只进行特征提取。The fifth encoder and the corresponding decoder use RSU-4F (as shown in Figure 4), and F means that the size will not change, that is, only feature extraction is performed.

本发明在第6个编码器引入SKnet(如图6所示)作为最深一层的编码器。The present invention introduces SKnet (as shown in FIG. 6 ) as the deepest encoder at the sixth encoder.

此外，为解决仅使用跳跃连接模拟全局多尺度上下文，容易导致空间信息的丢失的问题，本发明分别在第5个解码器和第4个解码器的输入引入2个AG+模块(如图5所示)，将第6个编码器与第5个编码器输出的特征进行融合输入第5个解码器，将第4个编码器与第5个解码器输出的特征进行融合输入第4个解码器。In addition, in order to solve the problem that only using skip connections to simulate the global multi-scale context may easily lead to the loss of spatial information, the present invention introduces two AG+ modules at the input of the fifth decoder and the fourth decoder respectively (as shown in Figure 5 shown), the features output by the sixth encoder and the fifth encoder are fused into the fifth decoder, and the features output by the fourth encoder and the fifth decoder are fused into the fourth decoder .

然后使用3×3卷积和sigmoid函数从第6个编码器、第5个解码器、第4个解码器、第3个解码器、第2个解码器和第1个解码器生成6个输出显著概率图然后将输出的显著图的逻辑图向上采样至于输入图像大小一致,并通过级联操作相融合；最后通过1×1卷积层和sigmoid函数，以生成最终的显著性概率映射图。Then use 3×3 convolution and sigmoid function to generate 6 outputs from 6th encoder, 5th decoder, 4th decoder, 3rd decoder, 2nd decoder and 1st decoder Significant probability map Then the logic map of the output saliency map is up-sampled to the same size as the input image, and fused through cascading operations; finally, the final saliency probability map is generated through a 1×1 convolutional layer and a sigmoid function.

DCRSU(如图3所示)由输入卷积层、编码器、解码器和残差结构4部分组成。且假设DCRSU中设置的三个参数：输入通道数Cin输出通道数Cout中间模块输出通道数Cmid。DCRSU (shown in Figure 3) consists of 4 parts: input convolutional layer, encoder, decoder and residual structure. And assume three parameters set in DCRSU: input channel number Cin output channel number Cout intermediate module output channel number Cmid.

输入卷积层用于提取局部特征和转换通道，以输入图片3×512×512为例，它通过一个3×3卷积一个BN层和Relu激活函数,通道上升，变为Cout×512×512(该特征设为a)。The input convolutional layer is used to extract local features and convert channels. Taking the input picture 3×512×512 as an example, it convolutes a BN layer and Relu activation function through a 3×3, and the channel rises to become Cout×512×512 (This feature is set to a).

编码阶段，最后一个编码器采用卷积+批量归一化+ReLU激活函数结构，倒数第2层采用深度可分离卷积+批量归一化+ReLU激活函数结构。其余编码器利用残差结构将深度可分离卷积提取的特征和经CA模块处理的输入特征相加后再输入下一个特征提取层进行特征提取，使得每级输出特征可以在聚焦于具有更多有效特征信息通道，加强每一级有效特征的提取能力并获取多尺度特征信息。In the encoding stage, the last encoder adopts convolution + batch normalization + ReLU activation function structure, and the penultimate layer adopts depth separable convolution + batch normalization + ReLU activation function structure. The remaining encoders use the residual structure to add the features extracted by the depthwise separable convolution and the input features processed by the CA module, and then input them into the next feature extraction layer for feature extraction, so that the output features of each level can be focused on more The effective feature information channel strengthens the ability to extract effective features at each level and obtains multi-scale feature information.

以a作输入特征为例，它通过Cout个3×3卷积，变为Cin×512×512；然后通过Cmid个Cin×1×1卷积、一个BN层和Relu激活函数，变为Cmid×512×512。Taking a as the input feature as an example, it becomes Cin×512×512 through Cout 3×3 convolutions; then through Cmid Cin×1×1 convolutions, a BN layer and Relu activation function, it becomes Cmid× 512×512.

同时a作为输入特征通过CA注意模块和一个BN层。CA注意模块(如图4所示)将C×H×W形状的输入特征图逐通道进行平均池化，使用(H，1)和(1，W)的池化核分别按X和Y轴方向进行池化对每个通道进行编码，产生C×H×1和C×1×W形状的特征图。通过这种方式所产生的一对方法感知特征图可以使CA注意力能够在一个通道内捕获长距离的依赖关系，并且还有助于保留精确的位置信息，使网络能够更加准确的定位对象。将上述所提取到的特征图按空间维度进行拼接，拼接成形状的特征图，其中r用于控制块的减小率和SE中的作用相同。再将特征图经过F1卷积变换函数(1×1卷积)和非线性激活函数产生f中间特征图。在将f按空间维度拆分成两个张量f_h和f_w，形状分别为/>和/>再分别进行F_h和F_w卷积变换函数(1×1卷积)和Sigmoid激活函数得到gh和gw坐标注意力。然后将gh和gw与原输入进行相乘得到与输入相同形状的输出，然后再经过一个1×1卷积和批量归一化层让输出的特征图的形状，与同阶段特征a经过深度可分离卷积输出的形状相同。最后将经两路处理的特征a进行相加，输入下一个特征提取层。At the same time, a is passed through a CA attention module and a BN layer as an input feature. The CA attention module (as shown in Figure 4) performs average pooling on the input feature map of C×H×W shape channel-by-channel, using (H, 1) and (1, W) pooling kernels on the X and Y axes, respectively Direction pooling encodes each channel, producing feature maps of C×H×1 and C×1×W shapes. The pair of method-aware feature maps generated in this way can enable CA attention to capture long-distance dependencies within one channel, and also help to preserve precise location information, enabling the network to localize objects more accurately. The feature maps extracted above are spliced according to the spatial dimension, and spliced into The feature map of the shape, where r is used to control the reduction rate of the block and has the same effect as in SE. Then the feature map is passed through the F1 convolution transformation function (1×1 convolution) and the nonlinear activation function to generate the f intermediate feature map. After splitting f into two tensors f _h and f _w according to the spatial dimension, the shapes are /> and /> Then perform F _h and F _w convolution transformation functions (1×1 convolution) and Sigmoid activation functions to obtain gh and gw coordinate attention. Then multiply gh and gw with the original input to get an output with the same shape as the input, and then go through a 1×1 convolution and batch normalization layer to make the shape of the output feature map, which is the same as the feature a at the same stage after the depth can be obtained. The shape of the separable convolution output is the same. Finally, the feature a processed by the two channels is added and input to the next feature extraction layer.

残差结构将输入层和中间层进行融合，对两个不同尺度的特征进行一个拼接。The residual structure fuses the input layer and the intermediate layer, and performs a concatenation of features of two different scales.

解码阶段解码器模块将经过拼接的特征图，经过一个3×3卷积一个批量归一化层和Relu激活函数通过上采样逐步修复分割对象的细节和空间维度。经最后一个解码器输出的特征图，与经输入卷积层的特征图相加融合得到经DCRSU模块处理后的最终特征图。In the decoding stage, the decoder module takes the spliced feature map, passes through a 3×3 convolution, a batch normalization layer and a Relu activation function to gradually restore the details and spatial dimensions of the segmented object through upsampling. The feature map output by the last decoder is added and fused with the feature map input to the convolutional layer to obtain the final feature map processed by the DCRSU module.

在RSU-4F模块，由于经过了几次下采样，原图已经很小了，为避免信息丢失，RSU-4F不再进行下采样(如图5)，将下采样和上采样换成了膨胀卷积，输入特征C×H×W首先通过2个由卷积+批量归一化层+Relu组成的模块，然后经过膨胀卷积依次为1、2、4、8，整个过程中特征图大小不变。In the RSU-4F module, due to several times of downsampling, the original image is already very small. In order to avoid information loss, RSU-4F no longer performs downsampling (as shown in Figure 5), and replaces downsampling and upsampling with dilation Convolution, the input feature C×H×W first passes through 2 modules composed of convolution + batch normalization layer + Relu, and then after expansion convolution is sequentially 1, 2, 4, 8, the size of the feature map in the whole process constant.

SKnet(图6)作为最深一层的编码器，提取多尺度特征操作是将原特征图分别通过一个3×3的分组/深度卷积和3×3的空洞卷积(感受野为5×5)生成两个特征图：(图中黄色)和/>(图中绿色)。然后将这两个特征图进行相加，生成U。生成的U通过全局平均池化(Fgp)生成1×1×C的特征图(图中的s)，该特征图通过全连接层(Ffc函数)生成d×1的向量(图中的z)，对向量z分别经2个FC层重新变回长度C，对2个向量在通道维度上求softmax，得到各自的权重向量，并将权重与阶段一的2个输出进行相乘得到新的特征图，对两个新的特征图进行求和得到最终输出，送入下一个解码器。SKnet (Figure 6) is the deepest encoder. The multi-scale feature extraction operation is to pass the original feature map through a 3×3 group/depth convolution and a 3×3 hole convolution (the receptive field is 5×5 ) to generate two feature maps: (yellow in the picture) and /> (green in the picture). Then these two feature maps are added to generate U. The generated U generates a 1×1×C feature map (s in the figure) through global average pooling (Fgp), and the feature map generates a d×1 vector (z in the figure) through a fully connected layer (Ffc function) , change the vector z back to the length C through 2 FC layers respectively, and calculate the softmax of the 2 vectors in the channel dimension to obtain their respective weight vectors, and multiply the weights with the 2 outputs of stage 1 to obtain new features map, the two new feature maps are summed to get the final output, which is sent to the next decoder.

AG+模块(如图7所示)的两个输入分别是encoder的当前层xl和decoder的下一层g，输入特征为C×H×W，他们首先经过逐元素的相加，经过Relu激活函数，得到C×H×W的特征图，然后通过1×1的卷积将通道数降为1，然后sigmoid激活函数进行归一化得到注意力系数，然后再经过一个1×1模块将尺寸还原回来，得到C×H×W的系数，最后使用得到的注意力系数对两个输入特征图进行相乘，然后进行拼接，将最后得到的特征图送入下一个解码器模块。The two inputs of the AG+ module (as shown in Figure 7) are the current layer xl of the encoder and the next layer g of the decoder, and the input features are C×H×W. They are first added element by element and passed through the Relu activation function , get the feature map of C×H×W, then reduce the number of channels to 1 through 1×1 convolution, then normalize the sigmoid activation function to get the attention coefficient, and then restore the size through a 1×1 module Come back, get the coefficient of C×H×W, and finally use the obtained attention coefficient to multiply the two input feature maps, and then splicing, and send the final feature map to the next decoder module.

实验设备采用NVIDIA V100，整个模型用PyTorch实现。The experimental equipment uses NVIDIA V100, and the entire model is implemented with PyTorch.

训练模型时，设定批量大小为16，使用AdamW优化器进行优化。我们首先使用1×10^-3的学习率进行初始训练，然后使用1×10^-5的学习率微调模型。When training the model, set the batch size to 16 and use the AdamW optimizer for optimization. We first use a learning rate of 1× ^10-3 for initial training, and then use a learning rate of 1× ^10-5 to fine-tune the model.

公式如下为计算的损失公式，在这里M＝6，表示第1个解码器、第2个解码器、第3个解码器、第4个解码器、第5个解码器、第6个编码器有六个输出，l_fuse代表的是最终的预测概率图的损失，l代表二值交叉熵损失，w代表每个损失的权重。The formula is the calculated loss formula as follows, where M=6, which means the 1st decoder, the 2nd decoder, the 3rd decoder, the 4th decoder, the 5th decoder, and the 6th encoder There are six outputs, l _fuse represents the loss of the final predicted probability map, l represents the binary cross entropy loss, and w represents the weight of each loss.

二值交叉熵损失如下所示，其中(r，c)是像素坐标，(H，W)是图像尺寸：高度和宽度。PG(r，c)和Ps(r，c)表示像素地面实况值和预测显著概率-能力图。The binary cross-entropy loss is shown below, where (r,c) are pixel coordinates and (H,W) are image dimensions: height and width. PG(r,c) and Ps(r,c) denote pixel ground truth values and predicted saliency probability-power maps.

为了验证算法的有效性，实验使用Magnetic-tile-defect-datasets数据集，将本发明提出的算法与U-Net和U²-net算法进行比较。In order to verify the effectiveness of the algorithm, the Magnetic-tile-defect-datasets data set is used in the experiment to compare the algorithm proposed by the present invention with the U-Net and U ² -net algorithms.

分析：在Magnetic-tile-defect-datasets数据集上进行了实验，并对生成的结果进行定量比较。主要是两个评价指标，maxF_β和MAE如下表所示，我们的方法相较于其他方法在两个指标上都取得了最优结果。Analysis: Experiments are carried out on the Magnetic-tile-defect-datasets dataset, and the generated results are compared quantitatively. There are mainly two evaluation indicators, maxF _β and MAE are shown in the table below. Compared with other methods, our method has achieved the best results in both indicators.

MethodMethod MAEMAE maxF_β maxF _β U2NETU2NET 0.0020.002 0.7810.781 U-NetU-Net 0.0040.004 0.7330.733 OursOurs 0.0010.001 0.7920.792

MAE为平均绝对误差，是衡量图像质量的指标之一。计算原理为真实值与预测值的差值的绝对值然后求和再平均，MAE值越小，说明图像质量越好。公式如下：MAE is mean absolute error, which is one of the indicators to measure image quality. The calculation principle is the absolute value of the difference between the real value and the predicted value and then summed and then averaged. The smaller the MAE value, the better the image quality. The formula is as follows:

其中P和G是显著对象的概率图检测和相应的地面实况，(H，W)和(r，c)是(高度、宽度)和像素协调。where P and G are the probabilistic map detections of salient objects and the corresponding ground truth, and (H, W) and (r, c) are the (height, width) and pixel coordinates.

准确率(Accuracy)＝(TP+TN)/总样本。定义是：对于给定的测试数据集，分类器正确分类的样本数与总样本数之比。Accuracy = (TP+TN)/total samples. The definition is: for a given test data set, the ratio of the number of samples correctly classified by the classifier to the total number of samples.

精确率(Precision)＝TP/(TP+FP)。它表示：预测为正的样本中有多少是真正的正样本，它是针对我们预测结果而言的。Precision (Precision) = TP/(TP+FP). It means: How many of the samples predicted to be positive are true positive samples, which is for our prediction results.

F_β用于综合评估精确度和召回率，如下所示：F _β is used to comprehensively evaluate precision and recall as follows:

将β²设置为0.3，并报告最大F_β(maxF_β)。Set β ² to 0.3 and report the maximum F _β (maxF _β ).

以上所述，并非对本发明作任何形式上的限制，虽然本发明已通过上述实施例揭示，然而并非用以限定本发明，任何熟悉本专业的技术人员，在不脱离本发明技术方案范围内，可利用上述揭示的技术内容作出些变动或修饰为等同变化的等效实施例，但凡是未脱离本发明技术方案的内容，依据本发明的技术实质对以上实施例所作的任何简单修改、等同变化与修饰，均仍属于本发明技术方案的范围内。The above description does not limit the present invention in any form. Although the present invention has been disclosed by the above-mentioned embodiments, it is not intended to limit the present invention. The technical content disclosed above can be used to make some changes or be modified into equivalent embodiments of equivalent changes, but any simple modifications and equivalent changes made to the above embodiments according to the technical essence of the present invention will not deviate from the content of the technical solution of the present invention and modifications, all still belong to the scope of the technical solution of the present invention.

Claims

1. A tile surface defect segmentation method for improving U ² -Net, characterized in that, comprising the following steps:

Step S1, obtaining the tile surface defect detection data set, and dividing it into a training set and a testing set;

Step S2, constructing a tile surface defect segmentation network model based on U ² -Net;

The tile surface defect segmentation network model is a six-layer U-shaped structure, including a 6-level encoder, a 5-level decoder, a 2-level multi-scale feature fusion module and a saliency map fusion module;

Among them, the first 4 encoders and the corresponding 4 decoders are composed of the feature extraction structure DCRSU, and the number of layers of each DCRSU decreases as the number of layers of the encoder and decoder increases, that is, the first 4 encoders use DCRSU respectively. -7, DCRSU-6, DCRSU-5, DCRSU-4, the same is true for the first 4 decoders;

The fifth encoder and corresponding decoder use RSU-4F;

The sixth encoder introduces SKnet as the deepest encoder;

Introduce two improved attention gate modules at the input of the 5th decoder and the 4th decoder respectively, fuse the features output by the 6th encoder and the 5th encoder into the 5th decoder, and use the The 4 encoders are fused with the features output by the 5th decoder and input to the 4th decoder; then use 3×3 convolution and sigmoid function from the 6th encoder, the 5th decoder, the 4th decoder , the 3rd decoder, the 2nd decoder and the 1st decoder generate 6 output saliency probability maps; then the logic map of the output saliency map is up-sampled to the same size as the input image, and are fused by cascade operation ;Finally, through the 1×1 convolutional layer and the sigmoid function to generate the final significance probability map;

Step S3, continuously iteratively train the tile surface defect segmentation network model through the training set, until the network finally converges, and obtain the trained tile surface defect segmentation network model;

Step S4: Input the image to be processed into the trained tile surface defect segmentation network model to obtain the segmentation target.

2 . The improved U ² -Net tile surface defect segmentation method according to claim 1 , wherein the training set and test set are divided by 4:1 in the step S10 .

3. A kind of tile surface defect segmentation method of improving U ² -Net according to claim 1, is characterized in that, described DCRSU is made up of 4 parts of input convolutional layer, encoder, decoder and residual structure;

The input convolution layer is used to extract local features and transform channels;

In the encoding stage, the last encoder adopts convolution + batch normalization + ReLU activation function structure, and the penultimate layer adopts depth separable convolution + batch normalization + ReLU activation function structure; the remaining encoders use the residual structure to The features extracted by the depth-separable convolution and the input features processed by the attention mechanism module are added and then input to the next feature extraction layer for feature extraction, so that the output features of each level can be focused on channels with more effective feature information and strengthened. The ability to extract effective features at each level and obtain multi-scale feature information;

The residual structure fuses the input layer and the middle layer, and splices features of two different scales;

In the decoding stage, the decoder module takes the concatenated feature map, passes through a 3×3 convolution, a batch normalization layer and a Relu activation function to gradually repair the details and spatial dimensions of the segmented object through upsampling; the final decoder output The feature map is added and fused with the feature map of the input convolutional layer to obtain the final feature map processed by the DCRSU module.

4. A tile surface defect segmentation method for improved U ² -Net according to claim 1, characterized in that, the RSU-4F replaces downsampling and upsampling with dilated convolution, and the input features CxHxW are first passed through 2 modules composed of convolution + batch normalization layer + Relu, and then after expansion convolution are sequentially 1, 2, 4, 8, the size of the feature map remains unchanged throughout the process.

5. A kind of improved U ² -Net tile surface defect segmentation method according to claim 1, characterized in that, SKnet is used as the deepest encoder, and the operation of extracting multi-scale features is to pass the original feature map through a 3 The ×3 group/depth convolution and the 3×3 atrous convolution generate two feature maps: and /> Then add these two feature maps to generate U;

The generated U generates a 1×1×C feature map through global average pooling. The feature map generates a d×1 vector z through a fully connected layer, and the vector z is converted back to length C through 2 FC layers. For 2 A vector is calculated for softmax in the channel dimension to obtain its respective weight vector, and the weight is multiplied by the two outputs of stage 1 to obtain a new feature map, and the two new feature maps are summed to obtain the final output, which is sent to the next decoder.

6. A kind of improved ^U2 -Net tile surface defect segmentation method according to claim 1, is characterized in that, two inputs of described improved attention gate module are the current layer x1 of encoder and decoder respectively The next layer g, the input feature is CxHxW, they first go through the element-by-element addition and the Relu activation function to get the feature map of CxHxW, and then reduce the number of channels to 1 through 1×1 convolution, and then the sigmoid activation function performs Normalize to get the attention coefficient, and then restore the size through a 1×1 module to get the coefficient of CxHxW, and finally use the obtained attention coefficient to multiply the two input feature maps, and then stitch them together, and finally get The feature map of is sent to the next decoder module.

7. A kind of improved U ² -Net tile surface defect segmentation method according to claim 1, it is characterized in that, when training model in the described step S3, setting batch size is 16, uses AdamW optimizer to optimize; We first use a learning rate of 1× ^10-3 for initial training, and then use a learning rate of 1× ^10-5 to fine-tune the model.

8. A method for segmenting surface defects of ceramic tiles by improving ^U2 -Net according to claim 1 or 7, wherein the loss formula in the step S3 is:

In the formula: l _fuse represents the loss of the final predicted probability map, l represents the binary cross-entropy loss, and w represents the weight of each loss.