CN116229063B

CN116229063B - Semantic segmentation network model and its training method based on category colorization technology

Info

Publication number: CN116229063B
Application number: CN202310036249.7A
Authority: CN
Inventors: 张力; 陈家棋; 卢嘉晨; 朱霞天
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2023-01-08
Filing date: 2023-01-08
Publication date: 2024-01-26
Anticipated expiration: 2043-01-08
Also published as: CN116229063A

Abstract

The invention discloses a training method for a semantic segmentation network model based on category colorization technology. It first obtains the ADE20K data set, divides the data set into a training set and a test set, generates a category color array with a dimension of C×3, and uses The training set trains the image feature extractor, discrete feature number classifier and auxiliary pixel classifier to obtain the weights of the trained image feature extractor and discrete feature number classifier and the image feature extractor generated in multiple iterative processes. and the weights of the discrete feature ordinal classifier. The learnable color patch-category mapping module is trained using the training set, the image feature extractor generated in multiple iterations, and the weights of the discrete feature ordinal classifier. The present invention can solve the technical problem that the existing semantic segmentation method based on the discriminant model cannot achieve optimal accuracy due to insufficient knowledge and insufficient information, resulting in poor accuracy and poor generalization.

Description

Semantic segmentation network model and its training method based on category colorization technology

技术领域Technical field

本发明属于图像数据处理技术领域，更具体地，涉及一种基于类别色彩化技术的语义分割网络模型及其训练方法。The invention belongs to the technical field of image data processing, and more specifically, relates to a semantic segmentation network model based on category colorization technology and a training method thereof.

背景技术Background technique

如今，语义分割(Semantic Segmentation)在计算机视觉领域已经得到了日趋广泛的应用，包括自动驾驶、机器人、抠图软件等。由于每张图像中都包含丰富的语义实体，因此，如何在模型中存储更丰富的语义知识成为了提升语义模型的检测性能的关键。Nowadays, semantic segmentation (Semantic Segmentation) has been increasingly widely used in the field of computer vision, including autonomous driving, robots, matting software, etc. Since each image contains rich semantic entities, how to store richer semantic knowledge in the model has become the key to improving the detection performance of the semantic model.

现有的语义分割方法主要分为判别式的模型和生成式的模型。Existing semantic segmentation methods are mainly divided into discriminative models and generative models.

判别式的语义分割模型首先利用特征提取网络得到整张图像的特征矩阵。然后，利用一个像素级分类器将这些特征上的每个像素位置上的特征都转化为类别的概率值。其中，概率值的个数与类别总数相等。最后，将每个像素位置上的类别概率最大的类别作为该像素的类别，以得到语义分割结果。在离线训练的过程中，判别式的模型直接利用交叉熵损失函数监督模型的离线训练。该方法训练过程和推理过程都较为简单，便于业务使用。The discriminative semantic segmentation model first uses the feature extraction network to obtain the feature matrix of the entire image. Then, a pixel-level classifier is used to convert the features at each pixel position on these features into probability values of the categories. Among them, the number of probability values is equal to the total number of categories. Finally, the category with the largest category probability at each pixel position is used as the category of the pixel to obtain the semantic segmentation result. In the process of offline training, the discriminative model directly uses the cross-entropy loss function to supervise the offline training of the model. The training process and inference process of this method are relatively simple and easy for business use.

生成式的模型首先利用特征提取网络得到整张图像的特征矩阵。然后，将特征矩阵输入到一个离散特征序号的分类器中，得到序号矩阵。随后，在离散特征码表中查询与每个离散特征序号相应的离散特征，并将所有查询到的离散特征按照元素的空间位置拼接成离散特征矩阵。最后，将离散特征矩阵输入到VQ-VAE的解码器中，以得到语义分割结果。在生成式语义分割模型的离线训练的过程中，需要使用到VQ-VAE的编码器将语义掩码的真值映射为一个特征矩阵，以及特征矩阵中的每个特征在VQ-VAE的离散特征码表中的序号组成的离散特征序号矩阵。生成式模型利用离散特征序号矩阵作为监督信号，交叉熵损失作为损失函数来监督模型的离线训练。该方法由于引入了生成式模型的思想，能够学习到丰富的语义信息和细节特征。The generative model first uses the feature extraction network to obtain the feature matrix of the entire image. Then, the feature matrix is input into a classifier of discrete feature numbers to obtain the number matrix. Subsequently, the discrete features corresponding to each discrete feature serial number are queried in the discrete feature code table, and all the queried discrete features are spliced into a discrete feature matrix according to the spatial position of the elements. Finally, the discrete feature matrix is input into the decoder of VQ-VAE to obtain the semantic segmentation results. In the process of offline training of the generative semantic segmentation model, the VQ-VAE encoder needs to be used to map the true value of the semantic mask into a feature matrix, and each feature in the feature matrix is the discrete feature of VQ-VAE. A matrix of discrete feature serial numbers composed of serial numbers in the code table. The generative model uses the discrete feature number matrix as the supervision signal and cross-entropy loss as the loss function to supervise the offline training of the model. This method can learn rich semantic information and detailed features due to the introduction of the idea of generative models.

然而，现有的基于判别式模型的语义分割方法和基于生成式模型的语义分割方法存在一些不可忽略的缺陷：However, existing semantic segmentation methods based on discriminative models and semantic segmentation methods based on generative models have some shortcomings that cannot be ignored:

第一，由于判别式语义分割模型中的像素级分类器本质上是学习不同类别的特征和其判别边界，并没有学习到足够丰富的语义信息和细节特征，这导致现有的判别式语义分割模型难以达到最佳的平均交并比(mean Intersection over Union，简称mIoU)精度，进而会导致导致最终的准确性和泛化性不佳。First, because the pixel-level classifier in the discriminative semantic segmentation model essentially learns features of different categories and their discriminative boundaries, it does not learn enough semantic information and detailed features, which results in the existing discriminative semantic segmentation It is difficult for the model to achieve the best mean intersection over Union (mIoU) accuracy, which will lead to poor final accuracy and generalization.

第二，现有的基于判别式语义分割方法的像素级分类器利用每个像素位置上的特征来预测该像素的类别，并没有充分利用局部区域的信息，导致局部区域的像素预测过于独立，语义分割结果上容易形成细小的噪点，进而影响了最终语义分割的准确性。Second, the existing pixel-level classifiers based on discriminative semantic segmentation methods use the characteristics of each pixel position to predict the category of the pixel, and do not fully utilize the information of the local area, resulting in the pixel prediction of the local area being too independent. It is easy to form small noise points in the semantic segmentation results, which in turn affects the accuracy of the final semantic segmentation.

第三，由于现有的基于生成式模型的语义分割方法在训练过程中没有妥善处理训练样本中语义掩码的真值中普遍存在的无类别标注的区域，导致无标注区域在输入到VQ-VAE的编码器的过程中逐层扩散，导致大量离散特征序号矩阵的真值的包含了无类别标注区域的信息，进而使得训练出的模型会将一些难以预测的类别错误地预测为无标注类别，最终降低了语义分割模型的准确性。Third, because the existing semantic segmentation methods based on generative models do not properly handle the unlabeled areas that are commonly found in the true values of the semantic masks in the training samples during the training process, resulting in unlabeled areas being input to VQ- The layer-by-layer diffusion in the VAE encoder process results in the true values of a large number of discrete feature number matrices containing information about unlabeled areas, which in turn causes the trained model to incorrectly predict some difficult-to-predict categories as unlabeled categories. , ultimately reducing the accuracy of the semantic segmentation model.

发明内容Contents of the invention

针对现有技术的以上缺陷或改进需求，本发明提供了一种基于类别色彩化技术的语义分割网络模型及其训练方法，其目的在于，解决现有基于判别式模型的语义分割方法由于知识不足、信息不够丰富，导致该方法无法达到最佳的精度，进而导致准确性和泛化性不佳的技术问题，以及现有基于判别式模型的语义分割方法没有充分利用局部信息，导致局部区域内像素的类别预测过于独立，导致语义分割结果中存在噪点，进而导致准确性不佳的技术问题，以及现有基于生成式模型的语义方法无法妥善处理训练样本的语义掩码中的无类别标注区域，导致训练过程引入了错误的信息，进而导致语义分割结果的精度不佳的技术问题。In view of the above defects or improvement needs of the existing technology, the present invention provides a semantic segmentation network model based on category colorization technology and its training method. Its purpose is to solve the problem of insufficient knowledge of the existing semantic segmentation method based on the discriminant model. , The information is not rich enough, resulting in the method not being able to achieve optimal accuracy, which in turn leads to technical problems such as poor accuracy and generalization, and the existing semantic segmentation methods based on discriminant models do not fully utilize local information, resulting in local areas. The category prediction of pixels is too independent, resulting in noise in the semantic segmentation results, which in turn leads to technical problems with poor accuracy, and existing semantic methods based on generative models cannot properly handle uncategorized areas in the semantic masks of training samples. , leading to the introduction of erroneous information in the training process, which in turn leads to technical problems such as poor accuracy in semantic segmentation results.

为实现上述目的，按照本发明的一个方面，提供了一种基于类别色彩化技术的语义分割网络模型，包括依次连接的图像特征提取器、离散特征序号分类器、离散特征码表、语义图像解码器、可学习的颜色块-类别映射模块、辅助像素分类器、无标注区域补全模块、类别-颜色映射模块、以及语义图像编码器，24个Swin Transformer模块，用于接收输入维度为bs×3×h×w的图像，输出为四个特征矩阵，特征矩阵的维度分别是和/> 其中，bs为离线训练过程中预先设置的批量数据大小，h和w分别是图像的长边的像素数量和短边的像素数量。In order to achieve the above object, according to one aspect of the present invention, a semantic segmentation network model based on category colorization technology is provided, including an image feature extractor, a discrete feature serial number classifier, a discrete feature code table, and a semantic image decoding connected in sequence. , learnable color block-category mapping module, auxiliary pixel classifier, unlabeled region completion module, category-color mapping module, and semantic image encoder, 24 Swin Transformer modules, used to receive input dimensions of bs× For a 3×h×w image, the output is four feature matrices. The dimensions of the feature matrices are and/> Among them, bs is the batch data size preset during the offline training process, h and w are the number of pixels on the long side and the number of pixels on the short side of the image respectively.

离散特征序号分类器包括特征聚合层、特征加工层和分类器模块。The discrete feature serial number classifier includes a feature aggregation layer, a feature processing layer and a classifier module.

特征聚合层是一个卷积模块，其输入是Swin Transformer网络输出的四个特征矩阵，输出是一个聚合后的特征矩阵，其维度是 The feature aggregation layer is a convolution module whose input is the four feature matrices output by the Swin Transformer network, and the output is an aggregated feature matrix whose dimensions are

特征加工层由两层Swin Transformer模块组成，其输入是特征聚合层聚合后的特征矩阵，输出维度是特征矩阵，其维度是 The feature processing layer consists of two layers of Swin Transformer modules. Its input is the feature matrix aggregated by the feature aggregation layer. The output dimension is the feature matrix, whose dimensions are

分类器模块的输入是特征加工层输出的特征矩阵，输出是概率矩阵，其维度是该概率矩阵中的每一个元素是8192个离散特征在该元素中出现的概率。The input of the classifier module is the feature matrix output by the feature processing layer, and the output is a probability matrix whose dimensions are Each element in the probability matrix is the probability that 8192 discrete features appear in that element.

离散特征码表是多个离散特征的集合，每个离散特征的维度是128，每个离散特征在离散特征码表中都有一个描述唯一的序号，即离散特征序号。The discrete feature code table is a collection of multiple discrete features. The dimension of each discrete feature is 128. Each discrete feature has a unique serial number in the discrete feature code table, that is, the discrete feature serial number.

离散特征码表的输入是概率矩阵，输出是离散特征矩阵，其维度是 The input of the discrete feature code table is a probability matrix, and the output is a discrete feature matrix whose dimensions are

语义图像解码器具体采用DALL-E的VQ-VAE模型的解码器，输入是从离散特征码表输出的离散特征矩阵，维度是输出是预测的语义图像，其维度是bs×3×h×w。The semantic image decoder specifically uses the decoder of DALL-E's VQ-VAE model. The input is the discrete feature matrix output from the discrete feature code table, and the dimension is The output is the predicted semantic image, whose dimensions are bs × 3 × h × w.

可学习的颜色块-类别映射模块是一个单层Swin Transformer模块。可学习的颜色块-类别映射模块的输入是语义图像解码器输出的维度为bs×3×h×w的预测的语义图像，输出是维度为bs×C×h×w的像素类别概率矩阵。The learnable color block-category mapping module is a single-layer Swin Transformer module. The input of the learnable color patch-category mapping module is the predicted semantic image with dimension bs×3×h×w output by the semantic image decoder, and the output is a pixel category probability matrix with dimension bs×C×h×w.

辅助像素分类器是一个卷积模块，其输入是离散特征分类器的特征聚合层输出的聚合后的特征矩阵，其维度是输出是辅助的语义掩码，其维度是bs×1×h×w。The auxiliary pixel classifier is a convolution module whose input is the aggregated feature matrix output by the feature aggregation layer of the discrete feature classifier, and its dimension is The output is an auxiliary semantic mask with dimensions bs×1×h×w.

无标注区域补全模块用于将语义掩码的真值中无类别标注的像素区域替换为辅助像素分类器预测的类别。无标注区域补全模块的输入是来自数据集的语义掩码的真值和辅助像素分类器输出的辅助的语义掩码，二者的维度都是bs×1×h×w。无标注区域补全模块的输出是全像素标注的语义掩码的真值，其维度是bs×1×h×w。The unlabeled region completion module is used to replace the unlabeled pixel regions in the ground truth of the semantic mask with the categories predicted by the auxiliary pixel classifier. The inputs to the unlabeled region completion module are the ground-truth values of the semantic masks from the dataset and the auxiliary semantic masks output by the auxiliary pixel classifier, both of which have dimensions bs×1×h×w. The output of the unlabeled region completion module is the true value of the semantic mask of the full pixel annotation, and its dimension is bs×1×h×w.

类别-颜色映射模块用于根据每个像素的类别获取该像素对应的颜色，其输入是无标注区域补全模块输出的全像素标注的语义掩码的真值，其维度为bs×1×h×w，输出是维度为bs×3×h×w的语义图像的真值。The category-color mapping module is used to obtain the color corresponding to each pixel according to its category. Its input is the true value of the full-pixel annotated semantic mask output by the unlabeled area completion module, and its dimension is bs×1×h. ×w, the output is the ground truth of the semantic image with dimensions bs×3×h×w.

语义图像编码器具体采用DALL-E的VQ-VAE模型的编码器，其输入是类别-颜色映射模块输出的语义图像的真值，维度为bs×3×h×w，输出是维度为的等价特征矩阵的真值。The semantic image encoder specifically uses the encoder of DALL-E's VQ-VAE model. Its input is the true value of the semantic image output by the category-color mapping module. The dimension is bs×3×h×w, and the output is the dimension of The true value of the equivalent characteristic matrix of .

优选地，离散特征码表是找到概率矩阵的每个元素中概率最大的离散特征对应的离散特征序号，概率矩阵中的所有元素对应的序号组成序号矩阵，然后，根据该序号矩阵中的每个元素在离散特征码表中查询对应的离散特征，所有元素对应的所有离散特征构成离散特征矩阵。Preferably, the discrete feature code table is to find the discrete feature sequence number corresponding to the discrete feature with the highest probability in each element of the probability matrix. The sequence numbers corresponding to all elements in the probability matrix form a sequence number matrix. Then, according to each element in the sequence number matrix The elements query the corresponding discrete features in the discrete feature code table, and all discrete features corresponding to all elements form a discrete feature matrix.

可学习的颜色块-类别映射模块首先遍历语义图像上的所有像素，针对第i行、第j列的像素L(i,j)为中心点的正方形局部区域的所有像素{L(p,q)|p∈[i-6,i+6],p∈[j-6,j+6]}，可学习的颜色块-类别映射模块预测其中心点像素L(i,j)的C个类别概率，其中C是类别总数，i∈[1,h]，j∈[1,w]。最后，将语义图像中所有像素L(i,j)的类别概率按照空间位置进行拼接，从而得到预测的像素类别概率矩阵。The learnable color block-category mapping module first traverses all pixels on the semantic image, targeting all pixels {L(p,q) of the square local area where the pixel L(i,j) in the i-th row and j-th column is the center point )|p∈[i-6,i+6],p∈[j-6,j+6]}, the learnable color block-category mapping module predicts C of its center point pixel L(i,j) Category probability, where C is the total number of categories, i∈[1,h], j∈[1,w]. Finally, the category probabilities of all pixels L(i,j) in the semantic image are spliced according to their spatial positions to obtain the predicted pixel category probability matrix.

按照本发明的另一方面，提供了一种基于类别色彩化技术的语义分割网络模型的训练方法，包括以下步骤：According to another aspect of the present invention, a method for training a semantic segmentation network model based on category colorization technology is provided, including the following steps:

(1)获取ADE20K数据集，将该ADE20K数据集的25574组图像及其对应语义掩码的真值划分为训练集，将ADE20K数据集的2000组图像及其对应语义掩码的真值划分为验证集。(1) Obtain the ADE20K data set, divide the 25574 sets of images in the ADE20K data set and the true values of their corresponding semantic masks into a training set, and divide the 2000 sets of images in the ADE20K data set and the true values of their corresponding semantic masks. divided into validation sets.

(2)生成维度为C×3的类别颜色数组。(2) Generate a category color array with dimension C×3.

(3)利用步骤(1)得到的ADE20K数据集的训练集，对图像特征提取器、离散特征序号分类器和辅助像素分类器进行训练，以得到训练好的图像特征提取器和离散特征序号分类器的权重和多个迭代过程中产生的图像特征提取器和离散特征序号分类器的权重。(3) Use the training set of the ADE20K data set obtained in step (1) to train the image feature extractor, discrete feature serial number classifier and auxiliary pixel classifier to obtain the trained image feature extractor and discrete feature serial number classification The weights of the image feature extractor and the discrete feature number classifier generated during multiple iterations.

(4)利用步骤(1)得到的ADE20K数据集的训练集、步骤(4)得到的多个迭代过程中产生的图像特征提取器和离散特征序号分类器的权重中的一个，对可学习的颜色块-类别映射模块进行训练，得到训练好的可学习的颜色块-类别映射模块。(4) Use the training set of the ADE20K data set obtained in step (1), one of the weights of the image feature extractor and discrete feature number classifier generated in the multiple iteration processes obtained in step (4), to learn the The color block-category mapping module is trained to obtain a trained and learnable color block-category mapping module.

(5)对步骤(3)得到的训练好的图像特征提取器和离散特征序号分类器的权重、步骤(4)得到的训练好的可学习的颜色块-类别映射模块的权重和从网络上下载得到的离散特征码表和语义图像解码器的权重进行保存，以得到基于类别色彩化技术的语义分割网络模型的权重。其中，离散特征码表的权重和语义图像解码器的权重分别是从网络下载的DALL-E的VQ-VAE模型的离散特征码表的权重和解码器的权重。(5) The weights of the trained image feature extractor and discrete feature number classifier obtained in step (3), and the weights of the trained learnable color block-category mapping module obtained in step (4) are obtained from the network. The downloaded discrete feature code table and the weights of the semantic image decoder are saved to obtain the weights of the semantic segmentation network model based on category colorization technology. Among them, the weight of the discrete feature code table and the weight of the semantic image decoder are respectively the weight of the discrete feature code table and the weight of the decoder of DALL-E's VQ-VAE model downloaded from the Internet.

优选地，步骤(2)包括以下子步骤：Preferably, step (2) includes the following sub-steps:

(2-1)生成三个一维数组A_R、A_G、A_B。优选地，数组的每个元素分别是和/> 且有k1∈[1，数组A_R中的元素总数]，k2∈[1，数组A_G中的元素总数]，k3∈[1，数组A_B中的元素总数]。(2-1) Generate three one-dimensional arrays A _R , A _G , and A _B . Preferably, each element of the array is and/> And there are k1∈[1, the total number of elements in the array A _R ], k2∈[1, the total number of elements in the array A _G ], k3∈[1, the total number of elements in the array A _B ].

(2-2)设置计数器k1＝1、k2＝1、k3＝1，并初始化RGB颜色数组A_RGB为空数组。(2-2) Set counters k1=1, k2=1, k3=1, and initialize RGB color array A _RGB to be an empty array.

(2-3)判断k1是否大于预设的最大循环次数J(其取值等于数组A_R中的元素总数)，如果是则转入步骤(2-13)，否则转入步骤(2-4)。(2-3) Determine whether k1 is greater than the preset maximum number of cycles J (its value is equal to the total number of elements in the array A _R ). If so, go to step (2-13), otherwise go to step (2-4 ).

(2-4)判断k2是否大于预设的最大循环次数K(其取值等于数组A_G中的元素总数)，如果是则转入步骤(2-3)，否则进入步骤(2-5)。(2-4) Determine whether k2 is greater than the preset maximum number of cycles K (its value is equal to the total number of elements in the array A _G ). If so, go to step (2-3), otherwise go to step (2-5) .

(2-5)判断k3是否大于预设的最大循环次数Q(其取值等于数组A_B中的元素总数)，如果是则转入步骤(2-4)，否则进入步骤(2-6)。(2-5) Determine whether k3 is greater than the preset maximum number of cycles Q (its value is equal to the total number of elements in array A _B ). If so, go to step (2-4), otherwise go to step (2-6) .

(2-6)生成一个-15到15之间的随机整数r_R，并更新A_R数组的第k1个元素以得到更新后的A_R数组的第k1个元素/> (2-6) Generate a random integer r _R between -15 and 15, and update the k1th element of the A _R array To get the k1th element/> of the updated A _R array

(2-7)生成一个-15到15之间的随机整数r_G，并更新A_G数组的第k2个元素以得到更新后的A_G数组的第k2个元素/> (2-7) Generate a random integer r _G between -15 and 15, and update the k2th element of the A _G array To get the k2th element/> of the updated A _G array

(2-8)生成一个-15到15之间的随机整数r_B，并更新A_B数组的第k3个元素以得到更新后的A_B数组的第k3个元素/> (2-8) Generate a random integer r _B between -15 and 15, and update the k3th element of the A _B array To get the k3th element/> of the updated A _B array

(2-9)将步骤(2-6)得到的更新后的A_R数组的第k1个元素步骤(2-7)得到的更新后的A_G数组的第k2个元素/>以及步骤(2-8)得到的更新后的A_B数组的第k3个元素/>组成一个三维元素/>并将三维元素/>插入到RGB颜色数组A_RGB的末尾，以得到更新后的RGB颜色数组A_RGB。(2-9) Add the k1th element of the updated A _R array obtained in step (2-6) The k2th element of the updated A _G array obtained in step (2-7)/> And the k3th element/> of the updated A _B array obtained in step (2-8) Make up a three-dimensional element/> and add three-dimensional elements/> Insert into the end of the RGB color array A _RGB to get the updated RGB color array A _RGB .

(2-10)设置k1＝k1+1，并返回步骤(2-5)(2-10) Set k1=k1+1 and return to step (2-5)

(2-11)设置k2＝k2+1，并返回步骤(2-4)(2-11) Set k2=k2+1 and return to step (2-4)

(2-12)设置k3＝k3+1，并返回步骤(2-3)(2-12) Set k3=k3+1 and return to step (2-3)

(2-13)获取步骤(2-9)得到的更新后的RGB颜色数组A_RGB中的前C个三维元素，以得到类别颜色数组。(2-13) Obtain the first C three-dimensional elements in the updated RGB color array A _RGB obtained in step (2-9) to obtain the category color array.

优选地，步骤(3)包括以下子步骤：Preferably, step (3) includes the following sub-steps:

(3-1)对图像特征提取器、离散特征序号分类器、辅助像素分类器、类别-颜色映射模块、语义图像编码器和离散特征码表的权重进行初始化，以得到初始化后的图像特征提取器、离散特征序号分类器、辅助像素分类器、类别-颜色映射模块、语义图像编码器和离散特征码表。(3-1) Initialize the weights of the image feature extractor, discrete feature number classifier, auxiliary pixel classifier, category-color mapping module, semantic image encoder and discrete feature code table to obtain the initialized image feature extraction , discrete feature number classifier, auxiliary pixel classifier, category-color mapping module, semantic image encoder and discrete feature code table.

(3-2)设置计数器i＝1，并对训练过程的超参数进行初始化，以得到初始化后的训练过程的超参数。(3-2) Set counter i=1 and initialize the hyperparameters of the training process to obtain the initialized hyperparameters of the training process.

(3-3)从步骤(1)获取的ADE20K数据集的训练集中获取多个图像及其对应语义掩码的真值。(3-3) Obtain the true values of multiple images and their corresponding semantic masks from the training set of the ADE20K data set obtained in step (1).

(3-4)对步骤(3-3)获取的多个图像和语义掩码的真值进行数据预处理，以得到多个预处理后的图像和语义掩码的真值。(3-4) Perform data preprocessing on the true values of multiple images and semantic masks obtained in step (3-3) to obtain multiple preprocessed images and true values of semantic masks.

(3-5)利用依次连接的图像特征提取器和离散特征序号分类器将步骤(3-4)得到的多个预处理后的图像映射为多个离散特征序号的概率矩阵，其维度为和多个聚合后的特征矩阵，其维度为/> (3-5) Use the sequentially connected image feature extractor and discrete feature number classifier to map the multiple preprocessed images obtained in step (3-4) into a probability matrix of multiple discrete feature numbers, with the dimension and multiple aggregated feature matrices, whose dimensions are/>

(3-6)将步骤(3-5)得到的多个聚合后的特征矩阵，输入到辅助像素分类器中，以得到多个辅助的语义掩码，其维度为bs×1×h×w。(3-6) Input the multiple aggregated feature matrices obtained in step (3-5) into the auxiliary pixel classifier to obtain multiple auxiliary semantic masks, whose dimensions are bs×1×h×w .

(3-7)将步骤(3-4)得到的多个预处理后的语义掩码的真值和步骤(4-7)得到的辅助的语义掩码，输入到无标注区域补全模块中，，以得到多个全像素标注的语义掩码的真值，其维度为bs×1×h×w。(3-7) Input the true values of the multiple preprocessed semantic masks obtained in step (3-4) and the auxiliary semantic masks obtained in step (4-7) into the unlabeled area completion module , to obtain the true values of multiple full-pixel annotated semantic masks, whose dimensions are bs×1×h×w.

(3-8)利用类别-颜色映射模块、语义图像编码器、以及离散特征码表将步骤(3-7)获得的多个全像素的语义掩码的真值映射为多个离散特征序号矩阵的真值。(3-8) Use the category-color mapping module, semantic image encoder, and discrete feature code table to map the true values of multiple full-pixel semantic masks obtained in step (3-7) into multiple discrete feature number matrices 's true value.

(3-9)将步骤(3-5)得到的多个离散特征序号的概率矩阵和步骤(4-7)得到的多个离散特征序号矩阵的真值输入交叉熵损失函数中，以得到语义特征损失值。(3-9) Input the probability matrices of multiple discrete feature numbers obtained in step (3-5) and the true values of the multiple discrete feature number matrices obtained in step (4-7) into the cross entropy loss function to obtain the semantics Feature loss value.

(3-10)利用步骤(3-9)得到的语义特征损失值进行反向传播，以得到图像特征提取器和离散特征序号分类器的梯度。(3-10) Use the semantic feature loss value obtained in step (3-9) to perform backpropagation to obtain the gradient of the image feature extractor and discrete feature number classifier.

(3-11)利用步骤(3-2)设置的学习率、AdamW优化器和步骤(3-9)得到的梯度，更新图像特征提取器和离散特征序号分类器的权重，以得到新的图像特征提取器和离散特征序号分类器的权重。(3-11) Use the learning rate set in step (3-2), the AdamW optimizer and the gradient obtained in step (3-9) to update the weights of the image feature extractor and discrete feature number classifier to obtain a new image Weights for feature extractors and discrete feature ordinal classifiers.

(3-12)计数器i＝i+1，并将图像特征提取器和离散特征序号分类器的梯度设置为0。(3-12) Counter i=i+1, and set the gradient of the image feature extractor and discrete feature number classifier to 0.

(3-13)判断i是否为预先设置的模型保存间隔t的整数倍，如果是则保存图像特征提取器和离散特征序号分类器的权重，并命名为为第i次迭代的图像特征提取器和离散特征序号分类器的权重，否则进入步骤(3-14)。(3-13) Determine whether i is an integer multiple of the preset model saving interval t. If so, save the weights of the image feature extractor and the discrete feature serial number classifier, and name them as the image feature extractor of the i-th iteration. and the weight of the discrete feature serial number classifier, otherwise go to step (3-14).

(3-14)判断i是否大于预先设置的最大迭代次数n，如果是则进入步骤(3-15)，否则返回步骤(3-3)。(3-14) Determine whether i is greater than the preset maximum number of iterations n, if so, go to step (3-15), otherwise return to step (3-3).

(3-15)对图像特征提取器和离散特征序号分类器的权重进行保存，得到训练好的图像特征提取器和离散特征序号分类器的权重。(3-15) Save the weights of the image feature extractor and discrete feature number classifier to obtain the weights of the trained image feature extractor and discrete feature number classifier.

优选地，步骤(3-1)包括以下子步骤：Preferably, step (3-1) includes the following sub-steps:

(3-1-1)将图像特征提取器的预训练权重加载到图像特征提取器。(3-1-1) Load the pre-trained weights of the image feature extractor into the image feature extractor.

(3-1-2)初始化离散特征序号分类器的权重为随机值，并设置离散特征序号分类器的权重为有梯度的权重，即图像特征提取器的权重将在离线训练过程中被优化。(3-1-2) Initialize the weight of the discrete feature serial number classifier to a random value, and set the weight of the discrete feature serial number classifier to a gradient weight, that is, the weight of the image feature extractor will be optimized during the offline training process.

(3-1-3)初始化辅助像素分类器的权重为随机值，并设置像素分类器的权重为有梯度的权重。(3-1-3) Initialize the weight of the auxiliary pixel classifier to a random value, and set the weight of the pixel classifier to a gradient weight.

(3-1-4)将步骤(2)得到的预设的类别颜色数组加载到类别-颜色映射模块中，并设置类别-颜色映射模块的权重为无梯度的权重。(3-1-4) Load the preset category color array obtained in step (2) into the category-color mapping module, and set the weight of the category-color mapping module to a gradient-free weight.

(3-1-5)将语义图像编码器的预训练权重到语义图像编码器，并设置语义图像编码器的权重为无梯度的权重。(3-1-5) Add the pre-trained weights of the semantic image encoder to the semantic image encoder, and set the weights of the semantic image encoder to gradient-free weights.

(3-1-6)将离散特征码表的与训练权重加载到离散特征码表，并设置离散特征码表的权重为无梯度的权重。(3-1-6) Load the discrete signature table and training weights into the discrete signature table, and set the weight of the discrete signature table to a gradient-free weight.

优选地，步骤(3-8)包括以下子步骤：Preferably, step (3-8) includes the following sub-steps:

(3-8-1)将步骤(3-7)获得的多个全像素标注的语义掩码的真值输入类别-颜色映射模块，以得到多个语义图像的真值。(3-8-1) Input the true values of multiple full-pixel annotated semantic masks obtained in step (3-7) into the category-color mapping module to obtain the true values of multiple semantic images.

(3-8-2)将步骤(3-8-1)获得的多个语义图像的真值输入到步骤(3-1-5)得到的初始化后的语义图像编码器，以得到多个等价特征矩阵的真值。(3-8-2) Input the true values of multiple semantic images obtained in step (3-8-1) to the initialized semantic image encoder obtained in step (3-1-5) to obtain multiple equal values. The true value of the valence characteristic matrix.

(3-8-3)针对步骤(3-8-2)得到的每个等价特征矩阵的真值而言，在步骤(3)得到的初始化后的离散特征码表中查询与该等价特征矩阵的真值最相近的多个离散特征及其离散特征序号，并将所有查询到的离散特征序号按照空间位置进行拼接，以得到离散特征序号矩阵的真值，其维度为 (3-8-3) For the true value of each equivalent characteristic matrix obtained in step (3-8-2), query the equivalent value in the initialized discrete characteristic code table obtained in step (3). The discrete features and their discrete feature serial numbers that are closest to the true value of the feature matrix are spliced together according to the spatial position of all the queried discrete feature serial numbers to obtain the true value of the discrete feature serial number matrix, whose dimensions are

优选地，步骤(4)包含以下子步骤：Preferably, step (4) includes the following sub-steps:

(4-1)对图像特征提取器、离散特征序号分类器、离散特征码表、语义图像解码器和可学习的颜色块-类别映射模块的权重进行初始化，以得到初始化后的图像特征提取器、离散特征序号分类器、离散特征码表、语义图像解码器和可学习的颜色块-类别映射模块。(4-1) Initialize the weights of the image feature extractor, discrete feature number classifier, discrete feature code table, semantic image decoder and learnable color block-category mapping module to obtain the initialized image feature extractor , discrete feature number classifier, discrete feature code table, semantic image decoder and learnable color block-category mapping module.

(4-2)设置计数器i＝1，并对训练过程的超参数进行初始化，以得到初始化后的训练过程的超参数。(4-2) Set counter i=1 and initialize the hyperparameters of the training process to obtain the initialized hyperparameters of the training process.

(4-3)从步骤(1)获取的ADE20K数据集的训练集中获取多个图像及其对应语义掩码的真值。(4-3) Obtain the true values of multiple images and their corresponding semantic masks from the training set of the ADE20K data set obtained in step (1).

(4-4)对步骤(5-3)获取的多个图像和语义掩码的真值进行数据预处理，以得到多个预处理后的图像和语义掩码的真值。(4-4) Perform data preprocessing on the true values of multiple images and semantic masks obtained in step (5-3) to obtain multiple preprocessed images and true values of semantic masks.

(4-5)利用依次连接的图像特征提取器和离散特征序号分类器将步骤4-4)得到的多个预处理后的图像映射为多个离散特征序号的概率矩阵，其维度为 (4-5) Use the sequentially connected image feature extractor and discrete feature number classifier to map the multiple preprocessed images obtained in step 4-4) into a probability matrix of multiple discrete feature numbers, with the dimension

(4-6)将步骤(4-5)得到的多个离散特征序号的概率矩阵，输入到离散特征码表中，以得到多个离散特征矩阵，其维度是 (4-6) Input the probability matrices of multiple discrete feature numbers obtained in step (4-5) into the discrete feature code table to obtain multiple discrete feature matrices, the dimensions of which are

(4-7)将步骤(4-6)得到的多个离散特征矩阵，输入到步骤(6-1)得到的初始化后的语义图像解码器中，以得到多个预测的语义图像，其维度是bs×3×h×w。(4-7) Input the multiple discrete feature matrices obtained in step (4-6) into the initialized semantic image decoder obtained in step (6-1) to obtain multiple predicted semantic images whose dimensions It is bs×3×h×w.

(4-8)将步骤(4-7)得到的多个预测的语义图像，输入到可学习的颜色块-类别映射模块，以得到多个预测的像素类别概率矩阵，其维度是bs×C×h×w。其中，C是类别总数。(4-8) Input the multiple predicted semantic images obtained in step (4-7) into the learnable color block-category mapping module to obtain multiple predicted pixel category probability matrices, whose dimensions are bs×C ×h×w. Among them, C is the total number of categories.

(4-9)将步骤(4-8)得到的多个预测的像素类别概率矩阵和步骤(4-4)得到的多个预处理后的语义掩码的真值输入交叉熵损失函数中，以得到像素分类损失值。(4-9) Input the multiple predicted pixel category probability matrices obtained in step (4-8) and the true values of the multiple preprocessed semantic masks obtained in step (4-4) into the cross entropy loss function, To get the pixel classification loss value.

(4-10)利用步骤(4-9)得到的像素分类损失值进行反向传播，以得到可学习的颜色块-类别映射模块的梯度。(4-10) Use the pixel classification loss value obtained in step (4-9) to perform backpropagation to obtain the gradient of the learnable color block-category mapping module.

(4-11)利用步骤(4-2)设置的学习率、AdamW优化器和步骤(4-10)得到的梯度，更新可学习的颜色块-类别映射模块的权重，以得到新的可学习的颜色块-类别映射模块的权重。(4-11) Utilize the learning rate set in step (4-2), the AdamW optimizer and the gradient obtained in step (4-10) to update the weight of the learnable color block-category mapping module to obtain a new learnable The weight of the color block-category mapping module.

(4-12)计数器i＝i+1，并将可学习的颜色块-类别映射模块设置为0。(4-12) Counter i=i+1, and set the learnable color block-category mapping module to 0.

(4-13)判断i是否大于预先设置的最大迭代次数n，如果是则进入步骤(4-14)，否则返回步骤(4-3)。(4-13) Determine whether i is greater than the preset maximum number of iterations n, if so, go to step (4-14), otherwise return to step (4-3).

(4-14)将可学习的颜色块-类别映射模块的权重进行保存，以得到训练好的学习的颜色块-类别映射模块的权重。(4-14) Save the weight of the learnable color block-category mapping module to obtain the weight of the trained learned color block-category mapping module.

优选地，步骤(4-1)包括以下子步骤：Preferably, step (4-1) includes the following sub-steps:

(4-1-1)将步骤(4)得到的第k次迭代的图像特征提取器的权重加载到图像特征提取器，并设置图像特征提取器的权重为无梯度的权重。(4-1-1) Load the weight of the k-th iteration of the image feature extractor obtained in step (4) to the image feature extractor, and set the weight of the image feature extractor to a gradient-free weight.

(4-1-2)将步骤(4)得到的第k次迭代的离散特征序号分类器的权重加载到图像特征提取器，并设置离散特征序号分类器的权重为无梯度的权重。(4-1-2) Load the weight of the k-th iteration of the discrete feature number classifier obtained in step (4) to the image feature extractor, and set the weight of the discrete feature number classifier to a gradient-free weight.

(4-1-3)将离散特征码表的预训练权重加载到离散特征码表，并设置离散特征码表的权重为无梯度的权重。(4-1-3) Load the pre-trained weights of the discrete signature table into the discrete signature table, and set the weight of the discrete signature table to a gradient-free weight.

(4-1-4)将语义图像解码器的预训练权重加载到语义图像解码器，并设置语义图像解码器的权重为无梯度的权重。(4-1-4) Load the pre-trained weights of the semantic image decoder into the semantic image decoder, and set the weights of the semantic image decoder to gradient-free weights.

(4-1-5)初始化可学习的颜色块-类别映射模块的权重为随机值，并设置可学习的颜色块-类别映射模块的权重为有梯度的权重。(4-1-5) Initialize the weight of the learnable color block-category mapping module to a random value, and set the weight of the learnable color block-category mapping module to a gradient weight.

总体而言，通过本发明所构思的以上技术方案与现有技术相比，能够取得下列有益效果：Generally speaking, compared with the prior art, the above technical solutions conceived by the present invention can achieve the following beneficial effects:

(1)由于本发明采用了离散特征序号分类器、离散特征码表、语义图像解码器。它将类别的色彩化为颜色，将语义分割建从像素分类的任务模成了图像生成的任务，有效利用了生成式模型能蕴含更丰富语义信息的优势。因此，本发明解决了现有的判别式语义分割方法学习到的语义信息丰富度不足的局限，使得语义分割的准确度、泛化性得到大幅提升。(1) Because the present invention uses a discrete feature serial number classifier, a discrete feature code table, and a semantic image decoder. It converts the colors of categories into colors and models semantic segmentation from the task of pixel classification to the task of image generation, effectively taking advantage of the generative model's ability to contain richer semantic information. Therefore, the present invention solves the limitation of insufficient richness of semantic information learned by existing discriminative semantic segmentation methods, greatly improving the accuracy and generalization of semantic segmentation.

(2)由于本方法采用了可学习的颜色块-类别映射模块。它能根据局部区域的颜色化的类别信息，判断中心像素的类别，避免了孤立地根据单一像素的特征判断其类别的做法。因此，本发明解决了现有的判别式语义分割方法的局部区域内像素的类别预测过于独立，导致语义分割结果中存在噪点，进而导致准确性不佳的技术问题。(2) Because this method uses a learnable color block-category mapping module. It can determine the category of the central pixel based on the colorized category information of the local area, avoiding the practice of determining the category of a single pixel based on its characteristics in isolation. Therefore, the present invention solves the technical problem in the existing discriminative semantic segmentation method that the category prediction of pixels in a local area is too independent, resulting in noise in the semantic segmentation results, which in turn leads to poor accuracy.

(3)由于本方法采用了辅助像素分类器和无标注区域补全模块。它在离线训练过程中，将图像特征输入到辅助像素分类器中得到辅助的语义掩码，并将辅助的语义掩码输入到无标注区域补全模块对含有无类别标注区域的语义掩码的真值进行补全，得到全像素标注的语义掩码的真值。该过程避免了无标注区域的信息被输入到VQ-VAE的编码器中，进而得到错误的监督信号。因此，本发明解决了现有生成式语义分割方法无法妥善处理训练样本的语义掩码中的无类别标注区域，导致训练过程引入了错误的信息，进而导致语义分割结果的精度不佳的技术问题。(3) Because this method uses an auxiliary pixel classifier and an unlabeled area completion module. During the offline training process, it inputs the image features into the auxiliary pixel classifier to obtain the auxiliary semantic mask, and inputs the auxiliary semantic mask into the unlabeled area completion module to calculate the semantic mask containing the unlabeled area. The true value is completed to obtain the true value of the semantic mask of the full pixel annotation. This process avoids information from unlabeled areas being input into the VQ-VAE encoder, thereby obtaining erroneous supervision signals. Therefore, the present invention solves the technical problem that the existing generative semantic segmentation method cannot properly handle the uncategorized area in the semantic mask of the training sample, resulting in the introduction of erroneous information in the training process, which in turn leads to poor accuracy of the semantic segmentation results. .

附图说明Description of the drawings

图1是本发明基于类别色彩化技术的语义分割网络模型的示意图。Figure 1 is a schematic diagram of the semantic segmentation network model based on category colorization technology of the present invention.

图2本发明基于类别色彩化技术的语义分割网络模型的训练方法的流程图。Figure 2 is a flow chart of the training method of the semantic segmentation network model based on category colorization technology of the present invention.

具体实施方式Detailed ways

为了使本发明的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本发明进行进一步详细说明。应当理解，此处所描述的具体实施例仅仅用以解释本发明，并不用于限定本发明。此外，下面所描述的本发明各个实施方式中所涉及到的技术特征只要彼此之间未构成冲突就可以相互组合。In order to make the purpose, technical solutions and advantages of the present invention more clear, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention and are not intended to limit the present invention. In addition, the technical features involved in the various embodiments of the present invention described below can be combined with each other as long as they do not conflict with each other.

如图1所示，本发明提供了一种基于类别色彩化技术的语义分割网络模型，其包括依次连接的图像特征提取器、离散特征序号分类器、离散特征码表、语义图像解码器、可学习的颜色块-类别映射模块、辅助像素分类器、无标注区域补全模块、类别-颜色映射模块和语义图像编码器，其中类别-颜色映射模块、语义图像编码器和辅助像素分类器仅用于对基于类别色彩化技术的语义分割网络模型进行离线训练。As shown in Figure 1, the present invention provides a semantic segmentation network model based on category colorization technology, which includes an image feature extractor, a discrete feature serial number classifier, a discrete feature code table, a semantic image decoder, and a The learned color block-category mapping module, auxiliary pixel classifier, unlabeled region completion module, category-color mapping module and semantic image encoder, in which the category-color mapping module, semantic image encoder and auxiliary pixel classifier only use For offline training of semantic segmentation network models based on category colorization technology.

图像特征提取器为Swin Transformer网络，它包括依次连接的24个SwinTransformer模块。图像特征提取器接收输入维度为bs×3×h×w的图像，输出为四个特征矩阵，特征矩阵的维度分别是和/>其中，bs为离线训练过程中预先设置的批量数据大小，h和w分别是图像的长边的像素数量和短边的像素数量。The image feature extractor is a Swin Transformer network, which includes 24 SwinTransformer modules connected in sequence. The image feature extractor receives an image with input dimensions of bs×3×h×w, and outputs four feature matrices. The dimensions of the feature matrices are and/> Among them, bs is the batch data size preset during the offline training process, h and w are the number of pixels on the long side and the number of pixels on the short side of the image respectively.

离散特征码表的输入是概率矩阵，输出是离散特征矩阵，其维度是具体而言，首先是找到概率矩阵的每个元素中概率最大的离散特征对应的离散特征序号，概率矩阵中的所有元素对应的序号组成序号矩阵，然后，根据该序号矩阵中的每个元素在离散特征码表中查询对应的离散特征，所有元素对应的所有离散特征构成离散特征矩阵。The input of the discrete feature code table is a probability matrix, and the output is a discrete feature matrix whose dimensions are Specifically, the first step is to find the discrete feature sequence number corresponding to the discrete feature with the highest probability in each element of the probability matrix. The sequence numbers corresponding to all elements in the probability matrix form a sequence number matrix. Then, according to each element in the sequence number matrix, Query the corresponding discrete features in the discrete feature code table, and all discrete features corresponding to all elements form a discrete feature matrix.

语义图像解码器的权重是从网络下载的DALL-E的VQ-VAE模型的解码器的权重，并将其加载到语义图像解码器中。The weights of the semantic image decoder are the weights of the decoder of DALL-E's VQ-VAE model downloaded from the network and loaded into the semantic image decoder.

可学习的颜色块-类别映射模块是一个单层Swin Transformer模块。可学习的颜色块-类别映射模块的输入是语义图像解码器输出的维度为bs×3×h×w的预测的语义图像，输出是维度为bs×C×h×w的像素类别概率矩阵。具体而言，可学习的颜色块-类别映射模块会首先遍历语义图像上的所有像素，针对第i行、第j列的像素L(i,j)为中心点的正方形局部区域的所有像素{L(p,q)|p∈[i-6,i+6],p∈[j-6,j+6]}，可学习的颜色块-类别映射模块预测其中心点像素L(i,j)的C个类别概率，其中C是类别总数，i∈[1,h]，j∈[1,w]。最后，将语义图像中所有像素L(i,j)的类别概率按照空间位置进行拼接，从而得到预测的像素类别概率矩阵。The learnable color block-category mapping module is a single-layer Swin Transformer module. The input of the learnable color patch-category mapping module is the predicted semantic image with dimension bs×3×h×w output by the semantic image decoder, and the output is a pixel category probability matrix with dimension bs×C×h×w. Specifically, the learnable color block-category mapping module will first traverse all pixels on the semantic image, targeting all pixels in the square local area where the pixel L(i,j) in the i-th row and j-th column is the center point { L(p,q)|p∈[i-6,i+6],p∈[j-6,j+6]}, the learnable color block-category mapping module predicts its center point pixel L(i, C category probabilities of j), where C is the total number of categories, i∈[1,h], j∈[1,w]. Finally, the category probabilities of all pixels L(i,j) in the semantic image are spliced according to their spatial positions to obtain the predicted pixel category probability matrix.

辅助像素分类器是一个卷积模块，其输入是离散特征分类器的特征聚合层输出的聚合后的特征矩阵，其维度是输出是辅助的语义掩码，其维度是bs×1×h×w。辅助的语义掩码被用于补全训练集中的样本的语义掩码的真值中未标注类别的像素区域，进而得到全像素标注的语义掩码的真值。其中，语义分割掩码的真值来自离线训练时使用的数据集。语义掩码的真值中每个像素y_gt的取值范围是y_gt∈[1,C]∪255，C是类别总数。y_gt∈[1,C]表示该像素有类别标注，y_gt＝255表示该像素无类别标注。The auxiliary pixel classifier is a convolution module whose input is the aggregated feature matrix output by the feature aggregation layer of the discrete feature classifier, and its dimension is The output is an auxiliary semantic mask with dimensions bs×1×h×w. The auxiliary semantic mask is used to complete the pixel areas of unlabeled categories in the true value of the semantic mask of the samples in the training set, and then obtain the true value of the fully pixel-labeled semantic mask. Among them, the true value of the semantic segmentation mask comes from the data set used during offline training. The value range of each pixel y _gt in the true value of the semantic mask is y _gt ∈[1,C]∪255, where C is the total number of categories. y _gt ∈ [1,C] indicates that the pixel has a category label, and y _gt =255 indicates that the pixel has no category label.

类别-颜色映射模块用于根据每个像素的类别获取该像素对应的颜色，其输入是无标注区域补全模块输出的全像素标注的语义掩码的真值，其维度为bs×1×h×w，输出是维度为bs×3×h×w的语义图像的真值。语义图像编码器具体采用DALL-E的VQ-VAE模型的编码器，其输入是类别-颜色映射模块输出的语义图像的真值，维度为bs×3×h×w，输出是维度为的等价特征矩阵的真值。The category-color mapping module is used to obtain the color corresponding to each pixel according to its category. Its input is the true value of the full-pixel annotated semantic mask output by the unlabeled area completion module, and its dimension is bs×1×h. ×w, the output is the ground truth of the semantic image with dimensions bs×3×h×w. The semantic image encoder specifically uses the encoder of DALL-E's VQ-VAE model. Its input is the true value of the semantic image output by the category-color mapping module. The dimension is bs×3×h×w, and the output is the dimension of The true value of the equivalent characteristic matrix of .

基于类别色彩化技术的语义分割网络模型是采用以下步骤训练得到的：The semantic segmentation network model based on category colorization technology is trained using the following steps:

具体而言，类别颜色数组是为C个类别分别预设的C个RGB颜色值的数组。Specifically, the category color array is an array of C RGB color values preset for C categories respectively.

本步骤包括以下子步骤：This step includes the following sub-steps:

具体而言，在本示例中，数组A_R、A_G、A_B中的元素总数分别是5、6、5。Specifically, in this example, the total number of elements in the arrays A _R , A _G , and A _B are 5, 6, and 5 respectively.

本步骤的优点在于：第一，每个通道的数组A_R、A_G、A_B都是等差数列，元素均匀分布在0到255之间，充分利用了色彩空间来最大化类别颜色的绝对距离，减少了像素错误分类的可能性。第二，每个通道的数组A_R、A_G、A_B的初始元素是不同的，数列差值也不同，避免了不同数组元素的数值相同，减少了像素错误分类的可能性。The advantages of this step are: first, the arrays A _R , A _G , and A _B of each channel are all arithmetic sequences, and the elements are evenly distributed between 0 and 255, making full use of the color space to maximize the absolute color of the category. distance, reducing the possibility of misclassification of pixels. Second, the initial elements of the arrays A _R , A _G , and A _B of each channel are different, and the sequence differences are also different, which avoids the same values of different array elements and reduces the possibility of pixel misclassification.

上述步骤(2-6)到(2-8)的优点在于，每个元素在插入到RGB颜色数组A_RGB之前都加上了相互独立的随机整数，避免了元素数值的重复，减少了像素错误分类的可能性。The advantage of the above steps (2-6) to (2-8) is that each element is added with an independent random integer before being inserted into the RGB color array A _RGB , which avoids duplication of element values and reduces pixel errors. Classification possibilities.

本步骤包含以下子步骤：This step contains the following sub-steps:

本步骤包括以下子步骤：This step includes the following sub-steps:

优选地，以Swin Transformer的大权重版本(Large版本)的骨干网络结构作为图像特征提取器的结构。从网络下载ImageNet-22K预训练的Swin Transformer骨干网络的权重，并将它加载到图像特征提取器中，并设置图像特征提取器的权重为有梯度的权重(即图像特征提取器的权重将在离线训练过程中被优化)。Preferably, the backbone network structure of the large weight version (Large version) of Swin Transformer is used as the structure of the image feature extractor. Download the weights of the ImageNet-22K pre-trained Swin Transformer backbone network from the Internet, load it into the image feature extractor, and set the weight of the image feature extractor to a gradient weight (that is, the weight of the image feature extractor will be in optimized during offline training).

(3-1-2)初始化离散特征序号分类器的权重为随机值，并设置离散特征序号分类器的权重为有梯度的权重(即图像特征提取器的权重将在离线训练过程中被优化)。(3-1-2) Initialize the weight of the discrete feature serial number classifier to a random value, and set the weight of the discrete feature serial number classifier to a gradient weight (that is, the weight of the image feature extractor will be optimized during the offline training process) .

具体而言，本步骤是从网络下载DALL-E的VQ-VAE模型的编码器的权重，并将其加载到语义图像编码器中。Specifically, this step is to download the weights of the encoder of DALL-E's VQ-VAE model from the Internet and load them into the semantic image encoder.

具体而言，本步骤是从网络下载DALL-E的VQ-VAE模型的离散特征码表的权重，并将其加载到离散特征码表中。Specifically, this step is to download the weights of the discrete signature table of DALL-E's VQ-VAE model from the Internet and load them into the discrete signature table.

训练过程的超参数包括最大循环次数、批次大小bs、学习率和优化器。Hyperparameters of the training process include the maximum number of loops, batch size bs, learning rate and optimizer.

具体而言，将批次大小bs设置为32，初始学习率设置为0.00001，优化器设置为AdamW优化器。Specifically, the batch size bs is set to 32, the initial learning rate is set to 0.00001, and the optimizer is set to the AdamW optimizer.

具体而言，本步骤的预处理过程先后包括统一尺度的插值、随机水平翻转、随机裁剪、以及归一化，其中多个预处理后的图像的维度是bs×3×h×w，其对应的预处理后的语义掩码的维度是bs×1×h×w，bs为步骤(3-2)设置的批次大小。Specifically, the preprocessing process of this step includes unified scale interpolation, random horizontal flipping, random cropping, and normalization. The dimensions of the multiple preprocessed images are bs×3×h×w, which correspond to The dimension of the preprocessed semantic mask is bs×1×h×w, and bs is the batch size set in step (3-2).

本步骤包括以下子步骤：This step includes the following sub-steps:

(3-5-1)将步骤(3-4)得到的每个预处理后的图像输入图像特征提取器中，以得到每个图像对应的四个不同空间尺度和通道数的图像特征。(3-5-1) Input each preprocessed image obtained in step (3-4) into the image feature extractor to obtain image features of four different spatial scales and channel numbers corresponding to each image.

(3-5-2)将步骤(3-5-1)得到的每个图像对应的四个不同空间尺度和通道数的图像特征输入离散特征序号分类器中，以得到每个图像对应的离散特征序号的概率矩阵和聚合后的特征矩阵。(3-5-2) Input the image features of four different spatial scales and channel numbers corresponding to each image obtained in step (3-5-1) into the discrete feature serial number classifier to obtain the discrete number corresponding to each image. The probability matrix of feature numbers and the aggregated feature matrix.

具体而言，本步骤是将步骤(2)得到的多个语义掩码的真值上像素值为255的像素，用步骤(3-6)得到的多个辅助的语义掩码上相应位置的像素来替换，以得到多个全像素标注的语义掩码的真值。Specifically, this step is to use the pixels with a pixel value of 255 on the true values of the multiple semantic masks obtained in step (2), and use the corresponding positions on the multiple auxiliary semantic masks obtained in step (3-6). pixels to obtain the true values of multiple full-pixel annotated semantic masks.

上述步骤(3-5)到(3-6)的优点在于辅助像素分类器利用聚合后的特征生成辅助的语义掩码。无标注区域补全模块补全语义掩码的真值中普遍存在的无标注区域，以得到所有像素上都有类别标注的语义掩码的真值。这使得后续的语义图像编码器不会编码到无类别标注的信息，避免了这些错误信息对离散特征序号的真值产生干扰。因此，解决了语义掩码的真值中无类别标注区域干扰模型训练，进而导致模型精度下降的问题。The advantage of the above steps (3-5) to (3-6) is that the auxiliary pixel classifier uses the aggregated features to generate auxiliary semantic masks. The unlabeled region completion module completes the ubiquitous unlabeled regions in the ground truth of the semantic mask to obtain the ground truth of the semantic mask with category labels on all pixels. This prevents subsequent semantic image encoders from encoding information without category annotations, and avoids these erroneous information from interfering with the true values of discrete feature numbers. Therefore, the problem of uncategorized areas in the true value of the semantic mask interfering with model training, which in turn leads to a decrease in model accuracy, is solved.

具体而言，本步骤是将维度为bs×1×h×w的多个预处理后的语义掩码的真值映射为多个维度为的离散特征序号矩阵的真值。Specifically, this step is to map the true values of multiple preprocessed semantic masks with dimensions bs×1×h×w into multiple dimensions with The true value of the discrete characteristic number matrix.

本步骤包括以下子步骤：This step includes the following sub-steps:

具体而言，预先设置的模型保存间隔t的范围为1000到10000，优选为16000。Specifically, the preset model saving interval t ranges from 1,000 to 10,000, preferably 16,000.

具体而言，预先设置的最大循环次数n的范围为1000到1000000，优选为160000。Specifically, the preset maximum number of cycles n ranges from 1,000 to 1,000,000, preferably 160,000.

本步骤包含以下子步骤：This step contains the following sub-steps:

本步骤包括以下子步骤：This step includes the following sub-steps:

具体而言，预先设置的k的范围为小于最大迭代次数n且为模型保存间隔t的整数，优选为32000。Specifically, the preset range of k is an integer less than the maximum number of iterations n and the model saving interval t, preferably 32000.

上述步骤(4-1-1)到(4-1-2)的优点在于，利用训练了中间过程的产生的图像特征提取器和离散特征序号分类器的权重，使得其预测的语义图像有较大的噪声。相比起精准的语义图像，含有噪声的语义图像对于可学习的颜色块-类别映射模块是一种数据增强，用其训练可以有效提升颜色块-类别映射模块利用局部信息来推断中心像素的准确性和鲁棒性。因此，解决了现有的判别式语义分割方法的局部区域内像素的类别预测过于独立，导致语义分割结果中存在噪点，进而导致准确性不佳的问题。The advantage of the above steps (4-1-1) to (4-1-2) is that the weights of the image feature extractor and the discrete feature sequence number classifier generated by the intermediate process are used to make the predicted semantic image more accurate. Loud noise. Compared with accurate semantic images, semantic images containing noise are a kind of data enhancement for the learnable color patch-category mapping module. Training with it can effectively improve the accuracy of the color patch-category mapping module using local information to infer the center pixel. performance and robustness. Therefore, the problem of existing discriminative semantic segmentation methods is solved, in which the category prediction of pixels in local areas is too independent, resulting in noise in the semantic segmentation results, which in turn leads to poor accuracy.

具体而言，本步骤是从网络下载DALL-E的VQ-VAE模型的解码器的权重，并将其加载到语义图像解码器中。Specifically, this step is to download the weights of the decoder of DALL-E's VQ-VAE model from the network and load them into the semantic image decoder.

(4-1-5)初始化可学习的颜色块-类别映射模块的权重为随机值，并设置可学习的颜色块-类别映射模块的权重为有梯度的权重(即可学习的颜色块-类别映射模块的权重将在离线训练过程中被优化)。(4-1-5) Initialize the weight of the learnable color block-category mapping module to a random value, and set the weight of the learnable color block-category mapping module to a gradient weight (that is, the learnable color block-category The weights of the mapping module will be optimized during offline training).

具体而言，本步骤的预处理过程先后包括统一尺度的插值、随机水平翻转、随机裁剪、以及归一化，其中多个预处理后的图像的维度是bs×3×h×w，其对应的预处理后的语义掩码的维度是bs×1×h×w，bs为步骤(4-2)设置的批次大小。Specifically, the preprocessing process of this step includes unified scale interpolation, random horizontal flipping, random cropping, and normalization. The dimensions of the multiple preprocessed images are bs×3×h×w, which correspond to The dimension of the preprocessed semantic mask is bs×1×h×w, and bs is the batch size set in step (4-2).

本步骤包括以下子步骤：This step includes the following sub-steps:

(4-5-1)将步骤(4-4)得到的每个预处理后的图像输入图像特征提取器中，以得到每个图像对应的四个不同空间尺度和通道数的图像特征。(4-5-1) Input each preprocessed image obtained in step (4-4) into the image feature extractor to obtain image features of four different spatial scales and channel numbers corresponding to each image.

(4-5-2)将步骤(4-5-1)得到的每个图像对应的四个不同空间尺度和通道数的图像特征输入离散特征序号分类器中，以得到离散特征序号的概率矩阵。(4-5-2) Input the image features of four different spatial scales and channel numbers corresponding to each image obtained in step (4-5-1) into the discrete feature number classifier to obtain the probability matrix of the discrete feature number .

具体而言，预先设置的最大循环次数n的范围为1000到1000000，优选为40000。Specifically, the preset maximum number of cycles n ranges from 1,000 to 1,000,000, preferably 40,000.

本领域的技术人员容易理解，以上所述仅为本发明的较佳实施例而已，并不用以限制本发明，凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等，均应包含在本发明的保护范围之内。It is easy for those skilled in the art to understand that the above descriptions are only preferred embodiments of the present invention and are not intended to limit the present invention. Any modifications, equivalent substitutions and improvements, etc., made within the spirit and principles of the present invention, All should be included in the protection scope of the present invention.

Claims

1. A semantic segmentation network model based on category colorization technology, including a sequentially connected image feature extractor, a discrete feature serial number classifier, a discrete feature code table, a semantic image decoder, and a learnable color block-category mapping module. Auxiliary pixel classifier, unlabeled region completion module, category-color mapping module, and semantic image encoder, characterized by:

The image feature extractor is a Swin Transformer network, which includes 24 Swin Transformer modules connected in sequence. It is used to receive images with input dimensions of bs×3×h×w. The output is four feature matrices. The dimensions of the feature matrix are and Among them, bs is the batch data size preset during the offline training process, h and w are the number of pixels on the long side and the number of pixels on the short side of the image respectively;

The discrete feature serial number classifier includes a feature aggregation layer, a feature processing layer and a classifier module;

The feature aggregation layer is a convolution module whose input is the four feature matrices output by the Swin Transformer network, and the output is an aggregated feature matrix whose dimensions are

The feature processing layer consists of two layers of Swin Transformer modules. Its input is the feature matrix aggregated by the feature aggregation layer. The output dimension is the feature matrix, whose dimensions are

The input of the classifier module is the feature matrix output by the feature processing layer, and the output is a probability matrix whose dimensions are Each element in the probability matrix is the probability that 8192 discrete features appear in that element;

The discrete feature code table is a collection of multiple discrete features. The dimension of each discrete feature is 128. Each discrete feature has a unique serial number in the discrete feature code table, that is, the discrete feature serial number; the discrete feature code table is found The discrete feature sequence number corresponding to the discrete feature with the highest probability in each element of the probability matrix. The sequence numbers corresponding to all elements in the probability matrix form a sequence number matrix. Then, the corresponding sequence number is queried in the discrete feature code table based on each element in the sequence number matrix. The discrete features of , all discrete features corresponding to all elements constitute a discrete feature matrix;

The input of the discrete feature code table is a probability matrix, and the output is a discrete feature matrix whose dimensions are

The semantic image decoder specifically uses the decoder of DALL-E's VQ-VAE model. The input is the discrete feature matrix output from the discrete feature code table, and the dimension is The output is the predicted semantic image, whose dimension is bs×3×h×w;

The learnable color patch-category mapping module is a single-layer Swin Transformer module; the input of the learnable color patch-category mapping module is the predicted semantic image with dimensions bs×3×h×w output by the semantic image decoder, The output is a pixel category probability matrix with dimensions bs×C×h×w; the learnable color block-category mapping module first traverses all pixels on the semantic image, targeting the pixel L(i,j ) is all the pixels in the square local area with the center point {L(p,q)|p∈[i-6,i+6],p∈[j-6,j+6]}, the learnable color block - The category mapping module predicts C category probabilities for its center point pixel L(i,j), where C is the total number of categories, i∈[1,h], j∈[1,w]; finally, all pixels in the semantic image are The category probabilities of L(i,j) are spliced according to the spatial position to obtain the predicted pixel category probability matrix;

The auxiliary pixel classifier is a convolution module whose input is the aggregated feature matrix output by the feature aggregation layer of the discrete feature classifier, and its dimension is The output is an auxiliary semantic mask whose dimension is bs×1×h×w;

The unlabeled region completion module is used to replace the unlabeled pixel regions in the ground truth of the semantic mask with the categories predicted by the auxiliary pixel classifier; the input of the unlabeled region completion module is the ground truth of the semantic mask from the data set. value and the auxiliary semantic mask output by the auxiliary pixel classifier, both of which have dimensions bs × 1 × h × w; the output of the unlabeled region completion module is the true value of the semantic mask with full pixel annotation, and its dimensions is bs×1×h×w;

The category-color mapping module is used to obtain the color corresponding to each pixel according to its category. Its input is the true value of the full-pixel annotated semantic mask output by the unlabeled area completion module, and its dimension is bs×1×h. ×w, the output is the true value of the semantic image with dimensions bs×3×h×w;

The semantic image encoder specifically uses the encoder of DALL-E's VQ-VAE model. Its input is the true value of the semantic image output by the category-color mapping module. The dimension is bs×3×h×w, and the output is the dimension of The true value of the equivalent characteristic matrix of .

2. A training method for a semantic segmentation network model based on category colorization technology according to claim 1, characterized in that it includes the following steps:

(1) Obtain the ADE20K data set, divide the 25574 sets of images in the ADE20K data set and the true values of their corresponding semantic masks into a training set, and divide the 2000 sets of images in the ADE20K data set and the true values of their corresponding semantic masks. Divide into validation set;

(2) Generate a category color array with dimension C×3; step (2) includes the following sub-steps:

(2-1) Generate three one-dimensional arrays A _R , A _G , and A _B ; each element of the array is and/> And there are k1∈[1, the total number of elements in the array A _R ], k2∈[1, the total number of elements in the array A _G ], k3∈[1, the total number of elements in the array A _B ];

(2-2) Set counters k1=1, k2=1, k3=1, and initialize RGB color array A _RGB to be an empty array;

(2-3) Determine whether k1 is greater than the preset maximum number of cycles J, whose value is equal to the total number of elements in the array A _R. If so, go to step (2-13), otherwise go to step (2-4) ;

(2-4) Determine whether k2 is greater than the preset maximum number of cycles K, whose value is equal to the total number of elements in the array A _G. If so, go to step (2-3), otherwise go to step (2-5);

(2-5) Determine whether k3 is greater than the preset maximum number of cycles Q, whose value is equal to the total number of elements in array A _B. If so, go to step (2-4), otherwise go to step (2-6);

(2-6) Generate a random integer r _R between -15 and 15, and update the k1th element of the A _R array To get the k1th element/> of the updated A _R array

(2-7) Generate a random integer r _G between -15 and 15, and update the k2th element of the A _G array To get the k2th element/> of the updated A _G array

(2-8) Generate a random integer r _B between -15 and 15, and update the k3th element of the A _B array To get the k3th element/> of the updated A _B array

(2-9) Add the k1th element of the updated A _R array obtained in step (2-6) The k2th element of the updated A _G array obtained in step (2-7)/> And the k3th element/> of the updated A _B array obtained in step (2-8) Make up a three-dimensional element/> and add three-dimensional elements/> Insert into the end of RGB color array A _RGB to get the updated RGB color array A _RGB ;

(2-10) Set k1=k1+1 and return to step (2-5);

(2-11) Set k2=k2+1 and return to step (2-4);

(2-12) Set k3=k3+1 and return to step (2-3);

(2-13) Obtain the first C three-dimensional elements in the updated RGB color array A _RGB obtained in step (2-9) to obtain the category color array;

(3) Use the training set of the ADE20K data set obtained in step (1) to train the image feature extractor, discrete feature serial number classifier and auxiliary pixel classifier to obtain the trained image feature extractor and discrete feature serial number classification The weight of the extractor and the weight of the image feature extractor and discrete feature number classifier generated during multiple iterations; step (3) includes the following sub-steps:

(3-1) Initialize the weights of the image feature extractor, discrete feature number classifier, auxiliary pixel classifier, category-color mapping module, semantic image encoder and discrete feature code table to obtain the initialized image feature extraction , discrete feature number classifier, auxiliary pixel classifier, category-color mapping module, semantic image encoder and discrete feature code table;

(3-2) Set counter i=1 and initialize the hyperparameters of the training process to obtain the initialized hyperparameters of the training process;

(3-3) Obtain the true values of multiple images and their corresponding semantic masks from the training set of the ADE20K data set obtained in step (1);

(3-4) Perform data preprocessing on the true values of multiple images and semantic masks obtained in step (3-3) to obtain multiple preprocessed images and true values of semantic masks;

(3-5) Use the sequentially connected image feature extractor and discrete feature number classifier to map the multiple preprocessed images obtained in step (3-4) into a probability matrix of multiple discrete feature numbers, with the dimension and multiple aggregated feature matrices, whose dimensions are/>

(3-6) Input the multiple aggregated feature matrices obtained in step (3-5) into the auxiliary pixel classifier to obtain multiple auxiliary semantic masks, whose dimensions are bs×1×h×w ;

(3-7) Input the true values of the multiple preprocessed semantic masks obtained in step (3-4) and the auxiliary semantic masks obtained in step (4-7) into the unlabeled area completion module , to obtain the true values of multiple full-pixel annotated semantic masks, whose dimensions are bs×1×h×w;

(3-8) Use the category-color mapping module, semantic image encoder, and discrete feature code table to map the true values of multiple full-pixel semantic masks obtained in step (3-7) into multiple discrete feature number matrices true value;

(3-9) Input the probability matrices of multiple discrete feature numbers obtained in step (3-5) and the true values of the multiple discrete feature number matrices obtained in step (4-7) into the cross entropy loss function to obtain the semantics Feature loss value;

(3-10) Use the semantic feature loss value obtained in step (3-9) to perform backpropagation to obtain the gradient of the image feature extractor and discrete feature number classifier;

(3-11) Use the learning rate set in step (3-2), the AdamW optimizer and the gradient obtained in step (3-9) to update the weights of the image feature extractor and discrete feature number classifier to obtain a new image The weights of the feature extractor and the discrete feature number classifier;

(3-12) Counter i=i+1, and set the gradient of the image feature extractor and discrete feature number classifier to 0;

(3-13) Determine whether i is an integer multiple of the preset model saving interval t. If so, save the weights of the image feature extractor and the discrete feature serial number classifier, and name them as the image feature extractor of the i-th iteration. and the weight of the discrete feature serial number classifier, otherwise go to step (3-14);

(3-14) Determine whether i is greater than the preset maximum number of iterations n, if so, go to step (3-15), otherwise return to step (3-3);

(3-15) Save the weights of the image feature extractor and discrete feature number classifier, and obtain the weights of the trained image feature extractor and discrete feature number classifier;

(4) Use the training set of the ADE20K data set obtained in step (1), and one of the weights of the image feature extractor and discrete feature number classifier generated in multiple iterations obtained in step (3) to learn the The color block-category mapping module is trained to obtain a trained and learnable color block-category mapping module;

(5) The weights of the trained image feature extractor and discrete feature number classifier obtained in step (3), and the weights of the trained learnable color block-category mapping module obtained in step (4) are obtained from the network. The downloaded discrete feature code table and the weight of the semantic image decoder are saved to obtain the weight of the semantic segmentation network model based on category colorization technology; where the weight of the discrete feature code table and the weight of the semantic image decoder are obtained from The weights of the discrete feature code table and the decoder of DALL-E's VQ-VAE model downloaded from the Internet.

3. The training method of semantic segmentation network model based on category colorization technology according to claim 2, characterized in that step (3-1) includes the following sub-steps:

(3-1-1) Load the pre-trained weights of the image feature extractor into the image feature extractor;

(3-1-2) Initialize the weight of the discrete feature serial number classifier to a random value, and set the weight of the discrete feature serial number classifier to a gradient weight, that is, the weight of the image feature extractor will be optimized during the offline training process;

(3-1-3) Initialize the weight of the auxiliary pixel classifier to a random value, and set the weight of the pixel classifier to a gradient weight;

(3-1-4) Load the preset category color array obtained in step (2) into the category-color mapping module, and set the weight of the category-color mapping module to a gradient-free weight;

(3-1-5) Add the pre-trained weights of the semantic image encoder to the semantic image encoder, and set the weights of the semantic image encoder to gradient-free weights;

(3-1-6) Load the discrete signature table and training weights into the discrete signature table, and set the weight of the discrete signature table to a gradient-free weight.

4. The training method of the semantic segmentation network model based on category colorization technology according to claim 3, characterized in that step (3-8) includes the following sub-steps:

(3-8-1) Input the true values of multiple full-pixel annotated semantic masks obtained in step (3-7) into the category-color mapping module to obtain the true values of multiple semantic images;

(3-8-2) Input the true values of multiple semantic images obtained in step (3-8-1) to the initialized semantic image encoder obtained in step (3-1-5) to obtain multiple equal values. The true value of the valence characteristic matrix;

(3-8-3) For the true value of each equivalent characteristic matrix obtained in step (3-8-2), query the equivalent value in the initialized discrete characteristic code table obtained in step (3). The discrete features and their discrete feature serial numbers that are closest to the true value of the feature matrix are spliced together according to the spatial position of all the queried discrete feature serial numbers to obtain the true value of the discrete feature serial number matrix, whose dimensions are

5. The training method of the semantic segmentation network model based on category colorization technology according to claim 4, characterized in that step (4) includes the following sub-steps:

(4-1) Initialize the weights of the image feature extractor, discrete feature number classifier, discrete feature code table, semantic image decoder and learnable color block-category mapping module to obtain the initialized image feature extractor , discrete feature number classifier, discrete feature code table, semantic image decoder and learnable color block-category mapping module;

(4-2) Set counter i=1 and initialize the hyperparameters of the training process to obtain the initialized hyperparameters of the training process;

(4-3) Obtain the true values of multiple images and their corresponding semantic masks from the training set of the ADE20K data set obtained in step (1);

(4-4) Perform data preprocessing on the true values of multiple images and semantic masks obtained in step (5-3) to obtain multiple preprocessed images and true values of semantic masks;

(4-5) Use the sequentially connected image feature extractor and discrete feature number classifier to map the multiple preprocessed images obtained in step 4-4) into a probability matrix of multiple discrete feature numbers, with the dimension

(4-6) Input the probability matrices of multiple discrete feature numbers obtained in step (4-5) into the discrete feature code table to obtain multiple discrete feature matrices, the dimensions of which are

(4-7) Input the multiple discrete feature matrices obtained in step (4-6) into the initialized semantic image decoder obtained in step (6-1) to obtain multiple predicted semantic images whose dimensions is bs×3×h×w;

(4-8) Input the multiple predicted semantic images obtained in step (4-7) into the learnable color block-category mapping module to obtain multiple predicted pixel category probability matrices, whose dimensions are bs×C ×h×w; where C is the total number of categories;

(4-9) Input the multiple predicted pixel category probability matrices obtained in step (4-8) and the true values of the multiple preprocessed semantic masks obtained in step (4-4) into the cross entropy loss function, To get the pixel classification loss value;

(4-10) Use the pixel classification loss value obtained in step (4-9) to perform backpropagation to obtain the gradient of the learnable color block-category mapping module;

(4-11) Utilize the learning rate set in step (4-2), the AdamW optimizer and the gradient obtained in step (4-10) to update the weight of the learnable color block-category mapping module to obtain a new learnable The weight of the color block-category mapping module;

(4-12) Counter i=i+1, and set the learnable color block-category mapping module to 0;

(4-13) Determine whether i is greater than the preset maximum number of iterations n, if so, go to step (4-14), otherwise return to step (4-3);

(4-14) Save the weight of the learnable color block-category mapping module to obtain the weight of the trained learned color block-category mapping module.

6. The training method of semantic segmentation network model based on category colorization technology according to claim 5, characterized in that step (4-1) includes the following sub-steps:

(4-1-1) Load the weight of the k-th iteration of the image feature extractor obtained in step (4) to the image feature extractor, and set the weight of the image feature extractor to a gradient-free weight;

(4-1-2) Load the weight of the k-th iteration of the discrete feature number classifier obtained in step (4) to the image feature extractor, and set the weight of the discrete feature number classifier to a gradient-free weight;

(4-1-3) Load the pre-trained weights of the discrete signature table into the discrete signature table, and set the weight of the discrete signature table to a gradient-free weight;

(4-1-4) Load the pre-trained weights of the semantic image decoder into the semantic image decoder, and set the weights of the semantic image decoder to gradient-free weights;

(4-1-5) Initialize the weight of the learnable color block-category mapping module to a random value, and set the weight of the learnable color block-category mapping module to a gradient weight.