CN114863348B

CN114863348B - Video target segmentation method based on self-supervision

Info

Publication number: CN114863348B
Application number: CN202210658263.6A
Authority: CN
Inventors: 李阳阳; 封星宇; 赵逸群; 刘睿娇; 陈彦桥; 焦李成; 尚荣华; 马文萍
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2022-06-10
Filing date: 2022-06-10
Publication date: 2024-08-23
Anticipated expiration: 2042-06-10
Also published as: CN114863348A

Abstract

The invention discloses a video target segmentation method based on self-supervision, which mainly solves the problems of lower segmentation precision and larger influence of target shielding and tracking drift in the prior art. The scheme comprises the following steps: 1) Acquiring a video sequence from a video target segmentation data set, preprocessing the video sequence, and dividing the video sequence to obtain a training, verifying and testing sample set; 2) Constructing and training an image reconstruction neural network model, and extracting target features by adopting a self-supervision learning method based on a multi-pixel scale image reconstruction task; 3) Building and training a side output edge detection network model; 4) Constructing and training an edge correction network model based on self supervision; 5) Combining the three trained models to obtain a video target segmentation model; 6) And inputting the test set into a video target segmentation model to obtain a target segmentation result. The video object segmentation method can effectively improve generalization and accuracy of video object segmentation, and can be used in the fields of automatic driving, intelligent monitoring, unmanned aerial vehicle intelligent tracking and the like.

Description

Video object segmentation method based on self-supervision

技术领域Technical Field

本发明属于计算机视觉技术领域，进一步涉及视频目标分割技术，具体为一种基于自监督的视频目标分割方法，可用于自动驾驶、智能监控、无人机智能跟踪等领域。The present invention belongs to the field of computer vision technology, and further relates to a video target segmentation technology, specifically a video target segmentation method based on self-supervision, which can be used in the fields of autonomous driving, intelligent monitoring, and drone intelligent tracking.

背景技术Background Art

计算机视觉旨在模仿人类建立视觉感知的过程，是人工智能技术发展中的重点环节，计算机视觉算法企求以更高的精度最大限度地模拟人类视觉行为，为下游任务提供感知信息。在人体的感知系统中，画面变换是一个连续的过程，在当前视觉技术大背景下，最贴近人类感知的画面存储方式为视频方式，因此，处理视频任务的计算机视觉算法更具备模拟人类视觉行为的能力。Computer vision aims to imitate the process of human visual perception and is a key link in the development of artificial intelligence technology. Computer vision algorithms seek to simulate human visual behavior to the greatest extent possible with higher accuracy and provide perceptual information for downstream tasks. In the human perception system, image transformation is a continuous process. In the current context of visual technology, the image storage method that is closest to human perception is video. Therefore, computer vision algorithms that process video tasks are more capable of simulating human visual behavior.

视频目标分割任务是视频处理任务中的一个重要课题，其目的在于将一系列视频序列中感兴趣的目标从背景中分割出来。近年来，由于深度学习技术在计算机视觉任务(如：图像识别、目标跟踪、动作识别等)中的优秀表现，基于深度学习的视频目标分割算法已成为解决视频目标分割任务的主流方法。基于深度学习的视频目标分割算法的性能依赖于其使用的神经网络的规模，神经网络性能的发挥依赖于大量的训练数据，训练数据集规模越大，训练所得神经网络泛化性和鲁棒性越好。在有监督学习的范式下，视频目标分割训练数据集的制作过程代价高昂且费时，不仅需要在空间上对图像中的每一个像素进行标注，还需要在时间上对视频序列中的每一帧进行标注。视频目标分割模型的性能还与结构密切相关，通过对视频目标分割模型推理过程的合理优化，可以有效地减少视频目标分割过程中的错误。Video target segmentation is an important topic in video processing tasks, and its purpose is to segment the target of interest from the background in a series of video sequences. In recent years, due to the excellent performance of deep learning technology in computer vision tasks (such as image recognition, target tracking, action recognition, etc.), the video target segmentation algorithm based on deep learning has become the mainstream method for solving video target segmentation tasks. The performance of the video target segmentation algorithm based on deep learning depends on the scale of the neural network it uses. The performance of the neural network depends on a large amount of training data. The larger the scale of the training data set, the better the generalization and robustness of the trained neural network. Under the paradigm of supervised learning, the production process of the video target segmentation training data set is costly and time-consuming. It is necessary not only to annotate every pixel in the image in space, but also to annotate every frame in the video sequence in time. The performance of the video target segmentation model is also closely related to the structure. By rationally optimizing the inference process of the video target segmentation model, the errors in the video target segmentation process can be effectively reduced.

自监督学习的研究目标是在不使用任何人工标注的情况下，训练深度学习模型，使其能从大量未经标注的图片或视频数据集中提取有效的视觉表征信息，提取所得信息经过微调，为下游任务所使用。基于自监督学习的视频目标分割针对半监督视频目标分割这一特定任务而设计，通过自监督学习方法训练视频目标分割模型，训练完成的模型能够直接用于视频目标分割任务，整个训练过程中不需要任何人工标注数据集介入。The research goal of self-supervised learning is to train deep learning models without any manual annotation, so that they can extract effective visual representation information from a large number of unlabeled pictures or video data sets, and the extracted information is fine-tuned for use in downstream tasks. Video object segmentation based on self-supervised learning is designed for the specific task of semi-supervised video object segmentation. The video object segmentation model is trained through self-supervised learning methods. The trained model can be directly used for video object segmentation tasks, and no manual annotation data set is required during the entire training process.

自监督视频目标分割方法的研究基本可分为两条线路，一是设计更好的前置任务训练模型，使模型具有更好的表征提取能力；二是针对半监督视频目标分割问题，引入更多机制减少目标遮挡和追踪漂移的影响。Colorizer等人于2018年在《European Conferenceon Computer Vision》上发表了一篇题为“Tracking emerges by colorizing videos”的文章，提出一种自监督视频跟踪着色模型，该模型利用颜色的自然时间相干性，学习给灰度视频上色，进一步提高了自监督视频跟踪技术的效果，但是，由于是基于前一帧传播的，因此对于目标遮挡和追踪漂移的鲁棒性较差。Corrflow等人于2019年在《British MachineVision Conference》上发表了一篇题为“Self supervised video representationlearning for correspondence flow”的文章，通过引入受限注意力机制，在不提高计算设备负担的情况下提高了模型输入的分辨率，提高分割的精度，然而该方法没有考虑不同尺度目标的特征提取泛化性，在目标尺度相差过大的情况下表现不佳。The research on self-supervised video target segmentation methods can be basically divided into two lines. One is to design a better pre-task training model to enable the model to have better representation extraction capabilities; the other is to introduce more mechanisms to reduce the impact of target occlusion and tracking drift for the problem of semi-supervised video target segmentation. In 2018, Colorizer et al. published an article entitled "Tracking emerges by colorizing videos" in the European Conference on Computer Vision. They proposed a self-supervised video tracking colorization model. The model uses the natural temporal coherence of color to learn to color grayscale videos, further improving the effect of self-supervised video tracking technology. However, since it is based on the propagation of the previous frame, it is less robust to target occlusion and tracking drift. In 2019, Corrflow et al. published an article entitled "Self supervised video representation learning for correspondence flow" at the British Machine Vision Conference. By introducing a restricted attention mechanism, the resolution of the model input is improved without increasing the burden on the computing device, and the accuracy of segmentation is improved. However, this method does not consider the generalization of feature extraction of targets of different scales, and performs poorly when the target scales differ too much.

发明内容Summary of the invention

本发明的目的在于针对上述现有技术的不足，提出一种基于自监督的视频目标分割方法，用于解决现有技术中存在的分割精度较低、目标遮挡和追踪漂移影响较大的技术问题。The purpose of the present invention is to propose a video target segmentation method based on self-supervision in view of the above-mentioned deficiencies in the prior art, so as to solve the technical problems existing in the prior art such as low segmentation accuracy, large influence of target occlusion and tracking drift.

实现本发明的思路是：首先，采用基于多像素尺度图像重建任务的自监督学习方法进行目标特征提取，使得视频目标分割模型可以兼顾大目标的特征和小目标的特征，获得更优的泛化性能，然后，针对视频目标分割模型在进行目标分割时产生的误差累积问题，提出使用图像语义边缘对目标分割掩膜进行修正；最后设计基于自监督的边缘融合网络，得到更精准的目标分割掩膜，相较于传统的自监督视频目标分割方法，本发明有效提升了视频目标分割的泛化性和精确度。The idea of implementing the present invention is: first, a self-supervised learning method based on a multi-pixel scale image reconstruction task is used to extract target features, so that the video target segmentation model can take into account the features of both large targets and small targets to obtain better generalization performance. Then, in view of the error accumulation problem generated by the video target segmentation model when performing target segmentation, it is proposed to use image semantic edges to correct the target segmentation mask; finally, a self-supervised edge fusion network is designed to obtain a more accurate target segmentation mask. Compared with the traditional self-supervised video target segmentation method, the present invention effectively improves the generalization and accuracy of video target segmentation.

为实现上述目的，本发明采取的技术方案包括如下步骤：To achieve the above object, the technical solution adopted by the present invention includes the following steps:

(1)获取训练样本集、验证样本集和测试样本集：(1) Obtain training sample set, verification sample set and test sample set:

从视频目标分割数据集中获取视频序列，并进行预处理，得到帧序列集合V，对该集合中的帧序列进行划分，得到训练样本集V_train、验证样本集V_val及测试样本集V_test；Obtain a video sequence from a video target segmentation dataset, perform preprocessing, obtain a frame sequence set V, divide the frame sequences in the set, and obtain a training sample set V _train , a verification sample set V _val , and a test sample set V _test ;

(2)构建并训练图像重建神经网络模型R：(2) Build and train the image reconstruction neural network model R:

(2a)搭建由特征提取网络构成的图像重建神经网络模型R，特征提取网络采用包括顺次连接的多个卷积层、多个池化层、多个残差单元模块和单个全连接层的残差网络；(2a) constructing an image reconstruction neural network model R consisting of a feature extraction network, wherein the feature extraction network adopts a residual network including multiple convolutional layers, multiple pooling layers, multiple residual unit modules and a single fully connected layer connected in sequence;

(2b)定义图像重建神经网络模型R的损失函数：(2b) Define the loss function of the image reconstruction neural network model R:

L_mix＝αL_cls+(1-α)L_reg _Lmix ＝ _αLcls +(1-α) _Lreg

其中，表示量化图像重建任务的交叉熵损失函数，针对训练样本集选取E个聚类质心点{μ₁,μ₂,...,μ_E}，且E≤50；根据训练样本与聚类质心点的距离，计算样本所属类别，设帧序列集合V中所包含目标类别数为C；修正聚类质心点位置，使得帧间同类目标标签相同，不同目标标签不同，其中，表示给定的帧图片I_t的第i个像素所属类别，表示使用K均值算法的预测结果，L_reg表示RGB图像重建任务的回归损失函数，其中，其中为真实目标帧像素，为重建目标帧像素，α表示权重系数，0.1≤α≤0.9；in, Represents the cross entropy loss function for the quantitative image reconstruction task, for the training sample set Select E cluster centroids {μ ₁ ,μ ₂ ,...,μ _E }, and E≤50; calculate the category to which the sample belongs based on the distance between the training sample and the cluster centroid, and set the number of target categories contained in the frame sequence set V to be C; modify the position of the cluster centroid so that the labels of the same target in the frame are the same, and the labels of different targets are different, where, Indicates the category to which the i-th pixel of a given frame image I _t belongs, represents the prediction result using the K-means algorithm, L _reg represents the regression loss function of the RGB image reconstruction task, in, in is the real target frame pixel, To reconstruct the target frame pixels, α represents the weight coefficient, 0.1≤α≤0.9;

(2c)设定特征提取网络参数及最大迭代次数N，根据图像重建神经网络模型R的损失函数，并利用训练样本集V_train中的目标帧图片对图像重建神经网络模型R进行迭代训练，得到训练好的图像重建神经网络模型R；(2c) setting feature extraction network parameters and a maximum number of iterations N, iteratively training the image reconstruction neural network model R according to the loss function of the image reconstruction neural network model R and using the target frame images in the training sample set V _train to obtain a trained image reconstruction neural network model R;

(3)构建并训练侧输出边缘检测网络模型Q：(3) Build and train the side output edge detection network model Q:

(3a)构建包括顺次连接的侧输出边缘检测层SODL和侧输出边缘融合层SOFL的边缘检测网络模型Q，侧输出边缘检测层SODL包括一个反卷积层和一个卷积核尺寸为1×1，且输出通道数为1的卷积层，侧输出边缘融合层SOFL是一个卷积核尺寸为1×1且通道数为1的卷积层；(3a) constructing an edge detection network model Q including a side output edge detection layer SODL and a side output edge fusion layer SOFL connected in sequence, wherein the side output edge detection layer SODL includes a deconvolution layer and a convolution layer with a convolution kernel size of 1×1 and an output channel number of 1, and the side output edge fusion layer SOFL is a convolution layer with a convolution kernel size of 1×1 and a channel number of 1;

(3b)定义侧输出边缘检测网络模型Q的损失函数：(3b) Define the loss function of the side output edge detection network model Q:

L_edge＝L_side+L_fuse L _edge = L _side + L _fuse

其中，L_side表示侧输出边缘检测损失函数，其中，β_i表示第i个侧输出边缘检测网络的权重系数，示第i个侧输出边缘检测网络预测结果的损失函数：Among them, L _side represents the side output edge detection loss function, Among them, β _i represents the weight coefficient of the i-th side output edge detection network, The loss function of the prediction result of the i-th side output edge detection network is shown as:

其中，其中，e表示输入图像目标边缘真值，|e^-|表示图像目标边缘真值中为边缘的像素数，|e⁺|表示图像目标边缘真值中非边缘的像素数，ω_i表示卷积层的参数，L_fuse表示边缘融合损失函数：in, Among them, e represents the true value of the target edge of the input image, |e ^- | represents the number of pixels that are edges in the true value of the target edge of the image, |e ⁺ | represents the number of pixels that are not edges in the true value of the target edge of the image, ω _i represents the parameters of the convolution layer, and L _fuse represents the edge fusion loss function:

(3c)设定最大迭代次数I，根据侧输出边缘检测网络模型Q的损失函数，并利用图像重建神经网络模型R中特征提取网络每一结构层输出的特征图集合对侧输出边缘检测网络模型Q进行迭代训练，得到训练好的侧输出边缘检测网络模型Q；(3c) setting a maximum number of iterations I, iteratively training the side output edge detection network model Q according to the loss function of the side output edge detection network model Q and using the feature map set output by each structural layer of the feature extraction network in the image reconstruction neural network model R, and obtaining a trained side output edge detection network model Q;

(4)构建并训练边缘修正网络模型Z：(4) Build and train the edge correction network model Z:

(4a)顺次连接空洞空间卷积池化金字塔模型F_γ和softmax激活函数输出层，其中，空洞空间卷积池化金字塔模型F_γ由顺次连接的多个卷积层和池化层构成，得到边缘修正网络模型Z；(4a) sequentially connecting the dilated spatial convolutional pooling pyramid model _Fγ and the softmax activation function output layer, wherein the dilated spatial convolutional pooling pyramid model _Fγ is composed of a plurality of sequentially connected convolutional layers and pooling layers, and obtaining an edge correction network model Z;

(4b)定义边缘修正网络模型Z的损失函数：(4b) Define the loss function of the edge correction network model Z:

其中，为边缘检测层输出的目标帧粗分割结果，为空洞空间卷积池化金字塔模型F_γ的预测结果，其中，表示Canny算法得到的图像边缘，M表示掩膜中像素的类别数量，表示掩膜中像素总数量；in, The rough segmentation result of the target frame output by the edge detection layer, is the prediction result of the dilated spatial convolutional pooling pyramid model F _γ , in, Represents the image edge obtained by the Canny algorithm, and M represents the mask The number of categories of pixels in , Indicates mask The total number of pixels in

(4c)设定最大迭代次数H，根据边缘修正网络模型Z的损失函数，并利用图像重建网络模型R和边缘检测网络模型Q的输出结果对边缘修正网络模型Z进行迭代训练，得到训练好的边缘修正网络模型Z；(4c) setting a maximum number of iterations H, iteratively training the edge correction network model Z according to the loss function of the edge correction network model Z and using the output results of the image reconstruction network model R and the edge detection network model Q to obtain a trained edge correction network model Z;

(5)由训练好的图像重建神经网络R、侧输出边缘检测网络Q和边缘修正网络模型Z组合得到基于图像目标边缘修正分割结果的视频目标分割模型；(5) The video target segmentation model based on the image target edge correction segmentation result is obtained by combining the trained image reconstruction neural network R, the side output edge detection network Q and the edge correction network model Z;

(6)获取自监督视频目标分割结果：(6) Obtain self-supervised video target segmentation results:

将测试集中的帧图像作为视频目标分割模型的输入进行前向传播，得到所有测试帧图片预测分割标签，根据测试帧图片预测分割标签得到最终的分割结果图。The test set The frame images in are used as the input of the video target segmentation model for forward propagation to obtain the predicted segmentation labels of all test frame images, and the final segmentation result image is obtained based on the predicted segmentation labels of the test frame images.

本发明与现有技术相比，具有以下优点：Compared with the prior art, the present invention has the following advantages:

第一、由于本文采用多像素尺度图像的重建任务作为自监督学习的前置任务，使得训练得到的模型提取的特征针对视频分割任务中的大目标和小目标具有更好的泛化性，从而在整体的视频目标分割任务中具有更好的表现。First, since this paper uses the reconstruction task of multi-pixel scale images as the prerequisite task of self-supervised learning, the features extracted by the trained model have better generalization for large and small targets in the video segmentation task, thus having better performance in the overall video target segmentation task.

第二、本发明使用视频图片中目标的边缘修复对目标掩膜进行修复，通过一个侧输出边缘检测网络，将视频目标分割模型中特征提取网络的各层提取到的特征图进行整合，预测出目标帧中的候选目标边缘，使用一种基于自监督的边缘融合模型将视频目标分割模型输出的分割结果与侧输出边缘检测网络输出的目标边缘进行融合，从而根据目标边缘修正分割掩膜，得到更为精确的分割结果。Second, the present invention uses edge restoration of targets in video images to repair target masks, integrates feature maps extracted from each layer of feature extraction networks in video target segmentation models through a side output edge detection network, predicts candidate target edges in target frames, and uses a self-supervised edge fusion model to fuse the segmentation results output by the video target segmentation model with the target edges output by the side output edge detection network, thereby correcting the segmentation mask according to the target edges to obtain more accurate segmentation results.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为本发明的实现流程图。FIG1 is a flow chart of the implementation of the present invention.

具体实施方式DETAILED DESCRIPTION

以下结合附图和具体实施例，对本发明作进一步详细描述。The present invention is further described in detail below in conjunction with the accompanying drawings and specific embodiments.

实施例一：参照图1，本发明提出的一种基于自监督的视频目标分割方法，具体包括如下步骤：Embodiment 1: Referring to FIG. 1 , a video object segmentation method based on self-supervision proposed by the present invention specifically comprises the following steps:

步骤1：获取训练样本集、验证样本集和测试样本集：Step 1: Get the training sample set, validation sample set, and test sample set:

从视频目标分割数据集中获取视频序列，并进行预处理，得到帧序列集合V，对该集合中的帧序列进行划分，得到训练样本集V_train、验证样本集V_val及测试样本集V_test；实现如下：Obtain a video sequence from the video target segmentation dataset and perform preprocessing to obtain a frame sequence set V. The frame sequences in the set are divided to obtain a training sample set V _train , a verification sample set V _val and a test sample set V _test . The implementation is as follows:

(1a)从视频目标分割数据集中获取S个多类别的视频序列，预处理后得到帧序列集合S≥3000；其中表示第k个由预处理过的图像帧构成的帧序列，表示第k个帧序列中的第n个图像帧，M≥30；(1a) Obtain S multi-category video sequences from the video object segmentation dataset and obtain a frame sequence set after preprocessing S≥3000; represents the kth frame sequence consisting of preprocessed image frames, represents the nth image frame in the kth frame sequence, M ≥ 30;

(1b)从帧序列集合V中随机抽取半数以上的帧序列组成训练样本集其中S/2＜N＜S，针对训练样本集中每一个帧序列将每一张待分割的目标帧图片缩放成p×p×h大小的图像块，并将图片格式由RGB转化为Lab；从剩余的帧序列中抽取一半帧序列组成验证样本集其中J≤S/4；另一半组成测试样本集T≤S/4，并将图片格式由RGB转化为Lab。(1b) Randomly select more than half of the frame sequences from the frame sequence set V to form a training sample set Where S/2＜N＜S, for each frame sequence in the training sample set Each target frame image to be segmented Scale the image blocks to p×p×h size and convert the image format from RGB to Lab; extract half of the frame sequences from the remaining frame sequences to form a verification sample set Where J≤S/4; the other half constitutes the test sample set T≤S/4, and convert the image format from RGB to Lab.

步骤2：构建并训练图像重建神经网络模型R：Step 2: Build and train the image reconstruction neural network model R:

L_mix＝αL_cls+(1-α)L_reg _Lmix ＝ _αLcls +(1-α) _Lreg

(2c)设定特征提取网络参数及最大迭代次数N，根据图像重建神经网络模型R的损失函数，并利用训练样本集V_train中的目标帧图片对图像重建神经网络模型R进行迭代训练，得到训练好的图像重建神经网络模型R；实现如下：(2c) Setting the feature extraction network parameters and the maximum number of iterations N, iteratively training the image reconstruction neural network model R according to the loss function of the image reconstruction neural network model R and using the target frame images in the training sample set V _train to obtain the trained image reconstruction neural network model R; the implementation is as follows:

(2c1)设特征提取网络的网络超参数为θ，最大迭代次数为N≥150000，n表示当前迭代次数；令n＝1，初始化迭代次数；(2c1) Let the network hyperparameter of the feature extraction network be θ, the maximum number of iterations be N ≥ 150000, n represents the current number of iterations; let n = 1, the number of initial iterations;

(2c2)将训练样本集V_train中的目标帧图片作为图像重建神经网络模型R的输入进行前向传播：(2c2) The target frame images in the training sample set V _train are used as the input of the image reconstruction neural network model R for forward propagation:

对每一张待分割的目标帧I_t，选取在其前面的q帧作为参考帧{I’₀，I₁’，...,I’_q}，其中2≤q≤5，目标帧I_t与其相应的参考帧集合作为特征提取网络Φ(.；θ)的输入，特征提取网络对I_t和其每一参考帧图像进行特征提取，得到目标帧图像特征f_t＝Φ(I_t；θ)，参考帧图像特征f′₀＝Φ(I′₀；θ),...,f′_q＝Φ(I′_q；θ)，训练样本集中的目标帧{I_t|0≤t≤N}作为K均值算法的输入，得到量化图像重建损失值L_cls，重建目标帧I_t与真实目标帧I_t作为RGB图像重建任务的输入，得到RGB图像重建损失值L_reg；For each target frame I _t to be segmented, select the q frames in front of it as reference frames {I' ₀ , I ₁ ', ..., I' _q }, where 2≤q≤5, the target frame I _t and its corresponding reference frame set are used as inputs of the feature extraction network Φ(.; θ), the feature extraction network extracts features from I _t and each of its reference frame images, and obtains the target frame image feature f _t =Φ(I _t ; θ), the reference frame image feature f′ ₀ =Φ(I′ ₀ ; θ), ..., f′ _q =Φ(I′ _q ; θ), the target frame {I _t |0≤t≤N} in the training sample set is used as the input of the K-means algorithm, and the quantized image reconstruction loss value L _cls is obtained. The reconstructed target frame I _t and the real target frame I _t are used as the input of the RGB image reconstruction task, and the RGB image reconstruction loss value L _reg is obtained;

(2c3)采用损失函数L_mix，通过交叉熵损失L_cls和回归损失L_reg计算图像重建神经网络的损失值采用反向传播计算网络参数梯度g(θ)，然后通过梯度下降方法对网络参数θ进行更新；(2c3) Using the loss function L _mix , the loss value of the image reconstruction neural network is calculated through the cross entropy loss L _cls and the regression loss L _reg Back propagation is used to calculate the network parameter gradient g(θ), and then the network parameter θ is updated by the gradient descent method;

(2c4)判断n＝N是否成立，若成立，得到训练好的图像重建神经网络R；否则，令n＝n+1，并返回执行步骤(2c2)。(2c4) Determine whether n=N. If so, obtain the trained image reconstruction neural network R; otherwise, set n=n+1 and return to step (2c2).

步骤3：构建并训练侧输出边缘检测网络模型Q：Step 3: Build and train the side output edge detection network model Q:

L_edge＝L_side+L_fuse L _edge = L _side + L _fuse

其中，L_side表示侧输出边缘检测损失函数，其中，β_i表示第i个侧输出边缘检测网络的权重系数，表示第i个侧输出边缘检测网络预测结果的损失函数：Among them, L _side represents the side output edge detection loss function, Among them, β _i represents the weight coefficient of the i-th side output edge detection network, Represents the loss function of the prediction result of the i-th side output edge detection network:

(3c)设定最大迭代次数I，根据侧输出边缘检测网络模型Q的损失函数，并利用图像重建神经网络模型R中特征提取网络每一结构层输出的特征图集合对侧输出边缘检测网络模型Q进行迭代训练，得到训练好的侧输出边缘检测网络模型Q，实现如下：(3c) Set the maximum number of iterations I, and iteratively train the side output edge detection network model Q according to the loss function of the side output edge detection network model Q and the feature map set output by each structural layer of the feature extraction network in the image reconstruction neural network model R to obtain the trained side output edge detection network model Q, which is implemented as follows:

(3c1)设最大迭代次数I≥150000，当前迭代次数为i；令i＝1，初始化迭代次数；(3c1) Assume that the maximum number of iterations I≥150000, the current number of iterations is i; let i=1, and initialize the number of iterations;

(3c2)将图像重建网络模型中特征提取网络每一结构层输出的特征图集合作为侧输出边缘检测网络的输入进行前向传播：(3c2) The feature map set output by each structural layer of the feature extraction network in the image reconstruction network model Perform forward propagation as input to the side output edge detection network:

(3c3)侧输出边缘检测层从特征图集合中获取目标的粗边缘，从而得到每一个特征图对应的粗边缘 (3c3) The output edge detection layer obtains the rough edge of the target from the feature map set, thereby obtaining the rough edge corresponding to each feature map

(3c4)侧输出边缘检测层SODL输出的粗边缘集合作为侧输出边缘融合层SOFL的输入，对粗边缘进行加权融合，得到最终预测边缘其中，表示粗边缘合并形成的特征图，ω_fuse表示侧输出边缘融合层的参数；(3c4) The rough edge set output by the side output edge detection layer SODL is used as the input of the side output edge fusion layer SOFL, and the rough edges are weighted and fused to obtain the final predicted edge. in, represents the feature map formed by merging coarse edges, ω _fuse represents the parameters of the side output edge fusion layer;

(3c5)采用损失函数L_edge，通过侧输出边缘检测损失L_side和侧输出边缘融合损失L_fuse计算边缘检测网络的损失值采用反向传播计算网络参数梯度g(ω)，然后通过梯度下降方法对网络参数ω进行更新；(3c5) Using the loss function L _edge , the loss value of the edge detection network is calculated by the side output edge detection loss L _side and the side output edge fusion loss L _fuse Back propagation is used to calculate the network parameter gradient g(ω), and then the network parameter ω is updated by the gradient descent method;

(3c6)判断i＝I是否成立，若成立，得到训练好的侧输出边缘检测网络模型Q；否则，令i＝i+1，并返回执行步骤(3c2)。(3c6) Determine whether i=I holds. If so, obtain the trained side output edge detection network model Q; otherwise, set i=i+1 and return to execute step (3c2).

步骤4：构建并训练边缘修正网络模型Z：Step 4: Build and train the edge correction network model Z:

(4c)设定最大迭代次数H，根据边缘修正网络模型Z的损失函数，并利用图像重建网络模型R和边缘检测网络模型Q的输出结果对边缘修正网络模型Z进行迭代训练，得到训练好的边缘修正网络模型Z，实现如下：(4c) Set the maximum number of iterations H, and iteratively train the edge correction network model Z according to the loss function of the edge correction network model Z and the output results of the image reconstruction network model R and the edge detection network model Q to obtain the trained edge correction network model Z, which is implemented as follows:

(4c1)设最大迭代次数为H≥150000，当前迭代次数为h；令h＝1，初始化迭代次数；(4c1) Assume that the maximum number of iterations is H ≥ 150000, and the current number of iterations is h; let h = 1, and initialize the number of iterations;

(4c2)将图像重建网络模型R输出的目标帧粗分割结果和边缘检测网络模型Q输出的边缘检测结果作为边缘修正网络模型Z的输入进行前向传播：(4c2) The target frame rough segmentation result output by the image reconstruction network model R and the edge detection result output by the edge detection network model Q are used as the input of the edge correction network model Z for forward propagation:

(4c2.1)边缘修正网络首先对目标帧粗分割结果和边缘检测结果进行通道维度上的合并，得到H×W×(K+1)大小的特征图 (4c2.1) The edge correction network first roughly segments the target frame And edge detection results Merge in the channel dimension to obtain a feature map of size H×W×(K+1)

(4c2.2)将特征图作为空洞空间卷积池化金字塔模型F_γ的输入，得到扩大后感受野预测结果；(4c2.2) The feature map As the input of the dilated spatial convolutional pooling pyramid model F _γ , the prediction result of the expanded receptive field is obtained;

(4c2.3)将扩大后感受野预测结果作为softmax激活函数输出层的输入，根据特征图中每个像素位于各个类别的概率，决定像素的分割标签，从而获得目标帧目标分割掩膜经过边缘融合修正后的更为精确的目标分割掩膜其中O_t表示目标帧I_t的预测分割标签；(4c2.3) The predicted result of the expanded receptive field is used as the input of the output layer of the softmax activation function. The segmentation label of the pixel is determined according to the probability that each pixel in the feature map belongs to each category, so as to obtain a more accurate target segmentation mask after edge fusion correction of the target frame target segmentation mask. Where O _t represents the predicted segmentation label of the target frame I _t ;

(4c3)采用损失函数L_corr，计算边缘修正网络的损失值采用反向传播计算网络参数梯度g(c)，然后通过梯度下降方法对网络参数c进行更新；(4c3) Use the loss function L _corr to calculate the loss value of the edge correction network Back propagation is used to calculate the network parameter gradient g(c), and then the network parameter c is updated by the gradient descent method;

(4c4)判断h＝H是否成立，若成立，得到训练好的边缘修正网络模型Z；否则，令h＝h+1，并返回执行步骤(4c2)。(4c4) Determine whether h=H holds. If so, obtain the trained edge correction network model Z; otherwise, set h=h+1 and return to execute step (4c2).

步骤5：由训练好的图像重建神经网络R、侧输出边缘检测网络Q和边缘修正网络模型Z组合得到基于图像目标边缘修正分割结果的视频目标分割模型，具体是按照如下方式进行组合：将图像重建神经网络R提取的中间特征图输入侧输出边缘检测网络Q中得到目标边缘预测图，图像重建神经网络R输出的目标分割掩膜预测图和侧输出边缘检测网络Q输出的目标边缘预测图作为边缘修正网络模型Z的输入，得到训练好的基于图像目标边缘修正分割结果的视频目标分割模型。Step 5: The trained image reconstruction neural network R, the side output edge detection network Q and the edge correction network model Z are combined to obtain a video target segmentation model based on the image target edge correction segmentation result. Specifically, the combination is performed in the following manner: the intermediate feature map extracted by the image reconstruction neural network R is input into the side output edge detection network Q to obtain a target edge prediction map. The target segmentation mask prediction map output by the image reconstruction neural network R and the target edge prediction map output by the side output edge detection network Q are used as inputs of the edge correction network model Z to obtain a trained video target segmentation model based on the image target edge correction segmentation result.

步骤6：获取自监督视频目标分割结果：Step 6: Obtain self-supervised video object segmentation results:

实施例二：本实施例的整体步骤同实施例一，对于其中部分参数的设置给出具体数值，进一步描述本发明的实现过程：Embodiment 2: The overall steps of this embodiment are the same as those of Embodiment 1, and specific values are given for the settings of some parameters to further describe the implementation process of the present invention:

步骤1)获取训练样本集、验证样本集和测试样本集：Step 1) Obtain training sample set, validation sample set and test sample set:

步骤1a)从视频目标分割数据集中获取S个多类别的视频序列，预处理后得到帧序列集合V，本实施例中多类别的视频序列从YouTubeVOS数据集中获取，S＝4453，M＝50；Step 1a) Obtain S multi-category video sequences from a video object segmentation dataset, and obtain a frame sequence set V after preprocessing. In this embodiment, the multi-category video sequences are obtained from a YouTube VOS dataset, S=4453, M=50;

步骤1b)设帧序列V中所包含目标类别数C＝94，类别集合Class＝{c_num,1≤num≤C}，每个帧序列中可能出现多个类别的目标，其中，c_num表示第num类目标；Step 1b) Assume that the number of target categories contained in the frame sequence V is C=94, and the category set Class={c _num ,1≤num≤C}. Multiple categories of targets may appear in each frame sequence, where c _num represents the numth category of target;

步骤1c)将从帧序列集合V中随机抽取半数以上的帧序列组成训练样本集其中，S/2＜N＜S，针对每一个训练集中的帧序列将每一张待分割的目标帧图片缩放成大小为p×p×h的图像块，并将RGB格式的图片转化为Lab图像，从剩余的帧序列中抽取一半帧序列组成验证样本集其中，J≤S/4，另一半组成测试样本集T≤S/4，同理，将原本图像由RGB格式转化为Lab格式；Step 1c) randomly select more than half of the frame sequences from the frame sequence set V to form a training sample set Among them, S/2＜N＜S, for each frame sequence in the training set Each target frame image to be segmented Scale the image into a size of p×p×h, convert the RGB image into a Lab image, and extract half of the frame sequence from the remaining frame sequence to form a verification sample set Among them, J≤S/4, and the other half constitutes the test sample set T≤S/4. Similarly, convert the original image from RGB format to Lab format;

设定裁剪框尺寸为x×y，针对训练集中的帧序列对每一张带分割的帧图片进行裁剪，得到裁剪后的帧图片对裁剪后的帧图片进行归一化处理，归一化后的帧图片组成预处理后的训练帧序列其中，为训练样本集中第m个帧序列；Set the crop box size to x×y for the frame sequence in the training set For each frame image with segmentation Crop to get the cropped frame image For the cropped frame image Normalization is performed, and the normalized frame images constitute the preprocessed training frame sequence in, is the mth frame sequence in the training sample set;

本实施例中，x＝256,y＝256，p＝256,h＝3；In this embodiment, x=256, y=256, p=256, h=3;

步骤2)构建图像重建神经网络模型R：Step 2) Construct the image reconstruction neural network model R:

步骤2a)构建图像重建神经网络模型R的结构：Step 2a) Construct the structure of the image reconstruction neural network model R:

构建由特征提取网络构成的图像重建神经网络模型，特征提取网络采用包括顺次连接的多个卷积层、多个池化层、多个残差单元模块和单个全连接层的残差网络；Constructing an image reconstruction neural network model composed of a feature extraction network, wherein the feature extraction network adopts a residual network including a plurality of convolutional layers, a plurality of pooling layers, a plurality of residual unit modules and a single fully connected layer connected in sequence;

特征提取网络包含17个卷积层，和1个全连接层，18层结构被划分为5个block，分别是conv_1、conv_2、conv_3、conv_4、conv_5，conv_1为一个卷积层，卷积层的卷积核尺寸为7×7，通道数为64，conv_2包含两个卷积层，其卷积核尺寸为3×3，通道数为64，卷积核移动距离为1，conv_3包含两个卷积层，其卷积核尺寸为3×3，通道数为128，其中，第一个卷积层卷积核移动距离设置为2，第二个卷积层卷积核移动距离为1，conv_4包含两个卷积层，其卷积核尺寸为3×3，通道数为256，卷积核移动距离为1，conv_5包含两个卷积层，其卷积核尺寸为3×3，通道数为512，卷积核移动距离为1；The feature extraction network contains 17 convolutional layers and 1 fully connected layer. The 18-layer structure is divided into 5 blocks, namely conv_1, conv_2, conv_3, conv_4, and conv_5. Conv_1 is a convolutional layer with a convolution kernel size of 7×7 and 64 channels. Conv_2 contains two convolutional layers with a convolution kernel size of 3×3 and 64 channels. The convolution kernel moving distance is 1. Conv_3 contains two convolutional layers with a convolution kernel size of 3×3 and 128 channels. The convolution kernel moving distance of the first convolution layer is set to 2, and the convolution kernel moving distance of the second convolution layer is 1. Conv_4 contains two convolutional layers with a convolution kernel size of 3×3 and 256 channels. The convolution kernel moving distance is 1. Conv_5 contains two convolutional layers with a convolution kernel size of 3×3 and 512 channels. The convolution kernel moving distance is 1.

步骤2b)定义图像重建神经网络模型的损失函数：Step 2b) Define the loss function of the image reconstruction neural network model:

L_mix＝αL_cls+(1-α)L_reg _Lmix ＝ _αLcls +(1-α) _Lreg

其中，L_cls表示量化图像重建任务的交叉熵损失函数,针对训练样本集选取E个聚类质心点{μ₁,μ₂,...,μ_E}，其中，E≤50，根据训练样本与聚类质心点的距离，计算样本所属类别，修正聚类质心点位置，使得帧间同类目标标签相同，不同目标标签不同，其中，表示给定的帧图片I_t的第i个像素所属类别，表示使用K均值算法的预测结果，L_reg表示RGB图像重建任务的回归损失函数，其中，其中为真实目标帧像素，为重建目标帧像素，α表示权重系数，0.1≤α≤0.9；Among them, L _cls represents the cross entropy loss function of the quantitative image reconstruction task, For the training sample set Select E cluster centroids {μ ₁ ,μ ₂ ,...,μ _E }, where E≤50, calculate the category of the sample based on the distance between the training sample and the cluster centroid, and correct the location of the cluster centroid so that the labels of the same target in the same frame are the same, and the labels of different targets are different, where Indicates the category to which the i-th pixel of a given frame image I _t belongs, represents the prediction result using the K-means algorithm, L _reg represents the regression loss function of the RGB image reconstruction task, in, in is the real target frame pixel, To reconstruct the target frame pixels, α represents the weight coefficient, 0.1≤α≤0.9;

本实施例中，K＝16，α＝0.6；In this embodiment, K=16, α=0.6;

步骤3)对图像重建神经网络模型进行迭代训练：Step 3) Iteratively train the image reconstruction neural network model:

步骤3a)特征提取网络的网络超参数为θ，初始化迭代次数为n，最大迭代次数为N，N≥150000，令n＝1；Step 3a) The network hyperparameter of the feature extraction network is θ, the number of initialization iterations is n, the maximum number of iterations is N, N ≥ 150000, and n = 1;

本实施例中，设计N＝300000，设计N＝300000是为了让模型训练更加充分；In this embodiment, N=300000 is designed to make the model training more sufficient;

步骤3b)将训练样本集V_train中的目标帧图片作为图像重建神经网络模型R的输入进行前向传播：Step 3b) Use the target frame image in the training sample set V _train as the input of the image reconstruction neural network model R for forward propagation:

步骤3b1)针对每一张待分割的目标帧I_t，选取在其前面的q帧作为参考帧{I’₀，I₁’，...,I_q’}，其中2≤q≤5，目标帧I_t与其相应的参考帧集合作为特征提取网络Φ(.；θ)的输入，特征提取网络对I_t和其每一参考帧图像进行特征提取，得到目标帧图像特征f_t＝Φ(I_t；θ)，参考帧图像特征f′₀＝Φ(I′₀；θ),...,f′_q＝Φ(I′_q；θ)，训练样本集中的目标帧{I_t|0≤t≤N}作为K均值算法的输入，得到量化图像重建损失值L_cls，重建目标帧I_t与真实目标帧I_t作为RGB图像重建任务的输入，得到RGB图像重建损失值L_reg；Step 3b1) For each target frame I _t to be segmented, select the q frames preceding it as reference frames {I' ₀ , I ₁ ', ..., I _q '}, where 2≤q≤5, the target frame I _t and its corresponding reference frame set are used as inputs of the feature extraction network Φ(.;θ), the feature extraction network performs feature extraction on I _t and each of its reference frame images, and obtains target frame image features f _t =Φ(I _t ;θ), reference frame image features f′ ₀ =Φ(I′ ₀ ;θ), ..., f′ _q =Φ(I′ _q ;θ), the target frames {I _t |0≤t≤N} in the training sample set are used as inputs of the K-means algorithm, and the quantized image reconstruction loss value L _cls is obtained, the reconstructed target frame I _t and the real target frame I _t are used as inputs of the RGB image reconstruction task, and the RGB image reconstruction loss value L _reg is obtained;

步骤3c)采用损失函数L_mix，通过交叉熵损失L_cls和回归损失L_reg计算图像重建神经网络的损失值采用反向传播计算网络参数梯度g(θ)，然后通过梯度下降方法对网络参数θ进行更新，更新公式为：Step 3c) Use the loss function L _mix to calculate the loss value of the image reconstruction neural network through the cross entropy loss L _cls and the regression loss L _reg Back propagation is used to calculate the network parameter gradient g(θ), and then the network parameter θ is updated by the gradient descent method. The update formula is:

其中，θ′表示θⁿ更新后的结果，γ表示学习率，1e-6≤γ≤1e-3，表示第n次迭代后图像重建神经网络的损失函数值，表示偏导计算。Among them, θ′ represents the result after θ ⁿ is updated, γ represents the learning rate, 1e-6≤γ≤1e-3, represents the loss function value of the image reconstruction neural network after the nth iteration, Represents partial derivative calculation.

本实施例中，初始学习率γ＝0.001，在迭代到第15万次时，学习率γ＝0.0005，迭代到第20万次时学习率γ＝0.00025，迭代到第25万次时学习率γ＝0.000125，优化器函数使用Adam优化器，学习率在网络迭代到一定次数时进行衰减的是为了防止损失函数陷入局部最小值；In this embodiment, the initial learning rate γ=0.001, when the iteration reaches the 150,000th time, the learning rate γ=0.0005, when the iteration reaches the 200,000th time, the learning rate γ=0.00025, when the iteration reaches the 250,000th time, the learning rate γ=0.000125, the optimizer function uses the Adam optimizer, and the learning rate is decayed when the network iterates to a certain number of times in order to prevent the loss function from falling into a local minimum;

步骤3d)判断n＝N是否成立，若成立，得到训练好的图像重建神经网络R，否则，令n＝n+1,并执行步骤(3b)；Step 3d) Determine whether n=N is established. If so, obtain the trained image reconstruction neural network R. Otherwise, set n=n+1 and execute step (3b);

步骤4)构建侧输出边缘检测网络模型Q：Step 4) Build the side output edge detection network model Q:

步骤4a)构建侧输出边缘检测网络模型Q的结构：Step 4a) Construct the structure of the side output edge detection network model Q:

构建包括顺次连接的侧输出边缘检测层SODL和侧输出边缘融合层SOFL的边缘检测网络模型Q，侧输出边缘检测层SODL包括一个反卷积层和一个卷积核尺寸为1×1，且输出通道数为1的卷积层，侧输出边缘融合层SOFL是一个卷积核尺寸为1×1且通道数为1的卷积层；Construct an edge detection network model Q including a side output edge detection layer SODL and a side output edge fusion layer SOFL connected in sequence, wherein the side output edge detection layer SODL includes a deconvolution layer and a convolution layer with a convolution kernel size of 1×1 and an output channel number of 1, and the side output edge fusion layer SOFL is a convolution layer with a convolution kernel size of 1×1 and a channel number of 1;

步骤4b)定义侧输出边缘检测网络模型的损失函数：Step 4b) Define the loss function of the side output edge detection network model:

L_edge＝L_side+L_fuse L _edge = L _side + L _fuse

步骤5)对侧输出边缘检测网络模型Q进行迭代训练：Step 5) Iteratively train the side output edge detection network model Q:

步骤5a)初始化迭代次数为i，最大迭代次数为I，I≥150000，令i＝1；Step 5a) Initialize the number of iterations to i, the maximum number of iterations to I, I≥150000, let i=1;

本实施例中为了让模型训练更加充分，设计I＝300000；In this embodiment, in order to make the model training more sufficient, I=300000;

步骤5b)将图像重建网络模型中特征提取网络每一结构层输出的特征图集合作为侧输出边缘检测网络的输入进行前向传播：Step 5b) The feature map set output by each structural layer of the feature extraction network in the image reconstruction network model Perform forward propagation as input to the side output edge detection network:

步骤5b1)侧输出边缘检测层从特征图集合中获取目标的粗边缘，从而得到每一个特征图对应的粗边缘 Step 5b1) The side output edge detection layer obtains the rough edge of the target from the feature map set, thereby obtaining the rough edge corresponding to each feature map

步骤5b2)侧输出边缘检测层SODL输出的粗边缘集合作为侧输出边缘融合层SOFL的输入，对粗边缘进行加权融合，得到最终预测边缘其中，表示粗边缘合并形成的特征图，ω_fuse表示侧输出边缘融合层的参数；Step 5b2) The rough edge set output by the side output edge detection layer SODL is used as the input of the side output edge fusion layer SOFL, and the rough edges are weighted and fused to obtain the final predicted edge. in, represents the feature map formed by merging coarse edges, ω _fuse represents the parameters of the side output edge fusion layer;

步骤5c)采用损失函数L_edge，通过侧输出边缘检测损失L_side和侧输出边缘融合损失L_fuse计算边缘检测网络的损失值采用反向传播计算网络参数梯度g(ω)，然后通过梯度下降方法对网络参数ω进行更新，更新公式为：Step 5c) Use the loss function L _edge to calculate the loss value of the edge detection network through the side output edge detection loss L _side and the side output edge fusion loss L _fuse Back propagation is used to calculate the network parameter gradient g(ω), and then the network parameter ω is updated by the gradient descent method. The update formula is:

其中，ω′表示ωⁱ更新后的结果，β表示学习率，1e-6≤β≤1e-3，表示第i次迭代后侧输出边缘检测神经网络的损失函数值，表示偏导计算。Among them, ω′ represents the result after ω ⁱ is updated, β represents the learning rate, 1e-6≤β≤1e-3, represents the loss function value of the output edge detection neural network after the i-th iteration, Represents partial derivative calculation.

本实施例中，初始学习率β＝0.001，在迭代到第15万次时，学习率β＝0.0005，迭代到第20万次时学习率β＝0.00025，迭代到第25万次时学习率β＝0.000125，优化器函数使用Adam优化器，学习率在网络迭代到一定次数时进行衰减的是为了防止损失函数陷入局部最小值；In this embodiment, the initial learning rate β=0.001, when the iteration reaches the 150,000th time, the learning rate β=0.0005, when the iteration reaches the 200,000th time, the learning rate β=0.00025, when the iteration reaches the 250,000th time, the learning rate β=0.000125, the optimizer function uses the Adam optimizer, and the learning rate is decayed when the network iterates to a certain number of times in order to prevent the loss function from falling into a local minimum;

步骤5d)判断i＝I是否成立，若成立，得到训练好的侧输出边缘检测网络模型Q，否则，令i＝i+1,并执行步骤(5b)；Step 5d) Determine whether i=I is established. If so, obtain the trained side output edge detection network model Q. Otherwise, set i=i+1 and execute step (5b);

步骤6)构建边缘修正网络模型Z：Step 6) Construct edge correction network model Z:

步骤6a)构建边缘修正网络模型Z的结构：Step 6a) Construct the structure of the edge correction network model Z:

构建包括顺次连接的空洞空间卷积池化金字塔模型F_γ和softmax激活函数输出层的边缘修正网络模型Z，其中，空洞空间卷积池化金字塔模型F_γ由顺次连接的多个卷积层和池化层构成；Constructing an edge correction network model Z including a sequentially connected atrous spatial convolutional pooling pyramid model _Fγ and a softmax activation function output layer, wherein the atrous spatial convolutional pooling pyramid model _Fγ is composed of a plurality of sequentially connected convolutional layers and pooling layers;

空洞空间卷积池化金字塔模型F_γ包括一个卷积层、池化金字塔和池化block，其中，卷积层卷积核大小为1×1，池化金字塔包括三个并行连接的卷积层，其卷积核尺寸为3×3，池化block包括一个1×1的池化层、一个卷积层和上采样操作层，其卷积层卷积核尺寸为1×1，对卷积层、池化金字塔和池化block模块输出的特征图进行concat操作，再经过一个1×1卷积层，得到空洞空间卷积池化金字塔模型F_γ的输出；The dilated spatial convolutional pooling pyramid model _Fγ includes a convolutional layer, a pooling pyramid, and a pooling block, wherein the convolution kernel size of the convolutional layer is 1×1, the pooling pyramid includes three parallel connected convolutional layers, whose convolution kernel size is 3×3, and the pooling block includes a 1×1 pooling layer, a convolution layer, and an upsampling operation layer, whose convolution kernel size is 1×1. The feature maps output by the convolutional layer, the pooling pyramid, and the pooling block modules are concat-operated, and then passed through a 1×1 convolutional layer to obtain the output of the dilated spatial convolutional pooling pyramid model _Fγ .

步骤6b)定义边缘修正网络模型Z的损失函数：Step 6b) Define the loss function of the edge correction network model Z:

步骤7)对边缘修正网络模型Z进行迭代训练：Step 7) Iteratively train the edge correction network model Z:

步骤7a)初始化迭代次数为h，最大迭代次数为H，H≥150000，令h＝1；Step 7a) Initialize the number of iterations to h, the maximum number of iterations to H, H ≥ 150000, and set h = 1;

本实施例中，设计H＝300000，设计H＝300000是为了让模型训练更加充分；In this embodiment, H=300000 is designed to make the model training more sufficient;

步骤7b)将图像重建网络模型输出的目标帧粗分割结果和边缘检测网络模型输出的边缘检测结果作为边缘修正网络模型Z的输入进行前向传播:Step 7b) The target frame coarse segmentation result output by the image reconstruction network model and the edge detection result output by the edge detection network model are used as the input of the edge correction network model Z for forward propagation:

步骤7b1)边缘修正网络首先对目标帧粗分割结果和边缘检测结果进行通道维度上的合并，得到H×W×(K+1)大小的特征图 Step 7b1) The edge correction network first roughly segments the target frame And edge detection results Merge in the channel dimension to obtain a feature map of size H×W×(K+1)

步骤7b2)特征图作为空洞空间卷积池化金字塔模型F_γ的输入，得到扩大后感受野预测结果；Step 7b2) Feature Map As the input of the dilated spatial convolutional pooling pyramid model F _γ , the prediction result of the expanded receptive field is obtained;

步骤7b3)扩大后感受野预测结果作为softmax激活函数输出层的输入根据特征图中每个像素位置各个类别的概率，决定像素的分割标签，从而获得目标帧目标分割掩膜经过边缘融合修正后的更为精确的目标分割掩膜其中，O_t表示目标帧I_t的预测分割标签；Step 7b3) The predicted result of the expanded receptive field is used as the input of the output layer of the softmax activation function. The segmentation label of the pixel is determined according to the probability of each category at each pixel position in the feature map, thereby obtaining a more accurate target segmentation mask of the target frame after edge fusion correction. Among them, O _t represents the predicted segmentation label of the target frame I _t ;

步骤7c)采用损失函数L_corr，计算边缘修正网络的损失值采用反向传播计算网络参数梯度g(c)，然后通过梯度下降方法对网络参数c进行更新,更新公式为：Step 7c) Use the loss function L _corr to calculate the loss value of the edge correction network Back propagation is used to calculate the network parameter gradient g(c), and then the network parameter c is updated by the gradient descent method. The update formula is:

其中，c′表示c^h更新后的结果，α表示学习率，1e-6≤α≤1e-3，表示第h次迭代后边缘修正神经网络的损失函数值，表示偏导计算。Among them, c′ represents the result after ^ch is updated, α represents the learning rate, 1e-6≤α≤1e-3, represents the loss function value of the edge-corrected neural network after the hth iteration, Represents partial derivative calculation.

本实施例中，初始学习率α＝0.001，在迭代到第15万次时，学习率α＝0.0005，迭代到第20万次时学习率α＝0.00025，迭代到第25万次时学习率α＝0.000125，优化器函数使用Adam优化器，学习率在网络迭代到一定次数时进行衰减的是为了防止损失函数陷入局部最小值；In this embodiment, the initial learning rate α=0.001, when the iteration reaches the 150,000th time, the learning rate α=0.0005, when the iteration reaches the 200,000th time, the learning rate α=0.00025, when the iteration reaches the 250,000th time, the learning rate α=0.000125, the optimizer function uses the Adam optimizer, and the learning rate is decayed when the network iterates to a certain number of times in order to prevent the loss function from falling into a local minimum;

步骤7d)判断h＝H是否成立，若成立，得到训练好的边缘修正网络模型Z，否则，令h＝h+1,并执行步骤(7b)；Step 7d) Determine whether h=H holds. If so, obtain the trained edge correction network model Z. Otherwise, set h=h+1 and execute step (7b);

步骤8)获取自监督视频目标分割结果：Step 8) Obtain the self-supervised video target segmentation results:

将测试集中的帧图像作为训练好的基于图像目标边缘修正分割结果的视频目标分割模型的输入进行前向传播，基于图像目标边缘修正分割结果的视频目标分割模型由图像重建神经网络R、侧输出边缘检测网络Q和边缘融合网络Z构成，得到所有测试帧图片分割标签，根据测试帧图片分割标签，确定分割结果图。The test set The frame images in are used as the input of the trained video target segmentation model based on the image target edge correction segmentation result for forward propagation. The video target segmentation model based on the image target edge correction segmentation result is composed of an image reconstruction neural network R, a side output edge detection network Q and an edge fusion network Z. The segmentation labels of all test frame images are obtained, and the segmentation result map is determined according to the test frame image segmentation labels.

以下通过仿真实验来对本发明的技术效果作进一步说明：The following simulation experiments are used to further illustrate the technical effects of the present invention:

1.仿真条件及内容：1. Simulation conditions and contents:

从YouTube-VOS数据集中获取了4453个视频序列用于仿真实验；4453 video sequences were obtained from the YouTube-VOS dataset for simulation experiments;

仿真实验在CPU型号为Intel(R)Core(TM)i77800x CPU@3.5GHz 64GB、GPU型号为NVIDIA GeForce RTX 2080ti的服务器上进行。操作系统为UBUNTU 16.04系统，深度学习框架为PyTorch，编程语言为Python3.6；The simulation experiment was conducted on a server with an Intel(R) Core(TM) i77800x CPU@3.5GHz 64GB CPU and an NVIDIA GeForce RTX 2080ti GPU. The operating system was UBUNTU 16.04, the deep learning framework was PyTorch, and the programming language was Python 3.6.

对本发明和现有的一种视频目标分割方法进行对比仿真。为了对视频目标分割结果进行量化比较，实验采用了两种视频目标分割结果评价指标，分别是区域相似度J和轮廓相似度F，这两种评价指标越高表明变化检测结果越好，仿真结果如表1所示。The present invention and an existing video target segmentation method are compared and simulated. In order to quantitatively compare the video target segmentation results, the experiment uses two video target segmentation result evaluation indicators, namely, region similarity J and contour similarity F. The higher the two evaluation indicators, the better the change detection result. The simulation results are shown in Table 1.

表1Table 1

2.仿真结果分析：2. Analysis of simulation results:

由表1可以看出，本发明相比于现有的视频分割方法，J指标和F指标提升明显，说明本发明构建出的基于自监督的视频目标分割技术能够有效解决目标遮挡和跟踪漂移等问题，从而提升视频目标分割精度，因为具有重要的实际意义和实用价值。It can be seen from Table 1 that compared with the existing video segmentation method, the J index and F index of the present invention are significantly improved, indicating that the self-supervised video target segmentation technology constructed by the present invention can effectively solve problems such as target occlusion and tracking drift, thereby improving the video target segmentation accuracy, and therefore has important practical significance and practical value.

上述仿真分析证明了本发明所提方法的正确性与有效性。The above simulation analysis proves the correctness and effectiveness of the method proposed in the present invention.

本发明未详细说明部分属于本领域技术人员公知常识。Parts of the present invention that are not described in detail belong to common knowledge among those skilled in the art.

以上所述仅为本发明的较佳实施例而已，并不用以限制本发明，显然对于本领域的专业人员来说，在了解了本发明内容和原理后，都可能在不背离本发明原理、结构的情况下，进行形式和细节上的各种修正和改变，但是这些基于本发明思想的修正和改变仍在本发明的权利要求保护范围之内。The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention. It is obvious that for professionals in this field, after understanding the content and principles of the present invention, they may make various modifications and changes in form and details without departing from the principles and structures of the present invention. However, these modifications and changes based on the ideas of the present invention are still within the scope of protection of the claims of the present invention.

Claims

1. A video object segmentation method based on self-supervision, characterized in that it includes the following steps:

(1) Obtain training sample set, verification sample set and test sample set:

Obtain a video sequence from a video target segmentation dataset, perform preprocessing, obtain a frame sequence set V, divide the frame sequences in the set, and obtain a training sample set V _train , a verification sample set V _val , and a test sample set V _test ;

(2) Build and train the image reconstruction neural network model R:

(2a) constructing an image reconstruction neural network model R consisting of a feature extraction network, wherein the feature extraction network adopts a residual network including multiple convolutional layers, multiple pooling layers, multiple residual unit modules and a single fully connected layer connected in sequence;

(2b) Define the loss function of the image reconstruction neural network model R:

_Lmix ＝ _αLcls +(1-α) _Lreg

in, Represents the cross entropy loss function for the quantitative image reconstruction task, for the training sample set Select E cluster centroids {μ ₁ ,μ ₂ ,...,μ _E }, and E≤50; calculate the category to which the sample belongs based on the distance between the training sample and the cluster centroid, and set the number of target categories contained in the frame sequence set V to be C; modify the position of the cluster centroid so that the labels of the same target in the frame are the same, and the labels of different targets are different, where, Indicates the category to which the i-th pixel of a given frame image I _t belongs, represents the prediction result using the K-means algorithm, L _reg represents the regression loss function of the RGB image reconstruction task, in, in is the real target frame pixel, To reconstruct the target frame pixels, α represents the weight coefficient, 0.1≤α≤0.9;

(2c) setting feature extraction network parameters and a maximum number of iterations N, iteratively training the image reconstruction neural network model R according to the loss function of the image reconstruction neural network model R and using the target frame images in the training sample set V _train to obtain a trained image reconstruction neural network model R;

(3) Build and train the side output edge detection network model Q:

(3a) constructing an edge detection network model Q including a side output edge detection layer SODL and a side output edge fusion layer SOFL connected in sequence, wherein the side output edge detection layer SODL includes a deconvolution layer and a convolution layer with a convolution kernel size of 1×1 and an output channel number of 1, and the side output edge fusion layer SOFL is a convolution layer with a convolution kernel size of 1×1 and a channel number of 1;

(3b) Define the loss function of the side output edge detection network model Q:

L _edge = L _side + L _fuse

Among them, L _side represents the side output edge detection loss function, Among them, β _i represents the weight coefficient of the i-th side output edge detection network, Represents the loss function of the prediction result of the i-th side output edge detection network:

in, Among them, e represents the true value of the target edge of the input image, |e ^- | represents the number of pixels that are edges in the true value of the target edge of the image, |e ⁺ | represents the number of pixels that are not edges in the true value of the target edge of the image, ω _i represents the parameters of the convolution layer, and L _fuse represents the edge fusion loss function:

(3c) setting a maximum number of iterations I, iteratively training the side output edge detection network model Q according to the loss function of the side output edge detection network model Q and using the feature map set output by each structural layer of the feature extraction network in the image reconstruction neural network model R, and obtaining a trained side output edge detection network model Q;

(4) Build and train the edge correction network model Z:

(4a) sequentially connecting the dilated spatial convolutional pooling pyramid model _Fγ and the softmax activation function output layer, wherein the dilated spatial convolutional pooling pyramid model _Fγ is composed of a plurality of sequentially connected convolutional layers and pooling layers, and obtaining an edge correction network model Z;

(4b) Define the loss function of the edge correction network model Z:

in, The rough segmentation result of the target frame output by the edge detection layer, is the prediction result of the dilated spatial convolutional pooling pyramid model F _γ , in, Represents the image edge obtained by the Canny algorithm, and M represents the mask The number of categories of pixels in , Indicates mask The total number of pixels in

(4c) setting a maximum number of iterations H, iteratively training the edge correction network model Z according to the loss function of the edge correction network model Z and using the output results of the image reconstruction network model R and the edge detection network model Q to obtain a trained edge correction network model Z;

(5) The video target segmentation model based on the image target edge correction segmentation result is obtained by combining the trained image reconstruction neural network R, the side output edge detection network Q and the edge correction network model Z;

(6) Obtain self-supervised video target segmentation results:

The test set The frame images in are used as the input of the video target segmentation model for forward propagation to obtain the predicted segmentation labels of all test frame images, and the final segmentation result image is obtained based on the predicted segmentation labels of the test frame images.

2. The method according to claim 1, characterized in that: in step (1), the training sample set V _train , the verification sample set V _val and the test sample set V _test are obtained according to the following steps:

(1a) Obtain S multi-category video sequences from the video object segmentation dataset and obtain a frame sequence set after preprocessing S≥3000; represents the kth frame sequence consisting of preprocessed image frames, represents the nth image frame in the kth frame sequence, M ≥ 30;

(1b) Randomly select more than half of the frame sequences from the frame sequence set V to form a training sample set Where S/2＜N＜S, for each frame sequence in the training sample set Each target frame image to be segmented Scale the image blocks to p×p×h size and convert the image format from RGB to Lab; extract half of the frame sequences from the remaining frame sequences to form a verification sample set Where J≤S/4; the other half constitutes the test sample set T≤S/4, and convert the image format from RGB to Lab.

3. The method according to claim 1, characterized in that: in step (2c), the image reconstruction neural network model R is iteratively trained as follows:

(2c1) Let the network hyperparameter of the feature extraction network be θ, the maximum number of iterations be N ≥ 150000, n represents the current number of iterations; let n = 1, the number of initial iterations;

(2c2) The target frame images in the training sample set V _train are used as the input of the image reconstruction neural network model R for forward propagation:

For each target frame I _t to be segmented, select the q frames in front of it as reference frames {I' ₀ , I' ₁ , ..., I' _q }, where 2≤q≤5, the target frame I _t and its corresponding reference frame set are used as inputs of the feature extraction network Φ(.; θ), the feature extraction network extracts features from I _t and each of its reference frame images, and obtains the target frame image feature f _t =Φ(I _t ; θ), the reference frame image feature f′ ₀ =Φ(I′ ₀ ; θ), ..., f′ _q =Φ(I′ _q ; θ), the target frame {I _t |0≤t≤N} in the training sample set is used as the input of the K-means algorithm, and the quantized image reconstruction loss value L _cls is obtained. The reconstructed target frame I _t and the real target frame I _t are used as the input of the RGB image reconstruction task, and the RGB image reconstruction loss value L _reg is obtained;

(2c3) Using the loss function L _mix , the loss value of the image reconstruction neural network is calculated through the cross entropy loss L _cls and the regression loss L _reg Back propagation is used to calculate the network parameter gradient g(θ), and then the network parameter θ is updated by the gradient descent method;

(2c4) Determine whether n=N. If so, obtain the trained image reconstruction neural network R; otherwise, set n=n+1 and return to step (2c2).

4. The method according to claim 1, characterized in that: in step (3c), the side output edge detection network model Q is iteratively trained as follows:

(3c1) Assume that the maximum number of iterations I≥150000, the current number of iterations is i; let i=1, and initialize the number of iterations;

(3c2) The feature map set output by each structural layer of the feature extraction network in the image reconstruction network model Perform forward propagation as input to the side output edge detection network:

(3c3) The output edge detection layer obtains the rough edge of the target from the feature map set, thereby obtaining the rough edge corresponding to each feature map

(3c4) The rough edge set output by the side output edge detection layer SODL is used as the input of the side output edge fusion layer SOFL, and the rough edges are weighted and fused to obtain the final predicted edge. in, represents the feature map formed by merging coarse edges, ω _fuse represents the parameters of the side output edge fusion layer;

(3c5) Using the loss function L _edge , the loss value of the edge detection network is calculated by the side output edge detection loss L _side and the side output edge fusion loss L _fuse Back propagation is used to calculate the network parameter gradient g(ω), and then the network parameter ω is updated by the gradient descent method;

(3c6) Determine whether i=I holds. If so, obtain the trained side output edge detection network model Q; otherwise, set i=i+1 and return to execute step (3c2).

5. The method according to claim 1, characterized in that: in step (4c), the edge correction network model Z is iteratively trained as follows:

(4c1) Assume that the maximum number of iterations is H ≥ 150000, and the current number of iterations is h; let h = 1, and initialize the number of iterations;

(4c2) The target frame rough segmentation result output by the image reconstruction network model R and the edge detection result output by the edge detection network model Q are used as the input of the edge correction network model Z for forward propagation:

(4c2.1) The edge correction network first roughly segments the target frame And edge detection results Merge in the channel dimension to obtain a feature map of size H×W×(K+1)

(4c2.2) The feature map As the input of the dilated spatial convolutional pooling pyramid model F _γ , the prediction result of the expanded receptive field is obtained;

(4c2.3) The predicted result of the expanded receptive field is used as the input of the output layer of the softmax activation function. The segmentation label of the pixel is determined according to the probability that each pixel in the feature map belongs to each category, so as to obtain a more accurate target segmentation mask after edge fusion correction of the target frame target segmentation mask. Where O _t represents the predicted segmentation label of the target frame I _t ;

(4c3) Use the loss function L _corr to calculate the loss value of the edge correction network Back propagation is used to calculate the network parameter gradient g(c), and then the network parameter c is updated by the gradient descent method;

(4c4) Determine whether h=H holds. If so, obtain the trained edge correction network model Z; otherwise, set h=h+1 and return to execute step (4c2).

6. The method according to claim 1 is characterized in that: the video target segmentation model in step (5) specifically inputs the intermediate feature map extracted by the image reconstruction neural network R into the side output edge detection network Q to obtain the target edge prediction map, and the target segmentation mask prediction map output by the image reconstruction neural network R and the target edge prediction map output by the side output edge detection network Q are used as the input of the edge correction network model Z to obtain a trained video target segmentation model based on the image target edge correction segmentation result.