CN112508959A

CN112508959A - Video object segmentation method and device, electronic equipment and storage medium

Info

Publication number: CN112508959A
Application number: CN202011480916.3A
Authority: CN
Inventors: 韦乔乔; 张慧; 黄慧娟; 宋丛礼; 郑文
Original assignee: Tsinghua University; Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Tsinghua University; Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2020-12-15
Filing date: 2020-12-15
Publication date: 2021-03-16
Anticipated expiration: 2040-12-15
Also published as: CN112508959B

Abstract

The present disclosure relates to a video target segmentation method, device, electronic device and storage medium, and relates to the field of computer vision. The method includes: determining a first target feature of a first target frame, a first reference feature of a first reference frame, and a reference segmentation map of the first reference frame; Perform weighting to obtain the second target feature; weight the first reference feature based on the second non-local attention of the first reference frame to obtain the second reference feature; determine the first offset based on the second target feature and the second reference feature information; perform offset processing on the second reference feature based on the first offset information, obtain the second reference feature after offset, and determine the first local attention between the second reference feature after offset and the second target feature; Referring to the segmentation map and the first local attention input target segmentation model, the first target segmentation map of the first target frame is obtained; this scheme improves the accuracy of video target segmentation.

Description

Video object segmentation method, device, electronic device and storage medium

技术领域technical field

本公开涉及计算机视觉领域，尤其涉及视频目标分割方法、装置、电子设备及存储介质。The present disclosure relates to the field of computer vision, and in particular, to a video object segmentation method, apparatus, electronic device, and storage medium.

背景技术Background technique

在对视频进行处理的过程中，例如，视频编辑、视频监控等，有时需要对视频进行目标分割，也就是获取视频的每一帧中目标前景的分割图。一般是通过目标分割模型，来对视频进行目标分割。In the process of video processing, for example, video editing, video monitoring, etc., it is sometimes necessary to perform target segmentation on the video, that is, to obtain a segmentation map of the target foreground in each frame of the video. Generally, the target segmentation is performed on the video through the target segmentation model.

相关技术中，使用目标分割模型对视频进行目标分割的过程为：从待分割的视频中获取目标帧、参考帧和参考帧的分割图；对于目标帧中的每个第一像素点，从该参考帧中确定与该第一像素点之间的相似度最大的第二像素点，基于该第一像素点、该第二像素点和该参考分割图，通过该目标分割模型，得到目标帧的分割图；而在确定第二像素点时，为了减少计算量，一般是在该参考帧中以第一像素点的像素位置为中点的一定范围的邻域内，等间隔地采样出一系列像素点，将这些像素点作为第二像素点。In the related art, the process of using the target segmentation model to perform target segmentation on the video is as follows: obtaining the target frame, the reference frame and the segmentation map of the reference frame from the video to be divided; for each first pixel in the target frame, from the In the reference frame, determine the second pixel with the greatest similarity with the first pixel, and based on the first pixel, the second pixel and the reference segmentation map, obtain the target segmentation model through the target segmentation model. When determining the second pixel point, in order to reduce the amount of calculation, a series of pixels are sampled at equal intervals within a certain range of neighborhoods with the pixel position of the first pixel point as the midpoint in the reference frame. point, take these pixels as the second pixel.

上述方法中当参考帧与目标帧之间的时间间隔过大时，由于采样间隔是固定的，因此，该采样间隔可能并不是第一像素点在上述时间间隔内的偏移，也就是说，基于该采样间隔确定出的第二像素点可能不是与第一像素点之间的相似度最大的像素点，从而导致基于第一像素点、第二像素点和参考分割图得到的目标帧的分割图是不准确的，进而导致视频目标分割的精度较低。In the above method, when the time interval between the reference frame and the target frame is too large, since the sampling interval is fixed, the sampling interval may not be the offset of the first pixel in the above time interval, that is, The second pixel determined based on the sampling interval may not be the pixel with the highest similarity with the first pixel, resulting in segmentation of the target frame based on the first pixel, the second pixel and the reference segmentation map The graph is inaccurate, which in turn leads to lower accuracy of video object segmentation.

发明内容SUMMARY OF THE INVENTION

本公开提供一种视频目标分割方法、装置、电子设备及存储介质，能够提高视频目标分割的精度。本公开的技术方案如下：The present disclosure provides a video object segmentation method, device, electronic device and storage medium, which can improve the accuracy of video object segmentation. The technical solutions of the present disclosure are as follows:

根据本公开实施例的第一方面，提供一种视频目标分割方法，包括：According to a first aspect of the embodiments of the present disclosure, a video object segmentation method is provided, including:

确定第一目标帧的第一目标特征、第一参考帧的第一参考特征和所述第一参考帧的参考分割图，所述第一目标帧为待分割的视频中未进行目标分割的视频帧，所述第一参考帧为所述视频中已进行目标分割的视频帧，所述第一参考帧与所述第一目标帧之间的间隔大于第一预设时长；Determine the first target feature of the first target frame, the first reference feature of the first reference frame, and the reference segmentation map of the first reference frame, where the first target frame is the video to be segmented without target segmentation. frame, the first reference frame is a video frame in which target segmentation has been performed in the video, and the interval between the first reference frame and the first target frame is greater than a first preset duration;

基于所述第一目标帧的第一非局部注意力，对所述第一目标特征进行加权，得到第二目标特征，以及基于所述第一参考帧的第二非局部注意力，对所述第一参考特征进行加权，得到第二参考特征；Based on the first non-local attention of the first target frame, the first target feature is weighted to obtain a second target feature, and based on the second non-local attention of the first reference frame, the The first reference feature is weighted to obtain the second reference feature;

基于所述第二目标特征和所述第二参考特征，确定第一偏移信息，所述第一偏移信息用于表示位置对应的第一像素点与第二像素点之间的偏移，所述第一像素点为所述第一目标帧中的像素点，所述第二像素点为所述第一参考帧中的像素点；Determine first offset information based on the second target feature and the second reference feature, where the first offset information is used to represent the offset between the first pixel point corresponding to the position and the second pixel point, The first pixel is a pixel in the first target frame, and the second pixel is a pixel in the first reference frame;

基于所述第一偏移信息，对所述第二参考特征进行偏移处理，得到偏移后的第二参考特征，确定所述偏移后的第二参考特征与所述第二目标特征之间的第一局部注意力；Based on the first offset information, an offset process is performed on the second reference feature to obtain a second reference feature after offset, and the difference between the second reference feature after offset and the second target feature is determined. the first partial attention in between;

将所述参考分割图和所述第一局部注意力输入目标分割模型，得到所述第一目标帧的第一目标分割图。The reference segmentation map and the first local attention are input into a target segmentation model to obtain a first target segmentation map of the first target frame.

在一些实施例中，所述基于所述第二目标特征和所述第二参考特征，确定第一偏移信息，包括：In some embodiments, the determining the first offset information based on the second target feature and the second reference feature includes:

将所述第二目标特征和所述第二参考特征进行拼接，得到第一特征；Splicing the second target feature and the second reference feature to obtain the first feature;

将所述第一特征输入偏移预测模型，得到所述第一偏移信息。Inputting the first feature into an offset prediction model to obtain the first offset information.

在一些实施例中，所述将所述参考分割图和所述第一局部注意力输入目标分割模型，得到所述第一目标帧的第一目标分割图，包括：In some embodiments, the inputting the reference segmentation map and the first local attention into a target segmentation model to obtain a first target segmentation map of the first target frame, including:

通过所述目标分割模型，对所述参考分割图中的每个第三像素点进行滑窗提取，得到与所述每个第三像素点对应的多个第一邻域图，将所述多个第一邻域图进行拼接，得到滑窗提取后的参考分割图；Through the target segmentation model, sliding window extraction is performed on each third pixel point in the reference segmentation map to obtain multiple first neighborhood maps corresponding to each third pixel point, and the multiple The first neighborhood map is spliced to obtain the reference segmentation map after sliding window extraction;

对所述滑窗提取后的参考分割图和所述第一局部注意力分别进行维度变换，得到维度变换后的参考分割图和维度变换后的第一局部注意力，所述维度变换后的参考分割图与所述维度变换后的第一局部注意力的维度相同；Perform dimension transformation on the reference segmentation map after the sliding window extraction and the first local attention, respectively, to obtain the dimension-transformed reference segmentation map and the dimension-transformed first local attention, and the dimension-transformed reference The segmentation map has the same dimension as the first local attention after the dimension transformation;

基于所述维度变换后的参考分割图与所述维度变换后的第一局部注意力，确定所述第一目标帧的第一目标分割图。A first target segmentation map of the first target frame is determined based on the dimension-transformed reference segmentation map and the dimension-transformed first local attention.

在一些实施例中，所述基于所述维度变换后的参考分割图与所述维度变换后的第一局部注意力，确定所述第一目标帧的第一目标分割图，包括：In some embodiments, determining the first target segmentation map of the first target frame based on the dimension-transformed reference segmentation map and the dimension-transformed first local attention includes:

对于所述维度变换后的参考分割图中的每个第四像素点，从所述维度变换后的第一局部注意力中确定多个相似度，所述多个相似度为所述第四像素点与其他像素点之间的相似度；For each fourth pixel in the dimension-transformed reference segmentation map, a plurality of similarities are determined from the dimension-transformed first local attention, and the plurality of similarities are the fourth pixel similarity between points and other pixels;

基于所述多个相似度，对所述第四像素点的像素值进行加权求和，得到加权求和后的像素值；Based on the multiple degrees of similarity, weighted summation is performed on the pixel values of the fourth pixel point to obtain a weighted and summed pixel value;

将所述第四像素点的像素值修改为所述加权求和后的像素值，得到所述第一目标帧的第一目标分割图。Modifying the pixel value of the fourth pixel point to the weighted and summed pixel value to obtain a first target segmentation map of the first target frame.

在一些实施例中，所述基于所述第一目标帧的第一非局部注意力，对所述第一目标特征进行加权，得到第二目标特征，包括：In some embodiments, the first target feature is weighted based on the first non-local attention of the first target frame to obtain the second target feature, including:

对所述第一非局部注意力进行归一化处理，得到第一权重；normalizing the first non-local attention to obtain a first weight;

通过所述目标分割模型对所述第一权重和所述第一目标特征进行矩阵相乘处理，得到第三目标特征；Perform matrix multiplication processing on the first weight and the first target feature by using the target segmentation model to obtain a third target feature;

将所述第三目标特征与所述第一目标特征进行叠加，得到所述第二目标特征。The third target feature is superimposed on the first target feature to obtain the second target feature.

在一些实施例中，所述方法还包括：In some embodiments, the method further includes:

在所述第一参考帧与所述第一目标帧之间的间隔不大于所述第一预设时长的情况下，将所述第一参考特征与所述第一目标特征之间的第二局部注意力和所述参考分割图，输入所述目标分割模型，得到所述第一目标帧的第一目标分割图。In the case that the interval between the first reference frame and the first target frame is not greater than the first preset duration, the second interval between the first reference feature and the first target feature is The local attention and the reference segmentation map are input into the target segmentation model to obtain the first target segmentation map of the first target frame.

在一些实施例中，所述目标分割模型通过以下方法训练得到：In some embodiments, the target segmentation model is trained by the following methods:

从样本视频中获取未进行目标分割的样本目标帧和第一样本参考帧，将所述样本目标帧和所述第一样本参考帧输入初始目标分割模型，得到所述样本目标帧的第一样本目标特征和所述第一样本参考帧的第一样本参考特征，所述样本目标帧与所述第一样本参考帧之间的间隔大于第二预设时长小于第三预设时长；Obtain the sample target frame and the first sample reference frame without target segmentation from the sample video, input the sample target frame and the first sample reference frame into the initial target segmentation model, and obtain the first sample target frame. A sample target feature and a first sample reference feature of the first sample reference frame, and the interval between the sample target frame and the first sample reference frame is greater than the second preset duration and less than the third preset time. set duration;

基于所述样本目标帧的第三非局部注意力，对所述第一样本目标特征进行加权，得到第二样本目标特征，以及基于所述第一样本参考帧的第四非局部注意力，对所述第一样本参考特征进行加权，得到第二样本参考特征；The first sample target feature is weighted based on the third non-local attention of the sample target frame to obtain the second sample target feature, and the fourth non-local attention based on the first sample reference frame , weighting the first sample reference feature to obtain a second sample reference feature;

基于所述第二样本目标特征和所述第二样本参考特征，确定第二偏移信息，所述第二偏移信息用于表示位置对应的第一样本像素点与第二样本像素点之间的偏移，所述第一样本像素点为所述样本目标帧中的像素点，所述第二样本像素点为所述第一样本参考帧中的像素点；Based on the second sample target feature and the second sample reference feature, second offset information is determined, where the second offset information is used to indicate the difference between the first sample pixel point corresponding to the position and the second sample pixel point The offset between, the first sample pixel point is the pixel point in the sample target frame, and the second sample pixel point is the pixel point in the first sample reference frame;

基于所述第二偏移信息，对所述第二样本参考特征进行偏移处理，得到偏移后的第二样本参考特征；Based on the second offset information, performing an offset process on the second sample reference feature to obtain a shifted second sample reference feature;

基于所述第一样本参考帧和第三局部注意力，对所述初始目标分割模型进行训练，得到所述目标分割模型，所述第三局部注意力为所述偏移后的第二样本参考特征与所述第二样本目标特征之间的局部注意力。Based on the first sample reference frame and the third local attention, the initial target segmentation model is trained to obtain the target segmentation model, and the third local attention is the shifted second sample Local attention between the reference feature and the second sample target feature.

在一些实施例中，所述基于所述第一样本参考帧和第三局部注意力，对所述初始目标分割模型进行训练，得到所述目标分割模型，包括：In some embodiments, the initial target segmentation model is trained based on the first sample reference frame and the third local attention to obtain the target segmentation model, including:

通过所述初始目标分割模型，对所述第一样本参考帧中的每个第二样本像素点进行滑窗提取，得到与所述每个第二样本像素点对应的多个第二邻域图，将所述多个第二邻域图进行拼接，得到第二样本参考帧；Through the initial target segmentation model, sliding window extraction is performed on each second sample pixel in the first sample reference frame to obtain a plurality of second neighborhoods corresponding to each second sample pixel splicing the plurality of second neighborhood graphs to obtain a second sample reference frame;

对所述第二样本参考帧和所述第三局部注意力分别进行维度变换，得到维度变换后的第二样本参考帧和维度变换后的第三局部注意力，所述维度变换后的第二样本参考帧与所述维度变换后的第三局部注意力的维度相同；Perform dimension transformation on the second sample reference frame and the third local attention respectively to obtain the second sample reference frame after dimension transformation and the third local attention after dimension transformation, and the second sample reference frame after dimension transformation and the third local attention after dimension transformation are obtained. The sample reference frame is the same as the dimension of the third local attention after the dimension transformation;

基于所述维度变换后的第二样本参考帧与所述维度变换后的第三局部注意力，对所述初始目标分割模型进行训练，得到所述目标分割模型。Based on the dimension-transformed second sample reference frame and the dimension-transformed third local attention, the initial target segmentation model is trained to obtain the target segmentation model.

在一些实施例中，所述基于所述维度变换后的第二样本参考帧与所述维度变换后的第三局部注意力，对所述初始目标分割模型进行训练，得到所述目标分割模型，包括：In some embodiments, the initial target segmentation model is trained based on the dimension-transformed second sample reference frame and the dimension-transformed third local attention to obtain the target segmentation model, include:

对于所述维度变换后的第二样本参考帧中的每个第三样本像素点，从所述维度变换后的第三局部注意力中确定多个相似度，所述多个相似度为所述第三样本像素点与其他像素点之间的相似度；For each third sample pixel in the dimension-transformed second sample reference frame, a plurality of similarities are determined from the dimension-transformed third local attention, and the plurality of similarities are the The similarity between the third sample pixel point and other pixel points;

基于所述多个相似度，对所述第三样本像素点的像素值进行加权求和，得到加权求和后的像素值；Based on the multiple degrees of similarity, weighted summation is performed on the pixel values of the third sample pixel points to obtain a weighted and summed pixel value;

将所述第三样本像素点的像素值修改为所述加权求和后的像素值，得到所述样本目标帧的第二目标分割图；Modifying the pixel value of the third sample pixel point to the pixel value after the weighted summation to obtain the second target segmentation map of the sample target frame;

将所述样本目标帧输入所述初始目标分割模型中，基于所述初始目标分割模型的输出结果和所述第二目标分割图之间的损失值，对所述初始目标分割模型的模型参数进行迭代更新，得到所述目标分割模型。The sample target frame is input into the initial target segmentation model, and based on the output result of the initial target segmentation model and the loss value between the second target segmentation map, the model parameters of the initial target segmentation model are analyzed. Iteratively update to obtain the target segmentation model.

根据本公开实施例的第二方面，提供一种目标分割模型的训练装置，包括：According to a second aspect of the embodiments of the present disclosure, there is provided an apparatus for training a target segmentation model, including:

第一确定单元，被配置为执行确定第一目标帧的第一目标特征、第一参考帧的第一参考特征和所述第一参考帧的参考分割图，所述第一目标帧为待分割的视频中未进行目标分割的视频帧，所述第一参考帧为所述视频中已进行目标分割的视频帧，所述第一参考帧与所述第一目标帧之间的间隔大于第一预设时长；a first determining unit configured to perform determining a first target feature of a first target frame, a first reference feature of a first reference frame, and a reference segmentation map of the first reference frame, where the first target frame is to be segmented The video frame that has not been subjected to target segmentation in the video, the first reference frame is the video frame that has been subjected to target segmentation in the video, and the interval between the first reference frame and the first target frame preset duration;

第一加权单元，被配置为执行基于所述第一目标帧的第一非局部注意力，对所述第一目标特征进行加权，得到第二目标特征，以及基于所述第一参考帧的第二非局部注意力，对所述第一参考特征进行加权，得到第二参考特征；A first weighting unit configured to perform a first non-local attention based on the first target frame, weight the first target feature to obtain a second target feature, and perform a first non-local attention based on the first reference frame. 2. Non-local attention, weighting the first reference feature to obtain a second reference feature;

第二确定单元，被配置为执行基于所述第二目标特征和所述第二参考特征，确定第一偏移信息，所述第一偏移信息用于表示位置对应的第一像素点与第二像素点之间的偏移，所述第一像素点为所述第一目标帧中的像素点，所述第二像素点为所述第一参考帧中的像素点；a second determining unit, configured to determine first offset information based on the second target feature and the second reference feature, where the first offset information is used to indicate that the first pixel point corresponding to the position and the first pixel point The offset between two pixel points, the first pixel point is a pixel point in the first target frame, and the second pixel point is a pixel point in the first reference frame;

第一偏移单元，被配置为执行基于所述第一偏移信息，对所述第二参考特征进行偏移处理，得到偏移后的第二参考特征，确定所述偏移后的第二参考特征与所述第二目标特征之间的第一局部注意力；a first offset unit, configured to perform offset processing on the second reference feature based on the first offset information, to obtain a shifted second reference feature, and to determine the shifted second reference feature a first local attention between the reference feature and the second target feature;

第一目标分割单元，被配置为执行将所述参考分割图和所述第一局部注意力输入目标分割模型，得到所述第一目标帧的第一目标分割图。A first target segmentation unit, configured to input the reference segmentation map and the first local attention into a target segmentation model to obtain a first target segmentation map of the first target frame.

在一些实施例中，所述第二确定单元，被配置为执行将所述第二目标特征和所述第二参考特征进行拼接，得到第一特征；将所述第一特征输入偏移预测模型，得到所述第一偏移信息。In some embodiments, the second determination unit is configured to perform splicing of the second target feature and the second reference feature to obtain a first feature; input the first feature into an offset prediction model , to obtain the first offset information.

在一些实施例中，所述第一目标分割单元，包括：In some embodiments, the first target segmentation unit includes:

第一滑窗提取子单元，被配置为执行通过所述目标分割模型，对所述参考分割图中的每个第三像素点进行滑窗提取，得到与所述每个第三像素点对应的多个第一邻域图，将所述多个第一邻域图进行拼接，得到滑窗提取后的参考分割图；The first sliding window extraction subunit is configured to perform sliding window extraction on each third pixel point in the reference segmentation map through the target segmentation model, to obtain a a plurality of first neighborhood graphs, splicing the plurality of first neighborhood graphs to obtain a reference segmentation graph after sliding window extraction;

第一维度变换子单元，被配置为执行对所述滑窗提取后的参考分割图和所述第一局部注意力分别进行维度变换，得到维度变换后的参考分割图和维度变换后的第一局部注意力，所述维度变换后的参考分割图与所述维度变换后的第一局部注意力的维度相同；The first dimension transformation subunit is configured to perform dimension transformation on the reference segmentation map after the sliding window extraction and the first local attention respectively, so as to obtain the reference segmentation map after dimension transformation and the first dimension transformation map. Local attention, the dimension of the reference segmentation map after the dimension transformation is the same as the dimension of the first local attention after the dimension transformation;

确定子单元，被配置为执行基于所述维度变换后的参考分割图与所述维度变换后的第一局部注意力，确定所述第一目标帧的第一目标分割图。A determining subunit is configured to perform determining a first target segmentation map of the first target frame based on the dimension-transformed reference segmentation map and the dimension-transformed first local attention.

在一些实施例中，所述确定子单元，被配置为执行对于所述维度变换后的参考分割图中的每个第四像素点，从所述维度变换后的第一局部注意力中确定多个相似度，所述多个相似度为所述第四像素点与其他像素点之间的相似度；基于所述多个相似度，对所述第四像素点的像素值进行加权求和，得到加权求和后的像素值；将所述第四像素点的像素值修改为所述加权求和后的像素值，得到所述第一目标帧的第一目标分割图。In some embodiments, the determining subunit is configured to perform, for each fourth pixel point in the dimension-transformed reference segmentation map, determining the number of pixels from the dimension-transformed first local attention. similarity degrees, the multiple degrees of similarity are the degrees of similarity between the fourth pixel point and other pixel points; based on the multiple degrees of similarity, the pixel values of the fourth pixel point are weighted and summed, Obtaining the weighted and summed pixel value; modifying the pixel value of the fourth pixel point to the weighted and summed pixel value to obtain the first target segmentation map of the first target frame.

在一些实施例中，所述第一加权单元，被配置为执行对所述第一非局部注意力进行归一化处理，得到第一权重；通过所述目标分割模型对所述第一权重和所述第一目标特征进行矩阵相乘处理，得到第三目标特征；将所述第三目标特征与所述第一目标特征进行叠加，得到所述第二目标特征。In some embodiments, the first weighting unit is configured to perform normalization processing on the first non-local attention to obtain a first weight; the first weight and the sum of the first weights are calculated by the target segmentation model. The first target feature is subjected to matrix multiplication processing to obtain a third target feature; the third target feature is superimposed with the first target feature to obtain the second target feature.

在一些实施例中，所述装置还包括：In some embodiments, the apparatus further includes:

第二目标分割单元，被配置为执行在所述第一参考帧与所述第一目标帧之间的间隔不大于所述第一预设时长的情况下，将所述第一参考特征与所述第一目标特征之间的第二局部注意力和所述参考分割图，输入所述目标分割模型，得到所述第一目标帧的第一目标分割图。The second target segmentation unit is configured to perform, under the condition that the interval between the first reference frame and the first target frame is not greater than the first preset duration, combine the first reference feature with the first target frame. The second local attention between the first target features and the reference segmentation map is input to the target segmentation model to obtain a first target segmentation map of the first target frame.

获取单元，被配置为执行从样本视频中获取未进行目标分割的样本目标帧和第一样本参考帧，将所述样本目标帧和所述第一样本参考帧输入初始目标分割模型，得到所述样本目标帧的第一样本目标特征和所述第一样本参考帧的第一样本参考特征，所述样本目标帧与所述第一样本参考帧之间的间隔大于第二预设时长小于第三预设时长；The acquisition unit is configured to perform the acquisition of the sample target frame and the first sample reference frame without target segmentation from the sample video, and input the sample target frame and the first sample reference frame into the initial target segmentation model to obtain The first sample target feature of the sample target frame and the first sample reference feature of the first sample reference frame, the interval between the sample target frame and the first sample reference frame is greater than the second The preset duration is less than the third preset duration;

第二加权单元，被配置为执行基于所述样本目标帧的第三非局部注意力，对所述第一样本目标特征进行加权，得到第二样本目标特征，以及基于所述第一样本参考帧的第四非局部注意力，对所述第一样本参考特征进行加权，得到第二样本参考特征；A second weighting unit configured to perform a third non-local attention based on the sample target frame, weight the first sample target feature to obtain a second sample target feature, and based on the first sample The fourth non-local attention of the reference frame weights the first sample reference feature to obtain the second sample reference feature;

第三确定单元，被配置为执行基于所述第二样本目标特征和所述第二样本参考特征，确定第二偏移信息，所述第二偏移信息用于表示位置对应的第一样本像素点与第二样本像素点之间的偏移，所述第一样本像素点为所述样本目标帧中的像素点，所述第二样本像素点为所述第一样本参考帧中的像素点；a third determining unit, configured to determine second offset information based on the second sample target feature and the second sample reference feature, where the second offset information is used to represent the first sample corresponding to the position The offset between the pixel point and the second sample pixel point, the first sample pixel point is the pixel point in the sample target frame, and the second sample pixel point is the first sample reference frame. pixel point;

第二偏移单元，被配置为执行基于所述第二偏移信息，对所述第二样本参考特征进行偏移处理，得到偏移后的第二样本参考特征；a second offset unit, configured to perform offset processing on the second sample reference feature based on the second offset information, to obtain a shifted second sample reference feature;

训练单元，被配置为执行基于所述第一样本参考帧和第三局部注意力，对所述初始目标分割模型进行训练，得到所述目标分割模型，所述第三局部注意力为所述偏移后的第二样本参考特征与所述第二样本目标特征之间的局部注意力。a training unit, configured to perform training on the initial target segmentation model based on the first sample reference frame and a third local attention, to obtain the target segmentation model, and the third local attention is the Local attention between the offset second sample reference feature and the second sample target feature.

在一些实施例中，所述训练单元，包括：In some embodiments, the training unit includes:

第二滑窗提取子单元，被配置为执行通过所述初始目标分割模型，对所述第一样本参考帧中的每个第二样本像素点进行滑窗提取，得到与所述每个第二样本像素点对应的多个第二邻域图，将所述多个第二邻域图进行拼接，得到第二样本参考帧；The second sliding window extraction subunit is configured to perform sliding window extraction on each second sample pixel point in the first sample reference frame through the initial target segmentation model, to obtain the same value as each second sample pixel point in the first sample reference frame. Multiple second neighborhood maps corresponding to the two-sample pixel points, splicing the multiple second neighborhood maps to obtain a second sample reference frame;

第二维度变换子单元，被配置为执行对所述第二样本参考帧和所述第三局部注意力分别进行维度变换，得到维度变换后的第二样本参考帧和维度变换后的第三局部注意力，所述维度变换后的第二样本参考帧与所述维度变换后的第三局部注意力的维度相同；A second dimensional transformation subunit configured to perform dimensional transformation on the second sample reference frame and the third local attention, respectively, to obtain a dimensionally transformed second sample reference frame and a dimensionally transformed third local Attention, the dimension of the second sample reference frame after the dimension transformation is the same as the dimension of the third local attention after the dimension transformation;

训练子单元，被配置为执行基于所述维度变换后的第二样本参考帧与所述维度变换后的第三局部注意力，对所述初始目标分割模型进行训练，得到所述目标分割模型。The training subunit is configured to perform training on the initial target segmentation model based on the dimension-transformed second sample reference frame and the dimension-transformed third local attention to obtain the target segmentation model.

在一些实施例中，所述训练子单元，被配置为执行对于所述维度变换后的第二样本参考帧中的每个第三样本像素点，从所述维度变换后的第三局部注意力中确定多个相似度，所述多个相似度为所述第三样本像素点与其他像素点之间的相似度；基于所述多个相似度，对所述第三样本像素点的像素值进行加权求和，得到加权求和后的像素值；将所述第三样本像素点的像素值修改为所述加权求和后的像素值，得到所述样本目标帧的第二目标分割图；将所述样本目标帧输入所述初始目标分割模型中，基于所述初始目标分割模型的输出结果和所述第二目标分割图之间的损失值，对所述初始目标分割模型的模型参数进行迭代更新，得到所述目标分割模型。In some embodiments, the training subunit is configured to perform, for each third sample pixel in the dimensionally transformed second sample reference frame, a third local attention transformed from the dimension Determine a plurality of similarities in Carry out weighted summation to obtain the pixel value after the weighted summation; modify the pixel value of the third sample pixel point to the pixel value after the weighted summation to obtain the second target segmentation map of the sample target frame; The sample target frame is input into the initial target segmentation model, and based on the output result of the initial target segmentation model and the loss value between the second target segmentation map, the model parameters of the initial target segmentation model are analyzed. Iteratively update to obtain the target segmentation model.

根据本公开实施例的第三方面，提供一种电子设备，所述电子设备包括处理器和用于存储所述处理器可执行指令的存储器；其中，所述处理器被配置为执行所述指令，以实现执行上述实施例中视频目标分割方法。According to a third aspect of embodiments of the present disclosure, there is provided an electronic device comprising a processor and a memory for storing instructions executable by the processor; wherein the processor is configured to execute the instructions , so as to implement the video object segmentation method in the above embodiment.

根据本公开实施例的第四方面，提供一种计算机可读存储介质，所述存储介质中存储指令，当所述存储介质中的指令由电子设备的处理器执行时，使得所述电子设备能够执行上述实施例中视频目标分割方法。According to a fourth aspect of the embodiments of the present disclosure, there is provided a computer-readable storage medium, storing instructions in the storage medium, and when the instructions in the storage medium are executed by a processor of an electronic device, the electronic device enables the electronic device to Execute the video object segmentation method in the above embodiment.

根据本公开实施例的第五方面，提供一种计算机程序产品，当所述计算机程序产品中的指令由电子设备的处理器执行时，使得所述电子设备能够执行上述实施例中视频目标分割方法。According to a fifth aspect of the embodiments of the present disclosure, there is provided a computer program product, when the instructions in the computer program product are executed by a processor of an electronic device, the electronic device can execute the video object segmentation method in the above embodiment .

本公开的实施例提供的技术方案至少带来以下有益效果：The technical solutions provided by the embodiments of the present disclosure bring at least the following beneficial effects:

在本公开实施例中，由于视频帧的特征能够体现视频帧的语义信息，因此，基于第一目标帧的目标特征和第一参考帧的参考特征，确定出的第一偏移信息已经融合了两个视频帧的语义信息，这样第一偏移信息就能够准确表示位置对应的每个像素点从第一参考帧到第一目标帧的偏移情况，从而基于该第一偏移信息预测出的第一目标帧的目标分割图较为准确，进而提高了视频目标分割的精度。In the embodiment of the present disclosure, since the feature of the video frame can reflect the semantic information of the video frame, based on the target feature of the first target frame and the reference feature of the first reference frame, the determined first offset information has been fused with Semantic information of two video frames, so that the first offset information can accurately represent the offset of each pixel corresponding to the position from the first reference frame to the first target frame, so as to predict based on the first offset information The target segmentation map of the first target frame is more accurate, thereby improving the accuracy of video target segmentation.

应当理解的是，以上的一般描述和后文的细节描述仅是示例性和解释性的，并不能限制本公开。It is to be understood that the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the present disclosure.

附图说明Description of drawings

此处的附图被并入说明书中并构成本说明书的一部分，示出了符合本公开的实施例，并与说明书一起用于解释本公开的原理，并不构成对本公开的不当限定。The accompanying drawings, which are incorporated into and constitute a part of this specification, illustrate embodiments consistent with the present disclosure, and together with the description, serve to explain the principles of the present disclosure and do not unduly limit the present disclosure.

图1是根据一示例性实施例示出的一种视频目标分割方法的流程图。Fig. 1 is a flow chart of a method for segmenting a video object according to an exemplary embodiment.

图2是根据一示例性实施例示出的一种目标分割模型的训练方法的流程图。Fig. 2 is a flowchart showing a training method of a target segmentation model according to an exemplary embodiment.

图3是根据一示例性实施例示出的一种目标分割模型的训练方法的示意图。Fig. 3 is a schematic diagram showing a training method of a target segmentation model according to an exemplary embodiment.

图4是根据一示例性实施例示出的一种视频目标分割方法的流程图。Fig. 4 is a flow chart of a method for segmenting a video object according to an exemplary embodiment.

图5是根据一示例性实施例示出的一种视频目标分割装置的框图。Fig. 5 is a block diagram of a video object segmentation apparatus according to an exemplary embodiment.

图6是根据一示例性实施例示出的一种服务器的框图。Fig. 6 is a block diagram of a server according to an exemplary embodiment.

图7是根据一示例性实施例示出的一种终端的框图。Fig. 7 is a block diagram of a terminal according to an exemplary embodiment.

具体实施方式Detailed ways

为了使本领域普通人员更好地理解本公开的技术方案，下面将结合附图，对本公开实施例中的技术方案进行清楚、完整地描述。In order to make those skilled in the art better understand the technical solutions of the present disclosure, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

需要说明的是，本公开的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象，而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换，以便这里描述的本公开的实施例能够以除了在这里图示或描述的那些以外的顺序实施。以下示例性实施例中所描述的实施方式并不代表与本公开相一致的所有实施方式。相反，它们仅是与如所附权利要求书中所详述的、本公开的一些方面相一致的装置和方法的例子。It should be noted that the terms "first", "second" and the like in the description and claims of the present disclosure and the above drawings are used to distinguish similar objects, and are not necessarily used to describe a specific sequence or sequence. It is to be understood that the data so used may be interchanged under appropriate circumstances such that the embodiments of the disclosure described herein can be practiced in sequences other than those illustrated or described herein. The implementations described in the illustrative examples below are not intended to represent all implementations consistent with this disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as recited in the appended claims.

本公开所涉及的用户信息可以为经用户授权或者经过各方充分授权的信息。The user information involved in the present disclosure may be information authorized by the user or fully authorized by all parties.

本公开实施例提供了一种视频目标分割方法，该方法由电子设备来实现。其中，该方法所使用的目标分割模型能够应用在视频监控、交通运输、视频编辑等场景下。An embodiment of the present disclosure provides a video object segmentation method, which is implemented by an electronic device. Among them, the target segmentation model used in this method can be applied in scenarios such as video surveillance, transportation, and video editing.

例如，在视频编辑场景下，技术人员需要对视频中的人物或者物体等目标进行编辑，然而该视频一般为动态视频，即视频中的目标是运动的。因此，需要先将该目标从视频中分割出来，再对该目标进行编辑；例如，该目标为视频中的演员c，电子设备获取视频采集设备采集到的视频，将该视频进行分帧，得到多个视频帧，从每一帧视频帧中，分割出包括演员c的图像，进而对该图像进行编辑。For example, in a video editing scenario, a technician needs to edit objects such as characters or objects in the video, but the video is generally a dynamic video, that is, the object in the video is moving. Therefore, it is necessary to separate the target from the video first, and then edit the target; for example, if the target is actor c in the video, the electronic device obtains the video collected by the video capture device, divides the video into frames, and obtains A plurality of video frames, from each video frame, an image including actor c is segmented, and then the image is edited.

图1是根据一示例性实施例示出的一种视频目标分割方法的流程图，如图1所示，视频目标分割方法用于电子设备中，包括以下步骤。Fig. 1 is a flowchart of a method for segmenting a video object according to an exemplary embodiment. As shown in Fig. 1 , the method for segmenting a video object is used in an electronic device, and includes the following steps.

在步骤S11中，确定第一目标帧的第一目标特征、第一参考帧的第一参考特征和该第一参考帧的参考分割图，该第一目标帧为待分割的视频中未进行目标分割的视频帧，该第一参考帧为该视频中已进行目标分割的视频帧，该第一参考帧与该第一目标帧之间的间隔大于第一预设时长；In step S11, the first target feature of the first target frame, the first reference feature of the first reference frame, and the reference segmentation map of the first reference frame are determined, and the first target frame is an untargeted video in the video to be segmented A segmented video frame, the first reference frame is a video frame in which target segmentation has been performed in the video, and the interval between the first reference frame and the first target frame is greater than the first preset duration;

在步骤S12中，基于该第一目标帧的第一非局部注意力，对该第一目标特征进行加权，得到第二目标特征，以及基于该第一参考帧的第二非局部注意力，对该第一参考特征进行加权，得到第二参考特征；In step S12, the first target feature is weighted based on the first non-local attention of the first target frame to obtain the second target feature, and based on the second non-local attention of the first reference frame, the The first reference feature is weighted to obtain a second reference feature;

在步骤S13中，基于该第二目标特征和该第二参考特征，确定第一偏移信息，该第一偏移信息用于表示位置对应的第一像素点与第二像素点之间的偏移，该第一像素点为该第一目标帧中的像素点，该第二像素点为该第一参考帧中的像素点；In step S13, based on the second target feature and the second reference feature, determine first offset information, where the first offset information is used to indicate the offset between the first pixel point corresponding to the position and the second pixel point moving, the first pixel is a pixel in the first target frame, and the second pixel is a pixel in the first reference frame;

在步骤S14中，基于该第一偏移信息，对该第二参考特征进行偏移处理，得到偏移后的第二参考特征，确定该偏移后的第二参考特征与该第二目标特征之间的第一局部注意力；In step S14, based on the first offset information, an offset process is performed on the second reference feature to obtain a second reference feature after offset, and the second reference feature after offset and the second target feature are determined the first local attention between;

在步骤S15中，将该参考分割图和该第一局部注意力输入目标分割模型，得到该第一目标帧的第一目标分割图。In step S15, the reference segmentation map and the first local attention are input into a target segmentation model to obtain a first target segmentation map of the first target frame.

在一些实施例中，该基于该第二目标特征和该第二参考特征，确定第一偏移信息，包括：In some embodiments, determining the first offset information based on the second target feature and the second reference feature includes:

将该第二目标特征和该第二参考特征进行拼接，得到第一特征；splicing the second target feature and the second reference feature to obtain the first feature;

将该第一特征输入偏移预测模型，得到该第一偏移信息。The first feature is input into an offset prediction model to obtain the first offset information.

在一些实施例中，该将该参考分割图和该第一局部注意力输入目标分割模型，得到该第一目标帧的第一目标分割图，包括：In some embodiments, the reference segmentation map and the first local attention are input into a target segmentation model to obtain a first target segmentation map of the first target frame, including:

通过该目标分割模型，对该参考分割图中的每个第三像素点进行滑窗提取，得到与该每个第三像素点对应的多个第一邻域图，将该多个第一邻域图进行拼接，得到滑窗提取后的参考分割图；Through the target segmentation model, sliding window extraction is performed on each third pixel in the reference segmentation map to obtain a plurality of first neighborhood maps corresponding to each third pixel, and the plurality of first neighborhood maps are obtained. The domain map is spliced to obtain the reference segmentation map after sliding window extraction;

对该滑窗提取后的参考分割图和该第一局部注意力分别进行维度变换，得到维度变换后的参考分割图和维度变换后的第一局部注意力，该维度变换后的参考分割图与该维度变换后的第一局部注意力的维度相同；Dimensionally transform the reference segmentation map extracted by the sliding window and the first local attention, respectively, to obtain the dimension-transformed reference segmentation map and the dimension-transformed first local attention. The dimension-transformed reference segmentation map is the same as The dimension of the first local attention after the dimension transformation is the same;

基于该维度变换后的参考分割图与该维度变换后的第一局部注意力，确定该第一目标帧的第一目标分割图。Based on the dimension-transformed reference segmentation map and the dimension-transformed first local attention, a first target segmentation map of the first target frame is determined.

在一些实施例中，该基于该维度变换后的参考分割图与该维度变换后的第一局部注意力，确定该第一目标帧的第一目标分割图，包括：In some embodiments, determining the first target segmentation map of the first target frame based on the dimension-transformed reference segmentation map and the dimension-transformed first local attention includes:

对于该维度变换后的参考分割图中的每个第四像素点，从该维度变换后的第一局部注意力中确定多个相似度，该多个相似度为该第四像素点与其他像素点之间的相似度；For each fourth pixel in the dimension-transformed reference segmentation map, determine a plurality of similarities from the dimension-transformed first local attention, where the plurality of similarities are the fourth pixel and other pixels similarity between points;

基于该多个相似度，对该第四像素点的像素值进行加权求和，得到加权求和后的像素值；Based on the plurality of similarities, weighted summation is performed on the pixel values of the fourth pixel point to obtain a weighted and summed pixel value;

将该第四像素点的像素值修改为该加权求和后的像素值，得到该第一目标帧的第一目标分割图。The pixel value of the fourth pixel is modified to the weighted and summed pixel value to obtain a first target segmentation map of the first target frame.

在一些实施例中，该基于该第一目标帧的第一非局部注意力，对该第一目标特征进行加权，得到第二目标特征，包括：In some embodiments, the first target feature is weighted based on the first non-local attention of the first target frame to obtain the second target feature, including:

对该第一非局部注意力进行归一化处理，得到第一权重；Normalize the first non-local attention to obtain a first weight;

通过该目标分割模型对该第一权重和该第一目标特征进行矩阵相乘处理，得到第三目标特征；Perform matrix multiplication processing on the first weight and the first target feature through the target segmentation model to obtain a third target feature;

将该第三目标特征与该第一目标特征进行叠加，得到该第二目标特征。The third target feature is superimposed on the first target feature to obtain the second target feature.

在一些实施例中，该方法还包括：In some embodiments, the method further includes:

在该第一参考帧与该第一目标帧之间的间隔不大于该第一预设时长的情况下，将该第一参考特征与该第一目标特征之间的第二局部注意力和该参考分割图，输入该目标分割模型，得到该第一目标帧的第一目标分割图。Under the condition that the interval between the first reference frame and the first target frame is not greater than the first preset duration, the second local attention between the first reference feature and the first target feature and the Referring to the segmentation map, input the target segmentation model to obtain a first target segmentation map of the first target frame.

在一些实施例中，该目标分割模型通过以下方法训练得到：In some embodiments, the target segmentation model is trained by the following methods:

从样本视频中获取未进行目标分割的样本目标帧和第一样本参考帧，将该样本目标帧和该第一样本参考帧输入初始目标分割模型，得到该样本目标帧的第一样本目标特征和该第一样本参考帧的第一样本参考特征，该样本目标帧与该第一样本参考帧之间的间隔大于第二预设时长小于第三预设时长；Obtain the sample target frame and the first sample reference frame without target segmentation from the sample video, input the sample target frame and the first sample reference frame into the initial target segmentation model, and obtain the first sample of the sample target frame The target feature and the first sample reference feature of the first sample reference frame, and the interval between the sample target frame and the first sample reference frame is greater than the second preset duration and less than the third preset duration;

基于该样本目标帧的第三非局部注意力，对该第一样本目标特征进行加权，得到第二样本目标特征，以及基于该第一样本参考帧的第四非局部注意力，对该第一样本参考特征进行加权，得到第二样本参考特征；Based on the third non-local attention of the sample target frame, the first sample target feature is weighted to obtain the second sample target feature, and based on the fourth non-local attention of the first sample reference frame, the The first sample reference feature is weighted to obtain the second sample reference feature;

基于该第二样本目标特征和该第二样本参考特征，确定第二偏移信息，该第二偏移信息用于表示位置对应的第一样本像素点与第二样本像素点之间的偏移，该第一样本像素点为该样本目标帧中的像素点，该第二样本像素点为该第一样本参考帧中的像素点；Based on the second sample target feature and the second sample reference feature, second offset information is determined, where the second offset information is used to indicate the offset between the first sample pixel point corresponding to the position and the second sample pixel point moving, the first sample pixel is a pixel in the sample target frame, and the second sample pixel is a pixel in the first sample reference frame;

基于该第二偏移信息，对该第二样本参考特征进行偏移处理，得到偏移后的第二样本参考特征；Based on the second offset information, performing offset processing on the second sample reference feature to obtain the offset second sample reference feature;

基于该第一样本参考帧和第三局部注意力，对该初始目标分割模型进行训练，得到该目标分割模型，该第三局部注意力为该偏移后的第二样本参考特征与该第二样本目标特征之间的局部注意力。Based on the first sample reference frame and the third local attention, the initial target segmentation model is trained to obtain the target segmentation model, and the third local attention is the offset of the second sample reference feature and the third local attention. Local attention between two-sample target features.

在一些实施例中，该基于该第一样本参考帧和第三局部注意力，对该初始目标分割模型进行训练，得到该目标分割模型，包括：In some embodiments, the initial target segmentation model is trained based on the first sample reference frame and the third local attention to obtain the target segmentation model, including:

通过该初始目标分割模型，对该第一样本参考帧中的每个第二样本像素点进行滑窗提取，得到与该每个第二样本像素点对应的多个第二邻域图，将该多个第二邻域图进行拼接，得到第二样本参考帧；Through the initial target segmentation model, sliding window extraction is performed on each second sample pixel in the first sample reference frame, and a plurality of second neighborhood maps corresponding to each second sample pixel are obtained. The plurality of second neighborhood graphs are spliced to obtain a second sample reference frame;

对该第二样本参考帧和该第三局部注意力分别进行维度变换，得到维度变换后的第二样本参考帧和维度变换后的第三局部注意力，该维度变换后的第二样本参考帧与该维度变换后的第三局部注意力的维度相同；Perform dimension transformation on the second sample reference frame and the third local attention respectively to obtain the second sample reference frame after dimension transformation and the third local attention after dimension transformation, the second sample reference frame after dimension transformation It is the same as the dimension of the third local attention after the dimension transformation;

基于该维度变换后的第二样本参考帧与该维度变换后的第三局部注意力，对该初始目标分割模型进行训练，得到该目标分割模型。Based on the dimensionally transformed second sample reference frame and the dimensionally transformed third local attention, the initial target segmentation model is trained to obtain the target segmentation model.

在一些实施例中，该基于该维度变换后的第二样本参考帧与该维度变换后的第三局部注意力，对该初始目标分割模型进行训练，得到该目标分割模型，包括：In some embodiments, the initial target segmentation model is trained based on the dimension-transformed second sample reference frame and the dimension-transformed third local attention to obtain the target segmentation model, including:

对于该维度变换后的第二样本参考帧中的每个第三样本像素点，从该维度变换后的第三局部注意力中确定多个相似度，该多个相似度为该第三样本像素点与其他像素点之间的相似度；For each third sample pixel in the dimension-transformed second sample reference frame, a plurality of similarities are determined from the dimension-transformed third local attention, and the plurality of similarities are the third sample pixels similarity between points and other pixels;

基于该多个相似度，对该第三样本像素点的像素值进行加权求和，得到加权求和后的像素值；Based on the multiple degrees of similarity, weighted summation is performed on the pixel values of the third sample pixel point to obtain a weighted and summed pixel value;

将该第三样本像素点的像素值修改为该加权求和后的像素值，得到该样本目标帧的第二目标分割图；Modify the pixel value of the third sample pixel point to the pixel value after the weighted summation to obtain the second target segmentation map of the sample target frame;

将该样本目标帧输入该初始目标分割模型中，基于该初始目标分割模型的输出结果和该第二目标分割图之间的损失值，对该初始目标分割模型的模型参数进行迭代更新，得到该目标分割模型。Input the sample target frame into the initial target segmentation model, based on the output result of the initial target segmentation model and the loss value between the second target segmentation map, iteratively update the model parameters of the initial target segmentation model to obtain the target segmentation model.

图2是根据一示例性实施例示出的一种目标分割模型的训练方法的流程图，如图2所示，在本公开实施例中，以训练目标分割模型为例进行说明，该目标分割模型的训练方法用于电子设备中，包括以下步骤：FIG. 2 is a flowchart showing a training method of a target segmentation model according to an exemplary embodiment. As shown in FIG. 2 , in the embodiment of the present disclosure, the training of a target segmentation model is taken as an example for description. The training method used in electronic equipment includes the following steps:

在步骤201中，从样本视频中获取未进行目标分割的样本目标帧和第一样本参考帧，将该样本目标帧和该第一样本参考帧输入初始目标分割模型，得到该样本目标帧的第一样本目标特征和该第一样本参考帧的第一样本参考特征，该样本目标帧与该第一样本参考帧之间的间隔大于第二预设时长小于第三预设时长。In step 201, a sample target frame and a first sample reference frame without target segmentation are obtained from the sample video, and the sample target frame and the first sample reference frame are input into an initial target segmentation model to obtain the sample target frame The first sample target feature and the first sample reference feature of the first sample reference frame, the interval between the sample target frame and the first sample reference frame is greater than the second preset duration and less than the third preset duration.

其中，该样本视频为训练数据集中的视频。该训练数据集包括多个样本视频。在对初始目标分割模型进行训练之前，电子设备事先获取多个样本视频，将该多个样本视频组成训练数据集。本步骤中，电子设备获取该训练数据集，从该训练数据集中，选取样本视频。Among them, the sample video is the video in the training data set. The training dataset includes multiple sample videos. Before training the initial target segmentation model, the electronic device obtains a plurality of sample videos in advance, and forms a training data set from the plurality of sample videos. In this step, the electronic device acquires the training data set, and selects sample videos from the training data set.

其中，该样本目标帧为未进行目标分割的视频帧，该第一样本参考帧为已知目标分割图的视频帧，且在该样本视频中，该第一样本参考帧位于该样本目标帧之前。该样本目标帧与该第一样本参考帧之间的间隔大于第二预设时长小于第三预设时长。第二预设时长为小于15的数值，第三预设时长为大于15的数值，例如，该间隔为Δt，第二预设时长为5，第三预设时长为20，则Δt为5-20之间的数值。Wherein, the sample target frame is a video frame without target segmentation, the first sample reference frame is a video frame of a known target segmentation map, and in the sample video, the first sample reference frame is located in the sample target frame before. The interval between the sample target frame and the first sample reference frame is greater than the second preset duration and smaller than the third preset duration. The second preset duration is a value less than 15, and the third preset duration is a value greater than 15. For example, if the interval is Δt, the second preset duration is 5, and the third preset duration is 20, then Δt is 5- A value between 20.

其中，第一样本目标特征为样本目标帧的高维特征，第一样本参考特征为第一样本参考帧的高维特征。该初始目标分割模型为神经网络模型，该神经网络模型为残差网络(Residual Network，ResNet)，例如，ResNet-18。在本步骤中，电子设备将该初始目标分割模型作为特征提取器，基于样本目标帧，通过以下公式一，得到第一样本目标特征，以及基于第一样本参考帧，通过以下公式二，得到第一样本参考特征：The first sample target feature is a high-dimensional feature of the sample target frame, and the first sample reference feature is a high-dimensional feature of the first sample reference frame. The initial target segmentation model is a neural network model, and the neural network model is a Residual Network (Residual Network, ResNet), for example, ResNet-18. In this step, the electronic device uses the initial target segmentation model as a feature extractor, and based on the sample target frame, obtains the first sample target feature through the following formula 1, and based on the first sample reference frame, through the following formula 2, Get the first sample reference feature:

公式一：f_t＝Φ(I_t；θ_Φ)Formula 1: f _t =Φ(I _t ; θ _Φ )

公式二：f_r＝Φ(I_r；θ_Φ)Formula 2: f _r =Φ(I _r ; θ _Φ )

其中，f_t为第一样本目标特征，I_t为样本目标帧，Φ为初始目标分割模型，θ_Φ为Φ的可学习参数，f_r为第一样本参考特征，I_r为第一样本参考帧，I_t∈R^C×H×W，I_r∈R^C×H×W，H、W和C分别为视频帧的高度、宽度和通道数，R为实数集。Among them, f _t is the first sample target feature, I _t is the sample target frame, Φ is the initial target segmentation model, θ _Φ is the learnable parameter of Φ, _{fr is the first sample reference feature, I r} _is the first sample Sample reference frame, I _t ∈ R ^C×H×W , I _r ∈ R ^C×H×W , where H, W and C are the height, width and channel number of the video frame, respectively, and R is the set of real numbers.

在步骤202中，基于该样本目标帧的第三非局部注意力，对该第一样本目标特征进行加权，得到第二样本目标特征。In step 202, the first sample target feature is weighted based on the third non-local attention of the sample target frame to obtain the second sample target feature.

其中，第三非局部注意力为相似度矩阵。非局部注意力可以表示一幅图像中相隔任意距离两个像素点相互的响应，这个响应用于表示这两个像素点的相似度。Among them, the third non-local attention is the similarity matrix. Non-local attention can represent the mutual response of two pixels separated by any distance in an image, and this response is used to represent the similarity of the two pixels.

本步骤可以通过以下步骤(1)至(2)实现，包括：This step can be achieved through the following steps (1) to (2), including:

(1)确定该样本目标帧的第三非局部注意力。(1) Determine the third non-local attention of the sample target frame.

在一些实施例中，电子设备基于该第一样本目标特征，确定该样本目标帧中的每个第一样本像素点之间的相似度，得到第三非局部注意力的实现方式包括以下步骤A1-A3：In some embodiments, the electronic device determines the similarity between each first sample pixel point in the sample target frame based on the first sample target feature, and the implementation manner of obtaining the third non-local attention includes the following: Steps A1-A3:

A1：对该第一样本目标特征进行降维处理，得到降维后的第一样本目标特征，该降维后的第一样本目标特征包括每个第一样本像素点的特征。A1: Perform dimension reduction processing on the first sample target feature to obtain a dimension-reduced first sample target feature, where the dimension-reduced first sample target feature includes the feature of each first sample pixel point.

其中，电子设备可以将第一样本目标特征降维到任一维度。例如，电子设备将第一样本目标特征和第一样本参考特征的通道数分别降维到1/8通道数，即由f_t∈R^C×H×W降维到f_t∈R^(C/8)×H×W，由f_r∈R^C×H×W降维到f_r∈R^(C/8)×H×W。The electronic device may reduce the dimension of the first sample target feature to any dimension. For example, the electronic device reduces the number of channels of the target feature of the first sample and the reference feature of the first sample to 1/8 of the number of channels respectively, that is, from f _t ∈ R ^C×H×W to f _t ∈ R ^{( C/8)×H×W} , dimensionality reduction from fr ∈R ^C×H×W to _{fr r} _∈R ^(C/8)×H×W .

A2：对于每个第一样本像素点，基于该第一样本像素点的特征和其他第一样本像素点的特征，确定该第一样本像素点与该其他像素点之间的相似度，得到多个相似度。A2: For each first sample pixel, based on the features of the first sample pixel and the features of other first sample pixels, determine the similarity between the first sample pixel and the other pixels degree to obtain multiple similarities.

在一些实施例中，对于每个第一样本像素点，基于该第一样本像素点的特征和全部其他第一样本像素点的特征，确定该第一样本像素点与该全部其他第一样本像素点之间的相似度，得到多个相似度。In some embodiments, for each first sample pixel point, based on the characteristics of the first sample pixel point and the characteristics of all other first sample pixel points, the first sample pixel point and the other The similarity between the first sample pixels is obtained to obtain a plurality of similarities.

在一些实施例中，对于每个第一样本像素点，基于该第一样本像素点的特征和部分其他第一样本像素点的特征，确定该第一样本像素点与该部分其他第一样本像素点之间的相似度，得到多个相似度。其中，该部分其他第一样本像素点为样本目标帧中去除边缘像素点后的得到的多个第一样本像素点。本公开实施例对边缘像素点的选取不做具体限定。In some embodiments, for each first sample pixel, based on the characteristics of the first sample pixel and the characteristics of some other first sample pixels, it is determined that the first sample pixel is different from the other The similarity between the first sample pixels is obtained to obtain a plurality of similarities. Wherein, the other first sample pixel points in the part are a plurality of first sample pixel points obtained after removing edge pixel points in the sample target frame. The embodiment of the present disclosure does not specifically limit the selection of edge pixels.

需要说明的一点是，对于一个第一样本像素点，会得到多个相似度，可以将该多个相似度以矩阵的形式存储，则将该第一样本像素点得到的多个相似度组成一个第一局部矩阵。由于第一样本目标特征中包括多个第一样本像素点的特征，则在本步骤中会得到多个第一局部矩阵。It should be noted that, for a first sample pixel, multiple similarities will be obtained, and the multiple similarities can be stored in the form of a matrix, then the multiple similarities obtained by the first sample pixel form a first local matrix. Since the first sample target feature includes features of multiple first sample pixel points, multiple first local matrices will be obtained in this step.

A3：将该多个相似度组成该第三非局部注意力。A3: The multiple similarities compose the third non-local attention.

例如，根据第一样本像素点a在样本目标帧的坐标，将步骤A2中得到的多个第一局部矩阵，组成第三非局部注意力。For example, according to the coordinates of the first sample pixel point a in the sample target frame, the plurality of first local matrices obtained in step A2 are formed into the third non-local attention.

步骤A2-A3由公式三实现：Steps A2-A3 are implemented by formula three:

公式三：A_nlocal(i,j)＝g(f_t(i)；θ_g)^T·h(f_t(j)；θ_h)Formula 3: _Anlocal (i,j)=g(f _t (i); θ _g ) ^T h(f _t (j); θ _h )

其中，i,j为坐标位置，A_nlocal为第三非局部注意力，

N＝H×W，f_t(i),f_t(j)为第一样本目标特征中两个不同坐标的像素值，g,h分别表示一个1×1卷积层，θ_g,θ_h，分别为g,h的可学习参数。Among them, i, j are the coordinate positions, _Anlocal is the third non-local attention,

N=H×W, f _t (i), f _t (j) are the pixel values of two different coordinates in the first sample target feature, g, h respectively represent a 1×1 convolutional layer, θ _g , θ _h , are the learnable parameters of g and h, respectively.

(2)基于该第三非局部注意力，对该第一样本目标特征进行加权，得到第二样本目标特征。(2) Based on the third non-local attention, weight the first sample target feature to obtain the second sample target feature.

在一些实施例中，电子设备基于样本目标帧的第三非局部注意力，对该第一样本目标特征进行加权，得到第二样本目标特征的实现方式包括以下步骤B1-B3：In some embodiments, the electronic device weights the first sample target feature based on the third non-local attention of the sample target frame, and the implementation manner of obtaining the second sample target feature includes the following steps B1-B3:

B1：对该第三非局部注意力进行归一化处理，得到第二权重。B1: Normalize the third non-local attention to obtain the second weight.

第二权重用于在后续过程中确定样本目标帧中第一样本像素点的偏移情况，即可确定样本目标帧中，目标的运动情况。The second weight is used to determine the offset of the first sample pixel point in the sample target frame in the subsequent process, that is, to determine the movement of the target in the sample target frame.

B2：通过该初始目标分割模型对该第二权重和该第一样本目标特征进行矩阵相乘处理，得到第三样本目标特征。B2: Perform matrix multiplication processing on the second weight and the first sample target feature through the initial target segmentation model to obtain a third sample target feature.

在步骤B2中，通过对第一权重和第一样本目标特征进行矩阵相乘处理，进而得到第一样本目标特征的残差，即第三样本目标特征。In step B2, by performing matrix multiplication processing on the first weight and the first sample target feature, the residual of the first sample target feature, that is, the third sample target feature is obtained.

B3：将该第三样本目标特征与该第一样本目标特征进行叠加，得到该第二样本目标特征。B3: Superimpose the third sample target feature and the first sample target feature to obtain the second sample target feature.

在步骤B3中，通过将步骤B2中得到的残差叠加到第一样本目标特征中，进而得到第二样本目标特征，使得初始目标分割模型基于残差和新的目标特征进行训练，进而提高模型的学习效率。In step B3, by superimposing the residual obtained in step B2 into the first sample target feature, the second sample target feature is obtained, so that the initial target segmentation model is trained based on the residual and the new target feature, thereby improving the The learning efficiency of the model.

步骤B1-B3由公式四实现：Steps B1-B3 are realized by formula four:

公式四：f_r＝softmax(A_nlocal,dim＝0)·m(f_r；θ_m)+f_r Formula 4: f _r =softmax(A _nlocal , dim=0)·m(f _r ; θ _m )+f _r

其中，m为初始目标分割模型的一个卷积层，θ_m为m的可学习参数，softmax为归一化函数。where m is a convolutional layer of the initial target segmentation model, θ _m is the learnable parameter of m, and softmax is the normalization function.

在步骤203中，基于该第一样本参考帧的第四非局部注意力，对该第一样本参考特征进行加权，得到第二样本参考特征。In step 203, based on the fourth non-local attention of the first sample reference frame, the first sample reference feature is weighted to obtain the second sample reference feature.

其中，第四非局部注意力为相似度矩阵。本步骤可以通过以下步骤(1)至(2)实现，包括：Among them, the fourth non-local attention is the similarity matrix. This step can be achieved through the following steps (1) to (2), including:

在一些实施例中，电子设备基于该第一样本参考特征，确定该第一样本参考帧的每个第二样本像素点之间的相似度，得到第四非局部注意力由以下步骤C1-C3来实现：In some embodiments, the electronic device determines the similarity between each second sample pixel point of the first sample reference frame based on the first sample reference feature, and obtains the fourth non-local attention by the following step C1 -C3 to achieve:

C1：对该第一样本参考特征进行降维处理，得到降维后的第一样本参考特征，该降维后的第一样本参考特征包括每个第二样本像素点的特征。C1: Perform dimension reduction processing on the first sample reference feature to obtain a dimension-reduced first sample reference feature, where the dimension-reduced first sample reference feature includes a feature of each second sample pixel point.

该步骤与步骤A1相似，在此不再赘述。This step is similar to step A1 and will not be repeated here.

C2：对于每个第二样本像素点，基于该第二样本像素点的特征和其他第二样本像素点的特征，确定该第二样本像素点与该其他像素点之间的相似度，得到多个相似度。C2: For each second sample pixel point, based on the characteristics of the second sample pixel point and the characteristics of other second sample pixel points, determine the similarity between the second sample pixel point and the other pixel points, and obtain the most a similarity.

该步骤与步骤A2相似，在此不再赘述。This step is similar to step A2 and will not be repeated here.

C3：将该多个相似度组成该第四非局部注意力。C3: The plurality of similarities are composed of the fourth non-local attention.

该步骤与步骤A3相似，在此不再赘述。This step is similar to step A3 and will not be repeated here.

在本公开实施例中，通过将第一样本目标特征和第一样本参考特征进行降维处理，减少了在求取非局部注意力时的计算量，以及分别求取降维后的第一样本目标特征和降维后的第一样本参考特征的非局部注意力，使得帧内的像素点互相激励，促进帧内的像素之间的语义信息流动，加快了初始目标分割模型的训练速度。In the embodiment of the present disclosure, by performing dimension reduction processing on the first sample target feature and the first sample reference feature, the amount of calculation in obtaining non-local attention is reduced, and the The non-local attention of a sample target feature and the dimension-reduced first sample reference feature makes the pixels in the frame stimulate each other, promotes the flow of semantic information between the pixels in the frame, and speeds up the initial target segmentation model. training speed.

(2)基于该第四非局部注意力，对该第一样本参考特征进行加权，得到第二样本参考特征。(2) Based on the fourth non-local attention, weight the first sample reference feature to obtain the second sample reference feature.

在一些实施例中，该步骤的实现方式包括以下步骤D1-D3：In some embodiments, the implementation of this step includes the following steps D1-D3:

D1：对该第四非局部注意力进行归一化处理，得到第三权重。D1: Normalize the fourth non-local attention to obtain the third weight.

该步骤与步骤B1相似，在此不再赘述。This step is similar to step B1 and will not be repeated here.

D2：通过该初始目标分割模型对该第三权重和该第一样本参考特征进行矩阵相乘处理，得到该第三样本参考特征。D2: Perform matrix multiplication processing on the third weight and the first sample reference feature by using the initial target segmentation model to obtain the third sample reference feature.

该步骤与步骤B2相似，在此不再赘述。This step is similar to step B2 and will not be repeated here.

D3：电子设备将该第三样本参考特征与该第一样本参考特征进行叠加，得到该第二样本参考特征。D3: The electronic device superimposes the third sample reference feature and the first sample reference feature to obtain the second sample reference feature.

该步骤与步骤B3相似，在此不再赘述。This step is similar to step B3 and will not be repeated here.

在本公开实施例中，通过将第一样本目标特征的非局部注意力以残差的形式加权到第一样本目标特征中，以及将第一样本参考特征的非局部注意力以残差的形式加权到第一样本参考特征中，使得非局部注意力可以作为一个模块嵌入神经网络中，并且提高了神经网络的学习效率。In the embodiment of the present disclosure, the non-local attention of the first sample target feature is weighted into the first sample target feature in the form of residual, and the non-local attention of the first sample reference feature is weighted as residual The difference form is weighted into the first sample reference feature, so that the non-local attention can be embedded in the neural network as a module, and the learning efficiency of the neural network is improved.

在步骤204中，基于该第二样本目标特征和该第二样本参考特征，确定第二偏移信息，该第二偏移信息用于表示位置对应的第一样本像素点与第二样本像素点之间的偏移，该第一样本像素点为该样本目标帧中的像素点，该第二样本像素点为该第一样本参考帧中的像素点。In step 204, based on the second sample target feature and the second sample reference feature, determine second offset information, where the second offset information is used to represent the first sample pixel point and the second sample pixel corresponding to the position The offset between points, the first sample pixel point is a pixel point in the sample target frame, and the second sample pixel point is a pixel point in the first sample reference frame.

其中，在样本视频中，样本目标帧与第一样本参考帧中的图像内容有一部分是相似的，例如样本目标帧中的前景物体可能是第一样本参考帧中的前景物体发生了非刚性变换得到的结果，又如样本目标帧中的背景图像可能是第一样本参考帧中的背景图像进行平移的结果。本步骤中，电子设备确定的第二偏移信息为偏移图，该偏移图表示样本目标帧和第一样本参考帧的语义和时序的关联性，即该偏移图不仅表示样本目标帧和第一样本参考帧之间图像上的差异，还表示样本目标帧和第一样本参考帧之间的高层次差异。Among them, in the sample video, the sample target frame is partially similar to the image content in the first sample reference frame, for example, the foreground object in the sample target frame may be the foreground object in the first sample reference frame. The result obtained by the rigid transformation, such as the background image in the sample target frame, may be the result of translation of the background image in the first sample reference frame. In this step, the second offset information determined by the electronic device is an offset map, and the offset map represents the semantic and timing correlation between the sample target frame and the first sample reference frame, that is, the offset map not only represents the sample target frame The difference in image between the frame and the first sample reference frame, also represents the high-level difference between the sample target frame and the first sample reference frame.

在一些实施例中，步骤204包括以下步骤A1-A2：In some embodiments, step 204 includes the following steps A1-A2:

A1：将该第二样本目标特征和该第二样本参考特征进行拼接，得到第二特征。A1: Splicing the second sample target feature and the second sample reference feature to obtain a second feature.

在步骤A1中，由于将第二样本目标特征和第二样本参考特征进行拼接，实现了样本目标帧和第一样本参考帧的高层信息融合。In step A1, since the second sample target feature and the second sample reference feature are spliced, high-level information fusion of the sample target frame and the first sample reference frame is realized.

A2：将该第二特征输入偏移预测模型，得到该第二偏移信息。A2: Input the second feature into an offset prediction model to obtain the second offset information.

其中，该偏移预测模型用于基于输入的第二特征，输出该第二偏移信息。该偏移预测模型为一个三层的神经网络。本公开对该偏移预测模型的网络类型不做具体限定。Wherein, the offset prediction model is used to output the second offset information based on the input second feature. The offset prediction model is a three-layer neural network. The present disclosure does not specifically limit the network type of the offset prediction model.

步骤A1-A2由公式五实现：Steps A1-A2 are implemented by formula five:

公式五：

Formula five:

其中，O为第二偏移信息，

为偏移预测模型，

为

的可学习参数，concat为拼接函数。Among them, O is the second offset information,

is the migration prediction model,

for

The learnable parameters of , concat is the concatenation function.

在本公开实施例中，由于将第二样本目标特征和第二样本参考特征进行拼接，将样本目标帧和第一样本参考帧的高层信息融合，进而将拼接得到的第二特征输入偏移预测模型，利用该模型预测二者的语义和时序关联性偏移，由于这个偏移的值是实值，因此，采样的位置是不固定的，更容易命中目标像素。并且，由于采用神经网络进行定位，计算量较小，进而减少了图形处理器的显存占用。In the embodiment of the present disclosure, since the second sample target feature and the second sample reference feature are spliced, the high-level information of the sample target frame and the first sample reference frame is fused, and then the second feature obtained by splicing is input to the offset The prediction model is used to predict the semantic and temporal correlation offset of the two. Since the value of this offset is a real value, the sampling position is not fixed, and it is easier to hit the target pixel. Moreover, since the neural network is used for positioning, the amount of computation is small, thereby reducing the memory occupation of the graphics processor.

在步骤205中，基于该第二偏移信息，对该第二样本参考特征进行偏移处理，得到偏移后的第二样本参考特征。In step 205, based on the second offset information, an offset process is performed on the second sample reference feature to obtain a shifted second sample reference feature.

本步骤由公式六实现：This step is realized by formula six:

公式六：

Formula six:

其中，

为偏移后的第二样本参考特征，p为第二样本参考特征中的一个像素点。in,

is the offset second sample reference feature, and p is a pixel in the second sample reference feature.

本步骤中的偏移处理操作为warp(扭曲)操作，即将

中的p处的像素值为f_r中的p+O(p)处的像素值进行双线性插值后得到的像素值。The offset processing operation in this step is a warp (warping) operation, that is,

The pixel value at p in _fr is the pixel value obtained by performing bilinear interpolation on the pixel value at p+O(p) in fr.

在本公开实施例中，通过基于第二偏移信息对第二样本参考特征进行偏移处理，得到偏移后的第二样本参考特征，进而将偏移后的第二样本参考特征与第二样本目标特征进行语义和时序的对齐，从而为后续重建样本目标帧提供依据。In the embodiment of the present disclosure, the second sample reference feature after offset is obtained by performing offset processing on the second sample reference feature based on the second offset information, and then the offset second sample reference feature and the second sample reference feature after offset are obtained. The sample target features are semantically and temporally aligned, thereby providing a basis for the subsequent reconstruction of the sample target frame.

在步骤206中，通过该初始目标分割模型，对该第一样本参考帧中的每个第二样本像素点进行滑窗提取，得到与该每个第二样本像素点对应的多个第二邻域图，将该多个第二邻域图进行拼接，得到第二样本参考帧。In step 206, using the initial target segmentation model, perform sliding window extraction on each second sample pixel in the first sample reference frame to obtain a plurality of second sample pixels corresponding to each second sample pixel Neighborhood graph, splicing the plurality of second neighborhood graphs to obtain a second sample reference frame.

参见图3，电子设备对第一样本参考帧中的每个第二样本像素点进行滑窗提取，得到多个P×P大小的邻域块，再将该多个邻域块进行拼接，得到第二样本参考帧

其中，本公开对P的数值的选取不做具体限定，例如，P可以为3、9。Referring to FIG. 3, the electronic device performs sliding window extraction on each second sample pixel point in the first sample reference frame to obtain a plurality of P×P neighborhood blocks, and then splices the plurality of neighborhood blocks, get the second sample reference frame

Wherein, the present disclosure does not specifically limit the selection of the value of P, for example, P may be 3 or 9.

在步骤207中，基于该偏移后的第二样本参考特征与该第二样本目标特征，确定第三局部注意力。In step 207, a third local attention is determined based on the offset second sample reference feature and the second sample target feature.

在一些实施例中，本步骤包括：电子设备在偏移后的第二样本参考特征中确定与该第一样本像素点的具有相同坐标位置的第二样本像素点，选取以该第二样本像素点为中心点的第一局部图像，确定该第一样本像素点与第一局部图像中的每个第二样本像素点的相似度，得到第三局部注意力。In some embodiments, this step includes: the electronic device determines a second sample pixel point having the same coordinate position as the first sample pixel point in the shifted second sample reference feature, and selects the second sample pixel point with the same coordinate position as the first sample pixel point. The pixel point is the first partial image with the center point, and the similarity between the first sample pixel point and each second sample pixel point in the first partial image is determined, and the third partial attention is obtained.

例如，该第一局部图像为P×P的大小。电子确定该第三局部注意力的实现方式由公式七实现：For example, the first partial image has a size of P×P. The way in which the electronic determination of this third partial attention is realized is realized by Equation 7:

公式七：

Formula seven:

其中，A_local为第三局部注意力，

P＝2M+1，P为第一局部图像的边长，M为第二样本像素点与第一局部图像的边之间的距离，i和k为第一样本像素点的坐标值，j和l为第二样本像素点的坐标值。Among them, A _local is the third local attention,

P=2M+1, P is the side length of the first partial image, M is the distance between the second sample pixel point and the edge of the first partial image, i and k are the coordinate values of the first sample pixel point, j And l is the coordinate value of the second sample pixel point.

在本公开实施例中，由于确定了第二样本目标特征中每个第一样本像素点与偏移后的第二样本参考特征中的部分第二样本像素点之家的相似度，进而捕获了帧间的语义信息和时序信息响应，促进了帧间的像素点的语义和时序信息流动。In the embodiment of the present disclosure, since the similarity between each first sample pixel point in the second sample target feature and part of the second sample pixel point in the offset second sample reference feature is determined, and then capture The response of semantic information and timing information between frames is promoted, and the flow of semantic and timing information of pixels between frames is promoted.

在步骤208中，对该第二样本参考帧和该第三局部注意力分别进行维度变换，得到维度变换后的第二样本参考帧和维度变换后的第三局部注意力，该维度变换后的第二样本参考帧与该维度变换后的第三局部注意力的维度相同。In step 208, dimension transformation is performed on the second sample reference frame and the third local attention respectively to obtain the second sample reference frame after dimension transformation and the third local attention after dimension transformation. The second sample reference frame has the same dimension as the third local attention transformed by this dimension.

在本步骤中，电子设备将第三局部注意力的维度进行变换，得到维度变换后的第三局部注意力。例如，第三局部注意力为P×P×H×W的大小，即P²×H×W的大小，而滑窗提取得到的第二样本参考帧的大小为C×P×P×H×W，维度变换后的第二样本参考帧的大小为C×P²×H×W，因此，维度变换后的第三局部注意力为1×P²×H×W的大小。In this step, the electronic device transforms the dimension of the third local attention to obtain the third local attention after the dimension transformation. For example, the size of the third local attention is P×P×H×W, that is, the size of P ² ×H×W, and the size of the second sample reference frame obtained by sliding window extraction is C×P×P×H× W, the size of the second sample reference frame after dimension transformation is C×P ² ×H×W, therefore, the size of the third local attention after dimension transformation is 1×P ² ×H×W.

在本公开实施例中，由于将该第三局部注意力和第二样本参考帧进行维度变换，得到了维度相同的第三局部注意力和第二样本参考帧，进而减少了后续重建样本目标帧的计算量，提高了神经网络在训练阶段的学习效率。In the embodiment of the present disclosure, since the third local attention and the second sample reference frame are dimensionally transformed, the third local attention and the second sample reference frame with the same dimensions are obtained, thereby reducing the number of target frames of subsequent reconstruction samples It can improve the learning efficiency of the neural network in the training phase.

在步骤209中，基于该维度变换后的第二样本参考帧与该维度变换后的第三局部注意力，对该初始目标分割模型进行训练，得到该目标分割模型。In step 209, the initial target segmentation model is trained based on the dimension-transformed second sample reference frame and the dimension-transformed third local attention to obtain the target segmentation model.

在一些实施例中，步骤209由以下步骤A1-A3来实现：In some embodiments, step 209 is implemented by the following steps A1-A3:

A1：对于该维度变换后的第二样本参考帧中的每个第三样本像素点，从该维度变换后的第三局部注意力中确定多个相似度，该多个相似度为该第三样本像素点与其他像素点之间的相似度。A1: For each third sample pixel in the second sample reference frame transformed by the dimension, determine a plurality of similarities from the third local attention transformed by the dimension, and the multiple similarities are the third Similarity between sample pixels and other pixels.

其中，维度变换后的第三局部注意力的大小为1×P²×H×W，可见，每个第三样本像素点与维度变换后的第三局部注意力中相同坐标处的第二局部矩阵相对应，该第二局部矩阵为P×P大小的矩阵，该矩阵中包括多个相似度。Among them, the size of the third local attention after dimension transformation is 1×P ² ×H×W. It can be seen that each third sample pixel is the same as the second local attention in the third local attention after dimension transformation. Corresponding to the matrix, the second local matrix is a matrix of size P×P, and the matrix includes a plurality of similarities.

A2：基于该多个相似度，对该第三样本像素点的像素值进行加权求和，得到加权求和后的像素值。A2: Based on the multiple degrees of similarity, perform weighted summation on the pixel values of the third sample pixel point to obtain a weighted and summed pixel value.

基于该多个相似度，对该第三样本像素点的像素值进行加权求和。Based on the plurality of similarities, weighted summation is performed on the pixel values of the third sample pixel point.

其中，对于维度变换后的第二样本参考帧中的每个第三样本像素点，电子设备确定该第三样本像素点与该第二局部矩阵中的每个相似度的乘积，得到多个乘积值，再将该多个乘积值进行加和，即对该第三样本像素点的像素值进行加权求和，得到新的像素值。Wherein, for each third sample pixel in the dimension-transformed second sample reference frame, the electronic device determines the product of the third sample pixel and each similarity in the second local matrix to obtain a plurality of products value, and then add the multiple product values, that is, perform weighted summation on the pixel values of the third sample pixel point to obtain a new pixel value.

A3：将该第三样本像素点的像素值修改为该加权求和后的像素值，得到该样本目标帧的第二目标分割图。A3: Modify the pixel value of the third sample pixel point to the weighted and summed pixel value to obtain a second target segmentation map of the sample target frame.

其中，电子设备基于新的像素值，通过初始目标分割模型，得到该第二目标分割图，即重建出样本目标帧。Wherein, based on the new pixel value, the electronic device obtains the second target segmentation map through the initial target segmentation model, that is, reconstructs the sample target frame.

步骤A1-A3由公式八来实现：Steps A1-A3 are implemented by formula eight:

公式八：

Formula eight:

其中，

为第二目标分割图，sum为求和函数，A_r为维度变换后的第三局部注意力。in,

is the second target segmentation map, sum is the summation function, and A _r is the third local attention after dimension transformation.

A4：将该样本目标帧输入该初始目标分割模型中，基于该初始目标分割模型的输出结果和该第二目标分割图之间的损失值，对该初始目标分割模型的模型参数进行迭代更新，得到该目标分割模型。A4: Input the sample target frame into the initial target segmentation model, based on the loss value between the output result of the initial target segmentation model and the second target segmentation map, iteratively update the model parameters of the initial target segmentation model, Get the target segmentation model.

在本公开实施例中，基于维度变换后的第三局部注意力，将样本目标帧与样本参考帧结合，捕获了帧间的语义和时序信息的响应，以及经过滑窗提取操作得到的样本参考帧，确定样本目标帧的第二目标分割图，使得该目标分割的精度较高，进而实现对样本目标帧的重建。In the embodiment of the present disclosure, based on the third local attention after dimension transformation, the sample target frame and the sample reference frame are combined to capture the response of the semantic and timing information between the frames, and the sample reference obtained through the sliding window extraction operation. frame, and determine the second target segmentation map of the sample target frame, so that the accuracy of the target segmentation is high, thereby realizing the reconstruction of the sample target frame.

其中，电子设备基于该样本目标帧的第二目标分割图，确定损失函数。该步骤由公式九来实现：Wherein, the electronic device determines the loss function based on the second target segmentation map of the sample target frame. This step is implemented by Equation Nine:

公式九：Formula nine:

其中，N＝H×W为第二目标分割图中的像素点的总个数，

为损失函数，l_i为中间变量。Among them, N=H×W is the total number of pixels in the second target segmentation map,

is the loss function, and _li is the intermediate variable.

在本公开实施例中，电子设备基于样本视频执行步骤201-209。响应于步骤209执行结束，电子设备从训练数据集中获取其他视频，继续执行步骤201-209。In the embodiment of the present disclosure, the electronic device performs steps 201-209 based on the sample video. In response to the end of the execution of step 209, the electronic device obtains other videos from the training data set, and continues to execute steps 201-209.

其中，由于训练数据集中包括多个视频，则初始目标分割模型每进行一次训练，即从该多个视频中获取样本视频，该模型都会更新一次损失函数，进而基于该初始目标分割模型的输出结果和该第二目标分割图之间的损失值，基于该损失值，对初始目标分割模型进行反向传播以及更新初始目标分割模型的参数。Among them, since the training data set includes multiple videos, each time the initial target segmentation model is trained, that is, sample videos are obtained from the multiple videos, the model will update the loss function once, and then based on the output results of the initial target segmentation model and the loss value between the second target segmentation map, and based on the loss value, back-propagate the initial target segmentation model and update the parameters of the initial target segmentation model.

其中，初始目标分割模型在训练过程中，需要执行多次步骤A4，即进行迭代更新。在该过程中，通过反向传播和更新参数，来提高初始目标分割模型的训练精度，进而提高目标分割模型的预测精度。Among them, in the training process of the initial target segmentation model, step A4 needs to be performed multiple times, that is, iterative update is performed. In this process, the training accuracy of the initial target segmentation model is improved by back-propagating and updating parameters, thereby improving the prediction accuracy of the target segmentation model.

在本公开实施例中，对第一样本参考帧进行滑窗提取，得到第二样本参考帧，以及对第三局部注意力和第二样本参考帧，进而对初始目标分割模型进行训练，使得模型训练的精度较高。In the embodiment of the present disclosure, the sliding window extraction is performed on the first sample reference frame to obtain the second sample reference frame, and the third local attention and the second sample reference frame are used to train the initial target segmentation model, so that the The training accuracy of the model is high.

在本公开实施例中，通过基于训练数据集中的样本视频对初目标分割模型进行训练，使得初始目标分割模型能够在训练过程中，学习到样本目标帧与样本参考帧之间的语义信息的联系，进而提高了该初始目标分割模型的模型预测精度。In the embodiment of the present disclosure, the initial target segmentation model is trained based on the sample videos in the training data set, so that the initial target segmentation model can learn the semantic information relationship between the sample target frame and the sample reference frame during the training process , thereby improving the model prediction accuracy of the initial target segmentation model.

图4是根据一示例性实施例示出的一种视频目标分割方法的流程图，如图4所示，在本公开实施例中，以基于目标分割模型，对待分割的视频中的未进行目标分割的视频帧进行目标分割为例进行说明，该视频目标分割方法用于电子设备中，包括以下步骤：FIG. 4 is a flow chart of a method for segmenting a video object according to an exemplary embodiment. As shown in FIG. 4 , in the embodiment of the present disclosure, based on the object segmentation model, the objects in the video to be segmented are not segmented. The video frame of the video frame is used for target segmentation as an example to illustrate. The video target segmentation method is used in electronic equipment and includes the following steps:

在步骤401中，确定第一目标帧的第一目标特征、第一参考帧的第一参考特征和该第一参考帧的参考分割图，该第一目标帧为待分割的视频中未进行目标分割的视频帧，该第一参考帧为该视频中已进行目标分割的视频帧，该第一参考帧与该第一目标帧之间的间隔大于第一预设时长。In step 401, a first target feature of a first target frame, a first reference feature of a first reference frame, and a reference segmentation map of the first reference frame are determined, and the first target frame is an untargeted video in the video to be segmented A segmented video frame, the first reference frame is a video frame in which target segmentation has been performed in the video, and the interval between the first reference frame and the first target frame is greater than a first preset duration.

例如，电子设备事先将该待分割的视频中的第一帧视频帧进行目标分割，得到参考分割图，进而电子设备将该第一帧视频帧作为第一参考帧，对时间位于该第一参考帧之后的视频帧进行目标分割。该参考分割图包括至少一个目标分割图。For example, the electronic device performs target segmentation on the first video frame in the video to be divided in advance to obtain a reference segmentation map, and then the electronic device uses the first video frame as the first reference frame, and the time is located in the first reference frame. The video frame after the frame is subjected to object segmentation. The reference segmentation map includes at least one target segmentation map.

例如，对于当前帧I_t，即第一目标帧，电子设备从记忆库中抽取{I₀,M₀}、{I₅,M₅}(如果t>5)、{I_t-5,M_t-5}(如果t>5和t≠10)、{I_t-3,M_t-3}(如果t>3和t≠8)和{I_t-1,M_t-1}(如果t>1和t≠6)作为第一参考帧。其中，电子设备将与第一目标帧的间隔大于第一预设时长的第一参考帧作为长时记忆参考帧，其余第一参考帧作为短时记忆参考帧，第一参考帧的总数为k；例如，第一预设时长为Δt＝15。记忆库中存储的内容为该待分割的视频中，时间位于第一目标帧之前的视频帧以及该视频帧的目标分割图。For example, for the current frame It, that is, the first target frame, the electronic device extracts {I ₀ , M ₀ }, {I ₅ , M ₅ } (if t>5), {I _t _-5 , M from the memory bank _t-5 } (if t>5 and t≠10), {I _t-3 ,M _t-3 } (if t>3 and t≠8) and {I _t-1 ,M _t-1 } (if t>1 and t≠6) as the first reference frame. The electronic device uses the first reference frame whose interval from the first target frame is greater than the first preset duration as the long-term memory reference frame, and the remaining first reference frames as the short-term memory reference frame, and the total number of the first reference frames is k ; For example, the first preset duration is Δt=15. The content stored in the memory library is the video frame whose time is located before the first target frame and the target segmentation map of the video frame in the video to be divided.

其中，第一目标特征为第一目标帧的高维特征，第一参考特征为第一参考帧的高维特征。在本步骤中，电子设备确定该第一目标帧的第一目标特征、第一参考帧的第一参考特征的实现方式与步骤201中电子设备确定该样本目标帧的第一样本目标特征和该第一样本参考帧的第一样本参考特征的实现方式相似，在此不再赘述。The first target feature is a high-dimensional feature of the first target frame, and the first reference feature is a high-dimensional feature of the first reference frame. In this step, the electronic device determines the first target feature of the first target frame, the implementation manner of the first reference feature of the first reference frame, and the electronic device determines the first sample target feature of the sample target frame in step 201 and The implementation manner of the first sample reference feature of the first sample reference frame is similar, and details are not described herein again.

在步骤402中，基于该第一目标帧的第一非局部注意力，对该第一目标特征进行加权，得到第二目标特征。In step 402, the first target feature is weighted based on the first non-local attention of the first target frame to obtain a second target feature.

其中，第一非局部注意力为相似度矩阵。本步骤可以通过以下步骤(1)至(2)实现，包括：Among them, the first non-local attention is the similarity matrix. This step can be achieved through the following steps (1) to (2), including:

(1)确定该第一目标帧的第一非局部注意力。(1) Determine the first non-local attention of the first target frame.

该步骤的实现方式包括以下步骤A1-A3：The implementation of this step includes the following steps A1-A3:

A1：对该第一非局部注意力进行归一化处理，得到第一权重。A1: Normalize the first non-local attention to obtain the first weight.

本步骤与步骤202中的步骤A1相似，在此不再赘述。This step is similar to step A1 in step 202, and is not repeated here.

A2：通过该目标分割模型对该第一权重和该第一目标特征进行矩阵相乘处理，得到第三目标特征。A2: Perform matrix multiplication processing on the first weight and the first target feature through the target segmentation model to obtain a third target feature.

本步骤与步骤202中的步骤A2相似，在此不再赘述。This step is similar to step A2 in step 202, and is not repeated here.

A3：将该第三目标特征与该第一目标特征进行叠加，得到该第二目标特征。A3: Superimpose the third target feature and the first target feature to obtain the second target feature.

本步骤与步骤202中的步骤A3相似，在此不再赘述。This step is similar to step A3 in step 202, and will not be repeated here.

(2)基于该第一非局部注意力，对该第一样本目标特征进行加权，得到第二样本目标特征。(2) Based on the first non-local attention, weight the first sample target feature to obtain the second sample target feature.

该步骤与步骤202中的步骤(2)相似，在此不再赘述。This step is similar to step (2) in step 202 and will not be repeated here.

在步骤403中，基于该第一参考帧的第二非局部注意力，对该第一参考特征进行加权，得到第二参考特征。In step 403, the first reference feature is weighted based on the second non-local attention of the first reference frame to obtain a second reference feature.

该步骤与步骤203相似，在此不再赘述。This step is similar to step 203 and will not be repeated here.

在步骤404中，基于该第二目标特征和该第二参考特征，确定第一偏移信息，该第一偏移信息用于表示位置对应的第一像素点与第二像素点之间的偏移，该第一像素点为该第一目标帧中的像素点，该第二像素点为该第一参考帧中的像素点。In step 404, based on the second target feature and the second reference feature, determine first offset information, where the first offset information is used to indicate the offset between the first pixel point corresponding to the position and the second pixel point moving, the first pixel is a pixel in the first target frame, and the second pixel is a pixel in the first reference frame.

在一些实施例中，步骤404包括以下步骤A1-A2：In some embodiments, step 404 includes the following steps A1-A2:

A1：将该第二目标特征和该第二参考特征进行拼接，得到第一特征。A1: Splicing the second target feature and the second reference feature to obtain the first feature.

本步骤与步骤204中的步骤A1相似，在此不再赘述。This step is similar to step A1 in step 204 and will not be repeated here.

A2：将该第一特征输入偏移预测模型，得到该第一偏移信息。A2: Input the first feature into an offset prediction model to obtain the first offset information.

本步骤与步骤204中的步骤A2相似，在此不再赘述。This step is similar to step A2 in step 204 and will not be repeated here.

在本公开实施例中，由于将第二目标特征和第二参考特征进行拼接，将第一目标帧和第一参考帧的高层信息融合，进而将拼接得到的第一特征输入偏移预测模型，利用该模型预测二者的语义和时序关联性偏移，由于这个偏移的值是实值，因此，采样的位置是不固定的，更容易命中目标像素。并且，由于采用神经网络进行定位，计算量较小，进而减少了图形处理器的显存占用。In the embodiment of the present disclosure, since the second target feature and the second reference feature are spliced, the high-level information of the first target frame and the first reference frame is fused, and then the spliced first feature is input into the offset prediction model, The model is used to predict the semantic and temporal correlation offset of the two. Since the value of this offset is a real value, the sampling position is not fixed, and it is easier to hit the target pixel. Moreover, since the neural network is used for positioning, the amount of computation is small, thereby reducing the memory occupation of the graphics processor.

在步骤405中，基于该第一偏移信息，对该第二参考特征进行偏移处理，得到偏移后的第二参考特征，确定该偏移后的第二参考特征与该第二目标特征之间的第一局部注意力。In step 405, an offset process is performed on the second reference feature based on the first offset information to obtain a second reference feature after offset, and the second reference feature after offset and the second target feature are determined between the first local attention.

本步骤中的偏移处理操作为warp操作。其中，电子设备基于该第一偏移信息，对该第二参考特征进行偏移处理，得到偏移后的第二参考特征的实现方式与步骤205相似，且电子设备确定第一局部注意力的实现方式与步骤207中电子设备确定第三局部注意力的实现方式相似，在此不再赘述。The offset processing operation in this step is a warp operation. Wherein, based on the first offset information, the electronic device performs offset processing on the second reference feature to obtain the shifted second reference feature. The implementation manner is similar to the implementation manner in which the electronic device determines the third local attention in step 207, and details are not repeated here.

需要说明的一点是，电子设备在该第一参考帧与该第一目标帧之间的间隔大于第一预设时长时，执行步骤402-405的操作；电子设备在该第一参考帧与该第一目标帧之间的间隔不大于该第一预设时长的情况下，将该第一参考特征与该第一目标特征之间的第二局部注意力和该参考分割图，输入该目标分割模型，得到该第一目标帧的第一目标分割图。It should be noted that, when the interval between the first reference frame and the first target frame is greater than the first preset duration, the electronic device performs the operations of steps 402-405; In the case where the interval between the first target frames is not greater than the first preset duration, the second local attention between the first reference feature and the first target feature and the reference segmentation map are input into the target segmentation model to obtain the first target segmentation map of the first target frame.

在本公开实施例中，当样本目标帧为待分割的视频中的第二帧视频帧时，模型获取到的第一参考帧为第一帧视频帧，即，该第一参考帧的个数为1，即仅具有一帧短时记忆参考帧，进而基于目标分割模型，仅需确定两次局部注意力来确定第一目标帧的目标分割图，进而减少了模型预测所需的时间，也提高了目标分割的效率。In the embodiment of the present disclosure, when the sample target frame is the second video frame in the video to be segmented, the first reference frame obtained by the model is the first video frame, that is, the number of the first reference frame is 1, that is, there is only one short-term memory reference frame, and based on the target segmentation model, only two local attentions need to be determined to determine the target segmentation map of the first target frame, thereby reducing the time required for model prediction, and also The efficiency of object segmentation is improved.

在步骤406中，通过该目标分割模型，对该参考分割图中的每个第三像素点进行滑窗提取，得到与该每个第三像素点对应的多个第一邻域图，将该多个第一邻域图进行拼接，得到滑窗提取后的参考分割图。In step 406, using the target segmentation model, perform sliding window extraction on each third pixel point in the reference segmentation map to obtain a plurality of first neighborhood maps corresponding to each third pixel point. A plurality of first neighborhood maps are spliced to obtain a reference segmentation map after sliding window extraction.

该步骤与步骤206相似，在此不再赘述。This step is similar to step 206 and will not be repeated here.

在步骤407中，对该滑窗提取后的参考分割图和该第一局部注意力分别进行维度变换，得到维度变换后的参考分割图和维度变换后的第一局部注意力，该维度变换后的参考分割图与该维度变换后的第一局部注意力的维度相同。In step 407, dimension transformation is performed on the reference segmentation map extracted by the sliding window and the first local attention, respectively, to obtain the dimension-transformed reference segmentation map and the dimension-transformed first local attention. The reference segmentation map of is the same as the dimension of the first local attention after this dimension transformation.

由于第一参考帧的数量为至少一个，因此得到滑窗提取后的的参考分割图和第一局部注意力的数量为至少一个，则在对该滑窗提取后的参考分割图和该第一局部注意力分别进行维度变换之前，电子设备需要分别将至少一个滑窗提取后的的参考分割图进行拼接，将至少一个第一局部注意力进行拼接；相应的，本步骤由公式十和公式十一来实现：Since the number of the first reference frame is at least one, the number of the reference segmentation map and the first local attention after the sliding window extraction is at least one, then the reference segmentation map after the sliding window extraction and the first local attention are obtained. Before the local attention is dimensionally transformed, the electronic device needs to splicing at least one reference segmentation map extracted by the sliding window, and splicing at least one first local attention; correspondingly, this step consists of formula ten and formula ten. One to achieve:

公式十：A＝concat(A₁,A₂,…,A_k)Formula 10: A＝concat(A ₁ ,A ₂ ,...,A _k )

公式十一：

Formula eleven:

其中，A∈R^{k×P×P×H×W}，

Among them, A∈R ^{k×P×P×H×W} ,

该步骤中电子设备对该滑窗提取后的参考分割图和该第一局部注意力分别进行维度变换，得到维度变换后的参考分割图和维度变换后的第一局部注意力的实现方式与步骤206相似，在此不再赘述。In this step, the electronic device performs dimension transformation on the reference segmentation map extracted by the sliding window and the first local attention, respectively, to obtain the reference segmentation map after dimension transformation and the implementation manner and steps of the first local attention after dimension transformation. 206 is similar and will not be repeated here.

例如，第一局部注意力为k×P×P×H×W的大小，滑窗提取后的参考分割图为C×(k×P²)×H×W的大小，维度变换后的第一局部注意力为(k×P²)×H×W的大小。For example, the first local attention is the size of k×P×P×H×W, the reference segmentation map after sliding window extraction is the size of C×(k×P ² )×H×W, and the first image after dimension transformation The local attention is of size (k×P ² )×H×W.

在步骤408中，基于该维度变换后的参考分割图与该维度变换后的第一局部注意力，确定该第一目标帧的第一目标分割图。In step 408, a first target segmentation map of the first target frame is determined based on the dimension-transformed reference segmentation map and the dimension-transformed first local attention.

在一些实施例中，步骤408由以下步骤A1-A3来实现：In some embodiments, step 408 is implemented by the following steps A1-A3:

A1：对于该维度变换后的参考分割图中的每个第四像素点，从该维度变换后的第一局部注意力中确定多个相似度，该多个相似度为该第四像素点与其他像素点之间的相似度。A1: For each fourth pixel in the dimension-transformed reference segmentation map, determine multiple similarities from the dimension-transformed first local attention, where the multiple similarities are the difference between the fourth pixel and the similarity between other pixels.

本步骤与步骤209中的步骤A1相似，在此不再赘述。This step is similar to step A1 in step 209 and will not be repeated here.

A2：基于该多个相似度，对该第四像素点的像素值进行加权求和，得到加权求和后的像素值。A2: Based on the multiple degrees of similarity, weighted summation is performed on the pixel values of the fourth pixel point to obtain a weighted and summed pixel value.

本步骤与步骤209中的步骤A2相似，在此不再赘述。This step is similar to step A2 in step 209 and will not be repeated here.

A3：将该第四像素点的像素值修改为该加权求和后的像素值，得到该第一目标帧的第一目标分割图。A3: Modify the pixel value of the fourth pixel point to the weighted and summed pixel value to obtain a first target segmentation map of the first target frame.

本步骤与步骤209中的步骤A3相似，在此不再赘述。This step is similar to step A3 in step 209 and will not be repeated here.

步骤A1-A3由公式十二来实现：Steps A1-A3 are implemented by Equation 12:

公式十二：

Formula twelve:

其中，

为第一目标分割图。in,

Segment the map for the first target.

在本公开实施例中，基于维度变换后的第一局部注意力，将第一目标帧与第一参考帧结合，捕获了帧间的语义和时序信息的响应，以及经过滑窗提取操作得到的参考分割图，确定第一目标帧的第一目标分割图，使得该目标分割的精度较高。In the embodiment of the present disclosure, based on the first local attention after dimension transformation, the first target frame is combined with the first reference frame, the response of semantic and timing information between frames is captured, and the result obtained through the sliding window extraction operation With reference to the segmentation map, a first target segmentation map of the first target frame is determined, so that the accuracy of the target segmentation is high.

图5是根据一示例性实施例示出的一种视频目标分割装置框图。参照图5，该装置50，包括：第一确定单元501，第一加权单元502，第二确定单元503，第一偏移单元504和第一目标分割单元505。Fig. 5 is a block diagram of a video object segmentation apparatus according to an exemplary embodiment. 5 , the apparatus 50 includes: a first determination unit 501 , a first weighting unit 502 , a second determination unit 503 , a first offset unit 504 and a first target segmentation unit 505 .

该第一确定单元501，被配置为执行确定第一目标帧的第一目标特征、第一参考帧的第一参考特征和该第一参考帧的参考分割图，该第一目标帧为待分割的视频中未进行目标分割的视频帧，该第一参考帧为该视频中已进行目标分割的视频帧，该第一参考帧与该第一目标帧之间的间隔大于第一预设时长；The first determining unit 501 is configured to perform determining a first target feature of a first target frame, a first reference feature of a first reference frame, and a reference segmentation map of the first reference frame, where the first target frame is to be segmented The video frame without target segmentation in the video, the first reference frame is the video frame in which the target segmentation has been performed, and the interval between the first reference frame and the first target frame is greater than the first preset duration;

第一加权单元502，被配置为执行基于该第一目标帧的第一非局部注意力，对该第一目标特征进行加权，得到第二目标特征，以及基于该第一参考帧的第二非局部注意力，对该第一参考特征进行加权，得到第二参考特征；The first weighting unit 502 is configured to perform a first non-local attention based on the first target frame, weight the first target feature to obtain a second target feature, and a second non-local attention based on the first reference frame. Local attention, weighting the first reference feature to obtain the second reference feature;

第二确定单元503，被配置为执行基于该第二目标特征和该第二参考特征，确定第一偏移信息，该第一偏移信息用于表示位置对应的第一像素点与第二像素点之间的偏移，该第一像素点为该第一目标帧中的像素点，该第二像素点为该第一参考帧中的像素点；The second determining unit 503 is configured to determine first offset information based on the second target feature and the second reference feature, where the first offset information is used to represent the first pixel point and the second pixel corresponding to the position The offset between points, the first pixel is a pixel in the first target frame, and the second pixel is a pixel in the first reference frame;

第一偏移单元504，被配置为执行基于该第一偏移信息，对该第二参考特征进行偏移处理，得到偏移后的第二参考特征，确定该偏移后的第二参考特征与该第二目标特征之间的第一局部注意力；The first offset unit 504 is configured to perform offset processing on the second reference feature based on the first offset information to obtain a second reference feature after offset, and determine the second reference feature after offset the first local attention with the second target feature;

第一目标分割单元505，被配置为执行将该参考分割图和该第一局部注意力输入目标分割模型，得到该第一目标帧的第一目标分割图。The first object segmentation unit 505 is configured to input the reference segmentation map and the first local attention into an object segmentation model to obtain a first object segmentation map of the first target frame.

在一些实施例中，该第二确定单元503，被配置为执行将该第二目标特征和该第二参考特征进行拼接，得到第一特征；将该第一特征输入偏移预测模型，得到该第一偏移信息。In some embodiments, the second determining unit 503 is configured to perform splicing of the second target feature and the second reference feature to obtain a first feature; input the first feature into an offset prediction model to obtain the first offset information.

在一些实施例中，该第一目标分割单元505，包括：In some embodiments, the first target segmentation unit 505 includes:

第一滑窗提取子单元，被配置为执行通过该目标分割模型，对该参考分割图中的每个第三像素点进行滑窗提取，得到与该每个第三像素点对应的多个第一邻域图，将该多个第一邻域图进行拼接，得到滑窗提取后的参考分割图；The first sliding window extraction subunit is configured to perform sliding window extraction for each third pixel point in the reference segmentation map through the target segmentation model, to obtain a plurality of first pixel points corresponding to each third pixel point. a neighborhood graph, splicing the plurality of first neighborhood graphs to obtain a reference segmentation graph after sliding window extraction;

第一维度变换子单元，被配置为执行对该滑窗提取后的参考分割图和该第一局部注意力分别进行维度变换，得到维度变换后的参考分割图和维度变换后的第一局部注意力，该维度变换后的参考分割图与该维度变换后的第一局部注意力的维度相同；The first dimension transformation subunit is configured to perform dimension transformation on the reference segmentation map after the sliding window extraction and the first local attention, respectively, to obtain the dimension-transformed reference segmentation map and the dimension-transformed first local attention force, the reference segmentation map after the dimension transformation is the same as the dimension of the first local attention after the dimension transformation;

确定子单元，被配置为执行基于该维度变换后的参考分割图与该维度变换后的第一局部注意力，确定该第一目标帧的第一目标分割图。The determining subunit is configured to perform determining a first target segmentation map of the first target frame based on the dimension-transformed reference segmentation map and the dimension-transformed first local attention.

在一些实施例中，该确定子单元，被配置为执行对于该维度变换后的参考分割图中的每个第四像素点，从该维度变换后的第一局部注意力中确定多个相似度，该多个相似度为该第四像素点与其他像素点之间的相似度；基于该多个相似度，对该第四像素点的像素值进行加权求和，得到加权求和后的像素值；将该第四像素点的像素值修改为该加权求和后的像素值，得到该第一目标帧的第一目标分割图。In some embodiments, the determining subunit is configured to perform, for each fourth pixel point in the dimension-transformed reference segmentation map, determining a plurality of similarities from the dimension-transformed first local attention , the multiple degrees of similarity are the degrees of similarity between the fourth pixel point and other pixel points; based on the multiple degrees of similarity, the pixel values of the fourth pixel point are weighted and summed to obtain the weighted and summed pixels value; modify the pixel value of the fourth pixel point to the weighted and summed pixel value to obtain the first target segmentation map of the first target frame.

在一些实施例中，该第一加权单元502，被配置为执行对该第一非局部注意力进行归一化处理，得到第一权重；通过该目标分割模型对该第一权重和该第一目标特征进行矩阵相乘处理，得到第三目标特征；将该第三目标特征与该第一目标特征进行叠加，得到该第二目标特征。In some embodiments, the first weighting unit 502 is configured to perform a normalization process on the first non-local attention to obtain a first weight; the first weight and the first weight are obtained through the target segmentation model The target feature is subjected to matrix multiplication processing to obtain a third target feature; the third target feature is superimposed with the first target feature to obtain the second target feature.

在一些实施例中，该装置还包括：In some embodiments, the apparatus further includes:

第二目标分割单元，被配置为执行在该第一参考帧与该第一目标帧之间的间隔不大于该第一预设时长的情况下，将该第一参考特征与该第一目标特征之间的第二局部注意力和该参考分割图，输入该目标分割模型，得到该第一目标帧的第一目标分割图。The second target segmentation unit is configured to perform, under the condition that the interval between the first reference frame and the first target frame is not greater than the first preset duration, the first reference feature and the first target feature Between the second local attention and the reference segmentation map, input the target segmentation model to obtain the first target segmentation map of the first target frame.

获取单元，被配置为执行从样本视频中获取未进行目标分割的样本目标帧和第一样本参考帧，将该样本目标帧和该第一样本参考帧输入初始目标分割模型，得到该样本目标帧的第一样本目标特征和该第一样本参考帧的第一样本参考特征，该样本目标帧与该第一样本参考帧之间的间隔大于第二预设时长小于第三预设时长；The acquisition unit is configured to perform acquiring the sample target frame and the first sample reference frame without target segmentation from the sample video, input the sample target frame and the first sample reference frame into the initial target segmentation model, and obtain the sample The first sample target feature of the target frame and the first sample reference feature of the first sample reference frame, and the interval between the sample target frame and the first sample reference frame is greater than the second preset duration and less than the third preset duration;

第二加权单元，被配置为执行基于该样本目标帧的第三非局部注意力，对该第一样本目标特征进行加权，得到第二样本目标特征，以及基于该第一样本参考帧的第四非局部注意力，对该第一样本参考特征进行加权，得到第二样本参考特征；a second weighting unit configured to perform a third non-local attention based on the sample target frame, weight the first sample target feature to obtain a second sample target feature, and perform a third sample target feature based on the first sample reference frame The fourth non-local attention weights the first sample reference feature to obtain the second sample reference feature;

第三确定单元，被配置为执行基于该第二样本目标特征和该第二样本参考特征，确定第二偏移信息，该第二偏移信息用于表示位置对应的第一样本像素点与第二样本像素点之间的偏移，该第一样本像素点为该样本目标帧中的像素点，该第二样本像素点为该第一样本参考帧中的像素点；a third determining unit, configured to determine second offset information based on the second sample target feature and the second sample reference feature, where the second offset information is used to indicate that the first sample pixel point corresponding to the position and the The offset between the second sample pixel points, the first sample pixel point is the pixel point in the sample target frame, and the second sample pixel point is the pixel point in the first sample reference frame;

第二偏移单元，被配置为执行基于该第二偏移信息，对该第二样本参考特征进行偏移处理，得到偏移后的第二样本参考特征；a second offset unit, configured to perform offset processing on the second sample reference feature based on the second offset information to obtain the offset second sample reference feature;

训练单元，被配置为执行基于该第一样本参考帧和第三局部注意力，对该初始目标分割模型进行训练，得到该目标分割模型，该第三局部注意力为该偏移后的第二样本参考特征与该第二样本目标特征之间的局部注意力。A training unit, configured to perform training on the initial target segmentation model based on the first sample reference frame and a third local attention, to obtain the target segmentation model, where the third local attention is the offset Local attention between the two-sample reference feature and the second-sample target feature.

在一些实施例中，该训练单元，包括：In some embodiments, the training unit includes:

第二滑窗提取子单元，被配置为执行通过该初始目标分割模型，对该第一样本参考帧中的每个第二样本像素点进行滑窗提取，得到与该每个第二样本像素点对应的多个第二邻域图，将该多个第二邻域图进行拼接，得到第二样本参考帧；The second sliding window extraction subunit is configured to perform sliding window extraction on each second sample pixel point in the first sample reference frame through the initial target segmentation model, to obtain the same pixel as each second sample pixel. a plurality of second neighborhood graphs corresponding to the points, and splicing the plurality of second neighborhood graphs to obtain a second sample reference frame;

第二维度变换子单元，被配置为执行对该第二样本参考帧和该第三局部注意力分别进行维度变换，得到维度变换后的第二样本参考帧和维度变换后的第三局部注意力，该维度变换后的第二样本参考帧与该维度变换后的第三局部注意力的维度相同；The second dimension transformation subunit is configured to perform dimension transformation on the second sample reference frame and the third local attention respectively to obtain the second sample reference frame after dimension transformation and the third local attention after dimension transformation , the dimension of the second sample reference frame after the dimension transformation is the same as the dimension of the third local attention after the dimension transformation;

训练子单元，被配置为执行基于该维度变换后的第二样本参考帧与该维度变换后的第三局部注意力，对该初始目标分割模型进行训练，得到该目标分割模型。The training subunit is configured to perform training on the initial target segmentation model based on the dimension-transformed second sample reference frame and the dimension-transformed third local attention to obtain the target segmentation model.

在一些实施例中，该训练子单元，被配置为执行对于该维度变换后的第二样本参考帧中的每个第三样本像素点，从该维度变换后的第三局部注意力中确定多个相似度，该多个相似度为该第三样本像素点与其他像素点之间的相似度；基于该多个相似度，对该第三样本像素点的像素值进行加权求和，得到加权求和后的像素值；将该第三样本像素点的像素值修改为该加权求和后的像素值，得到该样本目标帧的第二目标分割图；将该样本目标帧输入该初始目标分割模型中，基于该初始目标分割模型的输出结果和该第二目标分割图之间的损失值，对该初始目标分割模型的模型参数进行迭代更新，得到该目标分割模型。In some embodiments, the training sub-unit is configured to perform, for each third sample pixel in the dimensionally transformed second sample reference frame, determining how much from the dimensionally transformed third local attention similarity degrees, the multiple degrees of similarity are the degrees of similarity between the third sample pixel point and other pixel points; based on the multiple degrees of similarity, the pixel values of the third sample pixel point are weighted and summed to obtain the weighted The pixel value after the summation; modify the pixel value of the third sample pixel point to the pixel value after the weighted summation to obtain the second target segmentation map of the sample target frame; input the sample target frame into the initial target segmentation In the model, based on the loss value between the output result of the initial target segmentation model and the second target segmentation map, the model parameters of the initial target segmentation model are iteratively updated to obtain the target segmentation model.

关于上述实施例中的装置，其中各个模块执行操作的具体方式已经在有关该方法的实施例中进行了详细描述，此处将不做详细阐述说明。Regarding the apparatus in the above-mentioned embodiment, the specific manner in which each module performs operations has been described in detail in the embodiment of the method, and will not be described in detail here.

在示例性实施例中，提供一种电子设备，该电子设备包括处理器和用于存储该处理器可执行指令的存储器；其中，该处理器被配置为执行该指令，以实现执行上述实施例中目标分割模型的训练方法。In an exemplary embodiment, an electronic device is provided that includes a processor and a memory for storing instructions executable by the processor; wherein the processor is configured to execute the instructions to implement the above-described embodiments The training method of the medium target segmentation model.

在一些实施例中，该电子设备可提供为服务器，图6是根据一示例性实施例示出的一种服务器600的框图，该服务器600可因配置或性能不同而产生比较大的差异，可以包括一个或一个以上处理器(Central Processing Units，CPU)601和一个或一个以上的存储器602，其中，存储器602用于存储可执行指令，处理器601被配置为执行上述可执行指令，以实现上述各个方法实施例提供的目标分割模型的训练方法。当然，该服务器还可以具有有线或无线网络接口、键盘以及输入输出接口等部件，以便进行输入输出，该服务器还可以包括其他用于实现设备功能的部件，在此不做赘述。In some embodiments, the electronic device may be provided as a server. FIG. 6 is a block diagram of a server 600 according to an exemplary embodiment. The server 600 may vary greatly due to different configurations or performances, and may include One or more processors (Central Processing Units, CPUs) 601 and one or more memories 602, wherein the memory 602 is used to store executable instructions, and the processor 601 is configured to execute the above executable instructions to implement the above The method embodiment provides a training method for a target segmentation model. Of course, the server may also have components such as a wired or wireless network interface, a keyboard, and an input/output interface for input and output, and the server may also include other components for implementing device functions, which will not be described here.

在一些实施例中，该电子设备可提供为终端，图7是根据一示例性实施例示出的一种终端700的框图。该终端700可以为：智能手机、平板电脑、笔记本电脑或台式电脑等。终端700还可能被称为用户设备、便携式终端、膝上型终端、台式终端等其他名称。In some embodiments, the electronic device may be provided as a terminal, and FIG. 7 is a block diagram of a terminal 700 according to an exemplary embodiment. The terminal 700 may be: a smart phone, a tablet computer, a notebook computer, or a desktop computer. Terminal 700 may also be called user equipment, portable terminal, laptop terminal, desktop terminal, and the like by other names.

通常，终端700包括有：处理器701和存储器702。Generally, the terminal 700 includes: a processor 701 and a memory 702 .

在一些实施例中，处理器701包括一个或多个处理核心，比如4核心处理器、8核心处理器等。在一些实施例中，处理器701采用DSP(Digital Signal Processing，数字信号处理)、FPGA(Field－Programmable Gate Array，现场可编程门阵列)、PLA(ProgrammableLogic Array，可编程逻辑阵列)中的至少一种硬件形式来实现。在一些实施例中，处理器701也包括主处理器和协处理器，主处理器是用于对在唤醒状态下的数据进行处理的处理器，也称CPU(Central Processing Unit，中央处理器)；协处理器是用于对在待机状态下的数据进行处理的低功耗处理器。在一些实施例中，处理器701集成有GPU(GraphicsProcessing Unit，图像处理器)，GPU用于负责显示屏所需要显示的内容的渲染和绘制。一些实施例中，处理器701还包括AI(Artificial Intelligence，人工智能)处理器，该AI处理器用于处理有关机器学习的计算操作。In some embodiments, processor 701 includes one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. In some embodiments, the processor 701 adopts at least one of DSP (Digital Signal Processing, digital signal processing), FPGA (Field-Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array, programmable logic array) A hardware form is implemented. In some embodiments, the processor 701 also includes a main processor and a co-processor, and the main processor is a processor for processing data in a wake-up state, also referred to as a CPU (Central Processing Unit, central processing unit). ; a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 701 is integrated with a GPU (Graphics Processing Unit, image processor), and the GPU is used for rendering and drawing the content that needs to be displayed on the display screen. In some embodiments, the processor 701 further includes an AI (Artificial Intelligence, artificial intelligence) processor, where the AI processor is used to process computing operations related to machine learning.

在一些实施例中，存储器702包括一个或多个计算机可读存储介质，该计算机可读存储介质是非暂态的。在一些实施例中，存储器702还包括高速随机存取存储器，以及非易失性存储器，比如一个或多个磁盘存储设备、闪存存储设备。在一些实施例中，存储器702中的非暂态的计算机可读存储介质用于存储至少一个指令，该至少一个指令用于被处理器701所执行以实现本公开中方法实施例提供的视频目标分割方法。In some embodiments, memory 702 includes one or more computer-readable storage media that are non-transitory. In some embodiments, memory 702 also includes high-speed random access memory, and non-volatile memory, such as one or more disk storage devices, flash storage devices. In some embodiments, a non-transitory computer-readable storage medium in the memory 702 is used to store at least one instruction for execution by the processor 701 to achieve the video object provided by the method embodiments of the present disclosure segmentation method.

在一些实施例中，终端700还可选包括有：外围设备接口703和至少一个外围设备。在一些实施例中，处理器701、存储器702和外围设备接口703之间通过总线或信号线相连。在一些实施例中，各个外围设备通过总线、信号线或电路板与外围设备接口703相连。具体地，外围设备包括：射频电路704、显示屏705、摄像头组件706、音频电路707、定位组件708和电源709中的至少一种。In some embodiments, the terminal 700 may optionally further include: a peripheral device interface 703 and at least one peripheral device. In some embodiments, the processor 701, the memory 702 and the peripheral device interface 703 are connected through a bus or signal line. In some embodiments, various peripheral devices are connected to the peripheral device interface 703 through a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of a radio frequency circuit 704 , a display screen 705 , a camera assembly 706 , an audio circuit 707 , a positioning assembly 708 and a power supply 709 .

外围设备接口703可被用于将I/O(Input/Output，输入/输出)相关的至少一个外围设备连接到处理器701和存储器702。在一些实施例中，处理器701、存储器702和外围设备接口703被集成在同一芯片或电路板上；在一些其他实施例中，处理器701、存储器702和外围设备接口703中的任意一个或两个在单独的芯片或电路板上实现，本实施例对此不加以限定。The peripheral device interface 703 may be used to connect at least one peripheral device related to I/O (Input/Output) to the processor 701 and the memory 702 . In some embodiments, processor 701, memory 702, and peripherals interface 703 are integrated on the same chip or circuit board; in some other embodiments, any one of processor 701, memory 702, and peripherals interface 703 or The two are implemented on a separate chip or circuit board, which is not limited in this embodiment.

射频电路704用于接收和发射RF(Radio Frequency，射频)信号，也称电磁信号。射频电路704通过电磁信号与通信网络以及其他通信设备进行通信。射频电路704将电信号转换为电磁信号进行发送，或者，将接收到的电磁信号转换为电信号。在一些实施例中，射频电路704包括：天线系统、RF收发器、一个或多个放大器、调谐器、振荡器、数字信号处理器、编解码芯片组、用户身份模块卡等等。在一些实施例中，射频电路704通过至少一种无线通信协议来与其他终端进行通信。该无线通信协议包括但不限于：万维网、城域网、内联网、各代移动通信网络(2G、3G、4G及5G)、无线局域网和/或WiFi(Wireless Fidelity，无线保真)网络。在一些实施例中，射频电路704还包括NFC(Near Field Communication，近距离无线通信)有关的电路，本公开对此不加以限定。The radio frequency circuit 704 is used for receiving and transmitting RF (Radio Frequency, radio frequency) signals, also called electromagnetic signals. The radio frequency circuit 704 communicates with the communication network and other communication devices via electromagnetic signals. The radio frequency circuit 704 converts electrical signals into electromagnetic signals for transmission, or converts received electromagnetic signals into electrical signals. In some embodiments, radio frequency circuitry 704 includes: an antenna system, an RF transceiver, one or more amplifiers, tuners, oscillators, digital signal processors, codec chipsets, subscriber identity module cards, and the like. In some embodiments, radio frequency circuitry 704 communicates with other terminals via at least one wireless communication protocol. The wireless communication protocol includes but is not limited to: World Wide Web, Metropolitan Area Network, Intranet, various generations of mobile communication networks (2G, 3G, 4G and 5G), wireless local area network and/or WiFi (Wireless Fidelity, Wireless Fidelity) network. In some embodiments, the radio frequency circuit 704 further includes a circuit related to NFC (Near Field Communication, short-range wireless communication), which is not limited in the present disclosure.

显示屏705用于显示UI(User Interface，用户界面)。在一些实施例中，该UI包括图形、文本、图标、视频及其他们的任意组合。当显示屏705是触摸显示屏时，显示屏705还具有采集在显示屏705的表面或表面上方的触摸信号的能力。在一些实施例中，该触摸信号作为控制信号输入至处理器701进行处理。此时，显示屏705还用于提供虚拟按钮和/或虚拟键盘，也称软按钮和/或软键盘。在一些实施例中，显示屏705为一个，设置在终端700的前面板；在另一些实施例中，显示屏705为至少两个，分别设置在终端700的不同表面或呈折叠设计；在另一些实施例中，显示屏705是柔性显示屏，设置在终端700的弯曲表面上或折叠面上。甚至，显示屏705还设置成非矩形的不规则图形，也即异形屏。在一些实施例中，显示屏705采用LCD(Liquid Crystal Display，液晶显示屏)、OLED(Organic Light-EmittingDiode,有机发光二极管)等材质制备。The display screen 705 is used to display a UI (User Interface). In some embodiments, the UI includes graphics, text, icons, video, and any combination thereof. When the display screen 705 is a touch display screen, the display screen 705 also has the ability to acquire touch signals on or above the surface of the display screen 705 . In some embodiments, the touch signal is input to the processor 701 as a control signal for processing. At this time, the display screen 705 is also used to provide virtual buttons and/or virtual keyboards, also called soft buttons and/or soft keyboards. In some embodiments, there is one display screen 705, which is arranged on the front panel of the terminal 700; in other embodiments, there are at least two display screens 705, which are respectively arranged on different surfaces of the terminal 700 or in a folded design; In some embodiments, the display screen 705 is a flexible display screen disposed on a curved or folded surface of the terminal 700 . Even, the display screen 705 is also set as a non-rectangular irregular figure, that is, a special-shaped screen. In some embodiments, the display screen 705 is made of materials such as LCD (Liquid Crystal Display, liquid crystal display), OLED (Organic Light-Emitting Diode, organic light emitting diode).

摄像头组件706用于采集图像或视频。在一些实施例中，摄像头组件706包括前置摄像头和后置摄像头。通常，前置摄像头设置在终端的前面板，后置摄像头设置在终端的背面。在一些实施例中，后置摄像头为至少两个，分别为主摄像头、景深摄像头、广角摄像头、长焦摄像头中的任意一种，以实现主摄像头和景深摄像头融合实现背景虚化功能、主摄像头和广角摄像头融合实现全景拍摄以及VR(Virtual Reality，虚拟现实)拍摄功能或者其他融合拍摄功能。在一些实施例中，摄像头组件706还包括闪光灯。在一些实施例中，闪光灯是单色温闪光灯，在一些实施例中，闪光灯是双色温闪光灯。双色温闪光灯是指暖光闪光灯和冷光闪光灯的组合，用于不同色温下的光线补偿。The camera assembly 706 is used to capture images or video. In some embodiments, camera assembly 706 includes a front-facing camera and a rear-facing camera. Usually, the front camera is arranged on the front panel of the terminal, and the rear camera is arranged on the back of the terminal. In some embodiments, there are at least two rear cameras, which are any one of a main camera, a depth-of-field camera, a wide-angle camera, and a telephoto camera, so as to realize the fusion of the main camera and the depth-of-field camera to realize the background blur function, the main camera It is integrated with the wide-angle camera to achieve panoramic shooting and VR (Virtual Reality, virtual reality) shooting functions or other integrated shooting functions. In some embodiments, camera assembly 706 also includes a flash. In some embodiments, the flash is a single color temperature flash, and in some embodiments, the flash is a dual color temperature flash. Dual color temperature flash refers to the combination of warm light flash and cold light flash, which is used for light compensation under different color temperatures.

在一些实施例中，音频电路707包括麦克风和扬声器。麦克风用于采集用户及环境的声波，并将声波转换为电信号输入至处理器701进行处理，或者输入至射频电路704以实现语音通信。出于立体声采集或降噪的目的，在一些实施例中，麦克风为多个，分别设置在终端700的不同部位。在一些实施例中，麦克风是阵列麦克风或全向采集型麦克风。扬声器则用于将来自处理器701或射频电路704的电信号转换为声波。在一些实施例中，扬声器是传统的薄膜扬声器，在一些实施例中，扬声器以是压电陶瓷扬声器。当扬声器是压电陶瓷扬声器时，不仅能够将电信号转换为人类可听见的声波，也能够将电信号转换为人类听不见的声波以进行测距等用途。在一些实施例中，音频电路707还包括耳机插孔。In some embodiments, the audio circuit 707 includes a microphone and a speaker. The microphone is used to collect the sound waves of the user and the environment, convert the sound waves into electrical signals, and input them to the processor 701 for processing, or to the radio frequency circuit 704 to realize voice communication. For the purpose of stereo collection or noise reduction, in some embodiments, there are multiple microphones, which are respectively disposed in different parts of the terminal 700 . In some embodiments, the microphones are array microphones or omnidirectional acquisition microphones. The speaker is used to convert the electrical signal from the processor 701 or the radio frequency circuit 704 into sound waves. In some embodiments, the loudspeaker is a conventional thin-film loudspeaker, and in some embodiments, the loudspeaker is a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, it can not only convert electrical signals into sound waves audible to humans, but also convert electrical signals into sound waves inaudible to humans for distance measurement and other purposes. In some embodiments, the audio circuit 707 also includes a headphone jack.

定位组件708用于定位终端700的当前地理位置，以实现导航或LBS(LocationBased Service，基于位置的服务)。在一些实施例中，定位组件707是基于美国的GPS(Global Positioning System，全球定位系统)、中国的北斗系统或俄罗斯的伽利略系统的定位组件。The positioning component 708 is used to locate the current geographic location of the terminal 700 to implement navigation or LBS (Location Based Service, location-based service). In some embodiments, the positioning component 707 is a positioning component based on the GPS (Global Positioning System) of the United States, the Beidou system of China or the Galileo system of Russia.

电源709用于为终端700中的各个组件进行供电。在一些实施例中，电源709是交流电、直流电、一次性电池或可充电电池。当电源709包括可充电电池时，该可充电电池是有线充电电池或无线充电电池。有线充电电池是通过有线线路充电的电池，无线充电电池是通过无线线圈充电的电池。该可充电电池还用于支持快充技术。The power supply 709 is used to power various components in the terminal 700 . In some embodiments, the power source 709 is alternating current, direct current, a disposable battery, or a rechargeable battery. When the power source 709 includes a rechargeable battery, the rechargeable battery is a wired rechargeable battery or a wireless rechargeable battery. Wired rechargeable batteries are batteries that are charged through wired lines, and wireless rechargeable batteries are batteries that are charged through wireless coils. The rechargeable battery is also used to support fast charging technology.

在一些实施例中，终端700还包括有一个或多个传感器710。该一个或多个传感器710包括但不限于：加速度传感器711、陀螺仪传感器712、压力传感器713、指纹传感器714、光学传感器715以及接近传感器716。In some embodiments, the terminal 700 also includes one or more sensors 710 . The one or more sensors 710 include, but are not limited to, an acceleration sensor 711 , a gyro sensor 712 , a pressure sensor 713 , a fingerprint sensor 714 , an optical sensor 715 and a proximity sensor 716 .

在一些实施例中，加速度传感器711检测以终端700建立的坐标系的三个坐标轴上的加速度大小。比如，加速度传感器711用于检测重力加速度在三个坐标轴上的分量。在一些实施例中，处理器701根据加速度传感器711采集的重力加速度信号，控制显示屏705以横向视图或纵向视图进行用户界面的显示。在一些实施例中，加速度传感器711还用于游戏或者用户的运动数据的采集。In some embodiments, the acceleration sensor 711 detects the magnitude of acceleration on the three coordinate axes of the coordinate system established by the terminal 700 . For example, the acceleration sensor 711 is used to detect the components of the gravitational acceleration on the three coordinate axes. In some embodiments, the processor 701 controls the display screen 705 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 711 . In some embodiments, the acceleration sensor 711 is also used for game or user movement data collection.

在一些实施例中，陀螺仪传感器712检测终端700的机体方向及转动角度，陀螺仪传感器712与加速度传感器711协同采集用户对终端700的3D动作。处理器701根据陀螺仪传感器712采集的数据，能够实现如下功能：动作感应(比如根据用户的倾斜操作来改变UI)、拍摄时的图像稳定、游戏控制以及惯性导航。In some embodiments, the gyroscope sensor 712 detects the body direction and rotation angle of the terminal 700 , and the gyroscope sensor 712 cooperates with the acceleration sensor 711 to collect 3D actions of the user on the terminal 700 . The processor 701 can implement the following functions according to the data collected by the gyro sensor 712 : motion sensing (such as changing the UI according to the user's tilt operation), image stabilization during shooting, game control, and inertial navigation.

在一些实施例中，压力传感器713设置在终端700的侧边框和/或显示屏705的下层。当压力传感器713设置在终端700的侧边框时，能够检测用户对终端700的握持信号，由处理器701根据压力传感器713采集的握持信号进行左右手识别或快捷操作。当压力传感器713设置在显示屏705的下层时，由处理器701根据用户对显示屏705的压力操作，实现对UI界面上的可操作性控件进行控制。可操作性控件包括按钮控件、滚动条控件、图标控件、菜单控件中的至少一种。In some embodiments, the pressure sensor 713 is disposed on the side frame of the terminal 700 and/or the lower layer of the display screen 705 . When the pressure sensor 713 is disposed on the side frame of the terminal 700 , it can detect the user's holding signal of the terminal 700 , and the processor 701 performs left and right hand identification or shortcut operations according to the holding signal collected by the pressure sensor 713 . When the pressure sensor 713 is disposed on the lower layer of the display screen 705 , the processor 701 controls the operability controls on the UI interface according to the user's pressure operation on the display screen 705 . The operability controls include at least one of button controls, scroll bar controls, icon controls, and menu controls.

指纹传感器714用于采集用户的指纹，由处理器701根据指纹传感器714采集到的指纹识别用户的身份，或者，由指纹传感器714根据采集到的指纹识别用户的身份。在识别出用户的身份为可信身份时，由处理器701授权该用户执行相关的敏感操作，该敏感操作包括解锁屏幕、查看加密信息、下载软件、支付及更改设置等。在一些实施例中，指纹传感器714被设置在终端700的正面、背面或侧面。当终端700上设置有物理按键或厂商Logo时，指纹传感器714与物理按键或厂商Logo集成在一起。The fingerprint sensor 714 is used to collect the user's fingerprint, and the processor 701 identifies the user's identity according to the fingerprint collected by the fingerprint sensor 714 , or the fingerprint sensor 714 identifies the user's identity according to the collected fingerprint. When the user's identity is identified as a trusted identity, the processor 701 authorizes the user to perform related sensitive operations, including unlocking the screen, viewing encrypted information, downloading software, making payments, and changing settings. In some embodiments, the fingerprint sensor 714 is disposed on the front, back, or side of the terminal 700 . When the terminal 700 is provided with a physical button or a manufacturer's logo, the fingerprint sensor 714 is integrated with the physical button or the manufacturer's logo.

光学传感器715用于采集环境光强度。在一个实施例中，处理器701根据光学传感器715采集的环境光强度，控制显示屏705的显示亮度。具体地，当环境光强度较高时，调高显示屏705的显示亮度；当环境光强度较低时，调低显示屏705的显示亮度。在另一个实施例中，处理器701还根据光学传感器715采集的环境光强度，动态调整摄像头组件706的拍摄参数。Optical sensor 715 is used to collect ambient light intensity. In one embodiment, the processor 701 controls the display brightness of the display screen 705 according to the ambient light intensity collected by the optical sensor 715 . Specifically, when the ambient light intensity is high, the display brightness of the display screen 705 is increased; when the ambient light intensity is low, the display brightness of the display screen 705 is decreased. In another embodiment, the processor 701 also dynamically adjusts the shooting parameters of the camera assembly 706 according to the ambient light intensity collected by the optical sensor 715 .

接近传感器716，也称距离传感器，通常设置在终端700的前面板。接近传感器716用于采集用户与终端700的正面之间的距离。在一个实施例中，当接近传感器716检测到用户与终端700的正面之间的距离逐渐变小时，由处理器701控制显示屏705从亮屏状态切换为息屏状态；当接近传感器716检测到用户与终端700的正面之间的距离逐渐变大时，由处理器701控制显示屏705从息屏状态切换为亮屏状态。A proximity sensor 716 , also called a distance sensor, is usually provided on the front panel of the terminal 700 . The proximity sensor 716 is used to collect the distance between the user and the front of the terminal 700 . In one embodiment, when the proximity sensor 716 detects that the distance between the user and the front of the terminal 700 is gradually decreasing, the processor 701 controls the display screen 705 to switch from the bright screen state to the off screen state; when the proximity sensor 716 detects When the distance between the user and the front of the terminal 700 gradually increases, the processor 701 controls the display screen 705 to switch from the closed screen state to the bright screen state.

本领域技术人员能够理解，图7中示出的结构并不构成对终端700的限定，能够包括比图示更多或更少的组件，或者组合某些组件，或者采用不同的组件布置。Those skilled in the art can understand that the structure shown in FIG. 7 does not constitute a limitation on the terminal 700, and can include more or less components than the one shown, or combine some components, or adopt different component arrangements.

在示例性实施例中，还提供了一种包括指令的存储介质，例如包括指令的存储器602或者702，上述指令可由的处理器601或者701执行以完成上述视频目标分割方法。可选地，存储介质可以是非临时性计算机可读存储介质，例如，所述非临时性计算机可读存储介质可以是ROM、随机存取存储器(RAM)、CD-ROM、磁带、软盘和光数据存储设备等。In an exemplary embodiment, a storage medium including instructions, such as a memory 602 or 702 including instructions, is also provided, and the above-mentioned instructions can be executed by the processor 601 or 701 to complete the above-mentioned video object segmentation method. Alternatively, the storage medium may be a non-transitory computer-readable storage medium, for example, the non-transitory computer-readable storage medium may be ROM, random access memory (RAM), CD-ROM, magnetic tape, floppy disk, and optical data storage equipment, etc.

在示例性实施例中，还提供了一种计算机程序产品，当该计算机程序产品中的指令由电子设备的处理器执行时，使得该电子设备能够执行上述实施例中视频目标分割方法。In an exemplary embodiment, a computer program product is also provided, when the instructions in the computer program product are executed by the processor of the electronic device, the electronic device can execute the video object segmentation method in the above embodiment.

本领域技术人员在考虑说明书及实践这里公开的发明后，将容易想到本公开的其它实施方案。本申请旨在涵盖本公开的任何变型、用途或者适应性变化，这些变型、用途或者适应性变化遵循本公开的一般性原理并包括本公开未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的，本公开的真正范围和精神由下面的权利要求指出。Other embodiments of the present disclosure will readily occur to those skilled in the art upon consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the present disclosure that follow the general principles of the present disclosure and include common knowledge or techniques in the technical field not disclosed by the present disclosure . The specification and examples are to be regarded as exemplary only, with the true scope and spirit of the disclosure being indicated by the following claims.

应当理解的是，本公开并不局限于上面已经描述并在附图中示出的精确结构，并且可以在不脱离其范围进行各种修改和改变。本公开的范围仅由所附的权利要求来限制。It is to be understood that the present disclosure is not limited to the precise structures described above and illustrated in the accompanying drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method for segmenting video objects, comprising:

determining a first target feature of a first target frame, a first reference feature of a first reference frame and a reference segmentation map of the first reference frame, wherein the first target frame is a video frame which is not subjected to target segmentation in a video to be segmented, the first reference frame is a video frame which is subjected to target segmentation in the video, and the interval between the first reference frame and the first target frame is greater than a first preset duration;

weighting the first target feature based on the first non-local attention of the first target frame to obtain a second target feature, and weighting the first reference feature based on the second non-local attention of the first reference frame to obtain a second reference feature;

determining first offset information based on the second target feature and the second reference feature, wherein the first offset information is used for representing offset between a first pixel point and a second pixel point corresponding to positions, the first pixel point is a pixel point in the first target frame, and the second pixel point is a pixel point in the first reference frame;

performing offset processing on the second reference feature based on the first offset information to obtain an offset second reference feature, and determining a first local attention between the offset second reference feature and the second target feature;

inputting the reference segmentation map and the first local attention into a target segmentation model to obtain a first target segmentation map of the first target frame.

2. The method of claim 1, wherein determining first offset information based on the second target feature and the second reference feature comprises:

splicing the second target characteristic and the second reference characteristic to obtain a first characteristic;

and inputting the first characteristic into an offset prediction model to obtain the first offset information.

3. The method of claim 1, wherein said inputting the reference segmentation map and the first local attention into a target segmentation model to obtain a first target segmentation map for the first target frame comprises:

performing sliding window extraction on each third pixel point in the reference segmentation graph through the target segmentation model to obtain a plurality of first neighborhood graphs corresponding to each third pixel point, and splicing the plurality of first neighborhood graphs to obtain a reference segmentation graph after the sliding window extraction;

performing dimension transformation on the reference segmentation graph after the sliding window extraction and the first local attention respectively to obtain a reference segmentation graph after the dimension transformation and the first local attention after the dimension transformation, wherein the dimension of the reference segmentation graph after the dimension transformation is the same as that of the first local attention after the dimension transformation;

determining a first target segmentation map of the first target frame based on the dimension-transformed reference segmentation map and the dimension-transformed first local attention.

4. The method of claim 3, wherein determining the first target segmentation map for the first target frame based on the dimension-transformed reference segmentation map and the dimension-transformed first local attention comprises:

for each fourth pixel point in the reference segmentation map after the dimension transformation, determining a plurality of similarity degrees from the first local attention after the dimension transformation, wherein the similarity degrees are the similarity degrees between the fourth pixel point and other pixel points;

based on the similarity, carrying out weighted summation on the pixel value of the fourth pixel point to obtain a weighted-summation pixel value;

and modifying the pixel value of the fourth pixel point into the weighted and summed pixel value to obtain a first target segmentation graph of the first target frame.

5. The method of claim 1, wherein weighting the first target feature based on the first non-local attention of the first target frame to obtain a second target feature comprises:

normalizing the first non-local attention to obtain a first weight;

performing matrix multiplication processing on the first weight and the first target characteristic through the target segmentation model to obtain a third target characteristic;

and superposing the third target feature and the first target feature to obtain the second target feature.

6. The method of claim 1, further comprising:

and under the condition that the interval between the first reference frame and the first target frame is not more than the first preset duration, inputting the second local attention between the first reference feature and the first target feature and the reference segmentation map into the target segmentation model to obtain a first target segmentation map of the first target frame.

7. The method of claim 1, wherein the object segmentation model is trained by:

obtaining a sample target frame and a first sample reference frame which are not subjected to target segmentation from a sample video, inputting the sample target frame and the first sample reference frame into an initial target segmentation model, obtaining a first sample target feature of the sample target frame and a first sample reference feature of the first sample reference frame, wherein the interval between the sample target frame and the first sample reference frame is longer than a second preset duration and shorter than a third preset duration;

weighting the first sample target feature based on a third non-local attention of the sample target frame to obtain a second sample target feature, and weighting the first sample reference feature based on a fourth non-local attention of the first sample reference frame to obtain a second sample reference feature;

determining second offset information based on the second sample target feature and the second sample reference feature, wherein the second offset information is used for representing offset between a first sample pixel point and a second sample pixel point corresponding to the second sample target feature, the first sample pixel point is a pixel point in the sample target frame, and the second sample pixel point is a pixel point in the first sample reference frame;

performing offset processing on the second sample reference feature based on the second offset information to obtain an offset second sample reference feature;

training the initial target segmentation model based on the first sample reference frame and a third local attention to obtain the target segmentation model, wherein the third local attention is a local attention between the shifted second sample reference feature and the second sample target feature.

8. A video object segmentation apparatus, comprising:

a first determining unit, configured to perform determination of a first target feature of a first target frame, a first reference feature of a first reference frame and a reference segmentation map of the first reference frame, where the first target frame is a video frame of a video to be segmented, which is not subjected to target segmentation, the first reference frame is a video frame of the video, which is subjected to target segmentation, and an interval between the first reference frame and the first target frame is greater than a first preset duration;

a first weighting unit configured to perform weighting of the first target feature based on a first non-local attention of the first target frame to obtain a second target feature, and weighting of the first reference feature based on a second non-local attention of the first reference frame to obtain a second reference feature;

a second determining unit configured to perform determining first offset information based on the second target feature and the second reference feature, where the first offset information is used to represent an offset between a first pixel point and a second pixel point corresponding to a position, where the first pixel point is a pixel point in the first target frame, and the second pixel point is a pixel point in the first reference frame;

a first offset unit configured to perform offset processing on the second reference feature based on the first offset information, obtain an offset second reference feature, and determine a first local attention between the offset second reference feature and the second target feature;

a first target segmentation unit configured to perform inputting the reference segmentation map and the first local attention into a target segmentation model resulting in a first target segmentation map of the first target frame.

9. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the video object segmentation method of any one of claims 1 to 7.

10. A storage medium having instructions which, when executed by a processor of an electronic device/server, enable the electronic device to perform a video object segmentation method as claimed in any one of claims 1 to 7.