CN107644423B

CN107644423B - Real-time video data processing method, device and computing device based on scene segmentation

Info

Publication number: CN107644423B
Application number: CN201710908422.2A
Authority: CN
Inventors: 张蕊; 颜水成; 唐胜
Original assignee: Beijing Qihoo Technology Co Ltd
Current assignee: Beijing Qihoo Technology Co Ltd
Priority date: 2017-09-29
Filing date: 2017-09-29
Publication date: 2021-06-15
Anticipated expiration: 2037-09-29
Also published as: CN107644423A

Abstract

The invention discloses a real-time processing method, device, computing device and computer storage medium for video data based on scene segmentation, wherein the method comprises: acquiring in real time the video shot and/or recorded by an image acquisition device including a specific object. The current frame image; or, obtain the current frame image containing the specific object in the currently playing video in real time; input the current frame image into the scene segmentation network to obtain the scene segmentation result corresponding to the current frame image; determine the scene segmentation result according to the scene segmentation result Contour information of a specific object; add personalized special effects according to the contour information of a specific object to obtain a frame processed image; overlay the frame processed image over the current frame image to obtain processed video data; display the processed video data. The technical solution can obtain the scene segmentation result corresponding to the frame image in real time and accurately, and can add personalized special effects to the frame image more accurately based on the scene segmentation result.

Description

Real-time video data processing method, device and computing device based on scene segmentation

技术领域technical field

本发明涉及图像处理技术领域，具体涉及一种基于场景分割的视频数据实时处理方法、装置、计算设备及计算机存储介质。The present invention relates to the technical field of image processing, in particular to a real-time processing method, device, computing device and computer storage medium for video data based on scene segmentation.

背景技术Background technique

在现有技术中，图像场景分割处理方法主要是基于深度学习中的全卷积神经网络，这些处理方法利用迁移学习的思想，将在大规模分类数据集上经过预训练得到的网络迁移到图像分割数据集上进行训练，从而得到用于场景分割的分割网络，然后利用该分割网络对图像进行场景分割。In the prior art, image scene segmentation processing methods are mainly based on fully convolutional neural networks in deep learning. These processing methods use the idea of transfer learning to transfer the network pre-trained on large-scale classification datasets to images. Train on the segmentation dataset to obtain a segmentation network for scene segmentation, and then use the segmentation network to segment the image.

现有技术中得到的分割网络所使用的网络架构直接利用了图像分类网络，其卷积层中卷积块的大小是固定不变的，从而感受野的大小是固定不变的，其中，感受野是指输出特征图某个节点的响应对应的输入图像的区域，大小固定的感受野只适于捕捉固定大小和尺度的目标。然而对于图像场景分割，场景中经常会包含不同大小的目标，利用具有大小固定的感受野的分割网络在处理过大和过小的目标时常常会发生问题，例如，对于较小的目标，感受野会捕捉过多的目标周围的背景，从而将目标与背景混淆，导致目标遗漏并被错判为背景；对于较大的目标，感受野仅仅能捕捉目标的一部分，使得目标类别判断出现偏差，导致不连续的分割结果。因此，现有技术中的图像场景分割处理方式存在着图像场景分割的准确率低下的问题，那么利用所得到的分割结果也无法很好地、精准地对视频中的帧图像添加个性化特效，所得到的处理后的视频数据的显示效果较差。The network architecture used in the segmentation network obtained in the prior art directly utilizes the image classification network, and the size of the convolution block in the convolution layer is fixed, so the size of the receptive field is fixed. The field refers to the area of the input image corresponding to the response of a node in the output feature map, and the receptive field with a fixed size is only suitable for capturing objects of fixed size and scale. However, for image scene segmentation, the scene often contains objects of different sizes. Using a segmentation network with a fixed size receptive field often has problems when dealing with too large and too small objects. For example, for smaller objects, the receptive field It will capture too much background around the target, thus confusing the target with the background, causing the target to be missed and wrongly judged as the background; for larger targets, the receptive field can only capture a part of the target, making the target category judgment biased, resulting in Discontinuous segmentation results. Therefore, the image scene segmentation processing method in the prior art has the problem of low accuracy of image scene segmentation, and the obtained segmentation result cannot be used to add personalized special effects to the frame images in the video well and accurately. The display effect of the obtained processed video data is poor.

发明内容SUMMARY OF THE INVENTION

鉴于上述问题，提出了本发明以便提供一种克服上述问题或者至少部分地解决上述问题的基于场景分割的视频数据实时处理方法、装置、计算设备及计算机存储介质。In view of the above problems, the present invention is proposed to provide a real-time processing method, apparatus, computing device and computer storage medium for video data based on scene segmentation that overcome the above problems or at least partially solve the above problems.

根据本发明的一个方面，提供了一种基于场景分割的视频数据实时处理方法，该方法基于经过训练的场景分割网络而执行，该方法包括：According to one aspect of the present invention, a real-time processing method for video data based on scene segmentation is provided, the method is performed based on a trained scene segmentation network, and the method includes:

实时获取图像采集设备所拍摄和/或所录制的视频中包含特定对象的当前帧图像；或者，实时获取当前所播放的视频中包含特定对象的当前帧图像；Acquire in real time the current frame image of the specific object contained in the video shot and/or recorded by the image acquisition device; or, acquire in real time the current frame image of the specific object contained in the currently played video;

将当前帧图像输入至场景分割网络中，其中，在场景分割网络中至少一层卷积层，利用尺度回归层输出的尺度系数对该卷积层的第一卷积块进行缩放处理，得到第二卷积块，而后利用第二卷积块进行该卷积层的卷积运算，获得该卷积层的输出结果；尺度回归层为场景分割网络的中间卷积层；The current frame image is input into the scene segmentation network, wherein, in the scene segmentation network at least one convolutional layer, the first convolutional block of the convolutional layer is scaled by the scale coefficient output by the scale regression layer, and the first convolutional block is obtained. Second convolution block, and then use the second convolution block to perform the convolution operation of the convolution layer to obtain the output result of the convolution layer; the scale regression layer is the middle convolution layer of the scene segmentation network;

得到与当前帧图像对应的场景分割结果；Obtain the scene segmentation result corresponding to the current frame image;

根据与当前帧图像对应的场景分割结果，确定特定对象的轮廓信息；Determine the contour information of the specific object according to the scene segmentation result corresponding to the current frame image;

依据特定对象的轮廓信息，添加个性化特效，得到帧处理图像；According to the contour information of a specific object, add personalized special effects to obtain frame processing images;

将帧处理图像覆盖当前帧图像得到处理后的视频数据；Covering the frame processing image with the current frame image to obtain the processed video data;

显示处理后的视频数据。Displays the processed video data.

进一步地，利用第二卷积块进行该卷积层的卷积运算，获得该卷积层的输出结果进一步包括：Further, using the second convolution block to perform the convolution operation of the convolution layer, and obtaining the output result of the convolution layer further includes:

利用线性插值方法，从第二卷积块中采样得到特征向量，组成第三卷积块；Using the linear interpolation method, the feature vector is sampled from the second convolution block to form the third convolution block;

依据第三卷积块与该卷积层的卷积核进行卷积运算，获得该卷积层的输出结果。According to the third convolution block and the convolution kernel of the convolution layer, the convolution operation is performed to obtain the output result of the convolution layer.

进一步地，场景分割网络训练所用的样本包含：样本库存储的多个样本图像以及与样本图像对应的标注场景分割结果。Further, the samples used in the training of the scene segmentation network include: a plurality of sample images stored in the sample library and the labeled scene segmentation results corresponding to the sample images.

进一步地，场景分割网络的训练过程通过多次迭代完成；在一次迭代过程中，从样本库中提取样本图像以及与样本图像对应的标注场景分割结果，利用样本图像和标注场景分割结果实现场景分割网络的训练。Further, the training process of the scene segmentation network is completed through multiple iterations; in an iterative process, the sample images and the labeled scene segmentation results corresponding to the sample images are extracted from the sample library, and the scene segmentation is realized by using the sample images and the labeled scene segmentation results. training of the network.

进一步地，场景分割网络的训练过程通过多次迭代完成；其中一次迭代过程包括：Further, the training process of the scene segmentation network is completed through multiple iterations; one of the iterative processes includes:

将样本图像输入至场景分割网络，得到与样本图像对应的样本场景分割结果；Input the sample image to the scene segmentation network to obtain the sample scene segmentation result corresponding to the sample image;

根据样本场景分割结果与标注场景分割结果之间的分割损失，得到场景分割网络损失函数，利用场景分割网络损失函数实现场景分割网络的训练。According to the segmentation loss between the sample scene segmentation result and the labeled scene segmentation result, the scene segmentation network loss function is obtained, and the scene segmentation network loss function is used to realize the training of the scene segmentation network.

进一步地，场景分割网络的训练步骤包括：Further, the training steps of the scene segmentation network include:

从样本库中提取样本图像以及与样本图像对应的标注场景分割结果；Extract the sample images and the segmentation results of the labeled scene corresponding to the sample images from the sample library;

将样本图像输入至场景分割网络中进行训练，其中，在场景分割网络中至少一层卷积层，利用上一次迭代过程尺度回归层输出的尺度系数或者初始尺度系数对该卷积层的第一卷积块进行缩放处理，得到第二卷积块，而后利用第二卷积块进行该卷积层的卷积运算，获得该卷积层的输出结果；The sample image is input into the scene segmentation network for training, wherein, in the scene segmentation network at least one convolutional layer, the scale coefficient or the initial scale coefficient output by the scale regression layer in the last iteration process is used for the first step of the convolutional layer. The convolution block is scaled to obtain a second convolution block, and then the second convolution block is used to perform the convolution operation of the convolution layer to obtain the output result of the convolution layer;

获取与样本图像对应的样本场景分割结果；Obtain the sample scene segmentation result corresponding to the sample image;

根据样本场景分割结果与标注场景分割结果之间的分割损失，得到场景分割网络损失函数，根据场景分割网络损失函数更新场景分割网络的权重参数；According to the segmentation loss between the sample scene segmentation result and the labeled scene segmentation result, the scene segmentation network loss function is obtained, and the weight parameters of the scene segmentation network are updated according to the scene segmentation network loss function;

迭代执行场景分割网络的训练步骤，直至满足预定收敛条件。The training steps of the scene segmentation network are performed iteratively until a predetermined convergence condition is satisfied.

进一步地，预定收敛条件包括：迭代次数达到预设迭代次数；和/或，场景分割网络损失函数的输出值小于预设阈值。Further, the predetermined convergence condition includes: the number of iterations reaches a preset number of iterations; and/or the output value of the loss function of the scene segmentation network is less than a preset threshold.

进一步地，尺度系数为尺度回归层输出的尺度系数特征图中的特征向量。Further, the scale coefficient is a feature vector in the scale coefficient feature map output by the scale regression layer.

进一步地，该方法还包括：在场景分割网络训练开始时，对尺度回归层的权重参数进行初始化处理。Further, the method further includes: at the beginning of the scene segmentation network training, initializing the weight parameters of the scale regression layer.

进一步地，显示处理后的视频数据进一步包括：将处理后的视频数据实时显示；Further, displaying the processed video data further includes: displaying the processed video data in real time;

该方法还包括：将处理后的视频数据上传至云服务器。The method also includes: uploading the processed video data to a cloud server.

进一步地，将处理后的视频数据上传至云服务器进一步包括：Further, uploading the processed video data to the cloud server further includes:

将处理后的视频数据上传至云视频平台服务器，以供云视频平台服务器在云视频平台进行展示视频数据。Upload the processed video data to the cloud video platform server for the cloud video platform server to display the video data on the cloud video platform.

将处理后的视频数据上传至云直播服务器，以供云直播服务器将视频数据实时推送给观看用户客户端。Upload the processed video data to the cloud live broadcast server, so that the cloud live broadcast server can push the video data to the viewing user client in real time.

将处理后的视频数据上传至云公众号服务器，以供云公众号服务器将视频数据推送给公众号关注客户端。Upload the processed video data to the cloud official account server, so that the cloud official account server can push the video data to the official account attention client.

根据本发明的另一方面，提供了一种基于场景分割的视频数据实时处理装置，该装置基于经过训练的场景分割网络而运行，该装置包括：According to another aspect of the present invention, there is provided an apparatus for real-time processing of video data based on scene segmentation, the apparatus operates based on a trained scene segmentation network, and the apparatus includes:

获取模块，适于实时获取图像采集设备所拍摄和/或所录制的视频中包含特定对象的当前帧图像；或者，实时获取当前所播放的视频中包含特定对象的当前帧图像；an acquisition module, adapted to acquire the current frame image of the specific object in the video shot and/or recorded by the image acquisition device in real time; or, acquire the current frame image of the specific object in the currently played video in real time;

分割模块，适于将当前帧图像输入至场景分割网络中，其中，在场景分割网络中至少一层卷积层，利用尺度回归层输出的尺度系数对该卷积层的第一卷积块进行缩放处理，得到第二卷积块，而后利用第二卷积块进行该卷积层的卷积运算，获得该卷积层的输出结果；尺度回归层为场景分割网络的中间卷积层；The segmentation module is suitable for inputting the current frame image into the scene segmentation network, wherein, in the scene segmentation network at least one convolutional layer, the first convolutional block of the convolutional layer is processed by the scale coefficient output by the scale regression layer. scaling to obtain a second convolution block, and then use the second convolution block to perform the convolution operation of the convolution layer to obtain the output result of the convolution layer; the scale regression layer is the middle convolution layer of the scene segmentation network;

生成模块，适于得到与当前帧图像对应的场景分割结果；a generation module, adapted to obtain a scene segmentation result corresponding to the current frame image;

确定模块，适于根据与当前帧图像对应的场景分割结果，确定特定对象的轮廓信息；a determination module, adapted to determine the contour information of a specific object according to the scene segmentation result corresponding to the current frame image;

处理模块，适于依据特定对象的轮廓信息，添加个性化特效，得到帧处理图像；The processing module is suitable for adding personalized special effects according to the contour information of the specific object to obtain the frame processing image;

覆盖模块，适于将帧处理图像覆盖当前帧图像得到处理后的视频数据；an overlay module, adapted to overlay the frame processing image over the current frame image to obtain the processed video data;

显示模块，适于显示处理后的视频数据。The display module is adapted to display the processed video data.

进一步地，分割模块进一步适于：Further, the segmentation module is further adapted to:

进一步地，该装置还包括：场景分割网络训练模块；场景分割网络的训练过程通过多次迭代完成；Further, the device further includes: a scene segmentation network training module; the training process of the scene segmentation network is completed through multiple iterations;

场景分割网络训练模块适于：在一次迭代过程中，从样本库中提取样本图像以及与样本图像对应的标注场景分割结果，利用样本图像和标注场景分割结果实现场景分割网络的训练。The scene segmentation network training module is suitable for: in an iterative process, extract the sample images and the labeled scene segmentation results corresponding to the sample images from the sample library, and use the sample images and the labeled scene segmentation results to realize the training of the scene segmentation network.

场景分割网络训练模块适于：在一次迭代过程中，将样本图像输入至场景分割网络，得到与样本图像对应的样本场景分割结果；The scene segmentation network training module is suitable for: in an iterative process, input the sample image to the scene segmentation network to obtain the sample scene segmentation result corresponding to the sample image;

进一步地，该装置还包括：场景分割网络训练模块；Further, the device also includes: a scene segmentation network training module;

场景分割网络训练模块包括：The scene segmentation network training module includes:

提取单元，适于从样本库中提取样本图像以及与样本图像对应的标注场景分割结果；an extraction unit, adapted to extract a sample image and a segmentation result of annotated scene corresponding to the sample image from the sample library;

训练单元，适于将样本图像输入至场景分割网络中进行训练，其中，在场景分割网络中至少一层卷积层，利用上一次迭代过程尺度回归层输出的尺度系数或者初始尺度系数对该卷积层的第一卷积块进行缩放处理，得到第二卷积块，而后利用第二卷积块进行该卷积层的卷积运算，获得该卷积层的输出结果；The training unit is suitable for inputting the sample images into the scene segmentation network for training, wherein, in the scene segmentation network at least one convolutional layer, the scale coefficient or the initial scale coefficient output by the scale regression layer in the last iteration process is used for this volume. The first convolution block of the accumulation layer is scaled to obtain the second convolution block, and then the second convolution block is used to perform the convolution operation of the convolution layer to obtain the output result of the convolution layer;

获取单元，适于获取与样本图像对应的样本场景分割结果；an obtaining unit, adapted to obtain a sample scene segmentation result corresponding to the sample image;

更新单元，适于根据样本场景分割结果与标注场景分割结果之间的分割损失，得到场景分割网络损失函数，根据场景分割网络损失函数更新场景分割网络的权重参数；The updating unit is adapted to obtain the scene segmentation network loss function according to the segmentation loss between the sample scene segmentation result and the labeled scene segmentation result, and update the weight parameter of the scene segmentation network according to the scene segmentation network loss function;

场景分割网络训练模块迭代运行，直至满足预定收敛条件。The scene segmentation network training module runs iteratively until a predetermined convergence condition is met.

进一步地，场景分割网络训练模块进一步适于：在场景分割网络训练开始时，对尺度回归层的权重参数进行初始化处理。Further, the scene segmentation network training module is further adapted to: at the beginning of the scene segmentation network training, initialize the weight parameters of the scale regression layer.

进一步地，显示模块进一步适于：将处理后的视频数据实时显示；Further, the display module is further adapted to: display the processed video data in real time;

该装置还包括：The device also includes:

上传模块，适于将处理后的视频数据上传至云服务器。The uploading module is suitable for uploading the processed video data to the cloud server.

进一步地，上传模块进一步适于：Further, the uploading module is further adapted to:

根据本发明的又一方面，提供了一种计算设备，包括：处理器、存储器、通信接口和通信总线，处理器、存储器和通信接口通过通信总线完成相互间的通信；According to another aspect of the present invention, a computing device is provided, including: a processor, a memory, a communication interface, and a communication bus, and the processor, the memory, and the communication interface communicate with each other through the communication bus;

存储器用于存放至少一可执行指令，可执行指令使处理器执行上述基于场景分割的视频数据实时处理方法对应的操作。The memory is used for storing at least one executable instruction, and the executable instruction enables the processor to perform operations corresponding to the above-mentioned real-time processing method for video data based on scene segmentation.

根据本发明的再一方面，提供了一种计算机存储介质，存储介质中存储有至少一可执行指令，可执行指令使处理器执行如上述基于场景分割的视频数据实时处理方法对应的操作。According to yet another aspect of the present invention, a computer storage medium is provided, the storage medium stores at least one executable instruction, and the executable instruction causes the processor to perform operations corresponding to the above-mentioned real-time processing method for video data based on scene segmentation.

根据本发明提供的技术方案，实时获取图像采集设备所拍摄和/或所录制的视频中包含特定对象的当前帧图像；或者，实时获取当前所播放的视频中包含特定对象的当前帧图像，将当前帧图像输入至场景分割网络中，其中，在场景分割网络中至少一层卷积层，利用尺度回归层输出的尺度系数对该卷积层的第一卷积块进行缩放处理，得到第二卷积块，而后利用第二卷积块进行该卷积层的卷积运算，获得该卷积层的输出结果，然后得到与当前帧图像对应的场景分割结果，根据与当前帧图像对应的场景分割结果，确定特定对象的轮廓信息，依据特定对象的轮廓信息，添加个性化特效，得到帧处理图像，将帧处理图像覆盖当前帧图像得到处理后的视频数据，显示处理后的视频数据。本发明提供的技术方案依据尺度系数对卷积块进行缩放，实现了对感受野的自适应缩放，利用经过训练的场景分割网络能够实时、准确地得到视频中帧图像对应的场景分割结果，有效地提高了图像场景分割的准确率以及处理效率，基于所得到的场景分割结果能够更为精准地对帧图像添加个性化特效，美化了视频数据显示效果。According to the technical solution provided by the present invention, the current frame image containing the specific object in the video shot and/or recorded by the image acquisition device is acquired in real time; or, the current frame image containing the specific object in the currently played video is acquired in real time, and the The current frame image is input into the scene segmentation network, wherein, in the scene segmentation network, at least one convolutional layer is used to scale the first convolutional block of the convolutional layer by using the scale coefficient output by the scale regression layer to obtain the second convolutional block. Convolution block, and then use the second convolution block to perform the convolution operation of the convolution layer to obtain the output result of the convolution layer, and then obtain the scene segmentation result corresponding to the current frame image, according to the scene corresponding to the current frame image. From the segmentation result, determine the contour information of the specific object, add personalized special effects according to the contour information of the specific object, obtain the frame processed image, cover the current frame image with the frame processed image to obtain the processed video data, and display the processed video data. The technical scheme provided by the invention scales the convolution block according to the scale coefficient, realizes the adaptive scaling of the receptive field, and uses the trained scene segmentation network to obtain the scene segmentation result corresponding to the frame image in the video in real time and accurately, effectively The accuracy and processing efficiency of image scene segmentation are greatly improved, and personalized special effects can be added to frame images more accurately based on the obtained scene segmentation results, which beautifies the display effect of video data.

上述说明仅是本发明技术方案的概述，为了能够更清楚了解本发明的技术手段，而可依照说明书的内容予以实施，并且为了让本发明的上述和其它目的、特征和优点能够更明显易懂，以下特举本发明的具体实施方式。The above description is only an overview of the technical solutions of the present invention, in order to be able to understand the technical means of the present invention more clearly, it can be implemented according to the content of the description, and in order to make the above and other purposes, features and advantages of the present invention more obvious and easy to understand , the following specific embodiments of the present invention are given.

附图说明Description of drawings

通过阅读下文优选实施方式的详细描述，各种其他的优点和益处对于本领域普通技术人员将变得清楚明了。附图仅用于示出优选实施方式的目的，而并不认为是对本发明的限制。而且在整个附图中，用相同的参考符号表示相同的部件。在附图中：Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are for the purpose of illustrating preferred embodiments only and are not to be considered limiting of the invention. Also, the same components are denoted by the same reference numerals throughout the drawings. In the attached image:

图1示出了根据本发明一个实施例的基于场景分割的视频数据实时处理方法的流程示意图；1 shows a schematic flowchart of a real-time processing method for video data based on scene segmentation according to an embodiment of the present invention;

图2示出了根据本发明一个实施例的场景分割网络训练方法的流程示意图；2 shows a schematic flowchart of a method for training a scene segmentation network according to an embodiment of the present invention;

图3示出了根据本发明另一个实施例的基于场景分割的视频数据实时处理方法的流程示意图；3 shows a schematic flowchart of a real-time processing method for video data based on scene segmentation according to another embodiment of the present invention;

图4示出了根据本发明一个实施例的基于场景分割的视频数据实时处理装置的结构框图；4 shows a structural block diagram of a real-time processing apparatus for video data based on scene segmentation according to an embodiment of the present invention;

图5示出了根据本发明另一个实施例的基于场景分割的视频数据实时处理装置的结构框图；5 shows a structural block diagram of a real-time processing apparatus for video data based on scene segmentation according to another embodiment of the present invention;

图6示出了根据本发明实施例的一种计算设备的结构示意图。FIG. 6 shows a schematic structural diagram of a computing device according to an embodiment of the present invention.

具体实施方式Detailed ways

下面将参照附图更详细地描述本公开的示例性实施例。虽然附图中显示了本公开的示例性实施例，然而应当理解，可以以各种形式实现本公开而不应被这里阐述的实施例所限制。相反，提供这些实施例是为了能够更透彻地理解本公开，并且能够将本公开的范围完整的传达给本领域的技术人员。Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited by the embodiments set forth herein. Rather, these embodiments are provided so that the present disclosure will be more thoroughly understood, and will fully convey the scope of the present disclosure to those skilled in the art.

图1示出了根据本发明一个实施例的基于场景分割的视频数据实时处理方法的流程示意图，该方法基于经过训练的场景分割网络而执行，如图1所示，该方法包括如下步骤：1 shows a schematic flowchart of a real-time processing method for video data based on scene segmentation according to an embodiment of the present invention. The method is executed based on a trained scene segmentation network. As shown in FIG. 1 , the method includes the following steps:

步骤S100，实时获取图像采集设备所拍摄和/或所录制的视频中包含特定对象的当前帧图像；或者，实时获取当前所播放的视频中包含特定对象的当前帧图像。Step S100, acquiring in real time a current frame image containing a specific object in a video shot and/or recorded by the image acquisition device; or acquiring in real time a current frame image containing a specific object in a currently playing video.

本实施例中图像采集设备以移动终端为例进行说明。实时获取到移动终端摄像头在录制视频时的当前帧图像或者拍摄视频时的当前帧图像。由于本发明对特定对象进行处理，因此获取当前帧图像时仅获取包含特定对象的当前帧图像。除实时获取图像采集设备所拍摄和/或所录制的视频外，还可以实时获取当前所播放的视频中包含特定对象的当前帧图像。In this embodiment, the image acquisition device is described by taking a mobile terminal as an example. The current frame image of the camera of the mobile terminal when the video is recorded or the current frame image of the video is obtained in real time. Since the present invention processes a specific object, only the current frame image containing the specific object is acquired when the current frame image is acquired. In addition to acquiring the video shot and/or recorded by the image acquisition device in real time, it is also possible to acquire the current frame image of the specific object in the currently playing video in real time.

步骤S101，将当前帧图像输入至场景分割网络中。Step S101, the current frame image is input into the scene segmentation network.

当前帧图像中包含了特定对象，如人体。为了能够精准地为当前帧图像添加个性化特效，需要利用场景分割网络对当前帧图像进行场景分割。其中，场景分割网络是经过训练的，经过训练的场景分割网络能够利用该网络中尺度回归层输出的尺度系数地对卷积层的卷积块进行缩放，从而能够更为准确地对所输入的当前帧图像进行场景分割。具体地，场景分割网络训练所用的样本包含：样本库存储的多个样本图像以及与样本图像对应的标注场景分割结果。其中，标注场景分割结果为样本图像中的各个场景经人工分割与标注所得到的分割结果。The current frame image contains a specific object, such as a human body. In order to accurately add personalized special effects to the current frame image, the scene segmentation network needs to be used to segment the current frame image. Among them, the scene segmentation network is trained, and the trained scene segmentation network can use the scale coefficient output by the scale regression layer in the network to scale the convolution block of the convolution layer, so as to more accurately The current frame image is scene segmented. Specifically, the samples used in the training of the scene segmentation network include: a plurality of sample images stored in the sample library and the labeled scene segmentation results corresponding to the sample images. The labeling scene segmentation result is the segmentation result obtained by manual segmentation and labeling of each scene in the sample image.

其中，场景分割网络的训练过程通过多次迭代完成。可选地，在一次迭代过程中，从样本库中提取样本图像以及与样本图像对应的标注场景分割结果，利用样本图像和标注场景分割结果实现场景分割网络的训练。Among them, the training process of the scene segmentation network is completed through multiple iterations. Optionally, in an iterative process, the sample images and the labeled scene segmentation results corresponding to the sample images are extracted from the sample library, and the scene segmentation network is trained by using the sample images and the labeled scene segmentation results.

可选地，一次迭代过程包括：将样本图像输入至场景分割网络，得到与样本图像对应的样本场景分割结果；根据样本场景分割结果与标注场景分割结果之间的分割损失，得到场景分割网络损失函数，利用场景分割网络损失函数实现场景分割网络的训练。Optionally, an iterative process includes: inputting the sample image into the scene segmentation network to obtain a sample scene segmentation result corresponding to the sample image; and obtaining the scene segmentation network loss according to the segmentation loss between the sample scene segmentation result and the labeled scene segmentation result. function, and use the scene segmentation network loss function to realize the training of the scene segmentation network.

步骤S102，在场景分割网络中至少一层卷积层，利用尺度回归层输出的尺度系数对该卷积层的第一卷积块进行缩放处理，得到第二卷积块。Step S102, at least one convolution layer in the scene segmentation network, using the scale coefficient output by the scale regression layer to perform scaling processing on the first convolution block of the convolution layer to obtain a second convolution block.

本领域技术人员可根据实际需要选择对哪一层或哪些层的卷积层的卷积块进行缩放处理，此处不做限定。为了便于区分，在本发明中将待缩放处理的卷积块称为第一卷积块，将经缩放处理后的卷积块称为第二卷积块。假设对场景分割网络中的某一层卷积层的第一卷积块进行缩放处理，那么在该卷积层，利用尺度回归层输出的尺度系数对该卷积层的第一卷积块进行缩放处理，得到第二卷积块。Those skilled in the art can select which layer or which layers of convolutional blocks of the convolutional layer are to be scaled according to actual needs, which is not limited here. For convenience of distinction, in the present invention, the convolution block to be scaled is referred to as the first convolution block, and the convolution block after scaling is referred to as the second convolution block. Assuming that the first convolutional block of a certain convolutional layer in the scene segmentation network is scaled, then in this convolutional layer, the first convolutional block of the convolutional layer is scaled using the scale coefficient output by the scale regression layer. The scaling process is performed to obtain the second convolution block.

其中，尺度回归层为场景分割网络的中间卷积层，中间卷积层是指场景分割网络中的一层或多层卷积层，本领域技术人员可根据实际需要在场景分割网络中选择合适的一层或多层卷积层作为尺度回归层，此处不做限定。在本发明中，将尺度回归层输出的特征图称为尺度系数特征图，尺度系数为尺度回归层输出的尺度系数特征图中的特征向量。本发明依据尺度系数对卷积块进行缩放，从而实现了对感受野的自适应缩放，能够更为准确地对所输入的当前帧图像进行场景分割，有效地提高了图像场景分割的准确率。Among them, the scale regression layer is the middle convolution layer of the scene segmentation network, and the middle convolution layer refers to one or more convolution layers in the scene segmentation network. Those skilled in the art can choose the appropriate one in the scene segmentation network according to actual needs. One or more convolutional layers of , as scale regression layers, are not limited here. In the present invention, the feature map output by the scale regression layer is called the scale coefficient feature map, and the scale coefficient is the feature vector in the scale coefficient feature map output by the scale regression layer. The invention scales the convolution block according to the scale coefficient, thereby realizing the adaptive scaling of the receptive field, and can more accurately perform scene segmentation on the input current frame image, effectively improving the accuracy of image scene segmentation.

步骤S103，利用第二卷积块进行该卷积层的卷积运算，获得该卷积层的输出结果。Step S103, using the second convolution block to perform the convolution operation of the convolution layer to obtain the output result of the convolution layer.

在得到了第二卷积块之后，就可利用第二卷积块进行该卷积层的卷积运算，获得该卷积层的输出结果。After the second convolution block is obtained, the second convolution block can be used to perform the convolution operation of the convolution layer to obtain the output result of the convolution layer.

步骤S104，得到与当前帧图像对应的场景分割结果。In step S104, a scene segmentation result corresponding to the current frame image is obtained.

在步骤S103获得该卷积层的输出结果之后，在场景分割网络中若在该卷积层之后还存在其他的卷积层，那么将该卷积层的输出结果作为后一卷积层的输入进行后续的卷积运算。在经过场景分割网络中所有的卷积层的卷积运算之后，得到与当前帧图像对应的场景分割结果。After obtaining the output result of the convolution layer in step S103, if there are other convolution layers after the convolution layer in the scene segmentation network, then the output result of the convolution layer is used as the input of the next convolution layer Perform subsequent convolution operations. After the convolution operation of all convolutional layers in the scene segmentation network, the scene segmentation result corresponding to the current frame image is obtained.

步骤S105，根据与当前帧图像对应的场景分割结果，确定特定对象的轮廓信息。Step S105: Determine the contour information of the specific object according to the scene segmentation result corresponding to the current frame image.

在得到了与当前帧图像对应的场景分割结果之后，就可根据与当前帧图像对应的场景分割结果，确定出特定对象的轮廓信息。假设特定对象为人体，那么就可根据场景分割结果，确定出人体的轮廓信息，从而区分出当前帧图像中哪些区域是人体，哪些区域不是人体。After the scene segmentation result corresponding to the current frame image is obtained, the contour information of the specific object can be determined according to the scene segmentation result corresponding to the current frame image. Assuming that the specific object is a human body, then the outline information of the human body can be determined according to the scene segmentation result, so as to distinguish which areas in the current frame image are human bodies and which areas are not human bodies.

步骤S106，依据特定对象的轮廓信息，添加个性化特效，得到帧处理图像。Step S106, adding personalized special effects according to the contour information of the specific object to obtain a frame-processed image.

在确定了特定对象的轮廓信息之后，就可依据特定对象的轮廓信息，添加个性化特效，得到帧处理图像。本领域技术人员可根据实际需要设置个性化特效，此处不做限定。例如，可依据特定对象的轮廓信息，在特定对象的边缘处添加效果贴图，效果贴图可以为静态的效果贴图，也可以为动态的效果贴图，具体地，当特定对象为人体时，效果贴图可以为如火焰、跳动的音符、浪花等效果贴图；当特定对象为人体头部时，效果贴图可以为如发冠、晃动的耳朵等效果贴图，具体根据实施情况进行设置，此处不做限定。After the contour information of the specific object is determined, personalized special effects can be added according to the contour information of the specific object to obtain a frame-processed image. Persons skilled in the art can set personalized special effects according to actual needs, which are not limited here. For example, according to the contour information of the specific object, an effect map can be added at the edge of the specific object. The effect map can be a static effect map or a dynamic effect map. Specifically, when the specific object is a human body, the effect map can be It is an effect map such as flames, beating notes, waves, etc.; when the specific object is a human head, the effect map can be an effect map such as a hair crown, shaking ears, etc. The specific settings are set according to the implementation, which is not limited here.

步骤S107，将帧处理图像覆盖当前帧图像得到处理后的视频数据。Step S107, overlaying the frame processed image over the current frame image to obtain processed video data.

使用帧处理图像直接覆盖掉原来的当前帧图像，直接可以得到处理后的视频数据。同时，录制的用户还可以直接看到帧处理图像。Using the frame processing image to directly overwrite the original current frame image, the processed video data can be obtained directly. At the same time, the recorded user can also directly see the frame processing image.

在得到帧处理图像时，会将帧处理图像直接覆盖原来的当前帧图像。覆盖时的速度较快，一般在1/24秒之内完成。对于用户而言，由于覆盖处理的时间相对短，人眼没有明显的察觉，即人眼没有察觉到视频数据中的原当前帧图像被覆盖的过程。这样在后续显示处理后的视频数据时，相当于一边拍摄和/或录制和/或播放视频数据时，一边实时显示的为处理后的视频数据，用户不会感觉到视频数据中帧图像发生覆盖的显示效果。When the frame-processed image is obtained, the frame-processed image will directly overwrite the original current frame image. The speed of coverage is faster, generally completed within 1/24 seconds. For the user, due to the relatively short time of the overlay processing, the human eye does not notice it obviously, that is, the human eye does not notice the process that the original current frame image in the video data is overwritten. In this way, when the processed video data is subsequently displayed, the processed video data is displayed in real time while shooting and/or recording and/or playing the video data, and the user will not feel that the frame images in the video data are overwritten. display effect.

步骤S108，显示处理后的视频数据。Step S108, displaying the processed video data.

得到处理后的视频数据后，可以将其实时的进行显示，用户可以直接看到处理后的视频数据的显示效果。After the processed video data is obtained, it can be displayed in real time, and the user can directly see the display effect of the processed video data.

根据本实施例提供的基于场景分割的视频数据实时处理方法，实时获取图像采集设备所拍摄和/或所录制的视频中包含特定对象的当前帧图像；或者，实时获取当前所播放的视频中包含特定对象的当前帧图像，将当前帧图像输入至场景分割网络中，其中，在场景分割网络中至少一层卷积层，利用尺度回归层输出的尺度系数对该卷积层的第一卷积块进行缩放处理，得到第二卷积块，而后利用第二卷积块进行该卷积层的卷积运算，获得该卷积层的输出结果，然后得到与当前帧图像对应的场景分割结果，根据与当前帧图像对应的场景分割结果，确定特定对象的轮廓信息，依据特定对象的轮廓信息，添加个性化特效，得到帧处理图像，将帧处理图像覆盖当前帧图像得到处理后的视频数据，显示处理后的视频数据。本发明提供的技术方案依据尺度系数对卷积块进行缩放，实现了对感受野的自适应缩放，利用经过训练的场景分割网络能够实时、准确地得到视频中帧图像对应的场景分割结果，有效地提高了图像场景分割的准确率以及处理效率，基于所得到的场景分割结果能够更为精准地对帧图像添加个性化特效，美化了视频数据显示效果。According to the real-time processing method for video data based on scene segmentation provided by this embodiment, the current frame image of the specific object contained in the video shot and/or recorded by the image acquisition device is acquired in real time; The current frame image of the specific object is input into the scene segmentation network, wherein, in the scene segmentation network at least one convolution layer, the scale coefficient output by the scale regression layer is used for the first convolution layer of the convolution layer. The block is scaled to obtain the second convolution block, and then the second convolution block is used to perform the convolution operation of the convolution layer to obtain the output result of the convolution layer, and then the scene segmentation result corresponding to the current frame image is obtained, According to the scene segmentation result corresponding to the current frame image, determine the contour information of the specific object, add personalized special effects according to the contour information of the specific object, obtain the frame processing image, and cover the frame processing image with the current frame image to obtain the processed video data, Displays the processed video data. The technical scheme provided by the invention scales the convolution block according to the scale coefficient, realizes the adaptive scaling of the receptive field, and uses the trained scene segmentation network to obtain the scene segmentation result corresponding to the frame image in the video in real time and accurately, effectively The accuracy and processing efficiency of image scene segmentation are greatly improved, and personalized special effects can be added to frame images more accurately based on the obtained scene segmentation results, which beautifies the display effect of video data.

图2示出了根据本发明一个实施例的场景分割网络训练方法的流程示意图，如图2所示，场景分割网络的训练步骤包括如下步骤：FIG. 2 shows a schematic flowchart of a method for training a scene segmentation network according to an embodiment of the present invention. As shown in FIG. 2 , the training steps of the scene segmentation network include the following steps:

步骤S200，从样本库中提取样本图像以及与样本图像对应的标注场景分割结果。Step S200, extract the sample images and the segmentation result of the labeled scene corresponding to the sample images from the sample library.

样本库中不仅存储了样本图像，还存储了与样本图像对应的标注场景分割结果。本领域技术人员可根据实际需要设置样本库中所存储的样本图像的数量，此处不做限定。在步骤S200中，从样本库中提取样本图像，并提取与该样本图像对应的标注场景分割结果。The sample library not only stores the sample images, but also stores the segmentation results of the labeled scene corresponding to the sample images. Those skilled in the art can set the number of sample images stored in the sample library according to actual needs, which is not limited here. In step S200, a sample image is extracted from the sample library, and a segmentation result of the labeled scene corresponding to the sample image is extracted.

步骤S201，将样本图像输入至场景分割网络中进行训练。Step S201, input the sample image into the scene segmentation network for training.

在提取了样本图像之后，将样本图像输入至场景分割网络中进行训练。After the sample images are extracted, the sample images are input into the scene segmentation network for training.

步骤S202，在场景分割网络中至少一层卷积层，利用上一次迭代过程尺度回归层输出的尺度系数或者初始尺度系数对该卷积层的第一卷积块进行缩放处理，得到第二卷积块。Step S202, at least one convolution layer in the scene segmentation network, using the scale coefficient or the initial scale coefficient output by the scale regression layer in the last iteration process to scale the first convolution block of the convolution layer to obtain the second volume Building blocks.

本领域技术人员可根据实际需要选择对哪一层或哪些层的卷积层的卷积块进行缩放处理，此处不做限定。假设对场景分割网络中的某一层卷积层的第一卷积块进行缩放处理，那么在该卷积层，利用上一次迭代过程尺度回归层输出的尺度系数或者初始尺度系数对该卷积层的第一卷积块进行缩放处理，得到第二卷积块。Those skilled in the art can select which layer or which layers of convolutional blocks of the convolutional layer are to be scaled according to actual needs, which is not limited here. Assuming that the first convolution block of a certain convolutional layer in the scene segmentation network is scaled, then in this convolutional layer, the scale coefficient or initial scale coefficient output by the scale regression layer in the previous iteration process is used for the convolutional layer. The first convolutional block of the layer is scaled to obtain the second convolutional block.

具体地，为了有效对场景分割网络进行训练，在场景分割网络训练开始时，可对尺度回归层的权重参数进行初始化处理。本领域技术人员可根据实际需要设置具体的初始化的权重参数，此处不做限定。初始尺度系数即为经初始化处理后的尺度回归层输出的尺度系数特征图中的特征向量。Specifically, in order to effectively train the scene segmentation network, at the beginning of the scene segmentation network training, the weight parameters of the scale regression layer can be initialized. Those skilled in the art can set specific initialized weight parameters according to actual needs, which are not limited here. The initial scale coefficient is the feature vector in the scale coefficient feature map output by the scale regression layer after initialization.

步骤S203，利用第二卷积块进行该卷积层的卷积运算，获得该卷积层的输出结果。Step S203, using the second convolution block to perform the convolution operation of the convolution layer to obtain the output result of the convolution layer.

在得到了第二卷积块之后，就可利用第二卷积块进行该卷积层的卷积运算，获得该卷积层的输出结果。由于第二卷积块是对第一卷积块进行缩放处理后所得到的，那么第二卷积块中的特征向量对应的坐标可能并不是整数，因此，可利用预设计算方法得到这些非整数坐标对应的特征向量。本领域技术人员可根据实际需要设置预设计算方法，此处不做限定。例如，预设计算方法可为线性插值方法，具体地，利用线性插值方法，从第二卷积块中采样得到特征向量，组成第三卷积块，然后依据第三卷积块与该卷积层的卷积核进行卷积运算，获得该卷积层的输出结果。After the second convolution block is obtained, the second convolution block can be used to perform the convolution operation of the convolution layer to obtain the output result of the convolution layer. Since the second convolution block is obtained by scaling the first convolution block, the coordinates corresponding to the feature vectors in the second convolution block may not be integers. Eigenvectors corresponding to integer coordinates. Those skilled in the art can set the preset calculation method according to actual needs, which is not limited here. For example, the preset calculation method may be a linear interpolation method. Specifically, the linear interpolation method is used to sample the feature vector from the second convolution block to form a third convolution block, and then convolve the third convolution block with the convolution block. The convolution kernel of the layer performs the convolution operation to obtain the output result of the convolution layer.

在获得该卷积层的输出结果之后，在场景分割网络中若在该卷积层之后还存在其他的卷积层，那么将该卷积层的输出结果作为后一卷积层的输入进行后续的卷积运算。在经过场景分割网络中所有的卷积层的卷积运算之后，得到与样本图像对应的场景分割结果。After the output of the convolutional layer is obtained, if there are other convolutional layers after the convolutional layer in the scene segmentation network, the output of the convolutional layer is used as the input of the subsequent convolutional layer for subsequent convolution operation. After the convolution operation of all convolutional layers in the scene segmentation network, the scene segmentation result corresponding to the sample image is obtained.

步骤S204，获取与样本图像对应的样本场景分割结果。Step S204, obtaining a sample scene segmentation result corresponding to the sample image.

获取场景分割网络得到的与样本图像对应的样本场景分割结果。Obtain the sample scene segmentation result corresponding to the sample image obtained by the scene segmentation network.

步骤S205，根据样本场景分割结果与标注场景分割结果之间的分割损失，得到场景分割网络损失函数，根据场景分割网络损失函数更新场景分割网络的权重参数。Step S205: Obtain a scene segmentation network loss function according to the segmentation loss between the sample scene segmentation result and the labeled scene segmentation result, and update the weight parameter of the scene segmentation network according to the scene segmentation network loss function.

其中，本领域技术人员可根据实际需要设置场景分割网络损失函数的具体内容，此处不做限定。根据场景分割网络损失函数，进行反向传播(back propagation)运算，通过运算结果更新场景分割网络的权重参数。Wherein, those skilled in the art can set the specific content of the scene segmentation network loss function according to actual needs, which is not limited here. According to the loss function of the scene segmentation network, a back propagation operation is performed, and the weight parameters of the scene segmentation network are updated through the operation result.

步骤S206，迭代执行场景分割网络的训练步骤，直至满足预定收敛条件。Step S206, iteratively execute the training steps of the scene segmentation network until a predetermined convergence condition is satisfied.

其中，本领域技术人员可根据实际需要设置预定收敛条件，此处不做限定。例如，预定收敛条件可包括：迭代次数达到预设迭代次数；和/或，场景分割网络损失函数的输出值小于预设阈值。具体地，可以通过判断迭代次数是否达到预设迭代次数来判断是否满足预定收敛条件，也可以根据场景分割网络损失函数的输出值是否小于预设阈值来判断是否满足预定收敛条件。在步骤S206中，迭代执行场景分割网络的训练步骤，直至满足预定收敛条件，从而得到经过训练的场景分割网络。Wherein, those skilled in the art can set predetermined convergence conditions according to actual needs, which are not limited here. For example, the predetermined convergence condition may include: the number of iterations reaches a preset number of iterations; and/or, the output value of the scene segmentation network loss function is less than a preset threshold. Specifically, it can be judged whether the predetermined convergence condition is satisfied by judging whether the number of iterations reaches the preset number of iterations, or whether the predetermined convergence condition is satisfied according to whether the output value of the scene segmentation network loss function is smaller than a predetermined threshold. In step S206, the training steps of the scene segmentation network are iteratively executed until a predetermined convergence condition is satisfied, thereby obtaining a trained scene segmentation network.

在一个具体的训练过程中，例如需要对场景分割网络中的某一层卷积层的第一卷积块进行缩放处理，假设将该卷积层称为卷积层J，卷积层J的输入特征图为

其中，H_A为该输入特征图的高度参数，W_A为该输入特征图的宽度参数，C_A为该输入特征图的通道数；卷积层J的输出特征图为

其中，H_B为该输出特征图的高度参数，W_B为该输出特征图的宽度参数，C_B为该输出特征图的通道数；尺度回归层输出的尺度系数特征图为

其中，H_S为该尺度系数特征图的高度参数，W_S为该尺度系数特征图的宽度参数，该尺度系数特征图的通道数为1，具体地，H_S＝H_B，且W_S＝W_B。In a specific training process, for example, the first convolution block of a certain convolution layer in the scene segmentation network needs to be scaled. It is assumed that the convolution layer is called the convolution layer J. The input feature map is

Among them, HA is the height parameter of the input feature map, W _A is the width parameter of the input feature map, and _{C A} _is the number of channels of the input feature map; the output feature map of the convolutional layer J is

Among them, _HB is the height parameter of the output feature map, _WB is the width parameter of the output feature map, and _CB is the number of channels of the output feature map; the scale coefficient feature map output by the scale regression layer is

Wherein, H _S is the height parameter of the scale coefficient feature map, W _S is the width parameter of the scale coefficient feature map, and the number of channels of the scale coefficient feature map is 1, specifically, H _S = _HB , and W _S = _WB .

在场景分割网络中，可选择一个普通的3×3的卷积层作为尺度回归层，尺度回归层对应的通道数为1的输出特征图即为尺度系数特征图。为了有效对场景分割网络进行训练，防止场景分割网络在训练过程中崩溃，需要在场景分割网络训练开始时，对尺度回归层的权重参数进行初始化处理。其中，尺度回归层的初始化的权重参数为In the scene segmentation network, an ordinary 3×3 convolutional layer can be selected as the scale regression layer, and the output feature map corresponding to the scale regression layer with the channel number of 1 is the scale coefficient feature map. In order to effectively train the scene segmentation network and prevent the scene segmentation network from collapsing during the training process, it is necessary to initialize the weight parameters of the scale regression layer at the beginning of the scene segmentation network training. Among them, the initialized weight parameters of the scale regression layer are

其中，w₀为尺度回归层初始化后的卷积核，a为卷积核中的任意位置，b₀为初始化的偏置项。在对尺度回归层的权重参数的初始化处理中，卷积核被设置为满足高斯分布的随机系数σ，且其值很小，接近于0，并且偏置项被设置为1，因此，经初始化处理的尺度回归层将全部输出接近于1的值，即初始尺度系数接近于1，那么将初始尺度系数应用到卷积层J后，所得到的输出结果与标准的卷积结果差别不大，从而提供了较为稳定的训练过程，有效防止了场景分割网络在训练过程中崩溃。Among them, w ₀ is the initialized convolution kernel of the scale regression layer, a is any position in the convolution kernel, and b ₀ is the initialized bias term. In the initialization process of the weight parameters of the scale regression layer, the convolution kernel is set to satisfy the random coefficient σ of the Gaussian distribution, and its value is small, close to 0, and the bias term is set to 1. Therefore, after initializing The processed scale regression layer will all output values close to 1, that is, the initial scale coefficient is close to 1, then after applying the initial scale coefficient to the convolution layer J, the output results obtained are not much different from the standard convolution results. Thus, a relatively stable training process is provided, which effectively prevents the scene segmentation network from collapsing during the training process.

对于卷积层J，假设卷积层J的卷积核为

偏置为

卷积层J的输入特征图为

卷积层J的输出特征图为

卷积层J的第一卷积块为X^t，对第一卷积块X^t进行缩放处理后所得到的第二卷积块为Y^t，其中，一般情况下，k＝1。在输出特征图B中的任意位置t，对应的特征向量为

特征向量B^t为由该特征向量对应于输入特征图A中的第二卷积块Y^t与卷积核K内积得到的，其中，位置

For convolutional layer J, suppose the convolution kernel of convolutional layer J is

Biased to

The input feature map of the convolutional layer J is

The output feature map of convolutional layer J is

The first convolution block of the convolution layer J is X ^t , and the second convolution block obtained after scaling the first convolution block X ^t is Y ^t , where, in general, k=1. At any position t in the output feature map B, the corresponding feature vector is

The feature vector B ^t is obtained by the inner product of the feature vector corresponding to the second convolution block Y ^t in the input feature map A and the convolution kernel K, wherein, the position

第一卷积块X^t是一个以输入特征图A中(p^t,q^t)为中心的正方形区域，其边长固定为2kd+1，其中，

是卷积的膨胀系数，

和

是输入特征图A中的坐标。第一卷积块X^t中将均匀选取(2k+1)×(2k+1)个特征向量与卷积核K进行相乘，具体地，这些特征向量的坐标为The first convolution block X ^t is a square area centered at (p ^t , q ^t ) in the input feature map A, and its side length is fixed to 2kd+1, where,

is the expansion coefficient of the convolution,

and

are the coordinates in the input feature map A. In the first convolution block X ^t , (2k+1)×(2k+1) feature vectors are uniformly selected and multiplied by the convolution kernel K. Specifically, the coordinates of these feature vectors are

Figure DEST_PATH_GDA00014687753900001410

其中，

Figure DEST_PATH_GDA00014687753900001411

in,

假设s^t是尺度系数特征图中对应于输出特征图B中位置t的特征向量B^t的尺度系数，s^t在尺度系数特征图中的位置也是t，与特征向量B^t在输出特征图 B中的位置相同。Assuming that s ^t is the scale coefficient of the feature vector B ^t corresponding to the position t in the output feature map B in the scale coefficient feature map, the position of s ^t in the scale coefficient feature map is also t, and the feature vector B ^t in the output feature map B in the same position.

利用尺度系数s^t对卷积层J的第一卷积块X^t进行缩放处理，得到第二卷积块Y^t，第二卷积块Y^t是一个以输入特征图A中(p^t,q^t)为中心的正方形区域，其边长会根据尺度系数s^t变化为

Figure DEST_PATH_GDA00014687753900001412

第二卷积块Y^t中将均匀选取 (2k+1)×(2k+1)个特征向量与卷积核K进行相乘，具体地，这些特征向量的坐标为The first convolution block X ^t of the convolution layer J is scaled by the scale coefficient s ^t to obtain the second convolution block Y ^t . The second convolution block Y ^t is an input feature map A with (p ^t , q ^t ) as the center of the square area, its side length will change according to the scale coefficient s ^t as

In the second convolution block Y ^t , (2k+1)×(2k+1) feature vectors are uniformly selected and multiplied by the convolution kernel K. Specifically, the coordinates of these feature vectors are

其中，尺度系数s^t是实数值，那么特征向量的坐标x'_ij和y'_ij可能并不是整数。在本发明中，利用线性插值方法得到这些非整数坐标对应的特征向量。利用线性插值方法，从第二卷积块Y^t中采样得到特征向量，组成第三卷积块 Z^t，那么对于第三卷积块Z^t中各个特征向量

Figure DEST_PATH_GDA00014687753900001515

的具体的计算公式为：Among them, the scale coefficient s ^t is a real value, then the coordinates x' _ij and y' _ij of the feature vector may not be integers. In the present invention, the eigenvectors corresponding to these non-integer coordinates are obtained by using a linear interpolation method. Using the linear interpolation method, the feature vector is sampled from the second convolution block Y ^t to form the third convolution block Z ^t , then for each feature vector in the third convolution block Z ^t

The specific calculation formula is:

其中，

若(x'_ij,y'_ij)超出了输入特征图A的范围，则对应的特征向量将被置为0作为填补。假设

是卷积核K与对应的特征向量相乘且输出通道为c的卷积向量，其中，

那么卷积运算中对应所有通道的按元素相乘过程可以用与

进行矩阵相乘表示，则前向传播(forward propagation)过程为

in,

If (x' _ij , y' _ij ) exceeds the range of the input feature map A, the corresponding feature vector will be set to 0 as padding. Assumption

is the convolution vector in which the convolution kernel K is multiplied by the corresponding feature vector and the output channel is c, where,

Then the element-wise multiplication process corresponding to all channels in the convolution operation can be used with

Perform matrix multiplication, then the forward propagation process is

在反向传播过程中，假定从B^t传来的梯度g(B^t)，梯度为During backpropagation, assuming the gradient g(B ^t ) transmitted from B ^t , the gradient is

Figure DEST_PATH_GDA00014687753900001510

g(b)＝g(B^t)g(b)=g(B ^t )

其中，g(·)表示梯度函数，(·)^T表示矩阵转置。值得注意的是，在计算梯度的过程中，卷积核K和偏置b的最终梯度是从输出特征图B中所有位置得到的梯度的和。对于线性插值过程，其对应特征向量的偏导为where g(·) represents the gradient function, and (·) ^T represents the matrix transpose. It is worth noting that in the process of calculating the gradient, the final gradient of the convolution kernel K and the bias b is the sum of the gradients obtained from all positions in the output feature map B. For the linear interpolation process, the partial derivative of the corresponding eigenvector is

Figure DEST_PATH_GDA00014687753900001511

对应坐标的偏导为The partial derivatives of the corresponding coordinates are

Figure DEST_PATH_GDA00014687753900001512

对应的

Figure DEST_PATH_GDA00014687753900001513

的偏导与上述

Figure DEST_PATH_GDA00014687753900001514

的公式类似，此处不再赘述。corresponding

The partial derivative of the above

The formula is similar and will not be repeated here.

由于坐标是由尺度系数s^t计算得到，那么坐标对应尺度系数的偏导为Since the coordinates are calculated by the scale coefficient s ^t , the partial derivative of the scale coefficient corresponding to the coordinates is

基于上述偏导，尺度系数特征图S和输入特征图A的梯度可以由下面的公式得到：Based on the above partial derivatives, the gradient of the scale coefficient feature map S and the input feature map A can be obtained by the following formula:

由此可见，上述卷积过程形成了一个整体可导的计算过程，因此，场景分割网络中各卷积层的权重参数和尺度回归层的权重参数均可以通过端对端的形式进行训练。另外，尺度系数的梯度可以通过其后一层传来的梯度计算得到，因此，尺度系数是自动且隐式的得到的。在具体的实现过程中，前向传播过程和反向传播过程均可以在图形处理器(GPU)上并行运算，具有较高的计算效率。Therefore, the weight parameters of each convolution layer and the weight parameters of the scale regression layer in the scene segmentation network can be trained end-to-end. In addition, the gradient of the scale coefficient can be calculated by the gradient from the next layer, so the scale coefficient is obtained automatically and implicitly. In a specific implementation process, both the forward propagation process and the back propagation process can be performed in parallel on a graphics processing unit (GPU), which has high computational efficiency.

根据本实施例提供的场景分割网络训练方法，能够训练得到依据尺度系数对卷积块进行缩放的场景分割网络，实现了对感受野的自适应缩放，而且利用场景分割网络能够快速地得到对应的场景分割结果，有效地提高了图像场景分割的准确率以及处理效率。According to the scene segmentation network training method provided in this embodiment, a scene segmentation network that scales convolution blocks according to scale coefficients can be trained to achieve adaptive scaling of the receptive field, and the scene segmentation network can quickly obtain the corresponding The scene segmentation result effectively improves the accuracy and processing efficiency of image scene segmentation.

图3示出了根据本发明另一个实施例的基于场景分割的视频数据实时处理方法的流程示意图，该方法基于经过训练的场景分割网络而执行，如图3 所示，该方法包括如下步骤：3 shows a schematic flowchart of a real-time processing method for video data based on scene segmentation according to another embodiment of the present invention. The method is executed based on a trained scene segmentation network. As shown in FIG. 3 , the method includes the following steps:

步骤S300，实时获取图像采集设备所拍摄和/或所录制的视频中包含特定对象的当前帧图像；或者，实时获取当前所播放的视频中包含特定对象的当前帧图像。Step S300, acquiring in real time a current frame image containing a specific object in a video shot and/or recorded by the image acquisition device; or acquiring in real time a current frame image containing a specific object in a currently playing video.

步骤S301，将当前帧图像输入至场景分割网络中。Step S301, input the current frame image into the scene segmentation network.

其中，场景分割网络是经过训练的，经过训练的场景分割网络能够利用该网络中尺度回归层输出的尺度系数地对卷积层的卷积块进行缩放，更为准确地对所输入的当前帧图像进行场景分割。Among them, the scene segmentation network is trained, and the trained scene segmentation network can use the scale coefficient output by the scale regression layer in the network to scale the convolution block of the convolution layer, and more accurately input the current frame. Image for scene segmentation.

步骤S302，在场景分割网络中至少一层卷积层，利用尺度回归层输出的尺度系数对该卷积层的第一卷积块进行缩放处理，得到第二卷积块。Step S302, at least one convolution layer in the scene segmentation network, using the scale coefficient output by the scale regression layer to perform scaling processing on the first convolution block of the convolution layer to obtain a second convolution block.

本领域技术人员可根据实际需要选择对哪一层或哪些层的卷积层的卷积块进行缩放处理，此处不做限定。尺度系数为尺度回归层输出的尺度系数特征图中的特征向量，在步骤S302中，利用尺度系数对该卷积层的第一卷积块进行缩放处理，得到第二卷积块。Those skilled in the art can select which layer or which layers of convolutional blocks of the convolutional layer are to be scaled according to actual needs, which is not limited here. The scale coefficient is the feature vector in the scale coefficient feature map output by the scale regression layer. In step S302, the first convolution block of the convolution layer is scaled by the scale coefficient to obtain the second convolution block.

步骤S303，利用线性插值方法，从第二卷积块中采样得到特征向量，组成第三卷积块。Step S303 , using a linear interpolation method to sample a feature vector from the second convolution block to form a third convolution block.

由于第二卷积块是对第一卷积块进行缩放处理后所得到的，那么第二卷积块中的特征向量对应的坐标可能并不是整数，因此可利用线性插值方法，得到这些非整数坐标对应的特征向量。利用线性插值方法，从第二卷积块中采样得到特征向量，然后根据采样得到的特征向量组成第三卷积块。假设第二卷积块为Y^t，第三卷积块为Z^t，那么对于第三卷积块Z^t中各个特征向量

的具体的计算公式为：Since the second convolution block is obtained by scaling the first convolution block, the coordinates corresponding to the feature vectors in the second convolution block may not be integers. Therefore, the linear interpolation method can be used to obtain these non-integer values. The eigenvectors corresponding to the coordinates. Using the linear interpolation method, the feature vector is sampled from the second convolution block, and then the third convolution block is formed according to the sampled feature vector. Assuming that the second convolution block is Y ^t and the third convolution block is Z ^t , then for each feature vector in the third convolution block Z ^t

The specific calculation formula is:

其中，

d是卷积的膨胀系数，s^t是尺度系数，一般情况下，k＝1。in,

d is the expansion coefficient of the convolution, s ^t is the scale coefficient, in general, k=1.

步骤S304，依据第三卷积块与该卷积层的卷积核进行卷积运算，获得该卷积层的输出结果。Step S304, according to the third convolution block and the convolution kernel of the convolution layer, perform a convolution operation to obtain an output result of the convolution layer.

在得到了第三卷积块之后，依据第三卷积块与该卷积层的卷积核进行卷积运算，获得该卷积层的输出结果。After the third convolution block is obtained, a convolution operation is performed on the third convolution block and the convolution kernel of the convolution layer to obtain the output result of the convolution layer.

步骤S305，得到与当前帧图像对应的场景分割结果。In step S305, a scene segmentation result corresponding to the current frame image is obtained.

在步骤S304获得该卷积层的输出结果之后，在场景分割网络中若在该卷积层之后还存在其他的卷积层，那么将该卷积层的输出结果作为后一卷积层的输入进行后续的卷积运算。在经过场景分割网络中所有的卷积层的卷积运算之后，得到与当前帧图像对应的场景分割结果。After obtaining the output result of the convolution layer in step S304, if there are other convolution layers after the convolution layer in the scene segmentation network, then the output result of the convolution layer is used as the input of the next convolution layer Perform subsequent convolution operations. After the convolution operation of all convolutional layers in the scene segmentation network, the scene segmentation result corresponding to the current frame image is obtained.

步骤S306，根据与当前帧图像对应的场景分割结果，确定特定对象的轮廓信息。Step S306: Determine the contour information of the specific object according to the scene segmentation result corresponding to the current frame image.

在步骤S305得到了与当前帧图像对应的场景分割结果之后，就可根据与当前帧图像对应的场景分割结果，确定出特定对象的轮廓信息。假设特定对象为人体，那么就可根据场景分割结果，确定出人体的轮廓信息，从而区分出当前帧图像中哪些区域是人体，哪些区域不是人体。After the scene segmentation result corresponding to the current frame image is obtained in step S305, the contour information of the specific object can be determined according to the scene segmentation result corresponding to the current frame image. Assuming that the specific object is a human body, then the outline information of the human body can be determined according to the scene segmentation result, so as to distinguish which areas in the current frame image are human bodies and which areas are not human bodies.

步骤S307，依据特定对象的轮廓信息，添加个性化特效，得到帧处理图像。Step S307, adding personalized special effects according to the contour information of the specific object to obtain a frame-processed image.

在确定了特定对象的轮廓信息之后，就可依据特定对象的轮廓信息，添加个性化特效，得到帧处理图像。本领域技术人员可根据实际需要设置个性化特效，此处不做限定。After the contour information of the specific object is determined, personalized special effects can be added according to the contour information of the specific object to obtain a frame-processed image. Persons skilled in the art can set personalized special effects according to actual needs, which are not limited here.

例如，可依据特定对象的轮廓信息，在特定对象的边缘处添加效果贴图，效果贴图可以为静态的效果贴图，也可以为动态的效果贴图，具体地，当特定对象为人体时，效果贴图可以为如火焰、跳动的音符、浪花等效果贴图；当特定对象为人体头部时，效果贴图可以为如发冠、晃动的耳朵等效果贴图，具体根据实施情况进行设置，此处不做限定。又如，可依据特定对象的轮廓信息，确定出特定对象区域和非特定对象区域，可将非特定对象区域称为背景区域，然后将背景区域的图像替换为其他背景图像，其他背景图像可以为二维场景背景图像，也可以三维场景背景图像，具体地，当特定对象为人体时，可将背景区域的图像替换为如三维海底场景背景图像、三维火山场景背景图像等三维场景背景图像。For example, according to the contour information of the specific object, an effect map can be added at the edge of the specific object. The effect map can be a static effect map or a dynamic effect map. Specifically, when the specific object is a human body, the effect map can be It is an effect map such as flames, beating notes, waves, etc.; when the specific object is a human head, the effect map can be an effect map such as a hair crown, shaking ears, etc. The specific settings are set according to the implementation, which is not limited here. For another example, the specific object area and the non-specific object area can be determined according to the contour information of the specific object. The two-dimensional scene background image can also be a three-dimensional scene background image. Specifically, when the specific object is a human body, the image of the background area can be replaced with a three-dimensional scene background image such as a three-dimensional underwater scene background image and a three-dimensional volcano scene background image.

步骤S308，将帧处理图像覆盖当前帧图像得到处理后的视频数据。Step S308, overlaying the frame processed image over the current frame image to obtain processed video data.

步骤S309，显示处理后的视频数据。Step S309, displaying the processed video data.

步骤S310，将处理后的视频数据上传至云服务器。Step S310, upload the processed video data to the cloud server.

将处理后的视频数据可以直接上传至云服务器，具体的，可以将处理后的视频数据上传至一个或多个的云视频平台服务器，如爱奇艺、优酷、快视频等云视频平台服务器，以供云视频平台服务器在云视频平台进行展示视频数据。或者还可以将处理后的视频数据上传至云直播服务器，当有直播观看端的用户进入云直播服务器进行观看时，可以由云直播服务器将视频数据实时推送给观看用户客户端。或者还可以将处理后的视频数据上传至云公众号服务器，当有用户关注该公众号时，由云公众号服务器将视频数据推送给公众号关注客户端；进一步，云公众号服务器还可以根据关注公众号的用户的观看习惯，推送符合用户习惯的视频数据给公众号关注客户端。The processed video data can be directly uploaded to the cloud server. Specifically, the processed video data can be uploaded to one or more cloud video platform servers, such as iQiyi, Youku, Kuai Video and other cloud video platform servers. For the cloud video platform server to display video data on the cloud video platform. Alternatively, the processed video data can also be uploaded to the cloud live broadcast server. When a user of the live broadcast viewing terminal enters the cloud live broadcast server to watch, the cloud live broadcast server can push the video data to the viewing user client in real time. Alternatively, the processed video data can also be uploaded to the cloud official account server. When a user follows the official account, the cloud official account server will push the video data to the official account attention client; further, the cloud official account server can also Follow the viewing habits of users who follow the official account, and push video data that conforms to the user's habits to the official account's attention client.

根据本实施例提供的基于场景分割的视频数据实时处理方法，不仅依据尺度系数对卷积块进行缩放，实现了对感受野的自适应缩放，而且还利用线性插值方法对缩放处理后卷积块进行进一步处理，解决了对于缩放处理后卷积块中坐标为非整数的特征向量的选取问题；并且利用经过训练的场景分割网络能够实时、准确地得到视频中帧图像对应的场景分割结果，有效地提高了图像场景分割的准确率以及处理效率，基于所得到的场景分割结果能够更为精准地对帧图像添加个性化特效，美化了视频数据显示效果，优化了视频数据处理方式。According to the real-time processing method of video data based on scene segmentation provided in this embodiment, not only the convolution block is scaled according to the scale coefficient, so as to realize the adaptive scaling of the receptive field, but also the convolution block after scale processing is scaled by the linear interpolation method. Further processing is performed to solve the problem of selecting feature vectors whose coordinates are non-integers in the convolution block after scaling processing; and the scene segmentation results corresponding to the frame images in the video can be obtained in real time and accurately by using the trained scene segmentation network, effectively It improves the accuracy and processing efficiency of image scene segmentation, and can add personalized special effects to frame images more accurately based on the obtained scene segmentation results, beautify the video data display effect, and optimize the video data processing method.

图4示出了根据本发明一个实施例的基于场景分割的视频数据实时处理装置的结构框图，该装置基于经过训练的场景分割网络而运行，如图4所示，该装置包括：获取模块410、分割模块420、生成模块430、确定模块440、处理模块450、覆盖模块460和显示模块470。FIG. 4 shows a structural block diagram of an apparatus for real-time processing of video data based on scene segmentation according to an embodiment of the present invention. The apparatus operates based on a trained scene segmentation network. As shown in FIG. 4 , the apparatus includes: an acquisition module 410 , a segmentation module 420 , a generation module 430 , a determination module 440 , a processing module 450 , an overlay module 460 and a display module 470 .

获取模块410适于：实时获取图像采集设备所拍摄和/或所录制的视频中包含特定对象的当前帧图像；或者，实时获取当前所播放的视频中包含特定对象的当前帧图像。The acquiring module 410 is adapted to: acquire in real time a current frame image containing a specific object in a video shot and/or recorded by the image acquisition device; or acquire in real time a current frame image containing a specific object in a currently played video.

分割模块420适于：将当前帧图像输入至场景分割网络中，其中，在场景分割网络中至少一层卷积层，利用尺度回归层输出的尺度系数对该卷积层的第一卷积块进行缩放处理，得到第二卷积块，而后利用第二卷积块进行该卷积层的卷积运算，获得该卷积层的输出结果。The segmentation module 420 is adapted to: input the current frame image into the scene segmentation network, wherein, in at least one convolutional layer in the scene segmentation network, the first convolutional block of the convolutional layer is used using the scale coefficient output by the scale regression layer Perform scaling processing to obtain a second convolution block, and then use the second convolution block to perform a convolution operation of the convolution layer to obtain an output result of the convolution layer.

其中，场景分割网络是经过训练的，具体地，场景分割网络训练所用的样本包含：样本库存储的多个样本图像以及与样本图像对应的标注场景分割结果。尺度回归层为场景分割网络的中间卷积层。本领域技术人员可根据实际需要在场景分割网络中选择合适的一层或多层卷积层作为尺度回归层，此处不做限定。尺度系数为尺度回归层输出的尺度系数特征图中的特征向量。The scene segmentation network is trained. Specifically, the samples used in the training of the scene segmentation network include: a plurality of sample images stored in the sample library and the labeled scene segmentation results corresponding to the sample images. The scale regression layer is the intermediate convolutional layer of the scene segmentation network. Those skilled in the art can select an appropriate one or more convolutional layers in the scene segmentation network as the scale regression layer according to actual needs, which is not limited here. The scale coefficient is the feature vector in the scale coefficient feature map output by the scale regression layer.

生成模块430适于：得到与当前帧图像对应的场景分割结果。The generating module 430 is adapted to: obtain a scene segmentation result corresponding to the current frame image.

确定模块440适于：根据与当前帧图像对应的场景分割结果，确定特定对象的轮廓信息。The determining module 440 is adapted to: determine the contour information of the specific object according to the scene segmentation result corresponding to the current frame image.

处理模块450适于：依据特定对象的轮廓信息，添加个性化特效，得到帧处理图像。The processing module 450 is adapted to: add personalized special effects according to the contour information of the specific object to obtain a frame-processed image.

覆盖模块460适于：将帧处理图像覆盖当前帧图像得到处理后的视频数据。The overlay module 460 is adapted to: overlay the frame processed image over the current frame image to obtain the processed video data.

覆盖模块460使用帧处理图像直接覆盖掉原当前帧图像，直接可以得到处理后的视频数据。同时，录制的用户还可以直接看到帧处理图像。The overlay module 460 directly overlays the original current frame image with the frame processed image, and can directly obtain the processed video data. At the same time, the recorded user can also directly see the frame processing image.

显示模块470适于：显示处理后的视频数据。The display module 470 is adapted to: display the processed video data.

显示模块470得到处理后的视频数据后，可以将其实时的进行显示，用户可以直接看到处理后的视频数据的显示效果。After the display module 470 obtains the processed video data, it can display it in real time, and the user can directly see the display effect of the processed video data.

根据本实施例提供的基于场景分割的视频数据实时处理装置，能够依据尺度系数对卷积块进行缩放，实现了对感受野的自适应缩放，利用经过训练的场景分割网络能够实时、准确地得到视频中帧图像对应的场景分割结果，有效地提高了图像场景分割的准确率以及处理效率，基于所得到的场景分割结果能够更为精准地对帧图像添加个性化特效，美化了视频数据显示效果。According to the real-time video data processing device based on scene segmentation provided in this embodiment, the convolution block can be scaled according to the scale coefficient, and the adaptive scaling of the receptive field is realized. The scene segmentation results corresponding to the frame images in the video effectively improve the accuracy and processing efficiency of image scene segmentation. Based on the obtained scene segmentation results, personalized special effects can be added to the frame images more accurately, and the video data display effect can be beautified. .

图5示出了根据本发明另一个实施例的基于场景分割的视频数据实时处理装置的结构框图，该装置基于经过训练的场景分割网络而运行，如图5所示，该装置包括：获取模块510、场景分割网络训练模块520、分割模块530、生成模块540、确定模块550、处理模块560、覆盖模块570、显示模块580 和上传模块590。Fig. 5 shows a structural block diagram of an apparatus for real-time processing of video data based on scene segmentation according to another embodiment of the present invention. The apparatus operates based on a trained scene segmentation network. As shown in Fig. 5, the apparatus includes: an acquisition module 510 , a scene segmentation network training module 520 , a segmentation module 530 , a generation module 540 , a determination module 550 , a processing module 560 , an overlay module 570 , a display module 580 and an upload module 590 .

获取模块510适于：实时获取图像采集设备所拍摄和/或所录制的视频中包含特定对象的当前帧图像；或者，实时获取当前所播放的视频中包含特定对象的当前帧图像。The acquiring module 510 is adapted to: acquire in real time the current frame image of the video shot and/or recorded by the image acquisition device that contains the specific object; or, acquire the current frame image of the currently played video containing the specific object in real time.

其中，场景分割网络的训练过程通过多次迭代完成。场景分割网络训练模块520适于：在一次迭代过程中，从样本库中提取样本图像以及与样本图像对应的标注场景分割结果，利用样本图像和标注场景分割结果实现场景分割网络的训练。Among them, the training process of the scene segmentation network is completed through multiple iterations. The scene segmentation network training module 520 is adapted to: in an iterative process, extract sample images and labeled scene segmentation results corresponding to the sample images from the sample library, and use the sample images and labeled scene segmentation results to implement scene segmentation network training.

可选地，场景分割网络训练模块520适于：在一次迭代过程中，将样本图像输入至场景分割网络，得到与样本图像对应的样本场景分割结果；根据样本场景分割结果与标注场景分割结果之间的分割损失，得到场景分割网络损失函数，利用场景分割网络损失函数实现场景分割网络的训练。Optionally, the scene segmentation network training module 520 is adapted to: in an iterative process, input the sample image to the scene segmentation network to obtain the sample scene segmentation result corresponding to the sample image; according to the difference between the sample scene segmentation result and the labeled scene segmentation result; The segmentation loss between the two is obtained, and the scene segmentation network loss function is obtained, and the scene segmentation network training is realized by using the scene segmentation network loss function.

在一个具体实施例中，场景分割网络训练模块520可包括：提取单元521、训练单元522、获取单元523和更新单元524。In a specific embodiment, the scene segmentation network training module 520 may include: an extraction unit 521 , a training unit 522 , an acquisition unit 523 and an update unit 524 .

具体地，提取单元521适于：从样本库中提取样本图像以及与样本图像对应的标注场景分割结果。Specifically, the extraction unit 521 is adapted to: extract the sample images and the segmentation results of the labeled scene corresponding to the sample images from the sample library.

训练单元522适于：将样本图像输入至场景分割网络中进行训练，其中，在场景分割网络中至少一层卷积层，利用上一次迭代过程尺度回归层输出的尺度系数或者初始尺度系数对该卷积层的第一卷积块进行缩放处理，得到第二卷积块，而后利用第二卷积块进行该卷积层的卷积运算，获得该卷积层的输出结果。The training unit 522 is adapted to: input the sample image into the scene segmentation network for training, wherein, in the scene segmentation network at least one convolution layer, the scale coefficient or the initial scale coefficient output by the scale regression layer in the previous iteration process is used. The first convolution block of the convolution layer is scaled to obtain the second convolution block, and then the second convolution block is used to perform the convolution operation of the convolution layer to obtain the output result of the convolution layer.

其中，尺度回归层为场景分割网络的中间卷积层，尺度系数为尺度回归层输出的尺度系数特征图中的特征向量。Among them, the scale regression layer is the middle convolution layer of the scene segmentation network, and the scale coefficient is the feature vector in the scale coefficient feature map output by the scale regression layer.

可选地，训练单元522进一步适于：利用线性插值方法，从第二卷积块中采样得到特征向量，组成第三卷积块；依据第三卷积块与该卷积层的卷积核进行卷积运算，获得该卷积层的输出结果。Optionally, the training unit 522 is further adapted to: use a linear interpolation method to sample a feature vector from the second convolution block to form a third convolution block; according to the third convolution block and the convolution kernel of the convolution layer Perform the convolution operation to obtain the output result of the convolution layer.

获取单元523适于：获取与样本图像对应的样本场景分割结果。The obtaining unit 523 is adapted to: obtain a sample scene segmentation result corresponding to the sample image.

更新单元524适于：根据样本场景分割结果与标注场景分割结果之间的分割损失，得到场景分割网络损失函数，根据场景分割网络损失函数更新场景分割网络的权重参数。The updating unit 524 is adapted to: obtain a scene segmentation network loss function according to the segmentation loss between the sample scene segmentation result and the labeled scene segmentation result, and update the weight parameter of the scene segmentation network according to the scene segmentation network loss function.

场景分割网络训练模块520迭代运行，直至满足预定收敛条件。The scene segmentation network training module 520 runs iteratively until a predetermined convergence condition is satisfied.

其中，本领域技术人员可根据实际需要设置预定收敛条件，此处不做限定。例如，预定收敛条件可包括：迭代次数达到预设迭代次数；和/或，场景分割网络损失函数的输出值小于预设阈值。具体地，可以通过判断迭代次数是否达到预设迭代次数来判断是否满足预定收敛条件，也可以根据场景分割网络损失函数的输出值是否小于预设阈值来判断是否满足预定收敛条件。Wherein, those skilled in the art can set predetermined convergence conditions according to actual needs, which are not limited here. For example, the predetermined convergence condition may include: the number of iterations reaches a preset number of iterations; and/or, the output value of the scene segmentation network loss function is less than a preset threshold. Specifically, it can be judged whether the predetermined convergence condition is satisfied by judging whether the number of iterations reaches the preset number of iterations, or whether the predetermined convergence condition is satisfied according to whether the output value of the scene segmentation network loss function is smaller than a predetermined threshold.

可选地，场景分割网络训练模块520进一步适于：在场景分割网络训练开始时，对尺度回归层的权重参数进行初始化处理。Optionally, the scene segmentation network training module 520 is further adapted to: at the beginning of the scene segmentation network training, initialize the weight parameters of the scale regression layer.

分割模块530适于：将当前帧图像输入至场景分割网络中，其中，在场景分割网络中至少一层卷积层，利用尺度回归层输出的尺度系数对该卷积层的第一卷积块进行缩放处理，得到第二卷积块，而后利用线性插值方法，从第二卷积块中采样得到特征向量，组成第三卷积块；依据第三卷积块与该卷积层的卷积核进行卷积运算，获得该卷积层的输出结果。The segmentation module 530 is adapted to: input the current frame image into the scene segmentation network, wherein at least one convolutional layer in the scene segmentation network uses the scale coefficient output by the scale regression layer to the first convolutional block of the convolutional layer. Perform scaling processing to obtain the second convolution block, and then use the linear interpolation method to sample the feature vector from the second convolution block to form a third convolution block; according to the convolution of the third convolution block and the convolution layer The kernel performs the convolution operation to obtain the output result of the convolution layer.

生成模块540适于：得到与当前帧图像对应的场景分割结果。The generating module 540 is adapted to: obtain a scene segmentation result corresponding to the current frame image.

确定模块550适于：根据与当前帧图像对应的场景分割结果，确定特定对象的轮廓信息。The determining module 550 is adapted to: determine the contour information of the specific object according to the scene segmentation result corresponding to the current frame image.

处理模块560适于：依据特定对象的轮廓信息，添加个性化特效，得到帧处理图像。The processing module 560 is adapted to: add personalized special effects according to the contour information of the specific object to obtain a frame-processed image.

覆盖模块570适于：将帧处理图像覆盖当前帧图像得到处理后的视频数据。The overlay module 570 is adapted to: overlay the frame processed image over the current frame image to obtain the processed video data.

显示模块580适于：显示处理后的视频数据。The display module 580 is adapted to: display the processed video data.

显示模块580得到处理后的视频数据后，可以将其实时的进行显示，用户可以直接看到处理后的视频数据的显示效果。After the display module 580 obtains the processed video data, it can display it in real time, and the user can directly see the display effect of the processed video data.

上传模块590，适于将处理后的视频数据上传至云服务器。The uploading module 590 is adapted to upload the processed video data to the cloud server.

上传模块590将处理后的视频数据可以直接上传至云服务器，具体的，上传模块590可以将处理后的视频数据上传至一个或多个的云视频平台服务器，如爱奇艺、优酷、快视频等云视频平台服务器，以供云视频平台服务器在云视频平台进行展示视频数据。或者上传模块590还可以将处理后的视频数据上传至云直播服务器，当有直播观看端的用户进入云直播服务器进行观看时，可以由云直播服务器将视频数据实时推送给观看用户客户端。或者上传模块590还可以将处理后的视频数据上传至云公众号服务器，当有用户关注该公众号时，由云公众号服务器将视频数据推送给公众号关注客户端；进一步，云公众号服务器还可以根据关注公众号的用户的观看习惯，推送符合用户习惯的视频数据给公众号关注客户端。The uploading module 590 can directly upload the processed video data to the cloud server. Specifically, the uploading module 590 can upload the processed video data to one or more cloud video platform servers, such as iQiyi, Youku, Kuai Video Wait for the cloud video platform server for the cloud video platform server to display video data on the cloud video platform. Alternatively, the uploading module 590 can also upload the processed video data to the cloud live broadcast server. When a user who has a live broadcast viewing terminal enters the cloud live broadcast server to watch, the cloud live broadcast server can push the video data to the viewing user client in real time. Or the uploading module 590 can also upload the processed video data to the cloud official account server, and when a user pays attention to the official account, the cloud official account server pushes the video data to the official account attention client; further, the cloud official account server According to the viewing habits of users who follow the official account, video data that conforms to the user's habits can also be pushed to the client of the official account.

根据本实施例提供的基于场景分割的视频数据实时处理装置，不仅依据尺度系数对卷积块进行缩放，实现了对感受野的自适应缩放，而且还利用线性插值方法对缩放处理后卷积块进行进一步处理，解决了对于缩放处理后卷积块中坐标为非整数的特征向量的选取问题；并且利用经过训练的场景分割网络能够实时、准确地得到视频中帧图像对应的场景分割结果，有效地提高了图像场景分割的准确率以及处理效率，基于所得到的场景分割结果能够更为精准地对帧图像添加个性化特效，美化了视频数据显示效果，优化了视频数据处理方式。According to the apparatus for real-time processing of video data based on scene segmentation provided in this embodiment, not only the convolution block is scaled according to the scale coefficient, so as to realize the adaptive scaling of the receptive field, but also the convolution block after scaling is processed by linear interpolation method. Further processing is performed to solve the problem of selecting feature vectors whose coordinates are non-integers in the convolution block after scaling processing; and the scene segmentation results corresponding to the frame images in the video can be obtained in real time and accurately by using the trained scene segmentation network, effectively It improves the accuracy and processing efficiency of image scene segmentation, and can add personalized special effects to frame images more accurately based on the obtained scene segmentation results, beautify the video data display effect, and optimize the video data processing method.

本发明还提供了一种非易失性计算机存储介质，计算机存储介质存储有至少一可执行指令，可执行指令可执行上述任意方法实施例中的基于场景分割的视频数据实时处理方法。The present invention also provides a non-volatile computer storage medium, where the computer storage medium stores at least one executable instruction, and the executable instruction can execute the real-time processing method of video data based on scene segmentation in any of the above method embodiments.

图6示出了根据本发明实施例的一种计算设备的结构示意图，本发明具体实施例并不对计算设备的具体实现做限定。FIG. 6 shows a schematic structural diagram of a computing device according to an embodiment of the present invention. The specific embodiment of the present invention does not limit the specific implementation of the computing device.

如图6所示，该计算设备可以包括：处理器(processor)602、通信接口(Communications Interface)604、存储器(memory)606、以及通信总线608。As shown in FIG. 6 , the computing device may include: a processor (processor) 602 , a communications interface (Communications Interface) 604 , a memory (memory) 606 , and a communication bus 608 .

其中：in:

处理器602、通信接口604、以及存储器606通过通信总线608完成相互间的通信。The processor 602 , the communication interface 604 , and the memory 606 communicate with each other through the communication bus 608 .

通信接口604，用于与其它设备比如客户端或其它服务器等的网元通信。The communication interface 604 is used to communicate with network elements of other devices such as clients or other servers.

处理器602，用于执行程序610，具体可以执行上述基于场景分割的视频数据实时处理方法实施例中的相关步骤。The processor 602 is configured to execute the program 610, and specifically may execute the relevant steps in the foregoing embodiments of the method for real-time processing of video data based on scene segmentation.

具体地，程序610可以包括程序代码，该程序代码包括计算机操作指令。Specifically, the program 610 may include program code including computer operation instructions.

处理器602可能是中央处理器CPU，或者是特定集成电路ASIC (ApplicationSpecific Integrated Circuit)，或者是被配置成实施本发明实施例的一个或多个集成电路。计算设备包括的一个或多个处理器，可以是同一类型的处理器，如一个或多个CPU；也可以是不同类型的处理器，如一个或多个CPU以及一个或多个ASIC。The processor 602 may be a central processing unit (CPU), or an application specific integrated circuit (ASIC), or one or more integrated circuits configured to implement embodiments of the present invention. The one or more processors included in the computing device may be the same type of processors, such as one or more CPUs; or may be different types of processors, such as one or more CPUs and one or more ASICs.

存储器606，用于存放程序610。存储器606可能包含高速RAM存储器，也可能还包括非易失性存储器(non-volatile memory)，例如至少一个磁盘存储器。The memory 606 is used to store the program 610 . Memory 606 may include high-speed RAM memory, and may also include non-volatile memory, such as at least one disk memory.

程序610具体可以用于使得处理器602执行上述任意方法实施例中的基于场景分割的视频数据实时处理方法。程序610中各步骤的具体实现可以参见上述基于场景分割的视频数据实时处理实施例中的相应步骤和单元中对应的描述，在此不赘述。所属领域的技术人员可以清楚地了解到，为描述的方便和简洁，上述描述的设备和模块的具体工作过程，可以参考前述方法实施例中的对应过程描述，在此不再赘述。The program 610 can specifically be used to cause the processor 602 to execute the real-time processing method of video data based on scene segmentation in any of the foregoing method embodiments. For the specific implementation of the steps in the program 610, reference may be made to the corresponding descriptions in the corresponding steps and units in the above-mentioned embodiment of the real-time processing of video data based on scene segmentation, which will not be repeated here. Those skilled in the art can clearly understand that, for the convenience and brevity of description, for the specific working process of the above-described devices and modules, reference may be made to the corresponding process descriptions in the foregoing method embodiments, which will not be repeated here.

在此提供的算法和显示不与任何特定计算机、虚拟系统或者其它设备固有相关。各种通用系统也可以与基于在此的示教一起使用。根据上面的描述，构造这类系统所要求的结构是显而易见的。此外，本发明也不针对任何特定编程语言。应当明白，可以利用各种编程语言实现在此描述的本发明的内容，并且上面对特定语言所做的描述是为了披露本发明的最佳实施方式。The algorithms and displays provided herein are not inherently related to any particular computer, virtual system, or other device. Various general-purpose systems can also be used with teaching based on this. The structure required to construct such a system is apparent from the above description. Furthermore, the present invention is not directed to any particular programming language. It should be understood that various programming languages may be used to implement the inventions described herein, and that the descriptions of specific languages above are intended to disclose the best mode for carrying out the invention.

在此处所提供的说明书中，说明了大量具体细节。然而，能够理解，本发明的实施例可以在没有这些具体细节的情况下实践。在一些实例中，并未详细示出公知的方法、结构和技术，以便不模糊对本说明书的理解。In the description provided herein, numerous specific details are set forth. It will be understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

类似地，应当理解，为了精简本公开并帮助理解各个发明方面中的一个或多个，在上面对本发明的示例性实施例的描述中，本发明的各个特征有时被一起分组到单个实施例、图、或者对其的描述中。然而，并不应将该公开的方法解释成反映如下意图：即所要求保护的本发明要求比在每个权利要求中所明确记载的特征更多的特征。更确切地说，如下面的权利要求书所反映的那样，发明方面在于少于前面公开的单个实施例的所有特征。因此，遵循具体实施方式的权利要求书由此明确地并入该具体实施方式，其中每个权利要求本身都作为本发明的单独实施例。Similarly, it is to be understood that in the above description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together into a single embodiment, figure, or its description. This disclosure, however, should not be construed as reflecting an intention that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the Detailed Description are hereby expressly incorporated into this Detailed Description, with each claim standing on its own as a separate embodiment of this invention.

本领域那些技术人员可以理解，可以对实施例中的设备中的模块进行自适应性地改变并且把它们设置在与该实施例不同的一个或多个设备中。可以把实施例中的模块或单元或组件组合成一个模块或单元或组件，以及此外可以把它们分成多个子模块或子单元或子组件。除了这样的特征和/或过程或者单元中的至少一些是相互排斥之外，可以采用任何组合对本说明书(包括伴随的权利要求、摘要和附图)中公开的所有特征以及如此公开的任何方法或者设备的所有过程或单元进行组合。除非另外明确陈述，本说明书(包括伴随的权利要求、摘要和附图)中公开的每个特征可以由提供相同、等同或相似目的的替代特征来代替。Those skilled in the art will understand that the modules in the device in the embodiment can be adaptively changed and arranged in one or more devices different from the embodiment. The modules or units or components in the embodiments may be combined into one module or unit or component, and further they may be divided into multiple sub-modules or sub-units or sub-assemblies. All features disclosed in this specification (including accompanying claims, abstract and drawings) and any method so disclosed may be employed in any combination, unless at least some of such features and/or procedures or elements are mutually exclusive. All processes or units of equipment are combined. Each feature disclosed in this specification (including accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

此外，本领域的技术人员能够理解，尽管在此所述的一些实施例包括其它实施例中所包括的某些特征而不是其它特征，但是不同实施例的特征的组合意味着处于本发明的范围之内并且形成不同的实施例。例如，在下面的权利要求书中，所要求保护的实施例的任意之一都可以以任意的组合方式来使用。Furthermore, those skilled in the art will appreciate that although some of the embodiments described herein include certain features, but not others, included in other embodiments, that combinations of features of different embodiments are intended to be within the scope of the invention within and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.

本发明的各个部件实施例可以以硬件实现，或者以在一个或者多个处理器上运行的软件模块实现，或者以它们的组合实现。本领域的技术人员应当理解，可以在实践中使用微处理器或者数字信号处理器(DSP)来实现根据本发明实施例中的一些或者全部部件的一些或者全部功能。本发明还可以实现为用于执行这里所描述的方法的一部分或者全部的设备或者装置程序(例如，计算机程序和计算机程序产品)。这样的实现本发明的程序可以存储在计算机可读介质上，或者可以具有一个或者多个信号的形式。这样的信号可以从因特网网站上下载得到，或者在载体信号上提供，或者以任何其他形式提供。Various component embodiments of the present invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art should understand that a microprocessor or a digital signal processor (DSP) may be used in practice to implement some or all of the functions of some or all of the components according to the embodiments of the present invention. The present invention can also be implemented as apparatus or apparatus programs (eg, computer programs and computer program products) for performing part or all of the methods described herein. Such a program implementing the present invention may be stored on a computer-readable medium, or may be in the form of one or more signals. Such signals may be downloaded from Internet sites, or provided on carrier signals, or in any other form.

应该注意的是上述实施例对本发明进行说明而不是对本发明进行限制，并且本领域技术人员在不脱离所附权利要求的范围的情况下可设计出替换实施例。在权利要求中，不应将位于括号之间的任何参考符号构造成对权利要求的限制。单词“包含”不排除存在未列在权利要求中的元件或步骤。位于元件之前的单词“一”或“一个”不排除存在多个这样的元件。本发明可以借助于包括有若干不同元件的硬件以及借助于适当编程的计算机来实现。在列举了若干装置的单元权利要求中，这些装置中的若干个可以是通过同一个硬件项来具体体现。单词第一、第二、以及第三等的使用不表示任何顺序。可将这些单词解释为名称。It should be noted that the above-described embodiments illustrate rather than limit the invention, and that alternative embodiments may be devised by those skilled in the art without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention can be implemented by means of hardware comprising several different elements and by means of a suitably programmed computer. In a unit claim enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the words first, second, and third, etc. do not denote any order. These words can be interpreted as names.

Claims

1. A real-time processing method for video data based on scene segmentation, the method is performed based on a trained scene segmentation network, the method comprising:

Acquire in real time the current frame image of the specific object contained in the video shot and/or recorded by the image acquisition device; or, acquire in real time the current frame image of the specific object contained in the currently played video;

Inputting the current frame image into the scene segmentation network, wherein at least one convolutional layer in the scene segmentation network uses the scale coefficient output by the scale regression layer to scale the first convolutional block of the convolutional layer processing to obtain a second convolution block, and then using a linear interpolation method to sample a feature vector from the second convolution block to form a third convolution block;

According to the third convolution block and the convolution kernel of the convolution layer, the convolution operation is performed to obtain the output result of the convolution layer; the scale regression layer is the middle convolution layer of the scene segmentation network; the The scale coefficient is the feature vector in the scale coefficient feature map output by the scale regression layer;

Obtain the scene segmentation result corresponding to the current frame image;

Determine the contour information of the specific object according to the scene segmentation result corresponding to the current frame image;

According to the contour information of the specific object, adding personalized special effects to obtain a frame processing image;

Covering the frame processing image with the current frame image to obtain the processed video data;

Display the processed video data;

Wherein, the training process of the scene segmentation network is completed through multiple iterations; in an iterative process, sample images are input into the scene segmentation network for training, wherein, in the scene segmentation network at least one convolutional layer, The first convolution block of the convolution layer is scaled by the scale coefficient or the initial scale coefficient output by the scale regression layer in the last iteration process to obtain the second convolution block. The feature vector is obtained by sampling in the accumulation block to form the third convolution block;

According to the third convolution block and the convolution kernel of the convolution layer, the convolution operation is performed to obtain the output result of the convolution layer;

Wherein, the displaying the processed video data further includes: displaying the processed video data in real time;

The method further includes: uploading the processed video data to a cloud server.

2 . The method according to claim 1 , wherein the samples used in the scene segmentation network training comprise: a plurality of sample images stored in a sample library and the labeled scene segmentation results corresponding to the sample images. 3 .

3. The method according to claim 2, wherein, in an iterative process, a sample image and an annotation scene segmentation result corresponding to the sample image are extracted from the sample library, and the sample image and the annotated scene segmentation result are used for segmentation The results enable the training of the scene segmentation network.

4. The method of claim 3, wherein an iterative process comprises:

Input the sample image to the scene segmentation network to obtain the sample scene segmentation result corresponding to the sample image;

According to the segmentation loss between the sample scene segmentation result and the labeled scene segmentation result, a scene segmentation network loss function is obtained, and the scene segmentation network is trained by using the scene segmentation network loss function.

5. The method according to claim 4, wherein the training step of the scene segmentation network comprises:

Extracting sample images and the segmentation result of the labeled scene corresponding to the sample images from the sample library;

inputting the sample image into the scene segmentation network for training;

Obtain the sample scene segmentation result corresponding to the sample image;

According to the segmentation loss between the sample scene segmentation result and the labeled scene segmentation result, a scene segmentation network loss function is obtained, and the weight parameter of the scene segmentation network is updated according to the scene segmentation network loss function;

The training steps of the scene segmentation network are iteratively performed until a predetermined convergence condition is satisfied.

6. The method according to claim 5, wherein the predetermined convergence condition comprises: the number of iterations reaches a preset number of iterations; and/or the output value of the scene segmentation network loss function is less than a preset threshold.

7. The method according to claim 6, wherein the method further comprises: at the beginning of the scene segmentation network training, initializing the weight parameters of the scale regression layer.

8. The method according to claim 7, wherein the uploading of the processed video data to the cloud server further comprises:

Upload the processed video data to the cloud video platform server for the cloud video platform server to display the video data on the cloud video platform.

9. The method according to claim 8, wherein the uploading of the processed video data to the cloud server further comprises:

Upload the processed video data to the cloud live broadcast server, so that the cloud live broadcast server can push the video data to the viewing user client in real time.

10. The method according to claim 9, wherein the uploading of the processed video data to the cloud server further comprises:

Upload the processed video data to the cloud official account server, so that the cloud official account server can push the video data to the official account attention client.

11. A device for real-time processing of video data based on scene segmentation, the device operates based on a trained scene segmentation network, the device comprising:

an acquisition module, adapted to acquire the current frame image of the specific object in the video shot and/or recorded by the image acquisition device in real time; or, acquire the current frame image of the specific object in the currently played video in real time;

The segmentation module is adapted to input the current frame image into the scene segmentation network, wherein at least one convolutional layer in the scene segmentation network uses the scale coefficient output by the scale regression layer for the first convolutional layer. The convolution block is scaled to obtain a second convolution block, and then a linear interpolation method is used to sample a feature vector from the second convolution block to form a third convolution block; according to the third convolution block and The convolution kernel of the convolution layer performs convolution operation to obtain the output result of the convolution layer; the scale regression layer is the middle convolution layer of the scene segmentation network; the scale coefficient is the scale output by the scale regression layer eigenvectors in the coefficient feature map;

a generation module, adapted to obtain a scene segmentation result corresponding to the current frame image;

a determination module, adapted to determine the contour information of the specific object according to the scene segmentation result corresponding to the current frame image;

a processing module, adapted to add personalized special effects according to the outline information of the specific object to obtain a frame processing image;

an overlay module, adapted to overlay the frame processing image over the current frame image to obtain the processed video data;

a display module, adapted to display the processed video data;

Wherein, the training process of the scene segmentation network is completed through multiple iterations; in an iterative process, sample images are input into the scene segmentation network for training, wherein, in the scene segmentation network at least one convolutional layer, The first convolution block of the convolution layer is scaled by the scale coefficient or the initial scale coefficient output by the scale regression layer in the last iteration process to obtain the second convolution block. The feature vector is obtained by sampling in the accumulation block to form a third convolution block; according to the third convolution block and the convolution kernel of the convolution layer, the convolution operation is performed to obtain the output result of the convolution layer;

Wherein, the display module is further adapted to: display the processed video data in real time;

The device also includes:

The uploading module is suitable for uploading the processed video data to the cloud server.

12 . The apparatus according to claim 11 , wherein the samples used for the training of the scene segmentation network comprise: a plurality of sample images stored in a sample library and the labeled scene segmentation results corresponding to the sample images. 13 .

13. The apparatus according to claim 12, wherein the apparatus further comprises: a scene segmentation network training module;

The scene segmentation network training module is adapted to: in an iterative process, extract a sample image and a labeled scene segmentation result corresponding to the sample image from the sample library, and use the sample image and the labeled scene segmentation result to realize a scene Training of the segmentation network.

14. The apparatus according to claim 13, wherein the apparatus further comprises: a scene segmentation network training module;

The scene segmentation network training module is adapted to: in an iterative process, input the sample image to the scene segmentation network to obtain the sample scene segmentation result corresponding to the sample image;

15. The apparatus according to claim 14, wherein the apparatus further comprises: a scene segmentation network training module;

The scene segmentation network training module includes:

an extraction unit, adapted to extract a sample image and a segmentation result of annotated scene corresponding to the sample image from the sample library;

a training unit, adapted to input the sample images into the scene segmentation network for training;

an obtaining unit, adapted to obtain a sample scene segmentation result corresponding to the sample image;

an update unit, adapted to obtain a scene segmentation network loss function according to the segmentation loss between the sample scene segmentation result and the labeled scene segmentation result, and update the weight parameter of the scene segmentation network according to the scene segmentation network loss function;

The scene segmentation network training module runs iteratively until a predetermined convergence condition is satisfied.

16. The apparatus according to claim 15, wherein the predetermined convergence condition comprises: the number of iterations reaches a preset number of iterations; and/or the output value of the scene segmentation network loss function is less than a preset threshold.

17. The apparatus according to claim 16, wherein the scene segmentation network training module is further adapted to: initialize the weight parameters of the scale regression layer when the scene segmentation network training starts.

18. The apparatus of claim 17, wherein the upload module is further adapted to:

19. The apparatus of claim 18, wherein the uploading module is further adapted to:

20. The apparatus of claim 19, wherein the uploading module is further adapted to:

21. A computing device, comprising: a processor, a memory, a communication interface and a communication bus, the processor, the memory and the communication interface communicate with each other through the communication bus;

The memory is used to store at least one executable instruction, and the executable instruction causes the processor to perform an operation corresponding to the real-time processing method for video data based on scene segmentation according to any one of claims 1-10.

22. A computer storage medium, wherein at least one executable instruction is stored in the storage medium, and the executable instruction causes a processor to execute the scene segmentation-based video data real-time according to any one of claims 1-10. The operation corresponding to the processing method.