CN110392264B

CN110392264B - Alignment extrapolation frame method based on neural network

Info

Publication number: CN110392264B
Application number: CN201910790385.9A
Authority: CN
Inventors: 刘�东; 霍帅; 李厚强; 吴枫
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2019-08-26
Filing date: 2019-08-26
Publication date: 2022-10-28
Anticipated expiration: 2039-08-26
Also published as: CN110392264A

Abstract

The invention discloses an alignment extrapolation frame method based on a neural network, which comprises the following steps: dividing a target frame into M blocks with a specified size, selecting N continuous past frames closest to the target frame, and respectively finding out the most similar blocks from each past frame for alignment of each block in the target frame to obtain N aligned blocks; wherein N is a set natural number; inputting the N alignment blocks into a multi-scale and residual error network, learning the difference between a target frame and each past frame, and predicting to obtain 1 extrapolation block; and splicing all heterodyne blocks to obtain an aligned interpolation frame. The method uses the alignment operation to process the past frame, reduces the diversity of the extrapolation network input, and can obviously reduce the extrapolation difficulty by learning the difference between the target frame and the past frame; in addition, the extrapolation frame can be applied to video coding, and the coding efficiency is improved.

Description

A Neural Network-Based Method for Aligning and Extrapolating Frames

技术领域technical field

本发明涉及数字图像处理技术和视频编码技术领域，尤其涉及一种基于神经网络的对齐外插帧方法。The invention relates to the fields of digital image processing technology and video coding technology, in particular to a neural network-based method for aligning and extrapolating frames.

背景技术Background technique

众所周知，数字视频是由一系列数字图像(帧)按照时间的顺序组成的。在视频中，从过去的帧预测未来的帧，称为外插帧，即是要从过去帧的内容和运动趋势预测出未来帧的内容。随着卷积神经网络(Convolutional Neural Network,CNN)的发展，基于神经网络的外插帧方法取得了显著进展。外插帧的一个常用的场景是在视频编码中，因为外插帧可以从过去帧(参考帧)中得到对未来帧更准确的预测，在对未来帧进行编码时，精准的预测可以减少帧间冗余，提高视频编码效率。As we all know, digital video is composed of a series of digital images (frames) in time order. In video, predicting future frames from past frames is called extrapolation, which means predicting the content of future frames from the content and motion trend of past frames. With the development of Convolutional Neural Network (CNN), neural network-based extrapolation frame methods have made remarkable progress. A commonly used scenario for extrapolated frames is in video coding, because extrapolated frames can obtain more accurate predictions for future frames from past frames (reference frames), and when encoding future frames, accurate predictions can reduce frame Redundancy, improve video coding efficiency.

下面是一些研究在帧外插、帧间预测和视频编码里面使用帧外插技术的工作：The following are some works investigating the use of frame extrapolation techniques in frame extrapolation, inter-frame prediction, and video coding:

深度多尺度视频预测超出均方误差(M.Mathieu,C.Couprie,and Y.LeCun,“Deepmulti-scale video prediction beyond mean square error,”arXiv preprint arXiv:1511.05440,2015.)Deep multi-scale video prediction beyond mean square error (M. Mathieu, C. Couprie, and Y. LeCun, "Deepmulti-scale video prediction beyond mean square error," arXiv preprint arXiv:1511.05440, 2015.)

基于生成对抗网络的帧外插应用于视频编码(J.Lin,D.Liu,H.Li,and F.Wu,“Generative adversarial network-basedframe extrapolation for video coding,”inVCIP.IEEE,2018.)Frame extrapolation based on generative adversarial network is applied to video coding (J.Lin, D.Liu, H.Li, and F.Wu, "Generative adversarial network-basedframe extrapolation for video coding," inVCIP.IEEE, 2018.)

以上方法的缺点：Disadvantages of the above method:

1、由于自然视频中运动模式的复杂性和多样性，不对过去帧进行处理，直接从过去帧中推断出高质量的未来帧仍然是困难的。1. Due to the complexity and diversity of motion patterns in natural videos, it is still difficult to directly infer high-quality future frames from past frames without processing past frames.

2、由于直接从过去帧中推断出的帧质量不够高，而且直接外插很难处理好视频中的复杂运动，将直接外插方法应用到视频编码中编码效率提升有限。2. Since the quality of the frame directly inferred from the past frames is not high enough, and it is difficult to deal with the complex motion in the video through direct extrapolation, applying the direct extrapolation method to video coding has limited improvement in coding efficiency.

发明内容SUMMARY OF THE INVENTION

本发明的目的是提供一种基于神经网络的对齐外插帧方法，通过过去帧对齐来提高外插帧的质量。The purpose of the present invention is to provide a neural network-based method for aligning extrapolated frames, which improves the quality of extrapolated frames by aligning past frames.

本发明的目的是通过以下技术方案实现的：The purpose of the present invention is achieved through the following technical solutions:

一种基于神经网络的对齐外插帧方法，包括：A neural network-based method for aligning extrapolated frames, comprising:

将目标帧分成指定大小的M个块，选择距离目标帧最近且连续的N个过去帧，对于目标帧中的每个块，均分别从每一过去帧中找到最相似的块进行对齐，获得N个对齐块；其中，N与M均为设定的自然数；Divide the target frame into M blocks of a specified size, select the nearest and continuous N past frames from the target frame, and for each block in the target frame, find the most similar block from each past frame for alignment, and obtain N alignment blocks; wherein, N and M are both set natural numbers;

对于目标帧中的每个块，将N个对齐块输入至多尺度及残差的网络中，学习目标帧与各个过去帧之间的差异，并预测得到1个外插块；For each block in the target frame, input N aligned blocks into the multi-scale and residual network, learn the difference between the target frame and each past frame, and predict an extrapolated block;

将所有外插块按照对应的位置关系拼接得到对齐外插帧。All the extrapolation blocks are spliced according to the corresponding positional relationship to obtain an aligned extrapolation frame.

由上述本发明提供的技术方案可以看出，使用对齐操作处理过去帧，降低了外插网络输入的多样性，通过学习目标帧与过去帧之间的差异可显著降低外插难度；此外，可以将外插帧应用于视频编码中，提升编码效率。It can be seen from the above-mentioned technical solution provided by the present invention that using the alignment operation to process past frames reduces the diversity of extrapolation network inputs, and the difficulty of extrapolation can be significantly reduced by learning the difference between the target frame and past frames; in addition, it can Apply extrapolated frames to video coding to improve coding efficiency.

附图说明Description of drawings

为了更清楚地说明本发明实施例的技术方案，下面将对实施例描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域的普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他附图。In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the following will briefly introduce the accompanying drawings that need to be used in the description of the embodiments. Obviously, the accompanying drawings in the following description are only some embodiments of the present invention. For Those of ordinary skill in the art can also obtain other drawings based on these drawings on the premise of not paying creative work.

图1为本发明实施例提供的一种基于神经网络的对齐外插帧方法的整体框架图。FIG. 1 is an overall framework diagram of a neural network-based method for aligning and extrapolating frames provided by an embodiment of the present invention.

具体实施方式Detailed ways

下面结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明的保护范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some of the embodiments of the present invention, not all of them. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

本发明实施例提供一种基于神经网络的对齐外插帧方法，包括：An embodiment of the present invention provides a method for aligning extrapolated frames based on a neural network, including:

1、将目标帧分成指定大小的M个块，选择距离目标帧最近且连续的N个过去帧，对于目标帧中的每个块，均分别从每一过去帧中找到最相似的块进行对齐，获得N个对齐块。通过本步骤可以增大目标帧与过去帧的相关性，同时，过去帧之间块级的平动被完全去除，大大降低目标帧与过去帧的多样性。所述M为自然数，M的数量是根据目标帧的分辨率大小确定的，不同分辨率分块的数量不一样，所述的目标帧也即待预测帧。1. Divide the target frame into M blocks of a specified size, select N past frames closest to the target frame and continuous, and for each block in the target frame, find the most similar block from each past frame for alignment , to obtain N aligned blocks. Through this step, the correlation between the target frame and the past frame can be increased, and at the same time, the block-level translation between the past frames is completely removed, greatly reducing the diversity of the target frame and the past frame. The M is a natural number, and the number of M is determined according to the resolution of the target frame. The number of blocks with different resolutions is different, and the target frame is also the frame to be predicted.

2、对于目标帧中的每个块，将N个对齐块输入至多尺度及残差的网络中，学习目标帧与各个过去帧之间的差异，并预测得到1个外插块。本步骤利用残差学习的方式，让网络专注于学习目标帧中的目标块与对齐块细微的差异，包括复杂运动，纹理变化，噪声等，降低外插的难度。对于目标帧中的每个块，通过前述步骤1都可以从N个过去帧找到N个对齐块，将这N个对齐块输入网络预测出1个外插块。2. For each block in the target frame, input N aligned blocks into the multi-scale and residual network, learn the difference between the target frame and each past frame, and predict an extrapolated block. This step uses the method of residual learning to let the network focus on learning the subtle differences between the target block and the alignment block in the target frame, including complex motion, texture changes, noise, etc., to reduce the difficulty of extrapolation. For each block in the target frame, N alignment blocks can be found from N past frames through the aforementioned step 1, and these N alignment blocks are input into the network to predict an extrapolation block.

3、将所有外插块按照对应的位置关系拼接得到对齐外插帧。即若目标块在目标帧I_t的起始位置坐标为(x，y)，对应的外插块在外插帧中起始位置坐标也为(x，y)。3. Splicing all the extrapolation blocks according to the corresponding positional relationship to obtain an aligned extrapolation frame. That is, if the starting position coordinates of the target block in the target frame I _t are (x, y), the starting position coordinates of the corresponding extrapolation block in the extrapolating frame are also (x, y).

本发明实施例中，分块方式可以与编码标准HEVC的编码树单元(CTU)划分方式一致，块的大小与CTU一致(64x64)。In the embodiment of the present invention, the block division method may be consistent with the coding tree unit (CTU) division method of the coding standard HEVC, and the size of the block is consistent with the CTU (64x64).

为了便于理解，下面结合图1所示的框架对本发明做进一步说明。For ease of understanding, the present invention will be further described below in conjunction with the framework shown in FIG. 1 .

一、对齐方法。1. Alignment method.

本发明实施例中，选出N个过去帧(Previous frames)，目标帧记为I_t，距离目标帧最近且连续的N个过去帧记为I_t-1，I_t-2，...，I_t-N；其中，N为设定的自然数。示例性的，可设置N＝4，则过去帧为I_t-1，I_t-2，I_t-3，I_t-4。In the embodiment of the present invention, N past frames (Previous frames) are selected, the target frame is marked as I _t , and the N consecutive past frames closest to the target frame are marked as I _t-1 , It _-2 , … , I _tN ; wherein, N is a set natural number. Exemplarily, N=4 can be set, then the past frames are I _t-1 , It _-2 , It _-3 , It _-4 .

对齐包括两步：Alignment consists of two steps:

1、第一步为对齐目标帧I_t与I_t-1帧，得到I_t-1帧的对齐块。1. The first step is to align the target frame I _t with the I _t-1 frame to obtain the alignment block of the I _t-1 frame.

如图1所示，第一步采用如下两种方式中的任一种：As shown in Figure 1, the first step adopts either of the following two methods:

第一种方式(图1中的Scheme 1)为：以目标帧I_t的目标块的原始值作为运动估计(ME)的模板，在I_t-1帧中进行整像素运动估计，将平均绝对值误差最小的块作为I_t-1帧的对齐块。The first way (Scheme 1 in Figure 1) is: take the original value of the target block in the target frame I _t as the template for motion estimation (ME), perform integer pixel motion estimation in the I _t-1 frame, and average absolute The block with the smallest value error is used as the alignment block of the I _t-1 frame.

本领域技术人员可以理解，目标块就是指目标帧中分出且当前正在处理的块，原始值为编码术语，表示相应目标块的原始图像值。Those skilled in the art can understand that the target block refers to the block that is separated from the target frame and is currently being processed, and the original value is a coding term, representing the original image value of the corresponding target block.

第二种方式(图1中的Scheme2)为：基于目标帧I_t与I_t-1帧之间的运动小于设定值(也即运动足够小)的假设，采用固定位置(Co-located)的方式，得到I_t-1帧的对齐块，即若目标块在目标帧I_t的起始位置坐标为(x，y)，对齐块为I_t-1帧中起始位置坐标为(x，y)的图像块。The second method (Scheme2 in Figure 1) is: based on the assumption that the motion between the target frame I _t and I _t-1 frame is smaller than the set value (that is, the motion is small enough), use a fixed position (Co-located) In this way, the alignment block of the I _t-1 frame is obtained, that is, if the starting position coordinates of the target block in the target frame I _t are (x, y), the alignment block is that the starting position coordinates in the I _t-1 frame are (x , y) image block.

2、第二步为基于I_t-1帧的对齐块，对齐目标帧I_t与N个过去帧I_t-1，I_t-2，...，I_t-N，得到I_t-2帧至I_t-N帧的对齐块。2. The second step is to align the target frame I _t with N past frames I _t-1 , I _t-2 , ..., I _tN based on the alignment block of the I _t-1 frame, and obtain the I _t-2 frame to Alignment blocks for I _tN frames.

不论第一步使用哪种方式，第二步所使用的对齐方式均与第一步中第一种方式使用的对齐方式相同，即使用运动估计方式对齐，还参见图1：Regardless of which method is used in the first step, the alignment method used in the second step is the same as the alignment method used in the first method in the first step, that is, the motion estimation method is used for alignment, see also Figure 1:

不论第一步使用哪种方式，对于目标帧I_t中的目标块，以第一步中获得的I_t-1帧的对齐块作为运动估计的模板，在I_t-2帧中进行整像素运动估计，将平均绝对值误差最小的块作为I_t-2帧的对齐块；按照这样的方式重复执行，每一过去帧的对齐块更新为其更前一帧运动估计的模板，最终，将I_t-(N-1)帧的初步对齐块作为运动估计的模板，在I_t-N帧中进行整像素运动估计，将平均绝对值误差最小的块作为I_t-N帧的对齐块；从而得到N个对齐块。Regardless of which method is used in the first step, for the target block in the target frame I _t , the aligned block of the I _t-1 frame obtained in the first step is used as a template for motion estimation, and the whole pixel is performed in the I _t-2 frame Motion estimation, the block with the smallest average absolute value error is used as the alignment block of the I _t-2 frame; in this way, the alignment block of each past frame is updated to the template of the motion estimation of the previous frame, and finally, the The preliminary alignment block of the It- _(N-1) frame is used as a template for motion estimation, and the whole pixel motion estimation is carried out in the _ItN frame, and the block with the smallest average absolute value error is used as the alignment block of the _ItN frame; thereby obtaining N Align blocks.

上述的对齐操作让目标块里的物体永远保持在网络输入块(即，对齐块)的中心，避免在网络内部花费大量精力搜索目标物体。本发明实施例中，将第一步与第二步均采用运动估计的方式(即第一种方法)，简称为MEA(ME Alignment)。第一步采用固定位置的方式而第二步采用运动估计的方式(即第二种方法)，简称为CoIMEA(Co-locatedMEAlignment)。The above alignment operation keeps the object in the target block always at the center of the network input block (ie, the alignment block), avoiding spending a lot of energy searching for the target object inside the network. In the embodiment of the present invention, both the first step and the second step adopt motion estimation (that is, the first method), which is referred to as MEA (ME Alignment). The first step adopts the method of fixed position and the second step adopts the method of motion estimation (that is, the second method), which is referred to as CoIMEA (Co-locatedMEAlignment).

此外，为了避免分块带来的块边缘信息丢失，得到N个对齐块后分别将每一对齐块周围的重建像素补在相应对齐块周围，作为最终的N个对齐块。In addition, in order to avoid the loss of block edge information caused by block segmentation, after obtaining N aligned blocks, the reconstructed pixels around each aligned block are respectively supplemented around the corresponding aligned blocks, as the final N aligned blocks.

示例性的，目标帧先分成M个64x64的块，对于分的每一个块，都需要在4个(即N＝4)过去帧中分别找到4个对齐块，最终得到M组对齐块组，每个对齐块组包含4个对齐块。对于块的大小问题，因为分的目标块为64x64，对齐也是按照64x64大小进行的，在输入网络前，对4个对齐块后分别将每一对齐块周围的重建像素补在相应对齐块周围，上下左右各补32个像素，最终得到4个128x128的对齐块作为网络输入，网络输出1个128x128的块，截取块中心部分的64x64的块作为最终的外插块，其大小与目标块大小一致。Exemplarily, the target frame is first divided into M blocks of 64x64, and for each divided block, it is necessary to find 4 aligned blocks in 4 (ie N=4) past frames, and finally obtain M groups of aligned blocks, Each alignment block group contains 4 alignment blocks. For the size of the block, because the target block is 64x64, the alignment is also carried out according to the size of 64x64. Before inputting the network, after the 4 alignment blocks, the reconstructed pixels around each alignment block are respectively supplemented around the corresponding alignment block. 32 pixels are added up, down, left, and right, and finally four 128x128 aligned blocks are obtained as network input, and the network outputs a 128x128 block, and the 64x64 block in the center of the block is intercepted as the final extrapolation block, whose size is consistent with the target block size .

二、多尺度及残差的网络。2. Multi-scale and residual network.

多尺度结构可以充分捕捉不同分辨率的运动，残差网络可以更高效的学习出输入图像块与目标图像块的差异，适用于对齐外插任务。The multi-scale structure can fully capture the motion of different resolutions, and the residual network can learn the difference between the input image block and the target image block more efficiently, which is suitable for the alignment extrapolation task.

所述多尺度及残差的网络中包含Q个尺度，为了便于附图的绘制，图1以Q＝4为例，示意性的给出包含4个尺度的网络结构。The multi-scale and residual network includes Q scales. In order to facilitate the drawing of the drawings, FIG. 1 takes Q=4 as an example to schematically show a network structure including 4 scales.

本发明实施例中，各尺度分辨率大小依次记为s₁，s₂，...，s_Q；当前尺度k的分辨率大小为下一尺度k+1的一半，即

k＝1，2，...，Q。当k＝Q时，s_Q+1为设定的分辨率大小。以Q＝4为例，分辨率设为s₁＝16，s₂＝32，s₃＝64，s₄＝128。In the embodiment of the present invention, the resolutions of each scale are sequentially recorded as s ₁ , s ₂ , ..., s _Q ; the resolution of the current scale k is half of the next scale k+1, namely

k=1, 2, . . . , Q. When k=Q, s _Q+1 is the set resolution size. Taking Q=4 as an example, the resolutions are set to s ₁ =16, s ₂ =32, s ₃ =64, and s ₄ =128.

多尺度及残差的网络，从最低分辨率s₁开始进行一系列的预测，并以s_k大小的预测为出发点，进行s_k+1大小的预测，不断地进行上采样，并在下一个更精细的尺度(即第k+2个尺度)上添加所学的残差，直到回到全分辨率图像，对照之前的示例，此处的全分辨率图像也即128x128的图像。The multi-scale and residual network starts from the lowest resolution s ₁ to make a series of predictions, and takes the prediction of s _k size as the starting point to make predictions of s _k+1 size, continuously upsampling, and in the next update Add the learned residuals on the fine scale (ie the k+2th scale) until returning to the full-resolution image. Compared with the previous example, the full-resolution image here is also a 128x128 image.

本发明实施例中，递归地定义网络及其作出的外插块预测：In the embodiment of the present invention, the network and its extrapolation block prediction are defined recursively:

其中，

是指第k个尺度预测到的外插块，特别的，网络仅以下采样的X₁作为最小尺寸的输入，即，k＝1时，以

为X₁；X表示N个对齐块{X^t-1，X^t-2，...，X^t-N}，依次为I_t-1，I_t-2，...，I_t-N的对齐块，X_k表示X在分辨率s_k的下采样图像，

G_k表示一个从X_k和

学习出残差

的网络，也即，网络中的第k个尺度的网络，u_k表示朝着分辨率s_k的上采样操作。in,

Refers to the extrapolation block predicted by the kth scale. In particular, the network only takes the downsampled X ₁ as the input of the minimum size, that is, when k=1, the

is X ₁ ; X represents N alignment blocks {X ^t-1 , X ^t-2 , ..., X ^tN }, which are the alignment blocks of I _t-1 , I _t-2 , ..., I _tN , X _k represents the downsampled image of X at resolution s _k ,

G _k represents a from X _k and

learn residuals

The network of , that is, the network of the k-th scale in the network, u _k denotes an upsampling operation towards resolution s _k .

网络总损失为Q个尺度损失和：The total network loss is the sum of Q scale losses:

其中：in:

||*||₁表示l₁损失，l₁损失是深度学习中的专业术语，表示一种损失函数。||*|| ₁ means l ₁ loss, l ₁ loss is a professional term in deep learning, which means a loss function.

多尺度及残差的网络的训练阶段，采用原始未压缩的视频进行训练，每一次选取目标帧的过去N帧进行分块对齐，对于分的每一个块，均得到的N个对齐块作为网络输入，目标块的原始值作为标签(label)；特别的，在对齐过程中，使用MEA的对齐方式，即对齐第一步采用ME方式，以产生最准确的对齐块。对齐块下采样的方式采用双三次插值(Bicubic)下采样的方式。在网络训练中只使用Y分量进行训练，只得到一个模型，由于网络为全卷积网络，该训练好的模型可用于不同的分辨率。In the training phase of the multi-scale and residual network, the original uncompressed video is used for training. Each time, the past N frames of the target frame are selected for block alignment. For each block, the obtained N alignment blocks are used as the network Input, the original value of the target block is used as the label; in particular, in the alignment process, the MEA alignment method is used, that is, the ME method is used in the first step of the alignment to generate the most accurate alignment block. The alignment block downsampling method adopts the bicubic interpolation (Bicubic) downsampling method. In the network training, only the Y component is used for training, and only one model is obtained. Since the network is a fully convolutional network, the trained model can be used for different resolutions.

本领域技术人员可以理解，YUV包含Y、U、V三个通道，把每个通道单独来看，则一帧YUV图像可以分成3帧图像(Y图像，U图像，V图像)，前文所介绍的分块外插操作是基于某一个通道的进行的。例如，在Y通道进行分块对齐外插，拼接成Y通道外插帧，在U、V分量上同样进行外插，得到U、V外插帧，把Y、U、V通道的外插帧合并起来，得到YUV格式的外插图像。换句话说，YUV是三维的，包含通道、高度、宽度，外插某一分量是基于通道维度，分块是在高度和宽度维度。Those skilled in the art can understand that YUV includes three channels of Y, U, and V. If each channel is viewed separately, a frame of YUV image can be divided into 3 frames of images (Y image, U image, V image), as described above The block extrapolation operation is based on a certain channel. For example, block-aligned extrapolation is performed on the Y channel, and the extrapolation frame of the Y channel is spliced into a Y channel extrapolation frame, and extrapolation is also performed on the U and V components to obtain U and V extrapolation frames, and the extrapolation frames of the Y, U, and V channels are Combine them to get an extrapolated image in YUV format. In other words, YUV is three-dimensional, including channels, heights, and widths. The extrapolation of a certain component is based on the channel dimension, and the block is in the height and width dimensions.

三、对齐外插帧应用在视频编码3. Aligned extrapolated frames are applied in video coding

本发明实施例中，将获得的对齐外插帧应用在视频编码中，包括至少两种方案：In the embodiment of the present invention, applying the obtained aligned extrapolated frames to video coding includes at least two schemes:

第一种是，将对齐外插帧直接作为帧间预测中运动补偿预测结果，这一方案简称为MCP；The first is to use the aligned extrapolated frame directly as the motion compensation prediction result in inter-frame prediction, which is referred to as MCP for short;

第二种是，用对齐外插帧成一个新的参考帧，在新的参考帧上进行传统运动补偿，这一方案简称为REF。The second is to align the extrapolated frame into a new reference frame, and perform traditional motion compensation on the new reference frame. This scheme is referred to as REF.

对于第一种方法MCP，可以将对齐帧外插作为一种全新的帧间预测模式，只应用在32x32及更大的编码单元(CU)，将该模式与其他帧间预测模式进行率失真优化(RDO)，选择最优的模式进行编码。在码流结构上，对于32x32及更大的CU上，需要传输一个flag标识该CU是否使用MCP模式。For the first method, MCP, aligned frame extrapolation can be used as a new inter-frame prediction mode, which is only applied to 32x32 and larger coding units (CUs), and this mode is optimized for rate distortion with other inter-frame prediction modes (RDO), select the optimal mode for encoding. In terms of code stream structure, for 32x32 and larger CUs, a flag needs to be transmitted to indicate whether the CU uses MCP mode.

对于第二种方法REF，将对齐帧外插作为一个新的参考帧。遵循传统的编码框架在现有参考帧和外插参考帧中执行运动补偿，以便编码器可以为划分的预测单元(PU)选择最准确的部分，应用的PU大小从4x4到64x64。特别是，将外插块的大小设置为与CTU相同，即64x64。由于输入填充，网络输出块的大小是CTU的两倍，即128x128，包括生成的填充像素。为了实现HEVC参考，将整个网络输出块作为相应CTU位置的参考块，以便在运动估计和运动补偿的过程中保持外插生成的参考CTU与周围像素之间的边界连续性，并重复该参考块策略再压缩下一个CTU。由于YUV420的特殊格式，对亮度、色度通道分别处理。对齐过程只在Y分量上进行，先用网络外插出Y分量；在外插任务中，Y分量与U、V分量有相同的特性，如相同的运动趋势，所以可将Y分量对应的U、V对齐图像也输入到该网络模型进行外插，分别得到U、V的外插图像。在编码端和解码端用同样的方法生成外插帧，保证编解码端匹配。For the second method, REF, the aligned frame is extrapolated as a new reference frame. Motion compensation is performed in existing reference frames and extrapolated reference frames following a traditional coding framework so that the encoder can select the most accurate part for a split prediction unit (PU), with applied PU sizes ranging from 4x4 to 64x64. In particular, set the size of the extrapolation block to be the same as the CTU, i.e. 64x64. Due to the input padding, the network output block is twice the size of the CTU, i.e. 128x128, including the generated padding pixels. In order to achieve HEVC reference, the entire network output block is used as the reference block of the corresponding CTU position, so that the boundary continuity between the reference CTU generated by extrapolation and surrounding pixels is maintained during the process of motion estimation and motion compensation, and the reference block is repeated The policy then compresses the next CTU. Due to the special format of YUV420, the luminance and chrominance channels are processed separately. The alignment process is only carried out on the Y component, and the Y component is first extrapolated by the network; in the extrapolation task, the Y component has the same characteristics as the U and V components, such as the same movement trend, so the U, V components corresponding to the Y component can be The V aligned image is also input to the network model for extrapolation, and the extrapolated images of U and V are obtained respectively. Use the same method to generate extrapolated frames at the encoding and decoding ends to ensure that the encoding and decoding ends match.

四、对齐方法与两种应用方案进行组合。4. The alignment method is combined with two application schemes.

本放实施例中，将两种对齐方法与两张应用方案交叉组合，整合到编码标准HEVC中，组合方案为如下四种：MCP+ColMEA，MCP+MEA，REF+ColMEA和REF+MEA。In this embodiment, two alignment methods and two application schemes are cross-combined and integrated into the encoding standard HEVC. The combination schemes are as follows: MCP+ColMEA, MCP+MEA, REF+ColMEA and REF+MEA.

MCP方案和REF方案的基本实现前文已经介绍。对于MCP+ColMEA方案，由于ColMEA得到预测不需要传输任何附加信息，并且帧间预测值可以直接从帧外插中获得，可以避免传输传统帧间模式中得到预测值所需要的信息，包括AMVP模式下的运动矢量(MV)及参考帧索引信息(reference index)和Merge模式下的Merge Flag及Merge Index。The basic implementation of the MCP scheme and the REF scheme has been introduced above. For the MCP+ColMEA scheme, since the prediction of ColMEA does not need to transmit any additional information, and the inter-frame prediction value can be directly obtained from frame extrapolation, it can avoid the transmission of the information required to obtain the prediction value in the traditional inter-frame mode, including AMVP mode Motion vector (MV) and reference frame index information (reference index) under Merge mode and Merge Flag and Merge Index under Merge mode.

对于MCP+MEA方案，需要传输对齐信息，复用传统的MV编码模块，并采用整数分量来表示对齐的信息，分数分量表示对获得的外插块进行分像素插值，这可以充分利用MV结构生成更准确的帧间预测。For the MCP+MEA scheme, the alignment information needs to be transmitted, the traditional MV coding module is reused, and the integer component is used to represent the alignment information, and the fractional component represents the sub-pixel interpolation of the obtained extrapolation block, which can make full use of the MV structure to generate More accurate inter prediction.

对于REF+ColMEA方案，由于ColMEA中任何比特流结构都没有变化，只需在添加新参考帧的基础上遵循传统的编码框架。For the REF+ColMEA scheme, since there is no change in any bitstream structure in ColMEA, it only needs to follow the traditional coding framework on the basis of adding new reference frames.

对于REF+MEA方案，使用由周围PU给出的更好的运动矢量预测(MVP)并且对于对齐的MV非常接近以替换对齐的MV。MVP的传输比MV更经济，并且仅编码有PU引用的外插块的对齐MVP，以便进一步节省比特。For the REF+MEA scheme, better motion vector prediction (MVP) given by surrounding PUs is used and very close to aligned MVs to replace aligned MVs. MVPs are more economical to transmit than MVs, and only aligned MVPs with PU-referenced extrapolated blocks are coded for further bit savings.

本发明实施例上述方案，使用对齐操作处理过去帧，降低了外插网络输入的多样性，通过学习目标帧与过去帧之间的差异可显著降低外插难度；此外，可以将外插帧应用于视频编码中，提升编码效率。In the above scheme of the embodiment of the present invention, the alignment operation is used to process the past frames, which reduces the diversity of extrapolation network input, and the difficulty of extrapolation can be significantly reduced by learning the difference between the target frame and the past frames; in addition, the extrapolation frame can be applied to In video coding, the coding efficiency is improved.

另一方面，为了说明本发明的性能还进行了相关测试，外插主要适用于编码中的低延时配置。On the other hand, in order to illustrate the performance of the present invention, related tests are also carried out, and extrapolation is mainly suitable for low-latency configurations in encoding.

测试条件包括：1)帧间配置：低延时B即Low-delay B，LDB；低延时P即Low-delayP，LDP。2)基本量化步长(QP)设置为{27,32,37,42}，基于的软件是HM12.0，测试序列为HEVC标准测试序列。The test conditions include: 1) Inter-frame configuration: low delay B means Low-delay B, LDB; low delay P means Low-delayP, LDP. 2) The basic quantization step (QP) is set to {27,32,37,42}, the software based on it is HM12.0, and the test sequence is the HEVC standard test sequence.

本领域技术人员可以理解，编码领域需要在规定的标准测试序列测试性能，标准测试序列分成几类(ClassA～F)，下述测试过程中按照标准测试要求汇报各类测试序列的结果。Those skilled in the art can understand that the coding field needs to test the performance in the specified standard test sequence, and the standard test sequence is divided into several categories (ClassA-F), and the results of various test sequences are reported according to the standard test requirements in the following test process.

1)对齐方法与外插应用的4中组合编码性能比较1) Comparison of the performance of the 4 combinations of the alignment method and the extrapolation application

表1为LDP设置下4中组合的性能对比，从表中可以看出最优的方案为REF+ColMEA，可以将该方案作为优选方案。Table 1 shows the performance comparison of the 4 combinations under the LDP setting. It can be seen from the table that the optimal solution is REF+ColMEA, which can be used as the preferred solution.

表1 LDP设置下4中组合的性能对比结果Table 1 The performance comparison results of the 4 combinations under the LDP setting

表2为优选方案的完整性能。从表2可以看出，本发明实施例上述方案相对于HM12.0在LDP和LDB模式下可分别获得5.3％和2.8％的码率节省。Table 2 is the complete performance of the preferred scheme. It can be seen from Table 2 that, compared with HM12.0, the above solution of the embodiment of the present invention can obtain 5.3% and 2.8% code rate savings in LDP and LDB modes, respectively.

表2优选方案的完整性能The complete performance of table 2 preferred scheme

2)对齐外插帧效果测试2) Alignment and extrapolation frame effect test

表3给出在优选方案中，没有对齐的传统帧外插与本发明提出的对齐外插帧在LDP条件下编码性能比较，可以看出没有对齐的外插仅可获得2.2％的码率节省，而对齐外插可获得5.3％的码率节省。与简单的外插帧相比，本发明提出的对齐可以更显著提升编码性能。此外，通过观察发现，本发明提出的对齐方法产生的外插帧比传统方法更清晰的且有更好的效果。Table 3 shows the coding performance comparison between traditional frame extrapolation without alignment and the aligned extrapolation frame proposed by the present invention under LDP conditions in the preferred solution, it can be seen that extrapolation without alignment can only save 2.2% of the code rate , while aligned extrapolation yields a bitrate savings of 5.3%. Compared with simple extrapolated frames, the alignment proposed by the present invention can significantly improve the coding performance. In addition, it is found through observation that the extrapolation frame generated by the alignment method proposed by the present invention is clearer and has a better effect than the traditional method.

表3对齐外插帧效果Table 3 Alignment and extrapolation frame effects

通过以上的实施方式的描述，本领域的技术人员可以清楚地了解到上述实施例可以通过软件实现，也可以借助软件加必要的通用硬件平台的方式来实现。基于这样的理解，上述实施例的技术方案可以以软件产品的形式体现出来，该软件产品可以存储在一个非易失性存储介质(可以是CD-ROM，U盘，移动硬盘等)中，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行本发明各个实施例所述的方法。Through the above description of the implementation manners, those skilled in the art can clearly understand that the above embodiments can be implemented by software, or by means of software plus a necessary general hardware platform. Based on this understanding, the technical solutions of the above-mentioned embodiments can be embodied in the form of software products, which can be stored in a non-volatile storage medium (which can be CD-ROM, U disk, mobile hard disk, etc.), including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute the methods described in various embodiments of the present invention.

以上所述，仅为本发明较佳的具体实施方式，但本发明的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本发明披露的技术范围内，可轻易想到的变化或替换，都应涵盖在本发明的保护范围之内。因此，本发明的保护范围应该以权利要求书的保护范围为准。The above is only a preferred embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any person familiar with the technical field can easily conceive of changes or changes within the technical scope disclosed in the present invention. Replacement should be covered within the protection scope of the present invention. Therefore, the protection scope of the present invention should be determined by the protection scope of the claims.

Claims

1. A method for aligning extrapolated frames based on a neural network is characterized by comprising the following steps:

dividing a target frame into M blocks with a specified size, selecting N continuous past frames closest to the target frame, and respectively finding out the most similar blocks from each past frame for alignment of each block in the target frame to obtain N aligned blocks; wherein, N and M are set natural numbers;

for each block in a target frame, inputting N aligned blocks into a multi-scale and residual error network, learning the difference between the target frame and each past frame, and predicting to obtain 1 extrapolated block;

splicing all the extrapolation blocks according to the corresponding position relation to obtain an aligned extrapolation frame;

wherein, the step of selecting the N past frames which are nearest to the target frame and are continuous, and for each block in the target frame, respectively finding the most similar block from each past frame for alignment, and obtaining N aligned blocks comprises:

target frame is marked as I _t The N past frames nearest to the target frame and consecutive are marked as I _t-1 ,I _t-2 ,…,I _t-N (ii) a The alignment comprises two steps:

the first step is to align the target frame I _t And I _t-1 Frame, get I _t-1 An alignment block of a frame;

the second step is based on I _t-1 Alignment block of frame, alignment target frame I _t And N past frames I _t-1 ,I _t-2 ,…,I _t-N To obtain I _t-2 Frame to I _t-N An alignment block of a frame.

2. The neural network-based aligned extrapolation frame method of claim 1, wherein the first step employs either of the following two approaches:

the first mode is as follows: with a target frame I _t As a template for motion estimation, at I _t-1 Performing integer pixel motion estimation in frame, and correcting average absolute valueThe block with the smallest difference is taken as I _t-1 An alignment block of the frame;

the second way is: based on the target frame I _t And I _t-1 The assumption that the motion between frames is less than the set value adopts a fixed position mode to obtain I _t-1 An alignment block of a frame.

3. The neural network-based method for aligning extrapolated frames according to claim 1 or 2, wherein the alignment method used in the second step includes:

for target frame I _t Target block of (1), I obtained in the first step _t-1 Aligned blocks of frames as templates for motion estimation, at I _t-2 Performing integer pixel motion estimation in a frame, and taking the block with the minimum average absolute value error as I _t-2 An alignment block of a frame; repeatedly executing in such a way, and finally, I _t-(N-1) Preliminary aligned blocks of frames as templates for motion estimation, at I _t-N Performing integer pixel motion estimation in frame, and taking the block with minimum average absolute value error as I _t-N An alignment block of a frame; resulting in N aligned blocks.

4. The method of claim 3, wherein the N alignment blocks are obtained and then reconstructed pixels around each alignment block are respectively compensated around the corresponding alignment block to obtain the final N alignment blocks.

5. The method as claimed in claim 4, wherein the multi-scale and residual error network comprises Q scales, and the resolution is sequentially denoted as s ₁ ,s ₂ ,…,s _Q (ii) a The resolution size of the current scale k is half of the next scale k +1, i.e.

Multi-scale and residual networks, from lowest resolution s ₁ Starting a series of predictions and taking s _k Size prediction as a starting point, s _k+1 Prediction of size, upsampling is done continuously and the learned residual is added on the next finer scale until the full resolution picture is returned, i.e. the same size as the final aligned block.

6. The method of claim 5, wherein the extrapolation block prediction result of the multi-scale and residual error network is as follows:

wherein,

refers to the k-th scale predicted extrapolation block, X of network downsampling ₁ As input of minimum size, i.e., k =1, to

Is X ₁ (ii) a X denotes N aligned blocks { X ^t-1 ,X ^t-2 ,…,X ^t-N Are sequentially I _t-1 ,I _t-2 ,…,I _t-N Alignment block of (2), X _k Representing X at resolution s _k The down-sampled image of (a) is,

G _k represents a slave X _k And

learning out residual error

Network of u _k Representing towards resolution s _k The up-sampling operation of (2);

the total loss of the network is Q scale losses and:

wherein:

‖*‖ ₁ is represented by ₁ And (4) loss.

7. The neural network-based method for aligning extrapolated frames according to claim 3, further comprising: applying the obtained aligned interpolated frames in video coding includes at least two schemes:

the first is to use the aligned extrapolated frame as the motion compensation prediction result of inter-frame prediction, which is abbreviated as MCP;

second, conventional motion compensation is performed on a new reference frame by aligning the interpolated frame to the new reference frame, which is abbreviated as REF.

8. The neural network-based method for aligning extrapolated frames according to claim 7, further comprising: combining the alignment method with the two application schemes;

the alignment method comprises the following steps: the first step and the second step both adopt a motion estimation mode, which is abbreviated as MEA; the first step adopts a fixed position mode, and the second step adopts a motion estimation mode, which is called ColMEA for short;

the two alignment methods and two application schemes are combined in a cross mode and integrated into the coding standard HEVC, and the combination schemes are four as follows: MCP + ColMEA, MCP + MEA, REF + ColMEA and REF + MEA.