CN110392264A

CN110392264A - An Alignment and Extrapolation Frame Method Based on Neural Network

Info

Publication number: CN110392264A
Application number: CN201910790385.9A
Authority: CN
Inventors: 刘�东; 霍帅; 李厚强; 吴枫
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2019-08-26
Filing date: 2019-08-26
Publication date: 2019-10-29
Anticipated expiration: 2039-08-26
Also published as: CN110392264B

Abstract

The invention discloses frame interpolation methods outside a kind of alignment neural network based, it include: the M block that target frame is divided into specified size, select distance objective frame recently and continuous N number of past frame, for each of target frame piece, most like block is found from each past frame respectively to be aligned, and N number of alignment block is obtained；Wherein, N is the natural number of setting；N number of alignment block is input in multiple dimensioned and residual error network, the difference between learning objective frame and each past frame, and predicts to obtain 1 outer inserted block；All heterodyne blocks are spliced to obtain and are aligned outer interleave.This method handles past frame using alignment operation, reduces the diversity of extrapolation network inputs, can significantly reduce extrapolation difficulty by the difference between learning objective frame and past frame；Furthermore, it is possible to which outer interleave is applied in Video coding, code efficiency is promoted.

Description

An Alignment and Extrapolation Frame Method Based on Neural Network

技术领域technical field

本发明涉及数字图像处理技术和视频编码技术领域，尤其涉及一种基于神经网络的对齐外插帧方法。The invention relates to the field of digital image processing technology and video coding technology, in particular to a method for aligning and extrapolating frames based on a neural network.

背景技术Background technique

众所周知，数字视频是由一系列数字图像(帧)按照时间的顺序组成的。在视频中，从过去的帧预测未来的帧，称为外插帧，即是要从过去帧的内容和运动趋势预测出未来帧的内容。随着卷积神经网络(Convolutional Neural Network,CNN)的发展，基于神经网络的外插帧方法取得了显著进展。外插帧的一个常用的场景是在视频编码中，因为外插帧可以从过去帧(参考帧)中得到对未来帧更准确的预测，在对未来帧进行编码时，精准的预测可以减少帧间冗余，提高视频编码效率。As we all know, digital video is composed of a series of digital images (frames) in time sequence. In video, predicting a future frame from a past frame is called an extrapolation frame, that is, predicting the content of the future frame from the content and motion trend of the past frame. With the development of Convolutional Neural Network (CNN), the extrapolation frame method based on neural network has made remarkable progress. A common scenario for extrapolating frames is in video coding, because extrapolating frames can get more accurate predictions of future frames from past frames (reference frames), and when encoding future frames, accurate predictions can reduce the number of frames. Inter-redundancy to improve video coding efficiency.

下面是一些研究在帧外插、帧间预测和视频编码里面使用帧外插技术的工作：The following are some works on the use of frame extrapolation techniques in frame extrapolation, inter-frame prediction and video coding:

深度多尺度视频预测超出均方误差(M.Mathieu,C.Couprie,and Y.LeCun,“Deepmulti-scale video prediction beyond mean square error,”arXiv preprint arXiv:1511.05440,2015.)Deep multi-scale video prediction beyond mean square error (M. Mathieu, C. Couprie, and Y. LeCun, "Deep multi-scale video prediction beyond mean square error," arXiv preprint arXiv:1511.05440, 2015.)

基于生成对抗网络的帧外插应用于视频编码(J.Lin,D.Liu,H.Li,and F.Wu,“Generative adversarial network-basedframe extrapolation for video coding,”inVCIP.IEEE,2018.)Generative adversarial network-based frame extrapolation for video coding (J.Lin, D.Liu, H.Li, and F.Wu, "Generative adversarial network-based frame extrapolation for video coding," inVCIP.IEEE, 2018.)

以上方法的缺点：Disadvantages of the above method:

1、由于自然视频中运动模式的复杂性和多样性，不对过去帧进行处理，直接从过去帧中推断出高质量的未来帧仍然是困难的。1. Due to the complexity and diversity of motion patterns in natural videos, it is still difficult to directly infer high-quality future frames from past frames without processing them.

2、由于直接从过去帧中推断出的帧质量不够高，而且直接外插很难处理好视频中的复杂运动，将直接外插方法应用到视频编码中编码效率提升有限。2. Since the quality of the frames directly inferred from the past frames is not high enough, and it is difficult to deal with the complex motion in the video by the direct extrapolation, the coding efficiency improvement of applying the direct extrapolation method to the video coding is limited.

发明内容SUMMARY OF THE INVENTION

本发明的目的是提供一种基于神经网络的对齐外插帧方法，通过过去帧对齐来提高外插帧的质量。The purpose of the present invention is to provide a method for aligning extrapolated frames based on a neural network, which improves the quality of extrapolated frames by aligning past frames.

本发明的目的是通过以下技术方案实现的：The purpose of this invention is to realize through the following technical solutions:

一种基于神经网络的对齐外插帧方法，包括：A neural network-based method for aligning extrapolated frames, including:

将目标帧分成指定大小的M个块，选择距离目标帧最近且连续的N个过去帧，对于目标帧中的每个块，均分别从每一过去帧中找到最相似的块进行对齐，获得N个对齐块；其中，N与M均为设定的自然数；Divide the target frame into M blocks of a specified size, select N past frames that are the closest and continuous to the target frame, and for each block in the target frame, find the most similar block from each past frame for alignment, and obtain N alignment blocks; wherein, N and M are both set natural numbers;

对于目标帧中的每个块，将N个对齐块输入至多尺度及残差的网络中，学习目标帧与各个过去帧之间的差异，并预测得到1个外插块；For each block in the target frame, input N aligned blocks into the multi-scale and residual network, learn the difference between the target frame and each past frame, and predict an extrapolated block;

将所有外插块按照对应的位置关系拼接得到对齐外插帧。All the extrapolation blocks are spliced according to the corresponding positional relationship to obtain the aligned extrapolation frame.

由上述本发明提供的技术方案可以看出，使用对齐操作处理过去帧，降低了外插网络输入的多样性，通过学习目标帧与过去帧之间的差异可显著降低外插难度；此外，可以将外插帧应用于视频编码中，提升编码效率。It can be seen from the above technical solutions provided by the present invention that the use of alignment operations to process past frames reduces the diversity of extrapolation network inputs, and the extrapolation difficulty can be significantly reduced by learning the difference between the target frame and the past frames; Apply extrapolation frames to video coding to improve coding efficiency.

附图说明Description of drawings

为了更清楚地说明本发明实施例的技术方案，下面将对实施例描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域的普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他附图。In order to illustrate the technical solutions of the embodiments of the present invention more clearly, the following briefly introduces the accompanying drawings used in the description of the embodiments. Obviously, the drawings in the following description are only some embodiments of the present invention. For those of ordinary skill in the art, other drawings can also be obtained from these drawings without any creative effort.

图1为本发明实施例提供的一种基于神经网络的对齐外插帧方法的整体框架图。FIG. 1 is an overall framework diagram of a method for aligning and extrapolating frames based on a neural network according to an embodiment of the present invention.

具体实施方式Detailed ways

下面结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明的保护范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, rather than all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative work fall within the protection scope of the present invention.

本发明实施例提供一种基于神经网络的对齐外插帧方法，包括：An embodiment of the present invention provides a method for aligning and extrapolating frames based on a neural network, including:

1、将目标帧分成指定大小的M个块，选择距离目标帧最近且连续的N个过去帧，对于目标帧中的每个块，均分别从每一过去帧中找到最相似的块进行对齐，获得N个对齐块。通过本步骤可以增大目标帧与过去帧的相关性，同时，过去帧之间块级的平动被完全去除，大大降低目标帧与过去帧的多样性。所述M为自然数，M的数量是根据目标帧的分辨率大小确定的，不同分辨率分块的数量不一样，所述的目标帧也即待预测帧。1. Divide the target frame into M blocks of a specified size, select N past frames that are the closest and continuous to the target frame, and for each block in the target frame, find the most similar block from each past frame for alignment , to obtain N aligned blocks. Through this step, the correlation between the target frame and the past frame can be increased, and at the same time, the block-level translation between the past frames is completely removed, which greatly reduces the diversity of the target frame and the past frame. The M is a natural number, the number of M is determined according to the resolution of the target frame, the number of blocks with different resolutions is different, and the target frame is also the frame to be predicted.

2、对于目标帧中的每个块，将N个对齐块输入至多尺度及残差的网络中，学习目标帧与各个过去帧之间的差异，并预测得到1个外插块。本步骤利用残差学习的方式，让网络专注于学习目标帧中的目标块与对齐块细微的差异，包括复杂运动，纹理变化，噪声等，降低外插的难度。对于目标帧中的每个块，通过前述步骤1都可以从N个过去帧找到N个对齐块，将这N个对齐块输入网络预测出1个外插块。2. For each block in the target frame, input N aligned blocks into the multi-scale and residual network, learn the difference between the target frame and each past frame, and predict an extrapolated block. This step uses the residual learning method to let the network focus on learning the subtle differences between the target block and the aligned block in the target frame, including complex motion, texture changes, noise, etc., to reduce the difficulty of extrapolation. For each block in the target frame, through the aforementioned step 1, N aligned blocks can be found from N past frames, and these N aligned blocks are input into the network to predict an extrapolated block.

3、将所有外插块按照对应的位置关系拼接得到对齐外插帧。即若目标块在目标帧I_t的起始位置坐标为(x，y)，对应的外插块在外插帧中起始位置坐标也为(x，y)。3. Splicing all the extrapolation blocks according to the corresponding positional relationship to obtain an aligned extrapolation frame. That is, if the coordinates of the starting position of the target block in the target frame It are ( _x , y), the coordinates of the starting position of the corresponding extrapolation block in the extrapolation frame are also (x, y).

本发明实施例中，分块方式可以与编码标准HEVC的编码树单元(CTU)划分方式一致，块的大小与CTU一致(64x64)。In the embodiment of the present invention, the block division mode may be consistent with the coding tree unit (CTU) division mode of the coding standard HEVC, and the block size is consistent with the CTU (64×64).

为了便于理解，下面结合图1所示的框架对本发明做进一步说明。For ease of understanding, the present invention will be further described below with reference to the framework shown in FIG. 1 .

一、对齐方法。1. Alignment method.

本发明实施例中，选出N个过去帧(Previous frames)，目标帧记为I_t，距离目标帧最近且连续的N个过去帧记为I_t-1，I_t-2，...，I_t-N；其中，N为设定的自然数。示例性的，可设置N＝4，则过去帧为I_t-1，I_t-2，I_t-3，I_t-4。In the embodiment of the present invention, N past frames (Previous frames) are selected, the target frame is denoted as It, and the N past frames closest to the target frame and consecutive are _denoted as It _-1 , It _-2 , ... , I _tN ; wherein, N is a set natural number. Exemplarily, N=4 can be set, then the past frames are It _-1 , It _-2 , It _-3 , It _-4 .

对齐包括两步：Alignment consists of two steps:

1、第一步为对齐目标帧I_t与I_t-1帧，得到I_t-1帧的对齐块。1. The first step is to align the target frame It and the It _-1 _frame to obtain the alignment block of the It _-1 frame.

如图1所示，第一步采用如下两种方式中的任一种：As shown in Figure 1, the first step adopts either of the following two methods:

第一种方式(图1中的Scheme 1)为：以目标帧I_t的目标块的原始值作为运动估计(ME)的模板，在I_t-1帧中进行整像素运动估计，将平均绝对值误差最小的块作为I_t-1帧的对齐块。The first method (Scheme 1 in Fig. 1) is: _take the original value of the target block of the target frame It as the template of motion estimation (ME), and perform integer-pixel motion estimation in the It _-1 frame. The block with the smallest value error is used as the alignment block of the It _-1 frame.

本领域技术人员可以理解，目标块就是指目标帧中分出且当前正在处理的块，原始值为编码术语，表示相应目标块的原始图像值。Those skilled in the art can understand that a target block refers to a block that is divided in a target frame and is currently being processed, and the original value is a coding term, indicating the original image value of the corresponding target block.

第二种方式(图1中的Scheme2)为：基于目标帧I_t与I_t-1帧之间的运动小于设定值(也即运动足够小)的假设，采用固定位置(Co-located)的方式，得到I_t-1帧的对齐块，即若目标块在目标帧I_t的起始位置坐标为(x，y)，对齐块为I_t-1帧中起始位置坐标为(x，y)的图像块。The second method ( _Scheme2 in Figure 1) is: based on the assumption that the motion between the target frame It and the It _-1 frame is less than the set value (that is, the motion is small enough), a fixed position (Co-located) is used. way, the alignment block of the It _-1 frame is obtained, that is, if the starting position coordinates of the target block in the target frame It are (x, y), the alignment block is that the starting position coordinates in the It _-1 frame are ( _x , y) of the image block.

2、第二步为基于I_t-1帧的对齐块，对齐目标帧I_t与N个过去帧I_t-1，I_t-2，...，I_t-N，得到I_t-2帧至I_t-N帧的对齐块。2. The second step is the alignment block based on the It _-1 frame, aligning the target frame It and N past frames It _-1 , It _-2 , ..., _ItN , to obtain the _It _-2 frame to Alignment block for _ItN frames.

不论第一步使用哪种方式，第二步所使用的对齐方式均与第一步中第一种方式使用的对齐方式相同，即使用运动估计方式对齐，还参见图1：No matter which method is used in the first step, the alignment method used in the second step is the same as the alignment method used in the first method in the first step, that is, using the motion estimation method, see also Figure 1:

不论第一步使用哪种方式，对于目标帧I_t中的目标块，以第一步中获得的I_t-1帧的对齐块作为运动估计的模板，在I_t-2帧中进行整像素运动估计，将平均绝对值误差最小的块作为I_t-2帧的对齐块；按照这样的方式重复执行，每一过去帧的对齐块更新为其更前一帧运动估计的模板，最终，将I_t-(N-1)帧的初步对齐块作为运动估计的模板，在I_t-N帧中进行整像素运动估计，将平均绝对值误差最小的块作为I_t-N帧的对齐块；从而得到N个对齐块。No matter which method is used in the first step, for the target block in the target frame It, the alignment block of the It _-1 _frame obtained in the first step is used as the template for motion estimation, and the whole pixel is performed in the It _-2 frame. For motion estimation, the block with the smallest mean absolute value error is used as the alignment block of the It _-2 frame; the execution is repeated in this way, and the alignment block of each past frame is updated to the template of the motion estimation of the previous frame, and finally, the The preliminary alignment block of the It- _(N-1) frame is used as the template for motion estimation, and the whole-pixel motion estimation is performed in the _ItN frame, and the block with the smallest mean absolute value error is used as the alignment block of the _ItN frame; thus, N Align blocks.

上述的对齐操作让目标块里的物体永远保持在网络输入块(即，对齐块)的中心，避免在网络内部花费大量精力搜索目标物体。本发明实施例中，将第一步与第二步均采用运动估计的方式(即第一种方法)，简称为MEA(ME Alignment)。第一步采用固定位置的方式而第二步采用运动估计的方式(即第二种方法)，简称为CoIMEA(Co-locatedMEAlignment)。The above alignment operation keeps the objects in the target block always in the center of the network input block (ie, the alignment block), avoiding spending a lot of energy searching for the target object inside the network. In the embodiment of the present invention, both the first step and the second step are performed in a motion estimation manner (ie, the first method), which is abbreviated as MEA (ME Alignment). The first step adopts a fixed position method and the second step adopts a motion estimation method (ie, the second method), which is abbreviated as CoIMEA (Co-located MEAlignment).

此外，为了避免分块带来的块边缘信息丢失，得到N个对齐块后分别将每一对齐块周围的重建像素补在相应对齐块周围，作为最终的N个对齐块。In addition, in order to avoid the loss of block edge information caused by block segmentation, after N alignment blocks are obtained, the reconstructed pixels around each alignment block are supplemented around the corresponding alignment block, as the final N alignment blocks.

示例性的，目标帧先分成M个64x64的块，对于分的每一个块，都需要在4个(即N＝4)过去帧中分别找到4个对齐块，最终得到M组对齐块组，每个对齐块组包含4个对齐块。对于块的大小问题，因为分的目标块为64x64，对齐也是按照64x64大小进行的，在输入网络前，对4个对齐块后分别将每一对齐块周围的重建像素补在相应对齐块周围，上下左右各补32个像素，最终得到4个128x128的对齐块作为网络输入，网络输出1个128x128的块，截取块中心部分的64x64的块作为最终的外插块，其大小与目标块大小一致。Exemplarily, the target frame is first divided into M 64x64 blocks, and for each divided block, 4 alignment blocks need to be found in 4 (ie N=4) past frames respectively, and finally M groups of alignment block groups are obtained, Each alignment block group contains 4 alignment blocks. For the size of the block, because the target block of the division is 64x64, the alignment is also performed according to the size of 64x64. Before entering the network, after 4 alignment blocks, the reconstructed pixels around each alignment block are filled around the corresponding alignment block. The top, bottom, left and right are supplemented with 32 pixels, and finally four 128x128 aligned blocks are obtained as network input, and the network outputs one 128x128 block, and the 64x64 block in the central part of the block is intercepted as the final extrapolation block, whose size is consistent with the size of the target block .

二、多尺度及残差的网络。2. Multi-scale and residual network.

多尺度结构可以充分捕捉不同分辨率的运动，残差网络可以更高效的学习出输入图像块与目标图像块的差异，适用于对齐外插任务。The multi-scale structure can fully capture the motion of different resolutions, and the residual network can learn the difference between the input image patch and the target image patch more efficiently, which is suitable for alignment and extrapolation tasks.

所述多尺度及残差的网络中包含Q个尺度，为了便于附图的绘制，图1以Q＝4为例，示意性的给出包含4个尺度的网络结构。The multi-scale and residual network includes Q scales. In order to facilitate the drawing of the drawings, FIG. 1 takes Q=4 as an example, and schematically shows a network structure including four scales.

本发明实施例中，各尺度分辨率大小依次记为s₁，s₂，...，s_Q；当前尺度k的分辨率大小为下一尺度k+1的一半，即k＝1，2，...，Q。当k＝Q时，s_Q+1为设定的分辨率大小。以Q＝4为例，分辨率设为s₁＝16，s₂＝32，s₃＝64，s₄＝128。In the embodiment of the present invention, the resolution sizes of each scale are sequentially recorded as s ₁ , s ₂ , . . . , s _Q ; the resolution size of the current scale k is half of the next scale k+1, that is, k=1,2,...,Q. When k=Q, s _Q+1 is the set resolution size. Taking Q=4 as an example, the resolutions are set to s ₁ =16, s ₂ =32, s ₃ =64, and s ₄ =128.

多尺度及残差的网络，从最低分辨率s₁开始进行一系列的预测，并以s_k大小的预测为出发点，进行s_k+1大小的预测，不断地进行上采样，并在下一个更精细的尺度(即第k+2个尺度)上添加所学的残差，直到回到全分辨率图像，对照之前的示例，此处的全分辨率图像也即128x128的图像。The multi-scale and residual network makes a series of predictions starting from the lowest resolution s ₁ , and takes the prediction of size s _k as the starting point to make predictions of size s _k+1 , continuously upsampling, and in the next update. The learned residuals are added to the fine scale (ie, the k+2th scale) until returning to the full-resolution image, which is a 128x128 image in contrast to the previous example.

本发明实施例中，递归地定义网络及其作出的外插块预测：In the embodiment of the present invention, the network and the extrapolation block prediction made by it are defined recursively:

其中，是指第k个尺度预测到的外插块，特别的，网络仅以下采样的X₁作为最小尺寸的输入，即，k＝1时，以为X₁；X表示N个对齐块{X^t-1，X^t-2，...，X^t-N}，依次为I_t-1，I_t-2，...，I_t-N的对齐块，X_k表示X在分辨率s_k的下采样图像，G_k表示一个从X_k和学习出残差的网络，也即，网络中的第k个尺度的网络，u_k表示朝着分辨率s_k的上采样操作。in, Refers to the extrapolation block predicted at the kth scale. In particular, the network only takes the down-sampled X ₁ as the input of the minimum size, that is, when k=1, with is X ₁ ; X represents N alignment blocks {X ^t-1 , X ^t-2 , ..., X ^tN }, which are the alignment blocks of It _-1 , It _-2 , ..., It _N in sequence , X _k represents the downsampled image of X at resolution _sk , G _k represents a value from X _k and learn residuals The network of , that is, the network of the kth scale in the network, _uk represents the upsampling operation towards the resolution _sk .

网络总损失为Q个尺度损失和：The total network loss is the sum of Q scale losses:

其中：in:

||*||₁表示l₁损失，l₁损失是深度学习中的专业术语，表示一种损失函数。||*|| ₁ means l ₁ loss, l ₁ loss is a technical term in deep learning, which means a loss function.

多尺度及残差的网络的训练阶段，采用原始未压缩的视频进行训练，每一次选取目标帧的过去N帧进行分块对齐，对于分的每一个块，均得到的N个对齐块作为网络输入，目标块的原始值作为标签(label)；特别的，在对齐过程中，使用MEA的对齐方式，即对齐第一步采用ME方式，以产生最准确的对齐块。对齐块下采样的方式采用双三次插值(Bicubic)下采样的方式。在网络训练中只使用Y分量进行训练，只得到一个模型，由于网络为全卷积网络，该训练好的模型可用于不同的分辨率。In the training phase of the multi-scale and residual network, the original uncompressed video is used for training. Each time the past N frames of the target frame are selected for block alignment, for each block, the N aligned blocks obtained are used as the network. Input, the original value of the target block is used as the label; in particular, in the alignment process, the alignment method of MEA is used, that is, the ME method is used in the first step of alignment to generate the most accurate alignment block. The way of downsampling the aligned block adopts the way of downsampling by bicubic interpolation. In the network training, only the Y component is used for training, and only one model is obtained. Since the network is a fully convolutional network, the trained model can be used for different resolutions.

本领域技术人员可以理解，YUV包含Y、U、V三个通道，把每个通道单独来看，则一帧YUV图像可以分成3帧图像(Y图像，U图像，V图像)，前文所介绍的分块外插操作是基于某一个通道的进行的。例如，在Y通道进行分块对齐外插，拼接成Y通道外插帧，在U、V分量上同样进行外插，得到U、V外插帧，把Y、U、V通道的外插帧合并起来，得到YUV格式的外插图像。换句话说，YUV是三维的，包含通道、高度、宽度，外插某一分量是基于通道维度，分块是在高度和宽度维度。Those skilled in the art can understand that YUV contains three channels: Y, U, and V. Looking at each channel separately, a frame of YUV image can be divided into three frames of images (Y image, U image, and V image). The block extrapolation operation is based on a certain channel. For example, perform block alignment and extrapolation on the Y channel, splicing it into a Y channel extrapolation frame, and also perform extrapolation on the U and V components to obtain a U, V extrapolation frame, and combine the Y, U, V channel extrapolation frames. Combined to get the extrapolated image in YUV format. In other words, YUV is three-dimensional, including channel, height, and width. Extrapolation of a component is based on the channel dimension, and the block is in the height and width dimensions.

三、对齐外插帧应用在视频编码3. Aligned extrapolation frames applied in video coding

本发明实施例中，将获得的对齐外插帧应用在视频编码中，包括至少两种方案：In this embodiment of the present invention, applying the obtained aligned extrapolation frame to video coding includes at least two schemes:

第一种是，将对齐外插帧直接作为帧间预测中运动补偿预测结果，这一方案简称为MCP；The first is to use the aligned extrapolation frame directly as the result of motion compensation prediction in inter-frame prediction, which is referred to as MCP for short;

第二种是，用对齐外插帧成一个新的参考帧，在新的参考帧上进行传统运动补偿，这一方案简称为REF。The second is to align the extrapolated frame into a new reference frame, and perform traditional motion compensation on the new reference frame. This scheme is referred to as REF for short.

对于第一种方法MCP，可以将对齐帧外插作为一种全新的帧间预测模式，只应用在32x32及更大的编码单元(CU)，将该模式与其他帧间预测模式进行率失真优化(RDO)，选择最优的模式进行编码。在码流结构上，对于32x32及更大的CU上，需要传输一个flag标识该CU是否使用MCP模式。For the first method, MCP, aligned frame extrapolation can be used as a new inter prediction mode, which is only applied to 32x32 and larger coding units (CUs), and this mode is rate-distortion optimized with other inter prediction modes. (RDO), select the optimal mode for encoding. In the code stream structure, for 32x32 and larger CUs, a flag needs to be transmitted to identify whether the CU uses the MCP mode.

对于第二种方法REF，将对齐帧外插作为一个新的参考帧。遵循传统的编码框架在现有参考帧和外插参考帧中执行运动补偿，以便编码器可以为划分的预测单元(PU)选择最准确的部分，应用的PU大小从4x4到64x64。特别是，将外插块的大小设置为与CTU相同，即64x64。由于输入填充，网络输出块的大小是CTU的两倍，即128x128，包括生成的填充像素。为了实现HEVC参考，将整个网络输出块作为相应CTU位置的参考块，以便在运动估计和运动补偿的过程中保持外插生成的参考CTU与周围像素之间的边界连续性，并重复该参考块策略再压缩下一个CTU。由于YUV420的特殊格式，对亮度、色度通道分别处理。对齐过程只在Y分量上进行，先用网络外插出Y分量；在外插任务中，Y分量与U、V分量有相同的特性，如相同的运动趋势，所以可将Y分量对应的U、V对齐图像也输入到该网络模型进行外插，分别得到U、V的外插图像。在编码端和解码端用同样的方法生成外插帧，保证编解码端匹配。For the second method REF, the alignment frame is extrapolated as a new reference frame. Following the traditional coding framework, motion compensation is performed in the existing and extrapolated reference frames so that the encoder can select the most accurate part for the divided prediction units (PUs), with applied PU sizes ranging from 4x4 to 64x64. In particular, set the size of the extrapolation block to be the same as the CTU, i.e. 64x64. Due to input padding, the network output block is twice the size of the CTU, i.e. 128x128, including the resulting padding pixels. To achieve HEVC reference, the entire network output block is used as the reference block for the corresponding CTU position, so that the boundary continuity between the extrapolated reference CTU and surrounding pixels is maintained during the process of motion estimation and motion compensation, and the reference block is repeated The policy then compresses the next CTU. Due to the special format of YUV420, the luminance and chrominance channels are processed separately. The alignment process is only performed on the Y component, and the Y component is first extrapolated by the network; in the extrapolation task, the Y component has the same characteristics as the U and V components, such as the same motion trend, so the U, V components corresponding to the Y component can be used. The V-aligned image is also input to the network model for extrapolation, and the extrapolated images of U and V are obtained respectively. The same method is used to generate extrapolated frames at the encoding end and the decoding end to ensure that the encoding and decoding ends match.

四、对齐方法与两种应用方案进行组合。Fourth, the alignment method is combined with two application schemes.

本放实施例中，将两种对齐方法与两张应用方案交叉组合，整合到编码标准HEVC中，组合方案为如下四种：MCP+ColMEA，MCP+MEA，REF+ColMEA和REF+MEA。In this embodiment, two alignment methods and two application schemes are cross-combined and integrated into the coding standard HEVC, and the combination schemes are as follows: MCP+ColMEA, MCP+MEA, REF+ColMEA and REF+MEA.

MCP方案和REF方案的基本实现前文已经介绍。对于MCP+ColMEA方案，由于ColMEA得到预测不需要传输任何附加信息，并且帧间预测值可以直接从帧外插中获得，可以避免传输传统帧间模式中得到预测值所需要的信息，包括AMVP模式下的运动矢量(MV)及参考帧索引信息(reference index)和Merge模式下的Merge Flag及Merge Index。The basic implementation of the MCP scheme and the REF scheme has been introduced above. For the MCP+ColMEA scheme, since ColMEA does not need to transmit any additional information for prediction, and the inter-frame prediction value can be obtained directly from frame extrapolation, it can avoid transmitting the information required to obtain the prediction value in the traditional inter-frame mode, including AMVP mode The motion vector (MV) and reference frame index information (reference index) in the Merge mode and the Merge Flag and Merge Index in the Merge mode.

对于MCP+MEA方案，需要传输对齐信息，复用传统的MV编码模块，并采用整数分量来表示对齐的信息，分数分量表示对获得的外插块进行分像素插值，这可以充分利用MV结构生成更准确的帧间预测。For the MCP+MEA scheme, it is necessary to transmit the alignment information, multiplex the traditional MV coding module, and use the integer component to represent the alignment information, and the fractional component to perform sub-pixel interpolation on the obtained extrapolation block, which can make full use of the MV structure to generate More accurate inter prediction.

对于REF+ColMEA方案，由于ColMEA中任何比特流结构都没有变化，只需在添加新参考帧的基础上遵循传统的编码框架。For the REF+ColMEA scheme, since there is no change in any bitstream structure in ColMEA, it is only necessary to follow the traditional coding framework on the basis of adding new reference frames.

对于REF+MEA方案，使用由周围PU给出的更好的运动矢量预测(MVP)并且对于对齐的MV非常接近以替换对齐的MV。MVP的传输比MV更经济，并且仅编码有PU引用的外插块的对齐MVP，以便进一步节省比特。For the REF+MEA scheme, the better motion vector prediction (MVP) given by the surrounding PUs is used and is very close to the aligned MV to replace the aligned MV. The transmission of the MVP is more economical than the MV, and only the aligned MVP of the extrapolated blocks referenced by the PU is encoded to further save bits.

本发明实施例上述方案，使用对齐操作处理过去帧，降低了外插网络输入的多样性，通过学习目标帧与过去帧之间的差异可显著降低外插难度；此外，可以将外插帧应用于视频编码中，提升编码效率。The above scheme of the embodiment of the present invention uses the alignment operation to process the past frames, which reduces the diversity of the extrapolation network input, and can significantly reduce the extrapolation difficulty by learning the difference between the target frame and the past frame; in addition, the extrapolation frame can be applied to In video coding, the coding efficiency is improved.

另一方面，为了说明本发明的性能还进行了相关测试，外插主要适用于编码中的低延时配置。On the other hand, in order to illustrate the performance of the present invention, relevant tests are also carried out, and the extrapolation is mainly applicable to the low-latency configuration in encoding.

测试条件包括：1)帧间配置：低延时B即Low-delay B，LDB；低延时P即Low-delayP，LDP。2)基本量化步长(QP)设置为{27,32,37,42}，基于的软件是HM12.0，测试序列为HEVC标准测试序列。The test conditions include: 1) Inter-frame configuration: Low-delay B is Low-delay B, LDB; Low-delay P is Low-delay P, LDP. 2) The basic quantization step size (QP) is set to {27, 32, 37, 42}, the software based on it is HM12.0, and the test sequence is the HEVC standard test sequence.

本领域技术人员可以理解，编码领域需要在规定的标准测试序列测试性能，标准测试序列分成几类(ClassA～F)，下述测试过程中按照标准测试要求汇报各类测试序列的结果。Those skilled in the art can understand that the coding field needs to test the performance in the specified standard test sequences. The standard test sequences are divided into several categories (Class A to F). In the following test process, the results of the various test sequences are reported according to the standard test requirements.

1)对齐方法与外插应用的4中组合编码性能比较1) Comparison of 4 combined coding performance between alignment method and extrapolation application

表1为LDP设置下4中组合的性能对比，从表中可以看出最优的方案为REF+ColMEA，可以将该方案作为优选方案。Table 1 shows the performance comparison of the combinations in 4 under the LDP setting. It can be seen from the table that the optimal solution is REF+ColMEA, which can be used as the preferred solution.

表1 LDP设置下4中组合的性能对比结果Table 1 Performance comparison results of 4 combinations under LDP setting

表2为优选方案的完整性能。从表2可以看出，本发明实施例上述方案相对于HM12.0在LDP和LDB模式下可分别获得5.3％和2.8％的码率节省。Table 2 shows the full performance of the preferred scheme. It can be seen from Table 2 that, compared with HM12.0, the above solution in the embodiment of the present invention can obtain code rate savings of 5.3% and 2.8% in LDP and LDB modes, respectively.

表2优选方案的完整性能Table 2 Complete performance of the preferred scheme

2)对齐外插帧效果测试2) Align extrapolation frame effect test

表3给出在优选方案中，没有对齐的传统帧外插与本发明提出的对齐外插帧在LDP条件下编码性能比较，可以看出没有对齐的外插仅可获得2.2％的码率节省，而对齐外插可获得5.3％的码率节省。与简单的外插帧相比，本发明提出的对齐可以更显著提升编码性能。此外，通过观察发现，本发明提出的对齐方法产生的外插帧比传统方法更清晰的且有更好的效果。Table 3 shows the comparison of the coding performance between the traditional frame extrapolation without alignment and the aligned extrapolation frame proposed by the present invention under LDP conditions in the preferred solution, it can be seen that the extrapolation without alignment can only obtain a code rate saving of 2.2% , while aligned extrapolation yields 5.3% rate savings. Compared with the simple extrapolation frame, the alignment proposed by the present invention can significantly improve the coding performance. In addition, it is found through observation that the extrapolated frame generated by the alignment method proposed in the present invention is clearer and has better effect than the traditional method.

表3对齐外插帧效果Table 3 Aligned extrapolation frame effects

通过以上的实施方式的描述，本领域的技术人员可以清楚地了解到上述实施例可以通过软件实现，也可以借助软件加必要的通用硬件平台的方式来实现。基于这样的理解，上述实施例的技术方案可以以软件产品的形式体现出来，该软件产品可以存储在一个非易失性存储介质(可以是CD-ROM，U盘，移动硬盘等)中，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行本发明各个实施例所述的方法。From the description of the above embodiments, those skilled in the art can clearly understand that the above embodiments can be implemented by software or by means of software plus a necessary general hardware platform. Based on this understanding, the technical solutions of the above embodiments may be embodied in the form of software products, and the software products may be stored in a non-volatile storage medium (which may be CD-ROM, U disk, mobile hard disk, etc.), including Several instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in various embodiments of the present invention.

以上所述，仅为本发明较佳的具体实施方式，但本发明的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本发明披露的技术范围内，可轻易想到的变化或替换，都应涵盖在本发明的保护范围之内。因此，本发明的保护范围应该以权利要求书的保护范围为准。The above description is only a preferred embodiment of the present invention, but the protection scope of the present invention is not limited to this. Substitutions should be covered within the protection scope of the present invention. Therefore, the protection scope of the present invention should be based on the protection scope of the claims.

Claims

1. an alignment extrapolation frame method based on neural network, is characterized in that, comprises:

Divide the target frame into M blocks of a specified size, select N past frames that are the closest and continuous to the target frame, and for each block in the target frame, find the most similar block from each past frame for alignment, and obtain N alignment blocks; wherein, N and M are both set natural numbers;

For each block in the target frame, input N aligned blocks into the multi-scale and residual network, learn the difference between the target frame and each past frame, and predict an extrapolated block;

All the extrapolation blocks are spliced according to the corresponding positional relationship to obtain the aligned extrapolation frame.

2. a kind of alignment extrapolation frame method based on neural network according to claim 1, it is characterised in that the selection is the nearest and continuous N past frames from the target frame, for each block in the target frame, Find the most similar blocks from each past frame for alignment, and the steps to obtain N aligned blocks include:

The target frame is denoted as It, and the N nearest and consecutive past frames from the target frame are denoted as It _-1 , _It _-2 , ..., _ItN ; the alignment includes two steps:

The first step is to align the target frame It and the It _-1 _frame to obtain the alignment block of the It _-1 frame;

The second step is the alignment block based on the It _-1 frame, aligning the target frame It and N past frames It _-1 , It _-2 , . . . , _ItN to obtain the _It _-2 frame to _ItN The alignment block of the frame.

3. a kind of alignment extrapolation frame method based on neural network according to claim 2 is characterized in that, the first step adopts any one of the following two modes:

The first way is: take the original value of the target block of the target frame It as the template for motion estimation, perform whole-pixel motion estimation in the It _-1 _frame , and take the block with the smallest average absolute value error as the It _-1 frame the alignment block;

The second method is: based on the assumption that the motion between the target frame It and the It _-1 _frame is smaller than the set value, the alignment block of the It _-1 frame is obtained by adopting a fixed position method.

4. a kind of alignment extrapolation frame method based on neural network according to claim 2 or 3, is characterized in that, the alignment mode used in the second step comprises:

For the target block in the target frame It, the alignment block of the It _-1 _frame obtained in the first step is used as the template for motion estimation, and the whole-pixel motion estimation is performed in the It _-2 frame, and the average absolute value error is the smallest. The block is used as the alignment block of the It _-2 frame; the execution is repeated in this way, and finally, the initial alignment block of the It- _(N-1) frame is used as the template for motion estimation, and the whole-pixel motion estimation is performed in the _ItN frame. , the block with the smallest average absolute value error is taken as the alignment block of the _ItN frame; thus N alignment blocks are obtained.

5. a kind of alignment extrapolation frame method based on neural network according to claim 4 is characterized in that, after obtaining N alignment blocks, the reconstruction pixels around each alignment block are respectively filled around the corresponding alignment blocks, as the final of N aligned blocks.

6. A neural network-based method for aligning and extrapolating frames according to claim 5, wherein the multi-scale and residual network includes Q scales, and the resolution is sequentially denoted as s ₁ , s ₂ , ..., s _Q ; the resolution of the current scale k is half of the next scale k+1, that is k=1,2,...,Q;

The multi-scale and residual network makes a series of predictions starting from the lowest resolution s ₁ , and takes the prediction of size s _k as the starting point, makes predictions of size s _{k + 1} , continuously upsampling, and in the next update. The learned residuals are added at fine scales until back to the full resolution image, i.e. the same size as the final alignment block.

7. a kind of alignment extrapolation frame method based on neural network according to claim 6, is characterized in that, the extrapolation block prediction result of described multi-scale and residual network is:

in, Refers to the extrapolation block predicted at the kth scale, and the network downsampled X ₁ as the input of the minimum size, that is, when k=1, with is X ₁ ; X represents N alignment blocks {X ^t-1 , X ^t-2 , ..., X ^tN }, which are the alignment blocks of It _-1 , It _-2 , ..., It _N in sequence , X _k represents the downsampled image of X at resolution _sk , G _k represents a value from X _k and learn residuals The network of , _uk represents the upsampling operation towards the resolution _sk ;

The total network loss is the sum of Q scale losses:

in:

||*|| ₁ means l ₁ loss.

8. A neural network-based alignment extrapolation frame method according to claim 4, wherein the method further comprises: applying the obtained alignment extrapolation frame in video coding, including at least two schemes:

The first is to use the aligned extrapolation frame directly as the motion compensation prediction result of inter-frame prediction, which is referred to as MCP for short;

The second is to align the extrapolated frame into a new reference frame, and perform traditional motion compensation on the new reference frame. This scheme is referred to as REF for short.

9. A neural network-based method for aligning and extrapolating frames according to claim 8, wherein the method further comprises: combining the alignment method with two application schemes;

The alignment method includes: both the first step and the second step adopt a motion estimation method, referred to as MEA for short; the first step adopts a fixed position method and the second step adopts a motion estimation method, referred to as ColMEA for short;

The two alignment methods and the two application schemes are cross-combined and integrated into the coding standard HEVC. The combination schemes are as follows: MCP+ColMEA, MCP+MEA, REF+ColMEA and REF+MEA.