CN107133919A

CN107133919A - Time dimension video super-resolution method based on deep learning

Info

Publication number: CN107133919A
Application number: CN201710341864.3A
Authority: CN
Inventors: 董伟生; 巨丹; 石光明; 谢雪梅; 吴金建; 李甫
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2017-05-16
Filing date: 2017-05-16
Publication date: 2017-09-05

Abstract

The invention discloses a time-dimensional video super-resolution method based on deep learning, which mainly solves the problems of poor frame interpolation stability and low precision of reconstructed video images in the prior art. The key technology is to use neural network training to fit the nonlinear mapping relationship between the original video image and the downsampled video image, including: 1) obtaining the original video image set and the downsampled video image set as the training samples of the neural network; 2) Construct the neural network model and use the training samples to train the parameters of the neural network; 3) Input any given video as a test sample into the trained neural network model, and the output result of the neural network is the reconstructed video image. The invention reduces the computational complexity of frame interpolation and reconstruction of video images, improves the stability and accuracy of frame interpolation of reconstructed video images, and can be used for scene interpolation, animation production, and time-domain frame interpolation of low frame rate videos.

Description

Time-dimension video super-resolution method based on deep learning

技术领域technical field

本发明属于图像处理领域，具体涉及一种时间维视频超分辨方法，可用于场景插值、动画制作、实现低帧率视频的时间域插帧。The invention belongs to the field of image processing, and in particular relates to a time-dimensional video super-resolution method, which can be used for scene interpolation, animation production, and time domain frame interpolation of low frame rate videos.

背景技术Background technique

视频图像不仅包含了被观测目标的空间信息，而且包含了被观测目标在时间上的运动信息，具备“空时合一”的性质。由于视频图像可以把反映物体性质的空间信息和时间信息维系在一起，因此极大的提高了人类认知客观世界的能力，在遥感、军事、农业、医学、生物化学等领域都被证明有着巨大的应用价值。The video image not only contains the spatial information of the observed target, but also contains the motion information of the observed target in time, which has the property of "integration of space and time". Since the video image can maintain the spatial information and time information reflecting the nature of the object together, it greatly improves the ability of human beings to recognize the objective world, and has been proved to have great potential in remote sensing, military affairs, agriculture, medicine, biochemistry and other fields. application value.

利用视频成像设备获取精密的视频图像成本很高，而且受到传感器和光学器件制造工艺的限制，为了提高成像视频的分辨率，通常需要对视频进行压缩，以牺牲视频的时间分辨率为代价，这显然难以满足科学研究和大规模实际应用的需求。所以利用信号处理技术从压缩后的视频图像中重建出高分辨率的视频图像成为获取视频图像的一个重要途径。The use of video imaging equipment to obtain precise video images is very costly, and is limited by the manufacturing process of sensors and optical devices. In order to improve the resolution of imaging video, it is usually necessary to compress the video at the expense of the temporal resolution of the video. Obviously, it is difficult to meet the needs of scientific research and large-scale practical applications. So using signal processing technology to reconstruct high-resolution video images from compressed video images has become an important way to obtain video images.

Kang S J等人在“Dual Motion Estimation for Frame Rate Up-Conversion”中提出了一种采用运动估计和运动补偿的方法实现视频图像插帧重构的算法。该视频图像插帧重构问题是一个病态逆问题，其利用视频图图像的时间信息并结合视频图像的空间信息来实现视频图像插帧重构，但是该算法由于没有充分利用视频图像中存在的较强的相邻帧间的结构相似性，使得重构的视频图像稳定性和精度难以满足科学研究和大规模实际应用的要求。In "Dual Motion Estimation for Frame Rate Up-Conversion", Kang S J and others proposed an algorithm that uses motion estimation and motion compensation to realize frame interpolation and reconstruction of video images. The video image frame interpolation and reconstruction problem is an ill-conditioned inverse problem, which uses the time information of the video image and combines the spatial information of the video image to realize the video image frame interpolation reconstruction, but the algorithm does not make full use of the existing in the video image. The strong structural similarity between adjacent frames makes it difficult for the stability and accuracy of reconstructed video images to meet the requirements of scientific research and large-scale practical applications.

发明内容Contents of the invention

本发明的目的在于针对上述现有技术的不足，提出一种基于深度学习的时间维视频超分辨率方法，以提高重构视频图像的稳定性和精度，满足大规模实际应用的要求。The purpose of the present invention is to address the deficiencies of the above-mentioned prior art and propose a time-dimensional video super-resolution method based on deep learning to improve the stability and accuracy of reconstructed video images and meet the requirements of large-scale practical applications.

本发明的技术方案是这样实现的：Technical scheme of the present invention is realized like this:

将经过下采样的视频图像集和原始视频图像集分别作为神经网络的输入训练样本和输出训练样本，通过神经网络训练拟合下采样视频图像和原始视频图像之间的非线性映射关系，并以这种关系为指导进行测试样本的插帧重构，从而达到利用神经网络进行视频时间域插帧的目的，其具体步骤包括如下：The down-sampled video image set and the original video image set are respectively used as the input training samples and output training samples of the neural network, and the nonlinear mapping relationship between the down-sampled video images and the original video images is fitted through neural network training, and the This relationship is to guide the frame insertion and reconstruction of test samples, so as to achieve the purpose of using neural network for video time domain frame insertion. The specific steps include the following:

(1)将彩色视频图像集S＝{S₁,S₂,...,S_i,...,S_N}转换为灰度视频图像集，即原始视频图像集X＝{X₁,X₂,...,X_i,...,X_N}，并利用下采样矩阵F对原始视频图像集X进行直接下采样，得到下采样视频图像集Y＝{Y₁,Y₂,...,Y_i,...,Y_N}，其中，表示第i个原始视频图像样本，表示第i个下采样视频图像样本，1≤i≤N，N表示原始视频图像集中图像样本的数量，M表示原始视频图像块的大小，L_h表示原始视频图像集每个样本中图像块的数量，L_l表示下采样视频图像集每个样本中图像块的数量，且L_h＝r×L_l，r表示原始视频图像集对下采样视频图像集的放大倍数；(1) Convert the color video image set S={S ₁ , S ₂ ,...,S _i ,...,S _N } into a grayscale video image set, that is, the original video image set X={X ₁ , X ₂ ,...,X _i ,...,X _N }, and use the downsampling matrix F to directly downsample the original video image set X to obtain the downsampled video image set Y={Y ₁ ,Y ₂ , ...,Y _i ,...,Y _N }, where, Indicates the i-th original video image sample, Represents the i-th downsampled video image sample, 1≤i≤N, N represents the number of image samples in the original video image set, M represents the size of the original video image block, L _h represents the size of the image block in each sample of the original video image set Quantity, L _l represents the number of image blocks in each sample of the downsampled video image set, and L _h = r × L _l , r represents the magnification factor of the original video image set to the downsampled video image set;

(2)构建神经网络模型，并利用下采样视频图像集Y和原始视频图像集X训练神经网络参数：(2) Construct the neural network model, and utilize the downsampled video image set Y and the original video image set X to train the neural network parameters:

(2a)确定神经网络输入层节点数、输出层节点数、隐藏层数和隐藏层节点数量，随机初始化各层的连接权值W^(t)和偏置b^(t)，给定学习速率η，选定激活函数为：其中，g表示神经网络节点的输入值，t＝1,2,···,n，n表示神经网络的总层数；(2a) Determine the number of input layer nodes, the number of output layer nodes, the number of hidden layers and the number of hidden layer nodes of the neural network, randomly initialize the connection weight W ^(t) and bias b ^(t) of each layer, and give the learning rate η , the selected activation function is: Among them, g represents the input value of the neural network node, t=1,2,...,n, n represents the total number of layers of the neural network;

(2b)随机输入下采样视频图像集中的一个下采样视频图像Y_i作为输入训练样本，同时输入对应的原始视频图像集中的一个原始视频图像X_i作为输出训练样本，使用选定的激活函数计算神经网络每一层的激活值，计算得到：(2b) Randomly input a downsampled video image Y _i in the downsampled video image set as the input training sample, and at the same time input an original video image X _i in the corresponding original video image set as the output training sample, and use the selected activation function to calculate The activation value of each layer of the neural network is calculated as:

第1层即输入层的激活值为：a⁽¹⁾＝Y_i，The activation value of the first layer, that is, the input layer: a ⁽¹⁾ = Y _i ,

第t'＝2,3,...,n层的激活值为：a^(t′)＝f(W^(t′-1)*a^(t′-1)+b^(t′^-1))，其中，在该网络的第二层，第三层，第四层即t'＝2，t'＝3，t'＝4时，为了充分提取视频帧间的相关性，设计了三个三维滤波器用来代替传统的二维滤波器，f(g)表示tanh(g)激活函数，g＝W^(t′-1)*a^(t′-1)+b^(t′-1)，W^(t'-¹⁾和b^(t'-¹⁾分别表示第t'-1层的权重和偏置，a^(t'-¹⁾表示第t'-1层的激活值；The activation value of the t'=2,3,...,nth layer is: a ^(t') =f(W ^(t'-1) *a ^(t'-1) +b ^(t ′ ^-1) ), wherein, in the second layer of the network, the third layer, the fourth layer that is t'=2, t'=3, t'=4, in order to fully extract the correlation between video frames, three The three-dimensional filter is used to replace the traditional two-dimensional filter, f(g) represents the tanh(g) activation function, g=W ^(t′-1) *a ^(t′-1) +b ^(t′-1) , W ^(t' - ¹⁾ and b ^(t' - ¹⁾ represent the weight and bias of layer t'-1, respectively, and a ⁽ t'- ¹⁾ represents the activation value of layer t'-1;

(2c)计算神经网络各层的学习误差：(2c) Calculate the learning error of each layer of the neural network:

输出层即第n层的误差为：δ⁽ⁿ⁾＝X_i-a⁽ⁿ⁾，The error of the output layer, namely the nth layer, is: δ ⁽ⁿ⁾ =X _i -a ⁽ⁿ⁾ ,

第t"＝n-1,n-2,...,2层的误差为：δ^(t")＝((W^(t”))^Tδ^(t”+1)).*f'(W^(t”-1)*a^(t”-1)+b^(t”-1))，其中，W^(t”)表示第t"层的权值，δ^(t"+1)表示第t"+1层的误差，W^(t”-1)和b^(t”-1)分别表示第t"-1层的权值和偏置，a^(t”-1)表示第t"-1层的激活值，f'(g')表示函数f(g')的导数，(g”)^T表示转置变换，g'＝W^(t”-1)*a^(t”-1)+b^(t”-1)，g”＝W^(t”)；The error of the t"=n-1, n-2,..., layer 2 is: δ ^(t") ＝((W ^(t") ) ^T δ ^(t"+1) ).*f'( W ^(t”-1) *a ^(t”-1) +b ^(t”-1) ), where W ^(t”) represents the weight of the t"th layer, δ ^(t"+1) represents the The error of the t"+1 layer, W ^(t"-1) and b ^(t"-1) respectively represent the weight and bias of the t"-1 layer, a ^(t"-1) represents the t"- The activation value of layer 1, f'(g') represents the derivative of the function f(g'), (g") ^T represents the transpose transformation, g'=W ^(t"-1) *a ^(t"-1) +b ^(t"-1) , g"=W ^(t") ;

(2d)按误差梯度下降方法更新神经网络各层的权值和偏置：(2d) Update the weights and biases of each layer of the neural network according to the error gradient descent method:

将权值更新为W^(t)＝W^(t)-ηδ^(t+1)(a^(t))^T，将偏置更新为b^(t)＝b^(t)-ηδ^(t+1)，其中，δ^(t+1)表示第t+1层的误差，a^(t)表示第t层的激活值；Update the weight to W ^(t) = W ^(t) -ηδ ^(t+1) (a ^(t) ) ^T , update the bias to b ^(t) = b ^(t) -ηδ ^(t+1) , where δ ^(t+1) represents the error of layer t+1, and a ^(t) represents the activation value of layer t;

(2e)反复执行步骤(2b)-(2d)，直到神经网络的输出层误差达到预设精度要求或训练次数达到最大迭代次数，结束训练，保存网络结构和参数，得到训练好的神经网络模型；(2e) Repeat steps (2b)-(2d) until the output layer error of the neural network reaches the preset accuracy requirement or the number of training times reaches the maximum number of iterations, end the training, save the network structure and parameters, and obtain the trained neural network model ;

(3)任给一段视频，输入到训练好的神经网络模型中，神经网络的输出即为时间维超分辨后的视频。(3) Given any video, input it into the trained neural network model, and the output of the neural network is the super-resolved video in the time dimension.

本发明与现有的技术相比具有以下优点:Compared with the prior art, the present invention has the following advantages:

1)本发明由于利用卷积神经网络进行时间维视频超分辨率重建，相比现有技术降低了计算复杂度，提高了时间维视频图像超分辨重建的稳定性；1) Since the present invention uses a convolutional neural network to perform time-dimensional video super-resolution reconstruction, compared with the prior art, the computational complexity is reduced, and the stability of time-dimensional video image super-resolution reconstruction is improved;

2)本发明所设计的三维滤波器，由于充分考虑了视频相邻帧间的相关性，提高了时间维视频图像时间超分辨重建的精度。2) The three-dimensional filter designed in the present invention improves the accuracy of temporal super-resolution reconstruction of time-dimensional video images due to fully considering the correlation between adjacent video frames.

附图说明Description of drawings

图1为本发明的实现流程图；Fig. 1 is the realization flowchart of the present invention;

图2为本发明构建的神经网络结构图；Fig. 2 is the neural network structural diagram that the present invention builds;

图3为本发明仿真实验所用的bus视频的原始图像；Fig. 3 is the original image of the used bus video of the simulation experiment of the present invention;

图4为用现有的Kang’s方法和Choi’s方法以及本发明方法对bus视频图像的重建结果图。Fig. 4 is the reconstruction result figure of bus video image with existing Kang's method and Choi's method and the method of the present invention.

具体实施方式detailed description

以下结合附图对本发明的实施例和效果做进一步详细描述。Embodiments and effects of the present invention will be further described in detail below in conjunction with the accompanying drawings.

参照图1，本发明基于深度学习的时间维视频超分辨率方法，其实现步骤如下：With reference to Fig. 1, the time dimension video super-resolution method based on deep learning of the present invention, its realization steps are as follows:

步骤1，获取彩色视频图像集S。Step 1, get the color video image set S.

(1a)从给定数据库中，选取样本数为464814的彩色视频图像集S＝{S₁,S₂,...,S_i,...,S₄₆₄₈₁₄}，将S转换为灰度视频图像集，即原始视频图像集X＝{X₁,X₂,...,X_i,...,X₄₆₄₈₁₄}，其中，表示第i个原始视频图像样本，1≤i≤464814，M表示原始视频图像块的大小，M＝576，L_h表示原始视频图像集每个样本中图像块的数量，L_h＝6；(1a) From a given database, select a color video image set S={S ₁ ,S ₂ ,...,S _i ,...,S ₄₆₄₈₁₄ } with a sample number of 464814, and convert S to a grayscale video Image set, that is, the original video image set X={X ₁ ,X ₂ ,...,X _i ,...,X ₄₆₄₈₁₄ }, wherein, Represent the i-th original video image sample, 1≤i≤464814, M represents the size of the original video image block, M=576, L _h represents the number of image blocks in each sample of the original video image set, L _h =6;

(1b)利用下采样矩阵F，对原始视频图像集X进行直接下采样，得到下采样视频图像集Y＝FX，相当于对原始视频图像集X＝{X₁,X₂,...,X_i,...,X₄₆₄₈₁₄}中的每个样本进行下采样得到下采样视频图像集Y＝{Y₁,Y₂,...,Y_i,...,Y₄₆₄₈₁₄}，其中，Y_i表示第i个下采样视频图像样本，Y_i＝FX_i，1≤i≤464814，M表示下采样视频图像块的大小，M＝576，L_l表示下采样视频图像集每个样本中图像块的数量，L_l＝3， (1b) Use the downsampling matrix F to directly downsample the original video image set X to obtain the downsampled video image set Y=FX, which is equivalent to the original video image set X={X ₁ ,X ₂ ,..., Each sample in X _i ,...,X ₄₆₄₈₁₄ } is down-sampled to obtain a down-sampled video image set Y={Y ₁ ,Y ₂ ,...,Y _i ,...,Y ₄₆₄₈₁₄ }, where, Y _i represents the i-th downsampled video image sample, Y _i =FX _i , 1≤i≤464814, M represents the size of the downsampled video image block, M=576, L ₁ represents the number of image blocks in each sample of the downsampled video image set, L ₁ =3,

步骤2，构建神经网络模型，并利用下采样视频图像集Y和原始视频图像集X训练神经网络参数。Step 2, constructing a neural network model, and using the downsampled video image set Y and the original video image set X to train the neural network parameters.

本步骤的具体实现如下：The specific implementation of this step is as follows:

(2a)初始化神经网络参数；(2a) Initialize the neural network parameters;

(2a1)将下采样视频图像集中的下采样视频图像作为输入训练样本，将原始视频图像集中的原始视频图像作为输出训练样本；(2a1) the downsampled video image in the downsampled video image set is used as an input training sample, and the original video image in the original video image set is used as an output training sample;

(2a2)根据输入训练样本的视频帧数来确定神经网络的输入层节点数，本实施例中，根据输入层节点数等于下采样视频图像集每个样本中图像块的数量L_l，设置输入层节点数为3；(2a2) Determine the number of input layer nodes of the neural network according to the number of video frames of the input training samples. In this embodiment, according to the number of input layer nodes equal to the number L _l of image blocks in each sample of the downsampled video image set, set the input The number of layer nodes is 3;

(2a3)根据输出训练样本的视频帧数来确定神经网络的输出层节点数，本实施例中，根据输出层节点数等于原始视频图像集每个样本中图像块的数量L_h，设置输出层节点数为6；(2a3) Determine the output layer node number of the neural network according to the video frame number of the output training sample. In this embodiment, the output layer node number is equal to the number L _h of image blocks in each sample of the original video image set, and the output layer is set. The number of nodes is 6;

(2a4)确定隐藏层数和隐藏层节点数：(2a4) Determine the number of hidden layers and the number of hidden layer nodes:

由于神经网络的隐藏层数和隐藏层节点数决定了神经网络的规模，因而应在保证能够解决问题的前提下，力求神经网络的规模尽量简单，本实施例中，将神经网络的隐藏层数直接确定为7层，隐藏层节点数通过实验调节每层的节点数，即第一层结点数为64个，第二层结点数为32个，第三层结点数为24个，第四层结点数为12个，第五层结点数为32个，第六层结点数为32个，第七层结点数为6个；Since the number of hidden layers of the neural network and the number of hidden layer nodes determine the scale of the neural network, the scale of the neural network should be as simple as possible under the premise of ensuring that the problem can be solved. In this embodiment, the number of hidden layers of the neural network is It is directly determined as 7 layers, and the number of nodes in the hidden layer is adjusted through experiments, that is, the number of nodes in the first layer is 64, the number of nodes in the second layer is 32, the number of nodes in the third layer is 24, and the number of nodes in the fourth layer The number of nodes is 12, the number of nodes in the fifth layer is 32, the number of nodes in the sixth layer is 32, and the number of nodes in the seventh layer is 6;

(2a5)随机初始化各层连接权值W^(t)和偏置b^(t)，t＝1,2,3,4,5,6,7,8；(2a5) Randomly initialize the connection weight W ^(t) and bias b ^(t) of each layer, t=1,2,3,4,5,6,7,8;

(2a6)给定学习速率η＝0.0005；(2a6) given learning rate η=0.0005;

(2a7)选定激活函数为：其中，g表示神经网络节点包括偏置在内的输入加权和；(2a7) The selected activation function is: Among them, g represents the input weighted sum of neural network nodes including bias;

(2b)随机输入一个输入训练样本Y_i，使用选定的激活函数计算神经网络每一层的激活值，计算得到：(2b) Randomly input an input training sample Y _i , use the selected activation function to calculate the activation value of each layer of the neural network, and calculate:

第t'＝2,3,4,5,6,7层的激活值为：a^(t')＝f(W^(t'-1)*a^(t'-1)+b^(t'-1))，其中，在该网络的第二层，第三层，第四层即t'＝2，t'＝3，t'＝4时，为了充分提取视频帧间的相关性，设计了三个三维滤波器用来代替传统的二维滤波器，f(g)表示tanh(g)激活函数，g＝W^(t'-1)*a^(t'-1)+b^(t'-1)，W^(t'-1)和b^(t'-1)分别表示第t'-1层的权值和偏置，a^(t'-1)表示第t'-1层的激活值；The activation value of layer t'=2,3,4,5,6,7 is: a ^(t') =f(W ^(t'-1) *a ^(t'-1) +b ^{(t'- 1)} ), wherein, in the second layer of the network, the third layer, the fourth layer that is t'=2, t'=3, t'=4, in order to fully extract the correlation between video frames, design Three three-dimensional filters are used to replace the traditional two-dimensional filter, f(g) represents the tanh(g) activation function, g=W ^(t'-1) *a ^(t'-1) +b ^{(t'-1 )} , W ^(t'-1) and b ^(t'-1) respectively represent the weight and bias of the t'-1th layer, a ^(t'-1) represents the activation value of the t'-1th layer;

(2c)输入一个对应的输出训练样本X_i，计算神经网络各层的学习误差：(2c) Input a corresponding output training sample X _i , and calculate the learning error of each layer of the neural network:

输出层即第4层的误差为：δ⁽⁴⁾＝X_i-a⁽⁴⁾，The error of the output layer, that is, the fourth layer is: δ ⁽⁴⁾ ＝X _i -a ⁽⁴⁾ ,

第t”＝3,2层的误差为：δ^(t”)＝((W^(t”))^Tδ^(t”+1)).*f'(W^(t”-1)*a^(t”-1)+b^(t”-1))，其中，W^(t”)表示第t”层的权值，W^(t”-1)和b^(t”-1)分别表示第t”-1层的权值和偏置，a^(t”-1)表示第t”-1层的激活值，f'(g')表示函数f(g')的导数，(g”)^T表示转置变换，g'＝W^(t”-1)*a^(t”-1)+b^(t”-1)，g”＝W^(t”)；t”=3, the error of layer 2 is: δ ^(t”) ＝((W ^(t”) ) ^T δ ^(t”+1) ).*f'(W ^(t”-1) *a ^{( t”-1)} +b ^(t”-1) ), where W ^(t”) represents the weight of the t”th layer, W ^(t”-1) and b ^(t”-1) respectively represent the weight of the t”th layer The weight and bias of the "-1 layer, a ^(t"-1) represents the activation value of the t"-1 layer, f'(g') represents the derivative of the function f(g'), (g") ^T Represent transpose transformation, g'=W ^{(t "-1)} *a ^{(t "-1)} +b ^{(t "-1)} , g "=W ^{(t ")} ;

将权值更新为：W^(t)＝W^(t)-ηδ^(t+1)(a^(t))^T，Update the weight as: W ^(t) = W ^(t) -ηδ ^(t+1) (a ^(t) ) ^T ,

将偏置更新为：b^(t)＝b^(t)-ηδ^(t+1)，其中，δ^(t+1)表示第t+1层的误差，a^(t)表示第t层的激活值；Update the bias as: b ^(t) = b ^(t) -ηδ ^(t+1) , where δ ^(t+1) represents the error of the t+1th layer and a ^(t) represents the activation of the tth layer value;

(2e)反复执行步骤(2b)-(2d)，直到网络输出层误差达到预设精度要求或训练次数达到最大迭代次数，结束训练，保存网络结构和参数，得到训练好的神经网络模型,本实施例中，最大迭代次数为500；(2e) Repeat steps (2b)-(2d) until the network output layer error reaches the preset accuracy requirement or the number of training times reaches the maximum number of iterations, then end the training, save the network structure and parameters, and obtain the trained neural network model. In an embodiment, the maximum number of iterations is 500;

本步骤2所构建的神经网络如图2所示，其包括1个输入层、3个三维卷积层、3个二维卷积层、1个输出层，输入层有3个结点，7个隐含层的结点个数分别为64，32，24，12，32，32，6，输出层有6个结点。The neural network constructed in step 2 is shown in Figure 2, which includes 1 input layer, 3 three-dimensional convolution layers, 3 two-dimensional convolution layers, and 1 output layer. The input layer has 3 nodes, 7 The number of nodes in each hidden layer is 64, 32, 24, 12, 32, 32, 6 respectively, and the output layer has 6 nodes.

步骤3，利用训练好的神经网络模型，进行视频图像的时间维超分辨重建。Step 3, using the trained neural network model to perform time-dimensional super-resolution reconstruction of video images.

(3a)将任给的一段视频作为测试样本，将该视频图像中的每一个视频图像样本即Y_i拉成一个列向量，每个向量的大小为1728×1；(3a) Take any given section of video as a test sample, and pull each video image sample in the video image into a column vector, and the size of each vector is ₁₇₂₈ ×1;

(3b)将这些列向量作为已经训练好的神经网络模型的输入，对于每一个输入的向量，神经网络的输出结果是一个维数增加了的向量，该向量的大小为3456×1；(3b) Using these column vectors as the input of the trained neural network model, for each input vector, the output result of the neural network is a vector with increased dimension, and the size of the vector is 3456×1;

(3c)将这些向量进行重构组合，即先将这些向量重构成单帧图像，再将这些单帧的图像组合成视频，就可得到时间维超分辨的视频。(3c) Reconstruct and combine these vectors, that is, first reconstruct these vectors into single-frame images, and then combine these single-frame images into a video to obtain a time-dimensional super-resolution video.

本发明的效果可以通过如下仿真实验具体说明：Effect of the present invention can be specified by following simulation experiments:

1.仿真条件：1. Simulation conditions:

1)仿真实验中的直接下采样变换矩阵F通过函数imresize得到；1) The direct downsampling transformation matrix F in the simulation experiment is obtained by the function imresize;

2)仿真实验所用的编程平台为Matlab R2015a和Pycharm v2016；2) The programming platforms used in the simulation experiment are Matlab R2015a and Pycharm v2016;

3)仿真实验中构建的神经网络结构如图2所示；3) The neural network structure constructed in the simulation experiment is shown in Figure 2;

4)仿真实验所用的bus视频序列的第14帧图像如图3所示；4) The 14th frame image of the bus video sequence used in the simulation experiment is as shown in Figure 3;

5)仿真实验所用的视频图像集中的视频图像来源于Xiph数据库，共464814个训练样本；5) The video images in the video image set used in the simulation experiment come from the Xiph database, with a total of 464814 training samples;

6)仿真实验中，采用峰值信噪比PSNR指标来评价实验结果，峰值信噪比PSNR的定义为：6) In the simulation experiment, the peak signal-to-noise ratio PSNR index is used to evaluate the experimental results, and the peak signal-to-noise ratio PSNR is defined as:

其中，M表示重构出的视频图像的帧数，MAX_j表示重构出的第j帧图像的最大像素值，MSE_j表示重构出的视频图像第j帧与原始视频图像第j帧之间的均方误差。Among them, M represents the number of frames of the reconstructed video image, MAX _j represents the maximum pixel value of the reconstructed jth frame image, MSE _j represents the difference between the reconstructed video image frame j and the original video image frame j mean square error between.

2.仿真内容：采用本发明方法对图3所示的bus视频图像进行时间维视频超分辨重建，其重建结果如图4所示，其中：2. Simulation content: adopt the inventive method to carry out time-dimension video super-resolution reconstruction to the bus video image shown in Figure 3, and its reconstruction result is as shown in Figure 4, wherein:

图4(a)表示用Kang’s方法重构出的第14帧图像，Figure 4(a) shows the 14th frame image reconstructed by Kang's method,

图4(b)表示用Choi’s方法重构出的第14帧图像，Figure 4(b) shows the 14th frame image reconstructed by Choi’s method,

图4(c)表示用本发明方法重构出的第14帧图像，Fig. 4 (c) represents the 14th frame image reconstructed by the method of the present invention,

从图4所显示的重构结果可以看出，本发明重构出来的图像比Kang’s方法和Choi’s方法重构出来的图像更接近真实的图像。As can be seen from the reconstruction results shown in Figure 4, the image reconstructed by the present invention is closer to the real image than the reconstructed image by Kang's method and Choi's method.

3.峰值信噪比PSNR对比3. Peak Signal-to-Noise Ratio PSNR Comparison

计算现有的Tsai’s方法、Choi’s方法和本发明方法对bus视频图像进行视频时间超分辨重建的峰值信噪比PSNR，结果如表1所示。Calculate the peak signal-to-noise ratio (PSNR) of the existing Tsai's method, Choi's method and the method of the present invention to carry out the video time super-resolution reconstruction of the bus video image, and the results are shown in Table 1.

表1 重构视频图像的峰值信噪比PSNR值(单位：dB)Table 1 PSNR value of reconstructed video image (unit: dB)

从表1可以看出，本发明方法重建的视频图像的峰值信噪比PSNR比现有的Kang’s方法高2.99dB，比现有的Choi’s方法高2.38dB。As can be seen from Table 1, the peak signal-to-noise ratio PSNR of the video image reconstructed by the method of the present invention is 2.99dB higher than the existing Kang's method, and 2.38dB higher than the existing Choi's method.

Claims

1. The time dimension video super-resolution method based on deep learning comprises the following steps:

(1) set S ═ S of color video images₁,S₂,...,S_i,...,S_NConverting into a gray scale video image set, i.e. an original video image set X ═ X₁,X₂,...,X_i,...,X_NDirectly down-sampling the original video image set X by using a down-sampling matrix F to obtain a down-sampling video image set Y ═ Y₁,Y₂,...,Y_i,...,Y_NAnd (c) the step of (c) in which,representing the ith original video image sample,representing the ith down-sampling video image sample, i is more than or equal to 1 and less than or equal to N, N represents the number of image samples in the original video image set, M represents the size of the original video image block, L_hRepresenting the number of image blocks, L, in each sample of the original video image set_lRepresenting the number of image blocks in each sample of a down-sampled video image set, and L_h＝r×L_lAnd r represents the magnification of the original video image set to the downsampled video image set;

(2) constructing a neural network model, and training neural network parameters by using a down-sampling video image set Y and an original video image set X:

(2a) determining the number of nodes of input layer, the number of nodes of output layer, the number of hidden layers and the number of nodes of hidden layer of neural network, and randomly initializing the connection weight W of each layer^(t)And bias b^(t)Given the learning rate η, the activation function is chosen to be:wherein g represents the input value of the neural network node, t is 1,2, …, n, n represents the total number of layers of the neural network;

(2b) randomly inputting a down-sampled video image Y in a down-sampled video image set_iAs an input training sample, simultaneously inputting an original video image X in a corresponding original video image set_iAs an output training sample, calculating an activation value of each layer of the neural network by using the selected activation function, and calculating to obtain:

the activation values of layer 1, the input layer, are: a is⁽¹⁾＝Y_i，

The activation values of the n-th layer are: a is^(t′)＝f(W^(t′-1)*a^(t′-1)+b^(t′-1)) Wherein, in the second layer, the third layer and the fourth layer of the network, i.e. t '2 and t' 3,when t' is 4, in order to sufficiently extract the correlation between video frames, three-dimensional filters are designed to replace the conventional two-dimensional filter, f (g) represents the tan h (g) activation function, and g is W^(t′-1)*a^(t′-1)+b^(t′-1)，W^(t′-1)And b^(t′-1)Respectively represent the weight and offset of the t' -1 th layer, a^(t′^-1)Represents the activation value of the t' -1 th layer;

(2c) calculating the learning error of each layer of the neural network:

the error of the output layer, i.e. the nth layer, is:⁽ⁿ⁾＝X_i-a⁽ⁿ⁾，

the t ″, n-1, n-2, 2-layer error is:^(t′)＝((W^(t”))^T ^(t”+1)).*f'(W^(t”-1)*a^(t”-1)+b^(t”-1)) Wherein W is^(t”)Represents the weight of the t-th layer,^(t″+1)denotes an error of t "+1 layer, W^(t”-1)And b^(t”-1)Respectively representing weight and offset of the t "-1 th layer, a^(t”^-1)Denotes the activation value of the t "-1 th layer, f ' (g ') denotes the derivative of the function f (g '), (g")^TRepresenting a transposed transform, g ═ W^(t”-1)*a^(t”-1)+b^(t”-1)，g”＝W^(t”)；

(2d) Updating the weight and the bias of each layer of the neural network according to an error gradient descent method:

update the weight value to W^(t)＝W^(t)-η^(t+1)(a^(t))^TUpdate the bias to b^(t)＝b^(t)-η^(t+1)Wherein^(t+1)error of t +1 th layer, a^(t)Represents an activation value of the t-th layer;

(2e) repeatedly executing the steps (2b) - (2d) until the error of the output layer of the neural network reaches the preset precision requirement or the training frequency reaches the maximum iteration frequency, finishing the training, and storing the network structure and parameters to obtain a trained neural network model;

(3) and inputting any section of video into the trained neural network model, wherein the output of the neural network is the video after time dimension super resolution.

2. The method of claim 1, wherein the step (1) of converting the original video image set X into the downsampled video image set Y using a downsampled matrix F is performed by multiplying the original video image by the downsampled matrix F:

Y＝FX，

wherein,m denotes the size of the original video image block, L_lRepresenting the number of image blocks in each sample of a down-sampled video image set, L_hRepresenting the number of image blocks in each sample of the original video image set, and L_h＝r×L_lAnd r represents the magnification of the downsampled video image set to the original video image set in the time dimension.

3. The method of claim 1, wherein the number of input layer nodes of the neural network determined in step (2a) is determined according to the number of video frames of the input training sample, i.e. the number of input layer nodes is equal to the number L of image blocks in each sample of the downsampled video image set_l。

4. The method of claim 1, wherein the number of output layer nodes of the neural network determined in step (2a) is determined according to the number of video frames of the output training samples, i.e. the number of output layer nodes is equal to the number L of image blocks in each sample of the original video image set_h。

5. The method of claim 1, wherein the number of hidden layer nodes of the neural network determined in step (2a) is determined by experimental adjustment.