CN112533026A

CN112533026A - Video frame interpolation method based on convolutional neural network

Info

Publication number: CN112533026A
Application number: CN202011364443.0A
Authority: CN
Inventors: 罗斌; 贺大林; 穆力越; 常云鹏; 刘军
Original assignee: Xi'an Blue Top Medical Electronic Technology Co ltd
Current assignee: Xi'an Blue Top Medical Electronic Technology Co ltd
Priority date: 2020-11-27
Filing date: 2020-11-27
Publication date: 2021-03-19

Abstract

The invention provides a video frame interpolation method based on a convolutional neural network, which comprises the following steps: firstly, acquiring video continuous frames, inputting the video continuous frames into a convolutional neural network for down-sampling and up-sampling processing, and simultaneously extracting and outputting video frame interpolation characteristics; then circularly outputting the intermediate frame through the LSTM convolutional layer; and then calculating the mean square error of the optical flow and taking the mean square error as an optimization objective function of the video non-uniform frame interpolation calculation to realize the optimization of the video frame interpolation. The invention avoids the influence of the motion estimation process in the existing frame interpolation method on the frame interpolation quality, and can directly output the intermediate frame through the deep convolutional neural network.

Description

Video frame interpolation method based on convolutional neural network

Technical Field

The invention belongs to the technical field of computer image processing, and particularly relates to a non-uniform video frame interpolation method.

Background

The video frame rate conversion technology is a technology for reconstructing intermediate frames by utilizing relevant information between two adjacent frames in a video and applying an interpolation method. The technology can remove redundant information in coding, reduce the frame rate in the video transmission process and reduce the data volume transmitted by a video network, so the technology can be applied to video compression or video continuity enhancement.

The conventional video frame interpolation method mainly comprises two steps, namely optical flow estimation and pixel synthesis. The effect of the video frame interpolation technology in the method is often dependent on the quality of the optical flow estimation, and the process of the optical flow estimation is easily affected by occlusion and blurring to generate obvious errors. With the development of deep learning, a video frame interpolation technology based on the deep learning has a new breakthrough, and the attempt of video frame interpolation by using a convolutional neural network is successful to a certain extent.

The video frame interpolation technology is to use the related information between adjacent front and back frames in a video to obtain an intermediate frame by an interpolation method. The purpose of video frame interpolation is to synthesize new intermediate frames in a video and improve the frame rate of the video.

Video interpolation frames can be divided into uniform interpolation frames and non-uniform interpolation frames according to the relation between the number of new interpolation frames and the number of input video frames. The uniform frame interpolation refers to that a new interpolated frame and an input video frame sequence are synthesized into a final new video sequence according to a ratio of 1:1, and the non-uniform frame interpolation refers to that a new interpolated frame and an input video sequence are synthesized into a new video sequence according to a certain ratio.

The traditional video frame interpolation technology mainly finds out an obvious corresponding relation between pixels of two frames of images before and after a video, and the most common method is to acquire optical flow information between the two frames before and after the video. And taking the information of the optical flow field as the corresponding relation between the front frame image and the rear frame image of the video, and synthesizing the intermediate frame image by using the information of the optical flow field. In the conventional method, the quality of the inserted frame depends on the quality of the optical flow field information to a great extent.

In the video frame interpolation technology, motion estimation plays an important role, and besides directly finding out the motion relationship between two adjacent frame images, methods capable of replacing the motion information of the adjacent frames to be directly estimated are continuously generated, the methods are all based on the adaptation of a phase method, and the main idea is to encode the motion part in the phase difference between the adjacent video frame images.

Disclosure of Invention

In order to avoid the influence of the motion estimation process between two adjacent frames on the quality of an inserted frame, the invention aims to provide a video frame inserting method based on a convolutional neural network.

The technical solution of the invention is as follows:

a video frame interpolation method based on a convolutional neural network comprises the following steps:

1) acquiring continuous frames of a video:

selecting related front and rear frames from the real video frames, normalizing the front and rear frames, and inputting the normalized front and rear frames into a convolutional neural network;

2) extracting video motion information and restoring video space:

the convolutional neural network front half part encoding module performs down-sampling processing on the front and rear frames after normalization processing, and extracts motion information between the front and rear frames; then, a second half decoding module of the convolutional neural network performs up-sampling processing on the motion information between the videos subjected to down-sampling processing, recovers the spatial dimension of the videos and compensates details; meanwhile, the middle part of the convolutional neural network transmits the information of the network bottom layer to a deep network in a skip-connection mode, and extracts and outputs the video frame interpolation characteristics;

3) outputting a plurality of intermediate frames:

inputting the information subjected to the upsampling processing in the step 2 and the video frame interpolation characteristics into the bidirectional LSTM convolutional layer, and outputting at least one intermediate video frame in a circulating manner;

4) calculating the mean square error of the optical flow:

respectively calculating the intermediate video frame optical flow and the real video frame optical flow in the step 3 through the pre-trained FlowNet; then calculating the mean square error between the intermediate video frame optical flow and the real video frame optical flow;

5) optimizing the frame insertion:

and 4, taking the mean square error in the step 4 as an optimization objective function of the non-uniform video frame interpolation calculation, and enabling the FlowNet to participate in the gradient backward propagation process in network optimization to realize the optimization of the video frame interpolation.

In the step 2, the convolutional neural network front half part encoding module preferably performs two times of downsampling processing on the front and rear frames after the normalization processing; and the second half decoding module of the convolutional neural network correspondingly performs two times of up-sampling processing on the motion information between the videos after the down-sampling processing.

The convolutional neural network is preferably an encoding and decoding U-Net module, and the encoding module is a 3-layer convolutional module.

Before upsampling and downsampling, a dense connection network can also be used for convolution feature extraction respectively.

The up-sampling adopts a sub-pixel algorithm; the down-sampling described above employs the Conv-LSTM algorithm.

In the step 2, the middle part of the convolutional neural network transmits the information of the network bottom layer to the deep network through skip-connection.

Step 2 above also includes expanding the network field of view of the sample from video space using a convolution algorithm with stride.

The invention has the beneficial effects that:

1. the video non-uniform frame interpolation algorithm based on video motion estimation utilizes the spatial geometric relationship and the time domain motion relationship to perform video frame interpolation, is used for motion estimation and compensation prediction through deep learning, can be used for improving the video frame rate, and increases the time resolution.

2. The method of the invention carries out video non-uniform frame interpolation based on video motion estimation, removes redundant information in coding, reduces the frame rate in the video transmission process, and reduces the data volume transmitted by a video network, thereby being applicable to video compression or enhancing the video continuity.

3. The method carries out video non-uniform frame interpolation based on video motion estimation, thereby improving the video time resolution, improving the video smoothness and compensating the missing motion information.

4. The method of the invention extracts the video motion information through down sampling, thereby reducing the spatial dimension, and expands the network view field and the network receptive field from the video space by sampling through the convolution algorithm with stride, thereby enhancing the motion estimation capability of the network.

5. The method of the invention uses a sub-pixel algorithm to perform convolution characteristic up-sampling in the up-sampling process to recover the video space dimension and compensate the details.

6. The method uses skip-connection algorithm to transmit the bottom layer information of the convolutional neural network to the deep layer of the network for feature extraction and output of video interpolation frames, improves the feature extraction capability of the convolutional neural network, and can avoid gradient explosion during network optimization.

7. The method learns the motion information on the video time sequence through a Conv-LSTM algorithm, and is used for realizing video multi-frame interpolation based on video motion estimation based on the characteristic that the Conv-LSTM has the capability of outputting a plurality of convolution results in a time dimension.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

FIG. 2 is a diagram of a neural network used in the method of the present invention.

Fig. 3 and 4 are diagrams of the motion estimation module of the present invention, wherein fig. 3 is a down-sampling process of the first half of the convolutional neural network, i.e., motion estimation, and fig. 4 is an up-sampling process of the second half of the convolutional neural network, i.e., motion estimation.

Fig. 5 and 6 are two associated real video frames 1 and 2 of an embodiment of the invention.

Fig. 7 to 9 show three insertion frames, i.e., an intermediate insertion frame 1, an intermediate insertion frame 2, and an intermediate insertion frame 3, generated in the embodiment of the present invention.

Detailed Description

Referring to fig. 1, the video frame interpolation method based on the convolutional neural network of the present invention includes the following steps:

1) selecting related front and back frames from an input video, normalizing the front and back frames, and inputting the normalized front and back frames into a convolutional neural network;

2) referring to fig. 2, a dense connection network is used to perform convolution feature extraction on the frames before and after the normalization processing; then, a network visual field is expanded from a video space by sampling through a convolution algorithm with stride, then the former half part of a convolution neural network carries out down-sampling processing on the front frame and the rear frame after normalization processing (see figure 3), and motion information between the front frame and the rear frame is extracted; performing convolution feature extraction on the front and rear frames after the down-sampling processing by using a dense connection network again, then performing up-sampling processing on motion information between the front and rear frames after the down-sampling processing by using a rear half coding module of a convolution neural network (see fig. 4), recovering the video space dimensionality and compensating details; meanwhile, the middle part of the convolutional neural network transmits the information of the network bottom layer to a deep network through skip-connection; the overall structure of the network is an encoding and decoding U-Net module, the encoding module is a 3-layer convolution module, the down-sampling is preferably Conv-LSTM algorithm, and the up-sampling is preferably sub-pixel algorithm;

3) inputting the information subjected to the upsampling processing in the step 2 into the bidirectional LSTM convolutional layer, and circularly outputting a plurality of intermediate video frames;

4) respectively calculating the intermediate video frame optical flow and the real video frame optical flow in the step 3 through the pre-trained FlowNet; then calculating the mean square error between the intermediate video frame optical flow and the real video frame optical flow;

5) and 4, taking the mean square error in the step 4 as an optimization objective function of the non-uniform video frame interpolation calculation, and enabling the FlowNet to participate in the gradient backward propagation process in network optimization to realize the optimization of the video frame interpolation.

Taking three frames of the intermediate interpolated frame as an example, inputting two frames connected in the video, namely a real video frame 1 (figure 5) and a real video frame 2 (figure 6), into a convolutional neural network, respectively extracting the convolution characteristics of the front frame and the rear frame after normalization processing through a dense connection network, and extracting motion information between the front frame and the rear frame after processing by a down-sampling motion estimation module for two times; the second half part of the input convolutional neural network coding module is restored to the space size consistent with the input video after being processed by the up-sampling motion estimation module twice, and intermediate frame insertion results, namely an intermediate frame 1 (figure 7), an intermediate frame 2 (figure 8) and an intermediate frame 3 (figure 9), are output through a Conv-LSTM algorithm. It can be seen that the laser knife is gradually moved to the left with respect to the real video frame 1 and gradually approaches the position of the laser knife in the real video frame 2, while for relatively stable tumor portions in the video frame, the video frame interpolation results remain stable as with the real video frame 1 and the real video frame 2.

Claims

1. The video frame interpolation method based on the convolutional neural network is characterized by comprising the following steps of:

1) acquiring continuous frames of a video:

2) extracting video motion information and restoring video space:

the convolutional neural network front half part encoding module performs down-sampling processing on the front and rear frames after normalization processing, and extracts motion information between the front and rear frames; then, a second half decoding module of the convolutional neural network performs up-sampling processing on the motion information between the videos subjected to down-sampling processing, recovers the spatial dimension of the videos and compensates details;

meanwhile, the middle part of the convolutional neural network transmits the information of the network bottom layer to a deep network in a skip-connection mode, and extracts and outputs the video frame interpolation characteristics;

3) outputting a plurality of intermediate frames:

4) calculating the mean square error of the optical flow:

5) optimizing the frame insertion:

2. The convolutional neural network-based video frame interpolation method of claim 1, wherein: in the step 2, the convolutional neural network front half part encoding module performs two times of downsampling processing on the front and rear frames after the normalization processing; and the second half decoding module of the convolutional neural network performs up-sampling processing twice on the motion information between videos subjected to down-sampling processing.

3. The convolutional neural network-based video frame interpolation method of claim 1, wherein: the convolutional neural network is an encoding and decoding U-Net module, and the encoding module is a 3-layer convolutional module.

4. The convolutional neural network-based video frame interpolation method as claimed in claim 1, 2 or 3, wherein: before upsampling and downsampling, a dense connection network is used for convolution feature extraction respectively.

5. The convolutional neural network-based video frame interpolation method of claim 4, wherein: the up-sampling adopts a sub-pixel algorithm; the down-sampling employs the Conv-LSTM algorithm.

6. The convolutional neural network-based video frame interpolation method of claim 5, wherein: in step 2, the middle part of the convolutional neural network transmits the information of the network bottom layer to the deep network through skip-connection.

7. The convolutional neural network-based video frame interpolation method of claim 6, wherein: step 2 also includes expanding the network field of view of the sample from video space using a convolution algorithm with stride.