CN113992920A

CN113992920A - Video compressed sensing reconstruction method based on deep expansion network

Info

Publication number: CN113992920A
Application number: CN202111239880.4A
Authority: CN
Inventors: 张健; 武卓远
Original assignee: Peking University Shenzhen Graduate School
Current assignee: Peking University Shenzhen Graduate School
Priority date: 2021-10-25
Filing date: 2021-10-25
Publication date: 2022-01-28

Abstract

A video compressive sensing reconstruction method based on a deep expansion network, comprising the following steps: S1. Constructing a training data set: the training data set is composed of multiple data pairs, each data pair is composed of observation frames compressed by multiple frames and corresponding Uncompressed multi-frame composition; S2. Constructing a deep expansion network: Expand the semi-quadratic splitting algorithm optimized for compressed sensing into a deep expansion network, and add dense feature fusion technology; S3. Train a deep expansion network: Based on the training data set, Given a loss function, use backpropagation and gradient descent algorithms to continuously optimize the parameters in the deep unrolling network until the loss function is stable; and S4. Apply the trained deep unrolling network for video compressive sensing reconstruction process: the input is compressed Observation frame and sampling matrix, the output is the reconstructed video multi-frame. The method has good interpretability, and can achieve high reconstruction accuracy on the premise of ensuring fast reconstruction speed.

Description

Video compressed sensing reconstruction method based on deep expansion network

Technical Field

The invention belongs to the technical field of video processing, and particularly relates to a video compressed sensing reconstruction method based on a deep expansion network.

Background

Video compressed sensing is widely used in imaging systems, where the purpose is to capture, for example, video using two-dimensional sensors^[1]Or spectrum of light^[2]And (4) equal height dimension signals. By introducing an additional hardware component into the imaging system, the high-dimensional signal is compressed into a two-dimensional signal, and then the reconstruction from the two-dimensional signal to the high-dimensional signal is completed by applying a reconstruction algorithm.

The conventional video compressed perceptual reconstruction algorithm will be briefly described below.

From a mathematical point of view, after obtaining the compressed observation frames and the sampling matrix, the video compression perception solves an ill-posed inverse problem. Traditional methods introduce a priori knowledge of the image or video as a regularization term to iteratively solve a sparsely regularized optimization problem, e.g., introducing a total variance^[3]Gaussian mixture model^[4]Optical flow^[5]Non-local low rank^[6]The equistructure sparse property is used as a regular term a priori. Such conventional methods based on optimization iteration can be directly applied to different sampling matrices without retraining, but the performance is limited by the selected prior, and moreover, the process of iterative optimization takes a long time. In recent years, due to the vigorous development of deep learning, a plurality of video compression sensing methods based on the convolutional neural network are promoted, and the methods can be divided into two types, one type is a deep non-expanded network, and the other type is a deep expanded network. Deep non-expanded network learning is a direct mapping of compressed observation frames to original multiframes, e.g. stacked multi-layer convolutions to learn such a mapping^[7]A deep fully-connected network is designed to learn this mapping^[8]Selecting a jointly optimized sampling matrix and reconstructed network taking into account hardware limitations^[9]A network named BIRNAT is designed^[10]Wherein the first frame is reconstructed by a convolutional neural network and the rest frames are reconstructed by a bidirectional cyclic network, and the MetaSCI is provided^[11]Wherein, the backbone network selects a light design mode andand the network may be applied to different sampling matrices. Proposed RevSCI^[12]Packet reversible 3D convolution is used to handle large-scale video compressed sensing reconstruction, and such methods have the disadvantage of poor interpretability. Deep-unfolding network maps optimization method into a deep network to solve the problem that the traditional method needs multiple iterations, such as tenor-ADMM^[13]、tensor-FISTA^[14]、GAP-UNet^[15]、PnP-FFDNet^[16]Etc. are proposed in succession. The disadvantages of such a process are as follows: first, 2D convolution is commonly used in networks to exploit inter-frame correlation, which is not an optimal choice; second, deep-developed networks suffer significant loss of information transferred between stages.

Disclosure of Invention

The invention aims to overcome the defects and shortcomings of the conventional video compressed sensing reconstruction method and provide a video compressed sensing reconstruction algorithm based on a deep expansion network. The method designs a deep expansion network for training and reconstruction, not only has good interpretability, but also can achieve high reconstruction precision on the premise of ensuring higher reconstruction speed.

The technical scheme of the invention is as follows:

a video compressed sensing reconstruction method based on a deep expansion network comprises the following steps: s1, constructing a training data set: the training data set is composed of a plurality of data pairs, and each data pair is composed of an observation frame formed by compressing multiple frames and a corresponding uncompressed multiple frame; s2, constructing a deep expansion network: expanding the half-quadratic splitting algorithm for optimizing compressed sensing into a deep expanded network, and adding a dense feature fusion technology; s3, training a deep expansion network: based on a training data set, a loss function is given, and parameters in a deep expansion network are continuously optimized by using a back propagation and gradient descent algorithm until the loss function is stable; and S4, applying the trained deep expansion network to carry out a video compression sensing reconstruction process: the input is compressed observation frames and sampling matrixes, and the output is reconstructed video multiframes.

Preferably, in the method for video compressed sensing reconstruction based on the deep expansion network, in step S1, a video training data set is constructed for training the deep expansion network, where the training data set is composed of a plurality of data pairs, and each data pair includes a group of consecutive video frames and a corresponding multi-frame compressed observation frame.

Preferably, in the video compressed sensing reconstruction method based on the depth expansion network, in step S2, the depth expansion network is expanded by a semi-quadratic splitting algorithm for optimizing compressed sensing, a network structure of the depth expansion network is formed by alternately stacking data modules and prior modules, and a 3D convolution is introduced to improve a characterization capability of the depth expansion network on inter-frame correlation; and using dense feature fusion techniques to reduce the loss caused by information passing between different stages and to help information to be adaptively transmitted in different stages.

Preferably, in the video compressed sensing reconstruction method based on the deep expansion network, in step S3, a back propagation algorithm is used to calculate gradients of the loss function with respect to each parameter in the deep expansion network, and then a gradient descent algorithm is used to optimize parameters of a network layer of the deep expansion network based on the training data set until the value of the loss function is stable, so as to obtain an optimal parameter of the deep expansion network.

Preferably, in the video compressed sensing reconstruction method based on the deep expansion network, in step S4, a rough reconstruction is performed by using the acquired observation frame and the sampling matrix, and then the reconstruction result and the sampling matrix are sent to the trained deep expansion network, and the output is a high-quality reconstruction result.

According to the technical scheme of the invention, the beneficial effects are as follows:

1. the method constructs a deep expansion network for video compressed sensing reconstruction, wherein 3D convolution is introduced to solve near-end mapping, and the method can better utilize the correlation of a time domain and a space domain;

2. the method of the invention greatly surpasses the prior method in subjective results and numerical values, and obtains the best reconstruction effect up to now. The dense feature fusion technology provided by the invention well solves the problem of information loss in the network and brings about 0.45dB of gain;

3. the method of the invention is the first discussion of how to solve the problem of information loss in the task of video compression perception, and provides a feasible solution. In addition, in order to improve the information fusion capability, the method of the invention provides a dense feature adaptive fusion mode, which can make the effective information in the dense feature adaptively transmitted between different stages.

For a better understanding and appreciation of the concepts, principles of operation, and effects of the invention, reference will now be made in detail to the following examples, taken in conjunction with the accompanying drawings, in which:

drawings

In order to more clearly illustrate the detailed description of the invention or the technical solutions in the prior art, the drawings that are needed in the detailed description of the invention or the prior art will be briefly described below.

FIG. 1 is a flowchart of an implementation of a video compressed sensing reconstruction method based on a deep expansion network according to the present invention.

Fig. 2 is a schematic diagram of a compressed perceptual reconstruction of video.

Fig. 3 is a structural diagram of a deep developed network.

FIG. 4 is a schematic diagram of a dense feature adaptive fusion technique.

Fig. 5 and 6 are respectively a visual comparison of the respective reconstruction algorithms on the partially synthesized data set.

Fig. 7 is a visual comparison result of each reconstruction algorithm on real data in an experiment.

Detailed Description

In order to make the objects, technical means and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific examples. These examples are merely illustrative and not restrictive of the invention.

The invention discloses a video compressed sensing reconstruction method based on a depth expansion network, which is used for reconstructing a high-quality video sequence from observation frames which are obtained by compressing a plurality of frames and acquired by a Digital Micromirror Device (DMD) or a Liquid Crystal On Silicon (LCOS) camera.

As shown in fig. 1, the video compressed sensing reconstruction method based on the deep expansion network of the present invention includes the following steps:

s1, constructing a training data set: the training data set is composed of a plurality of data pairs, each data pair being composed of a plurality of compressed observation frames and a corresponding uncompressed multiframe. Specifically, a video training data set is constructed for training the deep expansion network, the training data set is composed of a plurality of data pairs, and each data pair comprises a group of continuous video frames and corresponding observation frames formed by compressing multiple frames.

To determine the optimal parameters of the proposed deep expansion network, the present invention constructs a training data set for the video compressed perceptual reconstruction problem. In the experiment, the invention selects to train on a DAVIS data set, the data set comprises 90 scenes and has the resolution of 480p, the original image is cut to 128 x 128 size in the training process, each data pair has 8 frames, and 25600 data pairs are acquired in total. Therefore, the data pair is composed of a compressed observation frame y and an uncompressed multiframe x, the observation frame and the sampling matrix are used as the input of the network, and the output is the reconstructed result x^KWhere uncompressed multiframes are the reconstruction targets, pairs of such training data form a network training data set S.

S2, constructing a deep expansion network: and expanding the half-quadratic splitting algorithm for optimizing the compressed sensing into a neural network, and adding a dense feature fusion technology. The deep expansion network is formed by expanding a semi-quadratic splitting algorithm for optimizing compressed sensing, the network structure is formed by alternately stacking a data module and a prior module, and 3D convolution is introduced to improve the characterization capability of the network on interframe correlation. Dense feature fusion techniques are used to reduce the loss caused by information passing between different stages and to help information to be adaptively transmitted in different stages.

The reconstruction result of the video compressed sensing can be obtained by solving the following optimization problem:

where x is a video frame (i.e. a plurality of frames x in fig. 2) with consecutive B frames, y is an observation frame compressed from the B frames, Φ is a corresponding sampling matrix, Ψ (x) is a regular term that constrains some a priori properties of the video frame x, and λ is a coefficient of the regular term.

Introducing another parameter V (intermediate variable, as shown in fig. 2), equation (1) can translate into a limited optimization problem:

the obtained objective function can be subjected to iterative optimization through a semi-quadratic splitting algorithm, and the method specifically comprises the following steps:

where k represents the number of iteration steps of the semi-quadratic split and η is another regular term coefficient.

(3) The solution can be accelerated by:

v^k＝x^k-1+Φ^T(ΦΦ^T+η^k)^-1(r^k-Φx^k-1).(6)

wherein r is⁰R is also an intermediate variable (as shown in fig. 2).

(4) Formula can be viewed as a de-noising subproblem, whereby the X and V subproblems can be re-expressed as:

i-represents a cascading operation,

represents a regularized observation frame, since

Has rich information and is therefore introduced in the prior module to provide supplementary reconstruction information. One iteration of the original optimization problem is converted into a data module in the network

And a prior module

Therefore, the invention expands the semi-quadratic splitting algorithm into a deep expanded network, wherein the reconstructed network is formed by alternately stacking the data modules and the prior modules (as shown in fig. 3). Initial value (i.e. initialization in fig. 2) x⁰＝Φ^Ty is the result of a coarse reconstruction from the observation frame and the sampling matrix, then x⁰Y and

sending the data to the network for reconstruction. The data module and the prior module are respectively used for solving the (7) and the (8), wherein the prior module comprises a dense feature adaptive fusion module for reducing information loss.

To better model the inter-frame correlation, a priori models are proposed in which a 3D convolution is used. Compared with 2D convolution, the 3D convolution kernel not only slides in the two-dimensional plane, but also slides in the time-domain dimension, which enables better utilization of inter-frame correlation in multi-frame reconstruction.

According to the above description, the information transmitted between each stage in the deeply-expanded network has only a limited number of channels, while the framework network used in the present invention is a UNet-like structure, it is obvious that the information transmitted in the network is often multi-channel, and then there is a loss of information in the course of transmission of each stage, in order to solve this problem, the present invention proposes a dense feature fusion technique, and the specific process is as follows (refer to fig. 2 and fig. 3):

as shown in FIG. 3, the network structure of the k-th stage (i.e., stage k) prior module is an encoder-decoder structure and has multiple scales, here, j ∈ [1, 2, 3 ] is used]To represent different scales, where j-1 corresponds to the shallowest layer of the network, and the characteristic diagram of the j-th scale output in the k-th stage encoder is defined as

Correspondingly, the input and output of the j scale in the k stage decoder are respectively

And

the output of each scale of the network can be expressed as:

it should be noted that

Represents the input to the a priori module(s),

indicating that there is no residual connection between the encoder and decoder at the scale corresponding to the deepest layer of the network.

To solve this problem, the present invention proposes a technique of dense feature fusion, as shown in fig. 2, the details of which are shown below: the inputs and outputs of the prior modules in each stage typically have a limited number of channels, indicating that there is a loss of information in the transmission of information between stages

Where ×) represents nearest neighbor upsampling. It can be seen that the input of the decoder of each stage fuses the feature maps of the corresponding scales in the previous stage, thereby reducing the information loss due to the change of the number of channels and the up-down sampling, wherein the features from the k-1 th stage are called as dense features and are expressed as follows:

finally, what the prior module learns is a residual map of the multi-frame reconstruction result of the data module (as shown in fig. 3):

since the characteristics of different channels have different contributions to the final result when the dense characteristics are fused, a dense characteristic adaptive fusion technology is provided to ensure that information is selectively transferred between adjacent stages. Specifically, when the dense feature F^k-1And regularizing the observation frame

In the same way, enhancement should be achieved during fusion, and inhibition should be achieved otherwise.

The core idea is to compute the similarity between dense features and the regularized observation frame, as shown in FIG. 4, for the m-th dense feature

For a certain position (p, q) of the c-th channel, the similarity S (p, q, c) can be defined by an anisotropic filter

Multiplied by dense features, where H and W denote the height and width of the ith scale feature map, nf denotes the size of the anisotropic filter, and the calculation process can be expressed as:

wherein

u is the index of the third element in the five tuple. The adaptive dense feature map may then be calculated by:

where σ denotes the sigmoid function, which acts to transform the variable to the [0, 1] interval. Then, the adaptive dense feature map may be represented as:

s3, training a deep expansion network, wherein the training process is as follows: based on a training data set, a loss function is given, and parameters in a deep expansion network are continuously optimized by using a back propagation and gradient descent algorithm until the loss function is stable. Specifically, a loss function is designed, the gradient of the loss function relative to each parameter in the deep expansion network is calculated by adopting a back propagation algorithm, then the parameters of the network layer are optimized by adopting a gradient descent algorithm based on a training data set until the value of the loss function is stable, namely, until a model converges, and the optimal parameters of the deep expansion network are obtained.

Taking S as a training data set, and taking a mean square error as a loss function of the network:

wherein N is_kRepresenting the total number of training data pairs, N_sRepresenting the total number of pixels of the image in each data pair. Calculating the gradient of the loss function relative to each parameter in the network through a back propagation algorithm, and then optimizing the parameters of the network layer by adopting a gradient descent algorithm based on the training data set until the value of the loss function is stable, so as to obtain the optimal parameters of the deeply expanded network.

S4, applying the trained deep expansion network to carry out a video compression perception reconstruction process: the input is compressed observation frames and sampling matrixes, and the output is reconstructed video multiframes.

Through the training process of the third step, the optimal depth expansion network parameters can be determined, and based on the trained model, when video compressed sensing reconstruction is performed, rough reconstruction is performed by using the acquired observation frame y and the sampling matrix phi (as shown in fig. 2), namely x⁰＝Φ^TAnd y, then sending the reconstruction result and the sampling matrix into a trained deep expansion network, and outputting the result, namely the high-quality reconstruction result.

During testing, the invention respectively reconstructs synthetic data and real data, wherein the synthetic data comprises six scenes of Kobe, Traffic, Runner, Drop, blast and Aerio, the dimensionality of the six scenes is 256 multiplied by 8, the real data set comprises two scenes of Water Balloon and Dominoes, and the dimensionality of the real data set is 512 multiplied by 10. In order to objectively evaluate the reconstruction accuracy of the different methods, the peak signal-to-noise ratio (PSNR) was used as an index for comparison. All experiments were run on servers of NVIDIA Tesla V100. The deep developed network used in the experiment was K ═ i 0.

Table 1: comparison of results of different methods under synthesized data

As shown in table 1 above, comparing the depth-expanded network proposed by the present invention with ten video-based compressed sensing reconstruction methods under synthesized data, the comparing method includes: GAP-TV^[3]、E2E-CNN^[7]、DeSCI^[6]、PnP-FFDNet^[15]、BIRNAT^[10]、Tensor-ADMM^[13]、Tensor-FISTA^[14]、GAP-UNet^[15]、MetaSCI^[16]、RevSCI^[12]. The depth expansion network provided by the invention achieves the highest reconstruction precision under the synthetic data, FIGS. 5 and 6 are the reconnection results of the methods under different scenes of the synthetic data, FIG. 7 is the reconstruction results of the methods under the real data, and the reconstruction results of the methods in the whole and amplified details can be seen.

The foregoing description is of the preferred embodiment of the concepts and principles of operation in accordance with the invention. The above-described embodiments should not be construed as limiting the scope of the claims, and other embodiments and combinations of implementations according to the inventive concept are within the scope of the invention.

Reference documents:

[1]Llull P,Liao X,Yuan X,et al.Coded aperture compressive temporal imaging[J].Optics express,2013,21(9):10526-10545.

[2]Wagadarikar A A,Pitsianis N P,Sun X,et al.Video rate spectral imaging using a coded aperture snapshot spectral imager[J].Optics express,2009,17(8):6368-6388.

[3]Yuan X.Generalized alternating projection based total variation minimization for compressive sensing[C].2016IEEE International Conference on Image Processing.IEEE,2016:2539-2543.

[4]Yang J,Yuan X,Liao X,et al.Video compressive sensing using Gaussian mixture models[J].IEEE Transactions on Image Processing,2014,23(11):4863-4878.

[5]Reddy D,Veeraraghavan A,Chellappa R.P2C2:Programmable pixel compressive camera for high speed imaging[C].Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,2011:329-336.

[6]Liu Y,Yuan X,Suo J,et al.Rank minimization for snapshot compressive imaging[J].IEEE transactions on pattern analysis and machine intelligence,2018,41(12):2990-3006.

[7]Qiao M,Meng Z,Ma J,et al.Deep learning for video compressive sensing[J].APL Photonics,2020,5(3):030801.

[8]Iliadis M,Spinoulas L,Katsaggelos A K.Deep fully-connected networks for video compressive sensing[J].Digital Signal Processing,2018,72:9-18.

[9]Yoshida M,Torii A,Okutomi M,et al.Joint optimization for compressive video sensing and reconstruction under hardware constraints[C].Proceedings of the European Conference on Computer Vision.2018:634-649.

[10]Cheng Z,Lu R,Wang Z,et al.BIRNAT:Bidirectional recurrent neural networks with adversarial training for video snapshot compressive imaging[C].European Conference on Computer Vision.Springer,Cham,2020:258-275.

[11]Wang Z,Zhang H,Cheng Z,et al.Metasci:Scalable and adaptive reconstruction for video compressive sensing[C].Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.,2021.

[12]Cheng Z,Chen B,Liu G,et al.Memory-efficient network for large-scale video compressive sensing[J].Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.,2021.

[13]Ma J,Liu X Y,Shou Z,et al.Deep tensor admm-net for snapshot compressive imaging[C].Proceedings of the IEEE/CVF International Conference on Computer Vision.2019:10223-10232.

[14]Han X,Wu B,Shou Z,et al.Tensor FISTA-Net for real-time snapshot compressive imaging[C].Proceedings of the AAAI Conference on Artificial Intelligence.2020,34(07):10933-10940.

[15]Meng Z,Jalali S,Yuan X.GAP-net for Snapshot Compressive Imaging[J].arXiv preprint arXiv:2012.08364,2020.

[16]Yuan X,Liu Y,Suo J,et al.Plug-and-play algorithms for large-scale snapshot compressive imaging[C].Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:1447-1457.

Claims

1. a video compressive sensing reconstruction method based on deep expansion network, is characterized in that, comprises the steps:

S1. Construct a training data set: the training data set is composed of multiple data pairs, and each data pair is composed of an observation frame compressed from multiple frames and a corresponding uncompressed multiple frames;

S2. Construct a deep expansion network: expand the semi-quadratic splitting algorithm optimized for compressed sensing into a deep expansion network, and add dense feature fusion technology;

S3. Train the deep unrolling network: Based on the training data set, given a loss function, use backpropagation and gradient descent algorithms to continuously optimize the parameters in the deep unrolling network until the loss function stabilizes; and

S4. Apply the trained deep expansion network to perform the video compressive sensing reconstruction process: the input is the compressed observation frame and the sampling matrix, and the output is the reconstructed video multi-frame.

2. the video compressive sensing reconstruction method based on the depth expansion network according to claim 1, is characterized in that, in step S1, constructs video training data set for training depth expansion network, and this training data set is composed of a plurality of data pairs. Each data pair includes a group of consecutive video frames and corresponding observation frames compressed from multiple frames.

3. the video compressive sensing reconstruction method based on the depth unwrapping network according to claim 1, is characterized in that, in step S2, described depth unwrapping network is expanded by the semi-quadratic splitting algorithm of optimized compressed sensing, and described The network structure of the deep unrolling network is formed by alternately stacking data modules and a priori modules, in which 3D convolution is introduced to improve the ability of the deep unrolling network to represent inter-frame correlations; and dense feature fusion techniques are used to reduce information at different stages. The loss caused by the transfer between them, and help the information to be adaptively transmitted in different stages.

4. the video compressive sensing reconstruction method based on depth expansion network according to claim 1, is characterized in that, in step S3, adopts back-propagation algorithm to calculate the gradient of loss function relative to each parameter in described depth expansion network, Then, based on the training data set, a gradient descent algorithm is used to optimize the parameters of the network layers of the deep unrolling network until the loss function is numerically stable, and the optimal parameters of the deep unrolling network are obtained.

5. The video compressive sensing reconstruction method based on deep expansion network according to claim 1, is characterized in that, in step S4, first utilizes the observation frame that obtains and sampling matrix to carry out rough reconstruction, then the reconstruction result and the sampling matrix are used for rough reconstruction. The sampling matrix is sent to the trained deep expansion network, and the output is a high-quality reconstruction result.