Disclosure of Invention
The invention aims to overcome the defects and shortcomings of the conventional video compressed sensing reconstruction method and provide a video compressed sensing reconstruction algorithm based on a deep expansion network. The method designs a deep expansion network for training and reconstruction, not only has good interpretability, but also can achieve high reconstruction precision on the premise of ensuring higher reconstruction speed.
The technical scheme of the invention is as follows:
a video compressed sensing reconstruction method based on a deep expansion network comprises the following steps: s1, constructing a training data set: the training data set is composed of a plurality of data pairs, and each data pair is composed of an observation frame formed by compressing multiple frames and a corresponding uncompressed multiple frame; s2, constructing a deep expansion network: expanding the half-quadratic splitting algorithm for optimizing compressed sensing into a deep expanded network, and adding a dense feature fusion technology; s3, training a deep expansion network: based on a training data set, a loss function is given, and parameters in a deep expansion network are continuously optimized by using a back propagation and gradient descent algorithm until the loss function is stable; and S4, applying the trained deep expansion network to carry out a video compression sensing reconstruction process: the input is compressed observation frames and sampling matrixes, and the output is reconstructed video multiframes.
Preferably, in the method for video compressed sensing reconstruction based on the deep expansion network, in step S1, a video training data set is constructed for training the deep expansion network, where the training data set is composed of a plurality of data pairs, and each data pair includes a group of consecutive video frames and a corresponding multi-frame compressed observation frame.
Preferably, in the video compressed sensing reconstruction method based on the depth expansion network, in step S2, the depth expansion network is expanded by a semi-quadratic splitting algorithm for optimizing compressed sensing, a network structure of the depth expansion network is formed by alternately stacking data modules and prior modules, and a 3D convolution is introduced to improve a characterization capability of the depth expansion network on inter-frame correlation; and using dense feature fusion techniques to reduce the loss caused by information passing between different stages and to help information to be adaptively transmitted in different stages.
Preferably, in the video compressed sensing reconstruction method based on the deep expansion network, in step S3, a back propagation algorithm is used to calculate gradients of the loss function with respect to each parameter in the deep expansion network, and then a gradient descent algorithm is used to optimize parameters of a network layer of the deep expansion network based on the training data set until the value of the loss function is stable, so as to obtain an optimal parameter of the deep expansion network.
Preferably, in the video compressed sensing reconstruction method based on the deep expansion network, in step S4, a rough reconstruction is performed by using the acquired observation frame and the sampling matrix, and then the reconstruction result and the sampling matrix are sent to the trained deep expansion network, and the output is a high-quality reconstruction result.
According to the technical scheme of the invention, the beneficial effects are as follows:
1. the method constructs a deep expansion network for video compressed sensing reconstruction, wherein 3D convolution is introduced to solve near-end mapping, and the method can better utilize the correlation of a time domain and a space domain;
2. the method of the invention greatly surpasses the prior method in subjective results and numerical values, and obtains the best reconstruction effect up to now. The dense feature fusion technology provided by the invention well solves the problem of information loss in the network and brings about 0.45dB of gain;
3. the method of the invention is the first discussion of how to solve the problem of information loss in the task of video compression perception, and provides a feasible solution. In addition, in order to improve the information fusion capability, the method of the invention provides a dense feature adaptive fusion mode, which can make the effective information in the dense feature adaptively transmitted between different stages.
For a better understanding and appreciation of the concepts, principles of operation, and effects of the invention, reference will now be made in detail to the following examples, taken in conjunction with the accompanying drawings, in which:
Detailed Description
In order to make the objects, technical means and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific examples. These examples are merely illustrative and not restrictive of the invention.
The invention discloses a video compressed sensing reconstruction method based on a depth expansion network, which is used for reconstructing a high-quality video sequence from observation frames which are obtained by compressing a plurality of frames and acquired by a Digital Micromirror Device (DMD) or a Liquid Crystal On Silicon (LCOS) camera.
As shown in fig. 1, the video compressed sensing reconstruction method based on the deep expansion network of the present invention includes the following steps:
s1, constructing a training data set: the training data set is composed of a plurality of data pairs, each data pair being composed of a plurality of compressed observation frames and a corresponding uncompressed multiframe. Specifically, a video training data set is constructed for training the deep expansion network, the training data set is composed of a plurality of data pairs, and each data pair comprises a group of continuous video frames and corresponding observation frames formed by compressing multiple frames.
To determine the optimal parameters of the proposed deep expansion network, the present invention constructs a training data set for the video compressed perceptual reconstruction problem. In the experiment, the invention selects to train on a DAVIS data set, the data set comprises 90 scenes and has the resolution of 480p, the original image is cut to 128 x 128 size in the training process, each data pair has 8 frames, and 25600 data pairs are acquired in total. Therefore, the data pair is composed of a compressed observation frame y and an uncompressed multiframe x, the observation frame and the sampling matrix are used as the input of the network, and the output is the reconstructed result xKWhere uncompressed multiframes are the reconstruction targets, pairs of such training data form a network training data set S.
S2, constructing a deep expansion network: and expanding the half-quadratic splitting algorithm for optimizing the compressed sensing into a neural network, and adding a dense feature fusion technology. The deep expansion network is formed by expanding a semi-quadratic splitting algorithm for optimizing compressed sensing, the network structure is formed by alternately stacking a data module and a prior module, and 3D convolution is introduced to improve the characterization capability of the network on interframe correlation. Dense feature fusion techniques are used to reduce the loss caused by information passing between different stages and to help information to be adaptively transmitted in different stages.
The reconstruction result of the video compressed sensing can be obtained by solving the following optimization problem:
where x is a video frame (i.e. a plurality of frames x in fig. 2) with consecutive B frames, y is an observation frame compressed from the B frames, Φ is a corresponding sampling matrix, Ψ (x) is a regular term that constrains some a priori properties of the video frame x, and λ is a coefficient of the regular term.
Introducing another parameter V (intermediate variable, as shown in fig. 2), equation (1) can translate into a limited optimization problem:
the obtained objective function can be subjected to iterative optimization through a semi-quadratic splitting algorithm, and the method specifically comprises the following steps:
where k represents the number of iteration steps of the semi-quadratic split and η is another regular term coefficient.
(3) The solution can be accelerated by:
vk=xk-1+ΦT(ΦΦT+ηk)-1(rk-Φxk-1).(6)
wherein r is0R is also an intermediate variable (as shown in fig. 2).
(4) Formula can be viewed as a de-noising subproblem, whereby the X and V subproblems can be re-expressed as:
i-represents a cascading operation,
represents a regularized observation frame, since
Has rich information and is therefore introduced in the prior module to provide supplementary reconstruction information. One iteration of the original optimization problem is converted into a data module in the network
And a prior module
Therefore, the invention expands the semi-quadratic splitting algorithm into a deep expanded network, wherein the reconstructed network is formed by alternately stacking the data modules and the prior modules (as shown in fig. 3). Initial value (i.e. initialization in fig. 2) x
0=Φ
Ty is the result of a coarse reconstruction from the observation frame and the sampling matrix, then x
0Y and
sending the data to the network for reconstruction. The data module and the prior module are respectively used for solving the (7) and the (8), wherein the prior module comprises a dense feature adaptive fusion module for reducing information loss.
To better model the inter-frame correlation, a priori models are proposed in which a 3D convolution is used. Compared with 2D convolution, the 3D convolution kernel not only slides in the two-dimensional plane, but also slides in the time-domain dimension, which enables better utilization of inter-frame correlation in multi-frame reconstruction.
According to the above description, the information transmitted between each stage in the deeply-expanded network has only a limited number of channels, while the framework network used in the present invention is a UNet-like structure, it is obvious that the information transmitted in the network is often multi-channel, and then there is a loss of information in the course of transmission of each stage, in order to solve this problem, the present invention proposes a dense feature fusion technique, and the specific process is as follows (refer to fig. 2 and fig. 3):
as shown in FIG. 3, the network structure of the k-th stage (i.e., stage k) prior module is an encoder-decoder structure and has multiple scales, here, j ∈ [1, 2, 3 ] is used]To represent different scales, where j-1 corresponds to the shallowest layer of the network, and the characteristic diagram of the j-th scale output in the k-th stage encoder is defined as
Correspondingly, the input and output of the j scale in the k stage decoder are respectively
And
the output of each scale of the network can be expressed as:
it should be noted that
Represents the input to the a priori module(s),
indicating that there is no residual connection between the encoder and decoder at the scale corresponding to the deepest layer of the network.
To solve this problem, the present invention proposes a technique of dense feature fusion, as shown in fig. 2, the details of which are shown below: the inputs and outputs of the prior modules in each stage typically have a limited number of channels, indicating that there is a loss of information in the transmission of information between stages
Where ×) represents nearest neighbor upsampling. It can be seen that the input of the decoder of each stage fuses the feature maps of the corresponding scales in the previous stage, thereby reducing the information loss due to the change of the number of channels and the up-down sampling, wherein the features from the k-1 th stage are called as dense features and are expressed as follows:
finally, what the prior module learns is a residual map of the multi-frame reconstruction result of the data module (as shown in fig. 3):
since the characteristics of different channels have different contributions to the final result when the dense characteristics are fused, a dense characteristic adaptive fusion technology is provided to ensure that information is selectively transferred between adjacent stages. Specifically, when the dense feature F
k-1And regularizing the observation frame
In the same way, enhancement should be achieved during fusion, and inhibition should be achieved otherwise.
The core idea is to compute the similarity between dense features and the regularized observation frame, as shown in FIG. 4, for the m-th dense feature
For a certain position (p, q) of the c-th channel, the similarity S (p, q, c) can be defined by an anisotropic filter
Multiplied by dense features, where H and W denote the height and width of the ith scale feature map, nf denotes the size of the anisotropic filter, and the calculation process can be expressed as:
wherein
u is the index of the third element in the five tuple. The adaptive dense feature map may then be calculated by:
where σ denotes the sigmoid function, which acts to transform the variable to the [0, 1] interval. Then, the adaptive dense feature map may be represented as:
s3, training a deep expansion network, wherein the training process is as follows: based on a training data set, a loss function is given, and parameters in a deep expansion network are continuously optimized by using a back propagation and gradient descent algorithm until the loss function is stable. Specifically, a loss function is designed, the gradient of the loss function relative to each parameter in the deep expansion network is calculated by adopting a back propagation algorithm, then the parameters of the network layer are optimized by adopting a gradient descent algorithm based on a training data set until the value of the loss function is stable, namely, until a model converges, and the optimal parameters of the deep expansion network are obtained.
Taking S as a training data set, and taking a mean square error as a loss function of the network:
wherein N iskRepresenting the total number of training data pairs, NsRepresenting the total number of pixels of the image in each data pair. Calculating the gradient of the loss function relative to each parameter in the network through a back propagation algorithm, and then optimizing the parameters of the network layer by adopting a gradient descent algorithm based on the training data set until the value of the loss function is stable, so as to obtain the optimal parameters of the deeply expanded network.
S4, applying the trained deep expansion network to carry out a video compression perception reconstruction process: the input is compressed observation frames and sampling matrixes, and the output is reconstructed video multiframes.
Through the training process of the third step, the optimal depth expansion network parameters can be determined, and based on the trained model, when video compressed sensing reconstruction is performed, rough reconstruction is performed by using the acquired observation frame y and the sampling matrix phi (as shown in fig. 2), namely x0=ΦTAnd y, then sending the reconstruction result and the sampling matrix into a trained deep expansion network, and outputting the result, namely the high-quality reconstruction result.
During testing, the invention respectively reconstructs synthetic data and real data, wherein the synthetic data comprises six scenes of Kobe, Traffic, Runner, Drop, blast and Aerio, the dimensionality of the six scenes is 256 multiplied by 8, the real data set comprises two scenes of Water Balloon and Dominoes, and the dimensionality of the real data set is 512 multiplied by 10. In order to objectively evaluate the reconstruction accuracy of the different methods, the peak signal-to-noise ratio (PSNR) was used as an index for comparison. All experiments were run on servers of NVIDIA Tesla V100. The deep developed network used in the experiment was K ═ i 0.
Table 1: comparison of results of different methods under synthesized data
As shown in table 1 above, comparing the depth-expanded network proposed by the present invention with ten video-based compressed sensing reconstruction methods under synthesized data, the comparing method includes: GAP-TV[3]、E2E-CNN[7]、DeSCI[6]、PnP-FFDNet[15]、BIRNAT[10]、Tensor-ADMM[13]、Tensor-FISTA[14]、GAP-UNet[15]、MetaSCI[16]、RevSCI[12]. The depth expansion network provided by the invention achieves the highest reconstruction precision under the synthetic data, FIGS. 5 and 6 are the reconnection results of the methods under different scenes of the synthetic data, FIG. 7 is the reconstruction results of the methods under the real data, and the reconstruction results of the methods in the whole and amplified details can be seen.
The foregoing description is of the preferred embodiment of the concepts and principles of operation in accordance with the invention. The above-described embodiments should not be construed as limiting the scope of the claims, and other embodiments and combinations of implementations according to the inventive concept are within the scope of the invention.
Reference documents:
[1]Llull P,Liao X,Yuan X,et al.Coded aperture compressive temporal imaging[J].Optics express,2013,21(9):10526-10545.
[2]Wagadarikar A A,Pitsianis N P,Sun X,et al.Video rate spectral imaging using a coded aperture snapshot spectral imager[J].Optics express,2009,17(8):6368-6388.
[3]Yuan X.Generalized alternating projection based total variation minimization for compressive sensing[C].2016IEEE International Conference on Image Processing.IEEE,2016:2539-2543.
[4]Yang J,Yuan X,Liao X,et al.Video compressive sensing using Gaussian mixture models[J].IEEE Transactions on Image Processing,2014,23(11):4863-4878.
[5]Reddy D,Veeraraghavan A,Chellappa R.P2C2:Programmable pixel compressive camera for high speed imaging[C].Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,2011:329-336.
[6]Liu Y,Yuan X,Suo J,et al.Rank minimization for snapshot compressive imaging[J].IEEE transactions on pattern analysis and machine intelligence,2018,41(12):2990-3006.
[7]Qiao M,Meng Z,Ma J,et al.Deep learning for video compressive sensing[J].APL Photonics,2020,5(3):030801.
[8]Iliadis M,Spinoulas L,Katsaggelos A K.Deep fully-connected networks for video compressive sensing[J].Digital Signal Processing,2018,72:9-18.
[9]Yoshida M,Torii A,Okutomi M,et al.Joint optimization for compressive video sensing and reconstruction under hardware constraints[C].Proceedings of the European Conference on Computer Vision.2018:634-649.
[10]Cheng Z,Lu R,Wang Z,et al.BIRNAT:Bidirectional recurrent neural networks with adversarial training for video snapshot compressive imaging[C].European Conference on Computer Vision.Springer,Cham,2020:258-275.
[11]Wang Z,Zhang H,Cheng Z,et al.Metasci:Scalable and adaptive reconstruction for video compressive sensing[C].Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.,2021.
[12]Cheng Z,Chen B,Liu G,et al.Memory-efficient network for large-scale video compressive sensing[J].Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.,2021.
[13]Ma J,Liu X Y,Shou Z,et al.Deep tensor admm-net for snapshot compressive imaging[C].Proceedings of the IEEE/CVF International Conference on Computer Vision.2019:10223-10232.
[14]Han X,Wu B,Shou Z,et al.Tensor FISTA-Net for real-time snapshot compressive imaging[C].Proceedings of the AAAI Conference on Artificial Intelligence.2020,34(07):10933-10940.
[15]Meng Z,Jalali S,Yuan X.GAP-net for Snapshot Compressive Imaging[J].arXiv preprint arXiv:2012.08364,2020.
[16]Yuan X,Liu Y,Suo J,et al.Plug-and-play algorithms for large-scale snapshot compressive imaging[C].Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:1447-1457.