Light field image depth estimation method based on deep learning
Technical Field
The invention relates to the technical field of computer vision and digital image processing, in particular to a light field image depth estimation method based on deep learning.
Background
In recent years, with the development of light field calculation imaging technology, a light field camera has entered the market as a light field acquisition device. On the basis of a traditional camera model, a micro-lens array is inserted between a main lens and a sensor of the light field camera, and the special structure can enable the light field camera to simultaneously record the position information and the angle information of all light rays reaching an imaging surface in one exposure, and can realize the applications of depth estimation, scene refocusing and the like in subsequent processing.
At present, some depth estimation methods based on the light field are proposed and achieve better effect, and the methods mainly include three types: a sub-aperture image matching based method, a polar plane image based method, and a deep learning based method. For example, the depth estimation method for the light field image provided by the publication number CN108596965A utilizes depth estimation guided by light field structure characteristics to calculate a depth map of a color image of a central viewpoint of the light field image in consideration of the occlusion problem; utilizing gradient information of the depth map as an energy function smoothing item for optimizing global depth in a Markov random field framework; and calculating the parallax between other viewpoints and the central viewpoint at the same horizontal position with the central viewpoint image by adopting multi-scale multi-window stereo matching. However, due to the huge calculation amount of light field data, the method has inevitable trade-off relation between calculation time and accuracy. As a polar plane image-based method provided under publication No. CN 107545586 a, the depth value of the corresponding region is obtained by calculating the slope of a straight line in a polar plane image obtained from light field data. However, due to the geometric characteristics of the polar plane image, this type of method has certain limitations in scenes with occluded or reflected regions.
With the development of deep learning, convolutional neural networks are beginning to be used to study the depth estimation problem of light field images. Most of the existing methods use a synthetic light field data set for training and testing, and achieve better effect. However, compared with the actual light field image shot by the light field camera, the actual light field image has a narrower baseline and contains a large amount of noise interference, so that the method has poor effect when applied to the actual light field image, and the feasibility of the application of the light field image depth estimation method in the actual scene is greatly restricted.
Disclosure of Invention
The invention provides a light field image depth estimation method based on deep learning, which combines a polar plane image and an image segmentation network, designs a neural network model aiming at light field image depth estimation, and realizes rapid and accurate depth estimation of a synthetic light field image and an actual light field image.
In order to realize the purpose of the invention, the following technical scheme is adopted:
a light field image depth estimation method based on deep learning comprises the following steps:
(1) decoding a reconstructed light field source file according to the parameter information of the light field camera, and extracting a sub-aperture image array;
(2) inputting the sub-aperture image into a trained neural network for calculation to obtain a depth map of secondary estimation;
the neural network comprises:
a polar plane image portion for extracting an initial estimated depth map from the sub-aperture image;
an image segmentation section for extracting edge information of the image from the sub-aperture image;
the cascade part is used for carrying out convolution according to the depth map of the initial estimation and the edge information to obtain a depth map of the secondary estimation;
(3) and carrying out median filtering on the depth map subjected to secondary estimation, and removing part of noise to obtain the final estimated depth map.
Optionally, in step (1), the parameter information of the light field camera is acquired by processing a white image captured by the camera;
and decoding the light field source file, and obtaining the required sub-aperture image array after filtering processing and color correction.
Several alternatives are provided below, but not as an additional limitation to the above general solution, but merely as a further addition or preference, each alternative being combinable individually for the above general solution or among several alternatives without technical or logical contradictions.
Optionally, the shape of the sub-aperture image is adjusted to be square and then input into the neural network.
Optionally, the polar plane image portion is composed of a multi-stream network and a merging network.
Optionally, the multi-stream network inputs a 9 × 9 sub-aperture image as a center, polar plane images in four directions of 0 °, 45 °, 90 °, and 135 ° are extracted from the image, and a defined convolution module is used to perform convolution respectively, so as to extract depth features of a scene.
Optionally, the merging network is connected to the output of the multi-stream network, and performs convolution on the merged network to calculate a relationship between depth features of polar plane images in different directions, so as to obtain an initially estimated depth map.
Optionally, the polar plane image portion and the cascade portion use a small convolution kernel of 3 × 3, and the step size is 1; the same filling is used in the convolution process, and the size of the output depth map is kept consistent with that of the input sub-aperture image.
Optionally, the input of the image segmentation part is a central sub-aperture image, and a convolution layer, a pooling layer and a deconvolution layer are used to extract edge information of the image.
Optionally, the neural network uses a light field data set including a real depth map as a training set, and performs training by using a method of randomly sampling a gray patch, and an average absolute error is used as a loss function, which is defined as follows:
wherein, L is a loss function, W is a weight matrix, b is a bias coefficient, T is the number of training patches, H is a forward propagation function of the network, g is an input 9 multiplied by 9 light field sub-aperture image, and d is a gray scale patch block of a real depth map;
and reducing the value of the loss function through iterative training, reducing the difference of the gray values between the finally estimated depth map and the real depth map until the training is judged to be saturated, ending the training and storing the trained neural network parameters.
Optionally, the neural network performs data addition processing on the training set before training, including rotation, inversion, gamma transformation, and processing of adding random noise, so as to avoid overfitting and improve generalization capability of the network.
The invention has the advantages that the depth estimation can be rapidly and accurately carried out on the light field image to obtain a high-precision depth map. The neural network obtains the depth characteristics of a scene through the polar plane image part and combines the image edge information obtained by the image segmentation part, so that the problem of mismatching in depth estimation of an actual light field image can be solved, bad pixel points in an estimated depth image are reduced, and the accuracy of depth estimation is improved. The whole process utilizes the high calculation power of the GPU, greatly improves the operation speed compared with the traditional algorithm, and meets the requirements of practical application.
Drawings
FIG. 1 is a flow chart of a light field image depth estimation method based on deep learning according to the present invention;
FIG. 2 is a schematic diagram of the structure of a neural network according to the present invention;
FIG. 3 is a schematic diagram of the structure of the convolution module of the present invention;
FIG. 4 is a schematic diagram of the structure of a pooling module of the present invention;
FIG. 5 is a schematic diagram illustrating an example comparison of the depth estimation results of the present invention with the prior art method.
Detailed Description
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced in other ways than those specifically described herein, and thus the present invention is not limited to the specific embodiments disclosed below.
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.
Fig. 1 is a schematic flow chart of a light field image depth estimation method based on deep learning in this embodiment, and includes the following steps:
step 1, decoding the reconstructed light field image according to the parameter information of the light field camera, and extracting a sub-aperture image array.
As described in step 1, the light field raw image captured by a light field camera (e.g., Lytro, etc.) is typically a 12-bit Bayer format image, which needs to be decoded into a sub-aperture image format for a subsequent depth estimation process. The decoding process can use a light field toolbox to process a white image shot by a camera so as to obtain parameter information of the light field camera, then the light field source file is decoded, and a required sub-aperture image array is obtained after filtering processing and color correction.
And 2, inputting the sub-aperture image into a trained neural network for calculation to obtain a depth map of secondary estimation.
Due to the special structure of the microlenses in the light field camera, the resulting sub-aperture image is generally rectangular in shape with unequal length and width, as described in step 2, and in order to extract four-directional polar plane images from the sub-aperture image in subsequent operations, the shape of the sub-aperture image needs to be adjusted to be square before being input into the neural network.
And 3, performing median filtering on the secondarily estimated depth map, and removing partial noise to obtain the finally estimated depth map.
Fig. 2 is a schematic structural diagram of the neural network according to the present invention, which includes a polar plane image portion 100, an image segmentation portion 200, and a cascade portion 300.
The polar plane image portion 100 is composed of a multi-stream network 110 and a merging network 120 for extracting an initial estimated depth map. The input of the multi-stream network 110 is a 9 × 9 sub-aperture image with the center, polar plane images in four directions of 0 °, 45 °, 90 ° and 135 ° are extracted from the image, and are respectively convolved by using a defined convolution module, so as to extract depth features of a scene. The combining network 120 connects the outputs of the multiflow network 110, and then performs convolution using 9 well-defined convolution modules to calculate the relationship of depth features between polar plane images in different directions, so as to obtain an initial estimated depth map.
The image segmentation part 200 uses a full convolution neural network (FCN) structure, and after the sub-aperture image at the center is input, pooling is performed by 3 defined pooling modules, and then convolution-deconvolution operations are performed on the output of each pooling module for different times, and then the output of each pooling module is combined, so that high-dimensional contour information is combined with low-dimensional fine information, and edge information of the image is extracted.
The cascade portion 300 connects the outputs of the polar plane image portion 100 and the image segmentation portion 200, combines the depth information of the image with the edge information of the image, and performs a "convolution-Relu-convolution" process on the image to obtain a depth map of the quadratic estimation.
In another preferred embodiment, the neural network uses a small convolution kernel of 3 × 3 in the polar plane image portion 100 and the cascade portion 300, with a step size of 1, for more accurately measuring minute disparities in the baseline-narrowed light field image; the same padding is used in the convolution process to keep the output depth map size consistent with the input sub-aperture image size.
In another preferred embodiment, the neural network uses a light field data set containing a real depth map as a training set and a randomly sampled gray patch method for training, with the average absolute error as a loss function, which is defined as follows:
wherein, L is a loss function, W is a weight matrix, b is a bias coefficient, T is the number of training patches, H is a forward propagation function of the network, g is an input 9 multiplied by 9 light field sub-aperture image, and d is a gray scale patch block of a real depth map; and reducing the value of the loss function through iterative training, reducing the gray value difference between the finally estimated depth map and the real depth map until the network parameters are updated less, and obtaining a better test result through repeated iterative training, namely judging that the training tends to be saturated, finishing the training and storing the trained neural network parameters.
In another preferred embodiment, the neural network performs data addition processing on the training set before training, including rotation, flipping, gamma transformation, and addition of random noise, to avoid overfitting and improve the generalization capability of the network.
In another preferred embodiment, the randomly sampled gray patch size of the neural network is 64 × 64, and the optimization method used is Adam optimizer.
Fig. 3 is a schematic structural diagram of the convolution module according to the present invention. The structure of the convolution module used in the present invention is defined as "convolution layer-Relu layer-convolution layer-batch normalization-Relu layer" for convolution calculation of the polar plane image portion. Wherein, Relu layer is used as an activation function for introducing a nonlinear factor; the batch normalization layer is used to speed up convergence and link overfitting.
Fig. 4 is a schematic diagram of the structure of the pooling module of the present invention. The structure of the pooling module used in the invention is defined as 'convolution layer-Relu layer-convolution layer-batch standardization-Relu layer-pooling layer', and is used for downsampling the sub-aperture image and extracting low-dimensional information.
In this embodiment, two typical light field image depth estimation methods are compared with the present invention, one is an LF _ OCC method: wang, published in ICCV, was proposed in 2015; one is the EPINET method: shin was proposed in 2018 and published in CVPR.
This example uses the Lytro Illum light field dataset provided by Daudt et al to test the performance of the invention on real scene light field data. The dataset comprised 36 sets of Lytro Illum camera data. Fig. 5 shows depth estimation results of 3 typical scenes, the first column is the central sub-aperture image of the scene, and the second to fourth columns are the result of the LF _ OCC method, the result of the EPINET method, and the result of the present invention, respectively; the top two rows show two sets of outdoor scenes, and the bottom row is an indoor scene.
It is obvious from the analysis of the embodiment that the depth information can be well estimated for indoor and outdoor noise scenes.
The light field image depth estimation method based on deep learning uses an HCI light field data set for testing, and can achieve the following performance parameters: the average dead pixel rate is 8.201% (when the threshold value is 0.07), the average mean square error is 3.020%, and the average calculation time is 0.415 seconds, so that the fast high-precision depth estimation of the synthesized and actual light field images is satisfied.
The method combines the polar plane image analysis and the image segmentation method on the basis of the neural network, simultaneously utilizes the depth characteristics and the edge information of the image, improves the problem of mismatching in the process of depth estimation of the actual light field image, reduces bad pixel points in the estimated depth image, and improves the accuracy of depth estimation. The whole process utilizes the high calculation power of the GPU, can quickly obtain a high-precision depth map, and meets the requirements of practical application.
The above description is only exemplary of the preferred embodiments of the present invention, and is not intended to limit the present invention, and any modifications, equivalents, improvements, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.