CN108830890B - Method for estimating scene geometric information from single image by using generative countermeasure network - Google Patents
Method for estimating scene geometric information from single image by using generative countermeasure network Download PDFInfo
- Publication number
- CN108830890B CN108830890B CN201810376281.9A CN201810376281A CN108830890B CN 108830890 B CN108830890 B CN 108830890B CN 201810376281 A CN201810376281 A CN 201810376281A CN 108830890 B CN108830890 B CN 108830890B
- Authority
- CN
- China
- Prior art keywords
- image
- depth
- pseudo
- sample
- depth image
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 34
- 238000013528 artificial neural network Methods 0.000 claims abstract description 75
- 230000006870 function Effects 0.000 claims description 43
- 238000012549 training Methods 0.000 claims description 15
- 238000005094 computer simulation Methods 0.000 claims description 11
- 238000013527 convolutional neural network Methods 0.000 claims description 6
- 230000009467 reduction Effects 0.000 claims description 4
- 230000009977 dual effect Effects 0.000 abstract 1
- 238000012545 processing Methods 0.000 description 13
- 238000010606 normalization Methods 0.000 description 11
- 238000005259 measurement Methods 0.000 description 9
- 238000011176 pooling Methods 0.000 description 6
- 238000010586 diagram Methods 0.000 description 4
- 239000011159 matrix material Substances 0.000 description 3
- 238000005070 sampling Methods 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 208000037170 Delayed Emergence from Anesthesia Diseases 0.000 description 1
- OAICVXFJPJFONN-UHFFFAOYSA-N Phosphorus Chemical compound [P] OAICVXFJPJFONN-UHFFFAOYSA-N 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 239000000654 additive Substances 0.000 description 1
- 230000004075 alteration Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000003190 augmentative effect Effects 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 238000013178 mathematical model Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/50—Depth or shape recovery
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Image Analysis (AREA)
- Image Processing (AREA)
Abstract
The invention provides a method for estimating scene geometric information from a single image by using a generative confrontation network, which comprises the following steps: inputting the image of the scene and the depth of a plurality of pixels in the image into a trained generative neural network to obtain a depth image of the scene; the depth of the pixel refers to the distance between a point in the scene corresponding to the pixel in the image and an observer, and the depth image refers to the total depth of each pixel in one image. The invention uses the depth of the image in the scene and a small number of corresponding pixels in the image as input, predicts or estimates the depth image of the scene through a dual consistency constrained generation countermeasure network, and is simple, effective and low in cost.
Description
Technical Field
The invention belongs to the field of computer image processing, relates to a method for estimating scene geometric information from a single image, and particularly relates to a method for estimating scene geometric information from a single image by using a generative confrontation network.
Background
Depth information prediction and estimation are very important in engineering application fields, such as robotics, auto-driving, Augmented Reality (AR), and 3D modeling. At present, two methods for acquiring depth images are mainly used, namely direct distance measurement and indirect distance measurement. Direct ranging refers to directly acquiring depth information using various hardware devices. For example, TOF cameras acquire the distance between an object in a target scene to an emitter by emitting successive near-infrared pulses; the laser radar scans an object in a detected scene by emitting laser so as to obtain the distance from the surface of the object to the laser radar; the Kinect utilizes an optical coding technology to obtain three-dimensional depth information through an infrared transmitter projection scene. However, they all have their own limitations: TOF cameras are typically expensive and susceptible to noise interference; the three-dimensional information captured by the laser radar is uneven and sparse in a color image coordinate system, and the cost is high; the Kinect has short measuring distance and is easily influenced by light to generate a large amount of noise.
Indirect ranging refers to indirect depth estimation using a single or multiple visible light images of the same scene. According to the difference of the number of the scene viewpoints, the method can be divided into the following steps: the depth estimation method based on the multi-view, the depth estimation algorithm based on the binocular image and the depth estimation method based on the monocular image. Multi-view based depth estimation typically employs an array of cameras for image acquisition of the same scene and utilizes redundant information between multiple viewpoint images for depth image computation. The depth estimation method based on multiple views can obtain a relatively accurate depth image corresponding to the scene, but the camera array is high in cost, troublesome in configuration and high in shooting requirement, so that the camera array is less used in the practical process. Depth estimation based on binocular images calculates depth information by a stereo matching technique using disparity between two cameras similar to both eyes of a human. Monocular image based depth estimation utilizes only video sequences and images of one viewpoint for depth estimation.
Due to these limitations, depth estimation using a single camera has been a strong concern, and such cameras are small in size, low in cost, energy-saving, and widely available in consumer electronics.
In recent years, with the development of deep learning, researchers have been trying to study the depth estimation problem of monocular images by using Convolutional Neural Network (CNN) in a large amount. Saxena et al model the depths of individual points and the relationship between the depths of different points using a Markov Random Field (MRF) containing multi-scale local and global image features using a supervised learning approach.
CN107578436A discloses a monocular image depth estimation method based on a full convolution neural network FCN, which includes the steps of: acquiring training image data; inputting training image data into a full convolution neural network FCN, and sequentially outputting by a pooling layer to obtain a characteristic image; amplifying the output characteristic images of the last pooling layer to obtain characteristic images with the same size as the output characteristic images of the previous pooling layer, and fusing the characteristic images of the last pooling layer and the output characteristic images of the previous pooling layer; sequentially fusing the output characteristic images of each pooling layer from back to front to obtain a final predicted depth image; in the training, parameters in the full convolution neural network FCN are trained by using a random gradient descent (SGD) method; and (3) acquiring an RGB image needing depth prediction, inputting the RGB image into the trained full convolution neural network FCN, and acquiring a corresponding depth prediction image. The method adopts a structure of a full convolution network, removes a full connection layer, effectively reduces the parameter quantity of the network, and can solve the problem of low resolution of an output image in the convolution process, but the method needs extremely large training samples and long training time.
Disclosure of Invention
To solve the above problem, the present invention provides a method for estimating scene geometry information from a single image using a generative confrontation network, the method comprising:
inputting the image of the scene and the depth of a plurality of pixels in the image into a trained generative neural network to obtain a depth image of the scene; the depth of the pixel refers to the distance between a point in a scene corresponding to the pixel in the image and an observer, and the depth image refers to the total depth of each pixel in one image;
the training step of the generative neural network comprises the following steps:
step A: collecting a training data set: the training data set comprises a plurality of samples, and each sample is a sub-image and a corresponding depth image;
and B: constructing a generative confrontation network architecture, comprising two generative neural networks (F and G) and two discriminant neural networks (D)XAnd DY);
And C: inputting the depth of a plurality of pixels in the image and the depth image in the sample to G to obtain a corresponding pseudo-depth image; inputting the depth image in the sample to F to obtain a corresponding pseudo image; the pseudo image or pseudo depth image refers to data generated by a computer model rather than being actually shot or measured;
step D: the discriminant neural network DYDiscriminating the image and/or the pseudo image in the sample in the step C, wherein the discriminant neural network DYC, distinguishing the depth image and/or the pseudo depth image in the sample in the step C;
step E: adjustment DXAnd DYTo reduce the discrimination loss in step D;
step F: calculating a difference loss between the depth image in the sample and the G-generated pseudo depth image in step C, and calculating a difference loss between the image in the sample and the F-generated pseudo image;
step G: adjusting G and F to reduce the difference loss in the step F so as to increase the discrimination loss of the pseudo image and the pseudo depth image in the step D;
step H: and C, returning to the step C for iteration until a preset iteration condition is met, and storing the generated neural network G at the moment as a final generated neural network.
In an embodiment of the present invention, the step C specifically includes:
inputting the depth of a plurality of pixels of the image and the depth image in the sample to G to obtain a corresponding pseudo-depth image, and then inputting the pseudo-depth image to F to obtain a pseudo-restored image;
inputting the depth image in the sample to F to obtain a corresponding pseudo image, and then inputting the pseudo image and the depth of a plurality of pixels in the depth image in the sample to G to obtain a pseudo reduction depth image; the pseudo-reduction image or the pseudo-reduction depth image refers to data generated by a computer model, and input data of the computer model is data generated by another computer model.
In one embodiment of the invention, said step F further comprises: calculating a disparity loss between the depth image in the sample and the pseudo-restored depth image, and calculating a disparity loss between the image in the sample and the pseudo-restored image.
In one embodiment of the invention, the generative confrontation network utilizes data to make inferences based on a Bayesian probabilistic model as follows:
wherein X is an image in the sample, Y is a depth image in the sample,is the depth of several pixels of the depth image in the sample, YsFor the depth of the pseudo-pixel of said several pixels, G is a generative neural network for generating a depth image from the image, F is a generative neural network for generating an image from the depth image, DXFor discriminative neural networks for discriminating the authenticity of images, DYThe discriminant neural network for discriminating the authenticity of the depth image outputs the probability that the depth image is true,for the pseudo-depth image generated by the generative neural network G,is a pseudo-restored image generated by the generative neural network F.
In one embodiment of the present invention, the generative neural networks G and F loss functions are:
LG=LGAN+λ1LREC+λ2LSSC:
LGAN=EY[log D(Y)]+EX[log(1-Dy(G(X)))],
LREC(X,G,F)=EX[||X-F(G(X))||1]+EY[||Y-G(F(Y))||1],
where E is the expectation, X is the image in the sample, Y is the depth image in the sample,is the depth of several pixels of the depth image in the sample,is the depth of the dummy pixel of the pixels, LGTo generate a loss function of the neural networks G and F, LGANTo combat the loss function of the network, LRECAs a loss function of reduction, LSSCIs a loss function, lambda, between the depth of several pixels of the depth image in the sample and the corresponding pixels in the pseudo-depth image generated by the generating neural network G1Is LRECWeight coefficient of (a), λ2Is LSSCThe weight coefficient of (2). Preferably, said λ1、λ2Is 0 to 10.
Further, the discriminant neural network DXAnd DYHas a loss function of
Where E is the expectation, X is the image in the sample, Y is the depth image in the sample,is the depth of several pixels of the depth image in the sample,for discriminant neural network DXThe discriminant loss function of (a),for discriminant neural network DYThe discriminant loss function of (1).
Further, the generative neural networks G and F are full convolutional neural networks, and the full convolutional neural networks include convolutional layers, residual network layers, and deconvolution layers. Preferably, the number of layers of the residual network layer is 9-21.
The following explains the terms appearing in the present invention:
scene: the scene is a specific picture made of an objective object or object that occurs in a certain time and space, and generally includes an indoor scene or an outdoor scene.
The observer: a sensor/processor device set for scene measurements. A point (which may be located inside/outside the device) uniquely defined by the position and attitude of the device group in the 3-dimensional physical world serves as a reference measurement point for depth as referred to herein.
Depth: the depth of a point in the scene is defined as its geometric distance from the observer position.
And (3) actual measurement point set: several points in the scene, the geometrical distances between these points and the surveyor, are measured with an instrument, are considered to be known in the present method.
Matrix: a two-dimensional data table that refers to a vertical and horizontal arrangement (J.J. Sylvester, "additives to The articles in The separation number of this j.s." On a new class of The articles, "and On Pascal's The articles," [ J ] The London, Edinburgh and double phosphor Magazine and Journal of Science, 1850, 37: 363) is described. By extension, image matrices (images for short) and matrices are defined.
A neural network: the method is an arithmetic mathematical model simulating animal neural network behavior characteristics and performing distributed parallel information processing.
Generating a network: generative Net, herein referred to as a neural network that outputs a form of a data matrix.
True/false data: the data may be an image/matrix, the true data is sensor data obtained by measuring a scene, and the false data is data with a similar format output by a configured generation network.
FCN: full Convolutional network full Convolutional conditional Networks, all layers are Convolutional layer neural Networks.
The invention has the beneficial effects that:
1. the invention provides a method for estimating scene geometric information from a single image by using a generative countermeasure network, which takes the image of the scene and a small amount of corresponding depth information in the image as input, requires less measurement data, is simple and effective and reduces the measurement cost of the scene depth.
2. And the generated network is optimized by utilizing the residual error network, so that the layer number of the generated network is increased, and the performance and the accuracy of the generated network are improved.
Drawings
FIG. 1a is an example of an image represented by X in the present invention;
FIG. 2 is a flow chart of an image depth estimation method according to the present invention;
FIG. 3 is a basic schematic diagram of image depth estimation of the present invention;
FIG. 4 is a diagram of a basic configuration of a countermeasure generation network of the present invention;
FIG. 5 is a diagram of a primary structure of a generating network according to an embodiment of the present invention;
fig. 6 is a diagram illustrating a basic structure of a discrimination network according to an embodiment of the present invention.
Detailed Description
In order to better understand the technical solution proposed by the present invention, the following further explains the present invention with reference to the drawings and specific embodiments.
Generally, in order to acquire geometric information, particularly depth information, of a certain scene, people use a Kinect camera to shoot a depth image of the scene, and the measuring distance of Kinect is short, and the acquired depth information is sparse. Multiple measurements are required to obtain the full depth information of the scene. Therefore, we want to estimate the full depth information of the scene by using RGB images taken by a normal camera and sparse depth images taken by a Kinect camera. The common camera needs to shoot the RGB image and the Kinect shooting depth image at the same position and angle.
The method comprises the steps of inputting an image of a scene and the depth of a plurality of pixels in the image into a trained generative neural network to obtain a depth image of the scene; the depth of the pixel refers to the distance between a point in the scene corresponding to the pixel in the image and an observer, and the depth image refers to the total depth of each pixel in one image.
The specific implementation steps in the deployment phase are as follows:
step 1: an RGB image X of a scene (as in FIG. 1a) is acquired and the depth of the corresponding pixel of the image is measured in real time at 300-400 pixels (typically not more than 1% of the total pixels of the image)(see FIG. 1 b);
step 2: combining the RGB image X with the depth of the pixelInputting the training result into a resultant confrontation network (G);
and step 3: starting a generating type neural network G to perform feedforward calculation;
and 4, step 4: using the feedforward calculation result of G as a pseudo-depth image(see FIG. 1c), thenIs a prediction of the depth image Y corresponding to X.
Therefore, we need to train the generative countermeasure network before the deployment phase.
As shown in fig. 2, in an embodiment of the present invention, the training step of the generative neural network includes:
step A: collecting a training data set: the training data set comprises a number of samples, each sample being a triplet (RGB image X, depth image Y, depth of pixel));
And B: constructing a generative confrontation network architecture, comprising two generative neural networks (F and G) and two discriminant neural networks (D)XAnd DY) Wherein G is the network for generating the depth image in the step of implementing the deployment phase;
and C: combining X in the sample withInput to G to obtain corresponding pseudo depth image(this step is the same as the implementation step of the deployment phase); inputting the depth image in the sample into F to obtain a corresponding pseudo image
Step D: the discriminant neural network DXFor image X in the sample and the pseudo-image generated by F in step CMaking a discrimination, the discriminating neural network DYFor the depth image Y in the sample and the pseudo depth image generated by G in step CJudging;
step E: adjustment DXAnd DYTo reduce the discrimination loss in step D;
step F: calculating an image X in the sample and a pseudo-image generated by F in step CThe difference loss between the depth image Y in the sample and the pseudo depth image generated by G in step CLoss of variance between;
step G: adjusting G and F to reduce the difference loss in the step F so as to increase the discrimination loss of the pseudo image and the pseudo depth image in the step D;
step H: and returning to the step C for iteration until a preset iteration condition is met, and saving the generated neural network G at the moment as a final generated neural network for implementation of a deployment stage.
As shown in fig. 3, Z represents data abstraction corresponding to an actual scene (i.e. color, depth, contrast, transparency information, etc. contained in the scene), X represents an image of the scene, Y represents a depth image, and Y representssRepresenting the depth of several pixels of the depth image in the sample or in actual measurement.
In one embodiment of the invention, the generative confrontation network utilizes data to make inferences based on a bayesian probabilistic model as follows:
wherein the random variable X represents an RGB image, the random variable Y represents a depth image,random variable YsRepresenting the depth of a number of pixels of the depth image in the sample,representing the depth of said pseudo pixels, G being a generative neural network for generating a depth image from the image, F being a generative neural network for generating an image from the depth image, DXOutputting the probability that the image is true for a discriminant neural network used for distinguishing the authenticity of the image; dYOutputting the probability that the depth image is true for a discriminant neural network used for distinguishing the authenticity of the depth image;for the pseudo-depth image generated by the generative neural network G,is a pseudo-restored image generated by the generative neural network F.
As shown in fig. 4, in an embodiment of the present invention, the step C specifically includes:
inputting the depth image in the sample to F to obtain a corresponding pseudo image, and then inputting the pseudo image and the depth of a plurality of pixels in the depth image in the sample to G to obtain a pseudo reduction depth image; the pseudo-reduction image or the pseudo-reduction depth image refers to data generated by a computer model, and input data of the computer model is data generated by another computer model.
C1Representing a loss between the pseudo depth image generated by the generative neural network and the measured depth image;
C2representing the loss of depth of the dummy pixels and the depth of the real pixels generated by the generative neural network G,the generated neural network G is restrained as supervision information;
C3representing pseudo-restored images and real scenes generated by a generative neural network FError of the image of (C)1、C2、C3The generating network G is constrained collectively by a loss function.
In one embodiment of the present invention, the loss function of the generative neural networks G and F is: l isG=LGAN+λ1LREC+λ2LSSC:
Wherein E is desired, i.e. E [ f (x)]Indicating the expectation of a function or random variable within brackets, X being the image in the sample, Y being the depth image in the sample,is the depth of several pixels of the depth image in the sample,is the depth of the dummy pixel of the pixels, LGTo generate a loss function of the neural networks G and F, LGANTo combat the loss function of the network, LRECAs a loss function of reduction, LSSCIs a loss function, lambda, between the depth of several pixels of the depth image in the sample and the corresponding pixels in the pseudo-depth image generated by the generating neural network G1Is LRECWeight coefficient of (a), λ2Is LSSCThe weight coefficient of (2). Said lambda1、λ2Is 0 to 10. Preferably, λ1=10,λ2=10。
Accordingly, a discriminative neural network DXAnd DYIs in the loss boxNumber is
Where E is the expectation, X is the image in the sample, Y is the depth image in the sample,is the depth of several pixels of the depth image in the sample,for discriminant neural network DXThe discriminant loss function of (a),for discriminant neural network DYThe discriminant loss function of (1).
In the structure of the generation network in the above embodiment, the generation neural networks G and F are full convolution networks including convolution layers, down-sampling layers, residual error networks, and deconvolution layers.
As shown in fig. 5, in the above embodiment, the generative neural networks G and F are full convolutional neural networks including convolutional layers, residual network layers, and deconvolution layers. The method comprises the following specific steps:
the first layer is convolutional layer Conv1, 64 4 × 4 convolutional kernels (filters) in Conv1 perform convolution operation with the step size of 1 pixel, meanwhile, an edge Padding (Padding) operation exists before convolution is performed, normalization processing is performed after convolution is finished, and then a Rectified Linear Unit (Rectified Linear Unit or ReLU) function of a nonlinear active layer is used as an active function;
the second layer is convolutional layer Conv2, 128 convolution kernels (filters) of 3 × 3 in Conv2 perform convolution operation with the step size of 2 pixels, meanwhile, an edge filling (Padding) operation exists before convolution, normalization processing is performed after convolution is finished, and then a nonlinear active layer rectifies a linear unit function to serve as an active function;
the third layer is convolutional layer Conv3, 256 convolution kernels (filters) of 3 × 3 in Conv3 perform convolution operation with the step size of 2 pixels, meanwhile, an edge filling (Padding) operation exists before convolution, normalization processing is performed after convolution is finished, and then a nonlinear active layer rectifies a linear unit function to serve as an active function;
after down-sampling (convolution), the processed data enters a Residual Network (Residual Network) for convolution operation, the Residual Network comprises a plurality of continuous Residual modules, and each Residual module comprises two convolution layers:
each convolution layer is provided with 256 convolution kernels (filters) of 3 multiplied by 3 for convolution operation with the step length of 1 pixel, meanwhile, the convolution operation (Padding) operation with the edge filled with 1 pixel exists before the convolution operation, the normalization processing is carried out after the convolution is finished, and then a nonlinear active layer rectifies a linear unit function to be used as an active function; dropout with probability of 0.5 is performed between the two convolutional layers; before the second layer of convolution layer is activated by a rectification Linear Unit, adding operation is carried out on data before entering the residual error module and data after two times of convolution processing, and then entering the next residual error module. Optionally, the number of residual modules of the residual network is 9-21. Preferably, the number of "residual blocks" of the residual network is 9.
After the residual error network, generating a network and then performing up-sampling on the processed data, wherein the network comprises a plurality of layers of deconvolution layers:
the first layer is a deconvolution layer Deconv1, 128 convolution kernels (filters) of 3 × 3 in the Deconv1 perform deconvolution operation with the step length of 1/2 pixels, meanwhile, an edge filling (Padding) operation exists before deconvolution, normalization processing is performed after deconvolution is finished, and then a nonlinear active layer rectifies a linear unit function to serve as an active function;
the second layer is a deconvolution layer Deconv2, 64 convolution kernels (filters) of 3 × 3 in the Deconv2 perform deconvolution operation with the step length of 1/2 pixels, meanwhile, an edge filling (Padding) operation exists before deconvolution, normalization processing is performed after deconvolution is finished, and then a nonlinear active layer rectifies a linear unit function to serve as an active function;
the third layer is a deconvolution layer Deconv3, 3 convolution kernels (filters) of 7 × 7 in Deconv3 perform deconvolution operation with step size of 1 pixel, meanwhile, an edge Padding (Padding) operation exists before deconvolution, normalization processing is performed after deconvolution is finished, and then a tangent function is used as an activation function.
As shown in FIG. 6, in an embodiment of the present invention, a discriminative neural network DXAnd DYThe structure of (1) is specifically as follows:
the first layer is convolutional layer Conv1, 64 convolution kernels (filters) of 7 × 7 in Conv1 perform convolution operation with the step size of 2 pixels, meanwhile, an edge Padding (Padding) operation exists before convolution is performed, normalization processing is performed after convolution is finished, and then a nonlinear active layer LeakyRelu function is used as an active function;
the second layer is convolutional layer Conv2, 128 4 × 4 convolutional kernels (filters) in Conv2 perform convolution operation with the step size of 2 pixels, meanwhile, an edge filling (Padding) operation exists before convolution, normalization processing is performed after convolution is finished, and then a nonlinear active layer LeakyRelu function is used as an active function;
the third layer is convolutional layer Conv3, 256 4 × 4 convolutional kernels (filters) in Conv3 perform convolution operation with the step size of 2 pixels, meanwhile, an edge Padding (Padding) operation exists before convolution, normalization processing is performed after convolution is finished, and then a nonlinear active layer LeakyRelu function is used as an active function;
the fourth layer is convolutional layer Conv4, 512 4 × 4 convolutional kernels (filters) in Conv4 perform convolution operation with the step size of 1 pixel, meanwhile, an edge Padding (Padding) operation exists before convolution, normalization processing is performed after convolution is finished, and then a nonlinear active layer LeakyRelu function is used as an active function;
the fifth layer is convolutional layer Conv5, where 1 convolution kernel (filter) of 4 × 4 in Conv5 performs a convolution operation with a step size of 1 pixel, while there is an edge Padding (Padding) operation before performing the convolution.
The depth estimation method is used as an executable instruction set of a processor circuit or a chip, can be applied to intelligent sensing equipment such as a next-generation laser radar and an advanced camera, predicts or restores the depth image of a scene corresponding to an actually measured point set by using a small number of actually measured point sets, and reduces the cost of acquiring the depth image of the scene by using the traditional method.
The above are merely specific embodiments of the present invention, and the scope of the present invention is not limited thereby; any alterations and modifications without departing from the spirit of the invention are within the scope of the invention.
Claims (6)
1. A method for estimating scene geometry information from a single image using a generative confrontation network, the method comprising:
inputting the image of the scene and the depth of a plurality of pixels in the image into a trained generative neural network to obtain a depth image of the scene; the depth of the pixel refers to the distance between a point in a scene corresponding to the pixel in the image and an observer, and the depth image refers to the total depth of each pixel in an image;
wherein the training step of the generative neural network comprises:
step A: collecting a training data set: the training data set comprises a plurality of samples, and each sample is an image and a corresponding depth image;
and B: constructing a generative confrontation network architecture, which comprises two generative neural networks: f and G, two discriminant neural networks: dXAnd DY;
And C: inputting the depth of a plurality of pixels in the image and the depth image in the sample to G to obtain a corresponding pseudo-depth image; inputting the depth image in the sample to F to obtain a corresponding pseudo image; the pseudo image or pseudo depth image refers to data generated by a computer model rather than being actually shot or measured; wherein, the image in the sample and the plurality of pixels in the depth image thereof refer to 300-400 pixels and not more than 1% of all pixels of the image;
step D: the discriminant neural network DXDiscriminating the image and/or the pseudo image in the sample in the step C, wherein the discriminant neural network DYC, distinguishing the depth image and/or the pseudo depth image in the sample in the step C;
step E: adjustment DXAnd DYTo reduce the discrimination loss in step D;
step F: calculating a difference loss between the depth image in the sample and the G-generated pseudo depth image in step C, and calculating a difference loss between the image in the sample and the F-generated pseudo image;
step G: adjusting G and F to reduce the difference loss in the step F so as to increase the discrimination loss of the pseudo image and the pseudo depth image in the step D;
step H: and C, returning to the step C for iteration until a preset iteration condition is met, and storing the generated neural network G at the moment as a final generated neural network.
2. The method for estimating scene geometry information from a single image using a generative countermeasure network as claimed in claim 1, wherein said step C is specifically:
inputting the depth of the plurality of pixels in the image and the depth image in the sample to G to obtain a corresponding pseudo-depth image, and then inputting the pseudo-depth image to F to obtain a pseudo-restored image;
inputting the depth image in the sample to F to obtain a corresponding pseudo image, and then inputting the pseudo image and the depths of the pixels to G to obtain a pseudo reduction depth image; the pseudo-reduction image or the pseudo-reduction depth image refers to data generated by a computer model, and input data of the computer model is data generated by another computer model.
3. The method of estimating scene geometry information from a single image using a generative confrontation network as in claim 1, wherein the generative confrontation network utilizes data to reason based on a bayesian probabilistic model as follows:
where X is the image in the sample, Y is the depth image in the sample,is the depth of several pixels of the depth image in the sample, YsThe depth of the pseudo pixel of the pixels, G is a generating neural network for generating the depth image of the image, F is a generating neural network for generating the depth image of the image, DYThe discriminant neural network for discriminating the authenticity of the depth image outputs the probability that the depth image is true,for the pseudo-depth image generated by the generative neural network G,is a pseudo-restored image generated by the generative neural network F.
4. The method of estimating scene geometry information from a single image using a generative confrontation network as claimed in claim 1 or claim 3, wherein the generative neural network G and F loss functions are:
LG=LGAN+λ1LREC+λ2LSSC,
where E is the expectation, X is the image in the sample, Y is the depth image in the sample,is the depth of several pixels of the depth image in the sample,is the depth of the dummy pixel of the pixels, LGTo generate a loss function of the neural networks G and F, LGANTo generate a loss function against the network, LRECLoss function between image or depth image and sample generated for generating neural networks G and F, LSSCIs a loss function, lambda, between the depth of several pixels of the depth image in the sample and the corresponding pixels in the pseudo-depth image generated by the generating neural network G1Is LRECWeight coefficient of (a), λ2Is LSSCThe weight coefficient of (2).
5. The method of estimating scene geometry information from a single image using a generative confrontation network as in claim 1, wherein the discriminative neural network DXAnd DYHas a loss function of
Where E is the expectation, X is the image in the sample, Y is the depth image in the sample,is the depth of several pixels of the depth image in the sample,for discriminant neural network DXThe discriminant loss function of (a),for discriminant neural network DYThe discriminant loss function of (1).
6. The method of estimating scene geometry information from a single image using a generative confrontation network as in claim 1, wherein the generative neural network G or F is a full convolutional neural network comprising convolutional layers, residual network layers, and anti-convolutional layers.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810376281.9A CN108830890B (en) | 2018-04-24 | 2018-04-24 | Method for estimating scene geometric information from single image by using generative countermeasure network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810376281.9A CN108830890B (en) | 2018-04-24 | 2018-04-24 | Method for estimating scene geometric information from single image by using generative countermeasure network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108830890A CN108830890A (en) | 2018-11-16 |
CN108830890B true CN108830890B (en) | 2021-10-01 |
Family
ID=64154785
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810376281.9A Active CN108830890B (en) | 2018-04-24 | 2018-04-24 | Method for estimating scene geometric information from single image by using generative countermeasure network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108830890B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110782397B (en) * | 2018-12-13 | 2020-08-28 | 北京嘀嘀无限科技发展有限公司 | Image processing method, generation type countermeasure network, electronic equipment and storage medium |
CN109788270B (en) * | 2018-12-28 | 2021-04-09 | 南京美乐威电子科技有限公司 | 3D-360-degree panoramic image generation method and device |
CN111274946B (en) * | 2020-01-19 | 2023-05-05 | 杭州涂鸦信息技术有限公司 | Face recognition method, system and equipment |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106157307A (en) * | 2016-06-27 | 2016-11-23 | 浙江工商大学 | A kind of monocular image depth estimation method based on multiple dimensioned CNN and continuous CRF |
-
2018
- 2018-04-24 CN CN201810376281.9A patent/CN108830890B/en active Active
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106157307A (en) * | 2016-06-27 | 2016-11-23 | 浙江工商大学 | A kind of monocular image depth estimation method based on multiple dimensioned CNN and continuous CRF |
Non-Patent Citations (3)
Title |
---|
DEPTH PREDICTION FROM A SINGLE IMAGE WITH CONDITIONAL ADVERSARIAL NETWORKS;Hyungjoo Jung 等;《2017 IEEE International Conference on Image Processing (ICIP)》;IEEE;20180222;正文第1717-1720页第1-3节,图2-3 * |
Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks;Zhu Junyan 等;《 2017 IEEE International Conference on Computer Vision (ICCV)》;IEEE;20171225;全文 * |
深度学习在图像识别中的应用;李超波 等;《南通大学学报(自然科学版)》;20180320;第17卷(第1期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN108830890A (en) | 2018-11-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Moreau et al. | Lens: Localization enhanced by nerf synthesis | |
AU2018292610B2 (en) | Method and system for performing simultaneous localization and mapping using convolutional image transformation | |
CN108596961B (en) | Point cloud registration method based on three-dimensional convolutional neural network | |
CN114049434B (en) | 3D modeling method and system based on full convolution neural network | |
CN109598754B (en) | Binocular depth estimation method based on depth convolution network | |
CN111899282A (en) | Pedestrian trajectory tracking method and device based on binocular camera calibration | |
CN110689562A (en) | Trajectory loop detection optimization method based on generation of countermeasure network | |
CN107204010A (en) | A kind of monocular image depth estimation method and system | |
CN116258817B (en) | A method and system for constructing autonomous driving digital twin scenes based on multi-view three-dimensional reconstruction | |
CN113689539A (en) | Dynamic scene real-time three-dimensional reconstruction method and device based on implicit optical flow field | |
CN113256698B (en) | Monocular 3D reconstruction method with depth prediction | |
CN112330795B (en) | Human body three-dimensional reconstruction method and system based on single RGBD image | |
CN110243390B (en) | Pose determination method and device and odometer | |
CN108830890B (en) | Method for estimating scene geometric information from single image by using generative countermeasure network | |
CN110910437A (en) | A Depth Prediction Method for Complex Indoor Scenes | |
CN111612898B (en) | Image processing method, image processing device, storage medium and electronic equipment | |
CN107689060A (en) | Visual processing method, device and the equipment of view-based access control model processing of destination object | |
CN114973407A (en) | A RGB-D-based 3D Human Pose Estimation Method for Video | |
WO2024193622A1 (en) | Three-dimensional construction network training method and apparatus, and three-dimensional model generation method and apparatus | |
JP2022027464A (en) | Method and device related to depth estimation of video | |
CN114996814A (en) | Furniture design system based on deep learning and three-dimensional reconstruction | |
CN112541972A (en) | Viewpoint image processing method and related equipment | |
CN116468769A (en) | An Image-Based Depth Information Estimation Method | |
CN118379445A (en) | A method for reconstructing deep-sea surface mineral topography based on binocular vision and deep learning | |
CN116091871B (en) | Physical countermeasure sample generation method and device for target detection model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |