Depth completion method for learning and guiding deformable convolution
Technical Field
The invention belongs to the field of computer vision, and relates to a depth complement method for learning and guiding deformable convolution.
Background
Currently, deep learning-based methods exhibit excellent performance in deep completion tasks. These methods progressively map sparse depth maps to dense depth maps through a large number of stacked filters. Since RGB images contain rich semantic cues, image-guided methods exhibit an unexpected performance in filling unknown depths. For example GuideNet proposes a feature fusion module based on a guided dynamic convolution network to better exploit the guided features of RGB images, CSPN studies the affinity matrix to refine the rough depth map through a Spatial Propagation Network (SPN), dySPN further develops dynamic SPN by assigning different attention levels to neighboring pixels at different distances. ACMNet introduces a symmetrical gating fusion strategy to perform fusion of RGB image feature and depth feature two-mode data, and FCFR-Net proposes channel shuffling between two modes and an energy-based fusion strategy.
Despite the significant progress made in the depth-completion task by existing methods, there are still some problems that need to be solved. Because of the fixed geometry of the convolution module, the convolution unit can only sample the input feature map at a fixed location, which may contain irrelevant blending information, so that under challenging environments and sparse depth measurements, the existing methods have difficulty generating clear structural features, resulting in depth blending problems, i.e., depth map boundary blurring and artifacts. Furthermore, due to the large variability of depth and RGB image information, existing methods typically use tens of millions of learnable parameters to ensure the ability of models to learn robust features in order to adequately fuse multimodal data. However, such large-scale networks typically require a large amount of computing resources, which is impractical in practical applications. Simply reducing the network size can significantly reduce the performance of the method.
Disclosure of Invention
It is therefore an object of the present invention to provide an image-guided module that can adaptively perceive the context structure of each pixel to assist the depth completion process, and at the same time, to reduce the complexity of the model, provide a low-coupling and lightweight network structure to accomplish this task.
In order to achieve the above purpose, the present invention provides the following technical solutions:
A method of learning a guided deformable convolution depth completion comprising the steps of:
s1, giving a pair of inputs comprising a sparse depth map and an RGB image;
S2, respectively extracting depth features and image features from the sparse depth map and the RGB image, and further fusing to obtain multi-mode features;
S3, taking multi-mode features as input, adaptively guiding depth feature aggregation by utilizing image feature information, and finally predicting a rough depth map;
and S4, generating a depth residual image through the spatial variation of the self-adaptive prediction and the kernel weight and the offset related to the content, and further obtaining the depth image after refinement.
Further, in step S2, for a stack of input sparse depth map S and RGB image I, depth feature F S′ and image feature F I′ are obtained by performing an initial convolution and lightweight encoder-decoder processing on S and I, respectively, and then these two features are fused by pixel-level addition operation to obtain multi-modal feature F IS, expressed as:
FIS=FI′+FS′
Wherein the method comprises the steps of AndRespectively representing an encoder-decoder for extracting image features and depth features.
Further in step S3, two encoder-decoder branches are constructed, an image-guided branch and a depth regression branch, each having multi-modal features as input, the encoder stage of the image-guided branch being by accepting signals fromThe jump connection of the decoder achieves information transfer, the encoder stage of the depth regression branch by accepting the signal fromAnd embedding a deformable guide module after each scale feature of the encoder of the depth regression branch, aggregating relevant semantic information from a neighborhood range, wherein the output of the depth regression branch is a rough depth map.
Further, the deformable guiding module comprises the following processing steps:
Given the image features f i from the image-guided branches and the depth features f s from the depth regression branches, first perform pixel-level addition, fusing the image features and the depth features;
next, a pixel level offset feature map is learned by a standard convolution operation, which includes x and y coordinate direction offsets, representing position deviations on a regular grid, for a total of 2 x k 2 channels, where is the k convolution kernel size;
Then, sampling the depth feature map on a regular grid based on the offsets to obtain related semantic information in the neighborhood;
then, standard convolution with the kernel size of k is performed on the sampled features to aggregate the information and learn to obtain depth feature residuals;
Finally, adding the depth characteristic residual error to the depth characteristic to obtain a final output;
The deformable guiding module is expressed as:
Offsets=Conv(fs+fi)
Output=fi+DeConv(fi,Offsets)
Wherein DeConv () represents a deformable convolution whose kernel weights are spatially shared, obtained by random initialization.
Further, the step S4 specifically includes:
giving the last layer features F I and F S of the image-guided and depth-regressive branch decoders, respectively;
firstly, performing a pixel level addition, and fusing double-branch characteristics;
respectively learning through two independent standard convolutions to obtain a convolution kernel weight and an offset characteristic diagram of a pixel level;
The weight is enabled to be larger than 0 and smaller than 1 through the sigmoid layer, the average value is subtracted, and the sum of the weight is enabled to be 0;
based on given offset and kernel weight, performing deformable convolution on the rough depth map CD to obtain a depth residual map delta D;
Adding a depth residual image to the rough depth image to obtain a final fine depth image D;
The concrete expression is as follows:
Weights=Conv(FI+FS)
Offsets=Conv(FI+FS)
ΔD=DeConv(D0,Weights,Offsets)
D=CD+ΔD
Wherein DeConv () represents a deformable convolution whose kernel weights are spatially varying and content dependent, learned from image-guided features and depth features.
Further, the mean square error MSE is used to calculate the loss between the true depth and the predicted depth, expressed as:
L(Dpred)=‖(Dpred-Dgt)⊙m(Dgt>0)‖2
Wherein D pred represents the predicted depth map, D gt represents the real depth map for supervision, as well as the element-wise multiplication, considering only pixels with valid depth values;
also, the rough depth map CD needs to be supervised, and the final loss function is:
Loss=L(D)+λL(CD)
where λ is an empirically set hyper-parameter.
The method has the advantages that the depth map can capture relevant information better through the method capable of adaptively sensing the context structure of each pixel, so that the structural definition of the generated depth map is improved remarkably, the sensing range is adjusted adaptively according to the pixel position, irrelevant information is not contained in the depth map any more, aggregation of the irrelevant information is effectively avoided, the abundant semantic information of RGB is fully utilized, the sampling position of the depth relevant information is predicted, irrelevant information caused by fixing local neighborhood is avoided, the precision of the depth map is improved, particularly, the situation that the depth map is better in face of complex and challenging scenes is improved, the low-coupling and light-weight network structure provided by the method not only improves the calculation efficiency, reduces the resource requirement, but also enables the depth completion method to be easier to deploy and popularize in various application scenes, the dual-branch stacked hourglass-based network structure is introduced, the single encoder structure of the prior method is decoupled into a stacked encoder-decoder structure, learning of a model is balanced, the situation that more and more abundant context are obtained gradually, and each decoder has very light-weight encoder characteristics.
Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objects and other advantages of the invention may be realized and obtained by means of the instrumentalities and combinations particularly pointed out in the specification.
Drawings
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in the following preferred detail with reference to the accompanying drawings, in which:
FIG. 1 is a flow chart of a method for learning a guided deformable convolution;
FIG. 2 is a flow chart of multi-modal feature extraction;
FIG. 3 is a guided depth regression flow chart;
FIG. 4 is a block diagram of a deformable boot module;
Fig. 5 is a depth refinement flow chart.
Detailed Description
Other advantages and effects of the present invention will become apparent to those skilled in the art from the following disclosure, which describes the embodiments of the present invention with reference to specific examples. The invention may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present invention. It should be noted that the illustrations provided in the following embodiments merely illustrate the basic idea of the present invention by way of illustration, and the following embodiments and features in the embodiments may be combined with each other without conflict.
In which the drawings are for illustrative purposes only and are not intended to be construed as limiting the invention, and in which certain elements of the drawings may be omitted, enlarged or reduced in order to better illustrate embodiments of the invention, and not to represent actual product dimensions, it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
In the description of the present invention, it should be understood that, if there are terms such as "upper", "lower", "left", "right", "front", "rear", etc., the directions or positional relationships indicated are based on the directions or positional relationships shown in the drawings, only for convenience of describing the present invention and simplifying the description, rather than indicating or implying that the referred devices or elements must have a specific direction, be constructed and operated in a specific direction, so that the terms describing the positional relationships in the drawings are merely for exemplary illustration and are not to be construed as limitations of the present invention, and that the specific meanings of the terms described above may be understood by those skilled in the art according to specific circumstances.
In the face of challenging environments and sparse depth measurements, traditional depth completion methods, especially fixed neighborhood convolution modules and affinity matrices, are difficult to meet the needs of depth completion tasks. The current method has a problem that in the process of generating the depth map, mixed information irrelevant to the periphery is easy to aggregate due to fixed neighborhood operation, so that the generated feature map structure is fuzzy. The present invention therefore aims to solve this problem, focusing on improving the structural definition of the depth map.
In order to achieve the above object, the present invention designs an image guidance module capable of adaptively sensing a context structure of each pixel, which can adaptively adjust a sensing range according to a pixel position, and better capture relevant information, thereby improving accuracy and definition of a depth map, and the introduction of the module aims to effectively cope with complex and challenging scenes in a depth complement process, and ensure that a generated depth feature map has a stronger structural sense. Specifically, in order to avoid irrelevant information brought by fixed local neighborhood, rich semantic information of RGB is fully utilized, and sampling positions of depth relevant information are predicted, wherein the sampling positions are obtained by learning offset of a regular grid. In order to meet the requirements of practical application on model complexity, a low-coupling and light-weight network structure is designed, and the lightweight design is not only beneficial to improving the calculation efficiency and reducing the resource requirements, but also enables the depth complement method to be easier to deploy and popularize in various application scenes.
Specific network architecture aspects include a dual-branch stacked hourglass-based network architecture that balances model learning by decoupling the single encoder architecture of previous approaches into stacked encoder-decoder architectures while progressively achieving clearer and richer context semantics. Because of the decoupled structure, a large number of learnable parameters are not required to balance the learning ability of the model, thus maintaining the robustness of the model, so that each encoder-decoder becomes very lightweight.
Based on the scheme, a learning-guided deformable convolution depth complement method is provided, and as shown in fig. 1, the method flow comprises three steps of multi-modal feature extraction, guided depth regression and depth refinement.
Multimodal feature extraction given a pair of inputs, sparse depth map S and RGB image I, we aim to extract a multimodal representation that seamlessly blends semantic information such as texture, edges, etc. of the image with depth information, as shown in FIG. 2. Depth feature F S′ and image feature F I′ are obtained by performing one initial convolutional layer and lightweight encoder-decoder process for S and I, respectively. Then, the two features are fused by pixel-level addition operation, resulting in a multi-modal feature F IS. Specifically, the process can be expressed as:
FIS=FI′+FS′
Wherein the method comprises the steps of AndRespectively representing encoder-decoder modules for extracting image features and depth features.
The guided depth regression this stage is split into two encoder-decoder branches, an image guided branch and a depth regression branch, as shown in fig. 3. The encoder stage of the image guided branch is implemented by accepting information fromThe jump connection of the decoder enables information transfer. The encoder stage of the depth regression branch also makes use of a jump connection, this time fromAnd a decoder. A deformable guide module is embedded after each scale feature of the encoder of the depth regression branch to aggregate relevant semantic information from within the neighborhood, the output of the branch being a coarse depth map CD. The two branches take the multi-mode characteristic F IS as input, so that the deep complementation is ensured by fully utilizing multi-source information in the subsequent step.
As shown in fig. 4, the deformable guide module is designed by, given the image features f i from the image-guide branches and the depth features f s from the depth regression branches, first, performing pixel-level addition, fusing the image features and the depth features. Next, the pixel level offset profile, including the x and y coordinate direction offsets, is learned by a standard convolution operation, representing the position deviations on a regular grid, for a total of 2 x k 2 channels, where is the k convolution kernel size. The depth feature map is then sampled on a regular grid based on these offsets to obtain relevant semantic information within the neighborhood. A standard convolution of kernel size k is then performed on the sampled features to aggregate this information and learn to get depth feature residuals. And finally, adding the depth characteristic residual error to the depth characteristic to obtain a final output. In particular, the module can be expressed as:
Offsets=Conv(fs+fi)
Output=f+DeConv(fi,Offsets)
wherein DConv () represents a deformable convolution whose kernel weights are spatially shared, obtained by random initialization.
Depth refinement as shown in fig. 5, the image guided and depth regressive branch decoders last layer features F I and F S, respectively, are given. Firstly, a pixel level addition is carried out, double branch features are fused, then a convolution kernel weight and an offset feature map of the pixel level are respectively learned through two independent standard convolutions, in order to enable a model to be stably converged, the weight is enabled to be greater than 0 and smaller than 1 through a sigmoid layer, the average value is subtracted, and the sum of the weights is enabled to be 0. And then based on given offset and kernel weight, performing deformable convolution on the rough depth map CD to obtain a depth residual map delta D, and finally adding the depth residual map to the rough depth map to obtain a final fine depth map D. In particular, the process may be expressed as,
Weights=Conv(FI+FS)
Offsets=Conv(FI+FS)
ΔD=Deconv(D0,Weights,Offsets)
D=CD+ΔD
Wherein DeConv () represents a deformable convolution whose kernel weights are spatially varying and content dependent, learned from image-guided features and depth features.
A Mean Square Error (MSE) is used during training to calculate the loss between the true depth and the predicted depth. For real world depth data, the true depth is typically semi-dense, as the true depth of each pixel is very well acquired. Thus, only valid pixels in the true depth map are considered when calculating the training loss. Thus, the loss function can be expressed as:
L(Dpred)=‖(Dpred-Dgt)⊙m(Dgt>0)‖2
Where D pred denotes the predicted depth map, D gt the true depth map for supervision, and as such, indicates element-wise multiplication. Since invalid pixels are contained in the true depth, we consider only pixels with valid depth values.
It is also necessary to supervise the intermediate depth prediction (coarse depth map CD), so the final loss function is:
Loss=L(D)+λL(CD)
where lambda is an empirically set hyper-parameter, here recommended to be set to 0.2.
It will be appreciated by those skilled in the art that all or part of the steps of the methods of the above embodiments may be implemented by a program for instructing relevant hardware, the program may be stored in a computer readable storage medium, the program may be executed to implement the steps of the method, the storage medium, such as ROM/RAM, magnetic disk, optical disk, etc.
Finally, it is noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made thereto without departing from the spirit and scope of the present invention, which is intended to be covered by the claims of the present invention.