[go: up one dir, main page]

CN117876449B - Learning to Guide Deformable Convolutions for Depth Completion - Google Patents

Learning to Guide Deformable Convolutions for Depth Completion Download PDF

Info

Publication number
CN117876449B
CN117876449B CN202410047698.6A CN202410047698A CN117876449B CN 117876449 B CN117876449 B CN 117876449B CN 202410047698 A CN202410047698 A CN 202410047698A CN 117876449 B CN117876449 B CN 117876449B
Authority
CN
China
Prior art keywords
depth
image
features
map
convolution
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410047698.6A
Other languages
Chinese (zh)
Other versions
CN117876449A (en
Inventor
田茂
李桥生
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Post and Telecommunications
Original Assignee
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Post and Telecommunications filed Critical Chongqing University of Post and Telecommunications
Priority to CN202410047698.6A priority Critical patent/CN117876449B/en
Publication of CN117876449A publication Critical patent/CN117876449A/en
Application granted granted Critical
Publication of CN117876449B publication Critical patent/CN117876449B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/90Determination of colour characteristics
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/766Arrangements for image or video recognition or understanding using pattern recognition or machine learning using regression, e.g. by projecting features on hyperplanes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10024Color image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

本发明涉及一种学习引导可形变卷积的深度补全方法,属于计算机视觉领域,包括以下步骤:S1:给定一对输入,包括稀疏深度图和RGB图像;S2:对稀疏深度图和RGB图像分别提取深度特征和图像特征,并进一步融合获得多模态特征;S3:以多模态特征为输入,利用图像特征信息自适应引导深度特征聚合,最终预测粗糙深度图;S4:通过自适应预测的空间变化和内容相关的核权重和偏移量,生成深度残差图,进一步得到细化后深度图。

The present invention relates to a depth completion method of learning guided deformable convolution, belongs to the field of computer vision, and comprises the following steps: S1: given a pair of inputs, including a sparse depth map and an RGB image; S2: extracting depth features and image features from the sparse depth map and the RGB image respectively, and further fusing them to obtain multimodal features; S3: taking the multimodal features as input, using image feature information to adaptively guide depth feature aggregation, and finally predicting a rough depth map; S4: generating a depth residual map through adaptively predicted spatial changes and content-related kernel weights and offsets, and further obtaining a refined depth map.

Description

Depth completion method for learning and guiding deformable convolution
Technical Field
The invention belongs to the field of computer vision, and relates to a depth complement method for learning and guiding deformable convolution.
Background
Currently, deep learning-based methods exhibit excellent performance in deep completion tasks. These methods progressively map sparse depth maps to dense depth maps through a large number of stacked filters. Since RGB images contain rich semantic cues, image-guided methods exhibit an unexpected performance in filling unknown depths. For example GuideNet proposes a feature fusion module based on a guided dynamic convolution network to better exploit the guided features of RGB images, CSPN studies the affinity matrix to refine the rough depth map through a Spatial Propagation Network (SPN), dySPN further develops dynamic SPN by assigning different attention levels to neighboring pixels at different distances. ACMNet introduces a symmetrical gating fusion strategy to perform fusion of RGB image feature and depth feature two-mode data, and FCFR-Net proposes channel shuffling between two modes and an energy-based fusion strategy.
Despite the significant progress made in the depth-completion task by existing methods, there are still some problems that need to be solved. Because of the fixed geometry of the convolution module, the convolution unit can only sample the input feature map at a fixed location, which may contain irrelevant blending information, so that under challenging environments and sparse depth measurements, the existing methods have difficulty generating clear structural features, resulting in depth blending problems, i.e., depth map boundary blurring and artifacts. Furthermore, due to the large variability of depth and RGB image information, existing methods typically use tens of millions of learnable parameters to ensure the ability of models to learn robust features in order to adequately fuse multimodal data. However, such large-scale networks typically require a large amount of computing resources, which is impractical in practical applications. Simply reducing the network size can significantly reduce the performance of the method.
Disclosure of Invention
It is therefore an object of the present invention to provide an image-guided module that can adaptively perceive the context structure of each pixel to assist the depth completion process, and at the same time, to reduce the complexity of the model, provide a low-coupling and lightweight network structure to accomplish this task.
In order to achieve the above purpose, the present invention provides the following technical solutions:
A method of learning a guided deformable convolution depth completion comprising the steps of:
s1, giving a pair of inputs comprising a sparse depth map and an RGB image;
S2, respectively extracting depth features and image features from the sparse depth map and the RGB image, and further fusing to obtain multi-mode features;
S3, taking multi-mode features as input, adaptively guiding depth feature aggregation by utilizing image feature information, and finally predicting a rough depth map;
and S4, generating a depth residual image through the spatial variation of the self-adaptive prediction and the kernel weight and the offset related to the content, and further obtaining the depth image after refinement.
Further, in step S2, for a stack of input sparse depth map S and RGB image I, depth feature F S′ and image feature F I′ are obtained by performing an initial convolution and lightweight encoder-decoder processing on S and I, respectively, and then these two features are fused by pixel-level addition operation to obtain multi-modal feature F IS, expressed as:
FIS=FI′+FS
Wherein the method comprises the steps of AndRespectively representing an encoder-decoder for extracting image features and depth features.
Further in step S3, two encoder-decoder branches are constructed, an image-guided branch and a depth regression branch, each having multi-modal features as input, the encoder stage of the image-guided branch being by accepting signals fromThe jump connection of the decoder achieves information transfer, the encoder stage of the depth regression branch by accepting the signal fromAnd embedding a deformable guide module after each scale feature of the encoder of the depth regression branch, aggregating relevant semantic information from a neighborhood range, wherein the output of the depth regression branch is a rough depth map.
Further, the deformable guiding module comprises the following processing steps:
Given the image features f i from the image-guided branches and the depth features f s from the depth regression branches, first perform pixel-level addition, fusing the image features and the depth features;
next, a pixel level offset feature map is learned by a standard convolution operation, which includes x and y coordinate direction offsets, representing position deviations on a regular grid, for a total of 2 x k 2 channels, where is the k convolution kernel size;
Then, sampling the depth feature map on a regular grid based on the offsets to obtain related semantic information in the neighborhood;
then, standard convolution with the kernel size of k is performed on the sampled features to aggregate the information and learn to obtain depth feature residuals;
Finally, adding the depth characteristic residual error to the depth characteristic to obtain a final output;
The deformable guiding module is expressed as:
Offsets=Conv(fs+fi)
Output=fi+DeConv(fi,Offsets)
Wherein DeConv () represents a deformable convolution whose kernel weights are spatially shared, obtained by random initialization.
Further, the step S4 specifically includes:
giving the last layer features F I and F S of the image-guided and depth-regressive branch decoders, respectively;
firstly, performing a pixel level addition, and fusing double-branch characteristics;
respectively learning through two independent standard convolutions to obtain a convolution kernel weight and an offset characteristic diagram of a pixel level;
The weight is enabled to be larger than 0 and smaller than 1 through the sigmoid layer, the average value is subtracted, and the sum of the weight is enabled to be 0;
based on given offset and kernel weight, performing deformable convolution on the rough depth map CD to obtain a depth residual map delta D;
Adding a depth residual image to the rough depth image to obtain a final fine depth image D;
The concrete expression is as follows:
Weights=Conv(FI+FS)
Offsets=Conv(FI+FS)
ΔD=DeConv(D0,Weights,Offsets)
D=CD+ΔD
Wherein DeConv () represents a deformable convolution whose kernel weights are spatially varying and content dependent, learned from image-guided features and depth features.
Further, the mean square error MSE is used to calculate the loss between the true depth and the predicted depth, expressed as:
L(Dpred)=‖(Dpred-Dgt)⊙m(Dgt>0)‖2
Wherein D pred represents the predicted depth map, D gt represents the real depth map for supervision, as well as the element-wise multiplication, considering only pixels with valid depth values;
also, the rough depth map CD needs to be supervised, and the final loss function is:
Loss=L(D)+λL(CD)
where λ is an empirically set hyper-parameter.
The method has the advantages that the depth map can capture relevant information better through the method capable of adaptively sensing the context structure of each pixel, so that the structural definition of the generated depth map is improved remarkably, the sensing range is adjusted adaptively according to the pixel position, irrelevant information is not contained in the depth map any more, aggregation of the irrelevant information is effectively avoided, the abundant semantic information of RGB is fully utilized, the sampling position of the depth relevant information is predicted, irrelevant information caused by fixing local neighborhood is avoided, the precision of the depth map is improved, particularly, the situation that the depth map is better in face of complex and challenging scenes is improved, the low-coupling and light-weight network structure provided by the method not only improves the calculation efficiency, reduces the resource requirement, but also enables the depth completion method to be easier to deploy and popularize in various application scenes, the dual-branch stacked hourglass-based network structure is introduced, the single encoder structure of the prior method is decoupled into a stacked encoder-decoder structure, learning of a model is balanced, the situation that more and more abundant context are obtained gradually, and each decoder has very light-weight encoder characteristics.
Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objects and other advantages of the invention may be realized and obtained by means of the instrumentalities and combinations particularly pointed out in the specification.
Drawings
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in the following preferred detail with reference to the accompanying drawings, in which:
FIG. 1 is a flow chart of a method for learning a guided deformable convolution;
FIG. 2 is a flow chart of multi-modal feature extraction;
FIG. 3 is a guided depth regression flow chart;
FIG. 4 is a block diagram of a deformable boot module;
Fig. 5 is a depth refinement flow chart.
Detailed Description
Other advantages and effects of the present invention will become apparent to those skilled in the art from the following disclosure, which describes the embodiments of the present invention with reference to specific examples. The invention may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present invention. It should be noted that the illustrations provided in the following embodiments merely illustrate the basic idea of the present invention by way of illustration, and the following embodiments and features in the embodiments may be combined with each other without conflict.
In which the drawings are for illustrative purposes only and are not intended to be construed as limiting the invention, and in which certain elements of the drawings may be omitted, enlarged or reduced in order to better illustrate embodiments of the invention, and not to represent actual product dimensions, it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
In the description of the present invention, it should be understood that, if there are terms such as "upper", "lower", "left", "right", "front", "rear", etc., the directions or positional relationships indicated are based on the directions or positional relationships shown in the drawings, only for convenience of describing the present invention and simplifying the description, rather than indicating or implying that the referred devices or elements must have a specific direction, be constructed and operated in a specific direction, so that the terms describing the positional relationships in the drawings are merely for exemplary illustration and are not to be construed as limitations of the present invention, and that the specific meanings of the terms described above may be understood by those skilled in the art according to specific circumstances.
In the face of challenging environments and sparse depth measurements, traditional depth completion methods, especially fixed neighborhood convolution modules and affinity matrices, are difficult to meet the needs of depth completion tasks. The current method has a problem that in the process of generating the depth map, mixed information irrelevant to the periphery is easy to aggregate due to fixed neighborhood operation, so that the generated feature map structure is fuzzy. The present invention therefore aims to solve this problem, focusing on improving the structural definition of the depth map.
In order to achieve the above object, the present invention designs an image guidance module capable of adaptively sensing a context structure of each pixel, which can adaptively adjust a sensing range according to a pixel position, and better capture relevant information, thereby improving accuracy and definition of a depth map, and the introduction of the module aims to effectively cope with complex and challenging scenes in a depth complement process, and ensure that a generated depth feature map has a stronger structural sense. Specifically, in order to avoid irrelevant information brought by fixed local neighborhood, rich semantic information of RGB is fully utilized, and sampling positions of depth relevant information are predicted, wherein the sampling positions are obtained by learning offset of a regular grid. In order to meet the requirements of practical application on model complexity, a low-coupling and light-weight network structure is designed, and the lightweight design is not only beneficial to improving the calculation efficiency and reducing the resource requirements, but also enables the depth complement method to be easier to deploy and popularize in various application scenes.
Specific network architecture aspects include a dual-branch stacked hourglass-based network architecture that balances model learning by decoupling the single encoder architecture of previous approaches into stacked encoder-decoder architectures while progressively achieving clearer and richer context semantics. Because of the decoupled structure, a large number of learnable parameters are not required to balance the learning ability of the model, thus maintaining the robustness of the model, so that each encoder-decoder becomes very lightweight.
Based on the scheme, a learning-guided deformable convolution depth complement method is provided, and as shown in fig. 1, the method flow comprises three steps of multi-modal feature extraction, guided depth regression and depth refinement.
Multimodal feature extraction given a pair of inputs, sparse depth map S and RGB image I, we aim to extract a multimodal representation that seamlessly blends semantic information such as texture, edges, etc. of the image with depth information, as shown in FIG. 2. Depth feature F S′ and image feature F I′ are obtained by performing one initial convolutional layer and lightweight encoder-decoder process for S and I, respectively. Then, the two features are fused by pixel-level addition operation, resulting in a multi-modal feature F IS. Specifically, the process can be expressed as:
FIS=FI′+FS
Wherein the method comprises the steps of AndRespectively representing encoder-decoder modules for extracting image features and depth features.
The guided depth regression this stage is split into two encoder-decoder branches, an image guided branch and a depth regression branch, as shown in fig. 3. The encoder stage of the image guided branch is implemented by accepting information fromThe jump connection of the decoder enables information transfer. The encoder stage of the depth regression branch also makes use of a jump connection, this time fromAnd a decoder. A deformable guide module is embedded after each scale feature of the encoder of the depth regression branch to aggregate relevant semantic information from within the neighborhood, the output of the branch being a coarse depth map CD. The two branches take the multi-mode characteristic F IS as input, so that the deep complementation is ensured by fully utilizing multi-source information in the subsequent step.
As shown in fig. 4, the deformable guide module is designed by, given the image features f i from the image-guide branches and the depth features f s from the depth regression branches, first, performing pixel-level addition, fusing the image features and the depth features. Next, the pixel level offset profile, including the x and y coordinate direction offsets, is learned by a standard convolution operation, representing the position deviations on a regular grid, for a total of 2 x k 2 channels, where is the k convolution kernel size. The depth feature map is then sampled on a regular grid based on these offsets to obtain relevant semantic information within the neighborhood. A standard convolution of kernel size k is then performed on the sampled features to aggregate this information and learn to get depth feature residuals. And finally, adding the depth characteristic residual error to the depth characteristic to obtain a final output. In particular, the module can be expressed as:
Offsets=Conv(fs+fi)
Output=f+DeConv(fi,Offsets)
wherein DConv () represents a deformable convolution whose kernel weights are spatially shared, obtained by random initialization.
Depth refinement as shown in fig. 5, the image guided and depth regressive branch decoders last layer features F I and F S, respectively, are given. Firstly, a pixel level addition is carried out, double branch features are fused, then a convolution kernel weight and an offset feature map of the pixel level are respectively learned through two independent standard convolutions, in order to enable a model to be stably converged, the weight is enabled to be greater than 0 and smaller than 1 through a sigmoid layer, the average value is subtracted, and the sum of the weights is enabled to be 0. And then based on given offset and kernel weight, performing deformable convolution on the rough depth map CD to obtain a depth residual map delta D, and finally adding the depth residual map to the rough depth map to obtain a final fine depth map D. In particular, the process may be expressed as,
Weights=Conv(FI+FS)
Offsets=Conv(FI+FS)
ΔD=Deconv(D0,Weights,Offsets)
D=CD+ΔD
Wherein DeConv () represents a deformable convolution whose kernel weights are spatially varying and content dependent, learned from image-guided features and depth features.
A Mean Square Error (MSE) is used during training to calculate the loss between the true depth and the predicted depth. For real world depth data, the true depth is typically semi-dense, as the true depth of each pixel is very well acquired. Thus, only valid pixels in the true depth map are considered when calculating the training loss. Thus, the loss function can be expressed as:
L(Dpred)=‖(Dpred-Dgt)⊙m(Dgt>0)‖2
Where D pred denotes the predicted depth map, D gt the true depth map for supervision, and as such, indicates element-wise multiplication. Since invalid pixels are contained in the true depth, we consider only pixels with valid depth values.
It is also necessary to supervise the intermediate depth prediction (coarse depth map CD), so the final loss function is:
Loss=L(D)+λL(CD)
where lambda is an empirically set hyper-parameter, here recommended to be set to 0.2.
It will be appreciated by those skilled in the art that all or part of the steps of the methods of the above embodiments may be implemented by a program for instructing relevant hardware, the program may be stored in a computer readable storage medium, the program may be executed to implement the steps of the method, the storage medium, such as ROM/RAM, magnetic disk, optical disk, etc.
Finally, it is noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made thereto without departing from the spirit and scope of the present invention, which is intended to be covered by the claims of the present invention.

Claims (3)

1. A learning-guided deformable convolution depth complement method is characterized by comprising the following steps:
s1, giving a pair of inputs comprising a sparse depth map and an RGB image;
S2, respectively extracting depth features and image features from the sparse depth map and the RGB image, and further fusing to obtain multi-mode features;
s3, taking multi-modal characteristics as input, utilizing image characteristic information to adaptively guide depth characteristic aggregation, and finally predicting rough depth image, wherein the method specifically comprises the steps of constructing two coder-decoder branches, namely an image guide branch and a depth regression branch, which take the multi-modal characteristics as input, and enabling the coder stage of the image guide branch to receive the images from the image guide branch The jump connection of the decoder achieves information transfer, the encoder stage of the depth regression branch by accepting the signal fromThe jump connection of the decoder realizes information transfer, a deformable guide module is embedded after each scale feature of the encoder of the depth regression branch, relevant semantic information from a neighborhood range is aggregated, the output of the depth regression branch is a rough depth map, and the deformable guide module comprises the following processing steps:
Given the image features f i from the image-guided branches and the depth features f s from the depth regression branches, first perform pixel-level addition, fusing the image features and the depth features;
next, a pixel level offset feature map is learned by a standard convolution operation, which includes x and y coordinate direction offsets, representing position deviations on a regular grid, for a total of 2 x k 2 channels, where is the k convolution kernel size;
Then, sampling the depth feature map on a regular grid based on the offsets to obtain related semantic information in the neighborhood;
then, standard convolution with the kernel size of k is performed on the sampled features to aggregate the information and learn to obtain depth feature residuals;
Finally, adding the depth characteristic residual error to the depth characteristic to obtain a final output;
The deformable guiding module is expressed as:
Offsets=Conv(fs+fi)
Output=fi+DeConv(fi,Offsets)
Wherein DeConv () represents a deformable convolution whose kernel weights are spatially shared, obtained by random initialization;
S4, generating a depth residual image through the spatial variation of the self-adaptive prediction and the kernel weight and the offset related to the content, and further obtaining a thinned depth image, wherein the method specifically comprises the following steps:
giving the last layer features F I and F S of the image-guided and depth-regressive branch decoders, respectively;
firstly, performing a pixel level addition, and fusing double-branch characteristics;
respectively learning through two independent standard convolutions to obtain a convolution kernel weight and an offset characteristic diagram of a pixel level;
The weight is enabled to be larger than 0 and smaller than 1 through the sigmoid layer, the average value is subtracted, and the sum of the weight is enabled to be 0;
based on given offset and kernel weight, performing deformable convolution on the rough depth map CD to obtain a depth residual map delta D;
Adding a depth residual image to the rough depth image to obtain a final fine depth image D;
The concrete expression is as follows:
Weights=Conv(FI+FS)
Offsets=Conv(FI+FS)
ΔD=DeConv(D0,Weightsc,Offsets)
D=CD+△D
Wherein DeConv () represents a deformable convolution whose kernel weights are spatially varying and content dependent, learned from image-guided features and depth features.
2. The method of learning guided deformable convolution depth completion of claim 1, wherein in step S2, for a stack of input sparse depth map S and RGB image I, depth feature F S′ and image feature F I′ are obtained by performing an initial convolution layer and lightweight encoder-decoder processing on S and I, respectively, and then the two features are fused by pixel-level addition operation to obtain multi-modal feature F IS, expressed as:
FIS=FI′+FS′
Wherein the method comprises the steps of AndRespectively representing an encoder-decoder for extracting image features and depth features.
3. The method for learning guided deformable convolution depth completion of claim 1, wherein the mean square error MSE is used to calculate the loss between the true depth and the predicted depth expressed as:
L(Dpred)=||(Dpred-Dgt)⊙m(Dgt>0)||2
Wherein D pred represents the predicted depth map, D gt represents the real depth map for supervision, as well as the element-wise multiplication, considering only pixels with valid depth values;
also, the rough depth map CD needs to be supervised, and the final loss function is:
Loss=L(D)+λL(CD)
where λ is an empirically set hyper-parameter.
CN202410047698.6A 2024-01-12 2024-01-12 Learning to Guide Deformable Convolutions for Depth Completion Active CN117876449B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410047698.6A CN117876449B (en) 2024-01-12 2024-01-12 Learning to Guide Deformable Convolutions for Depth Completion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410047698.6A CN117876449B (en) 2024-01-12 2024-01-12 Learning to Guide Deformable Convolutions for Depth Completion

Publications (2)

Publication Number Publication Date
CN117876449A CN117876449A (en) 2024-04-12
CN117876449B true CN117876449B (en) 2025-01-24

Family

ID=90596665

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410047698.6A Active CN117876449B (en) 2024-01-12 2024-01-12 Learning to Guide Deformable Convolutions for Depth Completion

Country Status (1)

Country Link
CN (1) CN117876449B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118941616A (en) * 2024-10-12 2024-11-12 中关村科学城城市大脑股份有限公司 Sludge weighing method, device, electronic device and computer readable medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112861729A (en) * 2021-02-08 2021-05-28 浙江大学 Real-time depth completion method based on pseudo-depth map guidance
CN113538278A (en) * 2021-07-16 2021-10-22 北京航空航天大学 Depth map completion method based on deformable convolution

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111066063B (en) * 2018-06-29 2023-07-21 百度时代网络技术(北京)有限公司 System and method for depth estimation using affinity for convolutional spatial propagation network learning
CN114004754B (en) * 2021-09-13 2022-07-26 北京航空航天大学 A system and method for scene depth completion based on deep learning
CN116245930A (en) * 2023-02-28 2023-06-09 北京科技大学顺德创新学院 A method and device for depth completion based on attention panorama perception guidance

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112861729A (en) * 2021-02-08 2021-05-28 浙江大学 Real-time depth completion method based on pseudo-depth map guidance
CN113538278A (en) * 2021-07-16 2021-10-22 北京航空航天大学 Depth map completion method based on deformable convolution

Also Published As

Publication number Publication date
CN117876449A (en) 2024-04-12

Similar Documents

Publication Publication Date Title
Wang et al. Lrru: Long-short range recurrent updating networks for depth completion
CN107945204B (en) A pixel-level portrait matting method based on generative adversarial network
CN111242238B (en) RGB-D image saliency target acquisition method
WO2023231329A1 (en) Medical image semantic segmentation method and apparatus
CN110427799B (en) Human hand depth image data enhancement method based on generative adversarial network
WO2023138062A1 (en) Image processing method and apparatus
CN112257759B (en) Image processing method and device
CN113223068B (en) Multi-mode image registration method and system based on depth global features
CN112884668B (en) Multi-scale lightweight low-light image enhancement method
CN114581552B (en) A grayscale image colorization method based on generative adversarial network
CN117876449B (en) Learning to Guide Deformable Convolutions for Depth Completion
CN113792621B (en) FPGA-based target detection accelerator design method
CN113901928B (en) A target detection method based on dynamic super-resolution, a transmission line component detection method and system
CN109272110A (en) Photoelectricity based on photon neural network chip merges intelligent signal processing system
CN113362251B (en) Anti-network image defogging method based on double discriminators and improved loss function
CN116935226B (en) HRNet-based improved remote sensing image road extraction method, system, equipment and medium
Zhao et al. Attention Unet++ for lightweight depth estimation from sparse depth samples and a single RGB image
CN114118284A (en) A target detection method based on multi-scale feature fusion
CN114241210A (en) A method and system for multi-task learning based on dynamic convolution
CN116630514A (en) Image processing method, device, computer readable storage medium and electronic equipment
CN114511475B (en) Image generation method based on improved Cycle GAN
Sun et al. Decoupled feature pyramid learning for multi-scale object detection in low-altitude remote sensing images
Fu et al. Dynamic visual SLAM based on probability screening and weighting for deep features
Zhuang et al. Semi-supervised domain adaptation incorporating three-way decision for multi-view echocardiographic sequence segmentation
CN118968192A (en) An image classification method based on multimodal information fusion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant