CN116468768B

CN116468768B - Scene depth completion method based on conditional variation self-encoder and geometric guidance

Info

Publication number: CN116468768B
Application number: CN202310422520.0A
Authority: CN
Inventors: 魏明强; 吴鹏; 燕雪峰
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2023-04-20
Filing date: 2023-04-20
Publication date: 2023-10-17
Anticipated expiration: 2043-04-20
Also published as: CN116468768A

Abstract

The invention discloses a scene depth completion method based on a conditional variation self-encoder and geometric guidance, which comprises the following steps: acquiring a color image, a sparse depth map and a dense depth map under an automatic driving scene; the method comprises the steps of designing a condition variation self-encoder with a priori network and a posterior network, inputting a color image and a sparse depth map into the priori network to extract features, and inputting the color image, the sparse depth map and the dense depth map into the posterior network to extract features; and converting the sparse depth map into point cloud by using camera internal parameters, namely focal length and optical center coordinates, extracting a geometric space feature from a point cloud up-sampling model, and mapping the geometric space feature back to the sparse depth map. The invention can solve the problem that the data acquired by the laser radar is too sparse, so that the low-cost laser radar with fewer wire harnesses can obtain more accurate and dense depth information, and provides a cost-effective solution for industries such as automatic driving, robot environment sensing and the like which need accurate and dense depth data.

Description

Scene depth completion method based on conditional variational autoencoders and geometric guidance

技术领域Technical field

本发明涉及深度图补全技术领域，具体为基于条件变分自编码器和几何引导的场景深度补全方法。The present invention relates to the technical field of depth map completion, specifically a scene depth completion method based on conditional variational autoencoders and geometric guidance.

背景技术Background technique

人类对周围环境的感知、理解和体验依赖于视觉获取的三维场景信息。计算机视觉效仿人类的行为，将各类传感器作为视觉器官获取场景信息，进而实现对场景的识别和理解，其中深度信息在机器人、自动驾驶和增强现实等领域中起到关键性作用。在自动驾驶领域，驾驶过程中需要感知当前车辆与其他车辆、行人和障碍物等之间的距离，完全自动化Level5要求具备精确到厘米的测距能力。目前，激光雷达(LiDAR)是自动驾驶中主要的主动式距离传感器。相较于彩色相机采集的二维RGB图像，激光雷达采集的深度图(通过相机内参，深度图和点云可以互相转换)具有精确的深度距离，因此能够准确地感知周围环境的3D目标的位置信息。然而，一颗激光雷达在垂直方向只能够发射16线，32线或64线的有限激光线束，导致采集的点云密度极为稀疏(有效深度值的像素只占彩色图像的5％左右)，给3D目标检测、三维环境感知等下游任务带来了严重的影响。Human perception, understanding and experience of the surrounding environment rely on three-dimensional scene information acquired through vision. Computer vision imitates human behavior and uses various sensors as visual organs to obtain scene information, thereby realizing the recognition and understanding of the scene. Depth information plays a key role in fields such as robotics, autonomous driving, and augmented reality. In the field of autonomous driving, it is necessary to sense the distance between the current vehicle and other vehicles, pedestrians and obstacles during driving. Full automation Level 5 requires the ability to measure distances accurate to centimeters. Currently, LiDAR is the main active distance sensor in autonomous driving. Compared with the two-dimensional RGB image collected by a color camera, the depth map collected by lidar (the depth map and point cloud can be converted into each other through the camera's internal parameters) has a precise depth distance, so it can accurately perceive the position of the 3D target in the surrounding environment. information. However, a lidar can only emit 16, 32 or 64 limited laser beams in the vertical direction, resulting in an extremely sparse point cloud density (the pixels with effective depth values only account for about 5% of the color image), giving Downstream tasks such as 3D target detection and 3D environment perception have had a serious impact.

发明内容Contents of the invention

本发明的目的在于提供一种基于条件变分自编码器和几何引导的场景深度补全方法，以解决现有的激光雷达等深度成像设备带来的数据稀疏、缺失的关键问题。The purpose of the present invention is to provide a scene depth completion method based on conditional variational autoencoders and geometric guidance to solve the key problems of data sparseness and missingness caused by existing depth imaging equipment such as lidar.

为实现上述目的，本发明提供如下技术方案：一种基于条件变分自编码器和几何引导的场景深度补全方法，包括以下步骤：To achieve the above objectives, the present invention provides the following technical solution: a scene depth completion method based on conditional variational autoencoders and geometric guidance, including the following steps:

获取自动驾驶场景下的彩色图像，稀疏深度图以及稠密深度图；Obtain color images, sparse depth maps and dense depth maps in autonomous driving scenarios;

设计具有先验网络和后验网络的条件变分自编码器，将彩色图像和稀疏深度图输入到先验网络中提取特征，再将彩色图像、稀疏深度图和稠密深度图输入到后验网络中提取特征；Design a conditional variational autoencoder with a priori network and a posteriori network, input the color image and sparse depth map into the prior network to extract features, and then input the color image, sparse depth map and dense depth map into the posterior network extract features;

利用相机内参或焦距将稀疏深度图转换成点云，再将点云上采样模型提取到几何空间特征，并映射回稀疏深度图上；Convert the sparse depth map into a point cloud using camera intrinsic parameters or focal length, then extract the point cloud upsampling model into geometric space features, and map them back to the sparse depth map;

采用动态图消息传播模块实现图像特征和点云特征的融合；The dynamic graph message dissemination module is used to achieve the fusion of image features and point cloud features;

利用基于残差网络的U形编码解码器生成初步的深度补全图；A preliminary depth completion map is generated using a U-shaped encoder and decoder based on a residual network;

将初步预测的补全深度图输入到置信度不确定性估计模块，实现最终的深度补全优化。The initially predicted completion depth map is input into the confidence uncertainty estimation module to achieve the final depth completion optimization.

优选的，所述获取自动驾驶场景下的彩色图像和稀疏深度图，包括：Preferably, the acquisition of color images and sparse depth maps in autonomous driving scenarios includes:

使用彩色相机和激光雷达捕获自动驾驶场景下的彩色图像和稀疏深度图；Use color cameras and lidar to capture color images and sparse depth maps in autonomous driving scenarios;

利用SparsityInvariantCNNs算法将稀疏深度图变成稠密深度图作为真实标签辅助训练。The SparsityInvariantCNNs algorithm is used to convert the sparse depth map into a dense depth map as a real label auxiliary training.

优选的，所述设计具有先验网络和后验网络的条件变分自编码器，将彩色图像和稀疏深度图输入到先验网络中提取特征，再将彩色图像、稀疏深度图和稠密深度图输入到后验网络中提取特征，包括：Preferably, the conditional variational autoencoder is designed with a priori network and a posteriori network, the color image and sparse depth map are input into the prior network to extract features, and then the color image, sparse depth map and dense depth map are Input to the posterior network to extract features, including:

基于ResNet结构的特征提取模块，设计具有相同结构的先验网络和后验网络作为条件变分自编码器；Based on the feature extraction module of the ResNet structure, a priori network and a posteriori network with the same structure are designed as conditional variational autoencoders;

将彩色图像和稀疏深度图输入先验网络提取最后一层的特征图Prior，同时，将彩色图像、稀疏深度图和真实标签输入后验网络提取最后一层的特征图Posterior，然后分别计算Prior和Posterior特征图的均值和方差，得到各自的特征的概率分布D₁和D₂，再使用Kullback-Leibler散度损失函数监督分布D₁和D₂之间的损失，使先验网络能够学习到后验网络的真实标签特征。Input the color image and sparse depth map into the prior network to extract the feature map Prior of the last layer. At the same time, input the color image, sparse depth map and real label into the posterior network to extract the feature map Posterior of the last layer, and then calculate Prior and The mean and variance of the Posterior feature map are used to obtain the probability distributions D ₁ and D ₂ of the respective features, and then the Kullback-Leibler divergence loss function is used to supervise the loss between the distributions D ₁ and D ₂ so that the prior network can learn the posterior Test the real label characteristics of the network.

优选的，所述利用相机内参或焦距将稀疏深度图转换成点云，再将点云上采样模型提取到几何空间特征，并映射回稀疏深度图上，包括：Preferably, the method uses camera internal parameters or focal length to convert the sparse depth map into a point cloud, and then extracts the point cloud upsampling model into geometric space features and maps them back to the sparse depth map, including:

根据相机内参将稀疏深度图像像素点(u_i,v_i)由像素坐标系转换到相机坐标系得到三维场景的点云坐标(x_i,y_i,z_i)，形成稀疏点云数据S；According to the internal parameters of the camera, the sparse depth image pixels (u _i , vi ₎ are converted from the pixel coordinate system to the camera coordinate system to obtain the point cloud coordinates (xi _, y _i , z _i ) of the three-dimensional scene, forming sparse point cloud data S;

其中，(c_x,c_y)为相机的光心坐标，f_x，f_y分别为相机在x轴和y轴方向的焦距，d_i为(u_i,v_i)处的深度值，对于真实标签深度图，也采用上述公式生成稠密标签点云S₁；Among them, (c _x , c _y ) are the optical center coordinates of the camera, f _x , f _y are the focal lengths of the camera in the x-axis and y-axis directions respectively, d _i is the depth value at (u _i , vi ₎ , for The real label depth map also uses the above formula to generate a dense label point cloud S ₁ ;

通过对点云S进行多次随机采样，获得不同数量的点云集合，针对每个点云集合，利用KNN最邻近节点算法聚合每个点周围的16个最近点，输入到几何感知神经网络中提取该点的局部几何特征；By randomly sampling the point cloud S multiple times, different numbers of point cloud sets are obtained. For each point cloud set, the KNN nearest neighbor node algorithm is used to aggregate the 16 closest points around each point and input into the geometry-aware neural network. Extract the local geometric features of the point;

将每个点提取到的稀疏点云特征与原始点云坐标(x_i,y_i,z_i)相加得到点云编码特征Q，将Q输入到四倍上采样的多层感知机网络中得到预测的稠密点云S₂，利用ChamferDistance损失函数计算真实稠密点云S₁和预测的稠密点云S₂之间的损失值，以此监督网络的训练，CD损失具体的计算公式如下：Add the sparse point cloud features extracted from each point to the original point cloud coordinates (x _i , y _i , z _i ) to obtain the point cloud encoding feature Q, and input Q into the multi-layer perceptron network with four times upsampling The predicted dense point cloud S ₂ is obtained, and the ChamferDistance loss function is used to calculate the loss value between the real dense point cloud S ₁ and the predicted dense point cloud S ₂ to supervise the training of the network. The specific calculation formula of CD loss is as follows:

其中，第一项表示S₁中任意一点x到S₂的最小距离之和，第二项表示S₂中任意一点y到S₁的最小距离之和。Among them, the first term represents the sum of the minimum distances from any point x in S ₁ to S ₂ , and the second term represents the sum of the minimum distances from any point y in S ₂ to S ₁ .

优选的，所述采用动态图消息传播模块实现图像特征和点云特征的融合，包括：Preferably, the dynamic graph message dissemination module is used to realize the fusion of image features and point cloud features, including:

设计两个具有相同结构的编码网络，编码器由5层ResNet模块构成，将彩色图像和稀疏深度图输入到RGB分支编码器提取到五个不同尺度的特征图L₁，L₂，L₃，L₄和L₅，将点云特征Q和稀疏深度图输入到点云分支编码器提取到五个不同尺度的特征图P₁，P₂，P₃，P₄和P₅；Design two encoding networks with the same structure. The encoder is composed of a 5-layer ResNet module. The color image and sparse depth map are input to the RGB branch encoder to extract five different scale feature maps L ₁ , L ₂ , L ₃ , L ₄ and L ₅ , input the point cloud feature Q and sparse depth map to the point cloud branch encoder to extract feature maps of five different scales P ₁ , P ₂ , P ₃ , P ₄ and P ₅ ;

对于L₁，L₂，L₃，L₄和L₅，采用空洞卷积的方式获取到不同感受野的像素点，再利用可变形卷积探索每个像素点的坐标偏移量，使得每个像素点动态地聚合周围相关性强的特征值，获得特征T₁，T₂，T₃，T₄和T₅；For L ₁ , L ₂ , L ₃ , L ₄ and L ₅ , dilated convolution is used to obtain pixels in different receptive fields, and then deformable convolution is used to explore the coordinate offset of each pixel, so that each Each pixel dynamically aggregates surrounding feature values with strong correlation to obtain features T ₁ , T ₂ , T ₃ , T ₄ and T ₅ ;

将富含动态图特征的T₁，T₂，T₃，T₄，T₅加到点云编码特征图P₁，P₂，P₃，P₄，P₅上，获得包含语义信息和几何信息的点云特征图M₁，M₂，M₃，M₄，M₅。Add T ₁ , T ₂ , T ₃ , T ₄ , and T ₅ rich in dynamic image features to the point cloud encoding feature maps P ₁ , P ₂ , P ₃ , P ₄ , and P ₅ to obtain semantic information and geometry. Information point cloud feature maps M ₁ , M ₂ , M ₃ , M ₄ , M ₅ .

优选的，所述利用基于残差网络的U形编码解码器生成初步的深度补全图，包括：Preferably, the U-shaped codec based on residual network is used to generate a preliminary depth completion map, including:

根据编码器结构设计相应的多尺度解码器结构，形成U-Net结构的编码解码器网络；Design the corresponding multi-scale decoder structure according to the encoder structure to form an encoder-decoder network with a U-Net structure;

将RGB分支生成的特征图L₁，L₂，L₃，L₄和L₅输入U-Net预测出第一张粗糙深度补全图Depth₁和置信度图C₁；Input the feature maps L ₁ , L ₂ , L ₃ , L ₄ and L ₅ generated by the RGB branch into U-Net to predict the first rough depth completion map Depth ₁ and confidence map C ₁ ;

将点云分支生成的特征图M₁，M₂，M₃，M₄和M₅输入U-Net预测出第二张粗糙深度补全图Depth₂和置信度图C₂。Input the feature maps M ₁ , M ₂ , M ₃ , M ₄ and M ₅ generated by the point cloud branch into U-Net to predict the second rough depth completion map Depth ₂ and confidence map C ₂ .

优选的，所述将初步预测的补全深度图输入到置信度不确定性估计模块，实现最终的深度补全优化，包括：Preferably, the initially predicted completion depth map is input into the confidence uncertainty estimation module to achieve final depth completion optimization, including:

将生成的置信度图C₁和C₂相加得到特征图C，利用Softmax函数对于C再进行不确定性预测，逐像素预测出每张置信度图的不确定性比例F₁和F₂；Add the generated confidence maps C ₁ and C ₂ to obtain the feature map C, use the Softmax function to predict the uncertainty of C, and predict the uncertainty ratios F ₁ and F ₂ of each confidence map pixel by pixel;

将不确定性图F₁和F₂分别和粗糙深度补全Depth₁和Depth₂相乘获得最终优化的深度补全图。The final optimized depth completion map is obtained by multiplying the uncertainty maps F ₁ and F ₂ with the rough depth completion Depth ₁ and Depth ₂ respectively.

与现有技术相比，本发明的有益效果是：Compared with the prior art, the beneficial effects of the present invention are:

通过利用条件变分自编码器学习真实稠密深度图中的特征分布，引导彩色图像和稀疏深度图生成更有价值的深度特征，其次，利用三维空间的点云特征捕获不同模态的下的空间结构特征，增强网络的几何感知能力，为预测出更加准确的深度值提供了辅助信息，并且，动态图消息传播模块巧妙地融合彩色图像和点云之间的特征，实现高精度的深度补全预测，能够弥补激光雷达采集的数据过于稀疏的问题，使得线束较少的低成本激光雷达获得较为准确稠密的深度信息，为自动驾驶、机器人环境感知等需要精确稠密深度数据的行业提供了具有成本效益的解决方案。By using conditional variational autoencoders to learn the feature distribution in real dense depth maps, color images and sparse depth maps are guided to generate more valuable depth features. Secondly, point cloud features in three-dimensional space are used to capture the space under different modalities. Structural features enhance the geometric perception of the network and provide auxiliary information for predicting more accurate depth values. Moreover, the dynamic graph message propagation module cleverly integrates the features between color images and point clouds to achieve high-precision depth completion. Prediction can make up for the problem of too sparse data collected by lidar, allowing low-cost lidar with fewer wire harnesses to obtain more accurate and dense depth information, providing a cost-effective solution for industries that require accurate and dense depth data, such as autonomous driving and robot environment perception. Efficient solutions.

附图说明Description of the drawings

图1为本发明实施例提供的一种基于条件变分自编码器和几何引导的场景深度补全方法的流程图；Figure 1 is a flow chart of a scene depth completion method based on conditional variational autoencoders and geometric guidance provided by an embodiment of the present invention;

图2为本发明实施例提供的一种基于条件变分自编码器和几何引导的场景深度补全方法的深度补全图。Figure 2 is a depth completion diagram of a scene depth completion method based on conditional variational autoencoders and geometric guidance provided by an embodiment of the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some of the embodiments of the present invention, rather than all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts fall within the scope of protection of the present invention.

本实施方式的方法的执行主体为终端，所述终端可以为手机、平板电脑、掌上电脑PDA、笔记本或台式机等设备，当然，还可以为其他具有相似功能的设备，本实施方式不加以限制。The execution subject of the method in this embodiment is a terminal. The terminal can be a mobile phone, a tablet computer, a PDA, a notebook or a desktop computer. Of course, it can also be other devices with similar functions, which is not limited by this embodiment. .

请参阅图1和2，本发明提供一种基于条件变分自编码器和几何引导的场景深度补全方法，所述方法应用于自动驾驶场景深度补全，包括：Please refer to Figures 1 and 2. The present invention provides a scene depth completion method based on conditional variational autoencoders and geometric guidance. The method is applied to automatic driving scene depth completion, including:

步骤S1，获取自动驾驶场景下的彩色图像，稀疏深度图以及稠密深度图。Step S1: Obtain color images, sparse depth maps and dense depth maps in autonomous driving scenarios.

具体的，步骤S1还包括以下步骤：Specifically, step S1 also includes the following steps:

S101，使用彩色相机和激光雷达捕获自动驾驶场景下的彩色图像和稀疏深度图；S101, uses color cameras and lidar to capture color images and sparse depth maps in autonomous driving scenarios;

S102，利用Sparsity Invariant CNNs算法将稀疏深度图变成稠密深度图作为真实标签辅助训练。S102, use the Sparsity Invariant CNNs algorithm to convert the sparse depth map into a dense depth map as real label auxiliary training.

其中，自动驾驶车辆主要配备彩色相机和激光雷达，分别采集RGB图像和深度图像，本方法中需要额外生成补全深度图作为训练标签，具体步骤如下：Among them, self-driving vehicles are mainly equipped with color cameras and lidar, which collect RGB images and depth images respectively. In this method, an additional complementary depth map needs to be generated as a training label. The specific steps are as follows:

使用彩色相机和Velodyne HDL-64E激光雷达捕获自动驾驶场景下的彩色图像和深度图像；利用Sparsity Invariant CNNs算法将稀疏深度图变成稠密深度图作为真实标签。Use a color camera and Velodyne HDL-64E lidar to capture color images and depth images in autonomous driving scenarios; use the Sparsity Invariant CNNs algorithm to convert sparse depth maps into dense depth maps as real labels.

步骤S2，设计具有先验网络和后验网络的条件变分自编码器，将彩色图像和稀疏深度图输入到先验网络中提取特征，再将彩色图像、稀疏深度图和稠密深度图输入到后验网络中提取特征。Step S2, design a conditional variational autoencoder with a priori network and a posteriori network, input the color image and sparse depth map into the prior network to extract features, and then input the color image, sparse depth map and dense depth map into Extract features from the posterior network.

具体的，步骤S2还包括以下步骤：Specifically, step S2 also includes the following steps:

S201，基于ResNet结构的特征提取模块，设计具有相同结构的先验网络和后验网络作为条件变分自编码器；S201, based on the feature extraction module of ResNet structure, design a priori network and posterior network with the same structure as a conditional variational autoencoder;

S202，将彩色图像和稀疏深度图输入先验网络提取最后一层的特征图Prior，同时，将彩色图像、稀疏深度图和真实标签输入后验网络提取最后一层的特征图Posterior，然后分别计算Prior和Posterior特征图的均值和方差，得到各自的特征的概率分布D₁和D₂，再使用Kullback-Leibler散度损失函数监督分布D₁和D₂之间的损失，使先验网络能够学习到后验网络的真实标签特征。S202, input the color image and sparse depth map into the prior network to extract the feature map Prior of the last layer. At the same time, input the color image, sparse depth map and real label into the posterior network to extract the feature map Posterior of the last layer, and then calculate them respectively. The mean and variance of the Prior and Posterior feature maps are used to obtain the probability distributions D ₁ and D ₂ of the respective features, and then the Kullback-Leibler divergence loss function is used to supervise the loss between the distributions D ₁ and D ₂ so that the prior network can learn to the true label features of the posterior network.

步骤S3，利用相机内参或焦距将稀疏深度图转换成点云，再将点云上采样模型提取到几何空间特征，并映射回稀疏深度图上。Step S3: Use camera intrinsic parameters or focal length to convert the sparse depth map into a point cloud, then extract the point cloud upsampling model into geometric space features, and map them back to the sparse depth map.

具体的，步骤S3还包括以下步骤：Specifically, step S3 also includes the following steps:

S301，根据相机内参/焦距等将稀疏深度图像像素点(u_i,v_i)由像素坐标系转换到相机坐标系得到三维场景的点云坐标(x_i,y_i,z_i)，形成稀疏点云数据S；S301, convert the sparse depth image pixels (u _i , _vi ) from the pixel coordinate system to the camera coordinate system according to the camera internal parameters/focal length, etc. to obtain the point cloud coordinates (x _i , y _i , z _i ) of the three-dimensional scene, forming a sparse Point cloud data S;

S302，通过对点云S进行多次随机采样，获得不同数量的点云集合，针对每个点云集合，利用KNN最邻近节点算法聚合每个点周围的16个最近点，输入到几何感知神经网络中提取该点的局部几何特征；S302, obtain different numbers of point cloud collections by randomly sampling the point cloud S multiple times. For each point cloud collection, use the KNN nearest neighbor node algorithm to aggregate the 16 closest points around each point and input them to the geometry perception neural network. Extract the local geometric features of the point in the network;

S303，将每个点提取到的稀疏点云特征与原始点云坐标(x_i,y_i,z_i)相加得到点云编码特征Q，将Q输入到四倍上采样的多层感知机网络中得到预测的稠密点云S₂，利用Chamfer Distance损失函数计算真实稠密点云S₁和预测的稠密点云S₂之间的损失值，以此监督网络的训练，CD损失具体的计算公式如下：S303, add the sparse point cloud features extracted from each point to the original point cloud coordinates (x _i , y _i , z _i ) to obtain the point cloud encoding feature Q, and input Q to the multi-layer perceptron with four times upsampling The predicted dense point cloud S ₂ is obtained from the network, and the Chamfer Distance loss function is used to calculate the loss value between the real dense point cloud S ₁ and the predicted dense point cloud S ₂ to supervise the training of the network. The specific calculation formula of CD loss as follows:

步骤S4，采用动态图消息传播模块实现图像特征和点云特征的融合。Step S4: Use the dynamic graph message dissemination module to achieve the fusion of image features and point cloud features.

具体的，步骤S4还包括以下步骤：Specifically, step S4 also includes the following steps:

S401，设计两个具有相同结构的编码网络，编码器由5层ResNet模块构成，将彩色图像和稀疏深度图输入到RGB分支编码器提取到五个不同尺度的特征图L₁，L₂，L₃，L₄和L₅，将点云特征Q和稀疏深度图输入到点云分支编码器提取到五个不同尺度的特征图P₁，P₂，P₃，P₄和P₅；S401, design two encoding networks with the same structure. The encoder is composed of a 5-layer ResNet module. The color image and sparse depth map are input to the RGB branch encoder to extract five feature maps of different scales L ₁ , L ₂ , L ₃ , L ₄ and L ₅ , input the point cloud feature Q and sparse depth map to the point cloud branch encoder to extract feature maps of five different scales P ₁ , P ₂ , P ₃ , P ₄ and P ₅ ;

S402，对于L₁，L₂，L₃，L₄和L₅，采用空洞卷积的方式获取到不同感受野的像素点，再利用可变形卷积探索每个像素点的坐标偏移量，使得每个像素点动态地聚合周围相关性强的特征值，获得特征T₁，T₂，T₃，T₄和T₅；S402, for L ₁ , L ₂ , L ₃ , L ₄ and L ₅ , use atrous convolution to obtain pixels in different receptive fields, and then use deformable convolution to explore the coordinate offset of each pixel. This allows each pixel to dynamically aggregate surrounding feature values with strong correlation to obtain features T ₁ , T ₂ , T ₃ , T ₄ and T ₅ ;

S403，将富含动态图特征的T₁，T₂，T₃，T₄，T₅加到点云编码特征图P₁，P₂，P₃，P₄，P₅上，获得包含语义信息和几何信息的点云特征图M₁，M₂，M₃，M₄，M₅。S403, add T ₁ , T ₂ , T ₃ , T ₄ , and T ₅ rich in dynamic image features to the point cloud encoding feature maps P ₁ , P ₂ , P ₃ , P ₄ , and P ₅ to obtain semantic information. and point cloud feature maps M ₁ , M ₂ , M ₃ , M ₄ , M ₅ of geometric information.

步骤S5，利用基于残差网络的U形编码解码器生成初步的深度补全图。Step S5: Use the U-shaped codec based on the residual network to generate a preliminary depth completion map.

具体的，步骤S5还包括以下步骤：Specifically, step S5 also includes the following steps:

S501，根据步骤S4中编码器结构设计相应的多尺度解码器结构，形成U-Net结构的编码解码器网络；S501. Design the corresponding multi-scale decoder structure according to the encoder structure in step S4 to form an encoder-decoder network with a U-Net structure;

S502，将RGB分支生成的特征图L₁，L₂，L₃，L₄和L₅输入U-Net预测出第一张粗糙深度补全图Depth₁和置信度图C₁；S502, input the feature maps L ₁ , L ₂ , L ₃ , L ₄ and L ₅ generated by the RGB branch into U-Net to predict the first rough depth completion map Depth ₁ and confidence map C ₁ ;

S503，将点云分支生成的特征图M₁，M₂，M₃，M₄和M₅输入U-Net预测出第二张粗糙深度补全图Depth₂和置信度图C₂。S503, input the feature maps M ₁ , M ₂ , M ₃ , M ₄ and M ₅ generated by the point cloud branch into U-Net to predict the second rough depth complement map Depth ₂ and confidence map C ₂ .

步骤S6，将初步预测的补全深度图输入到置信度不确定性估计模块，实现最终的深度补全优化。Step S6: Input the initially predicted completion depth map into the confidence uncertainty estimation module to achieve final depth completion optimization.

具体的，步骤S6还包括以下步骤：Specifically, step S6 also includes the following steps:

S601，将步骤S5生成的置信度图C₁和C₂相加得到特征图C，利用Softmax函数对于C再进行不确定性预测，逐像素预测出每张置信度图的不确定性比例F₁和F₂；S601, add the confidence maps C ₁ and C ₂ generated in step S5 to obtain the feature map C, use the Softmax function to perform uncertainty prediction on C, and predict the uncertainty ratio F ₁ of each confidence map pixel by pixel. and F ₂ ;

S602，将不确定性图F₁和F₂分别和粗糙深度补全Depth₁和Depth₂相乘获得最终优化的深度补全图。S602: Multiply the uncertainty maps F ₁ and F ₂ with the rough depth completion Depth ₁ and Depth ₂ respectively to obtain the final optimized depth completion map.

在本实施例中，首先，通过利用条件变分自编码器学习真实稠密深度图中的特征分布，引导彩色图像和稀疏深度图生成更有价值的深度特征，其次，利用三维空间的点云特征捕获不同模态的下的空间结构特征，增强网络的几何感知能力，为预测出更加准确的深度值提供了辅助信息，并且，动态图消息传播模块巧妙地融合彩色图像和点云之间的特征，实现高精度的深度补全预测。In this embodiment, first, by using conditional variational autoencoders to learn the feature distribution in real dense depth maps, color images and sparse depth maps are guided to generate more valuable depth features, and secondly, point cloud features in three-dimensional space are used Captures the spatial structure characteristics of different modalities, enhances the geometric perception ability of the network, and provides auxiliary information for predicting more accurate depth values. Moreover, the dynamic graph message propagation module cleverly integrates the features between color images and point clouds. , achieving high-precision depth completion prediction.

另外，还需要说明的是，本案中各技术特征的组合方式并不限本案权利要求中所记载的组合方式或是具体实施例所记载的组合方式，本案所记载的所有技术特征可以以任何方式进行自由组合或结合，除非相互之间产生矛盾。In addition, it should be noted that the combination of the technical features in this case is not limited to the combinations recorded in the claims of this case or the combinations recorded in the specific embodiments. All the technical features recorded in this case can be combined in any way. Free combination or combination, unless there is a conflict between them.

需要注意的是，以上列举的仅为本发明的具体实施例，显然本发明不限于以上实施例，随之有着许多的类似变化。本领域的技术人员如果从本发明公开的内容直接导出或联想到的所有变形，均应属于本发明的保护范围。It should be noted that the above examples are only specific embodiments of the present invention. Obviously, the present invention is not limited to the above embodiments, and there are many similar changes. All modifications directly derived or thought of by those skilled in the art from the disclosure of the present invention shall fall within the protection scope of the present invention.

以上仅为本发明的较佳实施例而已，并非用于限定本发明的保护范围。凡在本发明的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The above are only preferred embodiments of the present invention and are not intended to limit the scope of the present invention. Any modifications, equivalent substitutions, improvements, etc. made within the spirit and principles of the present invention shall be included in the protection scope of the present invention.

Claims

1. The scene depth completion method based on the conditional variation self-encoder and the geometric guidance is characterized by comprising the following steps of:

acquiring a color image, a sparse depth map and a dense depth map under an automatic driving scene;

the method comprises the steps of designing a condition variation self-encoder with a priori network and a posterior network, inputting a color image and a sparse depth map into the priori network to extract features, and inputting the color image, the sparse depth map and the dense depth map into the posterior network to extract features;

converting the sparse depth map into point cloud by utilizing camera internal parameters or focal length, extracting a point cloud up-sampling model to geometric space features, and mapping the geometric space features back to the sparse depth map;

the dynamic image information transmission module is adopted to realize the fusion of image characteristics and point cloud characteristics, and the specific steps of the fusion are as follows: designing two coding networks with the same structure, wherein an encoder consists of a 5-layer ResNet module, and inputting a color image and a sparse depth map into an RGB (red, green and blue) branch encoder to extract five feature maps L with different scales ₁ ，L ₂ ，L ₃ ，L ₄ And L ₅ Inputting the point cloud characteristics Q and the sparse depth map into a point cloud branch encoder to extract characteristic maps P with five different scales ₁ ，P ₂ ，P ₃ ，P ₄ And P ₅ The method comprises the steps of carrying out a first treatment on the surface of the For L ₁ ，L ₂ ，L ₃ ，L ₄ And L ₅ Obtaining pixel points of different receptive fields by adopting a cavity convolution mode, and exploring the coordinate offset of each pixel point by utilizing deformable convolution to dynamically aggregate the characteristic values with strong surrounding correlation of each pixel point to obtain a characteristic T ₁ ，T ₂ ，T ₃ ，T ₄ And T ₅ The method comprises the steps of carrying out a first treatment on the surface of the T to be rich in dynamic diagram features ₁ ，T ₂ ，T ₃ ，T ₄ ，T ₅ Adding to point cloud coding feature map P ₁ ，P ₂ ，P ₃ ，P ₄ ，P ₅ Obtaining a point cloud characteristic map M containing semantic information and geometric information ₁ ，M ₂ ，M ₃ ，M ₄ ，M ₅ ；

Generating a preliminary depth complement diagram by using a U-shaped coder decoder based on a residual error network;

and inputting the preliminary predicted complement depth map to a confidence uncertainty estimation module to realize final depth complement optimization.

2. The scene depth completion method based on conditional variance self-encoder and geometric guidance of claim 1, wherein the acquiring of color image and sparse depth map in an autopilot scene comprises:

capturing a color image and a sparse depth map in an autopilot scene using a color camera and a lidar;

the sparse depth map is changed into a dense depth map by using a Sparsity Invariant CNNs algorithm to serve as a real tag to assist training.

3. The scene depth completion method based on conditional variance self-encoder and geometric guidance according to claim 2, wherein the designing the conditional variance self-encoder with a priori network and a posterior network, inputting the color image and sparse depth map into the a priori network to extract features, and then inputting the color image, sparse depth map and dense depth map into the posterior network to extract features, comprises:

the feature extraction module based on the ResNet structure designs a priori network and a posterior network with the same structure as a condition variation self-encoder;

inputting the color image and the sparse depth map into a priori network to extract the feature map primary of the last layer, inputting the color image, the sparse depth map and the real label into a Posterior network to extract the feature map Posterior of the last layer, and then respectively calculating the mean value and the variance of the primary and Posterior feature maps to obtain the probability distribution D of the respective features ₁ And D ₂ Supervision distribution D by using Kullback-Leibler divergence loss function ₁ And D ₂ The loss between the two causes the prior network to learn the real label characteristics of the posterior network.

4. A scene depth completion method based on a conditional variation self-encoder and geometric guidance according to claim 3, wherein said converting a sparse depth map into a point cloud using camera intrinsic parameters or focal lengths, extracting a point cloud upsampling model to geometric spatial features, and mapping back onto the sparse depth map comprises:

sparse depth image pixels (u) _i ,v _i ) Converting the pixel coordinate system into the camera coordinate system to obtain the point cloud coordinate (x _i ,y _i ,z _i ) Forming sparse point cloud data S;

wherein (c) _x ,c _y ) Is the optical center coordinate of the camera, f _x ，f _y Focal lengths of the camera in x-axis and y-axis directions, d _i Is (u) _i ,v _i ) Depth values at the locations, for a real tag depth map, a dense tag point cloud S is also generated using the above formula ₁ ；

The method comprises the steps of randomly sampling point clouds for a plurality of times to obtain different numbers of point cloud sets, aggregating 16 nearest points around each point by utilizing a KNN nearest node algorithm aiming at each point cloud set, and inputting the 16 nearest points into a geometric perception neural network to extract local geometric features of the point;

the sparse point cloud features extracted from each point are combined with the original point cloud coordinates (x _i ,y _i ,z _i ) Adding to obtain a point cloud coding characteristic Q, and inputting the Q into a quadruple up-sampling multi-layer perceptron network to obtain a predicted dense point cloud S ₂ Computing a true dense point cloud S using a Chamfer Distance loss function ₁ And predicted dense point cloud S ₂ The specific calculation formula of the CD loss is as follows:

wherein the first term represents S ₁ Any point x to S ₂ Sum of minimum distances of (2), second termRepresent S ₂ Any point y to S ₁ Is the sum of the minimum distances of (a) and (b).

5. The conditional variance self-encoder and geometry guided scene depth completion method of claim 1, wherein generating a preliminary depth complement map using a residual network-based U-shaped codec comprises:

designing a corresponding multi-scale decoder structure according to the encoder structure to form a U-Net structured encoder-decoder network;

feature map L generated by branching RGB ₁ ，L ₂ ，L ₃ ，L ₄ And L ₅ Inputting U-Net to predict the Depth of the first rough Depth complement map ₁ And confidence map C ₁ ；

Feature map M generated by branching point cloud ₁ ，M ₂ ，M ₃ ，M ₄ And M ₅ Inputting U-Net to predict second coarse Depth full-complement map Depth ₂ And confidence map C ₂ 。

6. The scene depth completion method based on conditional variance self-encoder and geometric guidance of claim 1, wherein said inputting the preliminary predicted completion depth map to the confidence uncertainty estimation module, achieves final depth completion optimization, comprises:

confidence map C to be generated ₁ And C ₂ Adding to obtain a characteristic diagram C, carrying out uncertainty prediction on the characteristic diagram C by using a Softmax function, and predicting an uncertainty proportion F of each confidence diagram pixel by pixel ₁ And F ₂ ；

Map of uncertainty F ₁ And F ₂ Depth of roughness and Depth of roughness ₁ And Depth ₂ Multiplying to obtain the final optimized depth complement diagram.