CN118570273A

CN118570273A - A monocular depth estimation method based on surface normal vector and neural radiance field

Info

Publication number: CN118570273A
Application number: CN202411055049.7A
Authority: CN
Inventors: 徐亮; 潘洁莹; 徐文翔; 任毅龙; 于海洋
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2024-08-02
Filing date: 2024-08-02
Publication date: 2024-08-30

Abstract

The present invention discloses a monocular depth estimation method based on surface normal vectors and neural radiation fields, so as to solve the problems that the depth estimation method based on deep learning in the prior art has ambiguity and cannot handle occlusion and low-texture areas. The monocular depth estimation method of the present invention includes: 1. Acquiring a depth prior map based on surface normal vectors and a depth estimation model; 2. Guiding neural radiation field training based on depth prior; 3. Rendering an accurate depth map based on neural radiation field and Gaussian filtering. The method based on surface normal vectors adds geometric constraints to the projection points, effectively alleviating the ambiguity problem in monocular depth estimation. The monocular depth estimation method based on neural radiation fields does not rely on the estimated correspondence and cross-view reprojection depth for optimization, but directly optimizes the volume. Through the good view synthesis effect of the neural radiation field, problems such as occlusion and highlights in monocular depth estimation can be effectively solved.

Description

A monocular depth estimation method based on surface normal vector and neural radiance field

技术领域Technical Field

本发明属于计算机视觉领域，具体涉及一种基于表面法向量和神经辐射场的单目深度估计方法。The present invention belongs to the field of computer vision, and in particular relates to a monocular depth estimation method based on surface normal vectors and neural radiation fields.

背景技术Background Art

近年来，越来越多的自动驾驶行业采用纯视觉感知的方案。而深度信息表征着场景的三维结构，对于理解场景中物体的几何关系有重要意义，已广泛应用于三维重建、自动驾驶以及人脸识别等领域。深度估计的目标是获得到物体的距离，并最终获得深度图，该深度图为一系列任务（如3D重建、SLAM和决策）提供深度信息。目前市场上主流的距离测量方法是单目、立体和基于RGB-D相机。In recent years, more and more autonomous driving industries have adopted pure visual perception solutions. Depth information represents the three-dimensional structure of the scene, which is important for understanding the geometric relationship of objects in the scene. It has been widely used in fields such as three-dimensional reconstruction, autonomous driving, and face recognition. The goal of depth estimation is to obtain the distance to the object and ultimately obtain a depth map, which provides depth information for a series of tasks (such as 3D reconstruction, SLAM, and decision-making). The mainstream distance measurement methods on the market are monocular, stereo, and based on RGB-D cameras.

现有的单目深度估计方法，通常采用基于成本体积的架构解决多视图立体问题。然而，由于缺乏推理约束，跨视图的预测深度图通常不一致，并且经常违反光度一致性。当前主流的单目深度估计网络从单图像深度估计获得基于学习的先验进行优化，以生成准确且一致的深度图。然而，现有的方法在具体问题中缺乏对低纹理区域特征不足问题的解决方案，难以处理缺乏全局信息时深度歧义的问题。同时，由于单目图像是二维图像，在投影时往往会得到多个三维空间点，使得单目深度估计任务存在二义性。因此需要一种基于表面法向量和神经辐射场的单目深度估计方法使得原有的单目深度估计方法的精度提升。Existing monocular depth estimation methods usually adopt a cost volume-based architecture to solve the multi-view stereo problem. However, due to the lack of reasoning constraints, the predicted depth maps across views are usually inconsistent and often violate photometric consistency. The current mainstream monocular depth estimation network obtains learning-based priors from single image depth estimation for optimization to generate accurate and consistent depth maps. However, existing methods lack solutions to the problem of insufficient features in low-texture areas in specific problems, and it is difficult to deal with the problem of depth ambiguity in the absence of global information. At the same time, since monocular images are two-dimensional images, multiple three-dimensional spatial points are often obtained during projection, which makes the monocular depth estimation task ambiguous. Therefore, a monocular depth estimation method based on surface normal vectors and neural radiation fields is needed to improve the accuracy of the original monocular depth estimation method.

基于表面法向量的方法增加了对投影点的几何约束，有效缓解了单目深度估计中的二义性问题。基于神经辐射场的单目深度估计方法不依赖于估计的对应关系和交叉视图重投影深度来进行优化，而是直接对体积进行优化。通过神经辐射场的良好视图合成效果，可有效地解决单目深度估计中出现的遮挡和高光等问题。The surface normal vector-based method adds geometric constraints to the projection points, effectively alleviating the ambiguity problem in monocular depth estimation. The monocular depth estimation method based on neural radiance field does not rely on the estimated correspondence and cross-view reprojection depth for optimization, but directly optimizes the volume. The good view synthesis effect of neural radiance field can effectively solve the problems of occlusion and highlight in monocular depth estimation.

发明内容Summary of the invention

本发明的目的是提供一种基于表面法向量和神经辐射场的单目深度估计方法，用以解决现有技术的单目深度估计中出现的二义性问题、遮挡和高光等问题。The purpose of the present invention is to provide a monocular depth estimation method based on surface normal vector and neural radiation field, so as to solve the problems of ambiguity, occlusion and highlight that appear in the monocular depth estimation of the prior art.

本发明的技术方案如下：The technical solution of the present invention is as follows:

本发明首先提供了一种基于表面法向量和神经辐射场的单目深度估计方法，包括以下步骤：The present invention first provides a monocular depth estimation method based on surface normal vector and neural radiation field, comprising the following steps:

1）构建初始的表面法向量和深度估计模型，并使用带有深度标签的图像数据集进行训练，得到训练好的表面法向量和深度估计模型；1) Construct an initial surface normal vector and depth estimation model, and use an image dataset with depth labels for training to obtain a trained surface normal vector and depth estimation model;

2）将单目相机在多视角场景下捕获的待处理图像集输入训练好的表面法向量和深度估计模型，得到深度先验图集；2) Input the image set to be processed captured by the monocular camera in a multi-view scene into the trained surface normal vector and depth estimation model to obtain a depth prior atlas;

3）将单目相机捕获的待处理图像集输入神经辐射场，基于步骤2）得到的深度先验图集指导神经辐射场对待处理图像集进行三维重建，得到渲染后的深度图像集；3) The image set to be processed captured by the monocular camera is input into the neural radiation field, and the neural radiation field is guided to perform three-dimensional reconstruction of the image set to be processed based on the depth prior atlas obtained in step 2), so as to obtain a rendered depth image set;

4）使用高斯滤波对渲染后的深度图像集进行过滤优化，最终得到优化后的深度图像集作为单目深度估计结果。4) Use Gaussian filtering to filter and optimize the rendered depth image set, and finally obtain the optimized depth image set as the monocular depth estimation result.

作为本发明的优选方案，所述表面法向量和深度估计模型包括深度估计模块、表面法向量估计模块以及几何一致性优化模块；所述深度估计模块用于提取输入图像中的深度特征并输出预测深度图；所述表面法向量估计模块用于提取输入图像中的多尺度特征图并输出表面法向量；所述几何一致性优化模块用于根据多尺度特征图约束预测深度图中的深度沿表面法向量方向变化最终得到深度先验图。As a preferred solution of the present invention, the surface normal vector and depth estimation model includes a depth estimation module, a surface normal vector estimation module and a geometric consistency optimization module; the depth estimation module is used to extract depth features in the input image and output a predicted depth map; the surface normal vector estimation module is used to extract a multi-scale feature map in the input image and output a surface normal vector; the geometric consistency optimization module is used to predict the depth in the depth map according to the multi-scale feature map constraints and change along the direction of the surface normal vector to finally obtain a depth prior map.

作为本发明的优选方案，将待处理图像集输入神经辐射场，基于步骤2）得到的深度先验图集指导神经辐射场对待处理图像集进行三维重建，包括以下步骤：As a preferred solution of the present invention, the image set to be processed is input into the neural radiation field, and the neural radiation field is guided by the depth prior atlas obtained in step 2) to perform three-dimensional reconstruction on the image set to be processed, including the following steps:

3.1）将步骤2）得到的深度先验图集根据深度信息投影到三维空间中，将深度先验图集中的一张深度先验图转换投影到其他相机视角下并与其他相机视角本身的深度信息进行对比，从而获得各个视角下深度先验图对应的误差图，遍历深度先验图集中所有的深度先验图；3.1) Project the depth prior atlas obtained in step 2) into three-dimensional space according to the depth information, transform and project a depth prior map in the depth prior atlas to other camera perspectives and compare it with the depth information of other camera perspectives, so as to obtain the error map corresponding to the depth prior map at each perspective, and traverse all the depth prior maps in the depth prior atlas;

3.2）基于误差图确定神经辐射场采样的范围；3.2) Determine the range of neural radiation field sampling based on the error map;

3.3）在确定的神经辐射场采样范围内，对待处理图像集进行三维重建，得到渲染后的深度图像集。3.3) Within the determined neural radiation field sampling range, the image set to be processed is three-dimensionally reconstructed to obtain a rendered depth image set.

本发明还提供了一种基于表面法向量和神经辐射场的单目深度估计系统，包括：The present invention also provides a monocular depth estimation system based on surface normal vector and neural radiation field, comprising:

模型构建模块，用于构建初始的表面法向量和深度估计模型，并使用带有深度标签的图像数据集进行训练，得到训练好的表面法向量和深度估计模型；The model building module is used to build the initial surface normal vector and depth estimation model, and use the image dataset with depth labels for training to obtain the trained surface normal vector and depth estimation model;

深度先验图获取模块，用于将待处理图像集输入训练好的表面法向量和深度估计模型，得到深度先验图集；A depth prior map acquisition module is used to input the image set to be processed into the trained surface normal vector and depth estimation model to obtain a depth prior map set;

三维重建模块，用于将待处理图像集输入神经辐射场，基于深度先验图获取模块得到的深度先验图集指导神经辐射场对处理图像集进行三维重建，得到渲染后的深度图像集；A three-dimensional reconstruction module is used to input the image set to be processed into the neural radiation field, and guide the neural radiation field to perform three-dimensional reconstruction on the processed image set based on the depth prior atlas obtained by the depth prior map acquisition module to obtain a rendered depth image set;

高斯滤波模块，用于使用高斯滤波来对渲染后的深度图像集进行过滤优化，最终得到优化后的深度图像集作为单目深度估计结果。The Gaussian filter module is used to use Gaussian filtering to filter and optimize the rendered depth image set, and finally obtain the optimized depth image set as the monocular depth estimation result.

本发明进一步提供了一种终端设备，包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序，所述处理器执行所述计算机程序时实现上述基于表面法向量和神经辐射场的单目深度估计方法的步骤。The present invention further provides a terminal device, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the above-mentioned monocular depth estimation method based on surface normal vectors and neural radiation fields when executing the computer program.

与现有技术相比，本发明具有以下有益效果：Compared with the prior art, the present invention has the following beneficial effects:

本发明采用了表面法向量作为深度估计的辅助信息，因此克服了现有技术中单目深度估计在纹理贫乏区域的不准确性问题。现有方法在缺乏明显纹理的表面上往往难以提供可靠的深度信息，而表面法向量提供了额外的几何约束，使得模型能够更好地理解场景结构，从而提高了在这些区域的深度估计精度。The present invention uses surface normal vectors as auxiliary information for depth estimation, thus overcoming the inaccuracy problem of monocular depth estimation in texture-poor areas in the prior art. Existing methods often have difficulty in providing reliable depth information on surfaces lacking obvious textures, while surface normal vectors provide additional geometric constraints, enabling the model to better understand the scene structure, thereby improving the accuracy of depth estimation in these areas.

本发明采用了神经辐射场的体积渲染技术，因此克服了现有技术中单目深度估计在无纹理或低纹理区域的精度问题。神经辐射场能够从不同视角合成图像，即使在纹理贫乏的区域也能提供深度信息，从而提高了深度估计的全面性和准确性。The present invention adopts the volume rendering technology of neural radiation field, thus overcoming the accuracy problem of monocular depth estimation in textureless or low-texture areas in the prior art. Neural radiation field can synthesize images from different perspectives and provide depth information even in texture-poor areas, thereby improving the comprehensiveness and accuracy of depth estimation.

本发明结合了神经辐射场的场景表示与深度先验知识，因此提升了深度估计在动态场景和遮挡情况下的鲁棒性。通过利用神经辐射场的场景表示能力，模型能够更好地理解和预测遮挡关系，即使在有动态对象或遮挡的复杂场景中也能提供可靠的深度信息。The present invention combines the scene representation of neural radiance fields with depth prior knowledge, thus improving the robustness of depth estimation in dynamic scenes and occlusion. By leveraging the scene representation capabilities of neural radiance fields, the model can better understand and predict occlusion relationships, and provide reliable depth information even in complex scenes with dynamic objects or occlusions.

本发明利用高斯滤波器的权重视觉权重进行优化，因此解决了深度估计中边缘模糊的技术问题。通过调整高斯滤波器的参数，特别是标准差，可以在保持边缘清晰度的同时去除不必要的噪声，提高了深度图在物体边界处的准确性。The present invention uses the weighted visual weight of the Gaussian filter for optimization, thereby solving the technical problem of edge blur in depth estimation. By adjusting the parameters of the Gaussian filter, especially the standard deviation, unnecessary noise can be removed while maintaining edge clarity, thereby improving the accuracy of the depth map at the object boundary.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为基于表面法向量和深度估计模型获取深度先验过程图；FIG1 is a diagram showing the process of obtaining depth prior based on surface normal vector and depth estimation model;

图2为神经辐射场训练流程图；Figure 2 is a flowchart of neural radiation field training;

图3为神经辐射场渲染流程图；Figure 3 is a flowchart of neural radiation field rendering;

图4为基于表面法向量和神经辐射场的单目深度估计方法图；FIG4 is a diagram of a monocular depth estimation method based on surface normal vectors and neural radiation fields;

图5为基于表面法向量和神经辐射场的单目深度估计流程图。Figure 5 is a flowchart of monocular depth estimation based on surface normal vectors and neural radiation fields.

具体实施方式DETAILED DESCRIPTION

下面结合具体实施方式对本发明做进一步阐述和说明。所述实施例仅是本公开内容的示范且不圈定限制范围。本发明中各个实施方式的技术特征在没有相互冲突的前提下，均可进行相应组合。The present invention is further described and illustrated below in conjunction with specific embodiments. The embodiments are merely exemplary of the present disclosure and do not define the scope of limitation. The technical features of each embodiment of the present invention may be combined accordingly without conflicting with each other.

如图4和图5所示，本发明提供了一种基于表面法向量和神经辐射场的单目深度估计方法，以解决现有技术基于深度学习的深度估计方法存在二义性以及无法处理遮挡和低纹理区域的问题。具体包括：As shown in Figures 4 and 5, the present invention provides a monocular depth estimation method based on surface normal vectors and neural radiation fields to solve the problem that the depth estimation method based on deep learning in the prior art has ambiguity and cannot handle occlusion and low-texture areas. Specifically, it includes:

1）基于表面法向量和深度估计模型获取深度先验；现有的深度估计方法生成的深度图可能存在不连续或断裂现象，而表面法向量提供了像素点在三维空间中的方向信息，这种几何信息作为硬性约束被整合进深度估计过程中，有助于模型在缺乏纹理或遮挡区域做出更为合理的深度推断。通过强制深度图与法向量场的一致性，模型能够减少因光照变化或材质相似导致的歧义，提升深度图的几何准确性。在物体边界和角落处，深度变化尤为剧烈，而表面法向量在此类区域提供了丰富的形状信息。结合这些信息，模型能更好地界定物体边界，生成更为清晰、准确的边缘。结合表面法向量的单目深度估计模型不仅增强了深度估计的准确性和鲁棒性，还拓宽了其在多种复杂场景下的应用范围，是推动三维视觉技术进步的关键一环。将单目相机在多视角场景下捕获的待处理图像集输入训练好的表面法向量和深度估计模型，得到深度先验图集。1) Obtaining depth prior based on surface normal vector and depth estimation model; the depth map generated by the existing depth estimation method may be discontinuous or broken, while the surface normal vector provides the direction information of the pixel point in three-dimensional space. This geometric information is integrated into the depth estimation process as a hard constraint, which helps the model to make more reasonable depth inference in areas lacking texture or occlusion. By enforcing the consistency between the depth map and the normal vector field, the model can reduce the ambiguity caused by illumination changes or material similarity and improve the geometric accuracy of the depth map. At the boundaries and corners of objects, the depth changes are particularly drastic, and the surface normal vector provides rich shape information in such areas. Combined with this information, the model can better define the boundaries of objects and generate clearer and more accurate edges. The monocular depth estimation model combined with the surface normal vector not only enhances the accuracy and robustness of depth estimation, but also broadens its application scope in a variety of complex scenes. It is a key link in promoting the advancement of three-dimensional vision technology. The image set to be processed captured by the monocular camera in a multi-view scene is input into the trained surface normal vector and depth estimation model to obtain a depth prior atlas.

如图1所示，为基于表面法向量和深度估计模型获取深度先验图过程：As shown in Figure 1, the process of obtaining a depth prior map based on the surface normal vector and the depth estimation model is as follows:

1.1）深度与表面法向量估计；首先，使用深度估计模型作为特征提取器。输入单目相机在多视角场景下捕获的待处理图像经过一系列深度可分离卷积和点卷积，逐步降低空间分辨率，增加通道数，提取高级语义特征。在编码器的末端，使用多个瓶颈层来进一步压缩特征，减少计算负担。解码器部分负责将编码后的特征上采样并恢复到原始图像尺寸，同时逐渐增加细节。该部分使用上采样、跳跃连接与编码器的特征相结合，以及使用卷积层来实现。跳跃连接帮助保留局部细节，提高重建的准确性。最终，解码器输出一个与输入图像同样尺寸的深度图，每个像素包含着像素所在位置的估计深度信息。训练过程中，使用MSE 损失和SSIM损失函数，为了促进深度图的平滑性和几何一致性，在损失函数中加入变分（Total Variation，TV）正则化的平滑项。深度估计模型在带有精确深度标签的图像数据集上进行监督学习，总的损失函数表示为： 1.1) Depth and surface normal estimation; First, the depth estimation model is used as a feature extractor. The image to be processed captured by the input monocular camera in a multi-view scene is subjected to a series of depth-separable convolutions and point convolutions to gradually reduce the spatial resolution, increase the number of channels, and extract high-level semantic features. At the end of the encoder, multiple bottleneck layers are used to further compress the features and reduce the computational burden. The decoder part is responsible for upsampling the encoded features and restoring them to the original image size while gradually increasing the details. This part is implemented by combining upsampling, skip connections with the features of the encoder, and using convolutional layers. Skip connections help preserve local details and improve the accuracy of reconstruction. Finally, the decoder outputs a depth map of the same size as the input image, where each pixel contains the estimated depth information of the pixel's location. During training, the MSE loss and SSIM loss functions are used. In order to promote the smoothness and geometric consistency of the depth map, a smoothing term of variational (Total Variation, TV) regularization is added to the loss function. The depth estimation model is supervised learning on an image dataset with precise depth labels, and the total loss function is It is expressed as:

,, , ,

其中，是模型预测的深度图，是真实的深度图，N为样本数量，为图像亮度平均值，为像素值的标准差，，，L是灰度的动态范围，和图像数据的类型有关，表示深度图在位置i处的梯度；是平衡各损失项权重的超参数。 in, is the depth map predicted by the model, is the real depth map, N is the number of samples, is the average image brightness, is the standard deviation of pixel values, , , L is the dynamic range of grayscale, which is related to the type of image data, represents the gradient of the depth map at position i; is a hyperparameter that balances the weights of each loss term.

在表面法向量估计上，从输入的单目相机在多视角场景下捕获的待处理图像中提取特征，使用多尺度特征提取函数提取多尺度特征图。设输入图像为，多尺度特征提取函数为，则得到的多尺度特征图可表示为。基于多尺度特征图，通过卷积神经网络 CNN估计每个像素点的表面法向量。设法向量预测函数为，则有，其中为多尺度特征图每个像素点的表面法向量，为表面法向量估计模块使用的卷积神经网络。 In surface normal vector estimation, features are extracted from the image to be processed captured by the input monocular camera in a multi-view scene, and a multi-scale feature extraction function is used to extract a multi-scale feature map. Assume that the input image is , the multi-scale feature extraction function is , then the obtained multi-scale feature map can be expressed as Based on multi-scale feature maps , the surface normal vector of each pixel is estimated through the convolutional neural network CNN. The normal vector prediction function is , then ,in is the surface normal vector of each pixel in the multi-scale feature map, Convolutional neural network used in the surface normal estimation module.

1.2）深度图的一致性优化过程；经过深度估计模型得到预测深度图后，为了确保预测深度图与相应的表面法向量在几何上是一致的，本方法构造了一个几何一致性优化模型，具体实现方法如下；1.2) Depth map consistency optimization process: After the predicted depth map is obtained by the depth estimation model, in order to ensure that the predicted depth map is geometrically consistent with the corresponding surface normal vector, this method constructs a geometric consistency optimization model. The specific implementation method is as follows;

对于一个像素点的表面法向量和估计的深度值，法向量坐标；利用相机的几何模型来表达几何一致性。假设已知的是针孔相机模型，有内参矩阵、旋转矩阵和平移向量，则像素点对应的三维空间点可以表示为： For a pixel The surface normal vector and the estimated depth value , Normal vector coordinates; use the camera's geometric model to express geometric consistency. Assuming that the pinhole camera model is known, there is an internal parameter matrix , rotation matrix and translation vector , then the three-dimensional space point corresponding to the pixel point It can be expressed as:

进一步地，这个空间点在世界坐标系下的法向量应当与估计的法向量在方向上一致，为了优化这种一致性，本方法构造了一个损失函数来最小化深度梯度与法向量之间的偏差来实现，使其估计的深度值与法向量信息相协调： Furthermore, the normal vector of this space point in the world coordinate system is Should be consistent with the estimated normal vector In order to optimize the consistency, this method constructs a loss function to minimize the deviation between the depth gradient and the normal vector, so that the estimated depth value is coordinated with the normal vector information:

其中，几何一致性优化模块的损失函数；是预测深度图Z在像素位置处的深度梯度，反映了深度变化的方向，是同一位置的表面法向量，为一个标量系数，反映了深度变化率与法向量方向的关系。通过最小化，可以约束预测深度图中的深度变化沿表面法向量方向最终得到深度先验图，从而增强深度估计的几何合理性。 in, Loss function of the geometric consistency optimization module; is the predicted depth map Z at pixel position The depth gradient at reflects the direction of depth change. is the surface normal vector at the same position, is a scalar coefficient that reflects the relationship between the depth change rate and the normal vector direction. , the depth changes in the predicted depth map can be constrained to be along the direction of the surface normal vector to finally obtain a depth prior map, thereby enhancing the geometric rationality of the depth estimation.

2）基于深度先验指导神经辐射场训练；现有的神经辐射场的采样方式为分层采样，但在渲染深度图的采样过程中，由于沿着相机光线对应于纹理较差的像素的所有的采样点预测的像素值大致相同，故现有方法渲染采样得到的深度图很大程度上偏离了实际情况。在获得了深度先验图后，为了将采样范围分布更加精准，本方法使用了自适应的深度先验来指导神经辐射场的采样。2) Guide neural radiance field training based on depth prior: The existing neural radiance field sampling method is layered sampling, but in the sampling process of rendering depth map, since the pixel values predicted by all sampling points corresponding to pixels with poor texture along the camera light are roughly the same, the depth map obtained by rendering sampling by the existing method deviates from the actual situation to a large extent. After obtaining the depth prior map, in order to distribute the sampling range more accurately, this method uses an adaptive depth prior to guide the sampling of the neural radiance field.

如图2所示，基于深度先验的神经辐射场采样过程具体如下：As shown in Figure 2, the neural radiation field sampling process based on depth prior is as follows:

将得到的深度先验图集根据深度信息投影到三维空间中，并将深度先验图集中的一张图像转换投影到其他相机视角下与其他相机视角本身的深度信息进行对比，从而获得各个视角下深度图对应的误差图。其公式如下：The obtained depth prior atlas is projected into three-dimensional space according to the depth information, and an image in the depth prior atlas is transformed and projected to other camera perspectives for comparison with the depth information of other camera perspectives, so as to obtain the error map corresponding to the depth map at each perspective. The formula is as follows:

其中是图中该深度值对应空间点投影到图中后对应的二维坐标，是投影到图后该二维坐标的深度值，通过误差图表述该深度先验图的准确程度与可信程度；为从图像i到图像j所对应的旋转矩阵、为从图像i到图像j所对应的平移向量、为图i 中对应的深度值、相机的本征特征；为深度误差。 in It is a picture The depth value corresponds to the projection of the spatial point onto the image The corresponding two-dimensional coordinates of the middle and rear are, It is projected onto the graph The depth value of the two-dimensional coordinate is then expressed through an error map to describe the accuracy and credibility of the depth prior map; is the rotation matrix from image i to image j, is the translation vector from image i to image j, is the corresponding depth value in Figure i, The intrinsic characteristics of the camera; is the depth error.

根据每个视角误差图来决定相机视角下采样的范围，其中误差较小的像素上的采样集中在深度先验周围，而误差较大的像素上的采样仍采用原有的神经辐射场公式，采样范围的公式如下：The sampling range under the camera perspective is determined according to each perspective error map, where the sampling on pixels with smaller errors is concentrated around the depth prior, while the sampling on pixels with larger errors still uses the original neural radiation field formula. The formula for the sampling range is as follows:

其中，D为深度先验图，e为误差图，和为定义范围的上限和下限；为射线在深度先验图指导下的起始采样点，为射线在深度先验图指导下的终止采样点，为取值函数。由于同一场景不同视角图片可能对应场景内容差异巨大，从而导致大量区域置信度过低，因而算法中只取用误差值最小的四个其他视角用以计算平均值。 Among them, D is the depth prior map, e is the error map, and To define the upper and lower limits of the range; is the starting sampling point of the ray under the guidance of the depth prior map, is the termination sampling point of the ray under the guidance of the depth prior map, As the contents of the same scene may be greatly different in different perspectives, resulting in low confidence in a large number of regions, the algorithm only uses the four other perspectives with the smallest error value to calculate the average value.

神经辐射场的训练即将输入的图像转化成一个连续的、可微分的场景表示，允许从任意视角渲染新的视图。Neural radiance fields are trained to transform input images into a continuous, differentiable scene representation, allowing new views to be rendered from arbitrary perspectives.

如图3所示，具体的训练步骤如下：As shown in Figure 3, the specific training steps are as follows:

2.1）光线射线生成，对于每个训练图像中的像素，通过已知的相机内参和外参，计算出从相机中心穿过该像素的射线，其中是相机中心，是射线方向，是射线参数。 2.1) Ray generation: For each pixel in the training image, the ray from the camera center through the pixel is calculated using the known camera intrinsics and extrinsics. ,in is the camera center, is the ray direction, are the ray parameters.

2.2）体积渲染，在射线上基于深度先验得到的采样范围采样个点。 2.2) Volume rendering, in ray The sampling range is obtained based on the depth prior. Points .

通过一个多层感知器神经网络，给定每个采样点的位置和射线方向，网络预测该点的颜色和体积密度。 Through a multi-layer perceptron neural network , given the location of each sampling point and the ray direction , the network predicts the color of the point and bulk density .

使用经典的体积渲染公式计算射线上每个像素的颜色和透射率： The color of each pixel along the ray is calculated using the classic volume rendering formula and transmittance :

其中，是相邻采样点间距离，是从射线起点到第个采样点的累积透射率，为指数函数。 in, is the distance between adjacent sampling points, From the starting point of the ray to The cumulative transmittance of the sampling points is is an exponential function.

2.3）损失函数与优化；神经辐射场的训练目的是最小化重建图像与实际图像之间的差异，本方法采用的损失作为损失函数，即 2.3) Loss function and optimization: The training goal of the neural radiation field is to minimize the difference between the reconstructed image and the actual image. The loss is taken as the loss function, i.e.

其中，是根据体渲染公式计算出的预测颜色，是实际观测到的像素颜色。 in, is the predicted color calculated from the volume rendering formula, is the actual observed pixel color.

使用Adam优化器更新网络参数，以减小损失函数。 Use Adam optimizer to update network parameters to reduce the loss function .

2.4）迭代训练；经过多次迭代，随机从数据集中选择射线进行前向传播和反向传播，逐步调整网络参数，直到损失收敛。2.4) Iterative training: After multiple iterations, randomly select rays from the data set for forward propagation and backward propagation, and gradually adjust the network parameters until the loss converges.

神经辐射场的训练通过不断优化网络参数，使得从任意视角合成的图像尽可能接近实际拍摄的图像，从而实现对场景的三维重建。The training of the neural radiation field continuously optimizes the network parameters so that the images synthesized from any perspective are as close as possible to the actual captured images, thereby achieving three-dimensional reconstruction of the scene.

3）基于神经辐射场和高斯滤波渲染精确深度图；通过神经辐射场得到场景的三维场景，类似于颜色渲染，深度图利用体渲染获得，由于深度在采样过程就等于射线中的值，所以渲染出的深度图的公式如下： 3) Rendering accurate depth map based on neural radiation field and Gaussian filter; The three-dimensional scene of the scene is obtained through neural radiation field, similar to color rendering, and the depth map is obtained by volume rendering. Since the depth is equal to the ray in the sampling process middle The value of , so the rendered depth map The formula is as follows:

是相邻采样点间距离，是从射线起点到第个采样点的累积透射率，为指数函数。 is the distance between adjacent sampling points, From the starting point of the ray to The cumulative transmittance of the sampling points is is an exponential function.

神经辐射场学习到的三维重建的深度图可能存在深度不够连续和有杂点的情况，本方法使用了高斯滤波来对三维重建的深度图进行进一步的过滤优化。由于神经辐射场渲染得到的三维重建的深度图应该与真实输入保持高度一致，若两者之间存在误差，则代表其学习的神经辐射场信息存在缺陷，对应区域的深度信息往往也是错误的，因而可以利用渲染得到的三维重建图通过高斯滤波和输入三维重建图之间的误差来进行优化，具体过滤优化函数的公式为：The 3D reconstructed depth map learned by the neural radiation field may have insufficient depth continuity and noise. This method uses Gaussian filtering to further filter and optimize the 3D reconstructed depth map. Since the 3D reconstructed depth map obtained by neural radiation field rendering should be highly consistent with the real input, if there is an error between the two, it means that the learned neural radiation field information is defective, and the depth information of the corresponding area is often wrong. Therefore, the 3D reconstructed image obtained by rendering can be optimized by Gaussian filtering and the error between the input 3D reconstructed image. The specific formula of the filtering optimization function is:

其中，表示高斯滤波器，表示对渲染后的深度图，表示卷积操作，而表示范数的平方，即欧几里得距离平方，表示目标图像；表示过滤优化函数。该损失函数试图找到一个高斯滤波后的渲染图像，使其与目标图像尽可能接近。 in, represents a Gaussian filter, Represents the rendered depth map. represents the convolution operation, and express The square of the norm, that is, the square of the Euclidean distance, represents the target image; represents the filtering optimization function. This loss function tries to find a Gaussian filtered rendered image that is as close as possible to the target image.

本发明基于表面法向量和神经辐射场的单目深度估计方法，不仅采用了表面法向量作为深度估计的辅助信息，提供了额外的几何约束；还采用了神经辐射场（NeRF）的体积渲染技术，能够从不同视角合成图像，即使在纹理贫乏的区域也能提供深度信息；本发明还利用高斯滤波器的权重视觉权重进行优化，通过过调整高斯滤波器的参数，特别是标准差，可以在保持边缘清晰度的同时去除不必要的噪声；使得本发明在平均绝对误差（MAE）、平均平方误差（MSE）和相对误差（Rel）等多个性能指标上表现出色，不仅在精度上有显著提升，还在处理复杂场景、纹理贫乏区域和边界细节方面展现出优异的能力。此外，与现有的单目深度估计方法相比，本发明方法在运行效率和鲁棒性方面也有显著优势，证明了本发明方法在实际应用中的潜力和价值。The monocular depth estimation method based on surface normal vector and neural radiation field of the present invention not only uses surface normal vector as auxiliary information of depth estimation, providing additional geometric constraints; it also uses volume rendering technology of neural radiation field (NeRF), which can synthesize images from different perspectives and provide depth information even in texture-poor areas; the present invention also optimizes the weighted visual weight of Gaussian filter, and by adjusting the parameters of Gaussian filter, especially standard deviation, it can remove unnecessary noise while maintaining edge clarity; so that the present invention performs well in multiple performance indicators such as mean absolute error (MAE), mean square error (MSE) and relative error (Rel), not only significantly improving the accuracy, but also showing excellent ability in processing complex scenes, texture-poor areas and boundary details. In addition, compared with the existing monocular depth estimation method, the method of the present invention also has significant advantages in operation efficiency and robustness, proving the potential and value of the method of the present invention in practical applications.

基于同样的发明构思，在本实施例中还提供了一种基于表面法向量和神经辐射场的单目深度估计系统，包括：Based on the same inventive concept, a monocular depth estimation system based on surface normal vector and neural radiation field is also provided in this embodiment, including:

模型构建模块，用于构建初始的表面法向量和深度估计模型，并使用带有深度标签的图像数据集进行训练，得到训练好的表面法向量和深度估计模型；A model building module is used to build an initial surface normal vector and depth estimation model, and train it using an image dataset with depth labels to obtain a trained surface normal vector and depth estimation model;

对于系统实施例而言，由于其基本对应于方法实施例，所以相关之处参见方法实施例的部分说明即可，其余模块的实现方法此处不再赘述。以上所描述的系统实施例仅仅是示意性的，其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本发明方案的目的。本领域普通技术人员在不付出创造性劳动的情况下，即可以理解并实施。For the system embodiment, since it basically corresponds to the method embodiment, the relevant parts can refer to the partial description of the method embodiment, and the implementation methods of the remaining modules will not be repeated here. The system embodiment described above is only schematic, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the scheme of the present invention. Ordinary technicians in this field can understand and implement it without paying creative work.

本发明的系统的实施例可以应用在任意具备数据处理能力的设备上，该任意具备数据处理能力的设备可以为诸如计算机等设备或装置。系统实施例可以通过软件实现，也可以通过硬件或者软硬件结合的方式实现。以软件实现为例，作为一个逻辑意义上的装置，是通过其所在任意具备数据处理能力的设备的处理器将非易失性存储器中对应的计算机程序指令读取到内存中运行形成的。The embodiments of the system of the present invention can be applied to any device with data processing capabilities, and the device with data processing capabilities can be a device or apparatus such as a computer. The system embodiments can be implemented by software, or by hardware or a combination of software and hardware. Taking software implementation as an example, as a device in a logical sense, the corresponding computer program instructions in the non-volatile memory are read into the memory by the processor of any device with data processing capabilities and run.

本发明实施例还提供一种终端设备，包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序，所述处理器执行所述计算机程序时实现上述基于表面法向量和神经辐射场的单目深度估计方法的步骤。An embodiment of the present invention also provides a terminal device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the above-mentioned monocular depth estimation method based on surface normal vectors and neural radiation fields when executing the computer program.

以上所述实施例仅表达了本发明的几种实施方式，其描述较为具体和详细，但并不能因此而理解为对本发明专利范围的限制。对于本领域的普通技术人员来说，在不脱离本发明构思的前提下，还可以做出若干变形和改进，这些都属于本发明的保护范围。The above-mentioned embodiments only express several implementation methods of the present invention, and the description is relatively specific and detailed, but it cannot be understood as limiting the scope of the present invention. For ordinary technicians in this field, several modifications and improvements can be made without departing from the concept of the present invention, which all belong to the protection scope of the present invention.

Claims

1. A monocular depth estimation method based on a surface normal vector and a neural radiation field, comprising the steps of:

1) Constructing an initial surface normal vector and a depth estimation model, and training by using an image data set with a depth label to obtain a trained surface normal vector and a trained depth estimation model;

2) Inputting a to-be-processed image set captured by a monocular camera in a multi-view scene into a trained surface normal vector and a depth estimation model to obtain a depth priori atlas;

3) Inputting the image set to be processed into a nerve radiation field, guiding the nerve radiation field to carry out three-dimensional reconstruction on the image set to be processed based on the depth priori graph set obtained in the step 2), and obtaining a rendered depth image set;

4) And filtering and optimizing the rendered depth image set by using Gaussian filtering, and finally obtaining the optimized depth image set as a monocular depth estimation result.

2. The monocular depth estimation method based on surface normal vector and neural radiation field of claim 1, wherein the surface normal vector and depth estimation model comprises a depth estimation module, a surface normal vector estimation module, and a geometric consistency optimization module; the depth estimation module is used for extracting depth characteristics in an input image and outputting a predicted depth map; the surface normal vector estimation module is used for extracting a multi-scale feature map in an input image and outputting a surface normal vector; the geometric consistency optimization module is used for predicting depth in the depth map according to multi-scale feature map constraint and finally obtaining a depth priori map along the surface normal vector direction.

3. The method of monocular depth estimation based on surface normal vectors and neural radiation fields according to claim 2, wherein the depth estimation module uses a weighted sum of mean square error loss function, structural similarity loss function and variational regularized smooth term loss as the total loss function in the training process.

4. The monocular depth estimation method based on surface normal vector and neural radiation field according to claim 2, wherein the surface normal vector estimation module extracts a multiscale feature map from an input image using a multiscale feature extraction function, and estimates a surface normal vector of each pixel point of the multiscale feature map by a convolutional neural network, expressed as:

Wherein, For the surface normal vector of each pixel of the multi-scale feature map,A convolutional neural network used for the surface normal vector estimation module,For a multi-scale feature extraction function,Is an input image.

5. The method of monocular depth estimation based on surface normal vectors and neural radiation fields of claim 2, wherein the loss function of the geometric consistency optimization module causes the depth in the predicted depth map to vary along the surface normal vector direction by minimizing the deviation between the depth gradient in the predicted depth map and the surface normal vector; the loss function is:

Wherein, Is a predictive depth mapAt the pixel positionA depth gradient at the point where the depth is greater,Is the pixel locationThe normal vector of the surface at which the position is located,Is a scalar coefficient; the loss function of the module is optimized for geometric consistency.

6. The monocular depth estimation method based on the surface normal vector and the neural radiation field according to claim 1, wherein the depth prior atlas obtained based on the step 2) guides the neural radiation field to perform three-dimensional reconstruction on the image set to be processed, and the method comprises the following steps:

3.1 Projecting the depth prior map set obtained in the step 2) into a three-dimensional space according to the depth information, converting and projecting one depth prior map in the depth prior map set under other camera view angles and comparing the depth information of the other camera view angles, so as to obtain error maps corresponding to the depth prior map under each view angle, and traversing all the depth prior maps in the depth prior map set;

3.2 Determining a range of neural radiation field samples based on the error map;

3.3 In the determined nerve radiation field sampling range, carrying out three-dimensional reconstruction on the image set to be processed to obtain a rendered depth image set.

7. The method of monocular depth estimation based on surface normal vector and neural radiation field of claim 6, wherein the determining the range of neural radiation field samples based on the error map is expressed as:

Wherein, As a depth-prior map,In the form of an error map,AndUpper and lower limits of the range predefined; for the initial sampling point of the nerve radiation field under the guidance of the depth prior map, Terminating sampling points of the nerve radiation field under the guidance of the depth priori map; As a function of the value.

8. The method for monocular depth estimation based on surface normal vector and neural radiation field according to claim 1, wherein the filtering optimization is performed on the rendered depth image set by a filtering optimization function minimizing gaussian filtering, and the formula of the filtering optimization function is:

Wherein, The gaussian filter is represented by the number of the filters,Representing the depth map after rendering of the image,A convolution operation is represented and is performed,Representation ofThe square of the norm,Representing the target image as an optimized depth image; representing a filtering optimization function.

9. A monocular depth estimation system based on a surface normal vector and a neural radiation field, comprising:

The model construction module is used for constructing an initial surface normal vector and a depth estimation model, and training by using an image data set with a depth label to obtain a trained surface normal vector and a trained depth estimation model;

the depth prior image acquisition module is used for inputting the image set to be processed into the trained surface normal vector and the depth estimation model to obtain a depth prior image set;

The three-dimensional reconstruction module is used for inputting the image set to be processed into the nerve radiation field, guiding the nerve radiation field to carry out three-dimensional reconstruction on the processed image set based on the depth priori graph set obtained by the depth priori graph acquisition module, and obtaining a rendered depth image set;

and the Gaussian filtering module is used for filtering and optimizing the rendered depth image set by using Gaussian filtering, and finally obtaining the optimized depth image set as a monocular depth estimation result.

10. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the surface normal vector and neural radiation field based monocular depth estimation method according to any one of claims 1 to 8 when the computer program is executed.