CN116883477A

CN116883477A - A monocular depth estimation method

Info

Publication number: CN116883477A
Application number: CN202310990370.3A
Authority: CN
Inventors: 魏峰; 魏群瀚
Original assignee: Suzhou Fenghe Shixin Information Technology Co ltd; Hohai University HHU
Current assignee: Suzhou Fenghe Shixin Information Technology Co ltd; Hohai University HHU
Priority date: 2023-08-08
Filing date: 2023-08-08
Publication date: 2023-10-13

Abstract

The application discloses a monocular depth estimation method, which comprises the following steps: dividing a training set and a testing set by adopting a KITTI data set; constructing a network structure based on monocular depth estimation, wherein the network structure comprises a depth network and a pose network, the depth network adopts a U-net structure and comprises an encoder and a decoder, and the pose network comprises a pose encoder and a pose decoder; reconstructing a target image by using the image reconstruction model; computing reconstructed imagesCorresponding photometric feature loss and textural feature loss; and performing monocular depth estimation training by using a multi-fusion loss function of texture feature loss, luminosity feature loss and pixel smoothness as a total loss function, and effectively evaluating a test set by using a model obtained by training. On the basis of luminosity feature loss, the depth estimation effect under the condition of no texture or weaker texture is enhanced by introducing texture feature loss, and in addition, the good effect can be realized across the data domain.

Description

A monocular depth estimation method

技术领域Technical field

本发明涉及计算机视觉技术领域，具体是一种单目深度估计方法。The invention relates to the field of computer vision technology, specifically a monocular depth estimation method.

背景技术Background technique

目前单目深度估计是计算机视觉任务中一个的研究重点，在智能驾驶、机器人运动和三维认知中有广泛的应用。传统的sfm等算法在很多任务中实现较困难，随着深度学习正在成为当前主流算法，使用单目像机完成的效果越来越好，并且还具有节约成本、体积小等优点。Currently, monocular depth estimation is a research focus in computer vision tasks and is widely used in intelligent driving, robot movement and three-dimensional cognition. Traditional SFM and other algorithms are difficult to implement in many tasks. As deep learning is becoming the current mainstream algorithm, the effect of using monocular cameras is getting better and better, and it also has the advantages of cost saving and small size.

基于学习的方法主要分为监督学习和自监督学习两类。监督学习需要大量的各种各样的数据集和配对的真实地面深度标签作为输入。相对来说获取数据的任务比较困难，采用的激光雷达等设备价格也比较昂贵。自监督学习在获取深度和位姿相对要容易很多，通常将单目相机图像序列作为输入，采用相应的网络架构，将深度图和位姿估计两个任务统一到一个框架中，其中监督信息主要来自视图合成。Learning-based methods are mainly divided into two categories: supervised learning and self-supervised learning. Supervised learning requires a large variety of datasets and paired ground-truth depth labels as input. Relatively speaking, the task of obtaining data is difficult, and the equipment such as lidar used is also relatively expensive. Self-supervised learning is relatively easy to obtain depth and pose. Usually, a monocular camera image sequence is used as input, and a corresponding network architecture is used to unify the two tasks of depth map and pose estimation into one framework, in which the supervision information is mainly From view composition.

然而，就目前实现的效果来说，自监督方法仍然低于监督方法。主要原因在于采用光度特征损失作为监督信号时，在一些光强较大或纹理较弱地方无法有效的提取特征。However, in terms of the effects currently achieved, self-supervised methods are still inferior to supervised methods. The main reason is that when using photometric feature loss as the supervision signal, features cannot be effectively extracted in some places with high light intensity or weak texture.

发明内容Contents of the invention

为了解决上述问题，本发明提供了一种能够体现图像中纹理细节和边缘线索的单目深度估计方法。In order to solve the above problems, the present invention provides a monocular depth estimation method that can reflect texture details and edge clues in images.

为了达到上述目的，本发明是通过以下技术方案来实现的：In order to achieve the above objects, the present invention is achieved through the following technical solutions:

本发明是一种单目深度估计方法，包括如下：The present invention is a monocular depth estimation method, including the following:

采用KITTI数据集，划分训练集和测试集；Use the KITTI data set to divide the training set and the test set;

构建基于单目深度估计的网络结构，包括深度网络和位姿网络，所述深度网络采用U-net结构，包括编码器和解码器，所述位姿网络包括位姿编码器和位姿解码器；Construct a network structure based on monocular depth estimation, including a deep network and a pose network. The deep network adopts a U-net structure and includes an encoder and a decoder. The pose network includes a pose encoder and a pose decoder. ;

利用图像重构模型重构目标图像：目标图像I_t(p)经过深度网络输出对应深度图D_t、对应图像经过位姿网络生成的相对位姿T_t→t′，和相机的内参K结合图像重构模型进行图像重构，得到重建图像计算重建图像/>对应的光度特征损失，并引入纹理特征梯度计算得到纹理特征损失函数；Use the image reconstruction model to reconstruct the target image: the target image I _t (p) outputs the corresponding depth map D _t through the deep network, and the relative pose T _{t → t′} of the corresponding image generated through the pose network, combined with the internal parameter K of the camera The image reconstruction model performs image reconstruction to obtain the reconstructed image. Compute reconstructed image/> Corresponding photometric feature loss and introduction of texture feature gradient Calculate the texture feature loss function;

利用纹理特征损失、光度特征损失、像素平滑度的多融合损失函数为总的损失函数，进行单目深度估计训练，并利用训练得到的模型对测试集数据进行有效评估。The multi-fusion loss function of texture feature loss, photometric feature loss, and pixel smoothness is used as the total loss function to perform monocular depth estimation training, and the trained model is used to effectively evaluate the test set data.

本发明的进一步改进在于：编码器的输出端引入坐标注意力并连接解码器。A further improvement of the present invention is that the output end of the encoder introduces coordinate attention and is connected to the decoder.

本发明的进一步改进在于：深度网络的编码器采用去除全连接层的ResNet-18网络结构，其中最深的特征映射经过5个下采样阶段，并将输入图像的分辨率降低到1/32，所述解码器包含五个3×3卷积层，每个卷积层后具有一个双线性上采样层。A further improvement of the present invention is that the encoder of the deep network adopts the ResNet-18 network structure with the fully connected layer removed, in which the deepest feature map goes through 5 downsampling stages, and the resolution of the input image is reduced to 1/32, so The described decoder consists of five 3×3 convolutional layers, each followed by a bilinear upsampling layer.

本发明的进一步改进在于：位姿网络的位姿编码器采用ResNet-18网络结构。本发明的进一步改进在于：图像重构模型表达式为：A further improvement of the present invention is that the pose encoder of the pose network adopts the ResNet-18 network structure. A further improvement of the present invention is that the image reconstruction model expression is:

式中，为重建图像，/>为重建后的像素，proj为2D和3D间的投影关系，<>为采样运算符，I_t为目标图像帧，T_t→t′为相对位姿，K为相机的内参，D_t为深度图。In the formula, To reconstruct the image,/> is the reconstructed pixel, proj is the projection relationship between 2D and 3D, <> is the sampling operator, I _t is the target image frame, T _t→t′ is the relative pose, K is the internal parameter of the camera, and D _t is the depth. picture.

本发明的进一步改进在于：重建图像对应的光度特征损失的表达式为：A further improvement of the present invention lies in: reconstructing the image The corresponding expression of photometric feature loss is:

式中，L_phRec为光度特征损失，l(,)为测量每个像素的光度差损失，I_t(p)为目标图像，p为像素；In the formula, L _phRec is the photometric feature loss, l(,) is the photometric difference loss measured for each pixel, I _t (p) is the target image, and p is the pixel;

将纹理特征梯度引入到重建图像/>对应的光度特征损失的表达式中得到：Gradient texture features Introduced into reconstructed image/> The corresponding photometric characteristic loss expression is obtained:

式中，为重建的纹理特征，φ_t(p)为目标图像的纹理特征。In the formula, is the reconstructed texture feature, and φ _t (p) is the texture feature of the target image.

本发明的进一步改进在于：多融合损失函数的表达式为：A further improvement of the present invention is that the expression of the multi-fusion loss function is:

L_total＝λL_smooth+βL_phRec+γL_fmRec (4)L _total ＝λL _smooth +βL _phRec +γL _fmRec (4)

其中：in:

式中，L_total为多融合损失，L_fmRec为纹理特征损失，L_phRec为光度特征损失，L_smooth为像素平滑度，其中α取0.85，λ、β、γ分别为像素平滑度、光度特征损失、纹理特征损失的权重，其中λ＝1，β、γ取0.001，x、y用于表示像素坐标。In the formula, L _total is the multi-fusion loss, L _fmRec is the texture feature loss, L _phRec is the photometric feature loss, L _smooth is the pixel smoothness, where α is 0.85, λ, β, and γ are the pixel smoothness and photometric feature losses respectively. , the weight of texture feature loss, where λ = 1, β and γ are 0.001, and x and y are used to represent pixel coordinates.

本发明的有益效果是：本发明在光度特征损失的基础上，通过引入纹理特征损失，强化无纹理或者纹理较弱的情况下深度估计的效果。本发明的方法能够跨数据域实现良好的效果，本发明虽然基于KITTI数据集训练单目深度估计的网络，但是在KITTI、cityscapes和Maker3D等数据集上都有良好表现。本发明的方法引入坐标注意力，加强通道和位置注意力，加强了边界特征。The beneficial effects of the present invention are: based on the loss of photometric features, the present invention enhances the effect of depth estimation in the case of no texture or weak texture by introducing texture feature loss. The method of the present invention can achieve good results across data domains. Although the present invention trains a monocular depth estimation network based on the KITTI data set, it has good performance on KITTI, cityscapes, Maker3D and other data sets. The method of the present invention introduces coordinate attention, strengthens channel and position attention, and enhances boundary features.

附图说明Description of the drawings

图1是本发明实施例中的单目深度估计的网络结构示意图；Figure 1 is a schematic diagram of the network structure of monocular depth estimation in an embodiment of the present invention;

图2是本发明实施例中的光度特征损失流程图；Figure 2 is a flow chart of photometric feature loss in the embodiment of the present invention;

图3是本发明实施例中的坐标注意力示意图；Figure 3 is a schematic diagram of coordinate attention in the embodiment of the present invention;

图4是本发明实施例中的方法与现有的深度估计方法的Kitti数据集效果对比；Figure 4 is a comparison of the Kitti data set effects between the method in the embodiment of the present invention and the existing depth estimation method;

图5是本发明实施例中的方法与现有的深度估计方法的Cityscape数据集效果对比；Figure 5 is a comparison of the Cityscape data set effects between the method in the embodiment of the present invention and the existing depth estimation method;

图6是本发明实施例中的方法与现有的深度估计方法的Make3D数据集效果对比。Figure 6 is a comparison of the effects of the Make3D data set between the method in the embodiment of the present invention and the existing depth estimation method.

具体实施方式Detailed ways

为了使本发明的目的、技术方案及优点更加清晰，以下结合附图及实施例，对本发明做出进一步详细阐述。应当注意，此处所描述的实施例仅用以解释本发明，并不用做限定本发明。In order to make the purpose, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be noted that the embodiments described here are only used to explain the present invention and are not intended to limit the present invention.

本发明的一种单目深度估计方法，包括如下：A monocular depth estimation method of the present invention includes the following:

采用KITTI数据集，并划分训练集和测试集；Use the KITTI data set and divide it into a training set and a test set;

利用图像重构模型重构目标图像：目标图像I_t(p)经过深度网络输出对应深度图D_t、对应图像经过位姿网络生成的相对位姿T_t→t′，和相机的内参K结合图像重构模型进行图像重构，得到重建图像 Use the image reconstruction model to reconstruct the target image: the target image I _t (p) outputs the corresponding depth map D _t through the deep network, and the relative pose T _{t → t′} of the corresponding image generated through the pose network, combined with the internal parameter K of the camera The image reconstruction model performs image reconstruction to obtain the reconstructed image.

计算重建图像对应的光度特征损失，并引入纹理特征梯度/>计算得到纹理特征损失函数；Compute reconstructed image Corresponding photometric feature loss, and introduce texture feature gradient/> Calculate the texture feature loss function;

本发明的深度网络(DipthNet)的编码器部分采用Resnet18网络结构，其输出端可以引入坐标注意力并连接解码器。坐标注意力可以提高网络学习特征的表达能力。可以采用任何中间特征张量X＝[x₁,x₂,…,x_c]∈R^C×H×W作为输入，输出均和X大小相同的增广矩阵Y＝[y₁,y₂,…,y_c]。本实施例中坐标注意力结构如图3所示，给定输入X，使用X和Y方向池化内核的两个空间范围(H,1)或(1,W)，以分别沿着水平坐标和垂直坐标对每个通道进行编码，最后输出Y同样大小为C×H×W。在高度为h的第c个通道输出公式可以转化为：The encoder part of the deep network (DipthNet) of the present invention adopts the Resnet18 network structure, and its output end can introduce coordinate attention and connect to the decoder. Coordinate attention can improve the expressive ability of network learning features. _You _can ^use any _intermediate _feature _tensor …,y _c ]. The coordinate attention structure in this embodiment is shown in Figure 3. Given an input X, two spatial ranges (H, 1) or (1, W) of the pooling kernel in the and vertical coordinates are encoded for each channel, and the final output Y is the same size as C×H×W. The output formula of the c-th channel with height h can be transformed into:

同样，在宽带为w处的第c个通道输出公式同样为：Similarly, the output formula of the c-th channel at a bandwidth of w is also:

其中，C为图像通道数，H为图像高度，W为图像宽度，x_c(h,i)、x_c(j,w)表示坐标轴x方向的特征，表示坐标轴z方向的特征。Among them, C is the number of image channels, H is the image height, W is the image width, x _c (h,i), x _c (j,w) represent the features in the x direction of the coordinate axis, Represents the characteristics of the z direction of the coordinate axis.

上述两个变换分别沿着两个空间方向聚合特征，产生一对方向感知特征图。经过Xavg pool和Y avg pool处理后连接起来，然后进行1×1卷积变换，得到水平方向和垂直方向上的空间信息进行编码的中间特征图。然后，本实施例沿着空间维度将特征图分解为两个独立的张量分别进行处理，最后得到同等维度，具有空间和通道注意力特征的输出。The above two transformations aggregate features along two spatial directions respectively, producing a pair of direction-aware feature maps. After being processed by Xavg pool and Y avg pool, they are connected and then subjected to 1×1 convolution transformation to obtain an intermediate feature map that encodes spatial information in the horizontal and vertical directions. Then, this embodiment decomposes the feature map into two independent tensors along the spatial dimension for processing respectively, and finally obtains an output of the same dimension with spatial and channel attention features.

本发明的整个单目深度估计的网络结构采用多尺度结构设计，可以提取不同尺度的光度特征以解决出现”伪影”等问题。为了有效提高整体网络结构的紧凑性，位姿网络(PoseNet)的位姿编码器也采用resnet18，并在位姿解码器输出端输出连续三帧图像之间的相对位姿关系。具体的，对于深度网络的编码器采用去除全连接层的ResNet-18网络结构，其中最深的特征映射经过5个下采样阶段，并将输入图像的分辨率降低到1/32，解码器包含五个3×3卷积层，每个卷积层后跟一个双线性上采样层。解码器卷积层的多尺度特征图用于生成多尺度重建图像，其中每个尺度的特征图进一步通过3×3卷积和sigmoid函数进行图像重建。The entire monocular depth estimation network structure of the present invention adopts a multi-scale structure design, which can extract photometric features at different scales to solve problems such as "artifacts". In order to effectively improve the compactness of the overall network structure, the pose encoder of the pose network (PoseNet) also uses resnet18, and the relative pose relationship between three consecutive frames of images is output at the pose decoder output. Specifically, the encoder of the deep network uses the ResNet-18 network structure with the fully connected layer removed. The deepest feature map goes through 5 downsampling stages and reduces the resolution of the input image to 1/32. The decoder contains five 3×3 convolutional layers, each convolutional layer is followed by a bilinear upsampling layer. The multi-scale feature maps of the decoder convolutional layer are used to generate multi-scale reconstructed images, where the feature maps of each scale are further subjected to 3×3 convolution and sigmoid function for image reconstruction.

如图1所示，在整个单目深度估计的网络结构中，输入图像源I_s∈{I_t-1,I_t,I_t+1}为相邻的三幅图像帧，其中I_t作为目标图像帧，I_t输入深度网络，输出对应的深度图D_t，图像源I_s输入位姿网络，输出对应的位姿数据，其中I_t-1和I_t+1分别对I_t计算位姿估计记作T_t→t′。As shown in Figure 1, in the entire monocular depth estimation network structure, the input image source I _s ∈ {I _t-1 , I _t , I _t+1 } is three adjacent image frames, where I _t is as Target image frame, I _t is input into the depth network, and the corresponding depth map D _t is output. The image source I _s is input into the pose network, and the corresponding pose data is output. Among them, I _t-1 and I _t+1 calculate the position of I _t respectively. The pose estimation is denoted as T _t→t′ .

为了生成有效的深度图，本发明目标图像I_t(p)经过深度网络输出对应深度图D_t、对应图像经过位姿网络生成的相对位姿T_t→t′，和相机的内参K结合图像重构模型进行图像重构，得到重建图像图像重构模型表达式为：In order to generate an effective depth map, the target image I _t (p) of the present invention outputs the corresponding depth map D _t through the depth network, the relative pose T _t→t′ of the corresponding image generated through the pose network, and the image is combined with the internal parameter K of the camera Reconstruct the model to reconstruct the image and obtain the reconstructed image The expression of the image reconstruction model is:

光度特征损失在深度估计中比较常见，但是对于低纹理区域表现不佳，本发明在光度特征损失的基础上引入了纹理特征损失来增加低纹理区域的特征提取。光度特征损失在单目深度估计的网络结构中如图2所示，图中输入为目标图像(即I_t(p))，经过深度网络生成视差图(深度图D_t)，联合位姿网络的输出位姿T_t→t′进行重构，经过采样后生成重构图像(即)。对应的光度特征损失表达式为：Photometric feature loss is relatively common in depth estimation, but performs poorly in low-texture areas. The present invention introduces texture feature loss based on photometric feature loss to increase feature extraction in low-texture areas. The photometric feature loss is shown in Figure 2 in the network structure of monocular depth estimation. The input in the figure is the target image (i.e. I _t (p)). The disparity map (depth map D _t ) is generated through the depth network, and the joint pose network The output pose T _t→t′ is reconstructed, and the reconstructed image is generated after sampling (i.e. ). The corresponding photometric feature loss expression is:

其中，l(,)是测量每个像素的光度差损失，p为像素。Among them, l(,) is the photometric difference loss measured for each pixel, and p is the pixel.

在正常的深度估计和位姿运动情况下，光度特征损失可以起到良好的作用，但是对应低纹理甚至无纹理区域，如果光度差相近或相等那么就无法起到良好的监督作用。根据公式(5)和公式(6)，来进一步分析光度特征损失中的深度D(p)和自运动M的梯度：In the case of normal depth estimation and pose motion, photometric feature loss can play a good role, but corresponding to low-texture or even texture-less areas, if the photometric difference is similar or equal, it cannot play a good supervision role. According to formula (5) and formula (6), further analyze the gradient of depth D(p) and self-motion M in photometric feature loss:

从上述公式中可以看出深度和位姿梯度依赖于图像梯度而在无纹理区域图像梯度为零，这样导致公式(5)和公式(6)为零。从此处可以看出仅仅依靠光度误差不能很好的进行多视图重建，因此本发明引入纹理特征梯度/>到式(4)：It can be seen from the above formula that the depth and pose gradients depend on the image gradient In the textureless area, the image gradient is zero, which results in formula (5) and formula (6) being zero. It can be seen from here that multi-view reconstruction cannot be performed well by relying solely on photometric errors. Therefore, the present invention introduces texture feature gradient/> To formula (4):

其中表示重建的纹理特征，φ_t(p)表示目标图像的纹理特征。in represents the reconstructed texture feature, and φ _t (p) represents the texture feature of the target image.

本实施例中的多融合损失函数如下：The multi-fusion loss function in this embodiment is as follows:

根据式(7)可以得到纹理特征度量损失为：According to Equation (7), the texture feature measurement loss can be obtained as:

采用L1和SSIM生成光度误差L_phRec，即：The photometric error L _phRec is generated using L1 and SSIM, that is:

取α＝0.85。同时对生成的深度图求解视差平滑损失：Take α=0.85. At the same time, the parallax smoothing loss is solved for the generated depth map:

将像素平滑度、光度特征损失和纹理特征损失合并为整体的损失函数：Combine pixel smoothness, photometric feature loss, and texture feature loss into an overall loss function:

L_total＝γL_smooth+βL_phRec+γL_fmRec (11)L _total ＝γL _smooth +βL _phRec +γL _fmRec (11)

式中，λ、β、γ分别为像素平滑度、光度特征损失、纹理特征损失的权重。In the formula, λ, β, and γ are the weights of pixel smoothness, photometric feature loss, and texture feature loss respectively.

本实施例输入的图像大小为640*192，采用KITTI数据集进行单目深度估计的网络训练，其中训练图像39810帧，测试图像4242帧，主要采用Monocular(M)训练进行训练比较。经过对比实验发现：monodepth2采用同样的参数训练最后得到的Abs Rel为0.120。结果表明，即使在计算资源有限的条件下，本发明方法的评估效果依然表现比较好。训练结果如表1所示：The input image size in this example is 640*192. The KITTI data set is used for network training of monocular depth estimation. There are 39810 frames of training images and 4242 frames of test images. Monocular (M) training is mainly used for training comparison. After comparative experiments, it was found that monodepth2 uses the same parameters to train and the final Abs Rel obtained is 0.120. The results show that even under the conditions of limited computing resources, the evaluation effect of the method of the present invention is still relatively good. The training results are shown in Table 1:

表1：采样KITTI数据集训练结果对比Table 1: Comparison of training results on sampled KITTI data sets

Ours(Feat)指本发明实施例中仅含纹理特征损失的单目深度估计的网络，Ours(coor+Feat)指本发明实施例中同时引入纹理特征损失和坐标注意力的单目深度估计的网络。Ours(Feat) refers to the monocular depth estimation network that only includes texture feature loss in the embodiment of the present invention. Ours(coor+Feat) refers to the monocular depth estimation network that introduces both texture feature loss and coordinate attention in the embodiment of the present invention. network.

图4为KITTI的数据测试效果，参与比对的包括经典算法Monodepth2、最新成果Lite-mon和R_MSFM等方法，以及本发明采用只有纹理特征的方法our(Feat)和纹理注意力相结合ours(coor+Feat)的两种方法。由实验结果可见：纹理注意力相结合的方法在低纹理细节处表现较好，具体表现在：(a)列图中树木的轮廓比较清晰，(b)列的标识杆的粗细大小甚至反光标识都能清晰表达，(c)列在光照强烈的墙角的处理，(d)列中阳光照射的灌木以及栏杆的处理。由于光度误差不一致性经常会导致深度图中物体靠近摄像机的实际距离不一致，无法反映深度图中物体到像机的准备距离，具体可以参考(b)列中骑自行车的人，本发明的纹理注意力相结合的方法能具体反映实际的距离。Figure 4 shows the data test effect of KITTI. Participants in the comparison include the classic algorithm Monodepth2, the latest results Lite-mon and R_MSFM and other methods, and the present invention uses a method with only texture features our (Feat) and texture attention combined ours (coor +Feat) two methods. It can be seen from the experimental results that the method combining texture attention performs better in low texture details, which is specifically reflected in: (a) the outline of the trees in the column chart is relatively clear, (b) the thickness and size of the sign poles in the column and even the reflective logo They can clearly express, (c) the treatment of the corner with strong sunlight, and (d) the treatment of the shrubs and railings illuminated by the sun. Due to the inconsistency of photometric errors, the actual distance between the object and the camera in the depth map is often inconsistent, which cannot reflect the preparation distance from the object in the depth map to the camera. For details, please refer to the cyclist in column (b). Note the texture of the present invention. The method of combining forces can concretely reflect the actual distance.

图5为Cityscapes数据集的测试效果，KITTI数据基本上都是移动像机拍摄的静态场景，缺少移动目标，但是实际现实环境中公路上有很多车辆和行人。与KITTI数据集不同，Cityscapes数据集中有大量的行人和行驶车辆，把KITTI数据集中训练的模型用来测试Cityscapes数据集，来验证本发明的网络在不同数据集中的有效性。在图5是输入部分的远处的路灯杆的、树木的形状细节、行人的识别成为了辨别不同方法优劣的依据。(a)列中本发明的方法可以清楚的展现远处的路灯杆，(b)列中周围的路灯杆都能清楚表现出来，而在(c)列中中树木纹理细节只有本发明的方法得到的图像比较清晰，而(d)列中本发明的方法则可以把周围的路灯杆和远处的天空都能表现出来。可以看出在图5中使用本发明的方法效果明显。Figure 5 shows the test results of the Cityscapes data set. The KITTI data are basically static scenes shot by mobile cameras and lack moving targets. However, in the actual real-life environment, there are many vehicles and pedestrians on the highway. Different from the KITTI data set, there are a large number of pedestrians and driving vehicles in the Cityscapes data set. The model trained in the KITTI data set is used to test the Cityscapes data set to verify the effectiveness of the network of the present invention in different data sets. In the input part of Figure 5, the shape details of distant street light poles, trees, and pedestrians have become the basis for distinguishing the advantages and disadvantages of different methods. In column (a), the method of the present invention can clearly display the street light poles in the distance. In column (b), the surrounding street light poles can be clearly displayed. In column (c), only the method of the present invention has the details of the tree texture. The obtained image is relatively clear, and the method of the present invention in column (d) can express both the surrounding street light poles and the distant sky. It can be seen that the effect of using the method of the present invention is obvious in Figure 5.

图6为Make3D数据集的测试效果，同样把在KITTI中训练的网络对Make3D数据集进行测试比较。本发明中纹理注意力相结合的方法如图6所示在深度图的细节上表现更好，尤其是光照比较强烈的地方更能体现出房屋和树木的轮廓和细节。Figure 6 shows the test effect of the Make3D data set. The network trained in KITTI is also tested and compared on the Make3D data set. As shown in Figure 6, the method of combining texture attention in the present invention performs better in the details of the depth map, especially in places with strong lighting, which can better reflect the outlines and details of houses and trees.

以上所述，仅为本申请的具体实施方式，但本申请的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本申请揭露的技术范围内，可轻易想到其各种变化或替换，这些都应涵盖在本申请的保护范围之内。因此，本申请的保护范围应以权利要求的保护范围为准。The above are only specific embodiments of the present application, but the protection scope of the present application is not limited thereto. Any person familiar with the technical field can easily think of various changes or modifications within the technical scope disclosed in the present application. Replacements, these should all be covered by the protection scope of this application. Therefore, the protection scope of this application should be subject to the protection scope of the claims.

Claims

1. A monocular depth estimation method, characterized by: including the following:

Use the KITTI data set to divide the training set and the test set;

Construct a network structure based on monocular depth estimation, including a deep network and a pose network. The deep network adopts a U-net structure and includes an encoder and a decoder. The pose network includes a pose encoder and a pose decoder. ;

Use the image reconstruction model to reconstruct the target image: the target image I _t (p) outputs the corresponding depth map D _t through the deep network, and the relative pose T _{t → t′} of the corresponding image generated through the pose network, combined with the internal parameter K of the camera The image reconstruction model performs image reconstruction to obtain the reconstructed image.

Compute reconstructed image Corresponding photometric feature loss, and introduce texture feature gradient/> Calculate the texture feature loss function;

The multi-fusion loss function of texture feature loss, photometric feature loss, and pixel smoothness is used as the total loss function to perform monocular depth estimation training, and the trained model is used to effectively evaluate the test set.

2. A monocular depth estimation method according to claim 1, characterized in that: the output end of the encoder introduces coordinate attention and is connected to a decoder.

3. A monocular depth estimation method according to claim 1 or 2, characterized in that: the encoder of the deep network adopts the ResNet-18 network structure with the fully connected layer removed, in which the deepest feature map passes through 5 In the downsampling stage, and reducing the resolution of the input image to 1/32, the decoder contains five 3×3 convolutional layers, each convolutional layer is followed by a bilinear upsampling layer.

4. A monocular depth estimation method according to claim 3, characterized in that: the pose encoder of the pose network adopts a ResNet-18 network structure.

5. A monocular depth estimation method according to claim 3, characterized in that: the image reconstruction model expression is:

In the formula, To reconstruct the image,/> is the reconstructed pixel, proj is the projection relationship between 2D and 3D, <> is the sampling operator, I _t is the target image frame, T _t→t′ is the relative pose, K is the internal parameter of the camera, and D _t is the depth. picture.

6. A monocular depth estimation method according to claim 5, characterized in that: the reconstructed image The corresponding expression of photometric feature loss is:

In the formula, L _phRec is the photometric characteristic loss, To measure the photometric difference loss of each pixel, I _t (p) is the target image, and p is the pixel;

Gradient texture features Introduced into reconstructed image/> The corresponding photometric characteristic loss expression is obtained:

In the formula, is the reconstructed texture feature, and φ _t (p) is the texture feature of the target image.

7. A monocular depth estimation method according to claim 6, characterized in that: the expression of the multi-fusion loss function is:

L _total ＝λL _smooth +βL _phRec +γL _fmRec (4)

in:

In the formula, L _total is the multi-fusion loss, L _fmRec is the texture feature loss, L _phRec is the photometric feature loss, L _smooth is the pixel smoothness, where α is 0.85, λ, β, and γ are the pixel smoothness and photometric feature losses respectively. , the weight of texture feature loss, where λ = 1, β and γ are 0.001, and x and y are used to represent pixel coordinates.