CN107945265A

CN107945265A - Real-time dense monocular SLAM method and systems based on on-line study depth prediction network

Info

Publication number: CN107945265A
Application number: CN201711227295.6A
Authority: CN
Inventors: 杨欣; 罗鸿城; 高杨; 吴宇豪
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2017-11-29
Filing date: 2017-11-29
Publication date: 2018-04-20
Anticipated expiration: 2037-11-29
Also published as: CN107945265B

Abstract

The invention discloses a real-time dense monocular SLAM method based on the online learning depth prediction network: the camera pose of the key frame is obtained by minimizing the photometric error optimization of the high gradient point, and the depth of the high gradient point is predicted by the triangulation method to obtain the current The semi-dense map of the frame; select the online training picture pair, use the block-by-block stochastic gradient descent method to train and update the CNN network model online, and use the trained CNN network model to perform depth prediction on the current frame picture to obtain a dense map; according to the current frame. Performing depth scale regression on the semi-dense map and the predicted dense map to obtain the absolute scale factor of the depth information of the current frame; using the NCC score voting method to select the predicted depth values of each pixel of the current frame according to the two projection results to obtain a predicted depth map, And Gaussian fusion is performed on the predicted depth map to obtain a final depth map. The present invention also provides a corresponding real-time dense monocular SLAM system based on online learning depth prediction network.

Description

Real-time Dense Monocular SLAM Method and System Based on Online Learning Deep Prediction Network

技术领域technical field

本发明属于计算机视觉三维重建技术领域，更具体地，涉及一种基于在线学习深度预测网络的实时稠密单目SLAM方法与系统。The invention belongs to the technical field of computer vision three-dimensional reconstruction, and more specifically relates to a real-time dense monocular SLAM method and system based on an online learning depth prediction network.

背景技术Background technique

即时定位与地图重建技术(Simultaneous Localization And Mapping，SLAM)可以实时的预测传感器的位姿并且重建出周围环境的3D地图，因此在无人机避障以及增强现实等领域发挥着重要的作用。其中，仅仅依靠单个摄像头作为输入传感器的SLAM系统被称为单目SLAM系统。单目SLAM具有低功耗、硬件门槛低以及操作简单等特性，被研究人员广泛使用。但是，现有流行的单目SLAM系统，无论是基于特征方法的PTAM(Parallel TrackingAnd Mapping For Small AR Workspaces)和ORB-SLAM(Orb-slam:AVersatile AndAccurate Monocular Slam System)，还是采用直接法的LSD-SLAM(Lsd-slam:Large-scaleDirect Monocular Slam)，都存在两个主要的问题：(1)只能构建出场景的稀疏或者半稠密的地图，因为只有少数关键点或是高梯度点的深度可以计算出来；(2)具有尺度不确定性，存在尺度飘移的现象。Simultaneous Localization And Mapping (SLAM) technology can predict the pose of the sensor in real time and reconstruct a 3D map of the surrounding environment, so it plays an important role in the fields of UAV obstacle avoidance and augmented reality. Among them, a SLAM system that only relies on a single camera as an input sensor is called a monocular SLAM system. Monocular SLAM has the characteristics of low power consumption, low hardware threshold and simple operation, and is widely used by researchers. However, the existing popular monocular SLAM systems, whether they are feature-based PTAM (Parallel Tracking And Mapping For Small AR Workspaces) and ORB-SLAM (Orb-slam: AVersatile AndAccurate Monocular Slam System), or LSD- SLAM (Lsd-slam: Large-scale Direct Monocular Slam) has two main problems: (1) It can only construct a sparse or semi-dense map of the scene, because only a few key points or the depth of high gradient points can be (2) It has scale uncertainty, and there is a phenomenon of scale drift.

近几年，用于单目图片深度估计的深度卷积神经网络(Convolutional NeuralNetwork，CNN)已经取得了巨大的进步，其主要原理是在大量的训练数据之中学习物体的深度和形状、纹理、场景语义以及场景上下文等之间的内在联系，从而准确预测输入到网络之中图片的深度信息。将CNN与单目SLAM结合不但可以提高建图的完整率，还可以获得绝对的尺度信息，因此弥补了单目SLAM的缺陷与不足。目前，最成功将两者结合的系统被称为CNN-SLAM(Cnn-slam:Realtime Dense Monocular Slam With Learned Depth Prediction)，该系统把CNN深度预测的结果作为SLAM关键帧的初始深度值，然后采用像素匹配、三角测量和图优化的方法对关键帧中高梯度点的深度进行优化，从而获得稠密的3D重建结果，并使尺度信息更加接近真实尺度。虽然取得了一定的效果，但是该系统仍存在以下问题：In recent years, the deep convolutional neural network (Convolutional Neural Network, CNN) for monocular image depth estimation has made great progress. Its main principle is to learn the depth and shape, texture, and The intrinsic connection between scene semantics and scene context, etc., so as to accurately predict the depth information of the picture input into the network. The combination of CNN and monocular SLAM can not only improve the complete rate of mapping, but also obtain absolute scale information, thus making up for the defects and deficiencies of monocular SLAM. At present, the most successful system combining the two is called CNN-SLAM (Cnn-slam: Realtime Dense Monocular Slam With Learned Depth Prediction), which uses the result of CNN depth prediction as the initial depth value of the SLAM key frame, and then uses Pixel matching, triangulation, and graph optimization methods optimize the depth of high-gradient points in keyframes to obtain dense 3D reconstruction results and make scale information closer to the real scale. Although certain effects have been achieved, the system still has the following problems:

(1)只有少数高梯度像素点的深度值被优化，大部分低梯度像素点的深度值没有变化，造成重建效果不理想，特别是对于未知场景；(2)利用CNN输出之中高梯度像素点的深度信息来预测尺度信息不够准确，造成初始化不够充分，会增大SLAM系统建图和追踪的误差。(1) The depth values of only a few high-gradient pixels are optimized, and the depth values of most of the low-gradient pixels remain unchanged, resulting in unsatisfactory reconstruction effects, especially for unknown scenes; (2) Using the high-gradient pixels in the CNN output The depth information is not accurate enough to predict the scale information, resulting in insufficient initialization, which will increase the error of SLAM system mapping and tracking.

发明内容Contents of the invention

针对现有技术的以上缺陷或改进需求，本发明提供了一种将在线学习深度预测网络与单目SLAM相结合的方法与系统，其目的在于充分利用深度卷积神经网络的优势实现对于单目SLAM系统关键帧的稠密深度估计，并根据结果恢复场景真实尺度信息，由此解决传统单目SLAM缺少尺度信息和不能实现稠密建图的技术问题。In view of the above defects or improvement needs of the prior art, the present invention provides a method and system that combines online learning depth prediction network with monocular SLAM, the purpose of which is to make full use of the advantages of deep convolutional neural network to achieve The dense depth estimation of the key frames of the SLAM system, and the restoration of the real scale information of the scene based on the results, thereby solving the technical problems of the lack of scale information and the inability to achieve dense mapping in traditional monocular SLAM.

为实现上述目的，按照本发明的一个方面，提供了一种基于在线学习深度预测网络的实时稠密单目SLAM方法，包括：In order to achieve the above object, according to one aspect of the present invention, a real-time dense monocular SLAM method based on online learning depth prediction network is provided, including:

(1)从单目视觉传感器通过旋转和平移运动采集的图片序列中选择关键帧，通过最小化高梯度点的光度误差优化得到关键帧的相机姿态，并且采用三角测量法预测高梯度点的深度得到当前帧的半稠密地图；(1) Select the key frame from the image sequence collected by the monocular vision sensor through the rotation and translation motion, optimize the camera pose of the key frame by minimizing the photometric error of the high gradient point, and use the triangulation method to predict the depth of the high gradient point Get the semi-dense map of the current frame;

(2)根据所述关键帧选择在线训练图片对，采用逐块随机梯度下降法根据所述在线训练图片对来在线训练更新CNN网络模型，并利用训练后CNN网络模型对当前帧图片进行深度预测得到稠密地图；(2) Select an online training picture pair according to the key frame, use the block-by-block stochastic gradient descent method to train and update the CNN network model online according to the online training picture pair, and use the trained CNN network model to perform depth prediction on the current frame picture get a dense map;

(3)根据所述当前帧的半稠密地图和预测稠密地图进行深度尺度回归，得到当前帧深度信息的绝对尺度因子；(3) performing depth scale regression according to the semi-dense map of the current frame and the predicted dense map to obtain the absolute scale factor of the depth information of the current frame;

(4)根据相机姿态将所述预测稠密地图通过位姿变换投影到上一关键帧中，并根据所述绝对尺度因子将所述半稠密地图投影到上一关键帧中，采用NCC得分投票方法根据所述两种投影结果选择所述当前帧的各像素深度预测值得到预测深度图，并对所述预测深度图进行高斯融合得到最终深度图。(4) Project the predicted dense map to the previous key frame through pose transformation according to the camera pose, and project the semi-dense map to the previous key frame according to the absolute scale factor, using the NCC score voting method Selecting the predicted depth values of each pixel in the current frame according to the two projection results to obtain a predicted depth map, and performing Gaussian fusion on the predicted depth map to obtain a final depth map.

本发明的一个实施例中，所述根据所述关键帧选择在线训练图片，具体为：采用如下约束条件在关键帧前后帧图片中筛选图片帧与所述关键帧构成图片对：In an embodiment of the present invention, the selection of the online training picture according to the key frame is specifically: using the following constraints to select the picture frame and the key frame to form a picture pair in the frame pictures before and after the key frame:

第一，相机运动约束：两帧图片之间水平方向上的位移满足|t_x|>0.9*T，其中T代表两帧图片之间的基线距离；First, the camera motion constraint: the displacement in the horizontal direction between the two frames of pictures satisfies |t _x |>0.9*T, where T represents the baseline distance between the two frames of pictures;

第二，视差约束：对于每一对图片，采用光流法计算图片间的垂直方向的平均视差Dis_avg，只有当Dis_avg小于预设阈值δ时才会将该对图片保存为候选训练图片；Second, disparity constraints: For each pair of pictures, the optical flow method is used to calculate the average disparity Dis _avg in the vertical direction between the pictures, and only when Dis _avg is smaller than the preset threshold δ, the pair of pictures will be saved as candidate training pictures;

第三，多样性约束：同一个关键帧只能产生一对训练图片；Third, diversity constraints: the same key frame can only generate a pair of training pictures;

第四，训练池容量约束：每当训练图片对的数量到达设定阈值V时，就将训练池中的图片送入到网络，对网络进行在线训练，保存训练得到的网络模型，同时清空训练池继续进行训练数据的筛选。Fourth, training pool capacity constraints: Whenever the number of training picture pairs reaches the set threshold V, the pictures in the training pool are sent to the network, the network is trained online, the network model obtained from the training is saved, and the training time is cleared. The pool continues to filter the training data.

本发明的一个实施例中，采用逐块随机梯度下降法根据所述在线训练图片来在线训练更新CNN网络模型，具体为：In one embodiment of the present invention, a block-by-block stochastic gradient descent method is used to train and update the CNN network model online according to the online training pictures, specifically:

将ResNet-50之中的卷积层分为5个块，其中每一个块具体表示为conv1,conv2_x,conv3_x,conv4_x,conv5_x；conv1由一个单一的7X7的全卷积层组成；conv2_x由一个3X3的卷积层和3个瓶颈构建块共10层组成；conv3_x由4个瓶颈构建块共12层组成；conv4_x由6个瓶颈构建块共18层组成：conv5_x由3个瓶颈构建块共9层组成，五个部分加起来构成了ResNet-50的50层结构；The convolutional layer in ResNet-50 is divided into 5 blocks, each of which is specifically represented as conv1, conv2_x, conv3_x, conv4_x, conv5_x; conv1 consists of a single 7X7 full convolutional layer; conv2_x consists of a 3X3 The convolution layer and 3 bottleneck building blocks are composed of 10 layers; conv3_x is composed of 4 bottleneck building blocks with 12 layers; conv4_x is composed of 6 bottleneck building blocks with 18 layers; conv5_x is composed of 3 bottleneck building blocks with 9 layers , the five parts together constitute the 50-layer structure of ResNet-50;

在每一次在线学习和更新的过程之中，每一次迭代k，只更新一个部分的参数W_i(i＝1,2,3,4,5)，保持剩余4个部分网络层参数不变，而在下一次迭代中，更新第i块参数，其中i＝(i+1)％5；其他层参数保持不变，整个在线学习和更新的迭代一直在进行，直到预设停止条件被满足。In each online learning and updating process, each iteration k, only update a part of the parameters W _i (i=1,2,3,4,5), keep the remaining 4 parts of the network layer parameters unchanged, And in the next iteration, update the i-th block parameter, where i=(i+1)%5; other layer parameters remain unchanged, and the entire iteration of online learning and updating continues until the preset stop condition is met.

本发明的一个实施例中，所述在线训练更新CNN网络模型为选择性更新，具体为：In one embodiment of the present invention, the online training and updating CNN network model is a selective update, specifically:

计算每一批输入到CNN网络模型之中图片的训练损失函数，一旦一批图片的所有图片的损失函数都大于预先设定的阈值L_high，将启动在线学习和更新的进程，在线学习和更新的进程将会一直进行，直到训练图片的损失函数降到阈值L_low之下，或者是迭代的次数达到了预先设定的阈值。Calculate the training loss function of each batch of pictures input into the CNN network model. Once the loss function of all pictures in a batch of pictures is greater than the preset threshold L _high , the process of online learning and updating will be started. Online learning and updating The process will continue until the loss function of the training image drops below the threshold L _low , or the number of iterations reaches the preset threshold.

本发明的一个实施例中，所述深度尺度回归方法为：RANSAC算法或最小二乘算法。In an embodiment of the present invention, the depth scale regression method is: RANSAC algorithm or least squares algorithm.

本发明的一个实施例中，所述将所述预测稠密地图通过位姿变换投影到上一关键帧中，并根据所述绝对尺度因子将所述半稠密地图投影到上一关键帧中，采用NCC得分投票方法根据所述两种投影结果选择所述当前帧的各像素深度预测值得到预测深度图，具体为：In an embodiment of the present invention, the predicted dense map is projected into the previous key frame through pose transformation, and the semi-dense map is projected into the previous key frame according to the absolute scale factor, using The NCC score voting method selects each pixel depth prediction value of the current frame according to the two kinds of projection results to obtain a predicted depth map, specifically:

将关键帧i之中的每一个像素点p，根据CNN预测的稠密地图D_cnn(p)和位姿变换将该像素点投影到与之相距最近的关键帧i-1中，投影的结果表示为p′_cnn；For each pixel point p in the key frame i, according to the dense map D _cnn (p) predicted by CNN and the pose transformation Project the pixel point into the key frame i-1 closest to it, and the result of the projection is expressed as _p'cnn ;

将关键帧i之中的像素点p做另外一个投影，映射到关键帧i-1中记为p′_sd，投影是基于半稠密地图的结果D_sp(p)和绝对的尺度因子；Make another projection of the pixel point p in the key frame i, and map it to the key frame i-1 as p′ _sd , the projection is based on the result D _sp (p) of the semi-dense map and the absolute scale factor;

分别在关键帧i-1中投影点p′_cnn和p′_sd附近选取小的区域，并且分别计算区域R(p)与R_cnn(p′)之间的归一化互相关系数NCC_cnn和区域R(p)与R_sd(p′)之间归一化互相关系数NCC_sd，如果NCC_cnn小于NCC_sd，那么表明半稠密深度图的深度预测结果要好于CNN的结果，选择D_sp(p)作为像素点p的最终深度预测值，否则选择R_cnn(p′)，如果有些点只有CNN的预测结果，就使用R_cnn(p′)作为像素点p的最终深度。Select a small area near the projection points p′ _cnn and p′ _sd in the key frame i-1 respectively, and calculate the normalized _{cross-correlation coefficients NCC cnn} _and The normalized cross-correlation coefficient NCC _sd between the region R(p) and R _sd (p′), if NCC _cnn is smaller than NCC _sd , it indicates that the depth prediction result of the semi-dense depth map is better than the result of CNN, choose D _sp ( p) as the final depth prediction value of pixel p, otherwise select R _cnn (p′), if some points only have CNN prediction results, use R _cnn (p′) as the final depth of pixel p.

本发明的一个实施例中，对所述预测深度图进行高斯融合得到最终深度图，具体为：In one embodiment of the present invention, Gaussian fusion is performed on the predicted depth map to obtain the final depth map, specifically:

对NCC得分投票方法得到的深度图进一步处理，根据关键帧之间的上下文关系，并且结合关键帧深度图的不确定度图进行联合优化，通过联合优化得到最终的深度图。The depth map obtained by the NCC score voting method is further processed, according to the context relationship between key frames, and combined with the uncertainty map of the key frame depth map for joint optimization, and the final depth map is obtained through joint optimization.

本发明的一个实施例中，利用训练后CNN网络模型对当前帧图片进行深度预测得到稠密地图中还包括：In one embodiment of the present invention, the dense map obtained by using the trained CNN network model to predict the depth of the current frame picture also includes:

将深度图中每一个像素点的深度值乘以一个尺度系数 Multiply the depth value of each pixel in the depth map by a scale factor

其中，f_adapted为用于在线获取训练数据的单目相机焦距，B_adapted为双目训练图片的基线，f_pre-train和B_pre-train分别为用于训练原始CNN网络模型训练图片的焦距和基线。Among them, f _adapted is the focal length of the monocular camera used to obtain the training data online, B _adapted is the baseline of the binocular training picture, f _pre-train and B _pre-train are the focal length and the training picture used to train the original CNN network model, respectively. baseline.

本发明的一个实施例中，所述关键帧为：定义整个图像序列或者相机实时得到的第一张图片为关键帧，除了第一帧，后边的一部分图片帧也会被定义为关键帧，其中定义关键帧的原则是监测当前帧与之前一个最近的关键帧之间的平移和旋转是否达到了预先设定的阈值。In one embodiment of the present invention, the key frame is: define the whole image sequence or the first picture obtained by the camera in real time as a key frame, except for the first frame, a part of the following picture frames will also be defined as key frames, wherein The principle of defining a key frame is to monitor whether the translation and rotation between the current frame and the previous nearest key frame reach a preset threshold.

按照本发明的另一方面，还提供了一种基于在线学习深度预测网络的实时稠密单目SLAM系统，包括直接法单目SLAM模块、在线自适应CNN预测模块、绝对尺度回归模块以及深度图融合模块，其中：According to another aspect of the present invention, a real-time dense monocular SLAM system based on an online learning depth prediction network is also provided, including a direct method monocular SLAM module, an online adaptive CNN prediction module, an absolute scale regression module, and depth map fusion module, where:

所述直接法单目SLAM模块，用于从单目视觉传感器通过旋转和平移运动采集的图片序列中选择关键帧，通过最小化高梯度点的光度误差优化得到关键帧的相机姿态，并且采用三角测量法预测高梯度点的深度得到当前帧的半稠密地图；The direct method monocular SLAM module is used to select key frames from the image sequence collected by the monocular vision sensor through rotation and translation motion, and optimize the camera pose of the key frame by minimizing the photometric error of high gradient points, and adopts triangulation The measurement method predicts the depth of high gradient points to obtain a semi-dense map of the current frame;

所述在线自适应CNN预测模块，用于根据所述关键帧选择在线训练图片对，采用逐块随机梯度下降法根据所述在线训练图片对来在线训练更新CNN网络模型，并利用训练后CNN网络模型对当前帧图片进行深度预测得到稠密地图；The online adaptive CNN prediction module is used to select an online training picture pair according to the key frame, and use block-by-block stochastic gradient descent method to train and update the CNN network model online according to the online training picture pair, and use the trained CNN network The model performs depth prediction on the current frame picture to obtain a dense map;

所述绝对尺度回归模块，用于根据所述当前帧的半稠密地图和预测稠密地图进行深度尺度回归，得到当前帧深度信息的绝对尺度因子；The absolute scale regression module is used to perform depth scale regression according to the semi-dense map of the current frame and the predicted dense map to obtain the absolute scale factor of the depth information of the current frame;

所述深度图融合模块，用于根据相机姿态将所述预测稠密地图通过位姿变换投影到上一关键帧中，并根据所述绝对尺度因子将所述半稠密地图投影到上一关键帧中，采用NCC得分投票方法根据所述两种投影结果选择所述当前帧的各像素深度预测值得到预测深度图，并对所述预测深度图进行高斯融合得到最终深度图。The depth map fusion module is used to project the predicted dense map to the previous key frame through pose transformation according to the camera pose, and project the semi-dense map to the previous key frame according to the absolute scale factor , using the NCC score voting method to select the predicted depth values of each pixel of the current frame according to the two projection results to obtain a predicted depth map, and performing Gaussian fusion on the predicted depth map to obtain a final depth map.

总体而言，通过本发明所构思的以上技术方案与现有技术相比，具有如下有益效果：本发剪采用单目SLAM采用直接法为基础，采用优化的方式得到场景的半稠密地图和相机姿态；在线自适应CNN采用了弱监督的深度预测网络，并且根据场景信息进行在线更新，使得网络在未知场景下具有良好的效果；深度尺度回归可以得到深度值的尺度信息，用来提高3D重建的准确性；数据融合采用了区域投票和高斯融合的方式，在保证完整率的情况下，提高了结果的精度。Generally speaking, compared with the prior art, the above technical solution conceived by the present invention has the following beneficial effects: the present invention adopts monocular SLAM and adopts the direct method as the basis, and obtains the semi-dense map and camera of the scene in an optimized manner. Attitude; online adaptive CNN uses a weakly supervised depth prediction network, and performs online updates according to scene information, so that the network has a good effect in unknown scenes; depth scale regression can obtain scale information of depth values to improve 3D reconstruction accuracy; data fusion adopts regional voting and Gaussian fusion, which improves the accuracy of the results while ensuring the integrity rate.

附图说明Description of drawings

图1是本发明实施例中基于在线学习深度预测网络的实时稠密单目SLAM方法原理示意图；1 is a schematic diagram of the principle of a real-time dense monocular SLAM method based on an online learning depth prediction network in an embodiment of the present invention;

图2是本发明实施例中三角测量法模型示意图；Fig. 2 is a schematic diagram of a triangulation model in an embodiment of the present invention;

图3是本发明实施例中筛选训练图片的约束关系；其中图(a)是第一种像素对应关系的图像对，图(b)是第二种像素对应关系的图像对；Fig. 3 is the constraint relation of screening training pictures in the embodiment of the present invention; Wherein figure (a) is the image pair of the first kind of pixel correspondence, figure (b) is the image pair of the second kind of pixel correspondence;

图4是本发明实施例中尺度系数调整原理图，其中上半部分是原始的网络结构，下半部分是本发明对于网络的改进；Fig. 4 is a schematic diagram of scale coefficient adjustment in the embodiment of the present invention, wherein the upper part is the original network structure, and the lower part is the improvement of the network in the present invention;

图5是本发明实施例中逐块梯度下降法(block-wise SGD)示意图；Fig. 5 is a block-wise gradient descent method (block-wise SGD) schematic diagram in the embodiment of the present invention;

图6是本发明实施例中尺度回归及效果图；Fig. 6 is a scale regression and effect diagram in the embodiment of the present invention;

图7为本发明实施例中基于在线学习深度预测网络的实时稠密单目SLAM系统结构示意图。FIG. 7 is a schematic structural diagram of a real-time dense monocular SLAM system based on an online learning depth prediction network in an embodiment of the present invention.

具体实施方式Detailed ways

为了使本发明的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本发明进行进一步详细说明。应当理解，此处所描述的具体实施例仅仅用以解释本发明，并不用于限定本发明。此外，下面所描述的本发明各个实施方式中所涉及到的技术特征只要彼此之间未构成冲突就可以相互组合。In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention. In addition, the technical features involved in the various embodiments of the present invention described below can be combined with each other as long as they do not constitute a conflict with each other.

本发明要解决的问题是实现一个实时的单目稠密建图SLAM系统，该系统采用自适应在线CNN深度预测网络和基于直接法的单目SLAM系统相结合的方式，不但可以明显提高对于未知场景深度预测的准确性和鲁放性，还能够解决单目SLAM系统尺度不确定性的问题。The problem to be solved by the present invention is to realize a real-time monocular dense mapping SLAM system. The system adopts the combination of self-adaptive online CNN depth prediction network and monocular SLAM system based on direct method, which can not only significantly improve the performance of unknown scene The accuracy and robustness of depth prediction can also solve the problem of scale uncertainty of the monocular SLAM system.

为了实现上述目的，本发明采用了将CNN与SLAM结合的方式，针对单目SLAM存在的问题，提出了准确性更好鲁放性更强的算法，该方案的主要创新点包括：In order to achieve the above purpose, the present invention adopts the method of combining CNN and SLAM, and proposes an algorithm with better accuracy and more robustness for the problems existing in monocular SLAM. The main innovations of the scheme include:

(1)采用了自适应在线CNN深度预测网络，这也是整个领域第一次将该类型网络与单目SLAM系统结合，这样做可以大大提高系统在未知场景中深度预测的准确性；(1) An adaptive online CNN depth prediction network is adopted, which is the first time in the field to combine this type of network with a monocular SLAM system, which can greatly improve the accuracy of the system's depth prediction in unknown scenes;

(2)提出了一种“逐块随机梯度下降”(block-wise SGD)的方法和选择性更新的策略，使CNN在有限的训练数据条件下可以取得更好的深度预测结果；(2) A "block-wise SGD" method and a selective update strategy are proposed, so that CNN can achieve better depth prediction results under limited training data conditions;

(3)设计了一种基于自适应网络的绝对尺度回归方法，可以极大地提高深度预测的准确性，并且使整个系统追踪和建图的精度更高。(3) An adaptive network-based absolute scale regression method is designed, which can greatly improve the accuracy of depth prediction and make the tracking and mapping of the whole system more accurate.

该系统主要由四个组成部分：直接法单目SLAM，在线自适应CNN，深度尺度回归和数据融合，方法原理框图如图1所示。单目SLAM采用直接法，并以直接法为基础，采用优化的方式得到场景的半稠密地图和相机姿态；在线自适应CNN采用了弱监督的深度预测网络，并且根据场景信息进行在线更新，使得网络在未知场景下具有良好的效果；深度尺度回归可以得到深度值的尺度信息，用来提高3D重建的准确性；数据融合采用了区域投票和高斯融合的方式，在保证完整率的情况下，提高了结果的精度。The system mainly consists of four components: direct method monocular SLAM, online adaptive CNN, depth-scale regression and data fusion. The block diagram of the method is shown in Figure 1. Monocular SLAM adopts the direct method, and based on the direct method, adopts an optimized method to obtain the semi-dense map and camera pose of the scene; the online adaptive CNN uses a weakly supervised depth prediction network, and performs online updates according to the scene information, so that The network has a good effect in unknown scenes; the depth scale regression can obtain the scale information of the depth value, which is used to improve the accuracy of 3D reconstruction; the data fusion adopts the method of regional voting and Gaussian fusion, and in the case of ensuring the integrity rate, Improved precision of results.

具体地，所述方法包括如下过程：Specifically, the method includes the following processes:

(1)直接法单目SLAM：本部分是在LSD-SLAM的基础之上进行的改造，通过最小化高梯度点的光度误差，优化得到每一帧图片的相机姿态，并且采用三角测量法来预测高梯度点的深度，从而得到半稠密的地图；(1) Direct method monocular SLAM: This part is based on the transformation of LSD-SLAM. By minimizing the photometric error of high gradient points, the camera pose of each frame is optimized, and the triangulation method is used to Predict the depth of high gradient points to obtain a semi-dense map;

图片采集：本方法以单目视觉传感器为基础，在采集图片时，要求单目相机有旋转和平移的运动，而且平移的幅度适当加大一些。这样做的原因主要有两个：一是如果只存在静止和纯旋转的情况，则有可能造成本部分初始化失败或者图片追踪失败，进而造成整个系统的不正常工作；二是适当的增加平移的幅度，有助于系统挑选合适的训练图片，从而保证在线训练和更新CNN过程的正常进行。Picture collection: This method is based on the monocular vision sensor. When collecting pictures, the monocular camera is required to have rotation and translation movements, and the translation range should be increased appropriately. There are two main reasons for this: first, if there are only static and pure rotation situations, it may cause the initialization failure of this part or the image tracking failure, which will cause the entire system to work abnormally; the second is to increase the translation appropriately. The amplitude helps the system to select suitable training pictures, so as to ensure the normal progress of the online training and updating CNN process.

关键帧定义：单目SLAM部分定义整个序列或者相机实时得到的第一张图片为keyframe(关键帧)，除了第一帧，后边的一部分图片帧也会被定义为关键帧，其中定义关键帧的原则是监测当前帧与之前一个最近的关键帧之间的平移和旋转是否达到了预先设定的阈值；基于关键帧的算法组成是直接法单目SLAM后端优化的基础也是网络部分重要的框架结构，因此需要特别介绍。Key frame definition: The monocular SLAM part defines the entire sequence or the first picture obtained by the camera in real time as a keyframe (key frame). In addition to the first frame, some of the subsequent picture frames will also be defined as key frames, where the key frame is defined The principle is to monitor whether the translation and rotation between the current frame and the previous most recent key frame have reached the preset threshold; the algorithm composition based on the key frame is the basis of the direct method monocular SLAM back-end optimization and also an important framework of the network part structure, and therefore require special introduction.

相机姿态追踪：相机在三维空间中的运动共有六个自由度，在Δt时间内的运动量可用一个六维数组表示：ξ＝[v⁽¹⁾ ν⁽²⁾ ν⁽³⁾ ψ⁽¹⁾ ψ⁽²⁾ ψ⁽³⁾]^T。其中[ν⁽¹⁾ ν⁽²⁾ ν⁽³⁾]^T表示刚体运动量沿三个坐标轴的平移分量，且[ν⁽¹⁾ ν⁽²⁾ ν⁽³⁾]^T∈R³是欧式空间中的向量；[ψ⁽¹⁾ ψ⁽²⁾ψ⁽³⁾]^T表示刚体运动沿三坐标轴的旋转分量，且[ψ⁽¹⁾ ψ⁽²⁾ ψ⁽³⁾]^T∈SO(3)是非欧式三维旋转群SO(3)中的向量。基于视觉的相机就是跟踪通过视觉信息求解ξ的过程。本发明所采用的单目SLAM采用直接法来追踪相机姿态，将图A中有深度信息的所有点投影到图B中，得到一张新的图片B'，通过优化B'与B之间所有位置的灰度值之差的总和(photometric error)来得到B相对于A的位置变化。直接法能更好的应对视角变化、光照变化、场景纹理稀疏等情况，是目前比较流行的一类方法，因此本项目采用基于直接法来实现相机位姿跟踪。Camera pose tracking: There are six degrees of freedom in the movement of the camera in three-dimensional space, and the movement amount within Δt time can be represented by a six-dimensional array: ξ=[v ⁽¹⁾ ν ⁽²⁾ ν ⁽³⁾ ψ ⁽¹⁾ ψ ⁽²⁾ ψ ⁽³⁾ ] ^T . where [ν ⁽¹⁾ ν ⁽²⁾ ν ⁽³⁾ ] ^T represents the translation component of the motion of the rigid body along the three coordinate axes, and [ν ⁽¹⁾ ν ⁽²⁾ ν ⁽³⁾ ] ^T ∈ R ³ is the Euclidean space The vector in; [ψ ⁽¹⁾ ψ ⁽²⁾ ψ ⁽³⁾ ] ^T represents the rotation component of the rigid body motion along the three coordinate axes, and [ψ ⁽¹⁾ ψ ⁽²⁾ ψ ⁽³⁾ ] ^T ∈ SO(3 ) is a vector in the non-Euclidean three-dimensional rotation group SO(3). A vision-based camera is tracking the process of solving ξ through visual information. The monocular SLAM used in the present invention uses a direct method to track the camera pose, and projects all points with depth information in picture A to picture B to obtain a new picture B', by optimizing all points between B' and B The sum of the difference of the gray value of the position (photometric error) to get the position change of B relative to A. The direct method can better deal with changes in perspective, illumination changes, and sparse scene textures. It is a popular method at present. Therefore, this project uses the direct method to achieve camera pose tracking.

具体而言，直接法用于相机姿态追踪的关键想法是在当前帧n和相距最近的关键帧k之间寻找一个最优的相机姿态使得当前帧n和关键帧k之间的光度学误差最小。均匀区域的存在可能会造成帧间的像素匹配不准确，因为不同的相机姿态可能会有相似的光度学误差。为了获得高鲁放性的追踪结果并且减少优化所用的时间开销，光度学误差r只在关键帧k中的高梯度点{p}进行计算，如下所示：Specifically, the key idea of the direct method for camera pose tracking is to find an optimal camera pose between the current frame n and the nearest key frame k Minimize the photometric error between current frame n and key frame k. The existence of uniform regions may cause inaccurate pixel matching between frames due to different camera poses There may be similar photometric errors. In order to obtain highly reproducible tracking results and reduce the time overhead for optimization, the photometric error r is only calculated at the high gradient point {p} in the key frame k, as follows:

其中D(p)代表高梯度像素点p的深度值，π是投影的模型可以将相机坐标系中的3D空间点P^c投影到2D图片平面像素点p，π是由相机内参K决定的。Among them, D(p) represents the depth value of the high-gradient pixel point p, π is the projection model that can project the 3D space point P ^c in the camera coordinate system to the 2D image plane pixel point p, and π is determined by the camera internal reference K.

同样地，π^-1是反向投影的模型，可以将2D平面的像素点投影到3D空间。一个经过优化之后的相机姿态就可以通过最小化所有高梯度像素点的光度学误差r计算出来，如下所示：Similarly, π ^-1 is a model of back projection, which can project the pixels of the 2D plane to the 3D space. An optimized camera pose It can be calculated by minimizing the photometric error r of all high gradient pixels, as follows:

其中w_p是用于提高鲁放性并且最小化异常值影响的像素点p的权值。Where w _p is the weight of the pixel p used to improve the robustness and minimize the influence of outliers.

(2)式的问题可以被一个标准的高斯牛顿优化算法解决。当然，这种通过上述方法进行相机位姿跟踪会因为误差的累积而产生漂移，但是这种漂移可以通过额外添加回环检测来消除，本项目拟采用基于词袋模型的回环检测方法解决累计误差带来的漂移问题。The problem of (2) can be solved by a standard Gauss-Newton optimization algorithm. Of course, this kind of camera pose tracking through the above method will cause drift due to the accumulation of errors, but this drift can be eliminated by adding additional loopback detection. This project intends to use the loopback detection method based on the bag of words model to solve the cumulative error zone. Come the drift problem.

半稠密深度估计：单目直接法SLAM系统中用于建图的线程通过小基线立体比较的方法来估计高梯度像素点的深度值，也就是三角测量法所采用的像素匹配的方法。具体而言特征匹配及三角测量法的模型如下图2所示，C和C'分别是关键帧和参考帧的摄像机坐标系原点，X为要计算深度的3D点，m和m'分别为点X在摄像机C和C'摄像机投影平面上的投影。Semi-dense depth estimation: The thread used for mapping in the monocular direct method SLAM system estimates the depth value of high-gradient pixel points through the method of small baseline stereo comparison, which is the pixel matching method used in the triangulation method. Specifically, the model of feature matching and triangulation is shown in Figure 2 below. C and C' are the origin of the camera coordinate system of the key frame and reference frame respectively, X is the 3D point to calculate the depth, m and m' are the points respectively Projection of X on the camera projection planes of cameras C and C'.

因为单目视觉中关键帧和参考帧来源于同一个摄像机，所以它们的投影内参是相同的，此时若已经根据视觉方法得到两个摄像机坐标系之间的旋转平移[R,t]，则有下式：Because the key frame and reference frame in monocular vision come from the same camera, their projection internal parameters are the same. If the rotation and translation [R,t] between the two camera coordinate systems have been obtained according to the visual method, then Has the following formula:

其中，f_x,f_y,c_x,c_y,s是相机的内参，R,t分别是一个3×3和3×1的矩阵，表示摄像机C'坐标系相对于摄像机C坐标系的旋转和平移，(x_c,y_c,z_c)^T，(x_c',y_c',z_c')^T分别表示点X在摄像机坐标系C和C'下的齐次坐标，(u,v)^T，(u',v')^T分别表示点X在摄像机C和C'的投影平面上的像素坐标。由于相机标定后内参矩阵为已知值，[R,t]可以由前面的定位得到，所以(m₁₁...m₃₄)均为已知数，于是上式可化简为：Among them, f _x , f _y , c _x , _cy , s are the internal parameters of the camera, R, t are a 3×3 and 3×1 matrix respectively, indicating the rotation of the camera C' coordinate system relative to the camera C coordinate system and translation, (x _c , y _c , z _c ) ^T , (x _c' , y _c' , z _c ') ^T represent the homogeneous coordinates of point X in the camera coordinate system C and C' respectively, (u, v) ^T , (u', v') ^T represent the pixel coordinates of point X on the projection planes of cameras C and C', respectively. Since the internal parameter matrix is a known value after camera calibration, [R,t] can be obtained from the previous positioning, so (m ₁₁ ...m ₃₄ ) are all known numbers, so the above formula can be simplified as:

即：A(x_c,y_c,z_c)^T＝b，该方程组含有3个未知数，4个方程，是一个超定方程组，对其求最小二乘解解得使得满足 Namely: A(x _c ,y _c ,z _c ) ^T ＝b, this system of equations contains 3 unknowns and 4 equations, it is an overdetermined system of equations, the least square solution is obtained make satisfied

一旦一个新的关键帧k被创建，它的深度图D_pri和深度值预测的不确定度U_pri首先会被初始化，通过将第k-1个关键帧的深度D_k-1和不确定度U_k-1投影到当前关键帧，具体做法如下所示：Once a new keyframe k is created, its depth map D _pri and depth value prediction uncertainty U _pri are first initialized by adding the depth D _k-1 and uncertainty of the k-1th keyframe k U _k-1 is projected to the current keyframe, the specific method is as follows:

D_pri＝D_k-1-t_z(6)D _pri =D _k-1 -t _z (6)

其中,t_z是相机沿视轴方向上的平移，σ²表示初始化噪声的标准差。初始化的深度图会根据后来的图片帧被不断的修复，修复过程首先是沿着双极线检索当前帧中每一个高梯度像素点p在关键帧k中的匹配像素点，其中，双极线上的检索区间是由像素点p的深度不确定度决定的；一旦找到了匹配像素点，p的深度值就可以通过三角测量法计算出来。本发明用一个函数F来表示像素点匹配和三角测量法的整个过程，基于F，本发明得到的深度值的观测结果D_obs可以表示为如下所示：Among them, t _z is the translation of the camera along the viewing axis, and σ ² represents the standard deviation of the initialization noise. The initialized depth map will be continuously repaired according to the subsequent picture frames. The repair process first retrieves the matching pixel points of each high-gradient pixel point p in the current frame along the bipolar line in the key frame k, where the bipolar line The search interval on is determined by the depth uncertainty of pixel p; once a matching pixel is found, the depth value of p can be calculated by triangulation. The present invention uses a function F to represent the whole process of pixel point matching and triangulation. Based on F, the observation result D _obs of the depth value obtained by the present invention can be expressed as follows:

其中I_k和I_cur分别代表关键帧k和当前的图片帧，代表关键帧k到当前图片帧之间的相机运动，K代表相机的内参矩阵。深度的观测值D_obs的不确定度U_obs由噪声产生，这些噪声存在于I_k和I_cur之间的像素匹配和相机运动的估计过程之中。经过修复之后的深度图和它相应的不确定度其实就是初始深度信息和观测深度信息的一种融合，符合如下分布：Among them, I _k and I _cur represent the key frame k and the current picture frame respectively, Represents the camera motion between the key frame k and the current picture frame, and K represents the internal parameter matrix of the camera. Observations of depth (D _obs ) with uncertainties (U _obs ) are generated by noise that exists between I _k and I _cur pixel matching and camera motion in the estimation process. The repaired depth map and its corresponding uncertainty are actually a fusion of initial depth information and observed depth information, which conforms to the following distribution:

(2)在线自适应训练CNN，并利用所述CNN对关键帧进行预测得到稠密地图(densedepth map)：(2) Online adaptive training CNN, and using the CNN to predict key frames to obtain a dense map (densedepth map):

在线自适应CNN：首先，本发明采用了一种最先进的弱监督方法作为基础来进行单幅图片深度估计。该种弱监督方法的网络架构主要由两部分组成，第一部分是基于ResNet-50的全卷积层(ConvLayers)；第二部分则是把ResNet-50中位于后边的池化层和全连接层(FCLayers)替换为一系列由反卷积层和短连接层组成的上采样区域。训练整个CNN需要成对的经过矫正的立体图片，这些立体图片的基线B_pre-train和相机焦距f_pre-train都是固定的。网络的输出为视差图，根据视差图可以生成源图片的重构图，源图片和重构图片之间的光度学误差加上平滑项构成了整个网络的损失函数。实验中，通过单目SLAM系统可以获取一系列关键帧{I₁,…I_j}图片间平移{T_1,2,…,T_i-1,i,…,T_j-1,j}和旋转变化{R_1,2,…,R_i-1,i,…,R_j-1,j}，本发明以此信息作为真值，学习能够最小化任意两张关键帧之间重投影误差的二维图片的深度图。Online Adaptive CNN: First, the present invention uses a state-of-the-art weakly supervised method as a basis to estimate the depth of a single image. The network architecture of this weakly supervised method is mainly composed of two parts. The first part is the full convolutional layer (ConvLayers) based on ResNet-50; the second part is the pooling layer and the full connection layer located behind ResNet-50. (FCLayers) are replaced by a sequence of upsampling regions consisting of deconvolutional layers and short-connection layers. Training the entire CNN requires pairs of rectified stereo images for which both the baseline B _pre-train and the camera focal length f _pre-train are fixed. The output of the network is a disparity map. According to the disparity map, the reconstruction map of the source image can be generated. The photometric error between the source image and the reconstructed image plus the smoothing term constitutes the loss function of the entire network. In the experiment, a series of key frame {I ₁ ,…I _j } image translation {T _1,2 ,…,T _i-1,i ,…,T _j-1,j } and Rotation changes {R _1,2 ,…,R _i-1,i ,…,R _j-1,j }, the present invention takes this information as the true value, and learns to minimize the reprojection error between any two key frames The depth map of the 2D image.

预训练CNN：整个在线自适应CNN部分基于本发明预训练的CNN网络模型，在预训练网络模型时，本发明按照传统的CNN训练方法，使用CMU数据集(the Wean Hall dataset)的6510对图片加上自己录制的本实验室场景35588对图片共42098对图片作为训练集，对网络进行训练。其中训练集图片的基线为0.12米，本实验室场景图片是使用ZED立体相机拍摄得到，并且采用了随机的颜色，尺度和镜像变化对训练集进行数据增强。所有的训练集图片经过处理之后输入到网络进行迭代，迭代次数40000次，学习率为0.0001，最终得到了想要的预训练模型，整个系统也会基于此模型进行在线学习和更新。Pre-training CNN: The entire online adaptive CNN part is based on the pre-trained CNN network model of the present invention. When pre-training the network model, the present invention uses 6510 pairs of pictures of the CMU dataset (the Wean Hall dataset) according to the traditional CNN training method In addition to the 35588 pairs of pictures recorded by myself in this laboratory, a total of 42098 pairs of pictures are used as the training set to train the network. The baseline of the training set pictures is 0.12 meters. The laboratory scene pictures were taken by ZED stereo camera, and random color, scale and mirror changes were used to enhance the data of the training set. All the images in the training set are processed and input to the network for iteration. The number of iterations is 40,000, and the learning rate is 0.0001. Finally, the desired pre-training model is obtained. The entire system will also learn and update online based on this model.

在单个视频场景序列下，每一次训练CNN都会保存并更新网络的模型，并用新的模型生成深度图。在线自适应CNN策略主要有以下四个：Under a single video scene sequence, each time CNN is trained, the model of the network is saved and updated, and a new model is used to generate a depth map. There are four main online adaptive CNN strategies:

1)在线学习图片筛选：深度预测网络需要成对立体相机拍摄的图片作为训练图片，这些立体图片具有固定的基线B_pre-train。为了实时训练和更新CNN网络模型，本发明在单目相机运动的同时根据双目相机的规则收集成对的单目图片来模拟立体图片。本发明采用了高标准的要求来收集可信赖的训练图片以减小噪声产生的CNN网络模型对错误样本的过拟合现象。本发明设计了四个主要的筛选条件：第一，相机运动约束。两帧图片之间水平方向上的位移满足|t_x|>0.9*T，其中T代表两帧图片之间的基线距离第二，视差约束。对于每一对图片，都会采用光流法计算图片间的垂直方向的平均视差Dis_avg，只有当Dis_avg小于阈值δ(实验时取为5)时才会将该对图片保存为候选训练图片。效果如图3所示，(a)，(b)分别是两对图片，当每一对图片之中的像素关系满足如(a)所示的关系时，这样的图像对就会被筛选为训练的候选图片，当它们的关系如(b)所示时，就会被丢弃掉；第三，多样性约束。每一对训练图片的筛选都是和关键帧图片唯一对应的，也就是说同一个关键帧最多只能产生一对训练图片；第四，训练池容量约束。每当训练图片对的数量到达阈值V(实验时取4)时，就将训练池中的图片送入到网络，对网络进行在线训练，保存训练得到的网络模型，同时清空训练池继续进行训练数据的筛选；1) Online learning image screening: the depth prediction network needs images taken by pairs of stereo cameras as training images, and these stereo images have a fixed baseline B _pre-train . In order to train and update the CNN network model in real time, the present invention collects paired monocular pictures according to the rules of binocular cameras to simulate stereoscopic pictures while the monocular camera is moving. The present invention adopts high standard requirements to collect reliable training pictures to reduce the over-fitting phenomenon of the CNN network model generated by the noise to the wrong samples. The present invention designs four main screening conditions: first, camera motion constraints. The displacement in the horizontal direction between two frames of pictures satisfies |t _x |>0.9*T, where T represents the baseline distance between two frames of pictures Second, the parallax constraint. For each pair of pictures, the optical flow method is used to calculate the average vertical disparity Dis _avg between the pictures. Only when Dis _avg is smaller than the threshold δ (taken as 5 in the experiment) will the pair of pictures be saved as candidate training pictures. The effect is shown in Figure 3. (a) and (b) are two pairs of pictures respectively. When the pixel relationship in each pair of pictures satisfies the relationship shown in (a), such image pairs will be screened as Candidate images for training are discarded when their relationship is shown in (b); third, diversity constraints. The screening of each pair of training pictures is uniquely corresponding to the key frame picture, that is to say, the same key frame can only generate a pair of training pictures at most; Fourth, the capacity of the training pool is constrained. Whenever the number of training picture pairs reaches the threshold V (4 in the experiment), the pictures in the training pool are sent to the network, the network is trained online, the network model obtained from the training is saved, and the training pool is emptied to continue training screening of data;

2)相机参数调整：用于在线获取训练数据的单目相机焦距f_adapted和双目训练图片的基线B_adapted很有可能和用于训练原始CNN网络模型训练图片的焦距f_pre-train和基线B_pre-train有很大不同。相机参数和场景深度值之间的关系已经被隐含的融入到网络结构之中，因此如果用不同焦距图片输入网络测试，得到的3D重建结果的绝对尺度可能会不准确。因此，整个网络需要被调整以适应不同相机参数的变化，但是这样做会使每一次在线学习的更新速度变慢。为了解决这个问题，提出了调整输出深度图的新思路，基本构想如图4所示，基本思想是通过将深度图中每一个像素点的深度值乘以一个尺度系数来保证深度图的准确性；2) Camera parameter adjustment: The monocular camera focal length f _adapted for online acquisition of training data and the baseline B _adapted for binocular training pictures are likely to be used to train the original CNN network model. The focal length f _pre-train and baseline B of training pictures _Pre-train is very different. The relationship between camera parameters and scene depth values has been implicitly integrated into the network structure, so if images with different focal lengths are used for network testing, the absolute scale of the 3D reconstruction results obtained may be inaccurate. Therefore, the whole network needs to be adjusted to adapt to the changes of different camera parameters, but doing so will make the update speed of each online learning slower. In order to solve this problem, a new idea of adjusting the output depth map is proposed. The basic idea is shown in Figure 4. The basic idea is to multiply the depth value of each pixel in the depth map by a scale coefficient To ensure the accuracy of the depth map;

3)逐块SGD方法：随机梯度下降(stochastic gradient descent，SGD)是现今主流的一种深度学习的最全优化算法。其主要思路是对于训练数据集，首先将其分成n个batch，每个batch包含m个样本。每次更新网络的参数都只利用一个batch的数据，而非整个训练集。3) Block-by-block SGD method: Stochastic gradient descent (SGD) is the most comprehensive optimization algorithm for deep learning that is mainstream today. The main idea is that for the training data set, it is first divided into n batches, and each batch contains m samples. Each time the parameters of the network are updated, only one batch of data is used, not the entire training set.

优点是：当训练数据很多时，使用batch可以减少机器的压力，并且可以更快地收敛；当训练集有很多冗余时(类似的样本出现多次)，batch方法收敛更快。The advantages are: when there is a lot of training data, using batch can reduce the pressure on the machine and converge faster; when the training set has a lot of redundancy (similar samples appear multiple times), the batch method converges faster.

缺点是：容易收敛到局部最优，并不是全局最优。The disadvantage is: it is easy to converge to the local optimum, not the global optimum.

我们提出的逐块梯度下降法(block-wise SGD)是在随机梯度下降法(stochasticgradient descent，SGD)之上进行的一次创新性改进。Our proposed block-wise gradient descent (block-wise SGD) is an innovative improvement on top of stochastic gradient descent (SGD).

本发明将ResNet-50用于在图片中提取不同级别的特征信息，这些特征信息随后会通过一系列下采样操作被编码到视差图之中。为了减少由于训练图片局限性造成CNN过拟合的风险，本发明提出了一种“逐块随机梯度下降”(block-wise SGD)的新方法，将ResNet-50之中的卷积层分为了5个块，如图5所示，其中每一个块具体表示为conv1,conv2_x,conv3_x,conv4_x,conv5_x。conv1由一个单一的7X7的全卷积层组成；conv2_x由一个3X3的卷积层和3个瓶颈构建块(每个瓶颈构建块为1X1 64，3X3 64，1X1 256)共10层组成；conv3_x由4个瓶颈构建块(每个瓶颈构建块为1X1 128，3X3 128，1X1 512)共12层组成；conv4_x由6个瓶颈构建块(每个瓶颈构建块为1X1 256，3X3 256，1X1 1024)共18层组成：conv5_x由3个瓶颈构建块(每个瓶颈构建块为1X1 512，3X3 512，1X1 2048)共9层组成，五个部分加起来构成了ResNet-50的50层结构。在每一次在线学习和更新的过程之中，每一次迭代k，只更新一个部分的参数W_i(i＝1,2,3,4,5)，保持剩余4个部分网络层参数不变。而在下一次迭代中，更新第i块(i＝(k+1)％5)参数,其他层参数保持不变，由此减少了每一次更新网络的复杂度。整个在线学习和更新的迭代一直在进行，直到停止条件被满足(比如迭代的次数的限制，或者是训练的损失函数达到预先设定的阈值)；The present invention uses ResNet-50 to extract different levels of feature information in pictures, and these feature information will be encoded into the disparity map through a series of down-sampling operations. In order to reduce the risk of CNN overfitting due to the limitations of training pictures, the present invention proposes a new method of "block-wise SGD" (block-wise SGD), which divides the convolutional layer in ResNet-50 into 5 blocks, as shown in Figure 5, where each block is specifically represented as conv1, conv2_x, conv3_x, conv4_x, conv5_x. Conv1 consists of a single 7X7 fully convolutional layer; conv2_x consists of a 3X3 convolutional layer and 3 bottleneck building blocks (each bottleneck building block is 1X1 64, 3X3 64, 1X1 256) a total of 10 layers; conv3_x consists of 4 bottleneck building blocks (each bottleneck building block is 1X1 128, 3X3 128, 1X1 512) a total of 12 layers; conv4_x consists of 6 bottleneck building blocks (each bottleneck building block is 1X1 256, 3X3 256, 1X1 1024) 18-layer composition: conv5_x consists of 3 bottleneck building blocks (each bottleneck building block is 1X1 512, 3X3 512, 1X1 2048) with a total of 9 layers, and the five parts add up to form the 50-layer structure of ResNet-50. During each online learning and updating process, and each iteration k, only a part of the parameters W _i (i=1, 2, 3, 4, 5) is updated, and the remaining 4 parts of the network layer parameters are kept unchanged. In the next iteration, the parameters of the i-th block (i=(k+1)%5) are updated, and the parameters of other layers remain unchanged, thereby reducing the complexity of updating the network each time. The entire iteration of online learning and updating continues until the stopping condition is met (such as the limit on the number of iterations, or the training loss function reaches a preset threshold);

4)选择性更新：每当有合适的训练数据产生时就进行在线学习和CNN网络模型的更新，这种做法容易造成不必要的计算开销。只要当前CNN网络模型对于当前场景可以提供足够准确的深度预测结果，就一直使用当前的CNN网络模型，直到迫不得已进行网络模型的调整。基于这个思路，本发明设计了一种“系统选择性更新”的工作模式，通过计算每一批输入到CNN网络模型之中图片的训练损失函数，一旦一批图片的所有图片的损失函数都大于预先设定的阈值L_high，将启动在线学习和更新的进程。在线学习和更新的进程将会一直进行，直到训练图片的损失函数降到L_low之下，或者是迭代的次数达到了预先设定的阈值。这个策略不但在很大程度上减少了计算量，而且可以满足对于网络深度预测结果精度的要求。4) Selective update: Online learning and updating of the CNN network model are carried out whenever suitable training data is generated, which is likely to cause unnecessary computational overhead. As long as the current CNN network model can provide sufficiently accurate depth prediction results for the current scene, the current CNN network model will be used until the network model has to be adjusted. Based on this idea, the present invention designs a "system selective update" working mode, by calculating the training loss function of each batch of pictures input into the CNN network model, once the loss functions of all pictures in a batch of pictures are greater than The preset threshold L _high will start the process of online learning and updating. The process of online learning and updating will continue until the loss function of the training image drops below L _low , or the number of iterations reaches the preset threshold. This strategy not only greatly reduces the amount of calculation, but also meets the requirements for the accuracy of network depth prediction results.

(3)深度尺度回归：具有准确尺度信息的相机姿态对于挑选合适的训练图片具有重要意义，直接影响到网络的输出结果。由于单目SLAM系统无法得到绝对的尺度，因此本发明提出了一种“基于自适应CNN进行准确尺度回归”的方法。我们绘制了D_sd(p)-D_gt(p)之间的关系图，如图6所示，其中(b)图中的黑线是场景中GroundTruth(真实值)的相机姿态，蓝线是单目SLAM得到的相机姿态，而红线则是采用了RANSAC算法回归得到尺度并作用到相机姿态之后的结果；发现D_sd(p)(单目SLAM得到的高梯度点p的深度)与D_gt(p)(像素p点的真实深度值)的比值代表了p点的绝对尺度信息。基于此本发明提出利用所有高梯度点的深度关系来回归出绝对尺度信息的思路，但是在实际应用时真实的深度信息是未知的，所以本发明采用CNN的预测结果来进行尺度回归。考虑到CNN预测深度存在着一些异常值的不利影响，我们分别实验了RANSAC算法和最小二乘两种算法进行尺度回归，实验结果如图6(a)中绿线和红线所示，证明采用RANSAC算法可以取得更准确的拟合效果，因此本发明实施例采用了RANSAC的方法。当我们利用这种方法计算出深度信息的绝对尺度之后，根据映射关系又可以得到姿态的尺度信息，会反过来提高单目SLAM系统的追踪精度，如图6(b)所示，本发明在TUM数据集中的两个场景进行了测试，其中蓝色部分为单目SLAM追踪的姿态，黑色部分为真实深度信息，红色部分将尺度信息增加到单目SLAM追踪之中的结果，显示该种方法可以较好的拟合追踪尺度。(3) Depth scale regression: The camera pose with accurate scale information is of great significance for selecting appropriate training pictures, which directly affects the output of the network. Since the monocular SLAM system cannot obtain an absolute scale, the present invention proposes a method of "accurate scale regression based on adaptive CNN". We draw the relationship diagram between D _sd (p)-D _gt (p), as shown in Fig. 6, where the black line in (b) is the camera pose of GroundTruth (true value) in the scene, and the blue line is The camera pose obtained by monocular SLAM, and the red line is the result of using the RANSAC algorithm to regress to obtain the scale and apply it to the camera pose; it is found that D _sd (p) (the depth of the high gradient point p obtained by monocular SLAM) and D _gt The ratio of (p) (true depth value of pixel point p) represents the absolute scale information of point p. Based on this, the present invention proposes the idea of using the depth relationship of all high gradient points to regress the absolute scale information, but the real depth information is unknown in practical application, so the present invention uses the prediction results of CNN to perform scale regression. Considering that there are some adverse effects of outliers in the predicted depth of CNN, we experimented with the RANSAC algorithm and the least squares algorithm for scale regression respectively. The experimental results are shown in the green and red lines in Figure 6(a), which proves that RANSAC The algorithm can achieve a more accurate fitting effect, so the embodiment of the present invention adopts the RANSAC method. After we use this method to calculate the absolute scale of the depth information, the scale information of the attitude can be obtained according to the mapping relationship, which will in turn improve the tracking accuracy of the monocular SLAM system, as shown in Figure 6(b). Two scenes in the TUM dataset were tested. The blue part is the attitude of monocular SLAM tracking, the black part is the real depth information, and the red part is the result of adding scale information to monocular SLAM tracking, showing this method. It can better fit the tracking scale.

(4)数据融合：对于每一个关键帧，我们可以得到两张深度图，一张是单目SLAM经过优化后的结果D_sd，另一张则是CNN的预测结果D_cnn。本发明设计了一种“NCC得分投票与高斯融合相结合”的方式，以达到最好的结合效果。该过程由两个部分组成，第一部分为NCC得分投票。NCC(Normalized Cross Correlation)是归一化互相关的缩写，用来计算两个图片区域A和B之间的相关性，计算公式为关键帧i之中的每一个像素点p，根据CNN预测的深度图D_cnn(p)和位姿变换将该像素点投影到与之相距最近的关键帧i-1中，投影的结果表示为p′_cnn；与之相似的是，将关键帧i之中的像素点p做另外一个投影，映射到关键帧i-1中记为p′_sd，但是投影是基于半稠密地图的结果D_sp(p)和绝对的尺度因子。分别在关键帧i-1中投影点p′_cnn和p′_sd附近选取小的区域，并且分别计算区域R(p)与R_cnn(p′)之间的归一化互相关系数NCC_cnn和区域R(p)与R_sd(p′)之间归一化互相关系数NCC_sd。如果NCC_cnn小于NCC_sd，那么表明半稠密深度图的深度预测结果要好于CNN的结果，选择D_sp(p)作为像素点p的最终深度预测值，否则，选择R_cnn(p′)。如果有些点只有CNN的预测结果，我们就使用R_cnn(p′)作为像素点p的最终深度。第二部分为高斯融合。对于上一步得到的深度图进一步处理，根据关键帧之间的上下文关系，并且结合关键帧深度图的不确定度图进行联合优化，这就是所谓的高斯融合。通过联合优化得到最终的深度图。实验之中我们在多个数据集的场景序列进行了测试，取得了比较好的效果。(4) Data fusion: For each key frame, we can get two depth maps, one is the optimized result D _sd of monocular SLAM, and the other is the prediction result D _cnn of CNN. The present invention designs a method of "combining NCC score voting and Gaussian fusion" to achieve the best combination effect. The process consists of two parts, the first being the NCC scoring vote. NCC (Normalized Cross Correlation) is the abbreviation of Normalized Cross Correlation, which is used to calculate the correlation between two image areas A and B. The calculation formula is For each pixel point p in the key frame i, according to the depth map D _cnn (p) predicted by CNN and the pose transformation Project the pixel point into the nearest key frame i-1, and the result of the projection is expressed as p'_cnn; similarly, another projection is made on the pixel point p in the key frame i, and it is mapped to Denote p' _sd in keyframe i-1, but the projection is based on the result D _sp (p) of the semi-dense map and the absolute scale factor. Select a small area near the projection points p′ _cnn and p′ _sd in the key frame i-1 respectively, and calculate the normalized _{cross-correlation coefficients NCC cnn} _and The normalized cross-correlation coefficient NCC _sd between the regions R(p) and R _sd (p′). If NCC _cnn is smaller than NCC _sd , it indicates that the depth prediction result of the semi-dense depth map is better than that of CNN, and D _sp (p) is selected as the final depth prediction value of pixel p, otherwise, R _cnn (p′) is selected. If some points only have CNN prediction results, we use R _cnn (p′) as the final depth of pixel p. The second part is Gaussian fusion. For the further processing of the depth map obtained in the previous step, joint optimization is performed according to the contextual relationship between key frames and the uncertainty map of the key frame depth map. This is the so-called Gaussian fusion. The final depth map is obtained through joint optimization. In the experiment, we tested the scene sequences of multiple datasets and achieved good results.

由于使用了CNN，我们的单目稠密SLAM系统需要采用GPU的加速才能够取得很好实时性效果。在TUM数据集和ICL-NUIM数据集上，我们的算法进行了测试，与当前以直接法为基础的最先新的单目SLAM系统LSD-SLAM相比，我们的姿态追踪精度的绝对轨迹误差由0.622m降低到了0.231m。关键帧深度图的完整率(深度图中误差在10％以内的点占全图的比例)由0.61％提高到了26.47％；与单纯采用弱监督深度预测网络相比，关键帧深度图的完整率由21.05％提高到了26.47％。此外，整个系统的运行速度也可以达到实时的效果。Due to the use of CNN, our monocular dense SLAM system requires GPU acceleration to achieve good real-time performance. On the TUM dataset and the ICL-NUIM dataset, our algorithm is tested, compared with the state-of-the-art monocular SLAM system LSD-SLAM based on the current direct method, the absolute trajectory error of our attitude tracking accuracy From 0.622m to 0.231m. The completeness rate of the key frame depth map (the proportion of points with an error within 10% in the depth map to the whole map) has increased from 0.61% to 26.47%; From 21.05% to 26.47%. In addition, the operating speed of the entire system can also achieve real-time effects.

进一步地，如图7所示，本发明还提供了一种基于在线学习深度预测网络的实时稠密单目SLAM系统，包括直接法单目SLAM模块1、在线自适应CNN预测模块2、绝对尺度回归模块3以及深度图融合模块4，其中：Further, as shown in Figure 7, the present invention also provides a real-time dense monocular SLAM system based on online learning depth prediction network, including direct method monocular SLAM module 1, online adaptive CNN prediction module 2, absolute scale regression Module 3 and depth map fusion module 4, in which:

所述直接法单目SLAM模块1，用于从单目视觉传感器通过旋转和平移运动采集的图片序列中选择关键帧，通过最小化高梯度点的光度误差优化得到关键帧的相机姿态，并且采用三角测量法预测高梯度点的深度得到当前帧的半稠密地图；The direct method monocular SLAM module 1 is used to select key frames from the image sequence collected by the monocular vision sensor through rotation and translation motion, and obtain the camera pose of the key frame by minimizing the photometric error optimization of high gradient points, and adopt Triangulation predicts the depth of high gradient points to obtain a semi-dense map of the current frame;

所述在线自适应CNN预测模块2，用于根据所述关键帧选择在线训练图片对，采用逐块随机梯度下降法根据所述在线训练图片对来在线训练更新CNN网络模型，并利用训练后CNN网络模型对当前帧图片进行深度预测得到稠密地图；The online adaptive CNN prediction module 2 is used to select an online training picture pair according to the key frame, and use block-by-block stochastic gradient descent method to train and update the CNN network model online according to the online training picture pair, and use the trained CNN The network model performs depth prediction on the current frame picture to obtain a dense map;

所述绝对尺度回归模块3，用于根据所述当前帧的半稠密地图和预测稠密地图进行深度尺度回归，得到当前帧深度信息的绝对尺度因子；The absolute scale regression module 3 is configured to perform depth scale regression according to the semi-dense map of the current frame and the predicted dense map to obtain an absolute scale factor of the depth information of the current frame;

所述深度图融合模块4，用于根据相机姿态将所述预测稠密地图通过位姿变换投影到上一关键帧中，并根据所述绝对尺度因子将所述半稠密地图投影到上一关键帧中，采用NCC得分投票方法根据所述两种投影结果选择所述当前帧的各像素深度预测值得到预测深度图，并对所述预测深度图进行高斯融合得到最终深度图。The depth map fusion module 4 is used to project the predicted dense map to the previous key frame through pose transformation according to the camera pose, and project the semi-dense map to the previous key frame according to the absolute scale factor In this method, the NCC score voting method is used to select the predicted depth values of each pixel of the current frame according to the two projection results to obtain a predicted depth map, and Gaussian fusion is performed on the predicted depth map to obtain a final depth map.

本领域的技术人员容易理解，以上所述仅为本发明的较佳实施例而已，并不用以限制本发明，凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等，均应包含在本发明的保护范围之内。It is easy for those skilled in the art to understand that the above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalent replacements and improvements made within the spirit and principles of the present invention, All should be included within the protection scope of the present invention.

Claims

1. A real-time dense monocular SLAM method based on online learning depth prediction network, characterized in that, comprising the steps:

(1) Select the key frame from the image sequence collected by the monocular vision sensor through the rotation and translation motion, optimize the camera pose of the key frame by minimizing the photometric error of the high gradient point, and use the triangulation method to predict the depth of the high gradient point Get the semi-dense map of the current frame;

(2) Select an online training picture pair according to the key frame, use the block-by-block stochastic gradient descent method to train and update the CNN network model online according to the online training picture pair, and use the trained CNN network model to perform depth prediction on the current frame picture get a dense map;

(3) performing depth scale regression according to the semi-dense map of the current frame and the predicted dense map to obtain the absolute scale factor of the depth information of the current frame;

(4) Project the predicted dense map to the previous key frame through pose transformation according to the camera pose, and project the semi-dense map to the previous key frame according to the absolute scale factor, using the NCC score voting method Selecting the predicted depth values of each pixel in the current frame according to the two projection results to obtain a predicted depth map, and performing Gaussian fusion on the predicted depth map to obtain a final depth map.

2. The real-time dense monocular SLAM method based on online learning depth prediction network as claimed in claim 1, wherein the online training picture is selected according to the key frame, specifically: adopting the following constraints before and after the key frame In the frame picture, the selected picture frame and the key frame form a picture pair:

First, the camera motion constraint: the displacement in the horizontal direction between the two frames of pictures satisfies |t _x |>0.9*T, where T represents the baseline distance between the two frames of pictures;

Second, disparity constraints: For each pair of pictures, the optical flow method is used to calculate the average disparity Dis _avg in the vertical direction between the pictures, and only when Dis _avg is smaller than the preset threshold δ, the pair of pictures will be saved as candidate training pictures;

Third, diversity constraints: the same key frame can only generate a pair of training pictures;

Fourth, training pool capacity constraints: Whenever the number of training picture pairs reaches the set threshold V, the pictures in the training pool are sent to the network, the network is trained online, the network model obtained from the training is saved, and the training time is cleared. The pool continues to filter the training data.

3. the real-time dense monocular SLAM method based on online learning depth prediction network as claimed in claim 1 or 2, is characterized in that, adopt block-by-block stochastic gradient descent method to carry out online training update CNN network model according to described online training picture, Specifically:

The convolutional layer in ResNet-50 is divided into 5 blocks, each of which is specifically represented as conv1, conv2_x, conv3_x, conv4_x, conv5_x; conv1 consists of a single 7X7 full convolutional layer; conv2_x consists of a 3X3 The convolution layer and 3 bottleneck building blocks are composed of 10 layers; conv3_x is composed of 4 bottleneck building blocks with 12 layers; conv4_x is composed of 6 bottleneck building blocks with 18 layers; conv5_x is composed of 3 bottleneck building blocks with 9 layers , the five parts together constitute the 50-layer structure of ResNet-50;

In each online learning and updating process, each iteration k, only update a part of the parameters W _i (i=1,2,3,4,5), keep the remaining 4 parts of the network layer parameters unchanged, In the next iteration, update the i-th block parameter, where i=(k+1)%5; other layer parameters remain unchanged, and the entire iteration of online learning and updating continues until the preset stop condition is met.

4. the real-time dense monocular SLAM method based on online learning depth prediction network as claimed in claim 1 or 2, is characterized in that, described online training update CNN network model is selective update, specifically:

Calculate the training loss function of each batch of pictures input into the CNN network model. Once the loss function of all pictures in a batch of pictures is greater than the preset threshold L _high , the process of online learning and updating will be started. Online learning and updating The process will continue until the loss function of the training image drops below the threshold L _low , or the number of iterations reaches the preset threshold.

5. The real-time dense monocular SLAM method based on online learning depth prediction network according to claim 1 or 2, wherein the depth scale regression method is: RANSAC algorithm or least squares algorithm.

6. the real-time dense monocular SLAM method based on online learning depth prediction network as claimed in claim 1 or 2, is characterized in that, described prediction dense map is projected in last key frame by pose transformation, and Project the semi-dense map into the previous key frame according to the absolute scale factor, and use the NCC score voting method to select the predicted depth values of each pixel in the current frame according to the two projection results to obtain a predicted depth map, specifically: :

For each pixel point p in the key frame i, according to the dense map D _cnn (p) predicted by CNN and the pose transformation Project the pixel point into the key frame i-1 closest to it, and the result of the projection is expressed as _p'cnn ;

Make another projection of the pixel point p in the key frame i, and map it to the key frame i-1 as p′ _sd , the projection is based on the result D _sp (p) of the semi-dense map and the absolute scale factor;

Select a small area near the projection points p′ _cnn and p′ _sd in the key frame i-1 respectively, and calculate the normalized _{cross-correlation coefficients NCC cnn} _and The normalized cross-correlation coefficient NCC _sd between the region R(p) and R _sd (p′), if NCC _cnn is smaller than NCC _sd , it indicates that the depth prediction result of the semi-dense depth map is better than the result of CNN, choose D _sp ( p) as the final depth prediction value of pixel p, otherwise select R _cnn (p′), if some points only have CNN prediction results, use R _cnn (p′) as the final depth of pixel p.

7. The real-time dense monocular SLAM method based on the online learning depth prediction network as claimed in claim 1 or 2, wherein the Gaussian fusion is carried out to the predicted depth map to obtain the final depth map, specifically:

The depth map obtained by the NCC score voting method is further processed, according to the context relationship between key frames, and combined with the uncertainty map of the key frame depth map for joint optimization, and the final depth map is obtained through joint optimization.

8. the real-time dense monocular SLAM method based on online learning depth prediction network as claimed in claim 1 or 2, it is characterized in that, utilize CNN network model after training to carry out depth prediction to current frame picture and obtain dense map and also include:

Multiply the depth value of each pixel in the depth map by a scale factor

Among them, f _adapted is the focal length of the monocular camera used to obtain the training data online, B _adapted is the baseline of the binocular training picture, f _pre-train and B _pre-train are the focal length and the training picture used to train the original CNN network model, respectively. baseline.

9. the real-time dense monocular SLAM method based on online learning depth prediction network as claimed in claim 1 or 2, is characterized in that, described key frame is:

Define the entire image sequence or the first picture obtained by the camera in real time as a key frame. In addition to the first frame, a part of the following picture frames will also be defined as key frames. The principle of defining key frames is to monitor the current frame and the previous one. Whether the translation and rotation between keyframes reached a preset threshold.

10. A real-time dense monocular SLAM system based on online learning depth prediction network, characterized in that it includes a direct method monocular SLAM module, an online adaptive CNN prediction module, an absolute scale regression module and a depth map fusion module, wherein:

The direct method monocular SLAM module is used to select key frames from the image sequence collected by the monocular vision sensor through rotation and translation motion, and optimize the camera pose of the key frame by minimizing the photometric error of high gradient points, and adopts triangulation The measurement method predicts the depth of high gradient points to obtain a semi-dense map of the current frame;

The online adaptive CNN prediction module is used to select an online training picture pair according to the key frame, and use block-by-block stochastic gradient descent method to train and update the CNN network model online according to the online training picture pair, and use the trained CNN network The model performs depth prediction on the current frame picture to obtain a dense map;

The absolute scale regression module is used to perform depth scale regression according to the semi-dense map of the current frame and the predicted dense map to obtain the absolute scale factor of the depth information of the current frame;

The depth map fusion module is used to project the predicted dense map to the previous key frame through pose transformation according to the camera pose, and project the semi-dense map to the previous key frame according to the absolute scale factor , using the NCC score voting method to select the predicted depth values of each pixel of the current frame according to the two projection results to obtain a predicted depth map, and performing Gaussian fusion on the predicted depth map to obtain a final depth map.