CN111583345B

CN111583345B - Method, device and equipment for acquiring camera parameters and storage medium

Info

Publication number: CN111583345B
Application number: CN202010387692.5A
Authority: CN
Inventors: 王欣; 贾锋
Original assignee: Jilin University
Current assignee: Jilin University
Priority date: 2020-05-09
Filing date: 2020-05-09
Publication date: 2022-09-27
Anticipated expiration: 2040-05-09
Also published as: CN111583345A

Abstract

The present application discloses a method, device, device and storage medium for acquiring camera parameters, including: collecting original continuous frame images captured by a monocular camera; constructing a DepthNet model and a MotionNet model; the DepthNet model includes a single-channel depth map output. network; the MotionNet model includes a backbone network for camera motion prediction and pixel confidence masks and a branch network for camera intrinsic parameter prediction; the original continuous frame images are preprocessed and input into the constructed model, and through joint The loss function is used for unsupervised training and hyperparameter tuning; the image to be tested is processed by the trained model, and the depth map of each frame of image, camera motion, camera internal parameters and pixel confidence containing scene motion information are output. mask. In this way, there is no need to calibrate the camera, and the video can be directly input to obtain the camera internal parameters, camera motion and depth map of each frame.

Description

A method, device, device and storage medium for acquiring camera parameters

技术领域technical field

本发明涉及计算机视觉和摄影测量领域，特别是涉及一种相机参数的获取方法、装置、设备及存储介质。The present invention relates to the fields of computer vision and photogrammetry, in particular to a method, device, device and storage medium for acquiring camera parameters.

背景技术Background technique

作为计算机视觉的主要工具之一，相机以及围绕相机的各种算法具有重要的地位。其中摄影测量学主要研究相机的成像原理，并关注如何从相机拍摄的图片中获取真实世界信息。对于计算机视觉和摄影测量的各种应用，例如工业控制、自动驾驶、机器人导航寻路等场景，相机内参、相机运动和景深都具有重要的价值，大量涉及到摄影测量以及相机成像性质的计算过程都需要以这三种信息作为输入。As one of the main tools of computer vision, the camera and the various algorithms surrounding the camera have an important position. Among them, photogrammetry mainly studies the imaging principles of cameras, and focuses on how to obtain real-world information from pictures taken by cameras. For various applications of computer vision and photogrammetry, such as industrial control, automatic driving, and robot navigation and pathfinding, camera intrinsic parameters, camera motion and depth of field are of great value, and a large number of computational processes involving photogrammetry and camera imaging properties are involved. All three types of information are required as input.

相机的内参包含相机的焦距等信息，相机的自我运动也称ego-motion，包含了相机本身的位置变换信息，而景深表达了相机视野中每一点与相机光学中心的距离，通常用深度图表示。一般把获取相机内参和外参的过程称为相机标定(Camera Calibrate)，获取ego-motion的过程称为视觉里程计(Visual Odometry,VO)。The internal parameters of the camera include information such as the focal length of the camera. The self-motion of the camera is also called ego-motion, which includes the position transformation information of the camera itself. . Generally, the process of obtaining camera internal and external parameters is called camera calibration (Camera Calibrate), and the process of obtaining ego-motion is called visual odometry (Visual Odometry, VO).

对于相机内参、ego-motion和景深信息，基于非深度学习的现有方法通常使用独立的技术分别获取。用于获取内参的方法需要使用相机拍摄数张(通常需要20张左右)不同角度的标定板图像，在相机需要进行频繁调整的时候，不得不频繁的进行标定，而对于相机设备不可访问的应用场景，这种标定方法是不可用的。用于获取ego-motion和景深信息的方法具有相似的缺陷：这些方法的正常工作需要一些假设(即场景静态假设，场景一致性假设和朗伯面假设)，任何破坏这些假设的工况均会影响相应方法的正常工作。基于深度学习的技术可以不同程度的摆脱对于前置假设的依赖，并且能同步获取ego-motion和景深信息，使用便利性有所提升。但是仍然需要输入相机的内参，因此不能完全摆脱相机标定方法所带来的不便性。For camera intrinsics, ego-motion, and depth-of-field information, existing methods based on non-deep learning usually use separate techniques to obtain them separately. The method used to obtain the internal parameters needs to use the camera to take several (usually about 20) images of the calibration board at different angles. When the camera needs to be adjusted frequently, it has to be calibrated frequently, and for applications that are not accessible to the camera device scenarios, this calibration method is not available. The methods used to obtain ego-motion and depth-of-field information have similar flaws: these methods require some assumptions (i.e., scene static, scene consistency, and Lambertian surface assumptions) to work properly, and any operating conditions that violate these assumptions will affect The corresponding method works fine. The technology based on deep learning can get rid of the dependence on the premise to varying degrees, and can simultaneously obtain ego-motion and depth of field information, and the convenience of use is improved. However, it is still necessary to input the internal parameters of the camera, so it cannot completely get rid of the inconvenience brought by the camera calibration method.

因此，如何解决现有方案需要进行相机标定以及需要大量监督学习数据等限制的问题，是本领域技术人员亟待解决的技术问题。Therefore, how to solve the problems that the existing solutions require camera calibration and a large amount of supervised learning data, etc., is a technical problem to be solved urgently by those skilled in the art.

发明内容SUMMARY OF THE INVENTION

有鉴于此，本发明的目的在于提供一种相机参数的获取方法、装置、设备及存储介质，可以无监督学习，无需对相机进行标定，以单目相机拍摄的连续帧为输入，输出每一帧的深度图、拍摄过程中相机的运动和相机的内参。In view of this, the purpose of the present invention is to provide a method, device, device and storage medium for acquiring camera parameters, which can perform unsupervised learning without calibrating the camera, take continuous frames captured by a monocular camera as input, and output each The depth map of the frame, the motion of the camera during shooting, and the camera's intrinsic parameters.

其具体方案如下：Its specific plan is as follows:

一种相机参数的获取方法，包括：A method for obtaining camera parameters, including:

收集单目相机拍摄的原始连续帧图像；Collect the original continuous frame images captured by the monocular camera;

构建DepthNet模型和MotionNet模型；所述DepthNet模型包括用于输出单通道深度图的网络；所述MotionNet模型包括用于相机运动预测和给出像素置信度掩码的主干网络和用于相机内参预测的支线网络；Build a DepthNet model and a MotionNet model; the DepthNet model includes a network for outputting a single-channel depth map; the MotionNet model includes a backbone network for camera motion prediction and a pixel confidence mask and a network for camera intrinsic parameter prediction. branch network;

将所述原始连续帧图像经预处理后分别输入至构建的所述DepthNet模型和所述MotionNet模型中，并通过联合损失函数对所述DepthNet模型和所述MotionNet模型进行无监督训练，以及进行超参数的调优；The original continuous frame images are respectively input into the constructed DepthNet model and the MotionNet model after preprocessing, and unsupervised training is performed on the DepthNet model and the MotionNet model through a joint loss function, and a super parameter tuning;

通过训练好的所述DepthNet模型和所述MotionNet模型对待测图像进行处理，输出每帧所述待测图像的深度图，相机的运动，相机的内参和包含场景运动信息的像素置信度掩码。The image to be tested is processed by the trained DepthNet model and the MotionNet model, and the depth map of each frame of the image to be tested, the motion of the camera, the internal parameters of the camera and the pixel confidence mask containing scene motion information are output.

优选地，在本发明实施例提供的上述相机参数的获取方法中，所述DepthNet模型由第一编码器和第一解码器构成；Preferably, in the above-mentioned method for obtaining camera parameters provided by the embodiment of the present invention, the DepthNet model is composed of a first encoder and a first decoder;

将所述原始连续帧图像经预处理后输入至所述DepthNet模型中进行训练，具体包括：The original continuous frame images are input into the DepthNet model for training after preprocessing, specifically including:

通过所述第一编码器获取预处理后的三通道图像，并将所述三通道图像逐次编码为多种粒度的特征；Obtaining a preprocessed three-channel image by the first encoder, and encoding the three-channel image into features of various granularities successively;

使用所述第一解码器联合不同粒度的特征进行解码；decoding using the first decoder in conjunction with features of different granularities;

通过所述第一解码器输出尺寸与输入的所述三通道图像尺寸相同的单通道深度图。A single-channel depth map of the same size as the input three-channel image is output through the first decoder.

优选地，在本发明实施例提供的上述相机参数的获取方法中，通过所述第一编码器将所述三通道图像逐次编码为多种粒度的特征，使用所述第一解码器联合不同粒度的特征进行解码，具体包括：Preferably, in the above-mentioned camera parameter acquisition method provided by the embodiment of the present invention, the three-channel image is sequentially encoded into features of various granularities by using the first encoder, and the first decoder is used to combine different granularities. The features of , are decoded, including:

在所述第一编码器中，通过一个卷积核大小为7×7的2D卷积，经过批标准化和线性整流单元后，形成第一级特征编码；In the first encoder, through a 2D convolution with a convolution kernel size of 7×7, after batch normalization and linear rectification units, the first-level feature encoding is formed;

连接一个最大池化层和两个第一残差模块，形成第二级特征编码；Connect a max pooling layer and two first residual modules to form a second-level feature encoding;

交替连接第二残差模块和所述第一残差模块，分别形成第三级特征编码、第四级特征编码和第五级特征编码；alternately connecting the second residual module and the first residual module to form the third-level feature encoding, the fourth-level feature encoding and the fifth-level feature encoding respectively;

将所述第一级特征编码、所述第二级特征编码、所述第三级特征编码、所述第四级特征编码和所述第五级特征编码输入至所述第一解码器；inputting the first-level feature encoding, the second-level feature encoding, the third-level feature encoding, the fourth-level feature encoding, and the fifth-level feature encoding to the first decoder;

在所述第一解码器中，交替使用2D转置卷积和2D卷积，并逐级地组合五个层级的特征编码，在输出层采用softplus激活函数进行输出。In the first decoder, 2D transposed convolution and 2D convolution are used alternately, and the feature codes of five levels are combined step by step, and the softplus activation function is used in the output layer for output.

优选地，在本发明实施例提供的上述相机参数的获取方法中，所述主干网络由第二编码器和第二解码器构成；Preferably, in the method for acquiring the camera parameters provided in the embodiment of the present invention, the backbone network is composed of a second encoder and a second decoder;

将所述原始连续帧图像经预处理后输入至所述MotionNet模型中进行训练，具体包括：The original continuous frame images are input into the MotionNet model for training after preprocessing, specifically including:

通过所述第二编码器获取预处理后的相邻两帧图像；Obtaining two adjacent frames of images after preprocessing by the second encoder;

在所述第二编码器中，使用7个级联3×3的2D卷积层，在瓶颈部分连接一个1×1的卷积层，将输出通道数压缩到六个，前三个通道输出相机的平移，后三个通道输出相机的旋转；In the second encoder, 7 cascaded 3×3 2D convolutional layers are used, and a 1×1 convolutional layer is connected at the bottleneck part to compress the number of output channels to six, and the first three channels are output The translation of the camera, the rotation of the camera is output by the last three channels;

在所述第二解码器中，采用两条并列的卷积通路并使用short-cut连接，将卷积输出和双线性插值的输出组合，形成Refine模块的输出，输出一个像素级置信度掩码，用于在联合损失函数计算时确定每个像素是否参与计算，同时对所述像素级置信度掩码增加一个惩罚函数，用于防止训练退化；In the second decoder, two parallel convolution paths and short-cut connections are used, the convolution output and the bilinear interpolation output are combined to form the output of the Refine module, and a pixel-level confidence mask is output. code is used to determine whether each pixel participates in the calculation when the joint loss function is calculated, and at the same time, a penalty function is added to the pixel-level confidence mask to prevent training degradation;

通过连接在所述主干网络的最底层编码器上的所述支线网络输出相机的内参矩阵。The internal parameter matrix of the camera is output through the branch network connected to the bottommost encoder of the backbone network.

优选地，在本发明实施例提供的上述相机参数的获取方法中，输出相机的内参矩阵，具体包括：Preferably, in the above-mentioned camera parameter acquisition method provided by the embodiment of the present invention, outputting an internal parameter matrix of the camera specifically includes:

在所述支线网络中，将网络预测值乘以图像的宽和高为实际焦距；In the branch network, multiplying the network prediction value by the width and height of the image is the actual focal length;

将网络预测值加上0.5，并乘以图像的宽和高，获得主点的像素坐标；Add 0.5 to the predicted value of the network, and multiply by the width and height of the image to obtain the pixel coordinates of the principal point;

将焦距对角化为2×2的对角矩阵，连接主点坐标构成的列向量，再添加行向量，构成3×3内参矩阵。Diagonalize the focal length into a 2×2 diagonal matrix, connect the column vectors formed by the coordinates of the principal point, and then add row vectors to form a 3×3 internal parameter matrix.

优选地，在本发明实施例提供的上述相机参数的获取方法中，对所述原始连续帧图像预处理，包括：Preferably, in the method for acquiring the camera parameters provided in the embodiment of the present invention, the preprocessing of the original continuous frame image includes:

对所述原始连续帧图像进行分辨率的调整，并进行排列和拼接，以拼接为多张三联帧图像；Adjusting the resolution of the original continuous frame images, and arranging and splicing them into multiple triple frame images;

当每张所述三联帧图像输入至所述DepthNet模型中，输出每一帧图像的深度图；When each of the triple frame images is input into the DepthNet model, the depth map of each frame image is output;

当每张所述三联帧图像输入至所述MotionNet模型中，输出四次每相邻两帧图像之间的相机运动，相机的内参和像素置信度掩码。When each triple frame image is input into the MotionNet model, the camera motion between each adjacent two frame images, the camera's internal parameters and the pixel confidence mask are output four times.

优选地，在本发明实施例提供的上述相机参数的获取方法中，所述联合损失函数采用下述公式进行计算：Preferably, in the method for obtaining the camera parameters provided in the embodiment of the present invention, the joint loss function is calculated by using the following formula:

其中，L_total为所述联合损失函数，L_R为重投影误差函数，a为所述重投影误差函数的权值，

为深度平滑损失，b为所述深度平滑损失的权值，Λ为所述像素置信度掩码的正则化惩罚函数，c为所述惩罚函数的权值。Among them, L _total is the joint loss function, L _R is the reprojection error function, a is the weight of the reprojection error function,

is the depth smoothing loss, b is the weight of the depth smoothing loss, Λ is the regularization penalty function of the pixel confidence mask, and c is the weight of the penalty function.

本发明实施例还提供了一种相机参数的获取装置，包括：The embodiment of the present invention also provides a device for acquiring camera parameters, including:

图像收集模块，用于收集单目相机拍摄的原始连续帧图像；The image collection module is used to collect the original continuous frame images captured by the monocular camera;

模型构建模块，用于构建DepthNet模型和MotionNet模型；所述DepthNet模型包括用于输出单通道深度图的网络；所述MotionNet模型包括用于相机运动预测和给出像素置信度掩码的主干网络和用于相机内参预测的支线网络；A model building module for constructing a DepthNet model and a MotionNet model; the DepthNet model includes a network for outputting a single-channel depth map; the MotionNet model includes a backbone network for camera motion prediction and giving pixel confidence masks and A branch network for camera intrinsic parameter prediction;

模型训练模块，用于将所述原始连续帧图像经预处理后分别输入至构建的所述DepthNet模型和所述MotionNet模型中，并通过联合损失函数对所述DepthNet模型和所述MotionNet模型进行无监督训练，以及进行超参数的调优；The model training module is used to input the original continuous frame images into the constructed DepthNet model and the MotionNet model respectively after preprocessing, and perform a free-flow analysis on the DepthNet model and the MotionNet model through a joint loss function. Supervised training and tuning of hyperparameters;

模型预测模块，用于通过训练好的所述DepthNet模型和所述MotionNet模型对待测图像进行处理，输出每帧所述待测图像的深度图，相机的运动，相机的内参和包含场景运动信息的像素置信度掩码。The model prediction module is used to process the image to be tested by the trained DepthNet model and the MotionNet model, and output the depth map of the image to be tested in each frame, the motion of the camera, the internal parameters of the camera and the image that contains scene motion information. Pixel confidence mask.

本发明实施例还提供了一种相机参数的获取设备，包括处理器和存储器，其中，所述处理器执行所述存储器中保存的计算机程序时实现如本发明实施例提供的上述相机参数的获取方法。An embodiment of the present invention further provides a device for acquiring camera parameters, including a processor and a memory, wherein, when the processor executes a computer program stored in the memory, the above-mentioned camera parameter acquisition as provided by the embodiment of the present invention is realized method.

本发明实施例还提供了一种计算机可读存储介质，用于存储计算机程序，其中，所述计算机程序被处理器执行时实现如本发明实施例提供的上述相机参数的获取方法。Embodiments of the present invention further provide a computer-readable storage medium for storing a computer program, wherein, when the computer program is executed by a processor, the above-mentioned method for acquiring camera parameters provided by the embodiments of the present invention is implemented.

从上述技术方案可以看出，本发明所提供的一种相机参数的获取方法、装置、设备及存储介质，包括：收集单目相机拍摄的原始连续帧图像；构建DepthNet模型和MotionNet模型；DepthNet模型包括用于输出单通道深度图的网络；MotionNet模型包括用于相机运动预测和给出像素置信度掩码的主干网络和用于相机内参预测的支线网络；将原始连续帧图像经预处理后分别输入至构建的DepthNet模型和MotionNet模型中，并通过联合损失函数对DepthNet模型和MotionNet模型进行无监督训练，以及进行超参数的调优；通过训练好的DepthNet模型和MotionNet模型对待测图像进行处理，输出每帧待测图像的深度图，相机的运动，相机的内参和包含场景运动信息的像素置信度掩码。It can be seen from the above technical solutions that the method, device, device and storage medium for obtaining camera parameters provided by the present invention include: collecting original continuous frame images captured by a monocular camera; constructing a DepthNet model and a MotionNet model; a DepthNet model Including a network for outputting a single-channel depth map; the MotionNet model includes a backbone network for camera motion prediction and giving pixel confidence masks and a branch network for camera intrinsic parameter prediction; the original continuous frame images are preprocessed. Input into the built DepthNet model and MotionNet model, and perform unsupervised training on the DepthNet model and MotionNet model through the joint loss function, and perform hyperparameter tuning; through the trained DepthNet model and MotionNet model The image to be tested is processed, Output the depth map of each frame of the image to be tested, the motion of the camera, the intrinsic parameters of the camera and the pixel confidence mask containing scene motion information.

本发明不需要对相机进行标定，对使用场景没有额外的限制，直接输入单目相机拍摄的任意视频，即可获取拍摄期间的相机运动轨迹、每一帧的深度图以及相机的内参，并且在相机内参未知的情况下，使用联合损失函数进行无监督学习，可以保证训练正常进行；另外，为需要相机内参、相机运动和深度图的计算机视觉应用提供了一种约束较少的前端解决方案，具有良好的应用价值。The present invention does not need to calibrate the camera, and has no additional restrictions on the use scene. Any video shot by the monocular camera can be directly input to obtain the camera motion track, the depth map of each frame and the internal parameters of the camera during shooting. When the camera internal parameters are unknown, using the joint loss function for unsupervised learning can ensure normal training; in addition, it provides a less constrained front-end solution for computer vision applications that require camera internal parameters, camera motion, and depth maps. Has good application value.

附图说明Description of drawings

为了更清楚地说明本发明实施例或相关技术中的技术方案，下面将对实施例或相关技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据提供的附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention or related technologies, the following briefly introduces the accompanying drawings required for the description of the embodiments or related technologies. Obviously, the accompanying drawings in the following description are only the For the embodiments of the invention, for those of ordinary skill in the art, other drawings can also be obtained according to the provided drawings without any creative effort.

图1为本发明实施例提供的相机参数的获取方法的流程图；FIG. 1 is a flowchart of a method for acquiring camera parameters provided by an embodiment of the present invention;

图2为本发明实施例提供的DepthNet模型中第一编码器的结构示意图；2 is a schematic structural diagram of a first encoder in a DepthNet model provided by an embodiment of the present invention;

图3为本发明实施例提供的第一编码器中第一残差模块的结构示意图；3 is a schematic structural diagram of a first residual module in a first encoder according to an embodiment of the present invention;

图4为本发明实施例提供的第一编码器中第二残差模块的结构示意图；4 is a schematic structural diagram of a second residual module in a first encoder according to an embodiment of the present invention;

图5为本发明实施例提供的DepthNet模型中第一解码器的结构示意图；5 is a schematic structural diagram of a first decoder in a DepthNet model provided by an embodiment of the present invention;

图6为本发明实施例提供的MotionNet模型中主干网络的结构示意图；6 is a schematic structural diagram of a backbone network in the MotionNet model provided by an embodiment of the present invention;

图7为本发明实施例提供的主干网络中Refine模块的结构示意图；7 is a schematic structural diagram of a Refine module in a backbone network provided by an embodiment of the present invention;

图8为本发明实施例提供的MotionNet模型中支线网络的结构示意图；8 is a schematic structural diagram of a branch network in the MotionNet model provided by an embodiment of the present invention;

图9为本发明实施例提供的训练数据的排列方式示意图；9 is a schematic diagram of an arrangement of training data provided by an embodiment of the present invention;

图10为本发明实施例提供的相机参数的获取装置的结构示意图。FIG. 10 is a schematic structural diagram of an apparatus for acquiring camera parameters according to an embodiment of the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, but not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

本发明提供一种相机参数的获取方法，如图1所示，包括以下步骤：The present invention provides a method for acquiring camera parameters, as shown in FIG. 1 , comprising the following steps:

S101、收集单目相机拍摄的原始连续帧图像；需要说明的是，收集的原始连续帧图像可以从KITTI数据集中抽取出；S101. Collect the original continuous frame images captured by the monocular camera; it should be noted that the collected original continuous frame images can be extracted from the KITTI data set;

S102、构建DepthNet模型和MotionNet模型；DepthNet模型包括用于输出单通道深度图的网络；MotionNet模型包括用于相机运动预测和给出像素置信度掩码的主干网络和用于相机内参预测的支线网络；需要说明的是，内参预测的功能使得本发明能够从未知来源的任意视频中提取精确的相机运动和景深信息而无需进行相机标定；S102, build a DepthNet model and a MotionNet model; the DepthNet model includes a network for outputting a single-channel depth map; the MotionNet model includes a backbone network for camera motion prediction and pixel confidence mask giving and a branch network for camera intrinsic parameter prediction ; It should be noted that the function of internal reference prediction enables the present invention to extract accurate camera motion and depth of field information from any video from unknown sources without camera calibration;

S103、将原始连续帧图像经预处理后分别输入至构建的DepthNet模型和MotionNet模型中，并通过联合损失函数对DepthNet模型和MotionNet模型进行无监督训练，以及进行超参数的调优；需要说明的是，联合损失函数是由重投影误差、深度平滑损失和像素置信度掩码的正则化惩罚函数构成，联合损失函数使用单目相机拍摄的连续帧之间的关系作为监督信号的来源，为深度模型提供训练动力，是实现无监督学习的关键；另外，本发明提出的模型中主要含有学习率、损失函数权重，batch size这些超参数，为了获得最佳的组合，需要对其进行调优；S103, input the original continuous frame images into the constructed DepthNet model and MotionNet model after preprocessing, and perform unsupervised training on the DepthNet model and MotionNet model through a joint loss function, and perform hyperparameter tuning; Yes, the joint loss function is composed of the reprojection error, the depth smoothing loss and the regularization penalty function of the pixel confidence mask. The joint loss function uses the relationship between consecutive frames captured by the monocular camera as the source of the supervision signal, which is the depth The model provides training power, which is the key to realizing unsupervised learning; in addition, the model proposed by the present invention mainly contains hyperparameters such as learning rate, loss function weight, and batch size, which need to be tuned in order to obtain the best combination;

S104、通过训练好的DepthNet模型和MotionNet模型对待测图像进行处理，输出每帧待测图像的深度图，相机的运动，相机的内参和包含场景运动信息的像素置信度掩码。S104 , process the image to be tested by the trained DepthNet model and the MotionNet model, and output the depth map of each frame of the image to be tested, the motion of the camera, the internal parameters of the camera, and a pixel confidence mask containing scene motion information.

在本发明实施例提供的上述相机参数的获取方法中，DepthNet模型和MotionNet模型是无监督深度学习模型，不需要对相机进行标定，对使用场景没有额外的限制，直接输入单目相机拍摄的任意视频，即可获取拍摄期间的相机运动轨迹、每一帧的深度图以及相机的内参，并且在相机内参未知的情况下，使用联合损失函数进行无监督学习，可以保证训练正常进行；另外，为需要相机内参、相机运动和深度图的计算机视觉应用提供了一种约束较少的前端解决方案，具有良好的应用价值。In the method for obtaining the above camera parameters provided by the embodiment of the present invention, the DepthNet model and the MotionNet model are unsupervised deep learning models, which do not need to calibrate the camera, and have no additional restrictions on the use scene. Video, you can obtain the camera motion trajectory during shooting, the depth map of each frame and the camera's internal parameters, and when the camera's internal parameters are unknown, use the joint loss function for unsupervised learning to ensure normal training; in addition, for Computer vision applications that require camera intrinsics, camera motion, and depth maps provide a less constrained front-end solution with good application value.

在实际应用中，上述模型可以使用PyTorch实现，训练使用一台深度学习工作站，CPU为两块Intel Xeon E5 2678v3，64GB主存，4块NVIDIA GeForce GTX 1080Ti，每块显卡具有12GB的显存。本发明将这台机器进行了并行优化，具体将epochs设置为4的倍数，数据的读取阶段，两块CPU各自加载一半的数据，存储到各自对应的主存中，由于该机型的四块显卡为每块CPU通过PCI-E通道直接连接两块显卡，因此本发明这种两块CPU分别加载数据的方式能够最大限度的利用各条PCI-E通道的带宽，有助于减少数据传输的时间。数据由主存传输到显存之后，四块显卡分别开始各自的梯度过程，当四块显卡各自耗尽这一批数据后，到达程序的同步点，将各自的梯度信息汇报给CPU，由CPU进行梯度汇总和模型的更新，然后进入下一次的循环。最终实现的效果是GPU进行梯度运算时，CPU在进行数据的读取和准备，从而尽可能地减少GPU的停机时间，提升总体的运行效率。In practical applications, the above model can be implemented using PyTorch, using a deep learning workstation for training, the CPU is two Intel Xeon E5 2678v3, 64GB main memory, four NVIDIA GeForce GTX 1080Ti, each graphics card has 12GB video memory. The invention optimizes this machine in parallel, specifically setting the epochs to be a multiple of 4. In the data reading stage, the two CPUs each load half of the data and store them in their corresponding main memory. A graphics card means that each CPU is directly connected to two graphics cards through a PCI-E channel. Therefore, the method of the present invention that the two CPUs respectively load data can maximize the use of the bandwidth of each PCI-E channel and help reduce data transmission. time. After the data is transferred from the main memory to the video memory, the four graphics cards start their respective gradient processes. When the four graphics cards each exhaust this batch of data, they reach the synchronization point of the program and report their gradient information to the CPU, which is carried out by the CPU. Gradient summarization and model update, then enter the next cycle. The final effect is that when the GPU performs gradient operations, the CPU is reading and preparing data, thereby reducing the downtime of the GPU as much as possible and improving the overall operating efficiency.

在具体实施时，在本发明实施例提供的上述相机参数的获取方法中，DepthNet模型可以由第一编码器和第一解码器构成；DepthNet模型的输入是一张单目相机拍摄的三通道图片，通过编码器-解码器结构，输出一张尺寸与输入相同的单通道深度图。这相当于对输入图像的每一个像素进行了编码。此外，由于模型的复杂度较大，为了确保梯度的有效传递，使深层网络也能接受良好的训练，模型采用了大量的残差模块(Residual BuildingBlocks)。In specific implementation, in the above-mentioned method for obtaining camera parameters provided by the embodiment of the present invention, the DepthNet model may be composed of a first encoder and a first decoder; the input of the DepthNet model is a three-channel picture captured by a monocular camera , through the encoder-decoder structure, outputs a single-channel depth map of the same size as the input. This is equivalent to encoding every pixel of the input image. In addition, due to the large complexity of the model, in order to ensure the effective transfer of gradients and enable the deep network to receive good training, the model uses a large number of residual blocks (Residual BuildingBlocks).

上述步骤S103将原始连续帧图像经预处理后输入至DepthNet模型中进行训练，具体可以包括：首先，通过第一编码器获取预处理后的三通道图像，并将三通道图像逐次编码为多种粒度的特征；然后，使用第一解码器联合不同粒度的特征进行解码；最后，通过第一解码器输出尺寸与输入的三通道图像尺寸相同的单通道深度图。如图2所示，第一编码器可以输出五种粒度的特征。The above-mentioned step S103 inputs the original continuous frame image into the DepthNet model for training after preprocessing, which may specifically include: first, obtaining the preprocessed three-channel image through the first encoder, and sequentially encoding the three-channel image into a variety of granular features; then, use the first decoder to combine features of different granularities for decoding; finally, output a single-channel depth map with the same size as the input three-channel image size through the first decoder. As shown in Figure 2, the first encoder can output features with five granularities.

进一步地，在具体实施时，上述步骤中通过第一编码器将三通道图像逐次编码为多种粒度的特征，使用第一解码器联合不同粒度的特征进行解码，具体可以包括：在第一编码器中，首先通过一个卷积核大小为7×7的2D卷积，经过批标准化(Batch Normalize)和线性整流单元后，形成第一级特征编码；随后连接一个最大池化层和两个第一残差模块(residual_block_A)，形成第二级特征编码；最后交替连接第二残差模块(residual_block_B)和第一残差模块，分别形成第三级特征编码、第四级特征编码和第五级特征编码；接下来，将第一级特征编码、第二级特征编码、第三级特征编码、第四级特征编码和第五级特征编码输入至第一解码器；在第一解码器中，交替使用2D转置卷积和2D卷积，并逐级地组合五个层级的特征编码，在输出层采用softplus激活函数进行输出。Further, in the specific implementation, in the above steps, the first encoder is used to sequentially encode the three-channel image into features of various granularities, and the first decoder is used to combine the features of different granularities for decoding, which may specifically include: in the first encoding In the device, firstly through a 2D convolution with a convolution kernel size of 7×7, after batch normalization and linear rectification unit, the first-level feature encoding is formed; then a maximum pooling layer and two first-level feature encoding are connected. A residual block (residual_block_A), which forms the second-level feature encoding; finally, the second residual block (residual_block_B) and the first residual block are alternately connected to form the third-level feature encoding, the fourth-level feature encoding, and the fifth-level feature encoding, respectively. Feature encoding; Next, input the first-level feature encoding, the second-level feature encoding, the third-level feature encoding, the fourth-level feature encoding, and the fifth-level feature encoding to the first decoder; in the first decoder, 2D transposed convolution and 2D convolution are used alternately, and the feature encoding of five levels is combined step by step, and the softplus activation function is used in the output layer for output.

需要说明的是，在第一编码器中含有两种残差模块，residual_block_A和residual_block_B。其中，如图3所示，residual_block_A主要由两个3×3卷积层构成，这两个卷积层的输出通道数取值于该残差模块的输入通道数，因此residual_block_A不会改变张量的通道数。残差模块中两个连续的卷积层这一部分称为主分支，而从输入端直接连通到主分支的输出端的分支路径称为short-cut。residual_block_B的主分支与residual_block_A类似，但是short-cut的形式中含有一些条件判断逻辑；如图4所示，当输入通道数和输出通道数不相等时，short-cut在输入张量的基础上进行一个1×1卷积，并且通过这层卷积将输入和输出通道数调整到一致；当输入通道数和输出通道数相等时，进一步判断步长的大小。当步长为1时，short-cut为输入张量，而步长不为1时，输出张量的维度将与输入张量的维度不相等，为了补偿维度差异，在输入张量上增加一个最大池化层。如图2所示，out_channels和stride由该模块外部输入，三个out分别表示三种情况下的输出。It should be noted that the first encoder contains two residual modules, residual_block_A and residual_block_B. Among them, as shown in Figure 3, residual_block_A is mainly composed of two 3×3 convolutional layers. The number of output channels of these two convolutional layers is the number of input channels of the residual module, so residual_block_A will not change the tensor number of channels. The part of two consecutive convolutional layers in the residual module is called the main branch, and the branch path directly connected from the input to the output of the main branch is called a short-cut. The main branch of residual_block_B is similar to residual_block_A, but the form of short-cut contains some conditional judgment logic; as shown in Figure 4, when the number of input channels and the number of output channels are not equal, short-cut is performed on the basis of the input tensor A 1×1 convolution, and the number of input and output channels is adjusted to be consistent through this layer of convolution; when the number of input channels and the number of output channels are equal, the size of the step size is further judged. When the step size is 1, the short-cut is the input tensor, and when the step size is not 1, the dimension of the output tensor will not be equal to the dimension of the input tensor. In order to compensate for the difference in dimension, an additional one is added to the input tensor. max pooling layer. As shown in Figure 2, out_channels and stride are input from the outside of the module, and the three outs represent the outputs in three cases respectively.

另外，需要说明的是，如图5所示，第一解码器是以第一编码器的五级特征编码为输入，通过2D转置卷积和2D卷积的交替使用，并逐级地组合五个层级的特征编码，在输出层采用softplus激活函数，最终输出一张尺寸与编码器的输入相等的单通道深度图。图5中的concat_and_pad是一个复合操作，首先将来自2D转置卷积的输出与下一级编码器输出在第三个维度上连接，然后再执行一次延拓，将结果输入到后续的2D卷积中。最终输出的深度图尺寸为[B,h,w,1]，其中B代表batch size，h和w代表图片的高和宽，1代表深度图的通道数为1。In addition, it should be noted that, as shown in FIG. 5 , the first decoder takes the five-level feature encoding of the first encoder as input, uses 2D transposed convolution and 2D convolution alternately, and combines them step by step. The five-level feature encoding uses a softplus activation function in the output layer, and finally outputs a single-channel depth map with the same size as the encoder input. The concat_and_pad in Figure 5 is a compound operation that first concatenates the output from the 2D transposed convolution with the next-level encoder output in the third dimension, and then performs a continuation to input the result to the subsequent 2D volume accumulate. The final output depth map size is [B,h,w,1], where B represents the batch size, h and w represent the height and width of the image, and 1 represents the number of channels of the depth map is 1.

在具体实施时，在本发明实施例提供的上述相机参数的获取方法中，MotionNet模型中的主干网络可以由第二编码器和第二解码器构成。During specific implementation, in the above-mentioned camera parameter acquisition method provided by the embodiment of the present invention, the backbone network in the MotionNet model may be composed of a second encoder and a second decoder.

上述步骤S103将原始连续帧图像经预处理后输入至MotionNet模型中进行训练，具体可以包括：首先，通过第二编码器获取预处理后的相邻两帧图像；然后，如图6所示，在第二编码器中，使用7个级联的卷积核大小为3×3的2D卷积层，在第二编码器的瓶颈部分连接一个1×1的卷积层，将输出通道数压缩到六个，前三个通道输出相机的平移，后三个通道输出相机的旋转；之后，在第二解码器中，采用两条并列的卷积通路并使用类似残差模块的short-cut连接，将卷积输出和双线性插值的输出组合，形成Refine模块的输出，输出一个像素级置信度掩码，用于在联合损失函数计算时确定每个像素是否参与计算，被排除掉的像素由于场景的平移、旋转、遮挡等因素不能参与重投影损失的计算，同时对像素级置信度掩码增加一个惩罚函数，用于防止训练退化；最后，通过连接在主干网络的最底层编码器上的支线网络输出相机的内参矩阵。The above-mentioned step S103 inputs the original continuous frame images into the MotionNet model for training after preprocessing, which may specifically include: first, obtaining the preprocessed adjacent two frame images through the second encoder; then, as shown in FIG. 6 , In the second encoder, 7 concatenated 2D convolutional layers with kernel size of 3×3 are used, and a 1×1 convolutional layer is connected to the bottleneck part of the second encoder to compress the number of output channels To six, the first three channels output the translation of the camera, and the last three channels output the rotation of the camera; after that, in the second decoder, two parallel convolution paths are used and short-cut connections like residual modules are used , the convolution output and the bilinear interpolation output are combined to form the output of the Refine module, and a pixel-level confidence mask is output to determine whether each pixel participates in the calculation when the joint loss function is calculated. The excluded pixels Due to the translation, rotation, occlusion and other factors of the scene, it cannot participate in the calculation of the reprojection loss, and at the same time, a penalty function is added to the pixel-level confidence mask to prevent training degradation; The branch network outputs the camera's intrinsic parameter matrix.

需要了解的是，在MotionNet的主干网络中，第二解码器是由Refine模块组成，如图7所示，conv_input代表解码器侧输入，Refine_input代表前一级Refine的输出。为了解决分辨率的不同，本发明使用双线性插值对前一级Refine的输出调整尺寸。It should be understood that in the backbone network of MotionNet, the second decoder is composed of the Refine module. As shown in Figure 7, conv_input represents the input of the decoder side, and Refine_input represents the output of the previous stage of Refine. To address the difference in resolution, the present invention uses bilinear interpolation to resize the output of the previous stage of Refine.

在具体实施时，在本发明实施例提供的上述相机参数的获取方法中，输出相机的内参矩阵，具体包括：在支线网络中，将网络预测值乘以图像的宽和高为实际焦距；将网络预测值加上0.5，并乘以图像的宽和高，获得主点的像素坐标；将焦距对角化为2×2的对角矩阵，连接主点坐标构成的列向量，再添加行向量，构成3×3内参矩阵。In specific implementation, in the above-mentioned camera parameter acquisition method provided by the embodiment of the present invention, outputting the internal parameter matrix of the camera specifically includes: in the branch network, multiplying the network predicted value by the width and height of the image is the actual focal length; Add 0.5 to the predicted value of the network, and multiply it by the width and height of the image to obtain the pixel coordinates of the main point; diagonalize the focal length into a 2×2 diagonal matrix, connect the column vector formed by the coordinates of the main point, and then add the row vector , forming a 3×3 internal parameter matrix.

具体地，如图8所示，左侧的“bottleneck”代表主干网络的编码器输出的最底层特征，其大小为[B,1,1,1024]，B代表batch size，并列的两个1×1卷积分别用来预测内参矩阵中的焦距f_x、f_y和主点坐标c_x、c_y，为了方便网络学习，实际上两路卷积预测的都是一个较小的数字。对于焦距来讲，此处预测实际焦距与图像宽度和高度的比值，因此本发明将网络预测值乘以图像的宽和高。对于主点坐标来讲，此处预测坐标值占图像宽和高的比值，由于主点位置趋向于图像的中心，因此，本发明将网络预测值加上0.5，然后再乘以宽度和高度以获得主点的像素坐标。最后，将焦距对角化为2×2的对角矩阵，连接主点坐标构成的列向量，再添加行向量[0,0,1]，最终构成3×3内参矩阵。Specifically, as shown in Figure 8, the "bottleneck" on the left represents the lowest level feature output by the encoder of the backbone network, and its size is [B, 1, 1, 1024], B represents the batch size, and two 1's in parallel The ×1 convolution is used to predict the focal length f _x , f _y and the principal point coordinates c _x , _cy in the internal parameter matrix respectively. In order to facilitate network learning, in fact, the two convolutions predict a smaller number. For the focal length, the ratio of the actual focal length to the image width and height is predicted here, so the present invention multiplies the network predicted value by the image width and height. For the coordinates of the main point, the predicted coordinate value here accounts for the ratio of the width and height of the image. Since the position of the main point tends to the center of the image, the present invention adds 0.5 to the predicted value of the network, and then multiplies the width and height to obtain Get the pixel coordinates of the principal point. Finally, the focal length is diagonalized into a 2×2 diagonal matrix, the column vector formed by the coordinates of the principal point is connected, and the row vector [0,0,1] is added, and finally a 3×3 internal parameter matrix is formed.

在具体实施时，在本发明实施例提供的上述相机参数的获取方法中，步骤S103中对原始连续帧图像预处理，可以包括：对原始连续帧图像进行分辨率的调整，并进行排列和拼接，以拼接为多张三联帧图像；当每张三联帧图像输入至DepthNet模型中，输出每一帧图像的深度图；当每张三联帧图像输入至MotionNet模型中，输出四次每相邻两帧图像之间的相机运动，相机的内参和像素置信度掩码。In specific implementation, in the above-mentioned camera parameter acquisition method provided by the embodiment of the present invention, the preprocessing of the original continuous frame images in step S103 may include: adjusting the resolution of the original continuous frame images, and arranging and splicing them. , to be spliced into multiple triple frame images; when each triple frame image is input into the DepthNet model, the depth map of each frame image is output; when each triple frame image is input into the MotionNet model, the output is four times for each adjacent two Camera motion between frame images, camera intrinsics and pixel confidence masks.

需要了解的是，根据本发明采用的模型和训练方法，DepthNet模型的每一次训练需要一张单目彩色图像(即三通道图像)，MotionNet模型的每一次训练需要时间上连续的两张图像(即相邻两帧图像)，为了提高数据的读取效率，本发明需要对原始连续帧图像进行预处理，如图9所示，将每连续的三张图像拼接成一张，(a)代表数据集中的原始连续帧图像，(b)代表预处理后拼接在一起的三联帧图像，经过这样的处理后，每取一张三联帧图像，可以获得两对相邻的图片。为了降低运算压力，在拼接的同时，对原图进行等比缩小，最终所有图像的分辨率可以统一为416×128，单张三联帧图像的分辨率可以为1248×128。It should be understood that, according to the model and training method adopted in the present invention, each training of the DepthNet model requires a monocular color image (that is, a three-channel image), and each training of the MotionNet model requires two consecutive images ( That is, two adjacent frames of images), in order to improve the reading efficiency of the data, the present invention needs to preprocess the original continuous frame images, as shown in FIG. The concentrated original continuous frame images, (b) represents the triple frame images spliced together after preprocessing, after such processing, for each triple frame image taken, two pairs of adjacent pictures can be obtained. In order to reduce the computational pressure, at the same time of splicing, the original image is scaled down, and the final resolution of all images can be unified to 416×128, and the resolution of a single triple frame image can be 1248×128.

具体地，训练过程以三联帧为单位进行。每读出一张三联帧，首先使用DepthNet模型生成每一帧的深度图，然后使用MotionNet模型生成第1帧到第2帧的相机运动、像素置信度掩码和相机内参，同理再生成第2帧到第3帧、第3帧到第2帧、第2帧到第1帧，这样就获得了四次相机内参的预测，对这四次内参值取平均数，作为这一组三联关联的内参。然后，每两个相邻帧之间可以使用一次联合损失函数，所以一共能使用四次(即1-2、2-3、3-2、2-1)，累加这四次损失函数值，作为这一组三联帧关联的损失函数值。在实际训练时，每次读取数据会得到batch size个三联帧，这些帧会进行并行计算，然后进行反向传播，对模型进行更新。Specifically, the training process is performed in units of triple frames. Every time a triple frame is read out, first use the DepthNet model to generate the depth map of each frame, and then use the MotionNet model to generate the camera motion, pixel confidence mask and camera internal parameters for the first to second frames, and similarly, generate the first frame 2 frames to 3 frames, 3 frames to 2 frames, and 2 frames to 1 frames, so that four predictions of the camera's internal parameters are obtained, and the average of these four internal parameters is taken as this group of triple associations internal reference. Then, the joint loss function can be used once between every two adjacent frames, so a total of four times (ie 1-2, 2-3, 3-2, 2-1) can be used, and the four loss function values can be accumulated, as the loss function value associated with this set of triples. In actual training, each time the data is read, batch size triple frames will be obtained, these frames will be calculated in parallel, and then backpropagated to update the model.

在具体实施时，在本发明实施例提供的上述相机参数的获取方法中，联合损失函数可以采用下述公式进行计算：During specific implementation, in the above-mentioned method for obtaining camera parameters provided by the embodiments of the present invention, the joint loss function can be calculated by using the following formula:

其中，L_total为联合损失函数，L_R为重投影误差函数，a为重投影误差函数的权值，

为深度平滑损失(也称深度值的L₁范数，深度图的离群点和尖锐点越多，平滑损失越大)，b为深度平滑损失的权值，Λ为像素置信度掩码的正则化惩罚函数，c为惩罚函数的权值。Among them, L _total is the joint loss function, L _R is the reprojection error function, a is the weight of the reprojection error function,

is the depth smoothing loss (also known as the L ₁ norm of the depth value, the more outliers and sharp points in the depth map, the greater the smoothing loss), b is the weight of the depth smoothing loss, and Λ is the pixel confidence mask. The regularization penalty function, c is the weight of the penalty function.

式(1)中L_R为：In formula (1), _LR is:

其中，i，j为像素坐标，

为像素置信度掩码，表示(i，j)处像素的置信度，φ函数表示双线性插值，

为重投影视图，I_s(i，j)为真实视图。Among them, i, j are pixel coordinates,

is the pixel confidence mask, representing the confidence of the pixel at (i, j), the φ function represents bilinear interpolation,

is the reprojected view, and Is ( _i , j) is the real view.

式(1)中的重投影方法为：The reprojection method in formula (1) is:

其中，K为相机内参，R为相机旋转矩阵，t为相机平移向量，p_s为重投影之后的像素坐标，p_t为投影前的像素坐标，D_s(p_s)为像素坐标p_s处对应的重投影之后的深度值，D_t(p_t)为重投影之前的像素坐标p_t所对应的深度值。Among them, K is the camera internal parameter, R is the camera rotation matrix, t is the camera translation vector, _ps is the pixel coordinate after reprojection, _pt is the pixel coordinate before projection, D _s ( _ps ) is the pixel coordinate at _ps The corresponding depth value after reprojection, D _t ( _pt ) is the depth value corresponding to the pixel coordinate _pt before reprojection.

式(1)中的Λ为：Λ in formula (1) is:

其含义为对所有的H(i，j)取平均值。H(i，j)的定义为：Its meaning is to take the average of all H(i, j). H(i,j) is defined as:

H(i,j)＝-∑_i，jM(i，j)log(s(i，j)) (5)H(i,j)=-∑ _i,j M(i,j)log(s(i,j)) (5)

其含义为对S(i，j)取交叉熵，S(i，j)的定义为：Its meaning is to take cross entropy for S(i, j), and S(i, j) is defined as:

其含义为对像素置信度掩码

取交叉熵。Its meaning is the pixel confidence mask

Take the cross entropy.

惩罚函数Λ可以避免网络将所有的像素置信度掩码都预测为“不可信”(即对于每个像素位置(i，j)，

都取0，这种情况下损失函数的最主要部分L_R会直接为0，因为深度学习倾向于使损失函数尽量小，在不采用惩罚函数的情况下，很容易陷入这种“全部不可信”的情况，这时虽然损失函数很小，但是却没有实际任何意义)。惩罚函数Λ会在置信度掩码中“不可信像素”越多的时候取得越大的值。The penalty function Λ prevents the network from predicting all pixel confidence masks as "untrustworthy" (i.e. for each pixel position (i, j),

All take 0. In this case, the most important part of the loss function, _LR , will be directly 0, because deep learning tends to make the loss function as small as possible. Without the penalty function, it is easy to fall into this "all untrustworthy" ”, although the loss function is small, it has no practical significance). The penalty function Λ will take a larger value when there are more "untrustworthy pixels" in the confidence mask.

基于同一发明构思，本发明实施例还提供了一种相机参数的获取装置，由于该相机参数的获取装置解决问题的原理与前述一种相机参数的获取方法相似，因此该相机参数的获取装置的实施可以参见相机参数的获取方法的实施，重复之处不再赘述。Based on the same inventive concept, an embodiment of the present invention also provides a device for acquiring camera parameters. Since the principle of the device for acquiring camera parameters for solving problems is similar to the aforementioned method for acquiring camera parameters, the device for acquiring camera parameters has a For the implementation, please refer to the implementation of the method for acquiring the camera parameters, and the repetition will not be repeated.

在具体实施时，本发明实施例提供的相机参数的获取装置，如图10所示，具体包括：During specific implementation, the device for acquiring camera parameters provided by the embodiment of the present invention, as shown in FIG. 10 , specifically includes:

图像收集模块11，用于收集单目相机拍摄的原始连续帧图像；The image collection module 11 is used to collect the original continuous frame images captured by the monocular camera;

模型构建模块12，用于构建DepthNet模型和MotionNet模型；所述DepthNet模型包括用于输出单通道深度图的网络；所述MotionNet模型包括用于相机运动预测和给出像素置信度掩码的主干网络和用于相机内参预测的支线网络；Model building module 12, for constructing DepthNet model and MotionNet model; Described DepthNet model includes the network that is used for outputting single-channel depth map; Described MotionNet model includes the backbone network that is used for camera motion prediction and gives pixel confidence mask and a branch network for camera intrinsic parameter prediction;

模型训练模块13，用于将所述原始连续帧图像经预处理后分别输入至构建的所述DepthNet模型和所述MotionNet模型中，并通过联合损失函数对所述DepthNet模型和所述MotionNet模型进行无监督训练，以及进行超参数的调优；The model training module 13 is used to input the original continuous frame images into the constructed DepthNet model and the MotionNet model respectively after preprocessing, and perform the DepthNet model and the MotionNet model through the joint loss function. Unsupervised training and tuning of hyperparameters;

模型预测模块14，用于通过训练好的所述DepthNet模型和所述MotionNet模型对待测图像进行处理，输出每帧所述待测图像的深度图，相机的运动，相机的内参和包含场景运动信息的像素置信度掩码。The model prediction module 14 is used to process the image to be measured by the trained described DepthNet model and the MotionNet model, and output the depth map of the described image to be measured in each frame, the motion of the camera, the internal parameters of the camera and include scene motion information The pixel confidence mask of .

在本发明实施例提供的上述相机参数的获取装置中，可以通过上述四个模块的相互作用，不需要对相机进行标定，不需要使用全局信息的优点，仅需要在相机具有平移和旋转运动时进行视频录制，即可获得包括相机内参和深度图等在内的重要数据，可以作为其他计算机视觉应用的前置算法，使用场景限制小，应用方便。In the above-mentioned camera parameter acquisition device provided by the embodiment of the present invention, the interaction of the above-mentioned four modules can eliminate the need to calibrate the camera and use the advantages of global information, only when the camera has translational and rotational motions After video recording, important data including camera internal parameters and depth map can be obtained, which can be used as a pre-algorithm for other computer vision applications, with small usage scene restrictions and convenient application.

关于上述各个模块更加具体的工作过程可以参考前述实施例公开的相应内容，在此不再进行赘述。For more specific working processes of the above-mentioned modules, reference may be made to the corresponding contents disclosed in the foregoing embodiments, which will not be repeated here.

相应的，本发明实施例还公开了一种相机参数的获取设备，包括处理器和存储器；其中，处理器执行存储器中保存的计算机程序时实现前述实施例公开的相机参数的获取方法。Correspondingly, an embodiment of the present invention also discloses a device for acquiring camera parameters, including a processor and a memory; wherein the processor implements the method for acquiring camera parameters disclosed in the foregoing embodiments when the processor executes the computer program stored in the memory.

关于上述方法更加具体的过程可以参考前述实施例中公开的相应内容，在此不再进行赘述。For a more specific process of the above method, reference may be made to the corresponding content disclosed in the foregoing embodiments, which will not be repeated here.

进一步的，本发明还公开了一种计算机可读存储介质，用于存储计算机程序；计算机程序被处理器执行时实现前述公开的相机参数的获取方法。Further, the present invention also discloses a computer-readable storage medium for storing a computer program; when the computer program is executed by a processor, the method for obtaining the camera parameters disclosed above is implemented.

本说明书中各个实施例采用递进的方式描述，每个实施例重点说明的都是与其它实施例的不同之处，各个实施例之间相同或相似部分互相参见即可。对于实施例公开的装置、设备、存储介质而言，由于其与实施例公开的方法相对应，所以描述的比较简单，相关之处参见方法部分说明即可。The various embodiments in this specification are described in a progressive manner, and each embodiment focuses on the differences from other embodiments, and the same or similar parts between the various embodiments may be referred to each other. For the apparatuses, devices, and storage media disclosed in the embodiments, since they correspond to the methods disclosed in the embodiments, the descriptions are relatively simple, and reference may be made to the descriptions of the methods for related parts.

专业人员还可以进一步意识到，结合本文中所公开的实施例描述的各示例的单元及算法步骤，能够以电子硬件、计算机软件或者二者的结合来实现，为了清楚地说明硬件和软件的可互换性，在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行，取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能，但是这种实现不应认为超出本申请的范围。Professionals may further realize that the units and algorithm steps of each example described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of the two, in order to clearly illustrate the possibilities of hardware and software. Interchangeability, the above description has generally described the components and steps of each example in terms of functionality. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each particular application, but such implementations should not be considered beyond the scope of this application.

结合本文中所公开的实施例描述的方法或算法的步骤可以直接用硬件、处理器执行的软件模块，或者二者的结合来实施。软件模块可以置于随机存储器(RAM)、内存、只读存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM、或技术领域内所公知的任意其它形式的存储介质中。The steps of a method or algorithm described in connection with the embodiments disclosed herein may be directly implemented in hardware, a software module executed by a processor, or a combination of the two. The software module can be placed in random access memory (RAM), internal memory, read only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or any other in the technical field. in any other known form of storage medium.

综上，本发明实施例提供的一种相机参数的获取方法、装置、设备及存储介质，包括：收集单目相机拍摄的原始连续帧图像；构建DepthNet模型和MotionNet模型；DepthNet模型包括用于输出单通道深度图的网络；MotionNet模型包括用于相机运动预测和给出像素置信度掩码的主干网络和用于相机内参预测的支线网络；将原始连续帧图像经预处理后分别输入至构建的DepthNet模型和MotionNet模型中，并通过联合损失函数对DepthNet模型和MotionNet模型进行无监督训练，以及进行超参数的调优；通过训练好的DepthNet模型和MotionNet模型对待测图像进行处理，输出每帧待测图像的深度图，相机的运动，相机的内参和包含场景运动信息的像素置信度掩码。本发明不需要对相机进行标定，对使用场景没有额外的限制，直接输入单目相机拍摄的任意视频，即可获取拍摄期间的相机运动轨迹、每一帧的深度图以及相机的内参，并且在相机内参未知的情况下，使用联合损失函数进行无监督学习，可以保证训练正常进行；另外，为需要相机内参、相机运动和深度图的计算机视觉应用提供了一种约束较少的前端解决方案，具有良好的应用价值。In summary, a method, device, device, and storage medium for acquiring camera parameters provided by the embodiments of the present invention include: collecting original continuous frame images captured by a monocular camera; constructing a DepthNet model and a MotionNet model; A single-channel depth map network; the MotionNet model includes a backbone network for camera motion prediction and pixel confidence masking and a branch network for camera intrinsic parameter prediction; the original continuous frame images are preprocessed and input to the constructed In the DepthNet model and MotionNet model, unsupervised training is performed on the DepthNet model and MotionNet model through the joint loss function, and hyperparameter tuning is performed; the images to be tested are processed through the trained DepthNet model and MotionNet model, and each frame is output to be tested. The depth map of the measured image, the motion of the camera, the intrinsic parameters of the camera and the pixel confidence mask containing scene motion information. The present invention does not need to calibrate the camera, and has no additional restrictions on the use scene. Any video shot by the monocular camera can be directly input to obtain the camera motion track, the depth map of each frame and the internal parameters of the camera during shooting. When the camera internal parameters are unknown, using the joint loss function for unsupervised learning can ensure normal training; in addition, it provides a less constrained front-end solution for computer vision applications that require camera internal parameters, camera motion, and depth maps. Has good application value.

最后，还需要说明的是，在本文中，诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来，而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。Finally, it should also be noted that in this document, relational terms such as first and second are used only to distinguish one entity or operation from another, and do not necessarily require or imply these entities or that there is any such actual relationship or sequence between operations. Moreover, the terms "comprising", "comprising" or any other variation thereof are intended to encompass a non-exclusive inclusion such that a process, method, article or device that includes a list of elements includes not only those elements, but also includes not explicitly listed or other elements inherent to such a process, method, article or apparatus. Without further limitation, an element qualified by the phrase "comprising a..." does not preclude the presence of additional identical elements in a process, method, article or apparatus that includes the element.

以上对本发明所提供的相机参数的获取方法、装置、设备及存储介质进行了详细介绍，本文中应用了具体个例对本发明的原理及实施方式进行了阐述，以上实施例的说明只是用于帮助理解本发明的方法及其核心思想；同时，对于本领域的一般技术人员，依据本发明的思想，在具体实施方式及应用范围上均会有改变之处，综上所述，本说明书内容不应理解为对本发明的限制。The method, device, device and storage medium for obtaining camera parameters provided by the present invention have been introduced in detail above. The principles and implementations of the present invention are described with specific examples in this paper. The descriptions of the above embodiments are only for help. Understand the method of the present invention and its core idea; at the same time, for those skilled in the art, according to the idea of the present invention, there will be changes in the specific implementation and application scope. In summary, the content of this specification does not It should be understood as a limitation of the present invention.

Claims

1. A method for acquiring camera parameters is characterized by comprising the following steps:

collecting original continuous frame images shot by a monocular camera;

constructing a DepthNet model and a MotionNet model; the DepthNet model comprises a network for outputting a single-channel depth map; the MotionNet model comprises a main network used for camera motion prediction and giving a pixel confidence mask and a branch network used for camera internal parameter prediction;

respectively inputting the original continuous frame images into the constructed DepthNet model and the MotionNet model after preprocessing, and performing unsupervised training and super-parameter tuning on the DepthNet model and the MotionNet model through a joint loss function;

processing images to be detected through the trained DepthNet model and the trained MotionNet model, and outputting a depth map of each frame of the images to be detected, the motion of the camera, the internal reference of the camera and a pixel confidence mask containing scene motion information;

the DepthNet model is composed of a first encoder and a first decoder;

preprocessing the original continuous frame images and inputting the preprocessed original continuous frame images into the DepthNet model for training, wherein the method specifically comprises the following steps:

acquiring a preprocessed three-channel image through the first encoder, and successively encoding the three-channel image into features of multiple granularities;

decoding using the first decoder in conjunction with features of different granularity;

outputting, by the first decoder, a single-channel depth map of the same size as the input three-channel image;

the backbone network is composed of a second encoder and a second decoder;

inputting the original continuous frame images into the MotionNet model for training after preprocessing, specifically comprising:

acquiring two adjacent preprocessed frame images through the second encoder;

in the second encoder, 7 cascaded 3 × 3 2D convolutional layers are used, one 1 × 1 convolutional layer is connected to the bottleneck portion, the number of output channels is compressed to six, the first three channels output the translation of the camera, and the last three channels output the rotation of the camera;

in the second decoder, two parallel convolution paths are adopted and short-cut connection is used, the convolution output and the output of bilinear interpolation are combined to form the output of a Refine module, a pixel-level confidence mask is output and used for determining whether each pixel participates in calculation when a joint loss function is calculated, and meanwhile, a penalty function is added to the pixel confidence mask and used for preventing training degradation;

and outputting the internal reference matrix of the camera through the branch network connected to the lowest encoder of the backbone network.

2. The method according to claim 1, wherein the three-channel image is successively encoded into features of multiple granularities by the first encoder, and the first decoder is used to perform decoding in conjunction with the features of different granularities, specifically comprising:

in the first encoder, a 2D convolution with convolution kernel size of 7 multiplied by 7 is carried out, and after a batch standardization and linear rectification unit, a first-stage feature code is formed;

connecting a maximum pooling layer and two first residual modules to form a second-level feature code;

alternately connecting a second residual error module and the first residual error module to form a third-level feature code, a fourth-level feature code and a fifth-level feature code respectively;

inputting the first level feature encoding, the second level feature encoding, the third level feature encoding, the fourth level feature encoding, and the fifth level feature encoding to the first decoder;

in the first decoder, 2D transposition convolution and 2D convolution are alternately used, five levels of feature codes are combined step by step, and softplus activating functions are adopted for output at an output layer.

3. The method for acquiring camera parameters according to claim 1, wherein outputting the internal reference matrix of the camera specifically includes:

in the branch network, multiplying the network predicted value by the width and height of the image to obtain the actual focal length;

adding 0.5 to the network predicted value, and multiplying by the width and height of the image to obtain the pixel coordinate of the principal point;

and (3) the focal length is diagonal to form a diagonal matrix of 2 multiplied by 2, column vectors formed by connecting principal point coordinates are connected, and row vectors are added to form a 3 multiplied by 3 internal reference matrix.

4. The method for acquiring camera parameters according to claim 1, wherein the preprocessing of the original continuous frame images comprises:

adjusting the resolution of the original continuous frame images, and arranging and splicing the original continuous frame images to be spliced into a plurality of triple frame images;

when each triple frame image is input into the DepthNet model, outputting a depth map of each frame image;

and when each triple frame image is input into the MotionNet model, outputting four times of camera motion between every two adjacent frame images, and internal reference and pixel confidence mask of the camera.

5. The method of claim 1, wherein the joint loss function is calculated by using the following formula:

wherein L is _total As said joint loss function, L _R Is a reprojection error function, a is a weight of the reprojection error function,

and b is the depth smoothing loss, b is the weight of the depth smoothing loss, Λ is the regularization penalty function of the pixel confidence mask, and c is the weight of the penalty function.

6. An apparatus for acquiring camera parameters, comprising:

the image collection module is used for collecting original continuous frame images shot by the monocular camera;

the model building module is used for building a DepthNet model and a MotionNet model; the DepthNet model comprises a network for outputting a single-channel depth map; the MotionNet model comprises a main network used for camera motion prediction and giving a pixel confidence mask and a branch network used for camera internal parameter prediction;

the model training module is used for respectively inputting the original continuous frame images into the constructed DepthNet model and the MotionNet model after preprocessing, performing unsupervised training on the DepthNet model and the MotionNet model through a joint loss function, and performing super-parameter tuning;

the DepthNet model is composed of a first encoder and a first decoder;

the backbone network is composed of a second encoder and a second decoder;

inputting the original continuous frame images into the MotionNet model for training after preprocessing, and specifically comprising the following steps:

acquiring two adjacent preprocessed frame images through the second encoder;

in the second decoder, two parallel convolution paths are adopted and short-cut connection is used, the convolution output and the output of bilinear interpolation are combined to form the output of a Refine module, a pixel confidence mask is output and used for determining whether each pixel participates in calculation when a joint loss function is calculated, and meanwhile, a penalty function is added to the pixel confidence mask and used for preventing training degradation;

outputting an internal reference matrix of a camera through the branch network connected to a bottommost encoder of the backbone network; and the model prediction module is used for processing the image to be measured through the trained DepthNet model and the MotionNet model and outputting a depth map of each frame of the image to be measured, the motion of the camera, the internal parameters of the camera and a pixel confidence mask containing scene motion information.

7. An apparatus for acquiring camera parameters, comprising a processor and a memory, wherein the processor implements the method for acquiring camera parameters according to any one of claims 1 to 5 when executing a computer program stored in the memory.

8. A computer-readable storage medium for storing a computer program, wherein the computer program, when executed by a processor, implements the method of acquiring camera parameters according to any one of claims 1 to 5.