CN116342800B

CN116342800B - Semantic three-dimensional reconstruction method and system for multi-mode pose optimization

Info

Publication number: CN116342800B
Application number: CN202310181777.1A
Authority: CN
Inventors: 孙庆伟; 晁建刚; 陈炜; 林万洪; 许振瑛; 胡福超
Original assignee: Peoples Liberation Army Strategic Support Force Aerospace Engineering University; China Astronaut Research and Training Center
Current assignee: Peoples Liberation Army Strategic Support Force Aerospace Engineering University; China Astronaut Research and Training Center
Priority date: 2023-02-21
Filing date: 2023-02-21
Publication date: 2023-10-24
Anticipated expiration: 2043-02-21
Also published as: CN116342800A

Abstract

The application belongs to the technical field of semantic three-dimensional reconstruction, and particularly relates to a semantic three-dimensional reconstruction method and a semantic three-dimensional reconstruction system for multi-mode pose optimization, wherein the method comprises the following steps: collecting RGB images and depth images of a target area; performing semantic segmentation on the RGB image to obtain a final mask image of the RGB image; acquiring the global pose of the RGB image by utilizing the RGB image, the depth image and the final mask map; and acquiring a global semantic three-dimensional point cloud according to the global pose of the depth image and the RGB image. The technical scheme provided by the application has low computational force requirements and wide applicability, and improves the real-time performance and efficiency of semantic three-dimensional reconstruction.

Description

A semantic 3D reconstruction method and system for multi-modal pose optimization

技术领域Technical field

本发明属于语义三维重建技术领域，具体涉及一种多模态位姿优化的语义三维重建方法及系统。The invention belongs to the technical field of semantic three-dimensional reconstruction, and specifically relates to a semantic three-dimensional reconstruction method and system for multi-modal pose optimization.

背景技术Background technique

语义三维重建是对三维重建模型进行语义标注，将具有不同类别属性的结构从整体模型中分割，方便下游任务使用。决定语义三维重建质量的重要因素之一是相机位姿的计算，关系到重建模型的优劣和语义信息的分布。目前算法中的定位采用纯几何特征，需要设计复杂的筛选机制确定匹配点。语义三维重建可以划分为语义分割、三维重建、语义映射三个部分，其中三维重建包括位姿计算和建图。Semantic 3D reconstruction is to semantically annotate the 3D reconstruction model and segment structures with different categories of attributes from the overall model to facilitate the use of downstream tasks. One of the important factors that determines the quality of semantic 3D reconstruction is the calculation of camera pose, which is related to the quality of the reconstruction model and the distribution of semantic information. The positioning in the current algorithm uses purely geometric features, which requires the design of a complex screening mechanism to determine matching points. Semantic 3D reconstruction can be divided into three parts: semantic segmentation, 3D reconstruction, and semantic mapping. 3D reconstruction includes pose calculation and mapping.

现有算法中各部分相对独立，通过松耦合的方式进行数据通信，如语义三维重建多采用基于GPU的稠密ICP算法计算相机位姿，同时语义分割也需要占用GPU资源，不得不在运行效率上做出妥协，整个框架过于繁重，全局优化的规模较大。因此，目前的语义三维重建算法采用的语义分割和重建方法实时性不好，对算力要求较高，对算法整体运行造成较大负担，需要高性能计算机才能顺畅运行。Each part of the existing algorithm is relatively independent and performs data communication in a loosely coupled manner. For example, semantic 3D reconstruction mostly uses the GPU-based dense ICP algorithm to calculate the camera pose. At the same time, semantic segmentation also requires GPU resources, which has to be improved in terms of operating efficiency. Without compromise, the entire framework is too heavy and the scale of global optimization is large. Therefore, the semantic segmentation and reconstruction methods used in current semantic 3D reconstruction algorithms are not good in real-time performance, require high computing power, and impose a large burden on the overall operation of the algorithm, requiring high-performance computers to run smoothly.

发明内容Contents of the invention

为至少在一定程度上克服相关技术中存在的问题，本申请提供一种多模态位姿优化的语义三维重建方法及系统。In order to overcome the problems existing in related technologies at least to a certain extent, this application provides a semantic three-dimensional reconstruction method and system for multi-modal pose optimization.

根据本申请实施例的第一方面，提供一种多模态位姿优化的语义三维重建方法，所述方法包括：According to a first aspect of the embodiments of the present application, a semantic three-dimensional reconstruction method for multi-modal pose optimization is provided. The method includes:

采集目标区域的RGB图像和深度图像；Collect RGB images and depth images of the target area;

对RGB图像进行语义分割，获取所述RGB图像的最终掩码图；Perform semantic segmentation on the RGB image and obtain the final mask image of the RGB image;

利用所述RGB图像、所述深度图像和所述最终掩码图，获取RGB图像的全局位姿；Using the RGB image, the depth image and the final mask image, obtain the global pose of the RGB image;

根据所述深度图像和所述RGB图像的全局位姿获取全局语义三维点云。A global semantic three-dimensional point cloud is obtained according to the global pose of the depth image and the RGB image.

优选的，所述对RGB图像进行语义分割，获取所述RGB图像的最终掩码图，包括：Preferably, the semantic segmentation of the RGB image and obtaining the final mask image of the RGB image include:

利用基于深度学习的STDC网络，对RGB图像进行语义分割，得到RGB图像中各像素点的语义类别概率；Use the STDC network based on deep learning to perform semantic segmentation on the RGB image and obtain the semantic category probability of each pixel in the RGB image;

令各像素点的语义类别概率中概率最大的类别对应的类别编码为各像素点的掩码，得到RGB图像的初始掩码图；Let the category corresponding to the category with the highest probability among the semantic category probabilities of each pixel be encoded as the mask of each pixel, and obtain the initial mask map of the RGB image;

利用超像素分割方法gSLICr对所述RGB图像的初始掩码图进行边缘优化，得到RGB图像的最终掩码图。The superpixel segmentation method gSLICr is used to perform edge optimization on the initial mask image of the RGB image to obtain the final mask image of the RGB image.

优选的，所述利用所述RGB图像、所述深度图像和所述最终掩码图，获取RGB图像的全局位姿，包括：Preferably, using the RGB image, the depth image and the final mask image to obtain the global pose of the RGB image includes:

提取所述RGB图像的ORB特征点；Extract ORB feature points of the RGB image;

获取所述ORB特征点在所述深度图像中的深度值，并利用所述ORB特征点在所述深度图像中的深度值，对所述ORB特征点的二维坐标进行投影坐标转换，得到所述ORB特征点在三维空间中的坐标；Obtain the depth value of the ORB feature point in the depth image, and use the depth value of the ORB feature point in the depth image to perform projection coordinate conversion on the two-dimensional coordinates of the ORB feature point to obtain the The coordinates of the ORB feature points in the three-dimensional space;

基于所述ORB特征点在三维空间中的坐标和所述最终掩码图，利用稀疏SLAM方法获取RGB图像的全局位姿。Based on the coordinates of the ORB feature points in the three-dimensional space and the final mask map, the sparse SLAM method is used to obtain the global pose of the RGB image.

优选的，所述基于所述ORB特征点在三维空间中的坐标和所述最终掩码图，利用稀疏SLAM方法获取RGB图像的全局位姿，包括：Preferably, based on the coordinates of the ORB feature points in the three-dimensional space and the final mask image, the sparse SLAM method is used to obtain the global pose of the RGB image, including:

提取相邻的两帧RGB图像中BRIEF描述子相同的ORB特征点；Extract ORB feature points with the same BRIEF descriptor in two adjacent RGB images;

利用所述最终掩码图，剔除相邻的两帧RGB图像中BRIEF描述子相同的ORB特征点中的外点，并令剔除外点后的相邻的两帧RGB图像中BRIEF描述子相同的ORB特征点为匹配点；Using the final mask image, the outliers in the ORB feature points with the same BRIEF descriptor in the two adjacent frames of RGB images are eliminated, and the outliers in the two adjacent frames of the RGB image with the same BRIEF descriptor are eliminated. ORB feature points are matching points;

基于相邻的两帧RGB图像中前一帧RGB图像中的匹配点在三维空间中的坐标和相邻的两帧RGB图像中后一帧RGB图像的匹配点的二维坐标，利用PnP算法获取相邻的两帧RGB图像的第一帧间位姿；Based on the coordinates in the three-dimensional space of the matching point in the previous frame of the RGB image in the two adjacent frames of RGB images and the two-dimensional coordinates of the matching point in the latter frame of the RGB image in the two adjacent frames of RGB images, the PnP algorithm is used to obtain The pose between the first frame of two adjacent frames of RGB images;

基于所述第一帧间位姿，利用光束平差法获取RGB图像的全局位姿。Based on the first inter-frame pose, the beam adjustment method is used to obtain the global pose of the RGB image.

优选的，所述利用所述最终掩码图，剔除相邻的两帧RGB图像中BRIEF描述子相同的ORB特征点中的外点，包括：Preferably, using the final mask map to eliminate outliers in ORB feature points with the same BRIEF descriptor in two adjacent frames of RGB images includes:

根据所述最终掩码图，确定相邻的两帧RGB图像中每组BRIEF描述子相同的ORB特征点中的各ORB特征点对应的掩码；According to the final mask map, determine the mask corresponding to each ORB feature point in each group of ORB feature points with the same BRIEF descriptor in two adjacent frames of RGB images;

当某一组BRIEF描述子相同的ORB特征点中两个ORB特征点对应的掩码不同时，则该组BRIEF描述子相同的ORB特征点为外点，将所述外点剔除。When two ORB feature points in a group of ORB feature points with the same BRIEF descriptor have different masks, then the ORB feature points in the group with the same BRIEF descriptor are outliers, and the outlier points are eliminated.

优选的，所述基于所述第一帧间位姿，利用光束平差法获取RGB图像的全局位姿，包括：Preferably, based on the first inter-frame pose, using the beam adjustment method to obtain the global pose of the RGB image includes:

根据预设组数，将所有的RGB图像进行分组；Group all RGB images according to the preset number of groups;

利用光束平差法对所述第一帧间位姿进行优化，得到第二帧间位姿，并利用所述第二帧间位姿，获取相邻的两个第二帧间位姿间的相对位姿；The beam adjustment method is used to optimize the first inter-frame pose to obtain the second inter-frame pose, and the second inter-frame pose is used to obtain the distance between two adjacent second inter-frame poses. Relative posture;

令每组RGB图像中第一组相邻的两帧图像对应的第二帧间位姿为关键位姿，并利用光束平差法对所有的关键位姿进行优化，得到优化后的关键位姿；Let the second inter-frame pose corresponding to the first group of two adjacent frames in each set of RGB images be the key pose, and use the beam adjustment method to optimize all key poses to obtain the optimized key pose ;

基于所述相对位姿，利用所述优化后的关键位姿更新所述第二帧间位姿，所有更新后的第二帧间位姿和所有优化后的关键位姿构成全局位姿。Based on the relative pose, the optimized key pose is used to update the second inter-frame pose, and all updated second inter-frame poses and all optimized key poses constitute a global pose.

优选的，所述根据所述深度图像和所述RGB图像的全局位姿获取全局语义三维点云，包括：Preferably, obtaining a global semantic three-dimensional point cloud based on the global pose of the depth image and the RGB image includes:

生成深度图像的点云；Generate point clouds of depth images;

基于voxblox框架，根据所述RGB图像的全局位姿和所述深度图像的点云，利用加权平均算法获取由TSDF表示的全局三维模型；Based on the voxblox framework, according to the global pose of the RGB image and the point cloud of the depth image, a weighted average algorithm is used to obtain a global three-dimensional model represented by TSDF;

利用marching cube算法获取所述全局三维模型中的三维点云，所述全局三维模型中的所有三维点云构成全局语义三维点云。The marching cube algorithm is used to obtain the three-dimensional point cloud in the global three-dimensional model, and all three-dimensional point clouds in the global three-dimensional model constitute a global semantic three-dimensional point cloud.

优选的，所述方法还包括：Preferably, the method further includes:

当利用加权平均算法获取由TSDF表示的全局三维模型时，利用贝叶斯方法更新全局三维模型中TSDF的语义类别概率。When the weighted average algorithm is used to obtain the global three-dimensional model represented by the TSDF, the Bayesian method is used to update the semantic category probability of the TSDF in the global three-dimensional model.

根据本申请实施例的第二方面，提供一种多模态位姿优化的语义三维重建系统，所述系统包括：According to a second aspect of the embodiments of the present application, a multi-modal pose optimized semantic three-dimensional reconstruction system is provided. The system includes:

采集模块，用于采集目标区域的RGB图像和深度图像；Acquisition module, used to acquire RGB images and depth images of the target area;

语义分割模块，用于对RGB图像进行语义分割，得到所述RGB图像的最终掩码图；A semantic segmentation module, used to perform semantic segmentation on RGB images and obtain the final mask image of the RGB images;

位姿模块，用于利用所述RGB图像、所述深度图像和所述最终掩码图，获取RGB图像的全局位姿；A pose module used to obtain the global pose of the RGB image using the RGB image, the depth image and the final mask image;

语义三维重建模块，用于根据所述深度图像和所述RGB图像的全局位姿获取全局语义三维点云。A semantic three-dimensional reconstruction module, configured to obtain a global semantic three-dimensional point cloud based on the global pose of the depth image and the RGB image.

根据本申请实施例的第三方面，提供一种计算机设备，包括：一个或多个处理器；According to a third aspect of the embodiments of the present application, a computer device is provided, including: one or more processors;

所述处理器，用于存储一个或多个程序；The processor is used to store one or more programs;

当所述一个或多个程序被所述一个或多个处理器执行时，实现上述的多模态位姿优化的语义三维重建方法。When the one or more programs are executed by the one or more processors, the above-mentioned multi-modal pose optimized semantic three-dimensional reconstruction method is implemented.

根据本申请实施例的第四方面，提供一种计算机可读存储介质，其上存有计算机程序，所述计算机程序被执行时，实现上述的多模态位姿优化的语义三维重建方法。According to a fourth aspect of the embodiments of the present application, a computer-readable storage medium is provided, on which a computer program is stored. When the computer program is executed, the above-mentioned semantic three-dimensional reconstruction method of multi-modal pose optimization is implemented.

本发明上述一个或多个技术方案，至少具有如下一种或多种有益效果：One or more of the above technical solutions of the present invention have at least one or more of the following beneficial effects:

本发明提供的一种多模态位姿优化的语义三维重建方法及系统，包括：采集目标区域的RGB图像和深度图像；对RGB图像进行语义分割，获取RGB图像的最终掩码图；利用RGB图像、深度图像和最终掩码图，获取RGB图像的全局位姿；根据深度图像和RGB图像的全局位姿获取全局语义三维点云。本发明提供的技术方案，不仅对算力要求较低，适用性广，而且提高了语义三维重建的实时性和效率。The invention provides a multi-modal posture optimized semantic three-dimensional reconstruction method and system, which includes: collecting RGB images and depth images of the target area; performing semantic segmentation on the RGB image to obtain the final mask image of the RGB image; using RGB image, depth image and final mask map to obtain the global pose of the RGB image; obtain the global semantic three-dimensional point cloud based on the depth image and the global pose of the RGB image. The technical solution provided by the present invention not only requires low computing power and has wide applicability, but also improves the real-time performance and efficiency of semantic three-dimensional reconstruction.

附图说明Description of the drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to explain the embodiments of the present invention or the technical solutions in the prior art more clearly, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings in the following description are only These are some embodiments of the present invention. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without exerting creative efforts.

图1是根据一示例性实施例示出的一种多模态位姿优化的语义三维重建方法的流程图；Figure 1 is a flow chart of a semantic three-dimensional reconstruction method for multi-modal pose optimization according to an exemplary embodiment;

图2是根据一示例性实施例示出的全局位姿的获取过程的示意图；Figure 2 is a schematic diagram of a global pose acquisition process according to an exemplary embodiment;

图3是根据一示例性实施例示出的利用贝叶斯方法更新全局三维模型中TSDF的语义类别概率的示意图；Figure 3 is a schematic diagram illustrating the use of a Bayesian method to update the semantic category probability of TSDF in a global three-dimensional model according to an exemplary embodiment;

图4是根据一示例性实施例示出的一种多模态位姿优化的语义三维重建系统的主要结构框图。Figure 4 is a main structural block diagram of a multi-modal pose optimized semantic three-dimensional reconstruction system according to an exemplary embodiment.

具体实施方式Detailed ways

下面结合附图对本发明的具体实施方式作进一步的详细说明。The specific embodiments of the present invention will be described in further detail below with reference to the accompanying drawings.

为使本发明实施例的目的、技术方案和优点更加清楚，下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其它实施例，都属于本发明保护的范围。In order to make the purpose, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the drawings in the embodiments of the present invention. Obviously, the described embodiments These are some embodiments of the present invention, rather than all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts fall within the scope of protection of the present invention.

如背景技术中所公开的，语义三维重建是对三维重建模型进行语义标注，将具有不同类别属性的结构从整体模型中分割，方便下游任务使用。决定语义三维重建质量的重要因素之一是相机位姿的计算，关系到重建模型的优劣和语义信息的分布。目前算法中的定位采用纯几何特征，需要设计复杂的筛选机制确定匹配点，而且没有用到语义这一高级特征。语义三维重建可以划分为语义分割、三维重建、语义映射三个部分，其中三维重建包括位姿计算和建图。As disclosed in the background art, semantic 3D reconstruction is to semantically annotate the 3D reconstruction model, and segment structures with different categories of attributes from the overall model to facilitate the use of downstream tasks. One of the important factors that determines the quality of semantic 3D reconstruction is the calculation of camera pose, which is related to the quality of the reconstruction model and the distribution of semantic information. The current positioning in the algorithm uses purely geometric features, which requires the design of a complex screening mechanism to determine matching points, and does not use the advanced feature of semantics. Semantic 3D reconstruction can be divided into three parts: semantic segmentation, 3D reconstruction, and semantic mapping. 3D reconstruction includes pose calculation and mapping.

为了改善上述问题，提高语义三维重建的实时性和效率，降低语义三维重建的算力要求。In order to improve the above problems, improve the real-time performance and efficiency of semantic 3D reconstruction, and reduce the computing power requirements of semantic 3D reconstruction.

下面对上述方案进行详细阐述。The above scheme is elaborated below.

实施例一Embodiment 1

本发明提供一种多模态位姿优化的语义三维重建方法，如图1所示，该方法可以但不限于用于终端，包括以下步骤：The present invention provides a semantic three-dimensional reconstruction method for multi-modal pose optimization, as shown in Figure 1. This method can be, but is not limited to, used in terminals, and includes the following steps:

步骤101：采集目标区域的RGB图像和深度图像；Step 101: Collect RGB images and depth images of the target area;

步骤102：对RGB图像进行语义分割，获取RGB图像的最终掩码图；Step 102: Perform semantic segmentation on the RGB image and obtain the final mask image of the RGB image;

步骤103：利用RGB图像、深度图像和最终掩码图，获取RGB图像的全局位姿；Step 103: Use the RGB image, depth image and final mask image to obtain the global pose of the RGB image;

步骤104：根据深度图像和RGB图像的全局位姿获取全局语义三维点云。Step 104: Obtain the global semantic three-dimensional point cloud based on the global pose of the depth image and RGB image.

一些实施例中，可以但不限于通过RGB相机获取RGB图像，RGB-D深度相机获取深度图像；或通过双目相机获取RGB图像和深度图像；或者通过RGB相机获取RGB图像，然后再利用深度学习计算RGB图像的深度图像。In some embodiments, the RGB image can be acquired through, but is not limited to, an RGB camera, and the depth image can be acquired through an RGB-D depth camera; or the RGB image and the depth image can be acquired through a binocular camera; or the RGB image can be acquired through an RGB camera, and then deep learning can be used Calculate the depth image of an RGB image.

进一步的，步骤102，包括：Further, step 102 includes:

步骤1021：利用基于深度学习的STDC网络，对RGB图像进行语义分割，得到RGB图像中各像素点的语义类别概率；Step 1021: Use the STDC network based on deep learning to perform semantic segmentation on the RGB image, and obtain the semantic category probability of each pixel in the RGB image;

步骤1022：令各像素点的语义类别概率中概率最大的类别对应的类别编码为各像素点的掩码，得到RGB图像的初始掩码图；Step 1022: Let the category corresponding to the category with the highest probability among the semantic category probabilities of each pixel be encoded as the mask of each pixel, and obtain the initial mask map of the RGB image;

例如，假设RGB图像某一像素点的语义类别有A、B和C三种，语义类别A、B和C对应的类别编码分别为1、2和3，语义类别A的概率为0.7，语义类别B概率为0.2，语义类别C概率为0.1，则该像素点的掩码为语义类别A的类别编码1；For example, assume that there are three semantic categories of a certain pixel in an RGB image: A, B, and C. The category codes corresponding to semantic categories A, B, and C are 1, 2, and 3 respectively. The probability of semantic category A is 0.7. The probability of B is 0.2, and the probability of semantic category C is 0.1, then the mask of this pixel is the category code 1 of semantic category A;

步骤1023：利用超像素分割方法gSLICr对RGB图像的初始掩码图进行边缘优化，得到RGB图像的最终掩码图。Step 1023: Use the superpixel segmentation method gSLICr to perform edge optimization on the initial mask image of the RGB image to obtain the final mask image of the RGB image.

可以理解的是，在通过基于深度学习的STDC网络对RGB图像进行语义分割后，得到的初始掩码图，可能存在图像不太完整或图像边缘出现锯齿、粗糙等问题，通过利用超像素分割方法gSLICr对RGB图像的初始掩码图进行边缘优化，使得掩码图轮廓更清晰。因此，通过轻量化的基于深度学习的STDC网络与快速的超像素分割方法gSLICr结合进行语义分割，在保证速度的同时，分割结果更准确，轮廓更清晰，提高了分割质量。It is understandable that after semantic segmentation of RGB images through the STDC network based on deep learning, the initial mask image obtained may have problems such as incomplete images or jagged and rough image edges. By using the superpixel segmentation method gSLICr performs edge optimization on the initial mask map of the RGB image to make the mask map outline clearer. Therefore, semantic segmentation is performed by combining the lightweight deep learning-based STDC network with the fast superpixel segmentation method gSLICr. While ensuring speed, the segmentation results are more accurate, the contours are clearer, and the segmentation quality is improved.

需要说明的是，STDC网络是在U-Net中嵌入Short-Term Dense Concatenate(STDC)模块，可以使用很少的参数获得较大感受野。为了提高对细节分割的准确性，STDC在浅层网络设计了较大通道，并设计了Detail Guidance模块学习细节特性，同时减少深层网络的通道以减少计算冗余。STDC网络是一种快速轻量级的语义分割算法，在某些数据集上可以达到每秒250帧的分割速度，且计算开销小，可以准确地对大部分像素分类。It should be noted that the STDC network is a Short-Term Dense Concatenate (STDC) module embedded in U-Net, which can obtain a larger receptive field with very few parameters. In order to improve the accuracy of detail segmentation, STDC designed larger channels in the shallow network and designed the Detail Guidance module to learn detail characteristics, while reducing the channels of the deep network to reduce computational redundancy. The STDC network is a fast and lightweight semantic segmentation algorithm that can reach a segmentation speed of 250 frames per second on some data sets. It has low computational overhead and can accurately classify most pixels.

超像素分割是将图像上具有相似纹理、亮度、颜色等特征的像素采用聚类的方式构成不规则的像素块。超像素内的像素都具有类似的属性，相当于将原图像中的大量像素用少量的图像块表达，极大程度减少了后续图像处理的规模，并且在物体边缘处具有明显的分界线。gSLICr是SLIC算法在GPU上的实现，可以达到每秒250帧的分割速度。Superpixel segmentation is to cluster pixels with similar texture, brightness, color and other characteristics on the image to form irregular pixel blocks. The pixels in superpixels all have similar properties, which is equivalent to expressing a large number of pixels in the original image with a small number of image blocks, which greatly reduces the scale of subsequent image processing and has obvious dividing lines at the edges of objects. gSLICr is the implementation of the SLIC algorithm on the GPU and can achieve a segmentation speed of 250 frames per second.

本发明将两种分割方法相互融合，用超像素对神经网络分割结果的边缘进行优化，得到结构化的分割效果。基于STDC网络和gSLICr超像素的分割方法速度都很快，经实验证明，位姿计算的速度为每秒几十帧，三维重建的速度为每秒十几帧。因此将语义分割结果应用在位姿计算和语义建图上不会对整体速度产生影响。The present invention integrates two segmentation methods with each other, uses superpixels to optimize the edges of the neural network segmentation results, and obtains a structured segmentation effect. The segmentation methods based on STDC network and gSLICr superpixel are very fast. Experiments have proved that the speed of pose calculation is dozens of frames per second, and the speed of three-dimensional reconstruction is more than ten frames per second. Therefore, applying the semantic segmentation results to pose calculation and semantic mapping will not affect the overall speed.

需要说明的是，本发明实施例中涉及的“基于深度学习的STDC网络”和“超像素分割方法gSLICr”方式，是本领域技术人员所熟知的，因此，其具体实现方式不做过多描述。It should be noted that the "STDC network based on deep learning" and "superpixel segmentation method gSLICr" methods involved in the embodiments of the present invention are well known to those skilled in the art. Therefore, their specific implementation methods will not be described in detail. .

进一步的，步骤103，包括：Further, step 103 includes:

步骤1031：提取RGB图像的ORB特征点；Step 1031: Extract ORB feature points of the RGB image;

一些实施例中，可以但不限于采用OpenCV库提取RGB图像中的ORB特征点；In some embodiments, the OpenCV library may be used, but is not limited to, to extract ORB feature points in RGB images;

可以理解的是，ORB特征点本质是像素点，ORB特征点为从RGB图像的像素点中挑选出来的显著的像素点，例如，轮廓点、较暗区域中的亮点或较亮区域中的暗点等；It can be understood that ORB feature points are essentially pixel points. ORB feature points are significant pixel points selected from the pixel points of the RGB image, such as contour points, bright spots in darker areas, or dark spots in lighter areas. Click and wait;

步骤1032：获取ORB特征点在深度图像中的深度值，并利用ORB特征点在深度图像中的深度值，对ORB特征点的二维坐标进行投影坐标转换，得到ORB特征点在三维空间中的坐标；Step 1032: Obtain the depth value of the ORB feature point in the depth image, and use the depth value of the ORB feature point in the depth image to perform projection coordinate conversion on the two-dimensional coordinates of the ORB feature point to obtain the ORB feature point in the three-dimensional space. coordinate;

一些可选的实施例中，对ORB特征点的二维坐标进行投影坐标转换，得到ORB特征点在三维空间中的坐标的计算方式为：In some optional embodiments, the two-dimensional coordinates of the ORB feature points are transformed into projected coordinates to obtain the coordinates of the ORB feature points in the three-dimensional space as follows:

上式中，P＝(X,Y,Z)为ORB特征点在三维空间中的坐标，(u,v)为ORB特征点的二维坐标，Z为ORB特征点对应的深度值，K为相机内参；ORB特征点的二维坐标对应的语义类别与ORB特征点的三维坐标对应的语义类别相同，因为其对应的均是同一个像素点；In the above formula, P = (X, Y, Z) is the coordinate of the ORB feature point in the three-dimensional space, (u, v) is the two-dimensional coordinate of the ORB feature point, Z is the depth value corresponding to the ORB feature point, and K is Camera internal parameters; the semantic category corresponding to the two-dimensional coordinates of the ORB feature point is the same as the semantic category corresponding to the three-dimensional coordinates of the ORB feature point, because they all correspond to the same pixel;

步骤1033：基于ORB特征点在三维空间中的坐标和最终掩码图，利用稀疏SLAM方法获取RGB图像的全局位姿。Step 1033: Based on the coordinates of the ORB feature points in the three-dimensional space and the final mask map, use the sparse SLAM method to obtain the global pose of the RGB image.

进一步的，步骤1033，包括：Further, step 1033 includes:

步骤1033a：提取相邻的两帧RGB图像中BRIEF描述子相同的ORB特征点；Step 1033a: Extract ORB feature points with the same BRIEF descriptor in two adjacent frames of RGB images;

步骤1033b：利用最终掩码图，剔除相邻的两帧RGB图像中BRIEF描述子相同的ORB特征点中的外点，并令剔除外点后的相邻的两帧RGB图像中BRIEF描述子相同的ORB特征点为匹配点；Step 1033b: Use the final mask image to eliminate outliers in the ORB feature points with the same BRIEF descriptors in the two adjacent frames of RGB images, and make the BRIEF descriptors in the two adjacent frames of RGB images after eliminating outliers the same. The ORB feature points are matching points;

具体的，步骤1033b中利用最终掩码图，剔除相邻的两帧RGB图像中BRIEF描述子相同的ORB特征点中的外点，包括：Specifically, in step 1033b, the final mask image is used to eliminate outliers in ORB feature points with the same BRIEF descriptor in two adjacent frames of RGB images, including:

根据最终掩码图，确定相邻的两帧RGB图像中每组BRIEF描述子相同的ORB特征点中的各ORB特征点对应的掩码；According to the final mask map, determine the mask corresponding to each ORB feature point in each group of ORB feature points with the same BRIEF descriptor in the two adjacent frames of RGB images;

当某一组BRIEF描述子相同的ORB特征点中两个ORB特征点对应的掩码不同时，则该组BRIEF描述子相同的ORB特征点为外点，将外点剔除；When the masks corresponding to two ORB feature points in a group of ORB feature points with the same BRIEF descriptor are different, then the ORB feature points in the group with the same BRIEF descriptor are outliers, and the outlier points are eliminated;

例如，假设前一帧RGB图像中ORB特征点有D1、D2和D3，后一帧RGB图像中ORB特征点有D4、D5和D6，ORB特征点D1的BRIEF描述子与ORB特征点D4和D5的BRIEF描述子均相同，则具有相同的BRIEF描述子的ORB特征点有两组，即D1和D4、D1和D5；但是，ORB特征点D1的掩码为1，ORB特征点D4的掩码为2，ORB特征点D5的掩码为1，则具有相同的BRIEF描述子的ORB特征点D1和D4为外点，将ORB特征点D4剔除；剔除外点后的每组RGB图像内相邻的两帧RGB图像中BRIEF描述子相同的ORB特征点D1和D5为匹配点；For example, assume that the ORB feature points in the RGB image in the previous frame include D1, D2, and D3, and the ORB feature points in the RGB image in the next frame include D4, D5, and D6. The BRIEF descriptor of the ORB feature point D1 is the same as the ORB feature points D4 and D5. The BRIEF descriptors are the same, then there are two groups of ORB feature points with the same BRIEF descriptor, namely D1 and D4, D1 and D5; however, the mask of the ORB feature point D1 is 1, and the mask of the ORB feature point D4 is 2, and the mask of ORB feature point D5 is 1, then ORB feature points D1 and D4 with the same BRIEF descriptor are outliers, and ORB feature point D4 is eliminated; after eliminating outliers, each group of RGB images is adjacent to ORB feature points D1 and D5 with the same BRIEF descriptors in the two RGB images are matching points;

步骤1033c：基于相邻的两帧RGB图像中前一帧RGB图像中的匹配点在三维空间中的坐标和相邻的两帧RGB图像中后一帧RGB图像的匹配点的二维坐标，利用PnP算法获取相邻的两帧RGB图像的第一帧间位姿；Step 1033c: Based on the coordinates in the three-dimensional space of the matching point in the previous RGB image in the two adjacent RGB images and the two-dimensional coordinates of the matching point in the subsequent RGB image in the two adjacent RGB images, use The PnP algorithm obtains the pose between the first frame of two adjacent frames of RGB images;

步骤1033d：基于第一帧间位姿，利用光束平差法获取RGB图像的全局位姿。Step 1033d: Based on the first inter-frame pose, use the beam adjustment method to obtain the global pose of the RGB image.

例如，假设有12帧RGB图像{a1，a2，a3，a4，a5，a6，a7，a8，a9，a10，a11，a12}，a1-a12为RGB图像；For example, assume there are 12 frames of RGB images {a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12}, a1-a12 are RGB images;

提取相邻的两帧RGB图像中BRIEF描述子相同的ORB特征点，即分别提取“a1和a2、a2和a3、a3和a4、a4和a5、a5和a6、a6和a7、a7和a8、a8和a9、a9和a10、a10和a11、a11和a12”这11组相邻的两帧RGB图像中BRIEF描述子相同的ORB特征点；Extract ORB feature points with the same BRIEF descriptor in two adjacent frames of RGB images, that is, extract "a1 and a2, a2 and a3, a3 and a4, a4 and a5, a5 and a6, a6 and a7, a7 and a8, a8 and a9, a9 and a10, a10 and a11, a11 and a12" are 11 groups of ORB feature points with the same BRIEF descriptor in two adjacent frames of RGB images;

利用最终掩码图，剔除上述11组相邻的两帧RGB图像中BRIEF描述子相同的ORB特征点中的外点，并令剔除外点后的每组RGB图像内相邻的两帧RGB图像中BRIEF描述子相同的ORB特征点为匹配点；Using the final mask image, eliminate the outliers in the ORB feature points with the same BRIEF descriptor in the above 11 groups of adjacent two frames of RGB images, and make the two adjacent frames of RGB images in each group of RGB images after eliminating outliers ORB feature points with the same BRIEF descriptor are matching points;

基于上述11组相邻的两帧RGB图像中前一帧RGB图像中的匹配点在三维空间中的坐标和上述11组相邻的两帧RGB图像中后一帧RGB图像的匹配点的二维坐标，利用PnP算法获取上述11组相邻的两帧RGB图像的第一帧间位姿；Based on the coordinates in the three-dimensional space of the matching point in the previous frame of the RGB image among the above 11 groups of adjacent two frames of RGB images and the two-dimensional matching point of the following frame of the RGB image among the above 11 groups of adjacent two frames of RGB images. Coordinates, use the PnP algorithm to obtain the first inter-frame pose of the above 11 groups of two adjacent frames of RGB images;

基于第一帧间位姿，利用光束平差法获取RGB图像的全局位姿。Based on the pose between the first frames, the beam adjustment method is used to obtain the global pose of the RGB image.

可以理解的是，通过剔除外点，可以减少外点对位姿计算精度的影响，进而提高了位姿的准确性和可靠性。It can be understood that by eliminating outliers, the impact of outliers on the pose calculation accuracy can be reduced, thereby improving the accuracy and reliability of the pose.

进一步的，步骤1033d，包括：Further, step 1033d includes:

利用光束平差法对第一帧间位姿进行优化，得到第二帧间位姿，并利用第二帧间位姿，获取相邻的两个第二帧间位姿间的相对位姿；Use the beam adjustment method to optimize the first inter-frame pose to obtain the second inter-frame pose, and use the second inter-frame pose to obtain the relative pose between two adjacent second inter-frame poses;

基于相对位姿，利用优化后的关键位姿更新第二帧间位姿，所有更新后的第二帧间位姿和所有优化后的关键位姿构成全局位姿。Based on the relative pose, the optimized key pose is used to update the second inter-frame pose, and all updated second inter-frame poses and all optimized key poses constitute the global pose.

需要说明的是，本发明对“预设组数”不做限定，可以由本领域技术人员根据实验数据或专家经验等进行设置。It should be noted that the present invention does not limit the "preset number of groups" and can be set by those skilled in the art based on experimental data or expert experience.

一些实施例中，可以但不限于根据预设的每组的帧数，将所有的RGB图像进行分组；如，预设的每组的帧数可以但不限于为10帧。In some embodiments, all RGB images may be grouped according to, but are not limited to, a preset number of frames in each group; for example, the preset number of frames in each group may be, but is not limited to, 10 frames.

如图2所示，将RGB图像按照预设组数进行分组，在相邻的两帧RGB图像间采用PnP算法计算第一帧间位姿T₁。但第一帧间位姿只考虑了相邻的两帧RGB图像的匹配关系，存在较大误差，所以作为后续优化的初始值。As shown in Figure 2, the RGB images are grouped according to the preset number of groups, and the PnP algorithm is used to calculate the pose T ₁ between the first frames between two adjacent frames of RGB images. However, the first inter-frame pose only considers the matching relationship between two adjacent frames of RGB images, and there is a large error, so it is used as the initial value for subsequent optimization.

接着对第一帧间位姿进行局部优化和全局优化。具体的，采用光束平差法(BundleAdjustment)对第一帧间位姿T₁进行优化，得到第二帧间位姿T₂；每组RGB图像内的第一帧RGB图像为关键帧(Key Frame)，关键帧与其相邻的RGB图像间的第二帧间位姿即为关键位姿T₃，采用光束平差法(Bundle Adjustment)对关键位姿进行优化，得到每组RGB图像的优化后的关键位姿T₄。Then perform local optimization and global optimization on the pose between the first frames. Specifically, the bundle adjustment method (BundleAdjustment) is used to optimize the pose T ₁ between the first frames to obtain the pose T ₂ between the second frames; the first RGB image frame in each group of RGB images is the key frame (Key Frame). ), the second inter-frame pose between the key frame and its adjacent RGB image is the key pose T ₃ . The key pose is optimized using the bundle adjustment method to obtain the optimized result of each set of RGB images. The key pose T ₄ .

在全局优化阶段，本发明并不需要将组内各帧的特征点投影到第一帧，而是直接利用相对位姿和每组RGB图像的优化后的关键位姿T₄获取除优化后的关键位姿的其他位姿T₅，即更新后的第二帧间位姿，更新后的第二帧间位姿T₅和优化后的关键位姿T₄构成全局位姿。这是因为两个相邻的帧间位姿间的相对位姿T保持不变，关键位姿在全局范围内进行了优化改变，因此组内各帧间位姿也进行相应变化。In the global optimization stage, the present invention does not need to project the feature points of each frame in the group to the first frame, but directly uses the relative pose and the optimized key pose T ₄ of each group of RGB images to obtain the optimized The other poses T ₅ of the key pose are the updated second inter-frame pose. The updated second inter-frame pose T ₅ and the optimized key pose T ₄ constitute the global pose. This is because the relative pose T between two adjacent inter-frame poses remains unchanged, and the key pose is optimized and changed globally, so the poses between frames in the group also change accordingly.

例如，假设有4帧RGB图像{a1，a2，a3，a4}，a1-a4为RGB图像；For example, assume there are 4 frames of RGB images {a1, a2, a3, a4}, and a1-a4 are RGB images;

利用PnP算法获取相邻的两帧RGB图像的第一帧间位姿，得到3个第一帧间位姿，即a1和a2、a2和a3、a3和a4的第一帧间位姿t1、t2和t3；Use the PnP algorithm to obtain the first inter-frame poses of two adjacent frames of RGB images, and obtain three first inter-frame poses, namely the first inter-frame poses t1 and t1 of a1 and a2, a2 and a3, and a3 and a4. t2 and t3;

采用光束平差法对第一帧间位姿t1、t2和t3进行优化，得到第二帧间位姿t4、t5和t6，即进行局部优化；并根据第二帧间位姿获取相邻的两个第二帧间位姿间的相对位姿，即t4和t5、t5和t6的相对位姿△t₁和△t₂；具体的，△t₁＝t₅/t₄，△t₂＝t₆/t₅；The beam adjustment method is used to optimize the poses t1, t2, and t3 between the first frames, and the poses t4, t5, and t6 between the second frames are obtained, that is, local optimization is performed; and the adjacent poses are obtained based on the poses between the second frames. The relative poses between the two second frames, that is, the relative poses △t 1 and △t ₂ between t4 and _t5 , t5 and t6; specifically, △t ₁ =t ₅ /t ₄ , △t ₂ =t ₆ /t ₅ ;

预设组数为1组，则存在一组RGB图像A1＝{a1，a2，a3，a4}，令每组RGB图像内第一组相邻的两帧图像对应的第二帧间位姿为关键位姿，即第二帧间位姿t4为关键位姿，利用光束平差法对所有的关键位姿进行优化，得到优化后的关键位姿t7；The default number of groups is 1, then there is a group of RGB images A1 = {a1, a2, a3, a4}, so that the second inter-frame pose corresponding to the first group of two adjacent frame images in each group of RGB images is The key pose, that is, the pose t4 between the second frames is the key pose. All key poses are optimized using the beam adjustment method to obtain the optimized key pose t7;

由于两个相邻的帧间位姿间的相对位姿保持不变，而关键位姿在全局范围内进行了改变，因此组内各第二帧间位姿也应该进行相应变化，得到所有图像帧全局优化的结果，即：基于相邻的两个第二帧间位姿间的相对位姿，利用优化后的关键位姿更新第二帧间位姿(更新第二帧间位姿t5和t6后，分别得到更新后的第二帧间位姿t8和t9)，所有的更新后的第二帧间位姿和所有的优化后的关键位姿构成全局位姿(即优化后的关键位姿t7、更新后的第二帧间位姿t8和t9)，实现全局优化。Since the relative pose between two adjacent frames remains unchanged, and the key pose changes globally, the pose of each second interframe in the group should also change accordingly to obtain all images. The result of frame global optimization, that is, based on the relative pose between two adjacent second inter-frame poses, the optimized key pose is used to update the second inter-frame pose (update the second inter-frame pose t5 and After t6, the updated second inter-frame poses t8 and t9 are obtained respectively. All the updated second inter-frame poses and all optimized key poses constitute the global pose (that is, the optimized key pose Poses t7, updated second inter-frame poses t8 and t9) to achieve global optimization.

可以理解的是，通过利用光束平差法，采用帧间、局部、全局三层位姿优化策略，提高了位姿计算的准确度和实时性，从而进一步实现多模态位姿优化的语义三维重建方法。It can be understood that by using the beam adjustment method and adopting a three-layer pose optimization strategy of inter-frame, local and global, the accuracy and real-time performance of pose calculation are improved, thereby further realizing the semantic three-dimensional multi-modal pose optimization. Reconstruction method.

需要说明的是，本发明实施例中涉及的“基于相对位姿，利用优化后的关键位姿更新第二帧间位姿”的方式，是本领域技术人员所熟知的，因此，其具体实现方式不做过多描述。It should be noted that the method of "updating the second inter-frame pose based on the relative pose and using the optimized key pose" involved in the embodiment of the present invention is well known to those skilled in the art. Therefore, its specific implementation The method will not be described too much.

进一步的，步骤104，包括：Further, step 104 includes:

步骤1041：生成深度图像的点云；Step 1041: Generate a point cloud of the depth image;

一些实施例中，可以但不限于通过利用相机反投影生成深度图像的点云；In some embodiments, the point cloud of the depth image may be generated by, but is not limited to, camera back-projection;

步骤1042：基于voxblox框架，根据RGB图像的全局位姿和深度图像的点云，利用加权平均算法获取由TSDF表示的全局三维模型；Step 1042: Based on the voxblox framework, based on the global pose of the RGB image and the point cloud of the depth image, use the weighted average algorithm to obtain the global three-dimensional model represented by TSDF;

步骤1043：利用marching cube算法获取全局三维模型中的三维点云，全局三维模型中的所有三维点云构成全局语义三维点云。Step 1043: Use the marching cube algorithm to obtain the three-dimensional point cloud in the global three-dimensional model. All three-dimensional point clouds in the global three-dimensional model constitute the global semantic three-dimensional point cloud.

需要说明的是，Voxblox采用基于voxel hashing的TSDF表示方法，对处于同一空间位置的点提供了三种融合方法，即simple、merged和fast。其中simple方法针对每一个体素中的TSDF值都进行融合计算，精度较高，计算量大；fast方法将多个体素视为一组，对组内的所有TSDF值进行融合，减小了计算规模，加快了计算速度，但精度随之降低；merged为两种方法的折衷。本发明考虑到后续需融合所有点的语义信息，因此采用simple的融合方法进行重建。虽然每帧图像都参与重建会使数据更丰富、融合更充分，但相邻图像像素相似性较高，会出现较多冗余计算，增加运行成本，综合考虑算法的精度与速度，本发明只使用关键位姿进行三维重建。It should be noted that Voxblox uses the TSDF representation method based on voxel hashing and provides three fusion methods for points at the same spatial location, namely simple, merged and fast. Among them, the simple method performs fusion calculation on the TSDF value in each voxel, which has high accuracy and a large amount of calculation; the fast method treats multiple voxels as a group and fuses all TSDF values in the group, which reduces the calculation time. Scale speeds up the calculation, but the accuracy is reduced; merged is a compromise between the two methods. This invention considers that the semantic information of all points needs to be fused later, so a simple fusion method is used for reconstruction. Although each frame of image participates in the reconstruction, which will make the data richer and more fully integrated, the similarity of adjacent image pixels is high, which will cause more redundant calculations and increase the running cost. Considering the accuracy and speed of the algorithm, the present invention only 3D reconstruction using key poses.

本发明针对语义三维重建涉及模块多和对算力要求高的问题，使用基于稀疏SLAM的多层位姿计算方法求解相机位姿，并将语义分割结果用于外点的筛选。使用轻量级的实时语义分割网络实现像素分类，使用基于voxblox的实时三维重建框架进行场景三维建模，最终实现快速精确的轻量级语义三维重建。This invention aims at the problem that semantic three-dimensional reconstruction involves many modules and requires high computing power. It uses a multi-layer pose calculation method based on sparse SLAM to solve the camera pose, and uses the semantic segmentation results to filter out external points. Use a lightweight real-time semantic segmentation network to implement pixel classification, use a voxblox-based real-time 3D reconstruction framework for scene 3D modeling, and finally achieve fast and accurate lightweight semantic 3D reconstruction.

进一步的，该方法还包括：Further, the method also includes:

需要说明的是，多帧图像上拍摄同样物体的二维像素点通过计算，理论上可以获得同样的TSDF。但由于传感器(相机)的测量误差，实际上计算的TSDF不重合，需要使用加权平均算法进行融合。不同图像计算的TSDF(融合前的)对应一种语义类别概率，使用贝叶斯方法获得多帧TSDF融合后的最终语义类别。It should be noted that by calculating the two-dimensional pixels of the same object on multiple frames of images, the same TSDF can be theoretically obtained. However, due to the measurement error of the sensor (camera), the actually calculated TSDFs do not overlap, and a weighted average algorithm needs to be used for fusion. The TSDF calculated from different images (before fusion) corresponds to a semantic category probability, and the Bayesian method is used to obtain the final semantic category after multi-frame TSDF fusion.

例如，如图3所示，对于同一个TSDF，会有不同位姿的像素与之对应，而像素在每个位姿处的语义分割结果不完全相同，如图中的P(u＝c_i)，i＝{1，2，3，4，5}。因此要将多次分割的结果进行融合更新，得到最可能的语义类别P(u＝c_Fusion)。本发明采用与SemanticFusion相同的基于贝叶斯更新的方法进行语义信息的融合，如公式所示：For example, as shown in Figure 3, for the same TSDF, there will be pixels corresponding to different poses, and the semantic segmentation results of the pixels at each pose are not exactly the same, as shown in the figure P(u=c _i ), i={1, 2, 3, 4, 5}. Therefore, the results of multiple segmentations need to be fused and updated to obtain the most possible semantic category P (u=c _Fusion ). This invention uses the same Bayesian update method as SemanticFusion to fuse semantic information, as shown in the formula:

其中，P表示像素属于某一类别的概率(probability)，c_i为某一类别(class)，I为某一帧图像，K为图像的总帧数。P(u＝c_i|I_k)表示在第k帧图像下通过语义分割获得像素u属于类别c_i的概率；P(c_i|I₁,...,_k-1)表示综合前1到k-1帧的语义分割结果，像素u属于c_i的概率；P(c_i|I₁,...,_k)表示综合前1到k帧，像素u属于c_i的融合概率；z为归一化常数。通过不断更新，可以将所有时刻对同一TSDF的语义类别进行整合，得到其最终的类别属性。Among them, P represents the probability that a pixel belongs to a certain category (probability), c _i is a certain category (class), I is a certain frame of image, and K is the total number of frames of the image. P(u=c _i |I _k ) represents the probability that pixel u belongs to category c _i obtained through semantic segmentation in the k-th frame image; P(c _i |I ₁ ,..., _k-1 ) represents the first comprehensive 1 Based on the semantic segmentation results of k-1 frames, the probability that pixel u belongs to c _i ; P(c _i |I ₁ ,..., _k ) represents the fusion probability of pixel u belonging to c _i in the first 1 to k frames; z is the normalization constant. Through continuous updating, the semantic categories of the same TSDF at all times can be integrated to obtain its final category attributes.

需要说明的是，使用贝叶斯方法更新融合多角度下的语义类别，最终实现全局一致性的语义三维重建，提高了语义三维重建的可靠性。It should be noted that the Bayesian method is used to update and fuse semantic categories from multiple angles, ultimately achieving globally consistent semantic three-dimensional reconstruction and improving the reliability of semantic three-dimensional reconstruction.

本发明提供的一种多模态位姿优化的语义三维重建方法，通过采集目标区域的RGB图像和深度图像，对RGB图像进行语义分割，得到RGB图像的最终掩码图，利用RGB图像、深度图像和最终掩码图，获取RGB图像的全局位姿，根据深度图像和RGB图像的全局位姿获取全局语义三维点云，不仅实现了对算力要求较低，适用性广，而且提高了语义三维重建的实时性和效率。The invention provides a semantic three-dimensional reconstruction method for multi-modal pose optimization. By collecting RGB images and depth images of the target area, the RGB image is semantically segmented to obtain the final mask image of the RGB image. Using the RGB image, depth image and the final mask map, obtain the global pose of the RGB image, and obtain the global semantic three-dimensional point cloud based on the depth image and the global pose of the RGB image, which not only achieves low computing power requirements, wide applicability, but also improves semantics Real-time and efficient 3D reconstruction.

实施例二Embodiment 2

一种多模态位姿优化的语义三维重建系统，如图4所示，该系统包括：A multi-modal pose optimized semantic 3D reconstruction system, as shown in Figure 4. The system includes:

语义分割模块，用于对RGB图像进行语义分割，得到RGB图像的最终掩码图；The semantic segmentation module is used to perform semantic segmentation on RGB images and obtain the final mask image of the RGB image;

位姿模块，用于利用RGB图像、深度图像和最终掩码图，获取RGB图像的全局位姿；The pose module is used to obtain the global pose of the RGB image using the RGB image, depth image and final mask map;

语义三维重建模块，用于根据深度图像和RGB图像的全局位姿获取全局语义三维点云。The semantic 3D reconstruction module is used to obtain the global semantic 3D point cloud based on the global pose of the depth image and RGB image.

进一步的，语义分割模块，包括：Further, the semantic segmentation module includes:

第一获取模块，用于利用基于深度学习的STDC网络，对RGB图像进行语义分割，得到RGB图像中各像素点的语义类别概率；The first acquisition module is used to use the STDC network based on deep learning to perform semantic segmentation on the RGB image and obtain the semantic category probability of each pixel in the RGB image;

第二获取模块，用于令各像素点的语义类别概率中概率最大的类别对应的类别编码为各像素点的掩码，得到RGB图像的初始掩码图；The second acquisition module is used to encode the category corresponding to the category with the highest probability among the semantic category probabilities of each pixel into the mask of each pixel, and obtain the initial mask map of the RGB image;

第三获取模块，用于利用超像素分割方法gSLICr对RGB图像的初始掩码图进行边缘优化，得到RGB图像的最终掩码图。The third acquisition module is used to perform edge optimization on the initial mask image of the RGB image using the superpixel segmentation method gSLICr to obtain the final mask image of the RGB image.

进一步的，位姿模块，包括：Further, the pose module includes:

提取模块可用于提取RGB图像的ORB特征点；The extraction module can be used to extract ORB feature points of RGB images;

第四获取模块，用于获取ORB特征点在深度图像中的深度值，并利用ORB特征点在深度图像中的深度值，对ORB特征点的二维坐标进行投影坐标转换，得到ORB特征点在三维空间中的坐标；The fourth acquisition module is used to obtain the depth value of the ORB feature point in the depth image, and use the depth value of the ORB feature point in the depth image to perform projection coordinate conversion on the two-dimensional coordinates of the ORB feature point to obtain the location of the ORB feature point. Coordinates in three-dimensional space;

第五获取模块，用于基于ORB特征点在三维空间中的坐标和最终掩码图，利用稀疏SLAM方法获取RGB图像的全局位姿。The fifth acquisition module is used to obtain the global pose of the RGB image using the sparse SLAM method based on the coordinates of the ORB feature points in the three-dimensional space and the final mask map.

进一步的，第五获取模块，具体用于：Further, the fifth acquisition module is specifically used for:

利用最终掩码图，剔除相邻的两帧RGB图像中BRIEF描述子相同的ORB特征点中的外点，并令剔除外点后的相邻的两帧RGB图像中BRIEF描述子相同的ORB特征点为匹配点；Using the final mask image, eliminate the outliers in the ORB feature points with the same BRIEF descriptor in the two adjacent frames of RGB images, and make the ORB features with the same BRIEF descriptor in the two adjacent frames of RGB images after eliminating the outliers Points are matching points;

进一步的，利用最终掩码图，剔除相邻的两帧RGB图像中BRIEF描述子相同的ORB特征点中的外点，包括：Furthermore, the final mask image is used to eliminate outliers in ORB feature points with the same BRIEF descriptor in two adjacent frames of RGB images, including:

当某一组BRIEF描述子相同的ORB特征点中两个ORB特征点对应的掩码不同时，则该组BRIEF描述子相同的ORB特征点为外点，将外点剔除。When the masks corresponding to two ORB feature points in a certain group of ORB feature points with the same BRIEF descriptor are different, the ORB feature points in the group with the same BRIEF descriptor are regarded as outliers, and the outlier points are eliminated.

进一步的，基于第一帧间位姿，利用光束平差法获取RGB图像的全局位姿，包括：Further, based on the pose between the first frames, the beam adjustment method is used to obtain the global pose of the RGB image, including:

进一步的，语义三维重建模块，具体用于：Further, the semantic three-dimensional reconstruction module is specifically used for:

生成深度图像的点云；Generate point clouds of depth images;

基于voxblox框架，根据RGB图像的全局位姿和深度图像的点云，利用加权平均算法获取由TSDF表示的全局三维模型；Based on the voxblox framework, based on the global pose of the RGB image and the point cloud of the depth image, the weighted average algorithm is used to obtain the global three-dimensional model represented by TSDF;

利用marching cube算法获取全局三维模型中的三维点云，全局三维模型中的所有三维点云构成全局语义三维点云。The marching cube algorithm is used to obtain the three-dimensional point cloud in the global three-dimensional model. All three-dimensional point clouds in the global three-dimensional model constitute the global semantic three-dimensional point cloud.

进一步的，该系统还包括：Furthermore, the system also includes:

更新模块，用于当利用加权平均算法获取由TSDF表示的全局三维模型时，利用贝叶斯方法更新全局三维模型中TSDF的语义类别概率。The update module is used to update the semantic category probability of the TSDF in the global three-dimensional model using the Bayesian method when the weighted average algorithm is used to obtain the global three-dimensional model represented by the TSDF.

本发明提供的一种多模态位姿优化的语义三维重建系统，通过采集模块采集目标区域的RGB图像和深度图像，语义分割模块对RGB图像进行语义分割，得到RGB图像的最终掩码图，位姿模块利用RGB图像、深度图像和最终掩码图，获取RGB图像的全局位姿，语义三维重建模块根据深度图像和RGB图像的全局位姿获取全局语义三维点云，不仅实现了对算力要求较低，适用性广，而且提高了语义三维重建的实时性和效率。The invention provides a multi-modal pose optimized semantic three-dimensional reconstruction system. The RGB image and depth image of the target area are collected through the acquisition module. The semantic segmentation module performs semantic segmentation on the RGB image to obtain the final mask image of the RGB image. The pose module uses the RGB image, depth image and final mask map to obtain the global pose of the RGB image. The semantic 3D reconstruction module obtains the global semantic 3D point cloud based on the depth image and the global pose of the RGB image, which not only achieves the goal of computing power It has lower requirements, wide applicability, and improves the real-time performance and efficiency of semantic 3D reconstruction.

可以理解的是，上述提供的系统实施例与上述的方法实施例对应，相应的具体内容可以相互参考，在此不再赘述。It can be understood that the above-mentioned system embodiments correspond to the above-mentioned method embodiments, and the corresponding specific contents can be referred to each other, and will not be described again here.

可以理解的是，上述各实施例中相同或相似部分可以相互参考，在一些实施例中未详细说明的内容可以参见其他实施例中相同或相似的内容。It can be understood that the same or similar parts in the above-mentioned embodiments can be referred to each other, and the content that is not described in detail in some embodiments can be referred to the same or similar content in other embodiments.

实施例三Embodiment 3

基于同一种发明构思，本发明还提供了一种计算机设备，该计算机设备包括处理器以及存储器，存储器用于存储计算机程序，计算机程序包括程序指令，处理器用于执行计算机存储介质存储的程序指令。处理器可能是中央处理单元(Central Processing Unit，CPU)，还可以是其他通用处理器、数字信号处理器(Digital Signal Processor、DSP)、专用集成电路(Application SpecificIntegrated Circuit，ASIC)、现成可编程门阵列(Field-Programmable GateArray，FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等，其是终端的计算核心以及控制核心，其适于实现一条或一条以上指令，具体适于加载并执行计算机存储介质内一条或一条以上指令从而实现相应方法流程或相应功能，以实现上述实施例中一种多模态位姿优化的语义三维重建方法的步骤。Based on the same inventive concept, the present invention also provides a computer device. The computer device includes a processor and a memory. The memory is used to store computer programs. The computer programs include program instructions. The processor is used to execute program instructions stored in a computer storage medium. The processor may be a Central Processing Unit (CPU), or other general-purpose processor, Digital Signal Processor (DSP), Application Specific Integrated Circuit (ASIC), or off-the-shelf programmable gate Array (Field-Programmable GateArray, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., which are the computing core and control core of the terminal, are suitable for implementing one or more instructions, and are specifically suitable for Loading and executing one or more instructions in the computer storage medium to implement the corresponding method flow or corresponding functions to implement the steps of a multi-modal pose optimized semantic three-dimensional reconstruction method in the above embodiment.

实施例四Embodiment 4

基于同一种发明构思，本发明还提供了一种存储介质，具体为计算机可读存储介质(Memory)，计算机可读存储介质是计算机设备中的记忆设备，用于存放程序和数据。可以理解的是，此处的计算机可读存储介质既可以包括计算机设备中的内置存储介质，当然也可以包括计算机设备所支持的扩展存储介质。计算机可读存储介质提供存储空间，该存储空间存储了终端的操作系统。并且，在该存储空间中还存放了适于被处理器加载并执行的一条或一条以上的指令，这些指令可以是一个或一个以上的计算机程序(包括程序代码)。需要说明的是，此处的计算机可读存储介质可以是高速RAM存储器，也可以是非不稳定的存储器(non-volatile memory)，例如至少一个磁盘存储器。可由处理器加载并执行计算机可读存储介质中存放的一条或一条以上指令，以实现上述实施例中一种多模态位姿优化的语义三维重建方法的步骤。Based on the same inventive concept, the present invention also provides a storage medium, specifically a computer-readable storage medium (Memory). The computer-readable storage medium is a memory device in a computer device and is used to store programs and data. It can be understood that the computer-readable storage medium here may include a built-in storage medium in the computer device, and of course may also include an extended storage medium supported by the computer device. The computer-readable storage medium provides storage space, and the storage space stores the operating system of the terminal. Furthermore, one or more instructions suitable for being loaded and executed by the processor are also stored in the storage space. These instructions may be one or more computer programs (including program codes). It should be noted that the computer-readable storage medium here may be a high-speed RAM memory or a non-volatile memory (non-volatile memory), such as at least one disk memory. One or more instructions stored in the computer-readable storage medium can be loaded and executed by the processor to implement the steps of a multi-modal pose optimized semantic three-dimensional reconstruction method in the above embodiment.

本领域内的技术人员应明白，本发明的实施例可提供为方法、系统、或计算机程序产品。因此，本发明可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且，本发明可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art will appreciate that embodiments of the present invention may be provided as methods, systems, or computer program products. Thus, the invention may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

本发明是参照根据本发明实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器，使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each process and/or block in the flowchart illustrations and/or block diagrams, and combinations of processes and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing device to produce a machine, such that the instructions executed by the processor of the computer or other programmable data processing device produce a use A device for realizing the functions specified in one process or multiple processes of the flowchart and/or one block or multiple blocks of the block diagram.

这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中，使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品，该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory that causes a computer or other programmable data processing apparatus to operate in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction means, the instructions The device implements the functions specified in a process or processes of the flowchart and/or a block or blocks of the block diagram.

这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上，使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理，从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions may also be loaded onto a computer or other programmable data processing device, causing a series of operating steps to be performed on the computer or other programmable device to produce computer-implemented processing, thereby executing on the computer or other programmable device. Instructions provide steps for implementing the functions specified in a process or processes of a flowchart diagram and/or a block or blocks of a block diagram.

最后应当说明的是：以上实施例仅用以说明本发明的技术方案而非对其限制，尽管参照上述实施例对本发明进行了详细的说明，所属领域的普通技术人员应当理解：依然可以对本发明的具体实施方式进行修改或者等同替换，而未脱离本发明精神和范围的任何修改或者等同替换，其均应涵盖在本发明的权利要求保护范围之内。Finally, it should be noted that the above embodiments are only used to illustrate the technical solution of the present invention and not to limit it. Although the present invention has been described in detail with reference to the above embodiments, those of ordinary skill in the art should understand that the present invention can still be modified. Modifications or equivalent substitutions may be made to the specific embodiments, and any modifications or equivalent substitutions that do not depart from the spirit and scope of the invention shall be covered by the scope of the claims of the invention.

Claims

1. A semantic three-dimensional reconstruction method for multi-modal pose optimization, the method comprising:

collecting RGB images and depth images of a target area;

performing semantic segmentation on an RGB image to obtain a final mask image of the RGB image;

acquiring the global pose of the RGB image by using the RGB image, the depth image and the final mask map;

acquiring a global semantic three-dimensional point cloud according to the global pose of the depth image and the RGB image;

the obtaining the global pose of the RGB image by using the RGB image, the depth image and the final mask map includes:

extracting ORB characteristic points of the RGB image;

obtaining a depth value of the ORB characteristic point in the depth image, and performing projection coordinate conversion on the two-dimensional coordinates of the ORB characteristic point by utilizing the depth value of the ORB characteristic point in the depth image to obtain the coordinates of the ORB characteristic point in a three-dimensional space;

and acquiring the global pose of the RGB image by using a sparse SLAM method based on the coordinates of the ORB feature points in the three-dimensional space and the final mask map.

2. The method of claim 1, wherein the semantically segmenting the RGB image to obtain a final mask map of the RGB image comprises:

Performing semantic segmentation on the RGB image by using an STDC network based on deep learning to obtain semantic class probability of each pixel point in the RGB image;

coding the category corresponding to the category with the highest probability in the semantic category probability of each pixel point as a mask of each pixel point to obtain an initial mask map of the RGB image;

and performing edge optimization on the initial mask map of the RGB image by using a super-pixel segmentation method gSLICr to obtain a final mask map of the RGB image.

3. The method of claim 1, wherein the acquiring the global pose of the RGB image using the sparse SLAM method based on the coordinates of the ORB feature points in three-dimensional space and the final mask map comprises:

extracting ORB characteristic points with the same BRIEF descriptors in two adjacent RGB images;

removing outer points in ORB characteristic points with the same BRIEF descriptors in two adjacent frames of RGB images by using the final mask map, and enabling the ORB characteristic points with the same BRIEF descriptors in the two adjacent frames of RGB images after removing the outer points to be matching points;

acquiring a first frame pose of two adjacent RGB images by using a PnP algorithm based on coordinates of a matching point in a previous RGB image in the two adjacent RGB images in a three-dimensional space and two-dimensional coordinates of a matching point in a next RGB image in the two adjacent RGB images;

And acquiring the global pose of the RGB image by using a beam adjustment method based on the first frame pose.

4. A method according to claim 3, wherein the removing outliers in ORB feature points with identical BRIEF descriptors in two adjacent frames of RGB images using the final mask map comprises:

determining masks corresponding to ORB feature points in ORB feature points with the same BRIEF descriptors in each group of BRIEF descriptors in two adjacent frames of RGB images according to the final mask map;

when the masks corresponding to two ORB feature points in the ORB feature points with the same BRIEF descriptors in a certain group are different, the ORB feature points with the same BRIEF descriptors in the group are outliers, and the outliers are eliminated.

5. A method according to claim 3, wherein the obtaining the global pose of the RGB image using the beam adjustment method based on the first frame pose comprises:

grouping all RGB images according to a preset group number;

optimizing the first frame pose by using a beam adjustment method to obtain a second frame pose, and obtaining the relative pose between two adjacent second frame poses by using the second frame pose;

the second frame pose corresponding to the first group of two adjacent frames of images in each group of RGB images is made to be the key pose, and all the key poses are optimized by utilizing a beam adjustment method to obtain the optimized key pose;

Based on the relative pose, updating the second frame pose by using the optimized key pose, wherein all updated second frame poses and all optimized key poses form a global pose.

6. The method of claim 1, wherein the obtaining a global semantic three-dimensional point cloud from the global pose of the depth image and the RGB image comprises:

generating a point cloud of the depth image;

based on a voxblox frame, acquiring a global three-dimensional model represented by TSDF by using a weighted average algorithm according to the global pose of the RGB image and the point cloud of the depth image;

and acquiring three-dimensional point clouds in the global three-dimensional model by using a marking cube algorithm, wherein all three-dimensional point clouds in the global three-dimensional model form global semantic three-dimensional point clouds.

7. The method of claim 6, wherein the method further comprises:

when the global three-dimensional model represented by the TSDF is obtained by using a weighted average algorithm, the semantic class probability of the TSDF in the global three-dimensional model is updated by using a Bayesian method.

8. A multi-modal pose optimized semantic three-dimensional reconstruction system, the system comprising:

The acquisition module is used for acquiring RGB images and depth images of the target area;

the semantic segmentation module is used for carrying out semantic segmentation on the RGB image to obtain a final mask image of the RGB image;

the pose module is used for acquiring the global pose of the RGB image by utilizing the RGB image, the depth image and the final mask image;

the semantic three-dimensional reconstruction module is used for acquiring a global semantic three-dimensional point cloud according to the depth image and the global pose of the RGB image;

the pose module comprises:

the extraction module can be used for extracting ORB characteristic points of the RGB image;

the fourth acquisition module is used for acquiring the depth value of the ORB characteristic point in the depth image, and performing projection coordinate conversion on the two-dimensional coordinates of the ORB characteristic point by utilizing the depth value of the ORB characteristic point in the depth image to obtain the coordinates of the ORB characteristic point in the three-dimensional space;

and a fifth acquisition module, configured to acquire a global pose of the RGB image by using a sparse SLAM method based on coordinates of the ORB feature points in the three-dimensional space and the final mask map.

9. A computer device, comprising: one or more processors;

the processor is used for storing one or more programs;

The multi-modal pose optimized semantic three-dimensional reconstruction method according to any one of claims 1 to 7 when the one or more programs are executed by the one or more processors.

10. A computer readable storage medium, characterized in that a computer program is stored thereon, which, when executed, implements the multi-modal pose optimized semantic three-dimensional reconstruction method according to any one of claims 1 to 7.