CN115170746B

CN115170746B - Multi-view three-dimensional reconstruction method, system and equipment based on deep learning

Info

Publication number: CN115170746B
Application number: CN202211087276.9A
Authority: CN
Inventors: 任胜兵; 彭泽文; 陈旭洋
Original assignee: Central South University
Current assignee: Central South University
Priority date: 2022-09-07
Filing date: 2022-09-07
Publication date: 2022-11-22
Anticipated expiration: 2042-09-07
Also published as: CN115170746A

Abstract

The invention discloses a multi-view three-dimensional reconstruction method, system and equipment based on deep learning, which acquires multiple multi-view images, performs multi-scale semantic feature extraction on multiple multi-view images, and obtains feature maps of multiple scales; Multi-scale semantic segmentation is performed on feature maps of different scales to obtain semantic segmentation sets of multiple scales; multiple multi-view images are reconstructed through supervised 3D reconstruction methods to obtain initial depth maps; semantic segmentation sets based on multiple scales and The initial depth map obtains depth maps of multiple scales; constructs point cloud sets of multiple scales; optimizes point cloud sets of multiple scales with different radius filters to obtain optimized point cloud sets; based on the optimized point cloud set. Scale reconstruction to obtain 3D reconstruction results of different scales; splicing and fusion of 3D reconstruction results of each scale. The present invention can make full use of the semantic information of each scale, and can improve the accuracy of three-dimensional reconstruction.

Description

A method, system and device for multi-view 3D reconstruction based on deep learning

技术领域technical field

本发明涉及计算机视觉技术领域，尤其是涉及一种基于深度学习的多视图三维重建方法、系统及设备。The present invention relates to the technical field of computer vision, in particular to a deep learning-based multi-view three-dimensional reconstruction method, system and equipment.

背景技术Background technique

深度学习的三维重建方法是利用计算机搭建神经网络, 通过大量的图像数据与三维模型数据进行训练, 学习图像至三维模型的映射关系, 从而实现对新的图像目标进行三维重建。与传统的诸如3DMM方法（3D Morphable Model）和SFM方法（Structure fromMotion）相比，深度学习的三维重建方法能够将学习到的一些全局的语义信息引入图像重建，从而在一定程度上克服传统重建方法在弱光照、弱纹理区域重建不良的局限性。The 3D reconstruction method of deep learning is to use a computer to build a neural network, train through a large amount of image data and 3D model data, and learn the mapping relationship between images and 3D models, so as to achieve 3D reconstruction of new image targets. Compared with traditional methods such as 3DMM method (3D Morphable Model) and SFM method (Structure fromMotion), the 3D reconstruction method of deep learning can introduce some global semantic information learned into image reconstruction, thereby overcoming the traditional reconstruction method to a certain extent. Limitations of bad reconstruction in poorly lit, weakly textured areas.

目前的深度学习三维重建方法大多基于单一尺度，即对于图像中不同大小的物体采取同样的方式进行重建。单尺度的重建在一些场景复杂度较低、细小物体较少的环境下能保持较好的重建精度和速度。但在一些场景复杂、各种尺度的物体较多的环境下容易出现小尺度物体重建精度不足的问题。并且只利用了高层特征，图像的低层细节信息没有得到充分利用。Most of the current deep learning 3D reconstruction methods are based on a single scale, that is, objects of different sizes in the image are reconstructed in the same way. Single-scale reconstruction can maintain good reconstruction accuracy and speed in some environments with low scene complexity and few small objects. However, in some environments with complex scenes and many objects of various scales, the problem of insufficient reconstruction accuracy of small-scale objects is prone to occur. And only the high-level features are used, and the low-level detail information of the image is not fully utilized.

发明内容Contents of the invention

本发明旨在至少解决现有技术中存在的技术问题之一。为此，本发明提出一种基于深度学习的多视图三维重建方法、系统及设备，能够充分利用各个尺度的语义信息，能够提高三维重建的精确度。The present invention aims to solve at least one of the technical problems existing in the prior art. For this reason, the present invention proposes a multi-view 3D reconstruction method, system and device based on deep learning, which can make full use of semantic information of each scale and improve the accuracy of 3D reconstruction.

第一方面，本发明实施例提供了一种基于深度学习的多视图三维重建方法，所述基于深度学习的多视图三维重建方法包括：In the first aspect, an embodiment of the present invention provides a method for multi-view 3D reconstruction based on deep learning, the method for multi-view 3D reconstruction based on deep learning includes:

获取多张多视角图像，对多张所述多视角图像进行多尺度语义特征提取，获得多种尺度的特征图；Obtaining multiple multi-view images, performing multi-scale semantic feature extraction on the multiple multi-view images, and obtaining feature maps of multiple scales;

对所述多种尺度的特征图进行多尺度语义分割，获得多种尺度的语义分割集；performing multi-scale semantic segmentation on the feature maps of multiple scales to obtain semantic segmentation sets of multiple scales;

通过有监督的三维重建方法对多张所述多视角图像进行重建，获得初始深度图；Reconstructing multiple multi-view images by a supervised three-dimensional reconstruction method to obtain an initial depth map;

基于所述多种尺度的语义分割集和所述初始深度图，获得多种尺度的深度图；Obtaining depth maps of multiple scales based on the semantic segmentation sets of multiple scales and the initial depth map;

基于所述多种尺度的深度图，构建多种尺度的点云集；Constructing point cloud sets of multiple scales based on the depth maps of multiple scales;

根据所述点云集的尺度，对所述多种尺度的点云集采用不同的半径滤波进行优化，获得优化后的点云集；According to the scale of the point cloud set, the point cloud set of multiple scales is optimized by using different radius filters to obtain the optimized point cloud set;

基于所述优化后的点云集进行不同尺度的重建，获得不同尺度的三维重建结果；Reconstructing at different scales based on the optimized point cloud set to obtain three-dimensional reconstruction results at different scales;

将每种尺度的三维重建结果进行拼接融合，获得最终的三维重建结果。The 3D reconstruction results of each scale are spliced and fused to obtain the final 3D reconstruction result.

与现有技术相比，本发明第一方面具有以下有益效果：Compared with the prior art, the first aspect of the present invention has the following beneficial effects:

本方法通过对多张多视角图像进行多尺度语义特征提取，能够提取不同尺度的特征，能获得多种尺度的特征图，并对多种尺度的特征图进行多尺度语义分割，聚合各个尺度的语义信息，丰富了各个尺度的语义信息；通过利用多种尺度的语义分割集中的各个尺度的语义信息分别对初始深度图进行语义引导，从而不断修正初始深度图，获得准确的多种尺度的深度图；本方法用获得的多种尺度的深度图构建多种尺度的点云集，根据点云集的尺度采用不同的半径滤波进行优化，优化后的点云集用于不同尺度的重建，再将三维重建结果融合以获得更加精确的三维重建结果。因此，本方法能够充分利用各个尺度的语义信息，能够提高三维重建的精确度。By extracting multi-scale semantic features from multiple multi-view images, this method can extract features of different scales, obtain feature maps of multiple scales, perform multi-scale semantic segmentation on feature maps of multiple scales, and aggregate the features of each scale. Semantic information enriches the semantic information of each scale; by using the semantic information of each scale in the semantic segmentation set of multiple scales, the initial depth map is semantically guided, so as to continuously correct the initial depth map and obtain accurate depth of multiple scales Figure; this method constructs point cloud sets of multiple scales with the obtained depth maps of multiple scales, and optimizes them with different radius filters according to the scale of the point cloud sets. The optimized point cloud sets are used for reconstruction of different scales, and then the three-dimensional reconstruction The results are fused to obtain more accurate 3D reconstruction results. Therefore, this method can make full use of the semantic information of each scale, and can improve the accuracy of 3D reconstruction.

根据本发明的一些实施例，所述对多张所述多视角图像进行多尺度语义特征提取，获得多种尺度的特征图，包括：According to some embodiments of the present invention, the multi-scale semantic feature extraction is performed on multiple multi-view images to obtain feature maps of multiple scales, including:

通过ResNet网络对多张所述多视角图像进行多层特征提取，获得多种尺度的原始特征图；Performing multi-layer feature extraction on multiple multi-view images through a ResNet network to obtain original feature maps of multiple scales;

将每种尺度的所述原始特征图分别与通道注意力连接，以通过通道注意力机制对每种尺度的所述原始特征图进行重要性加权，获得多种尺度的特征图。The original feature maps of each scale are respectively connected with channel attention, so as to weight the importance of the original feature maps of each scale through a channel attention mechanism, and obtain feature maps of multiple scales.

根据本发明的一些实施例，所述通过通道注意力机制对每种尺度的所述原始特征图进行重要性加权，获得多种尺度的特征图，包括：According to some embodiments of the present invention, the importance weighting of the original feature maps of each scale is carried out through the channel attention mechanism to obtain feature maps of multiple scales, including:

将每种尺度的所述原始特征图通过压缩网络进行压缩，获得每种尺度的所述原始特征图对应的一维特征图；Compressing the original feature map of each scale through a compression network to obtain a one-dimensional feature map corresponding to the original feature map of each scale;

将所述一维特征图通过激励网络输入全连接层进行重要性预测，获得每个通道的重要性大小；The one-dimensional feature map is input into the fully connected layer through the excitation network for importance prediction, and the importance of each channel is obtained;

将所述每个通道的重要性大小通过激励函数激励到每种尺度的所述原始特征图的一维特征图上，获得多种尺度的特征图。The importance of each channel is excited on the one-dimensional feature map of the original feature map of each scale through an activation function to obtain feature maps of multiple scales.

根据本发明的一些实施例，所述对所述多种尺度的特征图进行多尺度语义分割，获得多种尺度的语义分割集，包括：According to some embodiments of the present invention, performing multi-scale semantic segmentation on the feature maps of multiple scales to obtain semantic segmentation sets of multiple scales includes:

将所述多种尺度的特征图通过非负矩阵分解进行聚类，获得多种尺度的语义分割集；其中，所述非负矩阵分解的表达式为：The feature maps of the multiple scales are clustered by non-negative matrix decomposition to obtain semantic segmentation sets of multiple scales; wherein, the expression of the non-negative matrix decomposition is:

其中，V表示将多种尺度的特征图映射串联并重塑为HW行C列的矩阵V，P表示HW行K列的矩阵，Q表示K行C列的矩阵，H表示系数矩阵，W表示基矩阵，K表示语义簇数的非负矩阵分解因子，C表示每个像素的维度，F表示采用非诱导范数。Among them, V means that the feature map maps of various scales are concatenated and reshaped into a matrix V with HW rows and C columns, P means a matrix with HW rows and K columns, Q means a matrix with K rows and C columns, H means a coefficient matrix, and W means The basis matrix, K represents the non-negative matrix factorization factor of the number of semantic clusters, C represents the dimension of each pixel, and F represents the use of non-induced norms.

根据本发明的一些实施例，所述基于所述多种尺度的语义分割集和所述初始深度图，获得多种尺度的深度图，包括：According to some embodiments of the present invention, the obtaining of depth maps of multiple scales based on the semantic segmentation set of multiple scales and the initial depth map includes:

选取多张所述多视角图像中的任一张作为参考图，其他作为待匹配图；Select any one of multiple multi-view images as a reference image, and others as images to be matched;

从所述参考图中选取参考点，并获取所述参考点在所述语义分割集中对应的语义类别，以及获取所述参考点在所述初始深度图上对应的深度值；Selecting a reference point from the reference image, and acquiring a semantic category corresponding to the reference point in the semantic segmentation set, and acquiring a depth value corresponding to the reference point on the initial depth map;

通过如下公式选取所述参考点的数目：The number of reference points is selected by the following formula:

其中，

表示第j个分割集选取的参考点数目，H表示所述多视角图像的高度，W 表示所述多视角图像的宽度，HW表示所述多视角图像的像素点数量，t表示一个常量参数，

表示第j个所述语义分割集所含的语义类别数，

表示第i个所述语义分割集所含的语义类别数； in,

Represents the number of reference points selected by the j-th segmentation set, H represents the height of the multi-view image, W represents the width of the multi-view image, HW represents the number of pixels of the multi-view image, t represents a constant parameter,

Indicates the number of semantic categories contained in the jth semantic segmentation set,

Indicates the number of semantic categories contained in the i-th semantic segmentation set;

基于每个所述参考点，通过如下公式获取每个所述参考点在所述待匹配图上的匹配点：Based on each of the reference points, the matching points of each of the reference points on the graph to be matched are obtained by the following formula:

其中，

表示第i个参考点在所述待匹配图上的匹配点，K表示相机的内参，T表示所述相机的外参，

表示所述参考图中的参考点P_i在所述初始深度图上对应的深度值； in,

Represents the matching point of the i-th reference point on the image to be matched, K represents the internal reference of the camera, and T represents the external reference of the camera,

Indicates the depth value corresponding to the reference point P _i in the reference image on the initial depth image;

获取每个所述匹配点对应的语义类别，通过最小化语义损失函数对每种尺度的所述多视角图像进行修正，获得所述多种尺度的深度图，所述语义损失函数

的计算公式如下： Obtain the semantic category corresponding to each of the matching points, correct the multi-view image of each scale by minimizing the semantic loss function, and obtain the depth map of the multiple scales, and the semantic loss function

The calculation formula is as follows:

其中，

表示第i个所述参考点的语义信息和第i个所述匹配点的语义信息的差别，M_i表示掩膜，N表示所述参考点的数目。 in,

Indicates the difference between the semantic information of the i-th reference point and the semantic information of the i-th matching point, M _i represents the mask, and N represents the number of the reference points.

根据本发明的一些实施例，所述基于所述多种尺度的深度图，构建多种尺度的点云集，包括：According to some embodiments of the present invention, the construction of point cloud sets of multiple scales based on the depth maps of multiple scales includes:

将每种尺度的深度图，通过如下表达式构建每种尺度的点云集：The depth map of each scale is used to construct the point cloud set of each scale through the following expression:

其中，

表示所述深度图的横坐标，

表示所述深度图的纵坐标，

和

表示根据相机参数所获得的相机焦距，x、y和z表示点云转化的点云坐标。 in,

represents the abscissa of the depth map,

represents the ordinate of the depth map,

and

Indicates the focal length of the camera obtained according to the camera parameters, and x, y, and z indicate the point cloud coordinates of point cloud conversion.

根据本发明的一些实施例，根据所述点云集的尺度，对所述多种尺度的点云集采用不同的半径滤波进行优化，获得优化后的点云集，包括：According to some embodiments of the present invention, according to the scale of the point cloud set, the point cloud sets of various scales are optimized using different radius filters to obtain the optimized point cloud set, including:

获取所述多种尺度的点云集，每种尺度的所述点云集中的点云都有对应的半径大小和预设的邻点数量；Obtaining the point cloud sets of multiple scales, the point clouds in the point cloud sets of each scale have a corresponding radius size and a preset number of adjacent points;

根据所述点云集的尺度采用如下公式计算出所述点云集中点云对应的半径：According to the scale of the point cloud set, the following formula is used to calculate the radius corresponding to the point cloud in the point cloud set:

其中，

表示不同尺度的点云集中点云对应的半径，

表示常量参数，t表示常量参数，

表示每个点云集的预先所设定的尺度等级； in,

Indicates the radius corresponding to the point cloud in the point cloud set of different scales,

Represents a constant parameter, t represents a constant parameter,

Represents the preset scale level of each point cloud set;

根据每个点云对应的所述半径大小和预设的邻点数量对所述多种尺度的点云集进行优化，获得优化后的点云集。The point cloud sets of various scales are optimized according to the radius size corresponding to each point cloud and the preset number of adjacent points, to obtain an optimized point cloud set.

第二方面，本发明实施例还提供了一种基于深度学习的多视图三维重建系统，所述基于深度学习的多视图三维重建系统包括：In the second aspect, the embodiment of the present invention also provides a multi-view 3D reconstruction system based on deep learning, and the multi-view 3D reconstruction system based on deep learning includes:

特征图获取单元，用于获取多视角图像，对所述多视角图像进行多尺度语义特征提取，获得多种尺度的特征图；A feature map acquisition unit, configured to acquire multi-view images, perform multi-scale semantic feature extraction on the multi-view images, and obtain feature maps of multiple scales;

语义分割集获取单元，用于对所述多种尺度的特征图进行多尺度语义分割，获得多种尺度的语义分割集；a semantic segmentation set acquisition unit, configured to perform multi-scale semantic segmentation on the feature maps of multiple scales, and obtain semantic segmentation sets of multiple scales;

初始深度图获取单元，用于通过有监督的三维重建方法对多张所述多视角图像进行重建，获得初始深度图；An initial depth map acquisition unit, configured to reconstruct multiple multi-view images through a supervised three-dimensional reconstruction method to obtain an initial depth map;

深度图获取单元，用于基于所述多种尺度的语义分割集和所述初始深度图，获得多种尺度的深度图；a depth map acquisition unit, configured to obtain depth maps of multiple scales based on the semantic segmentation sets of multiple scales and the initial depth map;

点云集获取单元，用于基于所述多种尺度的深度图，构建多种尺度的点云集；A point cloud set acquisition unit, configured to construct point cloud sets of multiple scales based on the depth maps of multiple scales;

半径滤波单元，用于根据所述点云集的尺度，对所述多种尺度的点云集采用不同的半径滤波进行优化，获得优化后的点云集；The radius filtering unit is configured to optimize the point cloud sets of multiple scales by using different radius filters according to the scale of the point cloud set to obtain an optimized point cloud set;

重建结果获取单元，用于基于所述优化后的点云集进行不同尺度的重建，获得不同尺度的三维重建结果；A reconstruction result acquisition unit, configured to perform reconstruction at different scales based on the optimized point cloud set, and obtain three-dimensional reconstruction results at different scales;

重建结果融合单元，用于将每种尺度的重建结果进行拼接融合，获得最终的三维重建结果。The reconstruction result fusion unit is configured to stitch and fuse the reconstruction results of each scale to obtain a final three-dimensional reconstruction result.

与现有技术相比，本发明第二方面具有以下有益效果：Compared with the prior art, the second aspect of the present invention has the following beneficial effects:

本系统的特征图获取单元通过对多张多视角图像进行多尺度语义特征提取，能够提取深层次的特征，能获得多种尺度的特征图，并通过语义分割集获取单元对多种尺度的特征图进行多尺度语义分割，聚合各个尺度的语义信息，丰富了各个尺度的语义信息；深度图获取单元通过利用多种尺度的语义分割集中的各个尺度的语义信息分别对初始深度图进行语义引导，从而不断修正初始深度图，获得准确的多种尺度的深度图；本系统的点云集获取单元用获得的多种尺度的深度图构建多种尺度的点云集，通过半径滤波单元根据点云集的尺度采用不同的半径滤波进行优化，通过重建结果获取单元基于优化后的点云集进行不同尺度的重建，再通过重建结果融合单元将三维重建结果融合以获得更加精确的三维重建结果。因此，本系统能够充分利用各个尺度的语义信息，能够提高三维重建的精确度。The feature map acquisition unit of this system can extract deep-level features by extracting multi-scale semantic features from multiple multi-view images, and can obtain feature maps of multiple scales, and obtain features of multiple scales through the semantic segmentation set acquisition unit. Multi-scale semantic segmentation is performed on the graph, and the semantic information of each scale is aggregated to enrich the semantic information of each scale; the depth map acquisition unit uses the semantic information of each scale in the semantic segmentation set of multiple scales to guide the semantics of the initial depth map respectively. In this way, the initial depth map is continuously corrected to obtain accurate depth maps of multiple scales; the point cloud acquisition unit of this system uses the obtained depth maps of various scales to construct point cloud sets of various scales, and the radius filter unit is used according to the scale of the point cloud set Different radius filters are used for optimization, and the reconstruction result acquisition unit performs reconstruction of different scales based on the optimized point cloud set, and then the 3D reconstruction results are fused by the reconstruction result fusion unit to obtain a more accurate 3D reconstruction result. Therefore, this system can make full use of the semantic information of each scale, and can improve the accuracy of 3D reconstruction.

第三方面，本发明实施例还提供了一种基于深度学习的多视图三维重建设备，包括至少一个控制处理器和用于与所述至少一个控制处理器通信连接的存储器；所述存储器存储有可被所述至少一个控制处理器执行的指令，所述指令被所述至少一个控制处理器执行，以使所述至少一个控制处理器能够执行如上所述的一种基于深度学习的多视图三维重建方法。In the third aspect, the embodiment of the present invention also provides a multi-view 3D reconstruction device based on deep learning, including at least one control processor and a memory for communicating with the at least one control processor; the memory stores Instructions executable by the at least one control processor, the instructions are executed by the at least one control processor, so that the at least one control processor can perform the above-mentioned deep learning-based multi-view 3D rebuild method.

第四方面，本发明实施例还提供了一种计算机可读存储介质，所述计算机可读存储介质存储有计算机可执行指令，所述计算机可执行指令用于使计算机执行如上所述的一种基于深度学习的多视图三维重建方法。In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium, the computer-readable storage medium stores computer-executable instructions, and the computer-executable instructions are used to make a computer perform one of the above-mentioned Multi-view 3D reconstruction method based on deep learning.

可以理解的是，上述第三方面至第四方面与相关技术相比存在的有益效果与上述第一方面与相关技术相比存在的有益效果相同，可以参见上述第一方面中的相关描述，在此不再赘述。It can be understood that the beneficial effects of the above-mentioned third aspect to the fourth aspect compared with the related technology are the same as those of the above-mentioned first aspect compared with the related technology. Please refer to the relevant description in the above-mentioned first aspect. This will not be repeated here.

附图说明Description of drawings

本发明的上述和/或附加的方面和优点从结合下面附图对实施例的描述中将变得明显和容易理解，其中：The above and/or additional aspects and advantages of the present invention will become apparent and comprehensible from the description of the embodiments in conjunction with the following drawings, wherein:

图1是本发明一实施例的一种基于深度学习的多视图三维重建方法的流程图；Fig. 1 is a flowchart of a multi-view 3D reconstruction method based on deep learning according to an embodiment of the present invention;

图2是本发明一实施例的深度残差网络的结构图；Fig. 2 is a structural diagram of a deep residual network according to an embodiment of the present invention;

图3是本发明一实施例的非负矩阵分解的示意图；Fig. 3 is a schematic diagram of a non-negative matrix decomposition of an embodiment of the present invention;

图4是本发明一实施例的多尺度语义分割的结构图；4 is a structural diagram of multi-scale semantic segmentation according to an embodiment of the present invention;

图5是本发明一实施例的一种基于深度学习的多视图三维重建系统的结构图。Fig. 5 is a structural diagram of a multi-view 3D reconstruction system based on deep learning according to an embodiment of the present invention.

具体实施方式Detailed ways

下面详细描述本发明的实施例，所述实施例的示例在附图中示出，其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施例是示例性的，仅用于解释本发明，而不能理解为对本发明的限制。Embodiments of the present invention are described in detail below, examples of which are shown in the drawings, wherein the same or similar reference numerals designate the same or similar elements or elements having the same or similar functions throughout. The embodiments described below by referring to the figures are exemplary only for explaining the present invention and should not be construed as limiting the present invention.

在本发明的描述中，如果有描述到第一、第二等只是用于区分技术特征为目的，而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量或者隐含指明所指示的技术特征的先后关系。In the description of the present invention, if the first, second, etc. are described only for the purpose of distinguishing technical features, it cannot be understood as indicating or implying relative importance or implicitly indicating the number of indicated technical features or implying Indicates the sequence of the indicated technical features.

在本发明的描述中，需要理解的是，涉及到方位描述，例如上、下等指示的方位或位置关系为基于附图所示的方位或位置关系，仅是为了便于描述本发明和简化描述，而不是指示或暗示所指的装置或元件必须具有特定的方位、以特定的方位构造和操作，因此不能理解为对本发明的限制。In the description of the present invention, it should be understood that when it comes to orientation descriptions, for example, the orientation or positional relationship indicated by up, down, etc. is based on the orientation or positional relationship shown in the drawings, which is only for the convenience of describing the present invention and simplifying the description , rather than indicating or implying that the device or element referred to must have a particular orientation, be constructed and operate in a particular orientation, and thus should not be construed as limiting the invention.

本发明的描述中，需要说明的是，除非另有明确的限定，设置、安装、连接等词语应做广义理解，所属技术领域技术人员可以结合技术方案的具体内容合理确定上述词语在本发明中的具体含义。In the description of the present invention, it should be noted that, unless otherwise clearly defined, words such as setting, installation, and connection should be understood in a broad sense, and those skilled in the art can reasonably determine that the above words are included in the present invention in combination with the specific content of the technical solution. specific meaning.

为方便本领域人员理解，对本实施例中的名词进行解释：For the convenience of those skilled in the art to understand, the nouns in the present embodiment are explained:

深度学习三维重建方法：深度学习的三维重建方法是利用计算机搭建神经网络,通过大量的图像数据与三维模型数据进行训练, 学习图像至三维模型的映射关系, 从而实现对新的图像目标进行三维重建。与传统的诸如3DMM重建三维信息的方法和SFM重建三维信息的方法相比，深度学习的三维重建方法能够将学习到的一些全局的语义信息引入图像重建，从而在一定程度上克服传统重建方法在弱光照、弱纹理区域重建不良的局限性，其中，SFM算法是一种基于各种收集到的无序图片进行三维重建的离线算法；3DMM，即三维可变形人脸模型，是一个通用的三维人脸模型，用固定的点数来表示人脸。Deep learning 3D reconstruction method: The 3D reconstruction method of deep learning is to use a computer to build a neural network, train through a large amount of image data and 3D model data, and learn the mapping relationship between images and 3D models, so as to achieve 3D reconstruction of new image targets . Compared with the traditional method of reconstructing 3D information such as 3DMM and the method of reconstructing 3D information by SFM, the 3D reconstruction method of deep learning can introduce some global semantic information learned into image reconstruction, thus to a certain extent overcome the traditional reconstruction method in The limitations of poor reconstruction in weak light and weak texture areas. Among them, the SFM algorithm is an offline algorithm for 3D reconstruction based on various collected disordered pictures; 3DMM, that is, 3D deformable face model, is a general 3D Face model, which uses a fixed number of points to represent the face.

目前的深度学习的三维重建方法主要可分为有监督的三维重建方法（例如，现有技术中的NVSNet、CVP-MVSNet、PatchmatchNet等）和自监督的三维重建方法（例如现有技术中的JDACS-MS等）两类。有监督的三维重建方法需要真值来进行训练，精度较高，但在一些真值获取困难的场景下难以适用。自监督的三维重建方法无需真值进行训练，适用面较广，精度相对低。The current 3D reconstruction methods of deep learning can be mainly divided into supervised 3D reconstruction methods (for example, NVSNet, CVP-MVSNet, PatchmatchNet, etc. in the prior art) and self-supervised 3D reconstruction methods (such as JDACS in the prior art -MS, etc.) two categories. The supervised 3D reconstruction method needs the real value for training, and has high accuracy, but it is difficult to apply in some scenes where the real value is difficult to obtain. The self-supervised 3D reconstruction method does not need the ground truth for training, so it has a wide range of applications and relatively low accuracy.

语义分割：语义分割是在像素级别上的分类，属于同一类的像素都要被归为一类，因此语义分割是从像素级别来理解图像的，例如，用不同的颜色将含有不同语义的像素标记出来。属于动物的像素被分为同一类。分割的语义信息能够对图像重建进行引导，提高重建的精度。采用聚类的方式进行语义分割，将属于同类的像素聚为相同的类。Semantic segmentation: Semantic segmentation is a classification at the pixel level. Pixels belonging to the same category must be classified into one category. Therefore, semantic segmentation understands images from the pixel level. For example, pixels with different semantics are divided into different colors. Mark it out. Pixels belonging to animals are grouped into the same class. The semantic information of segmentation can guide image reconstruction and improve the accuracy of reconstruction. Clustering is used for semantic segmentation, and pixels belonging to the same class are clustered into the same class.

深度图：也叫距离影像，是指将从图像采集器到场景中各点的距离（深度）值作为像素值的图像。Depth map: also called distance image, refers to an image with the distance (depth) value from the image collector to each point in the scene as the pixel value.

点云：物体外观表面的点数据集合是点云，含有物体的三维坐标信息和颜色等信息，通过点云数据可以实现图像的重建。Point cloud: The collection of point data on the appearance surface of an object is a point cloud, which contains information such as the three-dimensional coordinate information and color of the object, and the reconstruction of the image can be realized through the point cloud data.

非负矩阵分解（NMF）：是在矩阵中所有元素均为非负数约束条件之下的矩阵分解方法。利用矩阵分解来解决实际问题的分析方法很多，如PCA(主成分分析)、ICA(独立成分分析)、SVD(奇异值分解)、VQ(矢量量化)等。在所有这些方法中，原始的大矩阵V被近似分解为低秩的V=WH形式。这些方法的共同特点是，因子W和H中的元素可为正或负，即使输入的初始矩阵元素是全正的，传统的秩削减算法也不能保证原始数据的非负性。在数学上，从计算的观点看，分解结果中存在负值是正确的，但负值元素在实际问题中往往是没有意义的。Non-negative matrix factorization (NMF): It is a matrix factorization method under the constraint that all elements in the matrix are non-negative numbers. There are many analysis methods for solving practical problems by using matrix decomposition, such as PCA (Principal Component Analysis), ICA (Independent Component Analysis), SVD (Singular Value Decomposition), VQ (Vector Quantization) and so on. In all these methods, the original large matrix V is approximately decomposed into a low-rank V=WH form. The common feature of these methods is that the elements in the factors W and H can be positive or negative, even if the input initial matrix elements are all positive, the traditional rank reduction algorithm cannot guarantee the non-negativity of the original data. Mathematically, from a computational point of view, it is correct to have negative values in the decomposition results, but negative elements are often meaningless in practical problems.

深度学习的三维重建方法是利用计算机搭建神经网络, 通过大量的图像数据与三维模型数据进行训练, 学习图像至三维模型的映射关系, 从而实现对新的图像目标进行三维重建。与传统的诸如3DMM方法和SFM方法相比，深度学习的三维重建方法能够将学习到的一些全局的语义信息引入图像重建，从而在一定程度上克服传统重建方法在弱光照、弱纹理区域重建不良的局限性。The 3D reconstruction method of deep learning is to use a computer to build a neural network, train through a large amount of image data and 3D model data, and learn the mapping relationship between images and 3D models, so as to achieve 3D reconstruction of new image targets. Compared with traditional methods such as 3DMM method and SFM method, the 3D reconstruction method of deep learning can introduce some global semantic information learned into image reconstruction, so as to overcome the poor reconstruction of traditional reconstruction methods in weakly illuminated and weakly textured areas to a certain extent. limitations.

为解决上述问题，本申请对多张多视角图像进行多尺度语义特征提取，能够提取不同尺度的特征，能获得多种尺度的特征图，并对多种尺度的特征图进行多尺度语义分割，聚合各个尺度的语义信息，丰富了各个尺度的语义信息；通过利用多种尺度的语义分割集中的各个尺度的语义信息分别对初始深度图进行语义引导，从而不断修正初始深度图，获得准确的多种尺度的深度图；本申请用获得的多种尺度的深度图构建多种尺度的点云集，根据点云集的尺度采用不同的半径滤波进行优化，优化后的点云集用于不同尺度的重建，再将三维重建结果融合以获得更加精确的三维重建结果。因此，本申请能够充分利用各个尺度的语义信息，能够提高三维重建的精确度。In order to solve the above problems, this application performs multi-scale semantic feature extraction on multiple multi-view images, which can extract features of different scales, obtain feature maps of multiple scales, and perform multi-scale semantic segmentation on feature maps of multiple scales. The semantic information of each scale is aggregated to enrich the semantic information of each scale; the initial depth map is semantically guided by using the semantic information of each scale in the semantic segmentation set of multiple scales, so as to continuously correct the initial depth map and obtain accurate multiple Depth maps of different scales; this application uses the obtained depth maps of multiple scales to construct point cloud sets of multiple scales, and uses different radius filters to optimize according to the scale of the point cloud set. The optimized point cloud set is used for reconstruction of different scales. The 3D reconstruction results are then fused to obtain a more accurate 3D reconstruction result. Therefore, the present application can make full use of the semantic information of each scale, and can improve the accuracy of three-dimensional reconstruction.

参照图1，本发明实施例提供了一种基于深度学习的多视图三维重建方法，本基于深度学习的多视图三维重建方法包括：Referring to FIG. 1 , an embodiment of the present invention provides a method for multi-view 3D reconstruction based on deep learning. The method for multi-view 3D reconstruction based on deep learning includes:

步骤S100、获取多张多视角图像，对多张多视角图像进行多尺度语义特征提取，获得多种尺度的特征图。Step S100 , acquiring multiple multi-view images, performing multi-scale semantic feature extraction on the multiple multi-view images, and obtaining feature maps of multiple scales.

具体的，获取多张多视角图像，可以通过图像采集设备，例如摄像机、图像扫描仪等，对待识别的物体进行全方位多种角度的图像采集，得到多张多视角图像。例如，当需要对对多张多视角图像进行多尺度语义特征提取时，可以采用摄像机等图像采集设备得到多张多视角图像。Specifically, to obtain multiple multi-view images, an image acquisition device, such as a video camera, an image scanner, etc., may be used to collect images from all directions and multiple angles of the object to be recognized to obtain multiple multi-view images. For example, when multi-scale semantic feature extraction needs to be performed on multiple multi-view images, multiple multi-view images can be obtained by using an image acquisition device such as a camera.

本实施例通过ResNet网络对多张多视角图像进行多层特征提取，获得多种尺度的原始特征图；In this embodiment, multi-layer feature extraction is performed on multiple multi-view images through the ResNet network to obtain original feature maps of multiple scales;

将每种尺度的原始特征图分别与通道注意力连接，以通过通道注意力机制对每种尺度的原始特征图进行重要性加权，获得多种尺度的特征图，具体的：The original feature maps of each scale are respectively connected with the channel attention to weight the importance of the original feature maps of each scale through the channel attention mechanism to obtain feature maps of multiple scales, specifically:

将每种尺度的原始特征图通过压缩网络进行压缩，获得每种尺度的原始特征图对应的一维特征图；Compress the original feature map of each scale through the compression network to obtain the one-dimensional feature map corresponding to the original feature map of each scale;

将一维特征图通过激励网络输入全连接层进行重要性预测，获得每个通道的重要性大小；Input the one-dimensional feature map into the fully connected layer through the excitation network for importance prediction, and obtain the importance of each channel;

将每个通道的重要性大小通过激励函数激励到每种尺度的原始特征图的一维特征图上，获得多种尺度的特征图。The importance of each channel is excited to the one-dimensional feature map of the original feature map of each scale through the activation function to obtain feature maps of multiple scales.

在本实施例中，采用ResNet网络提取图像特征，当深度学习网络层数越深时，理论上表达能力会更强，但是CNN网络达到一定的深度后，再加深，分类性能不会提高，而是会导致网络收敛更缓慢，准确率也随着降低；即使把数据集增大，解决过拟合的问题，分类性能和准确度也不会提高。而ResNet网络采用一个残差学习的方法，参照图2，当输入为x时其学习到的特征记为

，现在我们希望其可以学习到残差

，这样其实原始的学习特征是

。之所以这样是因为残差学习相比原始特征直接学习更容易。当残差为0时，此时堆积层仅仅做了恒等映射，至少网络性能不会下降，实际上残差不会为0，这也会使得堆积层在输入特征基础上学习到新的特征，从而拥有更好的性能。这种残差函数更容易优化，能使网络层数大大加深，从而能够提取到更加深层次的语义信息。ResNet在效率、资源消耗以及深层次语义特征提取方面的性能表现显著优于VGG等网络。 In this embodiment, the ResNet network is used to extract image features. When the depth of the deep learning network layer is deeper, the expression ability will be stronger in theory, but after the CNN network reaches a certain depth, if it is deepened, the classification performance will not be improved, and It will cause the network to converge more slowly, and the accuracy rate will also decrease; even if the data set is increased to solve the problem of over-fitting, the classification performance and accuracy will not improve. The ResNet network uses a residual learning method. Referring to Figure 2, when the input is x, the learned features are recorded as

, now we hope that it can learn the residual

, so in fact the original learning feature is

. The reason for this is that residual learning is easier than direct learning of raw features. When the residual is 0, the accumulation layer only does the identity mapping at this time, at least the network performance will not decrease, in fact the residual will not be 0, which will also allow the accumulation layer to learn new features based on the input features , resulting in better performance. This kind of residual function is easier to optimize, and can greatly deepen the number of network layers, so that deeper semantic information can be extracted. The performance of ResNet in terms of efficiency, resource consumption, and deep semantic feature extraction is significantly better than that of networks such as VGG.

在通过ResNet网络对多张多视角图像进行多层特征提取，获得多种尺度的原始特征图后，将每种尺度的原始特征图分别与通道注意力连接，以通过通道注意力机制对每种尺度的原始特征图进行重要性加权，获得多种尺度的特征图。该通道注意力机制主要由压缩网络和激励网络两部分组成，具体过程为：After multi-layer feature extraction is performed on multiple multi-view images through the ResNet network to obtain original feature maps of various scales, the original feature maps of each scale are connected to the channel attention, so as to use the channel attention mechanism for each The original feature maps of different scales are weighted by importance to obtain feature maps of multiple scales. The channel attention mechanism is mainly composed of compression network and incentive network. The specific process is as follows:

设原始特征图的维度为H*W*C，其中H是高度（Height），W是宽度（width），C是通道数（channel）。压缩网络做的事情是把H*W*C压缩为1*1*C，相当于把H*W压缩成一维特征了，通过全局平均池化实现。H*W压缩成一维后，相当于这一维参数获得了之前H*W全局的视野，感受区域更广。将压缩网络得到的一维特征传递给激励网络，激励网络将一维特征传输至一个全连接层，对每个通道的重要性进行预测，得到不同通道的重要性大小后，再通过Sigmoid激励函数将不同通道的重要性大小激励到之前的特征图所对应的通道上。通道注意力机制使得网络能够关注更有效的语义特征，并迭代提高其权重，特征提取网络会提取到丰富的语义特征，不同的语义特征对于语义分割的重要性是不同的。通道注意力机制的引入能够使得网络关注那些更加有效的特征，抑制低效的特征，提高特征提取的有效性。Let the dimension of the original feature map be H*W*C, where H is the height (Height), W is the width (width), and C is the number of channels (channel). What the compression network does is to compress H*W*C into 1*1*C, which is equivalent to compressing H*W into a one-dimensional feature, which is realized by global average pooling. After H*W is compressed into one dimension, it is equivalent to obtaining the previous H*W global field of view for this dimension parameter, and the sensory area is wider. The one-dimensional features obtained by the compressed network are passed to the incentive network, and the incentive network transmits the one-dimensional features to a fully connected layer to predict the importance of each channel, and after obtaining the importance of different channels, pass the Sigmoid activation function The importance of different channels is excited to the channel corresponding to the previous feature map. The channel attention mechanism enables the network to focus on more effective semantic features and iteratively increase their weights. The feature extraction network will extract rich semantic features. Different semantic features have different importance for semantic segmentation. The introduction of the channel attention mechanism can make the network focus on those more effective features, suppress inefficient features, and improve the effectiveness of feature extraction.

在本实施例中，由于现有技术特征提取所用的卷积神经网络，像VGG网络的特征提取受限于网络提取层数，深层次特征提取能力不足，特征有效性不高。随着卷积层数的增加会出现网络收敛缓慢，准确率降低等问题，特征提取能力不足，并且所提取的全部特征对图像重建的重要性是不同的，难以保证提取到有效性高的特征。因此，本实施例通过对多张多视角图像进行多尺度语义特征提取，能够提取深层次的特征，能获得多种尺度的特征图。并通过通道注意力机制的引入能够使得网络关注那些更加有效的特征，抑制低效的特征，提高特征提取的有效性。In this embodiment, due to the convolutional neural network used for feature extraction in the prior art, the feature extraction of the VGG network is limited by the number of network extraction layers, the deep feature extraction capability is insufficient, and the feature validity is not high. As the number of convolutional layers increases, there will be problems such as slow network convergence and reduced accuracy, and the feature extraction ability is insufficient, and all the extracted features are of different importance to image reconstruction, so it is difficult to guarantee the extraction of highly effective features. . Therefore, in this embodiment, by performing multi-scale semantic feature extraction on multiple multi-view images, deep-level features can be extracted, and feature maps of multiple scales can be obtained. And through the introduction of the channel attention mechanism, the network can focus on those more effective features, suppress inefficient features, and improve the effectiveness of feature extraction.

步骤S200、对多种尺度的特征图进行多尺度语义分割，获得多种尺度的语义分割集。Step S200 , performing multi-scale semantic segmentation on feature maps of multiple scales to obtain semantic segmentation sets of multiple scales.

具体的，将多种尺度的特征图通过非负矩阵分解进行聚类，获得多种尺度的语义分割集；其中，非负矩阵分解的表达式为：Specifically, feature maps of multiple scales are clustered through non-negative matrix decomposition to obtain semantic segmentation sets of multiple scales; where the expression of non-negative matrix decomposition is:

通常的矩阵分解会把一个大的矩阵分解为多个小的矩阵，但是这些矩阵的元素有正有负。而在现实世界中，比如图像，文本等形成的矩阵中负数的存在是没有意义的，所以如果能把一个矩阵分解成全是非负元素是很有意义的。在NMF中要求原始的矩阵

的所有元素均是非负的，那么矩阵

可以分解为两个更小的非负矩阵的乘积，这个矩阵有且仅有一个这样的分解，即满足存在性和唯一性。例如， The usual matrix decomposition decomposes a large matrix into multiple small matrices, but the elements of these matrices are positive and negative. In the real world, the existence of negative numbers in the matrix formed by images, texts, etc. is meaningless, so it is very meaningful if a matrix can be decomposed into all non-negative elements. Ask for the original matrix in NMF

All elements of are non-negative, then the matrix

It can be decomposed into the product of two smaller non-negative matrices, and this matrix has one and only one such decomposition, which satisfies existence and uniqueness. E.g,

给定矩阵

，寻找非负矩阵

和非负矩阵

，使得

。分解前后可理解为：原始矩阵

的列向量是对左矩阵中所有列向量的加权和，而权重系数就是右矩阵对应列向量的元素，故称

为基矩阵，

为系数矩阵。 given matrix

, looking for a nonnegative matrix

and a nonnegative matrix

, making

. Before and after decomposition, it can be understood as: the original matrix

The column vector of is the weighted sum of all column vectors in the left matrix, and the weight coefficient is the element of the corresponding column vector of the right matrix, so it is called

is the basis matrix,

is the coefficient matrix.

参照图3，首先将N张多种尺度的特征图映射串联并重塑为（HW，C）矩阵V。利用乘法更新规则求解NMF，即使用公式

，

求解NMF，通过图中的NMF分解（即， NMF非负矩阵分解）将V分解为（HW，K）矩阵P和（K，C）矩阵Q，其中K是表示语义簇数的NMF因子。由于NMF（QQ^T=I）的正交约束性，可以将（K，C）矩阵Q的每一行视为C维的簇心，K，C）矩阵Q 的每一行对应于视图中的若干对象。（HW，K）矩阵P的行对应于来自N张多种尺度的特征图的所有像素的位置。通常，矩阵分解强制执行P的每一行和Q的每一列之间的乘积，以更好地近似V中每个像素的C维特征。这样，通过P矩阵得到了图像中每个位置的语义类别。 Referring to Figure 3, first, N feature map maps of various scales are concatenated and reshaped into a (HW, C) matrix V. Using the multiplicative update rule to solve NMF, that is, using the formula

,

To solve NMF, decompose V into (HW, K) matrix P and (K, C) matrix Q by NMF decomposition (i.e., NMF non-negative matrix factorization) in the figure, where K is the NMF factor representing the number of semantic clusters. Due to the orthogonal constraint of NMF (QQ ^T = I), each row of the (K, C) matrix Q can be regarded as a C-dimensional cluster center, and each row of the K, C) matrix Q corresponds to several objects in the view . The rows of the (HW,K) matrix P correspond to the positions of all pixels from N feature maps of various scales. In general, matrix factorization enforces a product between each row of P and each column of Q to better approximate the C-dimensional features of each pixel in V. In this way, the semantic category of each position in the image is obtained through the P matrix.

参照图4，假设提取的特征图

通过聚类的方式（即，NMF非负矩阵分解）进行语义分割，将各特征矩阵

分解为

，由于高层特征层的感受野大，特征更抽象，更加关注全局。低层特征层感受野小，更加关注细节。因此，通过多尺度的语义分割获得的各个分割集

，由粗到细包含多个层次。图4中的分割集S1至S3含有的细节信息逐渐增多。每一个分割集S中含有输入的一组图像（参考图和待匹配图）的语义分割结果，例如，不同的颜色表示不同的语义类别，含有更多的细节信息的分割集（如分割集S3）会含有更多的语义类别。 Referring to Figure 4, it is assumed that the extracted feature map

Semantic segmentation is performed by clustering (that is, NMF non-negative matrix decomposition), and each feature matrix

Decomposed into

, due to the large receptive field of the high-level feature layer, the features are more abstract and pay more attention to the overall situation. The low-level feature layer has a small receptive field and pays more attention to details. Therefore, each segmentation set obtained by multi-scale semantic segmentation

, including multiple levels from coarse to fine. The segmentation sets S1 to S3 in FIG. 4 contain more detailed information gradually. Each segmentation set S contains the semantic segmentation results of a set of input images (reference image and image to be matched), for example, different colors represent different semantic categories, segmentation sets containing more detailed information (such as segmentation set S3) will contain more semantic categories.

在本实施例中，由于目前的深度学习三维重建方法大多基于单一尺度，即对于图像中不同大小的物体采取同样的方式进行重建。单尺度的重建在一些场景复杂度较低、细小物体较少的环境下能保持较好的重建精度和速度，但在一些场景复杂、各种尺度的物体较多的环境下容易出现小尺度物体重建精度不足的问题；并且只利用了高层特征，图像的低层细节信息没有得到充分利用。因此，本实施例通过对多种尺度的特征图进行多尺度语义分割，聚合各个尺度的语义信息，丰富了各个尺度的语义信息，并使得低层特征层的细节信息能够得到充分利用。In this embodiment, since the current deep learning 3D reconstruction methods are mostly based on a single scale, that is, objects of different sizes in the image are reconstructed in the same manner. Single-scale reconstruction can maintain good reconstruction accuracy and speed in some environments with low scene complexity and few small objects, but small-scale objects are prone to appear in some environments with complex scenes and many objects of various scales The problem of insufficient reconstruction accuracy; and only high-level features are used, and the low-level detail information of the image is not fully utilized. Therefore, this embodiment performs multi-scale semantic segmentation on feature maps of multiple scales, aggregates semantic information of each scale, enriches semantic information of each scale, and makes full use of detailed information of low-level feature layers.

步骤S300、通过有监督的三维重建方法对多张多视角图像进行重建，获得初始深度图。Step S300 , using a supervised three-dimensional reconstruction method to reconstruct multiple multi-view images to obtain an initial depth map.

具体的，本实施例通过有监督的三维重建方法对多张多视角图像进行重建，获得初始深度图。Specifically, in this embodiment, a supervised three-dimensional reconstruction method is used to reconstruct multiple multi-view images to obtain an initial depth map.

本实施例通过有监督的三维重建方法获得初始深度图，能够提高重建精度。因为有监督的三维重建方法精度较高，但需要大量的训练真值数据，在某些特定的场景下（例如，水下），训练真值获取困难，难以适用。因此，需要步骤S400对本实施例的初始深度图进行语义引导，有监督三维重建方法转变为无监督，实现自监督的三维重建，从而克服有监督的三维重建方法的固有缺陷。In this embodiment, an initial depth map is obtained through a supervised three-dimensional reconstruction method, which can improve reconstruction accuracy. Because the supervised 3D reconstruction method has high accuracy, but requires a large amount of training ground truth data, in some specific scenarios (for example, underwater), it is difficult to obtain the training ground truth value and is difficult to apply. Therefore, step S400 is required to carry out semantic guidance on the initial depth map of this embodiment, and the supervised 3D reconstruction method is transformed into unsupervised to realize self-supervised 3D reconstruction, thereby overcoming the inherent defects of the supervised 3D reconstruction method.

需要说明的是，本实施例中的有监督的三维重建方法为现有技术中的任一种有监督的三维重建方法，例如，MVSNet（MVSNet: Depth Inference for Unstructured Multi-view Stereo）、CVP-MVSNet（Cost Volume Pyramid Based Depth Inference for Multi-View Stereo）和PatchmatchNet（PatchmatchNet: Learned Multi-View PatchmatchStereo）等，本实施例不再详细描述。It should be noted that the supervised 3D reconstruction method in this embodiment is any supervised 3D reconstruction method in the prior art, for example, MVSNet (MVSNet: Depth Inference for Unstructured Multi-view Stereo), CVP- MVSNet (Cost Volume Pyramid Based Depth Inference for Multi-View Stereo) and PatchmatchNet (PatchmatchNet: Learned Multi-View PatchmatchStereo), etc., will not be described in detail in this embodiment.

步骤S400、基于多种尺度的语义分割集和初始深度图，获得多种尺度的深度图。Step S400 , based on the semantic segmentation sets of multiple scales and the initial depth map, depth maps of multiple scales are obtained.

具体的，本实施例通过语义信息作为监督信号结合一个有监督的三维重建方法，引导图像重建获得深度图，具体过程为：Specifically, this embodiment uses semantic information as a supervisory signal combined with a supervised 3D reconstruction method to guide image reconstruction to obtain a depth map. The specific process is:

通过图像采集设备获取多张多视角图像，将多张多视角图像作为输入通过有监督的三维重建方法获得初始深度图；Obtain multiple multi-view images through an image acquisition device, and use the multiple multi-view images as input to obtain an initial depth map through a supervised 3D reconstruction method;

选取多张多视角图像中的任一张作为参考图，其他作为待匹配图；Select any one of multiple multi-view images as a reference image, and the others as images to be matched;

从参考图中选取参考点，并获取参考点在语义分割集中对应的语义类别，以及获取参考点在初始深度图上对应的深度值；Select a reference point from the reference image, and obtain the semantic category corresponding to the reference point in the semantic segmentation set, and obtain the depth value corresponding to the reference point on the initial depth map;

通过如下公式选取参考点的数目：The number of reference points is selected by the following formula:

其中，

表示第j个分割集选取的参考点数目，H表示多视角图像的高度，W表示多视角图像的宽度，HW表示多视角图像的像素点数量，t表示一个常量参数，

表示第j个语义分割集所含的语义类别数，

表示第i个语义分割集所含的语义类别数； in,

Indicates the number of reference points selected by the j-th segmentation set, H indicates the height of the multi-view image, W indicates the width of the multi-view image, HW indicates the number of pixels of the multi-view image, t indicates a constant parameter,

基于每个参考点，通过如下公式获取每个参考点在待匹配图上的匹配点：Based on each reference point, the matching point of each reference point on the image to be matched is obtained by the following formula:

其中，

表示第i个参考点在待匹配图上的匹配点，K表示相机的内参，T表示相机的外参，

表示参考图中的参考点P_i在初始深度图上对应的深度值； in,

Indicates the matching point of the i-th reference point on the image to be matched, K indicates the internal reference of the camera, T indicates the external reference of the camera,

获取每个匹配点对应的语义类别，通过最小化语义损失函数对每种尺度的多视角图像进行修正，获得多种尺度的深度图，语义损失函数

的计算公式如下： Obtain the semantic category corresponding to each matching point, correct the multi-view image of each scale by minimizing the semantic loss function, and obtain depth maps of multiple scales, semantic loss function

The calculation formula is as follows:

其中，

表示第i个参考点的语义信息和第i个匹配点的语义信息的差别，M_i 表示掩膜，N表示参考点的数目。本实施例采用如下例子进行说明： in,

Indicates the difference between the semantic information of the i-th reference point and the semantic information of the i-th matching point, M _i represents the mask, and N represents the number of reference points. This embodiment uses the following example for illustration:

首先，通过图像采集设备获取同一物体不同视角下的多张多视角图像，将多张多视角图像作为输入，通过一个有监督的三维重建方法可以得到初始深度图。在输入的多张多视角图像中选取一张为参考图，其余为待匹配图，在参考图像上取一点参考点P_i，及其在分割集S上对应的语义类别S_i，以及在深度图上对应的深度值。First, multiple multi-view images under different viewing angles of the same object are acquired through the image acquisition device, and the multiple multi-view images are used as input, and the initial depth map can be obtained through a supervised 3D reconstruction method. Select one of the multiple input multi-view images as the reference image, and the rest are the images to be matched. Take a reference point P _i on the reference image, its corresponding semantic category S _i on the segmentation set S, and the depth The corresponding depth value on the map.

对于不同层次的分割集，由于语义类别数的不同，类别较多的分割集需要更精细的引导，参考点数目应更多，参考点数目选取依据公式：For segmentation sets of different levels, due to the difference in the number of semantic categories, segmentation sets with more categories need more refined guidance, and the number of reference points should be more. The number of reference points is selected according to the formula:

通过如下单应性矩阵公式求出与参考点在待匹配图上对应的匹配点

： Calculate the matching point corresponding to the reference point on the image to be matched by the following homography matrix formula

:

取匹配点

的语义类别

，参考点在深度图准确的情况下（即对应位置的深度值正确）所计算出的匹配点的语义类别应与参考点的语义类别相同，计算并最小化如下的语义损失函数： take match point

Semantic category of

, when the depth map of the reference point is accurate (that is, the depth value of the corresponding position is correct), the semantic category of the matching point calculated should be the same as the semantic category of the reference point, and the following semantic loss function is calculated and minimized:

通过最小化语义损失函数，从而不断修正初始深度图，最终获得准确的深度图。语义信息能够代替真值进行引导，将有监督三维重建方法转变为无监督，实现自监督的三维重建，从而克服有监督方法的固有缺陷。By minimizing the semantic loss function, the initial depth map is continuously corrected, and an accurate depth map is finally obtained. Semantic information can replace the truth value for guidance, transform supervised 3D reconstruction methods into unsupervised, realize self-supervised 3D reconstruction, and overcome the inherent defects of supervised methods.

在本实施例中，由于，图像的语义可分为三层，视觉层，对象层和概念层，视觉层语义包含颜色、线条、轮廓等，对象层的语义包含各种物体，概念层的语义则涉及对场景的理解。在现有技术中，部分三维重建方法也利用到了语义信息引导，但单一尺度的高层抽象语义信息（对象层）在一些大尺度物体的重建任务上拥有较好的精度，在一些小尺度的重建任务上，高层抽象的语义信息相对粗糙，重建的精度不好。In this embodiment, since the semantics of the image can be divided into three layers, the visual layer, the object layer and the conceptual layer, the visual layer semantics include colors, lines, outlines, etc., the object layer semantics include various objects, and the conceptual layer semantics It involves understanding the scene. In the existing technology, some 3D reconstruction methods also use semantic information guidance, but the single-scale high-level abstract semantic information (object layer) has good accuracy in the reconstruction tasks of some large-scale objects, and in some small-scale reconstruction tasks On the task, the high-level abstract semantic information is relatively rough, and the reconstruction accuracy is not good.

因此，本实施例将多张多视角图像作为输入，通过有监督的三维重建方法获得初始深度图；基于多种尺度的语义分割集和初始深度图，获得多种尺度的深度图；本实施例利用多种尺度的语义分割集中的各个尺度的语义信息分别对初始深度图进行语义引导，从而不断修正初始深度图，获得准确的多种尺度的深度图。Therefore, in this embodiment, multiple multi-view images are used as input, and an initial depth map is obtained through a supervised three-dimensional reconstruction method; depth maps of various scales are obtained based on semantic segmentation sets of multiple scales and initial depth maps; this embodiment The semantic information of each scale in the multi-scale semantic segmentation set is used to carry out semantic guidance on the initial depth map, so as to continuously correct the initial depth map and obtain accurate multi-scale depth maps.

步骤S500、基于多种尺度的深度图，构建多种尺度的点云集。Step S500, based on the depth maps of multiple scales, construct point cloud sets of multiple scales.

具体的，将每种尺度的深度图，通过如下表达式构建每种尺度的点云集：Specifically, the depth map of each scale is used to construct the point cloud set of each scale through the following expression:

其中，

表示深度图的横坐标，

表示深度图的纵坐标，

和

represents the abscissa of the depth map,

Indicates the ordinate of the depth map,

and

步骤S600、根据点云集的尺度，对多种尺度的点云集采用不同的半径滤波进行优化，获得优化后的点云集。Step S600 , according to the scale of the point cloud set, optimize point cloud sets of various scales by using different radius filters to obtain optimized point cloud sets.

具体的，获取多种尺度的点云集，每种尺度的点云集中的点云都有对应的半径大小和预设的邻点数量；Specifically, point cloud sets of multiple scales are obtained, and the point clouds in each scale point cloud set have a corresponding radius and a preset number of adjacent points;

根据点云集的尺度采用如下公式计算出点云集中点云对应的半径：According to the scale of the point cloud set, the following formula is used to calculate the radius corresponding to the point cloud in the point cloud set:

其中，

表示不同尺度的点云集中点云对应的半径，

表示常量参数，t表示常量参数，

表示每个点云集的预先所设定的尺度等级； in,

Represents a constant parameter, t represents a constant parameter,

Represents the preset scale level of each point cloud set;

根据每个点云对应的半径大小和预设的邻点数量对多种尺度的点云集进行优化，获得优化后的点云集。According to the radius size corresponding to each point cloud and the preset number of adjacent points, the point cloud sets of various scales are optimized to obtain the optimized point cloud set.

在本实施例中，对于不同尺度的点云集，通过深度图转化后需进行半径滤波，滤除噪点，优化点云数据。对于不同尺度的点云集，由于点云的聚集程度不同，采取不同的半径滤波。半径滤波即首先获取每个点云对应的半径大小并预设邻点数量，只有满足在该半径范围内拥有足够数量的邻点的点云才会被保留，其余都被滤去。对于本实施例多尺度的点云集，还需考虑点云的在分割集中所属的语义类别，即，半径内拥有n数量的同语义类别的邻点的点云才会被保留。In this embodiment, for point cloud sets of different scales, radius filtering is required after the depth map is converted to filter out noise and optimize point cloud data. For point cloud sets of different scales, different radius filters are adopted due to the different degree of point cloud aggregation. Radius filtering is to first obtain the radius size corresponding to each point cloud and preset the number of neighbors. Only the point cloud with a sufficient number of neighbors within the radius will be retained, and the rest will be filtered out. For the multi-scale point cloud set in this embodiment, it is also necessary to consider the semantic category of the point cloud in the segmentation set, that is, only the point cloud with n number of neighbor points of the same semantic category within the radius will be retained.

步骤S700、基于优化后的点云集进行不同尺度的重建，获得不同尺度的三维重建结果。Step S700 , perform reconstruction at different scales based on the optimized point cloud set, and obtain three-dimensional reconstruction results at different scales.

具体的，在步骤S600中对不同尺度的点云集进行优化，获得不同尺度优化后的点云集，将每种尺度优化后的点云集进行重建，获得不同尺度的三维重建结果。Specifically, in step S600, point cloud sets of different scales are optimized to obtain optimized point cloud sets of different scales, and the optimized point cloud sets of each scale are reconstructed to obtain three-dimensional reconstruction results of different scales.

步骤S800、将每种尺度的三维重建结果进行拼接融合，获得最终的三维重建结果。In step S800, the 3D reconstruction results of each scale are spliced and fused to obtain a final 3D reconstruction result.

具体的，将每种尺度的三维重建结果进行拼接融合，获得最终的三维重建结果。本实施例通过步骤S700基于优化后的点云集进行不同尺度的重建，优化后的点云集更加精确，因此，本实施例获得的最终的三维重建结果也更加精确。Specifically, the 3D reconstruction results of each scale are spliced and fused to obtain a final 3D reconstruction result. In this embodiment, reconstructions of different scales are performed based on the optimized point cloud set through step S700, and the optimized point cloud set is more accurate, therefore, the final 3D reconstruction result obtained in this embodiment is also more accurate.

本实施例通过获取多张多视角图像，对多张多视角图像进行多尺度语义特征提取，获得多种尺度的特征图；对多种尺度的特征图进行多尺度语义分割，获得多种尺度的语义分割集；本实施例通过对多张多视角图像进行多尺度语义特征提取，能够提取深层次的特征，能获得多种尺度的特征图。并对多种尺度的特征图进行多尺度语义分割，聚合各个尺度的语义信息，丰富了各个尺度的语义信息。本实施例将多张多视角图像作为输入，通过有监督的三维重建方法获得初始深度图；基于多种尺度的语义分割集和初始深度图，获得多种尺度的深度图；本实施例利用多种尺度的语义分割集中的各个尺度的语义信息分别对初始深度图进行语义引导，从而不断修正初始深度图，获得准确的多种尺度的深度图。本实施例基于多种尺度的深度图，构建多种尺度的点云集；根据点云集的尺度，对多种尺度的点云集采用不同的半径滤波进行优化，获得优化后的点云集；基于优化后的点云集进行不同尺度的重建，获得不同尺度的重建结果；将每种尺度的重建结果进行拼接融合，获得最终的重建结果。本实施例用获得的多种尺度的深度图构建多种尺度的点云集，根据点云集的尺度采用不同的半径滤波进行优化，优化后的点云集用于不同尺度的重建，再将重建结果融合以获得更加精确的重建结果。本实施例能够充分利用各个尺度的语义信息，能够提高三维重建的精确度。In this embodiment, by acquiring multiple multi-view images, multi-scale semantic feature extraction is performed on multiple multi-view images to obtain feature maps of multiple scales; multi-scale semantic segmentation is performed on feature maps of multiple scales to obtain multi-scale semantic features Semantic segmentation set: In this embodiment, by performing multi-scale semantic feature extraction on multiple multi-view images, deep-level features can be extracted, and feature maps of multiple scales can be obtained. And perform multi-scale semantic segmentation on feature maps of multiple scales, aggregate semantic information of each scale, and enrich semantic information of each scale. In this embodiment, multiple multi-view images are used as input, and an initial depth map is obtained through a supervised three-dimensional reconstruction method; depth maps of various scales are obtained based on semantic segmentation sets of multiple scales and initial depth maps; this embodiment uses multiple The semantic information of each scale in the semantic segmentation set of different scales provides semantic guidance to the initial depth map, so as to continuously correct the initial depth map and obtain accurate depth maps of multiple scales. In this embodiment, point cloud sets of various scales are constructed based on depth maps of various scales; according to the scales of point cloud sets, point cloud sets of various scales are optimized using different radius filters to obtain optimized point cloud sets; based on the optimized The point cloud sets of different scales are reconstructed to obtain reconstruction results of different scales; the reconstruction results of each scale are spliced and fused to obtain the final reconstruction result. In this embodiment, the obtained depth maps of various scales are used to construct point cloud sets of various scales, and different radius filters are used for optimization according to the scale of point cloud sets. The optimized point cloud sets are used for reconstruction of different scales, and then the reconstruction results are fused to obtain more accurate reconstruction results. This embodiment can make full use of semantic information of various scales, and can improve the accuracy of three-dimensional reconstruction.

参照图5，本发明实施例提供了一种基于深度学习的多视图三维重建系统，本基于深度学习的多视图三维重建系统包括特征图获取单元100、语义分割集获取单元200、初始深度图获取单元300、深度图获取单元400、点云集获取单元500、半径滤波单元600、重建结果获取单元700以及重建结果融合单元800，其中：Referring to FIG. 5 , an embodiment of the present invention provides a multi-view 3D reconstruction system based on deep learning. The multi-view 3D reconstruction system based on deep learning includes a feature map acquisition unit 100, a semantic segmentation set acquisition unit 200, and an initial depth map acquisition unit. Unit 300, depth map acquisition unit 400, point cloud set acquisition unit 500, radius filter unit 600, reconstruction result acquisition unit 700 and reconstruction result fusion unit 800, wherein:

特征图获取单元100，用于获取多视角图像，对多视角图像进行多尺度语义特征提取，获得多种尺度的特征图；The feature map acquisition unit 100 is configured to acquire multi-view images, perform multi-scale semantic feature extraction on multi-view images, and obtain feature maps of multiple scales;

语义分割集获取单元200，用于对多种尺度的特征图进行多尺度语义分割，获得多种尺度的语义分割集；Semantic segmentation set acquisition unit 200, configured to perform multi-scale semantic segmentation on feature maps of multiple scales to obtain semantic segmentation sets of multiple scales;

初始深度图获取单元300，用于通过有监督的三维重建方法对多张多视角图像进行重建，获得初始深度图；An initial depth map acquisition unit 300, configured to reconstruct a plurality of multi-view images through a supervised three-dimensional reconstruction method to obtain an initial depth map;

深度图获取单元400，用于基于多种尺度的语义分割集和初始深度图，获得多种尺度的深度图；A depth map acquisition unit 400, configured to obtain depth maps of multiple scales based on semantic segmentation sets of multiple scales and initial depth maps;

点云集获取单元500，用于基于多种尺度的深度图，构建多种尺度的点云集；The point cloud set acquisition unit 500 is configured to construct point cloud sets of multiple scales based on depth maps of multiple scales;

半径滤波单元600，用于根据点云集的尺度，对多种尺度的点云集采用不同的半径滤波进行优化，获得优化后的点云集；The radius filtering unit 600 is configured to optimize point cloud sets of various scales using different radius filters according to the scale of the point cloud set, so as to obtain an optimized point cloud set;

重建结果获取单元700，用于基于优化后的点云集进行不同尺度的重建，获得不同尺度的三维重建结果；A reconstruction result acquisition unit 700, configured to perform reconstruction at different scales based on the optimized point cloud set, and obtain three-dimensional reconstruction results at different scales;

重建结果融合单元800，用于将每种尺度的重建结果进行拼接融合，获得最终的三维重建结果。The reconstruction result fusion unit 800 is configured to stitch and fuse the reconstruction results of each scale to obtain a final three-dimensional reconstruction result.

需要说明的是，由于本实施例中的一种基于深度学习的多视图三维重建系统与上述的一种基于深度学习的多视图三维重建方法基于相同的发明构思，因此，方法实施例中的相应内容同样适用于本系统实施例，此处不再详述。It should be noted that since the deep learning-based multi-view 3D reconstruction system in this embodiment is based on the same inventive concept as the above-mentioned deep learning-based multi-view 3D reconstruction method, the corresponding methods in the method embodiments The content is also applicable to this embodiment of the system, and will not be described in detail here.

本发明实施例还提供了一种基于深度学习的多视图三维重建设备，包括：至少一个控制处理器和用于与所述至少一个控制处理器通信连接的存储器。An embodiment of the present invention also provides a multi-view three-dimensional reconstruction device based on deep learning, including: at least one control processor and a memory for communicating with the at least one control processor.

存储器作为一种非暂态计算机可读存储介质，可用于存储非暂态软件程序以及非暂态性计算机可执行程序。此外，存储器可以包括高速随机存取存储器，还可以包括非暂态存储器，例如至少一个磁盘存储器件、闪存器件、或其他非暂态固态存储器件。在一些实施方式中，存储器可选包括相对于处理器远程设置的存储器，这些远程存储器可以通过网络连接至该处理器。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。As a non-transitory computer-readable storage medium, memory can be used to store non-transitory software programs and non-transitory computer-executable programs. In addition, the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage devices. In some embodiments, the memory optionally includes memory located remotely from the processor, and these remote memories may be connected to the processor via a network. Examples of the aforementioned networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.

实现上述实施例的一种基于深度学习的多视图三维重建方法所需的非暂态软件程序以及指令存储在存储器中，当被处理器执行时，执行上述实施例中的一种基于深度学习的多视图三维重建方法，例如，执行以上描述的图1中的方法步骤S100至步骤S800。The non-transitory software programs and instructions required to realize the multi-view 3D reconstruction method based on deep learning in the above-mentioned embodiments are stored in the memory, and when executed by the processor, a deep-learning-based 3D reconstruction method in the above-mentioned embodiments is executed. The multi-view three-dimensional reconstruction method, for example, executes the method steps S100 to S800 in FIG. 1 described above.

以上所描述的系统实施例仅仅是示意性的，其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The system embodiments described above are only illustrative, and the units described as separate components may or may not be physically separated, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.

本发明实施例还提供了一种计算机可读存储介质，该计算机可读存储介质存储有计算机可执行指令，该计算机可执行指令被一个或多个控制处理器执行，可使得上述一个或多个控制处理器执行上述方法实施例中的一种基于深度学习的多视图三维重建方法，例如，执行以上描述的图1中的方法步骤S100至步骤S800的功能。The embodiment of the present invention also provides a computer-readable storage medium, the computer-readable storage medium stores computer-executable instructions, and the computer-executable instructions are executed by one or more control processors, so that the above-mentioned one or more The control processor executes a deep learning-based multi-view 3D reconstruction method in the above method embodiment, for example, executes the functions from step S100 to step S800 of the method in FIG. 1 described above.

通过以上的实施方式的描述，本领域技术人员可以清楚地了解到各实施方式可借助软件加通用硬件平台的方式来实现。本领域技术人员可以理解实现上述实施例方法中的全部或部分流程是可以通过计算机程序来指令相关的硬件来完成，所述的程序可存储于一计算机可读取存储介质中，该程序在执行时，可包括如上述方法的实施例的流程。其中，所述的存储介质可为磁碟、光盘、只读存储记忆体（ReadOnly Memory ,ROM）或随机存储记忆体（Random Access Memory ,RAM）等。Through the above description of the implementation manners, those skilled in the art can clearly understand that each implementation manner can be implemented by means of software plus a general hardware platform. Those skilled in the art can understand that all or part of the process in the method of the above-mentioned embodiments can be completed by instructing related hardware through a computer program, and the program can be stored in a computer-readable storage medium. , it may include the flow of the embodiment of the above method. Wherein, the storage medium may be a magnetic disk, an optical disk, a read-only memory (ReadOnly Memory, ROM) or a random access memory (Random Access Memory, RAM), etc.

上面结合附图对本发明实施例作了详细说明，但本发明不限于上述实施例，在所属技术领域普通技术人员所具备的知识范围内，还可以在不脱离本发明宗旨的前提下作出各种变化。The embodiments of the present invention have been described in detail above in conjunction with the accompanying drawings, but the present invention is not limited to the above-mentioned embodiments. Within the scope of knowledge of those of ordinary skill in the art, various modifications can be made without departing from the spirit of the present invention. Variety.

Claims

1. A multi-view three-dimensional reconstruction method based on deep learning is characterized by comprising the following steps:

acquiring a plurality of multi-view images, and performing multi-scale semantic feature extraction on the plurality of multi-view images to obtain feature maps of various scales;

performing multi-scale semantic segmentation on the feature maps of multiple scales to obtain semantic segmentation sets of multiple scales;

reconstructing a plurality of multi-view images by a supervised three-dimensional reconstruction method to obtain an initial depth map;

obtaining the depth maps of multiple scales based on the semantic segmentation sets of multiple scales and the initial depth map, specifically:

selecting any one of the multiple multi-view images as a reference image, and taking the others as images to be matched;

selecting a reference point from the reference image, acquiring a semantic category corresponding to the reference point in the semantic segmentation set, and acquiring a depth value corresponding to the reference point on the initial depth image;

the number of reference points is chosen by the following formula:

wherein,

representing the number of reference points selected by the jth segmentation set, H representing the height of the multi-view image, W representing the width of the multi-view image, HW representing the number of pixel points of the multi-view image, t representing a constant parameter,

representing the number of semantic categories contained in the jth said semantic partition set,

representing the number of semantic categories contained in the ith semantic segmentation set, wherein n represents the total number of the semantic segmentation sets;

based on each reference point, acquiring a matching point of each reference point on the graph to be matched through the following formula:

wherein,

representing the matching point of the ith reference point on the graph to be matched, K representing the internal parameter of the camera, T representing the external parameter of the camera,

representing a reference point P in said reference map _i Corresponding depth values on the initial depth map;

obtaining semantic categories corresponding to each matching point, correcting the multi-view images of each scale by minimizing a semantic loss function to obtain the depth maps of various scales, wherein the semantic loss function

The calculation formula of (c) is as follows:

wherein,

representing the difference between the semantic information of the ith reference point and the semantic information of the ith matching point, M _i Representing a mask, N representing the number of said reference points;

constructing a point cloud set with multiple scales based on the depth maps with multiple scales;

according to the scale of the point cloud set, different radius filtering is adopted for the point cloud sets with various scales to carry out optimization, and the optimized point cloud set is obtained;

reconstructing at different scales based on the optimized point cloud set to obtain three-dimensional reconstruction results at different scales;

and splicing and fusing the three-dimensional reconstruction results of each scale to obtain a final three-dimensional reconstruction result.

2. The deep learning-based multi-view three-dimensional reconstruction method according to claim 1, wherein the performing multi-scale semantic feature extraction on the multiple multi-view images to obtain feature maps of multiple scales comprises:

performing multi-layer feature extraction on the multi-view images through a ResNet network to obtain original feature maps with various scales;

and respectively connecting the original feature map of each scale with channel attention so as to carry out importance weighting on the original feature map of each scale through a channel attention mechanism and obtain feature maps of various scales.

3. The deep learning-based multi-view three-dimensional reconstruction method according to claim 2, wherein the weighting of importance of the original feature map of each scale through a channel attention mechanism to obtain feature maps of multiple scales comprises:

compressing the original characteristic diagram of each scale through a compression network to obtain a one-dimensional characteristic diagram corresponding to the original characteristic diagram of each scale;

inputting the one-dimensional characteristic diagram into a full-connection layer through an excitation network to perform importance prediction, and obtaining the importance of each channel;

and exciting the importance of each channel to the one-dimensional characteristic diagram of the original characteristic diagram of each scale through an excitation function to obtain characteristic diagrams of various scales.

4. The deep learning-based multi-view three-dimensional reconstruction method according to claim 1, wherein the performing multi-scale semantic segmentation on the feature maps of multiple scales to obtain semantic segmentation sets of multiple scales includes:

clustering the characteristic graphs of multiple scales through nonnegative matrix decomposition to obtain semantic segmentation sets of multiple scales; wherein the expression of the non-negative matrix factorization is:

the method comprises the following steps of mapping, connecting and remolding feature maps of various scales into a matrix V with HW rows and C columns, wherein the P represents a matrix with HW rows and K columns, the Q represents a matrix with K rows and C columns, the H represents a coefficient matrix, the W represents a base matrix, the K represents a non-negative matrix decomposition factor of a semantic cluster number, the C represents the dimension of each pixel, and the F represents the adoption of a non-inducible norm.

5. The method for multi-view three-dimensional reconstruction based on deep learning of claim 1, wherein the constructing the point cloud sets of multiple scales based on the depth maps of multiple scales comprises:

constructing a point cloud set of each scale by using the depth map of each scale according to the following expression:

wherein,

the abscissa representing the depth map is shown,

represents the ordinate of the depth map and,

and

representing the camera focal length obtained from the camera parameters, and x, y and z represent the point cloud coordinates of the point cloud transformation.

6. The deep learning-based multi-view three-dimensional reconstruction method according to claim 1, wherein the optimization of the point cloud sets of multiple scales by using different radius filters according to the scales of the point cloud sets to obtain an optimized point cloud set comprises:

acquiring the point cloud sets of multiple scales, wherein the point cloud in the point cloud set of each scale has a corresponding radius and a preset number of adjacent points;

calculating the corresponding radius of the point cloud in the point cloud set by adopting the following formula according to the scale of the point cloud set:

wherein,

representing the corresponding radius of the point cloud in the point cloud set with different scales,

representing a constant parameter, t representing a constant parameter,

representing a preset scale grade of each point cloud set;

and optimizing the point cloud sets with various scales according to the radius corresponding to each point cloud and the preset number of adjacent points to obtain an optimized point cloud set.

7. A deep learning based multi-view three-dimensional reconstruction system, comprising:

the characteristic diagram acquisition unit is used for acquiring multi-view images, and performing multi-scale semantic feature extraction on the multi-view images to acquire characteristic diagrams of multiple scales;

the semantic segmentation set acquisition unit is used for carrying out multi-scale semantic segmentation on the feature maps with various scales to obtain a semantic segmentation set with various scales;

the initial depth map acquisition unit is used for reconstructing a plurality of multi-view images by a supervised three-dimensional reconstruction method to obtain an initial depth map;

a depth map obtaining unit, configured to obtain depth maps of multiple scales based on the multiple-scale semantic segmentation sets and the initial depth map, specifically:

selecting any one of the multiple multi-view images as a reference image, and taking the other images as images to be matched;

the number of reference points is chosen by the following formula:

wherein,

representing the number of semantic categories contained in the jth of said semantic segmentation sets,

based on each reference point, obtaining the matching point of each reference point on the graph to be matched through the following formula:

wherein,

representing the matching point of the ith reference point on the graph to be matched, K representing the internal reference of the camera, T representing the external reference of the camera,

obtaining the semantic category corresponding to each matching point, correcting the multi-view image of each scale by minimizing a semantic loss function to obtain the depth maps of multiple scales, wherein the semantic loss function

The calculation formula of (c) is as follows:

wherein,

the point cloud set acquisition unit is used for constructing point cloud sets with various scales based on the depth maps with various scales;

the radius filtering unit is used for optimizing the point cloud sets with various scales by adopting different radius filtering according to the scales of the point cloud sets to obtain the optimized point cloud sets;

a reconstruction result obtaining unit, configured to perform reconstruction of different scales based on the optimized point cloud set, so as to obtain three-dimensional reconstruction results of different scales;

and the reconstruction result fusion unit is used for splicing and fusing the reconstruction results of each scale to obtain a final three-dimensional reconstruction result.

8. A deep learning based multi-view three-dimensional reconstruction device comprising at least one control processor and a memory for communicative connection with the at least one control processor; the memory stores instructions executable by the at least one control processor to enable the at least one control processor to perform the method of deep learning based multi-view three-dimensional reconstruction according to any one of claims 1 to 6.

9. A computer-readable storage medium storing computer-executable instructions for causing a computer to perform the method for deep learning based multi-view three-dimensional reconstruction according to any one of claims 1 to 6.