CN117576303A

CN117576303A - Three-dimensional image generation method, device, equipment and storage medium

Info

Publication number: CN117576303A
Application number: CN202311113441.8A
Authority: CN
Inventors: 张进; 梁泽瑞
Original assignee: Southern University of Science and Technology
Current assignee: Southern University of Science and Technology
Priority date: 2023-08-29
Filing date: 2023-08-29
Publication date: 2024-02-20

Abstract

The application discloses a three-dimensional image generation method, a device, equipment and a storage medium, wherein the method comprises the following steps: acquiring original RGB image data; inputting the original RGB image data into a preset image processing model, and carrying out fusion processing on pixel coordinate system data and camera coordinate data of the original RGB image data based on the preset image processing model to obtain a low-resolution depth image; wherein the camera coordinate data includes depth image information of the original RGB image data and two-dimensional camera coordinate data, and the depth image information is not used as an original input but for a coarse-grained synthetic image; and performing three-dimensional conversion on the low-resolution depth image to obtain a target three-dimensional image. The present application improves target detection and spatial localization performance for target instances in input RGB image data.

Description

Three-dimensional image generation method, device, equipment and storage medium

技术领域Technical field

本申请涉及计算机视觉技术领域，尤其涉及一种三维图像生成方法、装置、设备及存储介质。The present application relates to the field of computer vision technology, and in particular to a three-dimensional image generation method, device, equipment and storage medium.

背景技术Background technique

近年来，随着人工智能的快速发展，机器人作业、虚拟现实等领域受到了极大关注，通过搭载的传感器设备对应的三维理解技术来感知、理解和解释所检测到的目标实例，进而对目标实例进行三维场景重建，是这些领域重要的技术环节，也是后续的决策阶段和执行阶段的基础。In recent years, with the rapid development of artificial intelligence, fields such as robotic operations and virtual reality have received great attention. The detected target instances are perceived, understood and interpreted through the three-dimensional understanding technology corresponding to the sensor equipment, and then the target is analyzed. Three-dimensional scene reconstruction based on examples is an important technical link in these fields, and is also the basis for the subsequent decision-making and execution stages.

现有的三维理解技术主要以直接推断多目标整体三维信息的单阶段方案为主，该方案将目标实例视为其二维图像的中心点，而每个中心点都包含该目标实例的完整三维信息，这种方法虽然在实时性、遮挡场景表现以及三维形状精度上具有优势，但其仅在训练方式上完成了由边界框模式向无边界框模式的迁移，没有实例分割带来的强制区域约束，使用的目标检测任务架构在小数据集上也难以提取类别级特征，从而无法很好地处理目标实例的类内差异性，导致在目标检测与空间定位方面性能较差。Existing 3D understanding technology is mainly based on a single-stage scheme that directly infers the overall 3D information of multiple targets. This scheme regards the target instance as the center point of its 2D image, and each center point contains the complete 3D of the target instance. Information, although this method has advantages in real-time performance, occlusion scene performance and three-dimensional shape accuracy, it only completes the migration from bounding box mode to bounding box mode in the training method, without the mandatory area caused by instance segmentation Constraints, the target detection task architecture used is also difficult to extract category-level features on small data sets, so it cannot handle the intra-class differences of target instances well, resulting in poor performance in target detection and spatial localization.

发明内容Contents of the invention

本申请的主要目的在于提供一种三维图像生成方法、装置、设备及存储介质，旨在解决相关技术中，通过直接推断多目标整体三维信息的单阶段方案对目标实例进行检测，无法很好地处理目标实例的类内差异性，导致在目标检测与空间定位方面性能较差的技术问题。The main purpose of this application is to provide a three-dimensional image generation method, device, equipment and storage medium, aiming to solve the problem in related technologies that the single-stage solution of directly inferring the overall three-dimensional information of multi-targets cannot detect target instances well. Handling intra-class differences in target instances leads to technical problems with poor performance in target detection and spatial localization.

为实现上述目的，本申请实施例提供了一种三维图像生成方法，所述方法包括：To achieve the above objectives, embodiments of the present application provide a three-dimensional image generation method, which method includes:

获取原始RGB图像数据；Get original RGB image data;

将所述原始RGB图像数据输入至预设图像处理模型，基于所述预设图像处理模型，对所述原始RGB图像数据的像素坐标系数据与相机坐标系数据进行融合处理，得到低分辨率深度图像；其中，所述相机坐标系数据包括所述原始RGB图像数据的深度图像信息以及二维相机坐标数据，且所述深度图像信息不作为原始输入，而是用于粗粒度合成图像；The original RGB image data is input into a preset image processing model. Based on the preset image processing model, the pixel coordinate system data of the original RGB image data and the camera coordinate system data are fused to obtain a low-resolution depth. Image; wherein the camera coordinate system data includes the depth image information of the original RGB image data and the two-dimensional camera coordinate data, and the depth image information is not used as original input, but is used for coarse-grained synthetic images;

对所述低分辨率深度图像进行三维转换，得到目标三维图像。Perform three-dimensional conversion on the low-resolution depth image to obtain a target three-dimensional image.

在本申请的一种可能的实施方式中，所述对所述原始RGB图像数据的像素坐标系数据与相机坐标系数据进行融合处理，得到低分辨率深度图像的步骤，包括：In a possible implementation of the present application, the step of fusing the pixel coordinate system data of the original RGB image data and the camera coordinate system data to obtain a low-resolution depth image includes:

对所述原始RGB图像数据进行向量计算，得到所述原始RGB图像数据中多个像素点的第一法向量；Perform vector calculation on the original RGB image data to obtain first normal vectors of multiple pixels in the original RGB image data;

对所述像素坐标系数据与所述相机坐标系数据的坐标系点集进行聚类处理，得到聚类特征数据；Perform clustering processing on the coordinate system point sets of the pixel coordinate system data and the camera coordinate system data to obtain cluster feature data;

将所述第一法向量与所述聚类特征数据进行特征融合，得到低分辨率深度图像。Feature fusion is performed on the first normal vector and the cluster feature data to obtain a low-resolution depth image.

在本申请的一种可能的实施方式中，所述对所述像素坐标系数据与所述相机坐标系数据的坐标系点集进行聚类处理，得到聚类特征数据的步骤，包括：In a possible implementation of the present application, the step of clustering the coordinate system point sets of the pixel coordinate system data and the camera coordinate system data to obtain cluster feature data includes:

对所述像素坐标系数据与所述相机坐标系数据的坐标系点集进行特征提取，得到像素坐标特征数据和相机坐标特征数据；Perform feature extraction on the coordinate system point set of the pixel coordinate system data and the camera coordinate system data to obtain pixel coordinate feature data and camera coordinate feature data;

根据所述像素坐标特征数据和相机坐标特征数据的相似程度，对多个点集特征数据进行特征聚合，得到聚类特征数据。According to the similarity between the pixel coordinate feature data and the camera coordinate feature data, feature aggregation is performed on multiple point set feature data to obtain cluster feature data.

在本申请的一种可能的实施方式中，所述对所述原始RGB图像数据进行向量计算，得到所述原始RGB图像数据中多个像素点的第一法向量的步骤，包括：In a possible implementation of the present application, the step of performing vector calculation on the original RGB image data to obtain the first normal vectors of multiple pixels in the original RGB image data includes:

对所述原始RGB图像数据进行数据增强处理，得到第一增强数据；Perform data enhancement processing on the original RGB image data to obtain first enhanced data;

对所述第一增强数据进行二维梯度计算，得到梯度计算值；Perform two-dimensional gradient calculation on the first enhanced data to obtain a gradient calculation value;

选取所述梯度计算值中的最大值，并确定所述梯度计算值中的最大值为所述第一增强数据中各像素点的第一法向量。Select the maximum value among the gradient calculation values, and determine the maximum value among the gradient calculation values as the first normal vector of each pixel in the first enhancement data.

在本申请的一种可能的实施方式中，所述对所述低分辨率深度图像进行三维转换，得到目标三维图像的步骤，包括：In a possible implementation of the present application, the step of performing three-dimensional conversion on the low-resolution depth image to obtain a target three-dimensional image includes:

根据所述原始RGB图像数据中多个目标示例，生成目标高斯热值图；Generate a target Gaussian heat value map according to multiple target examples in the original RGB image data;

根据所述低分辨率深度图像中的聚合特征以及目标高斯热值图，进行三维点云重建，得到目标三维图像。Based on the aggregated features in the low-resolution depth image and the target Gaussian heat value map, three-dimensional point cloud reconstruction is performed to obtain a target three-dimensional image.

在本申请的一种可能的实施方式中，所述根据所述原始RGB图像数据中多个目标示例，生成目标高斯热值图的步骤之后，包括：In a possible implementation of the present application, after the step of generating a target Gaussian heat value map based on multiple target examples in the original RGB image data, the step includes:

根据所述目标高斯热值图，计算得到所述目标高斯热值图中的多个目标实例中心坐标，并基于多个所述目标实例中心坐标，确定整体三维信息参数图；According to the target Gaussian heat value map, calculate the center coordinates of multiple target instances in the target Gaussian heat value map, and determine the overall three-dimensional information parameter map based on the multiple target instance center coordinates;

提取所述整体三维信息参数图中所有目标实例的完整三维信息。Complete three-dimensional information of all target instances in the overall three-dimensional information parameter map is extracted.

在本申请的一种可能的实施方式中，所述对所述低分辨率深度图像进行三维转换，得到目标三维图像的步骤之后，包括：In a possible implementation of the present application, after performing three-dimensional conversion on the low-resolution depth image to obtain a target three-dimensional image, the step includes:

根据所述目标三维图像的目标实例在二维坐标系上的中心点数据，计算得到中心距离估计结果；Calculate the center distance estimation result according to the center point data of the target instance of the target three-dimensional image on the two-dimensional coordinate system;

对所述中心距离估计结果与所述整体三维信息参数图的回归结果进行一致性约束处理，使得输入的所述目标实例收敛。Consistency constraint processing is performed on the regression result of the center distance estimation result and the overall three-dimensional information parameter map, so that the input target instance converges.

本申请还提供一种三维图像生成装置，所述三维图像生成装置包括：This application also provides a three-dimensional image generating device, which includes:

获取模块，用于获取原始RGB图像数据；Acquisition module, used to obtain original RGB image data;

第一处理模块，用于将所述原始RGB图像数据输入至预设图像处理模型，基于所述预设图像处理模型，对所述原始RGB图像数据的像素坐标系数据与相机坐标系数据进行融合处理，得到低分辨率深度图像；其中，所述相机坐标系数据包括所述原始RGB图像数据的深度图像信息以及二维相机坐标数据，且所述深度图像信息不作为原始输入，而是用于粗粒度合成图像；The first processing module is used to input the original RGB image data into a preset image processing model, and fuse the pixel coordinate system data of the original RGB image data and the camera coordinate system data based on the preset image processing model. Process to obtain a low-resolution depth image; wherein the camera coordinate system data includes the depth image information of the original RGB image data and the two-dimensional camera coordinate data, and the depth image information is not used as original input, but is used for Coarse-grained synthetic images;

转换模块，用于对所述低分辨率深度图像进行三维转换，得到目标三维图像。A conversion module is used to perform three-dimensional conversion on the low-resolution depth image to obtain a target three-dimensional image.

本申请还提供一种三维图像生成设备，所述三维图像生成设备为实体节点设备，所述三维图像生成设备包括：存储器、处理器以及存储在所述存储器上并可在所述处理器上运行的所述三维图像生成方法的程序，所述三维图像生成方法的程序被处理器执行时可实现如上述所述三维图像生成方法的步骤。This application also provides a three-dimensional image generation device. The three-dimensional image generation device is a physical node device. The three-dimensional image generation device includes: a memory, a processor, and a device that is stored on the memory and can run on the processor. The program of the three-dimensional image generating method can realize the steps of the above-mentioned three-dimensional image generating method when the program of the three-dimensional image generating method is executed by a processor.

为实现上述目的，还提供一种存储介质，所述存储介质上存储有三维图像生成程序，所述三维图像生成程序被处理器执行时实现上述任一所述的三维图像生成方法的步骤。In order to achieve the above object, a storage medium is also provided. A three-dimensional image generation program is stored on the storage medium. When the three-dimensional image generation program is executed by a processor, the steps of any one of the three-dimensional image generation methods described above are implemented.

本申请提供了一种三维图像生成方法、装置、设备及存储介质。与相关技术中，通过直接推断多目标整体三维信息的单阶段方案对目标实例进行检测，无法很好地处理目标实例的类内差异性，导致在目标检测与空间定位方面性能较差相比，在本申请中，获取原始RGB图像数据；将所述原始RGB图像数据输入至预设图像处理模型，基于所述预设图像处理模型，对所述原始RGB图像数据的像素坐标系数据与相机坐标系数据进行融合处理，得到低分辨率深度图像；其中，所述相机坐标系数据包括所述原始RGB图像数据的深度图像信息以及二维相机坐标数据，且所述深度图像信息不作为原始输入，而是用于粗粒度合成图像；对所述低分辨率深度图像进行三维转换，得到目标三维图像。在本申请中，通过获取原始RGB图像数据，将原始RGB图像数据分为相机坐标系数据和像素坐标系数据分双端输入，并将原始RGB图像数据的深度图像信息用于粗粒度合成图像，而不是作为原始输入，从而避免网络对原始输入图像进行各个特征网络的相关操作后，破坏图像中隐含的三维结构，提高目标检测与空间定位方面性能，进而，对检测到的图像数据准确地进行三维重建。This application provides a three-dimensional image generation method, device, equipment and storage medium. Compared with related technologies, a single-stage scheme that directly infers the overall three-dimensional information of multiple targets to detect target instances cannot handle the intra-class differences of target instances well, resulting in poor performance in target detection and spatial positioning. In this application, the original RGB image data is obtained; the original RGB image data is input to a preset image processing model, and based on the preset image processing model, the pixel coordinate system data and camera coordinates of the original RGB image data are compared System data are fused to obtain a low-resolution depth image; wherein the camera coordinate system data includes the depth image information of the original RGB image data and the two-dimensional camera coordinate data, and the depth image information is not used as the original input, Instead, it is used for coarse-grained synthesis of images; the low-resolution depth image is subjected to three-dimensional conversion to obtain a target three-dimensional image. In this application, by acquiring the original RGB image data, the original RGB image data is divided into camera coordinate system data and pixel coordinate system data for double-end input, and the depth image information of the original RGB image data is used for coarse-grained synthetic images, Instead of being used as the original input, this avoids the network performing relevant operations on each feature network on the original input image, destroying the hidden three-dimensional structure in the image, improving the performance of target detection and spatial positioning, and then accurately detecting the detected image data. Perform three-dimensional reconstruction.

附图说明Description of the drawings

图1为本申请三维图像生成方法的第一实施例的流程示意图；Figure 1 is a schematic flow chart of the first embodiment of the three-dimensional image generation method of the present application;

图2为本申请三维图像生成方法涉及的预设图像处理模型整体处理流程示意图；Figure 2 is a schematic diagram of the overall processing flow of the preset image processing model involved in the three-dimensional image generation method of the present application;

图3为本申请实施例方案涉及的硬件运行环境的设备结构示意图；Figure 3 is a schematic diagram of the equipment structure of the hardware operating environment involved in the embodiment of the present application;

图4为本申请三维图像生成方法涉及的图像数据中的目标实例高斯热值图示意图；Figure 4 is a schematic diagram of a Gaussian heat value diagram of a target instance in the image data involved in the three-dimensional image generation method of this application;

图5为本申请三维图像生成方法涉及的像素坐标系信息的法向量示意图；Figure 5 is a schematic diagram of the normal vector of the pixel coordinate system information involved in the three-dimensional image generation method of the present application;

图6为本申请三维图像生成方法涉及的像素坐标系信息的二维坐标示意图；Figure 6 is a two-dimensional coordinate diagram of the pixel coordinate system information involved in the three-dimensional image generation method of the present application;

图7为本申请三维图像生成方法涉及的三维点云聚类自编码器示意图；Figure 7 is a schematic diagram of the three-dimensional point cloud clustering autoencoder involved in the three-dimensional image generation method of this application;

图8为本申请三维图像生成方法涉及的三维点云聚类自编码器的单阶段处理流程示意图；Figure 8 is a schematic diagram of the single-stage processing flow of the three-dimensional point cloud clustering autoencoder involved in the three-dimensional image generation method of this application;

图9为本申请三维图像生成方法涉及的图像数据特征提取流程示意图；Figure 9 is a schematic diagram of the image data feature extraction process involved in the three-dimensional image generation method of the present application;

图10为本申请三维图像生成方法涉及的低分辨率合成深度图像示意图；Figure 10 is a schematic diagram of a low-resolution synthetic depth image involved in the three-dimensional image generation method of this application;

图11为本申请三维图像生成方法涉及的循环几何一致性约束计算流程示意图；Figure 11 is a schematic diagram of the cyclic geometric consistency constraint calculation process involved in the three-dimensional image generation method of this application;

图12为本申请三维图像生成方法涉及的目标实例重建后的三维图像示意图；Figure 12 is a schematic diagram of a reconstructed three-dimensional image of a target instance involved in the three-dimensional image generation method of this application;

图13为本申请三维图像生成方法涉及的测试集三维重建效果示意图。Figure 13 is a schematic diagram of the three-dimensional reconstruction effect of the test set involved in the three-dimensional image generation method of this application.

具体实施方式Detailed ways

应当理解，此处所描述的具体实施例仅仅用以解释本申请，并不用于限定本申请。It should be understood that the specific embodiments described here are only used to explain the present application and are not used to limit the present application.

本申请实施例提供一种三维图像生成方法，在本申请三维图像生成方法的第一实施例中，参照图1，所述方法包括：An embodiment of the present application provides a three-dimensional image generation method. In the first embodiment of the three-dimensional image generation method of the present application, referring to Figure 1, the method includes:

步骤S10，获取原始RGB图像数据；Step S10, obtain original RGB image data;

步骤S20，将所述原始RGB图像数据输入至预设图像处理模型，基于所述预设图像处理模型，对所述原始RGB图像数据的像素坐标系数据与相机坐标系数据进行融合处理，得到低分辨率深度图像；其中，所述相机坐标系数据包括所述原始RGB图像数据的深度图像信息以及二维相机坐标数据，且所述深度图像信息不作为原始输入，而是用于粗粒度合成图像；Step S20: Input the original RGB image data into a preset image processing model. Based on the preset image processing model, fuse the pixel coordinate system data and the camera coordinate system data of the original RGB image data to obtain a low Resolution depth image; wherein the camera coordinate system data includes the depth image information of the original RGB image data and the two-dimensional camera coordinate data, and the depth image information is not used as the original input, but is used for coarse-grained synthetic images ;

步骤S30，对所述低分辨率深度图像进行三维转换，得到目标三维图像。Step S30: perform three-dimensional conversion on the low-resolution depth image to obtain a target three-dimensional image.

本实施例旨在：将原始RGB图像数据分为相机坐标系数据和像素坐标系数据分双端输入，并将原始RGB图像数据的深度图像信息用于粗粒度合成图像，而不是作为原始输入，从而避免网络对原始输入图像进行各个特征网络的相关操作后，破坏图像中隐含的三维结构，提高目标检测与空间定位方面性能。This embodiment aims to: divide the original RGB image data into camera coordinate system data and pixel coordinate system data for double-ended input, and use the depth image information of the original RGB image data for coarse-grained synthetic images instead of as original input. This prevents the network from destroying the hidden three-dimensional structure in the image after performing relevant operations on each feature network on the original input image, and improves the performance in target detection and spatial positioning.

在本实施例中，针对的研发背景为：In this embodiment, the research and development background targeted is:

本方法属于单视图整体三维理解技术，主要涉及目标位姿估计以及三维场景重建等方面，现有的整体三维理解技术可以分为两类：实例级整体三维理解技术和类别级整体三维理解技术；This method belongs to the single-view overall 3D understanding technology, which mainly involves target pose estimation and 3D scene reconstruction. The existing overall 3D understanding technology can be divided into two categories: instance-level overall 3D understanding technology and category-level overall 3D understanding technology;

1、实例级整体三维理解技术：在目标位姿估计方面，使用基于传统算子或深度学习技术的实例级位姿估计方法，依赖由目标实例计算出的先验匹配信息或精确三维模型；在三维场景重建方面，则直接使用目标实例三维模型。此类方法主要缺陷为泛用性、实时性较差，以抓取指定类别物体的机器人作业为例，当类别中包含大量外观和形状存在差异的物体时，实例级位姿估计方法通常需要耗费大量时间执行模板匹配操作，再考虑将其移植到包含未见过目标物体的新环境中，此类方法的性能会因为没有实例先验信息而严重下降，甚至不可用。1. Instance-level overall three-dimensional understanding technology: In terms of target pose estimation, instance-level pose estimation methods based on traditional operators or deep learning technology are used, relying on a priori matching information or accurate three-dimensional models calculated from target instances; in For 3D scene reconstruction, the 3D model of the target instance is directly used. The main drawback of this type of method is poor versatility and real-time performance. Take the robot operation of grabbing specified categories of objects as an example. When the category contains a large number of objects with different appearances and shapes, the instance-level pose estimation method usually requires a lot of time. Spending a lot of time performing template matching operations, and then considering transplanting them to new environments containing unseen target objects, the performance of such methods will be seriously degraded or even unusable because there is no prior information about the instances.

2、类别级整体三维理解技术：与实例级整体三维理解技术不同，类别级整体三维理解技术仅利用训练期间学习到的类别级位姿尺度和形状先验信息，并在推断阶段泛化到未见过的目标类别实例，因此更具挑战性，根据模型复杂程度的不同，类别级整体三维理解技术可被分为先从图像中分割出感兴趣区域的多阶段方案和直接推断多目标整体三维信息的单阶段方案。2. Category-level overall 3D understanding technology: Different from instance-level overall 3D understanding technology, category-level overall 3D understanding technology only uses the category-level pose scale and shape prior information learned during training, and generalizes to the future in the inference stage. The target category instances that have been seen before are therefore more challenging. Depending on the complexity of the model, the category-level overall 3D understanding technology can be divided into a multi-stage scheme that first segments the region of interest from the image and a direct inference of the multi-target overall 3D A one-stage scheme for information.

在采用预分割的多阶段方案中，部分工作直接利用目标区域的颜色-深度信息，通过包含点云处理网络以及位姿估计网络的复杂架构对六自由度位姿进行推断，生成目标规范点云作为副产物，此类方法主要缺陷如下：1.其三维重建能力被视为位姿估计的补充信息推断，由类别中所有已知实例的平均形状先验通过变形预测得出，重建性能受到限制；2.计算成本较高且严重依赖图像分割质量，在复杂的多目标遮挡场景以及小样本学习任务上性能较差。In the multi-stage solution using pre-segmentation, some work directly uses the color-depth information of the target area to infer the six-degree-of-freedom pose through a complex architecture including a point cloud processing network and a pose estimation network to generate a target canonical point cloud. As a by-product, the main defects of this type of method are as follows: 1. Its three-dimensional reconstruction capability is regarded as supplementary information inference for pose estimation, which is derived from the average shape prior of all known instances in the category through deformation prediction, and the reconstruction performance is limited. ; 2. The computational cost is high and it relies heavily on image segmentation quality, and its performance is poor in complex multi-target occlusion scenes and small sample learning tasks.

在直接推断多目标整体三维信息的单阶段方案中，以无边界框的方式检测目标物体并同时推断其六自由度位姿、三维形状以及真实尺寸，该方法将图像数据中的目标实例视为其二维图像的中心点，而每个中心点都包含该目标实例的完整三维信息，虽然在实时性、遮挡场景表现以及三维形状精度上具有优势，但其仅在训练方式上完成了由边界框模式向无边界框模式的迁移，忽略了不同领域之间的差异性，即在数据集规模很小、推理难度较高、与真实空间几何信息强相关的整体三维理解任务上沿用针对超大规模数据集和单一任务的推断方式以及模型架构，而这种完全依靠特征提取网络的设计，无法很好地处理目标物体类内差异性，也缺乏对真实空间几何信息的挖掘，在目标检测和空间定位方面性能较差。In a single-stage scheme that directly infers the overall 3D information of multiple targets, the target object is detected in a bounding box-less manner and its six-degree-of-freedom pose, 3D shape, and true size are simultaneously inferred. This method treats target instances in image data as The center point of its two-dimensional image, and each center point contains complete three-dimensional information of the target instance. Although it has advantages in real-time performance, occlusion scene performance, and three-dimensional shape accuracy, it only completes the training method by the boundary. The migration from box mode to boundless box mode ignores the differences between different fields, that is, in the overall three-dimensional understanding task where the data set is small, the reasoning is difficult, and it is strongly related to the real space geometric information, it is still used for very large-scale The inference method and model architecture of the data set and single task, and this design that relies entirely on the feature extraction network cannot handle the differences within the target object class well, and lacks the mining of real spatial geometric information. In target detection and space Positioning performance is poor.

在本方案中，在直接推断多目标整体三维信息的单阶段方案的基础上进行改进，通过采用坐标系分离输入，将三维场景理解在隐层特征空间变换中划分为目标实例抽象和尺度位置理解两阶段，以显式地挖掘真实空间几何信息；以相似性度量的方式分层聚合不同偏好的特征，在建立像素坐标系信息和相机坐标系信息特征映射的同时，使网络更关注簇间整体差异而不是外形细节匹配；构建置信度几何一致性约束，来消除生成结果与真实值之间的误差。In this scheme, an improvement is made on the basis of a single-stage scheme that directly infers the overall three-dimensional information of multiple targets. By using the coordinate system to separate the input, the three-dimensional scene understanding is divided into target instance abstraction and scale position understanding in the hidden layer feature space transformation. Two stages, to explicitly mine the real spatial geometric information; to hierarchically aggregate features of different preferences through similarity measurement, while establishing feature mapping of pixel coordinate system information and camera coordinate system information, making the network pay more attention to the overall inter-cluster Differences rather than shape detail matching; construct confidence geometric consistency constraints to eliminate errors between generated results and true values.

具体步骤如下：Specific steps are as follows:

作为一种示例，三维图像生成方法可以应用于三维图像生成装置，三维图像生成装置属于三维图像生成系统，该三维图像生成系统属于三维图像生成设备。As an example, the three-dimensional image generation method can be applied to a three-dimensional image generation device, which belongs to a three-dimensional image generation system, and the three-dimensional image generation system belongs to a three-dimensional image generation device.

作为一种示例，原始RGB图像数据为获取的单张RGB-D图像，在单视图整体三维理解技术中，通过获取待输入的图像数据，其中，图像数据可以是照片、RGB图像等，根据图像数据中的目标物体/目标实例，进行三维场景重建。As an example, the original RGB image data is a single RGB-D image obtained. In the single-view overall three-dimensional understanding technology, the image data to be input is obtained, where the image data can be photos, RGB images, etc., according to the image Target objects/target instances in the data are used for 3D scene reconstruction.

作为一种示例，原始RGB图像数据可以是像素宽度为w，像素高度为h的单张RGB-D图像输入(I∈R^w×h×3，D∈R^w×h)，目标为检测三维场景中所有感兴趣物体并推断其六自由度位姿P∈SE(3)、一维真实尺度因子s以及三维形状点云C∈R^K×N×3，其中K代表三维场景中感兴趣物体的数目，N代表三维重建点云中采样点的数目，六自由度位姿P∈SE(3)由三维旋转矩阵R∈SO(3)以及三维平移向量t∈R³表示，六自由度位姿P、规范化三维形状点云C以及一维真实尺度因子s完全定义了三维场景中相对于相机坐标系的感兴趣目标实例，是绝大多数视觉任务所需的完整三维信息。As an example, the original RGB image data can be a single RGB-D image input (I∈R ^w×h×3 , D∈R ^w×h ) with a pixel width of w and a pixel height of h. The goal is to detect three-dimensional All objects of interest in the scene and infer their six-degree-of-freedom pose P∈SE(3), one-dimensional real scale factor s, and three-dimensional shape point cloud C∈R ^K×N×3 , where K represents the object of interest in the three-dimensional scene The number ^of The pose P, the normalized 3D shape point cloud C and the one-dimensional real scale factor s completely define the target instance of interest relative to the camera coordinate system in the 3D scene, which is the complete 3D information required for most vision tasks.

作为一种示例，预设图像处理模型为基于深度强化学习的图像处理模型，预设图像处理模型中包含有整体三维理解算法CoCFusion，通过整体三维理解算法来对图像进行处理。As an example, the preset image processing model is an image processing model based on deep reinforcement learning. The preset image processing model includes the overall three-dimensional understanding algorithm CoCFusion, and the image is processed through the overall three-dimensional understanding algorithm.

作为一种示例，RGB图像数据中均包含像素坐标系信息和相机坐标系信息，其中，像素坐标系数据为输入的图像数据中的像素二维坐标点，像素坐标系数据包括多个像素点，相机坐标系数据包括多个二维坐标点和三维坐标点，主要用于表示RGB图像数据中各个目标实例的空间位置关系。As an example, RGB image data contains pixel coordinate system information and camera coordinate system information, where the pixel coordinate system data is the pixel two-dimensional coordinate point in the input image data, and the pixel coordinate system data includes multiple pixel points, The camera coordinate system data includes multiple two-dimensional coordinate points and three-dimensional coordinate points, which are mainly used to represent the spatial position relationship of each target instance in the RGB image data.

作为一种示例，低分辨率深度图像为原始RGB图像数据的像素坐标系数据与相机坐标系数据融合后得到的图像信息，由原始RGB图像转换为低分辨率深度图像的转换过程如图10所示，低分辨率深度图像即是图中的(c)粗粒度合成深度图像，对低分辨率深度信息的预测过程也是对二维像素坐标系信息与三维相机坐标系信息映射关系的学习过程，能够提高网络对形状信息分布的理解力。As an example, the low-resolution depth image is the image information obtained by fusing the pixel coordinate system data of the original RGB image data with the camera coordinate system data. The conversion process from the original RGB image to the low-resolution depth image is shown in Figure 10 shows that the low-resolution depth image is the coarse-grained synthetic depth image in (c) in the figure. The prediction process of the low-resolution depth information is also a learning process of the mapping relationship between the two-dimensional pixel coordinate system information and the three-dimensional camera coordinate system information. It can improve the network's understanding of shape information distribution.

作为一种示例，深度图像信息为相机坐标系中各个目标实例的尺度位置关系信息，相对于完全利用神经网络拟合能力从原始深度信息中理解三维场景的方式，坐标系分离的输入端处理方法将深度图像的法向量DNR和相机坐标系中的二维坐标(x_c,y_c)作为相机坐标系信息，并在浅层网络结束时粗粒度合成深度信息z_c，使三维场景理解在隐层特征空间变换中划分为目标实例抽象和尺度位置理解两阶段，该方式的主要优点如下：(1)与基于感兴趣区域分割的多阶段方案具有一定相似性，能够更显式地挖掘真实空间几何信息。As an example, the depth image information is the scale position relationship information of each target instance in the camera coordinate system. Compared with the method of fully utilizing the neural network fitting ability to understand the three-dimensional scene from the original depth information, the input end processing method of coordinate system separation The normal vector DNR of the depth image and the two-dimensional coordinates (x _c , y _c ) in the camera coordinate system are used as the camera coordinate system information, and the depth information z _c is synthesized at the end of the shallow network in a coarse-grained manner, so that the three-dimensional scene understanding can be achieved implicitly. The layer feature space transformation is divided into two stages: target instance abstraction and scale position understanding. The main advantages of this method are as follows: (1) It has a certain similarity with the multi-stage scheme based on region of interest segmentation and can more explicitly mine the real space. Geometric information.

作为一种示例，三维转换的过程可以是在检测目标物体的同时，推断其六自由度位姿、三维形状以及真实尺寸，从而提取出所输入图像数据中的全部目标实例的完整三维信息，为了学习目标实例的三维点云编码，本方案采用了和类别级位姿方法SPD类似的训练方式设计了一个自编码器(AutoEncoder,AE)用于从ShapeNetCore-CAD数据集训练所有三维形状，该自编码器具有表示不变性，可以用于所有三维形状表示工作，如图7所示。As an example, the process of three-dimensional conversion can be to infer its six-degree-of-freedom pose, three-dimensional shape and true size while detecting the target object, thereby extracting the complete three-dimensional information of all target instances in the input image data. In order to learn Three-dimensional point cloud encoding of target instances. This solution uses a training method similar to the category-level pose method SPD to design an autoencoder (AutoEncoder, AE) to train all three-dimensional shapes from the ShapeNetCore-CAD data set. The autoencoder The implement has representation invariance and can be used for all three-dimensional shape representation work, as shown in Figure 7.

作为一种示例，在提取完整的三维信息之后，通过PointMLP来完整三维图像的重建过程，PointMLP是一个简单且有效的三维点云分析网络，使用残差点MLP模块(ResidualPoint MLP block,ResP Block)逐步提取局部特征，相比现有类别级位姿估计方法采用的PointNet或PointNet++三维点云编码器更简单且更深层，性能更好，但PointMLP并不擅长局部几何细节的挖掘，对于由复杂组件构的结构化三维模型，重建局部形状比较模糊。因此本申请在其所有阶段的预处理步骤中添加了两个间隔的上下文聚类模块，作用为将属于同一组件的三维点聚类在一起，增强三维模型的局部结构细节。As an example, after extracting complete three-dimensional information, the complete three-dimensional image reconstruction process is completed through PointMLP. PointMLP is a simple and effective three-dimensional point cloud analysis network, using the residual point MLP module (ResidualPoint MLP block, ResP Block) step by step Extracting local features is simpler and deeper than the PointNet or PointNet++ 3D point cloud encoder used in existing category-level pose estimation methods, with better performance. However, PointMLP is not good at mining local geometric details. For complex components constructed Based on the structured 3D model, the reconstructed local shape is relatively blurry. Therefore, this application adds two spaced context clustering modules in the preprocessing steps of all stages, which serve to cluster together three-dimensional points belonging to the same component and enhance the local structural details of the three-dimensional model.

作为一种示例，三维点云编码器PointMLP-CoCs如图8所示，其中φ_pre(·)为添加了间隔上下文聚类模块的特征预处理序列，φ_pos(·)为特征后处理序列，N和M为序列中的模块个数。在单个阶段中，PointMLP-CoCs的点特征抽象过程如公式(3-1)所示，对于点i，采用k近邻法(k-Nearest Neighbors,kNN)从包含K个点的局部区域中进行特征提取，φ_pre(·)的设计目的为从局部区域中学习共享权重，聚合函数A通过最大池化操作进行特征聚合，φ_pos(·)则被用来提取深度聚合特征，具体公式如下所示：As an example, the three-dimensional point cloud encoder PointMLP-CoCs is shown in Figure 8, where φ _pre (·) is the feature preprocessing sequence with the interval context clustering module added, φ _pos (·) is the feature post-processing sequence, N and M are the number of modules in the sequence. In a single stage, the point feature abstraction process of PointMLP-CoCs is shown in formula (3-1). For point i, the k-Nearest Neighbors (kNN) method is used to perform feature extraction from a local area containing K points. Extraction, φ _pre (·) is designed to learn shared weights from local areas, the aggregation function A performs feature aggregation through the maximum pooling operation, and φ _pos (·) is used to extract deep aggregate features. The specific formula is as follows :

g_i＝Φ_pos(A(Φ_pre(f_i，j)，|j＝1,…,K)). (3-1)g _i =Φ _pos (A(Φ _pre (f _{i, j} ), | j = 1,...,K)). (3-1)

作为一种示例，考虑到三维重建的实时性，CoCFusion将规范化三维点云P∈R^2048×3低维映射到三维形状隐层向量z∈R¹²⁸。因此在解码器端仅使用规模分别为128→512、128→1024、1024→2048×3的三层MLP进行点云三维重建。于损失函数的选择，CoCFusion使用表示点云重构误差的倒角距离(Chamfer Distance,CD)优化三维点云聚类自编码器,计算过程如公式(3-2)所示。As an example, considering the real-time nature of 3D reconstruction, CoCFusion low-dimensionally maps the normalized 3D point cloud P∈R ^2048×3 to the 3D shape hidden layer vector z∈R ¹²⁸ . Therefore, only three-layer MLPs with sizes of 128→512, 128→1024, and 1024→2048×3 are used on the decoder side for 3D point cloud reconstruction. For the selection of the loss function, CoCFusion uses the Chamfer Distance (CD) representing the point cloud reconstruction error to optimize the three-dimensional point cloud clustering autoencoder. The calculation process is shown in Formula (3-2).

其中，i可以是1,2,3，表示各个点，x和y表示坐标，p表示三维点云。Among them, i can be 1, 2, 3, representing each point, x and y represent coordinates, and p represents the three-dimensional point cloud.

作为一种示例，训练流程完毕后，本发明设计的三维点云聚类自编码器PointMLP-CoCs在ShapeNetCore-CAD测试数据集上的三维重建效果如图12和图13所示，其中第一行图像为带檐帽实例、头戴式耳机实例、座椅实例以及吉他实例的原始模型展示图，第二行为对应的三维重建结果。As an example, after the training process is completed, the three-dimensional reconstruction effect of the three-dimensional point cloud clustering autoencoder PointMLP-CoCs designed by the present invention on the ShapeNetCore-CAD test data set is shown in Figure 12 and Figure 13, where the first row The image shows the original model display of the brimmed hat instance, the headphone instance, the seat instance, and the guitar instance. The second line is the corresponding three-dimensional reconstruction result.

其中，所述对所述低分辨率深度图像进行三维转换，得到目标三维图像的步骤，包括：Wherein, the step of performing three-dimensional conversion on the low-resolution depth image to obtain a target three-dimensional image includes:

步骤S301，根据所述原始RGB图像数据中多个目标示例，生成目标高斯热值图；Step S301: Generate a target Gaussian heat value map based on multiple target examples in the original RGB image data;

作为一种示例，目标实例为图像数据中的物体，通过采取将目标视为像素点的训练方式，将目标实例建模为一个单点(二维图像中目标边界框的中心点)，然后通过关键点估计来寻找该中心点，并回归到目标实例的完整三维信息，完整三维信息包括包括六自由度位姿P、规范化三维形状点云C以及一维真实尺度因子s，将目标视为中心点的训练方式简单、快速、端到端可微，无需任何非极大抑制(Non-Maximum Suppression,NMS)后处理操作，且能在单次推断过程中估计一系列额外的对象属性，是实时目标检测和相关任务领域的重要研究，但需要注意的是，由于整体三维理解任务过于复杂，CoCFusion(本申请所采用的算法)仅进行实例中心点估计，完整三维信息是由其他任务处理网络的推断结果在该中心点位置提取出的，该中心点起锚点作用，本身不附加任何额外信息。As an example, the target instance is an object in the image data. By taking the training method of treating the target as a pixel point, the target instance is modeled as a single point (the center point of the target bounding box in the two-dimensional image), and then through Key point estimation is used to find the center point and return to the complete three-dimensional information of the target instance. The complete three-dimensional information includes the six-degree-of-freedom pose P, the normalized three-dimensional shape point cloud C and the one-dimensional true scale factor s. The target is regarded as the center The point training method is simple, fast, end-to-end differentiable, does not require any non-maximum suppression (NMS) post-processing operations, and can estimate a series of additional object attributes in a single inference process, which is real-time Important research in the field of target detection and related tasks, but it should be noted that because the overall three-dimensional understanding task is too complex, CoCFusion (the algorithm used in this application) only performs instance center point estimation, and the complete three-dimensional information is processed by other tasks in the network The inference result is extracted at the center point, which acts as an anchor point and does not attach any additional information.

作为一种示例，目标高斯热值图是由地面真实数据的目标实例对应的中心点生成的热值图，是实例中心点表示的首要步骤，生成的目标高斯热值图如图4所示。As an example, the target Gaussian heat value map is a heat value map generated from the center point corresponding to the target instance of the ground truth data. It is the first step in representing the center point of the instance. The generated target Gaussian heat value map is shown in Figure 4.

步骤S302，根据所述低分辨率深度图像中的聚合特征以及目标高斯热值图，进行三维点云重建，得到目标三维图像。Step S302: Perform three-dimensional point cloud reconstruction based on the aggregated features in the low-resolution depth image and the target Gaussian heat value map to obtain the target three-dimensional image.

作为一种示例，通过目标高斯热值图可以看出图像数据中各个目标实例的热值图像，在生成目标三维图像的过程中，还需要提取低分辨率深度图像的聚合特征。As an example, the heat value image of each target instance in the image data can be seen through the target Gaussian heat value map. In the process of generating the target three-dimensional image, it is also necessary to extract the aggregate features of the low-resolution depth image.

作为一种示例，通过上下文聚类模块CoC Block和空间-通道注意力模块GAM组成的层次化特征融合网络对以上输入点集进行特征提取，以相似性度量的方式分层聚合不同偏好的特征并在浅层网络结束时粗粒度合成深度图像，最后通过不同的任务处理网络预测目标实例热值图、语义分割图像、高分辨率深度图像、以目标为中心的整体三维信息参数，相对于现有的采ResNet-FPN特征提取架构的整体三维理解单阶段方案，基于上下文聚类模块的特征提取方式使网络更关注簇间整体差异而不是外观和形状上的细节匹配，从而体现出各个目标实例的特征，进而，根据高斯热值图以及每个目标实例的特征，完成三维场景的重建。As an example, feature extraction is performed on the above input point set through a hierarchical feature fusion network composed of the context clustering module CoC Block and the spatial-channel attention module GAM, and the features of different preferences are hierarchically aggregated in the form of similarity measurement and At the end of the shallow network, the depth image is coarse-grainedly synthesized, and finally the target instance heat value map, semantic segmentation image, high-resolution depth image, and overall target-centered three-dimensional information parameters are predicted through different task processing networks. Compared with existing The ResNet-FPN feature extraction architecture is a single-stage solution for overall three-dimensional understanding. The feature extraction method based on the context clustering module makes the network pay more attention to the overall differences between clusters rather than the detailed matching of appearance and shape, thus reflecting the characteristics of each target instance. Features, and then, based on the Gaussian heat value map and the characteristics of each target instance, the reconstruction of the three-dimensional scene is completed.

其中，所述根据所述原始RGB图像数据中多个目标示例，生成目标高斯热值图的步骤之后，包括：Wherein, after the step of generating a target Gaussian heat value map based on multiple target examples in the original RGB image data, the step includes:

步骤A1，根据所述目标高斯热值图，计算得到所述目标高斯热值图中的多个目标实例中心坐标，并基于多个所述目标实例中心坐标，确定整体三维信息参数图；Step A1, according to the target Gaussian heat value map, calculate the center coordinates of multiple target instances in the target Gaussian heat value map, and determine the overall three-dimensional information parameter map based on the multiple target instance center coordinates;

步骤A2，提取所述整体三维信息参数图中所有目标实例的完整三维信息。Step A2: Extract complete three-dimensional information of all target instances in the overall three-dimensional information parameter map.

作为一种示例，在生成目标高斯热值图之后，目标高斯热值图中有对应的实例中心点，且完整三维信息在其他任务处理网络的推断结果在该中心点位置提取出的。As an example, after the target Gaussian heat value map is generated, there is a corresponding instance center point in the target Gaussian heat value map, and the complete three-dimensional information is extracted at the center point position in the inference results of other task processing networks.

其中，所述对所述低分辨率深度图像进行三维转换，得到目标三维图像的步骤之后，包括：Wherein, after the step of performing three-dimensional conversion on the low-resolution depth image to obtain the target three-dimensional image, the step includes:

步骤B1，根据所述目标三维图像的目标实例在二维坐标系上的中心点数据，计算得到中心距离估计结果；Step B1: Calculate the center distance estimation result according to the center point data of the target instance of the target three-dimensional image on the two-dimensional coordinate system;

作为一种示例，中心点数据即是每个目标实例在二维坐标系上的中心点。As an example, the center point data is the center point of each target instance on the two-dimensional coordinate system.

作为一种示例，在生成三维图像后，需要通过损失函数构造循环几何一致性约束，惩罚真实掩码区域、预测掩码区域、投影掩码区域之间的不一致性，构造循环几何一致性约束，并将FS-Net的正交轴损失、PoseCNN的中心距离估计以及直接回归损失置信化结合，与循环几何一致性约束共同构成置信度几何一致性约束。As an example, after generating a three-dimensional image, it is necessary to construct a cyclic geometric consistency constraint through a loss function, penalize the inconsistency between the real mask area, predicted mask area, and projected mask area, and construct a cyclic geometric consistency constraint, The orthogonal axis loss of FS-Net, the center distance estimation of PoseCNN and the confidence of direct regression loss are combined with the cycle geometric consistency constraint to form a confidence geometric consistency constraint.

作为一种示例，置信度参数的推断通过在相应任务处理网络中添加使用sigmoid后处理的额外预测实现，这是非常容易做到的，此外，除了对辅助三维信息的优化过程，其他的损失函数都是基于地面真实目标实例热值图Y∈[0,1]^w×h×1的掩码损失函数，只有在像素点热度值h(i,j)大于δ时才会被执行，以防止在没有物体存在的区域出现模糊性，δ为热度掩码阈值，本申请中取0.35。As an example, the inference of confidence parameters is achieved by adding additional predictions using sigmoid post-processing in the corresponding task processing network, which is very easy to do. Furthermore, in addition to the optimization process of auxiliary 3D information, other loss functions They are all mask loss functions based on the ground truth target instance heat value map Y∈[0,1] ^w×h×1 . They will only be executed when the pixel heat value h(i,j) is greater than δ to prevent Fuzziness occurs in areas where no objects exist. δ is the heat mask threshold, which is taken as 0.35 in this application.

作为一种示例，辅助三维信息损失函数由粗粒度合成深度图像损失函数、预测高分辨率深度图像损失函数、预测类别掩码图像损失函数组成，As an example, the auxiliary 3D information loss function consists of a coarse-grained synthetic depth image loss function, a predicted high-resolution depth image loss function, and a predicted class mask image loss function.

步骤B2，对所述中心距离估计结果与所述整体三维信息参数图的回归结果进行一致性约束处理，使得输入的所述目标实例收敛。Step B2: Perform consistency constraint processing on the regression result of the center distance estimation result and the overall three-dimensional information parameter map, so that the input target instance converges.

作为一种示例，中心距离估计结果即是当相机内参矩阵已知的条件下，通过对目标实例在二维图像上的中心点以及中心点深度的估计其三维平移向量，通过公式(3-3)解得中心距离估计结果t_cd′，将中心距离估计结果t_cd′和整体三维信息参数图中的直接回归结果t_3d置信化结合，并考虑一维真实尺度因子s损失，使用L₁距离和余弦相似度构造损失函数，使得输入的目标实例收敛，三维旋转方面的损失函数如公式(3-4)所示。As an example, the center distance estimation result is when the camera internal parameter matrix is known, by estimating the three-dimensional translation vector of the center point of the target instance on the two-dimensional image and the depth of the center point, through the formula (3-3 ) to obtain the center distance estimation result t _cd ′, combine the center distance estimation result t _cd ′ with the direct regression result t _3d in the overall three-dimensional information parameter map with confidence, and consider the loss of the one-dimensional true scale factor s, using the L ₁ distance Construct a loss function with cosine similarity to make the input target instance converge. The loss function in three-dimensional rotation is shown in formula (3-4).

作为一种示例，预测类别掩码图像SSI∈R^{w×h×(cls+1)}作为辅助三维信息的引入能够使网络更好地在二维图像平面检测出感兴趣目标实例,但三维空间中的真实几何信息没有被考虑到，因此，本申请所使用的算法惩罚真实掩码区域、预测掩码区域、投影掩码区域之间的不一致性，构造循环几何一致性约束，以增强网络对二维像素平面与真实三维空间映射过程的学习能力。As an example, the introduction of the predicted category mask image SSI∈R ^{w×h×(cls+1)} as auxiliary three-dimensional information can enable the network to better detect target instances of interest in the two-dimensional image plane, but in the three-dimensional space The real geometric information of The learning ability of the mapping process between the two-dimensional pixel plane and the real three-dimensional space.

作为一种示例，该目标通过如下设计实现：(1)设已知相机内参矩阵为K，则预测三维旋转矩阵R∈SO(3)，和预测三维平移向量t∈R³定义了3D→2D映射关系π＝K[R|t]^-1。对于由一维真实尺度因子s和规范化三维点云P∈R^2048×3表征的真实目标实例三维形状p′＝π(s*p)，进而可由目标实例中心点坐标(c_x,c_y)和地面真实类别标签生成投影掩码监督图Proj′∈R^{w×h×(cls+1)}，显然，减少投影掩码监督图Proj′与类别掩码图像SSI的差异有助于模型理解二维像素→三维空间映射→二维像素重投影的转换过程。(2)若仅关注真实掩码区域与预测掩码区域的一致性、投影掩码区域与预测掩码区域的一致性，则可能出现“循环掩码错位”问题：网络耗费大量成本去学习扭曲的几何映射关系，导致模型偏离正确的目标实例理解，而这种偏离程度的加深反过来又会使网络抛弃已学习的几何映射理解，循环反复，难以收敛，为了解决该问题，添加了投影掩码区域与真实掩码区域之间的一致性，构造循环几何一致性约束，如图11所示，循环几何一致性约束损失函数如公式(3-5)所示。As an example, this goal is achieved through the following design: (1) Suppose the known camera intrinsic parameter matrix is K, then the predicted three-dimensional rotation matrix R∈SO(3), and the predicted three-dimensional translation vector t∈R ³ define 3D→2D Mapping relationship π=K[R|t] ^-1 . For the three-dimensional shape p′=π(s*p) of the real target instance represented by the one-dimensional real scale factor s and the normalized three-dimensional point cloud P∈R ^2048×3 , the center point coordinates of the target instance (c _x , c _y ) can be and the ground truth category labels to generate the projected mask supervision map Proj′∈R ^{w×h×(cls+1)} . Obviously, reducing the difference between the projected mask supervision map Proj′ and the category mask image SSI helps the model understand the two-dimensional The conversion process of pixel → three-dimensional space mapping → two-dimensional pixel reprojection. (2) If you only focus on the consistency between the real mask area and the predicted mask area, and the consistency between the projected mask area and the predicted mask area, the "cyclic mask misalignment" problem may occur: the network spends a lot of cost to learn distortions The geometric mapping relationship causes the model to deviate from the correct understanding of the target instance, and the deepening of this deviation will in turn cause the network to abandon the learned geometric mapping understanding, repeating the cycle and making it difficult to converge. In order to solve this problem, a projection mask is added The consistency between the code area and the real mask area constructs the cyclic geometric consistency constraint, as shown in Figure 11, and the cyclic geometric consistency constraint loss function is shown in formula (3-5).

作为一种示例，预设图像处理模型的整体处理流程图如图2所示，主要包括五个部分：实例完整三维信息的中心点表示、坐标系分离的输入端处理方法、位姿尺寸以及三维形状点云编码、基于上下文聚类的层次化特征融合网络、置信度几何一致性约束与辅助三维信息联合优化；实施方式如下：通过给定像素宽度为w，像素高度为h的单张RGB-D图像输入(I∈R^w×h×3，D∈R^w×h)，在推理阶段，首先从目标高斯热值图Y∈[0,1]^w×h×1中计算出满足热度阈值的局部峰值坐标作为目标实例中心点(c_x,c_y)，其次从整体三维信息参数图中采样出目标实例的完整完整三维信息O_3d(c_x,c_y)，并利用三维点云解码器MLPs-Decoder进行推断以获得重构点云P，最后，从O_3d(c_x,c_y)中提取出坐标轴a_x、a_z,置信度c_x和c_z、c_cd、一维真实尺度因子s，从HRD提取D(c_x,c_y)，由a_x、a_z、c_x、c_z计算出预测旋转矩阵R，由t_cd′、D(c_x,c_y)、c_cd计算出预测平移向量t∈R³，完成由二维图像平面推断真实场景中目标实例重构点云P^recon＝[R|t]*s*P的整体三维理解任务，即基于单张RGB-D图像的多目标类别级位姿估计与三维形状重建。As an example, the overall processing flow chart of the preset image processing model is shown in Figure 2, which mainly includes five parts: the center point representation of the complete three-dimensional information of the instance, the input end processing method of coordinate system separation, pose size and three-dimensional Shape point cloud coding, hierarchical feature fusion network based on context clustering, joint optimization of confidence geometric consistency constraints and auxiliary three-dimensional information; the implementation method is as follows: by given a single RGB-pixel with a pixel width of w and a pixel height of h- D image input (I∈R ^w×h×3 , D∈R ^w×h ). In the inference stage, the heat threshold is first calculated from the target Gaussian heat value map Y∈[0,1] ^w×h×1 The local peak coordinates are used as the target instance center point (c _x , c _y ), and secondly, from the overall three-dimensional information parameter map The complete and complete three-dimensional information O _3d (c _x ,c _y ) of the target instance is sampled, and the three-dimensional point cloud decoder MLPs-Decoder is used for inference to obtain the reconstructed point cloud P. Finally, from O _3d (c _x ,c Extract the coordinate axes a _x , a _z , the confidence level c _x and c _z , c _cd , and the one-dimensional true scale factor s from y ), extract D(c _x , c _y ₎ from the HRD, and use a _x , a _z , c _x , c _z calculates the predicted rotation matrix R, and calculates the predicted translation vector t∈R ³ from t _cd ′, D (c _x , c _y ), c _cd , completing the inference of the target instance in the real scene from the two-dimensional image plane The overall three-dimensional understanding task of reconstructing the point cloud P ^recon = [R|t]*s*P, that is, multi-target category-level pose estimation and three-dimensional shape reconstruction based on a single RGB-D image.

进一步地，基于本申请中第一实施例，提供本申请的另一实施例，在该实施例中，所述对所述原始RGB图像数据的像素坐标系数据与相机坐标系数据进行融合处理，得到低分辨率深度图像的步骤，包括：Further, based on the first embodiment of the present application, another embodiment of the present application is provided. In this embodiment, the pixel coordinate system data and the camera coordinate system data of the original RGB image data are fused, The steps to obtain low-resolution depth images include:

步骤C1，对所述原始RGB图像数据进行向量计算，得到所述原始RGB图像数据中多个像素点的第一法向量；Step C1: Perform vector calculation on the original RGB image data to obtain the first normal vectors of multiple pixels in the original RGB image data;

作为一种示例，第一法向量即是原始RGB图像数据中的深度图像的法向量，As an example, the first normal vector is the normal vector of the depth image in the original RGB image data,

作为一种示例，三维视觉投影过程指通过相机成像几何模型，建立目标物体从真实世界三维坐标到成像平面二维像素点的对应关系，当定义世界坐标系为以目标实例为中心的坐标系时，真实三维场景中的目标实例投影到二维图像平面的过程如公式(3-3)所示，该过程描述了目标实例坐标系中的一个点(x_w,y_w,z_w)(首先被旋转矩阵R∈SE(3)旋转，其次被三维平移向量t∈R³平移到相机坐标系中的位置(x_c,y_c,z_c)，最后利用相机的内参矩阵K将其投影到图像平面(u,v)处的像素上。As an example, the three-dimensional visual projection process refers to establishing the corresponding relationship between the target object from the three-dimensional coordinates of the real world to the two-dimensional pixel points of the imaging plane through the camera imaging geometric model. When the world coordinate system is defined as a coordinate system centered on the target instance , the process of projecting the target instance in the real three-dimensional scene to the two-dimensional image plane is shown in formula (3-3). This process describes a point (x _w , y _w , z _w ) in the coordinate system of the target instance (first It is rotated by the rotation matrix R∈SE(3), then translated to the position (x _c , y _c , z _c ) in the camera coordinate system by the three-dimensional translation vector t∈R ³ , and finally projected to on the pixel at image plane (u,v).

步骤C2，对所述像素坐标系数据与所述相机坐标系数据的坐标系点集进行聚类处理，得到聚类特征数据；Step C2: Perform clustering processing on the coordinate system point sets of the pixel coordinate system data and the camera coordinate system data to obtain cluster feature data;

作为一种示例，通过相似性度量的方式来聚合所输入的数据中不同偏好的特征，得到聚类特征数据。As an example, clustering feature data is obtained by aggregating features of different preferences in the input data through similarity measurement.

步骤C3，将所述第一法向量与所述聚类特征数据进行特征融合，得到低分辨率深度图像。Step C3: Perform feature fusion on the first normal vector and the cluster feature data to obtain a low-resolution depth image.

作为一种示例，在生成低分辨率深度图像时，通过坐标系分离的输入端处理方法从单张RGB-D图像中提取出像素坐标系空间和相机坐标系空间中的两组点集：由RGB图像的法向量CNR∈R^w×h×3以及像素平面二维坐标UV∈R^w×h×2组成的像素坐标系点集P_l∈R^5×n；由深度图像的法向量DNR∈R^w×h×3以及相机平面二维坐标XY∈R^w×h×2组成的相机坐标系点集P_r∈R^5×n，再根据各个点集中不同的特征进行聚合，从而为数据分类，使得提取的特征更能体现各个簇之间的整体差异，而不是外观和形状上的匹配，以便准确地还原三维场景。As an example, when generating a low-resolution depth image, two sets of point sets in the pixel coordinate system space and the camera coordinate system space are extracted from a single RGB-D image through the input-side processing method of coordinate system separation: The pixel coordinate system point set P _l ∈R ^5×n consists of the normal vector CNR∈R ^w×h×3 of the RGB image and the two-dimensional coordinates UV∈R ^w×h×2 of the pixel plane; it is composed of the normal vector DNR∈ of the depth image The camera coordinate system point set P _r ∈ R ^5×n composed of ^{R w×h×3} and the camera plane two-dimensional coordinates XY∈R ^w×h×2 is then aggregated according to the different features of each point set to classify the data. , so that the extracted features can better reflect the overall differences between each cluster, rather than the matching in appearance and shape, so as to accurately restore the three-dimensional scene.

作为一种示例，在融合过程中，采用了空间-通道注意力模块的融合操作能够在建立像素坐标系信息和相机坐标系信息特征映射的同时，避免固定中心点采样的聚类方式在层次化处理早期对隐含真实空间几何信息的破坏。As an example, in the fusion process, the fusion operation using the spatial-channel attention module can establish the feature mapping of pixel coordinate system information and camera coordinate system information while avoiding the hierarchical clustering method of fixed center point sampling. Handling early corruption of implicit real space geometric information.

其中，所述对所述像素坐标系数据与所述相机坐标系数据的坐标系点集进行聚类处理，得到聚类特征数据的步骤，包括：Wherein, the step of clustering the coordinate system point sets of the pixel coordinate system data and the camera coordinate system data to obtain cluster feature data includes:

步骤D1，对所述像素坐标系数据与所述相机坐标系数据的坐标系点集进行特征提取，得到像素坐标特征数据和相机坐标特征数据；Step D1: Perform feature extraction on the coordinate system point set of the pixel coordinate system data and the camera coordinate system data to obtain pixel coordinate feature data and camera coordinate feature data;

步骤D2，根据所述像素坐标特征数据和相机坐标特征数据的相似程度，对多个点集特征数据进行特征聚合，得到聚类特征数据。Step D2: Perform feature aggregation on multiple point set feature data based on the similarity between the pixel coordinate feature data and the camera coordinate feature data to obtain cluster feature data.

作为一种示例，在为多个图像数据中的目标实例进行分类之前，需要对数据进行特征提取，得到相应的特征数据。As an example, before classifying target instances in multiple image data, it is necessary to perform feature extraction on the data to obtain corresponding feature data.

作为一种示例，特征提取的过程如图9所示，输出层次化特征{F₁,F₂,F₃,F₄}最显著的特点为特征图生成过程是自底向上的，且随着网络深度的增加，生成特征图的分辨率更低、特征通道维度更大，以完成对语义信息由具体到抽象，由分散到富集的学习过程，构造特征金字塔网络将输出层次化特征变换{F₁,F₂,F₃,F₄}为语义层次化特征{P₁,P₂,P₃,P₄}，在得到语义层次化特征{P₁,P₂,P₃,P₄}后，CoCFusion使用不同的任务处理网络Depth Head、Seg Head、3D-info maps Head将这些层次化特征用于不同任务所期望的像素级输出。As an example, the feature extraction process is shown in Figure 9. The most significant feature of the output hierarchical features {F ₁ , F ₂ , F ₃ , F ₄ } is that the feature map generation process is bottom-up, and as As the depth of the network increases, the resolution of the generated feature map is lower and the dimension of the feature channel is larger. In order to complete the learning process of semantic information from concrete to abstract, from dispersion to enrichment, a feature pyramid network is constructed to transform the output hierarchical features { F ₁ , F ₂ , F ₃ , F ₄ } are semantic hierarchical features {P ₁ , P ₂ , P ₃ , P ₄ }. After obtaining semantic hierarchical features {P ₁ , P ₂ , P ₃ , P ₄ } Finally, CoCFusion uses different task processing networks Depth Head, Seg Head, and 3D-info maps Head to use these hierarchical features for the pixel-level output expected by different tasks.

作为一种示例，通过相似性度量方式聚合特征的过程在于：根据各个特征数据的相似程度进行分类，从而得到多个簇形成的集合，其中，在每个簇中的数据的相似度最高，簇间的相似程度较小。As an example, the process of aggregating features through similarity measurement is to classify each feature data according to the degree of similarity, thereby obtaining a set of multiple clusters, in which the data in each cluster has the highest similarity, and the cluster The degree of similarity between them is small.

在本实施例中，通过采用相似性度量的方法对特征数据进行分类，使神经网络更关注簇间整体差异，而不是外观和形状上的细节匹配，采用了空间-通道注意力模块的融合操作能够在建立像素坐标系信息和相机坐标系信息特征映射的同时，避免固定中心点采样的聚类方式在层次化处理早期对隐含真实空间几何信息的破坏。In this embodiment, the similarity measure method is used to classify feature data, so that the neural network pays more attention to the overall difference between clusters rather than the detailed matching of appearance and shape, and uses the fusion operation of the spatial-channel attention module. While establishing feature mapping of pixel coordinate system information and camera coordinate system information, it can avoid the destruction of implicit real spatial geometric information in the early stage of hierarchical processing by the clustering method of fixed center point sampling.

进一步地，基于本申请中第一实施例和第二实施例，提供本申请的另一实施例，在该实施例中，所述对所述原始RGB图像数据进行向量计算，得到所述原始RGB图像数据中多个像素点的第一法向量的步骤，包括：Further, based on the first embodiment and the second embodiment of the present application, another embodiment of the present application is provided. In this embodiment, vector calculation is performed on the original RGB image data to obtain the original RGB image data. The steps for determining the first normal vectors of multiple pixels in image data include:

步骤E1，对所述原始RGB图像数据进行数据增强处理，得到第一增强数据；Step E1, perform data enhancement processing on the original RGB image data to obtain first enhanced data;

作为一种示例，在不经过预分割的单阶段方案中，仅依靠不精确且存在缺失的深度信息无法很好地完成对感兴趣目标实例的检测任务，必须引入精确的颜色特征，为了在目标检测以及类内差异性之间寻找平衡，坐标系分离的输入端处理方法采用增强图像法向量替代原始RGB值。As an example, in a single-stage scheme without pre-segmentation, the detection task of target instances of interest cannot be completed well only by relying on imprecise and missing depth information. Accurate color features must be introduced in order to detect target instances. To find a balance between detection and intra-class differences, the input-side processing method of coordinate system separation uses enhanced image normal vectors to replace the original RGB values.

作为一种示例，数据增强的过程为：对原始RGB图像通过颜色抖动技术(ColorJitter)进行数据增强，随机改变其亮度、对比度、饱和度和色调。As an example, the data enhancement process is: data enhancement is performed on the original RGB image through color jitter technology (ColorJitter), and its brightness, contrast, saturation and hue are randomly changed.

步骤E2，对所述第一增强数据进行二维梯度计算，得到梯度计算值；Step E2: Perform two-dimensional gradient calculation on the first enhanced data to obtain a gradient calculation value;

步骤E3，选取所述梯度计算值中的最大值，并确定所述梯度计算值中的最大值为所述第一增强数据中各像素点的第一法向量。Step E3: Select the maximum value among the gradient calculation values, and determine the maximum value among the gradient calculation values as the first normal vector of each pixel in the first enhancement data.

作为一种示例，对增强后的RGB图像分通道进行二维梯度计算得到的值，即为梯度计算值，同时，取梯度计算值之中的最大值作为各像素点的最终法向量/第一法向量，其中，像素坐标系空间输入的增强后的RGB图像以及坐标数据展示图如图5和图6所示。As an example, the value obtained by performing two-dimensional gradient calculation on the enhanced RGB image sub-channel is the gradient calculation value. At the same time, the maximum value among the gradient calculation values is taken as the final normal vector of each pixel point/first Normal vector, where the enhanced RGB image input in the pixel coordinate system space and the coordinate data display are shown in Figures 5 and 6.

在本实施例中，通过对原始RGB图像数据进行数据增强，使用增强图像法向量替代原始RGB值，使得图像处理模型更容易处理目标物体类内差异性，以便为多个目标实例分类。In this embodiment, by performing data enhancement on the original RGB image data and using the enhanced image normal vector to replace the original RGB value, the image processing model can more easily handle differences within the target object class in order to classify multiple target instances.

作为一种示例，本申请在NOCS公开数据集上对CoCFusion算法的整体三维理解能力进行全面评估，分析算法在类别级位姿估计和三维场景重建领域的表现，涉及与现有基线方法的对比、创新点有效性、推理效率以及定性结果四个方面。NOCS数据集为类别级位姿估计领域最广泛使用的评估基准。关于预训练数据集CAMERA，由来自ShapeNetCore-CAD数据集的1085个渲染实例在真实环境背景的随机视图下生成。但需要说明的是，NOCS数据集的感兴趣目标类别仅有瓶子、圆碗、相机、罐头、马克杯、笔记本电脑6种。CAMERA数据集共包含300K合成渲染图像，本发明将其中275K数据划分为训练集，剩余包含184个不同目标实例的25K数据则用作验证，关于真实世界数据集REAL，由多个包含不同目标实例的真实场景组成。As an example, this application conducts a comprehensive evaluation of the overall three-dimensional understanding ability of the CoCFusion algorithm on the NOCS public data set, analyzes the performance of the algorithm in the fields of category-level pose estimation and three-dimensional scene reconstruction, and involves comparison with existing baseline methods. There are four aspects: effectiveness of innovation points, efficiency of reasoning and qualitative results. The NOCS dataset is the most widely used evaluation benchmark in the field of class-level pose estimation. Regarding the pre-training dataset CAMERA, it is generated from 1085 rendering instances from the ShapeNetCore-CAD dataset under random views of real environment backgrounds. However, it should be noted that the NOCS data set only has six target categories of interest: bottles, bowls, cameras, cans, mugs, and laptops. The CAMERA data set contains a total of 300K synthetic rendering images. The present invention divides 275K data into a training set, and the remaining 25K data containing 184 different target instances are used for verification. Regarding the real-world data set REAL, it consists of multiple data containing different target instances. composed of real scenes.

基线方法：本方法将六个基线方法与CoCFusion进行对比，以展示所设计算法的有效性，前五个为多阶段类别级整体三维理解技术，第六个为单阶段类别级整体三维理解技术。(1)NOCS：通过扩展的Mask R-CNN模型推断NOCS坐标在二维图像上的投影，然后使用Umeyama算法和RANSAC算法求解深度-投影相似变换以进行位姿推断。(2)Synthesis：将基于梯度的拟合过程与参数化神经图像合成模型相结合，隐式地表示整个对象类别的外观、形状和位姿。(3)Metric Scale：采用分别预测度量尺度形状和NOCS空间投影的方式扩展了NOCS方法。(4)SPD：使用正则目标点云，并根据观测到的RGB-D信息预测该类别三维形状先验的变形状况。(5)CASS：利用变分自编码器学习隐式形状表达，并利用卷积网络以及三维点云重建网络从分割区域和目标区域点云中直接回归隐式形状以及位姿。(6)CenterSnap：将目标实例视为其二维图像的中心点，提出了第一个检测目标实例并同时推断其六自由度位姿、三维形状以及真实尺寸的单阶段方案，比较结果如下表所示：Baseline method: This method compares six baseline methods with CoCFusion to demonstrate the effectiveness of the designed algorithm. The first five are multi-stage category-level overall three-dimensional understanding technology, and the sixth is a single-stage category-level overall three-dimensional understanding technology. (1)NOCS: The projection of NOCS coordinates on the two-dimensional image is inferred through the extended Mask R-CNN model, and then the Umeyama algorithm and RANSAC algorithm are used to solve the depth-projection similarity transformation for pose inference. (2)Synthesis: Combining the gradient-based fitting process with a parametric neural image synthesis model to implicitly represent the appearance, shape, and pose of the entire object category. (3) Metric Scale: The NOCS method is extended by separately predicting the shape of the metric scale and the projection of the NOCS space. (4) SPD: Use regular target point clouds and predict the prior deformation of the three-dimensional shape of this category based on the observed RGB-D information. (5) CASS: Use variational autoencoders to learn implicit shape expressions, and use convolutional networks and 3D point cloud reconstruction networks to directly regress implicit shapes and poses from segmented area and target area point clouds. (6) CenterSnap: Treat the target instance as the center point of its two-dimensional image, and propose the first single-stage solution to detect the target instance and simultaneously infer its six-degree-of-freedom pose, three-dimensional shape, and true size. The comparison results are as follows: Shown:

表4-2 NOCS数据集上三维目标检测和六自由度位姿估计定量对比Table 4-2 Quantitative comparison of three-dimensional target detection and six-degree-of-freedom pose estimation on the NOCS data set

三维目标检测方面，CoCFusion相比单阶段方案CenterSnap取得了一定进步，但较多阶段方案仍存在差距，且在REAL275数据集上的表现逊于CAMERA25数据集，这是可以理解的，因为基于目标区域分割的方法约束了实例的空间位置，数据集越小，该效应越明显。在六自由度位姿估计方面，CoCFusion于REAL275数据集上表现出了优异的性能：5°5cm处的mAP值为25.3％，5°10cm处的mAP值为28.4％，10°5cm处的mAP值为55.2％，10°10cm处的mAP值为62.2％，相比CenterSnap取得了3.5％、4.4％、8.7％以及8.0％的绝对提升，在旋转误差容忍度较大时取得了更好的表现。In terms of 3D target detection, CoCFusion has made some progress compared to the single-stage solution CenterSnap, but there are still gaps in more multi-stage solutions, and its performance on the REAL275 data set is inferior to the CAMERA25 data set. This is understandable because it is based on the target area. The segmentation method constrains the spatial location of instances, and the smaller the data set, the more obvious this effect is. In terms of six-degree-of-freedom pose estimation, CoCFusion showed excellent performance on the REAL275 data set: the mAP value at 5°5cm was 25.3%, the mAP value at 5°10cm was 28.4%, and the mAP value at 10°5cm The value is 55.2%, and the mAP value at 10°10cm is 62.2%. Compared with CenterSnap, it has achieved absolute improvements of 3.5%, 4.4%, 8.7% and 8.0%, and achieved better performance when the rotation error tolerance is large. .

参照图3，图3是本申请实施例方案涉及的硬件运行环境的设备结构示意图。Referring to Figure 3, Figure 3 is a schematic diagram of the equipment structure of the hardware operating environment involved in the embodiment of the present application.

如图3所示，该三维图像生成设备可以包括：处理器1001，存储器1005，通信总线1002。通信总线1002用于实现处理器1001和存储器1005之间的连接通信。As shown in Figure 3, the three-dimensional image generation device may include: a processor 1001, a memory 1005, and a communication bus 1002. The communication bus 1002 is used to realize connection communication between the processor 1001 and the memory 1005.

可选地，该三维图像生成设备还可以包括用户接口、网络接口、摄像头、RF(RadioFrequency，射频)电路，传感器、WiFi模块等等。用户接口可以包括显示屏(Display)、输入子模块比如键盘(Keyboard)，可选用户接口还可以包括标准的有线接口、无线接口。网络接口可以包括标准的有线接口、无线接口(如WI-FI接口)。Optionally, the three-dimensional image generation device may also include a user interface, a network interface, a camera, an RF (Radio Frequency, radio frequency) circuit, a sensor, a WiFi module, and so on. The user interface may include a display screen (Display) and an input sub-module such as a keyboard (Keyboard). The optional user interface may also include standard wired interfaces and wireless interfaces. Network interfaces may include standard wired interfaces and wireless interfaces (such as WI-FI interfaces).

本领域技术人员可以理解，图3中示出的三维图像生成设备结构并不构成对三维图像生成设备的限定，可以包括比图示更多或更少的部件，或者组合某些部件，或者不同的部件布置。Those skilled in the art can understand that the structure of the three-dimensional image generating device shown in Figure 3 does not constitute a limitation on the three-dimensional image generating device. It may include more or fewer components than shown in the figure, or some components may be combined or different. component layout.

如图3所示，作为一种存储介质的存储器1005中可以包括操作系统、网络通信模块以及三维图像生成程序。操作系统是管理和控制三维图像生成设备硬件和软件资源的程序，支持三维图像生成程序以及其它软件和/或程序的运行。网络通信模块用于实现存储器1005内部各组件之间的通信，以及与三维图像生成系统中其它硬件和软件之间通信。As shown in Figure 3, memory 1005, which is a storage medium, may include an operating system, a network communication module, and a three-dimensional image generation program. The operating system is a program that manages and controls the hardware and software resources of the three-dimensional image generation device, and supports the operation of the three-dimensional image generation program and other software and/or programs. The network communication module is used to implement communication between components within the memory 1005, as well as communication with other hardware and software in the three-dimensional image generation system.

在图3所示的三维图像生成设备中，处理器1001用于执行存储器1005中存储的三维图像生成程序，实现上述任一项所述的三维图像生成方法的步骤。In the three-dimensional image generation device shown in FIG. 3, the processor 1001 is used to execute the three-dimensional image generation program stored in the memory 1005 to implement the steps of any of the three-dimensional image generation methods described above.

本申请三维图像生成设备具体实施方式与上述三维图像生成方法各实施例基本相同，在此不再赘述。The specific implementation of the three-dimensional image generation device of the present application is basically the same as the above-mentioned embodiments of the three-dimensional image generation method, and will not be described again here.

在本申请的一种可能的实施方式中，所述第一处理模块包括：In a possible implementation of this application, the first processing module includes:

计算单元，用于对所述原始RGB图像数据进行向量计算，得到所述原始RGB图像数据中多个像素点的第一法向量；A calculation unit configured to perform vector calculation on the original RGB image data to obtain the first normal vectors of multiple pixels in the original RGB image data;

处理单元，用于对所述像素坐标系数据与所述相机坐标系数据的坐标系点集进行聚类处理，得到聚类特征数据；A processing unit configured to perform clustering processing on the coordinate system point sets of the pixel coordinate system data and the camera coordinate system data to obtain cluster feature data;

融合单元，用于将所述第一法向量与所述聚类特征数据进行特征融合，得到低分辨率深度图像。A fusion unit configured to perform feature fusion on the first normal vector and the cluster feature data to obtain a low-resolution depth image.

在本申请的一种可能的实施方式中，所述处理单元包括：In a possible implementation of the present application, the processing unit includes:

提取子单元，用于对所述像素坐标系数据与所述相机坐标系数据的坐标系点集进行特征提取，得到像素坐标特征数据和相机坐标特征数据；An extraction subunit, used to perform feature extraction on the coordinate system point set of the pixel coordinate system data and the camera coordinate system data, to obtain pixel coordinate feature data and camera coordinate feature data;

聚合子单元，用于根据所述像素坐标特征数据和相机坐标特征数据的相似程度，对多个点集特征数据进行特征聚合，得到聚类特征数据。The aggregation subunit is used to perform feature aggregation on multiple point set feature data based on the similarity between the pixel coordinate feature data and the camera coordinate feature data to obtain cluster feature data.

在本申请的一种可能的实施方式中，所述计算单元包括：In a possible implementation of this application, the computing unit includes:

处理子单元，用于对所述原始RGB图像数据进行数据增强处理，得到第一增强数据；A processing subunit, used to perform data enhancement processing on the original RGB image data to obtain first enhanced data;

计算子单元，用于对所述第一增强数据进行二维梯度计算，得到梯度计算值；A calculation subunit, used to perform two-dimensional gradient calculation on the first enhanced data to obtain a gradient calculation value;

选取子单元，用于选取所述梯度计算值中的最大值，并确定所述梯度计算值中的最大值为所述第一增强数据中各像素点的第一法向量。A selection subunit is used to select the maximum value among the gradient calculation values, and determine the maximum value among the gradient calculation values to be the first normal vector of each pixel in the first enhancement data.

在本申请的一种可能的实施方式中，所述转换模块包括：In a possible implementation of this application, the conversion module includes:

生成单元，用于根据所述原始RGB图像数据中多个目标示例，生成目标高斯热值图；A generation unit configured to generate a target Gaussian heat value map based on multiple target examples in the original RGB image data;

重建单元，用于根据所述低分辨率深度图像中的聚合特征以及目标高斯热值图，进行三维点云重建，得到目标三维图像。The reconstruction unit is configured to perform three-dimensional point cloud reconstruction based on the aggregated features in the low-resolution depth image and the target Gaussian heat value map to obtain the target three-dimensional image.

在本申请的一种可能的实施方式中，所述转换模块还包括：In a possible implementation of this application, the conversion module further includes:

确定单元，用于根据所述目标高斯热值图，计算得到所述目标高斯热值图中的多个目标实例中心坐标，并基于多个所述目标实例中心坐标，确定整体三维信息参数图；A determination unit configured to calculate the center coordinates of multiple target instances in the target Gaussian heat value map based on the target Gaussian heat value map, and determine the overall three-dimensional information parameter map based on the multiple target instance center coordinates;

提取单元，用于提取所述整体三维信息参数图中所有目标实例的完整三维信息。An extraction unit is used to extract complete three-dimensional information of all target instances in the overall three-dimensional information parameter map.

在本申请的一种可能的实施方式中，所述装置还包括：In a possible implementation of the present application, the device further includes:

计算模块，用于根据所述目标三维图像的目标实例在二维坐标系上的中心点数据，计算得到中心距离估计结果；A calculation module configured to calculate the center distance estimation result based on the center point data of the target instance of the target three-dimensional image on the two-dimensional coordinate system;

第二处理模块，用于对所述中心距离估计结果与所述整体三维信息参数图的回归结果进行一致性约束处理，使得输入的所述目标实例收敛。The second processing module is used to perform consistency constraint processing on the regression result of the center distance estimation result and the overall three-dimensional information parameter map, so that the input target instance converges.

需要说明的是，在本文中，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者系统不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者系统所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括该要素的过程、方法、物品或者系统中还存在另外的相同要素。It should be noted that, as used herein, the terms "include", "comprising" or any other variation thereof are intended to cover a non-exclusive inclusion, such that a process, method, article or system that includes a list of elements not only includes those elements, but It also includes other elements not expressly listed or that are inherent to the process, method, article or system. Without further limitation, an element defined by the statement "comprises a..." does not exclude the presence of other identical elements in the process, method, article, or system that includes that element.

上述本申请实施例序号仅仅为了描述，不代表实施例的优劣。The above serial numbers of the embodiments of the present application are only for description and do not represent the advantages or disadvantages of the embodiments.

通过以上的实施方式的描述，本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现，当然也可以通过硬件，但很多情况下前者是更佳的实施方式。基于这样的理解，本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来，该计算机软件产品存储在如上所述的一个存储介质(如ROM/RAM、磁碟、光盘)中，包括若干指令用以使得一台终端设备(可以是手机，计算机，服务器，空调器，或者网络设备等)执行本申请各个实施例所述的方法。Through the above description of the embodiments, those skilled in the art can clearly understand that the methods of the above embodiments can be implemented by means of software plus the necessary general hardware platform. Of course, it can also be implemented by hardware, but in many cases the former is better. implementation. Based on this understanding, the technical solution of the present application can be embodied in the form of a software product that is essentially or contributes to the existing technology. The computer software product is stored in a storage medium (such as ROM/RAM) as mentioned above. , magnetic disk, optical disk), including several instructions to cause a terminal device (which can be a mobile phone, computer, server, air conditioner, or network device, etc.) to execute the methods described in various embodiments of this application.

以上仅为本申请的优选实施例，并非因此限制本申请的专利范围，凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换，或直接或间接运用在其他相关的技术领域，均同理包括在本申请的专利保护范围内。The above are only preferred embodiments of the present application, and are not intended to limit the patent scope of the present application. Any equivalent structure or equivalent process transformation made using the contents of the description and drawings of the present application may be directly or indirectly used in other related technical fields. , are all equally included in the patent protection scope of this application.

Claims

1. A method of generating a three-dimensional image, the method comprising the steps of:

Acquiring original RGB image data;

inputting the original RGB image data into a preset image processing model, and carrying out fusion processing on pixel coordinate system data and camera coordinate data of the original RGB image data based on the preset image processing model to obtain a low-resolution depth image; wherein the camera coordinate data includes depth image information of the original RGB image data and two-dimensional camera coordinate data, and the depth image information is not used as an original input but for a coarse-grained synthetic image;

and performing three-dimensional conversion on the low-resolution depth image to obtain a target three-dimensional image.

2. The three-dimensional image generation method according to claim 1, wherein the step of fusing pixel coordinate system data of the original RGB image data with camera coordinate data to obtain a low-resolution depth image comprises:

vector calculation is carried out on the original RGB image data to obtain first normal vectors of a plurality of pixel points in the original RGB image data;

clustering the pixel coordinate system data and the coordinate system point set of the camera coordinate system data to obtain clustering characteristic data;

And carrying out feature fusion on the first normal vector and the clustering feature data to obtain a low-resolution depth image.

3. The three-dimensional image generating method according to claim 2, wherein the step of clustering the pixel coordinate system data and the coordinate system point set of the camera coordinate system data to obtain cluster feature data includes:

extracting features of the pixel coordinate system data and the coordinate system point set of the camera coordinate system data to obtain pixel coordinate feature data and camera coordinate feature data;

and carrying out feature aggregation on the feature data of the plurality of point sets according to the similarity degree of the feature data of the pixel coordinates and the feature data of the camera coordinates to obtain cluster feature data.

4. The method of generating a three-dimensional image according to claim 2, wherein the step of performing vector calculation on the original RGB image data to obtain a first normal vector of a plurality of pixels in the original RGB image data comprises:

performing data enhancement processing on the original RGB image data to obtain first enhancement data;

performing two-dimensional gradient calculation on the first enhancement data to obtain a gradient calculation value;

And selecting the maximum value in the gradient calculated values, and determining the maximum value in the gradient calculated values as a first normal vector of each pixel point in the first enhancement data.

5. The three-dimensional image generation method according to claim 1, wherein the step of three-dimensionally converting the low-resolution depth image to obtain the target three-dimensional image comprises:

generating a target stell value map according to a plurality of target examples in the original RGB image data;

and carrying out three-dimensional point cloud reconstruction according to the aggregation characteristics in the low-resolution depth image and the target Gaussian heat value image to obtain a target three-dimensional image.

6. The three-dimensional image generating method according to claim 5, wherein after the step of generating the target gaussian heat value map from the plurality of target examples in the original RGB image data, the method comprises:

calculating to obtain a plurality of target instance center coordinates in the target Gaussian heat value diagram according to the target Gaussian heat value diagram, and determining an overall three-dimensional information parameter diagram based on the plurality of target instance center coordinates;

and extracting the complete three-dimensional information of all target examples in the overall three-dimensional information parameter map.

7. The three-dimensional image generation method according to claim 6, wherein after the step of three-dimensionally converting the low-resolution depth image to obtain the target three-dimensional image, the method comprises:

calculating to obtain a center distance estimation result according to the center point data of the target instance of the target three-dimensional image on a two-dimensional coordinate system;

and carrying out consistency constraint processing on the center distance estimation result and the regression result of the integral three-dimensional information parameter graph so as to enable the input target instance to be converged.

8. A three-dimensional image generation apparatus, characterized in that the three-dimensional image generation apparatus includes:

the acquisition module is used for acquiring the original RGB image data;

the first processing module is used for inputting the original RGB image data into a preset image processing model, and carrying out fusion processing on pixel coordinate system data and camera coordinate coefficient data of the original RGB image data based on the preset image processing model to obtain a low-resolution depth image; wherein the camera coordinate data includes depth image information of the original RGB image data and two-dimensional camera coordinate data, and the depth image information is not used as an original input but for a coarse-grained synthetic image;

And the conversion module is used for carrying out three-dimensional conversion on the low-resolution depth image to obtain a target three-dimensional image.

9. A three-dimensional image generating apparatus, characterized in that the apparatus comprises: a memory, a processor, and a three-dimensional image generation program stored on the memory and executable on the processor, the three-dimensional image generation program configured to implement the steps of the three-dimensional image generation method of any one of claims 1 to 7.

10. A computer storage medium, characterized in that the computer storage medium has stored thereon a three-dimensional image generation program which, when executed by a processor, implements the steps of the three-dimensional image generation method according to any one of claims 1 to 7.