CN117333627A

CN117333627A - Reconstruction and complement method, system and storage medium for automatic driving scene

Info

Publication number: CN117333627A
Application number: CN202311631542.4A
Authority: CN
Inventors: 张美莹; 彭维源; 郝祁
Original assignee: Southern University of Science and Technology
Current assignee: Southern University of Science and Technology
Priority date: 2023-12-01
Filing date: 2023-12-01
Publication date: 2024-01-02
Anticipated expiration: 2043-12-01
Also published as: CN117333627B

Abstract

The invention provides a method, a system and a storage medium for reconstructing and complementing an automatic driving scene. The method comprises the following steps: acquiring a stereoscopic camera image of an automatic driving scene, and generating a depth map and a semantic map; optimizing a depth map by using the semantic map; establishing a local surface element model by utilizing the optimized depth map, eliminating conflict information in the local surface element model, and fusing the local surface element model and the global surface element model to obtain a fused global surface element model; acquiring a scene image with a new view angle from the global surface element model after the construction of the whole scene is completed; because of the estimation error in the construction process of the face element model and the removal of the dynamic target object, various holes/masks can appear in the new view angle scene rendering image, the context information of the reference image is matched to obtain context associated information of the new view angle scene rendering image, and the context associated information comprises a similarity graph and an associated image; and complementing the new view angle scene rendering image by using the countermeasure generation network model.

Description

A method, system and storage medium for reconstruction and completion of autonomous driving scenes

技术领域Technical field

本发明涉及图像场景生成技术领域，具体涉及一种自动驾驶场景的重建与补全方法、系统及存储介质。The present invention relates to the technical field of image scene generation, and specifically relates to a method, system and storage medium for reconstructing and completing automatic driving scenes.

背景技术Background technique

目前，自动驾驶相关应用的开发需要对自动驾驶汽车进行大量的验证和测试，以保证其实施的安全性。其解决方案之一是从不同的轨迹和角度真实地重现大量复杂多样的交通场景，因此所构建的场景流的真实性和一致性对自动驾驶系统测试至关重要。其中，可以通过游戏引擎或高保真计算机图形(CG)模型创建前述场景，如英特尔的CARLA（一个用于自动驾驶研究的开源模拟器）、微软的AirSim（一个跨平台的建立在虚幻引擎上的无人机以及其它自主移动设备的模拟器）、谷歌的CarCraft（一款虚拟重建城市中测试无人驾驶汽车软件）等均可以获得出色的渲染效果，但其合成图像仍然缺乏现实世界图像的丰富性，导致性能下降。另一种选择是数据驱动的方法，依靠感官数据(如摄像头和激光雷达等所获取的数据)重建前述交通场景，所重建的交通场景能够保留关于语义、场景照明和外观的丰富信息。通常，数据驱动的场景生成流程由三个部分组成：首先对前述感官数据进行去噪和细化；然后使用适当的几何代理重构三维模型以生成有序静态场景；最后利用图像增强技术来提高图像质量和一致性。然而，为了实现有效的静态场景，新视角图像合成需要克服以下挑战：Currently, the development of autonomous driving-related applications requires extensive verification and testing of autonomous vehicles to ensure the safety of their implementation. One of its solutions is to realistically reproduce a large number of complex and diverse traffic scenes from different trajectories and angles. Therefore, the authenticity and consistency of the constructed scene flow are crucial for autonomous driving system testing. Among them, the aforementioned scenes can be created through game engines or high-fidelity computer graphics (CG) models, such as Intel's CARLA (an open source simulator for autonomous driving research), Microsoft's AirSim (a cross-platform simulator built on Unreal Engine) Simulators for drones and other autonomous mobile devices), Google's CarCraft (a software for testing self-driving cars in a virtual reconstructed city), etc. can all achieve excellent rendering effects, but their synthetic images still lack the richness of real-world images. properties, leading to performance degradation. Another option is a data-driven approach that relies on sensory data (such as data acquired by cameras and lidar) to reconstruct the aforementioned traffic scene. The reconstructed traffic scene can retain rich information about semantics, scene lighting and appearance. Generally, the data-driven scene generation process consists of three parts: first, denoising and refining the aforementioned sensory data; then reconstructing the 3D model using appropriate geometric agents to generate ordered static scenes; and finally using image enhancement techniques to improve Image quality and consistency. However, in order to achieve effective static scenes, new perspective image synthesis needs to overcome the following challenges:

（1）构建场景几何模型。相机可以捕获比激光雷达更高的测量分辨率和更多的语义信息，但需要更精确的深度估计以用于模型重建。同时，图像的语义信息可以帮助改进估计深度。如何有效地利用立体相机实现高精度、高分辨率的冲浪模型仍然是一个难题；(1) Construct a scene geometric model. Cameras can capture higher measurement resolution and more semantic information than lidar, but require more precise depth estimates for model reconstruction. At the same time, the semantic information of the image can help improve the estimated depth. How to effectively utilize stereo cameras to achieve high-precision, high-resolution surfing models remains a difficult problem;

（2）生成图像中缺失部分的补全。在构建三维背景模型后，由于估计误差和动态目标的去除，所生成的新视图图像中会出现各种不规则的孔洞。一旦建立了三维背景模型，这些被遮挡的区域从任何角度都无法看到。如何用一定的内容填补这些不可忽视的漏洞，生成合理逼真的图像是一个主要问题；(2) Completion of missing parts in generated images. After constructing a three-dimensional background model, various irregular holes will appear in the generated new view image due to estimation errors and removal of dynamic targets. Once the 3D background model is established, these occluded areas cannot be seen from any angle. How to fill these holes that cannot be ignored with certain content and generate reasonably realistic images is a major issue;

（3）所生成的新视图图像之间的时空一致性。对于具有小孔或细孔的高度结构化的新视图图像，可以利用其相邻像素来推断缺失部分，但对于复杂多样的交通场景，这种方法效果不佳。此外，大多数图像补全算法不可避免地会导致时间伪影和抖动。如何从新的视角生成时空一致的新视图图像序列，对于自动驾驶系统中功能模块的验证和测试具有重要意义。(3) Spatiotemporal consistency between the generated new view images. For highly structured new view images with small holes or pores, its neighboring pixels can be exploited to infer the missing parts, but this approach does not work well for complex and diverse traffic scenes. Furthermore, most image completion algorithms inevitably lead to temporal artifacts and jitter. How to generate a spatiotemporally consistent new view image sequence from a new perspective is of great significance for the verification and testing of functional modules in autonomous driving systems.

移动相机可以通过多视图立体(MVS)技术生成密集的3D点云，但这需要从不同的角度获取同一区域周围的多幅图像，同时移动物体严重降低了它们的性能。与MVS相比，立体匹配方法可用于高分辨率图像的室外环境三维重建，但一个主要问题是缺乏高性能模型的真值数据。为了提高三维重建的质量，研究者们还开发了一些使用语义信息的方法，这些方法通常涉及高计算复杂度和联合学习架构，并且仍然包含许多异常值。Moving cameras can generate dense 3D point clouds through multi-view stereo (MVS) technology, but this requires acquiring multiple images around the same area from different angles, while moving objects severely degrades their performance. Compared with MVS, stereo matching methods can be used for 3D reconstruction of outdoor environments from high-resolution images, but a major problem is the lack of ground-truth data for high-performance models. In order to improve the quality of 3D reconstruction, researchers have also developed some methods using semantic information, which usually involve high computational complexity and joint learning architectures, and still contain many outliers.

在稠密的深度条件下，面元模型（即上述三维背景模型）的构造误差较小，参数可灵活调整。然而，新视图图像的渲染性能依赖于三维点云的质量，往往不能覆盖整个场景，因此需要先进的图像补全技术。最近，基于对抗生成网络（Generative AdversarialNetwork, GAN）的方法被开发出来，使用自学习编码器-解码器模型，组成多尺度上下文信息的扩展卷积，以及全局和局部判别器来完成图像。然而，这些方法通常不能处理复杂场景和大孔洞的图像。结合基于patch（补丁）的学习方法，它们可以通过基于patch的注意力机制从周围可见区域借鉴相似特征。然而，这些技术试图从图像内或图像间找到相似的补丁，不适合生成不在可见区域的内容。此外，它们单独完成每个图像，而不考虑上述新视图图像序列之间的空间一致性。然而，条件归一化层(ConditionalNormalization Layer, CNL)可以利用附加图像实现全局空间编码和空间变化调整。Under dense depth conditions, the construction error of the surface element model (i.e., the above-mentioned three-dimensional background model) is small, and the parameters can be flexibly adjusted. However, the rendering performance of new view images relies on the quality of the 3D point cloud, which often cannot cover the entire scene, so advanced image completion technology is required. Recently, methods based on Generative Adversarial Network (GAN) have been developed, using self-learning encoder-decoder models, extended convolutions that form multi-scale contextual information, and global and local discriminators to complete the image. However, these methods usually cannot handle images of complex scenes and large holes. Combined with patch-based learning methods, they can borrow similar features from surrounding visible areas through patch-based attention mechanisms. However, these techniques attempt to find similar patches within or between images and are not suitable for generating content that is not in the visible area. Furthermore, they do each image individually without considering the spatial consistency between the above new view image sequences. However, the Conditional Normalization Layer (CNL) can utilize additional images to achieve global spatial encoding and spatial variation adjustment.

因此，有必要对上述现有技术进行改进。Therefore, it is necessary to improve the above-mentioned prior art.

发明内容Contents of the invention

本发明主要解决的技术问题是提供一种自动驾驶场景的重建与补全方法、系统及存储介质，以利用所述优化后的深度图建立初始的局部面元模型，对所述初始的局部面元模型中与初始的全局面元模型相冲突的冲突信息进行剔除，利用剔除所述冲突信息后的所述局部面元模型对所述初始的全局面元模型进行融合而获得融合后的全局面元模型；在获得全局面元模型后，通过自适应方法对新视角场景渲染图像进行渲染，最后利用对抗生成网络模型生成空间结构一致的高质量的完整新视角场景图像序列。The main technical problem solved by the present invention is to provide a method, system and storage medium for reconstructing and completing autonomous driving scenes, so as to use the optimized depth map to establish an initial local surface element model, and to Conflict information in the element model that conflicts with the initial global scene element model is eliminated, and the initial global scene element model is fused using the local plane element model after eliminating the conflict information to obtain the fused global scene element model. Meta model; after obtaining the global element model, the new perspective scene rendering image is rendered through an adaptive method, and finally the adversarial generative network model is used to generate a high-quality complete new perspective scene image sequence with consistent spatial structure.

根据第一方面，一种实施例中提供一种自动驾驶场景的重建与补全方法。该方法包括：According to a first aspect, an embodiment provides a reconstruction and completion method of an autonomous driving scene. The method includes:

获取所述自动驾驶场景的立体相机图像，生成所述立体相机图像的深度图和语义图；其中，所述立体相机图像是由立体摄像机对所述自动驾驶场景进行拍摄而得到的；所述深度图用于表征所述立体相机图像的深度信息，所述语义图用于表征所述立体相机图像的语义信息；利用所述语义图对所述深度图进行优化，以获得优化后的深度图；Obtain a stereo camera image of the automatic driving scene, and generate a depth map and a semantic map of the stereo camera image; wherein the stereo camera image is obtained by shooting the automatic driving scene by a stereo camera; the depth The graph is used to represent the depth information of the stereo camera image, and the semantic map is used to represent the semantic information of the stereo camera image; the semantic map is used to optimize the depth map to obtain an optimized depth map;

利用所述优化后的深度图建立所述立体相机图像的当前帧的局部面元模型，对所述当前帧的局部面元模型和与所述当前帧之前的所有图像帧对应的全局面元模型相冲突的冲突信息进行剔除，利用剔除所述冲突信息后的所述当前帧的局部面元模型与所述全局面元模型进行融合，直到与所述自动驾驶场景的所有图像帧对应的全局面元模型构建完成；Using the optimized depth map to establish a local block model of the current frame of the stereo camera image, the local block model of the current frame and the global block model corresponding to all image frames before the current frame Conflicting conflict information is eliminated, and the local surface element model of the current frame after eliminating the conflict information is used to fuse with the global scene element model until the global scene corresponding to all image frames of the autonomous driving scene is The metamodel construction is completed;

对所述与所述自动驾驶场景的所有图像帧对应的全局面元模型进行自适应渲染而得到自适应渲染后的所述全局面元模型，从所述自适应渲染后的所述全局面元模型中获取与所述自动驾驶场景对应的新视角场景图像序列；其中，所述新视角场景图像序列包括多个新视角场景渲染图像；Perform adaptive rendering on the global scene element model corresponding to all image frames of the autonomous driving scene to obtain the adaptively rendered global scene element model. From the adaptively rendered global scene element model Obtain a new perspective scene image sequence corresponding to the autonomous driving scene in the model; wherein the new perspective scene image sequence includes a plurality of new perspective scene rendering images;

获得所述新视角场景渲染图像的渲染图像金字塔、与所述新视角场景渲染图像对应的参考图像的参考图像金字塔和与所述新视角场景渲染图像对应的掩码图像的掩码图像金字塔，将所述渲染图像金字塔、所述参考图像金字塔和所述掩码图像金字塔输入训练好的对抗生成网络模型，以利用所述对抗生成网络模型对所述新视角场景渲染图像中缺失信息的区域进行补全而生成完整新视角场景图像序列；Obtain the rendered image pyramid of the new perspective scene rendering image, the reference image pyramid of the reference image corresponding to the new perspective scene rendering image, and the mask image pyramid of the mask image corresponding to the new perspective scene rendering image, and The rendered image pyramid, the reference image pyramid and the mask image pyramid input a trained adversarial generative network model to use the adversarial generative network model to supplement the areas with missing information in the new perspective scene rendering image. Completely generate a complete new perspective scene image sequence;

其中，所述完整新视角场景图像序列包括多个完整新视角场景图像；所述参考图像是来自于所述新视角场景渲染图像的邻近视角的所述立体相机图像，所述掩码图像是基于所述新视角场景渲染图像中缺失信息的区域而得到的。Wherein, the complete new perspective scene image sequence includes multiple complete new perspective scene images; the reference image is the stereo camera image from an adjacent perspective of the new perspective scene rendering image, and the mask image is based on The new perspective scene is obtained by rendering the area with missing information in the image.

根据第二方面，一种实施例中提供一种自动驾驶场景的重建与补全系统，该重建与补全系统包括：According to the second aspect, an embodiment provides a reconstruction and completion system for autonomous driving scenes. The reconstruction and completion system includes:

立体摄像机，被配置为获取所述自动驾驶场景的立体相机图像；a stereo camera configured to acquire a stereo camera image of the autonomous driving scene;

语义深度生成模块，被配置为生成所述立体相机图像的深度图和语义图；a semantic depth generation module configured to generate a depth map and a semantic map of the stereo camera image;

语义深度增强模块，被配置为利用所述语义图对所述深度图进行优化，以获得优化后的深度图；A semantic depth enhancement module configured to optimize the depth map using the semantic map to obtain an optimized depth map;

面元模型构建模块，被配置为利用所述优化后的深度图建立所述立体相机图像的当前帧的局部面元模型，对所述当前帧的局部面元模型和与所述当前帧之前的所有图像帧对应的全局面元模型相冲突的冲突信息进行剔除，利用剔除所述冲突信息后的所述当前帧的局部面元模型与所述全局面元模型进行融合，直到与所述自动驾驶场景的所有图像帧对应的全局面元模型构建完成；对所述与所述自动驾驶场景的所有图像帧对应的全局面元模型进行自适应渲染而得到自适应渲染后的所述全局面元模型，从所述自适应渲染后的所述全局面元模型中获取与所述自动驾驶场景对应的新视角场景图像序列；其中，所述新视角场景图像序列包括多个新视角场景渲染图像；A surface element model building module configured to use the optimized depth map to establish a local surface element model of the current frame of the stereo camera image, and compare the local surface element model of the current frame with the previous one of the current frame. The conflict information that conflicts with the global scene element model corresponding to all image frames is eliminated, and the local segment model of the current frame after eliminating the conflict information is used to fuse with the global scene element model until it is integrated with the automatic driving The construction of the global scene element model corresponding to all image frames of the scene is completed; the global scene element model corresponding to all the image frames of the autonomous driving scene is adaptively rendered to obtain the adaptively rendered global scene element model. , obtain a new perspective scene image sequence corresponding to the autonomous driving scene from the adaptively rendered global element model; wherein the new perspective scene image sequence includes a plurality of new perspective scene rendering images;

对抗网络补全模块，被配置为获得所述新视角场景渲染图像的渲染图像金字塔、与所述新视角场景渲染图像对应的参考图像的参考图像金字塔和与所述新视角场景渲染图像对应的掩码图像的掩码图像金字塔，将所述渲染图像金字塔、所述参考图像金字塔和所述掩码图像金字塔输入训练好的对抗生成网络模型，以利用所述对抗生成网络模型对所述新视角场景渲染图像中缺失信息的区域进行补全而生成完整新视角场景图像序列；An adversarial network completion module configured to obtain a rendered image pyramid of the new perspective scene rendering image, a reference image pyramid of a reference image corresponding to the new perspective scene rendering image, and a mask corresponding to the new perspective scene rendering image. Mask image pyramid of the code image, input the rendered image pyramid, the reference image pyramid and the mask image pyramid into a trained adversarial generative network model to use the adversarial generative network model to analyze the new perspective scene The areas with missing information in the rendered image are completed to generate a complete new perspective scene image sequence;

根据第三方面，一种实施例中提供一种计算机可读存储介质。该计算机可读存储介质包括程序。所述程序能够被处理器执行以实现如本文中任一实施例所述的方法。According to a third aspect, an embodiment provides a computer-readable storage medium. The computer-readable storage medium includes the program. The program can be executed by the processor to implement the method as described in any embodiment herein.

本申请的有益效果是：The beneficial effects of this application are:

本申请的重建与补全方法、系统及存储介质通过获取自动驾驶场景的立体相机图像，生成立体相机图像的深度图和语义图；其中，立体相机图像是由立体摄像机对自动驾驶场景进行拍摄而得到的；深度图用于表征立体相机图像的深度信息，语义图用于表征立体相机图像的语义信息；利用语义图对深度图进行优化，以获得优化后的深度图；利用优化后的深度图建立立体相机图像的当前帧的局部面元模型，对当前帧的局部面元模型和与当前帧之前的所有图像帧对应的全局面元模型相冲突的冲突信息进行剔除，利用剔除冲突信息后的当前帧的局部面元模型与全局面元模型进行融合，直到与自动驾驶场景的所有图像帧对应的全局面元模型构建完成；对与自动驾驶场景的所有图像帧对应的全局面元模型进行自适应渲染而得到自适应渲染后的全局面元模型，从自适应渲染后的全局面元模型中获取与自动驾驶场景对应的新视角场景图像序列；其中，新视角场景图像序列包括多个新视角场景渲染图像；获得新视角场景渲染图像的渲染图像金字塔、与新视角场景渲染图像对应的参考图像的参考图像金字塔和与新视角场景渲染图像对应的掩码图像的掩码图像金字塔，将渲染图像金字塔、参考图像金字塔和掩码图像金字塔输入训练好的对抗生成网络模型，以利用对抗生成网络模型对新视角场景渲染图像中缺失信息的区域进行补全而生成完整新视角场景图像序列；其中，完整新视角场景图像序列包括多个完整新视角场景图像；参考图像是来自于新视角场景渲染图像的邻近视角的立体相机图像，掩码图像是基于新视角场景渲染图像中缺失信息的区域而得到的。The reconstruction and completion method, system and storage medium of this application generate the depth map and semantic map of the stereo camera image by acquiring the stereo camera image of the autonomous driving scene; wherein the stereo camera image is obtained by shooting the autonomous driving scene with a stereo camera. Obtained; the depth map is used to characterize the depth information of the stereo camera image, and the semantic map is used to characterize the semantic information of the stereo camera image; the semantic map is used to optimize the depth map to obtain an optimized depth map; the optimized depth map is used Establish a local block model of the current frame of the stereo camera image, and eliminate the conflict information that conflicts with the local block model of the current frame and the global block model corresponding to all image frames before the current frame, and use the conflict information after eliminating the conflict information The local segment model of the current frame is merged with the global segment model until the global segment model corresponding to all image frames of the autonomous driving scene is completed; the global segment model corresponding to all image frames of the autonomous driving scene is automatically Adapt to rendering to obtain an adaptively rendered global scene element model, and obtain a new perspective scene image sequence corresponding to the autonomous driving scene from the adaptively rendered global scene element model; wherein the new perspective scene image sequence includes multiple new perspectives Scene rendering image; obtain the rendering image pyramid of the new perspective scene rendering image, the reference image pyramid of the reference image corresponding to the new perspective scene rendering image, and the mask image pyramid of the mask image corresponding to the new perspective scene rendering image, and will render the image The pyramid, reference image pyramid and mask image pyramid input the trained adversarial generative network model to use the adversarial generative network model to complete the missing information areas in the new perspective scene rendering image to generate a complete new perspective scene image sequence; where, The complete new perspective scene image sequence includes multiple complete new perspective scene images; the reference image is a stereo camera image from an adjacent perspective of the new perspective scene rendering image, and the mask image is obtained based on the missing information area in the new perspective scene rendering image. of.

附图说明Description of drawings

图1为一种实施例的重建与补全方法的流程示意图；Figure 1 is a schematic flow chart of a reconstruction and completion method according to an embodiment;

图2为一种实施例的利用对抗生成网络模型对新视角场景渲染图像中缺失信息的区域进行补全而生成完整新视角场景图像序列的流程示意图；Figure 2 is a schematic flowchart of an embodiment of using an adversarial generative network model to complete areas with missing information in a new perspective scene rendering image to generate a complete new perspective scene image sequence;

图3为一种实施例的CSD残差模块内部的数据处理流程的示意图；Figure 3 is a schematic diagram of the data processing flow inside the CSD residual module according to an embodiment;

图4为一种实施例的CSD子模块的数据处理流程的示意图；Figure 4 is a schematic diagram of the data processing flow of the CSD sub-module according to an embodiment;

图5为一种实施例的得到与渲染图像金字塔中对应层的图像结果的上下文关联信息的示意图；Figure 5 is a schematic diagram of obtaining contextual information related to the image result of the corresponding layer in the rendered image pyramid according to an embodiment;

图6为一种实施例的渲染子块与渲染子块对应的搜索域内的多个参考掩码子块的相似度值的示意图；Figure 6 is a schematic diagram of similarity values of a rendering sub-block and multiple reference mask sub-blocks in a search domain corresponding to the rendering sub-block according to an embodiment;

图7为一种实施例的对抗生成网络模型的生成器的模块示意图；Figure 7 is a schematic module diagram of a generator of an adversarial generative network model according to an embodiment;

图8为一种实施例的自动驾驶场景的重建与补全系统的模块示意图。Figure 8 is a schematic module diagram of an autonomous driving scene reconstruction and completion system according to an embodiment.

具体实施方式Detailed ways

下面通过具体实施方式结合附图对本发明作进一步详细说明。其中不同实施方式中类似元件采用了相关联的类似的元件标号。在以下的实施方式中，很多细节描述是为了使得本申请能被更好的理解。然而，本领域技术人员可以毫不费力的认识到，其中部分特征在不同情况下是可以省略的，或者可以由其他元件、材料、方法所替代。在某些情况下，本申请相关的一些操作并没有在说明书中显示或者描述，这是为了避免本申请的核心部分被过多的描述所淹没，而对于本领域技术人员而言，详细描述这些相关操作并不是必要的，他们根据说明书中的描述以及本领域的一般技术知识即可完整了解相关操作。The present invention will be further described in detail below through specific embodiments in conjunction with the accompanying drawings. Similar elements in different embodiments use associated similar element numbers. In the following embodiments, many details are described in order to make the present application better understood. However, those skilled in the art can readily recognize that some of the features may be omitted in different situations, or may be replaced by other elements, materials, and methods. In some cases, some operations related to the present application are not shown or described in the specification. This is to avoid the core part of the present application being overwhelmed by excessive descriptions. For those skilled in the art, it is difficult to describe these in detail. The relevant operations are not necessary, and they can fully understand the relevant operations based on the descriptions in the instructions and general technical knowledge in the field.

另外，说明书中所描述的特点、操作或者特征可以以任意适当的方式结合形成各种实施方式。同时，方法描述中的各步骤或者动作也可以按照本领域技术人员所能显而易见的方式进行顺序调换或调整。因此，说明书和附图中的各种顺序只是为了清楚描述某一个实施例，并不意味着是必须的顺序，除非另有说明其中某个顺序是必须遵循的。Additionally, the features, operations, or characteristics described in the specification may be combined in any suitable manner to form various embodiments. At the same time, each step or action in the method description can also be sequentially exchanged or adjusted in a manner that is obvious to those skilled in the art. Therefore, the various sequences in the description and drawings are only for clearly describing a certain embodiment, and do not imply a necessary sequence, unless otherwise stated that a certain sequence must be followed.

本文中为部件所编序号本身，例如“第一”、“第二”等，仅用于区分所描述的对象，不具有任何顺序或技术含义。而本申请所说“连接”、“联接”，如无特别说明，均包括直接和间接连接（联接）。The serial numbers assigned to components in this article, such as "first", "second", etc., are only used to distinguish the described objects and do not have any sequential or technical meaning. The terms "connection" and "connection" mentioned in this application include direct and indirect connections (connections) unless otherwise specified.

本申请所提供的自动驾驶场景的重建与补全方法及其系统的技术构思是：首先，利用立体摄像机所获取的自动驾驶场景的立体相机图像来构建用于表征大规模交通场景的三维几何模型（如下文中的全局面元模型）；其中，这是由于可以通过高精度的学习方法（如语义分割、深度预测）对上述立体相机图像进行处理以为前述三维几何模型的构建提供密集的深度信息（如下文中的深度图）和语义信息（如下文中的语义图）；然后，利用前述语义分割的结果（如下文中的语义图）进一步有效地优化前述的深度信息，进而利用优化后的前述深度信息去实现三维几何模型的高质量的重建；之后，针对从前述重建好的三维几何模型获取的新视角场景图像序列中新视角场景渲染图像中缺失信息的区域，采用从所采集的参考图像中搜索与前述新视角场景渲染图像相似的子块或区域，并将多个不同尺度的新视角场景渲染图像与上述参考图像进行上下文内容上相关性的匹配，并把匹配的相关结果转换为后续的特征空间训练归一化的调制仿射参数，以生成上述新视角场景渲染图像中缺失信息的区域的纹理，从而保证最终所生成的完整新视角场景图像序列在内容和结构上的一致性。其中，对于上述三维几何模型的构建，所需的输入包括自动驾驶场景的立体相机图像，利用立体相机图像生成相应的深度图和语义图。利用语义图来优化深度图中边界异常点并去除动态目标，利用优化后的深度图建立初始的局部面元模型，并对初始的局部面元模型与初始的全局面元模型之间的冲突部分进行剔除，全局面元模型随着融合策略（即对初始的全局面元模型进行融合的过程）的增加而递增。在获得全局面元模型后，通过自适应方法对新视角场景渲染图像进行渲染，最后利用对抗生成网络模型生成空间结构一致的高质量的完整新视角场景图像序列。The technical concept of the reconstruction and completion method of autonomous driving scenes and its system provided by this application is: first, use the stereo camera images of the autonomous driving scene acquired by the stereo camera to construct a three-dimensional geometric model for representing large-scale traffic scenes (such as the global element model below); among them, this is because the above-mentioned stereo camera images can be processed through high-precision learning methods (such as semantic segmentation, depth prediction) to provide dense depth information for the construction of the aforementioned three-dimensional geometric model ( (such as the depth map below) and semantic information (such as the semantic map below); then, use the aforementioned semantic segmentation results (such as the semantic map below) to further effectively optimize the aforementioned depth information, and then use the optimized aforementioned depth information to Achieve high-quality reconstruction of the three-dimensional geometric model; then, for the areas with missing information in the new-view scene rendering image in the new-view scene image sequence obtained from the reconstructed three-dimensional geometric model, search and search from the collected reference images and The aforementioned new perspective scene rendering image has similar sub-blocks or areas, and multiple new perspective scene rendering images of different scales are matched with the above reference image in terms of contextual content, and the matching correlation results are converted into subsequent feature spaces. The normalized modulation affine parameters are trained to generate the texture of the areas with missing information in the above-mentioned new perspective scene rendering image, thereby ensuring the consistency in content and structure of the final generated complete new perspective scene image sequence. Among them, for the construction of the above three-dimensional geometric model, the required input includes stereo camera images of autonomous driving scenes, and the corresponding depth maps and semantic maps are generated using the stereo camera images. The semantic map is used to optimize the boundary anomalies in the depth map and remove dynamic targets. The optimized depth map is used to establish the initial local surface element model, and the conflict between the initial local surface element model and the initial global surface element model is analyzed. Elimination is performed, and the global scene element model increases with the increase of the fusion strategy (that is, the process of fusing the initial global scene element model). After obtaining the global element model, the new perspective scene rendering image is rendered through an adaptive method, and finally the adversarial generative network model is used to generate a high-quality complete new perspective scene image sequence with consistent spatial structure.

下面将结合实施例对本申请的技术方案进行详细说明。The technical solutions of the present application will be described in detail below with reference to examples.

本申请提供一种自动驾驶场景的重建与补全方法。请参考图1，该重建与补全方法包括：This application provides a method for reconstructing and completing autonomous driving scenes. Please refer to Figure 1. The reconstruction and completion methods include:

步骤S100：获取自动驾驶场景的立体相机图像，生成立体相机图像的深度图和语义图；其中，立体相机图像是由立体摄像机对自动驾驶场景进行拍摄而得到的；深度图用于表征立体相机图像的深度信息，语义图用于表征立体相机图像的语义信息；利用语义图对深度图进行优化，以获得优化后的深度图；Step S100: Obtain the stereo camera image of the autonomous driving scene, and generate the depth map and semantic map of the stereo camera image; where the stereo camera image is obtained by shooting the autonomous driving scene with a stereo camera; the depth map is used to characterize the stereo camera image. The depth information, the semantic map is used to characterize the semantic information of the stereo camera image; the semantic map is used to optimize the depth map to obtain the optimized depth map;

步骤S200：利用优化后的深度图建立立体相机图像的当前帧的局部面元模型，对当前帧的局部面元模型和与当前帧之前的所有图像帧对应的全局面元模型相冲突的冲突信息进行剔除，利用剔除冲突信息后的当前帧的局部面元模型与全局面元模型进行融合，直到与自动驾驶场景的所有图像帧对应的全局面元模型构建完成；Step S200: Use the optimized depth map to establish a local pixel model of the current frame of the stereo camera image, and compare the conflict information between the local pixel model of the current frame and the global pixel model corresponding to all image frames before the current frame. Eliminate, and use the local segment model of the current frame after eliminating conflict information to fuse with the global scene element model until the global scene element model corresponding to all image frames of the autonomous driving scene is constructed;

步骤S300：对与自动驾驶场景的所有图像帧对应的全局面元模型进行自适应渲染而得到自适应渲染后的全局面元模型，从自适应渲染后的全局面元模型中获取与自动驾驶场景对应的新视角场景图像序列；其中，新视角场景图像序列包括多个新视角场景渲染图像；Step S300: Perform adaptive rendering on the global scene element model corresponding to all image frames of the autonomous driving scene to obtain an adaptively rendered global scene element model, and obtain information related to the autonomous driving scene from the adaptively rendered global scene element model. The corresponding new perspective scene image sequence; wherein the new perspective scene image sequence includes multiple new perspective scene rendering images;

步骤S400：获得新视角场景渲染图像的渲染图像金字塔、与新视角场景渲染图像对应的参考图像的参考图像金字塔和与新视角场景渲染图像对应的掩码图像的掩码图像金字塔，将渲染图像金字塔、参考图像金字塔和掩码图像金字塔输入训练好的对抗生成网络模型，以利用对抗生成网络模型对新视角场景渲染图像中缺失信息的区域进行补全而生成完整新视角场景图像序列。Step S400: Obtain the rendering image pyramid of the new perspective scene rendering image, the reference image pyramid of the reference image corresponding to the new perspective scene rendering image, and the mask image pyramid of the mask image corresponding to the new perspective scene rendering image, and then render the image pyramid , input the trained adversarial generative network model with reference image pyramid and mask image pyramid, and use the adversarial generative network model to complete the missing information areas in the new perspective scene rendering image to generate a complete new perspective scene image sequence.

其中，上述步骤S400中的完整新视角场景图像序列包括多个完整新视角场景图像；参考图像是来自于新视角场景渲染图像的邻近视角的立体相机图像，掩码图像是基于新视角场景渲染图像中缺失信息的区域而得到的。Among them, the complete new perspective scene image sequence in the above step S400 includes multiple complete new perspective scene images; the reference image is a stereo camera image from an adjacent perspective of the new perspective scene rendering image, and the mask image is based on the new perspective scene rendering image. obtained from areas where information is missing.

上述立体相机图像是由立体摄像机对自动驾驶场景进行拍摄而得到的。上下文关联信息用于为缺失信息的区域提供上下文匹配信息。深度图用于表征立体相机图像的深度信息，语义图用于表征立体相机图像的语义信息；利用语义图对深度图进行优化，以获得优化后的深度图。The above-mentioned stereo camera image is obtained by shooting the autonomous driving scene with a stereo camera. Contextual information is used to provide contextual matching information for areas where information is missing. The depth map is used to represent the depth information of the stereo camera image, and the semantic map is used to represent the semantic information of the stereo camera image; the semantic map is used to optimize the depth map to obtain an optimized depth map.

上述完整新视角场景图像序列包括与自动驾驶场景对应的多个完整新视角场景图像。参考图像用于为新视角场景渲染图像提供场景上下文信息。The above complete new perspective scene image sequence includes multiple complete new perspective scene images corresponding to the autonomous driving scene. The reference image is used to provide scene context information for the new perspective scene rendering image.

上述步骤S100中，立体相机图像是由立体摄像机对前述自动驾驶场景进行拍摄而得到。其中，本领域技术人员可以根据实际需求而对自动驾驶场景进行选择，此处不对自动驾驶场景所包含的具体内容进行限定。In the above-mentioned step S100, the stereo camera image is obtained by photographing the aforementioned automatic driving scene by a stereo camera. Among them, those skilled in the art can select autonomous driving scenarios according to actual needs, and the specific content included in the autonomous driving scenarios is not limited here.

一些实施例中，上述步骤S100中，可以利用现有的语义分割模型对前述立体相机图像进行处理，以生成自动驾驶场景的语义图。其中，语义分割模型可以为PSM-Net的预训练模型等。其中，PSM-Net（pyramid stereo matching network，金字塔立体匹配网络）是由空间金字塔池化和三维卷积层组成的网络，将全局的背景信息纳入立体匹配中，以实现遮挡区域、无纹理或重复区域的可靠估计。空间金字塔池化能够通过多尺度积累获取全局语境信息，三维卷积层组成的网络能够获取视差图。In some embodiments, in the above step S100, the existing semantic segmentation model can be used to process the aforementioned stereo camera image to generate a semantic map of the autonomous driving scene. Among them, the semantic segmentation model can be a pre-trained model of PSM-Net, etc. Among them, PSM-Net (pyramid stereo matching network) is a network composed of spatial pyramid pooling and three-dimensional convolution layers. It incorporates global background information into stereo matching to achieve occlusion areas, no texture or repetition. A reliable estimate of the area. Spatial pyramid pooling can obtain global context information through multi-scale accumulation, and a network composed of three-dimensional convolutional layers can obtain disparity maps.

需要说明的是，上述步骤S100中利用现有的语义分割模型对前述立体相机图像进行处理以生成自动驾驶场景的语义图的具体过程属于本领域的现有技术，故此处不再对上述具体过程进行赘述。It should be noted that the specific process of using the existing semantic segmentation model to process the aforementioned stereo camera image to generate the semantic map of the autonomous driving scene in the above step S100 belongs to the existing technology in this field, so the specific process will not be discussed here. Elaborate.

一些实施例中，上述步骤S100中，可以利用现有的深度预测模型对前述立体相机图像进行处理，以生成自动驾驶场景的深度图。其中，深度预测模型可以为PointRend的预训练模型等。PointRend（基于点的渲染）神经网络，是基于迭代细分算法在自适应选择的位置执行基于点的分割预测；其通过在现有最新模型的基础上构建，PointRend可以灵活地应用于实例和语义分割任务。In some embodiments, in the above step S100, the existing depth prediction model can be used to process the aforementioned stereo camera image to generate a depth map of the autonomous driving scene. Among them, the depth prediction model can be PointRend's pre-trained model, etc. PointRend (point-based rendering) neural network is based on an iterative segmentation algorithm that performs point-based segmentation prediction at adaptively selected locations; by building on the latest existing models, PointRend can be flexibly applied to instances and semantics Split tasks.

需要说明的是，上述步骤S100中利用深度预测模型对前述立体相机图像进行处理，以生成自动驾驶场景的深度图的具体过程属于本领域的现有技术，故此处不再对上述具体过程进行赘述。It should be noted that the specific process of using the depth prediction model to process the aforementioned stereo camera image to generate a depth map of the autonomous driving scene in the above step S100 belongs to the existing technology in this field, so the specific process will not be described again here. .

需要说明的是，上述步骤S100中利用现有的语义分割模型生成上述语义图的目的有以下两方面：第一，利用上述语义图来优化上述深度图中物体的不太准确的边界深度信息，即，优化上述深度图中所有物体的边界深度信息；第二是利用上述语义图来剔除上述深度图中动态物体。It should be noted that the purpose of using the existing semantic segmentation model to generate the above-mentioned semantic map in the above-mentioned step S100 is as follows: first, using the above-mentioned semantic map to optimize the less accurate boundary depth information of the object in the above-mentioned depth map; That is, optimize the boundary depth information of all objects in the above-mentioned depth map; the second is to use the above-mentioned semantic map to eliminate dynamic objects in the above-mentioned depth map.

由于预测算法（如前述深度预测模型所使用的深度预测算法）的不确定性，不可避免地会在前述深度图中引入异常值或噪声。例如，一般情况下不同类的物体的深度不同（如树、房子、车等），但深度预测模型输出的深度信息对边界的深度估计误差比较大，因此，上述步骤S100中“利用语义图对深度图进行优化，以获得优化后的深度图”借助更准确的语义信息（即前述语义图）优化前述深度图中所有物体的边界深度信息，即通过优化前述深度图中目标物体的边界来增强前述深度图，进而得到优化后的深度图。具体地，首先确定需要优化的前述深度图中目标物体的边界，即，前述深度图中深度梯度变化超过预设的阈值的区域；其中，一般越靠近目标边界的区域，其深度梯度的变化越大；之后，相应的深度优化到几何上接近语义边界的目标点。根据语义一致区域的深度局部性原则，每个像素深度可以通过对其周围N个像素块的有效部分进行平均来优化，同时确保上述深度图的深度值由语义相同的像素进行融合和深度边界向语义边界方向优化。一些实施例中，还可以过滤掉深度图中相邻深度相差很大的点。通过比较顺序帧和语义标签可以删除上述深度图中的动态对象，例如行人和汽车。最终实现优化上述深度图中边界异常点并去除动态目标而得到优化后的深度图。Due to the uncertainty of the prediction algorithm (such as the depth prediction algorithm used by the aforementioned depth prediction model), outliers or noise will inevitably be introduced into the aforementioned depth map. For example, under normal circumstances, different types of objects have different depths (such as trees, houses, cars, etc.), but the depth information output by the depth prediction model has a relatively large error in estimating the depth of the boundary. Therefore, in the above step S100, "using the semantic map to Optimize the depth map to obtain an optimized depth map. "Optimize the boundary depth information of all objects in the aforementioned depth map with the help of more accurate semantic information (i.e., the aforementioned semantic map), that is, enhance it by optimizing the boundaries of the target objects in the aforementioned depth map. The aforementioned depth map is then used to obtain the optimized depth map. Specifically, first determine the boundary of the target object in the depth map that needs to be optimized, that is, the area in the depth map where the depth gradient change exceeds a preset threshold; generally, the closer the area is to the target boundary, the greater the change in depth gradient. Large; after that, the corresponding depth is optimized to target points that are geometrically close to the semantic boundaries. According to the depth locality principle of semantically consistent regions, each pixel depth can be optimized by averaging the effective parts of N pixel blocks around it, while ensuring that the depth value of the above depth map is fused by semantically identical pixels and depth boundaries. Semantic boundary direction optimization. In some embodiments, points in the depth map whose adjacent depths are very different can also be filtered out. Dynamic objects such as pedestrians and cars can be removed from the above depth map by comparing sequential frames and semantic labels. Finally, the boundary abnormal points in the above depth map are optimized and the dynamic targets are removed to obtain the optimized depth map.

三维面元模型（如前述全局面元模型和局部面元模型）的表面被量化为一组定向的小圆盘，即面元。面元的大小由图像的分辨率决定。每个面元被多属性有效地定义和可视化，如位置、法线、半径、颜色、语义和深度等信息。前述面元模型采用面元对三维空间对象的表面进行连续或非连续几何描述和特征描述，而不研究三维空间对象的内部特征。The surface of a three-dimensional surface element model (such as the aforementioned global surface element model and local surface element model) is quantized as a set of oriented small disks, namely surface elements. The size of the bin is determined by the resolution of the image. Each surfel is effectively defined and visualized with multiple attributes, such as position, normal, radius, color, semantics, and depth information. The aforementioned surface element model uses surface elements to perform continuous or discontinuous geometric description and feature description of the surface of the three-dimensional space object without studying the internal characteristics of the three-dimensional space object.

虽然上述步骤S100对深度图进行了优化，但是在将重建的新局部模型（即上述初始的局部面元模型）合并到现有的全局模型（即上述初始的全局面元模型）之前，还需要对上述初始的局部面元模型进行进一步的去噪。因此，上述步骤S200中“利用优化后的深度图建立初始的局部面元模型，对初始的局部面元模型中与初始的全局面元模型相冲突的冲突信息进行剔除，利用剔除冲突信息后的局部面元模型对初始的全局面元模型进行融合而获得融合后的全局面元模型”的具体流程的论述如下。由于当目标距离较远时，深度预测的性能会下降，因此距离相机（如前述立体摄像机）的光学中心较近的深度测量精度更高。基于此假设，现有初始的局部面元模型中的深度小于初始的全局面元模型的映射深度，即局部面元比全局面元更靠近相机光学中心，这是由于相机向目标移动且视场小于180度。如果在一定阈值范围内，局部面元的深度距离相机的光学中心的距离远于全局面元模型中全局面元的深度距离相机的光学中心的距离，则设置为较低的置信度，否则设置为较高的置信度。Although the depth map is optimized in step S100 above, before the reconstructed new local model (i.e., the above-mentioned initial local pixel model) is merged into the existing global model (i.e., the above-mentioned initial global pixel model), it is still necessary to Further denoising is performed on the above initial local surface element model. Therefore, in the above step S200, "the optimized depth map is used to establish an initial local surface element model, and the conflict information in the initial local surface element model that conflicts with the initial global surface element model is eliminated, and the conflict information after eliminating the conflict information is used. The specific process of "the local surface element model fuses the initial global surface element model to obtain the fused global surface element model" is discussed below. Since the performance of depth prediction degrades when the target is further away, depth measurements closer to the optical center of the camera (such as the aforementioned stereo camera) are more accurate. Based on this assumption, the depth in the existing initial local block model is smaller than the mapping depth of the initial global block model, that is, the local block is closer to the camera optical center than the global block. This is due to the camera moving toward the target and the field of view. Less than 180 degrees. If within a certain threshold range, the depth of the local surface element is farther from the optical center of the camera than the depth of the global surface element in the global surface element model is from the optical center of the camera, set it to a lower confidence level, otherwise set it to for a higher confidence level.

一些实施例中，可以通过映射的方式找到“深度冲突比较大的面元”（即上述冲突信息），对上述初始的局部面元模型中的冲突信息进行剔除之后需要利用剔除冲突信息后的局部面元模型对初始的全局面元模型进行融合。一些实施例中，局部面元元素的预测深度没有置信度，或者局部面元点比全局面元点离相机光学中心远得多(如超出预设的阈值)，这些局部面元点将作为冲突点被删除，从而进一步融合全局面元模型而获得融合后的全局面元模型。In some embodiments, "surface elements with relatively large depth conflicts" (i.e., the above-mentioned conflict information) can be found through mapping. After eliminating the conflict information in the above-mentioned initial local surface element model, it is necessary to use the local area elements after eliminating the conflict information. The surface element model fuses the initial global surface element model. In some embodiments, there is no confidence in the predicted depth of local bin elements, or the local bin points are much farther from the camera optical center than the global bin points (such as exceeding a preset threshold). These local bin points will be used as conflicts. The points are deleted, and the global element model is further fused to obtain the fused global element model.

需要说明的是，上述深度的置信度可以直接采用现有技术进行初始化。例如，上述初始化可以采用现有的文献“M.F.A.Eldesokey and F.S.Khan,“Confidence propagationthrough cnns forguided sparse depthregression,”IEEEtransactions on patternanalysis and machine intelligence,vol.42,no.10,pp.2423–2436,2019”所记载的方法。由于上述初始化可以直接采用本领域的现有技术，故此处不再对其进行赘述。It should be noted that the above-mentioned depth confidence can be directly initialized using existing technology. For example, the above initialization can be adopted from the existing literature "M.F.A. Eldesokey and F.S. Khan, "Confidence propagation through cnns for guided sparse depthregression," IEEE transactions on pattern analysis and machine intelligence, vol.42, no.10, pp.2423–2436, 2019" method of recording. Since the above initialization can directly adopt the existing technology in this field, it will not be described again here.

需要说明的是，上述局部面元模型的重建是依靠深度信息（即前述深度图）重建的，因此在将初始的局部面元模型中冲突深度（即上述冲突信息）剔除后，留下来的局部面元模型（本申请将该留下来的局部面元模型视为是比较准确的）再融合到上述全局面元模型。It should be noted that the reconstruction of the above-mentioned local bin model is based on the depth information (i.e., the aforementioned depth map). Therefore, after the conflict depth (i.e., the above-mentioned conflict information) in the initial local bin model is eliminated, the remaining local The surface element model (this application considers the remaining local surface element model to be relatively accurate) is then integrated into the above-mentioned global surface element model.

一些实施例中，上述步骤S200中“利用剔除冲突信息后的当前帧的局部面元模型与全局面元模型进行融合”时，为了避免在模型融合过程中出现异常值，只有那些局部面元模型中能够匹配到颜色和语义属性相似且置信度高的全局面元的局部面元才会被融合到全局面元模型中。当局部面元比全局面元更接近相机的光学中心时，只保留局部面元元素。若局部面元和全局面元都在预设的距离阈值内，则使用前述置信度加权平均值的方式将全局面元与局部面元进行融合。若没有局部面元和全局面元配对，新的局部面元将作为不稳定面元元素添加到前述全局面元模中，其置信度也设置为相对较低的数值。In some embodiments, when "using the local block model of the current frame after eliminating conflict information and the global block model to fuse" in the above step S200, in order to avoid outliers in the model fusion process, only those local block models are Only local polygons that can match global polygons with similar color and semantic attributes and high confidence will be integrated into the global polygon model. When a local bin element is closer to the optical center of the camera than a global bin element, only the local bin element is retained. If both the local surface element and the global surface element are within the preset distance threshold, the aforementioned confidence weighted average method is used to fuse the global surface element and the local surface element. If there is no pairing of local bins and global bins, the new local bins will be added to the aforementioned global bin model as unstable bin elements, and their confidence level will also be set to a relatively low value.

需要说明的是，一个自动驾驶场景包括多个立体相机图像的图像帧，例如，多个连续的图像帧。上述步骤S200中，“与当前帧之前的所有图像帧对应的全局面元模型”是一帧一帧融合而成的，即，当前帧的局部面元模型构建好后，对当前帧的局部面元模型和与当前帧之前的所有图像帧对应的全局面元模型相冲突的冲突信息进行剔除，然后将当前帧对应的局部面元模型融合进与当前帧之前的所有图像帧对应的全局面元模型；以此类推，由于融合的帧数越来越多，全局面元模型将会越来越大，直到与自动驾驶场景的所有图像帧对应的全局面元模型构建完成。需要注意的是，在构建第一帧（即第一个当前帧）的局部面元模型时，是没有全局面元模型的；全局面元模型是从第二帧（即第二个当前帧）开始构建的，即，将与第一个当前帧对应的局部面元模型作为上述与当前帧之前的所有图像帧对应的全局面元模型，以此类推。直到将自动驾驶场景的最后一帧（即最后一个当前帧）融合进前述与当前帧之前的所有图像帧对应的全局面元模型，至此，与自动驾驶场景的所有图像帧对应的全局面元模型构建完成。其中，与当前帧之前的所有图像帧均对应的是上述立体相机图像。It should be noted that an autonomous driving scene includes image frames of multiple stereo camera images, for example, multiple consecutive image frames. In the above step S200, the "global pixel model corresponding to all image frames before the current frame" is fused frame by frame. That is, after the local pixel model of the current frame is constructed, the local pixel model of the current frame is constructed. The conflict information that conflicts with the meta model and the global scene element model corresponding to all image frames before the current frame is eliminated, and then the local face element model corresponding to the current frame is integrated into the global scene element corresponding to all image frames before the current frame. model; and by analogy, as the number of fused frames increases, the global element model will become larger and larger until the global element model corresponding to all image frames of the autonomous driving scene is constructed. It should be noted that when constructing the local pixel model of the first frame (i.e., the first current frame), there is no global pixel model; the global pixel model is derived from the second frame (i.e., the second current frame). The initial construction is to use the local surface element model corresponding to the first current frame as the above-mentioned global surface element model corresponding to all image frames before the current frame, and so on. Until the last frame of the autonomous driving scene (i.e., the last current frame) is integrated into the aforementioned global scene element model corresponding to all image frames before the current frame, at this point, the global scene element model corresponding to all image frames of the autonomous driving scene The build is complete. Among them, all image frames before the current frame correspond to the above-mentioned stereo camera images.

一些实施例中，上述步骤S300中，可以直接从与自动驾驶场景的所有图像帧对应的全局面元模型获取与自动驾驶场景对应的新视角场景图像序列。In some embodiments, in the above step S300, the new perspective scene image sequence corresponding to the autonomous driving scene can be directly obtained from the global scene element model corresponding to all image frames of the autonomous driving scene.

一些实施例中，上述步骤S300中，可以对与自动驾驶场景的所有图像帧对应的全局面元模型进行自适应渲染而得到自适应渲染后的全局面元模型，从自适应渲染后的全局面元模型获取与自动驾驶场景对应的新视角场景图像序列。In some embodiments, in the above step S300, the global scene element model corresponding to all image frames of the autonomous driving scene can be adaptively rendered to obtain an adaptively rendered global scene element model. From the adaptively rendered global scene element model, The meta-model obtains a new perspective scene image sequence corresponding to the autonomous driving scene.

需要说明的是，从上述全局面元模型获取与自动驾驶场景对应的新视角场景图像序列的具体过程属于本领域的现有技术和公知常识，故此处不再赘述。It should be noted that the specific process of obtaining the new perspective scene image sequence corresponding to the autonomous driving scene from the above-mentioned global scene element model belongs to the existing technology and common knowledge in this field, so it will not be described again here.

需要说明的是，前述相机采集的视角一般称为“原视角”，在重建3D面元模型（如全局面元模型）过程中也是根据该“原视角”建立的，3D面元模型建立好以后，可以从3D面元模型中采集“新视角”（即“非原视角”）图片，像视频一样连续采集进而得到新视角场景图像序列。It should be noted that the perspective captured by the aforementioned camera is generally called the "original perspective". In the process of reconstructing the 3D panel model (such as the global panel model), it is also established based on the "original perspective". After the 3D panel model is established, , you can collect "new perspective" (i.e., "non-original perspective") pictures from the 3D surface element model, and collect them continuously like videos to obtain a sequence of new perspective scene images.

在构建好前述全局面元模型之后，可以从不完整的前述全局面元模型中获取新视角图像序列。相机的新视点接近原始视角，其中，一些实施例中，光轴被限制在±15度，基线相对较宽(约±1m)。一些实施例中，为了更好地利用前述全局面元模型的高保真可视化，本申请提供了一种对与自动驾驶场景的所有图像帧对应的全局面元模型进行自适应渲染的处理方式。该处理方式包含两种主要方案，分别用于调整前述全局面元模型中面元的法线和半径。第一，前述全局面元模型中远视点的面元将根据沿观测方向的法向量调整。如此的操作可以确保那些远距离的面元点也能有效地参与渲染。第二，前述全局面元模型中一个特定的半径往往会在较短的距离内从一个新的视角产生模糊的纹理。为了克服这个问题，一些实施例中，可以将使用面元的法线与新视角的观测方向之间的夹角来调整冲浪半径。一个面元点覆盖多个像素，并且法线在观测方向的一定范围内(如π/2)时，为了避免与其它面元的冗余重叠，其半径可以由r _i(u)大幅缩小为。其中，u为面元覆盖的像素的位置，α为面元的法线与新视角的观测方向的夹角，此处的i表示面元的编号。因此，一些实施例中，可以根据沿上述观测方向的法向量调整全局面元模型中面元（如远视点的面元）的法线和/或半径。After the foregoing global scene element model is constructed, a new perspective image sequence can be obtained from the incomplete aforesaid global scene element model. The camera's new viewpoint is close to the original viewpoint, where, in some embodiments, the optical axis is limited to ±15 degrees and the baseline is relatively wide (approximately ±1 m). In some embodiments, in order to better utilize the high-fidelity visualization of the aforementioned global scene element model, this application provides a processing method for adaptive rendering of the global scene element model corresponding to all image frames of the autonomous driving scene. This processing method includes two main solutions, which are used to adjust the normals and radii of the surface elements in the aforementioned global surface element model. First, the surface elements of the far viewpoint in the aforementioned global surface element model will be adjusted according to the normal vector along the observation direction. This operation ensures that distant surface element points can effectively participate in rendering. Second, a specific radius in the aforementioned global element model tends to produce blurred textures at shorter distances from a new perspective. To overcome this problem, in some embodiments, the surf radius may be adjusted using the angle between the normal of the surface element and the observation direction of the new perspective. When a bin point covers multiple pixels and the normal is within a certain range of the observation direction (such as π/2), in order to avoid redundant overlap with other bin points, its radius can be greatly reduced from r _i (u) to . Among them, u is the position of the pixel covered by the bin, α is the angle between the normal of the bin and the observation direction of the new perspective, and i here represents the number of the bin. Therefore, in some embodiments, the normals and/or radii of the surface elements in the global element model (such as the surface elements of the far viewpoint) can be adjusted according to the normal vector along the above-mentioned observation direction.

上述步骤S400主要用于完成交通场景（即上述新视角场景渲染图像）的图像补全。上述全局面元模型是基于真实世界的传感器数据（如上述立体相机图像）而构建的。由于上述全局面元模型重建的不完整以及上述动态物体的移除，场景背景渲染图像（即上述新视角场景渲染图像）中出现各种掩膜或孔洞。本申请主要考虑到空间和内容上的一致性，采用一定的内容去填充这些掩模/空洞（即前述新视角场景渲染图像中缺失信息的区域），使用下文的对抗生成网络模型合成高质量、逼真的交通场景序列（即前述完整新视角场景图像序列）。对抗生成网络模型的输入包括不同尺度（即不同分别率的）的新视角场景渲染图像、不同尺度的掩码图像和不同尺度的参考图像。对抗生成网络模型的整体处理过程包括两个阶段：The above-mentioned step S400 is mainly used to complete the image completion of the traffic scene (that is, the above-mentioned new perspective scene rendering image). The above global element model is built based on real-world sensor data (such as the above-mentioned stereo camera images). Due to the incomplete reconstruction of the above-mentioned global element model and the removal of the above-mentioned dynamic objects, various masks or holes appear in the scene background rendering image (that is, the above-mentioned new perspective scene rendering image). This application mainly considers the consistency of space and content, uses certain content to fill these masks/holes (that is, the areas with missing information in the aforementioned new perspective scene rendering image), and uses the adversarial generative network model below to synthesize high-quality, Realistic traffic scene sequence (that is, the complete new perspective scene image sequence mentioned above). The inputs of the adversarial generative network model include new perspective scene rendering images of different scales (that is, different resolutions), mask images of different scales, and reference images of different scales. The overall processing process of the adversarial generative network model includes two stages:

第一阶段，首先将对应尺度的新视角场景渲染图像以子块（patch）的方式划分为多个渲染子块，将渲染子块在参考图像上通过下文的掩膜图像表征关联机制搜索近似上下文而得到上下文关联信息；其中，为了增加上述近似上下文相关信息的匹配精度，将对应的掩膜图像的掩膜子块应用到参考图像对应的参考子块后得到参考掩膜子块，再进行渲染子块与参考掩膜子块匹配；In the first stage, the new perspective scene rendering image of the corresponding scale is first divided into multiple rendering sub-blocks in the form of sub-blocks (patch). The rendering sub-blocks are searched for approximate context on the reference image through the mask image representation association mechanism below. To obtain context-related information; in order to increase the matching accuracy of the above-mentioned approximate context-related information, the mask sub-block of the corresponding mask image is applied to the reference sub-block corresponding to the reference image to obtain the reference mask sub-block, and then rendering is performed The sub-block is matched to the reference mask sub-block;

第二阶段，将多个尺度的渲染图像作为输入，通过对抗生成网络模型来对新视角场景渲染图像中缺失信息的区域（如空洞等）进行补全而生成空间结构一致的、高质量的完整新视角场景图像序列；其中，将第一阶段预测的近似上下文相关信息（即，下文的关联图像和相似度图）转移到多个CSD子模块，以保持所生成的完整新视角场景图像序列之间的一致性。通过上述两阶段的图像补全处理，进而使得所生成的交通场景图像（即完整新视角场景图像序列）更加逼真、在内容上更加连贯。In the second stage, rendering images of multiple scales are used as input, and the adversarial generative network model is used to complete the missing information areas (such as holes, etc.) in the new perspective scene rendering image to generate a consistent, high-quality complete image with a consistent spatial structure. New perspective scene image sequence; where the approximate context-related information predicted in the first stage (i.e., the associated images and similarity maps below) is transferred to multiple CSD sub-modules to maintain the complete generated new perspective scene image sequence. consistency between. Through the above two stages of image completion processing, the generated traffic scene images (that is, the complete new perspective scene image sequence) are more realistic and more coherent in content.

需要说明的是，请参考图4，掩码图像I _m是基于新视角场景渲染图像I _r中缺失信息的区域而得到的。这是由于从自适应渲染后的全局面元模型中获取的新视角场景渲染图像I _r本身就会出现各种不规则空洞/掩码，若直接使用新视角场景渲染图像与来自邻近视角的另一个新视角场景渲染图像（即参考图像）进行相似性匹配，很可能造成相似性匹配的准确度不高（例如，需补全的新视角场景渲染图像的一个区域存在空洞，而来自邻近视角的参考图像的同一区域并不存在该空洞，若直接进行相似性匹配，可以理解的是，该区域很难匹配到相似的上下文信息），因此，可以首先将新视角场景渲染图像I _r的各种不规则空洞/掩码单独提取出来而制作成与新视角场景渲染图像I _r对应的掩码图像I _m（例如，新视角场景渲染图像I _r中汽车的后视镜区域存在不规则空洞，本申请就将该汽车的后视镜区域存在不规则空洞的区域单独提取出来而生成掩码图像I _m），然后再将掩码图像I _m应用于上述参考图像而得到对应的参考掩码子块，如此便可以使得参考图像中与新视角场景渲染图像I _r的各种不规则空洞/掩码对应的区域也自动地出现相同的不规则空洞/掩码等缺失信息的区域，进而可以显著地提高上述近似上下文相关信息的匹配精度（如下述渲染子块和参考子块进行匹配的精度）；反之，若不将掩码图像I _m应用于参考图像而得到对应的参考掩码子块并直接采用参考掩码子块与渲染子块进行相似性匹配，则容易造成后续生成的关联图像I _c'的准确度不高，进而无法完成高质量地对新视角场景渲染图像中缺失信息的区域进行补全。It should be noted that, please refer to Figure 4. The mask image Im is obtained based on the area with missing information in _the _new perspective scene rendering image Ir . This is because the new perspective scene rendering image I _r obtained from the global scene element model after adaptive rendering itself will have various irregular holes/masks. Similarity matching of a new perspective scene rendering image (i.e. reference image) is likely to result in low accuracy of similarity matching (for example, there are holes in an area of the new perspective scene rendering image that need to be completed, and the images from adjacent perspectives are This hole does not exist in the same area of the reference image. If similarity matching is performed directly, it is understandable that it is difficult to match similar context information in this area). Therefore, various aspects of the new perspective scene rendering image I _r can be first The irregular holes/mask are extracted separately and made into a mask _{image Im} _{corresponding} to the new perspective scene rendering image Ir (for example, there are irregular holes in the rearview mirror area of the car in _the new perspective scene rendering image Ir , this To apply, the area with irregular holes in the rearview mirror area of the car is extracted separately to generate a mask image Im ₎ , and then _the mask image Im is applied to the above reference image to obtain the corresponding reference mask sub-block. In this way, the same irregular holes/masks and other areas with missing information can automatically appear in _the reference image corresponding to various irregular holes/masks in the new perspective scene rendering image Ir , which can significantly improve The matching accuracy of the above approximate context-related information (such as the accuracy of matching the rendering sub-block and the reference sub-block below); conversely, if the mask image Im is not applied to the reference image and the corresponding reference mask sub-block _is obtained and the reference is directly used Similarity matching between the mask sub-block and the rendering sub-block will easily lead to low accuracy of the subsequently generated associated image I _c' , which will make it impossible to complete the missing information areas in the new perspective scene rendering image with high quality.

请参考图2，上述步骤S400中，将渲染图像金字塔、参考图像金字塔和掩码图像金字塔输入训练好的对抗生成网络模型，以利用对抗生成网络模型对新视角场景渲染图像中缺失信息的区域进行补全而生成完整新视角场景图像序列，包括：Please refer to Figure 2. In the above-mentioned step S400, the rendered image pyramid, the reference image pyramid and the mask image pyramid are input to the trained adversarial generative network model to use the adversarial generative network model to perform analysis on the areas with missing information in the new perspective scene rendering image. Completion to generate a complete new perspective scene image sequence, including:

步骤S410：渲染图像金字塔、参考图像金字塔和掩码图像金字塔中各层的图像结果按照由顶层至底层的顺序分别输入生成器内第一级的CSD残差模块至最后一级的CSD残差模块；将前一级的CSD残差模块的输出作为相邻的后一级的CSD残差模块的一种输入，将最后一级的CSD残差模块的输出作为完整新视角场景图像；其中，生成器内第一级的CSD残差模块至最后一级的CSD残差模块的网络结构均相同；其中，渲染图像金字塔、参考图像金字塔和掩码图像金字塔的层数均与CSD残差模块的数量相同。Step S410: The image results of each layer in the rendered image pyramid, the reference image pyramid and the masked image pyramid are respectively input into the first-level CSD residual module to the last-level CSD residual module in the generator in order from the top to the bottom. ; Use the output of the CSD residual module of the previous level as an input of the adjacent CSD residual module of the next level, and use the output of the CSD residual module of the last level as a complete new perspective scene image; where, generate The network structures of the first-level CSD residual module to the last-level CSD residual module in the device are all the same; among them, the number of layers of the rendering image pyramid, reference image pyramid and mask image pyramid is the same as the number of CSD residual modules. same.

一些实施例中，上述步骤S400中，可以对新视角场景渲染图像、与新视角场景渲染图像对应的参考图像和与新视角场景渲染图像对应的掩码图像分别进行下采样等现有技术而分别获得新视角场景渲染图像的渲染图像金字塔、参考图像的参考图像金字塔和掩码图像的掩码图像金字塔。In some embodiments, in the above-mentioned step S400, the new perspective scene rendering image, the reference image corresponding to the new perspective scene rendering image, and the mask image corresponding to the new perspective scene rendering image can be separately down-sampled or other existing techniques. Obtain a rendered image pyramid for the new perspective scene rendered image, a reference image pyramid for the reference image, and a mask image pyramid for the mask image.

图像金字塔是由一系列不同分辨率的图像构成的集合，主要采用上采样和下采样两种常用的运算。下采样是指由高分辨率图像向低分辨率图像采样并进行高斯滤波平滑处理。上采样是指由低分辨率图像向高分辨率图像插值并进行高斯滤波平滑处理。对原始图像进行上采样而得到的图像的分辨率变为原始图像的分辨率的两倍。对原始图像进行下采样得到的图像的分辨率变为原始图像的分辨率的一半。The image pyramid is a collection of images with different resolutions, mainly using two common operations: upsampling and downsampling. Downsampling refers to sampling from a high-resolution image to a low-resolution image and performing Gaussian filter smoothing. Upsampling refers to interpolating from a low-resolution image to a high-resolution image and performing Gaussian filter smoothing. The resolution of the image obtained by upsampling the original image becomes twice that of the original image. The resolution of the image obtained by downsampling the original image becomes half that of the original image.

一些实施例中，向下采样而构建图像金字塔的简要过程可以为：对于给定的图像先做一次高斯滤波平滑处理，也就是对上述图像进行一个卷积操作；对上述图像进行下采样，其中，可以在上述图像的行方向取奇数列，在上述图像的列方向取偶数列；或者，可以在上述图像的行方向取偶数列，在上述图像的列方向取奇数列；对采样后的图像，重复前两步操作即可得到图像金字塔。需要说明的是，一些实施例中，对新视角场景渲染图像、参考图像和掩码图像分别进行下采样而分别得到新视角场景渲染图像的渲染图像金字塔、参考图像的参考图像金字塔和掩码图像的掩码图像金字塔的具体过程属于本领域的现有技术，故此处不再赘述。In some embodiments, the brief process of downsampling to build an image pyramid can be as follows: first perform a Gaussian filter smoothing process on a given image, that is, perform a convolution operation on the above image; perform downsampling on the above image, where , you can take odd columns in the row direction of the above image, and take even columns in the column direction of the above image; or you can take even columns in the row direction of the above image, and take odd columns in the column direction of the above image; for the sampled image , repeat the first two steps to obtain the image pyramid. It should be noted that in some embodiments, the new perspective scene rendering image, the reference image and the mask image are respectively down-sampled to obtain the rendering image pyramid of the new perspective scene rendering image, the reference image pyramid and the mask image of the reference image respectively. The specific process of masking the image pyramid belongs to the existing technology in this field, so it will not be described again here.

一些实施例中，上述步骤S400中，上述渲染图像金字塔/参考图像金字塔和掩码图像金字塔中各层图像结果的尺度（或空间分辨率）按照由顶层至底层的顺序可以为1/32、1/16、1/8、1/4、1/2，其中，1/2表示当前层的尺度（或空间分辨率）为原始尺度（或空间分辨率）的1/2，以此类推。本领域技术人员可以根据实际需求而对上述渲染图像金字塔/参考图像金字塔和掩码图像金字塔中各层图像结果的尺度（或空间分辨率）进行调整。In some embodiments, in the above-mentioned step S400, the scale (or spatial resolution) of the image results of each layer in the above-mentioned rendered image pyramid/reference image pyramid and mask image pyramid may be 1/32, 1 in order from the top to the bottom. /16, 1/8, 1/4, 1/2, where 1/2 means that the scale (or spatial resolution) of the current layer is 1/2 of the original scale (or spatial resolution), and so on. Those skilled in the art can adjust the scale (or spatial resolution) of the image results of each layer in the above-mentioned rendered image pyramid/reference image pyramid and mask image pyramid according to actual needs.

一些实施例中，对抗生成网络模型包括一个生成器。请参考图7，生成器包括多级彼此串联连接的CSD残差模块（图7中由左至右的CSD残差模块分别为第一级的CSD残差模块、第二级的CSD残差模块至最后一级的CSD残差模块）。其中，CSD（Contextual andSpatial Denormalization）表示上下文和空间非规范化。In some embodiments, the adversarial generative network model includes a generator. Please refer to Figure 7. The generator includes multiple levels of CSD residual modules connected in series (the CSD residual modules from left to right in Figure 7 are the first-level CSD residual module and the second-level CSD residual module. to the last level CSD residual module). Among them, CSD (Contextual and Spatial Denormalization) represents contextual and spatial denormalization.

请参考图7，上述步骤S410中，渲染图像金字塔、参考图像金字塔和掩码图像金字塔中各层的图像结果按照由顶层至底层的顺序分别输入生成器内第一级的CSD残差模块至最后一级的CSD残差模块；将前一级的CSD残差模块的输出作为相邻的后一级的CSD残差模块的一种输入，将最后一级的CSD残差模块的输出作为完整新视角场景图像，包括：Please refer to Figure 7. In the above step S410, the image results of each layer in the rendering image pyramid, the reference image pyramid and the mask image pyramid are respectively input to the CSD residual module of the first level in the generator to the end in order from the top to the bottom. The first-level CSD residual module; the output of the previous-level CSD residual module is used as an input of the adjacent next-level CSD residual module, and the output of the last-level CSD residual module is used as a complete new Perspective scene images, including:

将渲染图像金字塔、参考图像金字塔和掩码图像金字塔中顶层的图像结果，以及与顶层的图像结果的尺度对应的新视角场景渲染图像输入第一级的CSD残差模块；Input the top-level image results in the rendered image pyramid, reference image pyramid and mask image pyramid, as well as the new perspective scene rendering image corresponding to the scale of the top-level image result, into the first-level CSD residual module;

将渲染图像金字塔、参考图像金字塔和掩码图像金字塔中除顶层外的其余各层的图像结果按照由顶层的下一层至底层的顺序分别输入生成器内第二级的CSD残差模块至最后一级的CSD残差模块；将最后一级的CSD残差模块的输出作为完整新视角场景图像。The image results of the remaining layers in the rendered image pyramid, reference image pyramid and masked image pyramid except the top layer are respectively input into the second-level CSD residual module in the generator in order from the next layer of the top layer to the bottom layer to the end. The first-level CSD residual module; the output of the last-level CSD residual module is used as a complete new perspective scene image.

一些实施例中，可以将渲染图像金字塔、参考图像金字塔和掩码图像金字塔中顶层的图像结果，以及批量的与顶层的图像结果的尺度对应的新视角场景渲染图像输入第一级的CSD残差模块。本领域技术人员可以自行确定上述每个批量（batchsize=N）的数量，如，每个批量的数量为五个。In some embodiments, the image results of the top layer in the rendered image pyramid, the reference image pyramid and the mask image pyramid, as well as the batch of new perspective scene rendering images corresponding to the scale of the top layer of image results, can be input into the first-level CSD residual. module. Those skilled in the art can determine the quantity of each batch (batchsize=N) mentioned above by themselves. For example, the quantity of each batch is five.

上述CSD残差模块包括第一级的CSD子模块、第二级的CSD子模块和第三级的CSD子模块。其中，CSD（Contextual and Spatial Denormalization）子模块，即，上下文和空间非规范化子模块。The above-mentioned CSD residual module includes a first-level CSD sub-module, a second-level CSD sub-module and a third-level CSD sub-module. Among them, CSD (Contextual and Spatial Denormalization) sub-module, that is, contextual and spatial denormalization sub-module.

请参考图3，上述CSD残差模块内部的数据处理流程为：Please refer to Figure 3. The data processing flow inside the above CSD residual module is:

将渲染图像金字塔、参考图像金字塔和掩码图像金字塔中对应层的图像结果（请参考图4，如对应尺度的掩码图像I _m、参考图像I _c和新视角场景渲染图像I _r），以及输入CSD残差模块的特征图分别输入至第一级的CSD子模块和第三级的CSD子模块，将第一级的CSD子模块的输出作为第一ReLU激活函数的输入，对第一ReLU激活函数的输出进行卷积操作而得到第一卷积结果，将第一卷积结果输入第二级的CSD子模块，将第二级的CSD子模块的输出作为第二ReLU激活函数的输入，对第二ReLU激活函数的输出进行卷积操作而得到第二卷积结果，将第三级的CSD子模块的输出作为第三ReLU激活函数的输入，对第三ReLU激活函数的输出进行卷积操作而得到第三卷积结果，对第二卷积结果和第三卷积结果执行单位加操作而得到CSD残差模块的输出；The image results of the corresponding layers in the rendered image pyramid, reference image pyramid and mask image pyramid _{(please refer to Figure 4, such as the corresponding scale mask image Im} , reference image Ic _and new perspective scene rendering image Ir ₎ , and The feature map input to the CSD residual module is input to the first-level CSD sub-module and the third-level CSD sub-module respectively. The output of the first-level CSD sub-module is used as the input of the first ReLU activation function. The output of the activation function is convolved to obtain the first convolution result. The first convolution result is input to the second-level CSD sub-module, and the output of the second-level CSD sub-module is used as the input of the second ReLU activation function. Perform a convolution operation on the output of the second ReLU activation function to obtain the second convolution result, use the output of the third-level CSD sub-module as the input of the third ReLU activation function, and perform convolution on the output of the third ReLU activation function. The operation is performed to obtain the third convolution result, and the unit addition operation is performed on the second convolution result and the third convolution result to obtain the output of the CSD residual module;

其中，输入第一级的CSD残差模块的特征图为与顶层的图像结果的尺度对应的新视角场景渲染图像（即需要补全的新视角场景渲染图像），输入相邻的后一级的CSD残差模块的特征图（请参考图3中的x ⁱ）为前一级的CSD残差模块的输出。Among them, the feature map input to the CSD residual module of the first level is the new perspective scene rendering image corresponding to the scale of the top-level image result (that is, the new perspective scene rendering image that needs to be completed), and the input of the adjacent subsequent level is The feature map of the CSD residual module (please refer to x ⁱ in Figure 3) is the output of the CSD residual module of the previous stage.

例如，输入第二级的CSD残差模块的特征图为第一级的CSD残差模块的输出，输入第三级的CSD残差模块的特征图为第二级的CSD残差模块的输出，输入其他级的CSD残差模块的特征图以此类推。For example, the feature map input to the second-level CSD residual module is the output of the first-level CSD residual module, and the feature map input to the third-level CSD residual module is the output of the second-level CSD residual module. Input the feature maps of other levels of CSD residual modules and so on.

一些实施例中，渲染图像金字塔的层数、参考图像金字塔的层数、掩码图像金字塔的层数均与生成器中CSD残差模块的数量相同。上述第一ReLU激活函数、第二ReLU激活函数和第三ReLU激活函数，即本领域现有常用的ReLU激活函数。In some embodiments, the number of layers of the rendered image pyramid, the number of layers of the reference image pyramid, and the number of layers of the mask image pyramid are the same as the number of CSD residual modules in the generator. The above-mentioned first ReLU activation function, second ReLU activation function and third ReLU activation function are currently commonly used ReLU activation functions in this field.

上述第一级的CSD子模块、第二级的CSD子模块和第三级的CSD子模块的数据处理流程均相同。请参考图4，图4中的BatchNorm表示归一化层或归一化操作，conv表示下述的卷积操作（如第一卷积操作），MPC表示下文的掩膜图像表征关联机制，element-wise表示逐元素乘法操作。The data processing procedures of the above-mentioned first-level CSD sub-module, second-level CSD sub-module and third-level CSD sub-module are all the same. Please refer to Figure 4. BatchNorm in Figure 4 represents the normalization layer or normalization operation, conv represents the following convolution operation (such as the first convolution operation), MPC represents the mask image representation association mechanism below, element -wise means element-wise multiplication operation.

请参考图4，CSD子模块的数据处理流程为：Please refer to Figure 4. The data processing flow of the CSD sub-module is:

对输入的渲染图像金字塔中对应层的图像结果（如对应尺度的新视角场景渲染图像I _r）、参考图像金字塔中对应层的图像结果（如对应尺度的参考图像I _c）和掩码图像金字塔中对应层的图像结果（如对应尺度的掩码图像I _m）进行相似性匹配，以得到与渲染图像金字塔中对应层的图像结果的上下文关联信息；The image results of the corresponding layer in the input rendered image pyramid (such as the new perspective scene rendering image I _r corresponding to the scale), the image results of the corresponding layer in the reference image pyramid (such as the reference image I _c corresponding to the scale) and the mask image pyramid The image results of the corresponding layer in the image pyramid (such as the mask image I _m of the corresponding scale) are subjected to similarity matching to obtain context-related information with the image results of the corresponding layer in the rendered image pyramid;

其中，上下文关联信息包括相似度图和渲染图像金字塔中对应层的图像结果（即对应尺度的I _r）的关联图像I _c'；关联图像是基于参考图像金字塔中对应层的图像结果（即对应尺度的I _c）与渲染图像金字塔中对应层的图像结果（即对应尺度的I _r）最相似的多个区域而生成的，相似度图用于表征渲染图像金字塔中对应层的图像结果（即对应尺度的I _r）与参考图像金字塔中对应层的图像结果（即对应尺度的I _c）之间的相关性；Among them, the context associated information includes the similarity map and the associated image I _c' of the image result of the corresponding layer in the rendered image pyramid (ie, the corresponding scale I _r ); the associated image is based on the image result of the corresponding layer in the reference image pyramid (ie, the corresponding The I _c of the scale is generated by multiple regions that are most similar to the image results of the corresponding layer in the rendered image pyramid (i.e., I _r of the corresponding scale). The similarity map is used to characterize the image results of the corresponding layer in the rendered image pyramid (i.e. The correlation between I _r at the corresponding scale) and the image result of the corresponding layer in the reference image pyramid (i.e. I _c at the corresponding scale);

其中，上下文关联信息用于为缺失信息的区域提供上下文匹配信息；Among them, context-related information is used to provide context matching information for areas with missing information;

对关联图像I _c'和相似度图分别执行第二卷积操作而分别得到关联图像I _c'的空间特征f（I _c'）和相似度图的空间特征；Perform a second convolution operation on the associated image I _c' and the similarity map respectively to obtain the spatial features f ( I _c' ) of the associated image I _c' and the spatial features of the similarity map respectively;

对渲染图像金字塔中对应层的图像结果I _r执行第二卷积操作得到对应层的新视角场景渲染图像的空间特征f（I _r）；Perform a second convolution operation on the image result I _r of the corresponding layer in the rendered image pyramid to obtain the spatial feature f ( I _r ) of the new perspective scene rendering image of the corresponding layer;

对相似度图的空间特征α _c,h,w和新视角场景渲染图像的空间特征f（I _r）执行逐元素乘法操作而得到逐元素乘法结果；Perform element-wise multiplication operations on the spatial features α _{c, h, w} of the similarity map and the spatial features f ( I _r ) of the new perspective scene rendering image to obtain the element-wise multiplication result;

将逐元素乘法结果与关联图像I _c'的空间特征f（I _c'）相加而得到第二仿射参数β _c,h,w；其中，第二仿射参数的表达式为：Add the element-wise multiplication result to the spatial feature f ( I _c' ) of the associated image I _c' to obtain the second affine parameter β _c,h,w ; where the expression of the second affine parameter is:

；为逐元素乘法结果，α _c,h,w为相似度图的空间特征，为关联图像I _c'的空间特征；其中，所述c、h和w分别为所述对应层的图像结果的特征通道数、长度和宽度，所述I _r表示所述新视角场景渲染图像，所述I _c'为所述对应层的图像结果的关联图像； ; is the element-wise multiplication result, α _{c, h, w} are the spatial characteristics of the similarity map, is the spatial feature of the associated image I _c' ; wherein, the c, h and w are the number of characteristic channels, the length and the width of the image result of the corresponding layer respectively, and the I _r represents the new perspective scene rendering image, The I _c' is the associated image of the image result of the corresponding layer;

对输入CSD残差模块的多个特征图进行归一化而得到归一化参数；Normalize multiple feature maps input to the CSD residual module to obtain normalized parameters;

将归一化参数与第二仿射参数β _c,h,w相加而得到的归一化结果（即下文的BN^'（x ⁱ））作为CSD子模块的输出。The normalized result obtained by adding the normalized parameter and the second affine parameter β _{c, h, w} (ie, BN ^' ( xi ⁾ below) is used as the output of the CSD sub-module.

需要说明的是，对关联图像I _c'和相似度图分别执行第二卷积操作分别是由不同的简单两层卷积网络来实现的。对渲染图像金字塔中对应层的图像结果（即I_r）执行第二卷积操作也是由简单两层卷积网络来实现的。但是，需要注意的是，与渲染图像金字塔中对应层的图像结果对应的简单两层卷积网络和上述与关联图像对应的简单两层卷积网络的操作相同且共享参数。It should be noted that performing the second convolution operation on the associated image I _c' and the similarity map are respectively implemented by different simple two-layer convolution networks. Performing the second convolution operation on the image result of the corresponding layer in the rendered image pyramid (ie, I _r ) is also implemented by a simple two-layer convolutional network. However, it is important to note that the simple two-layer convolutional network corresponding to the image result of the corresponding layer in the rendered image pyramid and the simple two-layer convolutional network described above corresponding to the associated image operate the same and share parameters.

一些实施例中，请参考图5，对输入的渲染图像金字塔中对应层的图像结果、参考图像金字塔中对应层的图像结果和掩码图像金字塔中对应层的图像结果进行相似性匹配，以得到与渲染图像金字塔中对应层的图像结果的上下文关联信息，包括：In some embodiments, please refer to Figure 5 to perform similarity matching on the image results of the corresponding layer in the input rendered image pyramid, the image result of the corresponding layer in the reference image pyramid, and the image result of the corresponding layer in the mask image pyramid to obtain Contextual information related to the image results of the corresponding layer in the rendered image pyramid, including:

利用一个划分框以第一步幅在渲染图像金字塔中对应层的图像结果（即对应尺度的I _r）中移动而将渲染图像金字塔中对应层的图像结果划分为多个尺寸为k×k的渲染子块，Use a dividing frame to move in the image result of the corresponding layer in the rendered image pyramid (i.e., I _r of the corresponding scale) in the first step to divide the image result of the corresponding layer in the rendered image pyramid into multiple renderings of size k×k sub-block,

利用划分框以第一步幅在掩码图像金字塔中对应层的图像结果（即对应尺度的I_m）中移动而将掩码图像金字塔中对应层的图像结果划分为多个尺寸为k×k的掩码子块，The dividing frame is used to move in the image result of the corresponding layer in the mask image pyramid (i.e., I _m of the corresponding scale) in the first step to divide the image result of the corresponding layer in the mask image pyramid into multiple images of size k×k. mask sub-block,

在参考图像金字塔中对应层的图像结果（即对应尺度的I _c）中确定与渲染子块对应的搜索域，搜索域的尺寸为k’×k’，利用划分框以第二步幅在搜索域中移动而将搜索域划分为s×s个尺寸为k×k的参考子块；k、k’和s均为预设的常数；Determine the search domain corresponding to the rendering sub-block in the image result of the corresponding layer in the reference image pyramid (i.e., I _c of the corresponding scale). The size of the search domain is k'×k', and the dividing box is used to search in the second step The search domain is divided into s×s reference sub-blocks of size k×k by moving in the domain; k, k' and s are all preset constants;

将与渲染子块对应的掩码子块应用到与渲染子块对应的参考子块而得到与渲染子块对应的参考掩码子块；Apply the mask sub-block corresponding to the rendering sub-block to the reference sub-block corresponding to the rendering sub-block to obtain a reference mask sub-block corresponding to the rendering sub-block;

计算渲染子块和与渲染子块对应的搜索域内各参考掩码子块之间的各相似度，将各相似度中的最大值作为相似度图中与渲染子块对应的元素；Calculate each similarity between the rendering sub-block and each reference mask sub-block in the search domain corresponding to the rendering sub-block, and use the maximum value of each similarity as the element corresponding to the rendering sub-block in the similarity map;

其中，与渲染子块对应的搜索域的尺寸大于渲染子块的尺寸；Among them, the size of the search domain corresponding to the rendering sub-block is larger than the size of the rendering sub-block;

将相似度图中各元素所分别对应的参考子块作为最佳匹配子块，利用与相似度图对应的多个最佳匹配子块生成关联图像（即I _c'）。The reference sub-block corresponding to each element in the similarity map is regarded as the best matching sub-block, and multiple best-matching sub-blocks corresponding to the similarity map are used to generate the associated image (ie, I _c' ).

一些实施例中，当k=3，第二步幅m=1，s=3时，而搜索域的尺寸k’×k’=[k+m×(s-1)]×[k+m×(s-1)]，因此上述k’=5。本领域技术人员可以根据实际需求而对上述参数k、m、s和k’进行调整。In some embodiments, when k=3, the second stride m=1, s=3, and the size of the search domain k'×k'=[k+m×(s-1)]×[k+m ×(s-1)], so the above k'=5. Those skilled in the art can adjust the above parameters k, m, s and k' according to actual needs.

一些实施例中，上述得到与渲染图像金字塔中对应层的图像结果的上下文关联信息的流程中，划分框是一个虚拟的方框。可以理解的是，可以将上述划分框看作一个卷积核，上述划分框可以按照第一步幅在渲染图像金字塔中对应层的图像结果（即对应尺度的I _r）中移动（例如从一行中的最左侧至最右侧，再从上一行至下一行，等移动方式），而将上述划分框所包围的渲染图像金字塔中对应层的图像结果的区域作为一个渲染子块，进而将渲染图像金字塔中对应层的图像结果划分为多个尺寸为k×k的渲染子块。In some embodiments, in the above process of obtaining contextual information related to the image result of the corresponding layer in the rendered image pyramid, the dividing box is a virtual box. It can be understood that the above-mentioned division box can be regarded as a convolution kernel, and the above-mentioned division box can be moved in the image result of the corresponding layer in the rendering image pyramid (ie, the corresponding scale I _r ) according to the first step (for example, from a row) from the far left to the far right, and then from the previous row to the next row, etc.), and the area of the image result of the corresponding layer in the rendered image pyramid surrounded by the above-mentioned dividing box is used as a rendering sub-block, and then the rendering The image result of the corresponding layer in the image pyramid is divided into multiple rendering sub-blocks of size k×k.

需要说明的是，本领域技术人员可以根据实际需求而对上述第一步幅n的大小进行调整，此处不对第一步幅的大小进行限制。例如，第一步幅n=1。It should be noted that those skilled in the art can adjust the size of the first step n according to actual needs, and the size of the first step n is not limited here. For example, the first step n=1.

需要说明的是，上述得到与渲染图像金字塔中对应层的图像结果的上下文关联信息的流程中“将与渲染子块对应的掩码子块应用到与渲染子块对应的参考子块而得到与渲染子块对应的参考掩码子块”指的是利用对应的掩码子块去遮挡对应的参考子块进而生成带有掩码的参考子块（即上述参考掩码子块），然后再计算渲染子块与该渲染子块的搜索域内的多个参考掩码子块之间的相似度。It should be noted that in the above-mentioned process of obtaining the context-related information of the image result of the corresponding layer in the rendered image pyramid, "the mask sub-block corresponding to the rendering sub-block is applied to the reference sub-block corresponding to the rendering sub-block to obtain the corresponding rendering sub-block. "Reference mask sub-block corresponding to the sub-block" refers to using the corresponding mask sub-block to block the corresponding reference sub-block to generate a reference sub-block with a mask (i.e. the above-mentioned reference mask sub-block), and then calculate the rendering sub-block The similarity between multiple reference mask sub-blocks within the search domain of this rendering sub-block.

如前文所述，从上述全局面元模型获取的新视角场景渲染图像并不完整，但在同一采集视角存在一张或多张与新视角场景渲染图像对应的原始真实图像。一些实施例中，定义上述I _m∈L^H×W为二进制掩码图像（即上述掩码图像），其中H和W分别为新视角场景渲染图像对应的掩码图像的高度和宽度。请参考图5，参考图像I _c，即条件样本，是由相机（如前述立体摄像机）采集的原始图像（即前述立体相机图像）。参考图像I _c来自新视角场景渲染图像I _r的邻近视点。参考图像I _c为新视角场景渲染图像I _r的补全提供场景上下文信息。新视角场景渲染图像I _r和对应的参考图像I _c之间的相关性计算方法与卷积类似，将输入的新视角场景渲染图像I _r划分成多个尺寸为k×k个渲染子块，其掩码图像相应地划分为相同的一组掩码子块。新视角场景渲染图像I _r的每个渲染子块在条件样本中对应的搜索域中作为一个卷积核来操作。搜索域的尺寸大于渲染子块的尺寸。在进行匹配之前，所有的参考子块都要被赋予相应的二进制掩码（即掩码子块），以提高上述相关性计算的匹配精度。As mentioned above, the new perspective scene rendering image obtained from the above global element model is incomplete, but there are one or more original real images corresponding to the new perspective scene rendering image at the same acquisition perspective. In some embodiments, the above Im _∈ L ^{H × W} is defined as a binary mask image (ie, the above mask image), where H and W are respectively the height and width of the mask image corresponding to the new perspective scene rendering image. Please refer to Figure 5. The reference image I _c , that is, the condition sample, is the original image (ie, the aforementioned stereo camera image) collected by the camera (such as the aforementioned stereo camera). The reference image I _c comes from the neighboring viewpoint of the new perspective scene rendering image I _r . The reference image I _c provides scene context information for the completion of the new perspective scene rendering image I _r . The correlation calculation method between the new perspective scene rendering image I _r and the corresponding reference image I _c is similar to convolution. The input new perspective scene rendering image I _r is divided into multiple rendering sub-blocks of size k×k. Its mask image is divided accordingly into the same set of mask sub-blocks. Each rendering sub-block of the new perspective scene rendering image I _r operates as a convolution kernel in the corresponding search domain in the conditional sample. The size of the search domain is larger than the size of the rendering subchunk. Before matching, all reference sub-blocks must be assigned corresponding binary masks (ie, mask sub-blocks) to improve the matching accuracy of the above correlation calculation.

需要说明的是，上述搜索域的尺寸大于渲染子块的尺寸的原因是：由于“新视角场景渲染图像I _r”与“原视角”在视角上存在偏差，因此需要在做上下文匹配的时候搜索更大的区域，才能匹配到对应的上下文信息。It should be noted that the reason why the size of the above search domain is larger than the size of the rendering sub-block is because there is a deviation in the perspective between the "new perspective scene rendering image I _r " and the "original perspective", so it needs to be searched when doing context matching. A larger area can match the corresponding contextual information.

上述相似度图的表达式为：The expression of the above similarity graph is:

； ;

其中，simi（i）表示所述渲染图像金字塔中对应层的图像结果中第i个渲染子块和参考图像金字塔中对应层的图像结果（即对应尺度的I _c）中与第i个渲染子块对应的搜索域内s×s个参考掩码子块/>之间的各相似度的最大值，/>为求取余弦相似度的操作，⊙表示逐元素相乘，此处的，其中，H和W分别为新视角场景渲染图像的高度和宽度，n为上述第一步幅；I _m表示掩码图像，I _c表示参考图像，I _r表示新视角场景渲染图像，j表示搜索域内参考子块/>的数量，/>为掩码子块。Where, simi ( i ) represents the i -th rendering sub-block in the image result of the corresponding layer in the rendering image pyramid and s×s reference mask sub-blocks in the search domain corresponding to the i-th rendering sub-block in the image result of the corresponding layer in the reference image pyramid (i.e. I _c of the corresponding scale)/> The maximum value of each similarity between them,/> To obtain the cosine similarity operation, ⊙ represents element-wise multiplication, where , where H and W are the height and width of the new perspective scene rendering image respectively, n is the first step above; I _m represents the mask image, I _c represents the reference image, I _r represents the new perspective scene rendering image, and j represents Search reference sub-blocks within the domain/> quantity,/> is the mask sub-block.

上述归一化参数的表达式为：，其中，x ⁱ为输入第i级的CSD残差模块中CSD子模块的一批特征图（如N个特征图），x ⁱ∈R^N×C×H×W，C为相应的特征通道的数量），例如，x ⁱ可以指输入第i级CSD残差模块中第一级的CSD子模块和第三级的CSD子模块的特征图；μ _c（x ⁱ）和σ _c（x ⁱ）分别是输入第i级的CSD残差模块中CSD子模块的N个特征图的均值和标准差，γ _c为第一仿射参数，γ _c∈R ^C；调制仿射参数包括第一仿射参数γ _c和第二仿射参数β _c,h,w。其中，均值和标准差的表达式分别为：The expression of the above normalized parameters is: , where x ⁱ is a batch of feature maps (such as N feature maps) of the CSD submodule in the i-th level CSD residual module input, x ⁱ ∈R ^N×C×H×W , and C is the corresponding feature channel quantity), for example, x ⁱ can refer to the feature map of the first-level CSD sub-module and the third-level CSD sub-module in the input i-th level CSD residual module; μ _c ( x ⁱ ) and σ _c ( x ⁱ ) are respectively the mean and standard deviation of the N feature maps of the CSD submodule in the CSD residual module input to the i-th level, γ _c is the first affine parameter, γ _c ∈ R ^C ; the modulated affine parameters include the first affine parameter The radial parameter γ _c and the second affine parameter β _c,h,w . Among them, the expressions of the mean and standard deviation are:

， ,

；其中，此处的n∈N，c∈C，h∈H，w∈W，ε为预设的系数，该预设的系数的设置属于本领域的公知常识。 ; Among them, n∈N, c∈C, h∈H, w∈W, and ε here are preset coefficients, and the settings of the preset coefficients are common knowledge in this field.

一些实施例中，本领域技术人员可以根据实际需求而灵活地设定上述搜索域的尺寸（即大小），例如，搜索域的大小可以为5x5，此时搜索域为以参考图像I _c中与该渲染子块对应的参考掩码子块为中心，边长为5的区域。通常情况下，上述H和W的数值相同。In some embodiments, those skilled in the art can flexibly set the size (i.e. size) of the above-mentioned search domain according to actual needs. For example, the size of the search domain can be 5x5, in which case the search domain is based on the reference image I _c and The reference mask sub-block corresponding to this rendering sub-block is an area with a center and a side length of 5. Normally, the values of H and W mentioned above are the same.

需要说明的是，在通过上述相似度图的表达式，为了提高相似度计算的精度，需要将上述掩码图像对应的掩码子块应用到参考图像中搜索域内的各参考子块而得到各参考子块对应的参考掩码子块（即，上述表示将第i个掩码子块分别赋予参考图像I _c中与第i个渲染子块对应的搜索域内各参考子块而得到的对应的搜索域内各参考掩码子块），之后再计算上述各参考掩码子块与对应的渲染子块的相似度。It should be noted that, in order to improve the accuracy of similarity calculation through the expression of the above similarity map, it is necessary to apply the mask sub-block corresponding to the above mask image to each reference sub-block in the search domain in the reference image to obtain each reference The reference mask sub-block corresponding to the sub-block (i.e., the above Indicates that the i- th mask sub-block is assigned to each reference sub-block in the search domain corresponding to the i- th rendering sub-block in the reference image I _c , and then the corresponding reference mask sub-blocks in the search domain are calculated). The similarity between the mask sub-block and the corresponding rendered sub-block.

一些实施例中，可以记录最佳匹配子块的索引。最佳匹配子块是指相似度图中各元素所分别对应的参考子块，即，渲染子块对应的搜索域中渲染子块与多个参考掩码子块之间的各相似度的最大值所对应的参考子块。上述索引用于标记最佳匹配子块在上述渲染子块对应的搜索域中对应的位置。对于新视角场景渲染图像I _r中缺失信息的较大区域，由于该区域对应的参考子块完全被掩码子块填充，因而上述渲染子块与多个参考掩码子块之间的各相似度中的一个或者多个可能为零，进而导致后续不正确的关联。为了缓解上述这一技术问题，一些实施例中，在逐个确定相似图中的各元素（请参考图5中的元素S1至元素S9）的所对应的最佳匹配子块的索引时，假设一个渲染子块对应的搜索域中含有9个参考掩码子块，请参考图6，a1至a9可以分别看作该渲染子块与渲染子块对应的搜索域内的9个参考掩码子块的相似度值，本申请采用非零索引所对应的的相似度值的平均值代替该零值索引的相似度值。例如，若计算出的a1至a9全部是0（其中，a1至a9所分别对应的索引为0至8），此时零值索引分别为0至8，那么该渲染子块和与渲染子块对应的搜索域内各参考掩码子块之间的各相似度的最大值就是0，此种情况下默认该渲染子块的最佳匹配子块是该搜索域内的第0个参考掩码子块所对应的参考子块，即a1，但是a1所对应的参考子块事实上并非最佳匹配子块。因此，本申请采用零值索引周围的某几个非零索引（如非零索引可以为零值索引的上方、下方、左侧和右侧的4个临近的索引）所对应的相似度值来计算出平均值，并使用该平均值来代替零值索引所对应的相似度值，之后，再根据与该渲染子块对应的搜索域内各参考掩码子块之间的各相似度的最大值的索引找出最佳匹配子块。例如，计算得到的最佳匹配子块的索引是5，而该索引值5指的是a6，即，该渲染子块的最佳匹配子块是该渲染子块对应的搜索域内a6所对应的参考掩码子块对应的参考子块。本领域技术人员可以根据实际需求而对最佳匹配子块的具体确定过程进行适应性地调整，如灵活设置零值索引周围的非零索引的具体数量等。In some embodiments, the index of the best matching sub-block may be recorded. The best matching sub-block refers to the reference sub-block corresponding to each element in the similarity map, that is, the maximum value of each similarity between the rendering sub-block and multiple reference mask sub-blocks in the search domain corresponding to the rendering sub-block. The corresponding reference sub-block. The above index is used to mark the corresponding position of the best matching sub-block in the search domain corresponding to the above-mentioned rendering sub-block. For a large area with missing information in the new perspective scene rendering image I _r , since the reference sub-block corresponding to this area is completely filled by the mask sub-block, the similarity between the above-mentioned rendering sub-block and multiple reference mask sub-blocks is in the middle One or more of may be zero, resulting in subsequent incorrect associations. In order to alleviate the above technical problem, in some embodiments, when determining the index of the best matching sub-block corresponding to each element in the similarity graph one by one (please refer to element S1 to element S9 in Figure 5), it is assumed that a The search domain corresponding to the rendering sub-block contains 9 reference mask sub-blocks. Please refer to Figure 6. a1 to a9 can be regarded as the similarity between the rendering sub-block and the 9 reference mask sub-blocks in the search domain corresponding to the rendering sub-block. value, this application uses the average of the similarity values corresponding to the non-zero index to replace the similarity value of the zero-value index. For example, if the calculated a1 to a9 are all 0 (wherein, the corresponding indexes of a1 to a9 are 0 to 8), and the zero value indexes are 0 to 8 respectively, then the rendering sub-block and the rendering sub-block The maximum value of similarity between reference mask sub-blocks in the corresponding search domain is 0. In this case, by default the best matching sub-block of the rendering sub-block is the 0th reference mask sub-block in the search domain. The reference sub-block is a1, but the reference sub-block corresponding to a1 is not actually the best matching sub-block. Therefore, this application uses the similarity values corresponding to certain non-zero indexes around the zero-value index (for example, the non-zero index can be the four adjacent indexes above, below, left and right of the zero-value index). Calculate the average value and use this average value to replace the similarity value corresponding to the zero-value index. Then, based on the maximum value of each similarity between the reference mask sub-blocks in the search domain corresponding to the rendering sub-block, Index to find the best matching sub-block. For example, the calculated index of the best matching sub-block is 5, and the index value 5 refers to a6, that is, the best matching sub-block of the rendering sub-block is the one corresponding to a6 in the search domain corresponding to the rendering sub-block. Reference sub-block corresponding to the reference mask sub-block. Those skilled in the art can adaptively adjust the specific determination process of the best matching sub-block according to actual needs, such as flexibly setting the specific number of non-zero indexes around the zero-valued index.

一些实施例中，上述得到与渲染图像金字塔中对应层的图像结果的上下文关联信息的流程中“将相似度图中各元素所分别对应的参考子块作为最佳匹配子块，利用与相似度图对应的多个最佳匹配子块生成关联图像”的流程可以为：第一步，通过上述索引获取上述相似度图中各元素所分别对应的最佳匹配子块，其中，这些最佳匹配子块都是相互独立的；第二步，需要把这些最佳匹配子块再拼接成一副图；其中，具体的拼接方式与反卷积的方式相同，故此处可以将最佳匹配子块称为“反卷积滤波器”；第三步，在拼回一幅图的过程中会出现部分重叠的区域，对这些部分重叠的区域中分别对应的多个像素值求平均，并将计算得到的平均值作为上述部分重叠的区域内对应像素点的数值，进而最终得到上述参考图像I _c对应的关联图像I _c'。In some embodiments, in the above-mentioned process of obtaining contextual information related to the image result of the corresponding layer in the rendered image pyramid, "the reference sub-block corresponding to each element in the similarity map is used as the best matching sub-block, and the similarity is used to The process of "generating associated images from multiple best matching sub-blocks corresponding to the graph" can be: the first step is to obtain the best matching sub-blocks corresponding to each element in the above-mentioned similarity graph through the above-mentioned index, where these best matches The sub-blocks are all independent of each other; in the second step, these best-matching sub-blocks need to be spliced into a picture; the specific splicing method is the same as the deconvolution method, so the best-matching sub-block can be called here is the "deconvolution filter"; in the third step, during the process of piecing back a picture, partially overlapping areas will appear. The corresponding multiple pixel values in these partially overlapping areas are averaged, and the calculation is The average value is used as the value of the corresponding pixel point in the above-mentioned partially overlapping area, and then the associated image I _c _' corresponding to the above-mentioned reference image I c is finally obtained.

需要说明的是，采用反卷积的方式将上述最佳匹配子块拼接成一副图的具体过程属于本领域的现有技术，故此处不再对该具体过程进行赘述。It should be noted that the specific process of splicing the above-mentioned best matching sub-blocks into a picture using deconvolution belongs to the existing technology in this field, so the specific process will not be described again here.

一些实施例中，请参考图3，上述CSD残差模块包括第一级的CSD子模块、第二级的CSD子模块和第三级的CSD子模块。In some embodiments, please refer to Figure 3. The above-mentioned CSD residual module includes a first-level CSD sub-module, a second-level CSD sub-module and a third-level CSD sub-module.

在给定场景上下文信息（即，上述条件样本）的情况下，本申请通过在上述第一级的CSD子模块、第二级的CSD子模块和第三级的CSD子模块中归一化层中加入上下文关联信息（即，上述对应的关联图像和对应的相似图），以用于补全上述新视角场景渲染图像中缺失信息的区域。Given the scene context information (i.e., the above-mentioned conditional sample), this application normalizes the layers in the above-mentioned first-level CSD sub-module, second-level CSD sub-module and third-level CSD sub-module. Contextual information (that is, the above-mentioned corresponding associated images and corresponding similar images) is added to fill in the areas with missing information in the above-mentioned new perspective scene rendering images.

需要说明的是，上述给定场景上下文信息和上下文关联信息能够提供上下文匹配信息，以使得对抗生成网络模型计算得出新视角场景渲染图像中缺失信息的具体区域。It should be noted that the above-mentioned given scene context information and context-related information can provide context matching information, so that the adversarial generative network model can calculate the specific areas with missing information in the new perspective scene rendering image.

一些实施例中，请参考图4，在上述归一化层中加入上下文关联信息的灵感来自于条件批归一化(BatchNormalization，BN)，其激活层被归一化为零均值和单位偏差，然后通过调整来自外部数据的仿射参数进行反规范化。与尺度对应的调制仿射参数可以用来控制对抗生成网络模型的全局输出，同时保留图像内容的空间结构，自适应实例归一化(AdaIN)也证明了这一点。需要说明的是，全局输出可以包括对抗生成网络模型所最终输出图片（如完整新视角场景图像序列）的内容和风格。本申请提出的各CSD子模块（即，上下文和空间非规范化子模块）中各归一化层利用空间变化的仿射变换使其适合于图像合成。与批量归一化（BN）相似，本申请的各CSD子模块对每个单独的特征通道(n∈N,c∈C,h∈H,w∈W)的均值和标准差进行归一化。其中，上述归一化结果的表达式为：In some embodiments, please refer to Figure 4. The inspiration for adding context-related information to the above-mentioned normalization layer comes from conditional batch normalization (BatchNormalization, BN), whose activation layer is normalized to zero mean and unit deviation. Denormalization is then performed by adjusting affine parameters from external data. Modulated affine parameters corresponding to the scale can be used to control the global output of the adversarial generative network model while preserving the spatial structure of the image content, as also demonstrated by adaptive instance normalization (AdaIN). It should be noted that the global output can include the content and style of the final output image of the adversarial generative network model (such as a complete new perspective scene image sequence). Each normalization layer in each CSD sub-module (ie, context and spatial denormalization sub-module) proposed in this application utilizes spatially varying affine transformations to make it suitable for image synthesis. Similar to batch normalization (BN), each CSD sub-module of this application normalizes the mean and standard deviation of each individual feature channel (n∈N, c∈C, h∈H, w∈W) . Among them, the expression of the above normalized result is:

。上述归一化结果作为各CSD子模块的最终输出。 . The above normalized results are used as the final output of each CSD sub-module.

需要说明的是，每个CSD残差模块中所有归一化层的调制仿射参数（即上述与不同尺度分别对应的调制仿射参数）均是从各CSD子模块学习到的。本申请将新视角场景渲染图像I _r和参考图像I _c输入到各CSD子模块的归一化层，而不是整个对抗生成网络模型的第一层网络，这是由于各CSD子模块比普通归一化层能更好地保存输入信息。事实上，各CSD子模块学习到的调制仿射参数已经编码了足够的输入图像信息，因此本申请抛弃了通常在编-解码生成对抗网络中使用的生成器的编码部分。It should be noted that the modulation affine parameters of all normalization layers in each CSD residual module (i.e., the above-mentioned modulation affine parameters corresponding to different scales) are learned from each CSD sub-module. This application inputs the new perspective scene rendering image I _r and the reference image I _c into the normalization layer of each CSD sub-module instead of the first layer network of the entire adversarial generation network model. This is because each CSD sub-module is smaller than the ordinary normalization layer. The unified layer can better preserve the input information. In fact, the modulation affine parameters learned by each CSD sub-module have already encoded enough input image information, so this application abandons the encoding part of the generator usually used in encoder-decoder generative adversarial networks.

该对抗生成网络模型的生成器由几个带有上采样层的CSD残差模块(即CSD-ResBlk)组成，每个CSD残差模块的输出都是相对应维度的特征。The generator of this adversarial generative network model consists of several CSD residual modules (i.e., CSD-ResBlk) with upsampling layers. The output of each CSD residual module is the feature of the corresponding dimension.

需要说明的是，上述归一化层具体对应图4中的符号“BatchNorm”。It should be noted that the above-mentioned normalization layer specifically corresponds to the symbol "BatchNorm" in Figure 4.

本申请首先对输入的新视角场景渲染图像、参考图像和对应的掩码图像进行下采样，建立各自的图像金字塔，然后通过对每个CSD残差模块进行上下文信息相关，在不同尺度下匹配空间分辨率，通过设置不同的子块（如渲染子块和参考掩码子块）和不同的搜索域来灵活调整匹配范围。This application first downsamples the input new perspective scene rendering image, reference image and corresponding mask image to establish their respective image pyramids, and then performs contextual information correlation on each CSD residual module to match the space at different scales. Resolution, the matching range can be flexibly adjusted by setting different sub-blocks (such as rendering sub-blocks and reference mask sub-blocks) and different search domains.

需要说明的是，上述“通过对每个CSD残差模块进行上下文信息相关，在不同尺度下匹配空间分辨率，通过设置不同的子块（如渲染子块和参考掩码子块）和不同的搜索域来灵活调整匹配范围”操作的原因是可以灵活设置上述子块和搜索域的大小。比如，当图片输入的尺度缩小一倍之后，而上述子块和搜索域的大小不变，因此，对应的CSD残差模块所匹配的感受野就增大一倍。而本申请希望可以尽可能扩大搜索域，这样可以处理视角偏差较大的场景（如转弯场景），但此种场景的计算量很大，因此本申请通过不同尺度的输入来灵活调整。上述操作的大致步骤为：第一步，将输入图片（即新视角场景渲染图像、参考图像和掩码图像）缩放至不同尺度而得到不同尺度的图片（即上述渲染图像金字塔中对应层的图像结果、参考图像金字塔中对应层的图像结果和掩码图像金字塔中对应层的图像结果）；第二步，将上述不同尺度的图片输入到对应的CSD子模块中以通过上述掩膜图像表征关联机制（MPC）进行上下文关联匹配；第三步，将获得的匹配信息通过各CSD子模块输送到对抗生成网络模型的整个网络。It should be noted that the above "by correlating contextual information for each CSD residual module, matching spatial resolution at different scales, by setting different sub-blocks (such as rendering sub-blocks and reference mask sub-blocks) and different searches Domain to flexibly adjust the matching range" operation is that the size of the above sub-blocks and search domains can be flexibly set. For example, when the scale of the image input is doubled, the sizes of the above sub-blocks and search domains remain unchanged. Therefore, the corresponding receptive field matched by the CSD residual module is doubled. This application hopes to expand the search domain as much as possible, so that it can handle scenes with large viewing angle deviations (such as turning scenes). However, such scenes require a large amount of calculation, so this application can flexibly adjust through inputs of different scales. The general steps of the above operation are: the first step is to scale the input image (i.e., the new perspective scene rendering image, the reference image, and the mask image) to different scales to obtain images of different scales (i.e., the image of the corresponding layer in the above-mentioned rendered image pyramid). result, the image result of the corresponding layer in the reference image pyramid and the image result of the corresponding layer in the mask image pyramid); in the second step, the above-mentioned pictures of different scales are input into the corresponding CSD sub-module to represent the association through the above-mentioned mask image mechanism (MPC) for context-related matching; in the third step, the obtained matching information is transmitted to the entire network of the adversarial generation network model through each CSD sub-module.

一些实施例中，本申请的对抗生成网络模型与pix2pixHD类似，使用多尺度鉴别器/判别器训练生成器。即，本申请的对抗生成网络模型还可以包括多个判别器和VGG卷积网络子模型。其中，pix2pixHD是pix2pix的重要升级，可以实现高分辨率图像生成和图片的语义编辑。对于一个生成对抗网络(GAN)，学习的关键就是理解生成器、判别器和损失函数这三部分。pix2pixHD的生成器和判别器都是多尺度的，单一尺度的生成器和判别器的结构和现有的pix2pix是一样的。pix2pix是一种基于条件式生成对抗网络(CGAN)的图像转译模型，而条件式生成抵抗网络是生成对抗网络的一种扩展，它通过在生成器和判别器中引入条件信息来实现有条件的图像生成。pix2pix的生成器采用U-Net网络结构，融合底层细粒度特征和高层抽象；判别器采用patchGAN网络结构，在图块尺度提取纹理等高频信息。In some embodiments, the adversarial generative network model of the present application is similar to pix2pixHD, using a multi-scale discriminator/discriminator to train the generator. That is, the adversarial generative network model of this application can also include multiple discriminators and VGG convolutional network sub-models. Among them, pix2pixHD is an important upgrade of pix2pix, which can realize high-resolution image generation and semantic editing of pictures. For a Generative Adversarial Network (GAN), the key to learning is to understand the three parts of the generator, discriminator and loss function. The generator and discriminator of pix2pixHD are multi-scale, and the structure of the single-scale generator and discriminator is the same as that of existing pix2pix. pix2pix is an image translation model based on conditional generative adversarial network (CGAN), and conditional generative adversarial network is an extension of generative adversarial network. It achieves conditional translation by introducing conditional information in the generator and discriminator. Image generation. The generator of pix2pix adopts U-Net network structure to integrate low-level fine-grained features and high-level abstraction; the discriminator adopts patchGAN network structure to extract high-frequency information such as texture at the patch scale.

此外，本申请在判别器中引入了相同的特征匹配损失，以改善对抗损失并稳定训练；与VGG的损失函数类似，感知损失L _VGG被联合使用以进一步提高性能。因此，本申请的对抗生成网络模型的总目标损失函数为：In addition, this application introduces the same feature matching loss in the discriminator to improve the adversarial loss and stabilize training; similar to the loss function of VGG, the perceptual loss L _VGG is jointly used to further improve performance. Therefore, the overall target loss function of the adversarial generative network model of this application is:

； ;

其中，λ₁=1，λ₂=10，λ₃=10；L _GAN表示对抗损失，主要控制图片输出的真假；L _FM表示特征损失，主要控制输出图片在鉴别器/判别器中的特征一致；L _VGG表示感知损失，主要控制VGG卷积网络子模型中特征的一致性。Among them, λ ₁ =1, λ ₂ =10, λ ₃ =10; L _GAN represents the adversarial loss, which mainly controls the authenticity of the picture output; L _FM represents the feature loss, which mainly controls the characteristics of the output picture in the discriminator/discriminator Consistent; L _VGG represents the perceptual loss, which mainly controls the consistency of features in the VGG convolutional network sub-model.

需要说明的是，对抗损失L _GAN、特征损失L _FM和感知损失L _VGG的具体获取过程属于本领域的现有技术。本领域技术人员可以参考现有的pix2pixHD模型等的具体结构和训练方式。此处不再对本申请的对抗生成网络模型的具体训练过程进行赘述。It should be noted that the specific acquisition processes of the adversarial loss LGAN , _the feature loss LFM and _{the perceptual loss LVGG} belong _to the existing technology in this field. Those skilled in the art can refer to the specific structure and training methods of the existing pix2pixHD model. The specific training process of the adversarial generative network model of this application will not be described again here.

需要说明的是，特征损失L _FM和感知损失L _VGG这两个损失函数分别来自判别器和预训练的VGG-19模型的层特征映射：It should be noted that the two loss functions, feature loss L _FM and perceptual loss L _VGG , come from the layer feature mapping of the discriminator and the pre-trained VGG-19 model respectively:

； ;

其中，T表示判别器D _i的网络层数，T'表示VGG卷积网络子模型的网络层数，f ^（j） _Di表示判别器D _i的第j层特征，f ^（j） _VGG表示VGG卷积网络子模型的第j层特征。G是指生成器(Generator)，D _i是指第i个判别器（Discriminator）。生成器是一个神经网络模型；D是采用与现有pix2pixHD中判别器类似的神经网络模型。I_g表示采集的原图像（即新视角场景渲染图像I _r所对应的立体相机图像），该原图像作为真值；C表示判别器D _i或VGG卷积网络子模型中相应的特征通道的数量，C _j为判别器D _i或VGG卷积网络子模型中第j层网络的特征通道数。Among them, T represents the number of network layers of the discriminator D _i , T' represents the number of network layers of the VGG convolutional network submodel, f ^{( j )} _Di represents the j-th layer feature of the discriminator D _i , f ^{( j )} _VGG represents VGG The jth layer features of the convolutional network submodel. G refers to the generator (Generator), and D _i refers to the i -th discriminator (Discriminator). The generator is a neural network model; D is a neural network model similar to the discriminator in the existing pix2pixHD. I _g represents the original image collected (that is, the stereo camera image corresponding to the new perspective scene rendering image I _r ), which is used as the true value; C represents the discriminator D _i or the corresponding feature channel in the VGG convolution network submodel. Quantity, C _j is the number of feature channels of the jth layer network in the discriminator D _i or VGG convolutional network submodel.

可以看出，本申请所提出的重建与补全方法采用的是一种基于数据驱动的自动驾驶场景合成框架，主要包含三维面元模型（如上述全局面元模型）的重建和图像补全网络（即对抗生成网络模型）。本重建与补全方法利用由立体摄像机对自动驾驶场景进行拍摄而得到的立体相机图像，并获得立体相机图像It can be seen that the reconstruction and completion method proposed in this application adopts a data-driven automatic driving scene synthesis framework, which mainly includes the reconstruction and image completion network of the three-dimensional surface element model (such as the above-mentioned global surface element model). (i.e. adversarial generative network model). This reconstruction and completion method uses the stereo camera image obtained by shooting the autonomous driving scene by the stereo camera, and obtains the stereo camera image

的深度图和语义图，基于该语义图对该深度图进行优化和滤波而得到优化后的深度图，可以有效地上述重建三维面元模型。The depth map and semantic map are optimized and filtered based on the semantic map to obtain an optimized depth map, which can effectively reconstruct the three-dimensional surface element model as described above.

一些实施例中，该三维面元模型采用自适应渲染机制，能够以合理的质量呈现新视角图像。In some embodiments, the three-dimensional surface element model uses an adaptive rendering mechanism to present new perspective images with reasonable quality.

本申请所提出的图像补全网络可以利用所提出的掩膜-图像匹配技术（即上述掩膜图像表征关联机制MPC）从场景上下文中找到最近似的信息。The image completion network proposed in this application can use the proposed mask-image matching technology (ie, the above-mentioned mask image representation association mechanism MPC) to find the most approximate information from the scene context.

本申请所提出的CSD子模块可以增强合成图像序列（即上述完整新视角场景图像序列）的内容一致性。实验结果表明，本申请所提出的重建与补全方法在图像质量方面高于现有最先进的方法约5%~73%的精度，在视频质量方面高出45%~162%，在2D检测方面在指标mAP@0.5下高出约6%~24%，在定位精度方面高出38%~41%。其中，mAP@0.5表示是在IOU阈值0.5下计算的mAP。mAP（mean Average Precision）在机器学习中的目标检测领域是十分重要的衡量指标，用于衡量目标检测算法的性能。The CSD sub-module proposed in this application can enhance the content consistency of the synthetic image sequence (that is, the above-mentioned complete new perspective scene image sequence). Experimental results show that the reconstruction and completion method proposed in this application is about 5% to 73% higher than the existing state-of-the-art methods in terms of image quality, 45% to 162% higher in video quality, and 2D detection. In terms of performance, it is about 6%~24% higher under the indicator mAP@0.5, and 38%~41% higher in positioning accuracy. Among them, mAP@0.5 means mAP calculated under the IOU threshold of 0.5. mAP (mean Average Precision) is a very important measurement indicator in the field of target detection in machine learning and is used to measure the performance of target detection algorithms.

可以看出，本申请所提出的重建与补全方法可以用于重建自动驾驶的静态场景序列；其中，利用上述语义图中语义信息对深度图进行优化而获得优化后的深度图，利用优化后的深度图构建具有自适应渲染表示的全局面元模型，从该全局面元模型获取新视角场景渲染图像。It can be seen that the reconstruction and completion method proposed in this application can be used to reconstruct static scene sequences of autonomous driving; wherein, the semantic information in the above semantic map is used to optimize the depth map to obtain an optimized depth map. The depth map is used to construct a global scene element model with adaptive rendering representation, and a new perspective scene rendering image is obtained from the global scene element model.

可以看出，本申请所提出的重建与补全方法，为了保持生成图像（即完整新视角场景图像序列）的空间内容的连续性，本申请提供了一个掩膜图像匹配机制，该机制可以从采集的场景数据中匹配到与渲染图像（即上述新视角场景渲染图像）最相似的上下文近似信息。It can be seen that the reconstruction and completion method proposed in this application, in order to maintain the continuity of the spatial content of the generated image (that is, the complete new perspective scene image sequence), this application provides a mask image matching mechanism, which can be The collected scene data is matched with the contextual approximate information that is most similar to the rendered image (i.e., the above-mentioned new perspective scene rendering image).

可以看出，通过CSD子模块将匹配到的上下文近似信息输入特征域（即前述的各CSD子模块）的归一化层中。其中，上述“将匹配到的上下文近似信息输入特征域的归一化层中”指的是经过卷积层和归一化操作，其只是表示一个信息流动过程。CSD子模块可以学习到上述渲染图像（即新视角场景渲染图像）与参考图像之间的先验相似信息（如上述调制仿射参数），并将其传播到对抗生成网络模型的整个网络。It can be seen that the matched context approximate information is input into the normalization layer of the feature domain (ie, each of the aforementioned CSD sub-modules) through the CSD sub-module. Among them, the above "inputting the matched contextual approximate information into the normalization layer of the feature domain" refers to going through the convolution layer and normalization operation, which only represents an information flow process. The CSD sub-module can learn the prior similarity information (such as the above-mentioned modulation affine parameters) between the above-mentioned rendered image (i.e., the new perspective scene rendering image) and the reference image, and propagate it to the entire network of the adversarial generation network model.

以上就是关于一种自动驾驶场景的重建与补全方法的一些说明。请参考图8，本申请一些实施例中还公开了一种自动驾驶场景的重建与补全系统，包括：The above are some instructions on the reconstruction and completion method of an autonomous driving scene. Please refer to Figure 8. Some embodiments of this application also disclose a system for reconstruction and completion of autonomous driving scenes, including:

立体摄像机100，被配置为获取所述自动驾驶场景的立体相机图像；Stereo camera 100 configured to acquire a stereo camera image of the autonomous driving scene;

语义深度生成模块200，被配置为生成所述立体相机图像的深度图和语义图；A semantic depth generation module 200 configured to generate a depth map and a semantic map of the stereo camera image;

语义深度增强模块300，被配置为利用所述语义图对所述深度图进行优化，以获得优化后的深度图；The semantic depth enhancement module 300 is configured to optimize the depth map using the semantic map to obtain an optimized depth map;

面元模型构建模块400，被配置为利用所述优化后的深度图建立所述立体相机图像的当前帧的局部面元模型，对所述当前帧的局部面元模型和与所述当前帧之前的所有图像帧对应的全局面元模型相冲突的冲突信息进行剔除，利用剔除所述冲突信息后的所述当前帧的局部面元模型与所述全局面元模型进行融合，直到与所述自动驾驶场景的所有图像帧对应的全局面元模型构建完成；对所述与所述自动驾驶场景的所有图像帧对应的全局面元模型进行自适应渲染而得到自适应渲染后的所述全局面元模型，从所述自适应渲染后的所述全局面元模型中获取与所述自动驾驶场景对应的新视角场景图像序列；其中，所述新视角场景图像序列包括多个新视角场景渲染图像；The surface element model building module 400 is configured to use the optimized depth map to establish a local surface element model of the current frame of the stereo camera image, and compare the local surface element model of the current frame with the previous one of the current frame. The conflict information that conflicts with the global scene element model corresponding to all image frames is eliminated, and the local plane element model of the current frame after eliminating the conflict information is used to fuse with the global scene element model until it is integrated with the automatic The global scene element model corresponding to all image frames of the driving scene is constructed; the global scene element model corresponding to all the image frames of the automatic driving scene is adaptively rendered to obtain the adaptively rendered global scene element. A model that obtains a new perspective scene image sequence corresponding to the autonomous driving scene from the adaptively rendered global element model; wherein the new perspective scene image sequence includes a plurality of new perspective scene rendering images;

对抗网络补全模块500，被配置为获得所述新视角场景渲染图像的渲染图像金字塔、与所述新视角场景渲染图像对应的参考图像的参考图像金字塔和与所述新视角场景渲染图像对应的掩码图像的掩码图像金字塔，将所述渲染图像金字塔、所述参考图像金字塔和所述掩码图像金字塔输入训练好的对抗生成网络模型，以利用所述对抗生成网络模型对所述新视角场景渲染图像中缺失信息的区域进行补全而生成完整新视角场景图像序列；The adversarial network completion module 500 is configured to obtain the rendered image pyramid of the new perspective scene rendering image, the reference image pyramid of the reference image corresponding to the new perspective scene rendering image, and the reference image pyramid corresponding to the new perspective scene rendering image. Mask image pyramid of the mask image, input the rendered image pyramid, the reference image pyramid and the mask image pyramid into a trained adversarial generative network model to use the adversarial generative network model to analyze the new perspective The missing information areas in the scene rendering image are completed to generate a complete new perspective scene image sequence;

需要说明的是，语义深度增强模块300的作用是优化深度图中物体边界的深度和剔除动态物体（例如场景中行动的车和人），即，通过语义深度增强模块300改进了原深度图中不同物体边界的深度。其中，语义深度增强模块300根据语义一致区域的深度局部性原则来优化边界深度的具体内容已在前文进行过论述，该部分是本申请根据现有深度图的深度误差大的问题而提出来的解决方案。It should be noted that the role of the semantic depth enhancement module 300 is to optimize the depth of object boundaries in the depth map and eliminate dynamic objects (such as cars and people moving in the scene). That is, the semantic depth enhancement module 300 improves the depth of the object boundaries in the original depth map. Depth of different object boundaries. Among them, the specific content of the semantic depth enhancement module 300 to optimize the boundary depth according to the depth locality principle of the semantically consistent area has been discussed previously. This part is proposed by this application based on the problem of large depth error in the existing depth map. solution.

需要说明的是，本自动驾驶场景的重建与补全系统的具体执行流程和技术效果可以参考本自动驾驶场景的重建与补全方法的具体执行流程和技术效果。此处不再对其具体执行流程和技术效果进行赘述。It should be noted that the specific execution process and technical effects of the reconstruction and completion system of this autonomous driving scene can be referred to the specific execution process and technical effects of the reconstruction and completion method of this autonomous driving scene. The specific execution process and technical effects will not be described here.

以上就是关于一种自动驾驶场景的重建与补全系统的一些说明。本申请一些实施例中还公开了一种计算机可读存储介质，包括程序，该程序能够被处理器执行以实现如本文中任一项实施例的方法。The above are some instructions on the reconstruction and completion system of an autonomous driving scene. Some embodiments of this application also disclose a computer-readable storage medium, including a program, which can be executed by a processor to implement the method as in any embodiment herein.

本文参照了各种示范实施例进行说明。然而，本领域的技术人员将认识到，在不脱离本文范围的情况下，可以对示范性实施例做出改变和修正。例如，各种操作步骤以及用于执行操作步骤的组件，可以根据特定的应用或考虑与系统的操作相关联的任何数量的成本函数以不同的方式实现（例如一个或多个步骤可以被删除、修改或结合到其他步骤中）。This document is described with reference to various exemplary embodiments. However, those skilled in the art will recognize that changes and modifications can be made to the exemplary embodiments without departing from the scope herein. For example, the various operational steps, as well as the components used to perform the operational steps, may be implemented in different ways (e.g., one or more steps may be eliminated, modified or incorporated into other steps).

在上述实施例中，可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。另外，如本领域技术人员所理解的，本文的原理可以反映在计算机可读存储介质上的计算机程序产品中，该可读存储介质预装有计算机可读程序代码。任何有形的、非暂时性的计算机可读存储介质皆可被使用，包括磁存储设备（硬盘、软盘等）、光学存储设备（CD至ROM、DVD、Blu Ray盘等）、闪存和/或诸如此类。这些计算机程序指令可被加载到通用计算机、专用计算机或其他可编程数据处理设备上以形成机器，使得这些在计算机上或其他可编程数据处理装置上执行的指令可以生成实现指定的功能的装置。这些计算机程序指令也可以存储在计算机可读存储器中，该计算机可读存储器可以指示计算机或其他可编程数据处理设备以特定的方式运行，这样存储在计算机可读存储器中的指令就可以形成一件制造品，包括实现指定功能的实现装置。计算机程序指令也可以加载到计算机或其他可编程数据处理设备上，从而在计算机或其他可编程设备上执行一系列操作步骤以产生一个计算机实现的进程，使得在计算机或其他可编程设备上执行的指令可以提供用于实现指定功能的步骤。In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. Additionally, as will be understood by those skilled in the art, the principles herein may be reflected in a computer program product on a computer-readable storage medium preloaded with computer-readable program code. Any tangible, non-transitory computer-readable storage medium may be used, including magnetic storage devices (hard disk, floppy disk, etc.), optical storage devices (CD to ROM, DVD, Blu Ray disk, etc.), flash memory and/or the like . These computer program instructions may be loaded onto a general-purpose computer, special-purpose computer, or other programmable data processing apparatus to form a machine, such that the instructions executed on the computer or other programmable data processing apparatus may generate a device that implements the specified functions. These computer program instructions may also be stored in a computer-readable memory, which may instruct a computer or other programmable data processing device to operate in a specific manner, such that the instructions stored in the computer-readable memory may form a Manufactured articles include devices that perform specified functions. Computer program instructions may also be loaded onto a computer or other programmable data processing device to perform a series of operating steps on the computer or other programmable device to produce a computer-implemented process such that the execution on the computer or other programmable device Instructions can provide steps for implementing a specified function.

虽然在各种实施例中已经示出了本文的原理，但是许多特别适用于特定环境和操作要求的结构、布置、比例、元件、材料和部件的修改可以在不脱离本披露的原则和范围内使用。以上修改和其他改变或修正将被包含在本文的范围之内。Although the principles herein have been illustrated in various embodiments, many modifications of structure, arrangement, proportion, elements, materials and parts as are particularly suited to particular circumstances and operating requirements may be made without departing from the principles and scope of the disclosure. use. The above modifications and other changes or revisions are intended to be included within the scope of this document.

前述具体说明已参照各种实施例进行了描述。然而，本领域技术人员将认识到，可以在不脱离本披露的范围的情况下进行各种修正和改变。因此，对于本披露的考虑将是说明性的而非限制性的意义上的，并且所有这些修改都将被包含在其范围内。同样，有关于各种实施例的优点、其他优点和问题的解决方案已如上所述。然而，益处、优点、问题的解决方案以及任何能产生这些的要素，或使其变得更明确的解决方案都不应被解释为关键的、必需的或必要的。本文中所用的术语“包括”和其任何其他变体，皆属于非排他性包含，这样包括要素列表的过程、方法、文章或设备不仅包括这些要素，还包括未明确列出的或不属于该过程、方法、系统、文章或设备的其他要素。此外，本文中所使用的术语“耦合”和其任何其他变体都是指物理连接、电连接、磁连接、光连接、通信连接、功能连接和/或任何其他连接。The foregoing detailed description has been described with reference to various embodiments. However, those skilled in the art will recognize that various modifications and changes can be made without departing from the scope of the disclosure. Accordingly, this disclosure is to be considered in an illustrative and not a restrictive sense, and all such modifications are to be included within its scope. Likewise, advantages, other advantages, and solutions to problems with respect to various embodiments have been described above. However, benefits, advantages, solutions to problems, and any elements that produce these, or make the solution more explicit, are not to be construed as critical, required, or essential. As used herein, the term "comprises" and any other variations thereof are intended to be non-exclusively inclusive such that a process, method, article, or apparatus that includes a list of elements includes not only those elements but also those not expressly listed or otherwise not part of the process , methods, systems, articles or other elements of equipment. Furthermore, the term "coupled" and any other variations thereof as used herein refers to physical connection, electrical connection, magnetic connection, optical connection, communication connection, functional connection and/or any other connection.

具有本领域技术的人将认识到，在不脱离本发明的基本原理的情况下，可以对上述实施例的细节进行许多改变。因此，本发明的范围应仅由权利要求确定。Those skilled in the art will recognize that many changes may be made in the details of the embodiments described above without departing from the basic principles of the invention. Therefore, the scope of the invention should be determined solely by the claims.

Claims

1. A method for reconstructing and completing autonomous driving scenes, which is characterized by including:

Obtain a stereo camera image of the automatic driving scene, and generate a depth map and a semantic map of the stereo camera image; wherein the stereo camera image is obtained by shooting the automatic driving scene by a stereo camera; the depth The graph is used to represent the depth information of the stereo camera image, and the semantic map is used to represent the semantic information of the stereo camera image; the semantic map is used to optimize the depth map to obtain an optimized depth map;

Using the optimized depth map to establish a local block model of the current frame of the stereo camera image, the local block model of the current frame and the global block model corresponding to all image frames before the current frame Conflicting conflict information is eliminated, and the local surface element model of the current frame after eliminating the conflict information is used to fuse with the global scene element model until the global scene corresponding to all image frames of the autonomous driving scene is The metamodel construction is completed;

Perform adaptive rendering on the global scene element model corresponding to all image frames of the autonomous driving scene to obtain the adaptively rendered global scene element model. From the adaptively rendered global scene element model Obtain a new perspective scene image sequence corresponding to the autonomous driving scene in the model; wherein the new perspective scene image sequence includes a plurality of new perspective scene rendering images;

Obtain the rendered image pyramid of the new perspective scene rendering image, the reference image pyramid of the reference image corresponding to the new perspective scene rendering image, and the mask image pyramid of the mask image corresponding to the new perspective scene rendering image, and The rendered image pyramid, the reference image pyramid and the mask image pyramid input a trained adversarial generative network model to use the adversarial generative network model to supplement the areas with missing information in the new perspective scene rendering image. Completely generate a complete new perspective scene image sequence;

Wherein, the complete new perspective scene image sequence includes multiple complete new perspective scene images; the reference image is the stereo camera image from an adjacent perspective of the new perspective scene rendering image, and the mask image is based on The new perspective scene is obtained by rendering the area with missing information in the image.

2. The reconstruction and completion method according to claim 1, wherein the adversarial generative network model includes a generator, and the generator includes multiple levels of CSD residual modules connected in series with each other;

Wherein, the rendered image pyramid, the reference image pyramid and the mask image pyramid are input into a trained adversarial generative network model, so as to use the adversarial generative network model to detect defects in the new perspective scene rendering image. The information area is completed to generate a complete new perspective scene image sequence, including:

The image results of each layer in the rendered image pyramid, the reference image pyramid and the mask image pyramid are respectively input to the CSD residual module of the first level in the generator to the end in order from the top layer to the bottom layer. The CSD residual module of the first level;

The output of the CSD residual module of the previous stage is used as an input of the CSD residual module of the adjacent subsequent stage, and the output of the CSD residual module of the last stage is used as the complete new perspective scene image;

Wherein, the network structure of the CSD residual module from the first level to the CSD residual module at the last level in the generator is the same;

Wherein, the number of layers of the rendered image pyramid, the reference image pyramid and the mask image pyramid is the same as the number of the CSD residual modules.

3. The reconstruction and completion method according to claim 2, wherein the image results of each layer in the rendered image pyramid, the reference image pyramid and the mask image pyramid are in order from the top layer to the bottom layer. The CSD residual module of the first stage in the generator is input to the CSD residual module of the last stage respectively; the output of the CSD residual module of the previous stage is used as the output of the adjacent subsequent stage. An input of the CSD residual module, using the output of the last level CSD residual module as the complete new perspective scene image, includes:

The image result of the top layer in the rendered image pyramid, the reference image pyramid and the mask image pyramid, and the new perspective scene rendering image corresponding to the scale of the image result of the top layer are input to the first level. Described CSD residual module;

The image results of the remaining layers in the rendered image pyramid, the reference image pyramid and the mask image pyramid, except the top layer, are respectively input to each layer in the order from the next layer of the top layer to the bottom layer. The CSD residual module at the second level in the generator to the CSD residual module at the last level; use the output of the CSD residual module at the last level as the complete new perspective scene image.

4. The reconstruction and completion method according to claim 3, wherein the CSD residual module includes a first-level CSD sub-module, a second-level CSD sub-module and a third-level CSD sub-module;

The data processing flow inside the CSD residual module is:

The image results of the corresponding layers in the rendered image pyramid, the reference image pyramid and the mask image pyramid, as well as the feature maps input to the CSD residual module, are respectively input to the first-level CSD sub-module and The third-level CSD sub-module uses the output of the first-level CSD sub-module as the input of the first ReLU activation function, and performs a convolution operation on the output of the first ReLU activation function to obtain the first volume. product result, input the first convolution result into the second-level CSD sub-module, use the output of the second-level CSD sub-module as the input of the second ReLU activation function, and activate the second ReLU The output of the function is subjected to a convolution operation to obtain the second convolution result. The output of the third-level CSD sub-module is used as the input of the third ReLU activation function, and a convolution operation is performed on the output of the third ReLU activation function. To obtain a third convolution result, perform a unit addition operation on the second convolution result and the third convolution result to obtain the output of the CSD residual module;

Wherein, the feature map of the CSD residual module input to the first level is the new perspective scene rendering image corresponding to the scale of the top-level image result, and the CSD residual module of the adjacent subsequent level is input The feature map is the output of the CSD residual module of the previous stage.

5. The reconstruction and completion method of claim 4, wherein the data of the first-level CSD sub-module, the second-level CSD sub-module and the third-level CSD sub-module The processing procedures are the same;

Among them, the data processing flow of the CSD sub-module is:

Similarity matching is performed on the input image results of the corresponding layer in the rendered image pyramid, the image results of the corresponding layer in the reference image pyramid, and the image results of the corresponding layer in the mask image pyramid to obtain the result of the rendering Contextual information about the image results of the corresponding layer in the image pyramid;

Wherein, the context association information includes a similarity map and an associated image of the image result of the corresponding layer in the rendered image pyramid; the associated image is based on the image result of the corresponding layer in the reference image pyramid and the rendered image pyramid. The similarity map is generated from multiple regions where the image results of the corresponding layers are most similar. The similarity map is used to characterize the image results of the corresponding layer in the rendered image pyramid and the image results of the corresponding layer in the reference image pyramid. Correlation;

Wherein, the context-related information is used to provide context matching information for the area with missing information;

Perform a second convolution operation on the associated image and the similarity map to respectively obtain the spatial features of the associated image and the spatial features of the similarity map;

Perform a second convolution operation on the image result of the corresponding layer in the rendered image pyramid to obtain the spatial characteristics of the new perspective scene rendering image of the corresponding layer;

Perform element-by-element multiplication operations on the spatial features of the similarity map and the spatial features of the new perspective scene rendering image to obtain an element-by-element multiplication result;

The second affine parameter is obtained by adding the element-wise multiplication result and the spatial feature of the associated image; wherein, the expression of the second affine parameter is:

; described/> is the element-wise multiplication result, the α _{c, h, w} are the spatial characteristics of the similarity map, and the /> is the spatial feature of the associated image; wherein, c, h and w are the number of characteristic channels, length and width of the image result of the corresponding layer respectively, _{and Ir} represents the new perspective scene rendering image, so The I _c' is the associated image of the image result of the corresponding layer;

Normalize multiple feature maps input to the CSD residual module to obtain normalized parameters;

The normalized result obtained by adding the normalized parameter and the second affine parameter is used as the output of the CSD sub-module.

6. The reconstruction and completion method according to claim 5, wherein the image result of the corresponding layer in the input rendered image pyramid, the image result of the corresponding layer in the reference image pyramid and the input Similarity matching is performed on the image results of the corresponding layer in the masked image pyramid to obtain contextual information related to the image results of the corresponding layer in the rendered image pyramid, including:

Using a dividing frame to move in the image result of the corresponding layer in the rendered image pyramid in the first step to divide the image result of the corresponding layer in the rendered image pyramid into a plurality of rendering sub-blocks with a size of k×k,

Utilize the dividing frame to move in the image result of the corresponding layer in the mask image pyramid in the first step to divide the image result of the corresponding layer in the mask image pyramid into a plurality of mask sub-sections with a size of k×k piece,

The search domain corresponding to the rendering sub-block is determined in the image result of the corresponding layer in the reference image pyramid. The size of the search domain is k'×k', and the dividing frame is used to locate the search domain at the second step. Move in the search domain and divide the search domain into s×s reference sub-blocks with a size of k×k; where k, k' and s are all preset constants;

Apply the mask sub-block corresponding to the rendering sub-block to the reference sub-block corresponding to the rendering sub-block to obtain a reference mask sub-block corresponding to the rendering sub-block;

Calculate each similarity between the rendering sub-block and each reference mask sub-block in the search domain corresponding to the rendering sub-block, and use the maximum value among the similarities as the similarity Elements corresponding to the rendering sub-block in the figure; wherein the size of the search domain corresponding to the rendering sub-block is greater than the size of the rendering sub-block;

The reference sub-block corresponding to each element in the similarity map is used as the best matching sub-block, and the associated image is generated using a plurality of the best-matching sub-blocks corresponding to the similarity map. .

7. The reconstruction and completion method according to claim 6, wherein the expression of the similarity graph is:

;wherein, simi ( i ) represents the i-th rendering sub-block in the image result of the corresponding layer in the rendering image pyramid/> and s×s reference mask sub-blocks in the search domain corresponding to the i-th rendering sub-block in the image result of the corresponding layer in the reference image pyramid/> The maximum value of each similarity between them, /> To obtain the cosine similarity operation, ⊙ represents element-wise multiplication, as described here , wherein the H and W are the height and width of the new perspective scene rendering image respectively, the n is the first frame; _{the Im} represents the mask image, and the I _c represents the The reference image, the j represents the reference sub-block in the search domain/> quantity, stated/> is the mask sub-block.

8. The reconstruction and completion method according to claim 5, characterized in that the expression of the normalized parameter is: ,

Wherein, the x ⁱ is the feature map of the CSD sub-module in the CSD residual module of the input level i, the x ⁱ ∈R ^N×C×H×W , and the C is the corresponding The number of feature channels; the μ _c ( xi ⁾ and σ _c ( xi ⁾ are respectively the mean of the N feature maps input to the CSD sub-module in the i-th level CSD residual module and standard deviation, the γ _c is the first affine parameter, and the γ _c ∈ R ^C ; the modulated affine parameters include the first affine parameter γ _c and the second affine parameter.

9. A system for reconstruction and completion of autonomous driving scenes, which is characterized by including:

a stereo camera configured to acquire a stereo camera image of the autonomous driving scene;

a semantic depth generation module configured to generate a depth map and a semantic map of the stereo camera image;

A semantic depth enhancement module configured to optimize the depth map using the semantic map to obtain an optimized depth map;

A surface element model building module configured to use the optimized depth map to establish a local surface element model of the current frame of the stereo camera image, and compare the local surface element model of the current frame with the previous one of the current frame. The conflict information that conflicts with the global scene element model corresponding to all image frames is eliminated, and the local segment model of the current frame after eliminating the conflict information is used to fuse with the global scene element model until it is integrated with the automatic driving The construction of the global scene element model corresponding to all image frames of the scene is completed; the global scene element model corresponding to all the image frames of the autonomous driving scene is adaptively rendered to obtain the adaptively rendered global scene element model. , obtain a new perspective scene image sequence corresponding to the autonomous driving scene from the adaptively rendered global element model; wherein the new perspective scene image sequence includes a plurality of new perspective scene rendering images;

An adversarial network completion module configured to obtain a rendered image pyramid of the new perspective scene rendering image, a reference image pyramid of a reference image corresponding to the new perspective scene rendering image, and a mask corresponding to the new perspective scene rendering image. Mask image pyramid of the code image, input the rendered image pyramid, the reference image pyramid and the mask image pyramid into a trained adversarial generative network model to use the adversarial generative network model to analyze the new perspective scene The areas with missing information in the rendered image are completed to generate a complete new perspective scene image sequence; wherein the complete new perspective scene image sequence includes multiple complete new perspective scene images; the reference image is from the new perspective scene The stereo camera image is rendered from an adjacent perspective of the image, and the mask image is obtained based on the area of missing information in the new perspective scene rendering image.

10. A computer-readable storage medium, characterized by comprising a program that can be executed by a processor to implement the method according to any one of claims 1 to 8.