CN110223380A

CN110223380A - Fusion is taken photo by plane and the scene modeling method of ground multi-view image, system, device

Info

Publication number: CN110223380A
Application number: CN201910502762.4A
Authority: CN
Inventors: 申抒含; 高翔; 朱灵杰; 胡占义
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2019-06-11
Filing date: 2019-06-11
Publication date: 2019-09-10
Anticipated expiration: 2039-06-11
Also published as: CN110223380B

Abstract

The invention belongs to the field of scene modeling, and specifically relates to a scene modeling method, system, and device for integrating aerial photography and ground perspective images. The problem of imprecise fusion. The method of the present invention includes: S100, acquiring an aerial perspective image of an indoor scene to be modeled, and constructing an aerial map; S200, based on the aerial map, obtaining a composite image by synthesizing a ground perspective reference image from the aerial map; S300, Obtain a set of ground perspective images through the ground perspective images collected by the ground camera; S400, based on the synthesized image, fuse the aerial perspective images and the ground perspective images to obtain an indoor scene model. The present invention can generate a complete and accurate indoor scene model, takes into account both acquisition efficiency and reconstruction accuracy, and has strong robustness.

Description

Scene modeling method, system and device for fusing aerial and ground perspective images

技术领域technical field

本发明属于场景建模领域，具体涉及一种融合航拍与地面视角图像的场景建模方法、系统、装置。The invention belongs to the field of scene modeling, and in particular relates to a scene modeling method, system and device for fusing aerial photography and ground perspective images.

背景技术Background technique

室内场景三维重建在许多现实应用中起到了重要作用，例如室内导航、服务机器人、建筑信息模型(BIM，building information modeling)等。现有的室内场景重建方法可大致分为三类：(1)基于激光雷达(LiDAR，light detection and ranging)的方法，(2)基于RGB-D相机的方法，(3)基于图像的方法。3D reconstruction of indoor scenes plays an important role in many real-world applications, such as indoor navigation, service robots, building information modeling (BIM, building information modeling), etc. Existing indoor scene reconstruction methods can be roughly divided into three categories: (1) methods based on LiDAR (light detection and ranging), (2) methods based on RGB-D cameras, and (3) methods based on images.

尽管基于LiDAR的方法与基于RGB-D相机的方法均有着较高的精度，在重建较大的室内场景时，上述两种方法均存在成本较高，拓展性较差等问题。对于基于LiDAR的方法，由于扫描视角限制导致场景遮挡难以避免，在进行扫描时往往需要多视角的激光扫描与点云对齐。对于基于RGB-D相机的方法，由于传感器有效工作距离受限，需要采集、处理大量的数据。因此，上述方法在进行大规模室内场景重建时，均存在高成本、低效率的不足。Although both the LiDAR-based method and the RGB-D camera-based method have high accuracy, when reconstructing large indoor scenes, the above two methods have problems such as high cost and poor scalability. For LiDAR-based methods, scene occlusion is unavoidable due to the limitation of scanning viewing angles, and multi-view laser scanning is often required to be aligned with the point cloud during scanning. For methods based on RGB-D cameras, due to the limited effective working distance of the sensor, a large amount of data needs to be collected and processed. Therefore, the above methods all have the disadvantages of high cost and low efficiency when performing large-scale indoor scene reconstruction.

相对于基于LiDAR的方法与基于RGB-D相机的方法，尽管基于图像的方法成本更低，灵活性更强，这类方法也存在一些不足，如由于复杂场景、重复结构、缺乏纹理等导致的不完整、不精确的重建结果。即使目前最先进的从运动恢复结构(SfM，structure frommotion)与多视图立体技术(MVS，multiple view stereo)技术，在规模较大，结构较复杂的室内场景中的重建效果仍不能令人满意。另外，一些基于图像的方法利用一些先验假设来处理室内场景重建问题，例如曼哈顿世界假设。尽管这些方法在有些时候能够取得较好的结果，但是，在不符合先验假设的情况下这些方法往往会导致错误的重建结果。Compared with LiDAR-based methods and RGB-D camera-based methods, although image-based methods are cheaper and more flexible, they also have some shortcomings, such as complex scenes, repetitive structures, lack of texture, etc. Incomplete and imprecise reconstruction results. Even with the most advanced structure from motion recovery (SfM, structure from motion) and multi-view stereo technology (MVS, multiple view stereo) technology, the reconstruction effect in large-scale and complex indoor scenes is still unsatisfactory. In addition, some image-based methods utilize some prior assumptions to deal with the indoor scene reconstruction problem, such as the Manhattan world assumption. Although these methods can sometimes achieve better results, they often lead to wrong reconstruction results when the prior assumptions are not met.

发明内容Contents of the invention

为了解决现有技术中的上述问题，即为了解决针对室内场景结构复杂、纹理缺乏，基于图像的建模结果不完整、不精确融合的问题，本发明第一方面，提出了一种融合航拍与地面视角图像的场景建模方法，包括以下步骤：In order to solve the above-mentioned problems in the prior art, that is, in order to solve the problems of complex indoor scene structure, lack of texture, incomplete and inaccurate fusion of image-based modeling results, the first aspect of the present invention proposes a fusion of aerial photography and A scene modeling method for a ground perspective image, comprising the following steps:

步骤S100，获取待建模的室内场景的航拍视角图像，并构建航拍地图；Step S100, obtaining the aerial perspective image of the indoor scene to be modeled, and constructing an aerial map;

步骤S200，基于所述航拍地图，通过由航拍地图合成地面视角参考图像的方法，获取合成图像；Step S200, based on the aerial map, obtain a synthesized image through the method of synthesizing a ground perspective reference image from the aerial map;

步骤S300，通过地面相机采集的地面视角图像，获取地面视角图像集合；Step S300, obtaining a set of ground perspective images through the ground perspective images collected by the ground camera;

步骤S400，基于所述合成图像，将所述航拍视角图像与所述地面视角图像进行融合，获取室内场景模型。Step S400, based on the synthesized image, the aerial perspective image and the ground perspective image are fused to obtain an indoor scene model.

在一些优选实施方式中，步骤S100中“获取待建模的室内场景的航拍视角图像，并构建航拍地图”，其方法为：In some preferred implementations, in step S100 "acquire the aerial perspective image of the indoor scene to be modeled, and construct an aerial map", the method is:

对室内场景的航拍视角视频，采用基于词袋模型的自适应视频抽帧方法抽取图像帧，得到室内场景的航拍视角图像集合；For the aerial perspective video of the indoor scene, the image frame is extracted by using the adaptive video frame extraction method based on the bag-of-words model, and the aerial perspective image collection of the indoor scene is obtained;

基于所述航拍视角图像集合，通过图像建模方法构建航拍地图。Based on the set of aerial perspective images, an aerial map is constructed through an image modeling method.

在一些优选实施方式中，步骤S200中“由航拍地图合成地面视角参考图像的方法”，其方法为：In some preferred implementation manners, in step S200 "the method for synthesizing ground perspective reference images from aerial maps", the method is:

基于航拍地图，计算虚拟相机位姿；Based on the aerial map, calculate the virtual camera pose;

通过图割算法，获取航拍地图获取地面视角参考图像的合成图像；Through the graph cut algorithm, the aerial map is obtained to obtain the synthetic image of the reference image of the ground perspective;

在一些优选实施方式中，“通过图割算法，获取航拍地图获取地面视角参考图像的合成图像”，其方法为：In some preferred implementations, "acquire aerial maps to obtain synthetic images of ground perspective reference images through the graph cut algorithm", the method is:

其中，E(l)为图割过程中的能量函数；为虚拟相机可见的三维空间网格投影得到的二维三角形集合，t_i为其中的第i个三角形；为投影得到的二维三角形集合中三角形的公共边集合；l_i为t_i的航拍图像序号；D_i(l_i)为数据项；V_i(l_i,l_j)为平滑项；Among them, E(l) is the energy function in the graph cut process; is a set of two-dimensional triangles projected from the three-dimensional space grid visible to the virtual camera, and t _i is the i-th triangle among them; is the set of common sides of the triangles in the projected two-dimensional triangle set; l _i is the serial number of the aerial image of t _i ; D _i (l _i ) is the data item; V _i (l _i , l _j ) is the smoothing item;

当对应t_i的空间面片在第l_i个航拍图像中可见时，数据项否则的话D_i(l_i)＝α，其中为第l_i个航拍图像中局部特征的尺度中值，为对应t_i的空间面片在第l_i个航拍图像中的投影面积，α为一个较大的常量；When the spatial patch corresponding to t _i is visible in the l _i aerial image, the data item Otherwise D _i (l _i )=α, where is the scale median value of local features in the l _i -th aerial image, is the projected area of the spatial patch corresponding to t _i in the l _i aerial image, and α is a relatively large constant;

当l_i＝l_j时，平滑项V_i(l_i,l_j)＝0；否则V_i(l_j,l_j)＝1。When l _i =l _j , the smoothing term V _i (l _i , l _j )=0; otherwise V _i (l _j ,l _j )=1.

在一些优选实施方式中，步骤S300中“通过地面相机采集的地面视角图像，获取地面视角图像集合”，其方法为：In some preferred implementation manners, in step S300, "obtain a set of ground perspective images through the ground perspective images collected by the ground camera", the method is:

地面机器人基于规划路径，通过其上设置的地面相机连续采集地面视角视频；Based on the planned path, the ground robot continuously collects ground perspective video through the ground camera set on it;

对室内场景的地面视角视频，采用基于词袋模型的自适应视频抽帧方法抽取图像帧，得到室内场景的地面视角图像集合。For the ground-view video of the indoor scene, the adaptive video frame extraction method based on the bag-of-words model is used to extract image frames, and the ground-view image collection of the indoor scene is obtained.

在一些优选实施方式中，“地面机器人基于规划路径，通过其上设置的地面相机连续采集地面视角视频”过程中，其定位方法包括初始机器人定位、移动机器人定位；In some preferred implementations, in the process of "the ground robot continuously collects the ground perspective video through the ground camera set on it based on the planned path", its positioning method includes initial robot positioning and mobile robot positioning;

所述初始机器人定位，其方法为：获取地面相机采集视频的第一帧，获取机器人在所述航拍地图中的初始位置，并将该位置作为机器人后续运动的起点；The initial robot positioning method is as follows: obtain the first frame of the video collected by the ground camera, obtain the initial position of the robot in the aerial map, and use this position as the starting point of the subsequent movement of the robot;

所述移动机器人定位，其方法为：基于初始位置，以及机器人各时刻行驶数据进行机器人位置粗定位，通过匹配当前时刻所采集视频帧图像与所述合成图像，获取机器人当前时刻在所述航拍地图中的位置，并以该位置修订粗定位的位置信息。The positioning of the mobile robot is as follows: based on the initial position and the driving data of the robot at each time, the rough positioning of the robot position is performed, and by matching the video frame image collected at the current time with the composite image, the current position of the robot on the aerial map is obtained. The position in , and use this position to revise the position information of coarse positioning.

在一些优选实施方式中，步骤S400“基于所述合成图像，将所述航拍视角图像与所述地面视角图像进行融合，获取室内场景模型”，其方法为：In some preferred implementation manners, step S400 "based on the synthesized image, fuse the aerial perspective image with the ground perspective image to obtain an indoor scene model", the method is:

获取地面视角图像集合中每一幅图像对应的地面相机在所述航拍地图中的位置；Obtain the position of the ground camera corresponding to each image in the ground perspective image collection in the aerial map;

将地面视角图像与合成图像匹配点连入原始的航拍与地面特征点轨迹中，生成跨视图的约束；Connect the matching points of the ground perspective image and the synthetic image into the original aerial photography and ground feature point trajectory to generate cross-view constraints;

通过捆绑调整(BA，bundle adjustment)对航拍与地面图像位姿进行优化；Optimize the aerial and ground image poses through bundle adjustment (BA, bundle adjustment);

利用航拍与地面视角图像进行稠密重建，获取室内场景的稠密模型。Dense reconstruction of aerial and ground view images is used to obtain a dense model of the indoor scene.

本发明的第二方面，提出了一种融合航拍与地面视角图像的场景建模系统，该系统包括航拍地图构建模块、合成图像获取模块、视角图像集合获取模块、室内场景模型获取模块；In the second aspect of the present invention, a scene modeling system that integrates aerial photography and ground perspective images is proposed, the system includes an aerial photography map construction module, a synthetic image acquisition module, a perspective image collection acquisition module, and an indoor scene model acquisition module;

所述航拍地图构建模块，配置为获取待建模的室内场景的航拍视角图像，并构建航拍地图；The aerial map construction module is configured to obtain an aerial perspective image of the indoor scene to be modeled, and construct an aerial map;

所述合成图像获取模块，配置为基于所述航拍地图，通过由航拍地图合成地面视角参考图像的方法，获取合成图像；The synthetic image acquisition module is configured to obtain a synthetic image based on the aerial map by synthesizing a ground perspective reference image from the aerial map;

所述视角图像集合获取模块，配置为通过地面相机采集的地面视角图像，获取地面视角图像集合；The viewing angle image set acquisition module is configured to acquire a ground viewing angle image set through ground viewing angle images collected by a ground camera;

所述室内场景模型获取模块，配置为基于所述合成图像，将所述航拍视角图像与所述地面视角图像进行融合，获取室内场景模型。The indoor scene model acquisition module is configured to fuse the aerial perspective image with the ground perspective image based on the synthesized image to acquire an indoor scene model.

本发明的第三方面，提出了一种存储装置，其中存储有多条程序，所述程序适于由处理器加载并执行以实现上述的融合航拍与地面视角图像的场景建模方法。In the third aspect of the present invention, a storage device is provided, in which a plurality of programs are stored, and the programs are adapted to be loaded and executed by a processor to implement the above-mentioned scene modeling method for fusing aerial photography and ground perspective images.

本发明的第四方面，提出了一种处理装置，包括处理器、存储装置；处理器，适于执行各条程序；存储装置，适于存储多条程序；所述程序适于由处理器加载并执行以实现上述的融合航拍与地面视角图像的场景建模方法。According to the fourth aspect of the present invention, a processing device is proposed, including a processor and a storage device; the processor is suitable for executing various programs; the storage device is suitable for storing multiple programs; and the program is suitable for being loaded by the processor And execute to realize the above-mentioned scene modeling method of fusing aerial photography and ground perspective images.

本发明的有益效果：Beneficial effects of the present invention:

本发明通过构建一个三维航拍地图引导机器人在室内场景中行进并采集地面视角图像，然后对航拍与地面图像进行融合，并通过融合后的图像生成完整、精确的室内场景模型。本发明室内场景重建流程兼顾采集效率与重建精度，并且，具有较强的鲁棒性。The invention constructs a three-dimensional aerial map to guide the robot to travel in the indoor scene and collects ground perspective images, then fuses the aerial and ground images, and generates a complete and accurate indoor scene model through the fused images. The indoor scene reconstruction process of the present invention takes both acquisition efficiency and reconstruction accuracy into consideration, and has strong robustness.

附图说明Description of drawings

通过阅读参照以下附图所作的对非限制性实施例所作的详细描述，本申请的其它特征、目的和优点将会变得更明显：Other characteristics, objects and advantages of the present application will become more apparent by reading the detailed description of non-limiting embodiments made with reference to the following drawings:

图1是本发明一种实施例的融合航拍与地面视角图像的场景建模方法流程框架示意图；Fig. 1 is a schematic diagram of a flow chart of a scene modeling method for fusing aerial photography and ground perspective images according to an embodiment of the present invention;

图2是本发明一种实施例中一个经271幅抽取的视频帧重建得到的航拍地图的示例图；Fig. 2 is an example diagram of an aerial map reconstructed through 271 extracted video frames in an embodiment of the present invention;

图3是本发明一种实施例中基于网格的图像合成示意图；Fig. 3 is a schematic diagram of grid-based image synthesis in an embodiment of the present invention;

图4是本发明一种实施例中局部特征尺度与图像清晰度之间的关系示例图；Fig. 4 is an exemplary diagram of the relationship between local feature scales and image clarity in an embodiment of the present invention;

图5是本发明一种实施例中不同配置下基于图割的图像合成结果示例图；Fig. 5 is an example diagram of image synthesis results based on graph cuts under different configurations in an embodiment of the present invention;

图6是作为对比的另外一些图像合成结果以及类似视角下的地面图像示例图；Figure 6 is a comparison of some other image synthesis results and an example diagram of ground images under similar viewing angles;

图7是本发明一种实施例中图像匹配结果示例图；Fig. 7 is an example diagram of image matching results in an embodiment of the present invention;

图8是本发明一种实施例中机器人运动过程中候选匹配合成图像查找示意图；Fig. 8 is a schematic diagram of searching for candidate matching composite images during robot movement in an embodiment of the present invention;

图9是本发明一种实施例中批量式相机定位流程示意图；Fig. 9 is a schematic diagram of batch camera positioning process in an embodiment of the present invention;

图10是本发明一种实施例中基于三种特征点轨迹的批量式相机定位结果示例图；Fig. 10 is an example diagram of batch camera positioning results based on three feature point trajectories in an embodiment of the present invention;

图11是本发明一种实施例中批量式相机定位过程示例图；Fig. 11 is an example diagram of batch type camera positioning process in an embodiment of the present invention;

图12是本发明一种实施例中针对航拍视图的航拍与地面特征点轨迹生成示意图；Fig. 12 is a schematic diagram of aerial photography and ground feature point trajectory generation for aerial photography views in an embodiment of the present invention;

图13是本发明一种实施例的测试中用到的数据采集设备；Fig. 13 is the data acquisition equipment used in the test of an embodiment of the present invention;

图14是本发明一种实施例的测试中Hall数据集中的示例航拍图像与生成的三维航拍地图示例图；Fig. 14 is an example aerial image in the Hall data set and an example diagram of the generated three-dimensional aerial map in the test of an embodiment of the present invention;

图15是本发明一种实施例的测试中Hall数据集航拍视频上的本发明抽帧方法与等间隔抽帧方法对比实验结果示例图；Fig. 15 is an example diagram of comparative experimental results between the frame drawing method of the present invention and the frame drawing method at equal intervals on the aerial video of the Hall data set in the test of an embodiment of the present invention;

图16是本发明一种实施例的测试中地面相机定位的定性对比结果示例图；Fig. 16 is an example diagram of qualitative comparison results of ground camera positioning in a test of an embodiment of the present invention;

图17是本发明一种实施例的测试中室内场景重建定性结果示例图。Fig. 17 is an example diagram of qualitative results of indoor scene reconstruction in a test according to an embodiment of the present invention.

具体实施方式Detailed ways

为使本发明的目的、技术方案和优点更加清楚，下面将结合附图对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。In order to make the purpose, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the accompanying drawings. Obviously, the described embodiments are part of the embodiments of the present invention, rather than Full examples. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

下面结合附图和实施例对本申请作进一步的详细说明。可以理解的是，此处所描述的具体实施例仅仅用于解释相关发明，而非对该发明的限定。另外还需要说明的是，为了便于描述，附图中仅示出了与有关发明相关的部分。The application will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain related inventions, rather than to limit the invention. It should also be noted that, for the convenience of description, only the parts related to the related invention are shown in the drawings.

需要说明的是，在不冲突的情况下，本申请中的实施例及实施例中的特征可以相互组合。It should be noted that, in the case of no conflict, the embodiments in the present application and the features in the embodiments can be combined with each other.

由于室内场景的复杂性，对于基于图像的方法实现场景完整重建需考虑如下两个问题。第一个是图像采集过程，即如何采集图像以完整、高效地覆盖室内场景。第二个是场景重建算法，即如何在SfM与MVS过程中融合不同视角的图像以获取完整、精确的重建结果。针对上述两问题，本发明提出了一种新颖的基于图像的室内场景采集与重建流程。该流程用到了迷你飞行器与地面机器人并包含四个主要步骤(如图1所示)：(1)航拍地图构建：采用一个迷你飞行器在室内采集航拍视角图像，然后由航拍视角图像获取表征室内场景的三角形网格，并将其用于为地面机器人定位导航的地图；(2)参考图像合成：在航拍地图中进行平面检测，获取地平面并用于地面机器人路径规划。然后，基于航拍地图合成若干地面视角图像，用于地面机器人的定位；(3)地面机器人定位：地面机器人进入室内场景进行地面视角图像的采集。在机器人边运动边采集图像的同时，通过匹配采集的图像与合成的地面视角图像，实现机器人的定位；(4)室内场景重建：当地面机器人完成图像采集后，通过在基于图像的建模流程中融合迷你飞行器图像与地面机器人图像，实现室内场景的完整与精确建模。Due to the complexity of indoor scenes, the following two issues need to be considered for image-based methods to achieve complete scene reconstruction. The first is the image acquisition process, i.e. how to acquire images to completely and efficiently cover the indoor scene. The second is the scene reconstruction algorithm, that is, how to fuse images from different perspectives in the process of SfM and MVS to obtain complete and accurate reconstruction results. In view of the above two problems, the present invention proposes a novel image-based indoor scene acquisition and reconstruction process. The process uses mini-aircraft and ground robots and includes four main steps (as shown in Figure 1): (1) Aerial map construction: a mini-aircraft is used to collect aerial perspective images indoors, and then the aerial perspective images are used to represent indoor scenes (2) Reference image synthesis: perform plane detection in the aerial map, obtain the ground plane and use it for path planning of the ground robot. Then, a number of ground perspective images are synthesized based on the aerial map for the positioning of the ground robot; (3) Ground robot positioning: the ground robot enters the indoor scene to collect the ground perspective images. While the robot is collecting images while moving, the positioning of the robot is realized by matching the collected images with the synthesized ground perspective images; (4) Indoor scene reconstruction: After the ground robot completes the image acquisition, the image-based modeling process The image of the mini-aircraft and the image of the ground robot are fused to realize the complete and accurate modeling of the indoor scene.

在本发明的建模流程中，只有航拍图像采集过程中需要人工操作，后续的地面图像采集以及室内场景建模过程均为全自动实现，这意味着本发明的流程拓展性强，适用于大规模室内场景的采集与重建。航拍图像的采集也可以通过自主导航按照获取的导航路径自动采集航拍图像，但其增加了算法的复杂性，因此优先选择人工操作，以保证获取图像的灵活性和完整性，以及可拓展性。In the modeling process of the present invention, only the aerial image acquisition process requires manual operation, and the subsequent ground image acquisition and indoor scene modeling processes are all fully automatic, which means that the process of the present invention is highly expandable and suitable for large Acquisition and reconstruction of large-scale indoor scenes. Aerial images can also be collected automatically through autonomous navigation according to the acquired navigation path, but it increases the complexity of the algorithm, so manual operation is preferred to ensure the flexibility, integrity, and scalability of image acquisition.

相比于地面机器人采集的地面图像，迷你飞行器采集的航拍图像拥有更好的视角和更大的视场，这意味着相对于地面图像，航拍图像中的遮挡与误匹配问题会更小。因此，通过航拍图像生成的地图能够更为可靠地用于后续的地面机器人定位过程中。Compared with ground images collected by ground robots, aerial images collected by mini-aircraft have a better viewing angle and larger field of view, which means that compared with ground images, the problems of occlusion and mismatch in aerial images will be smaller. Therefore, the maps generated from aerial images can be more reliably used in the subsequent ground robot positioning process.

迷你飞行器拍摄的航拍图像与地面机器人拍摄的地面图像相互补充并且能够完整覆盖室内场景。因此，通过融合航拍与地面图像，可以获取更为完整、精确的室内场景模型。The aerial images taken by the mini-aircraft complement the ground images taken by the ground robot and can completely cover the indoor scene. Therefore, by fusing aerial and ground images, a more complete and accurate indoor scene model can be obtained.

本发明的一种融合航拍与地面视角图像的场景建模方法，包括以下步骤：A scene modeling method for fusing aerial photography and ground perspective images of the present invention comprises the following steps:

为了更清晰地对本发明融合航拍与地面视角图像的场景建模方法进行说明，下面结合附图对本方发明方法一种实施例中各步骤进行展开详述。In order to more clearly describe the scene modeling method of the present invention that fuses aerial photography and ground perspective images, the steps in an embodiment of the inventive method will be described in detail below in conjunction with the accompanying drawings.

本发明一种实施例的本发明的融合航拍与地面视角图像的场景建模方法，包括步骤S100—S400。According to an embodiment of the present invention, the scene modeling method for fusing aerial photography and ground perspective images of the present invention includes steps S100-S400.

步骤S100，获取待建模的室内场景的航拍视角图像，并构建航拍地图。Step S100, acquiring an aerial perspective image of the indoor scene to be modeled, and constructing an aerial map.

首先采用迷你飞行器采集室内场景的航拍视频，并从视频中抽取一些图像。然后通过基于图像建模的流程对抽取的图像进行重建得到航拍模型，并用将其作地面机器人定位的三维地图。First, a mini-aircraft is used to collect aerial video of indoor scenes, and some images are extracted from the video. Then, the extracted images are reconstructed through an image-based modeling process to obtain an aerial model, which is used as a three-dimensional map for ground robot positioning.

步骤S101，对室内场景的航拍视角视频，采用基于词袋模型的自适应视频抽帧方法抽取图像帧，得到室内场景的航拍视角图像集合。Step S101 , for the aerial view video of the indoor scene, image frames are extracted using an adaptive video frame extraction method based on the bag-of-words model, to obtain an aerial view image set of the indoor scene.

本实施例中通过迷你飞行器在室内场景中采集自顶向下的航拍视角视频，采集的视频分辨率为1080p，帧率为25FPS。由于迷你无人机尺寸小，灵活度高，十分适用于室内场景拍摄。举例说明，本实施例中采用的迷你飞行器为安装了稳定器与4K相机的DJI Spark，其重量仅为300g。另外，相对于地面视角，从航拍视角对室内场景进行拍摄不易受到场景遮挡的影响，因此采用迷你飞行器可以更加高效、完整覆盖场景。In this embodiment, a top-down aerial perspective video is collected by a mini-aircraft in an indoor scene, and the resolution of the collected video is 1080p, and the frame rate is 25FPS. Due to its small size and high flexibility, the mini drone is very suitable for shooting indoor scenes. As an example, the miniature aircraft used in this example is a DJI Spark equipped with a stabilizer and a 4K camera, and its weight is only 300g. In addition, compared with the ground perspective, shooting indoor scenes from an aerial perspective is not easily affected by scene occlusion, so using a mini aircraft can cover the scene more efficiently and completely.

给出采集的航拍视频，可以通过同时定位与构图(SLAM，simultaneouslocalization and mapping)系统构建航拍地图。然而，在本实施例中，采用离线的SfM技术进行航拍地图构建。这是因为：(1)在实施例中航拍地图用于地面机器人定位，因此没必要进行在线构建；(2)与容易产生场景漂移现象的SLAM相比，SfM更加适用于大规模场景建模。可是，如果采用SfM进行航拍地图构建时，显然不需要用到航拍视频中的所有帧。因为航拍视频帧中含有大量的冗余信息，这会严重降低SfM地图构建的效率。为解决上述问题，一个直接的办法就是在视频中每间隔固定的帧数抽取一帧，然后用抽取的视频帧进行地图构建。然而，这种做法仍存在一些缺点：(1)很难通过人工操作迷你飞行器在室内场景中实现稳定、恒速的视频采集，而这个问题在航线拐角处会变得更加困难；(2)由于室内场景中的纹理丰富程度是不一致的，因此对场景进行均匀覆盖也是不恰当的。为解决上述在航拍地图构建过程中存在的问题，本实施例中采用了一种基于词袋(BoW，bag of words)模型的自适应视频抽帧方法，其过程详述如下：Given the collected aerial video, an aerial map can be constructed by a SLAM (simultaneous localization and mapping) system. However, in this embodiment, offline SfM technology is used for aerial map construction. This is because: (1) in the embodiment, the aerial map is used for ground robot positioning, so there is no need for online construction; (2) compared with SLAM, which is prone to scene drift, SfM is more suitable for large-scale scene modeling. However, if SfM is used for aerial map construction, it is obviously not necessary to use all the frames in the aerial video. Because aerial video frames contain a lot of redundant information, this will seriously reduce the efficiency of SfM map construction. To solve the above problems, a direct way is to extract a frame at a fixed interval of frames in the video, and then use the extracted video frames for map construction. However, there are still some disadvantages in this approach: (1) It is difficult to achieve stable and constant-speed video acquisition in indoor scenes by manually operating a miniature aircraft, and this problem will become more difficult at the corner of the route; (2) due to Texture richness in indoor scenes is inconsistent, so uniform coverage of the scene is also inappropriate. In order to solve the above-mentioned problems existing in the aerial map construction process, a kind of adaptive video frame extraction method based on the bag of words (BoW, bag of words) model is adopted in the present embodiment, and its process is described in detail as follows:

在BoW模型中，一幅图像可以表示为一个归一化向量vi，而一对图像相似度可通过对应向量的点乘表示。正如本领域技术人员所知，相邻图像之间过高的相似度会引入过多冗余信息，进而降低构图效率；而相邻图像之间过低的相似度则会导致图像之间连接性较差，构图不完整。因此，在本实施例中提出了一个从全体视频帧中自适应抽取子集的方法，在抽帧时该方法限定每个抽取的视频帧与其相邻的抽取的视频帧之间的相似度在一个合适的范围内。具体来说，先通过libvot库生成每一帧的归一化向量v_i，并将第一帧作为起始点。在抽帧过程中，假设当前第i帧已被抽取，获取该帧与其后续帧之间的相似度的得分：{s_i,j|j＝i+1,i+2,…}，其中然后，将与预设的相似度阈值t进行比较，本实施例中t＝0.1；假设为{s_i,j}中的第一个满足如下不等式：s_i,j<t，则第j^*-1帧(即第一个满足上述不等式的上一帧)为下一个抽取的视频帧。上述过程迭代进行，直至验证完所有视频帧。In the BoW model, an image can be expressed as a normalized vector vi, and the similarity of a pair of images can be expressed by the point product of the corresponding vector express. As known to those skilled in the art, too high similarity between adjacent images will introduce too much redundant information, thereby reducing composition efficiency; and too low similarity between adjacent images will lead to connectivity between images Poor, incomplete composition. Therefore, in this embodiment, a method for adaptively extracting subsets from all video frames is proposed. When extracting frames, the method limits the similarity between each extracted video frame and its adjacent extracted video frames to within within an appropriate range. Specifically, the normalized vector v _i of each frame is first generated through the libvot library, and the first frame is used as a starting point. In the frame extraction process, assuming that the current i-th frame has been extracted, obtain the similarity score between the frame and its subsequent frames: {s _i,j |j=i+1,i+2,...}, where followed by Compared with the preset similarity threshold t, t=0.1 in this embodiment; suppose The first one in {s _{i, j} } satisfies the following inequality: s _{i, j} <t, then the j ^* -1th frame (that is, the first previous frame that satisfies the above inequality) is the next extracted video frame . The above process is performed iteratively until all video frames are verified.

步骤S102，基于所述航拍视角图像集合，通过图像建模方法构建航拍地图。Step S102, based on the set of aerial perspective images, construct an aerial map by using an image modeling method.

基于步骤S101得到的航拍视角图像集合，通过一套标准的基于图像建模流程构建航拍地图，该流程包括：(1)SfM，(2)MVS，(3)表面重建。另外，由于室内接收不到GPS信号，可以通过地面控制点(GCP，ground control point)将航拍地图缩放至其真实物理尺寸。图2为一个经271幅抽取的视频帧重建得到的航拍地图示例，图中前三列为示例航拍图像及其对应的三维航拍地图区域，第四列为整个三维航拍地图，第五列为在航拍地图上的机器人路径规划与虚拟相机位姿计算结果，其中地平面标为浅灰色，规划路径标为图中线段，虚拟相机位姿由棱锥表示。Based on the aerial perspective image set obtained in step S101, an aerial map is constructed through a standard image-based modeling process, which includes: (1) SfM, (2) MVS, and (3) surface reconstruction. In addition, since the GPS signal cannot be received indoors, the aerial map can be scaled to its real physical size through the ground control point (GCP, ground control point). Figure 2 is an example of an aerial map reconstructed from 271 extracted video frames. The first three columns in the figure are the sample aerial image and its corresponding 3D aerial map area, the fourth column is the entire 3D aerial map, and the fifth column is the Robot path planning and virtual camera pose calculation results on the aerial map, where the ground plane is marked in light gray, the planned path is marked as a line segment in the figure, and the virtual camera pose is represented by a pyramid.

步骤S200，基于所述航拍地图，通过由航拍地图合成地面视角参考图像的方法，获取合成图像。Step S200, based on the aerial map, obtain a synthesized image by synthesizing a ground perspective reference image from the aerial map.

本实施例的步骤S100中构建的航拍地图在后续过程中起到了两个作用：第一个是为地面机器人规划路径并在机器人移动过程中进行定位；第二个是在室内场景重建过程中有助于航拍与地面图像的融合。上述两个过程均需要建立地面图像与航拍地图之间的二维到三维的点的对应关系。为获取上述对应点，一个可能有效的解决方案是直接匹配航拍与地面图像。然而，由于这两种图像在视角上差异巨大，直接对其进行匹配是十分困难的。在此，本实施例通过由航拍地图合成地面视角参考图像的方式解决上述问题。参考图像经如下两步进行合成：虚拟相机位姿计算以及基于图割的图像合成。The aerial map constructed in step S100 of this embodiment plays two roles in the follow-up process: the first is to plan the path for the ground robot and perform positioning during the movement of the robot; the second is to play a role in the indoor scene reconstruction process Contribute to the fusion of aerial photography and ground images. Both of the above two processes need to establish a two-dimensional to three-dimensional point correspondence between the ground image and the aerial map. To obtain the above correspondence points, a possible effective solution is to directly match aerial and ground images. However, it is very difficult to directly match the two images due to the huge difference in perspective. Here, in this embodiment, the above-mentioned problem is solved by synthesizing the ground perspective reference image from the aerial map. The reference images are synthesized through the following two steps: virtual camera pose calculation and image synthesis based on graph cut.

步骤S201，基于航拍地图，计算虚拟相机位姿。Step S201, calculating the virtual camera pose based on the aerial map.

用于参考图像合成的虚拟相机位姿基于室内场景的地平面进行计算，本实施例中航拍地图的地平面通过基于随机抽样一致性(RANSAC，random sample consensus)的平面检测方法进行检测(见图2)。虚拟相机位姿分两步进行计算，先计算位置后计算朝向。The virtual camera pose used for reference image synthesis is calculated based on the ground plane of the indoor scene. In this embodiment, the ground plane of the aerial map is detected by a plane detection method based on random sampling consistency (RANSAC, random sample consensus) (see Fig. 2). The virtual camera pose is calculated in two steps, first calculating the position and then calculating the orientation.

步骤S2011，虚拟相机位置计算。Step S2011, calculating the position of the virtual camera.

求取地平面的二维包围盒并将其划分成方形栅格，栅格的大小决定了虚拟相机的数量。为在定位精度与效率上达到平衡，本实施例中将栅格边长设为1m。对于每个栅格，当其中的地平面面积占栅格总面积的比例大于50％时，认为该栅格为放置虚拟相机的有效栅格。虚拟相机位置设为有效栅格的中心并有着高度为h的高程偏移量(见图2)。h的值由地面相机的高度决定，在本实施例中其值设为1m。Find the 2D bounding box of the ground plane and divide it into a square grid. The size of the grid determines the number of virtual cameras. In order to achieve a balance between positioning accuracy and efficiency, the grid side length is set to 1 m in this embodiment. For each grid, when the ground plane area therein accounts for more than 50% of the total area of the grid, the grid is considered to be a valid grid for placing the virtual camera. The virtual camera position is set to the center of the active grid with an elevation offset of height h (see Figure 2). The value of h is determined by the height of the ground camera, and its value is set to 1m in this embodiment.

步骤S2012，虚拟相机朝向设计。Step S2012, the virtual camera faces the design.

在得到虚拟相机位置以后，为实现对场景的全方向观测，需要在每个虚拟相机位置放置多个光心相同、朝向不同的虚拟相机。本实施例中，由于安装在地面机器人上的相机的光轴近似平行于地平面，在此只生成水平朝向的虚拟相机。另外，为消除地面与合成图像之间的透视投影失真，需要将虚拟相机的视场(内参数)设为与地面相机接近。在本实施例中，每个虚拟相机位置上放置6个虚拟相机，虚拟相机之间的偏航角夹角为60°。After obtaining the virtual camera position, in order to realize the omnidirectional observation of the scene, it is necessary to place multiple virtual cameras with the same optical center and different orientations at each virtual camera position. In this embodiment, since the optical axis of the camera installed on the ground robot is approximately parallel to the ground plane, only a horizontally oriented virtual camera is generated here. In addition, in order to eliminate the perspective projection distortion between the ground and the synthetic image, it is necessary to set the field of view (intrinsic parameters) of the virtual camera close to that of the ground camera. In this embodiment, 6 virtual cameras are placed at each virtual camera position, and the yaw angle between the virtual cameras is 60°.

另外，用于地面机器人运动的路径也要通过检查的地平面进行规划。由于本实施例并非聚焦于规划地面机器人的最优路径，在此将检测的地平面的骨架用作机器人路径，该骨架通过中轴变换法进行提取(见图2)。In addition, the path used for ground robot movement is also planned by checking the ground plane. Since this embodiment does not focus on planning the optimal path of the ground robot, the detected skeleton of the ground plane is used as the robot path here, and the skeleton is extracted by the central axis transformation method (see FIG. 2 ).

步骤S202，通过图割算法，获取航拍地图获取地面视角参考图像的合成图像。Step S202, through the graph cut algorithm, obtain an aerial map and obtain a synthesized image of a reference image of a ground perspective.

本实施例借助于空间连续的网格进行图像合成，如图3所示，图中f为一个三维空间面片，其在航拍相机C_a与虚拟的地面相机C_v相机上的二维投影三角形分别记作t_a与t_v，图像合成的原理是将t_a经过f变至t_v。具体来说，先获取每个航拍与虚拟相机的可见网格。然后，对于每个虚拟相机，将其可见网格投影至该相机上形成二维三角形集合。在进行虚拟图像合成时，对于虚拟图像中的一个特定的二维三角形来说，需要基于如下三个因素确定采用哪一幅航拍图像进行变换以填充此区域：(1)可见性，对于此二维三角形对应的三维空间面片，选取的航拍图像应有较好的视角与较近的视距；(2)清晰度，由于从室内航拍视频抽帧得到的图像中有一部分清晰度较差，需要在其中选取足够清晰的航拍图像；(3)一致性，虚拟图像中相邻的三角形应尽可能由同一幅航拍图像进行合成以保持合成图像的一致性。本实施例中，可见性因素通过空间面片在航拍图像上的投影面积衡量(越大越好)，而清晰度因素通过航拍图像局部特征尺度的中值衡量(越小越好)，具体见图4，左边两列为两幅局部特征尺度中值最大的图像，右边两列为两幅局部特征尺度中值最小的图像，第二行为第一行矩形区域的放大图像。基于上述描述，本实施例中的图像合成问题可归结为多标签优化问题，定义如公式(1)所示：In this embodiment, image synthesis is carried out by means of spatially continuous grids, as shown in Figure 3, f in the figure is a three-dimensional space surface, and its two-dimensional projection triangle on the aerial camera C _a and the virtual ground camera C _v camera They are denoted as t _a and t _v respectively. The principle of image synthesis is to change t _a to t _v through f. Specifically, the visible grids of each aerial and virtual camera are obtained first. Then, for each virtual camera, its visible mesh is projected onto the camera to form a 2D triangle set. When performing virtual image synthesis, for a specific two-dimensional triangle in the virtual image, it is necessary to determine which aerial image to use for transformation to fill this area based on the following three factors: (1) Visibility, for the two The three-dimensional space surface corresponding to the three-dimensional triangle, the selected aerial image should have a better viewing angle and a closer viewing distance; (2) clarity, because some of the images obtained from indoor aerial video frame extraction have poor clarity, It is necessary to select a sufficiently clear aerial image; (3) Consistency, the adjacent triangles in the virtual image should be synthesized from the same aerial image as much as possible to maintain the consistency of the synthesized image. In this embodiment, the visibility factor is measured by the projected area of the space patch on the aerial image (the larger the better), and the clarity factor is measured by the median value of the local feature scale of the aerial image (the smaller the better), see Fig. 4. The two columns on the left are images with the largest median value of the two local feature scales, the two columns on the right are images with the smallest median value of the two local feature scales, and the second row is the enlarged image of the rectangular area in the first row. Based on the above description, the image synthesis problem in this embodiment can be attributed to a multi-label optimization problem, defined as shown in formula (1):

其中，E(l)为图割过程中的能量函数；为虚拟相机可见的三维空间网格投影得到的二维三角形集合，t_i为其中的第i个三角形；为投影得到的二维三角形集合中三角形的公共边集合；l_i为t_i的标签，即航拍图像序号。当对应t_i的空间面片在第l_i个航拍图像中可见时，数据项其中为第l_i个航拍图像中局部特征的尺度中值而为对应t_i的空间面片在第l_i个航拍图像中的投影面积；否则的话D_i(l_i)＝α，其中α为一个较大的常量(本实施例中α＝10⁴)以惩罚这种情况。当l_j＝l_j时，平滑项V_i(l_i,l_j)＝0；否则V_i(l_i,l_j)＝1。定义于式(1)的优化问题可通过图割算法进行高效求解。Among them, E(l) is the energy function in the graph cut process; is a set of two-dimensional triangles projected from the three-dimensional space grid visible to the virtual camera, and t _i is the i-th triangle among them; is the set of common sides of the triangles in the projected two-dimensional triangle set; l _i is the label of t _i , that is, the serial number of the aerial image. When the spatial patch corresponding to t _i is visible in the l _i aerial image, the data item in is the scale median value of local features in the l _i -th aerial image and is the projected area of the spatial patch corresponding to t _i in the l _i aerial image; otherwise, D _i (l _i )=α, where α is a relatively large constant (α=10 ⁴ in this embodiment) and punish the situation. When l _j =l _j , the smoothing term V _i (l _i , l _j )=0; otherwise V _i (l _i ,l _j )=1. The optimization problem defined in formula (1) can be efficiently solved by the graph cut algorithm.

为阐明清晰度因素与一致性因素的影响，本实施例中在四种不同配置下在其中一个虚拟相机上进行了图像合成，结果如图5所示，从左到右分别为：既不考虑清晰度因素，又不考虑一致性因素；只考虑一致性因素；只考虑清晰度因素；既考虑清晰度因素，又考虑一致性因素的图像合成结果。每幅图右上角的大矩形为图中小矩形的方大图。由5图可知，清晰度因素使得合成图像更为清楚而一致性因素使得合成图像中孔洞及锐边更少。另外，图6给出了另外的一些图像合成结果以及类似视角下的地面图像。尽管仍有些难以避免的合成错误情况，合成图像与其对应的地面图像在公共可见区域有着较大的相似性，这验证了本实施例中图像合成方法的有效性。本步骤中的合成图像将用作地面机器人定位的参考数据库。In order to clarify the influence of the clarity factor and the consistency factor, in this embodiment, image synthesis is performed on one of the virtual cameras under four different configurations, and the results are shown in Figure 5, from left to right: neither The sharpness factor is not considered, and the consistency factor is not considered; only the consistency factor is considered; only the sharpness factor is considered; the image synthesis result that considers both the sharpness factor and the consistency factor. The large rectangle in the upper right corner of each figure is a square enlargement of the small rectangle in the figure. It can be seen from Figure 5 that the sharpness factor makes the composite image clearer and the consistency factor makes the composite image less holes and sharp edges. In addition, Figure 6 shows some other image synthesis results and ground images under similar viewing angles. Although there are still some unavoidable synthesis errors, the synthetic image and its corresponding ground image have a relatively large similarity in the common visible area, which verifies the effectiveness of the image synthesis method in this embodiment. The synthesized images in this step will be used as a reference database for ground robot localization.

步骤S300，通过地面相机采集的地面视角图像，获取地面视角图像集合。In step S300, a set of ground perspective images is acquired through the ground perspective images collected by the ground camera.

将地面机器人放置于室内场景中时，机器人将沿着规划路径运动并自动采集地面视角视频。如果机器人仅通过其内置传感器，例如轮子编码器与惯性测量单元(IMU，inertial measurement unit)进行定位的话，它将不会严格按照规划的路径运动。这是因为机器人内置传感器存在累积误差的问题，这种问题对于安装在消费级机器人上的低成本传感器来说尤为明显。因此，机器人的位姿需要通过视觉定位的方式进行修正，而在本步骤中通过匹配合成与地面图像实现视觉定位。When the ground robot is placed in the indoor scene, the robot will move along the planned path and automatically collect the ground perspective video. If the robot is only positioned by its built-in sensors, such as wheel encoders and inertial measurement units (IMUs, inertial measurement units), it will not strictly follow the planned path. This is due to the problem of accumulating errors in sensors built into robots, especially for low-cost sensors mounted on consumer-grade robots. Therefore, the pose of the robot needs to be corrected by visual positioning, and in this step, the visual positioning is achieved by matching the synthetic and ground images.

步骤S301，地面机器人基于规划路径，通过其上设置的地面相机连续采集地面视角视频。Step S301, based on the planned path, the ground robot continuously collects ground perspective videos through the ground camera set on it.

本步骤中，定位方法包括初始机器人定位、移动机器人定位。In this step, the positioning method includes initial robot positioning and mobile robot positioning.

(1)初始机器人定位(1) Initial robot positioning

初始机器人定位，其方法为：获取地面相机采集视频的第一帧，获取机器人在所述航拍地图中的初始位置，并将该位置作为机器人后续运动的起点。The initial robot positioning method is as follows: obtain the first frame of the video collected by the ground camera, obtain the initial position of the robot in the aerial map, and use this position as the starting point of the subsequent movement of the robot.

通过对地面相机采集视频的第一帧进行定位，可以获取机器人在航拍地图中的初始位置，并将该位置作为机器人后续运动的起点。上述初始定位可通过匹配第一帧图像与所有合成图像或者通过语义树检索得到的k个最相似的合成图像实现。本步骤中使用的是基于图像检索的方法，且k＝30。需要注意的是，尽管合成了地面视角图像，地面图像与合成图像在光照、视角等方面仍有较大区别，常用的尺度不变特征变换(SIFT，scale-invariantfeature transform)特征不足以应对。本步骤中采用的为ASIFT(affine-SIFT)特征。By locating the first frame of the video collected by the ground camera, the initial position of the robot in the aerial map can be obtained, and this position can be used as the starting point of the subsequent movement of the robot. The above initial positioning can be realized by matching the first frame image with all synthetic images or the k most similar synthetic images obtained through semantic tree retrieval. In this step, the method based on image retrieval is used, and k=30. It should be noted that although the ground perspective image is synthesized, there is still a big difference between the ground image and the synthetic image in terms of illumination and perspective, and the commonly used scale-invariant feature transform (SIFT, scale-invariant feature transform) feature is not enough to deal with it. The ASIFT (affine-SIFT) feature is used in this step.

为验证本步骤图像合成方法的有效性并对SIFT特征与ASIFT特征的性能进行比较，本实施例分别采用SIFT特征与ASIFT特征进行了合成与地面图像匹配以及航拍与地面图像匹配。其中，地面图像也是通过步骤S100中基于词袋模型的自适应视频抽帧方法从地面机器人采集的视频中抽取获得。在进行图像匹配时，通过检索不同数量的与当前地面图像最近似的合成图像与航拍图像，发现当经过基本矩阵验证后的匹配点数仍大于16时，可以认为这对图像是匹配的。图像匹配结果如图7所示(图中x轴为检索图像数量，y轴为匹配图像对数量的对数)。由图7可知，采用ASIFT进行合成与地面图像匹配得到的匹配对数分别大约是采用ASIFT进行航拍与地面图像匹配，采用SIFT进行合成与地面图像匹配以及采用SIFT进行航拍与地面图像匹配的6倍，8倍与19倍。In order to verify the effectiveness of the image synthesis method in this step and compare the performance of SIFT features and ASIFT features, this embodiment uses SIFT features and ASIFT features to perform synthesis and ground image matching and aerial photography and ground image matching. Wherein, the ground image is also extracted from the video collected by the ground robot through the adaptive video frame extraction method based on the bag-of-words model in step S100. When performing image matching, by retrieving different numbers of synthetic images and aerial images that are the closest to the current ground image, it is found that when the number of matching points after the basic matrix verification is still greater than 16, the pair of images can be considered to be matched. The image matching results are shown in Figure 7 (the x-axis in the figure is the number of retrieved images, and the y-axis is the logarithm of the number of matching image pairs). It can be seen from Figure 7 that the number of matching pairs obtained by using ASIFT for synthesis and ground image matching is about 6 times that of using ASIFT for aerial photography and ground image matching, using SIFT for synthesis and ground image matching, and using SIFT for aerial photography and ground image matching , 8 times and 19 times.

给出第一帧地面图像与检索的合成图像之间的二维匹配点，可以通过光线投射的方式在航拍地图上获取对应的三维空间点。这样一来可以采用基于透视n点(PnP，perspective-n-point)的方法实现第一帧地面图像的定位。具体来说，给定二维到三维对应点与地面相机内参数，相机位姿通过RANSAC采用不同的PnP算法进行求解。采用的PnP算法包括P3P，AP3P与EPnP。当上述算法对应的内点数有至少一种超过16个时，可以认为此次位姿估计为一次成功的估计，并将该相机的位姿定为PnP结果中内点数量最多的那一个。在本实施例的RANSAC过程中，一共进行了500次随机抽样，且将距离阈值设为4px。Given the two-dimensional matching points between the first ground image and the retrieved synthetic image, the corresponding three-dimensional space points can be obtained on the aerial map by ray casting. In this way, a method based on perspective-n-point (PnP, perspective-n-point) can be used to realize the positioning of the first frame of the ground image. Specifically, given the corresponding points from 2D to 3D and the internal parameters of the ground camera, the camera pose is solved by RANSAC using different PnP algorithms. The PnP algorithms adopted include P3P, AP3P and EPnP. When at least one of the interior points corresponding to the above algorithm exceeds 16, it can be considered that this pose estimation is a successful estimation, and the pose of the camera is set as the one with the largest number of interior points in the PnP result. In the RANSAC process of this embodiment, a total of 500 random samplings are performed, and the distance threshold is set to 4px.

(2)移动机器人定位(2) Mobile robot positioning

地面机器人在室内场景中运动并采集视频时，可以通过轮子里程计对其粗略定位。在本步骤中，通过匹配地面与合成图像将地面机器人全局式地定位至航拍地图上以修正机器人粗略定位结果。只对抽取的地面视频帧，而非全部视频帧进行位姿修正。这是因为：(1)地面机器人在室内运动相对缓慢，在较短时间内不会严重偏离规划路径；(2)每次进行全局视觉定位需要耗时大约0.5s，且时间主要耗在ASIFT特征提取上。需要注意的是，对于某些抽取的视频帧，由于用于PnP的内点数量不足，视觉定位并不能一直成功。When the ground robot moves in the indoor scene and collects video, it can be roughly positioned by the wheel odometer. In this step, the ground robot is globally positioned on the aerial map by matching the ground and the synthetic image to correct the rough positioning result of the robot. The pose correction is only performed on the extracted ground video frames, not all video frames. This is because: (1) the ground robot moves relatively slowly indoors, and will not seriously deviate from the planned path in a short period of time; (2) it takes about 0.5s to perform global visual positioning each time, and the time is mainly spent on ASIFT features Extract on. It should be noted that for some extracted video frames, the visual localization is not always successful due to the insufficient number of inliers for PnP.

假设上一次成功定位的地面图像的位置与朝向分别记为c_A与n_A，而当前待定位的地面图像通过轮子里程计得到的粗略位置与朝向分别记为c_B与n_B。在此，基于粗略定位结果，而非基于图像检索的方法查找当前地面图像的候选匹配合成图像。该方法的示意图如图8所示，图中c_A与n_A为上一次成功定位的地面图像的位置与朝向，c_B与n_B为当前的地面图像的粗略位置与朝向，图中的圆表示查找范围，该圆圆心为c_B，半径为r_B，图中三角形表示虚拟相机位姿，浅灰色的三角形表示选中的合成图像而深灰色的三角形表示未选中的合成图像。当合成图像满足如下两个条件时，将对其与当前地面图像进行匹配：(1)合成图像位于圆心为c_B，半径为r_B的圆中，其中r_B＝max(‖c_B-c_A‖,β)且β＝2m；(2)合成图像朝向与n_B的夹角小于90°。在此用到一个可变半径r_B的原因是随着机器人的运动，通过机器人内置传感器获取的相对位姿的漂移会越来越严重。在对当前地面图像与得到的候选匹配合成图像进行匹配之后，当前地面图像采用类似初始机器人定位中的方法，通过基于PnP的RANSAC的方法实现定位。如果定位结果在位置和朝向上与粗略定位结果偏差足够小(本实施例中位置偏差小于5m，朝向偏差小于30°)，则当前地面图像定位成功。确认为机器人的位姿已通过当前定位成功的地面图像全局修正，并将轮子里程计中的位姿重置为当前基于视觉的定位结果。本步骤中未定位成功的地面图像将在后续室内场景重建过程中重新定位。Assume that the position and orientation of the ground image successfully positioned last time are recorded as c _A and n _A respectively, and the rough position and orientation obtained by the wheel odometer of the current ground image to be positioned are respectively recorded as c _B and n _B . Here, the candidate matching synthetic image of the current ground image is searched based on the rough positioning results, not based on the image retrieval method. The schematic diagram of this method is shown in Figure 8, in which c _A and n _A are the position and orientation of the ground image successfully positioned last time, c _B and n _B are the rough position and orientation of the current ground image, and the circle in the figure Indicates the search range, the center of the circle is c _B , the radius is r _B , the triangle in the figure represents the virtual camera pose, the light gray triangle represents the selected composite image and the dark gray triangle represents the unselected composite image. When the synthesized image satisfies the following two conditions, it will be matched with the current ground image: (1) The synthesized image is located in a circle with center c _B and radius r _B , where r _B =max(∥c _B -c _A ‖, β) and β=2m; (2) The angle between the composite image orientation and n _B is less than 90°. The reason for using a variable radius r _B here is that as the robot moves, the drift of the relative pose acquired by the robot's built-in sensor will become more and more serious. After matching the current ground image with the obtained candidate matching synthetic image, the current ground image adopts a method similar to the method in the initial robot localization, and achieves localization through the method based on PnP RANSAC. If the position and orientation deviations of the positioning results from the rough positioning results are small enough (in this embodiment, the position deviation is less than 5m, and the orientation deviation is less than 30°), then the current ground image positioning is successful. It is confirmed that the pose of the robot has been globally corrected by the ground image of the current localization success, and the pose in the wheel odometer is reset to the current vision-based localization result. Ground images that are not successfully positioned in this step will be relocated in the subsequent indoor scene reconstruction process.

步骤S302，对室内场景的地面视角视频，采用基于词袋模型的自适应视频抽帧方法抽取图像帧，得到室内场景的地面视角图像集合。Step S302 , for the ground-view video of the indoor scene, image frames are extracted using an adaptive video frame extraction method based on the bag-of-words model, to obtain a ground-view image set of the indoor scene.

本步骤通过步骤S100中基于词袋模型的自适应视频抽帧方法，对所获取的室内场景的地面视角视频进行图像帧的抽取，得到室内场景的地面视角图像集合，由于方法一致，此处不再赘述。In this step, through the adaptive video frame extraction method based on the bag-of-words model in step S100, image frames are extracted from the obtained ground-view video of the indoor scene to obtain a set of ground-view images of the indoor scene. Since the method is consistent, it is not described here Let me repeat.

在机器人定位与视频采集后，并非所有从地面视频抽取的帧均已成功定位至航拍地图。然而，为获取完整的室内场景重建结果，需要定位并融合所有由(航拍与地面)视频中抽取得到的图像。在此，首先提出了一种批量式定位之前未成功定位的地面图像的流程。然后，将地面与合成图像匹配内点连入原始特征点轨迹中，并通过捆绑调整(BA，bundleadjustment)实现航拍与地面点云的融合。最后，通过融合航拍与地面图像以获取完整、稠密的室内场景重建结果。After robot localization and video acquisition, not all frames extracted from the ground video were successfully localized to the aerial map. However, to obtain a complete indoor scene reconstruction result, all images extracted from (aerial and terrestrial) videos need to be localized and fused. Here, a pipeline for batch-wise localization of previously unsuccessfully localized ground images is first proposed. Then, the matching interior points of the ground and the synthetic image are connected to the original feature point trajectory, and the fusion of the aerial photography and the ground point cloud is realized through bundle adjustment (BA, bundle adjustment). Finally, a complete and dense indoor scene reconstruction result is obtained by fusing aerial and ground images.

步骤S401，获取地面视角图像集合中每一幅图像对应的地面相机在所述航拍地图中的位置。Step S401, acquiring the position of the ground camera corresponding to each image in the ground perspective image set in the aerial map.

为定位步骤S301中未成功定位的地面图像，本发明提出了一种批量式相机定位流程。在每个相机定位循环中，尽量定位更多的相机。此处用于相机定位的二维到三维对应点中的三维空间点并不仅包括在SfM过程中重建得到的空间点，还包括通过光线投射与航拍地图(三维网格)相交得到的空间点。每个批量式相机定位循环中包括三个步骤：(1)相机定位、(2)场景扩展与捆绑调整(BA，bundle adjustment)、(3)相机过滤，其流程图如图9所示。在进行批量式相机定位之前，先对从地面视频中抽帧得到的图像进行匹配并将匹配点连接成特征点轨迹。对于至少有两幅已成功定位的可见图像的特征点轨迹，通过三角测量的方式求取其空间坐标。In order to locate the ground images that were not successfully located in step S301, the present invention proposes a batch camera positioning process. In each camera positioning loop, try to position as many cameras as possible. The 3D spatial points in the 2D to 3D corresponding points used for camera positioning here include not only the spatial points reconstructed in the SfM process, but also the spatial points obtained by intersecting the aerial map (3D mesh) through ray casting. Each batch camera positioning cycle includes three steps: (1) camera positioning, (2) scene expansion and bundle adjustment (BA, bundle adjustment), (3) camera filtering, and its flowchart is shown in Figure 9 . Before performing batch camera positioning, the images obtained by extracting frames from the ground video are first matched and the matching points are connected into feature point trajectories. For the feature point trajectories of at least two visible images that have been successfully positioned, their spatial coordinates are calculated by means of triangulation.

步骤S4011，相机定位。Step S4011, camera positioning.

有两种方式获取二维三维对应点以定位当前未定位成功的地面图像：(1)航拍地图，对于当前未成功定位的地面图像中的二维特征点，可以获取其在成功定位的图像中的匹配点。然后从成功定位的相机光心向这些匹配点投射射线，投射的射线与航拍地图的交点即为当前未成功定位的地面图像中的二维特征点对应的三维空间点。(2)地面特征点轨迹，给出当前通过三角测量得到的地面特征点轨迹，可以通过先前地面图像之间的匹配结果获取当前未成功定位的地面图像中对应的二维特征点。当前未定位成功的地面相机可利用上述两种二维三维对应点通过基于PnP的RANSAC的方法实现定位，而定位结果采用两结果中内点数多的那一个。将此处通过两种二维三维对应点实现相机定位的方法与只用其中任意一种的方法进行了比较，结果如图10所示，图中示出了(1)基于航拍地图与地面特征点轨迹、(2)仅基于航拍地图、(3)仅基于地面特征点轨迹的批量式相机定位结果，图中x轴为批量式相机定位循环次数，y轴为成功定位的相机数量；当x＝0时对应的y值为在步骤S300中成功定位的相机数量。由图10可知，经过若干迭代循环，三种方法均可定位同样数量的相机。然而，本实施例中通过两种二维三维对应点实现相机定位的方法所需的迭代循环次数最少(仅需5次，而其他两种方法分别需要6次与8次)。There are two ways to obtain 2D and 3D corresponding points to locate the ground image that is currently not successfully positioned: (1) Aerial map, for the 2D feature points in the ground image that is currently not successfully positioned, you can obtain its position in the successfully positioned image matching point. Then a ray is projected from the optical center of the successfully positioned camera to these matching points, and the intersection point of the projected ray and the aerial map is the 3D space point corresponding to the 2D feature point in the currently unsuccessfully positioned ground image. (2) Ground feature point trajectories. Given the current ground feature point trajectories obtained through triangulation, the corresponding two-dimensional feature points in the current unsuccessful ground image can be obtained through the matching results between previous ground images. Ground cameras that have not been successfully positioned at present can use the above two 2D and 3D corresponding points to achieve positioning through the PnP-based RANSAC method, and the positioning result uses the one with the most internal points among the two results. The method of realizing camera positioning through two 2D and 3D corresponding points is compared with the method using only one of them. point trajectory, (2) only based on the aerial map, (3) the batch camera positioning results based only on the ground feature point trajectory, the x-axis in the figure is the number of batch camera positioning cycles, and the y-axis is the number of successfully positioned cameras; when x =0, the corresponding y value is the number of cameras successfully positioned in step S300. It can be seen from Figure 10 that after several iterative cycles, the three methods can locate the same number of cameras. However, in this embodiment, the method for realizing camera positioning through two 2D and 3D corresponding points requires the least number of iterative cycles (only 5 times, while the other two methods need 6 times and 8 times respectively).

步骤S4012，场景扩展与BA。Step S4012, scene extension and BA.

在相机定位之后，根据新定位的相机对地面特征点轨迹进行三角测量以实现场景扩展。为提高相机位姿与场景点的精度，在三角测量后对已定位的地面相机地位姿与三角测量得到的地面特征点轨迹的空间位置通过BA进行优化。After camera localization, ground feature point trajectories are triangulated according to the newly localized camera to achieve scene expansion. In order to improve the accuracy of the camera pose and scene points, after triangulation, the localized ground camera pose and the spatial position of the ground feature point trajectory obtained by triangulation are optimized by BA.

步骤S4013，相机过滤。Step S4013, camera filtering.

考虑到方法的鲁棒性，在BA后对定位成功的相机加入了一步相机过滤的操作。若在本次迭代循环中新定位成功的相机，经BA优化后的位置或朝向与其粗略定位结果(轮子里程计获取的定位结果)偏差较大(位置偏差大于5m或朝向偏差大于30°)的话，判定此定位结果不可靠并将其滤除。该步骤中需要注意的是，在当前迭代循环中滤除的相机在后续迭代循环中仍可成功定位。Considering the robustness of the method, a camera filtering operation is added to the successfully located cameras after BA. If the position or orientation of the newly positioned camera in this iterative cycle is greatly deviated from its rough positioning result (the positioning result obtained by the wheel odometer) after BA optimization (the position deviation is greater than 5m or the orientation deviation is greater than 30°). , the positioning result is determined to be unreliable and filtered out. It should be noted in this step that the cameras filtered out in the current iterative cycle can still be successfully located in the subsequent iterative cycle.

上述三个步骤迭代进行，直至所有相机均成功定位或者不再有相机可以成功定位。批量式相机定位的过程如图11所示，图中棱锥表示定位成功的相机位姿。第0次迭代表示在步骤S300中的相机定位结果。The above three steps are performed iteratively until all cameras are successfully located or no more cameras can be successfully located. The process of batch camera positioning is shown in Figure 11. The pyramid in the figure represents the pose of the successfully positioned camera. The 0th iteration represents the result of camera positioning in step S300.

步骤S402，将地面视角图像与合成图像匹配点连入原始的航拍与地面特征点轨迹中，生成跨视图的约束。Step S402, connecting the matching points of the ground perspective image and the synthetic image into the original aerial photography and ground feature point trajectory to generate cross-view constraints.

为通过BA融合航拍与地面点云，需要引入航拍与地面图像之间的约束。在此，上述跨视图约束可通过由步骤S300中通过匹配地面与合成图像获取的图像匹配点生成的航拍与地面特征点轨迹提供。匹配的地面图像特征点可通过查询其序号较为便捷地连入原始地面特征点轨迹中。但是，尽管合成图像由航拍图像生成，想要将匹配的合成图像特征点连入原始航拍特征点轨迹中却没那么容易。这是因为用于与地面图像匹配的合成图像特征点是在合成图像上重新提取得到的。在本步骤中，通过光线投射与点投影的方式将地面与合成图像匹配点拓展至航拍视图，该过程的示意图如图12所示，图中C_i(i＝1,2,3)为航拍相机，X_j(j＝1,2,3)为对应于匹配的合成图像特征点的空间点，t_ij为点X_j在相机C_i上的投影，t_1j-t_2j-t_2j(j＝1,2)为第j个跨航拍视图的特征点轨迹。具体来说，先通过光线投射的方式在航拍地图上获取匹配的合成图像特征点对应的空间点，然后将获取的空间点投影至其可见航拍图像上以生产航拍与地面特征点轨迹。In order to fuse aerial photography and ground point clouds through BA, it is necessary to introduce constraints between aerial photography and ground images. Here, the above-mentioned cross-view constraint can be provided by the aerial photography and ground feature point trajectories generated by the image matching points obtained by matching the ground and the synthetic image in step S300. The matched ground image feature points can be easily connected to the original ground feature point trajectory by querying their serial numbers. However, although the synthetic image is generated from the aerial image, it is not so easy to connect the matching synthetic image feature points to the original aerial feature point trajectory. This is because the synthetic image feature points used for matching with the ground image are re-extracted on the synthetic image. In this step, the ground and synthetic image matching points are extended to the _aerial view by means of ray projection and point projection. camera, X _j (j=1,2,3) is the spatial point corresponding to the feature point of the matched synthetic image, t _ij is the projection of point X _j on camera C _i , t _1j -t _2j -t _2j (j =1,2) is the feature point trajectory of the jth inter-aerial view. Specifically, the spatial points corresponding to the matching synthetic image feature points are obtained on the aerial map by ray projection, and then the acquired spatial points are projected onto the visible aerial image to produce aerial and ground feature point trajectories.

步骤S403，通过BA对航拍地图与地面视角图像点云进行优化。Step S403, optimize the point cloud of the aerial map and the ground perspective image through BA.

本步骤中，采用Ceres库，通过最小化反投影误差的方式对连接生成的航拍与地面特征点轨迹，原始的(航拍与地面)特征点轨迹，所有(航拍与地面)相机的内外参数进行全局优化。In this step, the Ceres library is used to minimize the back-projection error to connect the generated aerial and ground feature point trajectories, the original (aerial and ground) feature point trajectories, and the internal and external parameters of all (aerial and ground) cameras. optimization.

步骤S404，利用通过步骤S403优化得到的航拍与地面相机位姿，融合航拍与地面图像进行稠密重建，获取室内场景的稠密模型。Step S404, using the aerial photography and ground camera poses optimized in step S403, fusing aerial photography and ground images to perform dense reconstruction to obtain a dense model of the indoor scene.

由于在优化过程中引入了跨航拍与地面视图的约束，且稠密重建过程中融合了航拍与地面图像，重建得到的模型比仅用单一来源的图像重建得到的模型更加完整、精确。Due to the introduction of cross-aerial and ground view constraints in the optimization process, and the fusion of aerial and ground images in the dense reconstruction process, the reconstructed model is more complete and accurate than the model reconstructed with only a single source of image.

为了对本发明实施例的融合航拍与地面视角图像的场景建模方法进行验证，下面基于如图13所示的采集航拍与地面元数据的实验设备，以及采集到的两组室内场景数据集，在这两组数据集上对本实施例方法进行了测试。In order to verify the scene modeling method for the fusion of aerial photography and ground perspective images in the embodiment of the present invention, based on the experimental equipment for collecting aerial photography and ground metadata as shown in Figure 13, and the collected two sets of indoor scene data sets, in The method of this embodiment is tested on these two sets of data sets.

1、数据集1. Dataset

由于目前几乎没有针对室内场景的航拍与地面图像公开数据集，在此本测试中自行采集了两组用于方法测评的室内场景数据集。具体来说，采用DJI Spark迷你飞行器进行航拍视角场景采集，采用安装在TurtleBot上的GoPro HERO4进行地面视角场景采集，元数据采集设备如图13所示，从左到右分别为地面上的TurtleBot、空中的DJISpark、桌面上的DJI Spark。采集的航拍与地面元数据的形式均为分辨率为1080p，帧率为25FPS的视频。采集的两个室内场景数据集，分别叫做Room与Hall。一些关于Room与Hall数据集的信息如表1所示。Room与Hall数据集中的示例航拍图像与生成的三维航拍地图分别如图2与图14所示。由图2与图14可知，Hall数据集的航拍地图相对于Room数据集的航拍地图质量更差且规模更大。然而，由后续的方法测评内容可知，本发明方法在上述两个数据集上均可取得预期的结果，这说明本发明方法有着较好的鲁棒性与可拓展性。Since there are almost no public datasets of aerial photography and ground images for indoor scenes, two sets of indoor scene datasets were collected for method evaluation in this test. Specifically, the DJI Spark mini-aircraft is used for aerial perspective scene collection, and the GoPro HERO4 installed on the TurtleBot is used for ground perspective scene collection. The metadata collection equipment is shown in Figure 13, from left to right are TurtleBot on the ground, DJISpark in the air, DJI Spark on the desktop. The collected aerial photography and ground metadata are in the form of videos with a resolution of 1080p and a frame rate of 25FPS. The collected two indoor scene data sets are called Room and Hall respectively. Some information about the Room and Hall datasets is shown in Table 1. The sample aerial images and the generated 3D aerial maps in the Room and Hall datasets are shown in Figure 2 and Figure 14, respectively. It can be seen from Figure 2 and Figure 14 that the aerial map of the Hall dataset is worse in quality and larger in scale than the aerial map of the Room dataset. However, it can be seen from the subsequent method evaluation content that the method of the present invention can achieve the expected results on the above two data sets, which shows that the method of the present invention has better robustness and scalability.

表1Table 1

数据集data set RoomRoom HallHall 航拍视频长度/sAerial video length/s 218218 494494 地面视频长度/sGround video length/s 6161 113113 覆盖面积/m<sup>2</sup>Coverage area/m<sup>2</sup> 3030 130130

另外，在Room与Hall数据集上的虚拟相机位姿计算与机器人路径规划结果分别展示于图2与图14的最右侧。如图所示，通过本发明方法，用于虚拟相机位姿计算与机器人路径规划的地平面可以成功检测，且生成的虚拟相机与机器人路径较为均匀地覆盖了室内场景。通过本发明的虚拟相机位姿计算方法在Room与Hall数据集上分别生成了60与384个虚拟相机。图14中前三列为示例航拍图像及其对应的三维航拍地图区域。第四列为整个三维航拍地图。第五列为在航拍地图上的机器人路径规划与虚拟相机位姿计算结果，其中地平面标为浅灰色，规划路径标为线段，虚拟相机位姿由棱锥表示。In addition, the virtual camera pose calculation and robot path planning results on the Room and Hall datasets are shown on the far right of Figure 2 and Figure 14, respectively. As shown in the figure, through the method of the present invention, the ground plane used for virtual camera pose calculation and robot path planning can be successfully detected, and the generated virtual camera and robot path evenly cover the indoor scene. 60 and 384 virtual cameras are respectively generated on the Room and Hall data sets through the virtual camera pose calculation method of the present invention. The first three columns in Fig. 14 are example aerial images and their corresponding three-dimensional aerial map areas. The fourth column is the entire 3D aerial map. The fifth column is the robot path planning and virtual camera pose calculation results on the aerial map, where the ground plane is marked in light gray, the planned path is marked as a line segment, and the virtual camera pose is represented by a pyramid.

2、自适应抽帧结果2. Adaptive frame drawing result

通过本发明中的自适应抽帧方法，分别从Room数据集的航拍与地面视频中抽取了271与112帧图像，从Hall数据集的航拍与地面视频中抽取了721与250帧图像。为验证本发明中抽帧方法的有效性，在Hall数据集的航拍视频上对本发明方法与等间隔抽帧方法进行了对比实验。采用本发明的自适应抽帧方法在长度为494s，帧率为25FPS的视频上抽取得到了721帧图像，对于等间隔抽帧方法，每隔17帧抽取1帧图像(494×25/721≈17)，共计抽取730帧图像。然后，将两种不同抽帧方法得到的视频帧通过开源SfM系统COLMAP进行相机标定，结果如图15所示，左图：自适应抽取的视频帧的COLMAP结果，其中的视频帧成功标定；中图和右图：等间隔抽取的视频帧的COLMAP结果，其中的视频帧成功标定，但断开为两部分。中图与右图分别对应着左图中的两个矩形区域。左图与右图中的圆圈定部分展示了在同一拐角处的对比结果。由图15可知，由于相比等间隔的方法，本发明方法抽取的视频帧连接性更好，因此通过对其进行重建，可获得一致的航拍地图。另外，图15中的黑圆表明，为获取更加完整的航拍地图，需要在拐角处对视频进行更加密集的抽帧操作。Through the adaptive frame extraction method in the present invention, 271 and 112 frames of images are respectively extracted from the aerial photography and ground videos of the Room dataset, and 721 and 250 frames of images are extracted from the aerial photography and ground videos of the Hall dataset. In order to verify the effectiveness of the frame extraction method in the present invention, a comparative experiment was carried out between the method of the present invention and the frame extraction method at equal intervals on the aerial video of the Hall dataset. The self-adaptive frame extraction method of the present invention is used to extract 721 frames of images on a video with a length of 494s and a frame rate of 25FPS. For the equal interval frame extraction method, an image is extracted every 17 frames (494×25/721≈ 17), a total of 730 frames of images were extracted. Then, the video frames obtained by the two different frame extraction methods are calibrated through the open source SfM system COLMAP, and the result is shown in Figure 15. Left: the COLMAP result of the adaptively extracted video frames, where The video frames of were successfully calibrated; the middle and right images: the COLMAP results of video frames extracted at equal intervals, where The video frame is successfully calibrated, but is broken into two parts. The middle image and the right image respectively correspond to the two rectangular areas in the left image. The circled parts in the left and right images show the comparison results at the same corner. It can be seen from Fig. 15 that, compared with the equal interval method, the video frames extracted by the method of the present invention are more connected, so by reconstructing them, a consistent aerial map can be obtained. In addition, the black circles in Figure 15 indicate that in order to obtain a more complete aerial map, it is necessary to perform more intensive frame extraction operations on the video at the corners.

3、地面相机定位结果3. Ground camera positioning results

为验证本发明的批量式相机定位以及航拍与地面图像融合方法，在此对批量式相机定位与航拍与地面图像融合后的相机定位结果以及COLMAP结果进行了定性与定量比较。需要注意的是，对于COLMAP来说，地面相机位姿并未进行初始化，仅通过图像本身进行标定，即在步骤S300中借助航拍地图的相机定位结果并未提供给COLMAP用作先验。In order to verify the batch camera positioning and aerial photography and ground image fusion method of the present invention, the qualitative and quantitative comparisons of batch camera positioning, aerial photography and ground image fusion camera positioning results and COLMAP results are carried out here. It should be noted that for COLMAP, the ground camera pose is not initialized, and only the image itself is calibrated, that is, the camera positioning result of the aerial map in step S300 is not provided to COLMAP as a priori.

定性对比结果如图16所示，第一行：Room数据集结果；第二行：Hall数据集结果；从左到右：航拍与地面图像融合后的结果、地面相机批量式定位后的结果、COLMAP标定结果；图中矩形标示出了错误的相机位姿。由图16可知，对于Room数据集，通过三种对比方法获取的相机位姿较为类似，这是由于Room数据集的场景结构较为简单。而对于Hall数据集来说，通过COLMAP计算得到的相机轨迹在场景的左边部分有着明显的错误。这是由于重复纹理与弱纹理导致地面图像之间的匹配结果包含较多的匹配外点，这样的话会导致增量式SfM系统产生较为明显的场景漂移现象。相比之下，对于批量式相机定位来说，由于部分地面图像已初步定位至航拍地图，其定位结果仅存在一些较为轻微的场景漂移情况。并且，上述错误的相机姿态均在后续的航拍与地面图像融合阶段修正过来。这是由于，在图像融合时的全局优化中引入了连接生成的航拍与地面特征点轨迹。上述结果表明，通过融合航拍与地面图像对地面相机进行定位相比于仅用地面图像来说更为鲁棒。The qualitative comparison results are shown in Figure 16. The first row: the results of the Room dataset; the second row: the results of the Hall dataset; from left to right: the results of the fusion of aerial photography and ground images, the results of batch positioning of ground cameras, COLMAP calibration result; the rectangle in the figure indicates the wrong camera pose. It can be seen from Figure 16 that for the Room dataset, the camera poses obtained by the three comparison methods are relatively similar, because the scene structure of the Room dataset is relatively simple. For the Hall dataset, the camera trajectory calculated by COLMAP has obvious errors in the left part of the scene. This is due to the fact that the matching results between ground images contain more matching outliers due to repeated textures and weak textures, which will lead to obvious scene drift in the incremental SfM system. In contrast, for batch camera positioning, since some ground images have been initially positioned on the aerial map, there are only some slight scene drifts in the positioning results. Moreover, the wrong camera poses mentioned above are all corrected in the subsequent stage of aerial photography and ground image fusion. This is due to the introduction of connection generated aerial photography and ground feature point trajectories in the global optimization of image fusion. The above results show that localization of ground cameras by fusing aerial and ground images is more robust than using ground images alone.

4、室内场景重建结果4. Indoor scene reconstruction results

最后，对本发明中的室内场景重建算法进行了定性与定量测评。本次测试比较了本发明中的室内重建结果与仅采用航拍或地面图像进行重建的结果，定性比较结果如图17所示，第一列：Room数据集结果；第二列：第一列中矩形区域的放大图；第三列：Hall数据集结果；第四列：第三列中矩形区域的放大图。从上到下：仅用地面图像，仅用航拍图像，利用融合的航拍与地面图像的结果。需要注意的是：(1)对于本发明中的室内重建算法，采用的相机位姿为融合航拍与地面图像之后的相机位姿；(2)对于仅采用地面图像的方法，采用的相机位姿为经过批量式相机定位之后的相机位姿；(3)对于仅采用航拍相机的方法，采用的相机位姿为经过SfM估计得到的相机位姿。由图17可知，尽管由于遮挡与弱纹理情况的存在，重建结果中仍不可避免地缺失了部分区域，相对于仅采用单独一种图像进行重建，通过融合航拍与地面图像的室内重建结果更为完整。Finally, qualitative and quantitative evaluations are carried out on the indoor scene reconstruction algorithm in the present invention. In this test, the indoor reconstruction results in the present invention are compared with the reconstruction results using only aerial photography or ground images. The qualitative comparison results are shown in Figure 17, the first column: the results of the Room dataset; the second column: the results in the first column Magnified view of the rectangular area; third column: Hall dataset results; fourth column: enlarged view of the rectangular area in the third column. From top to bottom: Ground imagery only, aerial imagery only, results using fused aerial and ground imagery. It should be noted that: (1) For the indoor reconstruction algorithm in the present invention, the camera pose used is the camera pose after fusing aerial photography and ground images; (2) For the method using only ground images, the camera pose used is the camera pose after batch camera positioning; (3) For the method that only uses aerial cameras, the camera pose used is the camera pose estimated by SfM. It can be seen from Fig. 17 that although due to the existence of occlusion and weak texture, some areas are inevitably missing in the reconstruction results. Compared with only using a single image for reconstruction, the indoor reconstruction results by fusing aerial photography and ground images are more accurate. whole.

本发明第二实施例的一种融合航拍与地面视角图像的场景建模系统，该系统包括航拍地图构建模块、合成图像获取模块、视角图像集合获取模块、室内场景模型获取模块；According to the second embodiment of the present invention, a scene modeling system that integrates aerial photography and ground perspective images, the system includes an aerial photography map construction module, a synthetic image acquisition module, a perspective image collection acquisition module, and an indoor scene model acquisition module;

所述室内场景模型获取模块，配置为基于所述合成图像，将所述航拍视角图像与所述地面视角图像进行融合，获取室内场景模型。，包括The indoor scene model acquisition module is configured to fuse the aerial perspective image with the ground perspective image based on the synthesized image to acquire an indoor scene model. ,include

所属技术领域的技术人员可以清楚地了解到，为描述的方便和简洁，上述描述的系统的具体工作过程及有关说明，可以参考前述方法实施例中的对应过程，在此不再赘述。Those skilled in the art can clearly understand that for the convenience and brevity of the description, the specific working process and related descriptions of the above-described system can refer to the corresponding process in the foregoing method embodiments, and will not be repeated here.

需要说明的是，上述实施例提供的融合航拍与地面视角图像的场景建模系统，仅以上述各功能模块的划分进行举例说明，在实际应用中，可以根据需要而将上述功能分配由不同的功能模块来完成，即将本发明实施例中的模块或者步骤再分解或者组合，例如，上述实施例的模块可以合并为一个模块，也可以进一步拆分成多个子模块，以完成以上描述的全部或者部分功能。对于本发明实施例中涉及的模块、步骤的名称，仅仅是为了区分各个模块或者步骤，不视为对本发明的不当限定。It should be noted that the scene modeling system for fusing aerial photography and ground perspective images provided by the above-mentioned embodiments is only illustrated by the division of the above-mentioned functional modules. In practical applications, the above-mentioned functions can be assigned to different function modules, that is, to decompose or combine the modules or steps in the embodiments of the present invention. For example, the modules in the above embodiments can be combined into one module, or can be further split into multiple sub-modules, so as to complete all or the steps described above. Some functions. The names of the modules and steps involved in the embodiments of the present invention are only used to distinguish each module or step, and are not regarded as improperly limiting the present invention.

本发明第三实施例的一种存储装置，其中存储有多条程序，所述程序适于由处理器加载并执行以实现上述的融合航拍与地面视角图像的场景建模方法。A storage device according to the third embodiment of the present invention stores a plurality of programs, and the programs are adapted to be loaded and executed by a processor to implement the above-mentioned scene modeling method for fusing aerial photography and ground perspective images.

本发明第四实施例的一种处理装置，包括处理器、存储装置；处理器，适于执行各条程序；存储装置，适于存储多条程序；所述程序适于由处理器加载并执行以实现上述的融合航拍与地面视角图像的场景建模方法。A processing device according to the fourth embodiment of the present invention includes a processor and a storage device; the processor is suitable for executing various programs; the storage device is suitable for storing multiple programs; the program is suitable for being loaded and executed by the processor In order to realize the above-mentioned scene modeling method of fusing aerial photography and ground perspective images.

所属技术领域的技术人员可以清楚地了解到，为描述的方便和简洁，上述描述的存储装置、处理装置的具体工作过程及有关说明，可以参考前述方法实施例中的对应过程，在此不再赘述。Those skilled in the art can clearly understand that for the convenience and brevity of the description, the specific working process and related descriptions of the storage device and the processing device described above can refer to the corresponding process in the foregoing method embodiments, and will not be repeated here. repeat.

本领域技术人员应该能够意识到，结合本文中所公开的实施例描述的各示例的模块、方法步骤，能够以电子硬件、计算机软件或者二者的结合来实现，软件模块、方法步骤对应的程序可以置于随机存储器(RAM)、内存、只读存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM、或技术领域内所公知的任意其它形式的存储介质中。为了清楚地说明电子硬件和软件的可互换性，在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以电子硬件还是软件方式来执行，取决于技术方案的特定应用和设计约束条件。本领域技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能，但是这种实现不应认为超出本发明的范围。Those skilled in the art should be able to realize that the modules and method steps described in conjunction with the embodiments disclosed herein can be implemented by electronic hardware, computer software, or a combination of the two, and that the programs corresponding to the software modules and method steps Can be placed in random access memory (RAM), internal memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or known in the technical field any other form of storage medium. In order to clearly illustrate the interchangeability of electronic hardware and software, the composition and steps of each example have been generally described in terms of functions in the above description. Whether these functions are performed by electronic hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art may implement the described functionality using different methods for each particular application, but such implementation should not be considered as exceeding the scope of the present invention.

术语“第一”、“第二”等是用于区别类似的对象，而不是用于描述或表示特定的顺序或先后次序。The terms "first", "second", etc. are used to distinguish similar items, and are not used to describe or represent a specific order or sequence.

术语“包括”或者任何其它类似用语旨在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者设备/装置不仅包括那些要素，而且还包括没有明确列出的其它要素，或者还包括这些过程、方法、物品或者设备/装置所固有的要素。The term "comprising" or any other similar term is intended to cover a non-exclusive inclusion such that a process, method, article, or apparatus/apparatus comprising a set of elements includes not only those elements but also other elements not expressly listed, or Also included are elements inherent in these processes, methods, articles, or devices/devices.

至此，已经结合附图所示的优选实施方式描述了本发明的技术方案，但是，本领域技术人员容易理解的是，本发明的保护范围显然不局限于这些具体实施方式。在不偏离本发明的原理的前提下，本领域技术人员可以对相关技术特征作出等同的更改或替换，这些更改或替换之后的技术方案都将落入本发明的保护范围之内。So far, the technical solutions of the present invention have been described in conjunction with the preferred embodiments shown in the accompanying drawings, but those skilled in the art will easily understand that the protection scope of the present invention is obviously not limited to these specific embodiments. Without departing from the principles of the present invention, those skilled in the art can make equivalent changes or substitutions to relevant technical features, and the technical solutions after these changes or substitutions will all fall within the protection scope of the present invention.

Claims

1. A scene modeling method for merging aerial photography and ground perspective image, is characterized in that, comprises the following steps:

Step S100, obtaining the aerial perspective image of the indoor scene to be modeled, and constructing an aerial map;

Step S200, based on the aerial map, obtain a synthesized image through the method of synthesizing a ground perspective reference image from the aerial map;

Step S300, obtaining a set of ground perspective images through the ground perspective images collected by the ground camera;

Step S400, based on the synthesized image, the aerial perspective image and the ground perspective image are fused to obtain an indoor scene model.

2. The scene modeling method for fusing aerial photography and ground perspective images according to claim 1, characterized in that, in step S100, "acquire the aerial perspective images of the indoor scene to be modeled, and construct an aerial map", the method is :

For the aerial perspective video of the indoor scene, the image frame is extracted by using the adaptive video frame extraction method based on the bag-of-words model, and the aerial perspective image collection of the indoor scene is obtained;

Based on the set of aerial perspective images, an aerial map is constructed through an image modeling method.

3. The scene modeling method for merging aerial photography and ground perspective images according to claim 1, characterized in that, in step S200, "the method for synthesizing ground perspective reference images by aerial photography maps" is as follows:

Based on the aerial map, calculate the virtual camera pose;

Through the graph cut algorithm, the aerial map is obtained to obtain the synthetic image of the reference image of the ground perspective.

4. the scene modeling method of fusion aerial photography and ground perspective image according to claim 3, is characterized in that, "by graph cut algorithm, obtain aerial photography map and obtain the synthetic image of ground perspective reference image", its method is:

Among them, E(l) is the energy function in the graph cut process; is a set of two-dimensional triangles projected from the three-dimensional space grid visible to the virtual camera, and t _i is the i-th triangle among them; is the set of common sides of the triangles in the projected two-dimensional triangle set; l _i is the serial number of the aerial image of t _i ; D _i (l _i ) is the data item; V _i (l _i , l _j ) is the smoothing item;

When the spatial patch corresponding to t _i is visible in the l _i aerial image, the data item Otherwise D _i (l _i )=α, where is the scale median value of local features in the l _i -th aerial image, is the projected area of the spatial patch corresponding to t _i in the l _i aerial image, and α is a relatively large constant;

When l _i =l _j , the smoothing term V _i (l _i , l _j )=0; otherwise V _i (l _i ,l _j )=1.

5. The scene modeling method for fusing aerial photography and ground perspective images according to claim 1, characterized in that, in step S300, "obtain a ground perspective image collection through the ground perspective images collected by the ground camera", the method is:

Based on the planned path, the ground robot continuously collects ground perspective video through the ground camera set on it;

For the ground-view video of the indoor scene, the adaptive video frame extraction method based on the bag-of-words model is used to extract image frames, and the ground-view image collection of the indoor scene is obtained.

6. The scene modeling method of fusion aerial photography and ground perspective images according to claim 5, characterized in that, in the process of "the ground robot continuously collects ground perspective videos through the ground camera provided on it" based on the planned path, its positioning The method includes initial robot positioning, mobile robot positioning;

The initial robot positioning method is as follows: obtain the first frame of the video collected by the ground camera, obtain the initial position of the robot in the aerial map, and use this position as the starting point of the subsequent movement of the robot;

The positioning of the mobile robot is as follows: based on the initial position and the driving data of the robot at each time, the rough positioning of the robot position is performed, and by matching the video frame image collected at the current time with the composite image, the current position of the robot on the aerial map is obtained. The position in , and use this position to revise the position information of coarse positioning.

7. The scene modeling method for fusing aerial photography and ground perspective images according to claim 1, characterized in that step S400 "based on the composite image, fuses the aerial photography perspective image with the ground perspective image to obtain Indoor Scene Model", the method is:

Obtain the position of the ground camera corresponding to each image in the ground perspective image collection in the aerial map;

Connect the matching points of the ground perspective image and the synthetic image into the original aerial photography and ground feature point trajectory to generate cross-view constraints;

Optimize aerial maps and point clouds of ground perspective images through bundled adjustments;

Dense reconstruction of aerial maps and ground perspective images is used to obtain dense models of indoor scenes.

8. A scene modeling system that fuses aerial photography and ground perspective images, characterized in that the system includes an aerial photography map building module, a synthetic image acquisition module, a perspective image collection acquisition module, and an indoor scene model acquisition module;

The aerial map construction module is configured to obtain an aerial perspective image of the indoor scene to be modeled, and construct an aerial map;

The synthetic image acquisition module is configured to obtain a synthetic image based on the aerial map by synthesizing a ground perspective reference image from the aerial map;

The viewing angle image set acquisition module is configured to acquire a ground viewing angle image set through ground viewing angle images collected by a ground camera;

The indoor scene model acquisition module is configured to fuse the aerial perspective image with the ground perspective image based on the synthesized image to acquire an indoor scene model.

9. A storage device, wherein a plurality of programs are stored, wherein the programs are adapted to be loaded and executed by a processor to realize the scene of merging aerial photography and ground perspective images according to any one of claims 1-7 modeling method.

10. A processing device, comprising a processor and a storage device; the processor is suitable for executing various programs; the storage device is suitable for storing a plurality of programs; it is characterized in that the program is suitable for being loaded and executed by the processor to The scene modeling method for fusing aerial photography and ground perspective images described in any one of claims 1-7 is realized.