CN118314370A

CN118314370A - Layered motion restoration structure method based on grid common view

Info

Publication number: CN118314370A
Application number: CN202410410945.4A
Authority: CN
Inventors: 刘晓涛; 李泽辉; 刘静
Original assignee: Guangzhou Institute of Technology of Xidian University
Current assignee: Guangzhou Institute of Technology of Xidian University
Priority date: 2024-04-07
Filing date: 2024-04-07
Publication date: 2024-07-09

Abstract

The invention discloses a hierarchical motion restoration structure method based on grid common view, which further processes the matched inner points of image characteristic points by using a grid common view method, so that the image can be correctly classified into one sub-scene when the image is cut, the sub-scene is expanded by using an iterative strong correlation image mode after the image is cut, the overlapping rate between the sub-scenes is improved on the premise that the image can be correctly reconstructed, and the three-dimensional point information is fused when the image is finally combined, thereby improving the accuracy of similar transformation between subareas, and the method has the advantages that: the internal points matched with the image characteristic points are further processed based on a grid common view method, so that the image can be correctly classified into a certain sub-scene when the image is cut, the sub-scene is expanded by using an iterative strong-correlation image mode after the image is cut, the overlapping rate between the sub-scenes is improved on the premise that the image can be correctly reconstructed, three-dimensional point information is fused when the image is finally combined, and the accuracy of similarity transformation between the subareas is improved.

Description

Layered motion restoration structure method based on grid common view

Technical Field

The invention relates to the technical field of image processing, in particular to a layered motion restoration structure method based on grid common view.

Background

Calculating the pose and three-dimensional points of the camera from the image information acquired by the unordered RGB camera mainly comprises the process of starting from a robust initial seed reconstruction, adding the cameras one by an incremental motion restoration structure (Structure from Motion, sfM) and then by a perspective n-Point method (PERSPECTIVE-n-Point, pnP). After successful registration of the camera, an additional beam method adjustment step is used to optimize the camera pose and 3D points, which makes the increment SfM robust and accurate. However, incremental SfM can also become inefficient and, due to repeated optimization steps, can encounter memory bottlenecks on large-scale data sets. Furthermore, the way new views are added incrementally makes these methods susceptible to drift, despite the use of additional re-triangulation steps. To overcome the inefficiency of incremental SfM while maintaining robustness of reconstruction, one natural idea is to reconstruct in a divide-and-conquer manner. The hierarchical SfM algorithm generally comprises the steps of firstly partitioning an image set through graph cutting, reconstructing each partition in an incremental or global mode, fusing each sub-reconstruction through similar transformation, and merging each sub-reconstruction by means of the same image.

Therefore, the result of graph cut is critical to the whole reconstruction flow, while the traditional graph cut method only adopts the number of the matching of the characteristic points between the images as the weight of the edges, and even if geometric verification is carried out, the weight of the edges in the scene graph still has larger error by relying on the number of the matching of the characteristic points of the images. The prior work infers the merging paths of the sub-scenes through a minimum spanning tree method, and obtains similar transformation among the subareas through the common image, thereby obtaining an effective reconstruction result. But no further processing of the inliers matching the image features, and only the common images are considered in the sub-scene merging, eventually resulting in sub-scene merging failure.

Disclosure of Invention

The invention aims to provide a layered motion restoration structure method based on grid common view, which aims to solve the problems in the background technology.

In order to achieve the above purpose, the present invention provides the following technical solutions:

a layered motion restoration structure method based on grid common view specifically comprises the following steps:

S1, inputting an image: inputting an RGB image set;

s2, image preprocessing: extracting characteristic points and matching the characteristic points of the image;

S3, constructing a scene graph: the method comprises the steps that a graph structure formed by all related images is formed, nodes are images, and edges are the association degree between the images;

S4, dividing a scene graph: dividing a large scene graph into a plurality of small scene graphs to obtain a plurality of sub-scenes;

s5, expanding the sub-scene: expanding boundaries of the sub-scene;

s6, reconstructing local SfM: independently reconstructing an SfM system for the image in each sub-scene;

S7, merging sub-scenes: and merging the results of the SfM system reconstruction independently in the previous step to obtain a complete reconstruction result.

As a preferred embodiment of the present invention: before the step S3 is performed, whether the two images are associated or not is needed to be judged, and the association degree is calculated, and the method specifically comprises the following steps:

S2.1, uniformly dividing each image into N ² grid units, and marking as

S2.2, taking each grid in the image as a node of a common view, wherein edges between the nodes represent common view relations among the grids, if at least T characteristic tracks are shared between two grids, the grids form a pair of common view grids, and the distance between any two images I _i and I _j is defined based on the common view of the grids;

s2.3, judging whether the two images are co-viewed or not;

s2.4, marking the first layer to the L layer respectively according to the sequence from low resolution to high resolution, marking grid cells with two states, if a common view matching pair is found in the grid cells of the first layer, updating the cell state to be full, and increasing the association score of the image pair by 2 ^l;

and S2.5, calculating the total association degree score between the two images through the single-side score average value of all the states of which are full grid cells.

As a preferred embodiment of the present invention: the formula for defining the distance between any two images in S2.2 is as follows:

wherein, Representing grids in a common viewAnd grid ofThe shortest path distance between, dist (I _i,I_j) represents the minimum distance between all possible co-view mesh pairs between image I _i and image I _j.

As a preferred embodiment of the present invention: the judging formula of whether the two images in S2.3 are viewed together is as follows:

Where σ represents a preset distance threshold, and if the distance between I _i and I _j does not exceed σ _dist, they are determined to be co-view pairs.

As a preferred embodiment of the present invention: the expanding the boundary of the sub-scene in S5 requires that in the conventional graph cut and expansion mode, the final merging is achieved by creating an overlapping region between the partitions, and the specific method is as follows:

s3.1, collecting boundary images: identifying those images from the already constructed scene graph that are at the junction of different sub-scenes or that are relatively isolated throughout the scene, and collecting them into a set S;

s3.2, gradually expanding the sub-partitions: performing the following steps for each sub-scene i in the scene graph;

S3.3, searching a boundary image with the highest association degree with the external image in the boundary images of the set S as an expansion image, and adding all associated images which do not belong to the partition into all associated images of the expansion image;

S3.4, judging whether the size of the overlapped area meets the requirement: and calculating the proportion of the newly added image in the sub-scene i, and if the proportion reaches or exceeds a set threshold tau _overlap, ending the expansion process of the sub-scene.

As a preferred embodiment of the present invention: the sub-scene merging operation of S7 specifically includes the following steps:

S4.1, identifying common three-dimensional points: identifying all the common three-dimensional points observed by at least three characteristic points in two sub-scenes src and tgt to be combined, and marking a coordinate set of the common three-dimensional points in the src sub-scene as src_xyz and the tgt sub-scene as tgt_xyz;

S4.2, randomly sampling: randomly extracting m points from src_xyz and tgt_xyz, which are named src_rand and tgt_rand respectively;

S4.3, estimating similarity transformation: estimating a similarity transformation matrix T from the src sub-scene to the tgt sub-scene by using the src_rand and the tgt_rand;

s4.4, calculating a reprojection error: calculating a reprojection error E _src of each three-dimensional point in the src sub-scene in the tgt sub-scene by means of estimating to obtain a similar transformation matrix T, and iterating;

s4.5, iteration judgment: and judging whether the iteration times reach a preset upper limit, ending iteration if the iteration times reach the preset upper limit, and returning to S4.3 if the iteration times do not reach the preset upper limit.

As a preferred embodiment of the present invention: the method for estimating the similarity transformation in S4.3 specifically comprises the following steps:

S4.3.1 the src_rand and tgt_rand are represented by vectors (x ₁,x₂,...,x_n) and (y ₁,y₂,...,y_n), and their means μ _x and μ _y, and variance are found And

S4.3.2, mu _x、μ_y obtained according to S4.3.1,AndCalculating to obtain covariance matrixes Sigma _xy of the two groups of vectors;

S4.3.3, performing SVD decomposition on the covariance matrix sigma _xy, and then obtaining the required similarity transformation between two point clouds through the decomposed matrix and the constructed diagonal matrix, wherein the similarity transformation comprises a rotation matrix, a translation vector and a scale.

As a preferred embodiment of the present invention: the covariance matrix Σ _xy in S4.3.2 has a calculation formula:

SVD decomposition, rotation matrix, translation vector and scale calculation formula in S4.3.3 are:

R＝USV^T

t＝μ_y-cRμ_x

As a preferred embodiment of the present invention: the calculation of the re-projection error in S4.4 also needs to obtain the re-projection error E _tgt of the tgt sub-scene, and if both E _src and E _tgt are smaller than the preset maximum threshold D _max, the three-dimensional point is regarded as the interior point.

As a preferred embodiment of the present invention: after each iteration operation in S4.4 is finished, src_rand and tgt_rand need to be updated, and the number of inner points N _cur of the current iteration is recorded.

Compared with the prior art, the invention has the beneficial effects that:

The internal points matched with the image characteristic points are further processed based on a grid common view method, so that the image can be correctly classified into a certain sub-scene when the image is cut, the sub-scene is expanded by using an iterative strong-correlation image mode after the image is cut, the overlapping rate between the sub-scenes is improved on the premise that the image can be correctly reconstructed, three-dimensional point information is fused when the image is finally combined, and the accuracy of similarity transformation between the subareas is improved.

Drawings

FIG. 1 is a flow chart of a layered motion restoration structure of the present invention;

FIG. 2 is a schematic diagram of pyramid scoring according to the present invention;

fig. 3 is a sub-scene extension schematic of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1, the present invention provides a technical solution:

S1, inputting an image: inputting an RGB image set;

s5, expanding the sub-scene: expanding boundaries of the sub-scene;

Further, before S3 is performed, it is required to determine whether two images are associated and calculate the association degree, which specifically includes the following steps:

S2.1, uniformly dividing each image into N ² grid units, and marking as

s2.3, judging whether the two images are co-viewed or not;

Further, the formula for defining the sum of any two images in S2.2 is:

Further, a judgment formula of whether the two images are viewed together in S2.3 is as follows:

Further, expanding the boundaries of the sub-scene in S5 requires that in the conventional graph cut and expansion mode, the final merging is achieved by creating an overlapping region between the partitions, and the specific method is as follows:

Further, the sub-scene merging operation of S7 specifically includes the following steps:

Further, the method for estimating the similarity transformation in S4.3 specifically includes the following steps:

S4.3.3, performing SVD decomposition on the covariance matrix Σ _xy, and then obtaining the required similarity transformation between two point clouds through the decomposed matrix and the constructed diagonal matrix, wherein the similarity transformation comprises a rotation matrix, a translation vector and a scale.

Further, the covariance matrix Σ _xy in S4.3.2 has the following calculation formula:

R＝USV^T

t＝μ_y-cRμ_x

Further, the calculation of the re-projection error in S4.4 also needs to obtain the re-projection error E _tgt of the tgt sub-scene, and if both E _src and E _tgt are smaller than the preset maximum threshold D _max, the three-dimensional point is regarded as the interior point.

Further, the iterative operation in S4.4 needs to update src_rand and tgt_rand after each end, and records the number of inner points N _cur of the current iteration.

In order to better embody the effects of the present invention, the following examples are provided

Example 1: construction of scene graph (common view grid-based scene graph construction method)

Firstly, each image is uniformly divided into N ² grid units, which are recorded as N ², each grid in the image is used as a node of a common view, edges between the nodes represent the common view relationship between the grids, if at least T characteristic tracks are shared between two grids, the grids form a pair of common view grids, and based on the grid common view, the distance between any two images I _i and I _j can be defined as follows:

if the two grids do not communicate in a common view, this distance is defined as infinity.

The construction of the grid-based co-view requires consideration of the transitivity of the co-view relationship, but as the chain length of the co-view increases, the overlap between the end-to-end grids of the chain may decrease, or even disappear altogether. Thus, for image pairs that are farther apart, they should not be considered co-viewed. Specifically, whether or not two images are co-viewed is determined by the following formula:

If the distance between I _i and I _j does not exceed σ _dist, they are considered to be a common view pair. This method effectively processes and evaluates the co-view relationship between images by dividing the images into grids and calculating the shortest distance between the grids.

Then a novel method is introduced for quantifying the association weights between images and reducing the false interpretation of the co-view relationship caused by matching the same object, and a strategy for re-dividing the co-view grid on different resolution layers is proposed herein, which is executed in order from low to high resolution, and is respectively marked as a first layer to a L layer. In each resolution layer, the grid cells may be in two states: empty or full. If a co-view matching pair is found in the grid cells of the first layer, the cell state is updated to be full, and the association score of the image pair is increased by 2 ^l, and the aim of the strategy is to enhance the robustness and accuracy of the co-view relationship-based image matching process by dispersing the distribution of the co-view grid. In order to calculate the total association score between two images, the method adopts an average value calculation mode of all valid (namely, grid cells with full states) unilateral scores, the process can be illustrated by visual representation, fig. 2 shows that a pyramid mechanism is utilized between two images to calculate association weights between the images, wherein matched characteristic points are marked by red points, and grid cells which are activated successfully are represented by yellow grids. Only the activated grid cells participate in the calculation of the final score, e.g. when L is equal to 3, the single side score of image i is 66, the single side score of image j is 80, and the correlation strength between the final image i and image j is calculated as 73.

Example 2: expansion sub-scene (sub-scene expansion method based on strong association)

In order to solve the problem that the overlapping area in the existing scene graph partitioning method is insufficient to support effective merging, the research proposes a new method. In the traditional graph cut and expand mode, final merging is achieved by creating overlapping regions between partitions, this approach relies on edge selection to determine overlapping image sets, particularly as shown in fig. 3, by collecting specific edges, e.g., (3, 10), (6, 10), (9, 26), etc., to identify image sets that can be used as partition separators, so that images {3,6,9, 10, 11, 16, 18, 19, 24, 25, 26} are used to create overlapping regions, however, this approach faces two main challenges: firstly, when the scene graph is too sparse, the number of the obtained boundary images is insufficient, and the similar transformation for all partial reconstruction merging cannot be calculated, and secondly, the graph cutting method tends to separate the partitions along the weakly associated edges, so that the separators are usually weakly constrained in the reconstruction process, and the accuracy of pose estimation of the separators is affected.

In order to overcome the limitations, the chapter provides a sub-scene expansion algorithm based on strong correlation, which aims to improve the precision of sub-scene merging under the sparse condition of a scene graph. The core idea of the algorithm is to identify and expand edges with strong relevance so as to increase the number and quality of overlapping areas and further improve the accuracy and the robustness of reconstruction, and the method comprises the following specific steps:

Step one: and (5) collecting boundary images. Initially, it is necessary to identify from the already constructed scene graph those images that are at the junction of different sub-scenes or that are relatively isolated throughout the scene, which images are considered boundary images, and to collect them into a set S.

Step two: gradual expansion of sub-partitions. For each sub-scene i in the scene graph, the following steps are performed:

Step three: searching a boundary image with the highest association degree with the external image in the boundary images of the set S, namely an extended image, and adding all associated images which do not belong to the partition into all associated images of the extended image.

Step four: and judging whether the size of the overlapped area meets the requirement. The proportion of the newly added image in the sub-scene is calculated. If this ratio meets or exceeds the set threshold τ _overlap, the expansion process for this sub-scene is terminated.

By the method, not only is the boundary of the sub-scene gradually filled, but also effective association is carried out according to the boundary image in the scene, and finally, a more continuous and accurate scene graph is obtained.

Example 3: merging sub-scenes (merging method of sub-scenes merging three-dimensional points)

In existing scene reconstruction methods, merging between sub-scenes typically relies on overlapping images, and similarity transformations between sub-scenes are computed from pose estimates of these common images. However, this approach faces significant challenges. First, if the estimation of the image pose is not accurate enough, the similarity transformation calculated based on these poses will also be unreliable, resulting in a bias in the result of sub-scene merging. Second, the camera center point cloud of the image is typically more sparse than the three-dimensional point cloud, which further reduces the number of data points available for estimating the similarity transformation, reducing the accuracy and stability of the estimation process. In order to solve the problems, an innovative sub-scene merging algorithm is introduced, and the algorithm overcomes the defects of the traditional method by fusing three-dimensional point cloud data.

(1) The specific combining process comprises the following steps:

Step one: identification of common three-dimensional points: first, it is necessary to identify, in the two sub-scenes src and tgt to be combined, all the common three-dimensional points observed by at least three feature points, and these points are valid matching points, which are supported by a plurality of observations, the coordinate set of the common three-dimensional points in the src sub-scene is denoted src_xyz, and the tgt sub-scene is denoted tgt_xyz.

Step two: random sampling: next, m points, named src_rand and tgt_rand, respectively, are randomly extracted from src_xyz and tgt_xyz, which step is to prepare for the estimation of the geometric transformation.

Step three: estimation of similarity transformations: using src_rand and tgt_rand, a similarity transformation matrix T is estimated from the src sub-scene to the tgt sub-scene, this transformation reflecting the spatial relationship between the two sub-scenes.

Step four: calculation of the re-projection error: with the estimated similarity transformation T, the re-projection error E _src of each three-dimensional point in the src sub-scene in the tgt sub-scene is calculated herein, similarly, the re-projection error E _tgt of the tgt sub-scene is also calculated herein, if both errors are smaller than the preset maximum threshold D _max, the three-dimensional point is considered as an interior point, the identification of the interior point helps to improve the accuracy of the transformation estimation, the src_rand and tgt_rand are updated after each iteration is finished, and the number of interior points N _cur of the current iteration is recorded.

Step five: and (3) iteration judgment: finally, checking whether the iteration times reach a preset upper limit, and if so, ending the iteration process; if not, the needed iteration times are recalculated according to the current internal point number E _tgt, and the step three is returned to be continuously executed.

Through the coherent and iterative process, sub-scenes can be accurately combined, so that higher-quality data fusion is realized in the whole three-dimensional scene reconstruction. The method not only enhances the consistency among sub-scenes, but also improves the accuracy and reliability of the whole three-dimensional model.

(2) Method for obtaining similar transformation between sub-scenes

The src_rand and tgt_rand are expressed by vectors (x ₁,x₂,...,x_n) and (y ₁,y₂,...,y_n) so that their average values mu _x and mu _y, and variance can be obtainedAndThe method comprises the following steps of:

covariance matrix of the two sets of vectors can then be obtained

Σ_xy＝UDV^T

Through SVD decomposition of the covariance matrix, finally, the required similarity transformation between two point clouds can be obtained through the decomposed matrix and the constructed diagonal matrix, and the method comprises the steps of rotating the matrix, translating the vector and the scale:

R＝USV^T

t＝μ_y-cRμ_x

Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. The layered motion restoration structure method based on grid common view is characterized by comprising the following steps of:

S1, inputting an image: inputting an RGB image set;

s5, expanding the sub-scene: expanding boundaries of the sub-scene;

2. A hierarchical motion restoration structure method based on grid-co-vision according to claim 1, wherein: before the step S3 is performed, whether the two images are associated or not is needed to be judged, and the association degree is calculated, and the method specifically comprises the following steps:

S2.1, uniformly dividing each image into N ² grid units, and marking as

s2.3, judging whether the two images are co-viewed or not;

3. A hierarchical motion restoration structure method based on grid-co-vision according to claim 2, wherein: the formula for defining the distance between any two images in S2.2 is as follows:

4. A hierarchical motion restoration structure method based on grid-co-vision according to claim 2, wherein: the judging formula of whether the two images in S2.3 are viewed together is as follows:

5. A hierarchical motion restoration structure method based on grid-co-vision according to claim 1, wherein: the expanding the boundary of the sub-scene in S5 requires that in the conventional graph cut and expansion mode, the final merging is achieved by creating an overlapping region between the partitions, and the specific method is as follows:

6. A hierarchical motion restoration structure method based on grid-co-vision according to claim 1, wherein: the sub-scene merging operation of S7 specifically includes the following steps:

7. A hierarchical motion restoration structure method based on grid-co-vision as recited in claim 6, wherein: the method for estimating the similarity transformation in S4.3 specifically comprises the following steps:

8. A hierarchical motion restoration structure method based on grid-co-vision as recited in claim 7, wherein: the covariance matrix Σ _xy in S4.3.2 has a calculation formula:

R＝USV^T

t＝μ_y-cRμ_x

9. A hierarchical motion restoration structure method based on grid-co-vision as recited in claim 6, wherein: the calculation of the re-projection error in S4.4 also needs to obtain the re-projection error E _tgt of the tgt sub-scene, and if both E _src and E _tgt are smaller than the preset maximum threshold D _max, the three-dimensional point is regarded as the interior point.

10. A hierarchical motion restoration structure method based on grid-co-vision as recited in claim 6, wherein: after each iteration operation in S4.4 is finished, src_rand and tgt_rand need to be updated, and the number of inner points N _cur of the current iteration is recorded.