CN118736009B

CN118736009B - A visual odometer method and system based on image depth prediction and monocular geometry

Info

Publication number: CN118736009B
Application number: CN202410853788.4A
Authority: CN
Inventors: 陈建军; 李壮
Original assignee: Jiangsu University of Science and Technology
Current assignee: Jiangsu University of Science and Technology
Priority date: 2024-06-28
Filing date: 2024-06-28
Publication date: 2025-07-11
Anticipated expiration: 2044-06-28
Also published as: CN118736009A

Abstract

The present invention discloses a visual odometer method and system based on image depth prediction and monocular geometry, the method comprising inputting two consecutive image frames, using a scale-invariant feature transformation algorithm to detect and describe local features of the image, and then using a FLANN algorithm to match corresponding feature point pairs between the two frames of image; using epipolar geometry constraints to solve the essential matrix to obtain the relative posture transformation of the camera; constructing and training a monocular depth prediction model to predict dense depth information for each frame of input image; if the number of valid depth information pairs formed by two consecutive image frames is greater than a given threshold, using triangulation to estimate the scale factor to obtain the corrected relative posture transformation, otherwise using a perspective N-point projection algorithm, a RANSAC algorithm and a combination of local nonlinear optimization to solve the absolute posture transformation. The present invention effectively integrates the advantages of deep learning and traditional geometric methods, can adapt to dynamic environments, and improves the robustness and accuracy of monocular visual odometers.

Description

Visual odometer method and system based on image depth prediction and monocular geometry

Technical Field

The invention belongs to the field of computer vision and robot navigation, and particularly relates to a vision odometer method and system based on image depth prediction and monocular geometry.

Background

The visual odometer is a process of estimating a motion trajectory of a robot from an input image sequence, and is one of core modules of a synchronous positioning and mapping (SLAM) system. The traditional geometric visual odometer method based on feature point matching has higher precision in a controlled environment, but is easy to fail in complex scenes with poor textures, dynamic objects and illumination changes. In recent years, although some progress is made in the end-to-end visual odometer method based on deep learning, geometrical constraint is often lacked, and precision and robustness are not satisfactory. The existing mixed vision odometer method integrates a classical geometric model and a deep learning framework, but needs to be improved when processing complex scenes such as high dynamics, low textures and the like.

Disclosure of Invention

The invention aims to provide a visual odometer method and a visual odometer system based on image depth prediction and monocular geometry, which are used for overcoming the defects of the prior art and improving the robustness and accuracy of the monocular visual odometer in a dynamic environment.

The technical scheme is that the visual odometer method based on image depth prediction and monocular geometry comprises the following steps:

(1) Inputting two continuous image frames, detecting and describing image local features by adopting a scale-invariant feature transformation algorithm, and then matching corresponding feature point pairs (F _i,F_j) between the two images by using a FLANN algorithm;

(2) Aiming at the characteristic matching relation obtained in the step (1), solving an essential matrix E by using epipolar geometric constraint to obtain relative attitude transformation [ R, t ] of a camera, wherein R is a rotation matrix and t is a translation vector;

(3) Constructing and training a monocular depth prediction model, and predicting dense depth information for each frame of input image;

(4) If the number of the effective depth information pairs formed by two continuous image frames is larger than a given threshold value, estimating a scale factor s by using a triangulation method to obtain corrected relative posture transformation [ R, st ];

(5) And solving absolute attitude transformation (R, t) by utilizing a combination mode of a perspective N-point projection algorithm, a RANSAC algorithm and local nonlinear optimization.

Further, the implementation process of the step (1) is as follows:

Through Gaussian difference kernel detection, pixel values are compared in different scale spaces by using the following steps of finding extreme points to serve as candidate key points:

D(x,y,σ)=(G(x,y,kσ)-G(x,y,σ))*I(x,y)

Wherein x, y are the spatial coordinates of the image, I (x, y) represents the pixel value of the input image at the (x, y) point, k represents the multiple relationship between adjacent scales, G (x, y, σ) is the gaussian kernel at scale σ, for generating the different scale spaces of the input image I (x, y);

And (3) performing three-dimensional quadratic function fitting on the key points by adopting a spline interpolation method, further finding out the true extreme points of the fitting function, and then removing points with low contrast, wherein the following formula is used:

Wherein, the Is the extreme point of the interpolation function, discardThe method comprises the steps of determining the principal direction of each key point, obtaining the principal direction through statistics of gradient histograms when determining the principal direction of each key point to enable the feature to have rotation invariance, finally generating a key point descriptor, calculating gradient direction histograms in the neighborhood of each key point by taking each key point as a center to form a 128-dimensional vector serving as a feature vector representation of the key point, and matching corresponding feature point pairs between image frames by using a FLANN algorithm after obtaining the feature vectors of two images (F _i,f_j).

Further, the step (2) is implemented by the following formula:

Wherein E is an essential matrix, K is a known parameter of the camera internal reference, and the relative attitude transformation [ R, t ] of the camera is obtained by decomposing E.

Further, the implementation process of the step (3) is as follows:

Given the predicted depth D _t and the camera pose matrix T _t+1→t, an image I _t+1→t is reconstructed:

I_t+1→t＝w(I_t+1,KT_t+1→tK^-1xD_t[x])

wherein w (·) is a differentiable image warping function, x represents the pixel value of a point in the image;

Using the generated I _t+1→t and reference image I _t, the following objective function is constructed:

pe(I_t+1→t,I_t)＝αSSIM(I_t+1→t,I_t)+(1-α)||I_t+1→t-I_t||₁

wherein SSIM is a structural similarity index, α is a weight that balances SSIM loss and L1 loss;

For reference view I _t, image I _t+1→t、I_t-1→t is reconstructed from its neighboring view I _t-1、I_t+1, and only the photometric error between the smallest pair of reference pixels and the synthesized pixel in the view is calculated:

L_p＝min(pe(I_t,I_t-1→t),pe(I_t,I_t+1→t))

and an edge-aware depth smoothing term L _s is introduced:

Wherein, the AndGradient amplitude values of the depth map D _t in the horizontal direction and the vertical direction are calculated respectively, so that the depth values of adjacent pixels are promoted to be close; And The method comprises the steps of obtaining a weight item based on the image gradient of the current view, allowing discontinuity of the depth map when the weight item is smaller in an edge area with larger image gradient, enabling the depth map to be kept smooth when the weight item is larger in a flat area with smaller image gradient, and enabling the depth map to be kept smooth according to a final loss function:

L=L_p+λL_s

and realizing and training the monocular depth prediction model by adopting PyTorch.

Further, the implementation process of the step (4) is as follows:

Normalization of the corresponding pairs of feature points (F _i,F_j) of a given two consecutive image frames I _i and I _j yields X _i and X _j, reconstructing the corresponding three-dimensional point cloud X _j of the second view by triangulation:

Wherein e ₃ is vector [0, 1] ^T;

projecting X _j back to the image plane to obtain triangulation-based depth information Depth information D _j of image I _j is then estimated using a monocular depth prediction model, and the fly-out mask M _d is then applied to the depth information D _j to obtain processed depth dataSearching forAnd (3) withIf the number of depth information pairs is greater than a given threshold, calculating a depth ratioA RANSAC algorithm is introduced to perform robust fitting on noisy data:

|r_i-s|≤∈₁

where e ₁ is the residual threshold and s is the scale factor.

Further, the implementation process of the step (5) is as follows:

Estimating depth information D _i of the current view I _i using a monocular depth prediction model, and then applying a fly-out mask M _d to D _i to obtain processed depth data Finally, camera internal parameters are utilizedCalculating three-dimensional point cloud coordinates X _i corresponding to the view I _i through back projection, and calculating an initial solution R ₀,t₀ of the pose by using a RANSAC random sampling minimum inner point set from the corresponding relation between all N pairs of three-dimensional space points and two-dimensional pixel points:

Wherein ρ is a robust kernel function, F _j is a feature point corresponding to the reference view I _j, pi is a projection function, and the pose solution { R ^*,t^* } is optimized by adopting a local nonlinear optimization method with the initial solution R ₀,t₀ as a starting point:

And calculating all N pairs of reprojection errors according to the optimized pose solution, and determining an inner point set:

I={i∣||F_j-π(R^*,t^*,X_i)||<∈₂} (3)

wherein, E ₂ is the interior point threshold value, and repeating the above process until the maximum iteration number is reached or the interior point number is not increased, thereby obtaining the optimal solution [ R, t ].

The invention relates to a visual odometer system based on image depth prediction and monocular geometry, which comprises:

The feature detection and matching module is used for detecting key points of the input image and matching corresponding feature points between the continuous image frames;

A 2D pose solver for estimating a 2D pose transform of the camera based on the correspondence of the feature point pairs;

monocular depth prediction for predicting a dense depth map for each frame of input image;

The scale factor correction module is used for correcting the scale factor of the 2D gesture transformation by utilizing the depth information output by the depth prediction network;

and the 3D gesture solver is used for directly solving the 3D gesture of the camera based on the corresponding relation between the 3D coordinates and the 2D pixel coordinates of the characteristic point pair when the scale factor correction module cannot operate.

The invention combines the advantages of the traditional geometric method and the latest depth learning technology, not only theoretically solves the challenges of the field such as scale estimation and dynamic environment adaptation, but also shows excellent performance in practical application, firstly, the invention combines the classical algorithm of feature detection and geometric constraint, can avoid the influence of complex environment scenes such as texture deficiency, dynamic objects, illumination change and the like, simultaneously introduces a 2D gesture solving strategy based on optical flow amplitude, further improves the adaptation capability to the dynamic scene, secondly, the invention can utilize the dense depth information output by a depth prediction network by the proposed novel scale factor correction method, stably estimates the inherent scale uncertainty of the monocular vision odometer based on the RANSAC algorithm, avoids long-term track drift, and the monocular depth prediction network of the system adopts a self-supervision training mode, does not need to depend on extra ground labeling data, even if training is only carried out on a conventional data set, also shows excellent generalization performance, and can stably run under extreme scenes.

Drawings

FIG. 1 is a schematic view of the overall framework of the visual odometer system of the invention;

FIG. 2 is a schematic diagram of the calculation of the fly-out mask of the present invention;

fig. 3 is a graph comparing experimental results when different sequences in KITTI data set are used as verification data, wherein graphs (a) to (k) are respectively graphs comparing experimental results when 00 sequences to 10 sequences in KITTI data set are used as verification data.

Detailed Description

The invention is described in further detail below with reference to the accompanying drawings.

As shown in fig. 1, the present invention proposes a visual odometer method based on image depth prediction and monocular geometry, comprising the steps of:

step1, running a feature detection and matching module to match corresponding feature points between continuous image frames, wherein the specific steps are as follows:

two consecutive image frames are input, local features of the images are detected and described using a scale invariant feature transform algorithm, and then a FLANN algorithm is used to match corresponding pairs of feature points between the two images (F _i,F_j).

The purpose of scale space extremum detection is to find keypoints in different scale spaces. Through Gaussian difference kernel detection, pixel values are compared in different scale spaces by using the following steps of finding extreme points to serve as candidate key points:

D(x,y,σ)=(G(x,y,kσ)-G(x,y,σ))*I(x,y)

and (3) performing three-dimensional quadratic function fitting on the key points by adopting a spline interpolation method, further finding out the true extreme points (key points) of the fitting function, and then removing points with low contrast, wherein the following formula is used:

Step 2, solving an essential matrix E based on the epipolar geometry constraint of the 2D attitude solver, thereby obtaining the relative attitude transformation [ R, t ] of the camera, which satisfies the following conditions:

Wherein R is a rotation matrix, t is a translation vector, E is an essential matrix, K is a known parameter of the camera internal reference, and F _i,F_j is the feature matching point obtained in the step (1). Thus, the relative pose transformation [ R, t ] of the camera can be found by decomposing E. The solver monitors the optical flow amplitude between the continuous frames in real time, and only when the amplitude is large enough, the essential matrix is decomposed to obtain pose solutions, so that small-scale dynamic point interference can be avoided.

And 3, constructing a monocular depth prediction network, and predicting a dense depth map for each frame of input image. Firstly, training a monocular depth model, and given a predicted depth D _t and a camera pose matrix T _t+1→t, reconstructing an image I _t+1→t:

I_t+1→t＝w(I_t+1,KT_t+1→tK^-1xD_t[x])

where w (·) is a differentiable image warping function and x represents the pixel value of a point in the image.

Using the generated I _t+1→t and reference image I _t, the following objective function can be constructed:

pe(I_t+1→t,I_t)＝αSSIM(I_t+1→t,I_t)+(1-α)||I_t+1→t-I_t||₁

wherein SSIM is a structural similarity index, which is an index for measuring the similarity of two images, and α=0.85 is a weight for balancing SSIM loss and L1 loss.

Specifically, as shown in fig. 2, for reference view I _t, image I _t+1→t、I_t-1→t is reconstructed from his neighboring view I _t-1、I_t+1. The present invention does not average the photometric errors between the reference pixels and the synthesized pixels in multiple views, but only calculates the photometric error between the pair of reference pixels and synthesized pixels for which the error is minimal:

L_p＝min(pe(I_t,I_t-1→t),pe(I_t,I_t+1→t))

finally, an edge-aware depth smoothing term L _s is introduced:

Wherein, the AndGradient magnitudes of the depth map D _t in the horizontal and vertical directions are calculated, respectively, to promote the depth values of adjacent pixels to be close.AndIs a weight term based on the current view image gradient. When the image gradient is large (e.g., edge region), the weight term is small, allowing discontinuities in the depth map, while in flat regions where the image gradient is small, the weight term is large, forcing the depth map to remain smooth. The reason for introducing this depth smoothing term is that it is desirable that the predicted depth map has a segmentation in the object edge region, while the region inside the object remains smooth, which complies with the features of the real scene. By this penalty term, edge retention and smoothness of the depth prediction may be enhanced, resulting in a more reasonable depth estimation result.

The final loss function is l=l _p+λL_s. The depth prediction model is implemented and trained using PyTorch. In the training process, an Adam optimization algorithm is used, 15 epochs are iterated in total, the learning rate is set to be 0.0001, and the weight coefficient lambda of the edge perception depth smoothing regular term is 0.001. After the monocular depth model is trained, the monocular image is input into the model and output as dense depth information of the current image.

Step 4, based on the result of the step 2, running a scale factor correction module, estimating a scale factor s by using a triangulation method by utilizing dense depth information of the image, and obtaining corrected relative posture transformation [ R, st ], wherein the specific steps are as follows:

The corresponding pairs of feature points (F _i,F_j) for a given two consecutive image frames I _i and I _j are normalized to obtain X _i and X _j, and the corresponding three-dimensional point cloud X _j for the second view can be reconstructed by triangulation:

Where e ₃ is the vector [0, 1] ^T. Projecting X _j back to the image plane to obtain triangulation-based depth information Depth information D _j of image I _j is then estimated using a monocular depth prediction model, and the fly-out mask M _d is then applied to the depth information D _j to obtain processed depth dataSearching forAnd (3) withIf the number of depth information pairs is greater than a given threshold, calculating a depth ratioA RANSAC algorithm is introduced to perform robust fitting on noisy data:

|r_i-s|≤∈₁

Where e ₁ is the appropriate residual threshold. In this way, a scale factor estimate s of the visual odometer may be obtained. And (3) obtaining the relative posture transformation [ R, st ] after scale correction on the basis of the step (2).

And 5, running a 3D gesture solver, and accurately solving absolute gesture transformation [ R, t ] by utilizing a mode of combining a perspective N-point projection algorithm, a RANSAC algorithm and local nonlinear optimization.

First, depth information D _i of the current view I _i is estimated using a monocular depth prediction model, and then a fly-out mask M _d is applied to D _i to obtain processed depth dataFinally, camera internal parameters are utilizedAnd calculating the three-dimensional point cloud coordinate X _i corresponding to the view I _i through back projection. Next, from the corresponding relation between all N pairs of three-dimensional space points and two-dimensional pixel points, using the RANSAC to randomly sample the minimum inner point set to calculate the initial solution R of the pose ₀,t₀

Where ρ is the robust kernel function, F _j is the two-dimensional pixel coordinates of the reference view I _j, X _i is the three-dimensional spatial point of the current view I _i, and pi is the projection function. Using the initial solution R ₀,t₀ as a starting point, adopting a local nonlinear optimization method to optimize the pose solution { R ^*,t^* }:

I={i∣||F_j-π(R^*,t^*,X_i)||<∈₂}

Here e ₂ is the inlier threshold. The above procedure is repeated until the maximum number of iterations or the number of interior points is reached without increasing, thereby obtaining an optimal solution [ R, t ].

The invention also proposes a visual odometer system based on image depth prediction and monocular geometry, comprising:

a monocular depth prediction network for predicting a dense depth map for each frame of input image;

In the present embodiment, 5 evaluation indexes, that is, an average translational error (t_err), an average rotational error (r_err), an average displacement error (ATE), a translational relative pose error (RPE (m)), and a rotational relative pose error (RPE (°)) are used. KITTIOdometry Dataset the dataset contains stereo camera and lidar data for 22 real urban scenes and provides accurate Ground Truth positions. As shown in fig. 3, 11 sequences, i.e., 00 sequences to 10 sequences in KITTI datasets, respectively, were used as evaluation data in (a) to (k) in fig. 3. The visual odometer system provided by the invention is compared with various existing methods. The methods that were compared included geometry-based methods (VISO 2 and ORB-SLAM 2) and end-to-end deep learning methods, such as SC-SfM-Learner and Depth-VO-Feat, and DF-VO. The quantitative results are shown in Table 1, and the best results are marked with bold underlines, and it can be seen that four of the five indexes are optimal in the present invention.

Table 1 error contrast of monocular vision odometer system and methods in KITTI dataset

It can also be seen from fig. 3 that in long distance trajectory estimation without route loop (sequences 01, 02, 08), serious drift accumulation occurs in ORB-SLAM 2. Although the method faces multiple challenges such as movement of dynamic obstacles and severe illumination change, the method still shows excellent track estimation capability under the high-dynamic complex scenes, and has highest matching degree with the ground truth track.

Finally, it should be noted that the above embodiments are merely for illustrating the technical solution of the present invention, and not for limiting the same, and although the present invention has been described in detail with reference to the above embodiments, it should be understood by those skilled in the art that the technical solution described in the above embodiments may be modified or some or all of the technical features may be equivalently replaced, and these modifications or substitutions do not deviate from the essence of the corresponding technical solution from the scope of the technical solution of the embodiments of the present invention.

Claims

1. A visual odometer method based on image depth prediction and monocular geometry, characterized in that it comprises the following steps:

Step (1) Input two consecutive image frames, use the scale-invariant feature transformation algorithm to detect and describe the local features of the image, and then use the FLANN algorithm to match the corresponding feature point pairs ( _Fi , _Fj ) between the two frames of images;

Step (2) solves the essential matrix E using the epipolar geometry constraint for the feature matching relationship obtained in step (1) to obtain the relative posture transformation of the camera;

Step (3) constructing and training a monocular depth prediction model to predict dense depth information for each frame of input image;

Step (4) If the number of valid depth information pairs formed by two consecutive image frames is greater than a given threshold, the scale factor s is estimated using triangulation to obtain the corrected relative posture transformation [R,st], where R is the rotation matrix and t is the translation vector; otherwise, step (5) is executed;

Step (5) solves the absolute posture transformation by combining the perspective N-point projection algorithm, the RANSAC algorithm and the local nonlinear optimization;

The step (2) is implemented by the following formula:

Where E is the essential matrix, K is the camera internal parameter which is a known parameter, and the relative posture transformation [R, t] of the camera is obtained by decomposing E; ( _Fi , _Fj ) is the feature point pair;

The implementation process of step (3) is as follows:

Given the predicted depth D _t and camera pose matrix T _t+1→t , reconstruct the image I _t+1→t :

I _t+1→t =w(I _t+1 ,KT _t+1→t K ^-1 xD _t [x])

Where w(·) is a differentiable image warping function, K is the camera intrinsic parameter, and [x] represents the pixel value of a point in the image;

Using the generated I _t+1→t and the reference image I _t , the following objective function is constructed:

pe(I _t+1→t ,I _t )=αSSIM(I _t+1→t ,I _t )+(1-α)||I _t+1→t -I _t || ₁

Among them, SSIM is the structural similarity index, and α is the weight for balancing SSIM loss and L1 loss;

For the reference view I _t , images I _t+1 _→t , I _t-1 _{→t are reconstructed based on its neighboring views I t-1 , I t+1} ; only the photometric error between the smallest pair of reference pixels and synthesized pixels in the view is calculated:

L _p =min(pe(I _t ,I _t-1→t ),pe(I _t ,I _t+1→t ))

And the edge-aware depth smoothing term L _s is introduced:

in, and Calculate the gradient amplitude of the depth map _Dt in the horizontal and vertical directions respectively to make the depth values of adjacent pixels close; and It is a weight term based on the gradient of the current view image. In the edge area where the image gradient is large, the weight term is small, allowing discontinuity in the depth map. In the flat area where the image gradient is small, the weight term is large, which makes the depth map smooth. The final loss function is:

L＝L _p +λL _s

PyTorch is used to implement and train the monocular depth prediction model.

2. According to the visual odometer method based on image depth prediction and monocular geometry according to claim 1, it is characterized in that the implementation process of step (1) is as follows:

Through Gaussian difference kernel detection, the following formula is used to compare pixel values in different scale spaces to find extreme points as candidate key points:

D(x,y,σ)=(G(x,y,kσ)-G(x,y,σ))*I(x,y)

Where x, y are the spatial coordinates of the image, I(x, y) represents the pixel value of the input image at point (x, y), k represents the multiple relationship between adjacent scales, and G(x, y, σ) is the Gaussian kernel at scale σ, which is used to generate different scale spaces of the input image I(x, y);

The spline interpolation method is used to fit the three-dimensional quadratic function to the key points, and then the true extreme points of the fitting function are found. Then, the points with low contrast are removed, using the following formula:

in, is the extreme point of the interpolation function, discard Points with a value less than a given threshold; when determining the main direction of each key point, the main direction is obtained by statistically calculating the gradient histogram so that the feature has rotation invariance; finally, a key point descriptor is generated, and the gradient direction histogram in its neighborhood is calculated with each key point as the center to form a 128-dimensional vector as the feature vector representation of the key point; after obtaining the feature vectors of the two images, the FLANN algorithm is used to match the corresponding feature point pairs ( _Fi , _Fj ) between the image frames.

3. According to the visual odometer method based on image depth prediction and monocular geometry of claim 1, it is characterized in that the implementation process of step (4) is as follows:

The corresponding feature point pairs ( _Fi , _Fj ) of two given consecutive image frames _Ii and _Ij are normalized to obtain _xi and _xj , and the three-dimensional point cloud _xj corresponding to the view is reconstructed by triangulation:

Where e ₃ is the vector [0,0,1] ^T ;

Project _Xj back to the image plane to get the depth information based on triangulation Then the monocular depth prediction model is used to estimate the depth information _Dj of image _Ij , and then the fly-out mask _Md is applied to the depth information _Dj to obtain the processed depth data Search and If the number of depth information pairs is greater than a given threshold, the depth ratio is calculated. The RANSAC algorithm is introduced to perform robust fitting of noisy data:

|r _i -s|≤∈ ₁

Among them, ∈ ₁ is the residual threshold and s is the scale factor.

4. According to the visual odometer method based on image depth prediction and monocular geometry according to claim 1, it is characterized in that the implementation process of step (5) is as follows:

Use the monocular depth prediction model to estimate the depth information _Di of the current view _Ii , and then apply the fly-out mask _Md to _Di to obtain the processed depth data Finally, using the camera internal parameters and The 3D point cloud coordinates _Xi corresponding to the view _Ii are calculated by back-projection; the minimum internal point set is randomly sampled using RANSAC from all the N pairs of correspondences between 3D space points and 2D pixel points, and the initial solution _R0 , _t0 of the pose is calculated; the pose solution {R ^* , t ^* } is optimized using the local nonlinear optimization method with the initial solution _R0 , _t0 as the starting point; the reprojection errors of all N pairs of points are calculated based on the optimized pose solution, and the internal point set is determined:

Among them, ∈ ₂ is the interior point threshold; repeat the above process until the maximum number of iterations or the interior point set is reached. The number of no longer increases, thus obtaining the final posture transformation.

5. A visual odometer system based on image depth prediction and monocular geometry using the method according to any one of claims 1 to 4, characterized in that it comprises:

Feature detection and matching module, used to detect key points of input images and match corresponding feature points between consecutive image frames;

2D pose solver, used to estimate the 2D pose transformation of the camera based on the correspondence between feature point pairs;

Monocular depth prediction, used to predict a dense depth map for each input image frame;

The scale factor correction module is used to correct the scale factor of 2D posture transformation using the depth information output by the depth prediction network;

The 3D attitude solver is used to directly solve the camera's 3D attitude based on the correspondence between the 3D coordinates of feature point pairs and the 2D pixel coordinates when the scale factor correction module cannot run.