Background
The aircraft visual navigation technology mainly uses visual sensors (visible light, infrared and the like) carried on a flight platform to image the ground and combines a reference image containing geographic position information and an image matching algorithm to realize the estimation of navigation parameters such as the pose of the aircraft. The existing visual navigation technology can be divided into two types of relative visual navigation and absolute visual navigation, and the relative visual navigation method mainly comprises the methods of visual odometer (Visual Odometry, VO), instant positioning, map construction (Simultaneous Localization AND MAPPING, SLAM) and the like by matching image sequences and estimating the relative gesture of an aircraft according to the geometric relation between front and rear frame images. The absolute visual navigation method utilizes a real-time image shot by the aircraft to be matched with a reference image with geographic coordinate information, and then solves the absolute pose of the aircraft according to the reference image information and an imaging model. The relative visual navigation method needs to build a map of an unknown flight environment, has a good effect in indoor navigation, but the outdoor flight condition (the flight height is more than 100 meters) and the observation mode (monocular vision) of the unmanned aerial vehicle are difficult to support SLAM to carry out synchronous three-dimensional map construction and accurate positioning. The absolute navigation method based on reference map matching is widely applied in various fields, but the algorithms are mainly suitable for high-altitude forward looking down shooting observation scenes, and under the condition of low-altitude flight (the main focus range is more than 100 meters and less than 1 km according to the current low-altitude economy) and large-inclination oblique observation, the existing method is still difficult to realize high-precision real-time map-reference map matching navigation under the influence of the stereoscopic effect of a target and the visual angle difference. In addition, urban scene structure is complex and various, repeated textures are many, most of existing matching methods do not pay attention to semantic features of matching areas, and many mismatching point pairs appear on non-rigid or non-fixed targets such as trees, water areas and moving objects, so that accurate pose solving results are difficult to obtain.
For the wide baseline image matching problem caused by large image visual angle difference and low overlapping rate, the traditional image matching method (such as SIFT, SURF, ORB) and the like are difficult to realize robust matching, the image matching model based on deep learning in recent years has made series progress on the wide baseline matching problem, and the deep learning model represented by Superpoint, superglue, loFTR, DKM and the like has certain robustness on the visual angle difference, but still cannot meet the requirement of high-precision visual navigation. In order to fully utilize the advantages of the deep learning technology in the understanding of image content, in recent years, students research the influence of factors such as visual angle difference and low overlapping rate on an image matching algorithm by adopting models of other visual tasks, such as Chen Ying and the like, firstly, the overlapping area of the image to be matched is calculated by using a deep learning method, then feature point matching is carried out on the overlapping area, zhao Chenhao and the like, target detection is carried out on the image by adopting Yolo-v5, then the feature points are extracted by taking a target frame as a center, semantic information and position information are fused into a feature encoder, finally, the mismatching points are removed by checking the semantic consistency of matching point pairs, zhang Yesheng and the like adopt a matching strategy from the surface to the point, firstly, a region matching method (SEMANTIC AND Geometry AREA MATCHING, SGAM) based on semantic features and geometric features is used, then the matching area is carried out on a pixel level, the whole semantic and geometric information of the image area can be utilized to provide priori for the matching of the subsequent feature points, zhang Yesheng and the like also proposes a method (MATCHING EVERYTHING by SEGMENTING ANYTHING, MESA) for matching by segmentation, firstly, a map (3435) is segmented by using a map (35) is carried out, and then a visual registration method is carried out on the map matching with high accuracy compared with the matching result. In addition, the MATCHING ANYTHING by SEGMENTING ANYTHING (MASA) framework proposed by Li Siyuan et al, which also takes advantage of the powerful segmentation capabilities of SAM models, has robust inter-frame matching and tracking capabilities for a variety of targets in video sequences by training together image segmentation and instance-level matching tracking tasks. In the method, chen Yeng et al needs to train an image overlapping region discrimination model in advance and cannot eliminate negative effects caused by overlapping region visual angle difference, zhao Chenhao et al adopts Yolo-v5 to detect targets, does not have the capability of identifying and detecting all types of targets in an image, does not provide a region segmentation result with finer granularity, still has a lot of background information interference in the same target region, SGAM, MESA and MASA do not use image regions or display semantic information of specific targets, and the operation can avoid the interference of the false identification of the image regions on subsequent matching, but also ignores the strong constraint effect of the display semantic information on the matching process, has large visual angle difference and low overlapping rate, and has insufficient matching robustness of the aircraft remote sensing images with various targets.
Disclosure of Invention
Based on the above, it is necessary to provide an aircraft visual navigation method based on consistent semantic constraint instance segmentation matching, which can improve matching robustness and positioning accuracy of an aircraft under conditions of low-altitude flight, large-view angle oblique observation and the like.
An aircraft visual navigation method based on consistent semantic constraint instance segmentation matching, the method comprising:
The method comprises the steps of acquiring an aerospace remote sensing image data set and a general image basic model trained according to the aerospace remote sensing image data set, wherein the aerospace remote sensing image data set comprises a plurality of ground scene optical images shot by airborne cameras of an aerospace vehicle;
Establishing a navigation target library mainly comprising rigid targets and plane area targets, taking an aerial optical image with geocoding and ground elevation information as a reference image, taking a ground scene optical image as a real-time image, and respectively extracting the reference image and ground object information in the real-time image according to a general image marking model to obtain target sets simultaneously existing in the real-time image, the reference image and the navigation target library;
Detecting positions of all semantic elements in a target set on a real-time graph and a reference graph by using an open set target detection model to obtain target frame sets of different elements;
dividing the outlines in all the target frames in the target frame set by using a general image segmentation model to obtain target area information;
performing feature point matching on the regions with the same semantic meaning in the real-time image and the reference image based on the target region information and an image matching algorithm to obtain matching point pairs between the reference image and the real-time image;
And establishing a relation between two-dimensional matching points on the real-time graph and corresponding three-dimensional information of the reference graph according to the matching point pairs between the reference graph and the real-time graph, then calculating the position and the gesture of the current camera by combining with the onboard camera internal parameters through a PnP algorithm, and then calculating the pose of the aircraft according to the translational rotation relation between the camera coordinate system and the aircraft coordinate so as to realize visual navigation of the aircraft.
According to the aircraft visual navigation method based on consistent semantic constraint instance segmentation matching, firstly, a navigation target library mainly comprising rigid targets and plane area targets is established, an aerial optical image with geocoding and ground elevation information is used as a reference image, a ground scene optical image is used as a real-time image, the reference image and ground object information in the real-time image are respectively extracted according to a general image marking model to obtain target sets which are simultaneously existing in the real-time image, the reference image and the navigation target library, a navigation database is constructed in advance, so that the common, fixed, rigid or plane area targets in a flight scene can be screened out in advance, and the targets can be detected, segmented and matched to further improve the practicability and matching accuracy of an algorithm; and inputting the target type into an open set target detection model to obtain the position of a target frame, inputting the target text and a spatial prompt into a general image segmentation model to obtain a fine granularity segmentation result of a corresponding target, realizing semantic extraction and instance segmentation of a reference image and a real-time image in the visual navigation process through a general image intelligent processing model, simultaneously obtaining fine contour information of all distribution areas of all key targets in a remote sensing image, and providing priori by fully utilizing semantic information of image areas by respectively carrying out feature point matching on the same semantic areas in the real-time image and the reference image, thereby overcoming the interference caused by factors such as visual angle change, illumination change, low overlapping rate, sensor modal difference and the like, and avoiding mismatching among different semantic areas. In conclusion, the method can effectively improve the matching robustness and the positioning accuracy of the visual navigation algorithm based on image matching under the conditions of low-altitude flight, large-view angle oblique observation and the like. The application can be applied to monocular visual navigation tasks of various flight platforms such as unmanned aerial vehicles, airships and the like, and has wide application prospect and economic value.
Detailed Description
The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
In one embodiment, as shown in fig. 1, there is provided an aircraft visual navigation method based on consistent semantic constraint instance segmentation matching, comprising the steps of:
Step 102, acquiring an aerospace remote sensing image dataset and a general image basic model trained according to the aerospace remote sensing image dataset, wherein the aerospace remote sensing image dataset comprises a plurality of ground scene optical images shot by airborne cameras of an aerospace vehicle, and the general image basic model comprises a general image marking model, an open set target detection model and a general image segmentation model.
Acquiring an air-to-air remote sensing image dataset, wherein the air-to-air remote sensing image dataset comprises ground scene optical images shot by a plurality of air-to-air aircrafts (unmanned aerial vehicles and satellites) and labels such as ground object target types, target frames and target instance segmentation information contained in image data, training or fine-tuning a general image marking model based on the dataset after acquiring the air-to-air remote sensing image dataset, and an open-set target detection model and a general image segmentation model, wherein the general image marking model, the open-set target detection model and the general image segmentation model are neural network models based on deep learning. Because the mode of the remote sensing image has specificity, the empty remote sensing image with labels is used for training or fine tuning the corresponding model, and the generalization performance of the model on the remote sensing image can be improved.
Step 104, a navigation target library mainly comprising rigid targets and plane area targets is established, an aerial optical image with geocoding and ground elevation information is used as a reference image, a ground scene optical image is used as a real-time image, and ground object information in the reference image and the real-time image is respectively extracted according to a general image marking model to obtain target sets which are simultaneously existing in the real-time image, the reference image and the navigation target library.
The reference map carries absolute geographical coordinate information, such as latitude and longitude, and ground level elevation information (digital elevation model DEM or digital surface model DSM). The generic image tag model includes, but is not limited to, all models (Recognize Anything Model, RAM) that are capable of identifying all typical object targets in the remote sensing image.
And 106, detecting positions of all semantic elements in the target set on the real-time graph and the reference graph by using an open set target detection model to obtain target frame sets of different elements.
The open set object detection model includes, but is not limited to, the Grounding-DINO model, provided that it has the ability to detect the corresponding object location in the image from the input text prompt (represented in the form of an object rectangular box).
And step 108, segmenting the outlines in all the target frames in the target frame set by using a general image segmentation model to obtain target region information.
The general image segmentation Model includes, but is not limited to, segmentation of a Model (SEGMENT ANYTHING Model, SAM), which can generate an arbitrary image segmentation mask based on text or spatial cues, can fully automatically segment images without text or spatial cues, and can adapt to new target segmentation tasks with zero samples.
The method comprises the steps of carrying out semantic extraction and instance segmentation on a reference image and a real-time image in the visual navigation process, analyzing common fixed targets in a flying scene in advance, establishing a navigation target library mainly comprising rigid targets and plane area targets (the targets are more beneficial to detection, segmentation and matching), taking an aerial optical image with geocoding and ground elevation information as the reference image, taking an optical image shot by an aerial vehicle in real time on the ground as the real-time image, respectively extracting ground object information in the reference image and the real-time image by using a general image marking model to obtain target sets existing in the real-time image, the reference image and the navigation target library simultaneously, detecting positions of the elements on the real-time image and the reference image by using an open set target detection model for all semantic elements in the ground object target set to obtain target frame sets of different elements, and dividing outlines of the targets in the target frame by using a general image segmentation model for all target frames to obtain more accurate target area information.
And 110, carrying out feature point matching on the regions with the same semantic meaning in the real-time image and the reference image based on the target region information and the image matching algorithm to obtain a matching point pair between the reference image and the real-time image.
Such image matching algorithms include, but are not limited to, conventional feature point matching algorithms (e.g., SIFT, SURF, ORB, etc.) or deep learning models (e.g., superpoint + Superglue, D2Net, DKM, loFTR, etc.).
And 112, establishing a relation between two-dimensional matching points on the real-time graph and corresponding three-dimensional information of the reference graph according to the matching point pairs between the reference graph and the real-time graph, calculating the position and the posture of the current camera by combining the internal parameters of the airborne camera through a PnP algorithm, and calculating the pose of the aircraft according to the translational rotation relation between the camera coordinate system and the aircraft coordinate so as to realize visual navigation of the aircraft.
And (3) carrying out feature point matching on the regions with the same semantic meaning in the real-time image and the reference image by using an image matching algorithm to obtain corresponding matching point pair information, and calculating the current position and posture of the onboard camera (the camera for shooting the real-time image) by using a 2D-3D PnP algorithm under the condition that the geographic coordinates and the three-dimensional information of the onboard camera reference and the reference image are known for the matching point pair information obtained by image matching, and then calculating the pose of the aircraft according to the translational rotation relation between the camera coordinate system and the aircraft coordinate. PnP pose resolving based on 2D-3D matching point pairs belongs to a general flow in the field of visual navigation, and comprises a direct linear transformation method (DLT), OPnP and a EPnP method, so that the robustness of a PnP algorithm is further improved, the PnP resolving is constrained by using inertial navigation parameters with errors of an aircraft, and the problem that the PnP is trapped into a local extremum due to factors such as observation geometric conditions is avoided.
The aircraft visual navigation method based on consistent semantic constraint instance segmentation matching firstly establishes a navigation target library mainly comprising rigid targets and plane area targets, takes an air-sky optical image with geocode and ground elevation information as a reference image, takes a ground scene optical image as a real-time image, respectively extracts ground object information in the reference image and the real-time image according to a general image mark model to obtain target sets which are simultaneously existing in the real-time image, the reference image and the navigation target library, and can screen out the targets which are common in a flight scene, fixed and rigid targets or plane area targets in advance by constructing a navigation database in advance, the targets are detected, segmented and matched, the practicability and the matching precision of an algorithm can be further improved, then the target types are input into an open-set target detection model to obtain the positions of target frames, finally, inputting the target text and the spatial prompt into a general image segmentation model to obtain fine-granularity segmentation results of the corresponding targets, realizing semantic extraction and instance segmentation of a reference image and a real-time image in the visual navigation process through a general image intelligent processing model, simultaneously obtaining fine contour information of all distribution areas of all key targets in a remote sensing image, and carrying out characteristic point matching on the same semantic areas in the real-time image and the reference image respectively to fully utilize semantic information of the image areas to provide priori, thereby overcoming interference caused by factors such as visual angle change, illumination change, low overlapping rate, sensor modal difference and the like and avoiding mismatching among different semantic areas The monocular visual navigation task of various flight platforms such as airship has wide application prospect and economic value.
In one embodiment, the general image marking model is a neural network model based on deep learning, and the method for respectively extracting ground feature information in the reference map and the real-time map according to the general image marking model to obtain target sets simultaneously existing in the real-time map, the reference map and the navigation target library comprises the following steps:
Respectively identifying all typical object targets in the input image according to the general image marking model, and respectively returning to the real-time graph Target set in (a)Reference mapTarget set in (a)Then to the collection,Navigation target libraryAcquiring intersection sets to obtain target sets simultaneously existing in a real-time image, a reference image and a navigation target libraryIs that
;
Wherein, AndRepresenting the target text in the real-time and reference graphs respectively,And (5) representing the real-time graph, and obtaining target texts with the same semantic meaning in the reference graph and the navigation target library.
In particular embodiments, the target text is such as a building, road, football field, and the like.
In one embodiment, the open set target detection model is a neural network model based on deep learning, and the method for detecting the positions of all semantic elements in the target set on the real-time graph and the reference graph by using the open set target detection model to obtain the target frame set of different elements comprises the following steps:
Detecting corresponding targets in the image according to the input text prompt by using an open set target detection model, returning the positions of the targets in the form of target rectangular boxes, and aiming at the types of the targets in the real-time graph Corresponding firstThe target positions areWherein, the method comprises the steps of, wherein,Representing a real-time graphThe number of type objects is determined by the number of type objects,Pixel coordinates of four corner points of the rectangular frame;
For object types in a reference graph Corresponding firstThe target positions areWherein, the method comprises the steps of, wherein,Representative of the reference diagramThe number of type objects is determined by the number of type objects,Is the pixel coordinates of the four corner points of the rectangular frame.
In one embodiment, the general image segmentation model can generate any image segmentation mask according to text or space prompt, can fully automatically segment images without text or space prompt and can adapt to a new target segmentation task in a zero sample mode, and the general image segmentation model is used for segmenting outlines in all target frames in a target frame set to obtain target region information, wherein the method comprises the following steps:
For object types in a real-time graph Text information of the target type is taken as a prompt word, and is displayed in a rectangular frameImage segmentation is carried out by using a general image segmentation model to obtainCorresponding target area maskFor the object type in the reference graphThe same operation is also performed to obtainCorresponding target area mask。
In one embodiment, the image matching algorithm comprises, but is not limited to, a traditional feature point matching algorithm or a deep learning model, and the method comprises the steps of performing feature point matching on the regions with the same semantic meaning in the real-time image and the reference image based on the target region information and the image matching algorithm to obtain a matching point pair between the reference image and the real-time image, and comprises the following steps:
For collections Target in (a)The corresponding area in the real-time graph isThe corresponding area in the reference graph isUsing image matching algorithms for matching objects having the same semantic meaningRegion and areaPerforming feature point matching on the region to obtain a matching point corresponding relationWhereinAndRepresenting objects on the real-time and reference maps, respectivelyAnd obtaining the matching point pairs between the reference graph and the real-time graph.
In one embodiment, a relationship between two-dimensional matching points on a real-time graph and corresponding three-dimensional information of the reference graph is established according to a matching point pair between the reference graph and the real-time graph, and then the position and the posture of a current camera are calculated by a PnP algorithm by combining with an onboard camera internal reference, including:
from pairs of matching points between reference and real-time maps And establishing a relation between two-dimensional matching points on the real-time map and corresponding three-dimensional information of the reference map, and solving the current position and the current posture of the airborne camera by utilizing a 2D-3D PnP algorithm under the condition that the internal parameters of the airborne camera and the geographic coordinates and the three-dimensional information of the reference map are known.
It should be understood that, although the steps in the flowchart of fig. 1 are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in fig. 1 may include multiple sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, nor do the order in which the sub-steps or stages are performed necessarily performed in sequence, but may be performed alternately or alternately with at least a portion of other steps or sub-steps of other steps.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The above examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.