CN116468786B

CN116468786B - Semantic SLAM method based on point-line combination and oriented to dynamic environment

Info

Publication number: CN116468786B
Application number: CN202211619407.3A
Authority: CN
Inventors: 杨健; 董军宇; 范浩; 饶源; 时正午; 杨凯; 李丛; 刘伊美
Original assignee: Ocean University of China
Current assignee: Ocean University of China
Priority date: 2022-12-16
Filing date: 2022-12-16
Publication date: 2023-12-26
Anticipated expiration: 2042-12-16
Also published as: CN116468786A

Abstract

The invention provides a semantic SLAM method based on point-line combination, which is oriented to a dynamic environment, improves on the basis of ORB-SLAM3, and is oriented to the dynamic environment. The method is used for extracting the point and line characteristics, and using the point and line characteristics for accurate matching and repositioning of Lu Bang under the scene lacking texture and illumination change to estimate the pose of the camera, so that the positioning error and repositioning error are reduced, and the algorithm solves the problems of failure detection and difficult positioning of the characteristic points in the weak texture area and the illumination change scene.

Description

Semantic SLAM method based on point-line combination and oriented to dynamic environment

Technical Field

The invention relates to the field of computer vision, in particular to a semantic SLAM method based on point-line combination and oriented to a dynamic environment.

Background

The synchronous positioning and map construction technology (Simultaneous Localization and Mapping, SLAM) refers to that a robot collects surrounding environment information by using various sensors carried by the robot in an unknown environment, analyzes the position of the robot by an algorithm and establishes a map of the surrounding environment, wherein vision SLAM (Visual SLAM) mainly uses cameras to acquire data, including monocular, binocular, RGB-D cameras and the like, and the camera sensor used by the robot has the characteristics of high cost performance, small volume, low power consumption and capability of acquiring abundant environment information, so that the robot becomes a popular research field in recent years.

Various algorithms of the traditional visual SLAM can obtain good feature matching in a static scene, and mismatching can occur in a dynamic scene, so that great errors are generated in positioning and mapping of the SLAM system. Therefore, aiming at the problem that the positioning accuracy and the robustness of an SLAM system are reduced when a dynamic moving object exists in an application scene, a semantic SLAM method and a semantic SLAM system based on feature points and feature lines are provided.

The existing semantic SLAM technology mainly aims at a scene with a dynamic object, and the existing semantic SLAM technology mainly adopts a mode that pixels on all prior dynamic objects are deleted, the rest pixels are utilized for feature extraction and subsequent positioning research, or all dynamic feature points are deleted, only static feature points are adopted for feature point matching and rear end processing, the method can improve the positioning precision of a camera in a dynamic scene with rich textures, but for the scene with the dynamic object with low textures and strong illumination, only the information of the feature points and the semantics is adopted, so that enough data are difficult to obtain, tracking loss of a SLAM system is easy to be caused, and the positioning precision is reduced.

Currently, vision-based SLAM algorithm research has made great progress, such as ORB-SLAM2 (Orient FAST and Rotated BRIEF SLAM), LSD-SLAM (Large Scale Direct monocular SLAM), and the like. However, these algorithms are generally based on a strong assumption that a static working environment has many features and no obvious illumination changes, and have strict limitations on the application environment. The assumption influences the applicability of the visual SLAM system in an actual scene, when the environment is a dynamic weak texture area and has illumination change, the characteristic points are sensitive to the scene and are difficult to detect, the accuracy and the robustness of camera pose estimation can be reduced, errors are caused to the positioning based on vision, and a large deviation occurs to the three-dimensional reconstruction result.

The camera is typically in motion during the mobile robot's positioning and mapping process using the camera. This makes classical motion segmentation methods such as background removal (Background Subtraction) unusable in visual SLAM. Early SLAM systems mostly employed data optimization methods to reduce the effects of dynamic objects. A Random sampling consistency detection (Random SampleConsensus, RANSAC) algorithm is used for roughly estimating a basic matrix between two frames, semantic information and a mobile consistency detection result are combined, the establishment of a two-stage semantic knowledge base is completed, and all feature points in a dynamic contour are deleted as noise or discrete points. And eliminating the inter-frame characteristic point matching pairs on the dynamic object by using a RANSAC algorithm, and reducing the influence of the dynamic object on the SLAM system to a certain extent. These methods all implicitly assume that the objects in the image are mostly static and will fail when the data generated by the dynamic object exceeds a certain threshold.

In the prior art, researches on visual positioning, robot navigation and the like in scenes with abundant features such as cities, indoors and the like have been advanced to a certain extent, but many research contents are still insufficient, and for scenes with low texture and illumination variation with geometric features, the following problems still exist in visual positioning:

(1) The existing method is influenced by the problems of shielding, missing and the like of objects in the aspect of feature detection, and the complete geometric features are difficult to detect from the image, so that the pose of a camera is difficult to calculate;

(2) The existing method is affected by few textures and few feature points in the low-texture image, so that features of the image are difficult to extract, or feature matching errors are caused, SLAM tracking and repositioning are invalid, and camera pose recognition is poor;

(3) In the area with obvious illumination change, the detection of the characteristic points is sensitive, and the problems of difficult detection of the characteristic points, no matching and the like are easy to occur, so that the pose of a camera is inaccurate;

and combining MASK-RCNN with multi-view geometry to realize the example segmentation and rejection of the dynamic target, simultaneously identifying dynamic characteristic points, eliminating the interference of the dynamic target on characteristic matching and eliminating the influence of the dynamic target on an SLAM system.

Disclosure of Invention

The invention improves on the basis of ORB-SLAM3, and provides a semantic SLAM method based on point line characteristics, compared with the point characteristics, the line provides more geometric structure information about the environment, and the camera pose is jointly optimized through the point line, so that the camera positioning precision and robustness are improved. The method is used for extracting the point and line characteristics, and using the point and line characteristics for accurate matching and repositioning of Lu Bang under the scene lacking texture and illumination change to estimate the pose of the camera, so that the positioning error and repositioning error are reduced, and the algorithm solves the problems of failure detection and difficult positioning of the characteristic points in the weak texture area and the illumination change scene.

The invention is realized by the following technical scheme: a semantic SLAM method facing dynamic environment based on point-line combination specifically comprises the following steps:

step S1: acquiring an image stream of a scene, transmitting the image stream into a CNN network frame by frame, dividing an object with a priori dynamic property pixel by pixel, dividing the dynamic object in the scene to obtain a key frame image, and complementing a static scene blocked by a dynamic target by utilizing information of the previous frames;

step S2: for step S1: extracting feature points and feature lines from the obtained key frame image, constructing a local map related to the current frame image, including a key frame image sharing a common view point with the current frame image and adjacent frame images of the key frame image, searching feature points and line segments matched with the current frame image in the key frame image and the adjacent frame images of the key frame image, then carrying out dynamic consistency check on the prior dynamic object, removing the feature points and the feature lines on the dynamic object, reserving the feature points and the feature lines on the static object, and carrying out matching by utilizing the rest static feature points and the rest static lines;

step S3: matching the characteristic points and the characteristic lines in the step S2, filtering at the same time, removing the points and the lines which are incorrectly matched to obtain correct matching point pairs and line pairs, and obtaining the initial camera pose by using the matching point pairs;

step S4: calculating the camera pose of the current frame through the matching point pair and the line pair obtained in the step S3, and obtaining accurate camera pose estimation by minimizing the re-projection error of the point pair and the line pair;

step S5: constructing a local map about a scene by utilizing a key frame image, carrying out instance segmentation on each frame image, merging characteristic points and characteristic lines in each instance into corresponding instances, positioning a camera pose by utilizing the characteristic points and the characteristic lines, and calculating point clouds of objects and the scene to obtain a sparse point cloud map;

step S6: and (3) performing pose optimization by using loop detection, correcting drift errors, and obtaining more accurate camera pose estimation.

As a preferred scheme, step S1 is to extract feature points and feature lines of a static region on a key frame image, and extract feature points and feature lines of the static region of the key frame image, and specifically includes the following steps: and extracting the characteristics of the image static region by using ORB characteristic points, simultaneously calculating ORB descriptors to obtain characteristic points and descriptors of the image static region, extracting line characteristics of the image from which the dynamic object is removed, wherein the extraction of the line characteristics adopts a network structure of a transducer, and the line characteristics on the image static region are obtained by fusing characteristic information under different scales through a series of up-sampling and down-sampling operations.

Further, the extracted line features employ horizontal distancesAnd vertical distanceGenerating vector->To predict the positions of the two end points of a single line segment to obtain line characteristics, wherein +.>Andrepresenting coordinates of left and right end points of the line segment, < >>Is the midpoint coordinate of the line segment, ">Represent right endpoint->Coordinates and midpointA vector of relationships between coordinatesIn the present method->And->Expressed as: />，。

As a preferred solution, the matching of the feature points and the feature lines in step S3 specifically includes the following steps: the feature point matching is to find out a feature point with the closest descriptor distance as a matching point in the current frame through quick nearest neighbor search by generating ORB descriptors, then to reject the mismatching point pair, when the matching descriptor distance is larger than a threshold gamma or the ratio of the optimal matching point distance to the second optimal matching point distance is smaller than 1, the second matching point is equivalent to the first matching point, then the matching point pair is considered to be easy to be mismatched, and the matching point pair is rejected; the matching of the characteristic lines is to obtain 2D-2D matching line pairs through geometric constraint, map the 2D-2D matching line pairs to a 3D space directly through outlier rejection, and then obtain accurate 2D-3D line matching pairs by minimizing the reprojection error.

As a preferred solution, the optimization of the camera pose by minimizing the re-projection errors of the point pairs and the line pairs in step S4 is specifically implemented as follows:

the position and posture are jointly optimized by adopting the dotted line, and the minimized reprojection error is defined as:

wherein the method comprises the steps of

Wherein N represents a pair of matching lines on 2D-3D, a functionEqual to 3D line->Line projected onto 2D plane, angle error +.>By defining two planes +.>And->Defined, function->Equal to 3D point->Points on the 2D plane of the graph, +.>And->Is a given weight value, and optimizes the camera pose by minimizing the re-projection error.

In the preferred scheme, in step S5, the point cloud processing is performed through local mapping, and the pose of the camera is optimized by global repositioning, so as to obtain a sparse point cloud reconstruction map, which specifically comprises the following steps:

calculating a BOW vector of each frame of data stream, calculating the current frame image comprising the BOW vector and the common view relation information, inserting the current frame image into a map, and updating the common view; in the tracking process, each key frame is attached with information comprising feature points, feature lines and descriptors, and then map points are created by utilizing triangulation; judging whether other key frames exist in the key frame queue, if not, optimizing map points, and performing local BA optimization by using the current frame, the key frame image sharing the common view point with the current frame image and the adjacent frame images of the key frame image;

and (3) finding candidate key frames corresponding to the current frame, matching the current frame with the key frames by using a BOW dictionary for each candidate key frame, initializing by using the matching relation between the current frame and the candidate key frames, and estimating the pose by using EPnP for each candidate key frame.

Further, in step S6, optimizing the pose of the camera through loop detection specifically includes the following steps:

based on two characteristics of points and lines, performing loop detection by using key frames, when three continuous closed-loop candidate key frames have higher similarity with the current key frame, obtaining loop candidate frames, firstly matching characteristic points and characteristic lines on each candidate loop frame with the current frame, then solving a similar transformation matrix by using three-dimensional information corresponding to the characteristic points and the characteristic lines, if enough inner points and inner lines exist in the loop frame, performing Sim (3) optimization, performing loop correction by using the loop candidate frames, optimizing characteristic point constraint and line segment constraint, and obtaining the camera pose after point-line joint optimization.

(1) The invention adopts the technical proposal, and compared with the prior art, the invention has the following beneficial effects: the invention improves on the basis of ORB-SLAM3, proposes a SLAM algorithm based on feature points, feature lines and semantic information, combines MASK-RCNN with multi-view geometry, realizes the example segmentation and rejection of dynamic targets, simultaneously identifies dynamic feature points and feature lines, eliminates the interference of the dynamic targets on feature matching, eliminates the influence of the dynamic targets on a SLAM system, and completes the static scene blocked by the dynamic targets by utilizing the information of the previous frames;

(2) The invention provides a semantic SLAM system based on feature points and feature lines, which adopts a structure of a transducer to extract line features, and the line features extracted by the method are more accurate than those extracted by the traditional method;

compared with the point features, the line provides more geometric structure information about the environment, the point and line features are extracted, the point and line features can be more accurately matched with Lu Bang under the scene of weak texture and illumination change, the pose estimation of a camera is realized, the positioning error and repositioning error are reduced, and the algorithm solves the problem of difficult positioning under the low-texture scene.

Additional aspects and advantages of the invention will be set forth in part in the description which follows, or may be learned by practice of the invention.

Drawings

The foregoing and/or additional aspects and advantages of the invention will become apparent and may be better understood from the following description of embodiments taken in conjunction with the accompanying drawings in which:

FIG. 1 is a feature line detection diagram;

FIG. 2 is a flow chart of the present invention.

Detailed Description

In order that the above-recited objects, features and advantages of the present invention will be more clearly understood, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description. It should be noted that, in the case of no conflict, the embodiments of the present application and the features in the embodiments may be combined with each other.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced otherwise than as described herein, and therefore the scope of the present invention is not limited to the specific embodiments disclosed below.

The semantic SLAM method based on the dotted line combination for the dynamic environment according to the embodiment of the present invention is specifically described below with reference to fig. 1 to 2.

As shown in fig. 1 and fig. 2, the invention provides a semantic SLAM method based on point-line combination and oriented to a dynamic environment, which is characterized by comprising the following steps:

step S1: acquiring an image stream of a scene, transmitting the image stream into a CNN network frame by frame, dividing objects with priori dynamic properties such as pedestrians, vehicles, fish and the like pixel by pixel, and dividing dynamic objects in the scene to obtainThe key frame image is used for complementing the static scene shielded by the dynamic target by utilizing the information of the previous frames; extracting feature points and feature lines of a static region on a key frame image, and extracting the feature points and the feature lines of the static region of the key frame image, wherein the method specifically comprises the following steps of: and extracting the characteristics of the image static region by using ORB characteristic points, simultaneously calculating ORB descriptors to obtain characteristic points and descriptors of the image static region, extracting line characteristics of the image from which the dynamic object is removed, wherein the extraction of the line characteristics adopts a network structure of a transducer, and the line characteristics on the image static region are obtained by fusing characteristic information under different scales through a series of up-sampling and down-sampling operations. Extracting line characteristics by using length of line segmentAnd the angle theta acquires two endpoints of the line segment, and for the long line segment, the small change of the angle can greatly influence the position of the endpoint of the line segment, so that larger line error is caused, and the method adopts horizontal distance +.>And vertical distance->Generating vector->To predict the positions of the two end points of a single line segment to obtain line characteristics, wherein +.>And->Representing coordinates of left and right end points of the line segment, < >>Is the midpoint coordinate of the line segment, ">Represent right endpoint->Coordinates and midpoint->A vector of the relation between coordinates, in the method +.>And->Expressed as: />，/>。

step S3: matching the characteristic points and the characteristic lines in the step S2, filtering at the same time, removing the points and the lines which are incorrectly matched to obtain correct matching point pairs and line pairs, and obtaining the initial camera pose by using the matching point pairs; the matching of the feature points and the feature lines specifically comprises the following steps: the feature point matching is to find out a feature point with the closest descriptor distance as a matching point in the current frame through quick nearest neighbor search by generating ORB descriptors, then to reject the mismatching point pair, when the matching descriptor distance is larger than a threshold gamma or the ratio of the optimal matching point distance to the second optimal matching point distance is smaller than 1, the second matching point is equivalent to the first matching point, then the matching point pair is considered to be easy to be mismatched, and the matching point pair is rejected; the matching of the characteristic lines is to obtain 2D-2D matching line pairs through geometric constraint, map the 2D-2D matching line pairs to a 3D space directly through outlier rejection, and then obtain accurate 2D-3D line matching pairs by minimizing the reprojection error. The initial camera pose calculation specifically comprises the following steps: and calculating a basic matrix and an essential matrix through the feature points and the feature lines, and obtaining a relatively accurate pose transformation matrix between cameras through SVD decomposition.

Step S4: calculating the camera pose of the current frame through the matching point pair and the line pair obtained in the step S3, and obtaining accurate camera pose estimation by minimizing the re-projection error of the point pair and the line pair; the specific implementation of optimizing the camera pose by minimizing the reprojection error of the point pair and the line pair is as follows:

wherein the method comprises the steps of

Wherein N represents a pair of matching lines on 2D-3D, a functionEqual to 3D line->Line projected onto 2D plane, angle error +.>By defining two planesFace->And->Defined, functionEqual to 3D point->Dot of figure onto 2D plane +.>And->Is a given weight value, and optimizes the camera pose by minimizing the re-projection error.

Step S5: constructing a local map about a scene by utilizing a key frame image, carrying out instance segmentation on each frame image, merging characteristic points and characteristic lines in each instance into corresponding instances, positioning a camera pose by utilizing the characteristic points and the characteristic lines, calculating point clouds of an object and the scene, carrying out point cloud processing by utilizing the local map, and optimizing the camera pose by utilizing global repositioning, thereby obtaining a sparse point cloud reconstruction map, and specifically comprising the following steps:

calculating a BOW vector of each frame of data stream, calculating the current frame image comprising the BOW vector and the common view relation information, inserting the current frame image into a map, and updating the common view; in the tracking process, each key frame is attached with information comprising feature points, feature lines and descriptors, but not all feature points become 3D map points, so that unqualified feature points and feature lines need to be removed, and then the map points are created by utilizing triangulation; judging whether other key frames exist in the key frame queue, if not, optimizing map points, and performing local BA optimization by using the current frame, the key frame image sharing the common view point with the current frame image and the adjacent frame images of the key frame image;

Step S6: and (3) performing pose optimization by using loop detection, correcting drift errors, and obtaining more accurate camera pose estimation. The method specifically comprises the following steps:

In the description of the present invention, the term "plurality" means two or more, unless explicitly defined otherwise, the orientation or positional relationship indicated by the terms "upper", "lower", etc. are based on the orientation or positional relationship shown in the drawings, merely for convenience of description of the present invention and to simplify the description, and do not indicate or imply that the apparatus or elements referred to must have a specific orientation, be constructed and operated in a specific orientation, and therefore should not be construed as limiting the present invention; the terms "coupled," "mounted," "secured," and the like are to be construed broadly, and may be fixedly coupled, detachably coupled, or integrally connected, for example; can be directly connected or indirectly connected through an intermediate medium. The specific meaning of the above terms in the present invention can be understood by those of ordinary skill in the art according to the specific circumstances.

In the description of the present specification, the terms "one embodiment," "some embodiments," "particular embodiments," and the like, mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A semantic SLAM method based on point-line union for dynamic environments, which is characterized by including the following steps:

Step S1: Obtain the image stream of the scene, pass the image stream frame by frame into the CNN network, segment the a priori dynamic objects pixel by pixel, segment the dynamic objects in the scene, obtain the key frame image, and use the previous Several frames of information are used to complete the static scene blocked by the dynamic target; extract feature points and feature lines in the static area of the key frame image, and extract feature points and feature lines from the static area of the key frame image. The specific steps include the following steps: Using ORB features Points extract the features of the static area of the image, and at the same time calculate the ORB descriptor to obtain the feature points and descriptors of the static area of the image. Line features are extracted from the image with dynamic objects removed. The line features are extracted using the Transformer network structure, through a series of Upsampling and downsampling operations combine feature information at different scales to obtain line features in the static area of the image; line features are extracted using horizontal distance and vertical distance/> Generate vector/> To predict the positions of the two endpoints of a single line segment and obtain line features, where/> and/> Indicates the coordinates of the left and right endpoints of the line segment, /> is the midpoint coordinate of the line segment,/> Indicates the right endpoint/> Coordinates and midpoints/> A vector of the relationship between coordinates, in this method/> and/> Respectively expressed as:/> ,/> ;

Step S2: Extract feature points and feature lines from the key frame image obtained in Step S1, and construct a local map of the current frame image, including key frame images that have a common viewpoint with the current frame image and adjacent frames of the key frame image. image, find the feature points and line segments matching the current frame image in the key frame image and the adjacent frame images of the key frame image, and then conduct a dynamic consistency check on the objects with a priori dynamic properties, and eliminate the feature points and line segments on the dynamic objects. Feature lines, retain feature points and feature lines on static objects, and use the remaining static feature points and static lines for matching;

Step S3: Match the feature points and feature lines in step S2, and perform filtering at the same time to eliminate incorrectly matched points and lines, obtain correct matching point pairs and line pairs, and use the matching point pairs to obtain the initial camera pose; the feature points and The matching of feature lines specifically includes the following steps: the matching of feature points is by generating ORB descriptors, through fast nearest neighbor search, finding a feature point with the closest descriptor distance in the current frame as a matching point, and then eliminating mismatched points. Yes, when the matching descriptor distance is greater than the threshold γ or the ratio of the optimal matching point distance to the second optimal matching point distance is less than 1, it means that the second matching point is equivalent to the first matching point, and the matching point pair is considered prone to errors. matching, discarding the matching point pair; the matching of feature lines is to obtain a 2D-2D matching line pair through geometric constraints, directly map it to the 3D space through external point elimination, and then obtain an accurate 2D-2D space by minimizing the reprojection error. 3D line matching pairs;

Step S4: Calculate the camera pose of the current frame through the matching point pairs and line pairs obtained in step S3, and obtain an accurate camera pose estimate by minimizing the reprojection error of point pairs and line pairs; minimizing point pairs and line pairs The specific implementation of optimizing the camera pose using the reprojection error is as follows:

Using points and lines to jointly optimize the pose, minimizing the reprojection error is defined as:

in

Among them, N represents the matching line pair on 2D-3D, and the function Equal to 3D line/> Line projected onto 2D plane, angle error/> By defining two planes/> and/> defined, function Equal to 3D point/> The graph should go to a point on the 2D plane,/> and/> is a given weight value, optimizing the camera pose by minimizing the reprojection error

Step S5: Use key frame images to construct a local map of the scene, perform instance segmentation on each frame of image, merge the feature points and feature lines in each instance into the corresponding instance, and use the feature points and feature lines to locate the camera position. pose, calculate point clouds of objects and scenes to obtain sparse point cloud maps;

Step S6: Use loop closure detection to optimize the pose, correct the drift error, and obtain a more accurate camera pose estimate.

2. A semantic SLAM method based on point-line union for dynamic environments according to claim 1, characterized in that in step S5, point cloud processing is performed through local mapping and global relocation is used to optimize the camera pose. , to obtain the sparse point cloud reconstruction map, which specifically includes the following steps:

Calculate the BOW vector of each frame of data stream, calculate the current frame image including BOW vector, common view relationship information and insert it into the map, update the common view; during the tracking process, each key frame is accompanied by feature points, feature lines and descriptions sub-information, and then use triangulation to create map points; determine whether there are other key frames in the key frame queue, if not, optimize the map points, use the current frame and key frame images that have a common viewpoint with the current frame image and The adjacent frame images of the key frame image undergo local BA optimization;

Find the candidate key frame corresponding to the current frame. For each candidate key frame, use the BOW dictionary to match the current frame with the key frame. Use the matching relationship between the current frame and the candidate key frame to initialize. For each candidate key frame, use EPnP estimation. Posture.

3. A semantic SLAM method based on point-line union for dynamic environments according to claim 2, characterized in that optimizing the camera pose through loop closure detection in step S6 specifically includes the following steps:

Based on the two features of points and lines, key frames are used for loop closure detection. When three consecutive closed loop candidate key frames have a high degree of similarity with the current key frame, a loop closure candidate frame is obtained. For each candidate loop closure frame, First match it with the feature points and feature lines on the current frame, and then use the three-dimensional information corresponding to the feature points and feature lines to solve a similarity transformation matrix. If there are enough interior points and interior lines in the loop frame, do Sim(3) optimization , use loop closure candidate frames to perform loop closure correction, optimize feature point constraints and line segment constraints, and obtain the jointly optimized camera pose of points and lines.