CN109341703A

CN109341703A - A full-cycle visual SLAM algorithm using CNNs feature detection

Info

Publication number: CN109341703A
Application number: CN201811087509.9A
Authority: CN
Inventors: 赵永嘉; 张宁; 雷小永; 戴树岭
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2018-09-18
Filing date: 2018-09-18
Publication date: 2019-02-15
Anticipated expiration: 2038-09-18
Also published as: CN109341703B

Abstract

The invention discloses a visual SLAM algorithm that adopts CNNs feature detection in the whole cycle. First, at the front end, the original image data is pre-trained with an unsupervised model, and then the pre-trained data is used to combine the joint representation of motion and depth with the CNN network architecture. Local velocity and direction changes are correlated to perform visual odometry; finally, path prediction is performed. The present invention also uses the OverFeat neural network model to perform the loopback detection link, which is used to eliminate the accumulated error brought by the front end and construct a visual slam architecture based on deep learning. At the same time, time and space continuity filters are constructed to verify the matching results, improve the matching accuracy and eliminate false matching. The present invention has great advantages and potential in improving the accuracy of the visual odometry and the correct rate of closed-loop detection.

Description

A kind of complete period uses the vision SLAM algorithm of CNNs feature detection

Technical field

The technical field with map structuring (slam) algorithm is positioned while the invention belongs in computer vision, is specifically related to And a kind of complete period uses the vision SLAM algorithm of CNNs feature detection.

Background technique

SLAM (Simultaneous Localization and Mapping) Chinese be " while positioning and map Building ".SLAM is a fascinating research field, has in robot, navigation and other many applications and widely answers With.Vision SLAM is essentially related to estimate camera motion according to visual sensor information and is attempted to construct the ground of ambient enviroment Figure, such as the sequence frame from one or more cameras.The research means of current SLAM problem are mainly by robot sheet Sensor is installed to estimate the characteristic information of robot body motion information and circumstances not known on body, is merged using information It realizes to the accurate estimation of robot pose and the spatial modeling of scene.Although the sensor that SLAM is used has laser and vision Equal multiple types, but its treatment process generally comprises 3 parts: front-end vision odometer, rear end optimization and closed loop detection.

Typical vision SLAM algorithm is to estimate pose of camera as main target, 3D is reconstructed by multiple view geometry theory Figure.To improve data processing speed, partial visual SLAM algorithm extracts sparse characteristics of image first, by between characteristic point Visual odometry and closed loop detection are realized in matching, such as special based on SIFT (scale invariant feature transform) The vision SLAM [13] of the sign and vision SLAM for being based on ORB (oriented FAST and rotated BRIEF) feature.SIFT Rely on its preferable robustness and preferably separating capacity and quick processing speed with ORB feature, in the field vision SLAM It is widely applied.The sparse characteristics of image of engineer currently has many limitations, and it is special on the one hand how to design sparse image Sign optimally indicates that image information is still the unsolved major issue of computer vision field, on the other hand sparse characteristics of image In terms of reply illumination variation, dynamic object movement, camera parameters change and lack texture or texture is single Still there is more challenge.Traditional visual odometry (Visual Odometry, VO) essentially relates to be estimated according to visual information Meter movement, such as the sequence frame from one or more cameras.One common feature of most number attribute in these work is They estimate visual token by critical point detection and tracking and camera geometry.

In recent years, the method based on study many fields of computer vision show desirable as a result, Defect present in traditional vision slam algorithm can be overcome, and (sparse characteristics of image is in reply illumination variation, dynamic object fortune Dynamic, camera parameters, which change and lack texture or the single environment etc. of texture, more difficulty).As convolutional Neural net Model as network (CNNs) is proved to highly effective in various visual tasks, such as classifies and positions, estimation of Depth etc..Nothing Supervise the ability that feature learning model display is indicated by partial transformation in multiplication interactive learning data.Research shows that by nothing Data after monitor model pre-training reapply in CNNs network, preferable can cross noise filtering and prevent " over-fitting ".

Visual odometry (VO) is responsible for the initial value of estimation track and map.VO principle is the connection that only considered consecutive frame picture System.Error inevitably will be with time integral, so that cumulative errors occurs in whole system, the result estimated for a long time will not Reliably, in other words, we can not construct globally consistent track and map.Winding detection module can provide one except consecutive frame A bit after constraint more remote.The key of winding detection, be how effectively to detect camera by the same place this Part thing is related to the correctness of the estimation and map of our estimations under long-time.Therefore, winding detection is to entire SLAM system The promotion of precision and robustness be obviously.

And this winding detection based on appearance is substantially a task of image similarity match.By scheming to two As carrying out characteristic matching, to verify whether as same place.The conventional method of winding detection is that " bag of words " is utilized to generate word Allusion quotation does characteristic matching.Now, CNNs shows optimal performance in various classification tasks.Existing road sign test result Show that the further feature of the CNN from different levels executes more preferably than SIFT always in terms of descriptors match, shows SIFT Or SURF may no longer be the preferred descriptor of matching task.Therefore, our invention by CNN in terms of image classification go out The inspiration for the feasibility evidence that color table is existing and they are in terms of characteristic matching.The bag of words method to discard tradition, with being based on CNNs using depth learning technology as the stratification image characteristic extracting method of representative, to do winding detection.Deep learning algorithm It is the recognizer of computer vision field mainstream, relies on the stratification mark sheet of multilayer neural network study image Show, compared with traditional recognition method, the accuracy rate of higher feature extraction and position identification may be implemented.

Summary of the invention

In view of the above-mentioned problems, the present invention proposes that a kind of complete period uses the vision SLAM algorithm of CNNs feature detection, it is one Kind is all handled using convolutional neural networks in the front end (VO) that SLAM algorithm is run and winding detection, realizes a complete period The SLAM system of deep learning algorithm.

Complete period of the invention uses the vision SLAM algorithm of CNNs feature detection, comprising the following steps:

Step 1: using the environmental information around binocular camera scanning；A part in collected multiple video flowings is made For training dataset, a part is used as test data set.

Step 2: training dataset carries out pre-training in the video flowing acquired by the method for synchronous detection to step 1.

Step 3: being changed using the part that convolutional neural networks training obtains speed and changed with the part in direction, to execute view Feel odometer.

Step 4: the part of the speed obtained using step 3, which is changed, changes information with the part in direction, recovers camera Motion path.

Step 5: carrying out closed loop detection using convolutional neural networks, eliminate the accumulated error of path prediction.

The invention has the advantages that:

1, the complete period of the invention uses the vision SLAM algorithm of CNNs feature detection, and convolutional neural networks are used for front end, Compared to conventional front-end algorithm, the side based on study substitutes cumbersome formula and calculates, and extracts and matches without manual features, it appears succinct Intuitively, and on-line operation fast speed.

2, the complete period of the invention uses the vision SLAM algorithm of CNNs feature detection, the utilization bag of words to discard tradition into The method of row closed loop detection obtains the effect of relatively good position identification by accurate characteristic matching.

3, the complete period of the invention uses the vision SLAM algorithm of CNNs feature detection, can pass through neural network learning image In profound feature, discrimination can achieve higher level.Compared to Conventional visual slam algorithm, closed loop inspection can be improved Survey accuracy rate, expression image information it is more abundant, have more strong robustness to environmental changes such as illumination, seasons, can to 2 frame images it Between similitude calculated, to realize more succinct visual odometry, and complete using database pre-training neural network While at characteristic Design, the design of matched classifier can also be synchronously completed.

Detailed description of the invention

Fig. 1 is the vision SLAM algorithm overall flow figure for using the detection of CNNs feature in the complete period of the invention；

Fig. 2 is the complete period of the invention using closed loop detection method flow chart in the vision SLAM algorithm of CNNs feature detection.

Specific embodiment

The present invention is described in further detail below in conjunction with the accompanying drawings.

Complete period of the invention uses the vision SLAM algorithm of CNNs feature detection, as shown in Figure 1, comprising the following steps:

Step 1: scanning the environmental information of surrounding；

Using binocular camera along being moved in square area, ambient image information acquisition is carried out to real scene, is obtained Video flowing be transmitted in host computer in real time.Binocular camera mobile ring number is 1~2 circle, constitutes closed loop, facilitates subsequent close Compensation of the ring detection to cumulative errors.Repeat the above process, by a part of conduct in collected multiple video flowings Training dataset, a part are used as test data set.

It indicates to obtain the combination of camera motion and image depth information, is adopted using unsupervised learning model (SAE-D) Pre-training is carried out to training dataset with stochastic gradient descent training.It is the model of a single layer based on synchronous SAE-D, allows Feature is obtained from training dataset by part and Hebbian type study.

SAE-D model concentrates training, the size of each binocular block on the local binocular block cut out immediately in training data For 16*16*5 (space*space*time), the characteristic information of movement and depth joint expression is obtained；It then will movement and depth The characteristic information that joint indicates goes albefaction back to image space, completes the pre-training to training dataset.Instruction after pre-training Practice the first layer that data set is used to initialize CNN (convolutional neural networks).

Step 3: executing visual odometry using convolutional neural networks (CNN) training.

Convolutional neural networks (CNN) are a kind of learning models based on supervision.CNN is trained to by partial-depth and movement Indicate associated with the localized variation in speed and direction, to learn to execute visual odometry.By CNN framework the fortune of acquisition Dynamic and depth joint expression is trained associated with desired label (direction and speed change).

The feature that SAE-D model training obtains in step 2 is inputted to the CNN first layer of two same architectures in the present invention, Initialization for two CNN first layers；The part of two CNN difference output speeds changes to be changed with the part in direction, and is had It is associated with desired label that the part of speed is changed the part change with direction respectively by two CNN remainders.

Entire CNN network shares 6 layers, and first layer is the convolutional layer of 5*5, respectively left images learning characteristic.Following Two layers of feature extracted to two convolutional layers in left and right carry out element multiplication.The convolutional layer of third layer 1*1, the 4th layer is pond Layer, layer 5 is full articulamentum, and last output layer is Softmax layers.

The input of above-mentioned two CNN is 5 frame subsequences, and target output is local velocity and the vector that direction changes It indicates.The speed of part and direction, which change the effect of information and precision, to be evaluated and tested by binocular data set KITTI.

Step 4: the prediction of camera motion path

For entire video flowing, the speed of the every 5 frame subsequence obtained using step 3 and direction change information, Ji Keli The scattered motion path for recovering camera.

Step 5: carrying out closed loop detection using CNN, eliminate the accumulated error of path prediction；

Local speed obtained in step 4 and direction change information can not completely precisely, and there is a certain error.With The continuous accumulation of error, predicted path can constantly increase with the gap in true path from origin-to-destination.So needing Subsequent closed loop detection eliminates these cumulative errors, reduces the gap of predicted path and true path.It is this Algorithm includes two parts: carrying out feature extraction using convolutional neural networks；Location matches are assumed by comparing characteristic response Carry out spatio-temporal filtering.

Equally handled using the algorithm based on CNN in closed loop detection, the CNN model being applied to it is aforementioned CNN model is different, is independently used in visual odometry and closed loop detection, pre- with the camera motion path that this removal process 4 obtains The cumulative errors of survey realize from main closed loop, method particularly includes:

Image characteristics extraction is carried out using the convolutional neural networks of pre-training.The present invention uses overfeat convolutional Neural net Network carries out image characteristics extraction.

Overfeat convolutional neural networks carry out pre-training on 2012 data set of ImageNet, and the data set is by 1,200,000 Open image and 1000 classes compositions.Overfeat convolutional neural networks include that five convolution stages and three are fully connected the stage. Two convolution stages of beginning are by a convolutional layer, a maximum pond layer and rectification (ReLU) non-linear layer composition.The Three and the Volume Four product stage by convolutional layer, zero padding layer and ReLU non-linear layer composition.5th stage include a convolutional layer, one A zero padding layer, one ReLU layers and one Maxpooling layers.Finally, the 6th and the 7th be fully connected the stage include one it is complete Full articulamentum and one ReLU layer, and it includes the output layer for being only fully connected layer that the 8th stage, which was,.Entire convolutional neural networks are total Share 21 layers.

When image I is input into network, it generates a series of layering activation.L is used in the present invention_k(I), k= 1 ..., 21 indicate the corresponding output of given input picture I kth layer.The feature vector L of each layer of output_k(I) every in One be all image I deep learning indicate；Execution position is compared to the corresponding feature vector of different images by these Set identification.The network is capable of handling the image that any size is equal to or more than 231 × 231 pixels, therefore overfeat convolution mind Input through network is all using being sized as the image of 256 × 256 pixels.

It is good using pre-training as a result, by using the training dataset acquired in step 1 and test data set as input Overfeat convolutional neural networks carry out feature extraction.

Step 6: characteristic matching generates hybrid matrix and carries out space-time expending detection.

As shown in Fig. 2, being extracted by overfeat convolutional neural networks to from the picture that each test data is concentrated Feature concentrate the feature proposed to be matched with from each training data.

To compare performance difference of the feature of every tomographic image in overfeat convolutional neural networks in scene Recognition, into one Step utilizes each layer latent structure hybrid matrix:

M_k(i, j)=d (L_k(I_i),L_k(I_j)), i=1 ..., R, j=1 ..., T

Wherein, I_iRepresent the image input of the i-th frame training data concentration, I_jThe image for representing jth frame test data concentration is defeated Enter, L_k(I_i) represent and I_iCorresponding kth layer output, M_k(i, j) represents the Europe between kth layer training sample i and test sample j Family name's distance describes matching degree between the two；R and T respectively represent the number of training image and the number of test image. The averaged feature vector that the above-mentioned each column of hybrid matrix store between jth frame test image and all training images is poor.

In order to find strongest location matches it is assumed that having the member of minimum feature vector difference in search hybrid matrix each column Element.

For location matches possible in hybrid matrix it is assumed that further constructing spatial continuity filter and Time Continuous Property filter carry out comprehensive verification, improve matching accuracy rate.Meanwhile the characteristic performance come is trained to each layer network and is visited Rope, it is preferable that discovery network middle layer feature describes similar for visual angle image matching effect, and in after layer for scene visual angle Variation has stronger adaptability and robustness.

There are accurately location matches, so that it may the cumulative errors caused by the visual odometry of not winding compensate, Construct globally consistent track.

Claims

1. the vision SLAM algorithm that a kind of complete period uses the detection of CNNs feature, comprising the following steps:

Step 1: using the environmental information around binocular camera scanning；By a part in collected multiple video flowings as instruction Practice data set, a part is used as test data set；

Step 2: training dataset carries out pre-training in the video flowing acquired by the method for synchronous detection to step 1；

Step 3: being changed using the part that convolutional neural networks training obtains speed and changed with the part in direction, to execute in vision Journey meter；

Step 4: the part of the speed obtained using step 3, which is changed, changes information with the part in direction, recovers the movement of camera Path；

2. a kind of complete period uses the vision SLAM algorithm of CNNs feature detection as described in claim 1, it is characterised in that: step Using binocular camera along moving in annular region in 1, mobile ring number is 1~2 circle, constitutes closed loop.

3. a kind of complete period uses the vision SLAM algorithm of CNNs feature detection as described in claim 1, it is characterised in that: step Pre-training is carried out to training dataset using stochastic gradient descent training using unsupervised learning model in 2, is moved and deep The characteristic information that movement and depth joint indicate then is gone albefaction back to image space by the characteristic information that degree joint indicates.

4. a kind of complete period uses the vision SLAM algorithm of CNNs feature detection as described in claim 1, it is characterised in that: step In 3, the feature that step 2 training obtains is inputted to the CNN first layer of two same architectures, by CNN framework the movement of acquisition Joint expression with depth is trained associated with desired label.

5. a kind of complete period uses the vision SLAM algorithm of CNNs feature detection as described in claim 1, it is characterised in that: step In 5, firstly, by using the training dataset acquired in step 1 and test data set as input, using in imagenet number According to upper pre-training is collected, good overfeat convolutional neural networks carry out feature extraction；Then, it is concentrated to from each test data Picture in the feature extracted concentrate the feature proposed to be matched with from each training data；

Utilize each layer latent structure hybrid matrix of overfeat convolutional neural networks:

M_k(i, j)=d (L_k(I_i),L_k(I_j)), i=1 ..., R, j=1 ..., T

Wherein, I_iRepresent the image input of the i-th frame training data concentration, I_jThe image input of jth frame test data concentration is represented, L_k(I_i) represent and I_iCorresponding kth layer output, M_k(i, j) represent the Euclidean between kth layer training sample i and test sample j away from From describing matching degree between the two；R and T respectively represent the number of training image and the number of test image；It is above-mentioned The averaged feature vector that each column of hybrid matrix store between jth frame test image and all training images is poor；

Search for the element in hybrid matrix each column with minimum feature vector difference；