Background technique
SLAM (Simultaneous Localization and Mapping) Chinese be " while positioning and map
Building ".SLAM is a fascinating research field, has in robot, navigation and other many applications and widely answers
With.Vision SLAM is essentially related to estimate camera motion according to visual sensor information and is attempted to construct the ground of ambient enviroment
Figure, such as the sequence frame from one or more cameras.The research means of current SLAM problem are mainly by robot sheet
Sensor is installed to estimate the characteristic information of robot body motion information and circumstances not known on body, is merged using information
It realizes to the accurate estimation of robot pose and the spatial modeling of scene.Although the sensor that SLAM is used has laser and vision
Equal multiple types, but its treatment process generally comprises 3 parts: front-end vision odometer, rear end optimization and closed loop detection.
Typical vision SLAM algorithm is to estimate pose of camera as main target, 3D is reconstructed by multiple view geometry theory
Figure.To improve data processing speed, partial visual SLAM algorithm extracts sparse characteristics of image first, by between characteristic point
Visual odometry and closed loop detection are realized in matching, such as special based on SIFT (scale invariant feature transform)
The vision SLAM [13] of the sign and vision SLAM for being based on ORB (oriented FAST and rotated BRIEF) feature.SIFT
Rely on its preferable robustness and preferably separating capacity and quick processing speed with ORB feature, in the field vision SLAM
It is widely applied.The sparse characteristics of image of engineer currently has many limitations, and it is special on the one hand how to design sparse image
Sign optimally indicates that image information is still the unsolved major issue of computer vision field, on the other hand sparse characteristics of image
In terms of reply illumination variation, dynamic object movement, camera parameters change and lack texture or texture is single
Still there is more challenge.Traditional visual odometry (Visual Odometry, VO) essentially relates to be estimated according to visual information
Meter movement, such as the sequence frame from one or more cameras.One common feature of most number attribute in these work is
They estimate visual token by critical point detection and tracking and camera geometry.
In recent years, the method based on study many fields of computer vision show desirable as a result,
Defect present in traditional vision slam algorithm can be overcome, and (sparse characteristics of image is in reply illumination variation, dynamic object fortune
Dynamic, camera parameters, which change and lack texture or the single environment etc. of texture, more difficulty).As convolutional Neural net
Model as network (CNNs) is proved to highly effective in various visual tasks, such as classifies and positions, estimation of Depth etc..Nothing
Supervise the ability that feature learning model display is indicated by partial transformation in multiplication interactive learning data.Research shows that by nothing
Data after monitor model pre-training reapply in CNNs network, preferable can cross noise filtering and prevent " over-fitting ".
Visual odometry (VO) is responsible for the initial value of estimation track and map.VO principle is the connection that only considered consecutive frame picture
System.Error inevitably will be with time integral, so that cumulative errors occurs in whole system, the result estimated for a long time will not
Reliably, in other words, we can not construct globally consistent track and map.Winding detection module can provide one except consecutive frame
A bit after constraint more remote.The key of winding detection, be how effectively to detect camera by the same place this
Part thing is related to the correctness of the estimation and map of our estimations under long-time.Therefore, winding detection is to entire SLAM system
The promotion of precision and robustness be obviously.
And this winding detection based on appearance is substantially a task of image similarity match.By scheming to two
As carrying out characteristic matching, to verify whether as same place.The conventional method of winding detection is that " bag of words " is utilized to generate word
Allusion quotation does characteristic matching.Now, CNNs shows optimal performance in various classification tasks.Existing road sign test result
Show that the further feature of the CNN from different levels executes more preferably than SIFT always in terms of descriptors match, shows SIFT
Or SURF may no longer be the preferred descriptor of matching task.Therefore, our invention by CNN in terms of image classification go out
The inspiration for the feasibility evidence that color table is existing and they are in terms of characteristic matching.The bag of words method to discard tradition, with being based on
CNNs using depth learning technology as the stratification image characteristic extracting method of representative, to do winding detection.Deep learning algorithm
It is the recognizer of computer vision field mainstream, relies on the stratification mark sheet of multilayer neural network study image
Show, compared with traditional recognition method, the accuracy rate of higher feature extraction and position identification may be implemented.
Summary of the invention
In view of the above-mentioned problems, the present invention proposes that a kind of complete period uses the vision SLAM algorithm of CNNs feature detection, it is one
Kind is all handled using convolutional neural networks in the front end (VO) that SLAM algorithm is run and winding detection, realizes a complete period
The SLAM system of deep learning algorithm.
Complete period of the invention uses the vision SLAM algorithm of CNNs feature detection, comprising the following steps:
Step 1: using the environmental information around binocular camera scanning;A part in collected multiple video flowings is made
For training dataset, a part is used as test data set.
Step 2: training dataset carries out pre-training in the video flowing acquired by the method for synchronous detection to step 1.
Step 3: being changed using the part that convolutional neural networks training obtains speed and changed with the part in direction, to execute view
Feel odometer.
Step 4: the part of the speed obtained using step 3, which is changed, changes information with the part in direction, recovers camera
Motion path.
Step 5: carrying out closed loop detection using convolutional neural networks, eliminate the accumulated error of path prediction.
The invention has the advantages that:
1, the complete period of the invention uses the vision SLAM algorithm of CNNs feature detection, and convolutional neural networks are used for front end,
Compared to conventional front-end algorithm, the side based on study substitutes cumbersome formula and calculates, and extracts and matches without manual features, it appears succinct
Intuitively, and on-line operation fast speed.
2, the complete period of the invention uses the vision SLAM algorithm of CNNs feature detection, the utilization bag of words to discard tradition into
The method of row closed loop detection obtains the effect of relatively good position identification by accurate characteristic matching.
3, the complete period of the invention uses the vision SLAM algorithm of CNNs feature detection, can pass through neural network learning image
In profound feature, discrimination can achieve higher level.Compared to Conventional visual slam algorithm, closed loop inspection can be improved
Survey accuracy rate, expression image information it is more abundant, have more strong robustness to environmental changes such as illumination, seasons, can to 2 frame images it
Between similitude calculated, to realize more succinct visual odometry, and complete using database pre-training neural network
While at characteristic Design, the design of matched classifier can also be synchronously completed.
Specific embodiment
The present invention is described in further detail below in conjunction with the accompanying drawings.
Complete period of the invention uses the vision SLAM algorithm of CNNs feature detection, as shown in Figure 1, comprising the following steps:
Step 1: scanning the environmental information of surrounding;
Using binocular camera along being moved in square area, ambient image information acquisition is carried out to real scene, is obtained
Video flowing be transmitted in host computer in real time.Binocular camera mobile ring number is 1~2 circle, constitutes closed loop, facilitates subsequent close
Compensation of the ring detection to cumulative errors.Repeat the above process, by a part of conduct in collected multiple video flowings
Training dataset, a part are used as test data set.
Step 2: training dataset carries out pre-training in the video flowing acquired by the method for synchronous detection to step 1.
It indicates to obtain the combination of camera motion and image depth information, is adopted using unsupervised learning model (SAE-D)
Pre-training is carried out to training dataset with stochastic gradient descent training.It is the model of a single layer based on synchronous SAE-D, allows
Feature is obtained from training dataset by part and Hebbian type study.
SAE-D model concentrates training, the size of each binocular block on the local binocular block cut out immediately in training data
For 16*16*5 (space*space*time), the characteristic information of movement and depth joint expression is obtained;It then will movement and depth
The characteristic information that joint indicates goes albefaction back to image space, completes the pre-training to training dataset.Instruction after pre-training
Practice the first layer that data set is used to initialize CNN (convolutional neural networks).
Step 3: executing visual odometry using convolutional neural networks (CNN) training.
Convolutional neural networks (CNN) are a kind of learning models based on supervision.CNN is trained to by partial-depth and movement
Indicate associated with the localized variation in speed and direction, to learn to execute visual odometry.By CNN framework the fortune of acquisition
Dynamic and depth joint expression is trained associated with desired label (direction and speed change).
The feature that SAE-D model training obtains in step 2 is inputted to the CNN first layer of two same architectures in the present invention,
Initialization for two CNN first layers;The part of two CNN difference output speeds changes to be changed with the part in direction, and is had
It is associated with desired label that the part of speed is changed the part change with direction respectively by two CNN remainders.
Entire CNN network shares 6 layers, and first layer is the convolutional layer of 5*5, respectively left images learning characteristic.Following
Two layers of feature extracted to two convolutional layers in left and right carry out element multiplication.The convolutional layer of third layer 1*1, the 4th layer is pond
Layer, layer 5 is full articulamentum, and last output layer is Softmax layers.
The input of above-mentioned two CNN is 5 frame subsequences, and target output is local velocity and the vector that direction changes
It indicates.The speed of part and direction, which change the effect of information and precision, to be evaluated and tested by binocular data set KITTI.
Step 4: the prediction of camera motion path
For entire video flowing, the speed of the every 5 frame subsequence obtained using step 3 and direction change information, Ji Keli
The scattered motion path for recovering camera.
Step 5: carrying out closed loop detection using CNN, eliminate the accumulated error of path prediction;
Local speed obtained in step 4 and direction change information can not completely precisely, and there is a certain error.With
The continuous accumulation of error, predicted path can constantly increase with the gap in true path from origin-to-destination.So needing
Subsequent closed loop detection eliminates these cumulative errors, reduces the gap of predicted path and true path.It is this
Algorithm includes two parts: carrying out feature extraction using convolutional neural networks;Location matches are assumed by comparing characteristic response
Carry out spatio-temporal filtering.
Equally handled using the algorithm based on CNN in closed loop detection, the CNN model being applied to it is aforementioned
CNN model is different, is independently used in visual odometry and closed loop detection, pre- with the camera motion path that this removal process 4 obtains
The cumulative errors of survey realize from main closed loop, method particularly includes:
Image characteristics extraction is carried out using the convolutional neural networks of pre-training.The present invention uses overfeat convolutional Neural net
Network carries out image characteristics extraction.
Overfeat convolutional neural networks carry out pre-training on 2012 data set of ImageNet, and the data set is by 1,200,000
Open image and 1000 classes compositions.Overfeat convolutional neural networks include that five convolution stages and three are fully connected the stage.
Two convolution stages of beginning are by a convolutional layer, a maximum pond layer and rectification (ReLU) non-linear layer composition.The
Three and the Volume Four product stage by convolutional layer, zero padding layer and ReLU non-linear layer composition.5th stage include a convolutional layer, one
A zero padding layer, one ReLU layers and one Maxpooling layers.Finally, the 6th and the 7th be fully connected the stage include one it is complete
Full articulamentum and one ReLU layer, and it includes the output layer for being only fully connected layer that the 8th stage, which was,.Entire convolutional neural networks are total
Share 21 layers.
When image I is input into network, it generates a series of layering activation.L is used in the present inventionk(I), k=
1 ..., 21 indicate the corresponding output of given input picture I kth layer.The feature vector L of each layer of outputk(I) every in
One be all image I deep learning indicate;Execution position is compared to the corresponding feature vector of different images by these
Set identification.The network is capable of handling the image that any size is equal to or more than 231 × 231 pixels, therefore overfeat convolution mind
Input through network is all using being sized as the image of 256 × 256 pixels.
It is good using pre-training as a result, by using the training dataset acquired in step 1 and test data set as input
Overfeat convolutional neural networks carry out feature extraction.
Step 6: characteristic matching generates hybrid matrix and carries out space-time expending detection.
As shown in Fig. 2, being extracted by overfeat convolutional neural networks to from the picture that each test data is concentrated
Feature concentrate the feature proposed to be matched with from each training data.
To compare performance difference of the feature of every tomographic image in overfeat convolutional neural networks in scene Recognition, into one
Step utilizes each layer latent structure hybrid matrix:
Mk(i, j)=d (Lk(Ii),Lk(Ij)), i=1 ..., R, j=1 ..., T
Wherein, IiRepresent the image input of the i-th frame training data concentration, IjThe image for representing jth frame test data concentration is defeated
Enter, Lk(Ii) represent and IiCorresponding kth layer output, Mk(i, j) represents the Europe between kth layer training sample i and test sample j
Family name's distance describes matching degree between the two;R and T respectively represent the number of training image and the number of test image.
The averaged feature vector that the above-mentioned each column of hybrid matrix store between jth frame test image and all training images is poor.
In order to find strongest location matches it is assumed that having the member of minimum feature vector difference in search hybrid matrix each column
Element.
For location matches possible in hybrid matrix it is assumed that further constructing spatial continuity filter and Time Continuous
Property filter carry out comprehensive verification, improve matching accuracy rate.Meanwhile the characteristic performance come is trained to each layer network and is visited
Rope, it is preferable that discovery network middle layer feature describes similar for visual angle image matching effect, and in after layer for scene visual angle
Variation has stronger adaptability and robustness.
There are accurately location matches, so that it may the cumulative errors caused by the visual odometry of not winding compensate,
Construct globally consistent track.