CN115439743A

CN115439743A - Method for accurately extracting visual SLAM static characteristics in parking scene

Info

Publication number: CN115439743A
Application number: CN202211028947.4A
Authority: CN
Inventors: 崔博非; 胡习之; 李洪涛; 符茂达
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2022-08-23
Filing date: 2022-08-23
Publication date: 2022-12-06

Abstract

The invention discloses a method for accurately extracting visual SLAM static characteristics in a parking scene, which uses multithreading parallelism to frame out vehicles and pedestrians to form mask masks by using a target detection model when dynamic objects such as pedestrians and vehicles exist in the parking scene; meanwhile, the common manual features of the VSLAM system at present are abandoned, an improved feature extraction model SuperPoint based on deep learning is selected for feature extraction, the accuracy of feature extraction is higher and more robust, key points and descriptors of image frames are obtained, feature points in dynamic object frames are screened and removed according to generated mask masks, feature matching and camera pose estimation are carried out by using the remaining accurate static feature points, and then tracking, drawing and loop detection threads can be executed subsequently to complete the whole SLAM work. By using the method, the probability of mismatching in the memory parking scene is reduced, the defects that the SLAM algorithm is difficult to remove dynamic characteristic points and the scene recognition precision is low can be effectively overcome, and the reliability of memory parking is improved.

Description

A Method for Accurately Extracting Static Features of Visual SLAM in Parking Scenes

技术领域technical field

本发明属于视觉SLAM与深度学习领域，尤其涉及一种泊车场景下视觉SLAM利用深度学习目标检测去除动态特征点的方法，能够在泊车场景下精确提取视觉SLAM静态特征，便于完成建图。The invention belongs to the field of visual SLAM and deep learning, and in particular relates to a method for visual SLAM in a parking scene using deep learning target detection to remove dynamic feature points.

背景技术Background technique

同时定位与地图构建(Simultaneous localization and mapping，SLAM)技术在没有环境先验内容的前提下，利用机器人自身的传感器完成对周围环境信息的摄取和处理，进而完成地图构建与机器人的自身定位。利用相机传感器完成对周围环境的感知称为视觉SLAM，视觉传感器利用其成本低廉、采集信息丰富等优势，成为现代SLAM研究常用到的传感器。Simultaneous localization and mapping (SLAM) technology uses the robot's own sensors to capture and process the surrounding environment information without prior knowledge of the environment, and then completes the map construction and the robot's own positioning. The use of camera sensors to complete the perception of the surrounding environment is called visual SLAM. Visual sensors take advantage of their low cost and rich collection of information to become sensors commonly used in modern SLAM research.

随着视觉SLAM的技术不断完善和发展，涌现一批如ORB-SLAM2，OpenVSLAM等优秀的开源SLAM框架。经典的视觉SLAM主要由传感器数据输入、前端视觉里程计、后端优化、回环检测、建图几个模块组成。ORB-SLAM2由跟踪、局部建图和回环检测三个线程并行运行，采用传统ORB算法进行特征提取，经过验证，其对于光照的鲁棒性较差。近年来也不断涌现出一些基于深度学习的特征提取算法，特征提取对于整个视觉SLAM系统起到了举足轻重的作用，保证提取出的特征点对于场景具有较好的代表作用，现在的视觉SLAM大多只能提取出此场景当前场景下的所有特征点，但是对于车辆和行人这些动态物体上的特征点下次定位时未必还能匹配到，造成定位失败。With the continuous improvement and development of visual SLAM technology, a number of excellent open source SLAM frameworks such as ORB-SLAM2 and OpenVSLAM have emerged. The classic visual SLAM is mainly composed of several modules such as sensor data input, front-end visual odometer, back-end optimization, loop detection, and map building. ORB-SLAM2 is run in parallel by three threads of tracking, local mapping and loop detection. The traditional ORB algorithm is used for feature extraction. It has been verified that its robustness to illumination is poor. In recent years, some feature extraction algorithms based on deep learning have emerged. Feature extraction has played a pivotal role in the entire visual SLAM system, ensuring that the extracted feature points have a good representative effect on the scene. Most of the current visual SLAM can only All the feature points in the current scene of this scene are extracted, but the feature points on dynamic objects such as vehicles and pedestrians may not be matched in the next positioning, resulting in positioning failure.

发明内容Contents of the invention

针对上述传统视觉SLAM算法对于动态特征点以及特征点精度和鲁棒性上的不足，本发明的目的是提供一种泊车场景下精确提取视觉SLAM静态特征的方法，此方案能够提高特征点提取的精度，并且可有效减弱动态物体特征点对于视觉SLAM建图定位的影响，提高视觉SLAM的可靠性。In view of the shortcomings of the above-mentioned traditional visual SLAM algorithm for dynamic feature points and the accuracy and robustness of feature points, the purpose of the present invention is to provide a method for accurately extracting static features of visual SLAM in a parking scene, which can improve feature point extraction. It can effectively reduce the impact of dynamic object feature points on visual SLAM mapping and positioning, and improve the reliability of visual SLAM.

为解决上述技术问题，本发明提供的一种泊车场景下精确提取视觉SLAM静态特征的方法，包括以下步骤：In order to solve the above technical problems, the present invention provides a method for accurately extracting static features of visual SLAM in a parking scene, comprising the following steps:

步骤1：对车前方的停车场场景提取图像，对图像进行预处理后将图像输入到目标检测网络中进行目标检测，得到目标物的检测框；Step 1: Extract the image of the parking lot scene in front of the car, preprocess the image and input the image into the target detection network for target detection, and obtain the detection frame of the target object;

步骤2：筛选步骤1中输出的动态物体检测框，并形成mask掩码，与SuperPoint提取的特征点结合使用，剔除动态物体检测框的特征点，并得到关键点和描述子，SuperPoint网络包括关键点和描述子的共享编码器、关键点解码器和描述子解码器，共享编码器用于对图像进行编码得到特征图，关键点解码器用于对获得图像中关键点的坐标，描述子解码器用于获取关键点的描述子向量，其中，对SuperPoint网络的改进包括：将编码器中的所有卷积改成深度可分离卷积，其中，目标检测和特征提取使用多线程并行技术，在特征提取的同时进行目标检测；Step 2: Filter the dynamic object detection frame output in step 1, and form a mask mask, use it in combination with the feature points extracted by SuperPoint, eliminate the feature points of the dynamic object detection frame, and obtain key points and descriptors, the SuperPoint network includes key The shared encoder, key point decoder and descriptor decoder of points and descriptors, the shared encoder is used to encode the image to obtain the feature map, the key point decoder is used to obtain the coordinates of the key points in the image, and the descriptor decoder is used for Obtain the descriptor vector of key points, where the improvements to the SuperPoint network include: changing all convolutions in the encoder to depth-separable convolutions, where target detection and feature extraction use multi-threaded parallel technology, in the feature extraction Simultaneous target detection;

步骤3：如果mask掩码代表的是行人，SuperPoint网络对掩码内的特征点进行剔除；如果是汽车，则对比相邻帧的汽车目标检测区域，相邻的两个目标检测区域的非公共部分的特征点保留，公共部分的特征点进行剔除，得到筛选后的静态特征点；Step 3: If the mask represents a pedestrian, the SuperPoint network removes the feature points in the mask; if it is a car, compare the car target detection areas of adjacent frames, and the non-common points of the adjacent two target detection areas Part of the feature points are retained, and the common part of the feature points is eliminated to obtain the filtered static feature points;

步骤4：利用SuperPoint网络提取并剔除mask掩码内的关键点和描述子，使用剩余的特征点进行特征匹配，继续执行视觉SLAM的tracking模块，计算相机位姿并建图，完成整个SLAM工作。Step 4: Use the SuperPoint network to extract and remove key points and descriptors in the mask, use the remaining feature points for feature matching, continue to execute the tracking module of visual SLAM, calculate the camera pose and build a map, and complete the entire SLAM work.

步骤1中，目标检测网络采用YOLOv5网络，基于YOLOv5的目标检测算法流程如下：输入一张608＊608＊3的RGB图像，将输入图像缩放到网络的输入尺寸，并利用Mosaic进行数据增强，Mosaic随机选取4张图片进行缩放、旋转、排布组成一张新的图片，不仅大大增加图片数量，并且加快了训练速度，达到数据增强的作用；Backbone模块使用CSPDarknet53结构和Focus结构来提取一些通用特征；将提取的通用特征输送到Neck网络中提取更具多样性和鲁棒性的特征，输入到CSP2_X和CBL结构，并经过上采样，和主干网络输出的特征进行contcat，增强了特征融合的能力；最后输出端使用CIoU_LOSS代替之前的GIoU_LOSS作为Bounding Box的损失函数，CIoU公式如下：In step 1, the target detection network uses the YOLOv5 network, and the target detection algorithm process based on YOLOv5 is as follows: input a 608*608*3 RGB image, scale the input image to the input size of the network, and use Mosaic for data enhancement, Mosaic Randomly select 4 pictures to scale, rotate, and arrange to form a new picture, which not only greatly increases the number of pictures, but also speeds up the training speed and achieves the effect of data enhancement; the Backbone module uses the CSPDarknet53 structure and the Focus structure to extract some general features ;Transport the extracted general features to the Neck network to extract more diverse and robust features, input them to the CSP2_X and CBL structure, and after upsampling, contcat with the features output by the backbone network, enhancing the ability of feature fusion ;The final output uses CIoU_LOSS instead of the previous GIoU_LOSS as the loss function of the Bounding Box. The CIoU formula is as follows:

CIou考虑了真实框和预测框的尺寸比例，式中，v∈[0，1]表示预测框长宽和对应的真实框之间比例差值的归一化表示，

α表示损失平衡因子。CIou considers the size ratio of the real frame and the predicted frame, where v ∈ [0, 1] represents the normalized representation of the proportional difference between the length and width of the predicted frame and the corresponding real frame,

α represents the loss equalization factor.

步骤2中，步骤2中采用的SuperPoint网络在训练前，In step 2, the SuperPoint network used in step 2 is before training,

SuperPoint网络采用自监督的方式进行提取，首先使用规则的几何形状作为数据集训练一个全卷积网络-Base Detector；将未标注的真实图片利用Base Detector网络的检测结果作为伪Ground Truth Keypoint(伪真值关键点)，为了伪Ground Truth Keypoint更具鲁棒性和准确性，使用Homographic Adaptation技术(单应技术)，将未标注的真实图片在不同尺寸下提取特征，生成伪标签；生成伪标签后，即可将真是未标注图片放进SuperPoint网络中进行训练。在图像输入阶段，采用翻转等数据增强手段。The SuperPoint network is extracted in a self-supervised manner. First, a fully convolutional network-Base Detector is trained using regular geometric shapes as a data set; the detection results of unlabeled real pictures using the Base Detector network are used as pseudo-Ground Truth Keypoint (false-true Value key point), in order to make the pseudo-Ground Truth Keypoint more robust and accurate, use Homographic Adaptation technology (homographic technology) to extract features from unlabeled real pictures at different sizes to generate pseudo-labels; after generating pseudo-labels , you can put the real unlabeled pictures into the SuperPoint network for training. In the image input stage, data enhancement methods such as flipping are used.

SuperPoint网络包括关键点和描述子的共享编码器、关键点解码器、描述子解码器三部分，进一步地，步骤2中，SuperPoint网络检测关键点和描述子过程如下：The SuperPoint network includes three parts: a shared encoder of key points and descriptors, a key point decoder, and a descriptor decoder. Further, in step 2, the SuperPoint network detects key points and descriptors as follows:

输入一张H*W*3的图像帧，将其灰度化后转化成H*W*1，接着将图像输入到经过改进的更加轻量化共享编码器，经过编码器后，输入图像尺寸转化为H_c＝H/8，W_c＝W/8，以此降低图像尺寸；Input an image frame of H*W*3, convert it to H*W*1 after grayscale, and then input the image to an improved and lighter shared encoder. After the encoder, the input image size is converted H _c =H/8, W _c =W/8, so as to reduce the image size;

关键点解码器进行子像素卷积操作，通过depth to space过程将输入向量由H/8*W/8*65转化成H*W，最终输出为各个像素点是Keypoint的概率；The key point decoder performs sub-pixel convolution operation, converts the input vector from H/8*W/8*65 to H*W through the depth to space process, and finally outputs the probability that each pixel is a Keypoint;

描述子解码器利用卷积网络得到半稠密描述子，接着利用双三次差值得出剩余描述，最后通过L2归一化得到统一长度(H*W*D)的描述子。The descriptor decoder uses the convolutional network to obtain a semi-dense descriptor, then uses the bicubic difference to obtain the remaining description, and finally obtains a descriptor of uniform length (H*W*D) through L2 normalization.

步骤2中，使用改进SuperPoint的共享编码器，原始SuperPoint编码器使用类VGG6的卷积网络层，但是计算量、训练参数庞大，本发明将编码器部分所有卷积改成深度可分离卷积。正常卷积过程如图3所示，设定输入图像尺寸为H*W*3，输出m层feature map，则普通卷积核参数量为3*f*f*m；In step 2, the shared encoder of the improved SuperPoint is used. The original SuperPoint encoder uses a VGG6-like convolutional network layer, but the calculation amount and training parameters are huge. The present invention changes all convolutions of the encoder part into depth-separable convolutions. The normal convolution process is shown in Figure 3, set the input image size to H*W*3, and output an m-layer feature map, then the normal convolution kernel parameter is 3*f*f*m;

深度可分离卷积(如图4所示)分为逐通道卷积和逐点卷积两个连续过程，逐通道卷积是给每个通道一个单独的卷积核进行卷积，将卷积过程转化到二位平面内进行，最终生成mid feature map(中间特征图)，此环节的卷积核参数量为f*f*3，生成的mid featuremap进行逐点卷积，使用1*1*3卷积核，具有数据融合的作用，最终也输出m层feature map，此部分的参数量为1*3*m；则逐通道核与逐点卷积的卷积核的参数量为3*(f*f+m)，较之直接卷积的3*f*f*m下降了一个数量级，时间效率上会大大提升；虽然学习的参数量较之普通卷积有所下降，但是精度上并没有下降太多。Depth separable convolution (as shown in Figure 4) is divided into two continuous processes of channel-by-channel convolution and point-by-point convolution. Channel-by-channel convolution is to convolve each channel with a separate convolution kernel, and the convolution The process is transformed into a two-dimensional plane, and finally a mid feature map (intermediate feature map) is generated. The convolution kernel parameter of this link is f*f*3, and the generated mid featuremap is convolved point by point, using 1*1* 3 convolution kernels, which have the function of data fusion, and finally output the m-layer feature map. The parameter quantity of this part is 1*3*m; the parameter quantity of the convolution kernel of channel-by-channel kernel and point-by-point convolution is 3* (f*f+m), which is an order of magnitude lower than the 3*f*f*m of direct convolution, and the time efficiency will be greatly improved; although the number of parameters learned is lower than that of ordinary convolution, the accuracy is higher. Didn't drop much.

步骤2中，改进SuperPoint的损失函数由关键点提取损失和描述子检测损失两部分组成：In step 2, the loss function of the improved SuperPoint consists of two parts: key point extraction loss and descriptor detection loss:

式中损失函数由两部分组成：

为关键点损失，使用交叉熵损失函数，

为描述子损失，λ为平衡因子。where the loss function consists of two parts:

For the keypoint loss, use the cross-entropy loss function,

is the descriptor loss, and λ is the balance factor.

步骤2中，目标检测和特征提取运行机制如下：使用多线程并行技术，在特征提取的同时进行目标检测，跟踪线程不用等待YOLOv5的检测结果，提高的CPU的利用率，提升了运行效率。In step 2, the operating mechanism of target detection and feature extraction is as follows: using multi-thread parallel technology, target detection is performed while feature extraction is performed, and the tracking thread does not need to wait for the detection results of YOLOv5, which improves CPU utilization and improves operating efficiency.

步骤3中，剔除由步骤1、2动态物体(预定车辆与行人)检测框中的特征点，如果直接剔除每个动态物体框中的特征点，会由于特征点过少出现难以匹配的情况。如图7所示，对于相邻两帧的检测出动态目标框的特征点分别为

(图7中A区域)，

(图7中B区域)，

表示第n帧中得第i个特征点。将相邻两帧检测同一动态物体目标框的交集作为最终的动态目标特征点，即D＝D_nD_n+1(图7的C区域)，将D集合中特征点最为最终动态特征点集合，这样降低了动态特征点误剔除的概率，也保留部分疑似静态特征点，剩余的绿色、蓝色和黄色区域为保留区域，增加了跟踪线程的可靠性。In step 3, remove the feature points in the detection frames of the dynamic objects (predetermined vehicles and pedestrians) in steps 1 and 2. If you directly remove the feature points in each dynamic object frame, it will be difficult to match due to too few feature points. As shown in Figure 7, the feature points of the detected dynamic target frames in two adjacent frames are

(A region in Figure 7),

(area B in Figure 7),

Indicates the i-th feature point in the n-th frame. The intersection of the same dynamic object target frame detected in two adjacent frames is used as the final dynamic target feature point, that is, D=D _n D _n+1 (C area of Figure 7), and the feature point in the D set is the final dynamic feature point set , which reduces the probability of false removal of dynamic feature points, and also retains some suspected static feature points. The remaining green, blue, and yellow areas are reserved areas, which increases the reliability of tracking threads.

步骤4中，计算相机位姿过程如下：经过上述过程中筛选出的特征点以及描述子进行图像匹配，利用RANSAC随机采样一致性算法剔除误匹配特征点，根据匹配关系转化成2D点到2D点的对极几何问题。假定x₁、x₂为两张图像上对应的匹配点归一化坐标，R为相机旋转矩阵，t为平移矩阵，则满足In step 4, the process of calculating the camera pose is as follows: through the feature points and descriptors selected in the above process, image matching is performed, and the RANSAC random sampling consensus algorithm is used to eliminate mismatched feature points, and the matching relationship is converted into 2D points to 2D points The epipolar geometry problem. Assuming that x ₁ and x ₂ are the normalized coordinates of the corresponding matching points on the two images, R is the camera rotation matrix, and t is the translation matrix, then satisfy

x₂＝Rx₁+tx ₂ =Rx ₁ +t

左乘x^T ₂t：

Left multiply x ^T ₂ t:

等式左侧为0，则：

即为对极约束表达式，在按照最小重投影误差即可求出相机位姿。令基础矩阵E＝t^R，本质矩阵F＝K^-TEK^-1。求解相机位姿可以转化成以下两步：求出基础矩阵E或者本质矩阵F；根据E或F，求解R，t。The left side of the equation is 0, then:

It is the epipolar constraint expression, and the camera pose can be calculated according to the minimum reprojection error. Let the fundamental matrix E=t^R, the essential matrix F=K ^-T EK ^-1 . Solving the camera pose can be transformed into the following two steps: finding the fundamental matrix E or the essential matrix F; according to E or F, solving R, t.

步骤4中，关键帧的筛选对于信息的冗余与计算资源的释放有着较大的影响。若系统处于定位模式、局部地图被占用或者刚刚结束重定位，则不插入关键帧。In step 4, the screening of key frames has a great influence on the redundancy of information and the release of computing resources. If the system is in positioning mode, the local map is occupied, or repositioning has just finished, no keyframe will be inserted.

与现有技术相比，本发明至少能够实现以下有益效果：Compared with the prior art, the present invention can at least achieve the following beneficial effects:

本发明采用基于深度学习的SuperPoint网络结合目标检测网络(如YOLOv5)分别进行关键点、描述子提取以及动态目标检测，传统方案大都使用ORB、SURF来进行特征提取，但是对于停车场场景变化、光照强度变化明显时的特征提取效果不佳，而本发明使用的改进SuperPoin首先让网络模型更加轻量化，且最终能使得特征提取对于不同场景变化更加鲁棒，提取的特征点更加均匀合理。The present invention uses a deep learning-based SuperPoint network combined with a target detection network (such as YOLOv5) to perform key point, descriptor extraction, and dynamic target detection. Traditional solutions mostly use ORB and SURF for feature extraction, but for parking lot scene changes, lighting The feature extraction effect is not good when the intensity changes are obvious, but the improved SuperPoint used in the present invention first makes the network model more lightweight, and finally makes the feature extraction more robust to different scene changes, and the extracted feature points are more uniform and reasonable.

附图说明Description of drawings

图1是本发明实施例提供的泊车场景下精确提取视觉SLAM静态特征方法的流程示意图。Fig. 1 is a schematic flowchart of a method for accurately extracting static features of visual SLAM in a parking scene provided by an embodiment of the present invention.

图2是本发明实施例提供的改进的轻量化SuperPoint核心流程图。Fig. 2 is a flowchart of an improved lightweight SuperPoint core provided by an embodiment of the present invention.

图3是普通卷积示意图。Figure 3 is a schematic diagram of ordinary convolution.

图4是深度可分离卷积示意图，其中，(a)图是逐通道卷积示意图，(b)图是逐点卷积示意图。Figure 4 is a schematic diagram of depth-separable convolution, where (a) is a schematic diagram of channel-by-channel convolution, and (b) is a schematic diagram of point-by-point convolution.

图5是本发明实施例提供的YOLOv5在停车场场景的检测效果图。Fig. 5 is a detection effect diagram of YOLOv5 in the parking lot scene provided by the embodiment of the present invention.

图6是本发明实施例提供的SuperPoint特征提取效果图。Fig. 6 is an effect diagram of SuperPoint feature extraction provided by an embodiment of the present invention.

图7是本发明实施例提供的动态特征筛除示意图(C处为去除区域)。Fig. 7 is a schematic diagram of dynamic feature screening provided by an embodiment of the present invention (C is the removal area).

具体实施方式detailed description

为使本申请实施例的目的、技术方案和优点更加清楚，下面将结合本申对技术方案进行清楚、完整地描述。In order to make the purpose, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions will be clearly and completely described below in conjunction with the present application.

传统的视觉SLAM特征提是基于静态环境假设，但是对于停车场环境每个停车位，并不是一直都有汽车停放的，所以剔除停车位上一些动态物体的特征点是很有必要的，保证这些动态特征不会保留在最后的地图上。现有技术提出的动态特征点一般是用传统的特征提取算法如ORB、SURF等，但是这些特征点对于不同的场景变化其效果差别是很大的，并不是很鲁棒，本发明采用基于深度学习的SuperPoint网络结合YOLOv5算法分别进行关键点、描述子提取以及动态目标检测，并对SuperPoint进行改进让网络模型更加轻量化，最终使得特征提取对于不同场景变化更加鲁棒，提取的特征点更加均匀合理。下面对本发明所提供的方法做具体介绍。The traditional visual SLAM feature extraction is based on the assumption of a static environment, but for each parking space in the parking lot environment, there are not always cars parked, so it is necessary to remove the feature points of some dynamic objects on the parking space to ensure these Dynamic features are not preserved on the final map. The dynamic feature points proposed in the prior art generally use traditional feature extraction algorithms such as ORB, SURF, etc., but the effect of these feature points on different scene changes is very different, and it is not very robust. The present invention uses depth-based The learned SuperPoint network is combined with the YOLOv5 algorithm to extract key points, descriptors, and dynamic object detection, and to improve SuperPoint to make the network model more lightweight, and finally make the feature extraction more robust to different scene changes, and the extracted feature points are more uniform Reasonable. The method provided by the present invention is described in detail below.

如图1至图6所示，本发明公开了一种泊车场景下精确提取视觉SLAM静态特征的方法，包括以下步骤：As shown in Figures 1 to 6, the present invention discloses a method for accurately extracting static features of visual SLAM in a parking scene, comprising the following steps:

Stepl：对车前方的停车场场景提取图像，对图像进行预处理后将图像输入到目标检测网络中进行目标检测，得到目标物的检测框。Stepl: Extract the image of the parking lot scene in front of the car, preprocess the image, input the image into the target detection network for target detection, and obtain the detection frame of the target object.

在本发明的其中一些实施例中，所采用的目标检测网络为YOLOV5网络模型。可以理解的是，在其他实施例中，也可以采用其他目标检测网络。In some embodiments of the present invention, the target detection network used is the YOLOV5 network model. It can be understood that in other embodiments, other target detection networks can also be used.

使用单目相机对车前方的停车场场景实时采集图像，进行图像预处理操作(如滤波、图像增强等)后，将一张H＊W＊3(H、W分别表示一张图像的长宽的像素数)的RGB图像输入到YOLOV5网络模型中，输入图像缩放到网络的输入尺寸，并利用Mosaic进行数据增强，Mosaic随机选取4张图片进行缩放、旋转、排布组成一张新的图片，不仅大大增加了图片数量，并且能够加快训练速度，达到数据增强的作用；主干网络提取图像特征，生成特征图；Backbone模块使用CSPDarknet53结构和Focus结构来提取一些通用特征；将提取的通用特征输送到Neck网络中提取更具多样性和鲁棒性的特征，输入到CSP2_X和CBL结构，并经过上采样，和主干网络输出的特征进行拼接，增强特征融合的能力；最后输出端使用CIoU_LOSS代替之前的GIoU_LOSS作为Bounding Box的损失函数，CIoU公式如下：Use a monocular camera to collect images of the parking lot scene in front of the car in real time, and after image preprocessing operations (such as filtering, image enhancement, etc.), a H*W*3 (H, W represent the length and width of an image respectively) The number of pixels) of the RGB image is input into the YOLOV5 network model, the input image is scaled to the input size of the network, and the data is enhanced using Mosaic. Mosaic randomly selects 4 pictures to scale, rotate, and arrange to form a new picture. It not only greatly increases the number of pictures, but also can speed up the training speed and achieve the effect of data enhancement; the backbone network extracts image features and generates feature maps; the Backbone module uses the CSPDarknet53 structure and Focus structure to extract some general features; the extracted general features are sent to More diverse and robust features are extracted from the Neck network, input to the CSP2_X and CBL structures, and after upsampling, they are spliced with the features output by the backbone network to enhance the ability of feature fusion; the final output uses CIoU_LOSS to replace the previous ones GIoU_LOSS is used as the loss function of Bounding Box, and the CIoU formula is as follows:

式中，CIou考虑了真实框和预测框的尺寸比例，CIoU(B_pre，B_GT)表示预测框和真实目标框之间的CIOU交并比；Iou(B_pre，B_Gr)表示预测框和真实目标框之间的交并比；B_pre表示预测框；B_GT表示真实目标检测框；ρ(B_pre，B_GT)表示预测框和真实框的中心点距离；v表示预测框和真实框的宽高比相似度，v∈[0，1]；w^GT表示真实框的宽；h^GT表示真实框的高；GT表示真实框信息；w、h分别表示预测框的宽和高；

α表示损失平衡因子。In the formula, CIou considers the size ratio of the real frame and the predicted frame, CIoU(B _pre , B _GT ) represents the CIOU intersection ratio between the predicted frame and the real target frame; Iou(B _pre , B _Gr ) represents the predicted frame and Intersection and union ratio between real target frames; B _pre represents the predicted frame; B _GT represents the real target detection frame; ρ(B _pre , B _GT ) represents the center point distance between the predicted frame and the real frame; v represents the predicted frame and the real frame The aspect ratio similarity of , v∈[0,1]; w ^GT represents the width of the real frame; h ^GT represents the height of the real frame; GT represents the information of the real frame; w, h represent the width and height of the predicted frame, respectively;

α represents the loss equalization factor.

Step2：筛选步骤1中输出的动态物体检测框，并形成mask掩码，与SuperPoint提取的特征点结合使用，剔除动态物体检测框的特征点。Step2: Screen the dynamic object detection frame output in step 1, and form a mask mask, and use it in combination with the feature points extracted by SuperPoint to eliminate the feature points of the dynamic object detection frame.

在本发明的其中一些实施例中，动态物体检测框中动态物体包括：车辆、行人、少数的动物。In some embodiments of the present invention, the dynamic objects in the dynamic object detection frame include: vehicles, pedestrians, and a few animals.

与此同时，另一线程进行特征提取过程。将采集的尺寸为H*W*3的RGB图像经过灰度化后输入到轻量化的SuperPoint网络中，SuperPoint网络采用自监督的方式进行提取，首先使用规则的几何形状作为数据集训练一个全卷积网络Base Detector；将未标注的真实图片利用Base Detector网络的检测结果作为伪真值关键点(伪Ground TruthKeypoint)，为了伪真值关键点更具鲁棒性和准确性，使用单应技术(HomographicAdaptation技术)，将未标注的真实图片在不同尺寸下提取特征，生成伪标签；生成伪标签后，即可将真实未标注图片放进SuperPoint网络中进行训练。在图像输入阶段，采用翻转等数据增强手段。At the same time, another thread conducts the feature extraction process. The collected RGB image with a size of H*W*3 is grayscaled and then input into the lightweight SuperPoint network. The SuperPoint network uses a self-supervised method for extraction. First, a full volume is trained using regular geometric shapes as a data set. Product network Base Detector; unmarked real pictures using the detection results of the Base Detector network as false ground truth key points (pseudo Ground TruthKeypoint), in order to make false ground truth key points more robust and accurate, using homography technology ( HomographicAdaptation technology), which extracts features from unlabeled real pictures at different sizes to generate pseudo-labels; after generating pseudo-labels, the real unlabeled pictures can be put into the SuperPoint network for training. In the image input stage, data enhancement methods such as flipping are used.

SuperPoint网络包括关键点和描述子的共享编码器、关键点解码器和描述子解码器，共享编码器用于对图像进行编码得到特征图，关键点解码器用于对获得图像中关键点的坐标，描述子解码器用于获取关键点的描述子向量。The SuperPoint network includes shared encoders for key points and descriptors, key point decoders, and descriptor decoders. The shared encoders are used to encode images to obtain feature maps. The key point decoders are used to obtain the coordinates of key points in the image and describe The sub-decoder is used to obtain the descriptor sub-vectors of the keypoints.

具体地，输入一张H*W*3的图像帧，将其灰度化后转化成H*W*1，接着将图像输入到经过改进的更加轻量化的共享编码器，编码器中将所有的普通卷积操作转化成深度可分离卷积。对于普通卷积，设定输出的特征图有m层，则卷积核参数量为3*f*f*m(f表示卷积核尺寸，m表示最终输出通道数)，经过深度可分离卷积后参数量变为3*(f*f+m)，较之直接卷积的3*f*f*m下降了一个数量级，时间效率上会大大提升；虽然学习的参数量较之普通卷积有所下降，但是精度上并没有下降太多。Specifically, input an image frame of H*W*3, grayscale it and convert it into H*W*1, and then input the image into an improved and lighter shared encoder, in which all The ordinary convolution operation is transformed into a depthwise separable convolution. For ordinary convolution, set the output feature map to have m layers, then the convolution kernel parameter is 3*f*f*m (f represents the convolution kernel size, m represents the final output channel number), after the depth separable convolution After the product, the amount of parameters becomes 3*(f*f+m), which is an order of magnitude lower than the 3*f*f*m of direct convolution, and the time efficiency will be greatly improved; although the amount of parameters learned is compared with ordinary convolution It has decreased, but the accuracy has not decreased too much.

在本发明的其中一些实施例中，深度可分离卷积包括逐通道卷积和逐点卷积两个连续过程，逐通道卷积是给每个通道一个单独的卷积核进行卷积，将卷积过程转化到二维平面内进行，最终生成中间特征图，此环节的卷积核参数量为f*f*3，生成的中间特征图进行逐点卷积，使用1*1*3卷积核，具有数据融合的作用，最终也输出m层feature map，此部分的参数量为1*3*m；则逐通道核与逐点卷积的卷积核的参数量为3*(f*f+m)，较之直接卷积的3*f*f*m下降了一个数量级，时间效率上会大大提升；虽然学习的参数量较之普通卷积有所下降，但是精度上并没有下降太多。In some embodiments of the present invention, depthwise separable convolution includes two continuous processes of channel-by-channel convolution and point-by-point convolution. Channel-by-channel convolution is to convolve each channel with a separate convolution kernel. The convolution process is transformed into a two-dimensional plane, and finally an intermediate feature map is generated. The convolution kernel parameter in this link is f*f*3, and the generated intermediate feature map is convolved point by point, using 1*1*3 volume The accumulation kernel has the function of data fusion, and finally outputs the m-layer feature map. The parameter amount of this part is 1*3*m; the parameter amount of the convolution kernel of channel-by-channel kernel and point-by-point convolution is 3*(f *f+m), which is an order of magnitude lower than the 3*f*f*m of direct convolution, and the time efficiency will be greatly improved; although the number of parameters learned is lower than that of ordinary convolution, the accuracy is not. Dropped too much.

经过编码器后，输入图像尺寸转化为H_c＝H/8，W_c＝W/8，以此降低图像尺寸；关键点解码器进行子像素卷积操作，通过depth to space(把depth维的数据移到space维上)过程将输入向量由H/8*W/8*65转化成H*W向量，H*W向量通过softmax运算、再通过Resahpe进行维度转换后，最终输出各个像素点是关键点(KeyPoint)的概率，以向量形式表示，通过概率阈值筛选为关键点的地方就是关键点坐标；描述子检测器利用卷积网络得到半稠密描述子，接着利用双三次差值得出剩余描述，最后通过L2归一化得到统一长度(H*W*D)的描述子。After the encoder, the input image size is converted to H _c = H/8, W _c = W/8, thereby reducing the image size; the key point decoder performs sub-pixel convolution operation, through depth to space (the depth dimension The data is moved to the space dimension) process converts the input vector from H/8*W/8*65 to H*W vector, and the H*W vector is subjected to softmax operation, and then dimensionally transformed by Resahpe, and the final output of each pixel is The probability of a key point (KeyPoint), expressed in vector form, is the key point coordinates where the key point is screened by the probability threshold; the descriptor detector uses a convolutional network to obtain a semi-dense descriptor, and then uses the bicubic difference to obtain the remaining description , and finally obtain a descriptor of uniform length (H*W*D) through L2 normalization.

改进SuperPoint的损失函数由关键点提取损失和描述子检测损失两部分组成：The loss function of the improved SuperPoint consists of two parts: key point extraction loss and descriptor detection loss:

式中损失函数由两部分组成：

为关键点损失，

为为翻转后图像的关键点损失，使用交叉熵损失函数，

为描述子损失，

表示图像经编码网络模型后对关键点的响应，尺寸为

表示原图经过翻转后经编码网络模型后对关键点的响应；D表示图像经编码网络模型后对描述子的响应；D′表示翻转后的图像经编码网络模型后对描述子的响应；Y表示关键点坐标标签；Y′表示翻转后图像的关键点坐标标签；S表示表示原图像和翻转后图像组成的图像对；λ为平衡因子。where the loss function consists of two parts:

is the keypoint loss,

For the keypoint loss of the flipped image, use the cross-entropy loss function,

is the descriptor loss,

Indicates the response of the image to key points after being encoded by the network model, and the size is

Indicates the response of the original image to the key points after being flipped and encoded by the network model; D indicates the response of the image to the descriptor after being encoded by the network model; D′ indicates the response of the flipped image to the descriptor after being encoded by the network model; Y Indicates the key point coordinate label; Y' indicates the key point coordinate label of the flipped image; S represents the image pair composed of the original image and the flipped image; λ is the balance factor.

Step3：如果mask掩码代表的是行人，SuperPoint网络对mask掩码内的特征点进行剔除；如果是汽车，则对比相邻帧的汽车目标检测区域，相邻的两个目标检测区域的非公共部分的特征点保留，公共部分的特征点进行剔除，得到筛选后的静态特征点。Step3: If the mask represents pedestrians, the SuperPoint network removes the feature points in the mask; if it is a car, compare the car target detection areas of adjacent frames, the non-common of two adjacent target detection areas Part of the feature points are retained, and the common part of the feature points is eliminated to obtain the filtered static feature points.

目标检测框和深度特征使用多线程并行技术，在特征提取的同时进行目标检测，跟踪线程不用等待目标检测网络的检测结果，提高了CPU的利用率，提升了运行效率。The target detection frame and deep features use multi-thread parallel technology to perform target detection while feature extraction, and the tracking thread does not need to wait for the detection results of the target detection network, which improves CPU utilization and improves operating efficiency.

剔除由步骤1、2动态物体(如车辆与行人)检测框中的特征点，如果直接剔除每个动态物体框中的特征点，会由于特征点过少出现难以匹配的情况。因此，在本发明的其中一些实施例中，对于相邻两帧的检测出动态目标框的特征点分别为

(如图7中A区域)，

(如图7中B区域)，

表示第n帧中的第i个特征点，p表示第n帧中的特征点的总数，q表示第n+1帧中的特征点的总数。将相邻两帧检测同一动态物体目标框的交集作为最终的动态目标特征点，即D＝D_nD_n+1(如图7中的C区域)，将D集合中特征点作为最终动态特征点集合，将D集合中的特征点进行剔除，这样降低了动态特征点误剔除的概率，也保留了部分疑似静态特征点，剩余的A、B区域为保留区域，增加了跟踪线程的可靠性，将筛选后的静态特征点进行保存，用于后续的特征匹配与位姿计算。Eliminate the feature points in the detection frame of dynamic objects (such as vehicles and pedestrians) in steps 1 and 2. If you directly remove the feature points in each dynamic object frame, it will be difficult to match due to too few feature points. Therefore, in some of the embodiments of the present invention, the feature points of the detected dynamic target frames of two adjacent frames are respectively

(A region in Figure 7),

(as in area B in Figure 7),

Represents the i-th feature point in the nth frame, p represents the total number of feature points in the nth frame, and q represents the total number of feature points in the n+1th frame. The intersection of the same dynamic object target frame detected in two adjacent frames is used as the final dynamic target feature point, that is, D=D _n D _n+1 (as shown in C area in Figure 7), and the feature point in the D set is used as the final dynamic feature Point collection, the feature points in the D set are eliminated, which reduces the probability of dynamic feature points being mistakenly eliminated, and also retains some suspected static feature points. The remaining A and B areas are reserved areas, which increases the reliability of the tracking thread. , and save the filtered static feature points for subsequent feature matching and pose calculation.

Step4：利用SuperPoint网络提取并剔除mask掩码内的关键点和描述子，使用剩余的特征点进行特征匹配，继续执行视觉SLAM的Tracking模块，利用最小重投影误差计算相机位姿并建图，完成整个SLAM工作。Step4: Use the SuperPoint network to extract and eliminate key points and descriptors in the mask mask, use the remaining feature points for feature matching, continue to execute the Tracking module of visual SLAM, use the minimum reprojection error to calculate the camera pose and build a map, and complete The whole SLAM job.

经过上述过程中筛选出的特征点以及描述子进行图像匹配，在本发明的其中一些实施例中，利用RANSAC随机采样一致性算法剔除误匹配特征点，根据匹配关系转化成2D点到2D点的对极几何问题。假定x₁、x₂为两张图像上对应的匹配点归一化坐标，R为相机旋转矩阵，t为平移矩阵，则满足Image matching is carried out through the feature points and descriptors screened out in the above process. In some embodiments of the present invention, the RANSAC random sampling consensus algorithm is used to eliminate the mismatching feature points, and convert them into 2D points to 2D points according to the matching relationship. Epipolar geometry problems. Assuming that x ₁ and x ₂ are the normalized coordinates of the corresponding matching points on the two images, R is the camera rotation matrix, and t is the translation matrix, then satisfy

x₂＝Rx₁+tx ₂ =Rx ₁ +t

左乘x^T ₂t：

Left multiply x ^T ₂ t:

等式左侧为0，则：

即为对极约束表达式，按照最小重投影误差即可求出相机位姿。令基础矩阵E＝t^R，本质矩阵F＝K^-TEK^-1，K-相机内参矩阵，E-基本矩阵，T表示矩阵的转置运算，t表示平移矩阵，R表示旋转矩阵。求解相机位姿可以转化成以下两步：求出基础矩阵E或者本质矩阵F；根据E或F，求解R，t。若系统处于定位模式、局部地图被占用或者刚刚结束重定位，则不插入关键帧。在不满足上面三种情况的前提下，若当前帧匹配的内点数超过设定的阈值，可将当前帧设置成关键帧，跟踪线程完成，接着继续进行建图和回环检测，最终完成整个地图的建立。The left side of the equation is 0, then:

It is the epipolar constraint expression, and the camera pose can be calculated according to the minimum reprojection error. Let the basic matrix E=t^R, the essential matrix F=K ^-T EK ^-1 , K-camera internal reference matrix, E-basic matrix, T represents the transpose operation of the matrix, t represents the translation matrix, and R represents the rotation matrix. Solving the camera pose can be transformed into the following two steps: finding the fundamental matrix E or the essential matrix F; according to E or F, solving R, t. If the system is in positioning mode, the local map is occupied, or just finished repositioning, no keyframe will be inserted. Under the premise that the above three conditions are not satisfied, if the number of inliers matched by the current frame exceeds the set threshold, the current frame can be set as a key frame, the tracking thread is completed, and then continue to build the map and loop detection, and finally complete the entire map of establishment.

对所公开的实施例的上述说明，使本领域专业技术人员能够实现或使用本发明。对这些实施例的多种修改对本领域的专业技术人员来说将是显而易见的，本文中所定义的一般原理可以在不脱离本发明的精神或范围的情况下，在其他实施例中实现。因此，本发明将不会被限制于本文所示的这些实施例，而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。The above description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the invention. Therefore, the present invention will not be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for accurately extracting visual SLAM static features under a parking scene, characterized in that, comprising the following steps:

Step 1: Extract the image of the parking lot scene in front of the car, preprocess the image and input the image into the target detection network for target detection, and obtain the detection frame of the target object;

Step 2: Screen the dynamic object detection frame output in step 1, and form a mask mask, use it in combination with the feature points extracted by the SuperPoint network, eliminate the feature points of the dynamic object detection frame, and obtain key points and descriptors, the SuperPoint network includes The shared encoder, key point decoder and descriptor decoder of key points and descriptors, the shared encoder is used to encode the image to obtain the feature map, the key point decoder is used to obtain the coordinates of the key points in the image, and the descriptor decoder is used To obtain the descriptor vector of the key point, the improvement of the SuperPoint network includes: changing all the convolutions in the encoder to depth separable convolutions, wherein the target detection and feature extraction use multi-threaded parallel technology, and feature extraction while performing target detection;

Step 3: If the mask represents a pedestrian, the SuperPoint network removes the feature points in the mask; if it is a car, compare the car target detection areas of adjacent frames, and the non-common points of the adjacent two target detection areas Part of the feature points are retained, and the common part of the feature points is eliminated to obtain the filtered static feature points;

Step 4: Use the SuperPoint network to extract and remove key points and descriptors in the mask, use the remaining feature points for feature matching, continue to execute the tracking module of visual SLAM, calculate the camera pose and build a map, and complete the entire SLAM work.

2. the method for accurately extracting visual SLAM static feature under a kind of parking scene according to claim 1, is characterized in that, in step 1, target detection network adopts YOLOv5 network, and the process of target detection comprises:

Input an RGB image, scale the input image to the input size of the network, and perform data enhancement;

The backbone network extracts image features and generates feature maps. The Backbone module uses CSPDarknet53 structure and Focus structure to extract general features; the extracted general features are sent to the Neck network to extract more diverse and robust features, which are input to CSP2_X and CBL structure, and after upsampling, splicing with the features output by the backbone network; the final output uses CIoU_LOSS as the loss function of the Bounding Box.

3. the method for accurately extracting visual SLAM static features under a kind of parking scene according to claim 1, is characterized in that, before the SuperPoint network that adopts in step 2 is trained,

The SuperPoint network is extracted in a self-supervised manner, first using regular geometric shapes as a data set to train a fully convolutional network;

Utilizing the detection results of the full convolutional network for the unmarked real picture as a false truth key point, and using the homography technique to extract features from the unmarked real picture under different sizes to generate a pseudo-label;

After the pseudo-label is generated, the real unlabeled image can be put into the SuperPoint network for training.

4. the method for accurately extracting visual SLAM static feature under a kind of parking scene according to claim 1, is characterized in that, in step 2, SuperPoint network detects key point and describes sub-process as follows:

Input an image frame of H*W*3, convert it to H*W*1 after grayscale, and then input the image to an improved and lighter shared encoder. After the encoder, the input image size is converted _Hc =H/8, _Wc =W/8;

The key point decoder performs sub-pixel convolution operation, and converts the input vector from H/8*W/8*65 to H*W, and finally outputs the probability that each pixel is a key point;

The descriptor decoder uses a convolutional network to obtain a semi-dense descriptor, then uses the bicubic difference to obtain the remaining description, and finally obtains a uniform-length descriptor through L2 normalization.

5. The method for accurately extracting static features of visual SLAM in a parking scene according to claim 1, wherein the depth-separable convolution in step 2 includes two consecutive convolutions by channel and point by point. The channel-by-channel convolution is to convolve each channel with a separate convolution kernel, transform the convolution process into a two-dimensional plane, and finally generate an intermediate feature map. The convolution kernel parameter in this link is f* f*3, the generated intermediate feature map is convolved point by point, using 1*1*3 convolution kernel, and finally output m layer feature map, the parameter amount of this part is 1*3*m; then the channel-by-channel kernel and The parameter amount of the convolution kernel of point-by-point convolution is 3*(f*f+m).

6. the method for accurately extracting visual SLAM static features under a kind of parking scene according to claim 1, is characterized in that, in step 2, the loss function of the improved SuperPoint network is made of key point extraction loss and descriptor detection loss two Partial composition:

In the formula,

is the keypoint loss,

is the keypoint loss of the flipped image,

is the descriptor loss, χ represents the response of the image to key points after being encoded by the network model; χ′ represents the response to key points after the original image is flipped and encoded by the network model; D represents the response of the image to the descriptor after being encoded by the network model Response; D' represents the response of the flipped image to the descriptor after being encoded by the network model; Y represents the key point coordinate label; Y' represents the key point coordinate label of the flipped image; S represents the original image and the flipped image. Image pair; λ is the balance factor.

7. The method for accurately extracting visual SLAM static features under a parking scene according to claim 1, wherein the method of removing feature points in step 3 is:

The feature points of the detected dynamic target frame in two adjacent frames are respectively

Represents the i-th feature point in the nth frame, and the intersection of the same dynamic object target frame detected in two adjacent frames is used as the final dynamic target feature point, that is, D=D _n D _n+1 , and the feature points in the D set As the final set of dynamic feature points.

8. The method for accurately extracting static features of visual SLAM in a parking scene according to claims 1-7, wherein the process of calculating the camera pose in step 4 is as follows: the selected feature points and descriptors are Image matching, eliminate mismatched feature points, and transform it into a 2D point-to-2D point epipolar geometric problem according to the matching relationship, assuming that x ₁ and x ₂ are the normalized coordinates of the corresponding matching points on the two images, and R is the camera rotation matrix , t is the translation matrix, and T represents the transposition operation of the matrix, then it satisfies

x ₂ =Rx ₁ +t

Left multiply x ^T ₂ t^:

The left side of the equation is 0, then:

That is, the epipolar constraint expression, and then the camera pose can be calculated according to the minimum reprojection error.

9. The method for accurately extracting visual SLAM static features in a parking scene according to claim 8, wherein the RANSAC random sampling consensus algorithm is used to eliminate mismatched feature points.

10. The method for accurately extracting static features of visual SLAM in a parking scene according to claim 8, characterized in that, when building a map in step 4, if the system is in positioning mode, the local map is occupied or just finished position, no keyframe is inserted.