CN113362363A

CN113362363A - Automatic image annotation method and device based on visual SLAM and storage medium

Info

Publication number: CN113362363A
Application number: CN202110676419.9A
Authority: CN
Inventors: 曾碧; 王俊丰
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2021-06-18
Filing date: 2021-06-18
Publication date: 2021-09-07
Anticipated expiration: 2041-06-18
Also published as: CN113362363B

Abstract

The invention discloses an image automatic labeling method, device and storage medium based on visual SLAM; the method first processes the current frame image to obtain a labeling frame, and then automatically labels other video images of different frames to correct the initial In order to obtain the final more accurate annotation data; this method does not require a priori data set, can realize online automatic annotation while collecting images, and supports multi-object and multi-category annotation. This method only needs a single annotation by the user. It can greatly reduce the manual image annotation work, improve the collection efficiency of image annotation data, and the operation is simple and easy to use. The invention can be widely used in the technical field of image processing.

Description

A kind of image automatic labeling method, device and storage medium based on visual SLAM

技术领域technical field

本发明涉及图像处理技术领域，尤其是一种基于视觉SLAM的图像自动标注方法、装置及存储介质。The invention relates to the technical field of image processing, in particular to a visual SLAM-based image automatic labeling method, device and storage medium.

背景技术Background technique

随着计算机视觉的发展，机器学习和深度学习等人工智能技术在图像分类、目标检测和图像分割等研究中得到广泛的应用并取得突破性的成效，但这类方法的模型训练依赖于大量与任务相关的图像标注数据，在推进模型落地应用前必须以人工方式对业务场景进行图像数据采集与标注，目前这一阶段需时较大且影响整体的研发进度。With the development of computer vision, artificial intelligence technologies such as machine learning and deep learning have been widely used in image classification, target detection and image segmentation and have achieved breakthrough results. However, the model training of such methods relies on a large number of For task-related image annotation data, image data collection and annotation of business scenarios must be performed manually before the model is applied. At present, this stage takes a long time and affects the overall research and development progress.

目前，利用基于深度学习识别待标注图像的方法进行标注，但该方法需要预先收集图像数据进行训练并以此建立对应的候选标签集，因此不能直接用于首次标注的情况；此外，图像识别的准确度依赖数据集规模和各标签的概率分布情况，同时标注对象受限，该方法不具备普遍适用性。At present, the method of identifying images to be labeled based on deep learning is used for labeling, but this method needs to collect image data in advance for training and establish a corresponding candidate label set, so it cannot be directly used for the first labeling; The accuracy depends on the scale of the dataset and the probability distribution of each label, and the labeling objects are limited, so this method does not have universal applicability.

发明内容SUMMARY OF THE INVENTION

本发明旨在至少解决现有技术中存在的技术问题之一。为此，本发明提出一种基于视觉SLAM的图像自动标注方法、装置及存储介质。The present invention aims to solve at least one of the technical problems existing in the prior art. To this end, the present invention proposes an automatic image labeling method, device and storage medium based on visual SLAM.

本发明所采取的技术方案是：The technical scheme adopted by the present invention is:

一方面，本发明实施例包括一种基于视觉SLAM的图像自动标注方法，包括：On the one hand, an embodiment of the present invention includes an automatic image labeling method based on visual SLAM, including:

获取实时相机或视频中至少两帧连续图像，利用ORB-SLAM算法构建得到三维点云空间；Obtain at least two consecutive images from a real-time camera or video, and use the ORB-SLAM algorithm to construct a 3D point cloud space;

获取新一帧图像作为当前帧图像，通过PnP问题求解得到相机位姿；Obtain a new frame image as the current frame image, and obtain the camera pose by solving the PnP problem;

以所述当前帧图像为选区，在选区框选标注区域，并在所述标注区域标注相应语义信息；Taking the current frame image as a selection area, frame a marked area in the selection area, and mark corresponding semantic information in the marked area;

根据所述三维点云空间，获取所述标注区域内二维特征点对应的第一三维点云集合；obtaining, according to the three-dimensional point cloud space, a first three-dimensional point cloud set corresponding to the two-dimensional feature points in the marked area;

利用点云聚类算法对所述第一三维点云集合进行处理，得到第二三维点云集合，所述第二三维点云集合为点云数量最多的点云簇；Using a point cloud clustering algorithm to process the first three-dimensional point cloud set to obtain a second three-dimensional point cloud set, the second three-dimensional point cloud set is the point cloud cluster with the largest number of point clouds;

基于主成分分析法计算得到所述第二三维点云集合的最小有向包围盒；Calculate the minimum directed bounding box of the second three-dimensional point cloud set based on principal component analysis;

根据所述相机位姿和相机内参，构建得到相机透视矩阵；According to the camera pose and camera internal parameters, construct a camera perspective matrix;

根据所述相机透视矩阵，将所述最小有向包围盒投影到所述当前帧图像，形成增强现实场景，并得到所述最小有向包围盒对应的矩形框；According to the camera perspective matrix, project the minimum directed bounding box to the current frame image to form an augmented reality scene, and obtain a rectangular frame corresponding to the minimum directed bounding box;

对所述矩形框进行图像分割以识别主体；image segmentation of the rectangular frame to identify the subject;

根据识别的主体对所述矩形框进行调整，得到标注框；Adjust the rectangular frame according to the identified subject to obtain a labeling frame;

比较所述矩形框和所述标注框的面积；comparing the area of the rectangular frame and the callout frame;

若所述标注框的面积与所述矩形框的面积之比小于第一阈值，则以所述标注框为选区，并执行在选区框选标注区域，并在所述标注区域标注相应语义信息的步骤；If the ratio of the area of the callout frame to the area of the rectangular frame is less than the first threshold, the callout frame is used as the selection area, and the selection area is selected in the selection area, and the corresponding semantic information is marked in the callout area. step;

从所述视频中获取目标帧图像，所述目标帧图像为除所述当前帧图像以外的后续任意一帧图像；Obtain a target frame image from the video, where the target frame image is any subsequent frame image except the current frame image;

基于视觉SLAM算法获取所述目标帧图像的第一新增点云数据；Obtain the first newly added point cloud data of the target frame image based on the visual SLAM algorithm;

对所述第一新增点云数据进行过滤，得到第二新增点云数据，所述第二新增点云数据为所述最小有向包围盒中新增的点云数据；Filtering the first newly added point cloud data to obtain second newly added point cloud data, where the second newly added point cloud data is the newly added point cloud data in the minimum directed bounding box;

若所述第二新增点云数据的数量大于第二阈值，则以所述目标帧图像为选区，并执行在选区框选标注区域，并在所述标注区域标注相应语义信息的步骤。If the quantity of the second newly added point cloud data is greater than the second threshold, the target frame image is used as a selection area, and the steps of selecting a marked area in the selection area and marking corresponding semantic information in the marked area are performed.

进一步地，所述获取实时相机或视频中至少两帧图像，利用ORB-SLAM算法构建得到三维点云空间这一步骤，包括：Further, the step of obtaining at least two frames of images in the real-time camera or the video, and constructing the three-dimensional point cloud space by using the ORB-SLAM algorithm, includes:

对相机进行标定以获取相机内参矩阵和畸变参数；Calibrate the camera to obtain the camera internal parameter matrix and distortion parameters;

获取实时相机或视频中至少两帧连续图像；Obtain at least two consecutive images from a live camera or video;

依次对两帧所述连续图像提取ORB特征点并计算相应的二进制描述子；Extracting ORB feature points from the continuous images of the two frames in turn and calculating the corresponding binary descriptors;

根据所述ORB特征点，对两帧所述连续图像进行特征点匹配，得到图像特征点；According to the ORB feature points, feature point matching is performed on the two consecutive images to obtain image feature points;

利用对极几何约束估计两帧所述连续图像间的相机运动，计算得到相机位姿；Using the epipolar geometric constraint to estimate the camera motion between the two consecutive images, the camera pose is calculated;

根据所述相机内参矩阵、畸变参数和相机位姿，利用三角化测量计算所述图像特征点的三维坐标，构建得到三维点云空间。According to the camera internal parameter matrix, distortion parameters and camera pose, the three-dimensional coordinates of the image feature points are calculated by triangulation, and a three-dimensional point cloud space is constructed.

进一步地，所述以所述当前帧图像为选区，在选区框选标注区域，并在所述标注区域标注相应语义信息这一步骤，包括：Further, the step of taking the current frame image as a selection area, selecting a marked area in the selection area, and marking corresponding semantic information in the marked area, includes:

以所述当前帧图像为选区，在选区框选第一区域；Taking the current frame image as a selection area, select the first area in the selection area;

根据所述当前帧图像的窗口大小与所述当前帧图像像素宽高比例，将所述第一区域转化为像素区域；converting the first area into a pixel area according to the window size of the current frame image and the pixel aspect ratio of the current frame image;

将所述像素区域作为标注区域，并在所述标注区域标注相应语义信息。The pixel area is used as an annotation area, and corresponding semantic information is annotated in the annotation area.

进一步地，所述根据所述三维点云空间，获取所述标注区域内二维特征点对应的第一三维点云集合这一步骤，包括：Further, the step of obtaining the first three-dimensional point cloud set corresponding to the two-dimensional feature points in the marked area according to the three-dimensional point cloud space includes:

获取所述标注区域内的二维特征点；acquiring two-dimensional feature points in the marked area;

根据所述三维点云空间，将所述二维特征点进行反投影，得到第一三维点云集合。According to the three-dimensional point cloud space, the two-dimensional feature points are back-projected to obtain a first three-dimensional point cloud set.

进一步地，所述利用点云聚类算法对所述第一三维点云集合进行处理，得到第二三维点云集合这一步骤，包括：Further, the step of using a point cloud clustering algorithm to process the first three-dimensional point cloud set to obtain a second three-dimensional point cloud set includes:

从所述第一三维点云集合中选取第一点云为目标点云，所述第一点云为所述第一三维点云集合中的任意点云；Selecting a first point cloud from the first three-dimensional point cloud set as a target point cloud, and the first point cloud being any point cloud in the first three-dimensional point cloud set;

判断所述目标点云是否为边缘噪点；Determine whether the target point cloud is edge noise;

若所述目标点云为边缘噪点，则重新从所述第一三维点云集合中选取第二点云为目标点云，所述第二点云为所述第一三维点云集合中除所述第一点云以外的任意点云；If the target point cloud is edge noise, then re-select a second point cloud from the first 3D point cloud set as the target point cloud, and the second point cloud is the first 3D point cloud set except Any point cloud other than the first point cloud mentioned above;

若所述目标点云不为边缘噪点，则从所述第一三维点云集合中找出所有从所述目标点云密度可达的点云，构成点云簇；If the target point cloud is not an edge noise, find out all the point clouds that are density-reachable from the target point cloud from the first three-dimensional point cloud set to form a point cloud cluster;

遍历所述第一三维点云集合中的所有点云，得到多个点云簇；Traverse all point clouds in the first three-dimensional point cloud set to obtain a plurality of point cloud clusters;

选择点云数量最多的点云簇为第二三维点云集合。The point cloud cluster with the largest number of point clouds is selected as the second three-dimensional point cloud set.

进一步地，所述根据所述相机位姿和相机内参，构建得到相机透视矩阵这一步骤，包括：Further, the step of constructing and obtaining the camera perspective matrix according to the camera pose and camera internal parameters includes:

根据第一公式将所述相机位姿的数据从世界坐标系转到OpenGL坐标系，所述第一公式为：R_o＝R_owR_wR_ow,t_o＝R_owt_w；其中，

世界坐标系为(R_w,t_w)，OpenGL坐标系为(R_o,t_o)；The data of the camera pose is transferred from the world coordinate system to the OpenGL coordinate system according to the first formula. The first formula is: R _o =R _ow R _w R _ow , to =R _ow _t _w ; wherein,

The world coordinate system is (R _w ,t _w ), and the OpenGL coordinate system is (R _o ,t _o );

根据所述OpenGL坐标系，结合相机内参，构建得到相机透视矩阵，所述相机透视矩阵为：

According to the OpenGL coordinate system, combined with the camera internal parameters, the camera perspective matrix is constructed and obtained, and the camera perspective matrix is:

其中，

in,

式中，P表示相机透视矩阵，相机内参为(u₀,v₀,f_u,f_v)；near表示相机光心到近截面的距离，far表示相机光心到远截面的距离，right表示近截面中心到右截面的距离，left表示近截面中心到左截面的距离，top表示近截面中心到上截面的距离，bottom表示近截面中心到下截面的距离，w表示图像像素宽度，h表示图像像素高度。In the formula, P represents the perspective matrix of the camera, and the internal parameters of the camera are (u ₀ , v ₀ , f _u , f _v ); near represents the distance from the camera optical center to the near section, far represents the distance from the camera optical center to the far section, and right represents The distance from the near section center to the right section, left indicates the distance from the near section center to the left section, top indicates the distance from the near section center to the upper section, bottom indicates the distance from the near section center to the lower section, w indicates the image pixel width, h indicates Image pixel height.

进一步地，所述对所述第一新增点云数据进行过滤，得到第二新增点云数据这一步骤，包括：Further, the step of filtering the first newly added point cloud data to obtain the second newly added point cloud data includes:

基于随机抽样一致性算法对所述第一新增点云数据进行过滤，除去地面点云数据；Filtering the first newly added point cloud data based on a random sampling consistency algorithm to remove ground point cloud data;

根据所述最小有向包围盒的三维空间范围进行条件滤波，得到第二新增点云数据。Conditional filtering is performed according to the three-dimensional space range of the minimum directional bounding box to obtain second newly added point cloud data.

进一步地，若所述第二新增点云数据的数量不大于第二阈值，则将所述目标帧图像作为当前帧图像，并执行根据所述相机透视矩阵，将所述最小有向包围盒投影到所述当前帧图像，形成增强现实场景，并得到所述最小有向包围盒对应的矩形框的步骤。Further, if the number of the second newly added point cloud data is not greater than the second threshold, the target frame image is taken as the current frame image, and the minimum directed bounding box is executed according to the camera perspective matrix. The steps of projecting to the current frame image, forming an augmented reality scene, and obtaining a rectangular box corresponding to the minimum directed bounding box.

另一方面，本发明实施例还包括一种基于视觉SLAM的图像自动标注装置，包括：On the other hand, the embodiment of the present invention also includes an image automatic labeling device based on visual SLAM, including:

至少一个处理器；at least one processor;

至少一个存储器，用于存储至少一个程序；at least one memory for storing at least one program;

当所述至少一个程序被所述至少一个处理器执行，使得所述至少一个处理器实现所述的基于视觉SLAM的图像自动标注方法。When the at least one program is executed by the at least one processor, the at least one processor implements the automatic image annotation method based on visual SLAM.

另一方面，本发明实施例还包括计算机可读存储介质，其上存储有处理器可执行的程序，所述处理器可执行的程序在被处理器执行时用于实现所述的基于视觉SLAM的图像自动标注方法。On the other hand, an embodiment of the present invention further includes a computer-readable storage medium on which a processor-executable program is stored, and the processor-executable program, when executed by the processor, is used to implement the vision-based SLAM method for automatic image annotation.

本发明的有益效果是：The beneficial effects of the present invention are:

本发明提出一种基于视觉SLAM的图像自动标注方法，包括：获取视频图像，基于利用ORB-SLAM算法构建得到三维点云空间和相机位姿，以当前帧图像为选区，在选区框选标注区域，并在所述标注区域标注相应语义信息；根据三维点云空间，获取标注区域内二维特征点对应的第一三维点云集合；利用点云聚类算法对第一三维点云集合进行处理，得到第二三维点云集合；基于主成分分析法计算得到第二三维点云集合的最小有向包围盒；根据相机位姿和相机内参，构建得到相机透视矩阵；根据相机透视矩阵，将最小有向包围盒投影到当前帧图像，形成增强现实场景，并得到最小有向包围盒对应的矩形框；对矩形框进行图像分割以识别主体；根据识别的主体对矩形框进行调整，得到标注框；比较矩形框和标注框的面积；若标注框的面积与矩形框的面积之比小于第一阈值，则以标注框为选区，并执行在选区框选标注区域，并在标注区域标注相应语义信息的步骤；从而可生成当前帧图像的标注结果。最后，结合不同视频帧的新增点云数据，自动调整标注框，从而自动生成每一帧图像的标注结果。该方法无需先验数据集，能在采集图像的同时实现在线式自动标注，支持多目标和多类别标注，该方法仅需用户单次标注，能极大减少人工图像标注工作，提高图像标注数据的采集效率，且操作简单易用。The present invention proposes an automatic image labeling method based on visual SLAM, which includes: acquiring a video image, obtaining a three-dimensional point cloud space and camera pose based on constructing an ORB-SLAM algorithm, taking the current frame image as a selection area, and selecting a labeling area in the selection area , and label corresponding semantic information in the labeling area; obtain the first 3D point cloud set corresponding to the 2D feature points in the labeling area according to the 3D point cloud space; use the point cloud clustering algorithm to process the first 3D point cloud set , obtain the second three-dimensional point cloud set; calculate the minimum directed bounding box of the second three-dimensional point cloud set based on principal component analysis; construct the camera perspective matrix according to the camera pose and camera internal parameters; The directed bounding box is projected to the current frame image to form an augmented reality scene, and the rectangular frame corresponding to the smallest directed bounding box is obtained; the image segmentation of the rectangular frame is performed to identify the subject; the rectangular frame is adjusted according to the identified subject to obtain an annotation frame ; Compare the area of the rectangle and the callout; if the ratio of the area of the callout to the area of the rectangle is less than the first threshold, take the callout as the selection area, select the callout area in the selection area, and mark the corresponding semantics in the callout area information step; thus, the annotation result of the current frame image can be generated. Finally, combined with the newly added point cloud data of different video frames, the annotation frame is automatically adjusted to automatically generate the annotation results of each frame of image. This method does not require a priori data set, can realize online automatic annotation while collecting images, and supports multi-object and multi-category annotation. This method only needs a single annotation by the user, which can greatly reduce the manual image annotation work and improve the image annotation data. The collection efficiency is high, and the operation is simple and easy to use.

本发明的附加方面和优点将在下面的描述中部分给出，部分将从下面的描述中变得明显，或通过本发明的实践了解到。Additional aspects and advantages of the present invention will be set forth, in part, from the following description, and in part will be apparent from the following description, or may be learned by practice of the invention.

附图说明Description of drawings

本发明的上述和/或附加的方面和优点从结合下面附图对实施例的描述中将变得明显和容易理解，其中：The above and/or additional aspects and advantages of the present invention will become apparent and readily understood from the following description of embodiments taken in conjunction with the accompanying drawings, wherein:

图1为本发明实施例所述基于视觉SLAM的图像自动标注方法的步骤流程图；FIG. 1 is a flowchart of the steps of the automatic image labeling method based on visual SLAM according to an embodiment of the present invention;

图2为本发明实施例所述基于视觉SLAM的图像自动标注方法的具体流程图。FIG. 2 is a specific flowchart of an image automatic labeling method based on visual SLAM according to an embodiment of the present invention.

图3为本发明实施例所述获取视频中至少两帧图像，利用ORB-SLAM算法构建得到三维点云空间的流程图；3 is a flowchart of obtaining at least two frames of images in a video according to an embodiment of the present invention, and constructing a three-dimensional point cloud space by using an ORB-SLAM algorithm;

图4为本发明实施例所述利用点云聚类算法对所述第一三维点云集合进行处理，得到第二三维点云集合的流程图；4 is a flowchart of processing the first three-dimensional point cloud set by using a point cloud clustering algorithm according to an embodiment of the present invention to obtain a second three-dimensional point cloud set;

图5为本发明实施例所述利用点云聚类算法对所述第一三维点云集合进行处理，得到第二三维点云集合的具体流程图；5 is a specific flowchart of processing the first three-dimensional point cloud set by using a point cloud clustering algorithm according to an embodiment of the present invention to obtain a second three-dimensional point cloud set;

图6为本发明实施例所述矩形框和标注框的对比示意图；6 is a schematic diagram of a comparison between a rectangular frame and a marked frame according to an embodiment of the present invention;

图7为本发明实施例所述基于视觉SLAM的图像自动标注装置的结构示意图。FIG. 7 is a schematic structural diagram of an apparatus for automatic image annotation based on visual SLAM according to an embodiment of the present invention.

具体实施方式Detailed ways

下面详细描述本发明的实施例，所述实施例的示例在附图中示出，其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施例是示例性的，仅用于解释本发明，而不能理解为对本发明的限制。The following describes in detail the embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein the same or similar reference numerals refer to the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the accompanying drawings are exemplary, only used to explain the present invention, and should not be construed as a limitation of the present invention.

在本发明的描述中，需要理解的是，涉及到方位描述，例如上、下、前、后、左、右等指示的方位或位置关系为基于附图所示的方位或位置关系，仅是为了便于描述本发明和简化描述，而不是指示或暗示所指的装置或元件必须具有特定的方位、以特定的方位构造和操作，因此不能理解为对本发明的限制。In the description of the present invention, it should be understood that the azimuth description, such as the azimuth or position relationship indicated by up, down, front, rear, left, right, etc., is based on the azimuth or position relationship shown in the drawings, only In order to facilitate the description of the present invention and simplify the description, it is not indicated or implied that the indicated device or element must have a particular orientation, be constructed and operated in a particular orientation, and therefore should not be construed as limiting the present invention.

在本发明的描述中，若干的含义是一个或者多个，多个的含义是两个以上，大于、小于、超过等理解为不包括本数，以上、以下、以内等理解为包括本数。如果有描述到第一、第二只是用于区分技术特征为目的，而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量或者隐含指明所指示的技术特征的先后关系。In the description of the present invention, the meaning of several is one or more, the meaning of multiple is two or more, greater than, less than, exceeding, etc. are understood as not including this number, above, below, within, etc. are understood as including this number. If it is described that the first and the second are only for the purpose of distinguishing technical features, it cannot be understood as indicating or implying relative importance, or indicating the number of the indicated technical features or the order of the indicated technical features. relation.

本发明的描述中，除非另有明确的限定，设置、安装、连接等词语应做广义理解，所属技术领域技术人员可以结合技术方案的具体内容合理确定上述词语在本发明中的具体含义。In the description of the present invention, unless otherwise clearly defined, words such as setting, installation, connection should be understood in a broad sense, and those skilled in the art can reasonably determine the specific meanings of the above words in the present invention in combination with the specific content of the technical solution.

下面结合附图，对本申请实施例作进一步阐述。The embodiments of the present application will be further described below with reference to the accompanying drawings.

参照图1，本发明实施例提供一种基于视觉SLAM的图像自动标注方法，包括但不限于以下步骤：1, an embodiment of the present invention provides a visual SLAM-based automatic image labeling method, including but not limited to the following steps:

S100.获取实时相机或视频中至少两帧连续图像，利用ORB-SLAM算法构建得到三维点云空间；S100. Acquire at least two consecutive frames of images from a real-time camera or video, and use the ORB-SLAM algorithm to construct a three-dimensional point cloud space;

S200获取新一帧图像作为当前帧图像，通过PnP问题求解得到相机位姿；S200 obtains a new frame image as the current frame image, and obtains the camera pose by solving the PnP problem;

S300.以当前帧图像为选区，在选区框选标注区域，并在标注区域标注相应语义信息；S300. Take the current frame image as a selection area, select a marked area in the selection area, and mark corresponding semantic information in the marked area;

S400.根据三维点云空间，获取标注区域内二维特征点对应的第一三维点云集合；S400. According to the three-dimensional point cloud space, obtain the first three-dimensional point cloud set corresponding to the two-dimensional feature points in the marked area;

S500.利用点云聚类算法对第一三维点云集合进行处理，得到第二三维点云集合，第二三维点云集合为点云数量最多的点云簇；S500. Use a point cloud clustering algorithm to process the first three-dimensional point cloud set to obtain a second three-dimensional point cloud set, and the second three-dimensional point cloud set is the point cloud cluster with the largest number of point clouds;

S600.基于主成分分析法计算得到第二三维点云集合的最小有向包围盒；S600. Calculate the minimum directed bounding box of the second three-dimensional point cloud set based on principal component analysis;

S700.根据相机位姿和相机内参，构建得到相机透视矩阵；S700. According to the camera pose and camera internal parameters, construct the camera perspective matrix;

S800.根据相机透视矩阵，将最小有向包围盒投影到当前帧图像，形成增强现实场景，并得到最小有向包围盒对应的矩形框；S800. Project the minimum directed bounding box to the current frame image according to the camera perspective matrix to form an augmented reality scene, and obtain a rectangular frame corresponding to the minimum directed bounding box;

S900.对矩形框进行图像分割以识别主体；S900. Perform image segmentation on the rectangular frame to identify the subject;

S1000.根据识别的主体对矩形框进行调整，得到标注框；S1000. Adjust the rectangular frame according to the identified subject to obtain a label frame;

S1100.比较矩形框和标注框的面积；S1100. Compare the area of the rectangle frame and the callout frame;

S1200.若标注框的面积与矩形框的面积之比小于第一阈值，则以标注框为选区，并执行在选区框选标注区域，并在标注区域标注相应语义信息的步骤；S1200. If the ratio of the area of the marked frame to the area of the rectangular frame is less than the first threshold, the marked frame is used as a selection area, and the steps of selecting a marked area in the selection area and marking the corresponding semantic information in the marked area are performed;

S1300.从视频中获取目标帧图像，目标帧图像为除当前帧图像以外的任意一帧图像；S1300. Obtain the target frame image from the video, and the target frame image is any frame image except the current frame image;

S1400.基于视觉SLAM算法获取目标帧图像的第一新增点云数据；S1400. Obtain the first newly added point cloud data of the target frame image based on the visual SLAM algorithm;

S1500.对第一新增点云数据进行过滤，得到第二新增点云数据，第二新增点云数据为最小有向包围盒中新增的点云数据；S1500. Filter the first newly added point cloud data to obtain the second newly added point cloud data, and the second newly added point cloud data is the newly added point cloud data in the minimum directed bounding box;

S1600.若第二新增点云数据的数量大于第二阈值，则以目标帧图像为选区，并执行在选区框选标注区域，并在标注区域标注相应语义信息的步骤。S1600. If the quantity of the second newly added point cloud data is greater than the second threshold, take the target frame image as a selection area, and perform the steps of selecting a marked area in the selection area, and marking corresponding semantic information in the marked area.

本实施例中，针对人工标注图像数据需时较大且制约研发进度的问题，本发实施例提出一种基于视觉SLAM的图像自动标注方法。主要包括以下步骤：首先，获取视频图像，基于视觉SLAM算法构建三维点云空间和相机位姿，同时结合相机内参矩阵，基于OpenGL坐标系构建增强现实场景；其次，根据用户单次框选区域内的二维特征点，反投影获得对应的三维点云数据并对其进行聚类，再求解最小有向包围盒；结合相机位姿和投影矩阵，可将包裹实体的虚拟包围盒投影到二维图像上并生成初步矩形标注框；然后，基于图像分割算法识别主体，对框选区域进行合理调整，生成本次标注结果。最后，结合不同视角下的新增点云数据以及自动调整的二维框选区域，将进一步调整虚拟包围盒，从而自动生成每一帧图像的标注数据。本实施例的图像自动标注方法，从图像源进行数据标注，能极大减少人工图像标注工作，提高图像标注数据的采集效率，有效推进计算机视觉模型训练的研发进度。In this embodiment, in order to solve the problem that it takes a long time to manually label image data and restricts the progress of research and development, the embodiment of the present invention proposes an automatic image labeling method based on visual SLAM. It mainly includes the following steps: first, acquire video images, construct 3D point cloud space and camera pose based on visual SLAM algorithm, and construct augmented reality scene based on OpenGL coordinate system combined with camera internal parameter matrix; secondly, according to the user's single frame selection area The two-dimensional feature points of , back-project the corresponding three-dimensional point cloud data and cluster them, and then solve the minimum directed bounding box; Combined with the camera pose and projection matrix, the virtual bounding box of the wrapped entity can be projected to the two-dimensional A preliminary rectangular annotation frame is generated on the image; then, the subject is identified based on the image segmentation algorithm, and the frame selection area is adjusted reasonably to generate this annotation result. Finally, combined with the newly added point cloud data from different perspectives and the automatically adjusted two-dimensional frame selection area, the virtual bounding box will be further adjusted to automatically generate annotation data for each frame of image. The automatic image labeling method of the present embodiment performs data labeling from the image source, which can greatly reduce the manual image labeling work, improve the collection efficiency of image labeling data, and effectively promote the research and development progress of computer vision model training.

本实施例中，视觉SLAM算法采用ORB-SLAM算法；点云聚类算法采用基于密度的DBSCAN算法；求解最小有向包围盒算法采用有向包围盒算法(Oriented Bounding Box)；图像分割算法采用区域生长算法。In this embodiment, the visual SLAM algorithm adopts the ORB-SLAM algorithm; the point cloud clustering algorithm adopts the density-based DBSCAN algorithm; the minimum directed bounding box algorithm adopts the oriented bounding box algorithm (Oriented Bounding Box); the image segmentation algorithm adopts the region growth algorithm.

具体地，关于步骤S100，本实施例中，获取实时相机或视频中至少两帧连续图像，就能够利用ORB-SLAM算法构建得到三维点云空间，当然，获取超过两帧以上的连续图像，也能够构建得到三维点云空间，参照图3，其具体的操作过程如下：Specifically, with regard to step S100, in this embodiment, by acquiring at least two consecutive images from a real-time camera or video, the ORB-SLAM algorithm can be used to construct a three-dimensional point cloud space. The three-dimensional point cloud space can be constructed and obtained. Referring to Figure 3, the specific operation process is as follows:

S101.对相机进行标定以获取相机内参矩阵和畸变参数；S101. Calibrate the camera to obtain the camera internal parameter matrix and distortion parameters;

S102.获取实时相机或视频中至少两帧连续图像；S102. Acquire at least two consecutive frames of images from the real-time camera or video;

S103.依次对两帧连续图像提取ORB特征点并计算相应的二进制描述子；S103. Extract ORB feature points from two consecutive frames of images in turn and calculate corresponding binary descriptors;

S104.根据ORB特征点，对两帧连续图像进行特征点匹配，得到图像特征点；S104. According to the ORB feature points, perform feature point matching on two consecutive images to obtain image feature points;

S105.利用对极几何约束估计两帧连续图像间的相机运动，计算得到相机位姿；S105. Use epipolar geometric constraints to estimate the camera motion between two consecutive images, and calculate the camera pose;

S106.根据相机内参矩阵、畸变参数和相机位姿，利用三角化测量计算图像特征点的三维坐标，构建得到三维点云空间。S106. According to the camera internal parameter matrix, the distortion parameters and the camera pose, use triangulation to calculate the three-dimensional coordinates of the image feature points, and construct a three-dimensional point cloud space.

本实施例中，对采集视频图像的相机进行标定以获取该相机的内参矩阵和畸变参数，依次对两帧连续图像提取ORB特征点并获取ORB描述子，然后对两帧连续图像进行特征点匹配，得到图像特征点，最后根据对极点约束求解相机在两帧连续图像之间的运动，利用三角化测量来计算图像特征点的三维坐标，以此构建三维点云空间。如果获取的是两帧以上的视频图像，比如随机截取一段视频，此时，同样需要对相机进行标定以获取相机内参矩阵和畸变参数，然后依时序对视频图像提取ORB特征点并求其描述子，在相邻帧之间进行特征点匹配，根据对极点约束求解相机在相邻帧之间的运动，利用三角化测量来计算图像特征点的三维坐标，以此构建三维点云空间。In this embodiment, the camera that collects video images is calibrated to obtain the internal parameter matrix and distortion parameters of the camera, ORB feature points are sequentially extracted from two consecutive frames of images and ORB descriptors are obtained, and then feature point matching is performed on the two consecutive frames of images. , obtain the image feature points, and finally solve the motion of the camera between two consecutive images according to the pole constraint, and use the triangulation measurement to calculate the three-dimensional coordinates of the image feature points to construct a three-dimensional point cloud space. If the acquired video images are more than two frames, such as randomly intercepting a video, at this time, the camera also needs to be calibrated to obtain the camera internal parameter matrix and distortion parameters, and then extract the ORB feature points from the video image in time sequence and find its descriptors , perform feature point matching between adjacent frames, solve the motion of the camera between adjacent frames according to the extreme point constraints, and use triangulation to calculate the three-dimensional coordinates of image feature points to construct a three-dimensional point cloud space.

本实施例中，对于步骤S200，对于视频中的后续图像帧，将进行PnP问题求解，以此计算相机位姿。此外，ORB-SLAM算法会进行闭环检测、重定位和对关键帧的图优化，进而可以求得较为精确的相机位姿，即相机的旋转矩阵R和平移向量t。In this embodiment, for step S200, for the subsequent image frames in the video, the PnP problem is solved to calculate the camera pose. In addition, the ORB-SLAM algorithm will perform closed-loop detection, relocation, and graph optimization of key frames, so as to obtain a more accurate camera pose, that is, the camera's rotation matrix R and translation vector t.

对于步骤S300，也就是以当前帧图像为选区，在选区框选标注区域，并在标注区域标注相应语义信息这一步骤，具体包括：For step S300, that is, taking the current frame image as a selection area, selecting a marked area in the selection area, and marking the corresponding semantic information in the marked area, the step specifically includes:

S301.以当前帧图像为选区，在选区框选第一区域；S301. Take the current frame image as the selection area, and select the first area in the selection area;

S302.根据当前帧图像的窗口大小与当前帧图像像素宽高比例，将第一区域转化为像素区域；S302. Convert the first area into a pixel area according to the window size of the current frame image and the pixel aspect ratio of the current frame image;

S303.将像素区域作为标注区域，并在标注区域标注相应语义信息。S303. Use the pixel area as the labeling area, and label corresponding semantic information in the labeling area.

本实施例中，用户需要对当前帧图像的目标对象进行首次框选，具体地，以当前帧图像为选区，在选区框选目标对象所在区域为标注区域。然后，根据用户在当前帧图像上所框选的标注区域，按窗口大小与图像像素宽高比例，转化为像素区域，并保存该区域所标注的语义信息。In this embodiment, the user needs to perform the first frame selection on the target object of the current frame image. Specifically, the current frame image is used as the selection area, and the area where the frame selection target object is located in the selection area is the marked area. Then, according to the marked area selected by the user on the current frame image, according to the ratio of the window size and the image pixel width and height, it is converted into a pixel area, and the semantic information marked in the area is saved.

本实施例中，步骤S400，也就是根据三维点云空间，获取标注区域内二维特征点对应的第一三维点云集合这一步骤，具体包括：In this embodiment, step S400, that is, the step of acquiring the first three-dimensional point cloud set corresponding to the two-dimensional feature points in the marked area according to the three-dimensional point cloud space, specifically includes:

S401.获取标注区域内的二维特征点；S401. Obtain two-dimensional feature points in the marked area;

S402.根据三维点云空间，将二维特征点进行反投影，得到第一三维点云集合。S402. Back-project the two-dimensional feature points according to the three-dimensional point cloud space to obtain a first three-dimensional point cloud set.

本实施例中，考虑到在当前帧图像上的标注区域为一个二维平面，所以在该标注区域内的点云均为二维特征点。因此需要根据之前构建得到的三维点云空间，将二维特征点进行反投影，得到二维特征点对应的第一三维点云集合。In this embodiment, considering that the marked area on the current frame image is a two-dimensional plane, the point clouds in the marked area are all two-dimensional feature points. Therefore, it is necessary to back-project the two-dimensional feature points according to the previously constructed three-dimensional point cloud space to obtain the first three-dimensional point cloud set corresponding to the two-dimensional feature points.

本实施例中，得到第一三维点云集合之后，需要进一步利用点云聚类算法对该第一三维点云集合进行处理，得到第二三维点云集合，其中，第二三维点云集合为点云数量最多的点云簇。也就是说，得到第一三维点云集合之后，要执行步骤S500，参照图4，步骤S500具体包括：In this embodiment, after the first three-dimensional point cloud set is obtained, it is necessary to further process the first three-dimensional point cloud set by using a point cloud clustering algorithm to obtain a second three-dimensional point cloud set, wherein the second three-dimensional point cloud set is: The point cloud cluster with the largest number of point clouds. That is to say, after obtaining the first three-dimensional point cloud set, step S500 needs to be performed. Referring to FIG. 4 , step S500 specifically includes:

S501.从第一三维点云集合中选取第一点云为目标点云，第一点云为第一三维点云集合中的任意点云；S501. Select the first point cloud as the target point cloud from the first three-dimensional point cloud set, and the first point cloud is any point cloud in the first three-dimensional point cloud set;

S502.判断目标点云是否为边缘噪点；S502. Determine whether the target point cloud is edge noise;

S503.若目标点云为边缘噪点，则重新从第一三维点云集合中选取第二点云为目标点云，第二点云为第一三维点云集合中除第一点云以外的任意点云；S503. If the target point cloud is edge noise, then re-select the second point cloud from the first three-dimensional point cloud set as the target point cloud, and the second point cloud is any arbitrary point cloud other than the first point cloud in the first three-dimensional point cloud set point cloud;

S504.若目标点云不为边缘噪点，则从第一三维点云集合中找出所有从目标点云密度可达的点云，构成点云簇；S504. If the target point cloud is not edge noise, find out all the point clouds that are density-reachable from the target point cloud from the first three-dimensional point cloud set to form a point cloud cluster;

S505.遍历第一三维点云集合中的所有点云，得到多个点云簇；S505. Traverse all point clouds in the first three-dimensional point cloud set to obtain multiple point cloud clusters;

S506.选择点云数量最多的点云簇为第二三维点云集合。S506. Select the point cloud cluster with the largest number of point clouds as the second three-dimensional point cloud set.

具体地，参照图5，本实施例中，根据业务场景，设置领域半径和领域内对象数量阈值；然后从步骤S400得到的第一三维点云集合内任意选择一个三维点云作为核心点P，找出所有从点云P密度可达的三维点云，形成一个点云簇；如果选取的点云为边缘噪点则选取其他三维点云作为核心点，重复执行上述行为，直至访问完第一三维点云集合内的所有点云。最后保留三维点云数量最多的点云簇作为第二三维点云集合。Specifically, referring to FIG. 5 , in this embodiment, according to the business scenario, the field radius and the threshold of the number of objects in the field are set; then a three-dimensional point cloud is arbitrarily selected from the first three-dimensional point cloud set obtained in step S400 as the core point P, Find all 3D point clouds that are density-reachable from point cloud P to form a point cloud cluster; if the selected point cloud is edge noise, select other 3D point clouds as core points, and repeat the above behavior until the first 3D point cloud is accessed. All point clouds within the point cloud collection. Finally, the point cloud cluster with the largest number of 3D point clouds is reserved as the second 3D point cloud set.

本实施例中，如果点云P邻域点数量不大于预设的阈值，则将该点云P标记为边缘噪点，此时，从第一三维点云集合内重新选择其他的点云作为核心点。通过选取的不同核心点，会创建得到多个不同的点云簇，最后需要从得到的所有点云簇中选取点云数量最多的点云簇作为第二三维点云集合。本实施例中，得到第二三维点云集合之后，还需要剔除剩余的少量外部噪点。In this embodiment, if the number of points in the neighborhood of the point cloud P is not greater than the preset threshold, the point cloud P is marked as edge noise, and at this time, other point clouds are reselected from the first three-dimensional point cloud set as the core point. By selecting different core points, multiple different point cloud clusters will be created, and finally, the point cloud cluster with the largest number of point clouds needs to be selected from all the obtained point cloud clusters as the second three-dimensional point cloud set. In this embodiment, after the second three-dimensional point cloud set is obtained, it is necessary to remove a small amount of remaining external noise points.

对于步骤S600，本实施例中，获取得到第二三维点云集合之后，基于主成分分析法(Principal Component Analysis)求解该第二三维点云集合的最小有向包围盒，即可求得最小有向包围盒的中心点位置、尺寸以及旋转矩阵。For step S600, in this embodiment, after the second three-dimensional point cloud set is obtained, the minimum directed bounding box of the second three-dimensional point cloud set can be obtained based on principal component analysis (Principal Component Analysis). The center point position, size and rotation matrix to the bounding box.

本实施例中，步骤S700，也就是根据相机位姿和相机内参，构建得到相机透视矩阵这一步骤，具体包括：In this embodiment, step S700, that is, the step of constructing and obtaining the camera perspective matrix according to the camera pose and camera internal parameters, specifically includes:

S701.根据第一公式将相机位姿的数据从世界坐标系转到OpenGL坐标系，第一公式为：R_o＝R_owR_wR_ow,t_o＝R_owt_w；其中，

世界坐标系为(R_w,t_w)，OpenGL坐标系为(R_o,t_o)；S701. Transfer the camera pose data from the world coordinate system to the OpenGL coordinate system according to the first formula. The first formula is: R _o =R _ow R _w R _ow , t _o =R _ow t _w ; wherein,

S702.根据OpenGL坐标系，结合相机内参，构建得到相机透视矩阵，相机透视矩阵为：S702. According to the OpenGL coordinate system, combined with the internal parameters of the camera, the camera perspective matrix is constructed and obtained. The camera perspective matrix is:

其中，

in,

本实施例中，根据第一公式将相机位姿数据从世界坐标系(R_w,t_w)转到OpenGL坐标系(R_o,t_o)，结合相机内参(u₀、v₀、f_u、f_v)可构建得到相机透视矩阵P。In this embodiment, the camera pose data is transferred from the world coordinate system (R _w ,t _w ) to the OpenGL coordinate system (R _o ,t _o ) according to the first formula, combined with the camera internal parameters (u ₀ , v ₀ , f _{u )} , f _v ) can be constructed to obtain the camera perspective matrix P.

本实施例中，构建得到相机透视矩阵P之后，进行步骤S800，即根据相机透视矩阵，将最小有向包围盒投影到当前帧图像，形成增强现实场景，并得到最小有向包围盒对应的矩形框。具体地，将最小有向包围盒投影到当前图像上，形成增强现实场景。对比最小有向包围盒8个顶点投影后的像素坐标，保留最大和最小的u、v坐标值，构成矩形框的左上和右下角坐标。In this embodiment, after the camera perspective matrix P is constructed and obtained, step S800 is performed, that is, according to the camera perspective matrix, the minimum directed bounding box is projected to the current frame image to form an augmented reality scene, and the rectangle corresponding to the minimum directed bounding box is obtained. frame. Specifically, the minimum directed bounding box is projected onto the current image to form an augmented reality scene. Compare the projected pixel coordinates of the 8 vertices of the smallest directed bounding box, retain the largest and smallest u and v coordinate values, and form the coordinates of the upper left and lower right corners of the rectangular box.

本实施例中，得到矩形框之后，考虑到矩形框包含较大范围，可能并没有紧裹标注对象(主体)，因此需要对矩形框作进一步处理，需要进一步执行步骤S1000和步骤S1100，即对矩形框进行图像分割以识别主体，然后根据识别的主体对矩形框进行调整，得到标注框。具体地，先适当扩大矩形框，对矩形框内图像进行图像分割以识别主体，调整矩形框大小以紧凑包裹主体，将调整后的矩形框作为标注框。参照图6，图6为矩形框和标注框的对比示意图。In this embodiment, after the rectangular frame is obtained, considering that the rectangular frame contains a large range, the labeling object (the main body) may not be tightly wrapped, so the rectangular frame needs to be further processed, and steps S1000 and S1100 need to be further performed, that is, to The rectangular frame is image-segmented to identify the subject, and then the rectangular frame is adjusted according to the identified subject to obtain an annotation frame. Specifically, the rectangular frame is appropriately enlarged first, the image in the rectangular frame is segmented to identify the subject, the size of the rectangular frame is adjusted to compactly wrap the subject, and the adjusted rectangular frame is used as a labeling frame. Referring to FIG. 6 , FIG. 6 is a schematic diagram of a comparison between a rectangular frame and a marked frame.

本实施例中，图像分割是采用区域生长算法进行的，具体地，取矩形框中心点为种子点，同时标记为已访问点，在种子点处进行4邻域扩展，如果邻域像素与种子像素灰度值差的绝对值小于阈值，则合并为同一区域，该邻域点加入集合保存。从集合中取出未被访问过的点作为种子点重复上述行为，直至集合为空。以新区域构成的矩形框作为标注框。In this embodiment, the image segmentation is carried out by using the region growing algorithm. Specifically, the center point of the rectangular frame is taken as the seed point, and is marked as the visited point at the same time, and 4 neighborhood expansion is performed at the seed point. If the absolute value of the pixel gray value difference is less than the threshold value, it will be merged into the same area, and the neighborhood point will be added to the collection and saved. Take unvisited points from the set as seed points and repeat the above behavior until the set is empty. The rectangle frame formed by the new area is used as the callout frame.

本实施例中，得到标注框之后，进一步比较矩形框和标注框的面积；若标注框的面积与矩形框的面积之比小于第一阈值，则以标注框为选区，并执行在选区框选标注区域，并在标注区域标注相应语义信息的步骤，也就是以标注框为选区重复执行步骤S300，直到最终得到的标注框的面积与矩形框的面积之比不小于第一阈值为止。若标注框的面积与矩形框的面积之比不小于第一阈值，说明该标注框为当前帧图像所需的标注数据。In this embodiment, after the callout frame is obtained, the areas of the rectangle frame and the callout frame are further compared; if the ratio of the area of the callout frame to the area of the rectangle frame is smaller than the first threshold, the callout frame is used as the selection area, and the selection in the selection area is performed. The step of labeling the area and labeling the corresponding semantic information in the labeling area, that is, repeating step S300 with the labeling frame as the selection area, until the ratio of the area of the labeling frame to the area of the rectangular frame obtained is not less than the first threshold. If the ratio of the area of the labeling frame to the area of the rectangular frame is not less than the first threshold, it means that the labeling frame is the labeling data required for the current frame image.

完成对当前帧图像的标注之后，由于在不同的视角下，或者说对于不同的视频帧图像，其对应的视角不同，因此会新增不同的三维点云数据，因此，对其他不同的视频帧图像进行处理时，首先通过步骤S1300-S1600完成对标注数据的修正。具体地，选取下一帧视频图像，先执行步骤S1400，即基于视觉SLAM算法获取该帧视频图像的第一新增点云数据；然后执行步骤S1500，即对第一新增点云数据进行过滤，得到第二新增点云数据，第二新增点云数据为最小有向包围盒中新增的点云数据；具体地，该步骤具体包括：After completing the annotation of the current frame image, different 3D point cloud data will be added due to different perspectives, or for different video frame images, the corresponding perspectives are different. Therefore, for other different video frames When the image is processed, the correction of the annotation data is first completed through steps S1300-S1600. Specifically, select the next frame of video image, first perform step S1400, that is, obtain the first newly added point cloud data of this frame of video image based on the visual SLAM algorithm; then perform step S1500, that is, filter the first newly added point cloud data , obtain the second newly added point cloud data, and the second newly added point cloud data is the newly added point cloud data in the minimum directed bounding box; specifically, this step specifically includes:

S1501.基于随机抽样一致性算法对第一新增点云数据进行过滤，除去地面点云数据；S1501. Filter the first newly added point cloud data based on the random sampling consistency algorithm to remove the ground point cloud data;

S1502.根据最小有向包围盒的三维空间范围进行条件滤波，得到第二新增点云数据。S1502. Perform conditional filtering according to the three-dimensional space range of the minimum directed bounding box to obtain second newly added point cloud data.

得到第二新增点云数据之后，再执行步骤S1600，即判断第二新增点云数据是否大于第二阈值，若第二新增点云数据的数量大于第二阈值，则以该帧视频图像为选区，并执行在选区框选标注区域，并在标注区域标注相应语义信息的步骤。也就是说，如果第二新增点云数据的数量大于第二阈值，说明该帧视频图像与之前处理的当前帧图像差距较大，此时，需要以该帧视频图像为选区，重新执行步骤S300，以得到该帧视频图像对应的新的标准框。After the second newly added point cloud data is obtained, step S1600 is performed again, that is, it is determined whether the second newly added point cloud data is greater than the second threshold, and if the quantity of the second newly added point cloud data is greater than the second threshold, the frame of video The image is a selection area, and the steps of selecting a marked area in the selection area and marking the corresponding semantic information in the marked area are performed. That is to say, if the number of the second newly added point cloud data is greater than the second threshold, it means that the video image of this frame is far from the current frame image processed before. S300, to obtain a new standard frame corresponding to the frame of video image.

本实施例中，若第二新增点云数据的数量不大于第二阈值，则将目标帧图像作为当前帧图像，并执行根据相机透视矩阵，将最小有向包围盒投影到当前帧图像，形成增强现实场景，并得到最小有向包围盒对应的矩形框的步骤。也就是说，如果第二新增点云数据的数量不大于第二阈值，说明该帧视频图像与之前处理的当前帧图像相差不大，此时，只需要根据该帧视频图像的视角重新计算最小有向包围盒的投影情况，即重新执行步骤S800。本实施例中，基于视觉SLAM的图像自动标注方法的具体流程可参照图2。In this embodiment, if the number of the second newly added point cloud data is not greater than the second threshold, the target frame image is taken as the current frame image, and the minimum directed bounding box is projected to the current frame image according to the camera perspective matrix, The steps of forming an augmented reality scene and obtaining the rectangular box corresponding to the minimum directed bounding box. That is to say, if the amount of the second newly added point cloud data is not greater than the second threshold, it means that the video image of this frame is not much different from the current frame image processed before. In the case of the projection of the minimum directed bounding box, step S800 is re-executed. In this embodiment, the specific process of the automatic image labeling method based on visual SLAM may refer to FIG. 2 .

本发明实施例一种基于视觉SLAM的图像自动标注方法具有以下技术效果：An image automatic labeling method based on visual SLAM in the embodiment of the present invention has the following technical effects:

本发明实施例提出一种基于视觉SLAM的图像自动标注方法，包括：获取视频图像，基于利用ORB-SLAM算法构建得到三维点云空间和相机位姿，以当前帧图像为选区，在选区框选标注区域，并在标注区域标注相应语义信息；根据三维点云空间，获取标注区域内二维特征点对应的第一三维点云集合；利用点云聚类算法对第一三维点云集合进行处理，得到第二三维点云集合；基于主成分分析法计算得到第二三维点云集合的最小有向包围盒；根据相机位姿和相机内参，构建得到相机透视矩阵；根据相机透视矩阵，将最小有向包围盒投影到当前帧图像，形成增强现实场景，并得到最小有向包围盒对应的矩形框；对矩形框进行图像分割以识别主体；根据识别的主体对矩形框进行调整，得到标注框；比较矩形框和标注框的面积；若标注框的面积与矩形框的面积之比小于第一阈值，则以标注框为选区，并执行在选区框选标注区域，并在标注区域标注相应语义信息的步骤；从而可生成当前帧图像的标注结果。最后，结合不同视频帧的新增点云数据，自动调整标注框，从而自动生成每一帧图像的标注结果。该方法无需先验数据集，能在采集图像的同时实现在线式自动标注，支持多目标和多类别标注，该方法仅需用户单次标注，能极大减少人工图像标注工作，提高图像标注数据的采集效率，且操作简单易用。An embodiment of the present invention proposes an automatic image labeling method based on visual SLAM, which includes: acquiring a video image, obtaining a three-dimensional point cloud space and a camera pose based on constructing an ORB-SLAM algorithm, taking the current frame image as a selection area, and selecting a frame in the selection area. Mark the area, and mark the corresponding semantic information in the marked area; obtain the first three-dimensional point cloud set corresponding to the two-dimensional feature points in the marked area according to the three-dimensional point cloud space; use the point cloud clustering algorithm to process the first three-dimensional point cloud set , obtain the second three-dimensional point cloud set; calculate the minimum directed bounding box of the second three-dimensional point cloud set based on principal component analysis; construct the camera perspective matrix according to the camera pose and camera internal parameters; The directed bounding box is projected to the current frame image to form an augmented reality scene, and the rectangular frame corresponding to the smallest directed bounding box is obtained; the image segmentation of the rectangular frame is performed to identify the subject; the rectangular frame is adjusted according to the identified subject to obtain an annotation frame ; Compare the area of the rectangle and the callout; if the ratio of the area of the callout to the area of the rectangle is less than the first threshold, take the callout as the selection area, select the callout area in the selection area, and mark the corresponding semantics in the callout area information step; thus, the annotation result of the current frame image can be generated. Finally, combined with the newly added point cloud data of different video frames, the annotation frame is automatically adjusted to automatically generate the annotation results of each frame of image. This method does not require a priori data set, can realize online automatic annotation while collecting images, and supports multi-object and multi-category annotation. This method only needs a single annotation by the user, which can greatly reduce the manual image annotation work and improve the image annotation data. The collection efficiency is high, and the operation is simple and easy to use.

参照图7，本发明实施例还提供了一种基于视觉SLAM的图像自动标注装置200，具体包括：Referring to FIG. 7 , an embodiment of the present invention further provides a visual SLAM-based image automatic labeling apparatus 200, which specifically includes:

至少一个处理器210；at least one processor 210;

至少一个存储器220，用于存储至少一个程序；at least one memory 220 for storing at least one program;

当至少一个程序被至少一个处理器210执行，使得至少一个处理器210实现如图1所示的方法。When at least one program is executed by at least one processor 210 , at least one processor 210 implements the method shown in FIG. 1 .

其中，存储器220作为一种非暂态计算机可读存储介质，可用于存储非暂态软件程序以及非暂态性计算机可执行程序。存储器220可以包括高速随机存取存储器，还可以包括非暂态存储器，例如至少一个磁盘存储器件、闪存器件、或其他非暂态固态存储器件。在一些实施方式中，存储器220可选包括相对于处理器210远程设置的远程存储器，这些远程存储器可以通过网络连接至处理器210。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。The memory 220, as a non-transitory computer-readable storage medium, can be used to store non-transitory software programs and non-transitory computer-executable programs. Memory 220 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some implementations, the memory 220 may optionally include remote memory located remotely from the processor 210, which may be connected to the processor 210 through a network. Examples of such networks include, but are not limited to, the Internet, an intranet, a local area network, a mobile communication network, and combinations thereof.

可以理解到，图7中示出的装置结构并不构成对装置200的限定，可以包括比图示更多或更少的部件，或者组合某些部件，或者不同的部件布置。It can be understood that the device structure shown in FIG. 7 does not constitute a limitation to the device 200, and may include more or less components than those shown, or combine some components, or arrange different components.

如图7所示的装置200中，处理器210可以调取存储器220中储存的程序，并执行但不限于图1所示实施例的步骤。In the apparatus 200 shown in FIG. 7 , the processor 210 can call the program stored in the memory 220 and execute but not limited to the steps of the embodiment shown in FIG. 1 .

以上所描述的装置200实施例仅仅是示意性的，其中作为分离部件说明的单元可以是或者也可以不是物理上分开的，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现实施例的目的。The above-described embodiment of the apparatus 200 is only illustrative, and the units described as separate components may or may not be physically separated, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the embodiments.

本发明实施例还提供了一种计算机可读存储介质，该计算机可读存储介质存储有处理器可执行的程序，处理器可执行的程序在被处理器执行时用于实现如图1所示的方法。An embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium stores a program executable by a processor, and the program executable by the processor is used to implement the program as shown in FIG. 1 when executed by the processor Methods.

本申请实施例还公开了一种计算机程序产品或计算机程序，该计算机程序产品或计算机程序包括计算机指令，该计算机指令存储在计算机可读存介质中。计算机设备的处理器可以从计算机可读存储介质读取该计算机指令，处理器执行该计算机指令，使得该计算机设备执行图1所示的方法。Embodiments of the present application further disclose a computer program product or computer program, where the computer program product or computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium. A processor of the computer device can read the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method shown in FIG. 1 .

可以理解的是，上文中所公开方法中的全部或某些步骤、系统可以被实施为软件、固件、硬件及其适当的组合。某些物理组件或所有物理组件可以被实施为由处理器，如中央处理器、数字信号处理器或微处理器执行的软件，或者被实施为硬件，或者被实施为集成电路，如专用集成电路。这样的软件可以分布在计算机可读介质上，计算机可读介质可以包括计算机存储介质(或非暂时性介质)和通信介质(或暂时性介质)。如本领域普通技术人员公知的，术语计算机存储介质包括在用于存储信息(诸如计算机可读指令、数据结构、程序模块或其他数据)的任何方法或技术中实施的易失性和非易失性、可移除和不可移除介质。计算机存储介质包括但不限于RAM、ROM、EEPROM、闪存或其他存储器技术、CD-ROM、数字多功能盘(DVD)或其他光盘存储、磁盒、磁带、磁盘存储或其他磁存储装置、或者可以用于存储期望的信息并且可以被计算机访问的任何其他的介质。此外，本领域普通技术人员公知的是，通信介质通常包含计算机可读指令、数据结构、程序模块或者诸如载波或其他传输机制之类的调制数据信号中的其他数据，并且可包括任何信息递送介质。It will be understood that all or some of the steps and systems in the methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. Some or all physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit . Such software may be distributed on computer-readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). As known to those of ordinary skill in the art, the term computer storage media includes both volatile and nonvolatile implemented in any method or technology for storage of information, such as computer readable instructions, data structures, program modules or other data flexible, removable and non-removable media. Computer storage media include, but are not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disk (DVD) or other optical disk storage, magnetic cartridges, magnetic tape, magnetic disk storage or other magnetic storage devices, or may Any other medium used to store desired information and which can be accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave or other transport mechanism, and can include any information delivery media, as is well known to those of ordinary skill in the art .

上面结合附图对本发明实施例作了详细说明，但是本发明不限于上述实施例，在技术领域普通技术人员所具备的知识范围内，还可以在不脱离本发明宗旨的前提下作出各种变化。The embodiments of the present invention have been described in detail above in conjunction with the accompanying drawings, but the present invention is not limited to the above-mentioned embodiments, and within the scope of knowledge possessed by those of ordinary skill in the technical field, various changes can also be made without departing from the purpose of the present invention. .

Claims

1. an image automatic labeling method based on visual SLAM, is characterized in that, comprises:

Obtain at least two consecutive images from a real-time camera or video, and use the ORB-SLAM algorithm to construct a 3D point cloud space;

Obtain a new frame image as the current frame image, and obtain the camera pose by solving the PnP problem;

Taking the current frame image as a selection area, frame a marked area in the selection area, and mark corresponding semantic information in the marked area;

obtaining, according to the three-dimensional point cloud space, a first three-dimensional point cloud set corresponding to the two-dimensional feature points in the marked area;

Using a point cloud clustering algorithm to process the first three-dimensional point cloud set to obtain a second three-dimensional point cloud set, the second three-dimensional point cloud set is the point cloud cluster with the largest number of point clouds;

Calculate the minimum directed bounding box of the second three-dimensional point cloud set based on principal component analysis;

According to the camera pose and camera internal parameters, construct a camera perspective matrix;

According to the camera perspective matrix, project the minimum directed bounding box to the current frame image to form an augmented reality scene, and obtain a rectangular frame corresponding to the minimum directed bounding box;

image segmentation of the rectangular frame to identify the subject;

Adjust the rectangular frame according to the identified subject to obtain a labeling frame;

comparing the area of the rectangular frame and the callout frame;

If the ratio of the area of the callout frame to the area of the rectangular frame is smaller than the first threshold, the callout frame is used as a selection area, and the selection of the callout area in the selection area is performed, and corresponding semantics are marked in the callout area information steps;

Obtain a target frame image from the video, where the target frame image is any subsequent frame image except the current frame image;

Obtain the first newly added point cloud data of the target frame image based on the visual SLAM algorithm;

Filtering the first newly added point cloud data to obtain second newly added point cloud data, where the second newly added point cloud data is the newly added point cloud data in the minimum directed bounding box;

If the quantity of the second newly added point cloud data is greater than the second threshold, take the target frame image as a selection area, and execute the step of selecting a marked area in the selection area, and marking corresponding semantic information in the marked area .

2. a kind of image automatic labeling method based on visual SLAM according to claim 1, is characterized in that, described acquisition real-time camera or at least two frames of images in the video, utilize ORB-SLAM algorithm to construct and obtain three-dimensional point cloud space this. steps, including:

Calibrate the camera to obtain the camera internal parameter matrix and distortion parameters;

Obtain at least two consecutive images from a live camera or video;

Extracting ORB feature points from the continuous images of the two frames in turn and calculating the corresponding binary descriptors;

According to the ORB feature points, feature point matching is performed on the two consecutive images to obtain image feature points;

Using the epipolar geometric constraint to estimate the camera motion between the two consecutive images, the camera pose is calculated;

According to the camera internal parameter matrix, distortion parameters and camera pose, the three-dimensional coordinates of the image feature points are calculated by triangulation, and a three-dimensional point cloud space is constructed.

3. a kind of image automatic labeling method based on visual SLAM according to claim 1, it is characterized in that, described taking the current frame image as a selection area, in the selection area frame selection label area, and label corresponding semantic information in the described label area This step includes:

Taking the current frame image as a selection area, select the first area in the selection area;

converting the first area into a pixel area according to the window size of the current frame image and the pixel aspect ratio of the current frame image;

The pixel area is used as an annotation area, and corresponding semantic information is annotated in the annotation area.

4. A kind of image automatic labeling method based on visual SLAM according to claim 1, is characterized in that, according to described 3D point cloud space, obtain the first 3D point corresponding to 2D feature point in described labeling area This step of cloud collection includes:

acquiring two-dimensional feature points in the marked area;

According to the three-dimensional point cloud space, the two-dimensional feature points are back-projected to obtain a first three-dimensional point cloud set.

5. a kind of image automatic labeling method based on visual SLAM according to claim 1, is characterized in that, described utilizes point cloud clustering algorithm to process described first three-dimensional point cloud collection, obtains the second three-dimensional point cloud Assemble this step, including:

Selecting a first point cloud from the first three-dimensional point cloud set as a target point cloud, and the first point cloud being any point cloud in the first three-dimensional point cloud set;

Determine whether the target point cloud is edge noise;

If the target point cloud is edge noise, then re-select a second point cloud from the first 3D point cloud set as the target point cloud, and the second point cloud is the first 3D point cloud set except Any point cloud other than the first point cloud mentioned above;

If the target point cloud is not an edge noise, find out all the point clouds that are density-reachable from the target point cloud from the first three-dimensional point cloud set to form a point cloud cluster;

Traverse all point clouds in the first three-dimensional point cloud set to obtain a plurality of point cloud clusters;

The point cloud cluster with the largest number of point clouds is selected as the second three-dimensional point cloud set.

6. a kind of image automatic labeling method based on visual SLAM according to claim 1, is characterized in that, described according to described camera pose and camera internal parameter, construct and obtain the step of camera perspective matrix, comprising:

The data of the camera pose is transferred from the world coordinate system to the OpenGL coordinate system according to the first formula. The first formula is: R _o =R _ow R _w R _ow , to =R _ow _t _w ; wherein,

in,

In the formula, P represents the perspective matrix of the camera, and the internal parameters of the camera are (u ₀ , v ₀ , f _u , f _v ); near represents the distance from the camera optical center to the near section, far represents the distance from the camera optical center to the far section, and right represents The distance from the near section center to the right section, left indicates the distance from the near section center to the left section, top indicates the distance from the near section center to the upper section, bottom indicates the distance from the near section center to the lower section, w indicates the image pixel width, h indicates Image pixel height.

7. a kind of image automatic labeling method based on visual SLAM according to claim 1, it is characterized in that, the described first newly added point cloud data is filtered, the step that obtains the second newly added point cloud data ,include:

Filtering the first newly added point cloud data based on a random sampling consistency algorithm to remove ground point cloud data;

Conditional filtering is performed according to the three-dimensional space range of the minimum directional bounding box to obtain second newly added point cloud data.

8. A kind of image automatic labeling method based on visual SLAM according to claim 1, is characterized in that, if the quantity of described second newly added point cloud data is not greater than the second threshold value, then described target frame image is used as. The current frame image, and executing the projection of the minimum directional bounding box to the current frame image according to the camera perspective matrix to form an augmented reality scene, and obtaining the minimum directional bounding box corresponding to the rectangular box step.

9. An image automatic labeling device based on visual SLAM, characterized in that, comprising:

at least one processor;

at least one memory for storing at least one program;

When the at least one program is executed by the at least one processor, the at least one processor is caused to implement the method of any one of claims 1-8.

10. A computer-readable storage medium, characterized in that a program executable by a processor is stored thereon, and the program executable by the processor, when executed by the processor, is used to implement the program as claimed in any one of claims 1-8. method described.