CN115082885A

CN115082885A - Point cloud target detection method, device, equipment and storage medium

Info

Publication number: CN115082885A
Application number: CN202210743615.8A
Authority: CN
Inventors: 罗磊; 宋南
Original assignee: Shenzhen Seed Space Technology Co ltd
Current assignee: Shenzhen Seed Space Technology Co ltd
Priority date: 2022-06-27
Filing date: 2022-06-27
Publication date: 2022-09-20

Abstract

The application relates to the technical field of automatic driving, and provides a method, a device, equipment and a storage medium for detecting a point cloud target, wherein the method is used for acquiring local point cloud characteristics and global voxel characteristics; and fusing the local point cloud characteristics and the global voxel characteristics based on a preset double-encoder and decoder algorithm to generate a three-dimensional target frame and complete the detection of the point cloud target in the scene. According to the method, after the local point cloud characteristics and the global voxel characteristics are respectively obtained through the point cloud network and the three-position sparse convolution network, the three-dimensional target frame with multi-modal characteristics fused is generated through the preset double-encoder and fusion decoder algorithm, and therefore the detection of the point cloud target in the scene to be detected is completed. Therefore, the multi-mode fusion is adopted to determine the three-dimensional target frame, the precision and the efficiency of the point cloud target detection method are improved, and the technical problems of low precision and low efficiency of the existing point cloud target detection method are solved.

Description

Point cloud target detection method, device, equipment and storage medium

技术领域technical field

本发明涉及自动驾驶技术领域，尤其涉及一种点云目标的检测方法、装置、设备及存储介质。The present invention relates to the technical field of automatic driving, and in particular, to a method, device, device and storage medium for detecting a point cloud target.

背景技术Background technique

目前，随着深度相机、结构光相机和激达等三维传感器技术的快速发展，点云等三维空间数据也越来越容易获取，并应用于到各个真实应用场景中，如无人驾驶骑车、机器人技术和虚拟现实等。三维目标检测是计算机视觉中一个具有挑战性的问题，并且有利于许多下游的视觉任务。随着卷积神经网络(Convolution Neual Networks，CNN)的快速发展，二维目标检测取得了巨大的成功。与二维图像相比，三维点云无序、稀疏、不规则，这使得二维检测方法难以直接应用于三维场景。At present, with the rapid development of 3D sensor technologies such as depth cameras, structured light cameras, and LIDAR, 3D spatial data such as point clouds are becoming more and more easily obtained and applied to various real application scenarios, such as driverless cycling. , robotics and virtual reality. 3D object detection is a challenging problem in computer vision and benefits many downstream vision tasks. With the rapid development of Convolution Neual Networks (CNN), two-dimensional object detection has achieved great success. Compared with 2D images, 3D point clouds are disordered, sparse, and irregular, which makes it difficult for 2D detection methods to be directly applied to 3D scenes.

现有的三维目标检测方法一般可以分为两类：基于体素的点云目标的检测和基于原始点云的目标检测。基于体素划分的点云目标的检测首先将稀疏的三维点云转化到鸟瞰图或提速网格等紧致的空间表达上，并针对每个表达单元提取对应的语义特征，然后高效地将二维的卷积神经网络迁移到三维空间中，利用三维卷积算法对这些紧致的空间表征实现体素场景语义信息提取和三维目标检测。这些基于体素的方法适用于生成较为准确的空间三维位置提案，然而，这些方法在体素化过程中不可避免地会丢失用于实现精确定位得到几何结构信息。基于原始点云的三维目标检测算法直接以原始点云及对应的特征作为输入，通过基于多层感知机(Multilayer Perceptron，MLP)构建的点云网络来提取富有几何信息的点语义特征，同时借助于最远点采样(Farthest Point Sampling，FPS)和点集特征提取模块(Set Abstract，SA)保证空间信息提取的均匀性及局部感知的有效性。这些方法可以弥补体素化等造成的信息缺失，更有效地刻画三维空间几何结构，但此类方法往往需要耗费更多的计算资源。因此，如何提高现有点云目标的检测方法精度低且效率低成为了目前亟待解决的技术问题。Existing 3D object detection methods can generally be divided into two categories: voxel-based point cloud object detection and raw point cloud-based object detection. The detection of point cloud objects based on voxel division first converts the sparse 3D point cloud into a compact spatial representation such as bird's-eye view or speed-up grid, and extracts the corresponding semantic features for each expression unit, and then efficiently converts the two The 3D convolutional neural network is migrated to the 3D space, and the 3D convolution algorithm is used to realize the extraction of voxel scene semantic information and 3D object detection for these compact spatial representations. These voxel-based methods are suitable for generating relatively accurate spatial 3D position proposals, however, these methods inevitably lose the geometric structure information used to achieve precise localization during the voxelization process. The 3D target detection algorithm based on the original point cloud directly takes the original point cloud and the corresponding features as input, and extracts the point semantic features rich in geometric information through the point cloud network constructed based on the Multilayer Perceptron (MLP). The Farthest Point Sampling (FPS) and the point set feature extraction module (Set Abstract, SA) are used to ensure the uniformity of spatial information extraction and the effectiveness of local perception. These methods can make up for the lack of information caused by voxelization and more effectively describe the geometric structure of three-dimensional space, but such methods often require more computational resources. Therefore, how to improve the low accuracy and low efficiency of the existing point cloud target detection methods has become an urgent technical problem to be solved.

发明内容SUMMARY OF THE INVENTION

本发明的主要目的在于提供一种点云目标的检测方法、装置、设备及计算机可读存储介质，旨在解决现有点云目标的检测方法的精度低且效率低的技术问题。The main purpose of the present invention is to provide a point cloud target detection method, device, device and computer-readable storage medium, aiming to solve the technical problems of low precision and low efficiency of the existing point cloud target detection method.

为实现上述目的，本发明提供一种点云目标的检测方法，所述方法通过三维稀疏卷积神经网络处理所述待确定场景的三维体素网格，得到所述待确定场景的全局体素特征，并生成目标提议框；通过点云网络处理待检测场景的原始点云，得到所述待检测场景的局部点云特征；基于预设双编码器与融合解码器算法，融合所述局部点云特征和所述全局体素特征，对所述目标提议框进行调整，生成三维目标框，并基于所述三维目标框完成所述待检测场景中点云目标的检测。In order to achieve the above object, the present invention provides a method for detecting a point cloud target. The method processes the three-dimensional voxel grid of the to-be-determined scene through a three-dimensional sparse convolutional neural network, and obtains the global voxel of the to-be-determined scene. feature, and generate a target proposal frame; process the original point cloud of the scene to be detected through the point cloud network to obtain the local point cloud features of the scene to be detected; based on the preset dual encoder and fusion decoder algorithm, fuse the local points The cloud feature and the global voxel feature are adjusted to the target proposal frame to generate a three-dimensional target frame, and based on the three-dimensional target frame, the point cloud target detection in the scene to be detected is completed.

此外，为实现上述目的，本发明还提供一种点云目标的检测装置，所述装置包括：体素特征提取模块，用于通过三维稀疏卷积神经网络处理所述待确定场景的三维体素网格，得到所述待确定场景的全局体素特征，并生成目标提议框；点云特征提取模块，用于通过点云网络处理待检测场景的原始点云，得到所述待检测场景的局部点云特征；特征融合模块，用于基于预设双编码器与融合解码器算法，融合所述局部点云特征和所述全局体素特征，对所述目标提议框进行调整，生成三维目标框，并基于所述三维目标框完成所述待检测场景中点云目标的检测。In addition, in order to achieve the above purpose, the present invention also provides a point cloud target detection device, the device includes: a voxel feature extraction module for processing the three-dimensional voxels of the to-be-determined scene through a three-dimensional sparse convolutional neural network grid to obtain the global voxel feature of the scene to be determined, and generate a target proposal frame; point cloud feature extraction module, used to process the original point cloud of the scene to be detected through the point cloud network, and obtain the local part of the scene to be detected point cloud features; a feature fusion module, used to fuse the local point cloud features and the global voxel features based on a preset dual encoder and fusion decoder algorithm, adjust the target proposal frame, and generate a three-dimensional target frame , and completes the detection of point cloud targets in the scene to be detected based on the three-dimensional target frame.

此外，为实现上述目的，本发明还提供一种点云目标的检测设备，所述点云目标的检测设备包括处理器、存储器、以及存储在所述存储器上并可被所述处理器执行的点云目标的检测程序，其中所述点云目标的检测程序被所述处理器执行时，实现如上述的点云目标的检测方法的步骤。In addition, in order to achieve the above object, the present invention also provides a device for detecting a point cloud target, the device for detecting a point cloud target includes a processor, a memory, and a device that is stored in the memory and can be executed by the processor. A detection program for a point cloud target, wherein when the detection program for a point cloud target is executed by the processor, the steps of the above-mentioned method for detecting a point cloud target are implemented.

此外，为实现上述目的，本发明还提供一种计算机可读存储介质，所述计算机可读存储介质上存储有点云目标的检测程序，其中所述点云目标的检测程序被处理器执行时，实现如上述的点云目标的检测方法的步骤。In addition, in order to achieve the above object, the present invention also provides a computer-readable storage medium on which a detection program of a point cloud target is stored, wherein when the detection program of the point cloud target is executed by a processor, The steps of implementing the method for detecting a point cloud target as described above.

本发明提供一种点云目标的检测方法，所述方法通过三维稀疏卷积神经网络处理所述待确定场景的三维体素网格，得到所述待确定场景的全局体素特征，并生成目标提议框；通过点云网络处理待检测场景的原始点云，得到所述待检测场景的局部点云特征；基于预设双编码器与融合解码器算法，融合所述局部点云特征和所述全局体素特征，对所述目标提议框进行调整，生成三维目标框，并基于所述三维目标框完成所述待检测场景中点云目标的检测。通过上述方式，本发明通过点云网络和三位稀疏卷积网络，分别获取局部点云特征和全局体素特征后，通过预设双编码器与融合解码器算法，生成融合多模态特征的三维目标框，并以此完成待检测场景中点云目标的检测。由此，采用多模态融合确定三维目标框，提高了点云目标的检测方法的精度与效率，解决了目前点云目标的检测方法的精度低且效率低的技术问题。The present invention provides a point cloud target detection method. The method processes the three-dimensional voxel grid of the to-be-determined scene through a three-dimensional sparse convolutional neural network, obtains the global voxel feature of the to-be-determined scene, and generates a target Proposal frame; process the original point cloud of the scene to be detected through the point cloud network, and obtain the local point cloud feature of the scene to be detected; based on the preset dual encoder and fusion decoder algorithm, fuse the local point cloud feature and the The global voxel feature is used to adjust the target proposal frame to generate a three-dimensional target frame, and based on the three-dimensional target frame, the detection of the point cloud target in the scene to be detected is completed. Through the above method, the present invention obtains local point cloud features and global voxel features through point cloud network and three-dimensional sparse convolution network respectively, and generates a fusion multi-modal feature through preset dual encoder and fusion decoder algorithms. The three-dimensional target frame is used to complete the detection of the point cloud target in the scene to be detected. Therefore, using multimodal fusion to determine the three-dimensional target frame improves the accuracy and efficiency of the point cloud target detection method, and solves the technical problems of low accuracy and low efficiency of the current point cloud target detection method.

附图说明Description of drawings

图1为本发明实施例方案中涉及的一种点云目标的检测设备的硬件结构示意图；1 is a schematic diagram of the hardware structure of a point cloud target detection device involved in an embodiment of the present invention;

图2为本发明一种点云目标的检测方法第一实施例的流程示意图；2 is a schematic flowchart of a first embodiment of a method for detecting a point cloud target according to the present invention;

图3为本发明一种点云目标的检测装置第一实施例的功能模块示意图。FIG. 3 is a schematic diagram of functional modules of a first embodiment of a device for detecting a point cloud target according to the present invention.

本发明目的的实现、功能特点及优点将结合实施例，参照附图做进一步说明。The realization, functional characteristics and advantages of the present invention will be further described with reference to the accompanying drawings in conjunction with the embodiments.

具体实施方式Detailed ways

应当理解，此处所描述的具体实施例仅仅用以解释本发明，并不用于限定本发明。It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention.

本发明实施例涉及的点云目标的检测方法主要应用于点云目标的检测设备，该点云目标的检测生成设备可以是PC、便携计算机、移动终端等具有显示和处理功能的设备。The point cloud target detection method involved in the embodiments of the present invention is mainly applied to a point cloud target detection device, and the point cloud target detection and generation device may be a PC, a portable computer, a mobile terminal and other devices with display and processing functions.

参照图1，图1为本发明实施例方案中涉及的点云目标的检测设备的硬件结构示意图。本发明实施例中，点云目标的检测设备可以包括处理器1001(例如CPU)，通信总线1002，用户接口1003，网络接口1004，存储器1005。其中，通信总线1002用于实现这些组件之间的连接通信；用户接口1003可以包括显示屏(Display)、输入单元比如键盘(Keyboard)；网络接口1004可选的可以包括标准的有线接口、无线接口(如WI-FI接口)；存储器1005可以是高速RAM存储器，也可以是稳定的存储器(non-volatilememory)，例如磁盘存储器，存储器1005可选的还可以是独立于前述处理器1001的存储装置。Referring to FIG. 1 , FIG. 1 is a schematic diagram of a hardware structure of a point cloud target detection device involved in an embodiment of the present invention. In this embodiment of the present invention, a device for detecting a point cloud target may include a processor 1001 (eg, a CPU), a communication bus 1002 , a user interface 1003 , a network interface 1004 , and a memory 1005 . Wherein, the communication bus 1002 is used to realize the connection and communication between these components; the user interface 1003 may include a display screen (Display), an input unit such as a keyboard (Keyboard); the network interface 1004 may optionally include a standard wired interface, a wireless interface (such as a WI-FI interface); the memory 1005 may be a high-speed RAM memory, or a non-volatile memory, such as a disk memory, and the memory 1005 may optionally be a storage device independent of the aforementioned processor 1001 .

本领域技术人员可以理解，图1中示出的硬件结构并不构成对点云目标的检测设备的限定，可以包括比图示更多或更少的部件，或者组合某些部件，或者不同的部件布置。Those skilled in the art can understand that the hardware structure shown in FIG. 1 does not constitute a limitation on the detection device of the point cloud target, and may include more or less components than the one shown, or combine some components, or different Component placement.

继续参照图1，图1中作为一种计算机可读存储介质的存储器1005可以包括操作系统、网络通信模块以及点云目标的检测程序。Continuing to refer to FIG. 1 , the memory 1005 as a computer-readable storage medium in FIG. 1 may include an operating system, a network communication module, and a detection program for point cloud objects.

在图1中，网络通信模块主要用于连接服务器，与服务器进行数据通信；而处理器1001可以调用存储器1005中存储的点云目标的检测程序，并执行本发明实施例提供的点云目标的检测方法。In FIG. 1, the network communication module is mainly used to connect to the server, and perform data communication with the server; and the processor 1001 can call the detection program of the point cloud object stored in the memory 1005, and execute the point cloud object detection program provided by the embodiment of the present invention. Detection method.

本发明实施例提供了一种点云目标的检测方法。An embodiment of the present invention provides a method for detecting a point cloud target.

参照图2，图2为本发明点云目标的检测方法第一实施例的流程示意图。Referring to FIG. 2 , FIG. 2 is a schematic flowchart of a first embodiment of a method for detecting a point cloud target according to the present invention.

本实施例中，所述点云目标的检测方法包括以下步骤：In this embodiment, the method for detecting the point cloud target includes the following steps:

步骤S10，通过三维稀疏卷积神经网络处理所述待确定场景的三维体素网格，得到所述待确定场景的全局体素特征，并生成目标提议框；Step S10, processing the three-dimensional voxel grid of the to-be-determined scene through a three-dimensional sparse convolutional neural network, obtaining the global voxel feature of the to-be-determined scene, and generating a target proposal frame;

本实施例中，通过所述三维稀疏卷积神经网络提取所述待确定目标的空体素的三维体素特征；通过第二编码器将所述三维体素特征压缩到二维空间，得到二维体素特征，作为所述全局体素特征，其中，所述第二编码器为三维稀疏卷网络多层编码器。In this embodiment, the three-dimensional voxel feature of the empty voxel of the target to be determined is extracted by the three-dimensional sparse convolutional neural network; the three-dimensional voxel feature is compressed into a two-dimensional space by the second encoder to obtain two The dimensional voxel feature is used as the global voxel feature, wherein the second encoder is a three-dimensional sparse convolutional network multilayer encoder.

进一步地，基于二维提议网络，根据所述待确定目标的分类及回归，生成所述待确定目标的粗略提议框。Further, based on the two-dimensional proposal network, a rough proposal frame of the to-be-determined target is generated according to the classification and regression of the to-be-determined target.

步骤S20，通过点云网络处理待检测场景的原始点云，得到所述待检测场景的局部点云特征；Step S20, processing the original point cloud of the scene to be detected through the point cloud network to obtain local point cloud features of the scene to be detected;

本实施例中，基于预设采样规则，在所述待确定目标的邻域内，选择至少一个点进行采样；通过多层感知机提取所述邻域内的采样点特征；通过第一编码器将全部所述采样点特征进行逐层采样，并将采样结果作为待确定目标的所述局部点云特征，其中，所述第一编码器为点云网络多层编码器。In this embodiment, based on a preset sampling rule, at least one point is selected for sampling in the neighborhood of the target to be determined; the features of the sampling points in the neighborhood are extracted by a multilayer perceptron; The sampling point feature is sampled layer by layer, and the sampling result is used as the local point cloud feature of the target to be determined, wherein the first encoder is a point cloud network multi-layer encoder.

步骤S30，基于预设双编码器与融合解码器算法，融合所述局部点云特征和所述全局体素特征，对所述目标提议框进行调整，生成三维目标框，并基于所述三维目标框完成所述待检测场景中点云目标的检测；Step S30, based on the preset dual encoder and fusion decoder algorithm, fuse the local point cloud feature and the global voxel feature, adjust the target proposal frame, generate a three-dimensional target frame, and based on the three-dimensional target frame to complete the detection of the point cloud target in the scene to be detected;

本实施例中，基于所述原始点云，将所述局部点云特征和所述全局体素特征输入至基于注意力机制的多模态特征融合网络，融合所述局部点云特征和所述全局体素特征，生成所述待确定目标的全局特征；通过预设微调网络，将所述全局特征与所述粗略提议框融合，并根据所述待确定目标的分类与回归，确定所述目标提议框。In this embodiment, based on the original point cloud, the local point cloud features and the global voxel features are input into a multimodal feature fusion network based on an attention mechanism, and the local point cloud features and the The global voxel feature is used to generate the global feature of the target to be determined; the global feature is fused with the rough proposal frame through a preset fine-tuning network, and the target is determined according to the classification and regression of the target to be determined proposal box.

具体实施例中，通过所述多模态特征融合网络，将较少的体素特征逐渐传递至较多的体素特征中，并最终将所述全局体素特征映射至所述原始点云，以融合所述局部点云特征和所述全局体素特征，生成所述全局特征。In a specific embodiment, through the multimodal feature fusion network, fewer voxel features are gradually transferred to more voxel features, and finally the global voxel features are mapped to the original point cloud, The global feature is generated by fusing the local point cloud feature and the global voxel feature.

区域提议，也叫区域建议，是一个典型的全卷积网络，因此在网络的训练过程中，反向传播算法和梯度下降算法同样适用于区域建议网络Region proposal, also called region proposal, is a typical fully convolutional network, so in the process of network training, the back-propagation algorithm and gradient descent algorithm are also applicable to region proposal network

本实施例提供一种点云目标的检测方法，所述方法通过三维稀疏卷积神经网络处理所述待确定场景的三维体素网格，得到所述待确定场景的全局体素特征，并生成目标提议框；通过点云网络处理待检测场景的原始点云，得到所述待检测场景的局部点云特征；基于预设双编码器与融合解码器算法，融合所述局部点云特征和所述全局体素特征，对所述目标提议框进行调整，生成三维目标框，并基于所述三维目标框完成所述待检测场景中点云目标的检测。通过上述方式，本发明通过点云网络和三位稀疏卷积网络，分别获取局部点云特征和全局体素特征后，通过预设双编码器与融合解码器算法，生成融合多模态特征的三维目标框，并以此完成待检测场景中点云目标的检测。由此，采用多模态融合确定三维目标框，提高了点云目标的检测方法的精度与效率，解决了目前点云目标的检测方法的精度低且效率低的技术问题。This embodiment provides a point cloud target detection method. The method processes the three-dimensional voxel grid of the to-be-determined scene through a three-dimensional sparse convolutional neural network, obtains the global voxel feature of the to-be-determined scene, and generates The target proposal frame; the original point cloud of the scene to be detected is processed through the point cloud network, and the local point cloud features of the scene to be detected are obtained; based on the preset dual encoder and fusion decoder algorithm, the local point cloud features and all According to the global voxel feature, the target proposal frame is adjusted to generate a three-dimensional target frame, and based on the three-dimensional target frame, the detection of the point cloud target in the scene to be detected is completed. Through the above method, the present invention obtains local point cloud features and global voxel features through point cloud network and three-dimensional sparse convolution network respectively, and generates a fusion multi-modal feature through preset dual encoder and fusion decoder algorithms. The three-dimensional target frame is used to complete the detection of the point cloud target in the scene to be detected. Therefore, using multimodal fusion to determine the three-dimensional target frame improves the accuracy and efficiency of the point cloud target detection method, and solves the technical problems of low accuracy and low efficiency of the current point cloud target detection method.

基于上述图2所示实施例，本实施例中，所述步骤S10，包括：Based on the embodiment shown in FIG. 2 above, in this embodiment, the step S10 includes:

将所述待检测点云空间划分成规则的体素表达形式，其中仅保留含有点的体素网格，而其余体素则被认为是空体素；Divide the point cloud space to be detected into a regular voxel expression form, in which only the voxel grid containing points is retained, and the rest of the voxels are considered as empty voxels;

对于每个所述空体素，采用其所包含点的平均特征作为其初始输入特征，随后采用所述三维稀疏卷积神经网络感知体素网格并给出所述待确定图像的粗略提议框。For each empty voxel, the average feature of the points it contains is used as its initial input feature, and then the three-dimensional sparse convolutional neural network is used to perceive the voxel grid and give a rough proposal frame of the image to be determined. .

体素是体积元素(Volume Pixel)的简称，包含体素的立体可以通过立体渲染或者提取给定阈值轮廓的多边形等值面表现出来。一如其名，是数字数据于三维空间分割上的最小单位，体素用于三维成像、科学数据与医学影像等领域。概念上类似二维空间的最小单位——像素，像素用在二维计算机图像的影像数据上。有些真正的三维显示器运用体素来描述它们的分辨率，举例来说：可以显示512×512×512体素的显示器。Voxel is the abbreviation of Volume Pixel, and a volume containing voxels can be represented by volume rendering or polygon isosurface extraction of a given threshold outline. As its name suggests, it is the smallest unit of digital data in 3D space segmentation, and voxels are used in 3D imaging, scientific data and medical imaging. Conceptually similar to the smallest unit of two-dimensional space - pixel, pixel is used in the image data of two-dimensional computer image. Some true 3D displays use voxels to describe their resolution, for example a display that can display 512 x 512 x 512 voxels.

本实施例中，构建三维稀疏卷积网络多层编码器提取全局位置信息并给出目标框提议。首先将点云空间根据划分成规则的体素表达形式，其中仅保留含有点的体素网格，而其余体素则被认为是空体素。特别地，对于每个所述空体素，采用其所包含点的平均特征作为其初始输入特征。随后采用稀疏三维卷积高效地感知体素空间并给出目标提议。包括如下两个子步骤：In this embodiment, a three-dimensional sparse convolutional network multi-layer encoder is constructed to extract global position information and propose a target frame. First, the point cloud space is divided into a regular voxel representation, in which only the voxel grid containing the points is kept, and the rest of the voxels are considered as empty voxels. Specifically, for each of the empty voxels, the average feature of the points it contains is used as its initial input feature. Sparse 3D convolution is then employed to efficiently perceive the voxel space and give object proposals. It includes the following two sub-steps:

三维稀疏卷积，为了减少显存占用和提高运行速度，采用三维稀疏卷积代替原始的三维卷积，即仅对空体素做特征提取而忽略空体素。采用Spconv库堆叠4层稀疏卷积模块，其中每个稀疏卷积模块包含两层子流型卷积和一层用于逐步下采样的稀疏卷积模块。假设输入的体素张量表示为L×W×H×C，那么该三维稀疏卷积的网络输出可以表示为

For 3D sparse convolution, in order to reduce memory usage and improve running speed, 3D sparse convolution is used to replace the original 3D convolution, that is, only empty voxels are used for feature extraction and empty voxels are ignored. The Spconv library is used to stack 4 layers of sparse convolution modules, where each sparse convolution module contains two layers of sub-stream convolution and one layer of sparse convolution module for stepwise downsampling. Assuming that the input voxel tensor is represented as L×W×H×C, then the network output of the three-dimensional sparse convolution can be expressed as

二维区域提议网络(RPN)分类和目标提议回归。将三维输出特征由高维压缩至二维空间得到二维图像表征为

随后将其输入用于鸟瞰图目标检测的二维RPN网络，该网络由四层二维卷积神经网络搭建，采用U-Net网络结构，逐层的输出具体表示为

各层均采用3×3卷积以减小学习参数，得到进一步细化的特征图用于后续的预测。最后通过基于锚框的分类和回归得到RPN网络检测出的物体提议，即对每一个像素点都预先设置一个对应的锚框，对该锚框做微调以实现精准的定位，因此对于大小为

的特征图，一共会有

个预定义框。值得注意的是在自动驾驶场景中，物体的三维框可以表示为{x,y,z,l,w,h,r}，各个参数依次表示为物体框中心位置、物体的大小和针对x-y平面的旋转角，而不需要考虑其他的角度自由度，大大降低了预测的难度。2D Region Proposal Network (RPN) Classification and Object Proposal Regression. The three-dimensional output features are compressed from high-dimensional to two-dimensional space to obtain a two-dimensional image representation as

Then it is input into a two-dimensional RPN network for bird's-eye view target detection. The network is built by a four-layer two-dimensional convolutional neural network and adopts a U-Net network structure. The output of each layer is specifically expressed as

Each layer adopts 3×3 convolution to reduce the learning parameters, and obtain further refined feature maps for subsequent prediction. Finally, the object proposal detected by the RPN network is obtained through the classification and regression based on the anchor frame, that is, a corresponding anchor frame is preset for each pixel point, and the anchor frame is fine-tuned to achieve accurate positioning. Therefore, for the size of

The feature map of , there will be a total of

predefined boxes. It is worth noting that in the automatic driving scene, the three-dimensional frame of the object can be expressed as {x, y, z, l, w, h, r}, and each parameter is expressed as the center position of the object frame, the size of the object, and the xy plane. The rotation angle does not need to consider other angular degrees of freedom, which greatly reduces the difficulty of prediction.

基于上述图2所示实施例，本实施例中，所述步骤S20，包括：Based on the embodiment shown in FIG. 2 above, in this embodiment, the step S20 includes:

本实施例中，采用点云网络和三维卷积神经网络分别处理原始点云与三维体素网格，从而提取场景的局部点云特征与全局的粗略物体空间位置信息。首先，给定场景的原始点云，对其均匀采集到固定数目N作为初始点云分支输入，并将其体素化生成稀疏的体素网格作为卷积分支输入,随后通过局部点云特征提取器和三维稀疏卷积神经网络分别处理场景以提取局部的点云特征与全局的粗略定位信息，然后将其输入到多模态特征融合网络，自适应地聚合两者的优势以促进后续的目标检测任务。包括以下子步骤：In this embodiment, a point cloud network and a three-dimensional convolutional neural network are used to process the original point cloud and the three-dimensional voxel grid respectively, so as to extract local point cloud features of the scene and global rough object spatial position information. First, given the original point cloud of the scene, uniformly collect a fixed number N as the input of the initial point cloud branch, and voxelize it to generate a sparse voxel grid as the input of the convolution branch, and then pass the local point cloud features The extractor and the 3D sparse convolutional neural network separately process the scene to extract local point cloud features and global rough localization information, which are then input into the multimodal feature fusion network, which adaptively aggregates the advantages of both to facilitate subsequent object detection task. Includes the following sub-steps:

构建点云网络多层编码器提取空间几何机构信息，利用最远点采样策略从原始点云数据中均匀采样固定点云数N,

每一层点集特征提取模块的输入依次为上一层的输出，而当前层的输出为后续FPS采样的点集。通过点集特征提取模块提取各层点云的局部点云特征，且逐层扩大以得到全局点云语义信息。设点云中存在点p_i，设S_i为在以点p_i为中心，半径为r的球邻域内部的点所构成的集合。p_i的输出特征的计算方法如下：A point cloud network multi-layer encoder is constructed to extract spatial geometric structure information, and a fixed point cloud number N is uniformly sampled from the original point cloud data by using the farthest point sampling strategy.

The input of the point set feature extraction module of each layer is the output of the previous layer in turn, and the output of the current layer is the point set of the subsequent FPS sampling. The local point cloud features of each layer of point cloud are extracted through the point set feature extraction module, and the global point cloud semantic information is obtained by expanding layer by layer. Suppose that there is a point p _i in the point cloud, and let S _i be a set of points within a spherical neighborhood with the point p _i as the center and the radius r. The output features of _pi are calculated as follows:

从集合S_i中随机采样k个点组成集合S_i；Randomly sample _k points from the set Si to form the set _Si ;

通过多层感知机对该邻域内采样的点进行特征融合提取，计算得到点p_i的输出特征。计算公式如下：The feature fusion extraction is performed on the points sampled in the neighborhood through the multi-layer _perceptron , and the output feature of the point pi is calculated. Calculated as follows:

首先采用多层感知机对点局域内的k个点的特征做进一步的特征提取得到对应的n维映射特征，然后在特征维度采用最大池化处理得到以p_i为中心，r为半径的邻域的最大信息表征，随后采用多层感知机进一步抽象该最大信息表征的n维高维特征以获取点p_i的输出特征。First, the multi-layer perceptron is used to further extract the features of k points in the point local area to obtain the corresponding _n -dimensional map features, and then use the maximum pooling process in the feature dimension to obtain the neighbors with pi as the center and r as the radius The maximum information representation of the domain, and then the multi-layer perceptron is used to further abstract the _n -dimensional high-dimensional features of the maximum information representation to obtain the output features of point pi.

逐层对点云重复FPS采样到对应点数，并对采样得到的点通过聚合邻域特征，实现待检测场景的原始点云的局部点云特征提取。The point cloud is repeatedly sampled by FPS layer by layer to the corresponding number of points, and the local point cloud feature extraction of the original point cloud of the scene to be detected is realized by aggregating the neighborhood features of the sampled points.

FPS采样：最远点采样(Farthest Point Sampling)是一种非常常用的采样算法，由于能够保证对样本的均匀采样，被广泛使用，像3D点云深度学习框架中的PointNet++对样本点进行FPS采样再聚类作为感受野，3D目标检测网络VoteNet对投票得到的散乱点进行FPS采样再进行聚类，6D位姿估计算法PVN3D中用于选择物体的8个特征点进行投票并计算位姿。FPS sampling: Farthest Point Sampling is a very common sampling algorithm. It is widely used because it can ensure uniform sampling of samples. For example, PointNet++ in the 3D point cloud deep learning framework performs FPS sampling on sample points. Reclustering is used as the receptive field. The 3D target detection network VoteNet performs FPS sampling on the scattered points obtained by voting and then performs clustering. The 6D pose estimation algorithm PVN3D is used to select 8 feature points of the object to vote and calculate the pose.

具体实施例中，多层感知器(MLP，Multilayer Perception)是一种前馈人工神经网络模型，其将输入的多个数据集映射到单一的输出的数据集上。In a specific embodiment, Multilayer Perception (MLP, Multilayer Perception) is a feedforward artificial neural network model, which maps multiple input data sets to a single output data set.

基于上述图2所示实施例，本实施例中，所述步骤S30，包括：Based on the embodiment shown in FIG. 2 above, in this embodiment, the step S30 includes:

多模态感知特征的自适应融合与目标提议框的精调。针对于步骤1中所提取到的场景的局部点云信息与全局的粗略物体空间位置信息，为有效发挥各自的优势，通过注意力机制的方法以原始点云为基础有效地融合两者特征。然后，结合初步预测的目标提议框，通过局部物体的融合点云特征实现对目标框的精细调准。具体包括以下步骤：Adaptive fusion of multimodal-aware features and fine-tuning of object proposals. For the local point cloud information of the scene extracted in step 1 and the global rough object spatial position information, in order to effectively exert their respective advantages, the attention mechanism method is used to effectively fuse the two features based on the original point cloud. Then, combined with the preliminarily predicted target proposal frame, the target frame is fine-tuned by fused point cloud features of local objects. Specifically include the following steps:

基于注意力机制的多模态感知特征自适应融合。根据步骤1中提取的多尺度点云与体素特征，本发明设计了双编码器-解码器的结构，基于点云融合两者。包括以下子步骤：Adaptive fusion of multimodal perception features based on attention mechanism. According to the multi-scale point cloud and voxel features extracted in step 1, the present invention designs a dual encoder-decoder structure, and fuses the two based on the point cloud. Includes the following sub-steps:

体素特征到点云空间的映射。对于相同尺度的点云特征与体素特征，考虑到点云保留了更好的空间结构与位置信息，故本发明以点云为基础，采用最近邻插值的方法将体素特征映射到点云上。设步骤1中点云分支和体素分支的输出分别为

和

通过三最近邻插值函数赋值，设S(v_j)表示在欧式空间中距离V_j最近的三个点组成的组合，P_k表示其中一点。Mapping of voxel features to point cloud space. For point cloud features and voxel features of the same scale, considering that the point cloud retains better spatial structure and position information, the present invention is based on the point cloud and uses the nearest neighbor interpolation method to map the voxel features to the point cloud. superior. Let the outputs of the point cloud branch and the voxel branch in step 1 be

and

Through the assignment of the three-nearest-neighbor interpolation function, let S(v _j ) represent the combination of three points closest to V _j in the Euclidean space, and P _k represents one of the points.

这里的w_kj计算如下：Here w _kj is calculated as follows:

每一个被基于点的体素特征通过周围领域内三个体素的特征对欧式距离的加权求和得到。Each point-based voxel feature is obtained by the weighted summation of Euclidean distances of the three voxel features in the surrounding field.

具体地，最近邻插值法(Nearest_Neighbor)将变换后的图像中的原像素点最邻近像素的灰度值赋给原像素点的方法，是最简单的灰度值插值。也称作零阶插值，就是令变换后像素的灰度值等于距它最近的输入像素的灰度值。Specifically, the nearest neighbor interpolation method (Nearest_Neighbor) assigns the gray value of the most adjacent pixel of the original pixel in the transformed image to the original pixel, which is the simplest gray value interpolation. Also known as zero-order interpolation, it is to make the gray value of the transformed pixel equal to the gray value of the nearest input pixel.

基于注意力机制的特征融合。对于场景中的不同位置，不同特征发挥着不同的作用，如对于车的侧壁、场景中的柱状体等不包含详细特征的位置，体素表达方便与快速定位到物体，而如车的轮胎、人的姿态等包含判别性特征的位置，采用点云特征更好的刻画物体并且助于调整阶段做精细回归及类别判断。为精细地融合不同的特征，采用解码器的结构有粗到细的逐步融合，并通过与三最近邻插值的方法，将少数点云集的特征传递到多数点云集，并最终还原到原始的点云，取得关于整个场景的多模态强表征性融合特征，并将其输入后续的微调网络，进一步改善检测精度。Feature fusion based on attention mechanism. For different positions in the scene, different features play different roles, such as the sidewall of the car, the column in the scene and other positions that do not contain detailed features, the voxel expression is convenient and fast to locate the object, such as the tire of the car , human posture and other positions containing discriminative features, using point cloud features to better describe objects and help in the adjustment stage for fine regression and category judgment. In order to finely fuse different features, the structure of the decoder is gradually fused from coarse to fine, and the features of the minority point cloud set are transferred to the majority point cloud set by means of interpolation with the three nearest neighbors, and finally restored to the original point. Cloud, obtains multi-modal strong representational fusion features about the entire scene, and feeds them into the subsequent fine-tuning network to further improve the detection accuracy.

给定融合特征及目标提议框，设计微调网络进一步精确回归物体的空间信息，及过滤掉可能存在的误检物体。对于每个目标提议框取其中的点集G_i，并采用最大池化提取全局特征，计算得到物体G_i的输出特征。计算公式如下：Given the fusion feature and target proposal frame, a fine-tuning network is designed to further accurately regress the spatial information of the object, and filter out possible falsely detected objects. For each target proposal frame, the point set G _i in it is taken, and the global feature is extracted by maximum pooling, and the output feature of the object G _i is calculated. Calculated as follows:

随后通过置信度预测和目标框回归分支分别获取物体的精确信息。Then the precise information of the object is obtained through the confidence prediction and the target box regression branch, respectively.

置信度预测的定义为在对未来的情况做出估计时，采用数理统计中的区间估计法而获得的在误差允许范围以内的预测值的概率。Confidence prediction is defined as the probability of a predicted value within the allowable error range obtained by using interval estimation in mathematical statistics when estimating future conditions.

损失函数设计。损失函数用于监督深度神经网络学习到点云目标检测任务所需要的特征，这里的监督包括分类监督和回归监督；分类损失函数采用如下交叉熵损失函数：Loss function design. The loss function is used to supervise the deep neural network to learn the features required for the point cloud target detection task. The supervision here includes classification supervision and regression supervision; the classification loss function adopts the following cross entropy loss function:

其中P(a_i)表示第i个预置框预测的分数，Q(a_i)表示该预制框的真实的标签值。本发明的回归损失函数采用了Smooth-L1损失函数：where P(a _i ) represents the predicted score of the i-th preset box, and Q(a _i ) represents the real label value of the preset box. The regression loss function of the present invention adopts the Smooth-L1 loss function:

其中υ表示三维框{x,y,z,l,w,h,r}的一个变量。通过分类损失函数和回归损失函数的联合监督，网络最终可学习到点云目标检测的能力。where υ represents a variable of the three-dimensional box {x, y, z, l, w, h, r}. Through the joint supervision of the classification loss function and the regression loss function, the network can finally learn the ability of point cloud object detection.

此外，本发明实施例还提供一种点云目标的检测装置。In addition, an embodiment of the present invention also provides a device for detecting a point cloud target.

参照图3，图3为本发明点云目标的检测装置第一实施例的功能模块示意图。Referring to FIG. 3 , FIG. 3 is a schematic diagram of functional modules of the first embodiment of the device for detecting a point cloud target according to the present invention.

本实施例中，所述点云目标的检测装置包括：In this embodiment, the detection device for the point cloud target includes:

体素特征提取模块10，用于通过三维稀疏卷积神经网络处理所述待确定场景的三维体素网格，得到所述待确定场景的全局体素特征，并生成目标提议框；The voxel feature extraction module 10 is configured to process the three-dimensional voxel grid of the to-be-determined scene through a three-dimensional sparse convolutional neural network, obtain the global voxel feature of the to-be-determined scene, and generate a target proposal frame;

点云特征提取模块20，用于通过点云网络处理待检测场景的原始点云，得到所述待检测场景的局部点云特征；The point cloud feature extraction module 20 is used to process the original point cloud of the scene to be detected through the point cloud network, and obtain the local point cloud features of the scene to be detected;

特征融合模块30，用于基于预设双编码器与融合解码器算法，融合所述局部点云特征和所述全局体素特征，对所述目标提议框进行调整，生成三维目标框，并基于所述三维目标框完成所述待检测场景中点云目标的检测。The feature fusion module 30 is configured to fuse the local point cloud feature and the global voxel feature based on a preset dual encoder and fusion decoder algorithm, adjust the target proposal frame, generate a three-dimensional target frame, and generate a three-dimensional target frame based on The three-dimensional target frame completes the detection of point cloud targets in the scene to be detected.

进一步地，所述体素特征提取模块10具体包括：Further, the voxel feature extraction module 10 specifically includes:

体素划分单元，用于将所述待检测点云空间划分成规则的体素表达形式，其中仅保留含有点的体素网格，而其余体素则被认为是空体素；a voxel division unit, which is used to divide the point cloud space to be detected into a regular voxel expression form, in which only the voxel grid containing points is retained, and the rest of the voxels are regarded as empty voxels;

粗略提议框生成单元，用于对于每个所述空体素，采用其所包含点的平均特征作为其初始输入特征，随后采用所述三维稀疏卷积神经网络感知体素网格并给出所述待确定图像的粗略提议框。A rough proposal box generation unit is used for each of the empty voxels to take the average feature of the points it contains as its initial input feature, and then use the three-dimensional sparse convolutional neural network to perceive the voxel grid and give the A rough proposal box describing the image to be determined.

进一步地，所述点云特征提取模块20具体包括：Further, the point cloud feature extraction module 20 specifically includes:

采样单元，用于利用最远点采样策略从所述原始点云中均匀采样固定点云数为N,

每一层局部点云特征提取模块的输入依次为上一层的输出，而当前层的输出为后续最远点采样的点集；The sampling unit is used to uniformly sample the fixed point cloud from the original point cloud by using the farthest point sampling strategy to be N,

The input of the local point cloud feature extraction module of each layer is the output of the previous layer in turn, and the output of the current layer is the point set of the subsequent furthest point sampling;

局部点云特征提取单元，用于通过所述局部点云特征提取模块提取待检测场景的各层点云的局部点云特征，且逐层扩大以得到全局点云语义信息，设点云中存在点p_i，设S_i为在以点p_i为中心，半径为r的球邻域内部的点所构成的集合。The local point cloud feature extraction unit is used to extract the local point cloud features of each layer of the point cloud of the scene to be detected through the local point cloud feature extraction module, and expand layer by layer to obtain the global point cloud semantic information. For a point _pi , let Si be the set of points in the neighborhood of the sphere with the point _pi _as the center and the radius r.

进一步地，所述点云目标的检测装置还包括：Further, the detection device for the point cloud target also includes:

微调网络调整单元，用于给定融合特征及目标提议框，设计微调网络进一步精确回归物体的空间信息，过滤掉可能存在的误检物体。The fine-tuning network adjustment unit is used to give the fusion feature and target proposal frame, design the fine-tuning network to further accurately return the spatial information of the object, and filter out the possible falsely detected objects.

其中，上述点云目标的检测装置中各个模块与上述点云目标的检测方法实施例中各步骤相对应，其功能和实现过程在此处不再一一赘述。Wherein, each module in the above-mentioned device for detecting a point cloud target corresponds to each step in the above-mentioned embodiment of the above-mentioned method for detecting a point cloud target, and the functions and implementation processes thereof will not be repeated here.

此外，本发明实施例还提供一种计算机可读存储介质。In addition, an embodiment of the present invention further provides a computer-readable storage medium.

本发明计算机可读存储介质上存储有点云目标的检测程序，其中所述点云目标的检测程序被处理器执行时，实现如上述的点云目标的检测方法的步骤。The computer-readable storage medium of the present invention stores a point cloud target detection program, wherein when the point cloud target detection program is executed by the processor, the steps of the above point cloud target detection method are implemented.

其中，点云目标的检测程序被执行时所实现的方法可参照本发明点云目标的检测方法的各个实施例，此处不再赘述。The method implemented when the detection program of the point cloud target is executed may refer to the various embodiments of the detection method of the point cloud target of the present invention, which will not be repeated here.

需要说明的是，在本文中，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者系统不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者系统所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括该要素的过程、方法、物品或者系统中还存在另外的相同要素。It should be noted that, herein, the terms "comprising", "comprising" or any other variation thereof are intended to encompass non-exclusive inclusion, such that a process, method, article or system comprising a series of elements includes not only those elements, It also includes other elements not expressly listed or inherent to such a process, method, article or system. Without further limitation, an element qualified by the phrase "comprising a..." does not preclude the presence of additional identical elements in the process, method, article or system that includes the element.

上述本发明实施例序号仅仅为了描述，不代表实施例的优劣。The above-mentioned serial numbers of the embodiments of the present invention are only for description, and do not represent the advantages or disadvantages of the embodiments.

本申请可用于众多通用或专用的计算机系统环境或配置中。例如：个人计算机、服务器计算机、手持设备或便携式设备、平板型设备、多处理器系统、基于微处理器的系统、置顶盒、可编程的消费电子设备、网络PC、小型计算机、大型计算机、包括以上任何系统或设备的分布式计算环境等等。本申请可以在由计算机执行的计算机可执行指令的一般上下文中描述，例如程序模块。一般地，程序模块包括执行特定任务或实现特定抽象数据类型的例程、程序、对象、组件、数据结构等等。也可以在分布式计算环境中实践本申请，在这些分布式计算环境中，由通过通信网络而被连接的远程处理设备来执行任务。在分布式计算环境中，程序模块可以位于包括存储设备在内的本地和远程计算机存储介质中。The present application may be used in numerous general purpose or special purpose computer system environments or configurations. For example: personal computers, server computers, handheld or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set-top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, including A distributed computing environment for any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including storage devices.

通过以上的实施方式的描述，本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现，当然也可以通过硬件，但很多情况下前者是更佳的实施方式。基于这样的理解，本发明的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来，该计算机软件产品存储在如上所述的一个存储介质(如ROM/RAM、磁碟、光盘)中，包括若干指令用以使得一台终端设备(可以是手机，计算机，服务器，空调器，或者网络设备等)执行本发明各个实施例所述的方法。From the description of the above embodiments, those skilled in the art can clearly understand that the method of the above embodiment can be implemented by means of software plus a necessary general hardware platform, and of course can also be implemented by hardware, but in many cases the former is better implementation. Based on such understanding, the technical solutions of the present invention can be embodied in the form of software products in essence or the parts that make contributions to the prior art, and the computer software products are stored in a storage medium (such as ROM/RAM) as described above. , magnetic disk, optical disk), including several instructions to make a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) execute the methods described in the various embodiments of the present invention.

以上仅为本发明的优选实施例，并非因此限制本发明的专利范围，凡是利用本发明说明书及附图内容所作的等效结构或等效流程变换，或直接或间接运用在其他相关的技术领域，均同理包括在本发明的专利保护范围内。The above are only preferred embodiments of the present invention, and are not intended to limit the scope of the present invention. Any equivalent structure or equivalent process transformation made by using the contents of the description and drawings of the present invention, or directly or indirectly applied in other related technical fields , are similarly included in the scope of patent protection of the present invention.

Claims

1. A method for detecting a point cloud target is characterized by comprising the following steps:

processing the three-dimensional voxel grid of the scene to be determined through a three-dimensional sparse convolution neural network to obtain the global voxel characteristics of the scene to be determined and generate a target proposal frame;

processing an original point cloud of a scene to be detected through a point cloud network to obtain local point cloud characteristics of the scene to be detected;

and fusing the local point cloud characteristics and the global voxel characteristics based on a preset double-encoder and fusion decoder algorithm, adjusting the target proposal frame to generate a three-dimensional target frame, and completing the detection of the point cloud target in the scene to be detected based on the three-dimensional target frame.

2. The method for detecting the point cloud target of claim 1, wherein the processing the three-dimensional voxel grid of the scene to be determined through a three-dimensional sparse convolutional neural network to obtain a global voxel characteristic of the scene to be determined and generate a target proposal box comprises:

dividing the point cloud space to be detected into regular voxel expression forms, wherein only a voxel grid containing points is reserved, and the rest voxels are considered as empty voxels;

and for each empty voxel, adopting the average characteristic of the contained points as the initial input characteristic thereof, then adopting the three-dimensional sparse convolution neural network to sense the voxel grid and giving a target proposal frame of the image to be determined.

3. The method for detecting point cloud target according to claim 2, wherein said step of adopting, for each empty voxel, the average feature of the points contained in the empty voxel as its initial input feature, and then adopting the three-dimensional sparse convolutional neural network to sense the voxel grid and give a rough proposed box of the image to be determined comprises:

the three-dimensional output characteristics are compressed to a two-dimensional space in a height dimension manner to obtain a two-dimensional image which is characterized in that

Inputting the two-dimensional image representation into a two-dimensional RPN network for aerial view target detection, and specifically expressing the layer-by-layer output as

Each layer adopts 3 multiplied by 3 convolution to reduce learning parameters, and further refined characteristic graphs are obtained for subsequent prediction, wherein the two-dimensional RPN network is built by four layers of two-dimensional convolution neural networks and adopts a U-Net network structure;

the object proposal detected by the RPN is obtained through classification and regression based on the anchor frame, namely, a corresponding anchor frame is preset for each pixel point, and the anchor frame is finely adjusted to realize accurate positioning, so that the object proposal with the size of being equal to that of the object proposal is obtained

All together have

A predefined frame.

4. The method for detecting a point cloud target according to claim 2, wherein the dividing the point cloud space to be detected into regular voxel expression forms, wherein only a voxel grid containing points is retained, and the rest voxels are regarded as empty voxels, comprises:

stacking 4 layers of the three-dimensional sparse convolution modules by using a Spconv library, wherein each sparse convolution module comprises two layers of sub-flow convolution and one layer of sparse convolution module for gradually down-sampling; assuming that the input voxel tensor is expressed as L × W × H × C, the network output of the three-dimensional sparse convolution can be expressed as

5. The method for detecting a point cloud target according to claim 1, wherein the processing an original point cloud of a scene to be detected through a point cloud network to obtain a local point cloud characteristic of the scene to be detected comprises:

uniformly sampling a fixed point cloud number from the original point cloud by utilizing a furthest point sampling strategy

The input of each layer of local point cloud feature extraction module is the output of the previous layer in sequence, and the output of the current layer is a point set sampled by the subsequent farthest point;

extracting local point cloud characteristics of each layer of point cloud of a scene to be detected through the local point cloud characteristic extraction module, expanding layer by layer to obtain global point cloud semantic information, and setting a point p in the point cloud _i Is provided with S _i At a point p _i The set of points inside a sphere neighborhood of radius r, centered.

6. The method for detecting point cloud target of claim 5, wherein the local point cloud feature extraction module extracts local point cloud features of each layer of point cloud of the scene to be detected, and expands layer by layer to obtain global point cloud semantic information, and the point p is set to exist in the point cloud _i Is provided with S _i At a point p _i A set of centered, r-radius points within a spherical neighborhood comprising:

from the set S _i K points are randomly sampled to form set S' _i ；

Performing feature fusion extraction on the sampled points in the neighborhood through a multilayer perceptron, and calculating to obtain a point p _i The calculation formula is as follows:

f(p _i )＝MLP(max _{j＝1，...，k} {MLP(f _j )}) (1)

firstly, a multilayer perceptron is adopted to further extract the features of k points in a point local area to obtain the corresponding n-dimensional mapping feature, and then the maximum pooling processing is adopted in the feature dimension to obtain the p-dimensional mapping feature _i The maximum information representation of the neighborhood with the radius as the center and r is adopted, and then the n-dimensional high-dimensional feature of the maximum information representation is further abstracted by adopting a multilayer perceptron to obtain a point p _i The output characteristics of (1);

and repeatedly sampling the point cloud to corresponding points layer by layer, aggregating neighborhood characteristics of the sampled points, and extracting the high-dimensional geometric characteristics of the point cloud information.

7. The method for detecting a point cloud target according to claim 1, wherein the fusing the local point cloud feature and the global voxel feature based on a preset dual encoder and fusion decoder algorithm, adjusting the target proposal frame, generating a three-dimensional target frame, and after completing the detection of the point cloud target in the scene to be detected based on the three-dimensional target frame, the method further comprises:

giving a fusion characteristic and a target proposing box, designing a fine tuning network to further accurately regress the spatial information of the object, and filtering out the possible false detection object;

for each of the goal proposals, a set of points G is framed therein _i And extracting global features by adopting maximum pooling, and calculating to obtain an object G _i The calculation formula is as follows:

f(G _i )＝MLP(max _{j＝1，...，k} {MLP(f _j )}) (2)

and respectively acquiring the accurate information of the scene to be detected through confidence prediction and regression branches of the target proposal box.

8. A point cloud target detection device is characterized by comprising:

the voxel characteristic extraction module is used for processing the three-dimensional voxel grid of the scene to be determined through a three-dimensional sparse convolution neural network to obtain the global voxel characteristic of the scene to be determined and generate a target proposal frame;

the point cloud feature extraction module is used for processing the original point cloud of the scene to be detected through a point cloud network to obtain the local point cloud feature of the scene to be detected;

and the feature fusion module is used for fusing the local point cloud features and the global voxel features based on a preset double-encoder and fusion decoder algorithm, adjusting the target proposal frame to generate a three-dimensional target frame, and completing the detection of the point cloud target in the scene to be detected based on the three-dimensional target frame.

9. A detection apparatus of a point cloud object, characterized in that the detection apparatus of the point cloud object comprises a processor, a memory, and a detection program of the point cloud object stored on the memory and executable by the processor, wherein the detection program of the point cloud object, when executed by the processor, implements the steps of the detection method of the point cloud object according to any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a detection program of a point cloud object, wherein the detection program of the point cloud object, when executed by a processor, implements the steps of the detection method of the point cloud object according to any one of claims 1 to 7.