[go: up one dir, main page]

CN116758534B - 3D object detection method based on convolutional long short-term memory network - Google Patents

3D object detection method based on convolutional long short-term memory network

Info

Publication number
CN116758534B
CN116758534B CN202310719201.6A CN202310719201A CN116758534B CN 116758534 B CN116758534 B CN 116758534B CN 202310719201 A CN202310719201 A CN 202310719201A CN 116758534 B CN116758534 B CN 116758534B
Authority
CN
China
Prior art keywords
convolution
output
layer
point cloud
characteristic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310719201.6A
Other languages
Chinese (zh)
Other versions
CN116758534A (en
Inventor
何立火
钟彬彬
甘海林
柯俊杰
王笛
高新波
路文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN202310719201.6A priority Critical patent/CN116758534B/en
Publication of CN116758534A publication Critical patent/CN116758534A/en
Application granted granted Critical
Publication of CN116758534B publication Critical patent/CN116758534B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

本发明公开了一种基于卷积长短期记忆网络的3D目标检测方法,包括以下步骤;步骤1:使用nuScenes点云数据集;步骤2:将输入点云数据转换到球坐标系下;步骤3:将点云空间按照球坐标进行体素划分,得到体素特征,并对体素特征进行初步提取;步骤4:对体素特征进行中间特征提取;步骤5:通过卷积长短期记忆网络进行时间特征提取,实现卷积长短期记忆网络的输出特征;步骤6:对输出特征进行多尺度特征提取,得到特征图;步骤7:利用特征图生成锚框,并对锚框进行分类、边界框回归和角度回归。本发明通过卷积长短期记忆网络对点云序列进行时间特征提取,解决现有的基于深度学习的3D连续目标检测方法中,因序列过长而无法长期依赖的问题。

The present invention discloses a 3D object detection method based on a convolutional long short-term memory network, comprising the following steps: Step 1: Using the nuScenes point cloud dataset; Step 2: Converting the input point cloud data to a spherical coordinate system; Step 3: Dividing the point cloud space into voxels according to the spherical coordinates to obtain voxel features, and performing preliminary extraction on the voxel features; Step 4: Performing intermediate feature extraction on the voxel features; Step 5: Performing temporal feature extraction using a convolutional long short-term memory network to realize the output features of the convolutional long short-term memory network; Step 6: Performing multi-scale feature extraction on the output features to obtain a feature map; Step 7: Using the feature map to generate anchor boxes, and performing classification, bounding box regression, and angle regression on the anchor boxes. The present invention uses a convolutional long short-term memory network to extract temporal features from point cloud sequences, solving the problem in existing deep learning-based 3D continuous object detection methods that the sequences are too long to be relied upon for a long time.

Description

3D target detection method based on convolution long-term and short-term memory network
Technical Field
The invention belongs to the technical field of 3D target detection, and particularly relates to a 3D target detection method based on a convolution long-short-term memory network.
Background
The 3D object detection method can be broadly classified into an image-based method, a point cloud-based method, and a fusion data-based method. Image-based 3D object detection is performed using a single or multiple images as input. The method based on the point cloud utilizes the point cloud data acquired by the sensors such as the laser radar, the TOF camera and the like to detect the target, has relatively accurate depth information, has higher identification accuracy on a remote target than the method based on the image, and is mainly used for automatically driving automobiles at present. The method for fusing the data simultaneously uses the 2D image and the point cloud data to detect the 3D target. In the existing research, a plurality of evaluation indexes based on manual design are established, such as average Accuracy (AP), average direction similarity (AOS) and the like inherited 2D images.
When the image is recognized, global information such as a target object, a foreground, a background and the like in the visual field can be judged through image information at the current moment, and objects of interest can be captured through continuous video information. Therefore, the 3D object detection task needs to include the information features at the next time, and needs to include the information features at the previous time, so that the full use of the information of the 3D images at different times becomes a key for improving the performance of the 3D object detection model. The existing 3D target detection method does not effectively utilize time domain information provided by a sensor to improve detection accuracy when target detection is performed.
The application publication number is CN115546784A, the name is a 3D target detection method based on deep learning, the 3D target detection method based on deep learning is disclosed, the method comprises the steps of loading Kitti dataset as a training sample image, preprocessing the loaded training sample image, calculating a 3D center point of a target, projection points of the 3D center point on the image, eight corner positions and Gaussian distribution of the target center point, constructing a deep learning convolutional neural network, comprising a backbone network and two branch networks, loading the dataset as a training set, obtaining the output of the deep learning convolutional neural network through forward propagation of the data, calculating the loss degree, carrying out reverse propagation, updating network parameters, obtaining a trained neural network model, receiving test set image data, sending the image into the pre-trained neural network model, obtaining the corresponding output target, and calculating the 3D position and the type of each target.
The method has the defects that the spatial position distribution of the point cloud information of Kitti data sets under a rectangular coordinate system is uneven, only the point cloud characteristics of a single frame are extracted, the time dependence of a frame to be detected and a previous frame is ignored, and the quality prediction accuracy and generalization capability of the method are low.
The traditional 3D target detection method is often based on the spatial information of the current frame, and is inconsistent with the characteristics of the current detection frame information and the previous frame information in human eye vision. The fact that features at different moments are not integrated into the feature information is one of the important reasons for influencing the performance of the 3D target detection method.
The existing 3D continuous target detection method based on deep learning only uses the spatial characteristic information of the current frame in the characteristic extraction process, does not show the adjacent time domain characteristic information before the detection frame, or can only show the time domain information of a shorter time frame.
Disclosure of Invention
In order to overcome the problems in the prior art, the invention aims to provide a 3D target detection method based on a convolution long-short-term memory network, which is used for extracting time characteristics of a point cloud sequence through the convolution long-short-term memory network (ConvLSTM) and solving at least one of the problems of low detection accuracy and low robustness caused by incapability of long-term dependence due to overlong sequence in the existing 3D continuous target detection method based on deep learning.
In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:
A3D target detection method based on a convolution long-term and short-term memory network comprises the following steps of;
acquiring or constructing point cloud data by using nuScenes point cloud data sets, wherein the point cloud data sets are stored in a three-dimensional rectangular coordinate system coordinate form;
Step 2, converting a three-dimensional rectangular coordinate system of the point cloud data into a spherical coordinate system to realize the density homogenization of the point cloud in space;
step 3, carrying out voxel division on the point cloud space according to spherical coordinates to obtain voxel characteristics, and carrying out preliminary extraction on the voxel characteristics to unbind the voxel characteristics and the space absolute position;
Step 4, extracting intermediate features of the voxel features obtained in the step 3 through a convolution network;
Step 5, extracting time features of the intermediate features through a convolution long-short-term memory network to obtain output features H n of the convolution long-short-term memory network;
Step 6, carrying out multi-scale feature extraction on the output feature H n obtained in the step 5 to obtain a feature map H f;
and 7, generating an anchor frame by utilizing the feature map H f obtained in the step 6, and classifying the anchor frame, carrying out bounding box regression and angle regression to obtain a final prediction frame.
And 8, setting super parameters and training parameters, training the convolution long-short-term memory network, and verifying the algorithm effect.
The step 1 specifically comprises the following steps:
The point cloud data includes XYZ coordinates (x, y, z) of each point and a reflection intensity I, in the form of [ (x i,yi,zi,Ii) ], where the index I refers to the serial number of the corresponding data.
The step 2 specifically comprises the following steps:
Step 2.1, arranging the point cloud data [ (x i,yi,zi,Ii) ] obtained in the step 1 in a time sequence to obtain continuous point cloud frames [ (x it,yit,zit,Iit) ] with subscripts t=0, 1,2, & gt, n-1 is the reverse order of a frame sequence, namely t=0 indicates that the point belongs to a marked key frame, and advancing to the previous frame along with the increasing of t in sequence, so as to form a point cloud input sequence from a certain moment before the key frame to the key frame;
step 2.2, encoding the continuous point cloud frames [ (x it,yit,zit,Iit) ] into an input sequence [ (x i,yi,zi,Ii, t) ];
Step 2.3, calculating each point (x i,yi,zi,Ii) in the data set point cloud:
θi=arctan2(yi,xi)
Ii=Ii
Obtaining a point cloud sequence under a spherical coordinate system Where d denotes the linear distance of the point from the origin (lidar), θ is the azimuth of the point,The pitch angle of a point is the reflection intensity of the point;
step 2.4, re-decoding the point cloud sequence into point cloud frames in list form according to t=0, 1, 2..n-1 And the length n of each sequence in Batch is recorded.
The step 3 specifically comprises the following steps:
Step 3.1, in the spherical coordinate system, d, θ, The side lengths in the three directions are v d、vθ respectively,The division range is not infinite, the ranges of the three dimensions are d min,dmax],[θminmax,Where d min and d max represent upper and lower limits on target-to-radar range, θ min and θ max represent upper and lower limits on azimuth dimension,AndRepresenting upper and lower limits of pitch angle dimensions;
then, grouping the point clouds according to the voxel grids of each point in the point clouds;
step 3.2, point Yun Zhen List Voxelized is carried out on each frame of point cloud;
step 3.3, extracting the characteristics of each voxel, wherein the calculation mode is as follows:
Obtaining a characteristic matrix as This is denoted as feature F 0, wherein,Is the characteristic information of a voxel grid of a single frame point cloud,For the average reflected intensity for all points in the corresponding voxel grid, d cc,The voxel center point is the relative distance from the average value of the points in the voxel to the voxel center point, which is used as the voxel characteristic.
The step 4 specifically comprises the following steps:
The network for extracting the further characteristics of the voxel characteristics obtained in the step 3 is formed by sequentially connecting 6 convolution layers, and comprises 1 input layer, 4 intermediate layers and 1 output layer, wherein the specific structure comprises the input layer, the first intermediate layer, the second intermediate layer, the third intermediate layer, the fourth intermediate layer and the output layer;
Step 4.1, inputting the characteristic matrix F 0 obtained in the step 3.3 into an input layer, and outputting to obtain a characteristic F 1, wherein the input layer consists of 1 SubMConv d convolution layers;
Step 4.2, inputting the characteristic F 1 into a first intermediate layer, and outputting to obtain the characteristic F 2, wherein the first intermediate layer comprises 1 SubMConv d convolution layers;
Step 4.3, inputting the characteristic F 2 into a second intermediate layer, and outputting to obtain the characteristic F 3, wherein the second intermediate layer is formed by sequentially connecting 3 convolution layers;
The specific structure is SparseConv d convolution layer- & gt SubMConv d convolution layer- & gt SubMConv d convolution layer;
step 4.4, inputting the characteristic F 3 into a third intermediate layer, and outputting to obtain the characteristic F 4, wherein the third intermediate layer is formed by sequentially connecting 3 convolution layers;
The specific structure is SparseConv d convolution layer- & gt SubMConv d convolution layer- & gt SubMConv d convolution layer;
Step 4.5, inputting the characteristic F 4 into a fourth intermediate layer, and outputting to obtain the characteristic F 5, wherein the fourth intermediate layer is formed by sequentially connecting 3 convolution layers;
The specific structure is SparseConv d convolution layer- & gt SubMConv d convolution layer- & gt SubMConv d convolution layer;
And 4.6, inputting the characteristic F 5 into an output layer, outputting to obtain an intermediate characteristic F s, wherein the output layer consists of 1 SubMConv d convolution layers, F s is a characteristic diagram sequence, and the sequence is stored in a matrix form.
In the step 5, the intermediate characteristic F s obtained in the step 4.6 is input into a convolution long-short-term memory network, the time dimension characteristic of F s is extracted, the convolution long-short-term memory network consists of a forgetting gate, an input gate, candidate memory cells and an output gate, and the convolution long-short-term memory network comprises the following calculation operations:
Forgetting the door:
wherein f t is the output of the forgetting gate at time t, σ is the softmax function, x is the matrix convolution calculation, For matrix Hadamard product calculation, W xf is a convolution weight matrix of a forgetting gate and input X t, W hf is a convolution weight matrix of the forgetting gate and a t-1 moment hidden state H t-1, W cf is a product weight matrix of the forgetting gate and a t-1 moment memory cell state C t-1, and b f is a bias matrix of the forgetting gate;
An input door:
Wherein, I t is the output of the input gate at time t, W xi is the convolution weight matrix of the input gate and input X t, W hi is the convolution weight matrix of the input gate and hidden state H t-1 at time t-1, W ci is the product weight matrix of the input gate and memory cell state C t-1 at time t-1, and b i is the bias matrix of the input gate;
candidate memory cell state:
Wherein, the For the output of the candidate memory cell state at time t, tanh is tanh activation function, W xc is the convolution weight matrix of the candidate memory cell state and input X t, W hc is the convolution weight matrix of the candidate memory cell state and hidden state H t-1 at time t-1, and b c is the bias matrix of the candidate memory cell state;
Output door O t=σ(Wxo*Xt+Who*Ht-1+bo
Wherein O t is the output of the output gate at time t, W xo is the convolution weight matrix of the output gate and input X t, W ho is the convolution weight matrix of the output gate and hidden state H t-1 at time t-1, and b o is the bias matrix of the output gate;
hidden state:
Wherein H t is the output of the hidden state at the time t, O t is the output of the output gate at the time t, and C t is the output of the cell memory state at the time t.
Memory cell state:
Wherein C t is the output of the memory state of the cell at the time t, For the output of the candidate memory cell state at the time t, I t is the output of the input gate at the time t, f t is the output of the forgetting gate at the time t, and C t-1 is the output of the memory cell state at the time t-1.
The step 5 specifically comprises the following steps:
Step 5.1, inputting the intermediate feature F s obtained in step 4.6, according to the sequence length n, the n-1 th, the..0 th feature map of F s into a convolution long-short-period memory network, and for the i-th frame feature map, inputting the input of the i-th frame feature map as X i;
Step 5.2, after inputting X i, X i and the hidden state H i-1 of the previous frame, performing forgetting gate calculation on the cell state C t-1, and outputting forgetting gate information f i;
Step 5.3, at the same time, X i and the hidden state H i-1 of the previous frame and the cell state C t-1 are subjected to input gate calculation, and I i is output in the output layer;
Step 5.3, simultaneously, X i and the hidden state H i-1 of the previous frame perform candidate memory cell state calculation, and output the candidate memory cell state of the ith frame
Step 5.4, at the same time, performing output gate calculation on X i and the hidden state H i-1 of the previous frame, and outputting O i in the output layer;
Step 5.5, outputting I i in the layer of the forgetting gate information f i obtained by the calculation, and candidate memory cell state And the cell state information C t-1 of the previous frame, calculating to obtain the cell state information C i of the current frame;
Step 5.6, taking the cell state C i of the current frame, and calculating the hidden state H i of the current frame by the in-layer output O i;
and 5.7, inputting a next frame characteristic diagram X i+1, and repeating the steps 5.2 to 5.6 until the key frame is input to obtain H n, and finally taking H n as the output characteristic of the whole convolution long-short-period memory network.
The step 6 specifically comprises the following steps:
Step 6.1, marking the output characteristic H n in the step 5 as H 1, inputting H 1 into a first characteristic extraction layer for characteristic extraction, and outputting a characteristic diagram H f1 with highest resolution;
Step 6.2, downsampling H f1, marking the sampling result as H 2, inputting the downsampling result into a second feature extraction layer, and outputting a feature map H f2;
Step 6.3, downsampling H f2, marking the sampling result as H 3, inputting the downsampling result into a second feature extraction layer, and outputting a feature map H f3;
Step 6.4, up-sampling three feature graphs H f1、Hf2、Hf3 with different scales, and marking the feature graph of the sampling result as H f1_up、Hf2_up、Hf3_up;
and 6.5, merging the feature images H f1_up、Hf2_up、Hf3_up which are subjected to up-sampling and have the same dimension into a multi-scale feature image, and recording as H f.
The step 7 specifically comprises the following steps:
Step 7.1, inputting the multi-scale feature map H f obtained in the step 6.5 into three full-connection layers, wherein the features of each point on the feature map can predict a plurality of anchor frames through the full-connection layers, and the sizes of the anchor frames are set according to the sizes of targets;
Step 7.2, mapping the generated anchor frames and the truth value boundary frames to a d-theta plane, and performing cross-correlation calculation in the d-theta plane, setting different upper limit thresholds and lower limit thresholds for different target categories, distributing anchor frames with cross-correlation ratio higher than the upper limit threshold as positive samples, distributing anchor frames with cross-correlation ratio lower than the lower limit threshold as negative samples, and discarding anchor frames with cross-correlation ratio between the upper limit threshold and the lower limit threshold;
Step 7.3, calculating a classification loss function L cls of the anchor frame, wherein the calculation mode is as follows:
Wherein, the Predicting the probability of belonging to the class c targets for the ith anchor frame, wherein alpha and gamma are two super parameters for two types of targets in total;
Step 7.4, calculating an angle loss function L dir of the anchor frame, wherein the calculation mode is as follows:
Wherein, the Representing the probability of the predicted spin angle of the ith anchor frame being r,The true probability is corresponding;
step 7.5, calculating a position loss function L reg of the anchor frame, wherein the calculation mode is as follows:
Wherein, the X i is the geometric center of the real anchor frame of the corresponding target, and beta is a super parameter;
Step 7.6, calculating the total loss function of the anchor frame, and combining the losses of the three subtasks to obtain the total loss:
Ltotal=β1Lcls2Lreg3Ldir
where L cls is the classification task penalty, L reg is the regression task penalty, L dir is the angle classification task penalty, and β 1、β2、β3 is the weight constant parameter for the three penalties.
The step 8 specifically comprises the following steps:
Step 8.1, loading a data set category label, determining an evaluation index and designing an ablation experiment;
Step 8.2, setting a data collection point cloud input range;
Step 8.3, setting a voxelized range;
Step 8.4, setting the number of voxel grids in two groups of experiments;
step 8.5, setting the maximum training voxel number and the maximum test voxel number;
Step 8.6, in the training stage, setting an optimizer, a learning rate adjustment policy, a learning rate rising step number proportion, a learning rate adjustment target maximum multiplying power, and a minimum multiplying power as training rounds;
And 8.7, performing a simulation experiment to explain the technical effects of the invention.
The invention has the beneficial effects that:
According to the method, point cloud information in a rectangular coordinate system is converted into a spherical coordinate system with more uniform point cloud distribution, the time characteristic extraction is carried out on the point cloud information by utilizing a convolution long-short-term memory network, the dependency relationship of a 3D target between adjacent frames is captured, and the characterization capability of the fused characteristic diagram on multi-frame point cloud information can be enhanced. Because the time characteristic information of different point cloud frames in the image is fully utilized, the method has the advantage of improving the average accuracy and the robustness of continuous 3D target detection tasks.
According to the invention, the 3D target detection task under the rectangular coordinate system is converted into the spherical coordinate system, so that the characteristic of uneven distribution of the point cloud is improved. When the point cloud feature extraction is carried out, a convolution long-term and short-term memory network is added, and the condition of underutilization of the adjacent point cloud frame information in other 3D target detection methods is improved. The final experimental result shows that the method is a 3D target detection method which does not use adjacent frame information, and has higher average accuracy and robustness.
Drawings
FIG. 1 is a schematic flow chart of the present invention.
Fig. 2 is a schematic diagram of the correspondence between coordinates of a spherical coordinate system and coordinates of a three-dimensional rectangular coordinate system.
Fig. 3 is a schematic view of voxel division in a spherical coordinate system.
FIG. 4 is a schematic diagram of a convolutional long and short term memory network.
Fig. 5 is a schematic diagram of multi-scale feature extraction.
Detailed Description
The invention is described in further detail below with reference to the accompanying drawings.
As shown in FIG. 1, the 3D target detection method based on the convolution long-term memory network specifically comprises the following steps of;
Step1, acquiring nuScenes data sets:
And constructing a corresponding training set and a corresponding testing set by utilizing nuScenes data sets commonly used in the field of 3D target detection.
NuScenes dataset is a large autopilot dataset. nuScenes data acquisition is mainly performed in singapore and boston, and the route of travel is carefully planned to capture more challenging scenes. The dataset contains 1000 scenes of length 20s, which contain different environments, times, low points and also weather conditions. Meanwhile, in order to balance the difference of the category number, the data set adjusts the number of rare category scenes.
NuScenes in the dataset, the point cloud data is stored in the form of coordinates in a three-dimensional rectangular coordinate system, the data includes XYZ coordinates (x, y, z) and reflection intensity I of each point, as shown in [ (x i,yi,zi,Ii) ], where the subscript I broadly refers to a sequence number of a certain data, the value being in a sequence or list of sequence numbers, and for convenience, the description is also followed in the following sections herein.
Step 2, converting the input data into a spherical coordinate system:
The step realizes the density homogenization of point cloud in space by replacing the coordinate system from a three-dimensional rectangular coordinate system to a spherical coordinate system on the whole structure level.
Step 2.1, a continuous point cloud frame [ (x it,yit,zit,Iit) ] is input, wherein the subscript i still refers to the serial number of a certain data, the value is in the serial number range of a sequence or list, the subscript t=0, 1,2 is the reverse order of the frame sequence, i.e. t=0 indicates that the point belongs to a key frame with a label, and the time advances to the previous frame along with the increment of t. Thus forming a point cloud input sequence from a moment before the occurrence of the key frame to the key frame.
Step 2.2, encoding the continuous point cloud frames [ (x it,yit,zit,Iit) ] into an input sequence [ (x i,yi,zi,Ii, t) ].
Step 2.3, calculating each point (x i,yi,zi,Ii) in the data set point cloud:
θi=arctan2(yi,xi)
Ii=Ii
Obtaining a point cloud sequence under a spherical coordinate system Where d denotes the linear distance of the point from the origin (lidar), θ is the azimuth of the point,I is the pitch angle of the point and I is the reflection intensity of the point.
The corresponding relationship between the coordinates of the spherical coordinate system and the coordinates of the three-dimensional rectangular coordinate system is shown in fig. 2.
Step 2.4, re-decoding the point cloud sequence into point cloud frames in list form according to t=0, 1, 2..n-1And the length n of each sequence in Batch is recorded.
And 3, carrying out voxel division on the point cloud space according to spherical coordinates:
and converting the geometrical form representation of the point cloud into a voxel representation form closest to the point cloud, and performing preliminary extraction on voxel characteristics to unbind the voxel characteristics from the spatial absolute position.
Step 3.1, in the spherical coordinate system, d, θ,The side lengths in the three directions are v d、vθ respectively,Is divided into a voxel grid. The division range is not infinite, the ranges of the three dimensions are d min,dmax],[θminmax,Where d min and d max represent upper and lower limits on target-to-radar range, θ min and θ max represent upper and lower limits on azimuth dimension,AndRepresenting upper and lower limits of pitch angle dimensions.
Voxel division diagram 3 shows:
And then grouping the point clouds according to the voxel grids where each point in the point clouds is located.
Step 3.2, point Yun Zhen ListThe point cloud of each frame is voxelized.
Step 3.3, extracting the characteristics of each voxel, wherein the calculation mode is as follows:
Obtaining a characteristic matrix as This is designated as feature F 0. Wherein, the And the characteristic information of a certain voxel grid is a single-frame point cloud.For the average reflected intensity for all points in the corresponding voxel grid, d cc,The voxel center point is the relative distance from the average value of the points in the voxel to the voxel center point, which is used as the voxel characteristic.
And 4, further extracting intermediate features of the voxel features through a convolution network:
The intermediate feature extraction is composed of a convolution network formed by sequentially connecting 6 convolution layers, and comprises 1 input layer, 4 intermediate layers and 1 output layer, wherein the specific structure comprises an input layer, a first intermediate layer, a second intermediate layer, a third intermediate layer, a fourth intermediate layer and an output layer.
And 4.1, inputting the characteristic matrix F 0 obtained in the step 3.3 into an input layer, and outputting to obtain a characteristic F 1. The input layer consists of 1 SubMConv3d convolutional layers (feature dimension: 16, convolution kernel size 3 x 3, step size 2).
And 4.2, inputting the feature F 1 into the first intermediate layer, and outputting to obtain the feature F 2. The first intermediate layer consists of 1 SubMConv3d convolutional layers (feature dimension: 16, convolutional kernel size: 3 x 3, step size: 2).
And 4.3, inputting the feature F 2 into the second intermediate layer, and outputting to obtain the feature F 3. The second intermediate layer is formed by sequentially connecting 3 convolution layers. The specific structure is SparseConv d convolution layer (characteristic dimension: 32, convolution kernel size: 3 x 3, step size: 2) → SubMConv3d convolution layer (characteristic dimension: 32, convolution kernel size: 3 x 3, step size: 2) → SubMConv3d convolution layer (characteristic dimension: 32, convolution kernel size: 3 x 3, step size: 2)
And 4.4, inputting the characteristic F 3 into a third intermediate layer, and outputting to obtain the characteristic F 4. The third intermediate layer is formed by sequentially connecting 3 convolution layers. The specific structure is SparseConv d convolution layer (characteristic dimension: 64, convolution kernel size: 3 x 3, step size: 2) → SubMConv3d convolution layer (characteristic dimension: 64, convolution kernel size: 3 x 3, step size: 2) → SubMConv3d convolution layer (characteristic dimension: 64, convolution kernel size: 3 x 3, step size: 2)
And 4.5, inputting the feature F 4 into a fourth intermediate layer, and outputting to obtain the feature F 5. The fourth intermediate layer is formed by sequentially connecting 3 convolution layers. The specific structure is SparseConv d convolution layer (characteristic dimension: 64, convolution kernel size: 3 x 3, step size: 2) → SubMConv3d convolution layer (characteristic dimension: 64, convolution kernel size: 3 x 3, step size: 2) → SubMConv3d convolution layer (characteristic dimension: 64, convolution kernel size: 3 x 3, step size: 2)
And 4.6, inputting the feature F 5 into an output layer, and outputting to obtain an intermediate feature F s. The output layer consists of 1 SubMConv3d convolutional layers (feature dimension: 128, convolution kernel size 3 x 3, step size 2). F 6 is a sequence of feature maps, the sequence being stored in a matrix form.
And 5, extracting time characteristics through a convolution long-term and short-term memory network:
And (3) inputting the intermediate feature F s obtained in the step (4.6) into a convolution long-short-term memory network, so as to extract the time dimension feature of the feature sequence. The convolution long-short-term memory network is formed by cascading a plurality of ConvLSTM network layers, and each ConvLSTM network layer is formed by a forgetting gate, an input gate, candidate memory cells and an output gate, and the connection mode of the convolution long-term memory network are shown in figure 4.
Forgetting the door:
wherein f t is the output of the forgetting gate at time t, σ is the softmax function, x is the matrix convolution calculation, For matrix Hadamard product calculation, W xf is a convolution weight matrix of a forgetting gate and input X t, W hf is a convolution weight matrix of the forgetting gate and a t-1 moment hidden state H t-1, W cf is a product weight matrix of the forgetting gate and a t-1 moment memory cell state C t-1, and b f is a bias matrix of the forgetting gate;
An input door:
Wherein, I t is the output of the input gate at time t, W xi is the convolution weight matrix of the input gate and input X t, W hi is the convolution weight matrix of the input gate and hidden state H t-1 at time t-1, W ci is the product weight matrix of the input gate and memory cell state C t-1 at time t-1, and b i is the bias matrix of the input gate;
candidate memory cell state:
Wherein, the For the output of the candidate memory cell state at time t, tanh is tanh activation function, W xc is the convolution weight matrix of the candidate memory cell state and input X t, W hc is the convolution weight matrix of the candidate memory cell state and hidden state H t-1 at time t-1, and b c is the bias matrix of the candidate memory cell state;
Output door O t=σ(Wxo*Xt+Who*Ht-1+bo
Wherein O t is the output of the output gate at time t, W xo is the convolution weight matrix of the output gate and input X t, W ho is the convolution weight matrix of the output gate and hidden state H t-1 at time t-1, and b o is the bias matrix of the output gate;
hidden state:
Wherein H t is the output of the hidden state at the time t, O t is the output of the output gate at the time t, and C t is the output of the cell memory state at the time t.
Memory cell state:
Wherein C t is the output of the memory state of the cell at the time t, For the output of the candidate memory cell state at the time t, I t is the output of the input gate at the time t, f t is the output of the forgetting gate at the time t, and C t-1 is the output of the memory cell state at the time t-1.
And 5.1, inputting the intermediate characteristic F s obtained in the step 4.6 into a convolution long-short-period memory network according to the n-1 th, the first and the second characteristic diagrams of F s in sequence according to the sequence length n, wherein the input of the i-th frame characteristic diagram is marked as X i.
Step 5.2, after inputting X i, X i performs forgetting gate calculation with the hidden state H i-1 of the previous frame and the cell state C t-1, and outputs forgetting gate information f i.
Step 5.3, at the same time, X i performs input gate calculation with the hidden state H i-1 of the previous frame and the cell state C t-1, and outputs I i in the output layer.
Step 5.3, simultaneously, X i and the hidden state H i-1 of the previous frame perform candidate memory cell state calculation, and output the candidate memory cell state of the ith frame
And 5.4, simultaneously, performing output gate calculation on X i and the hidden state H i-1 of the previous frame, and outputting O i in an output layer.
Step 5.5, outputting I i in the layer of the forgetting gate information f i obtained by the calculation, and candidate memory cell stateAnd the cell state information C t-1 of the previous frame, and calculating to obtain the cell state information C i of the current frame.
And 5.6, taking the cell state C i of the current frame, and calculating the hidden state H i of the current frame by using the in-layer output O i.
And 5.7, inputting a next frame characteristic diagram X i+1, and repeating the steps 5.2 to 5.6 until the key frame is input to obtain H n, and finally taking H n as the output characteristic of the whole convolution long-short-period memory network.
And 6, extracting multi-scale features:
As shown in the following figure, feature H n is denoted as H 1, and H 1 is subject to multi-scale feature extraction through a network similar to the feature pyramid structure.
And 6.1, inputting H 1 into a first feature extraction layer for feature extraction, wherein the feature dimension of feature extraction is 128, and outputting a feature map H f1 with highest resolution.
And 6.2, downsampling H f1 at intervals of 2, marking the sampling result as H 2, inputting the downsampling result into a second feature extraction layer, extracting features with feature dimensions of 256, and outputting a feature map H f2.
And 6.3, downsampling H f2 at intervals of 2, marking the sampling result as H 3, inputting the downsampling result into a second feature extraction layer, extracting features with feature dimensions of 256, and outputting a feature map H f3.
And 6.4, up-sampling the three feature maps H f1、Hf2、Hf3 with different scales, wherein the up-sampling dimension is 256, and the sampling result is recorded as H f1_up、Hf2_up、Hf3_up.
And 6.5, merging the feature images H f1_up、Hf2_up、Hf3_up which are subjected to up-sampling and have the same dimension into a multi-scale feature image, namely H f, wherein the feature depth is 256 x 3.
And 7, generating an anchor frame and classifying, carrying out boundary regression and angle regression:
And 7.1, inputting the multi-scale feature map H f obtained in the step 6.5 into three full-connection layers, wherein a plurality of anchor frames can be predicted by the features of each point on the feature map through the full-connection layers, the sizes of the anchor frames are set according to the sizes of targets, in the example, the targets of vehicles and pedestrians in the data set are respectively set to be 3.9m1.6m1.65m and 0.6m0.8m1.73m, and the spin angle r of each group of anchor frames is two of 0 degrees and 90 degrees, so that four different frames are formed.
And 7.2, mapping the generated anchor frames and the truth value boundary frames to a d-theta plane, performing cross-over calculation in the d-theta plane, setting different upper limit thresholds and lower limit thresholds for different target categories, distributing the anchor frames with the cross-over ratio higher than the upper limit threshold as positive samples, distributing the anchor frames with the cross-over ratio lower than the lower limit threshold as negative samples, discarding the anchor frames with the cross-over ratio between the upper limit threshold and the lower limit threshold, setting the cross-over ratio upper limit threshold and the lower limit threshold of the vehicle target as 0.6 and 0.45, and setting the cross-over ratio upper limit threshold and the lower limit threshold of the pedestrian target as 0.4 and 0.3 in the example.
Step 7.3, calculating a classification loss function L cls of the anchor frame, wherein the calculation mode is as follows:
Wherein, the The probability of belonging to the class c object is predicted for the ith anchor frame, and the total two classes of objects, α and γ, are two super parameters, in this embodiment, α is 0.25, and γ is 2.0.;
Step 7.4, calculating an angle loss function L dir of the anchor frame, wherein the calculation mode is as follows:
Wherein, the Representing the probability of the predicted spin angle of the ith anchor frame being r,The true probability is corresponding;
step 7.5, calculating a position loss function L reg of the anchor frame, wherein the calculation mode is as follows:
Wherein, the X i is the geometric center of the real anchor frame of the corresponding target, beta is a super parameter, and beta is 1 in the example;
Step 7.6, calculating the total loss function of the anchor frame, and combining the losses of the three subtasks to obtain the total loss:
Ltotal=β1Lcls2Lreg3Lair
Wherein, L cls is a classification task loss, L reg is a regression task loss, L dir is an angle classification task loss, β 1、β2、β3 is a weight constant parameter of three losses, and the method sets β 1=1.0,β2=2.0,β3 =0.2, so that the model is more focused on the bounding box regression task and classification task.
Step 8, setting super parameters and training parameters, and training and experimental verification are carried out on the convolution long-term and short-term memory network:
and verifying an ablation experiment of a spherical coordinate 3D target detection method guiding effect and model based on a convolution long-term and short-term memory network.
In step 8.1, in this example, the types of tags loaded are both pedestrian and vehicle types, and the evaluation index is measured using the average accuracy AP. Because the data set is collected at 20Hz frequency and the frequency of the marked key frames is 2Hz, 10 unlabeled point cloud frames are possessed before each key frame on average, but because the equipment used in the experiment is NVIDIA RTX2080Ti graphic card configuration, the ultra-long sequence input can not be supported due to limited display, and the ablation experiment is carried out by loading 2 along-way frames and 0 along-way frames respectively.
Step 8.2, limiting the input range of the data concentration point cloud to be-50 x-50 y-5-3 m.
Setting voxel forming range of 0 m-50 m, 180 deg. 0-31. Wherein phi is nuScenes data set scanning ring ID, and the laser radar scanning ring ID corresponds to a fixed pitch angle.
Step 8.4, setting the number of voxel grids in two experiments, d 1′=1408,θ1' =2048, And d 2 =1088,θ2 = 1088,
And 8.5, setting the maximum training voxel number as 20000 and setting the maximum test voxel number as 40000.
Step 8.6, training stage, the optimizer adopts AdamW, the learning rate is set to 0.000144, the learning rate adjustment policy is set to cyclic, the learning rate rising step number proportion is set to 0.3, the learning rate adjustment target maximum multiplying power is 10 times, the minimum multiplying power is 0.0001, and training is carried out for 40 generations.
And 8.7, performing a simulation experiment to illustrate the technical effects of the invention, wherein the experimental index results are shown in the following table.
TABLE 1 continuous 3D target detection experiment results
According to the experimental results, the average precision of two experimental groups is greatly improved when 2 along-way frames are loaded compared with that of a control group (namely, a simulated convolution-free long-term and short-term memory network) for loading 0 along-way frames, so that the continuous 3D target detection method provided by the invention can actually and effectively utilize time domain information in a continuous point cloud sequence. In particular, it is also observed from the experimental results that the number of voxel grids is reduced from 1408×2048×32 to 1088×1088×32, and that the substantial reduction in the number of voxel grids does not result in a substantial reduction in the accuracy of model detection, and that the main reason is considered to be that only 1080±10 points are generated per scanning cycle of the laser radar in combination with the sensor parameter analysis. Therefore, the decrease of the number of division meshes in the θ -dimension from 1408 to 1088 hardly causes the loss of spatial information in the θ -dimension, and the slight decrease in detection performance is caused by the decrease of the number of division in the d-dimension. The above is only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited by this, and any modification made on the basis of the technical scheme according to the technical idea of the present invention falls within the protection scope of the claims of the present invention.

Claims (10)

1. The 3D target detection method based on the convolution long-term memory network is characterized by comprising the following steps of;
Step 1, acquiring or constructing point cloud data, wherein the point cloud data are stored in a three-dimensional rectangular coordinate system coordinate form in a point cloud data set;
Step 2, converting a three-dimensional rectangular coordinate system of the point cloud data into a spherical coordinate system to realize the density homogenization of the point cloud in space;
step 3, carrying out voxel division on the point cloud space according to spherical coordinates to obtain voxel characteristics, and carrying out preliminary extraction on the voxel characteristics to unbind the voxel characteristics and the space absolute position;
Step 4, extracting intermediate features of the voxel features through a convolution network;
Step 5, extracting time features of the intermediate features through a convolution long-short-term memory network to obtain output features H n of the convolution long-short-term memory network;
Step 6, extracting multi-scale features of the H n to obtain a feature map H f;
step 7, generating an anchor frame by utilizing H f, and classifying the anchor frame, carrying out bounding box regression and angle regression;
And 8, setting super parameters and training parameters, training the convolution long-short-term memory network, and verifying the algorithm effect.
2. The 3D object detection method based on the convolutional long-term memory network according to claim 1, wherein the step 1 specifically comprises:
A nuScenes point cloud data set is used, the point cloud data comprising XYZ coordinates (x, y, z) of each point and a reflection intensity I, denoted [ (x i,yi,zi,Ii) ], where the subscript I refers to the sequence number of the corresponding data.
3. The 3D object detection method based on the convolutional long-term memory network according to claim 1, wherein the step 2 specifically comprises:
Step 2.1, arranging the point cloud data [ (x i,yi,zi,Ii) ] obtained in the step 1 in a time sequence to obtain continuous point cloud frames [ (x it,yit,zit,Iit) ] with subscripts t=0, 1,2, & gt, n-1 is the reverse order of a frame sequence, namely t=0 indicates that the point belongs to a marked key frame, and advancing to the previous frame along with the increasing of t in sequence, so as to form a point cloud input sequence from a certain moment before the key frame to the key frame;
step 2.2, encoding the continuous point cloud frames [ (x it,yit,zit,Iit) ] into an input sequence [ (x i,yi,zi,Ii, t) ];
Step 2.3, calculating each point (x i,yi,zi,Ii) in the data set point cloud:
θi=arctan2(yi,xi)
Ii=Ii
Obtaining a point cloud sequence under a spherical coordinate system Where d denotes the linear distance of the point from the origin (lidar), θ is the azimuth of the point,The pitch angle of a point is the reflection intensity of the point;
step 2.4, re-decoding the point cloud sequence into point cloud frames in list form according to t=0, 1, 2..n-1 And the length n of each sequence in Batch is recorded.
4. The 3D object detection method based on the convolutional long-term memory network according to claim 3, wherein the step 3 is specifically:
Step 3.1, in the spherical coordinate system, d, θ, The side lengths in the three directions are v d、vθ respectively,The division range is not infinite, the ranges of the three dimensions are d min,dmax],[θminmax,Where d min and d max represent upper and lower limits on target-to-radar range, θ min and θ max represent upper and lower limits on azimuth dimension,AndRepresenting upper and lower limits of pitch angle dimensions;
then, grouping the point clouds according to the voxel grids of each point in the point clouds;
step 3.2, point Yun Zhen List Voxelized is carried out on each frame of point cloud;
step 3.3, extracting the characteristics of each voxel, wherein the calculation mode is as follows:
Obtaining a characteristic matrix as This is denoted as feature F 0, wherein,Is the characteristic information of a voxel grid of a single frame point cloud,For the average reflected intensity for all points in the corresponding voxel grid,The voxel center point is the relative distance from the average value of the points in the voxel to the voxel center point, which is used as the voxel characteristic.
5. The 3D target detection method based on the convolution long-short-term memory network according to claim 4, wherein the network for extracting the voxel characteristics obtained in the step 3 is formed by sequentially connecting 6 convolution layers, and comprises 1 input layer, 4 middle layers and 1 output layer, and the specific structure comprises the input layer, the first middle layer, the second middle layer, the third middle layer, the fourth middle layer and the output layer.
6. The 3D object detection method based on the convolutional long-term memory network according to claim 5, wherein the specific steps of step 4 are as follows:
Step 4.1, inputting the characteristic matrix F 0 obtained in the step 3.3 into an input layer, and outputting to obtain a characteristic F 1, wherein the input layer consists of 1 SubMConv d convolution layers;
Step 4.2, inputting the characteristic F 1 into a first intermediate layer, and outputting to obtain the characteristic F 2, wherein the first intermediate layer comprises 1 SubMConv d convolution layers;
Step 4.3, inputting the characteristic F 2 into a second intermediate layer, and outputting to obtain the characteristic F 3, wherein the second intermediate layer is formed by sequentially connecting 3 convolution layers;
The specific structure is SparseConv d convolution layer- & gt SubMConv d convolution layer- & gt SubMConv d convolution layer;
step 4.4, inputting the characteristic F 3 into a third intermediate layer, and outputting to obtain the characteristic F 4, wherein the third intermediate layer is formed by sequentially connecting 3 convolution layers;
The specific structure is SparseConv d convolution layer- & gt SubMConv d convolution layer- & gt SubMConv d convolution layer;
Step 4.5, inputting the characteristic F 4 into a fourth intermediate layer, and outputting to obtain the characteristic F 5, wherein the fourth intermediate layer is formed by sequentially connecting 3 convolution layers;
The specific structure is SparseConv d convolution layer- & gt SubMConv d convolution layer- & gt SubMConv d convolution layer;
And 4.6, inputting the characteristic F 5 into an output layer, outputting to obtain an intermediate characteristic F s, wherein the output layer consists of 1 SubMConv d convolution layers, F s is a characteristic diagram sequence, and the sequence is stored in a matrix form.
7. The 3D object detection method based on the long-short-term convolutional memory network according to claim 6, wherein in the step 5, the intermediate feature F s obtained in the step 4.6 is input into the long-short-term convolutional memory network to extract the time dimension feature of F s, the long-short-term convolutional memory network is composed of a forgetting gate, an input gate, candidate memory cells and an output gate, and the long-short-term convolutional memory network comprises the following computing operations:
Forgetting the door:
wherein f t is the output of the forgetting gate at time t, σ is the softmax function, x is the matrix convolution calculation, For matrix Hadamard product calculation, W xf is a convolution weight matrix of a forgetting gate and input X t, W hf is a convolution weight matrix of the forgetting gate and a t-1 moment hidden state H t-1, W cf is a product weight matrix of the forgetting gate and a t-1 moment memory cell state C t-1, and b f is a bias matrix of the forgetting gate;
An input door:
Wherein, I t is the output of the input gate at time t, W xi is the convolution weight matrix of the input gate and input X t, W hi is the convolution weight matrix of the input gate and hidden state H t-1 at time t-1, W ci is the product weight matrix of the input gate and memory cell state C t-1 at time t-1, and b i is the bias matrix of the input gate;
candidate memory cell state:
Wherein, the For the output of the candidate memory cell state at time t, tanh is tanh activation function, W xc is the convolution weight matrix of the candidate memory cell state and input X t, W hc is the convolution weight matrix of the candidate memory cell state and hidden state H t-1 at time t-1, and b c is the bias matrix of the candidate memory cell state;
Output door O t=σ(Wxo*Xt+Who*Ht-1+bo
Wherein O t is the output of the output gate at time t, W xo is the convolution weight matrix of the output gate and input X t, W ho is the convolution weight matrix of the output gate and hidden state H t-1 at time t-1, and b o is the bias matrix of the output gate;
hidden state:
wherein H t is the output of the hidden state at the time t, O t is the output of the output gate at the time t, and C t is the output of the cell memory state at the time t;
Memory cell state:
Wherein C t is the output of the memory state of the cell at the time t, For the output of the candidate memory cell state at the time t, I t is the output of the input gate at the time t, f t is the output of the forgetting gate at the time t, and C t-1 is the output of the memory cell state at the time t-1.
8. The 3D object detection method based on the convolutional long-term memory network according to claim 7, wherein the step 5 specifically comprises:
Step 5.1, inputting the intermediate feature F s obtained in step 4.6, according to the sequence length n, the n-1 th, the..0 th feature map of F s into a convolution long-short-period memory network, and for the i-th frame feature map, inputting the input of the i-th frame feature map as X i;
Step 5.2, after inputting X i, X i and the hidden state H i-1 of the previous frame, performing forgetting gate calculation on the cell state C t-1, and outputting forgetting gate information f i;
Step 5.3, at the same time, X i and the hidden state H i-1 of the previous frame and the cell state C t-1 are subjected to input gate calculation, and I i is output in the output layer;
Step 5.3, simultaneously, X i and the hidden state H i-1 of the previous frame perform candidate memory cell state calculation, and output the candidate memory cell state of the ith frame
Step 5.4, at the same time, performing output gate calculation on X i and the hidden state H i-1 of the previous frame, and outputting O i in the output layer;
Step 5.5, outputting I i in the layer of the forgetting gate information f i obtained by the calculation, and candidate memory cell state And the cell state information C t-1 of the previous frame, calculating to obtain the cell state information C i of the current frame;
Step 5.6, taking the cell state C i of the current frame, and calculating the hidden state H i of the current frame by the in-layer output O i;
and 5.7, inputting a next frame characteristic diagram X i+1, and repeating the steps 5.2 to 5.6 until the key frame is input to obtain H n, and finally taking H n as the output characteristic of the whole convolution long-short-period memory network.
9. The 3D object detection method based on the convolutional long-term memory network according to claim 8, wherein the step 6 specifically comprises:
Step 6.1, marking the output characteristic H n in the step 5 as H 1, inputting H 1 into a first characteristic extraction layer for characteristic extraction, and outputting a characteristic diagram H f1 with highest resolution;
Step 6.2, downsampling H f1, marking the sampling result as H 2, inputting the downsampling result into a second feature extraction layer, and outputting a feature map H f2;
Step 6.3, downsampling H f2, marking the sampling result as H 3, inputting the downsampling result into a second feature extraction layer, and outputting a feature map H f3;
Step 6.4, up-sampling three feature graphs H f1、Hf2、Hf3 with different scales, and marking the feature graph of the sampling result as H f1_up、Hf2_up、Hf3_up;
and 6.5, merging the feature images H f1_up、Hf2_up、Hf3_up which are subjected to up-sampling and have the same dimension into a multi-scale feature image, and recording as H f.
10. The 3D object detection method based on the convolutional long-term memory network according to claim 9, wherein the step 7 specifically comprises:
Step 7.1, inputting the multi-scale feature map H f obtained in the step 6.5 into three full-connection layers, wherein the features of each point on the feature map can predict a plurality of anchor frames through the full-connection layers, and the sizes of the anchor frames are set according to the sizes of targets;
Step 7.2, mapping the generated anchor frames and the truth value boundary frames to a d-theta plane, and performing cross-correlation calculation in the d-theta plane, setting different upper limit thresholds and lower limit thresholds for different target categories, distributing anchor frames with cross-correlation ratio higher than the upper limit threshold as positive samples, distributing anchor frames with cross-correlation ratio lower than the lower limit threshold as negative samples, and discarding anchor frames with cross-correlation ratio between the upper limit threshold and the lower limit threshold;
Step 7.3, calculating a classification loss function L cls of the anchor frame, wherein the calculation mode is as follows:
Wherein, the Predicting the probability of belonging to the class c targets for the ith anchor frame, wherein alpha and gamma are two super parameters for two types of targets in total;
Step 7.4, calculating an angle loss function L dir of the anchor frame, wherein the calculation mode is as follows:
Wherein, the Representing the probability that the predicted spin angle of the ith anchor box is r o,The true probability is corresponding;
step 7.5, calculating a position loss function L reg of the anchor frame, wherein the calculation mode is as follows:
Wherein, the X i is the geometric center of the real anchor frame of the corresponding target, and beta is a super parameter;
Step 7.6, calculating the total loss function of the anchor frame, and combining the losses of the three subtasks to obtain the total loss:
Ltotal=β1Lcls2Lreg3Ldir
where L cls is the classification task penalty, L reg is the regression task penalty, L dir is the angle classification task penalty, and β 1、β2、β3 is the weight constant parameter for the three penalties.
CN202310719201.6A 2023-06-16 2023-06-16 3D object detection method based on convolutional long short-term memory network Active CN116758534B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310719201.6A CN116758534B (en) 2023-06-16 2023-06-16 3D object detection method based on convolutional long short-term memory network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310719201.6A CN116758534B (en) 2023-06-16 2023-06-16 3D object detection method based on convolutional long short-term memory network

Publications (2)

Publication Number Publication Date
CN116758534A CN116758534A (en) 2023-09-15
CN116758534B true CN116758534B (en) 2025-09-23

Family

ID=87958454

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310719201.6A Active CN116758534B (en) 2023-06-16 2023-06-16 3D object detection method based on convolutional long short-term memory network

Country Status (1)

Country Link
CN (1) CN116758534B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116794680A (en) * 2023-06-26 2023-09-22 西安电子科技大学 Logarithmic sphere coordinate 3D target detection method based on reflection intensity information guiding mechanism
CN119919929B (en) * 2025-04-03 2025-06-13 杭州曼孚科技有限公司 A method, device and medium for automatic annotation of point cloud based on neural network

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113219493A (en) * 2021-04-26 2021-08-06 中山大学 End-to-end point cloud data compression method based on three-dimensional laser radar sensor
CN113268916A (en) * 2021-04-07 2021-08-17 浙江工业大学 Traffic accident prediction method based on space-time graph convolutional network

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111950467B (en) * 2020-08-14 2021-06-25 清华大学 Fusion network lane line detection method and terminal device based on attention mechanism
CN112529944B (en) * 2020-12-05 2022-11-18 东南大学 An End-to-End Unsupervised Optical Flow Estimation Method Based on Event Cameras
CN113378647B (en) * 2021-05-18 2024-03-29 浙江工业大学 Real-time track obstacle detection method based on three-dimensional point cloud
CN114979801A (en) * 2022-05-10 2022-08-30 上海大学 Dynamic video abstraction algorithm and system based on bidirectional convolution long-short term memory network

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113268916A (en) * 2021-04-07 2021-08-17 浙江工业大学 Traffic accident prediction method based on space-time graph convolutional network
CN113219493A (en) * 2021-04-26 2021-08-06 中山大学 End-to-end point cloud data compression method based on three-dimensional laser radar sensor

Also Published As

Publication number Publication date
CN116758534A (en) 2023-09-15

Similar Documents

Publication Publication Date Title
Wang et al. Data-driven based tiny-YOLOv3 method for front vehicle detection inducing SPP-net
CN112101278B (en) Homestead point cloud classification method based on k-nearest neighbor feature extraction and deep learning
CN113705631B (en) A 3D point cloud target detection method based on graph convolution
CN108961235B (en) Defective insulator identification method based on YOLOv3 network and particle filter algorithm
CN108038445B (en) SAR automatic target identification method based on multi-view deep learning framework
CN108416378B (en) A large-scene SAR target recognition method based on deep neural network
CN116129234B (en) Attention-based 4D millimeter wave radar and vision fusion method
CN110929577A (en) An improved target recognition method based on YOLOv3 lightweight framework
CN116758534B (en) 3D object detection method based on convolutional long short-term memory network
CN107154048A (en) The remote sensing image segmentation method and device of a kind of Pulse-coupled Neural Network Model
EP4174792A1 (en) Method for scene understanding and semantic analysis of objects
CN116740561B (en) SAR target recognition method based on fusion of ASC features and multi-scale depth features
CN116503760A (en) UAV Cruise Detection Method Based on Semantic Segmentation of Adaptive Edge Features
Zelener et al. Cnn-based object segmentation in urban lidar with missing points
CN118941526A (en) A road crack detection method, medium and product
CN116935356A (en) Weak supervision-based automatic driving multi-mode picture and point cloud instance segmentation method
CN115294565A (en) 3D target detection and parameterized radius learning method and system based on key points
CN112801928A (en) Attention mechanism-based millimeter wave radar and visual sensor fusion method
CN107529647B (en) Cloud picture cloud amount calculation method based on multilayer unsupervised sparse learning network
CN107341449A (en) A kind of GMS Calculation of precipitation method based on cloud mass changing features
CN118918482B (en) Natural resource measurement method and system based on remote sensing images
CN114140698A (en) Water system information extraction algorithm based on FasterR-CNN
CN112560799A (en) Unmanned aerial vehicle intelligent vehicle target detection method based on adaptive target area search and game and application
CN115424028B (en) Feature-enhanced lightweight SSD-based infrared target detection method
CN117876869A (en) A remote sensing image road extraction method, system, device and medium based on HRNet and multi-scale feature attention

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant