CN116758534B

CN116758534B - 3D object detection method based on convolutional long short-term memory network

Info

Publication number: CN116758534B
Application number: CN202310719201.6A
Authority: CN
Inventors: 何立火; 钟彬彬; 甘海林; 柯俊杰; 王笛; 高新波; 路文
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2023-06-16
Filing date: 2023-06-16
Publication date: 2025-09-23
Anticipated expiration: 2043-06-16
Also published as: CN116758534A

Abstract

The present invention discloses a 3D object detection method based on a convolutional long short-term memory network, comprising the following steps: Step 1: Using the nuScenes point cloud dataset; Step 2: Converting the input point cloud data to a spherical coordinate system; Step 3: Dividing the point cloud space into voxels according to the spherical coordinates to obtain voxel features, and performing preliminary extraction on the voxel features; Step 4: Performing intermediate feature extraction on the voxel features; Step 5: Performing temporal feature extraction using a convolutional long short-term memory network to realize the output features of the convolutional long short-term memory network; Step 6: Performing multi-scale feature extraction on the output features to obtain a feature map; Step 7: Using the feature map to generate anchor boxes, and performing classification, bounding box regression, and angle regression on the anchor boxes. The present invention uses a convolutional long short-term memory network to extract temporal features from point cloud sequences, solving the problem in existing deep learning-based 3D continuous object detection methods that the sequences are too long to be relied upon for a long time.

Description

3D target detection method based on convolution long-term and short-term memory network

Technical Field

The invention belongs to the technical field of 3D target detection, and particularly relates to a 3D target detection method based on a convolution long-short-term memory network.

Background

The 3D object detection method can be broadly classified into an image-based method, a point cloud-based method, and a fusion data-based method. Image-based 3D object detection is performed using a single or multiple images as input. The method based on the point cloud utilizes the point cloud data acquired by the sensors such as the laser radar, the TOF camera and the like to detect the target, has relatively accurate depth information, has higher identification accuracy on a remote target than the method based on the image, and is mainly used for automatically driving automobiles at present. The method for fusing the data simultaneously uses the 2D image and the point cloud data to detect the 3D target. In the existing research, a plurality of evaluation indexes based on manual design are established, such as average Accuracy (AP), average direction similarity (AOS) and the like inherited 2D images.

When the image is recognized, global information such as a target object, a foreground, a background and the like in the visual field can be judged through image information at the current moment, and objects of interest can be captured through continuous video information. Therefore, the 3D object detection task needs to include the information features at the next time, and needs to include the information features at the previous time, so that the full use of the information of the 3D images at different times becomes a key for improving the performance of the 3D object detection model. The existing 3D target detection method does not effectively utilize time domain information provided by a sensor to improve detection accuracy when target detection is performed.

The application publication number is CN115546784A, the name is a 3D target detection method based on deep learning, the 3D target detection method based on deep learning is disclosed, the method comprises the steps of loading Kitti dataset as a training sample image, preprocessing the loaded training sample image, calculating a 3D center point of a target, projection points of the 3D center point on the image, eight corner positions and Gaussian distribution of the target center point, constructing a deep learning convolutional neural network, comprising a backbone network and two branch networks, loading the dataset as a training set, obtaining the output of the deep learning convolutional neural network through forward propagation of the data, calculating the loss degree, carrying out reverse propagation, updating network parameters, obtaining a trained neural network model, receiving test set image data, sending the image into the pre-trained neural network model, obtaining the corresponding output target, and calculating the 3D position and the type of each target.

The method has the defects that the spatial position distribution of the point cloud information of Kitti data sets under a rectangular coordinate system is uneven, only the point cloud characteristics of a single frame are extracted, the time dependence of a frame to be detected and a previous frame is ignored, and the quality prediction accuracy and generalization capability of the method are low.

The traditional 3D target detection method is often based on the spatial information of the current frame, and is inconsistent with the characteristics of the current detection frame information and the previous frame information in human eye vision. The fact that features at different moments are not integrated into the feature information is one of the important reasons for influencing the performance of the 3D target detection method.

The existing 3D continuous target detection method based on deep learning only uses the spatial characteristic information of the current frame in the characteristic extraction process, does not show the adjacent time domain characteristic information before the detection frame, or can only show the time domain information of a shorter time frame.

Disclosure of Invention

In order to overcome the problems in the prior art, the invention aims to provide a 3D target detection method based on a convolution long-short-term memory network, which is used for extracting time characteristics of a point cloud sequence through the convolution long-short-term memory network (ConvLSTM) and solving at least one of the problems of low detection accuracy and low robustness caused by incapability of long-term dependence due to overlong sequence in the existing 3D continuous target detection method based on deep learning.

In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:

A3D target detection method based on a convolution long-term and short-term memory network comprises the following steps of;

acquiring or constructing point cloud data by using nuScenes point cloud data sets, wherein the point cloud data sets are stored in a three-dimensional rectangular coordinate system coordinate form;

Step 2, converting a three-dimensional rectangular coordinate system of the point cloud data into a spherical coordinate system to realize the density homogenization of the point cloud in space;

step 3, carrying out voxel division on the point cloud space according to spherical coordinates to obtain voxel characteristics, and carrying out preliminary extraction on the voxel characteristics to unbind the voxel characteristics and the space absolute position;

Step 4, extracting intermediate features of the voxel features obtained in the step 3 through a convolution network;

Step 5, extracting time features of the intermediate features through a convolution long-short-term memory network to obtain output features H _n of the convolution long-short-term memory network;

Step 6, carrying out multi-scale feature extraction on the output feature H _n obtained in the step 5 to obtain a feature map H _f;

and 7, generating an anchor frame by utilizing the feature map H _f obtained in the step 6, and classifying the anchor frame, carrying out bounding box regression and angle regression to obtain a final prediction frame.

And 8, setting super parameters and training parameters, training the convolution long-short-term memory network, and verifying the algorithm effect.

The step 1 specifically comprises the following steps:

The point cloud data includes XYZ coordinates (x, y, z) of each point and a reflection intensity I, in the form of [ (x _i,y_i,z_i,I_i) ], where the index I refers to the serial number of the corresponding data.

The step 2 specifically comprises the following steps:

Step 2.1, arranging the point cloud data [ (x _i,y_i,z_i,I_i) ] obtained in the step 1 in a time sequence to obtain continuous point cloud frames [ (x _it,y_it,z_it,I_it) ] with subscripts t=0, 1,2, & gt, n-1 is the reverse order of a frame sequence, namely t=0 indicates that the point belongs to a marked key frame, and advancing to the previous frame along with the increasing of t in sequence, so as to form a point cloud input sequence from a certain moment before the key frame to the key frame;

step 2.2, encoding the continuous point cloud frames [ (x _it,y_it,z_it,I_it) ] into an input sequence [ (x _i,y_i,z_i,I_i, t) ];

Step 2.3, calculating each point (x _i,y_i,z_i,I_i) in the data set point cloud:

θ_i＝arctan2(y_i,x_i)

I_i＝I_i

Obtaining a point cloud sequence under a spherical coordinate system Where d denotes the linear distance of the point from the origin (lidar), θ is the azimuth of the point,The pitch angle of a point is the reflection intensity of the point;

step 2.4, re-decoding the point cloud sequence into point cloud frames in list form according to t=0, 1, 2..n-1 And the length n of each sequence in Batch is recorded.

The step 3 specifically comprises the following steps:

Step 3.1, in the spherical coordinate system, d, θ, The side lengths in the three directions are v _d、v_θ respectively,The division range is not infinite, the ranges of the three dimensions are d _min,d_max],[θ_min,θ_max,Where d _min and d _max represent upper and lower limits on target-to-radar range, θ _min and θ _max represent upper and lower limits on azimuth dimension,AndRepresenting upper and lower limits of pitch angle dimensions;

then, grouping the point clouds according to the voxel grids of each point in the point clouds;

step 3.2, point Yun Zhen List Voxelized is carried out on each frame of point cloud;

step 3.3, extracting the characteristics of each voxel, wherein the calculation mode is as follows:

Obtaining a characteristic matrix as This is denoted as feature F ₀, wherein,Is the characteristic information of a voxel grid of a single frame point cloud,For the average reflected intensity for all points in the corresponding voxel grid, d _c,θ_c,The voxel center point is the relative distance from the average value of the points in the voxel to the voxel center point, which is used as the voxel characteristic.

The step 4 specifically comprises the following steps:

The network for extracting the further characteristics of the voxel characteristics obtained in the step 3 is formed by sequentially connecting 6 convolution layers, and comprises 1 input layer, 4 intermediate layers and 1 output layer, wherein the specific structure comprises the input layer, the first intermediate layer, the second intermediate layer, the third intermediate layer, the fourth intermediate layer and the output layer;

Step 4.1, inputting the characteristic matrix F ₀ obtained in the step 3.3 into an input layer, and outputting to obtain a characteristic F ₁, wherein the input layer consists of 1 SubMConv d convolution layers;

Step 4.2, inputting the characteristic F ₁ into a first intermediate layer, and outputting to obtain the characteristic F ₂, wherein the first intermediate layer comprises 1 SubMConv d convolution layers;

Step 4.3, inputting the characteristic F ₂ into a second intermediate layer, and outputting to obtain the characteristic F ₃, wherein the second intermediate layer is formed by sequentially connecting 3 convolution layers;

The specific structure is SparseConv d convolution layer- & gt SubMConv d convolution layer- & gt SubMConv d convolution layer;

step 4.4, inputting the characteristic F ₃ into a third intermediate layer, and outputting to obtain the characteristic F ₄, wherein the third intermediate layer is formed by sequentially connecting 3 convolution layers;

Step 4.5, inputting the characteristic F ₄ into a fourth intermediate layer, and outputting to obtain the characteristic F ₅, wherein the fourth intermediate layer is formed by sequentially connecting 3 convolution layers;

And 4.6, inputting the characteristic F ₅ into an output layer, outputting to obtain an intermediate characteristic F _s, wherein the output layer consists of 1 SubMConv d convolution layers, F _s is a characteristic diagram sequence, and the sequence is stored in a matrix form.

In the step 5, the intermediate characteristic F _s obtained in the step 4.6 is input into a convolution long-short-term memory network, the time dimension characteristic of F _s is extracted, the convolution long-short-term memory network consists of a forgetting gate, an input gate, candidate memory cells and an output gate, and the convolution long-short-term memory network comprises the following calculation operations:

Forgetting the door:

wherein f _t is the output of the forgetting gate at time t, σ is the softmax function, x is the matrix convolution calculation, For matrix Hadamard product calculation, W _xf is a convolution weight matrix of a forgetting gate and input X _t, W _hf is a convolution weight matrix of the forgetting gate and a t-1 moment hidden state H _t-1, W _cf is a product weight matrix of the forgetting gate and a t-1 moment memory cell state C _t-1, and b _f is a bias matrix of the forgetting gate;

An input door:

Wherein, I _t is the output of the input gate at time t, W _xi is the convolution weight matrix of the input gate and input X _t, W _hi is the convolution weight matrix of the input gate and hidden state H _t-1 at time t-1, W _ci is the product weight matrix of the input gate and memory cell state C _t-1 at time t-1, and b _i is the bias matrix of the input gate;

candidate memory cell state:

Wherein, the For the output of the candidate memory cell state at time t, tanh is tanh activation function, W _xc is the convolution weight matrix of the candidate memory cell state and input X _t, W _hc is the convolution weight matrix of the candidate memory cell state and hidden state H _t-1 at time t-1, and b _c is the bias matrix of the candidate memory cell state;

Output door O _t＝σ(W_xo*X_t+W_ho*H_t-1+b_o

Wherein O _t is the output of the output gate at time t, W _xo is the convolution weight matrix of the output gate and input X _t, W _ho is the convolution weight matrix of the output gate and hidden state H _t-1 at time t-1, and b _o is the bias matrix of the output gate;

hidden state:

Wherein H _t is the output of the hidden state at the time t, O _t is the output of the output gate at the time t, and C _t is the output of the cell memory state at the time t.

Memory cell state:

Wherein C _t is the output of the memory state of the cell at the time t, For the output of the candidate memory cell state at the time t, I _t is the output of the input gate at the time t, f _t is the output of the forgetting gate at the time t, and C _t-1 is the output of the memory cell state at the time t-1.

The step 5 specifically comprises the following steps:

Step 5.1, inputting the intermediate feature F _s obtained in step 4.6, according to the sequence length n, the n-1 th, the..0 th feature map of F _s into a convolution long-short-period memory network, and for the i-th frame feature map, inputting the input of the i-th frame feature map as X _i;

Step 5.2, after inputting X _i, X _i and the hidden state H _i-1 of the previous frame, performing forgetting gate calculation on the cell state C _t-1, and outputting forgetting gate information f _i;

Step 5.3, at the same time, X _i and the hidden state H _i-1 of the previous frame and the cell state C _t-1 are subjected to input gate calculation, and I _i is output in the output layer;

Step 5.3, simultaneously, X _i and the hidden state H _i-1 of the previous frame perform candidate memory cell state calculation, and output the candidate memory cell state of the ith frame

Step 5.4, at the same time, performing output gate calculation on X _i and the hidden state H _i-1 of the previous frame, and outputting O _i in the output layer;

Step 5.5, outputting I _i in the layer of the forgetting gate information f _i obtained by the calculation, and candidate memory cell state And the cell state information C _t-1 of the previous frame, calculating to obtain the cell state information C _i of the current frame;

Step 5.6, taking the cell state C _i of the current frame, and calculating the hidden state H _i of the current frame by the in-layer output O _i;

and 5.7, inputting a next frame characteristic diagram X _i+1, and repeating the steps 5.2 to 5.6 until the key frame is input to obtain H _n, and finally taking H _n as the output characteristic of the whole convolution long-short-period memory network.

The step 6 specifically comprises the following steps:

Step 6.1, marking the output characteristic H _n in the step 5 as H ₁, inputting H ₁ into a first characteristic extraction layer for characteristic extraction, and outputting a characteristic diagram H _f1 with highest resolution;

Step 6.2, downsampling H _f1, marking the sampling result as H ₂, inputting the downsampling result into a second feature extraction layer, and outputting a feature map H _f2;

Step 6.3, downsampling H _f2, marking the sampling result as H ₃, inputting the downsampling result into a second feature extraction layer, and outputting a feature map H _f3;

Step 6.4, up-sampling three feature graphs H _f1、H_f2、H_f3 with different scales, and marking the feature graph of the sampling result as H _{f1_up}、H_{f2_up}、H_{f3_up};

and 6.5, merging the feature images H _{f1_up}、H_{f2_up}、H_{f3_up} which are subjected to up-sampling and have the same dimension into a multi-scale feature image, and recording as H _f.

The step 7 specifically comprises the following steps:

Step 7.1, inputting the multi-scale feature map H _f obtained in the step 6.5 into three full-connection layers, wherein the features of each point on the feature map can predict a plurality of anchor frames through the full-connection layers, and the sizes of the anchor frames are set according to the sizes of targets;

Step 7.2, mapping the generated anchor frames and the truth value boundary frames to a d-theta plane, and performing cross-correlation calculation in the d-theta plane, setting different upper limit thresholds and lower limit thresholds for different target categories, distributing anchor frames with cross-correlation ratio higher than the upper limit threshold as positive samples, distributing anchor frames with cross-correlation ratio lower than the lower limit threshold as negative samples, and discarding anchor frames with cross-correlation ratio between the upper limit threshold and the lower limit threshold;

Step 7.3, calculating a classification loss function L _cls of the anchor frame, wherein the calculation mode is as follows:

Wherein, the Predicting the probability of belonging to the class c targets for the ith anchor frame, wherein alpha and gamma are two super parameters for two types of targets in total;

Step 7.4, calculating an angle loss function L _dir of the anchor frame, wherein the calculation mode is as follows:

Wherein, the Representing the probability of the predicted spin angle of the ith anchor frame being r,The true probability is corresponding;

step 7.5, calculating a position loss function L _reg of the anchor frame, wherein the calculation mode is as follows:

Wherein, the X _i is the geometric center of the real anchor frame of the corresponding target, and beta is a super parameter;

Step 7.6, calculating the total loss function of the anchor frame, and combining the losses of the three subtasks to obtain the total loss:

L_total＝β₁L_cls+β₂L_reg+β₃L_dir

where L _cls is the classification task penalty, L _reg is the regression task penalty, L _dir is the angle classification task penalty, and β ₁、β₂、β₃ is the weight constant parameter for the three penalties.

The step 8 specifically comprises the following steps:

Step 8.1, loading a data set category label, determining an evaluation index and designing an ablation experiment;

Step 8.2, setting a data collection point cloud input range;

Step 8.3, setting a voxelized range;

Step 8.4, setting the number of voxel grids in two groups of experiments;

step 8.5, setting the maximum training voxel number and the maximum test voxel number;

Step 8.6, in the training stage, setting an optimizer, a learning rate adjustment policy, a learning rate rising step number proportion, a learning rate adjustment target maximum multiplying power, and a minimum multiplying power as training rounds;

And 8.7, performing a simulation experiment to explain the technical effects of the invention.

The invention has the beneficial effects that:

According to the method, point cloud information in a rectangular coordinate system is converted into a spherical coordinate system with more uniform point cloud distribution, the time characteristic extraction is carried out on the point cloud information by utilizing a convolution long-short-term memory network, the dependency relationship of a 3D target between adjacent frames is captured, and the characterization capability of the fused characteristic diagram on multi-frame point cloud information can be enhanced. Because the time characteristic information of different point cloud frames in the image is fully utilized, the method has the advantage of improving the average accuracy and the robustness of continuous 3D target detection tasks.

According to the invention, the 3D target detection task under the rectangular coordinate system is converted into the spherical coordinate system, so that the characteristic of uneven distribution of the point cloud is improved. When the point cloud feature extraction is carried out, a convolution long-term and short-term memory network is added, and the condition of underutilization of the adjacent point cloud frame information in other 3D target detection methods is improved. The final experimental result shows that the method is a 3D target detection method which does not use adjacent frame information, and has higher average accuracy and robustness.

Drawings

FIG. 1 is a schematic flow chart of the present invention.

Fig. 2 is a schematic diagram of the correspondence between coordinates of a spherical coordinate system and coordinates of a three-dimensional rectangular coordinate system.

Fig. 3 is a schematic view of voxel division in a spherical coordinate system.

FIG. 4 is a schematic diagram of a convolutional long and short term memory network.

Fig. 5 is a schematic diagram of multi-scale feature extraction.

Detailed Description

The invention is described in further detail below with reference to the accompanying drawings.

As shown in FIG. 1, the 3D target detection method based on the convolution long-term memory network specifically comprises the following steps of;

Step1, acquiring nuScenes data sets:

And constructing a corresponding training set and a corresponding testing set by utilizing nuScenes data sets commonly used in the field of 3D target detection.

NuScenes dataset is a large autopilot dataset. nuScenes data acquisition is mainly performed in singapore and boston, and the route of travel is carefully planned to capture more challenging scenes. The dataset contains 1000 scenes of length 20s, which contain different environments, times, low points and also weather conditions. Meanwhile, in order to balance the difference of the category number, the data set adjusts the number of rare category scenes.

NuScenes in the dataset, the point cloud data is stored in the form of coordinates in a three-dimensional rectangular coordinate system, the data includes XYZ coordinates (x, y, z) and reflection intensity I of each point, as shown in [ (x _i,y_i,z_i,I_i) ], where the subscript I broadly refers to a sequence number of a certain data, the value being in a sequence or list of sequence numbers, and for convenience, the description is also followed in the following sections herein.

Step 2, converting the input data into a spherical coordinate system:

The step realizes the density homogenization of point cloud in space by replacing the coordinate system from a three-dimensional rectangular coordinate system to a spherical coordinate system on the whole structure level.

Step 2.1, a continuous point cloud frame [ (x _it,y_it,z_it,I_it) ] is input, wherein the subscript i still refers to the serial number of a certain data, the value is in the serial number range of a sequence or list, the subscript t=0, 1,2 is the reverse order of the frame sequence, i.e. t=0 indicates that the point belongs to a key frame with a label, and the time advances to the previous frame along with the increment of t. Thus forming a point cloud input sequence from a moment before the occurrence of the key frame to the key frame.

Step 2.2, encoding the continuous point cloud frames [ (x _it,y_it,z_it,I_it) ] into an input sequence [ (x _i,y_i,z_i,I_i, t) ].

θ_i＝arctan2(y_i,x_i)

I_i＝I_i

Obtaining a point cloud sequence under a spherical coordinate system Where d denotes the linear distance of the point from the origin (lidar), θ is the azimuth of the point,I is the pitch angle of the point and I is the reflection intensity of the point.

The corresponding relationship between the coordinates of the spherical coordinate system and the coordinates of the three-dimensional rectangular coordinate system is shown in fig. 2.

Step 2.4, re-decoding the point cloud sequence into point cloud frames in list form according to t=0, 1, 2..n-1And the length n of each sequence in Batch is recorded.

And 3, carrying out voxel division on the point cloud space according to spherical coordinates:

and converting the geometrical form representation of the point cloud into a voxel representation form closest to the point cloud, and performing preliminary extraction on voxel characteristics to unbind the voxel characteristics from the spatial absolute position.

Step 3.1, in the spherical coordinate system, d, θ,The side lengths in the three directions are v _d、v_θ respectively,Is divided into a voxel grid. The division range is not infinite, the ranges of the three dimensions are d _min,d_max],[θ_min,θ_max,Where d _min and d _max represent upper and lower limits on target-to-radar range, θ _min and θ _max represent upper and lower limits on azimuth dimension,AndRepresenting upper and lower limits of pitch angle dimensions.

Voxel division diagram 3 shows:

And then grouping the point clouds according to the voxel grids where each point in the point clouds is located.

Step 3.2, point Yun Zhen ListThe point cloud of each frame is voxelized.

Obtaining a characteristic matrix as This is designated as feature F ₀. Wherein, the And the characteristic information of a certain voxel grid is a single-frame point cloud.For the average reflected intensity for all points in the corresponding voxel grid, d _c,θ_c,The voxel center point is the relative distance from the average value of the points in the voxel to the voxel center point, which is used as the voxel characteristic.

And 4, further extracting intermediate features of the voxel features through a convolution network:

The intermediate feature extraction is composed of a convolution network formed by sequentially connecting 6 convolution layers, and comprises 1 input layer, 4 intermediate layers and 1 output layer, wherein the specific structure comprises an input layer, a first intermediate layer, a second intermediate layer, a third intermediate layer, a fourth intermediate layer and an output layer.

And 4.1, inputting the characteristic matrix F ₀ obtained in the step 3.3 into an input layer, and outputting to obtain a characteristic F ₁. The input layer consists of 1 SubMConv3d convolutional layers (feature dimension: 16, convolution kernel size 3 x 3, step size 2).

And 4.2, inputting the feature F ₁ into the first intermediate layer, and outputting to obtain the feature F ₂. The first intermediate layer consists of 1 SubMConv3d convolutional layers (feature dimension: 16, convolutional kernel size: 3 x 3, step size: 2).

And 4.3, inputting the feature F ₂ into the second intermediate layer, and outputting to obtain the feature F ₃. The second intermediate layer is formed by sequentially connecting 3 convolution layers. The specific structure is SparseConv d convolution layer (characteristic dimension: 32, convolution kernel size: 3 x 3, step size: 2) → SubMConv3d convolution layer (characteristic dimension: 32, convolution kernel size: 3 x 3, step size: 2) → SubMConv3d convolution layer (characteristic dimension: 32, convolution kernel size: 3 x 3, step size: 2)

And 4.4, inputting the characteristic F ₃ into a third intermediate layer, and outputting to obtain the characteristic F ₄. The third intermediate layer is formed by sequentially connecting 3 convolution layers. The specific structure is SparseConv d convolution layer (characteristic dimension: 64, convolution kernel size: 3 x 3, step size: 2) → SubMConv3d convolution layer (characteristic dimension: 64, convolution kernel size: 3 x 3, step size: 2) → SubMConv3d convolution layer (characteristic dimension: 64, convolution kernel size: 3 x 3, step size: 2)

And 4.5, inputting the feature F ₄ into a fourth intermediate layer, and outputting to obtain the feature F ₅. The fourth intermediate layer is formed by sequentially connecting 3 convolution layers. The specific structure is SparseConv d convolution layer (characteristic dimension: 64, convolution kernel size: 3 x 3, step size: 2) → SubMConv3d convolution layer (characteristic dimension: 64, convolution kernel size: 3 x 3, step size: 2) → SubMConv3d convolution layer (characteristic dimension: 64, convolution kernel size: 3 x 3, step size: 2)

And 4.6, inputting the feature F ₅ into an output layer, and outputting to obtain an intermediate feature F _s. The output layer consists of 1 SubMConv3d convolutional layers (feature dimension: 128, convolution kernel size 3 x 3, step size 2). F ₆ is a sequence of feature maps, the sequence being stored in a matrix form.

And 5, extracting time characteristics through a convolution long-term and short-term memory network:

And (3) inputting the intermediate feature F _s obtained in the step (4.6) into a convolution long-short-term memory network, so as to extract the time dimension feature of the feature sequence. The convolution long-short-term memory network is formed by cascading a plurality of ConvLSTM network layers, and each ConvLSTM network layer is formed by a forgetting gate, an input gate, candidate memory cells and an output gate, and the connection mode of the convolution long-term memory network are shown in figure 4.

Forgetting the door:

An input door:

candidate memory cell state:

Output door O _t＝σ(W_xo*X_t+W_ho*H_t-1+b_o

hidden state:

Memory cell state:

And 5.1, inputting the intermediate characteristic F _s obtained in the step 4.6 into a convolution long-short-period memory network according to the n-1 th, the first and the second characteristic diagrams of F _s in sequence according to the sequence length n, wherein the input of the i-th frame characteristic diagram is marked as X _i.

Step 5.2, after inputting X _i, X _i performs forgetting gate calculation with the hidden state H _i-1 of the previous frame and the cell state C _t-1, and outputs forgetting gate information f _i.

Step 5.3, at the same time, X _i performs input gate calculation with the hidden state H _i-1 of the previous frame and the cell state C _t-1, and outputs I _i in the output layer.

And 5.4, simultaneously, performing output gate calculation on X _i and the hidden state H _i-1 of the previous frame, and outputting O _i in an output layer.

Step 5.5, outputting I _i in the layer of the forgetting gate information f _i obtained by the calculation, and candidate memory cell stateAnd the cell state information C _t-1 of the previous frame, and calculating to obtain the cell state information C _i of the current frame.

And 5.6, taking the cell state C _i of the current frame, and calculating the hidden state H _i of the current frame by using the in-layer output O _i.

And 6, extracting multi-scale features:

As shown in the following figure, feature H _n is denoted as H ₁, and H ₁ is subject to multi-scale feature extraction through a network similar to the feature pyramid structure.

And 6.1, inputting H ₁ into a first feature extraction layer for feature extraction, wherein the feature dimension of feature extraction is 128, and outputting a feature map H _f1 with highest resolution.

And 6.2, downsampling H _f1 at intervals of 2, marking the sampling result as H ₂, inputting the downsampling result into a second feature extraction layer, extracting features with feature dimensions of 256, and outputting a feature map H _f2.

And 6.3, downsampling H _f2 at intervals of 2, marking the sampling result as H ₃, inputting the downsampling result into a second feature extraction layer, extracting features with feature dimensions of 256, and outputting a feature map H _f3.

And 6.4, up-sampling the three feature maps H _f1、H_f2、H_f3 with different scales, wherein the up-sampling dimension is 256, and the sampling result is recorded as H _{f1_up}、H_{f2_up}、H_{f3_up}.

And 6.5, merging the feature images H _{f1_up}、H_{f2_up}、H_{f3_up} which are subjected to up-sampling and have the same dimension into a multi-scale feature image, namely H _f, wherein the feature depth is 256 x 3.

And 7, generating an anchor frame and classifying, carrying out boundary regression and angle regression:

And 7.1, inputting the multi-scale feature map H _f obtained in the step 6.5 into three full-connection layers, wherein a plurality of anchor frames can be predicted by the features of each point on the feature map through the full-connection layers, the sizes of the anchor frames are set according to the sizes of targets, in the example, the targets of vehicles and pedestrians in the data set are respectively set to be 3.9m1.6m1.65m and 0.6m0.8m1.73m, and the spin angle r of each group of anchor frames is two of 0 degrees and 90 degrees, so that four different frames are formed.

And 7.2, mapping the generated anchor frames and the truth value boundary frames to a d-theta plane, performing cross-over calculation in the d-theta plane, setting different upper limit thresholds and lower limit thresholds for different target categories, distributing the anchor frames with the cross-over ratio higher than the upper limit threshold as positive samples, distributing the anchor frames with the cross-over ratio lower than the lower limit threshold as negative samples, discarding the anchor frames with the cross-over ratio between the upper limit threshold and the lower limit threshold, setting the cross-over ratio upper limit threshold and the lower limit threshold of the vehicle target as 0.6 and 0.45, and setting the cross-over ratio upper limit threshold and the lower limit threshold of the pedestrian target as 0.4 and 0.3 in the example.

Wherein, the The probability of belonging to the class c object is predicted for the ith anchor frame, and the total two classes of objects, α and γ, are two super parameters, in this embodiment, α is 0.25, and γ is 2.0.;

Wherein, the X _i is the geometric center of the real anchor frame of the corresponding target, beta is a super parameter, and beta is 1 in the example;

L_total＝β₁L_cls+β₂L_reg+β₃L_air

Wherein, L _cls is a classification task loss, L _reg is a regression task loss, L _dir is an angle classification task loss, β ₁、β₂、β₃ is a weight constant parameter of three losses, and the method sets β ₁＝1.0,β₂＝2.0,β₃ =0.2, so that the model is more focused on the bounding box regression task and classification task.

Step 8, setting super parameters and training parameters, and training and experimental verification are carried out on the convolution long-term and short-term memory network:

and verifying an ablation experiment of a spherical coordinate 3D target detection method guiding effect and model based on a convolution long-term and short-term memory network.

In step 8.1, in this example, the types of tags loaded are both pedestrian and vehicle types, and the evaluation index is measured using the average accuracy AP. Because the data set is collected at 20Hz frequency and the frequency of the marked key frames is 2Hz, 10 unlabeled point cloud frames are possessed before each key frame on average, but because the equipment used in the experiment is NVIDIA RTX2080Ti graphic card configuration, the ultra-long sequence input can not be supported due to limited display, and the ablation experiment is carried out by loading 2 along-way frames and 0 along-way frames respectively.

Step 8.2, limiting the input range of the data concentration point cloud to be-50 x-50 y-5-3 m.

Setting voxel forming range of 0 m-50 m, 180 deg. 0-31. Wherein phi is nuScenes data set scanning ring ID, and the laser radar scanning ring ID corresponds to a fixed pitch angle.

Step 8.4, setting the number of voxel grids in two experiments, d ₁′＝1408,θ₁' =2048, And d ₂ ^′＝1088,θ₂ ^′ = 1088,

And 8.5, setting the maximum training voxel number as 20000 and setting the maximum test voxel number as 40000.

Step 8.6, training stage, the optimizer adopts AdamW, the learning rate is set to 0.000144, the learning rate adjustment policy is set to cyclic, the learning rate rising step number proportion is set to 0.3, the learning rate adjustment target maximum multiplying power is 10 times, the minimum multiplying power is 0.0001, and training is carried out for 40 generations.

And 8.7, performing a simulation experiment to illustrate the technical effects of the invention, wherein the experimental index results are shown in the following table.

TABLE 1 continuous 3D target detection experiment results

According to the experimental results, the average precision of two experimental groups is greatly improved when 2 along-way frames are loaded compared with that of a control group (namely, a simulated convolution-free long-term and short-term memory network) for loading 0 along-way frames, so that the continuous 3D target detection method provided by the invention can actually and effectively utilize time domain information in a continuous point cloud sequence. In particular, it is also observed from the experimental results that the number of voxel grids is reduced from 1408×2048×32 to 1088×1088×32, and that the substantial reduction in the number of voxel grids does not result in a substantial reduction in the accuracy of model detection, and that the main reason is considered to be that only 1080±10 points are generated per scanning cycle of the laser radar in combination with the sensor parameter analysis. Therefore, the decrease of the number of division meshes in the θ -dimension from 1408 to 1088 hardly causes the loss of spatial information in the θ -dimension, and the slight decrease in detection performance is caused by the decrease of the number of division in the d-dimension. The above is only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited by this, and any modification made on the basis of the technical scheme according to the technical idea of the present invention falls within the protection scope of the claims of the present invention.

Claims

1. The 3D target detection method based on the convolution long-term memory network is characterized by comprising the following steps of;

Step 1, acquiring or constructing point cloud data, wherein the point cloud data are stored in a three-dimensional rectangular coordinate system coordinate form in a point cloud data set;

Step 4, extracting intermediate features of the voxel features through a convolution network;

Step 6, extracting multi-scale features of the H _n to obtain a feature map H _f;

step 7, generating an anchor frame by utilizing H _f, and classifying the anchor frame, carrying out bounding box regression and angle regression;

2. The 3D object detection method based on the convolutional long-term memory network according to claim 1, wherein the step 1 specifically comprises:

A nuScenes point cloud data set is used, the point cloud data comprising XYZ coordinates (x, y, z) of each point and a reflection intensity I, denoted [ (x _i,y_i,z_i,I_i) ], where the subscript I refers to the sequence number of the corresponding data.

3. The 3D object detection method based on the convolutional long-term memory network according to claim 1, wherein the step 2 specifically comprises:

θ_i＝arctan2(y_i,x_i)

I_i＝I_i

4. The 3D object detection method based on the convolutional long-term memory network according to claim 3, wherein the step 3 is specifically:

Obtaining a characteristic matrix as This is denoted as feature F ₀, wherein,Is the characteristic information of a voxel grid of a single frame point cloud,For the average reflected intensity for all points in the corresponding voxel grid,The voxel center point is the relative distance from the average value of the points in the voxel to the voxel center point, which is used as the voxel characteristic.

5. The 3D target detection method based on the convolution long-short-term memory network according to claim 4, wherein the network for extracting the voxel characteristics obtained in the step 3 is formed by sequentially connecting 6 convolution layers, and comprises 1 input layer, 4 middle layers and 1 output layer, and the specific structure comprises the input layer, the first middle layer, the second middle layer, the third middle layer, the fourth middle layer and the output layer.

6. The 3D object detection method based on the convolutional long-term memory network according to claim 5, wherein the specific steps of step 4 are as follows:

7. The 3D object detection method based on the long-short-term convolutional memory network according to claim 6, wherein in the step 5, the intermediate feature F _s obtained in the step 4.6 is input into the long-short-term convolutional memory network to extract the time dimension feature of F _s, the long-short-term convolutional memory network is composed of a forgetting gate, an input gate, candidate memory cells and an output gate, and the long-short-term convolutional memory network comprises the following computing operations:

Forgetting the door:

An input door:

candidate memory cell state:

Output door O _t＝σ(W_xo*X_t+W_ho*H_t-1+b_o

hidden state:

wherein H _t is the output of the hidden state at the time t, O _t is the output of the output gate at the time t, and C _t is the output of the cell memory state at the time t;

Memory cell state:

8. The 3D object detection method based on the convolutional long-term memory network according to claim 7, wherein the step 5 specifically comprises:

9. The 3D object detection method based on the convolutional long-term memory network according to claim 8, wherein the step 6 specifically comprises:

10. The 3D object detection method based on the convolutional long-term memory network according to claim 9, wherein the step 7 specifically comprises:

Wherein, the Representing the probability that the predicted spin angle of the ith anchor box is r ^o,The true probability is corresponding;

L_total＝β₁L_cls+β₂L_reg+β₃L_dir