CN114842313B

CN114842313B - Target detection method, device, electronic device and storage medium based on pseudo point cloud

Info

Publication number: CN114842313B
Application number: CN202210508913.9A
Authority: CN
Inventors: 陈禹行; 彭微; 李雪; 范圣印
Original assignee: Beijing Yihang Yuanzhi Technology Co Ltd
Current assignee: Beijing Yihang Yuanzhi Technology Co Ltd
Priority date: 2022-05-10
Filing date: 2022-05-10
Publication date: 2024-05-31
Anticipated expiration: 2042-05-10
Also published as: CN114842313A

Abstract

The disclosure provides a target detection method, device, electronic equipment and storage medium based on pseudo point cloud. The target detection method based on the pseudo point cloud in the disclosure comprises the following steps: acquiring first pseudo point cloud data of a first image, acquiring 3D candidate frame information of the first image, acquiring first pseudo point cloud data of a 3D candidate frame, acquiring second pseudo point cloud data of the 3D candidate frame according to the first pseudo point cloud data of the 3D candidate frame, acquiring a first feature vector of a target corresponding to the 3D candidate frame through feature coding according to the second pseudo point cloud data of the 3D candidate frame, enabling target features represented by the second pseudo point cloud data to be consistent with target features represented by the laser point cloud data in distribution, and acquiring 3D detection frame information according to the first feature vector of the target corresponding to the 3D candidate frame. The method and the device can improve the accuracy of pseudo point cloud target detection and enable the detection result to be more accurate.

Description

Target detection method and device based on pseudo point cloud, electronic equipment and storage medium

Technical Field

The disclosure relates to the field of computer vision, and in particular relates to a target detection method, device, electronic equipment and storage medium based on pseudo point cloud.

Background

The three-dimensional (3D) target detection algorithm based on the pseudo point cloud takes the pseudo point cloud as input, and performs feature extraction and analysis on the pseudo point cloud to realize the prediction of the position, the size and the category information of the object in the three-dimensional space, thereby playing an important role in the fields of automatic driving scenes, robot application and the like. The pseudo point cloud is typically converted from a depth map of the RGB image, which data is consistent in appearance with the laser point cloud. Because the data form of the point cloud can often better represent the shape information of an object in a three-dimensional space, the existing 3D target detection algorithm based on the laser point cloud has better effect, so that the current 3D target detection algorithm based on visual images usually carries out depth estimation on the images to obtain a depth map, then converts pixel points into the three-dimensional space according to the depth map to obtain pseudo point cloud, and finally realizes 3D target detection based on the pseudo point cloud. At present, the 3D target detection based on the pseudo point cloud is often directly realized by adopting a 3D target detection model of the laser point cloud, the depth information of the pseudo point cloud is not accurate enough, and the pseudo point cloud and the laser point cloud are different in distribution, so that the target detection method based on the pseudo point cloud is low in accuracy and poor in precision, and the obstacle detection requirements in scenes such as automatic driving cannot be met.

Disclosure of Invention

In order to solve at least one of the above technical problems, the present disclosure provides a target detection method, apparatus, electronic device and storage medium based on pseudo point cloud.

A first aspect of the present disclosure provides a target detection method based on a pseudo point cloud, including:

acquiring first pseudo point cloud data of a first image, wherein target features represented by the first pseudo point cloud data comprise three-dimensional 3D features and category features;

Acquiring 3D candidate frame information of the first image according to the first pseudo point cloud data of the first image;

Acquiring first pseudo point cloud data of a 3D candidate frame according to the 3D candidate frame information of the first image and the first pseudo point cloud data of the first image;

Obtaining second pseudo point cloud data of the 3D candidate frame according to the first pseudo point cloud data of the 3D candidate frame, wherein target characteristics represented by the second pseudo point cloud data comprise 3D characteristics, category characteristics and internal characteristics of corresponding targets of the 3D candidate frame;

According to the second pseudo point cloud data of the 3D candidate frame, a first feature vector of a target corresponding to the 3D candidate frame is obtained through feature coding, and the feature coding can enable target features represented by the second pseudo point cloud data to be consistent with target features represented by laser point cloud data in distribution;

and obtaining 3D detection frame information according to the first feature vector of the target corresponding to the 3D candidate frame.

A second aspect of the present disclosure provides a pseudo point cloud-based object detection apparatus, including: the pseudo point cloud acquisition unit acquires first pseudo point cloud data of a first image, and target features represented by the first pseudo point cloud data comprise three-dimensional 3D features and category features; a 3D target candidate frame extraction unit, configured to obtain 3D candidate frame information of the first image according to first pseudo point cloud data of the first image; the candidate frame pseudo point cloud unit is used for acquiring first pseudo point cloud data of the 3D candidate frame according to the 3D candidate frame information of the first image and the first pseudo point cloud data of the first image; the characteristic association unit is used for obtaining second pseudo point cloud data of the 3D candidate frame according to the first pseudo point cloud data of the 3D candidate frame, and target characteristics represented by the second pseudo point cloud data comprise 3D characteristics, category characteristics and internal characteristics of a corresponding target of the 3D candidate frame; the feature coding unit is used for obtaining a first feature vector of a target corresponding to the 3D candidate frame through feature coding according to second pseudo point cloud data of the 3D candidate frame, and the feature coding can enable target features represented by the second pseudo point cloud data to be consistent with target features represented by laser point cloud data in distribution; and the detection frame acquisition unit is used for acquiring 3D detection frame information according to the first feature vector of the target corresponding to the 3D candidate frame.

A third aspect of the present disclosure provides an electronic device, comprising: a memory storing execution instructions; and a processor executing the execution instructions stored in the memory, so that the processor executes the target detection method based on the pseudo point cloud.

A fourth aspect of the present disclosure provides a readable storage medium having stored therein execution instructions which, when executed by a processor, are configured to implement the above-described pseudo-point cloud-based target detection method.

According to the method and the device, the problem of characteristic alignment which is usually considered in multi-mode fusion can be avoided by embedding the category information into the pseudo point cloud, so that the operation complexity is effectively reduced, and the processing efficiency is improved; in addition, the association relation between the pseudo point clouds is built in the pseudo point cloud data through the second pseudo point cloud data, so that the point features in the 3D candidate frames are not isolated any more; in addition, the method and the device enable the pseudo point cloud target characteristics to be consistent with the laser point cloud target characteristic distribution through characteristic coding, optimize the data source variability at the target layer, optimize more efficiently compared with the whole point cloud scene, and improve the detection precision. In other words, the method and the device can reduce the complexity of pseudo point cloud target detection, improve the processing efficiency and the precision, and simultaneously enable the detection result to be more accurate.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the disclosure and together with the description serve to explain the principles of the disclosure.

Fig. 1 is a flow diagram of a pseudo point cloud based target detection method according to one embodiment of the present disclosure.

Fig. 2 is a flow diagram of acquiring first pseudo point cloud data of a first image according to one embodiment of the present disclosure.

Fig. 3 is an exemplary architectural diagram of a feature association network of one embodiment of the present disclosure.

Fig. 4 is a schematic flow diagram of obtaining second pseudo point cloud data of a 3D candidate box according to one embodiment of the disclosure.

Fig. 5 is a flow diagram of feature encoding of one embodiment of the present disclosure.

Fig. 6 is a schematic diagram of a model structure and its execution in one embodiment of the present disclosure.

Fig. 7 is a flowchart of a specific implementation of a target detection method based on a pseudo point cloud according to an embodiment of the present disclosure.

Fig. 8 is a schematic block diagram of a pseudo point cloud based target detection apparatus employing a hardware implementation of a processing system according to one embodiment of the present disclosure.

Detailed Description

The present disclosure is described in further detail below with reference to the drawings and the embodiments. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant content and not limiting of the present disclosure. It should be further noted that, for convenience of description, only a portion relevant to the present disclosure is shown in the drawings.

In addition, embodiments of the present disclosure and features of the embodiments may be combined with each other without conflict. The technical aspects of the present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Unless otherwise indicated, the exemplary implementations/embodiments shown are to be understood as providing exemplary features of various details of some ways in which the technical concepts of the present disclosure may be practiced. Thus, unless otherwise indicated, features of the various implementations/embodiments may be additionally combined, separated, interchanged, and/or rearranged without departing from the technical concepts of the present disclosure.

A brief analysis of the related art is first performed below.

Related art 1: the Chinese patent with the publication number of CN112149550 discloses a multi-sensor fusion-based 3D target detection method for an automatic driving vehicle, although the technology realizes multi-level depth fusion of laser radar point cloud and optical camera images, the advantages of accurate space information of point cloud data and good target recognition capability of image data can be utilized more efficiently to improve the sensing accuracy of the automatic driving vehicle to the surrounding environment. However, the related technology needs to consider the alignment problem between the input characteristic data of the point cloud and the image in two different modes, and is complex, tedious and low in processing efficiency.

Related art 2: the Chinese patent with the patent publication number of CN112419494 discloses a method for fusing pseudo point cloud and laser to strengthen target point cloud, image data is converted into a pseudo point cloud form of a 3D space according to depth information, and then the pseudo point cloud and the laser point cloud are fused, although the obstacle detection precision is effectively improved, the direct fusion of the point cloud layer is greatly influenced by the accuracy of the position information of the pseudo point cloud, and the target detection is mainly carried out based on the laser point cloud, and the pseudo point cloud mainly plays an auxiliary role and is not suitable for a scene lacking the laser point cloud.

Related art 3: the Chinese patent with the patent publication number of CN113486887 discloses a feature fusion method of a laser point cloud and a pseudo point cloud region of interest, a pseudo point cloud auxiliary network is utilized to generate a stronger laser point cloud region of interest feature representation, but the laser point cloud and the pseudo point cloud have larger difference in distribution, so that effective fusion between the two features is difficult to realize, complex model design is involved, and the operation complexity is high and the processing efficiency is low. In addition, the related technology is also based on laser point cloud for target detection, and pseudo point cloud mainly plays an auxiliary role and is not suitable for a scene lacking laser point cloud.

In summary, the influence factors of the 3D target detection precision based on the pseudo point cloud mainly have two aspects, namely, the depth information is not accurate enough, and the laser point cloud and the pseudo point cloud have a gap in distribution, so that the learned characteristic distribution is inconsistent. The method comprises the following steps:

1) At present, 3D target detection based on pseudo point cloud is to directly adopt a laser point cloud model to directly perform 3D target detection on the pseudo point cloud, and the difference between the two is not considered. For laser point clouds and pseudo point clouds, the sources of the two clouds are different, the laser point clouds are sparse point clouds acquired by a laser radar and have accurate depth information, the pseudo point clouds are converted from an image depth map, the depth information is not necessarily accurate, and the laser point clouds are denser than the laser point clouds. In addition, the laser point cloud and the pseudo point cloud of the same target have larger difference in distribution, the pseudo point cloud is more divergent, and the variance is larger. The existing laser point cloud model is suitable for target feature coding of the laser point cloud, but the effect of directly coding the pseudo point cloud by using the model is poor. Therefore, if the difference between the laser point cloud and the pseudo point cloud can be fully considered, the accuracy of a 3D target detection algorithm based on the pseudo point cloud can be improved;

2) In the related art, the pseudo point cloud is obtained by converting the depth map into the three-dimensional space according to coordinates, and is generally filled with 1 on the reflectivity of the fourth dimension, and the filling mode is only used for keeping consistent with the data format of the laser point cloud, so that no additional benefit is brought. That is, in the related art, the pseudo point cloud has only the position information and lacks the semantic information. Most multi-mode fusion methods can utilize semantic features obtained by RGB images to fuse with point cloud features, and the fusion method often needs to consider feature alignment, is complex, has high operation complexity, and cannot meet the real-time requirements of scenes such as automatic driving.

In view of this, the present disclosure proposes a pseudo-point cloud-based target detection method, apparatus, electronic device, and storage medium, which effectively improve the pseudo-point cloud-based 3D target detection accuracy mainly through improvements in the following two aspects: 1) Embedding semantic information in the generation of the pseudo point clouds, and adding a characteristic association module to construct association relations among the pseudo point clouds; 2) The first target feature coding network guided by the laser point cloud is designed, and the target feature coding of the laser point cloud is introduced in the training stage of the first target feature coding network to guide the generation of target features of the pseudo point cloud, so that the target features of the pseudo point cloud tend to be consistent with the target features of the laser point cloud in distribution, the data source variability is optimized in the target layer, and compared with the optimization of the whole point cloud scene, the method is more efficient.

The present disclosure is applicable to a scenario where 3D object detection needs to be performed. For example, the method and the device can be applied to detection of vehicles in the surrounding environment in the automatic driving field, and position information of other vehicles can be perceived in time, so that effective obstacle avoidance can be performed, and the vehicles can run more safely.

Specific embodiments of the present disclosure are described in detail below with reference to fig. 1 through 8.

Fig. 1 shows a flow diagram of a pseudo point cloud-based target detection method S10 of the present disclosure. Referring to fig. 1, the target detection method S10 based on the pseudo point cloud may include:

step S12, first pseudo point cloud data of a first image are obtained, and target features represented by the first pseudo point cloud data comprise 3D features and category features;

For example, the first image may be acquired by a camera such as an RGB camera, a depth camera, or the like. Taking a vehicle scene as an example, the first image may be, but is not limited to, a front image acquired by a front camera of the vehicle, where the front image includes a front environment of the vehicle, and 3D features such as a position, a size, a category, and the like of an obstacle in front of the vehicle may be obtained by performing object detection on the front image.

The 3D features may include 3D contour features, 3D shape features, and/or other similar features, which may be indicated by 3D space coordinates such as keypoints (e.g., center points, corner points, etc.), sizes in 3D space, and the like.

The first pseudo-point cloud data and the second pseudo-point cloud below have four dimensions, one of which represents a category of points. For example, the four dimensions may include: three dimensions representing the 3D position and a category, the three dimensions representing the 3D position may be represented as X-axis coordinate values, Y-axis coordinate values, and Z-axis coordinate values in a 3D space coordinate system, and the category dimension may be represented by semantic information. For example, the information of any one point K in the first pseudo point cloud data or the second pseudo point cloud may be expressed as (x, y, z, cls), where x represents an abscissa of the point K in the 3D space coordinate system, y represents an ordinate of the point K in the 3D space coordinate system, z represents a z-axis coordinate of the point K in the 3D space coordinate system, and cls represents semantic information of the point K. Therefore, the pseudo point cloud data obtained by the method not only contains 3D position information, but also carries rich semantic information, provides more information for 3D target detection based on the pseudo point cloud, and can effectively improve the precision and accuracy of 3D target detection based on the pseudo point cloud.

In some embodiments, referring to fig. 2, the process S20 of acquiring the first pseudo point cloud data of the first image in step S12 may include:

Step S22, obtaining depth information of a first image;

In some embodiments, the depth information of the first image may be directly acquired or generated by a depth camera using a depth estimation network. Thereby, a first image containing depth information may be obtained, the pixel data in which may be represented as (u, v, depth) where u and v represent the position of the pixel point in the pixel coordinate system of the camera and depth represents the depth of the pixel point in the laser radar coordinate system.

Step S24, semantic information of a first image is obtained;

in some embodiments, the semantic information of the first image may be obtained using a semantic segmentation network, whereby a first image may be obtained that includes the semantic information, the pixel data of the first image that includes the semantic information may be represented as (u, v, cls), u and v representing the locations of the pixels in the pixel coordinate system of the camera, cls representing the semantic information of the pixels, which may indicate the category of the pixels.

Step S26, generating first pseudo point cloud data of the first image through coordinate conversion according to the depth information and semantic information of the first image.

In some embodiments, step S26 may include: first, embedding semantic information of pixel points in a first image with depth information to obtain a first image containing the depth information and the semantic information, wherein data of any pixel point in the first image can be expressed as (u, v, depth, cls)), and generating first pseudo point cloud data containing the semantic information through coordinate transformation according to camera parameters (for example, camera external parameters and camera internal parameters) of a sensor for acquiring the first image.

In some embodiments, the coordinate conversion formulas are as follows (1) to (3):

z＝D(u,v) (1)

where D (u, v) represents a depth value of the pixel (u, v), (c _u,c_v) represents a pixel coordinate corresponding to the center of the camera, f _u represents a horizontal focal length of the camera, and f _v represents a vertical focal length of the camera. Here, the camera refers to a camera that acquires a first image.

In step S12, the fusion mode of the depth map and the semantic segmentation result is adopted, so that semantic classification information can be effectively embedded into pseudo point cloud, and the pseudo point cloud data contains 3D position information (x, y, z) and semantic information cls capable of indicating the category, thereby avoiding the problem of feature alignment which is usually required to be considered in multi-mode fusion.

Step S14, acquiring 3D candidate frame information of a first image according to first pseudo point cloud data of the first image;

illustratively, the 3D candidate frame information may represent coordinates of a center point of the 3D candidate frame in a 3D space coordinate system in a form (x^c,y^c,z^c,w^c,l^c,h^c,θ^c),x^c,y^c,z^c, w ^c represents a width of the 3D candidate frame, l ^c represents a length of the 3D candidate frame, h ^c represents a height of the 3D candidate frame, and θ ^c represents an anchor orientation angle of the 3D candidate frame.

In some implementations, the 3D candidate box of the first image may be obtained using the first pseudo point cloud data of the first image based on the pre-trained 3D candidate box detection model. Wherein the 3D candidate frame detection model is used to extract the 3D candidate frame. For example, the 3D candidate box detection model may be, but is not limited to, a region generation network (Region Proposal Network, RPN), any high quality other RPN, or a lightweight 3D detection network.

In some implementations, the 3D candidate box detection model may be an original point cloud-based RPN, a voxel-based RPN, or a point+voxel-based RPN.

For example, when the 3D candidate frame detection model is based on the RPN of the original point cloud, pointNet or PointNet ++ network may be used to extract the feature of the first pseudo point cloud data of the first image, and then perform 3D candidate frame generation according to the feature and the preset anchor.

For another example, when the 3D candidate frame detection model is based on the RPN of the voxels, the first pseudo point cloud data of the first image is firstly subjected to voxel division, and is distributed in a three-dimensional space with a size of D, H, W along the ZYX coordinate system, the three-dimensional space is divided into small cubes, namely, voxels, the length, width and height of each voxel are set as v _D,v_H and v _w, and the total number of the divided voxels isAnd an upper threshold for the number of clouds per voxel set point. And carrying out voxel coding on the divided voxels by adopting a voxel coder, carrying out feature extraction on the coded voxels through 3D sparse convolution, and generating a 3D candidate frame according to the features and a preset anchor.

Step S16, acquiring first pseudo point cloud data of a 3D candidate frame according to the 3D candidate frame information of the first image and the first pseudo point cloud data of the first image;

In some embodiments, the first pseudo point cloud data of the 3D candidate frame may be extracted from the first pseudo point cloud data of the first image according to the 3D candidate frame information.

In some embodiments, after extracting the first pseudo point cloud data of the 3D candidate frame, the first pseudo point cloud data of the 3D candidate frame may be further encoded, so that the number of points of each 3D candidate frame is equal, thereby reducing errors existing in the candidate frames.

For example, encoding the first pseudo point cloud data of the 3D candidate box may include: for each 3D candidate frame, in order to reduce the error existing in the candidate frame, more target point cloud data is wrapped, each 3D candidate frame can be enlarged into a cylinder with an unlimited height, the radius r of the bottom surface of the cylinder meets the formula (4), w and l respectively represent the width and the length of the 3D candidate frame, alpha is a super parameter, in the experiment, alpha=1.2, and the number N of points in each 3D candidate frame is 256.

In some embodiments, encoding the first pseudo point cloud data of the 3D candidate box may further include: if the actual number M of the points of the 3D candidate frame is larger than N, randomly selecting N points from the M points, and deleting other points. If the number M of the actual points of the 3D candidate frame is smaller than N, randomly selecting one point coordinate as the coordinates of the other N-M points, and supplementing the number of the points corresponding to the first pseudo point cloud data of the 3D candidate frame to N.

Step S18, obtaining second pseudo point cloud data of the 3D candidate frame according to the first pseudo point cloud data of the 3D candidate frame;

The target features of the second pseudo point cloud data representation comprise 3D features, category features and internal features of the target corresponding to the 3D candidate frame. For example, the internal features may include, but are not limited to, internal geometric features, internal structural features, interrelated features between internal components, and/or other similar internal features.

In some implementations, the second pseudo point cloud data of the 3D candidate box can be obtained through a pre-trained feature association network. As shown in fig. 3, the exemplary structure of the feature association network 300 is shown, where the feature association network 300 includes an inter-point attention module 320, an inter-channel attention module 340, a position coding module 360, and a fusion module 380, the inter-point attention module 320 may be used to obtain a spatial association feature of an object corresponding to the 3D candidate frame, the inter-channel attention module 340 may be used to obtain a channel association feature of an object corresponding to the 3D candidate frame, the position coding module 360 may be used to obtain a 3D relative position feature of an object corresponding to the 3D candidate frame, and the fusion module 380 may be used to obtain second pseudo point cloud data of the 3D candidate frame.

The data of different points in the first pseudo point cloud data of each 3D candidate frame are mutually independent. In some embodiments, the feature association network 300 may be configured to construct association relationships between internal point data of the same 3D candidate frame, so as to characterize internal features of the target corresponding to the 3D candidate frame through the association relationships.

Referring to fig. 3, the input feature X of the feature correlation network 300 may be expressed as b×n×d, B represents the number of 3D candidate frames, N represents the number of points of the 3D candidate frames, and D represents the number of dimensions of the first pseudo point cloud data. As described above, the first pseudo point cloud data may be expressed as (x, y, z, cls), that is, the dimension number D of the first pseudo point cloud data is 4.

In some embodiments, referring to fig. 4, the process S40 of obtaining the second pseudo point cloud data of the 3D candidate box may include:

step S42, according to the first pseudo point cloud data of the 3D candidate frame, acquiring the spatial correlation characteristic of the target corresponding to the 3D candidate frame;

The first pseudo point cloud data of the 3D candidate frame may characterize a geometry of a target corresponding to the 3D candidate frame, that is, there is a spatial interdependence relationship between points in the 3D candidate frame, so the present disclosure adopts an inter-point attention mechanism to construct a spatial association relationship of the 3D candidate frame.

In some embodiments, referring to fig. 3, the specific implementation procedure of the point-to-point attention module 320, that is, the specific implementation procedure of step S42, may include:

the input feature X passes through three convolution layers to obtain three features A, C and G, respectively, and the output feature F ₁ of the inter-point attention module 320 (i.e., the spatial correlation feature of the target corresponding to the 3D candidate frame) can be obtained by the following formulas (5) to (6):

F₁＝S·G+X (6)

In the formulas (5) and (6), N represents the number of 3D candidate frames, and S _ij represents the element value of the ith row and jth column of the feature S.

Step S44, obtaining channel association characteristics of a target corresponding to the 3D candidate frame according to the first pseudo point cloud data of the 3D candidate frame;

In addition to the spatial interdependencies between points, the correlation between the internal channels of the point cloud needs to be paid attention to, so the present disclosure adopts a channel attention mechanism to pay attention to the importance of different channel features.

In some implementations, referring to FIG. 3, the inter-channel attention module 340 includes an average pooling layer, a linear layer, a Relu layer, a linear layer, and a sigmoid activation layer. In the inter-channel attention module 340 of fig. 3, from input to output, the first light gray square represents the average pooling layer, the second dark gray square and the third dark gray square represent two linear layers, respectively, and the Relu layers and sigmoid activation layers are directly indicated by arrows.

For example, the specific implementation procedure of step S44 may include: the input feature X is firstly subjected to spatial compression through an average pooling layer, then sequentially subjected to processing of a linear layer, a Relu layer and a linear layer, and then enters a sigmoid activation layer, the sigmoid activation layer is processed to obtain a weight matrix, and finally the weight matrix is multiplied by the input feature X to obtain an output feature F ₂ of the inter-channel attention module 340, wherein the output feature F ₂ is a channel association feature of a target corresponding to the 3D candidate frame.

Step S46, acquiring 3D relative position characteristics of a target corresponding to the 3D candidate frame according to the position information in the first pseudo point cloud data of the 3D candidate frame;

In some implementations, the point cloud data itself may contain 3D location information for location embedded encoding. For the position embedding of the point cloud, the present disclosure introduces 3D space relative coordinates, whose calculation formula is the following formula (7):

E＝ψ(p_i-p_j) (7)

Where p _i and p _j represent the 3D position information (x, y, z) of points i and j, respectively, in the same 3D candidate frame, and ψ represents the operations of one linear layer, one Relu layer, one linear layer and sigmoid activation layer.

Referring to fig. 3, the position encoding module 360 may include a linear layer, a Relu layer, two linear layers and a sigmoid activation layer that are sequentially connected, where the input feature of the position encoding module 360 is 3D position information of any two points in the same 3D candidate frame, the output feature is F ₃, and the output feature F ₃ is a 3D relative position feature of the first pseudo point cloud data of the 3D candidate frame. Likewise, in the position-coding module 360, three dark grey squares represent three linear layers, respectively, and the Relu layers and the sigmoid activation layer are directly indicated by arrows.

And S48, obtaining second pseudo point cloud data of the 3D candidate frame according to the spatial correlation characteristic, the channel correlation characteristic and the 3D relative position characteristic of the target corresponding to the 3D candidate frame.

In some embodiments, step S48 may include: and respectively splicing the 3D relative position characteristic F ₃ of the target corresponding to the 3D candidate frame with the space correlation characteristic F ₁ of the target corresponding to the 3D candidate frame and the channel correlation characteristic F ₂ of the target corresponding to the 3D candidate frame, and respectively carrying out pixel-by-pixel multiplication by two multi-layer perceptron (MLP, multilayer Perceptron) to obtain an output characteristic Y, wherein the output characteristic Y is the second pseudo point cloud data of the 3D candidate frame.

In some embodiments, referring to fig. 3, the fusion module 380 includes two splice layers (concat), two MLP layers, and one network layer for performing multiplication operations, and the input features of the fusion module 380 include a 3D relative position feature F ₃, a spatial correlation feature F ₁, and a channel correlation feature F ₂, and the output feature is Y.

From the above, the size of the second pseudo point cloud data is consistent with the size of the first pseudo point cloud data, the characteristic representation of the first pseudo point cloud data is independent, but the second pseudo point cloud data constructs an association relationship between the pseudo point clouds through an attention mechanism, and the data of the points in each 3D candidate frame are fused with the spatial association and the position information of other points in the 3D candidate frame, so that the 3D candidate frame has stronger representation capability.

For example, the second pseudo point cloud data of the 3D candidate frame may be expressed as n= { p ₁,…,p_N},,p_i representing data of the i-th point in the 3D candidate frame, i=1, … …, N being the number of points in the 3D candidate frame. As previously described, the number of points N in the 3D candidate box may be 256.

In step S14, feature association is performed on the data of the points in the same 3D candidate frame by using a feature association network, so that an association relationship between pseudo point clouds is fully constructed, so that the features of each point in the 3D candidate frame are not isolated any more, and the information of surrounding points is fused.

Step S11, according to second pseudo point cloud data of the 3D candidate frame, a first feature vector of a target corresponding to the 3D candidate frame is obtained through feature coding, and the feature coding can enable target features represented by the second pseudo point cloud data to be consistent with target features represented by the laser point cloud data in distribution;

in some embodiments, referring to fig. 5, step S11 may include:

Step S52, encoding second pseudo point cloud data of the 3D candidate frame into pseudo point cloud target feature data of the 3D candidate frame by adopting a key point relative coordinate encoding mode, wherein the pseudo point cloud target feature data is consistent with the laser point cloud target feature data in distribution, and the laser point cloud target feature data is obtained by encoding the laser point cloud data of the 3D candidate frame;

In step S54, the pseudo point cloud target feature data is decoded into a first feature vector of a fixed size.

In some embodiments, feature encoding may be performed on the second pseudo point cloud data of the 3D candidate frame through the pre-trained first target feature encoding network, so as to obtain a first feature vector of a target corresponding to the 3D candidate frame.

In some implementations, the first target feature encoding network may include an encoder (encoder) and a decoder (decoder). The encoder may be configured to encode the second pseudo point cloud data of the 3D candidate frame into pseudo point cloud target feature data of the 3D candidate frame by using a key point relative coordinate encoding manner, and the decoder may be configured to decode the pseudo point cloud target feature data output by the encoder into a first feature vector with a fixed size.

For example, the encoder may encode the relationship between each point in the 3D candidate frame and each corner of the 3D candidate frame using a key point relative coordinate encoding scheme.

For example, the relative coordinates between N points in the 3D candidate frame and 8 corner points of the 3D candidate frame may be expressed as the following formula (8):

Wherein p ^j denotes the coordinates of the jth corner point, And representing the relative coordinates of the ith point and the jth corner point in the 3D candidate frame.

For example, the relative coordinates of the i-th point in the 3D candidate frame and the center point of the 3D candidate frame may be expressed as

For example, after encoding, the feature of the midpoint of the 3D candidate frame is expressed as the following formula (9).

Wherein cls _i represents class information of the i-th point,Representing a linear layer maps the features of the i-th point to a high dimensional space. In the experiment, d=28 in formula (9).

The coded target point features are subject to a multi-headed self-attention mechanism (Muti-Head self-attention) for each candidate frame featureEncoding rich context, the encoder in the target encoding network model may be stacked with 3 multi-headed self-attention structures.

The decoder outputs the pseudo point cloud target characteristic data output by the encoderThe decoding is performed to obtain a global feature vector y with a fixed size (for example, 1×d), where the global feature vector y with the fixed size is the first feature vector of the target corresponding to the 3D candidate frame. As such, the second pseudo point cloud data of each 3D candidate box may be represented as a set of vectors y in one dimension of 1 x D.

Illustratively, the output characteristic of the decoder, i.e., the first characteristic vector of the object corresponding to the 3D candidate box, may be expressed as the following formula (10):

f_pse＝{y₁,y₂,…,y_N} (10)

Wherein N is the number of 3D candidate frames, f _pse represents the set of the first feature vectors of the targets corresponding to all the 3D candidate frames in the first image, and y represents the first feature vector of the target corresponding to a single 3D candidate frame.

Considering the difference between the pseudo point cloud and the laser point cloud, namely, for the pseudo point cloud and the laser point cloud in the same 3D candidate frame, the following differences mainly exist: 1) The depth information of the laser point cloud is accurate, and errors exist in the depth information of the pseudo point cloud; 2) The laser point cloud is acquired by laser radar sending out laser beams at certain angle intervals, has regularity and sparsity, and the pseudo point cloud is obtained by converting image pixels according to a depth map and coordinates, and is irregular and dense. The two are distributed in three-dimensional space with larger difference, especially in mean and variance in X, Y and Z directions. In view of this, in order to reduce the network performance loss caused by such a data source, let the first target feature encoding network learn a more effective feature representation, the present disclosure introduces a laser point cloud encoding branch to encode a laser point cloud target feature in the training process of the network, and uses the feature to guide the generation of a pseudo point cloud target feature, so that the pseudo point cloud target feature data tends to be consistent with the laser point cloud target feature in distribution. Therefore, the target detection precision of the pseudo point cloud can be improved in a laser point cloud guiding mode.

In some embodiments, the first target feature encoding network is trained according to a first feature vector and a second feature vector of a target corresponding to the 3D candidate frame, and the second feature vector is obtained by the second target feature encoding network according to laser point cloud data corresponding to the 3D candidate frame.

The second target feature encoding network is obtained through training according to the laser point cloud data and is used for feature encoding of the laser point cloud data of the 3D candidate frame to obtain a second feature vector of a target corresponding to the 3D candidate frame.

The second target feature encoding network is identical in structure to the first target feature encoding network. That is, the second target feature encoding network includes an encoder and a decoder that function identically to corresponding components in the first target feature encoding network, except that the input data is laser point cloud data of a 3D candidate frame, the laser point cloud data has four dimensions, which may be three dimensions representing a 3D position and reflectivity, the laser point cloud data may be represented as (x, y, z, r), the x, y, z corresponding to three-dimensional coordinates in a 3D space rectangular coordinate system, and r being reflectivity. And outputting the second characteristic vector of the target corresponding to the 3D candidate frame. The second feature vector has the same size as the first feature vector (for example, may be 1×d).

Illustratively, the output feature of the second object feature encoding network, that is, the second feature vector of the object corresponding to the 3D candidate box, may be represented as the following formula (11):

Wherein f _lidar denotes a set of second feature vectors of the object corresponding to all 3D candidate frames of the first image, And a second feature vector representing a target corresponding to the single 3D candidate frame.

In a specific application, the corresponding environmental laser point cloud data may be acquired while the first image is acquired, and after the 3D candidate frame information is acquired, the laser point cloud data corresponding to the 3D candidate frame may be extracted from the environmental laser point cloud data according to the 3D candidate frame information (i.e., (x ^c,y^c,z^c,w^c,l^c,h^c,θ^c) above).

In some embodiments, the loss function of the first target feature encoding network may include a feature similarity loss, where the feature similarity loss is obtained according to a first feature vector and a second feature vector of the target corresponding to the 3D candidate frame. Therefore, the method and the device constraint the generation of the pseudo point cloud target characteristics by adding the characteristic similarity loss into the loss function of the first target characteristic coding network, so that the distribution of the pseudo point cloud target characteristics can be kept consistent with the laser point cloud target characteristics.

Preferably, the feature similarity loss may be a KL divergence loss between the first feature vector and the second feature vector of the object corresponding to the 3D candidate frame. KL divergence, also known as relative entropy, can be used to measure the distance between two distributions. The feature similarity penalty may be obtained by a KL divergence penalty function (KLDivLoss).

In some embodiments, it is assumed that the output data of the second target feature encoding network isThe output data of the first target feature encoding network is f _pse＝{y₁,y₂,…,y_N, and the KL divergence loss between the first target feature encoding network and the output data can be obtained by the following steps:

In the formula (12), L _KLD represents the feature similarity loss, N represents the number of 3D candidate frames, And a second eigenvector representing a target corresponding to the jth 3D candidate frame, j=1, 2, … …, N.

And S13, obtaining 3D detection frame information according to the first feature vector of the target corresponding to the 3D candidate frame.

In some embodiments, after step S13 or in step S13, the method may further include: and obtaining the confidence coefficient of the 3D detection frame according to the first feature vector of the target corresponding to the 3D candidate frame.

In some embodiments, the 3D detection frame information may include information of a center point position, a size, and the like of the 3D detection frame. For example, the 3D detection frame information may be expressed in the form of the foregoing 3D candidate frame information, that is, including coordinates of a center point of the 3D detection frame in a 3D space coordinate system, width, length, and height of the 3D detection frame, orientation of the 3D detection frame.

In some embodiments, the 3D detection frame information and the confidence of the 3D detection frame may be obtained through a Feed-Forward Network (FFN). For example, the 3D detection frame information may be obtained based on a pre-trained first feedforward neural network and/or the confidence of the 3D detection frame may be obtained based on a pre-trained second feedforward neural network. That is, the first feedforward neural network is used to implement 3D detection frame regression, and the second feedforward neural network is used to calculate the confidence of the 3D detection frame.

Fig. 6 illustrates a model structure for implementing a pseudo point cloud-based target detection method and a schematic diagram of an implementation process thereof according to an embodiment of the present disclosure.

As shown in fig. 6, the first pseudo point cloud data of the first image sequentially passes through the 3D candidate frame detection network, the feature association network and the first target feature encoding network to obtain a first feature vector of a target corresponding to the 3D candidate frame, and the first feature vector of the target corresponding to the 3D candidate frame respectively passes through the first feedforward neural network and the second feedforward neural network to obtain the 3D detection frame information and the confidence coefficient of the 3D detection frame.

In some embodiments, the training process of the network includes: the method comprises a single stage and a combined stage, wherein the single stage can train a 3D candidate frame detection network, a feature association network, a first target feature encoding network, a first feedforward neural network and a second feedforward neural network by using pseudo point cloud data containing an original labeling frame, and the combined stage trains the 3D candidate frame detection network, the feature association network, the first target feature encoding network, the first feedforward neural network and the second feedforward neural network based on the pseudo point cloud data and the laser point cloud data on the basis of training results of the single stage.

During the training of the separate stage, the loss function of the 3D candidate block detection network, the feature correlation network, the first target feature encoding network, the first feedforward neural network, and/or the second feedforward neural network may be represented as formula (13):

wherein, Each of the weight coefficients, L _total, the function value of the loss function, L _conf, the detection frame confidence calculation loss, L _reg, the 3D detection frame regression loss, and L _RPN represent the loss of the 3D candidate frame detection network.

During training of the joint phase, the loss function of the 3D candidate block detection network, the feature correlation network, the first target feature encoding network, the first feedforward neural network, and/or the second feedforward neural network may be represented as formula (14):

L_total＝α₁L_RPN+α₂L_conf+α₃L_reg+α₄L_KLD (14)

Where L _total represents the function value of the loss function, α ₁、α₂、α₃、α₄ represents the weight coefficient, L _KLD represents the feature similarity loss, L _conf represents the detection box confidence calculation loss, L _reg represents the 3D detection box regression loss, and L _RPN represents the loss of the 3D candidate box detection network (e.g., RPN).

In some embodiments, L _RPN relates primarily to classification loss L _cls, 3D detection box regression loss L _reg, i.e., L _RPN can be derived by the following formula (15):

L_RPN＝β₁L_cls+β₂L_reg (15)

Where β ₁ is the weight coefficient of L _cls and β ₂ is the weight coefficient of L' _reg.

In some embodiments, the classification loss function used to obtain the classification loss L _cls may use Focal loss, which can effectively suppress the problem of imbalance between positive and negative samples. Specifically, the classification loss function can be expressed as the following formulas (16) to (17):

where n represents the number of categories, p _i represents the predictive score for the i-th category, and α and γ are two hyper-parameters.

In some embodiments, confidence prediction loss L _conf is calculated using cross entropy, which is expressed by the following formulas (18) - (19):

L_conf＝-c^tlog(c)-(1-c^t)log(1-c) (18)

Where c is the confidence score of the predicted frame, c ^t is the confidence true value of the predicted frame, ioU is the intersection ratio of the 3D candidate frame and the labeling frame, and α _F and α _B are IoU thresholds for distinguishing the foreground and the background, respectively. In the experiment, α _F＝0.75,α_B =0.25.

In some implementations, the 3D detection box regression loss L _reg is actually the offset of the regression anchor relative to the original labeling box (ground truth bound).

Considering the mechanism of the anchor, what actually regresses is the offset to the anchor (x ^a,y^a,z^a,w^a,l^a,h^a,θ^a). Therefore, in the training process, the original labeling frame (ground truth bound) (x ^g,y^g,z^g,w^g,l^g,h^g,θ^g) is first encoded into the offset relative to the anchor, and the encoding modes are as shown in the following formulas (20) - (22):

θ^t＝θ^g-θ^a (22)

Box _t(x^t,y^t,z^t,w^t,l^t,h^t,θ^t) is a target offset amount to be regressed, which satisfies the following formula (23):

the calculation formula for predicting the center point position of the detection frame and the length, width and height dimensions is shown in the following formula (24):

L_reg-loc+dim＝(box_prediction-box_t)² (24)

Wherein box _prediction represents the predicted offset of the target and box _t represents the to-be-regressed offset of the target.

The prediction of the orientation angle mainly includes direction prediction and angle prediction. The direction prediction can be simplified into a two-classification problem, and a loss form of cross entropy is adopted, and the calculation formula is shown as the following formula (25):

L_dir＝-ylog(p(x))-(1-y)log(1-p(x)) (25)

Wherein p (x) represents the prediction result of x, and y represents the true classification label.

Wherein, the angle loss part can be in SmoothL <1 > form of sine function, and the calculation formula is shown as the following formula (26):

L_reg-θ＝SmooothL1(sin(θ_p-θ_t)) (26)

wherein θ _p represents a predicted value.

Thus, the 3D detection frame regression loss L _reg can be obtained by the following formula (27):

L_reg＝γ₁L_reg-loc+dim+γ₂L_dir+γ₃L_reg-θ (27)

Wherein, gamma ₁,γ₂,γ₃ is the weight coefficient of each loss.

The 3D detection frame regression loss L _reg regresses for the offset of the 3D candidate frame relative to the original labeling frame (ground truth bound).

For example, the 3D detection frame regression loss L _reg may be calculated by first screening the detection frames, and only calculating the loss of the detection frame with the intersection ratio (Intersection-over-Union, ioU) of the detection frame and the labeling frame being greater than the set threshold, where the threshold may be 0.55.

Fig. 6 also shows the laser point cloud branches and the second target feature encoding network. In the present disclosure, the laser point cloud branching is only used for auxiliary training, and is not needed in model reasoning. Illustratively, the training method of the network is: the pseudo point cloud branch and the laser point cloud branch are respectively and independently trained for 100 epochs, the learning rate is set to be 0.001, then the trained parameters are loaded for combined training, the parameters of the laser point cloud branch are frozen during the combined training, gradient feedback is not carried out, the method is only used for guiding the generation of the pseudo point cloud target characteristics, and the updating is not learned any more. And (3) training 50epoch in a combined mode, setting the learning rate to be 0.0005, and finally selecting a model with the best detection effect on the verification set for detecting the pictures of the test set.

Therefore, the pre-trained laser point cloud branches are adopted to guide the generation of the pseudo point cloud target characteristics, so that the encoding results of the pseudo point cloud target characteristics and the distribution of the laser point cloud target characteristics tend to be consistent, the data source diversity is optimized in the target layer, and compared with the optimization of the whole point cloud scene, the method is more efficient, and meanwhile, the detection precision is improved.

Fig. 7 shows a flowchart of a specific implementation of a pseudo point cloud-based 3D target detection method according to an embodiment of the present disclosure.

As shown in fig. 7, an exemplary implementation flow S70 of the pseudo point cloud-based 3D object detection method may include:

Step S72, generating first pseudo point cloud data containing category information by using depth information of the first image and semantic segmentation results obtained through a semantic segmentation network;

Step S74, the first pseudo point cloud data is processed by a 3D candidate frame detection network (e.g. RPN) to obtain 3D candidate frame information and corresponding category information;

step S76, extracting first pseudo point cloud data of the target, namely, extracting the first pseudo point cloud data of the 3D candidate frame according to the 3D candidate frame information;

Step S78, the first pseudo point cloud data of the 3D candidate frame is processed through a feature association network, and an association relation between the pseudo point clouds is constructed, so that second pseudo point cloud data of the 3D candidate frame is obtained, wherein the second pseudo point cloud data can simultaneously represent 3D features, category features and internal features of the target, and the internal features can be internal geometric features, internal structural features and/or other features related to the interior of the target;

Step S71, second pseudo point cloud data of the 3D candidate frame is processed through a first target feature encoding network to generate a first feature vector of a target corresponding to the 3D candidate frame;

Step S73, the first feature vector of the target corresponding to the 3D candidate frame is processed by the first feedforward neural network, the regression of the 3D detection frame is executed, information such as the length, the width, the height and the coordinates of the center point in the 3D space coordinate system of the 3D detection frame are obtained, meanwhile, the first feature vector of the target corresponding to the 3D candidate frame is processed by the second feedforward neural network, the confidence coefficient calculation is executed, and the confidence coefficient of the 3D detection frame is obtained.

Step S75, judging whether the current training stage is performed, if so, continuing step S77, otherwise, ending the current flow.

Step S77, if the training stage is currently in progress, determining a feature similarity loss L _KLD according to the first feature vector of the target corresponding to the 3D candidate frame obtained in step S71 and the second feature vector of the target corresponding to the 3D candidate frame obtained in step S714;

Step S79, the 3D detection frame information and the confidence coefficient of the 3D detection frame obtained in step S73 are reversely propagated in the pseudo point cloud branch shown in FIG. 6, and gradient feedback is carried out by using the loss function of the formula (14) in the reverse propagation process, so that parameters of the 3D candidate frame detection network, the feature association network, the first target feature encoding network, the first feedforward neural network and the second feedforward neural network are updated to complete the round of training.

Step S711, determining whether the convergence condition is satisfied, if yes, ending the training, otherwise, returning to step S72, and continuing to execute the next training round.

If the training phase is currently in the training phase, the following steps may be executed simultaneously before the step S77 and after the step S74:

step S712, extracting laser point cloud data of the target, that is, extracting laser point cloud data of the 3D candidate frame, from the environmental laser point cloud data corresponding to the first image, according to the 3D candidate frame information;

For example, the first image may be acquired by a sensor such as an in-vehicle camera, and the laser point cloud data of the surrounding environment may be acquired by a sensor such as an in-vehicle lidar, where the field of view of the in-vehicle lidar at least partially coincides with the field of view of the in-vehicle camera, so that the environmental laser point cloud data corresponding to the first image may be obtained.

Step S714, the laser point cloud data of the 3D candidate frame is processed by the second target feature encoding network to obtain a second feature vector of the target corresponding to the 3D candidate frame, and the step S77 is skipped.

The method and the device provide a more concise semantic information fusion mode. Firstly, generating a pseudo point cloud containing semantic category information according to a depth map and a semantic segmentation result of an image, and embedding the semantic information in the pseudo point cloud. Secondly, the association relation of the pseudo point clouds is fully constructed through a characteristic association network, and the characteristic association network constructs the space association relation between the pseudo point clouds and the channel association relation inside the pseudo point clouds through a attention mechanism and combining with 3D position embedding.

The present disclosure also provides a first target feature encoding network for pseudo point cloud data directed by a laser point cloud, enabling the network to learn a more powerful target feature representation. According to the method, the laser point cloud is introduced in the network training process, two target feature coding networks are used for respectively coding target candidate point clouds of the two target feature coding networks, a second target feature coding network for the laser point cloud data loads pre-trained model parameters and is stored unchanged in the training process, gradient return is not carried out, the laser point cloud features output by the second target feature coding network are represented as f _lidar, the pseudo point cloud target features output by the first target feature coding network are represented as f _pse, feature similarity loss is introduced between the two target feature coding networks, the representation of the pseudo point cloud target features tends to the target feature representation of the laser point cloud in distribution, the data source variability is optimized in a target layer, and compared with the optimization of the whole point cloud scene, the method is more efficient, and meanwhile the detection precision is improved.

The apparatus may include corresponding modules that perform the steps of the flowcharts described above. Thus, each step or several steps in the flowcharts described above may be performed by respective modules, and the apparatus may include one or more of these modules. A module may be one or more hardware modules specifically configured to perform the respective steps, or be implemented by a processor configured to perform the respective steps, or be stored within a computer-readable medium for implementation by a processor, or be implemented by some combination.

As shown in fig. 8, the pseudo point cloud-based object detection apparatus 800 may include:

A pseudo point cloud obtaining unit 802, configured to obtain first pseudo point cloud data of a first image, where target features represented by the first pseudo point cloud data include three-dimensional 3D features and category features;

a 3D target candidate frame extraction unit 804, configured to obtain 3D candidate frame information of the first image according to first pseudo point cloud data of the first image;

A candidate frame pseudo point cloud unit 806, configured to obtain first pseudo point cloud data of a 3D candidate frame according to the 3D candidate frame information of the first image and the first pseudo point cloud data of the first image;

A feature association unit 808, configured to obtain second pseudo point cloud data of the 3D candidate frame according to the first pseudo point cloud data of the 3D candidate frame, where a target feature represented by the second pseudo point cloud data includes a 3D feature, a category feature, and an internal feature of a corresponding target of the 3D candidate frame;

The feature encoding unit 810 is configured to obtain, according to second pseudo point cloud data of the 3D candidate frame, a first feature vector of a target corresponding to the 3D candidate frame through feature encoding, where the feature encoding can enable a target feature represented by the second pseudo point cloud data to be consistent in distribution with a target feature represented by laser point cloud data;

and a detection frame obtaining unit 812, configured to obtain 3D detection frame information according to the first feature vector of the target corresponding to the 3D candidate frame.

In some embodiments, the pseudo point cloud obtaining unit 802 is specifically configured to: acquiring depth information of the first image; acquiring semantic information of the first image; and generating first pseudo point cloud data of the first image through coordinate conversion according to the depth information and the semantic information of the first image.

In some implementations, the first pseudo point cloud data and/or the second pseudo point cloud data has four dimensions, one of which represents a category of points.

In some embodiments, the feature associating unit 808 is specifically configured to: according to the first pseudo point cloud data of the 3D candidate frame, acquiring the spatial correlation characteristic of the target corresponding to the 3D candidate frame; acquiring channel association characteristics of a target corresponding to the 3D candidate frame according to the first pseudo point cloud data of the 3D candidate frame; acquiring 3D relative position characteristics of a target corresponding to the 3D candidate frame according to position information in the first pseudo point cloud data of the 3D candidate frame; and obtaining second pseudo point cloud data of the 3D candidate frame according to the spatial correlation characteristic, the channel correlation characteristic and the 3D relative position characteristic of the target corresponding to the 3D candidate frame.

In some embodiments, the feature association unit 805 is specifically configured to obtain, through a pre-trained feature association network, second pseudo point cloud data of the 3D candidate frame, where the feature association network includes an inter-point attention module, an inter-channel attention module, a position encoding module, and a fusion module, where the inter-point attention module is configured to obtain spatial association features of a target corresponding to the 3D candidate frame, the inter-channel attention module is configured to obtain channel association features of the target corresponding to the 3D candidate frame, the position encoding module is configured to obtain 3D relative position features of the target corresponding to the 3D candidate frame, and the fusion module is configured to obtain second pseudo point cloud data of the 3D candidate frame.

In some embodiments, the feature encoding unit 810 is specifically configured to: encoding second pseudo point cloud data of the 3D candidate frame into pseudo point cloud target feature data of the 3D candidate frame by adopting a key point relative coordinate encoding mode, wherein the pseudo point cloud target feature data are consistent with laser point cloud target feature data in distribution, and the laser point cloud target feature data are obtained by encoding the laser point cloud data of the 3D candidate frame; and decoding the pseudo point cloud target feature data into a first feature vector with a fixed size.

In some embodiments, the feature encoding unit 810 is specifically configured to perform feature encoding on the second pseudo point cloud data of the 3D candidate frame through a pre-trained first target feature encoding network to obtain a first feature vector of a target corresponding to the 3D candidate frame, where the first target feature encoding network includes an encoder and a decoder.

In some embodiments, the first target feature encoding network is trained according to a first feature vector and a second feature vector of a target corresponding to the 3D candidate frame, the second feature vector is obtained by a second target feature encoding network according to laser point cloud data corresponding to the 3D candidate frame, and the second target feature encoding network has the same structure as the first target feature encoding network.

In some embodiments, the loss function of the first target feature encoding network includes a feature similarity loss, where the feature similarity loss is obtained according to a first feature vector and a second feature vector of a target corresponding to the 3D candidate frame; preferably, the feature similarity loss is a KL divergence loss between a first feature vector and a second feature vector of the object corresponding to the 3D candidate frame.

In some embodiments, the pseudo point cloud based object detection apparatus 800 may further include: and a confidence coefficient calculating unit 814, configured to obtain a confidence coefficient of the 3D detection frame according to the first feature vector of the target corresponding to the 3D candidate frame.

In some embodiments, the detection frame acquisition unit 812 is specifically configured to obtain 3D detection frame information based on the pre-trained first feedforward neural network, and/or the confidence calculation unit 814 is specifically configured to obtain the confidence of the 3D detection frame based on the pre-trained second feedforward neural network.

In some implementations, the loss function of the 3D candidate box detection network, the feature correlation network, the first target feature encoding network, the first feedforward neural network, and/or the second feedforward neural network is represented as equation (14).

The hardware architecture may be implemented using a bus architecture. The bus architecture may include any number of interconnecting buses and bridges depending on the specific application of the hardware and the overall design constraints. Bus 900 connects together various circuits including one or more processors 1000, memory 1100, and/or hardware modules. Bus 900 may also connect various other circuits 1200 such as peripherals, voltage regulators, power management circuits, external antennas, and the like.

Bus 900 may be an industry standard architecture (ISA, industry Standard Architecture) bus, a peripheral component interconnect (PCI, PERIPHERAL COMPONENT) bus, or an extended industry standard architecture (EISA, extended Industry Standard Component) bus, among others. The buses may be divided into address buses, data buses, control buses, etc. For ease of illustration, only one connection line is shown in the figure, but not only one bus or one type of bus.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and further implementations are included within the scope of the preferred embodiment of the present disclosure in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the embodiments of the present disclosure. The processor performs the various methods and processes described above. For example, method embodiments in the present disclosure may be implemented as a software program tangibly embodied on a machine-readable medium, such as a memory. In some embodiments, part or all of the software program may be loaded and/or installed via memory and/or a communication interface. One or more of the steps of the methods described above may be performed when a software program is loaded into memory and executed by a processor. Alternatively, in other embodiments, the processor may be configured to perform one of the methods described above in any other suitable manner (e.g., by means of firmware).

Logic and/or steps represented in the flowcharts or otherwise described herein may be embodied in any readable storage medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions.

For the purposes of this description, a "readable storage medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable read-only memory (CDROM). In addition, the readable storage medium may even be paper or other suitable medium on which the program can be printed, as the program can be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted, or otherwise processed in a suitable manner if necessary, and then stored in a memory.

It should be understood that portions of the present disclosure may be implemented in hardware, software, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.

Those of ordinary skill in the art will appreciate that all or part of the steps implementing the method of the above embodiments may be implemented by a program to instruct related hardware, and the program may be stored in a readable storage medium, where the program when executed includes one or a combination of the steps of the method embodiments.

Furthermore, each functional unit in each embodiment of the present disclosure may be integrated into one processing module, or each unit may exist alone physically, or two or more units may be integrated into one module. The integrated modules may be implemented in hardware or in software functional modules. The integrated modules may also be stored in a readable storage medium if implemented in the form of software functional modules and sold or used as a stand-alone product. The storage medium may be a read-only memory, a magnetic disk or optical disk, etc.

The present disclosure also provides an electronic device, including: a memory storing execution instructions; and a processor or other hardware module that executes the execution instructions stored in the memory, such that the processor or other hardware module executes the pseudo-point cloud-based target detection method described above.

The disclosure also provides a readable storage medium, in which execution instructions are stored, the execution instructions being used to implement the above-described pseudo-point cloud-based target detection method when executed by a processor.

In the description of the present specification, reference to the terms "one embodiment/mode," "some embodiments/modes," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment/mode or example is included in at least one embodiment/mode or example of the present application. In this specification, the schematic representations of the above terms are not necessarily the same embodiments/modes or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments/modes or examples. Furthermore, the various embodiments/implementations or examples described in this specification and the features of the various embodiments/implementations or examples may be combined and combined by persons skilled in the art without contradiction.

Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present application, the meaning of "plurality" means at least two, for example, two, three, etc., unless specifically defined otherwise.

It will be appreciated by those skilled in the art that the above-described embodiments are merely for clarity of illustration of the disclosure, and are not intended to limit the scope of the disclosure. Other variations or modifications will be apparent to persons skilled in the art from the foregoing disclosure, and such variations or modifications are intended to be within the scope of the present disclosure.

Claims

1. The target detection method based on the pseudo point cloud is characterized by comprising the following steps of:

obtaining second pseudo point cloud data of the 3D candidate frame according to the first pseudo point cloud data of the 3D candidate frame, wherein target features represented by the second pseudo point cloud data comprise 3D features, category features and internal features of corresponding targets of the 3D candidate frame, and the internal features comprise internal geometric features or internal structural features;

According to the second pseudo point cloud data of the 3D candidate frame, a first feature vector of the 3D candidate frame is obtained through feature coding, and the feature coding can enable target features represented by the second pseudo point cloud data to be consistent with target features represented by laser point cloud data in distribution;

2. The pseudo point cloud based target detection method according to claim 1, wherein acquiring first pseudo point cloud data of the first image comprises:

acquiring depth information of the first image;

acquiring semantic information of the first image;

and generating first pseudo point cloud data of the first image through coordinate conversion according to the depth information and the semantic information of the first image.

3. The pseudo-point cloud based target detection method according to claim 1 or 2, wherein the first pseudo-point cloud data and/or the second pseudo-point cloud data has four dimensions, one of which represents a category of a point.

4. The pseudo point cloud based target detection method according to claim 1, wherein obtaining second pseudo point cloud data of the 3D candidate frame from the first pseudo point cloud data of the 3D candidate frame comprises:

according to the first pseudo point cloud data of the 3D candidate frame, acquiring the spatial correlation characteristic of the target corresponding to the 3D candidate frame;

Acquiring channel association characteristics of a target corresponding to the 3D candidate frame according to the first pseudo point cloud data of the 3D candidate frame;

acquiring 3D relative position characteristics of a target corresponding to the 3D candidate frame according to position information in the first pseudo point cloud data of the 3D candidate frame;

and obtaining second pseudo point cloud data of the 3D candidate frame according to the spatial correlation characteristic, the channel correlation characteristic and the 3D relative position characteristic of the target corresponding to the 3D candidate frame.

5. The method for detecting the target based on the pseudo point cloud according to claim 4, wherein the second pseudo point cloud data of the 3D candidate frame is obtained through a pre-trained feature association network, the feature association network comprises an inter-point attention module, an inter-channel attention module, a position coding module and a fusion module, the inter-point attention module is used for obtaining the spatial association feature, the inter-channel attention module is used for obtaining the channel association feature of the target corresponding to the 3D candidate frame, the position coding module is used for obtaining the 3D relative position feature of the target corresponding to the 3D candidate frame, and the fusion module is used for obtaining the second pseudo point cloud data of the 3D candidate frame.

6. The method for detecting a target based on a pseudo point cloud as claimed in claim 5, wherein,

According to the second pseudo point cloud data of the 3D candidate frame, obtaining a first feature vector of a target corresponding to the 3D candidate frame through feature coding, wherein the first feature vector comprises the following components:

Encoding second pseudo point cloud data of the 3D candidate frame into pseudo point cloud target feature data of the 3D candidate frame by adopting a key point relative coordinate encoding mode, wherein the pseudo point cloud target feature data are consistent with laser point cloud target feature data in distribution, and the laser point cloud target feature data are obtained by encoding the laser point cloud data of the 3D candidate frame;

And decoding the pseudo point cloud target feature data into a first feature vector with a fixed size.

7. The method for detecting a target based on pseudo point cloud according to claim 6, wherein obtaining a first feature vector of a target corresponding to the 3D candidate frame through feature encoding according to second pseudo point cloud data of the 3D candidate frame comprises: and carrying out feature coding on the second pseudo point cloud data of the 3D candidate frame through a pre-trained first target feature coding network to obtain a first feature vector of a target corresponding to the 3D candidate frame, wherein the first target feature coding network comprises an encoder and a decoder.

8. The method for detecting a target based on pseudo point cloud according to claim 7, wherein the first target feature encoding network is trained according to a first feature vector and a second feature vector of a target corresponding to the 3D candidate frame, the second feature vector is obtained by a second target feature encoding network according to laser point cloud data corresponding to the 3D candidate frame, and the second target feature encoding network has the same structure as the first target feature encoding network.

9. The method for detecting a target based on a pseudo point cloud according to claim 8, wherein the loss function of the first target feature encoding network includes a feature similarity loss, and the feature similarity loss is obtained according to a first feature vector and a second feature vector of a target corresponding to the 3D candidate frame;

The feature similarity loss is a KL divergence loss between a first feature vector and a second feature vector of the target corresponding to the 3D candidate frame.

10. The pseudo point cloud based target detection method of claim 9, further comprising: and obtaining the confidence coefficient of the 3D detection frame according to the first feature vector of the target corresponding to the 3D candidate frame.

11. The pseudo point cloud based target detection method of claim 10, wherein the 3D detection box information is obtained based on a pre-trained first feedforward neural network and/or the confidence level of the 3D detection box is obtained based on a pre-trained second feedforward neural network.

12. The pseudo point cloud based target detection method according to claim 10, wherein the 3D candidate block detection network, the feature correlation network, the first target feature encoding network, the first feedforward neural network and/or the second feedforward neural network loss functions are expressed as:

wherein, Function value representing a loss function,、、、The weight coefficient is represented by a number of weight coefficients,

Representing feature similarity loss,Representing the confidence calculation loss of the detection frame,Representing regression loss of 3D detection frame,Representing a loss of 3D candidate box detection network;

the 3D candidate frame detection network is used for acquiring the 3D candidate frame information, the feature correlation network is used for acquiring second pseudo point cloud data of the 3D candidate frame, the first target feature encoding network is used for acquiring a first feature vector of a target corresponding to the 3D candidate frame, the first feedforward neural network is used for acquiring the 3D detection frame information, and the second feedforward neural network is used for acquiring the confidence coefficient of the 3D detection frame.

13. A pseudo point cloud-based object detection apparatus, comprising:

The pseudo point cloud acquisition unit acquires first pseudo point cloud data of a first image, and target features represented by the first pseudo point cloud data comprise three-dimensional 3D features and category features;

a 3D target candidate frame extraction unit, configured to obtain 3D candidate frame information of the first image according to first pseudo point cloud data of the first image;

the candidate frame pseudo point cloud unit is used for acquiring first pseudo point cloud data of the 3D candidate frame according to the 3D candidate frame information of the first image and the first pseudo point cloud data of the first image;

The characteristic association unit is used for obtaining second pseudo point cloud data of the 3D candidate frame according to the first pseudo point cloud data of the 3D candidate frame, wherein target characteristics represented by the second pseudo point cloud data comprise 3D characteristics, category characteristics and internal characteristics of a corresponding target of the 3D candidate frame, and the internal characteristics comprise internal geometric characteristics or internal structural characteristics;

the feature coding unit is used for obtaining a first feature vector of a target corresponding to the 3D candidate frame through feature coding according to second pseudo point cloud data of the 3D candidate frame, and the feature coding can enable target features represented by the second pseudo point cloud data to be consistent with target features represented by laser point cloud data in distribution;

and the detection frame acquisition unit is used for acquiring 3D detection frame information according to the first feature vector of the target corresponding to the 3D candidate frame.

14. The target detection device based on the pseudo point cloud according to claim 13, wherein the pseudo point cloud obtaining unit is specifically configured to:

acquiring depth information of the first image;

acquiring semantic information of the first image;

15. The pseudo-point cloud based object detection apparatus according to claim 13 or 14, wherein the first pseudo-point cloud data and/or the second pseudo-point cloud data has four dimensions, one of which represents a category of points.

16. The object detection device based on pseudo point cloud according to claim 13, wherein the feature association unit is specifically configured to:

17. The target detection device based on pseudo point cloud according to claim 16, wherein the feature association unit is specifically configured to obtain second pseudo point cloud data of the 3D candidate frame through a pre-trained feature association network, the feature association network includes an inter-point attention module, an inter-channel attention module, a position coding module and a fusion module, the inter-point attention module is configured to obtain spatial association features of a target corresponding to the 3D candidate frame, the inter-channel attention module is configured to obtain channel association features of the target corresponding to the 3D candidate frame, the position coding module is configured to obtain 3D relative position features of the target corresponding to the 3D candidate frame, and the fusion module is configured to obtain second pseudo point cloud data of the 3D candidate frame.

18. The object detection device based on pseudo point cloud according to claim 17, wherein the feature encoding unit is specifically configured to:

19. The target detection device based on pseudo point cloud according to claim 18, wherein the feature encoding unit is specifically configured to perform feature encoding on the second pseudo point cloud data of the 3D candidate frame through a pre-trained first target feature encoding network to obtain a first feature vector of a target corresponding to the 3D candidate frame, where the first target feature encoding network includes an encoder and a decoder.

20. The pseudo-point cloud based object detection apparatus according to claim 19, wherein the first object feature encoding network is trained according to a first feature vector and a second feature vector of an object corresponding to the 3D candidate frame, the second feature vector is obtained by a second object feature encoding network according to laser point cloud data corresponding to the 3D candidate frame, and the second object feature encoding network has the same structure as the first object feature encoding network.

21. The pseudo point cloud based object detection apparatus according to claim 20, wherein the loss function of the first object feature encoding network includes a feature similarity loss, and the feature similarity loss is obtained according to a first feature vector and a second feature vector of an object corresponding to the 3D candidate frame;

22. The pseudo point cloud based object detection apparatus according to claim 13 or 21, further comprising: and the confidence coefficient calculating unit is used for obtaining the confidence coefficient of the 3D detection frame according to the first feature vector of the target corresponding to the 3D candidate frame.

23. The pseudo point cloud based target detection apparatus according to claim 22, wherein the detection frame acquisition unit is specifically configured to obtain 3D detection frame information based on a pre-trained first feedforward neural network, and/or the confidence calculation unit is specifically configured to obtain a confidence level of the 3D detection frame based on a pre-trained second feedforward neural network.

24. The pseudo point cloud based object detection apparatus according to claim 21, wherein the 3D candidate box detection network, the feature correlation network, the first object feature encoding network, the first feedforward neural network and/or the second feedforward neural network loss functions are expressed as:

Representing feature similarity loss,Indicating a loss of confidence calculation for the detection frame,

Representing regression loss of 3D detection frame,Representing a loss of 3D candidate box detection network;

25. An electronic device, comprising:

A memory storing execution instructions; and

A processor executing the execution instructions stored in the memory, causing the processor to execute the pseudo point cloud-based target detection method according to any one of claims 1 to 12.

26. A readable storage medium, wherein execution instructions are stored in the readable storage medium, which when executed by a processor are configured to implement the pseudo point cloud based object detection method of any one of claims 1 to 12.