CN118429764A

CN118429764A - Collaborative sensing method based on multi-mode fusion

Info

Publication number: CN118429764A
Application number: CN202410436656.1A
Authority: CN
Inventors: 闵令通; 黄丹; 张磊; 王昭中; 何欣欣; 王秉路
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2024-04-11
Filing date: 2024-04-11
Publication date: 2024-08-02

Abstract

The invention provides a collaborative sensing method based on multi-mode fusion, which adopts a method for integrating laser radar and camera data, and by fusing the data of two sensors into the same representation space, the accurate depth information provided by the laser radar and the rich visual features provided by the camera can be utilized to realize more accurate and more stable environment sensing, and a plurality of different fusion strategies are considered and are deeply analyzed. The weight of different sensor data can be adaptively adjusted, so that the correlation between the sensor data and the sensor data can be captured better, the high-precision sensing of the environment is realized, and the sensor data display method has better performance and stronger robustness in the cooperative sensing task.

Description

Collaborative perception method based on multimodal fusion

技术领域Technical Field

本发明涉及自动驾驶和智能交通系统领域，更具体来说是涉及一种多模态融合的协同感知方法。The present invention relates to the field of autonomous driving and intelligent transportation systems, and more specifically to a collaborative perception method of multimodal fusion.

背景技术Background technique

协同感知是自动驾驶领域中的一个关键问题，它允许自动驾驶车辆通过车载传感器收集环境信息并通过车对车(Vehicle-to-Vehicle,V2V)无线通信技术与其他车辆进行实时共享，从而实现更强大和全面的环境感知能力。协同感知任务的目标是利用多个车辆作Collaborative perception is a key issue in the field of autonomous driving. It allows autonomous vehicles to collect environmental information through on-board sensors and share it with other vehicles in real time through vehicle-to-vehicle (V2V) wireless communication technology, thereby achieving more powerful and comprehensive environmental perception capabilities. The goal of collaborative perception tasks is to use multiple vehicles as

为移动传感器网构建高精度的多点观测场景表征，以增强车辆在车队协同、交通流优化、自动驾驶辅助和自动驾驶等应用场景下的感知能力。相对于传统的单车智能，多车协同感知可以更好地应对复杂的交通场景，提高自动驾驶系统的决策准确性和行驶安全性。Build high-precision multi-point observation scene representation for mobile sensor networks to enhance vehicle perception in application scenarios such as fleet coordination, traffic flow optimization, autonomous driving assistance and autonomous driving. Compared with traditional single-vehicle intelligence, multi-vehicle collaborative perception can better cope with complex traffic scenarios and improve the decision-making accuracy and driving safety of autonomous driving systems.

早期的研究主要基于单一传感器模式，如仅使用激光雷达或仅使用相机进行环境感知。然而，这些单模态方法未能充分利用两种传感器的互补优势，限制了感知系统的性能。基于纯激光雷达的方法可能会忽略相机传感器捕捉到的细粒度视觉细节，而基于纯相机的方法通常缺乏对于精确目标定位至关重要的准确深度信息，也可能受低光或恶劣天气条件的影响。Early research was mainly based on a single sensor modality, such as using only LiDAR or only cameras for environmental perception. However, these single-modality approaches fail to fully exploit the complementary advantages of both sensors, limiting the performance of the perception system. Pure LiDAR-based methods may ignore the fine-grained visual details captured by camera sensors, while pure camera-based methods usually lack accurate depth information that is critical for precise object localization and may also be affected by low light or bad weather conditions.

近年来，研究人员为了克服单模态的局限性，开始探索基于多模态融合的感知方法。这些方法旨在将激光雷达和相机的优势结合起来，以获得更全面、准确的环境感知，并已经证明了融合激光雷达和相机数据的优势，包括改进的目标检测、增强的场景理解以及在具有挑战性的环境条件下的稳健性能。然而，当涉及到多车协同感知任务时，在多模态融合方面的探索仍然非常有限。最近提出的HM-ViT是多车协同感知在多模态融合方面的初期探索，它假设每个车辆只能获取任意一种模态，并基于此设定来实现不同车辆之间的异构模态融合。尽管现有文献中已有大量关于多模态的研究，但在协同感知领域中，对于多模态交互的深入探讨仍显不足。In recent years, researchers have begun to explore perception methods based on multimodal fusion in order to overcome the limitations of single modality. These methods aim to combine the advantages of lidar and cameras to obtain more comprehensive and accurate environmental perception, and have demonstrated the advantages of fusing lidar and camera data, including improved target detection, enhanced scene understanding, and robust performance under challenging environmental conditions. However, when it comes to multi-vehicle collaborative perception tasks, the exploration of multimodal fusion is still very limited. The recently proposed HM-ViT is an early exploration of multimodal fusion in multi-vehicle collaborative perception. It assumes that each vehicle can only obtain any one modality, and based on this setting, heterogeneous modal fusion between different vehicles is achieved. Although there are a lot of studies on multimodality in the existing literature, in-depth discussions on multimodal interaction are still insufficient in the field of collaborative perception.

发明内容Summary of the invention

为了克服现有技术的不足，本发明提供一种基于多模态融合的协同感知方法。为了更全面地利用不同类型传感器的互补优势，本发明构建了一个多模态融合的基线系统，作为实现协同感知的基础框架。该基线系统参考了多模态融合在单车感知中的成功案例，采用了一种集成激光雷达和相机数据的方法，通过将两种传感器的数据融合到同一表示空间中，能够利用激光雷达提供的精确深度信息和相机提供的丰富视觉特征，实现更准确、更稳健的环境感知。在构建该基线系统时，本发明考虑了多种不同的融合策略，并进行了深入的分析。首先，采用了通道级拼接和元素级求和方法，这两种方法简单直观，能够将来自不同传感器的特征直接拼接在一起，形成一个统一的特征表示。虽然这种方法在某些情况下能够取得良好的性能，但它也有一定的局限性，主要表现在不能充分考虑到不同传感器数据之间的相关性。为了解决这一问题，本发明进一步探索了基于注意力机制的融合方法，这种方法能够自适应地调整不同传感器数据的权重，从而更好地捕捉它们之间的相关性。通过多头自注意力机制，所提算法能够在不同的传感器特征之间建立复杂的关联，从而实现更精细的融合。本发明所提出的协同感知方法描述了如何利用激光雷达和相机这两种不同的传感器，通过高级的数据处理和融合技术，实现对环境的高精度感知。该方法不拘泥于某种特定系统，而可以在不同系统中进行使用。在OPV2V数据集上的实验结果表明，这种基于注意力机制的融合方法相比传统的融合方法，在协同感知任务中展现出更优越的性能和更强的鲁棒性。In order to overcome the shortcomings of the prior art, the present invention provides a collaborative perception method based on multimodal fusion. In order to more comprehensively utilize the complementary advantages of different types of sensors, the present invention constructs a baseline system of multimodal fusion as the basic framework for realizing collaborative perception. The baseline system refers to the successful case of multimodal fusion in single-vehicle perception, and adopts a method of integrating lidar and camera data. By fusing the data of the two sensors into the same representation space, it can utilize the precise depth information provided by the lidar and the rich visual features provided by the camera to achieve more accurate and robust environmental perception. When constructing the baseline system, the present invention considers a variety of different fusion strategies and conducts in-depth analysis. First, the channel-level splicing and element-level summation methods are adopted. These two methods are simple and intuitive, and can directly splice features from different sensors together to form a unified feature representation. Although this method can achieve good performance in some cases, it also has certain limitations, mainly manifested in that it cannot fully consider the correlation between different sensor data. In order to solve this problem, the present invention further explores a fusion method based on the attention mechanism, which can adaptively adjust the weights of different sensor data to better capture the correlation between them. Through the multi-head self-attention mechanism, the proposed algorithm can establish complex associations between different sensor features, thereby achieving more sophisticated fusion. The collaborative perception method proposed in the present invention describes how to use two different sensors, lidar and camera, to achieve high-precision perception of the environment through advanced data processing and fusion technology. This method is not limited to a specific system, but can be used in different systems. The experimental results on the OPV2V dataset show that this fusion method based on the attention mechanism shows superior performance and stronger robustness in collaborative perception tasks compared with traditional fusion methods.

本发明解决其技术问题所采用的技术方案的步骤如下：The steps of the technical solution adopted by the present invention to solve its technical problem are as follows:

步骤1：多模态特征提取：对于多视角图像数据，采用CaDDN(Categorical DepthDistribution Network)架构，CaDDN架构包括编码器、深度估计、体素变换和折叠4个模块，确保从输入图像中捕捉到尽可能丰富和准确的信息，使提取到的图像特征应能全面代表输入图像的关键视觉信息，关键视觉信息包括边缘、纹理、颜色以及场景深度；对于点云数据，选用PointPillar作为点云数据的特征提取器；首先，对于给定的3D点确定点p在柱状坐标系中的位置p＝(i，j，l)；将确定该3D点p的三维空间划分为一系列均匀间隔的柱状结构，将复杂的3D数据转换为2D的结构；随后，所有柱状结构沿柱状结构自身的高度方向被压平，生成一个伪图像；最后，通过一系列二维卷积进行进一步的特征编码与整合；Step 1: Multimodal feature extraction: For multi-view image data, the CaDDN (Categorical Depth Distribution Network) architecture is used. The CaDDN architecture includes four modules: encoder, depth estimation, voxel transformation, and folding. It ensures that as rich and accurate information as possible is captured from the input image, so that the extracted image features can fully represent the key visual information of the input image, including edge, texture, color, and scene depth. For point cloud data, PointPillar is selected as the feature extractor of point cloud data. First, for a given 3D point Determine the position p = (i, j, l) of point p in the cylindrical coordinate system; divide the three-dimensional space of the 3D point p into a series of evenly spaced cylindrical structures, converting the complex 3D data into a 2D structure; then, all the cylindrical structures are flattened along the height direction of the cylindrical structure itself to generate a pseudo image; finally, further feature encoding and integration are performed through a series of two-dimensional convolutions;

二维卷积网络被用于提取空间特征，加强不同柱状体之间的特征相关性，通过逐层卷积和激活函数，网络能够捕获并编码更复杂的空间模式，从而提高特征的表达能力；经过卷积网络处理后，得到一组编码后的高维特征图；高维特征图代表原始点云数据中丰富的空间信息和物体特征，高维特征图通过池化、拼接操作被整合形成最终的特征表示，用于后续的检测任务；The two-dimensional convolutional network is used to extract spatial features and strengthen the feature correlation between different columns. Through layer-by-layer convolution and activation functions, the network can capture and encode more complex spatial patterns, thereby improving the expressiveness of features. After being processed by the convolutional network, a set of encoded high-dimensional feature maps is obtained. The high-dimensional feature map represents the rich spatial information and object features in the original point cloud data. The high-dimensional feature map is integrated through pooling and splicing operations to form the final feature representation for subsequent detection tasks.

步骤2：多模态特征融合：各车辆首先对自身的BEV特征进行压缩编码，然后向中心车辆发送，当中心车辆成功接收来自所有其他车辆的BEV特征后，将接收的BEV特征进行融合，生成一个更为全局和详尽的场景表示；融合过程将先进行同构模态特征融合，然后进行异构模态特征融合；Step 2: Multimodal feature fusion: Each vehicle first compresses and encodes its own BEV features and then sends them to the central vehicle. When the central vehicle successfully receives the BEV features from all other vehicles, it fuses the received BEV features to generate a more global and detailed scene representation. The fusion process will first perform homogeneous modal feature fusion and then heterogeneous modal feature fusion.

步骤3：检测头网络：检测头网络根据融合特征预测目标的类别和位置，检测头网络包含3个反卷积层，3个反卷积层负责上采样特征图，为后续的类别和位置预测提供更细粒度的信息；在通过3个反卷积层处理，特征图经过上采样并且串联在一起后是类别预测分支和边界框回归分支；类别预测分支为每个锚框输出一个分数，指示该锚框中的车辆类别以及存在的概率；边界框回归分支负责细化目标的位置信息，为每个锚框预测4个值，代表中心位置(x，y)的偏移量以及框的宽度和高度的变化；综合类别预测分支和边界框回归分支的输出，检测头网络输出每个锚框中目标类别和调整后的位置信息。Step 3: Detection head network: The detection head network predicts the category and position of the target based on the fused features. The detection head network contains 3 deconvolution layers. The 3 deconvolution layers are responsible for upsampling the feature maps to provide more fine-grained information for subsequent category and position predictions. After being processed by the 3 deconvolution layers, the feature maps are upsampled and concatenated together to form the category prediction branch and the bounding box regression branch. The category prediction branch outputs a score for each anchor box, indicating the vehicle category in the anchor box and the probability of its existence. The bounding box regression branch is responsible for refining the location information of the target and predicting 4 values for each anchor box, representing the offset of the center position (x, y) and the changes in the width and height of the box. Combining the outputs of the category prediction branch and the bounding box regression branch, the detection head network outputs the target category and adjusted location information in each anchor box.

所述同构模态特征融合的步骤如下：The steps of isomorphic modality feature fusion are as follows:

对于图像BEV特征，采用元素级的最大化操作融合不同车辆的图像BEV特征，通过比较每个车辆的图像BEV特征，并采用元素级的最大值，确保最终融合的特征包含了所有车辆观察到的最显著特点；具体来说，当中心车辆收到其他车辆发送的图像BEV特征时，逐元素地将其与自身的图像BEV特征进行逐元素比较，然后选取最大值作为融合结果；该过程表示为 For the image BEV features, the element-level maximization operation is used to fuse the image BEV features of different vehicles. By comparing the image BEV features of each vehicle and taking the maximum value at the element level, it is ensured that the final fused features contain the most significant features observed by all vehicles; specifically, when the central vehicle receives the image BEV features sent by other vehicles When , it is element-by-element compared with its own image BEV feature Perform element-by-element comparison and then select the maximum value as the fusion result; this process is expressed as

其中max(·)操作是逐元素的执行的，这种方式确保融合后的特征继承了多个车辆中最为显著和丰富的部分；The max(·) operation is performed element by element, which ensures that the fused features inherit the most significant and rich parts of multiple vehicles;

对于点云BEV特征，利用自注意力机制进行融合，该过程需要建立本地图(locality graph)，本地图是一个数据结构，用于表示不同车辆中相同空间位置的特征向量的关系；在实际操作中，首先确定每个车辆的BEV特征向量，并将BEV特征向量映射到统一的坐标系中；接着，对于每个BEV特征向量，在本地图中创建一个节点，并在空间上邻近的节点间建立连接；对于本地图中的每一个节点(即特征向量)，计算其与相邻节点的注意力分数，注意力分数决定了在融合过程中，每个邻近节点的信息对当前节点的影响程度；然后，根据计算出的注意力分数，每个节点的特征向量被更新为其邻居节点特征的加权和，经过自注意力机制的迭代更新后，每个节点的特征向量融合来自不同车辆但处于相同或相近空间位置的信息；最终，更新后的特征向量被汇总，形成一个融合后的BEV特征表示，用于后续的检测任务；For the point cloud BEV features, the self-attention mechanism is used for fusion. This process requires the establishment of a locality graph, which is a data structure used to represent the relationship between feature vectors of the same spatial position in different vehicles. In actual operation, the BEV feature vector of each vehicle is first determined and mapped to a unified coordinate system. Then, for each BEV feature vector, a node is created in the local map and connections are established between spatially adjacent nodes. For each node (i.e., feature vector) in the local map, the attention score between it and the adjacent nodes is calculated. The attention score determines the degree of influence of the information of each adjacent node on the current node during the fusion process. Then, according to the calculated attention score, the feature vector of each node is updated to the weighted sum of the features of its neighboring nodes. After iterative updates by the self-attention mechanism, the feature vector of each node fuses information from different vehicles but in the same or similar spatial positions. Finally, the updated feature vectors are aggregated to form a fused BEV feature representation for subsequent detection tasks.

所述异构模态特征融合的步骤如下：The steps of heterogeneous modality feature fusion are as follows:

在异构模态融合阶段，整合不同模态的融合BEV特征，以利用激光雷达和相机数据之间的互补信息。如图2所示，本发明采用3种常见的融合策略：通道级拼接、元素级求和以及Transformer融合；当需要同时利用图像和点云数据的全部特征信息时，采用通道级拼接融合，这种方法简单直观，适用于特征直接相关性较强且维度匹配的情形；当需要快速简单地结合两种数据源的特征时，采用元素级求和融合，适用于特征维度相同且每个元素对应的物理含义相似的情形；当处理的环境或场景具有高度复杂性，且需要深入挖掘激光雷达数据和图像数据之间潜在的、非直观的关联时，采用Transformer融合。In the heterogeneous modality fusion stage, the fused BEV features of different modalities are integrated to utilize the complementary information between the lidar and camera data. As shown in Figure 2, the present invention adopts three common fusion strategies: channel-level splicing, element-level summation, and Transformer fusion; when it is necessary to use all the feature information of image and point cloud data at the same time, channel-level splicing fusion is adopted. This method is simple and intuitive, and is suitable for situations where the features are directly correlated and the dimensions are matched; when it is necessary to quickly and easily combine the features of the two data sources, element-level summation fusion is adopted, which is suitable for situations where the feature dimensions are the same and the physical meanings corresponding to each element are similar; when the environment or scene being processed is highly complex and it is necessary to deeply explore the potential and non-intuitive associations between the lidar data and the image data, Transformer fusion is adopted.

所述通道级拼接融合的步骤为：The steps of channel-level splicing and fusion are:

首先将图像和点云特征沿通道维度拼接，得到特征张量，然后将拼接后的特征送入两个卷积层进行进一步处理；其中，第1个卷积层用于捕捉空间关系并从拼接数据中提取有价值的特征；随后，第2个卷积层确保信息压缩后仍保留关键信息，将通道维度降低来进一步细化融合特征，两层卷积层并不必须通用，可根据特征的特性分别设计；First, the image and point cloud features are concatenated along the channel dimension to obtain the feature tensor, and then the concatenated features are sent to two convolutional layers for further processing; the first convolutional layer is used to capture spatial relationships and extract valuable features from the concatenated data; then, the second convolutional layer ensures that key information is retained after information compression, and reduces the channel dimension to further refine the fusion features. The two convolutional layers do not have to be universal and can be designed separately according to the characteristics of the features;

所述元素级求和融合的步骤为：The steps of element-level summation and fusion are:

将激光雷达特征送入一个1×1卷积层对通道维度进行降采样，然后，对图像特征和降采样后的激光雷达特征进行逐元素相加，得到融合特征；The lidar features are fed into a 1×1 convolutional layer to downsample the channel dimension. Then, the image features and the downsampled lidar features are added element by element to obtain the fused features.

所述Transformer融合的步骤为：The steps of Transformer fusion are:

首先，将图像和点云的BEV特征沿通道维度连接，形成一个联合特征张量，然后通过加法的方式将位置编码结合到特征向量中，以增加模型对于特征在空间上位置的感知；接着，为了捕获两种模态之间的复杂交互，在给联合特征张量添加位置编码之后引入了多头自注意力机制，为不同模态的每个部分分配不同的权重，且使模型从多个角度或来捕捉不同模态特征之间的相关性；然后，通过前馈网络进一步强化模型的非线性处理能，最后，在Transformer融合的每一步中都采用了残差连接和层归一化技术得到最终的融合特征。First, the BEV features of the image and point cloud are connected along the channel dimension to form a joint feature tensor, and then the position encoding is combined with the feature vector by addition to increase the model's perception of the spatial position of the feature; then, in order to capture the complex interaction between the two modalities, a multi-head self-attention mechanism is introduced after adding the position encoding to the joint feature tensor, assigning different weights to each part of the different modalities, and enabling the model to capture the correlation between different modal features from multiple angles; then, the nonlinear processing ability of the model is further enhanced through the feedforward network, and finally, residual connection and layer normalization techniques are used in each step of Transformer fusion to obtain the final fusion features.

本发明的有益效果在于深入探讨了几种先进的融合策略在多模态协同感知任务中的适用性和有效性，引入了基于注意力机制的融合方法，在提高感知系统准确性和鲁棒性方面表现出色；本发明通过多头自注意力机制，在不同传感器特征之间建立了复杂的关联，提高了融合的精细度，展现出卓越的性能和鲁棒性，并验证了多模态融合的有效性，为现实世界中实际应用提供了一种经济可行的方案。The beneficial effect of the present invention lies in that it deeply explores the applicability and effectiveness of several advanced fusion strategies in multimodal collaborative perception tasks, introduces a fusion method based on the attention mechanism, and performs well in improving the accuracy and robustness of the perception system; the present invention establishes complex associations between different sensor features through a multi-head self-attention mechanism, improves the fineness of fusion, demonstrates excellent performance and robustness, and verifies the effectiveness of multimodal fusion, providing an economically feasible solution for practical applications in the real world.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为本发明的框架图；Fig. 1 is a framework diagram of the present invention;

图2为异构模态特征融合三种策略图，图2的(a)为通道级拼接融合策略图，图2(b)为元素级求和融合策略图，图2的(c)为Transformer融合策略图；Figure 2 shows three strategies for heterogeneous modal feature fusion. Figure 2 (a) shows the channel-level concatenation fusion strategy, Figure 2 (b) shows the element-level summation fusion strategy, and Figure 2 (c) shows the Transformer fusion strategy.

图3为模型详细架构图；Figure 3 is a detailed diagram of the model architecture;

图4为定位误差对模型性能的影响对比图，图4的(a)为CARLA默认城镇在AP@0.5指标上的三种模式对比，图4的(b)为CARLA默认城镇在AP@0.7指标上的三种模式对比，图4的(c)为Culver City数字城镇在P@0.5指标上的三种模式对比，图4的(d)为Culver City数字城镇在P@0.7指标上的三种模式对比。Figure 4 is a comparison of the impact of positioning error on model performance. Figure 4 (a) is a comparison of three modes of CARLA default town on the AP@0.5 indicator, Figure 4 (b) is a comparison of three modes of CARLA default town on the AP@0.7 indicator, Figure 4 (c) is a comparison of three modes of Culver City digital town on the P@0.5 indicator, and Figure 4 (d) is a comparison of three modes of Culver City digital town on the P@0.7 indicator.

图5为不同模型检测结果可视化对比图，图5(a)为纯图像模式，图5(b)为纯点云模式，图5(c)为Transformer融合模式。Figure 5 is a visual comparison of the detection results of different models. Figure 5(a) is the pure image mode, Figure 5(b) is the pure point cloud mode, and Figure 5(c) is the Transformer fusion mode.

具体实施方式Detailed ways

下面结合附图和实施例对本发明进一步说明。The present invention is further described below in conjunction with the accompanying drawings and embodiments.

本发明提供了一种基于多模态融合的协同感知方法，如图1所示，为该框架的核心组成部分，涵盖多模态特征提取、多模态特征融合以及检测头网络。各组成部分共同协作，实现激光雷达与相机数据的有效整合，充分挖掘二者的互补性，以提高V2V场景下的感知准确性。该方法的具体步骤包括：The present invention provides a collaborative perception method based on multimodal fusion, as shown in Figure 1, which is the core component of the framework, covering multimodal feature extraction, multimodal feature fusion and detection head network. Each component works together to achieve effective integration of lidar and camera data, fully tapping the complementarity of the two to improve perception accuracy in V2V scenarios. The specific steps of the method include:

步骤1：多模态特征提取Step 1: Multimodal feature extraction

为了捕捉和保留来自不同模态的独特线索，本发明使用单独的分支进行特征提取并生成统一的BEV表示。To capture and preserve unique cues from different modalities, we use separate branches for feature extraction and generate a unified BEV representation.

(1)多视角图像数据(1) Multi-view image data

本发明采用CaDDN(Categorical Depth Distribution Network)架构，该架构包含4个主要模块：编码器、深度估计、体素变换和折叠，确保从输入图像中捕捉到尽可能丰富和准确的信息，使提取到的图像特征应能全面代表输入图像的关键视觉信息，如边缘、纹理、颜色以及场景深度等。为了将2D图像特征和3D点云特征这两种异构特征进行融合，需要显式地预测图像特征中每个像素的深度来将2D平面提升到3D空间，最终转换到统一的BEV空间。以车辆a_i为例，首先，编码器模块对原始的输入图像进行初步的特征抽取，生成维度为X×Y×D的图像特征该过程表示为： The present invention adopts the CaDDN (Categorical Depth Distribution Network) architecture, which includes four main modules: encoder, depth estimation, voxel transformation and folding, to ensure that as rich and accurate information as possible is captured from the input image, so that the extracted image features can fully represent the key visual information of the input image, such as edges, textures, colors, and scene depth. In order to fuse the two heterogeneous features of 2D image features and 3D point cloud features, it is necessary to explicitly predict the depth of each pixel in the image features to promote the 2D plane to 3D space, and finally convert it to a unified BEV space. Taking vehicle a _i as an example, first, the encoder module processes the original input image Perform preliminary feature extraction to generate image features with dimensions of X×Y×D The process is expressed as:

然后，深度估计模块为每个像素预测出一个深度概率分布P，表示为：Then, the depth estimation module predicts a depth probability distribution P for each pixel, expressed as:

该分布反映了该像素的深度信息。之后，体素转换模块将先前抽取的特征从2D投影到3D空间。它根据所有可能的深度分布和图像的校准矩阵，生成的相应的3D体素特征 This distribution reflects the depth information of the pixel. Afterwards, the voxel conversion module projects the previously extracted features from 2D to 3D space. It generates the corresponding 3D voxel features based on all possible depth distributions and the image calibration matrix.

最后，折叠模块将3D体素特征合并到一个高度平面上 Finally, the folding module merges the 3D voxel features into a height plane.

其中为生成的BEV特征图，H和W表示图像BEV网格的高度和宽度，C表示通道数。该过程可以有效地获取语义丰富的视觉特征。in For the generated BEV feature map, H and W represent the height and width of the image BEV grid, and C represents the number of channels. This process can effectively obtain semantically rich visual features.

(2)点云数据(2) Point cloud data

由于点云数据特有的稀疏性与三维结构，直接利用传统的3D卷积网络很可能引发计算和内存的巨大开销。为了有效而高效地从这种数据中提取特征，本发明将PointPillar作为点云数据的特征提取器。首先，对于给定的3D点确定点p在柱状坐标系中的位置可以被定位为p＝(i，j，l)。这里，i和j分别是二维网格的x和y坐标，而l表示其垂直方向上的高度。然后，将确定该3D点p的三维空间划分为一系列均匀间隔的柱状结构。形式上，这种划分表示为 Due to the unique sparsity and three-dimensional structure of point cloud data, directly using traditional 3D convolutional networks is likely to cause huge computational and memory overhead. In order to effectively and efficiently extract features from such data, this paper uses PointPillar as a feature extractor for point cloud data. First, for a given 3D point The position of a point p in a cylindrical coordinate system can be defined as p = (i, j, l). Here, i and j are the x and y coordinates of the two-dimensional grid, respectively, and l represents its height in the vertical direction. The three-dimensional space that determines the 3D point p is then divided into a series of evenly spaced cylindrical structures. Formally, this division is expressed as

其中，W_pillar和H_pillar分别表示柱状体在x和y方向上的宽度和高度。通过这种方式可以将复杂的3D数据转换为2D的结构，每一个柱子内的点共享相同的高度信息。随后，所有柱状体沿柱状体自身高度方向被压平，生成一个伪图像。对于柱状体内的所有3D点，其对应的3D特征为(其中N_p是柱体中点的数量，C_p是点云特征的维度)被转化为2D柱体特征 Where W _pillar and H _pillar represent the width and height of the pillar in the x and y directions respectively. In this way, complex 3D data can be converted into a 2D structure, and each point in the pillar shares the same height information. Subsequently, all pillars are flattened along the height direction of the pillar itself to generate a pseudo image. For all 3D points in the pillar, the corresponding 3D features are (where N _p is the number of points in the cylinder and C _p is the dimension of the point cloud feature) is converted into a 2D cylinder feature

其中，φ(·)是转换函数，用于将柱体内所有的点转换为该柱体的整体表示。最后，通过一系列二维卷积对F_pillar进行进一步的特征编码与整合，得到维度为H×W×C的BEV特征其维度与图像BEV特征相同。在这一步中，二维卷积网络被用于提取空间特征，加强不同柱状体之间的特征相关性。通过逐层卷积和激活函数，网络能够捕获并编码更复杂的空间模式，从而提高特征的表达能力。经过卷积网络处理后，可以得到一组编码后的高维特征图。这些特征图代表了原始点云数据中丰富的空间信息和物体特征。最后，这些高维特征图会通过池化、拼接等操作被整合形成最终的特征表示，用于后续的检测任务。Among them, φ(·) is a conversion function used to convert all points in the pillar into the overall representation of the pillar. Finally, a series of two-dimensional convolutions are used to further encode and integrate the features of F _pillar to obtain the BEV feature with dimensions of H×W×C Its dimension is the same as the image BEV feature. In this step, a two-dimensional convolutional network is used to extract spatial features and strengthen the feature correlation between different columns. Through layer-by-layer convolution and activation functions, the network can capture and encode more complex spatial patterns, thereby improving the expressiveness of features. After processing by the convolutional network, a set of encoded high-dimensional feature maps can be obtained. These feature maps represent the rich spatial information and object features in the original point cloud data. Finally, these high-dimensional feature maps will be integrated through operations such as pooling and splicing to form the final feature representation for subsequent detection tasks.

步骤2：多模态特征融合Step 2: Multimodal feature fusion

随着从激光雷达和相机中提取的BEV特征的获取，特征融合环节成为关键。在这一阶段，各车辆首先对自身的BEV特征进行压缩编码，然后向中心车辆发送。当中心车辆成功接收来自所有其他车辆的BEV特征后，将这些接收的BEV特征进行融合，生成一个更为全局和详尽的场景表示。该融合过程主要涉及两个关键步骤：先进行同构模态特征融合，然后进行异构模态特征融合。其中同构特征融合主要处理来自相同传感器类型的信息，而异构特征融合则旨在结合不同类型传感器的信息，充分挖掘不同模态之间的互补性。With the acquisition of BEV features extracted from lidar and cameras, the feature fusion link becomes critical. In this stage, each vehicle first compresses and encodes its own BEV features and then sends them to the central vehicle. When the central vehicle successfully receives the BEV features from all other vehicles, it fuses these received BEV features to generate a more global and detailed scene representation. The fusion process mainly involves two key steps: first, homogeneous modal feature fusion, and then heterogeneous modal feature fusion. Among them, homogeneous feature fusion mainly processes information from the same sensor type, while heterogeneous feature fusion aims to combine information from different types of sensors to fully explore the complementarity between different modalities.

(1)同构模态特征融合(1) Homogeneous modal feature fusion

在多模态间的特征融合中，同构特征的融合显然比异构特征更为直观和简单，因为它们来自相同的数据源，共享相似的特性和分布。针对这一特点，本发明为图像BEV特征和点云BEV特征分别设计了不同的融合策略，以更直接的方式整合同一模态下的多源信息。In the feature fusion between multiple modalities, the fusion of homogeneous features is obviously more intuitive and simpler than that of heterogeneous features, because they come from the same data source and share similar characteristics and distribution. In view of this feature, the present invention designs different fusion strategies for image BEV features and point cloud BEV features respectively, so as to integrate multi-source information under the same modality in a more direct way.

首先，本发明采用了元素级的最大化操作来融合不同车辆的图像BEV特征。其背后的原理是：在多车协同的环境中，某些车辆可能观察到的某些特征可能比其他车辆更为明显或清晰。而通过比较每个车辆的图像BEV特征，并采用元素级的最大值，可以确保最终融合的特征包含了所有车辆观察到的最显著特点。具体来说，当中心车辆收到其他车辆发送的图像BEV特征时，会逐元素地将其与自身的图像BEV特征进行逐元素比较，然后选取最大值作为融合结果。该过程可表示为First, the present invention adopts element-level maximization operation to fuse the image BEV features of different vehicles. The principle behind it is that in a multi-vehicle collaborative environment, some features that may be observed by some vehicles may be more obvious or clearer than those of other vehicles. By comparing the image BEV features of each vehicle and taking the maximum value at the element level, it can be ensured that the final fused features contain the most significant features observed by all vehicles. Specifically, when the center vehicle receives the image BEV features sent by other vehicles When , it will compare it with its own image BEV feature element by element. Perform element-by-element comparison and then select the maximum value as the fusion result. This process can be expressed as

其中max(·)操作是逐元素的执行的，这种方式确保融合后的特征继承了多个车辆中最为显著和丰富的部分。The max(·) operation is performed element by element, which ensures that the fused features inherit the most significant and rich parts of multiple vehicles.

对于点云BEV特征，本发明利用自注意力机制进行融合。点云BEV特征的融合首先涉及到本地图的构建。本地图连接了来自不同车辆但处于相同空间位置的特征向量。在这个过程中，每个特征向量被视为一个节点，两个特征向量之间的边用于连接来自不同车辆的相同空间位置的特征向量，这样的连接不仅有助于融合来自不同数据源的信息，同时也为后续的自注意力机制处理提供了基础。For point cloud BEV features, the present invention uses a self-attention mechanism for fusion. The fusion of point cloud BEV features first involves the construction of a local map. This map connects feature vectors from different vehicles but at the same spatial position. In this process, each feature vector is regarded as a node, and the edge between two feature vectors is used to connect feature vectors at the same spatial position from different vehicles. Such a connection not only helps to fuse information from different data sources, but also provides a basis for subsequent self-attention mechanism processing.

然后，对该本地图应用自注意力机制进行点云BEV特征融合。自注意力机制的核心在于其能够赋予不同特征向量以不同的权重，这些权重反映了特征之间的相互依赖关系。在点云BEV特征的情境下，这意味着每个特征点不仅被其自身属性所定义，还受到周围特征点的影响。具体而言，对于每个车辆的点云BEV特征将该二维特征图展开为维度为M×C的一维特征向量，其中M＝H×W。接着，该特征向量会被转换成查询(Q)、键(K)和值(V)三部分，进而通过自注意力公式计算得到更新后的特征向量这一过程可以描述为Then, the self-attention mechanism is applied to the local map to perform point cloud BEV feature fusion. The core of the self-attention mechanism is that it can assign different weights to different feature vectors, which reflect the interdependence between features. In the context of point cloud BEV features, this means that each feature point is not only defined by its own attributes, but also affected by the surrounding feature points. Specifically, for each vehicle’s point cloud BEV feature The two-dimensional feature map is expanded into a one-dimensional feature vector of dimension M×C, where M=H×W. Then, the feature vector is converted into three parts: query (Q), key (K), and value (V), and then the updated feature vector is calculated through the self-attention formula This process can be described as

其中，d_k是键向量的维度，softmax(·)函数则确保所有权重加起来为1。最后将更新后的特征向量变换为原始维度H×W×C。按照以上方式更新所有车辆的BEV特征并堆叠起来得到融合结果 Among them, d _k is the dimension of the key vector, and the softmax(·) function ensures that all weights add up to 1. Finally, the updated feature vector is transformed to the original dimension H×W×C. The BEV features of all vehicles are updated in the above way and stacked to obtain the fusion result

通过多源同构模态特征的融合，所得到的融合特征更好地整合和表达了来自多个车辆相同传感器的观测信息，从多个角度捕获了环境细节的丰富性和多样性。同时，这也为后续进行异构特征融合奠定了坚实基础。深度融合不同传感模态之间的互补信息可以进一步丰富环境描述，充分发挥各类传感资源的价值。By fusing multi-source homogeneous modal features, the resulting fusion features better integrate and express observation information from the same sensors of multiple vehicles, capturing the richness and diversity of environmental details from multiple angles. At the same time, this also lays a solid foundation for subsequent heterogeneous feature fusion. Deep fusion of complementary information between different sensing modalities can further enrich environmental descriptions and give full play to the value of various sensing resources.

(2)异构模态特征融合(2) Heterogeneous modal feature fusion

在异构模态融合阶段，本发明整合了不同模态的融合BEV特征，以利用激光雷达和相机数据之间的互补信息。如图2所示，本发明采用3种常见的融合策略：通道级拼接、元素级求和以及Transformer融合，以有效地组合模态并生成综合表示。当需要同时利用图像和点云数据的全部特征信息时，采用通道级拼接融合。这种方法简单直观，适用于特征直接相关性较强且维度匹配的情形。当需要快速简单地结合两种数据源的特征时，采用元素级求和融合，适用于特征维度相同且每个元素对应的物理含义相似的情形。当处理的环境或场景具有高度复杂性，且需要深入挖掘激光雷达数据和图像数据之间潜在的、非直观的关联时，Transformer融合尤为适用。具体来说，给定融合后的图像特征和点云特征融合过程如下：In the heterogeneous modality fusion stage, the present invention integrates the fused BEV features of different modalities to utilize the complementary information between the lidar and camera data. As shown in Figure 2, the present invention adopts three common fusion strategies: channel-level splicing, element-level summation, and Transformer fusion to effectively combine modalities and generate a comprehensive representation. When it is necessary to utilize all the feature information of image and point cloud data at the same time, channel-level splicing fusion is adopted. This method is simple and intuitive, and is suitable for situations where the features are directly correlated and the dimensions are matched. When it is necessary to quickly and easily combine the features of two data sources, element-level summation fusion is adopted, which is suitable for situations where the feature dimensions are the same and the physical meanings corresponding to each element are similar. Transformer fusion is particularly applicable when the environment or scene being processed is highly complex and it is necessary to deeply explore the potential and non-intuitive associations between lidar data and image data. Specifically, given the fused image features and point cloud features The fusion process is as follows:

①通道级拼接融合① Channel-level splicing and fusion

通道级拼接是一种直观且广泛应用的特征融合方法。这种方法主要依赖于特征的空间一致性，即相同的空间位置上来自不同模态的特征被认为是相关的。在通道级拼接融合中，首先将图像和点云特征沿通道维度拼接，得到一个维度为H×W×4C的特征张量B^cat。然后将拼接后的特征送入两个卷积层进行进一步处理。其中，第1个卷积层用于捕捉空间关系并从拼接数据中提取有价值的特征。随后，第2个卷积层需要确保信息压缩后仍保留关键信息，将通道维度降低来进一步细化融合特征。这两层卷积层并不必须通用，可根据特征的特性分别设计。Channel-level splicing is an intuitive and widely used feature fusion method. This method mainly relies on the spatial consistency of features, that is, features from different modalities at the same spatial position are considered to be related. In channel-level splicing fusion, the image and point cloud features are first spliced along the channel dimension to obtain a feature tensor B ^cat with a dimension of H×W×4C. The spliced features are then fed into two convolutional layers for further processing. Among them, the first convolutional layer is used to capture spatial relationships and extract valuable features from the spliced data. Subsequently, the second convolutional layer needs to ensure that key information is retained after information compression, and the channel dimension is reduced to further refine the fused features. These two convolutional layers do not have to be universal and can be designed separately according to the characteristics of the features.

②元素级求和融合②Element-level summation and fusion

元素级求和旨在保持空间结构的同时融合两种模态的互补信息。在逐元素求和融合模块中，首先将激光雷达特征送入一个1×1卷积层对通道维度进行降采样。然后，对图像特征和降采样后的激光雷达特征进行逐元素相加，得到融合特征这种元素及求和的特征融合策略确保了两种模态对最终融合结果的贡献是相等的。The element-wise summation aims to fuse the complementary information of the two modalities while maintaining the spatial structure. In the element-wise summation fusion module, the lidar features are first A 1×1 convolutional layer is used to downsample the channel dimension. Then, the image features and the downsampled lidar features are added element by element to obtain the fused features. This element-wise and summation feature fusion strategy ensures that the two modalities contribute equally to the final fusion result.

③Transformer融合③Transformer Fusion

近年来，Transformer架构在自然语言处理领域取得了很大的成功，尤其是在捕获序列数据中的长距离依赖关系方面。在多模态融合中，Transformer因其卓越的性能和对长距离依赖关系的处理能力而受到广泛关注。特别是，其内部的自注意机制为不同模态特征之间的相互作用和交互提供了一个强大的框架。在异构模态特征融合的过程中，每种模态的特征都可视为一个“序列”，其中每个“元素”都代表着空间信息的一部分。In recent years, the Transformer architecture has achieved great success in the field of natural language processing, especially in capturing long-distance dependencies in sequence data. In multimodal fusion, Transformer has attracted widespread attention due to its excellent performance and ability to handle long-distance dependencies. In particular, its internal self-attention mechanism provides a powerful framework for the interaction and interaction between different modal features. In the process of heterogeneous modal feature fusion, the features of each modality can be regarded as a "sequence", in which each "element" represents a part of the spatial information.

首先，将图像和点云的BEV特征沿通道维度连接，形成一个联合特征张量 First, the BEV features of the image and point cloud are concatenated along the channel dimension to form a joint feature tensor

考虑到原始的Transformer的结构不包含关于元素位置的任何信息，需要通过加法的方式将位置编码结合到特征向量中。为此联合特征添加位置编码，确保模型能够识别特征在空间上的位置Considering that the original Transformer structure does not contain any information about the position of the element, it is necessary to combine the position encoding into the feature vector by addition. To this end, the position encoding is added to the joint feature to ensure that the model can recognize the position of the feature in space.

接着，为了捕获两种模态之间的复杂交互，本发明引入了多头自注意力机制。这种机制使模型能够为不同模态的每个部分分配不同的权重，多头的设计可以使模型从多个角度或来捕捉不同模态特征之间的相关性Next, in order to capture the complex interaction between the two modalities, the present invention introduces a multi-head self-attention mechanism. This mechanism enables the model to assign different weights to each part of different modalities. The multi-head design enables the model to capture the correlation between different modal features from multiple angles.

然后，通过前馈网络进一步强化模型的非线性处理能力Then, the nonlinear processing capability of the model is further enhanced through the feedforward network.

B^FFN＝FeedForwarNetwork(B^att)B ^FFN =FeedForwarNetwork(B ^att )

最后，为确保模型的稳定性并维持特征规模，在上述Transformer融合的每一步中都采用了残差连接和层归一化技术得到最终的融合特征Finally, to ensure the stability of the model and maintain the feature scale, residual connection and layer normalization techniques are used in each step of the above Transformer fusion to obtain the final fusion feature.

整体来说，这种基于Transformer的融合策略允许算法充分考虑和利用两种模态之间的复杂交互和依赖关系，从而为下游任务提供一个高度丰富和代表性的特征表示。Overall, this Transformer-based fusion strategy allows the algorithm to fully consider and utilize the complex interactions and dependencies between the two modalities, thereby providing a highly rich and representative feature representation for downstream tasks.

步骤3：检测头网络Step 3: Detection Head Network

检测头网络负责根据融合特征来预测目标的类别和位置。该网络包含3个反卷积层，这3个反卷积层负责上采样特征图，从而为后续的类别和位置预测提供更细粒度的信息。在通过三个反卷积层处理，特征图经过上采样并且串联在一起后是检测头中的两个分支：类别预测分支和边界框回归分支。类别预测分支为每个锚框输出一个分数，指示该锚框内是否存在某一车辆类别的对象，以及存在的概率。这些输出分数经过激活函数处理后，可以转化为各个类别的概率分布。另一方面，边界框回归分支则负责细化目标的位置信息。它为每个锚框预测4个值，这4个值代表中心位置(x，y)的偏移量以及框的宽度和高度的变化。通过这些预测值，可以校正锚框的位置，使其更紧密地围绕目标对象。综合两个分支的输出，检测头网络输出每个锚框中目标类别和调整后的位置信息。The detection head network is responsible for predicting the category and location of the target based on the fused features. The network contains three deconvolution layers, which are responsible for upsampling the feature maps to provide more fine-grained information for subsequent category and location predictions. After being processed by the three deconvolution layers, the feature maps are upsampled and concatenated together to form two branches in the detection head: the category prediction branch and the bounding box regression branch. The category prediction branch outputs a score for each anchor box, indicating whether there is an object of a certain vehicle category in the anchor box, and the probability of existence. These output scores can be converted into probability distributions of each category after being processed by the activation function. On the other hand, the bounding box regression branch is responsible for refining the location information of the target. It predicts 4 values for each anchor box, which represent the offset of the center position (x, y) and the change in the width and height of the box. Through these predicted values, the position of the anchor box can be corrected to make it surround the target object more closely. Combining the outputs of the two branches, the detection head network outputs the target category and adjusted location information in each anchor box.

下面结合实验对本发明的效果做进一步的描述。The effects of the present invention are further described below in conjunction with experiments.

1.实验设置：1. Experimental setup:

本发明实验是在OPV2V数据集上进行的。数据集总计11,464帧，主要分为两个子集：CARLA默认城镇子集和Culver City数字城镇子集。其中，CARLA默认城镇子集占10,914帧，按6,764帧、1,980帧和2,170帧的比例分为训练、验证和测试3部分。该子集覆盖了不同复杂程度的多种场景，为协同感知模型提供了充足的训练与评估数据。相对而言，CulverCity子集只有550帧，但其目标在于评估模型在现实世界场景的泛化表现，特别是那些对模型感知能力构成挑战的实际城市环境。The experiment of the present invention was conducted on the OPV2V dataset. The dataset has a total of 11,464 frames, which are mainly divided into two subsets: the CARLA default town subset and the Culver City digital town subset. Among them, the CARLA default town subset accounts for 10,914 frames, which is divided into three parts: training, verification and testing according to the ratio of 6,764 frames, 1,980 frames and 2,170 frames. This subset covers a variety of scenes of different complexity, providing sufficient training and evaluation data for the collaborative perception model. Relatively speaking, the CulverCity subset has only 550 frames, but its goal is to evaluate the generalization performance of the model in real-world scenarios, especially those actual urban environments that pose challenges to the model's perception capabilities.

在实现细节方面，所提算法基于PyTorch框架，并在配有24GB RAM的NVIDIA RTX4090GPU的PC上进行训练。在训练过程中，随机选择了一组能够在场景中建立通信的车辆，并规定每个车辆的通信范围为70m。同时，点云的范围被设定为沿x,y与z轴的[-140.8,14.8]×[-40,40]×[-3,1]。体素的分辨率为0.4m。为了提升训练数据的多样性，采用了一系列数据增强技术，如随机翻转、±0.05范围内缩放以及±45°范围内旋转。利用Adam优化器对模型进行训练，设定模型的初始学习率为0.002，批量大小为2。此外，利用了基于验证损失的早停策略以防止模型过拟合。本发明算法的详细架构和其他参数如图3所示。In terms of implementation details, the proposed algorithm is based on the PyTorch framework and trained on a PC with an NVIDIA RTX4090 GPU equipped with 24GB RAM. During the training process, a group of vehicles that can establish communication in the scene are randomly selected, and the communication range of each vehicle is set to 70m. At the same time, the range of the point cloud is set to [-140.8, 14.8] × [-40, 40] × [-3, 1] along the x, y and z axes. The resolution of the voxel is 0.4m. In order to improve the diversity of training data, a series of data enhancement techniques are used, such as random flipping, scaling within ±0.05, and rotation within ±45°. The model is trained using the Adam optimizer, and the initial learning rate of the model is set to 0.002 and the batch size is 2. In addition, an early stopping strategy based on validation loss is used to prevent the model from overfitting. The detailed architecture and other parameters of the algorithm of the present invention are shown in Figure 3.

性能评价指标方面，本发明采用了标准的评价指标来评估算法的性能，即平均精度(Average Precision,AP)。平均精度是一种常用的目标检测性能指标，用于衡量算法在不同交并比(Intersection over Union,IoU)阈值下的准确性。在计算平均精度时，需事先设定交并比IoU的阈值，即预测框和真值框的重合面积占预测框和真值框面积总和的比例，如果大于设定阈值则认为检测正确。本发明分别计算模型在IoU阈值为0.5和0.7时的AP值，即AP@0.5和AP@0.7。这两个指标能够提供对算法在不同严格程度下的性能评估，从而更全面地衡量算法在目标检测任务中的表现。In terms of performance evaluation indicators, the present invention uses a standard evaluation indicator to evaluate the performance of the algorithm, namely, Average Precision (AP). Average Precision is a commonly used target detection performance indicator, which is used to measure the accuracy of the algorithm under different Intersection over Union (IoU) thresholds. When calculating the average precision, it is necessary to set the threshold of the intersection over Union (IoU) in advance, that is, the ratio of the overlapping area of the prediction box and the true value box to the total area of the prediction box and the true value box. If it is greater than the set threshold, the detection is considered correct. The present invention calculates the AP value of the model when the IoU threshold is 0.5 and 0.7, namely AP@0.5 and AP@0.7. These two indicators can provide performance evaluation of the algorithm under different degrees of strictness, thereby more comprehensively measuring the performance of the algorithm in the target detection task.

根据不同的异构模态融合策略，本发明构建了3个版本的多模态融合模型：通道级拼接融合，元素级求和融合，Transformer融合。此外，为了更全面地验证所提多模态融合模型的效果，本发明算法还与多种现有先进算法进行了对比，包括Cooper,F-Cooper,V2VNet,AttFuse以及CoBEVT。According to different heterogeneous modal fusion strategies, the present invention constructs three versions of multimodal fusion models: channel-level splicing fusion, element-level summation fusion, and Transformer fusion. In addition, in order to more comprehensively verify the effect of the proposed multimodal fusion model, the algorithm of the present invention is also compared with a variety of existing advanced algorithms, including Cooper, F-Cooper, V2VNet, AttFuse and CoBEVT.

除了考虑所有车辆都可以同时获得点云和图像两种模态的数据这种一般场景以外，本发明还考虑了更加现实和复杂的异构模态场景。在这种异构模态场景下，每个车辆只能获取点云和图像中的任意一种模态，用于模拟所提算法在各种模态缺失情况下的性能表现。In addition to the general scenario where all vehicles can obtain data in both point cloud and image modes, the present invention also considers a more realistic and complex heterogeneous modality scenario. In this heterogeneous modality scenario, each vehicle can only obtain one of the modalities of point cloud and image, which is used to simulate the performance of the proposed algorithm in the absence of various modalities.

2.定量实验结果分析：2. Analysis of quantitative experimental results:

在表1中，本发明将多种SOTA算法在OPV2V数据集上的表现进行了综合对比。In Table 1, the present invention comprehensively compares the performance of various SOTA algorithms on the OPV2V dataset.

表1与SOTA算法的综合性能对比Table 1 Comprehensive performance comparison with SOTA algorithm

首先，所有的协同感知算法都在不同场景的不同指标上显著优于单车感知，这表明多车之间的相互协作有利于彼此更好地感知周围环境。其次，可以发现前期融合策略在大多数情况下均表现优于后期融合。这表明在不考虑通信带宽的情况下，数据在初期的融合能够更好的保留原始场景信息。而且，本发明算法采用不同融合策略的3种实现版本(即元素级求和融合、通道级拼接融合和Transformer融合)的性能在所有对比算法中均具有较强竞争力，尤其是使用了Transformer架构的多模态融合模型Transformer融合，在两种IoU阈值下均取得更高的AP分数。尽管在CARLA默认城镇子集中，Transformer融合在AP@0.7指标上略低于CoBEVT算法，但在更具挑战性的Culver City数字城镇子集中，其在AP@0.5和AP@0.7两个指标上均领先于其他所有对比算法。First, all collaborative perception algorithms are significantly better than single-vehicle perception in different indicators in different scenarios, which shows that the mutual cooperation between multiple vehicles is conducive to better perception of the surrounding environment. Secondly, it can be found that the early fusion strategy performs better than the late fusion in most cases. This shows that without considering the communication bandwidth, the fusion of data in the early stage can better preserve the original scene information. Moreover, the performance of the three implementation versions of the algorithm of the present invention using different fusion strategies (i.e., element-level summation fusion, channel-level splicing fusion and Transformer fusion) is highly competitive among all the comparison algorithms, especially the Transformer fusion, a multimodal fusion model using the Transformer architecture, which achieves higher AP scores under both IoU thresholds. Although the Transformer fusion is slightly lower than the CoBEVT algorithm in the AP@0.7 indicator in the CARLA default town subset, it is ahead of all other comparison algorithms in both AP@0.5 and AP@0.7 indicators in the more challenging Culver City digital town subset.

表2展示了本发明算法在异构模态场景中的性能比较。Table 2 shows the performance comparison of the proposed algorithm in heterogeneous modal scenarios.

表2所提算法不同异构模态场景下的性能对比Table 2 Performance comparison of the proposed algorithm in different heterogeneous modal scenarios

其中，中心车辆图像模态表示所有车辆中一半只能获取图像数据(包括中心车辆)，另一半只能获取点云数据；与之不同的是，中心车辆点云模态表示一半数量的车辆(包括中心车辆)只能获取点云数据。从表2可以看出，纯图像模式下的性能明显较低，特别是在Culver City场景中的AP@0.7，只有8.6％。而纯点云模式的性能则明显优于纯图像模式，主要是因为3D点云可以提供准确的场景深度测量，因此更适用于3D目标检测任务。进一步考虑到异构模态的混合情况，中心车辆图像模态和中心车辆点云模态都表现出了相对较好的性能。特别是在中心车辆只能获取点云数据的情况下(中心车辆点云模态)，其性能接近纯点云模式，说明中心车辆的数据模态在V2V协同检测中起到了关键的作用。而对于中心车辆图像模态，尽管其性能略低于中心车辆点云模态，但仍然明显优于纯图像模式，这表明即使在一半车辆只能获取图像数据的情况下，点云数据的存在仍然能大大增强系统的检测性能。表2的实验结果验证了所提算法在不同的异构模态场景下都具有相对稳健的性能，特别是在模态数据丢失的情况下。此外，这也再次突显了点云数据在V2V协同检测任务中的重要性，以及中心车辆对整体检测性能的影响。Among them, the center vehicle image modality means that half of all vehicles can only obtain image data (including the center vehicle) and the other half can only obtain point cloud data; in contrast, the center vehicle point cloud modality means that half of the number of vehicles (including the center vehicle) can only obtain point cloud data. As can be seen from Table 2, the performance in the pure image mode is significantly lower, especially in the Culver City scene, AP@0.7, which is only 8.6%. The performance of the pure point cloud mode is significantly better than the pure image mode, mainly because 3D point clouds can provide accurate scene depth measurement, so they are more suitable for 3D object detection tasks. Further considering the mixed situation of heterogeneous modalities, both the center vehicle image modality and the center vehicle point cloud modality show relatively good performance. Especially in the case where the center vehicle can only obtain point cloud data (center vehicle point cloud modality), its performance is close to the pure point cloud mode, indicating that the data modality of the center vehicle plays a key role in V2V collaborative detection. For the center vehicle image modality, although its performance is slightly lower than that of the center vehicle point cloud modality, it is still significantly better than the pure image mode, indicating that even in the case where half of the vehicles can only obtain image data, the presence of point cloud data can still greatly enhance the detection performance of the system. The experimental results in Table 2 verify that the proposed algorithm has relatively robust performance in different heterogeneous modal scenarios, especially when modal data is lost. In addition, this also highlights the importance of point cloud data in V2V collaborative detection tasks and the impact of the center vehicle on the overall detection performance.

定位误差是现实场景中需要考虑的复杂因素之一。为了分析模型对定位误差的鲁棒性，本发明从高斯分布中分别采样坐标噪和角度噪，并将其添加到准确的定位数据上来模拟定位误差。图4给出了所提多模态融合算法(Transformer融合)与其他两种单模态方案(纯点云模式,纯图像模式)在不同定位误差情况下的性能对比。可以看出，随着定位噪声的增加，3种模型的性能整体均趋于下降，尤其是纯点云模式对定位误差最敏感：当误差增加到0.2时，性能下降超过4％；误差为0.4时，下降超过10％。相比之下，纯图像模式受定位误差较小，性能下降平缓。其主要原因是激光雷达数据提取的三维几何特征对坐标轴的偏移比图像提取的视觉特征更加敏感。此外，由于多模态融合模型(Transformer融合)中包含一定的冗余信息使其不会过于依赖某种单一模态，因此其在保持较高性能的同时对于定位误差具有较强的容忍能力。Positioning error is one of the complex factors that need to be considered in real scenes. In order to analyze the robustness of the model to positioning error, the present invention samples coordinate noise and angle noise from Gaussian distribution respectively, and adds them to accurate positioning data to simulate positioning error. Figure 4 shows the performance comparison of the proposed multimodal fusion algorithm (Transformer fusion) and the other two single-modal schemes (pure point cloud mode, pure image mode) under different positioning errors. It can be seen that with the increase of positioning noise, the performance of the three models tends to decline as a whole, especially the pure point cloud mode is most sensitive to positioning error: when the error increases to 0.2, the performance decreases by more than 4%; when the error is 0.4, it decreases by more than 10%. In contrast, the pure image mode is less affected by positioning error and the performance decreases slowly. The main reason is that the three-dimensional geometric features extracted from lidar data are more sensitive to the offset of the coordinate axis than the visual features extracted from the image. In addition, since the multimodal fusion model (Transformer fusion) contains certain redundant information so that it will not rely too much on a single mode, it has a strong tolerance for positioning error while maintaining high performance.

2.定性试验结果分析：2. Analysis of qualitative test results:

图5展示了3种不同方案在OPV2V数据集上的可视化检测结果。如图所示，纯相机和纯激光雷达模型会在不同程度上受到语义模糊和不确定性等因素的影响，从而导致漏检和误检等问题。相比之下，本发明提出的多传感器融合模型在检测精度和鲁棒性方面有了显著改进。通过利用激光雷达和相机模式的互补优势，融合模型有效地解决了单个传感器的局限性，实现了更精确、更可靠的物体检测。Figure 5 shows the visualization detection results of three different schemes on the OPV2V dataset. As shown in the figure, the pure camera and pure lidar models are affected to varying degrees by factors such as semantic ambiguity and uncertainty, resulting in problems such as missed detection and false detection. In contrast, the multi-sensor fusion model proposed in this paper has achieved significant improvements in detection accuracy and robustness. By leveraging the complementary advantages of lidar and camera modes, the fusion model effectively addresses the limitations of a single sensor and achieves more accurate and reliable object detection.

Claims

1. A collaborative perception method based on multimodal fusion, characterized by comprising the following steps:

Step 1: Multimodal feature extraction: For multi-view image data, the CaDDN architecture is used. The CaDDN architecture includes four modules: encoder, depth estimation, voxel transformation, and folding. It ensures that as much rich and accurate information as possible is captured from the input image, so that the extracted image features can fully represent the key visual information of the input image, including edge, texture, color, and scene depth. For point cloud data, PointPillar is selected as the feature extractor of point cloud data. First, for a given 3D point Determine the position p = (i, j, l) of point p in the cylindrical coordinate system; divide the three-dimensional space of the 3D point p into a series of evenly spaced cylindrical structures, converting the complex 3D data into a 2D structure; then, all the cylindrical structures are flattened along the height direction of the cylindrical structure itself to generate a pseudo image; finally, further feature encoding and integration are performed through a series of two-dimensional convolutions;

The two-dimensional convolutional network is used to extract spatial features and strengthen the feature correlation between different columns. Through layer-by-layer convolution and activation functions, the network can capture and encode more complex spatial patterns, thereby improving the expressiveness of features. After being processed by the convolutional network, a set of encoded high-dimensional feature maps is obtained. The high-dimensional feature map represents the rich spatial information and object features in the original point cloud data. The high-dimensional feature map is integrated through pooling and splicing operations to form the final feature representation for subsequent detection tasks.

Step 2: Multimodal feature fusion: Each vehicle first compresses and encodes its own BEV features and then sends them to the central vehicle. When the central vehicle successfully receives the BEV features from all other vehicles, it fuses the received BEV features to generate a more global and detailed scene representation. The fusion process will first perform homogeneous modal feature fusion and then heterogeneous modal feature fusion.

Step 3: Detection head network: The detection head network predicts the category and position of the target based on the fused features. The detection head network contains 3 deconvolution layers. The 3 deconvolution layers are responsible for upsampling the feature maps to provide more fine-grained information for subsequent category and position predictions. After being processed by the 3 deconvolution layers, the feature maps are upsampled and concatenated together to form the category prediction branch and the bounding box regression branch. The category prediction branch outputs a score for each anchor box, indicating the vehicle category in the anchor box and the probability of its existence. The bounding box regression branch is responsible for refining the location information of the target and predicting 4 values for each anchor box, representing the offset of the center position (x, y) and the changes in the width and height of the box. Combining the outputs of the category prediction branch and the bounding box regression branch, the detection head network outputs the target category and adjusted location information in each anchor box.

2. The collaborative perception method based on multimodal fusion according to claim 1, characterized in that:

The steps of isomorphic modality feature fusion are as follows:

For the image BEV features, the element-level maximization operation is used to fuse the image BEV features of different vehicles. By comparing the image BEV features of each vehicle and taking the maximum value at the element level, it is ensured that the final fused features contain the most significant features observed by all vehicles; when the central vehicle receives the image BEV features sent by other vehicles When , it is element-by-element compared with its own image BEV feature Perform element-by-element comparison and then select the maximum value as the fusion result; this process is expressed as

The max(·) operation is performed element by element, which ensures that the fused features inherit the most significant and rich parts of multiple vehicles;

For the point cloud BEV features, the self-attention mechanism is used for fusion. This process requires the establishment of a local map, which is a data structure used to represent the relationship between feature vectors of the same spatial position in different vehicles. In actual operation, the BEV feature vector of each vehicle is first determined and mapped to a unified coordinate system. Then, for each BEV feature vector, a node is created in the local map, and connections are established between spatially adjacent nodes. For each node in the local map, the attention score between it and the adjacent nodes is calculated. The attention score determines the degree of influence of the information of each adjacent node on the current node during the fusion process. Then, based on the calculated attention score, the feature vector of each node is updated to the weighted sum of the features of its neighboring nodes. After iterative updates by the self-attention mechanism, the feature vector of each node fuses information from different vehicles but in the same or similar spatial positions. Finally, the updated feature vectors are aggregated to form a fused BEV feature representation for subsequent detection tasks.

3. The collaborative perception method based on multimodal fusion according to claim 1, characterized in that:

The steps of heterogeneous modality feature fusion are as follows:

In the heterogeneous modal fusion stage, the fused BEV features of different modalities are integrated to utilize the complementary information between the lidar and camera data, and three common fusion strategies are adopted: channel-level stitching, element-level summation and Transformer fusion; when it is necessary to utilize all the feature information of image and point cloud data at the same time, channel-level stitching fusion is adopted. This method is simple and intuitive, and is suitable for situations where the features are directly correlated and the dimensions are matched; when it is necessary to quickly and easily combine the features of the two data sources, element-level summation fusion is adopted, which is suitable for situations where the feature dimensions are the same and the physical meanings corresponding to each element are similar; when the environment or scene being processed is highly complex and it is necessary to deeply explore the potential and non-intuitive associations between the lidar data and the image data, Transformer fusion is adopted.

4. The collaborative perception method based on multimodal fusion according to claim 3 is characterized in that:

The steps of channel-level splicing and fusion are:

First, the image and point cloud features are concatenated along the channel dimension to obtain the feature tensor, and then the concatenated features are sent to two convolutional layers for further processing; among them, the first convolutional layer is used to capture spatial relationships and extract valuable features from the concatenated data; then, the second convolutional layer ensures that key information is retained after information compression, and reduces the channel dimension to further refine the fusion features. The two convolutional layers do not have to be universal and can be designed separately according to the characteristics of the features.

5. The collaborative perception method based on multimodal fusion according to claim 3 is characterized in that:

The steps of element-level summation and fusion are:

The lidar features are fed into a 1×1 convolutional layer to downsample the channel dimension. Then, the image features and the downsampled lidar features are added element by element to obtain the fused features.

6. The collaborative perception method based on multimodal fusion according to claim 3 is characterized in that:

The steps of Transformer fusion are:

First, the BEV features of the image and point cloud are connected along the channel dimension to form a joint feature tensor, and then the position encoding is combined with the feature vector by addition to increase the model's perception of the spatial position of the feature; then, in order to capture the complex interaction between the two modalities, a multi-head self-attention mechanism is introduced after adding the position encoding to the joint feature tensor, assigning different weights to each part of the different modalities, and enabling the model to capture the correlation between different modal features from multiple angles; then, the nonlinear processing ability of the model is further enhanced through the feedforward network, and finally, residual connection and layer normalization techniques are used in each step of Transformer fusion to obtain the final fusion features.